AdaBoost Algorithm For Machine Learning

1. Objective

Through this Machine Learning blog, we will study Boosting – AdaBoost Algorithm. Also, will try to cover every concept related to Adaptive boosting with AdaBoost example.

2. What is AdaBoost

First of all, AdaBoost is short for Adaptive Boosting. Basically, Ada Boosting was the first really successful boosting algorithm developed for binary classification. Also, it is the best starting point for understanding boosting. Moreover, modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.

Generally, AdaBoost is used with short decision trees. Further, the first tree is created, the performance of the tree on each training instance is used. Also, we use it to weight how much attention the next tree. Thus, it is created should pay attention to each training instance. Hence, training data that is hard to predict is given more weight. Although, whereas easy to predict instances are given less weight.

3. Learning – AdaBoost Model

Learn AdaBoost Model from Data

Ada Boosting is best used to boost the performance of decision trees and this is based on binary classification problems.
AdaBoost was originally called AdaBoost.M1 by the author. More recently it may be referred to as discrete Ada Boost. As because it is used for classification rather than regression.
AdaBoost can be used to boost the performance of any machine learning algorithm. It is best used with weak learners.

Each instance in the training dataset is weighted. The initial weight is set to:
weight(xi) = 1/n
Where xi is the i’th training instance and n is the number of training instances.

4. How To Train One Model

A weak classifier is prepared on the training data using the weighted samples. Only binary classification problems are supported. So each decision stump makes one decision on one input variable. And outputs a +1.0 or -1.0 value for the first or second class value.
The misclassification rate is calculated for the trained model. Traditionally, this is calculated as:
error = (correct – N) / N
Where error is the misclassification rate. While correct is the number of training instance predicted by the model. And N is the total number of training instances.

Example 1

if the model predicted 78 of 100 training instances the error.
This is modified to use the weighting of the training instances:
error = sum(w(i) * terror(i)) / sum(w)
Which is the weighted sum of the misclassification rate.
where w is the weight for training instance i
terror is the prediction error for training instance i. Also, which is 1 if misclassified and 0 if correctly classified?

Example 2

if we had 3 training instances with the weights 0.01, 0.5 and 0.2.
The predicted values were -1, -1 and -1, and
the actual output variables in the instances were -1, 1 and -1, then the terrors would be 0, 1, and 0.
The misclassification rate would be calculated as:
error = (0.01*0 + 0.5*1 + 0.2*0) / (0.01 + 0.5 + 0.2)
or
error = 0.704
A stage value is calculated for the trained model. As it provides a weighting for any predictions that the model makes. The stage value for a trained model is calculated as follows:
stage = ln((1-error) / error)
Where stage is the stage value used to weight predictions from the model. Also, ln() is the natural logarithm and error is the misclassification error for the model. The effect of the stage weight is that more accurate models have more weight.
The training weights are updated giving more weight to predicted instances. And less weight to predicted instances.

Example 3

the weight of one training instance (w) is updated using:
w = w * exp(stage * terror)
Where w is the weight for a specific training instance,
exp() is the numerical constant e or Euler’s number raised to a power,
a stage is the misclassification rate for the weak classifier and
terror is the error the weak classifier made predicting the output and evaluated as:
terror = 0 if(y == p), otherwise 1
Where y is the output variable for the training instance and p is the prediction from the weak learner.
This has the effect of not changing the weight if the training instance was classified. Thus, making the weight slightly larger if the weak learner misclassified the instance.

5. AdaBoost Ensemble

Basically, weak models are added sequentially, trained using the weighted training data.
Generally, the process continues until a pre-set number of weak learners have been created.
Once completed, you are left with a pool of weak learners each with a stage value.

6. Making Predictions with AdaBoost

Predictions are made by calculating the weighted average of the weak classifiers. For a new input instance, each weak learner calculates a predicted value as either +1.0 or -1.0. The predicted values are weighted by each weak learners stage value. The prediction for the ensemble model is taken as a sum of the weighted predictions. If the sum is positive, then the first class is predicted, if negative the second class is predicted.

For example-
5 weak classifiers may predict the values 1.0, 1.0, -1.0, 1.0, -1.0. From a majority vote, it looks like the model will predict a value of 1.0 or the first class. These same 5 weak classifiers may have the stage values 0.2, 0.5, 0.8, 0.2 and 0.9 respectively. Calculating the weighted sum of these predictions results in an output of -0.8. And which would be an ensemble prediction of -1.0 or the second class.

7. Data Preparation for AdaBoost

This section lists some heuristics for best preparing your data for AdaBoost.

Quality Data:
Because of the ensemble method attempt to correct misclassifications in the training data. Also, you need to be careful that the training data is high-quality.

Outliers:
Generally, outliers will force the ensemble down the rabbit hole of work. Although, it is so hard to correct for cases that are unrealistic. These could be removed from the training dataset.

Noisy Data:
Basically, noisy data, specifical noise in the output variable can be problematic. But if possible, attempt to isolate and clean these from your training dataset.