XGBoost Algorithm – Applied Machine Learning

1. XGBoosting

In this Machine Learning blog, we will learn Introduction to XGBoost, coding of XGBoost Algorithm, an Advanced functionality of XGboosting Algorithm, General Parameters, Booster Parameters, Linear Booster Specific Parameters, Learning Task Parameters. Furthermore, we will study about building models and parameters of XGBoost

2. Introduction to XGBoost Algorithm

Basically, XGBoost is an algorithm. Also, it has recently been dominating applied machine learning. XGBoost is an implementation of gradient boosted decision trees. Although, it was designed for speed and performance. Basically, it is a type of software library. That you can download and install on your machine. Then have to access it from a variety of interfaces.

3. XGBoost Algorithm working With Main Interfaces

C++, Java and JVM languages.
Julia.
Command Line Interface.
Python interface along with integrated model in scikit-learn.
R interface as well as a model in the caret package.

4. Preparation of Data for using XGBoost Algorithm

Let’s assume, you have a dataset named ‘campaign’ . If want to convert all categorical variables into such flags. Then except the response variable. Here is how you do it :

sparse_matrix <- sparse.model.matrix(response ~ .-1, data = campaign)

Now let’s break down this code as follows:

“sparse.model.matrix” is the command. And, all other inputs inside parentheses are parameters.

The parameter “response” says that this statement should ignore “response” variable.
“-1” removes an extra column which this command creates as the first column.
And finally, you specify the dataset name.

To convert the target variables as well, you can use following code:
output_vector = df[,response] == “Responder”

Here is what the code does:

set output_vector to 0
set output_vector to 1 for rows where a response is “Responder” is TRUE ;
return output_vector.

5. Building Model – Xgboost AlgorithmR

Here are simple steps you can use to crack any data problem using xgboost Algorithm:

Step 1: Load all the libraries

library(xgboost)
library(readr)
library(stringr)
library(caret)
library(car)

Step 2 : Load the dataset

(Here I use a bank data where we need to find whether a customer is eligible for loan or not).
set.seed(100)
setwd(“C:\\Users\\ts93856\\Desktop\\datasource”)
# load data
df_train = read_csv(“train_users_2.csv”)
df_test = read_csv(“test_users.csv”)
# Loading labels of train data
labels = df_train[‘labels’]
df_train = df_train[-grep(‘labels’, colnames(df_train))]
# combine train and test data
df_all = rbind(df_train,df_test)

Step 3: Data Cleaning & Feature Engineering # clean Variables : here I clean people with age less than 14 or more than 100
df_all[df_all$age < 14 | df_all$age > 100,’age’] <- -1
df_all$age[df_all$age < 0] <- mean(df_all$age[df_all$age > 0])
# one-hot-encoding categorical features
ohe_feats = c(‘gender’, ‘education’, ’employer’)
dummies <- dummyVars(~ gender + education + employer, data = df_all)
df_all_ohe <- as.data.frame(predict(dummies, newdata = df_all))
df_all_combined <- cbind(df_all[,-c(which(colnames(df_all) %in% ohe_feats))],df_all_ohe)df_all_combined$agena <- as.factor(ifelse(df_all_combined$age < 0,1,0))
I am using a list of variables in “feature_selected” to be used by the model. I have shared a quick and smart way to choose variables later in this article.
df_all_combined <- df_all_combined[,c(‘id’,features_selected)]
# split train and test
X = df_all_combined[df_all_combined$id %in% df_train$id,]
y <- recode(labels$labels,”‘True’=1; ‘False’=0)
X_test = df_all_combined[df_all_combined$id %in% df_test$id,]

Step 4: Tune and Run the model

xgb <- xgboost(data = data.matrix(X[,-1]),
label = y,
eta = 0.1,
max_depth = 15,
nround=25,
subsample = 0.5,
colsample_bytree = 0.5,
seed = 1,
eval_metric = “merror”,
objective = “multi:softprob”,
num_class = 12,
nthread = 3
)
Step 5: Score the Test Population
And that’s it! You now have an object “xgb” which is an xgboost Algorithm model. Here is how you score a test population :
# predict values in test set
y_pred <- predict(xgb, data.matrix(X_test[,-1]))

6. Xgboost Algorithm – Parameters

a. General Parameters

Following are the General parameters used in Xgboost Algorithm:

silent: The default value is 0. You need to specify 0 for printing running messages, 1 for silent mode.
booster: The default value is gbtree. You need to specify the booster to use: gbtree (tree based) or gblinear (linear function).
num_pbuffer: This is set automatically by xgboost Algorithm, no need to be set by a user. Read the documentation of xgboost for more details.
num_feature: This is set automatically by xgboost Algorithm, no need to be set by a user.

b. Booster Parameters
Below we discussed tree-specific parameters in Xgboost Algorithm:
,br>

eta: The default value is set to 0.3. You need to specify step size shrinkage used in an update to prevents overfitting. After each boosting step, we can directly get the weights of new features. eta actually shrinks the feature weights to make the boosting process more conservative. The range is 0 to 1. Low eta value means the model is more robust to overfitting.
gamma: The default value is set to 0. You need to specify minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be. The range is 0 to ∞. Larger the gamma more conservative the algorithm is.
max_depth: The default value is set to 6. You need to specify the maximum depth of a tree. The range is 1 to ∞.
min_child_weight: The default value is set to 1. You need to specify the minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node. Then with the sum of instance weight less than min_child_weight. Then the building process will give up further partitioning. In linear regression mode, corresponds to a minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. The range is 0 to ∞.
max_delta_step: The default value is set to 0. Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help make the update step more conservative. Usually, this parameter is not needed, but it might help in logistic regression. Especially, when a class is extremely imbalanced. Set it to a value of 1-10 might help control the update.The range is 0 to ∞.
subsample: The default value is set to 1. You need to specify the subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances. That needs to grow trees and this will prevent overfitting. The range is 0 to 1.
colsample_bytree: The default value is set to 1. You need to specify the subsample ratio of columns when constructing each tree. The range is 0 to 1.

c. Linear Booster Specific Parameters

These are Linear Booster Specific Parameters in XGBoost Algorithm.

lambda and alpha: These are regularization term on weights. Lambda default value assumed is 1 and alpha are 0.
lambda_bias: L2 regularization term on bias and has a default value of 0.

d. Learning Task Parameters

Following are the Learning Task Parameters in XGBoost Algorithm:

base_score: The default value is set to 0.5. You need to specify the initial prediction score of all instances, global bias.
objective: The default value is set to reg:linear. You need to specify the type of learner you want. That includes linear regression, Poisson regression etc.
eval_metric: You need to specify the evaluation metrics for validation data. And a default metric will be assigned according to the objective.
seed: As always here you specify the seed to reproduce the same set of outputs.

7. Advanced functionality of XGBoost Algorithm

We can say xgboost is simple in comparison to other machine learning techniques. If you did all we have done till now, you already have a model.

Let’s take it one step further and try to find the variable importance in the model and subset our variable list.

# Lets start with finding what the actual tree looks like
model <- xgb.dump(xgb, with.stats = T)
model[1:10] #This statement prints top 10 nodes of the model
# Get the feature real names
names <- dimnames(data.matrix(X[,-1]))[[2]]
# Compute feature importance matrix
importance_matrix <- xgb.importance(names, model = xgb)
# Nice graph
xgb.plot.importance(importance_matrix[1:10,])
#In case last step does not work for you because of a version issue, you can try following :
barplot(importance_matrix[,1])