Random Forest
In general, the performance of Random Forest is similar to boosting, but it is much simpler to train and tune. As the result of their ease of use and out of the box performance. Random forest are a very popular machine learning algorithm and are implemented in a variety of packages including a widely used package called “randomForest”, which we will use throughout this notes.
The basic idea behind random forest is identical to bagging, which both are ensembles of trees trained on bootstrapped samples of the training data.
However, in the random forest algorithm, there is slight tweak to the way the decision trees are built that leads to better performance. The key difference is that when training the trees that make up the ensemble, we add a bit of extra randomnes to the model. Hence the name Random Forest. At each split in the tree, rather than considering all features, or input variables, for the split. We sample a subset of these features and consider only these few variables as a candidates for the split.
The technique of sampling variables from the input or feature space is also feature bagging or the random subspace method.
At first glance, this may sound counter intuitive. We have fewer variables to choose from, which means there’s less information available to the model. So how can that lead to better performance? well, adding this extra bit of randomness leads to a collection of trees that are further decorrelated (or more different) from one another.
So, random forest improve upon bagging by reducing the correlation between the sampled trees.
To run the random forest algorithm, we will use the “randomForest” function from “randomForest” package.
let’s take a look at randomForest function in R.
library(randomForest) ?randomForest
The syntax for training a random forest model using the randomForest package follows standard conventions for modelling in R. The user can specify the input variables and dependent variable using the familiar formula interface. The training data is passed to the “data” argument. The number of trees is specified through the “ntrees” argument and defaults is 500, which is usually a good place to start. We can always add more trees to improve the performance of the ensemble. More trees almost always means better performance in a random forest. There are a handful of other model hyperparameters that are useful to know about in a random forest, and we will go into greater detail about those later in this notes.
Creating Data Partition
We will use “caret” library to partition the data into training set and test set.
#we are using metabolomics dataset data1 <- read.csv("m-metabolomics.csv", header=T, sep=",") set.seed(3033) #creating data partition library(caret) intrain <- createDataPartition(y = data1$Group, p= 0.7, list = FALSE) training <- data1[intrain,] testing <- data1[-intrain,] dim(training)
## [1] 78 101
dim(testing)
## [1] 32 101
Train a Random Forest Model
Here we will use the randomForest() function from the randomForest Package to train a Random Forest classifier to predict Alzheimer’s disease from metabolomics dataset.
set.seed(12345) #for reproducibility library(randomForest) AD_model <- randomForest(formula = Group ~., data=training) #print the model output print(AD_model)
## ## Call: ## randomForest(formula = Group ~ ., data = training) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 10 ## ## OOB estimate of error rate: 23.08% ## Confusion matrix: ## AD Control class.error ## AD 31 8 0.2051282 ## Control 10 29 0.2564103
Now, we just trained a model on the Metabolomics dataset. Let’s take a look at the output of the model.
By looking at the result output, it Present the:
Type of random forest (either classification or regression). for this model is classification.
The number of trees used (default is 500).
The number of variables tried at each split, which in random forest parlance is called “mtry”. The default “mtry” is determined dynamically, based on the number of input variables in the training set. This dataset has 100 covariates and in classification forests, the default “mtry” value is the square root of the number of features.
We also see the the “out-of-bag” or “OOB” estimates of the error rate for the model. This is the error rate computed across the samples that were not selected into the bootstrapped training sets. (we will look in detail later).
Since this is a classification problem, we also see the confusion matrix based on the out-of-bag samples.
Since each tree in the random forest is trained on a bootstrapped sample of the original training set, this means that some of the samples will be duplicated in the training set and some will be absent. The absent sample are what called “out of bag” samples.
One of the nice things about the random forest algortihms is that we are provided with built-in validation set without any extra work.
Since the out of bag samples were not used to train the trees, they can be used to evaluate the tree’s performance on unseen data.
The classification error across all the out of bag samples is called the “Out of Bag Error”. The OOB error matrix is stored in the random forest model, and the rows of this matrix represent the number of trees in the forest.
#To look up some OOB error matrix err <- AD_model$err.rate head(err)
## OOB AD Control ## [1,] 0.1481481 0.15384615 0.1428571 ## [2,] 0.2666667 0.13636364 0.3913043 ## [3,] 0.2142857 0.06896552 0.3703704 ## [4,] 0.2656250 0.25000000 0.2812500 ## [5,] 0.2535211 0.16666667 0.3428571 ## [6,] 0.2837838 0.21052632 0.3611111
The i-th row reports the OOB error rate for all trees up to and including the i-th trees. The 1st column shows the error across all the classes and then there are additional columns per class OOB error.
If we look at the last row, we will have the final out-of-bag error. THIS IS THE SAME VALUE that printed in the MODEL OUTPUT.
oob_err <- err[nrow(err), "OOB"] print(oob_err)
## OOB ## 0.2307692
When you plot a random forest model object, it shows a plot of the OOB error rates as a function of the number of trees in the forest.
plot(AD_model) legend(x="right", legend = colnames(err), fill=1:ncol(err))
This plots helps us to decide how many trees are necessary to include in the ensemble. As we can see, when using less than 50 trees the OOB error remains quite high, but drops and starts to flattens out betweeen 400 and 500 trees.
After a certain point, including more trees in your model will not get us any additional performance.
There is nothing wrong with using “too many” trees, however, computing the predictions for each tree does take time, so we dont want to include more trees than we actually need.
Evaluate model performance on a test set
we will use “caret” package, which is “confusionMatrix” function to compute test set accuracy and generate a confusion matrix. We will compare the test set accuracy to the OOB accuracy.
#Generating class prediction for data1 data frame using AD_model object. class_prediction <- predict(object= AD_model, # model object newdata = testing, # test dataset type = "class" ) # return classification labels # Calculate the confusion matrix for the test set cm <- confusionMatrix(data = class_prediction, # predicted classes reference = testing$Group) # actual classes print(cm)
## Confusion Matrix and Statistics ## ## Reference ## Prediction AD Control ## AD 10 5 ## Control 6 11 ## ## Accuracy : 0.6562 ## 95% CI : (0.4681, 0.8143) ## No Information Rate : 0.5 ## P-Value [Acc > NIR] : 0.05509 ## ## Kappa : 0.3125 ## Mcnemar's Test P-Value : 1.00000 ## ## Sensitivity : 0.6250 ## Specificity : 0.6875 ## Pos Pred Value : 0.6667 ## Neg Pred Value : 0.6471 ## Prevalence : 0.5000 ## Detection Rate : 0.3125 ## Detection Prevalence : 0.4688 ## Balanced Accuracy : 0.6562 ## ## 'Positive' Class : AD ##
We then compare the test set accuracy reported from the confusion matrix to the OOB accuracy.
#Compare test set accuracy to OOB accuracy paste0("Test Accuracy: ", cm$overall[1])
## [1] "Test Accuracy: 0.65625"
paste0("OOB Accuracy: ", 1-oob_err)
## [1] "OOB Accuracy: 0.769230769230769"
OOB Error Versus Test Set Error
Previously we already discuss the out of bag error and it’s use in random forest as a quick built in snapshot of model performance. We might be wondering how OOB error compares to test set error and which one should we use to evaluate the genralization error of our model. One of the nicest things about random forest algorithm is that you are provided with a built in validation set and we dont need to sacrifice any of our training data to use for validation. Another nice advantage of the random forest package is in particular it that it already has OOB error computation built in, so we dont need to write any extra code to evaluate our model performance.
The disadvantages.
Although technically it’s possible to compute other metrics such as AUC and logloss on the out of bag sample, it is not built in to the randomForest package, so it makes it imposible to do so.
The randomForest package does not keep track of which observations were part of the out of bag sample in each tree, so there is no way to calculate these metrices after the fact.
If you are comparing the performance of your random forest to another type model such as GLM or support vector machin. Then you would want to score each of these models on the same validation set to compare performance.
Although the out of bag error rate can be used to compare several random forest, you wont be able to perform a model comparison to any other type of model using the out of bag estimate.
Evaluate test set AUC
We will compute test set AUC for random forest model
#Generate predictions on the test set pred <- predict(object = AD_model, newdata = testing, type = "prob") #"Pred" is a matrix class(pred) #look at pred format head(pred) #Compute AUC (Actual must be a binary 1/0 numeric vector) auc(actual = ifelse(testing$Group == "AD", 1,0), predicted=pred[,"AD"]) auc(testing$Group, pred)
Tuning a Random Forest Model
Like any machine learning algorithm, the key to getting a good performance is to tune the model hyperparameters. Random forest is one of the easier algorithms to tune because there are only a handful of hyperparameters that have a big impact on the performance of the model.
If we contrast tuning a random forest model to a support vector machine or deep neural network, it is a walk in the park. This makes random forest a great machine learning algorithm for beginners to use. We will get great performance out of the box, with little tuning and no expert knowledge required.
Next, we will review some of the most important hyperparameters for the random forest algorithm.
mtry = number of variables randomly sampled as candidates at each split.
sampsize = number of sample size to train on. Defaults for sampsize is 63.2% of the number of training examples. The reason for usig this strategy precise number is that 63.2% is the expected number of unique observations in a bootstrapped sample.
nodesiz = minimum size (number of sample) of the terminal nodes.
maxnodes = maximum number of terminal nodes.
the nodesize and maxnodes are both parameters that control the complexity of the tree. When nodesize is small, it allows deeper and more complex trees to be grown. maxnodes is just another way to limit tree growth and avoid overfitting.
Keep in mind that each random forest implementation can use different names for these same parameters. These are the names that randomForest package used.
One of the most important model hyperparameters in the random forest algorithms is called “mtry”. At each split in a tree, we consider some number of predictor variables, from this group, we choose the variable that splits the data in the most pure manner. “mtry” is the number of predictor variables that we sample at each split. In randomForest package has build in function for tuning the “mtry” parameter called “tuneRF”, which tunes the model based on OOB error.
#execute the tuning process set.seed(1) library(ggplot2) res <- tuneRF(x=subset(training, select=-Group), y=training$Group, ntreeTry = 500)
## mtry = 10 OOB error = 23.08% ## Searching left ... ## mtry = 5 OOB error = 26.92% ## -0.1666667 0.05 ## Searching right ... ## mtry = 20 OOB error = 24.36% ## -0.05555556 0.05
Rather than iterating over a set list of mtry values, the tuneRF function will start with default value of “mtry” and increase the value by an amount specified in the “StepFactor” argument. The search will be stopped when the OOB error stops decreasing by specified amount. Keep in mind that specialized “tuneRF” function is just one way to tune a random forest.
After knowing how to tune our random forest using the “tuneRF” function, we will try to look on manual grid search from scratch. A manual grid search gives us more control over the search space, allow us to evaluate the random forest using metrics other than OOB error and allows us to include other model hyperparameter in the grid search, such as nodesize and sampsize.
Tuning a Random Fores via mtry
we will use randomForest::tuneRF() to tune “mtry” (by training several models.). This function is a specific utility to tune the “mtry” parameter based on OOB error, which is helpful when you want to tune model. A more generic way of tuning Random forest are:
#Execute the tuning process set.seed(12345) res <- tuneRF(x=subset(training, select=-Group), y=training$Group, ntreeTry = 500)
## mtry = 10 OOB error = 23.08% ## Searching left ... ## mtry = 5 OOB error = 24.36% ## -0.05555556 0.05 ## Searching right ... ## mtry = 20 OOB error = 21.79% ## 0.05555556 0.05 ## mtry = 40 OOB error = 21.79% ## 0 0.05
#look at the results print(res)
## mtry OOBError ## 5.OOB 5 0.2435897 ## 10.OOB 10 0.2307692 ## 20.OOB 20 0.2179487 ## 40.OOB 40 0.2179487
#Finding the mtry value that minimizes OOB error mtry_opt <- res[, "mtry"][which.min(res[, "OOBError"])] print(mtry_opt)
## 20.OOB ## 20
# If you just want to return the best RF model (rather than results) # you can set `doBest = TRUE` in `tuneRF()` to return the best RF model # instead of a set performance matrix.
Tuning a Random Forest via Tree Depth
we will creating a manual grid of hyperparameters using the expand.grid() function and wrote code that trained and evaluated the models of the grid in a loop. In this exercise, you will create a grid of mtry, nodesize and sampsize values. In this example, we will identify the “best model” based on OOB error. The best model is defined as the model from our grid which minimizes OOB error.
Keep in mind that there are other ways to select a best model from a grid, such as choosing the best model based on validation AUC. However, for this exercise, we will use the built-in OOB error calculations instead of using a separate validation set.
#Establish a list of possible value for mtry, nodesize and sampsize mtry <- seq(4, ncol(training)*0.8, 2) nodesize <- seq(3,8,2) sampsize <- nrow(training)*c(0.7,0.8)
#Creating a data frame containing all combinations hyper_grid <- expand.grid(mtry=mtry, nodesize=nodesize, sampsize = sampsize) #create an empty vector to store OOB error value oob_err <- c()
#write a loop over the rows of hyper_grid to train the grid of models. for (i in 1:nrow(training)){ #train a random forest model model <- randomForest(formula = Group ~ ., data= training, mtry = hyper_grid$mtry[i], nodesize = hyper_grid$nodesize[i], sampsize = hyper_grid$sampsize[i]) #store OOB error for the model oob_err[i] <- model$err.rate[nrow(model$err.rate), "OOB"] } # Identify optimal set of hyperparmeters based on OOB error opt_i <- which.min(oob_err) print(hyper_grid[opt_i,])
## mtry nodesize sampsize ## 1 4 3 54.6
ncG1vNJzZmiqkaOxsLnFqKmeq6Rme7R%2FjZqnZqufqsGpscCsq2ZpXpa6osbOp5iwq16YvK57sZqlnaedlJOwvsSsq5imn6mytHrHraSl