library(randomForest)
randomForest 4.6-12 Type rfNews() to see new features/changes/bug fixes.
First, we load the required libraries and read our data into data frames. The following data (and most of the code) come from kaggle's introductory "Digit Recognizer" optical character recognition problem.
library(randomForest)
trainData <- read.csv("data/digits/train.csv")
trainData$label <- as.factor(trainData$label)
testData <- read.csv("data/digits/test.csv")
Let's verify that the data look properly loaded, and inspect the format.
dim(trainData); dim(testData); str(trainData[, 1:6])
'data.frame': 42000 obs. of 6 variables: $ label : Factor w/ 10 levels "0","1","2","3",..: 2 1 2 5 1 1 8 4 6 4 ... $ pixel0: int 0 0 0 0 0 0 0 0 0 0 ... $ pixel1: int 0 0 0 0 0 0 0 0 0 0 ... $ pixel2: int 0 0 0 0 0 0 0 0 0 0 ... $ pixel3: int 0 0 0 0 0 0 0 0 0 0 ... $ pixel4: int 0 0 0 0 0 0 0 0 0 0 ...
We see that the first column (label) is the actual digit represented by an image, and the remaining 784 columns are greyscale values for each pixel in the 28x28 images.
After setting the seed to keep our results repeatable, we select a random subsample of our data and pull the predictors from the outcome.
set.seed(0)
numTrain <- 10000
rows <- sample(1:nrow(trainData), numTrain)
labels <- as.factor(trainData[rows,1])
subtrain <- trainData[rows,-1]
Finally, we build a random forest model on the subset of the train data set, computing the predicted outcomes of the test set along the way. The actual randomForest function is called with four arguments, though only the first is necessary.
randomForest(x, y=NULL, xtest=NULL, ntree=500)
There are over a dozen additional potential parameters to pass, including mtry, the number of predictors to randomly sample at each break point.
numTrees <- 25
rf <- randomForest(subtrain, labels, xtest=testData, ntree=numTrees)
predictions <- data.frame(
ImageId=1:nrow(testData),
Label=levels(labels)[rf$test$predicted]
)
head(predictions)
| ImageId | Label | |
|---|---|---|
| 1 | 1 | 2 |
| 2 | 2 | 0 |
| 3 | 3 | 9 |
| 4 | 4 | 4 |
| 5 | 5 | 2 |
| 6 | 6 | 7 |
This went rather smoothly. But:
caret helps will all of these things and more.
library(caret)
library(doMC)
registerDoMC(8)
library(RColorBrewer)
Loading required package: lattice Loading required package: ggplot2 Loading required package: foreach Loading required package: iterators Loading required package: parallel
Let's improve upon kaggle's example model by applying some of caret's functionality. We begin by loading the caret package. We will simultaneously load a parallel processing package doMC and tell it how many cores we're rocking (the Mac on which I wrote this has four cores with two threads each). For those packages that implement some form of parallelization, caret does not interfere. randomForest is definitely one of those packages.
See the caret documentation for additional information.
library(caret)
library(doMC)
registerDoMC(8)
createDataPartition¶The first function we will want to learn is caret's data partitioning function. Here is the function call from the documentation:
createDataPartition(
y,
times = 1,
p = 0.5,
list = TRUE,
groups = min(5, length(y))
)
This function takes times samples from your data vector y of proportion p. If your data are discrete, createDataPartition will automatically take a representative sample from each level as best as it can; otherwise, you can use groups to help caret partition a continuous variable.
The values returned are chosen indices from y.
train¶This poorly-named function trains your model. Again from the slimmed down docs:
train(
x,
y,
method = "rf",
...
)
This call returns a train object, which is basically a list. The model contained is built applying the given method (in this case "rf" means random forest) to the predictors in the data frame x and with associated vector of outcomes y. As caret is really just a wrapper for the underlying packages that deploy varous methods, we can pass additional arguments through the ellipses as needed.
Let's have a look at an example. These lines are nearly identical to those from kaggle's "benchmark" code. A few things are different:
randomForest to keep track of importance variables, which it does not do by defaultYou can see that we pass list=FALSE to createDataPartition; as we only have one sample, we'd like to have our row numbers in a vector so that we can easily subset our data with it. We also used the formula implementation of the train function rather than slice the data frame via train(naiveData[, -1], naive$label, ...).
set.seed(0)
inTrain <- createDataPartition(trainData$label, p=1/40, list=FALSE)
naiveData <- trainData[inTrain, ]
naiveModel <- train(
label ~ .,
data = naiveData,
method="rf",
ntree=25,
importance=TRUE
)
varImp¶Since we've asked randomForest to keep track of importance, let's have a look at it. The varImp function computes importance on a scale from 0 to 100 (by default--set scale=FALSE to return the raw score used).
varImp(naiveModel)
rf variable importance
variables are sorted by maximum importance across the classes
only 20 most important variables shown (out of 784)
0 1 2 3 4 5 6 7 8 9
pixel541 61.52 64.81 64.07 13.23 72.30 35.56 100.00 59.12 51.84 66.12
pixel378 82.61 98.46 84.90 88.62 77.01 58.12 61.93 82.00 74.71 68.00
pixel318 59.12 86.95 82.19 94.40 62.16 64.72 85.19 76.38 34.27 68.28
pixel515 59.12 69.80 73.71 76.49 48.60 66.84 92.61 45.11 53.24 78.41
pixel598 67.47 59.12 62.48 67.49 70.48 59.12 34.97 67.26 74.25 91.30
pixel430 63.26 72.91 57.07 42.38 89.54 64.20 70.00 79.75 59.12 74.07
pixel290 58.78 81.45 59.12 89.32 21.37 42.88 59.12 50.41 58.28 74.77
pixel292 72.61 62.05 86.97 63.86 51.44 40.57 52.25 37.19 50.69 63.69
pixel181 43.79 59.12 45.78 48.01 86.27 49.53 63.57 79.18 25.34 52.70
pixel571 67.50 40.24 68.22 51.51 70.20 21.37 86.12 52.71 71.37 56.82
pixel210 59.12 63.00 13.15 57.05 85.81 67.51 80.77 58.51 43.88 75.57
pixel239 59.12 66.86 28.99 13.31 73.31 24.20 76.92 70.71 85.81 43.44
pixel205 59.12 85.22 76.80 71.36 13.00 40.24 47.93 67.17 50.17 65.50
pixel345 52.69 70.57 59.12 85.15 81.31 67.91 71.98 61.08 41.78 64.93
pixel346 68.41 59.12 67.19 35.88 84.92 66.33 64.59 61.17 67.00 61.12
pixel183 41.52 59.12 59.12 59.12 72.69 84.44 21.37 72.09 59.12 62.63
pixel406 69.33 84.25 62.54 35.23 44.88 52.74 59.12 78.32 71.59 33.57
pixel403 59.12 84.25 64.01 60.60 22.15 73.38 59.25 61.14 40.91 69.38
pixel405 82.24 77.73 67.10 57.99 59.61 31.07 21.37 84.19 81.21 53.45
pixel408 67.90 40.24 38.61 67.83 67.94 84.18 47.64 40.24 21.37 62.63
featurePlot¶A wrapper for various lattice plots. Once more, the call string from documentation:
featurePlot(
x,
y,
plot = if(is.factor(y)) "strip" else "scatter",
...
)
As before, x holds the predictor data frame and y holds the outcome vector. plot is a string corresponding to the type of plot you want (e.g., "pairs"). ... implies that you can add additional arguments to be passed down to the lattice plot call.
featurePlot(
x = naiveData[, c(320, 380, 432, 543, 600, 1)],
y = naiveData$label,
plot = "pairs",
alpha = 1/20,
auto.key = list(columns = 10)
)
As an optional argument to pass to train, tuneGrid allows you to pass in various combinations of hyperparameters to your model in an effort to optimize them. The caret documentation has a nice example that demonstrates how you make a simple matrix of hyperparameter combinations, save it as a named matrix, and pass that in as the tuneGrid argument.
If you want to know what hyperparameters a particular method takes, simply call the modelLookup function (e.g., modelLookup("rf")). What returns will be a printout of each hyperparameter by name, description, and some indicators of its intended use. For additional details, you'll need to check the documentation of the underlying package.
Note: You must name your tuning grid! caret will get angry if you try to pass in a call to expand.grid.
For randomForest (method="rf"), there is only one hyperparameter: mtry. This tells randomForest how many of the predictors to try and split on at each node. By default, randomForest takes a random sample of the square root of the total and tries to split on those. In our case, 28 x 28 = 784 pixels means the default is 28 pixels chosen at each split. But what if that isn't best?
set.seed(12345)
inTrain <- createDataPartition(trainData$label, p=0.5, list=FALSE)
fitGrid <- expand.grid(
mtry = (1:8) * 10 - 2
)
rfModel <- train(
label ~ .,
data = trainData[inTrain, ],
method="rf",
tuneGrid=fitGrid,
ntree=25
)
If you ask R to print the train object, it outputs a nice summary that includes (within reason) a list of the parameter combinations and the resulting 'quality' metrics (these can be changed).
print(rfModel)
Random Forest 21003 samples 784 predictor 10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' No pre-processing Resampling: Bootstrapped (25 reps) Summary of sample sizes: 21003, 21003, 21003, 21003, 21003, 21003, ... Resampling results across tuning parameters: mtry Accuracy Kappa Accuracy SD Kappa SD 8 0.9358967 0.9287435 0.003123861 0.003469168 18 0.9424117 0.9359874 0.002589643 0.002877086 28 0.9445155 0.9383264 0.002878136 0.003198734 38 0.9450507 0.9389212 0.002622172 0.002913991 48 0.9453394 0.9392426 0.002658911 0.002953523 58 0.9460275 0.9400069 0.002009262 0.002232896 68 0.9458082 0.9397635 0.002404315 0.002670009 78 0.9459638 0.9399367 0.003060001 0.003401450 Accuracy was used to select the optimal model using the largest value. The final value used for the model was mtry = 58.
Similarly, if you plot a train object, you get a graph of your metric against your hyperparameter(s).
plot(rfModel)
Do you like ggplot2? So does caret!
ggplot(rfModel)
In this final section, we'd like to look at a few tools that can help validate and analyze your models. For a classification task, the most obvious such tool is the confusion matrix. Perhaps unsurprisingly, caret has an aptly-named helper function.
caret's confusionMatrix function has two iterations, and the first applies to a train object. Assuming that the outcomes of the method call were explicitly discrete, calling confusionMatrix(myModel) will return a simple diagram that shows how frequently each level was guessed correctly or confused for a different level. This is simply a finer level of detail on the accuracy score we've already been shown by printing the train object directly.
confusionMatrix(rfModel)
Bootstrapped (25 reps) Confusion Matrix
(entries are percentages of table totals)
Reference
Prediction 0 1 2 3 4 5 6 7 8 9
0 9.6 0.0 0.1 0.0 0.0 0.1 0.1 0.0 0.0 0.1
1 0.0 10.9 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.0
2 0.0 0.1 9.4 0.2 0.0 0.0 0.0 0.1 0.1 0.0
3 0.0 0.0 0.1 9.5 0.0 0.2 0.0 0.0 0.2 0.1
4 0.0 0.0 0.1 0.0 9.2 0.0 0.0 0.1 0.1 0.2
5 0.0 0.0 0.0 0.2 0.0 8.4 0.1 0.0 0.1 0.1
6 0.0 0.0 0.1 0.0 0.1 0.1 9.7 0.0 0.1 0.0
7 0.0 0.0 0.1 0.1 0.0 0.0 0.0 9.9 0.0 0.1
8 0.1 0.0 0.1 0.1 0.1 0.1 0.0 0.0 8.9 0.1
9 0.0 0.0 0.0 0.1 0.3 0.1 0.0 0.2 0.2 9.2
What if we want to know how the model performs on data it wasn't trained on? We'll need to apply it to other data we've been holding back (the point of createDataPartition), and compare that to truth values for those data. Once again to the docs:
predict(
object,
newdata = NULL,
...
)
Here, object is the train object we're using to predict, and newdata is a data frame containing the withheld data. As with other caret functions, this is essentially a wrapper for the prediction functions of the various packages, so additional arguments are ocassionally necessary and can be passed through the ellipsis.
rfValidData <- predict(rfModel, trainData[-inTrain, ])
Now that we have used our model to predict the outcomes for new data, we'll want to compare that to the known truth values. This is the other (perhaps more useful) version of confuseMatrix. As usual, the docs:
confusionMatrix(
data,
reference,
...
)
Here, data is a vector of newly predicted data and reference are the truth values.
confusionMatrix(rfValidData, trainData[-inTrain, "label"])
Confusion Matrix and Statistics
Reference
Prediction 0 1 2 3 4 5 6 7 8 9
0 2019 0 10 8 2 5 16 2 5 9
1 0 2298 5 1 5 3 2 14 23 6
2 4 14 1985 31 2 2 4 31 14 8
3 5 7 16 2011 0 54 1 7 28 34
4 4 3 10 3 1956 4 10 10 13 45
5 10 3 3 53 3 1789 15 2 16 11
6 12 3 12 4 4 14 2002 0 10 3
7 3 7 26 16 2 2 1 2098 6 20
8 8 6 15 33 9 13 17 8 1891 22
9 1 1 6 15 53 11 0 28 25 1936
Overall Statistics
Accuracy : 0.9518
95% CI : (0.9488, 0.9547)
No Information Rate : 0.1115
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9464
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
Sensitivity 0.97725 0.9812 0.95067 0.92460 0.96071 0.94307
Specificity 0.99699 0.9968 0.99418 0.99192 0.99462 0.99393
Pos Pred Value 0.97254 0.9750 0.94749 0.92973 0.95044 0.93911
Neg Pred Value 0.99752 0.9976 0.99455 0.99129 0.99578 0.99434
Prevalence 0.09840 0.1115 0.09944 0.10359 0.09697 0.09035
Detection Rate 0.09616 0.1094 0.09454 0.09578 0.09316 0.08520
Detection Prevalence 0.09887 0.1123 0.09978 0.10301 0.09801 0.09073
Balanced Accuracy 0.98712 0.9890 0.97243 0.95826 0.97766 0.96850
Class: 6 Class: 7 Class: 8 Class: 9
Sensitivity 0.96809 0.95364 0.93107 0.92455
Specificity 0.99672 0.99558 0.99309 0.99259
Pos Pred Value 0.96996 0.96194 0.93521 0.93256
Neg Pred Value 0.99651 0.99458 0.99262 0.99165
Prevalence 0.09849 0.10478 0.09673 0.09973
Detection Rate 0.09535 0.09992 0.09006 0.09220
Detection Prevalence 0.09830 0.10387 0.09630 0.09887
Balanced Accuracy 0.98240 0.97461 0.96208 0.95857
This is going rather well, but we have not yet considered how the model is validating itself as it trains. By default, caret uses boostrap resampling; this can be changed, however. Much like tuneGrid, there is a trControl argument to the train function that takes the output of the trainControl function (as documented here):
trainControl(
method = "boot",
number = ifelse(method %in% c("cv", "repeatedcv"), 10, 25),
repeats = ifelse(method %in% c("cv", "repeatedcv"), 1, number,
...
)
You can set method to be a string like "repeatedcv" to change the resampling method, and pass additional parameters that suit your method. I've mentioned the ones that have to do with repeated cross-validation, but there are many others in the docs if you are interested.
set.seed(2967)
inTrain <- createDataPartition(trainData$label, p = 0.5, list = FALSE)
fitControl <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 3
)
# hyperparameters must be passed through the tuneGrid argument, even if constant
fitGrid <- expand.grid(mtry = 58)
finalModel <- train(
label ~ .,
data = trainData[inTrain, ],
method = "rf",
trControl = fitControl,
tuneGrid = fitGrid
)
How did we do this time?
print(finalModel)
Random Forest 21003 samples 784 predictor 10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' No pre-processing Resampling: Cross-Validated (5 fold, repeated 3 times) Summary of sample sizes: 16803, 16803, 16802, 16803, 16801, 16802, ... Resampling results Accuracy Kappa Accuracy SD Kappa SD 0.9585135 0.9538893 0.003605268 0.004006638 Tuning parameter 'mtry' was held constant at a value of 58
confusionMatrix(finalModel)
Cross-Validated (5 fold, repeated 3 times) Confusion Matrix
(entries are percentages of table totals)
Reference
Prediction 0 1 2 3 4 5 6 7 8 9
0 9.7 0.0 0.1 0.0 0.0 0.1 0.1 0.0 0.0 0.1
1 0.0 10.9 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.0
2 0.0 0.1 9.5 0.2 0.0 0.0 0.0 0.1 0.0 0.0
3 0.0 0.0 0.1 9.7 0.0 0.1 0.0 0.0 0.1 0.2
4 0.0 0.0 0.1 0.0 9.3 0.0 0.0 0.0 0.1 0.2
5 0.0 0.0 0.0 0.2 0.0 8.6 0.1 0.0 0.1 0.0
6 0.0 0.0 0.1 0.0 0.1 0.1 9.6 0.0 0.0 0.0
7 0.0 0.0 0.1 0.1 0.0 0.0 0.0 10.0 0.0 0.1
8 0.1 0.0 0.1 0.1 0.0 0.1 0.0 0.0 9.1 0.1
9 0.0 0.0 0.0 0.1 0.2 0.0 0.0 0.1 0.1 9.3
Is that just overfit? How about on the other, reserve half of the data?
validData <- predict(finalModel, trainData[-inTrain, ])
confusionMatrix(validData, trainData[-inTrain, "label"])
Confusion Matrix and Statistics
Reference
Prediction 0 1 2 3 4 5 6 7 8 9
0 2029 0 9 4 3 10 9 1 2 8
1 0 2309 5 8 3 1 3 7 19 5
2 2 13 2012 38 5 3 2 28 9 9
3 1 7 15 2035 1 20 0 3 16 28
4 3 4 9 3 1957 4 5 14 8 36
5 3 2 1 26 0 1818 16 0 15 9
6 15 1 6 3 12 15 2025 0 12 2
7 0 4 16 14 0 1 0 2110 2 21
8 12 2 13 31 5 10 8 7 1926 10
9 1 0 2 13 50 15 0 30 22 1966
Overall Statistics
Accuracy : 0.9614
95% CI : (0.9587, 0.964)
No Information Rate : 0.1115
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9571
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
Sensitivity 0.98209 0.9859 0.96360 0.93563 0.96120 0.95836
Specificity 0.99757 0.9973 0.99424 0.99517 0.99546 0.99623
Pos Pred Value 0.97783 0.9784 0.94861 0.95720 0.95791 0.96190
Neg Pred Value 0.99804 0.9982 0.99597 0.99258 0.99583 0.99587
Prevalence 0.09840 0.1115 0.09944 0.10359 0.09697 0.09035
Detection Rate 0.09663 0.1100 0.09582 0.09692 0.09320 0.08658
Detection Prevalence 0.09882 0.1124 0.10101 0.10125 0.09730 0.09001
Balanced Accuracy 0.98983 0.9916 0.97892 0.96540 0.97833 0.97729
Class: 6 Class: 7 Class: 8 Class: 9
Sensitivity 0.97921 0.9591 0.94830 0.93887
Specificity 0.99651 0.9969 0.99483 0.99296
Pos Pred Value 0.96844 0.9732 0.95158 0.93664
Neg Pred Value 0.99773 0.9952 0.99447 0.99323
Prevalence 0.09849 0.1048 0.09673 0.09973
Detection Rate 0.09644 0.1005 0.09173 0.09363
Detection Prevalence 0.09959 0.1033 0.09639 0.09997
Balanced Accuracy 0.98786 0.9780 0.97157 0.96592

caret can handle.caret is capable of handling custom models.createDataPartition using the outcome, you can split on the predictors using (for example) maximum dissimilarity.caret up- or down-sample so that underrepresented classes carry more weight in model training.caret can help pre-process your data, often from right inside the train function.