Data Modeling with caret

Classification And REgression Training

caret logo

Part 1: Machine Learning in R Without caret

Crying bunny

A simple machine learning example that does not take advantage of caret

In [1]:
library(randomForest)
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

First, we load the required libraries and read our data into data frames. The following data (and most of the code) come from kaggle's introductory "Digit Recognizer" optical character recognition problem.

In [2]:
library(randomForest)
trainData <- read.csv("data/digits/train.csv")
trainData$label <- as.factor(trainData$label)
testData <- read.csv("data/digits/test.csv")

Let's verify that the data look properly loaded, and inspect the format.

In [3]:
dim(trainData); dim(testData); str(trainData[, 1:6])
Out[3]:
  1. 42000
  2. 785
Out[3]:
  1. 28000
  2. 784
'data.frame':	42000 obs. of  6 variables:
 $ label : Factor w/ 10 levels "0","1","2","3",..: 2 1 2 5 1 1 8 4 6 4 ...
 $ pixel0: int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel1: int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel2: int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel3: int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel4: int  0 0 0 0 0 0 0 0 0 0 ...

We see that the first column (label) is the actual digit represented by an image, and the remaining 784 columns are greyscale values for each pixel in the 28x28 images.

After setting the seed to keep our results repeatable, we select a random subsample of our data and pull the predictors from the outcome.

In [4]:
set.seed(0)
numTrain <- 10000
rows <- sample(1:nrow(trainData), numTrain)
labels <- as.factor(trainData[rows,1])
subtrain <- trainData[rows,-1]

Finally, we build a random forest model on the subset of the train data set, computing the predicted outcomes of the test set along the way. The actual randomForest function is called with four arguments, though only the first is necessary.

randomForest(x, y=NULL, xtest=NULL, ntree=500)

There are over a dozen additional potential parameters to pass, including mtry, the number of predictors to randomly sample at each break point.

In [5]:
numTrees <- 25
rf <- randomForest(subtrain, labels, xtest=testData, ntree=numTrees)
predictions <- data.frame(
    ImageId=1:nrow(testData),
    Label=levels(labels)[rf$test$predicted]
)
head(predictions)
Out[5]:
ImageIdLabel
112
220
339
444
552
667

This went rather smoothly. But:

  • What if I want to reserve my own data set for validation before predicting on the test set?
  • What if I want further details on factor selection done by the model?
  • What if I simply want to try a different model?

caret helps will all of these things and more.

Part 2: Data and Model Exploration in caret

Carrot selection

Variable importance and parameter tuning

In [6]:
library(caret)
library(doMC)
registerDoMC(8)
library(RColorBrewer)
Loading required package: lattice
Loading required package: ggplot2
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel

Let's improve upon kaggle's example model by applying some of caret's functionality. We begin by loading the caret package. We will simultaneously load a parallel processing package doMC and tell it how many cores we're rocking (the Mac on which I wrote this has four cores with two threads each). For those packages that implement some form of parallelization, caret does not interfere. randomForest is definitely one of those packages.

See the caret documentation for additional information.

In [7]:
library(caret)
library(doMC)
registerDoMC(8)

createDataPartition

The first function we will want to learn is caret's data partitioning function. Here is the function call from the documentation:

createDataPartition(
    y, 
    times = 1,
    p = 0.5,
    list = TRUE,
    groups = min(5, length(y))
)

This function takes times samples from your data vector y of proportion p. If your data are discrete, createDataPartition will automatically take a representative sample from each level as best as it can; otherwise, you can use groups to help caret partition a continuous variable.

The values returned are chosen indices from y.

train

This poorly-named function trains your model. Again from the slimmed down docs:

train(
    x,
    y, 
    method = "rf",
    ... 
)

This call returns a train object, which is basically a list. The model contained is built applying the given method (in this case "rf" means random forest) to the predictors in the data frame x and with associated vector of outcomes y. As caret is really just a wrapper for the underlying packages that deploy varous methods, we can pass additional arguments through the ellipses as needed.

Let's have a look at an example. These lines are nearly identical to those from kaggle's "benchmark" code. A few things are different:

  • I want to plot soon, so I reduced from a sample of 10,000 to one of about 1,000
  • I asked randomForest to keep track of importance variables, which it does not do by default

You can see that we pass list=FALSE to createDataPartition; as we only have one sample, we'd like to have our row numbers in a vector so that we can easily subset our data with it. We also used the formula implementation of the train function rather than slice the data frame via train(naiveData[, -1], naive$label, ...).

In [8]:
set.seed(0)
inTrain <- createDataPartition(trainData$label, p=1/40, list=FALSE)
naiveData <- trainData[inTrain, ]
naiveModel <- train(
    label ~ .,
    data = naiveData,
    method="rf",
    ntree=25,
    importance=TRUE
)

varImp

Since we've asked randomForest to keep track of importance, let's have a look at it. The varImp function computes importance on a scale from 0 to 100 (by default--set scale=FALSE to return the raw score used).

In [9]:
varImp(naiveModel)
Out[9]:
rf variable importance

  variables are sorted by maximum importance across the classes
  only 20 most important variables shown (out of 784)

             0     1     2     3     4     5      6     7     8     9
pixel541 61.52 64.81 64.07 13.23 72.30 35.56 100.00 59.12 51.84 66.12
pixel378 82.61 98.46 84.90 88.62 77.01 58.12  61.93 82.00 74.71 68.00
pixel318 59.12 86.95 82.19 94.40 62.16 64.72  85.19 76.38 34.27 68.28
pixel515 59.12 69.80 73.71 76.49 48.60 66.84  92.61 45.11 53.24 78.41
pixel598 67.47 59.12 62.48 67.49 70.48 59.12  34.97 67.26 74.25 91.30
pixel430 63.26 72.91 57.07 42.38 89.54 64.20  70.00 79.75 59.12 74.07
pixel290 58.78 81.45 59.12 89.32 21.37 42.88  59.12 50.41 58.28 74.77
pixel292 72.61 62.05 86.97 63.86 51.44 40.57  52.25 37.19 50.69 63.69
pixel181 43.79 59.12 45.78 48.01 86.27 49.53  63.57 79.18 25.34 52.70
pixel571 67.50 40.24 68.22 51.51 70.20 21.37  86.12 52.71 71.37 56.82
pixel210 59.12 63.00 13.15 57.05 85.81 67.51  80.77 58.51 43.88 75.57
pixel239 59.12 66.86 28.99 13.31 73.31 24.20  76.92 70.71 85.81 43.44
pixel205 59.12 85.22 76.80 71.36 13.00 40.24  47.93 67.17 50.17 65.50
pixel345 52.69 70.57 59.12 85.15 81.31 67.91  71.98 61.08 41.78 64.93
pixel346 68.41 59.12 67.19 35.88 84.92 66.33  64.59 61.17 67.00 61.12
pixel183 41.52 59.12 59.12 59.12 72.69 84.44  21.37 72.09 59.12 62.63
pixel406 69.33 84.25 62.54 35.23 44.88 52.74  59.12 78.32 71.59 33.57
pixel403 59.12 84.25 64.01 60.60 22.15 73.38  59.25 61.14 40.91 69.38
pixel405 82.24 77.73 67.10 57.99 59.61 31.07  21.37 84.19 81.21 53.45
pixel408 67.90 40.24 38.61 67.83 67.94 84.18  47.64 40.24 21.37 62.63

featurePlot

A wrapper for various lattice plots. Once more, the call string from documentation:

featurePlot(
    x,
    y, 
    plot = if(is.factor(y)) "strip" else "scatter",
    ...
)

As before, x holds the predictor data frame and y holds the outcome vector. plot is a string corresponding to the type of plot you want (e.g., "pairs"). ... implies that you can add additional arguments to be passed down to the lattice plot call.

In [26]:
featurePlot(
    x = naiveData[, c(320, 380, 432, 543, 600, 1)],
    y = naiveData$label,
    plot = "pairs",
    alpha = 1/20,
    auto.key = list(columns = 10)
)

train(tuneGrid)

As an optional argument to pass to train, tuneGrid allows you to pass in various combinations of hyperparameters to your model in an effort to optimize them. The caret documentation has a nice example that demonstrates how you make a simple matrix of hyperparameter combinations, save it as a named matrix, and pass that in as the tuneGrid argument.

If you want to know what hyperparameters a particular method takes, simply call the modelLookup function (e.g., modelLookup("rf")). What returns will be a printout of each hyperparameter by name, description, and some indicators of its intended use. For additional details, you'll need to check the documentation of the underlying package.

Note: You must name your tuning grid! caret will get angry if you try to pass in a call to expand.grid.

For randomForest (method="rf"), there is only one hyperparameter: mtry. This tells randomForest how many of the predictors to try and split on at each node. By default, randomForest takes a random sample of the square root of the total and tries to split on those. In our case, 28 x 28 = 784 pixels means the default is 28 pixels chosen at each split. But what if that isn't best?

In [5]:
set.seed(12345)
inTrain <- createDataPartition(trainData$label, p=0.5, list=FALSE)
fitGrid <- expand.grid(
    mtry = (1:8) * 10 - 2
)
rfModel <- train(
    label ~ .,
    data = trainData[inTrain, ],
    method="rf",
    tuneGrid=fitGrid,
    ntree=25
)

If you ask R to print the train object, it outputs a nice summary that includes (within reason) a list of the parameter combinations and the resulting 'quality' metrics (these can be changed).

In [6]:
print(rfModel)
Random Forest 

21003 samples
  784 predictor
   10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 21003, 21003, 21003, 21003, 21003, 21003, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
   8    0.9358967  0.9287435  0.003123861  0.003469168
  18    0.9424117  0.9359874  0.002589643  0.002877086
  28    0.9445155  0.9383264  0.002878136  0.003198734
  38    0.9450507  0.9389212  0.002622172  0.002913991
  48    0.9453394  0.9392426  0.002658911  0.002953523
  58    0.9460275  0.9400069  0.002009262  0.002232896
  68    0.9458082  0.9397635  0.002404315  0.002670009
  78    0.9459638  0.9399367  0.003060001  0.003401450

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 58. 

Similarly, if you plot a train object, you get a graph of your metric against your hyperparameter(s).

In [7]:
plot(rfModel)

Do you like ggplot2? So does caret!

In [8]:
ggplot(rfModel)

Part 3: Model Validation

carrot show

Tuning and performance

In this final section, we'd like to look at a few tools that can help validate and analyze your models. For a classification task, the most obvious such tool is the confusion matrix. Perhaps unsurprisingly, caret has an aptly-named helper function.

confusionMatrix.train

caret's confusionMatrix function has two iterations, and the first applies to a train object. Assuming that the outcomes of the method call were explicitly discrete, calling confusionMatrix(myModel) will return a simple diagram that shows how frequently each level was guessed correctly or confused for a different level. This is simply a finer level of detail on the accuracy score we've already been shown by printing the train object directly.

In [9]:
confusionMatrix(rfModel)
Out[9]:
Bootstrapped (25 reps) Confusion Matrix 

(entries are percentages of table totals)
 
          Reference
Prediction    0    1    2    3    4    5    6    7    8    9
         0  9.6  0.0  0.1  0.0  0.0  0.1  0.1  0.0  0.0  0.1
         1  0.0 10.9  0.0  0.0  0.0  0.0  0.0  0.1  0.1  0.0
         2  0.0  0.1  9.4  0.2  0.0  0.0  0.0  0.1  0.1  0.0
         3  0.0  0.0  0.1  9.5  0.0  0.2  0.0  0.0  0.2  0.1
         4  0.0  0.0  0.1  0.0  9.2  0.0  0.0  0.1  0.1  0.2
         5  0.0  0.0  0.0  0.2  0.0  8.4  0.1  0.0  0.1  0.1
         6  0.0  0.0  0.1  0.0  0.1  0.1  9.7  0.0  0.1  0.0
         7  0.0  0.0  0.1  0.1  0.0  0.0  0.0  9.9  0.0  0.1
         8  0.1  0.0  0.1  0.1  0.1  0.1  0.0  0.0  8.9  0.1
         9  0.0  0.0  0.0  0.1  0.3  0.1  0.0  0.2  0.2  9.2

predict

What if we want to know how the model performs on data it wasn't trained on? We'll need to apply it to other data we've been holding back (the point of createDataPartition), and compare that to truth values for those data. Once again to the docs:

predict(
    object,
    newdata = NULL,
    ...
)

Here, object is the train object we're using to predict, and newdata is a data frame containing the withheld data. As with other caret functions, this is essentially a wrapper for the prediction functions of the various packages, so additional arguments are ocassionally necessary and can be passed through the ellipsis.

In [10]:
rfValidData <- predict(rfModel, trainData[-inTrain, ])

confusionMatrix

Now that we have used our model to predict the outcomes for new data, we'll want to compare that to the known truth values. This is the other (perhaps more useful) version of confuseMatrix. As usual, the docs:

confusionMatrix(
    data,
    reference,
    ...
)

Here, data is a vector of newly predicted data and reference are the truth values.

In [12]:
confusionMatrix(rfValidData, trainData[-inTrain, "label"])
Out[12]:
Confusion Matrix and Statistics

          Reference
Prediction    0    1    2    3    4    5    6    7    8    9
         0 2019    0   10    8    2    5   16    2    5    9
         1    0 2298    5    1    5    3    2   14   23    6
         2    4   14 1985   31    2    2    4   31   14    8
         3    5    7   16 2011    0   54    1    7   28   34
         4    4    3   10    3 1956    4   10   10   13   45
         5   10    3    3   53    3 1789   15    2   16   11
         6   12    3   12    4    4   14 2002    0   10    3
         7    3    7   26   16    2    2    1 2098    6   20
         8    8    6   15   33    9   13   17    8 1891   22
         9    1    1    6   15   53   11    0   28   25 1936

Overall Statistics
                                          
               Accuracy : 0.9518          
                 95% CI : (0.9488, 0.9547)
    No Information Rate : 0.1115          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9464          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
Sensitivity           0.97725   0.9812  0.95067  0.92460  0.96071  0.94307
Specificity           0.99699   0.9968  0.99418  0.99192  0.99462  0.99393
Pos Pred Value        0.97254   0.9750  0.94749  0.92973  0.95044  0.93911
Neg Pred Value        0.99752   0.9976  0.99455  0.99129  0.99578  0.99434
Prevalence            0.09840   0.1115  0.09944  0.10359  0.09697  0.09035
Detection Rate        0.09616   0.1094  0.09454  0.09578  0.09316  0.08520
Detection Prevalence  0.09887   0.1123  0.09978  0.10301  0.09801  0.09073
Balanced Accuracy     0.98712   0.9890  0.97243  0.95826  0.97766  0.96850
                     Class: 6 Class: 7 Class: 8 Class: 9
Sensitivity           0.96809  0.95364  0.93107  0.92455
Specificity           0.99672  0.99558  0.99309  0.99259
Pos Pred Value        0.96996  0.96194  0.93521  0.93256
Neg Pred Value        0.99651  0.99458  0.99262  0.99165
Prevalence            0.09849  0.10478  0.09673  0.09973
Detection Rate        0.09535  0.09992  0.09006  0.09220
Detection Prevalence  0.09830  0.10387  0.09630  0.09887
Balanced Accuracy     0.98240  0.97461  0.96208  0.95857

trainControl

This is going rather well, but we have not yet considered how the model is validating itself as it trains. By default, caret uses boostrap resampling; this can be changed, however. Much like tuneGrid, there is a trControl argument to the train function that takes the output of the trainControl function (as documented here):

trainControl(
    method = "boot", 
    number = ifelse(method %in% c("cv", "repeatedcv"), 10, 25),
    repeats = ifelse(method %in% c("cv", "repeatedcv"), 1, number,
    ...
)

You can set method to be a string like "repeatedcv" to change the resampling method, and pass additional parameters that suit your method. I've mentioned the ones that have to do with repeated cross-validation, but there are many others in the docs if you are interested.

In [12]:
set.seed(2967)
inTrain <- createDataPartition(trainData$label, p = 0.5, list = FALSE)
fitControl <- trainControl(
    method = "repeatedcv",
    number = 5,
    repeats = 3
)
# hyperparameters must be passed through the tuneGrid argument, even if constant
fitGrid <- expand.grid(mtry = 58)
finalModel <- train(
    label ~ .,
    data = trainData[inTrain, ],
    method = "rf",
    trControl = fitControl,
    tuneGrid = fitGrid
)

How did we do this time?

In [13]:
print(finalModel)
Out[13]:
Random Forest 

21003 samples
  784 predictor
   10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 16803, 16803, 16802, 16803, 16801, 16802, ... 
Resampling results

  Accuracy   Kappa      Accuracy SD  Kappa SD   
  0.9585135  0.9538893  0.003605268  0.004006638

Tuning parameter 'mtry' was held constant at a value of 58
 
In [14]:
confusionMatrix(finalModel)
Out[14]:
Cross-Validated (5 fold, repeated 3 times) Confusion Matrix 

(entries are percentages of table totals)
 
          Reference
Prediction    0    1    2    3    4    5    6    7    8    9
         0  9.7  0.0  0.1  0.0  0.0  0.1  0.1  0.0  0.0  0.1
         1  0.0 10.9  0.0  0.0  0.0  0.0  0.0  0.1  0.1  0.0
         2  0.0  0.1  9.5  0.2  0.0  0.0  0.0  0.1  0.0  0.0
         3  0.0  0.0  0.1  9.7  0.0  0.1  0.0  0.0  0.1  0.2
         4  0.0  0.0  0.1  0.0  9.3  0.0  0.0  0.0  0.1  0.2
         5  0.0  0.0  0.0  0.2  0.0  8.6  0.1  0.0  0.1  0.0
         6  0.0  0.0  0.1  0.0  0.1  0.1  9.6  0.0  0.0  0.0
         7  0.0  0.0  0.1  0.1  0.0  0.0  0.0 10.0  0.0  0.1
         8  0.1  0.0  0.1  0.1  0.0  0.1  0.0  0.0  9.1  0.1
         9  0.0  0.0  0.0  0.1  0.2  0.0  0.0  0.1  0.1  9.3

Is that just overfit? How about on the other, reserve half of the data?

In [15]:
validData <- predict(finalModel, trainData[-inTrain, ])
confusionMatrix(validData, trainData[-inTrain, "label"])
Out[15]:
Confusion Matrix and Statistics

          Reference
Prediction    0    1    2    3    4    5    6    7    8    9
         0 2029    0    9    4    3   10    9    1    2    8
         1    0 2309    5    8    3    1    3    7   19    5
         2    2   13 2012   38    5    3    2   28    9    9
         3    1    7   15 2035    1   20    0    3   16   28
         4    3    4    9    3 1957    4    5   14    8   36
         5    3    2    1   26    0 1818   16    0   15    9
         6   15    1    6    3   12   15 2025    0   12    2
         7    0    4   16   14    0    1    0 2110    2   21
         8   12    2   13   31    5   10    8    7 1926   10
         9    1    0    2   13   50   15    0   30   22 1966

Overall Statistics
                                         
               Accuracy : 0.9614         
                 95% CI : (0.9587, 0.964)
    No Information Rate : 0.1115         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9571         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
Sensitivity           0.98209   0.9859  0.96360  0.93563  0.96120  0.95836
Specificity           0.99757   0.9973  0.99424  0.99517  0.99546  0.99623
Pos Pred Value        0.97783   0.9784  0.94861  0.95720  0.95791  0.96190
Neg Pred Value        0.99804   0.9982  0.99597  0.99258  0.99583  0.99587
Prevalence            0.09840   0.1115  0.09944  0.10359  0.09697  0.09035
Detection Rate        0.09663   0.1100  0.09582  0.09692  0.09320  0.08658
Detection Prevalence  0.09882   0.1124  0.10101  0.10125  0.09730  0.09001
Balanced Accuracy     0.98983   0.9916  0.97892  0.96540  0.97833  0.97729
                     Class: 6 Class: 7 Class: 8 Class: 9
Sensitivity           0.97921   0.9591  0.94830  0.93887
Specificity           0.99651   0.9969  0.99483  0.99296
Pos Pred Value        0.96844   0.9732  0.95158  0.93664
Neg Pred Value        0.99773   0.9952  0.99447  0.99323
Prevalence            0.09849   0.1048  0.09673  0.09973
Detection Rate        0.09644   0.1005  0.09173  0.09363
Detection Prevalence  0.09959   0.1033  0.09639  0.09997
Balanced Accuracy     0.98786   0.9780  0.97157  0.96592

Part 4: Other Features

carrot anatomy

Additional Models

  • Is a random forest not appropriate for your modeling task? There are over 200 other models caret can handle.
  • Don't see what you want? Well, you'll get no help from me, but caret is capable of handling custom models.

Additional additions

  • Instead of createDataPartition using the outcome, you can split on the predictors using (for example) maximum dissimilarity.
  • You can also affect class subsampling by having caret up- or down-sample so that underrepresented classes carry more weight in model training.
  • caret can help pre-process your data, often from right inside the train function.
In [ ]: