# Predicting large and imbalanced data set using the R package tidymodels

## Introduction

The super easy way, at least for me, to deploy machine learning models is by making use of the R package **tidymodels**, which is a collection of many packages that makes the workflow steps for your project very smooth and tightly connected to each other and easily managable in a well-structured manner.
The core packages contained in tidymodels are:

- rsample: for data splitting and rsampling.
- parsnip: Unified interface to the most common machine learning models.
- recipes: unified interface to the most common pre-processing tools for feature engineering.
- workflows: bundle the workflow steps together.
- tune: for optimization of the hyperparameters.
- yardstick: provides the most common performance metrics.
- broom: converts the outputs into user friendly formats such as tibble.
- dials: provides tools for parameter grids.
- infer: provides tools for statistical inferences.

In addition to the above apackages tidymodels contains also some classical packages such as: dplyr, ggplot2, purrr, tibble. For more detail click here.

In order to widely explore and understand the tidymodels, we should look for a noisy dataset that has large number of variables with missing values. Fortunately, I found an open source dataset that fulfils these requirements and in addition, it is highly imbalanced. This data is about **scania trucks** and can be downloaded from UCI machine learning repository with an extra file for its description.

the target variable of this data is the air pressure system **APS** in the truck that generates the pressurized air that are utilized in various function in the truck. It has two classes: positive **pos** if a component failures due to a failures in the APS system, negative **neg** if a component failures are not related to the APS system. This means that we are dealing with binary classification problem.

## Data exploration

The data is already separated into training and testing set from the source, so let’s call the packages that we need and the data.

```
ssh <- suppressPackageStartupMessages
ssh(library(readr))
ssh(library(caret))
ssh(library(themis))
ssh(library(tidymodels))
train <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_training_set.csv", skip = 20)
test <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_test_set.csv", skip = 20)
```

Notice that the data is a tibble where the first twenty rows are a mix of rows that contain some text descritpion and empty rows. and the 21th row contains the column names. That is why we have set the **skip** argument equals to **20**, and for the 21th column by default has been read as colnames `col_names = TRUE`

.

### Summary of the variables

First let’s check the dimension of the two sets to be aware of what we are dealing with.

`dim(train)`

`[1] 60000 171`

`dim(test)`

`[1] 16000 171`

The training set has 60000 rows and 171 variables which is moderately large dataset. Inspecting thus this data by the usual functions such as **summary**, **str** would give heavy and not easily readable outputs. the best alternative However is by extracting the most important information that is required for building any machine learning model in aggregated way, for instance, what type of variables it has, some statistics about the variable values, the missing values..etc.

```
map_chr(train, typeof) %>%
tibble() %>%
table()
```

```
.
character double
170 1
```

Strangely, all the variables but one are characters, which is in contradiction with the description of this data from the file description. To figure out what is going on we display some few rows and some few columns.

`train[1:5,1:7]`

```
# A tibble: 5 x 7
class aa_000 ab_000 ac_000 ad_000 ae_000 af_000
<chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 neg 76698 na 2130706438 280 0 0
2 neg 33058 na 0 na 0 0
3 neg 41040 na 228 100 0 0
4 neg 12 0 70 66 0 10
5 neg 60874 na 1368 458 0 0
```

I think, the problem is that the missing values in the data indicated by **na** are not recognized as missing values, instead they are treated as characters and this what makes the function **read_csv** coerces every variable that has this **na** values to character type. To fix this problem we can either go back and set the **na** argument to **“na”**, or we set the missing values by hand as follows.

```
train[-1] <- train[-1] %>%
modify(~replace(., .=="na", NA)) %>%
modify(., as.double)
```

Now let’s check again

```
map_chr(train, typeof) %>%
tibble() %>%
table()
```

```
.
character double
1 170
```

The first column excluded above is our target variable **class**. We should not forget to do the same transformation to the test set.

```
test[-1] <- test[-1] %>%
modify(~replace(., .=="na", NA)) %>%
modify(., as.double)
```

If we try to apply the **summary** function on the entire variables (170), we will spent a lot of time to read the summary of each variable without much gain. Instead, we try to get an automated way to get only the information needed to build efficiently our model. To decide whether we should normalize the data or not, for instance, we display the standard deviances of all the variable in decreasing order.

**Note**: with tree based models we do not need neither normalize the data nor converting factors to dummies.

```
map_dbl(train[-1], sd, na.rm=TRUE) %>%
tibble(sd = .) %>%
arrange(-sd)
```

```
# A tibble: 170 x 1
sd
<dbl>
1 794874918.
2 97484780.
3 42746746.
4 40404413.
5 40404412.
6 40404411.
7 11567771.
8 10886737.
9 10859905.
10 10859904.
# ... with 160 more rows
```

We have very large variability, which means that the data should be normalized for any machine learning model that uses gradient descent or based on class distances.

Another thing we can check is if some variabels have small number of unique values which can hence be converted to factor type.

```
map(train[-1], unique) %>%
lengths(.) %>%
sort(.) %>%
head(5)
```

```
cd_000 ch_000 as_000 ef_000 ab_000
2 3 22 29 30
```

To make things simple we consider only the first two ones to be converted to factor type.

the first one is constant which is of type zero variance because its variance equals to zero, and the second one should be converted to factor type with two levels (for the two sets), but since it has large missing values we will decide about it later on. Notice that we do not apply theses transformations here because they will be combined at once with all the required transformations as what will be shown shortly.

### Missing values

The best way to deal with missing values depends on their number compared to the dataset size. if we have small number then it would be easier to completely remove them from the data, if in contrast we have large number then the best choice is to impute them using one of the common methods designed for this type of issue.

`dim(train[!complete.cases(train),])`

`[1] 59409 171`

As we see almost every row contains at least one missing value in some columns. Let’s check the distribution of missing values within columns.

```
df <- modify(train[-1], is.na) %>%
colSums() %>%
tibble(names = colnames(train[-1]),missing_values=.) %>%
arrange(-missing_values)
df
```

```
# A tibble: 170 x 2
names missing_values
<chr> <dbl>
1 br_000 49264
2 bq_000 48722
3 bp_000 47740
4 bo_000 46333
5 ab_000 46329
6 cr_000 46329
7 bn_000 44009
8 bm_000 39549
9 bl_000 27277
10 bk_000 23034
# ... with 160 more rows
```

I think the best strategy is to first remove columns that have a large number of missing values then we impute the rest, thereby we reduce the number of predictors and the number of missing values at ones. The following script keep the predictors that have a number of missing values less than **10000**.

```
names <- modify(train[-1], is.na) %>%
colSums() %>%
tibble(names = colnames(train[-1]), missing_values=.) %>%
filter(missing_values < 10000)
train1 <- train[c("class",names$names)]
test1 <- test[c("class",names$names)]
```

An important thing should be noted here is that, if we use imputation methods that use information from all other columns and/or rows to predict the current missing value, therefore the data must be first split between training and testing sets before any imputation, to abide by the crucial rule of machine learning: the test data should never be seen by the model during training process.
Fortunately, our data is already split so that the imputation can be done separately. However, the imputation methods will be implemented later on by the help of the **recipes** package where we bundle all the pre-processing steps together.
**Note**: the above **ch_000** was removed since it did not fulfill the required threshold.

### imbalanced data

Another important issue that we face when predicting this data is the **imbalanced** problem.

`prop.table(table(train1$class))`

```
neg pos
0.98333333 0.01666667
```

This data is highly imbalanced, which tends to make even the worst machine learning model gives very high accuracy rate. In other words, if we do not use any model and predict every class as the largest class label (in our case negative) the accuracy rate will be approximately equal to the proportion of the largest class (in our case 98%), which is very big misleading result. Moreover, this misleading result can be catastrophic if we are more interested to predict the small class (in our case positive) such as detecting fraudulent credit cards. If you would like to get more detail about how to deal with imbalanced data please check this article.

## building the recipe

Our initial model will be the **random forest** wich is the most popular one . So the first step to build our model is by defining our model with the **engine**, which is the method (or the package) used to fit this model, and the **mode** with two possible values **classification** or **regression**. In our case, for instance, there exists two available engines: **randomForest** or **ranger**. Notice that the **parsnip** package who provides these settings. For more detail about all the models available click here.

**Note**: To speed up the computation process we restrict the forest to **100** trees instead of the default **500**.

```
rf <- rand_forest(trees = 100) %>%
set_engine("ranger", num.threads=3, seed = 123) %>%
set_mode("classification")
```

Most machine learning models require pre-processed data with some feature engineering. Traditionally, R has (and some other packages such as dplyr and stringr) provides a wide range of functions such that we can do almost every kind of feature engineering. However, if we have many different transformations to perform then they will be done separately and it will be a little cumbersome to repeat the same scripts again for testing set for instance. Therefore, the **recipes** package provides an easy way to combine all the transformations and other features related to the model, such as selecting the predictors that should be included, identifiers, …etc, as a single block that can be used for any other subset of the data.

For our case we will apply the following transformations:

- Imputing the missing values by the median of the corresponding variable since we have only numeric variables (for simplicity).
- removing variables that have zero variance (variable that has one unique value).
- removing highly correlated predictor using threshold
**0.8**. - Normalizing the data (even we do not need it in this model but we add this step since this recipe will be used with other models that use gradient decent or distances calculations).
- using the subsampling method
**smote**to create a balanced data. Notice that the**smote**method is provided by the package**themis**

To combine all these operations together we call the function **recipe**.

```
data_rec <- recipe(class~., data=train1) %>%
step_medianimpute(all_predictors() , seed_val = 111) %>%
step_zv(all_predictors()) %>%
step_corr(all_predictors(), threshold = 0.8) %>%
step_normalize(all_predictors()) %>%
step_smote(class)
```

As you see everything combined nicely and elegantly. However, this recipe transformed nothing yet, it just recorded the formula, the predictors and the transformations that should be applied. This means that we can update, at any time before fitting our model, the formula, add or remove some steps. The super interesting feature of recipe is that we can apply it to any other data (than that mentioned above, train) provided that has the same variable names. In case you want to apply these transformations to the training data use the **prep** function, and to retrieve the results use the function **juice**, and for other data use **bake** after **prep** to be able to apply some parameters from the training data, for instance, when we normalize the data this function lets us use the mean of predictors computed from the training data rather than from the testing data. However, in our case, we will combine everything until the model fitting step.

For more detail about all the steps available click here.

## Building the workflow

To well organize our workflow in a structured and smoother way, we use the **workflow** package that is one of the tidymodels collection.

```
rf_wf <- workflow() %>%
add_model(rf) %>%
add_recipe(data_rec)
rf_wf
```

```
== Workflow =======================
Preprocessor: Recipe
Model: rand_forest()
-- Preprocessor -------------------
5 Recipe Steps
* step_medianimpute()
* step_zv()
* step_corr()
* step_normalize()
* step_smote()
-- Model --------------------------
Random Forest Model Specification (classification)
Main Arguments:
trees = 100
Engine-Specific Arguments:
num.threads = 3
seed = 123
Computational engine: ranger
```

## random forest model

Now we can run everything at once, the recipe and the model, notice that here we can also update, add or remove some elements before going ahead and fit the model.

### model training

Everything now is ready to run our model with the default values.

```
model_rf <- rf_wf %>%
fit(data = train1)
```

We can extract the summary of this model as follows

`model_rf %>% pull_workflow_fit()`

```
parsnip model object
Fit time: 55.7s
Ranger result
Call:
ranger::ranger(formula = ..y ~ ., data = data, num.trees = ~100, num.threads = ~3, seed = ~123, verbose = FALSE, probability = TRUE)
Type: Probability estimation
Number of trees: 100
Sample size: 118000
Number of independent variables: 95
Mtry: 9
Target node size: 10
Variable importance mode: none
Splitrule: gini
OOB prediction error (Brier s.): 0.003998112
```

This model has created 100 trees and has chosen randomly 9 predictors with each tree. with these settings thus we do obtain very low oob error rate which is 0.4% (accuracy rate 99.6% ). However, be cautious with such high accuracy rate, since, in practice, This result may highly related to an overfitting problem. Last thing I want to mention about this output, by looking at the confusion matrix, is the fact that we have now balanced data.

### model evaluation

The best way to evaluate our model is by using the testing set. Notice that the **yardstick** provides bunch of metrics to use, but let’s use the most popular one for classification problems **accuracy**.

```
model_rf %>%
predict( new_data = test1) %>%
bind_cols(test1["class"]) %>%
accuracy(truth= as.factor(class), .pred_class)
```

```
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.990
```

with this model we get high accuracy which is very closer to the previous one. However, we should not forget that we are dealing with imbalanced data, and even though we have used subsampling methods (like smote method used here), they do not completely solve this issue, they can only minimize it at certain level and this is the reason why we have many of these methods. Therefore, it is better to use the confusion matrix from the **caret** package since it gives more information.

`caret::confusionMatrix(as.factor(test1$class), predict(model_rf, new_data = test1)$.pred_class)`

```
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 15532 93
pos 64 311
Accuracy : 0.9902
95% CI : (0.9885, 0.9917)
No Information Rate : 0.9748
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.7934
Mcnemar's Test P-Value : 0.02544
Sensitivity : 0.9959
Specificity : 0.7698
Pos Pred Value : 0.9940
Neg Pred Value : 0.8293
Prevalence : 0.9748
Detection Rate : 0.9708
Detection Prevalence : 0.9766
Balanced Accuracy : 0.8828
'Positive' Class : neg
```

As said shortly, the specificity rate related to the minor class **78%** is very low compared to the major class **99%**, and You can think of this as a partial overfitting towards the major class. So if we are more interested to the minor class (which is often the case) then we have go back to our model and try tuning our model or try another subsampling method.

### Model tuning:

For model tuning we try other values for some arguments rather than the default vaues. and leave the tuning for some others to the **dials** package. So let’s try the following argument values:

- num.trees = 100. The default is 500.
- num.threads = 3. The default is 1.

And tune the following:

- mtry = tune(). The default is square root of the number of the variables.
- min_n = tune(). The default is 1.

First, we define the model with these new arguments.

```
model_tune <- rand_forest(trees= 100, mtry=tune(), min_n = tune()) %>%
set_engine("ranger", num.threads=3, seed=123) %>%
set_mode("classification")
```

Since in grid search the two arguments mtry and min_n are data dependent, then we should at least specify their ranges.

```
grid <- grid_regular(mtry(range = c(9,15)), min_n(range = c(5,40)), levels = 3)
grid
```

```
# A tibble: 9 x 2
mtry min_n
<int> <int>
1 9 5
2 12 5
3 15 5
4 9 22
5 12 22
6 15 22
7 9 40
8 12 40
9 15 40
```

By setting the levels equal to 3 we get 9 combinations and hence 9 models will be trained. The above recipe has steps that should not be repeated many times when tuning the model, we apply therefore the recipe to the training data in order to get the transformed data, and do not forget to apply the recipe to the testing data.

```
train2 <- prep(data_rec) %>%
juice()
test2 <- prep(data_rec) %>%
bake(test1)
```

To tune our model we use cross validation technique. since we have large data set we use only 3 folds.

```
set.seed(111)
fold <- vfold_cv(train2, v = 3, strata = class)
```

Now we bundle our recipe with the specified model.

```
tune_wf <- workflow() %>%
add_model(model_tune) %>%
add_formula(class~.)
```

To fit these models across the folds we use the **tune_grid** function instead of **fit**.

```
tune_rf <- tune_wf %>%
tune_grid(resamples = fold, grid = grid)
```

For classification problems this function uses two metrics: accuracy and area under the ROC curve. SO we can extract the metric values as follows.

`results <- tune_rf %>% collect_metrics()`

To get the best model we have to choose one of the two metrics, so let’s go ahead with the accuracy rate.

```
best_param <-
tune_rf %>% select_best(metric = "accuracy")
best_param
```

```
# A tibble: 1 x 3
mtry min_n .config
<int> <int> <chr>
1 15 5 Model3
```

we can finalize the workflow with the new parameter values.

```
tune_wf2 <- tune_wf %>%
finalize_workflow(best_param)
tune_wf2
```

```
== Workflow =======================
Preprocessor: Formula
Model: rand_forest()
-- Preprocessor -------------------
class ~ .
-- Model --------------------------
Random Forest Model Specification (classification)
Main Arguments:
mtry = 15
trees = 100
min_n = 5
Engine-Specific Arguments:
num.threads = 3
seed = 123
Computational engine: ranger
```

Now we fit the model with the best parameter values to the entire training data.

```
best_model <- tune_wf2 %>%
fit(train2)
best_model
```

```
== Workflow [trained] =============
Preprocessor: Formula
Model: rand_forest()
-- Preprocessor -------------------
class ~ .
-- Model --------------------------
Ranger result
Call:
ranger::ranger(formula = ..y ~ ., data = data, mtry = ~15L, num.trees = ~100, min.node.size = ~5L, num.threads = ~3, seed = ~123, verbose = FALSE, probability = TRUE)
Type: Probability estimation
Number of trees: 100
Sample size: 118000
Number of independent variables: 95
Mtry: 15
Target node size: 5
Variable importance mode: none
Splitrule: gini
OOB prediction error (Brier s.): 0.00359659
```

Let’s get the confusion matrix

`caret::confusionMatrix(as.factor(test2$class), predict(best_model, new_data = test2)$.pred_class)`

```
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 15538 87
pos 67 308
Accuracy : 0.9904
95% CI : (0.9887, 0.9918)
No Information Rate : 0.9753
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7951
Mcnemar's Test P-Value : 0.1258
Sensitivity : 0.9957
Specificity : 0.7797
Pos Pred Value : 0.9944
Neg Pred Value : 0.8213
Prevalence : 0.9753
Detection Rate : 0.9711
Detection Prevalence : 0.9766
Balanced Accuracy : 0.8877
'Positive' Class : neg
```

As we see we do not get any improvement for the specificity rate. so let’s try another subsampling method, say **Rose** method.

```
rf_rose <- rand_forest(trees = 100, mtry=9, min_n = 5) %>%
set_engine("ranger", num.threads=3, seed = 123) %>%
set_mode("classification")
data_rec2 <- recipe(class~., data=train1) %>%
step_medianimpute(all_predictors() , seed_val = 111) %>%
step_zv(all_predictors()) %>%
step_corr(all_predictors(), threshold = 0.8) %>%
step_normalize(all_predictors()) %>%
step_rose(class)
rf_rose_wf <- workflow() %>%
add_model(rf_rose) %>%
add_recipe(data_rec2)
model_rose_rf <- rf_rose_wf %>%
fit(data = train1)
caret::confusionMatrix(as.factor(test1$class), predict(model_rose_rf, new_data = test1)$.pred_class)
```

```
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 15522 103
pos 140 235
Accuracy : 0.9848
95% CI : (0.9828, 0.9867)
No Information Rate : 0.9789
P-Value [Acc > NIR] : 2.437e-08
Kappa : 0.6514
Mcnemar's Test P-Value : 0.02092
Sensitivity : 0.9911
Specificity : 0.6953
Pos Pred Value : 0.9934
Neg Pred Value : 0.6267
Prevalence : 0.9789
Detection Rate : 0.9701
Detection Prevalence : 0.9766
Balanced Accuracy : 0.8432
'Positive' Class : neg
```

The rose method is much worse than smote method since the specificity rate has doped down to 69%.

## logistic regression model

The logistic regression is another model to fit data with binary outcome. As before we use the first recipe with smote method.

```
logit <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
logit_wf <- workflow() %>%
add_model(logit) %>%
add_recipe(data_rec)
set.seed(123)
model_logit <- logit_wf %>%
fit(data = train1)
caret::confusionMatrix(as.factor(test1$class), predict(model_logit, new_data = test1)$.pred_class)
```

```
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 15327 298
pos 59 316
Accuracy : 0.9777
95% CI : (0.9753, 0.9799)
No Information Rate : 0.9616
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6282
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9962
Specificity : 0.5147
Pos Pred Value : 0.9809
Neg Pred Value : 0.8427
Prevalence : 0.9616
Detection Rate : 0.9579
Detection Prevalence : 0.9766
Balanced Accuracy : 0.7554
'Positive' Class : neg
```

with this model we do not get better rate for minority class than random forest model.

## Session information

`sessionInfo()`

```
R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] yardstick_0.0.7 workflows_0.2.0 tune_0.1.1 tidyr_1.1.2
[5] tibble_3.0.3 rsample_0.0.8 purrr_0.3.4 parsnip_0.1.3
[9] modeldata_0.0.2 infer_0.5.3 dials_0.0.9 scales_1.1.1
[13] broom_0.7.1 tidymodels_0.1.1 themis_0.1.2 recipes_0.1.13
[17] dplyr_1.0.2 caret_6.0-86 ggplot2_3.3.2 lattice_0.20-41
[21] readr_1.3.1
loaded via a namespace (and not attached):
[1] nlme_3.1-149 lubridate_1.7.9 doParallel_1.0.15
[4] DiceDesign_1.8-1 tools_4.0.1 backports_1.1.10
[7] utf8_1.1.4 R6_2.4.1 rpart_4.1-15
[10] colorspace_1.4-1 nnet_7.3-14 withr_2.3.0
[13] prettyunits_1.1.1 tidyselect_1.1.0 curl_4.3
[16] compiler_4.0.1 parallelMap_1.5.0 cli_2.0.2
[19] bookdown_0.20 checkmate_2.0.0 stringr_1.4.0
[22] digest_0.6.25 rmarkdown_2.4 unbalanced_2.0
[25] pkgconfig_2.0.3 htmltools_0.5.0 lhs_1.1.0
[28] rlang_0.4.7 rstudioapi_0.11 BBmisc_1.11
[31] FNN_1.1.3 generics_0.0.2 ModelMetrics_1.2.2.2
[34] magrittr_1.5 ROSE_0.0-3 Matrix_1.2-18
[37] fansi_0.4.1 Rcpp_1.0.5 munsell_0.5.0
[40] GPfit_1.0-8 lifecycle_0.2.0 furrr_0.1.0
[43] stringi_1.5.3 pROC_1.16.2 yaml_2.2.1
[46] MASS_7.3-53 plyr_1.8.6 grid_4.0.1
[49] parallel_4.0.1 listenv_0.8.0 crayon_1.3.4
[52] splines_4.0.1 hms_0.5.3 knitr_1.30
[55] mlr_2.17.1 pillar_1.4.6 ranger_0.12.1
[58] reshape2_1.4.4 codetools_0.2-16 stats4_4.0.1
[61] fastmatch_1.1-0 glue_1.4.2 evaluate_0.14
[64] ParamHelpers_1.14 blogdown_0.20 data.table_1.13.0
[67] vctrs_0.3.4 foreach_1.5.0 gtable_0.3.0
[70] RANN_2.6.1 future_1.19.1 assertthat_0.2.1
[73] xfun_0.18 gower_0.2.2 prodlim_2019.11.13
[76] e1071_1.7-3 class_7.3-17 survival_3.2-7
[79] timeDate_3043.102 iterators_1.0.12 hardhat_0.1.4
[82] lava_1.6.8 globals_0.13.0 ellipsis_0.3.1
[85] ipred_0.9-9
```