Bayesian hyperparameters optimization

Tuning the hyperparameters of random forest and deep learning models by bayesian optimization

Last updated on Jun 11, 2020 R

1 Introduction

Machine learning models are called by this name because of their ability to learn the best parameter values that are as closest as possible to the right values of the optimum objective function (or loss function). However, since all models require some assumptions (like linearity in linear regression models), parameters (like the cost C in svm models), and settings (like the number of layers in deep learning models) to be prespecified before training the model (in most cases are set by default ), the name of machine learning is not fully justified. Theses prespecified parameters are called hyperparameters, and should be defined in such a way that the corresponding model reaches its best performance (conditionally on the data at hand).

The search of the best hyperparameters called tuning, which simply is training the model with each combination of hyperparameter values many times, using some techniques like cross-validation to make sure that the resulted loss functions highly likely depends on this specific combination values (not by random chances), and then pick the best one that gives the optimum value for the objective function. This means that if our model requires long computational time (due to fewer hardware resources, or large dataset) we cannot try a large number of combinations, and hence we will be likely far away from the best result.

The main problem, however, is how do we define the space of these hyperparameters to choose from. Many methods are available:

grid search: Using this method the modeler provides the values for each parameter to evaluate and then pick the combination that gives the best result. However, this method based entirely on the experience of the modeler with the model in question and the data at hand. So with not enough experience, which is often the case, the choice in most cases would be arbitrary.
Random search: with this method, we choose randomly the values for each hyperparameter, and it turns out that this method is sometimes more accurate than the previous one. However, this method also suffers from the arbitrary choice of values.
Bayesian hyperparameters: This method uses Bayesian optimization to guide a little bit the search strategy to get the best hyperparameter values with minimum cost (the cost is the number of models to train). We will briefly discuss this method, but if you want more detail you can check the following great article.

we will focus on the best one which is Bayesian hyperparameters, but we first start by briefly introducing the others.

To well understand these methods we will make use of small dataset with a small number of predictors, and we will use two models, the machine learning model Random forest and the deep learning model feedforward neural network.

2 Bayesian optimization

The main idea behind this method is very simple, at the first iteration we pick a point at random, then at each iteration, and based on Bayes rule, we make a trade-off between choosing the point that has the highest uncertainty (known as active learning) or choosing the point within the region that has already the best result (optimum objective function) until the current iteration. In more detail, let’s say we are dealing with maximization problem such as maximizing the accuracy rate for a classification problem, then, at each iteration, the Bayes optimization method should decide between focusing the search on the region that contains the best point (that resulted in the maximum objective function) until the current iteration (called exploitation), or inspecting another region with the highest uncertainty (called exploration). The question, however, is how can this method decide between these two options. Well, this method uses what is called acquisition function, which is a function that helps to decide what region should we inspect in the next iteration.

2.1 Acquisition functions

Since this problem is always partially random, there exist many forms for this function, and we will discuss briefly the most common ones.

2.1.1 Upper confidence bound

Using this function, the following located point will be that that has the highest upper confidence bound, and, assuming the Gaussian process, this bound will be computed as follows:

\[UCB(x)=\mu(x)+\kappa\sigma(x)\]

Where \(\mu\) and \(\sigma\) are the mean and the standard deviation defined by the Gaussian process, and \(\kappa\) is an exploration parameter, the larger the values the more the exploration.

2.1.2 Probability of improvement PI

this acquisition function chooses the next point that has the highest probability of improvement over the current maximum objective function \(f_{max}\) obtained from the points evaluated previously.

Assuming the Gaussian process, the new point then will be the one that has the highest following probability:

\[PI(x)=\Phi\left(\frac{\mu(x)-f{max}-\varepsilon}{\sigma(x)}\right)\] Where \(\varepsilon\) plays the role for trading off between exploration and exploitation, such that larger values result in more exploration than exploitation

2.1.3 Expected improvement EI

This acquisition function, unlike the previous one, tries to quantify how much improvement we get with the new point, the new point that has the maximum expected value will be chosen.

Again, by assuming the Gaussian process, this function can be computed as follows:

\[EI(x) = \left(\mu(x)-f_{max}\right)\Phi\left(\frac{\mu(x)-f{max}-\varepsilon}{\sigma(x)}\right)+\sigma(x)\phi\left(\frac{\mu(x)-f{max}-\varepsilon}{\sigma(x)}\right)\]

3 Data preparation

Let’s first call the packages needed along this article.

options(warn=-1)
library(readr)
library(tidymodels)
library(themis)
library(plot3D)

For our illustration, we will use data downloaded from the link described in the below script. It is already split between a training set and a testing set. The target variable is a highly imbalanced binary variable, and the data has a very large number of missing values. To make our analysis simpler, however, we will reduce the size of this data by first, removing the variables with a large number of missing values, then removing completely the remaining ones, and lastly correcting the imbalance distribution using downsampling method . For more detail about this data check my previous article.

options(warn=-1)
train <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_training_set.csv", skip = 20)
test <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_test_set.csv", skip = 20)

This data has 171 variables and a total of 76000 instances, 60000 in the training set, and 16000 in the testing sets. Notice that the missing values in the data are represented by lower case na values and not NA and hence are not recognized by R as missing values, That is why the read_csv function converted all the variables that have these values to the character type.

map_chr(train, typeof) %>% 
  tibble() %>% 
  table()

.
character    double 
      170         1

To fix this problem, therefore, we replace na by NA then converting back the corresponding variable to the numeric type.

train[-1] <- train[-1] %>% 
  modify(~replace(., .=="na", NA)) %>%
  modify(., as.double)

test[-1] <- test[-1] %>% 
  modify(~replace(., .=="na", NA)) %>%
  modify(., as.double)

Then we keep only the predictors that have less than 600 missing values, and thereafter, we keep only rows without missing values.

names <- modify(train[-1], is.na) %>% 
  colSums() %>%
  tibble(names = colnames(train[-1]), missing_values=.) %>% 
  filter(missing_values < 600) %>% 
  select(1)
train1 <- train[c("class",names$names)] %>% 
  .[complete.cases(.),]
test1 <- test[c("class",names$names)] %>% 
  .[complete.cases(.),]
dim(train1)

[1] 58888    11

dim(test1)

[1] 15728    11

As we see the data now has been reduced to 11 variables. The last thing to check is the distribution of the target variable.

prop.table(table(train1$class))


       neg        pos 
0.98376579 0.01623421

As we see, the data is highly imbalanced, so we correct this problem by downsamplig using the themis package. But first, we create a recipe that defines the formula of our model, normalizes the predictors, and down sample the data. Then we execute the recipe to the data using prep function, and lastly we retrieve the transformed data by juice function.

train2 <- recipe(class~., data=train1) %>%
  step_normalize(all_predictors()) %>% 
  step_downsample(class, seed = 111) %>%
  prep() %>% 
  juice()

For the testing set, however, we can transform it by the same recipe, but at the end we make use of the bake function instead of juice that uses the data defined in the recipe.

test2 <- recipe(class~., data=train1) %>%
  step_normalize(all_predictors()) %>%
  themis::step_downsample(class, seed = 111) %>% 
  prep() %>%
  bake(test1)

It should be noted here that the step_downsample is not needed for the testing set, that is why, the bake does not execute this step by default applied to any new data. We can check this by the dimension that still the same.

dim(test1); dim(test2)

[1] 15728    11

[1] 15728    11

Note: The testing set is not needed, since our purpose is the hyperparameter tuning methods, which requires only the training set. But it is included only to show how it is processed with the work flow of tidymodels package.

4 Random forest model

Since we are dealing with a classification problem, our objective function will be the area under the ROC curve roc_area. And for the model, we will use the most popular one, Random forest model with the two hyperparameters to tune:

mtry: The number of sampled predictors at each step.
min_n: The minimum number of instances in the node to split further.

4.1 The true distribution of the hyperparameters

With this small data, this model, in the sense of computational time, is not expensive, so we will try a wide range of values for the above hyperparameters, then we can use this as the true distribution to compare the performance of the hyperparameter tuning methods discussed above.

Using tidymodels package we define the model and we leave the mtry and min_n for tuning. To speed up, however, computation process we restrict the number of trees to 100 instead of the default fo 500 trees.

model_tune <- rand_forest(mtry = tune(), min_n = tune(), trees = 100L) %>% 
  set_engine("ranger", seed=222) %>% 
  set_mode("classification")

To avoid the pitfalls of random samples, we use the cross-validation technique.

set.seed(1234)
folds <- vfold_cv(train2, v=5, strata = class)

Then we use the following work flow:

tune_wf <- workflow() %>% 
  add_model(model_tune) %>% 
  add_formula(class~.)

Now we are ready to train the model using a wide range of combination values, and then we assume that these combinations are the true distribution of our model. Note that we have 100 combinations, so if the model takes hours to train, then it will require days for tuning. To prevent the model to rerun when rendering this document we save the results in csv file then we load it again. If you want, however, to run this model then uncomment the script

#tuned <- tune_wf %>% 
#   tune_grid(resamples = folds, 
#            grid =expand.grid(mtry=1:10, min_n=1:10),
#            metrics=metric_set(roc_auc))

To extract the results we use the collect_metrics function.

#df <- tuned %>% collect_metrics()
#write_csv(df, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df.csv")
df <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df.csv")
df <- df %>% arrange(-mean) %>% tibble(rank=seq_len(nrow(df)), .)

df

# A tibble: 100 x 8
    rank  mtry min_n .metric .estimator  mean     n std_err
   <int> <dbl> <dbl> <chr>   <chr>      <dbl> <dbl>   <dbl>
 1     1     1     2 roc_auc binary     0.984     5 0.00198
 2     2     3     6 roc_auc binary     0.984     5 0.00204
 3     3     3     7 roc_auc binary     0.983     5 0.00269
 4     4     3     8 roc_auc binary     0.983     5 0.00304
 5     5     2    10 roc_auc binary     0.983     5 0.00200
 6     6     3     4 roc_auc binary     0.983     5 0.00246
 7     7     4     4 roc_auc binary     0.983     5 0.00224
 8     8     2     1 roc_auc binary     0.983     5 0.00170
 9     9     3     5 roc_auc binary     0.983     5 0.00271
10    10     4     5 roc_auc binary     0.983     5 0.00246
# ... with 90 more rows

We reached the maximum value for the objective function 0.9837565 with the following hyperparameter values mtry = 1, and min_n = 2.

Note: we will ignore the overfitting problem, since we are only making comparison using the same training set.

We can plot this distribution as follows:

scatter3D(x=df$mtry, y=df$min_n, z=df$mean, phi = 0, bty = "g",  type = "h", 
          ticktype = "detailed", pch = 19, cex = 0.5, 
          main="the true distribution", xlab="mtry",
          ylab="min_n", zlab="roc_auc")

4.2 random search

Now suppose that we are allowed to try only 10 combinations. For random search strategy, we use the grid_random function from dials package (embedded in tidymodels package).

#tuned_random <- tune_wf %>% 
#  tune_grid(resamples = folds, 
#            grid = grid_random(mtry(range = c(1,10)), min_n(range = c(1,10)), #size = 10),
#            metrics=metric_set(roc_auc))
#df_r <- tuned_random %>% collect_metrics()

#write_csv(df_r, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_r.csv")
df_r <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_r.csv")

df_r %>% arrange(-mean) %>% 
  head(1) %>% 
  inner_join(df, by="mean") %>%
  select(rank, mtry.x, min_n.x, mean)

# A tibble: 1 x 4
   rank mtry.x min_n.x  mean
  <int>  <dbl>   <dbl> <dbl>
1    13      3      10 0.983

The maximum value obtained by this method is 0.9826900, which corresponds to the 13th rank compared to the maximum value of the the true distribution, and the associated hyperparameters values are mtry=3, min_n=10

4.3 bayesian optimization UCB

For the Bayesian optimization method, we make use of the tune_bayes function from tune package that provides all the acquisition functions discussed above. So we start by UCB function, and we set the argument kappa equals to 2, which controls the trade-off between exploitation and exploration such that larger values lead to more exploration. Notice that, by default, this function explores 10 combinations, if you want another number change the argument iter.

#tuned_UCB <- tune_wf %>% 
#  tune_bayes(resamples = folds,
#          param_info=parameters(mtry(range = c(1,10)), min_n(range = c(1,10))),
#             metrics=metric_set(roc_auc),
#             objective=conf_bound(kappa = 2))

#df_UCB <- tuned_UCB %>% collect_metrics()
#write_csv(df_UCB,"C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_UCB.csv")
df_UCB <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_UCB.csv")

df_UCB %>% arrange(-mean) %>% 
  head(1) %>% 
  inner_join(df, by="mean") %>%
  select(rank, mtry.x, min_n.x, mean)

# A tibble: 1 x 4
   rank mtry.x min_n.x  mean
  <int>  <dbl>   <dbl> <dbl>
1     2      3       6 0.984

With this acquisition function, we obtained better result than the random search 0.9837529 , which occupies the second position.

4.4 bayesian optimization PI

This time we will explore the acquisition functions PI discussed above. Note that larger values for the argument trade-off lead to more exploration than exploitation.

#tuned_PI <- tune_wf %>% 
#  tune_bayes(resamples = folds,
#         param_info=parameters(mtry(range = c(1,10)), min_n(range = c(1,10))),
#             metrics=metric_set(roc_auc),
#             objective=prob_improve(trade_off = 0.01))

#df_PI <- tuned_PI %>% collect_metrics()
#write_csv(df_PI, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_PI.csv")
df_PI <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_PI.csv")

df_PI %>% arrange(-mean) %>% 
  head(1) %>% 
  inner_join(df, by="mean") %>%
  select(rank, mtry.x, min_n.x, mean)

# A tibble: 1 x 4
   rank mtry.x min_n.x  mean
  <int>  <dbl>   <dbl> <dbl>
1    11      2       9 0.983

As we see, with this acquisition function, we obtained the 11th position which is more worst than the previous one, but still better than random search.

4.5 bayesian optimization EI

Now we try another acquisition function, the expected improvement function.

#tuned_EI <- tune_wf %>% 
#  tune_bayes(resamples = folds,
#             param_info=parameters(mtry(range = c(1,10)), 
#             min_n(range = c(1,10))),
#             metrics=metric_set(roc_auc),
#             objective=exp_improve(trade_off = 0.01))

#df_EI <- tuned_EI %>% collect_metrics()

#write_csv(df_EI, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_EI.csv")
df_EI <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_EI.csv")

df_EI %>% arrange(-mean) %>% 
  head(1) %>% 
  inner_join(df, by="mean") %>%
  select(rank, mtry.x, min_n.x, mean)

# A tibble: 1 x 4
   rank mtry.x min_n.x  mean
  <int>  <dbl>   <dbl> <dbl>
1     2      3       6 0.984

Here also we get the same result as UCB method.

4.6 Contrast the results

df_rf <- tibble(names=c("true", "random", "UCB", "PI", "EI"),
                mtry=c(df[df$mean==max(df$mean),1, drop=TRUE],
                  df_r[df_r$mean==max(df_r$mean),1, drop=TRUE],
                     df_UCB[df_UCB$mean==max(df_UCB$mean),1, drop=TRUE],
                     df_PI[df_PI$mean==max(df_PI$mean),1, drop=TRUE],
                     df_EI[df_EI$mean==max(df_EI$mean),1, drop=TRUE]),
             min_n=c(df[df$mean==max(df$mean),2, drop=TRUE],
               df_r[df_r$mean==max(df_r$mean),2, drop=TRUE],
                     df_UCB[df_UCB$mean==max(df_UCB$mean),2, drop=TRUE],
                     df_PI[df_PI$mean==max(df_PI$mean),2, drop=TRUE],
                     df_EI[df_EI$mean==max(df_EI$mean),2, drop=TRUE]),
             roc_auc=c(max(df$mean), max(df_r$mean),max(df_UCB$mean),
                       max(df_PI$mean),max(df_EI$mean)),
             std_err=c(df[df$mean==max(df$mean),ncol(df), drop=TRUE],
                    df_r[df_r$mean==max(df_r$mean),ncol(df_r), drop=TRUE],
                     df_UCB[df_UCB$mean==max(df_UCB$mean),ncol(df_UCB), drop=TRUE],
                     df_PI[df_PI$mean==max(df_PI$mean),ncol(df_PI), drop=TRUE],
                     df_EI[df_EI$mean==max(df_EI$mean),ncol(df_EI), drop=TRUE]))
df_rf %>% arrange(-roc_auc)

# A tibble: 5 x 5
  names   mtry min_n roc_auc std_err
  <chr>  <dbl> <dbl>   <dbl>   <dbl>
1 true       1     1   0.984 0.00198
2 UCB        3     6   0.984 0.00204
3 EI         3     6   0.984 0.00204
4 PI         2     9   0.983 0.00290
5 random     3    10   0.983 0.00275

As we see, the Bayesian optimization method performs better than random search whatever acquisition function used. However, since the acquisition functions have their own hyperparameters (to trade-off between exploration and exploitation), their performance thus may differ strongly from each other. Moreover, the difference can be larger if we use a larger dataset.

5 deep learning model

As we did with the random forest model we explore a wide range of hyperparameter values assuming that these values represent the true distribution, then we apply all the above tuning methods. For the model architecture, we use a deep learning model with a single layer, and for tuning, we use the two hyperparameters, the number of nodes, and the number of epochs. The last thing we have to do is to save what we need in csv file and load it again since each time we rerun the model we get a different result, and besides, it takes a lot of time.

model_tune1 <- mlp(hidden_units = tune(), epochs = tune()) %>% 
  set_engine("keras") %>% 
  set_mode("classification")

tune_wf1 <- workflow() %>% 
  add_model(model_tune1) %>% 
  add_formula(class~.)

#tuned_deepl <- tune_wf1 %>% 
#  tune_grid(resamples = folds, 
#            grid = expand.grid(hidden_units=5:15, epochs=10:20),
#            metrics=metric_set(roc_auc))

#df1 <- tuned_deepl %>% collect_metrics()


#write_csv(df1, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df1.csv")

df1 <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df1.csv")

df1 %>% arrange(-mean)

# A tibble: 121 x 7
   hidden_units epochs .metric .estimator  mean     n std_err
          <dbl>  <dbl> <chr>   <chr>      <dbl> <dbl>   <dbl>
 1           12     18 roc_auc binary     0.982     5 0.00255
 2           11     20 roc_auc binary     0.982     5 0.00265
 3            9     14 roc_auc binary     0.982     5 0.00274
 4            8     16 roc_auc binary     0.982     5 0.00207
 5           14     18 roc_auc binary     0.982     5 0.00241
 6           14     20 roc_auc binary     0.982     5 0.00244
 7           12     17 roc_auc binary     0.982     5 0.00265
 8           10     16 roc_auc binary     0.982     5 0.00230
 9           10     19 roc_auc binary     0.981     5 0.00231
10           12     20 roc_auc binary     0.981     5 0.00227
# ... with 111 more rows

With 12 units in the hidden layer and 18 epochs we reached the maximum value 0.9819806 for the area under the ROC curve.

scatter3D(x=df1$hidden_units, y=df1$epochs, z=df1$mean, 
          phi = 0, bty = "g",  type = "h", 
          ticktype = "detailed", pch = 19, cex = 0.5, 
          main="the true distribution", xlab="units",
          ylab="epochs", zlab="roc_auc")

5.1 Random search

#tuned_random1 <- tune_wf1 %>% 
#  tune_grid(resamples = folds, 
#            grid = grid_random(hidden_units(range = c(5,15)), epochs(range = #c(10,20)), size = 10),
#            metrics=metric_set(roc_auc))
#df_r1 <- tuned_random1 %>% collect_metrics()

#write_csv(df_r1, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_#r1.csv")
df_r1 <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_r1.csv")

df_r1 %>% arrange(-mean)

# A tibble: 10 x 7
   hidden_units epochs .metric .estimator  mean     n std_err
          <dbl>  <dbl> <chr>   <chr>      <dbl> <dbl>   <dbl>
 1           14     11 roc_auc binary     0.982     5 0.00230
 2           14     17 roc_auc binary     0.981     5 0.00240
 3            8     15 roc_auc binary     0.981     5 0.00266
 4           11     19 roc_auc binary     0.981     5 0.00268
 5           14     16 roc_auc binary     0.981     5 0.00221
 6           10     12 roc_auc binary     0.980     5 0.00259
 7            7     13 roc_auc binary     0.980     5 0.00229
 8            5     16 roc_auc binary     0.980     5 0.00241
 9            9     10 roc_auc binary     0.980     5 0.00244
10            5     13 roc_auc binary     0.979     5 0.00270

Since deep learning models strongly depend on an internal random process (like when initializing the weights), and besides, the small difference between results, due to the small size of the data, it is harder to contrast between different methods. To alleviate this problem, however, we use this time the standard errors to highlight the significance differences between the methods.

5.2 Bayesian optimization UCB

#tuned_UCB1 <- tune_wf1 %>% 
#  tune_bayes(resamples = folds,
#             param_info=parameters(hidden_units(range = c(5L,15L)), 
#             epochs(range = c(10L,20L))),
#             metrics=metric_set(roc_auc),
#             objective=conf_bound(kappa = 2))

#df_UCB1 <- tuned_UCB1 %>% collect_metrics()

#write_csv(df_UCB1, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_UCB1.csv")
df_UCB1 <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_UCB1.csv")

df_UCB1 %>% arrange(-mean)

# A tibble: 15 x 8
   hidden_units epochs .iter .metric .estimator  mean     n std_err
          <dbl>  <dbl> <dbl> <chr>   <chr>      <dbl> <dbl>   <dbl>
 1           12     19     8 roc_auc binary     0.981     5 0.00237
 2           12     17     0 roc_auc binary     0.981     5 0.00250
 3            9     19    10 roc_auc binary     0.981     5 0.00203
 4           12     18     2 roc_auc binary     0.981     5 0.00236
 5           11     17     3 roc_auc binary     0.981     5 0.00248
 6           13     19     9 roc_auc binary     0.981     5 0.00246
 7           13     18     6 roc_auc binary     0.981     5 0.00260
 8           13     17     1 roc_auc binary     0.981     5 0.00254
 9           10     19     0 roc_auc binary     0.981     5 0.00230
10           11     19     7 roc_auc binary     0.981     5 0.00221
11           13     15     0 roc_auc binary     0.981     5 0.00257
12           11     18     5 roc_auc binary     0.981     5 0.00269
13           12     16     4 roc_auc binary     0.981     5 0.00213
14            5     11     0 roc_auc binary     0.979     5 0.00233
15            9     12     0 roc_auc binary     0.979     5 0.00288

5.3 Bayesian optimization PI

#tuned_PI1 <- tune_wf1 %>% 
#  tune_bayes(resamples = folds,
#             param_info=parameters(hidden_units(range = c(5L,15L)), 
#             epochs(range = c(10L,20L))),
#             metrics=metric_set(roc_auc),
#             objective=prob_improve(trade_off = 0.01))

#df_PI1 <- tuned_PI1 %>% collect_metrics()

#write_csv(df_PI1, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_PI1.csv")
df_PI1 <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_PI1.csv")

df_PI1 %>% arrange(-mean)

# A tibble: 15 x 8
   hidden_units epochs .iter .metric .estimator  mean     n std_err
          <dbl>  <dbl> <dbl> <chr>   <chr>      <dbl> <dbl>   <dbl>
 1            7     20     6 roc_auc binary     0.981     5 0.00236
 2           10     10     5 roc_auc binary     0.981     5 0.00237
 3            9     17     0 roc_auc binary     0.981     5 0.00216
 4           15     16     8 roc_auc binary     0.981     5 0.00213
 5           14     20     0 roc_auc binary     0.981     5 0.00251
 6           11     20     4 roc_auc binary     0.981     5 0.00282
 7           15     10     3 roc_auc binary     0.981     5 0.00272
 8            6     10     0 roc_auc binary     0.981     5 0.00286
 9           13     20    10 roc_auc binary     0.981     5 0.00253
10           12     14     0 roc_auc binary     0.981     5 0.00268
11            8     12     0 roc_auc binary     0.980     5 0.00220
12           13     10     7 roc_auc binary     0.980     5 0.00273
13            8     10     1 roc_auc binary     0.980     5 0.00305
14            5     20     2 roc_auc binary     0.980     5 0.00227
15            5     12     9 roc_auc binary     0.979     5 0.00234

5.4 Bayesian optimization EI

#tuned_EI1 <- tune_wf1 %>% 
#  tune_bayes(resamples = folds,
#             param_info=parameters(hidden_units(range = c(5L,15L)), 
#             epochs(range = c(10L,20L))),
#             metrics=metric_set(roc_auc),
#             objective=exp_improve(trade_off = 0.01))

#df_EI1 <- tuned_EI1 %>% collect_metrics()

#write_csv(df_EI1, "C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_EI1.csv")
df_EI1 <- read_csv("C:/Users/dell/Documents/new-blog/content/bayes/hyper_bayes/df_EI1.csv")

df_EI1 %>% arrange(-mean)

# A tibble: 15 x 8
   hidden_units epochs .iter .metric .estimator  mean     n std_err
          <dbl>  <dbl> <dbl> <chr>   <chr>      <dbl> <dbl>   <dbl>
 1           10     20     0 roc_auc binary     0.981     5 0.00229
 2           15     18     1 roc_auc binary     0.981     5 0.00263
 3            7     16     0 roc_auc binary     0.981     5 0.00286
 4           12     20     8 roc_auc binary     0.981     5 0.00276
 5            8     19     6 roc_auc binary     0.981     5 0.00231
 6           14     14     0 roc_auc binary     0.981     5 0.00229
 7           13     14     7 roc_auc binary     0.981     5 0.00253
 8            5     20    10 roc_auc binary     0.981     5 0.00205
 9            6     13     5 roc_auc binary     0.981     5 0.00244
10           12     10     0 roc_auc binary     0.980     5 0.00266
11           11     14     4 roc_auc binary     0.980     5 0.00191
12           10     10     9 roc_auc binary     0.980     5 0.00284
13            5     12     0 roc_auc binary     0.980     5 0.00274
14            5     10     2 roc_auc binary     0.980     5 0.00261
15            9     14     3 roc_auc binary     0.979     5 0.00243

5.5 Contrast the results

df_deep <- tibble(names=c("true", "random", "UCB", "PI", "EI"),
                  hidden_units=c(df1[df1$mean==max(df1$mean),1, drop=TRUE],
                    df_r1[df_r1$mean==max(df_r1$mean),1, drop=TRUE],
                     df_UCB1[df_UCB1$mean==max(df_UCB1$mean),1, drop=TRUE],
                     df_PI1[df_PI1$mean==max(df_PI1$mean),1, drop=TRUE],
                     df_EI1[df_EI1$mean==max(df_EI1$mean),1, drop=TRUE]),
                  epochs=c(df1[df1$mean==max(df1$mean),2, drop=TRUE],
                    df_r1[df_r1$mean==max(df_r1$mean),2, drop=TRUE],
                     df_UCB1[df_UCB1$mean==max(df_UCB1$mean),2, drop=TRUE],
                     df_PI1[df_PI1$mean==max(df_PI1$mean),2, drop=TRUE],
                     df_EI1[df_EI1$mean==max(df_EI1$mean),2, drop=TRUE]),
                  roc_auc=c(max(df1$mean), max(df_r1$mean), max(df_UCB1$mean),
                       max(df_PI1$mean),max(df_EI1$mean)),
                  std_err=c(df1[df1$mean==max(df1$mean),ncol(df1), drop=TRUE],
                    df_r1[df_r1$mean==max(df_r1$mean),ncol(df_r1), drop=TRUE],
                     df_UCB1[df_UCB1$mean==max(df_UCB1$mean),ncol(df_UCB1), drop=TRUE],
                     df_PI1[df_PI1$mean==max(df_PI1$mean),ncol(df_PI1), drop=TRUE],
                     df_EI1[df_EI1$mean==max(df_EI1$mean),ncol(df_EI1), drop=TRUE]))
             
df_deep

# A tibble: 5 x 5
  names  hidden_units epochs roc_auc std_err
  <chr>         <dbl>  <dbl>   <dbl>   <dbl>
1 true             12     18   0.982 0.00255
2 random           14     11   0.982 0.00230
3 UCB              12     19   0.981 0.00237
4 PI                7     20   0.981 0.00236
5 EI               10     20   0.981 0.00229

ggplot(df_deep[-1, ]%>% arrange(-roc_auc), aes(x=names, y=roc_auc))+
         geom_point(size=2, color= "red")+
         geom_hline(yintercept = df_deep[df_deep$names=="true", 4, drop=TRUE], 
                    color= "blue", lty="dashed")+
         geom_errorbar(aes(ymin=roc_auc-1.92*std_err, ymax=roc_auc+1.92*std_err))

As we see, we do not obtain any significance difference between these methods, which is essentially due to the small size of the data and the small range used for the hyperparameter values. However, with more complex and large data set the difference will be obvious.

6 Conclusion

The main purpose of this article is more to show how implementing these methods in practice rather than highlighting the difference of performance between each other. However, the Bayesian optimization is known as the most efficient method, but with the downside that it also requires defining some hyperparameters like the acquisition function, sometimes arbitrarily. In other words, we cannot waste, say the limited budget, that should be concentrated to search for the optimum, to instead try different acquisition functions.

7 Session info

sessionInfo()

R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] plot3D_1.3       themis_0.1.2     yardstick_0.0.7  workflows_0.2.0 
 [5] tune_0.1.1       tidyr_1.1.2      tibble_3.0.3     rsample_0.0.8   
 [9] recipes_0.1.13   purrr_0.3.4      parsnip_0.1.3    modeldata_0.0.2 
[13] infer_0.5.3      ggplot2_3.3.2    dplyr_1.0.2      dials_0.0.9     
[17] scales_1.1.1     broom_0.7.1      tidymodels_0.1.1 readr_1.3.1     

loaded via a namespace (and not attached):
 [1] lubridate_1.7.9    doParallel_1.0.15  DiceDesign_1.8-1   tools_4.0.1       
 [5] backports_1.1.10   utf8_1.1.4         R6_2.4.1           rpart_4.1-15      
 [9] colorspace_1.4-1   nnet_7.3-14        withr_2.3.0        tidyselect_1.1.0  
[13] compiler_4.0.1     parallelMap_1.5.0  cli_2.0.2          labeling_0.3      
[17] bookdown_0.20      checkmate_2.0.0    stringr_1.4.0      digest_0.6.25     
[21] rmarkdown_2.4      unbalanced_2.0     pkgconfig_2.0.3    htmltools_0.5.0   
[25] lhs_1.1.0          rlang_0.4.7        rstudioapi_0.11    BBmisc_1.11       
[29] FNN_1.1.3          farver_2.0.3       generics_0.0.2     magrittr_1.5      
[33] ROSE_0.0-3         Matrix_1.2-18      Rcpp_1.0.5         munsell_0.5.0     
[37] fansi_0.4.1        GPfit_1.0-8        lifecycle_0.2.0    furrr_0.1.0       
[41] stringi_1.5.3      pROC_1.16.2        yaml_2.2.1         MASS_7.3-53       
[45] plyr_1.8.6         misc3d_0.9-0       grid_4.0.1         parallel_4.0.1    
[49] listenv_0.8.0      crayon_1.3.4       lattice_0.20-41    splines_4.0.1     
[53] hms_0.5.3          knitr_1.30         mlr_2.17.1         pillar_1.4.6      
[57] tcltk_4.0.1        codetools_0.2-16   fastmatch_1.1-0    glue_1.4.2        
[61] evaluate_0.14      ParamHelpers_1.14  blogdown_0.20      data.table_1.13.0 
[65] vctrs_0.3.4        foreach_1.5.0      gtable_0.3.0       RANN_2.6.1        
[69] future_1.19.1      assertthat_0.2.1   xfun_0.18          gower_0.2.2       
[73] prodlim_2019.11.13 class_7.3-17       survival_3.2-7     timeDate_3043.102 
[77] iterators_1.0.12   lava_1.6.8         globals_0.13.0     ellipsis_0.3.1    
[81] ipred_0.9-9

bayes hyperparameters