<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>xgboost | Modeling with R and Python</title>
    <link>https://www.metalesaek.com/tag/xgboost/</link>
      <atom:link href="https://www.metalesaek.com/tag/xgboost/index.xml" rel="self" type="application/rss+xml" />
    <description>xgboost</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Sun, 05 Jan 2020 00:00:00 +0000</lastBuildDate>
    <image>
      <url>https://www.metalesaek.com/images/icon_hu0b7a4cb9992c9ac0e91bd28ffd38dd00_9727_512x512_fill_lanczos_center_2.png</url>
      <title>xgboost</title>
      <link>https://www.metalesaek.com/tag/xgboost/</link>
    </image>
    
    <item>
      <title>Xgboost model</title>
      <link>https://www.metalesaek.com/post/xgboost/xgboost/</link>
      <pubDate>Sun, 05 Jan 2020 00:00:00 +0000</pubDate>
      <guid>https://www.metalesaek.com/post/xgboost/xgboost/</guid>
      <description>
&lt;script src=&#34;https://www.metalesaek.com/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-preparation&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-visualization&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; Data visualization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-partition&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; Data partition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#model-training&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; Model training&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#fine-tune-the-hyperparameters&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; Fine tune the hyperparameters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#conclusion&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;7&lt;/span&gt; Conclusion:&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#session-information&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;8&lt;/span&gt; Session information&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/h1&gt;
&lt;p&gt;Decision tree&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; is a model that recursively splits the input space into regions and defines local model for each resulted region. However, fitting decision tree model to complex data would not yield to accurate prediction in most cases, which can be termed as &lt;a href=&#34;http://rob.schapire.net/papers/strengthofweak.pdf&#34;&gt;weak learner&lt;/a&gt;. But combining multiple decision trees together (called also &lt;strong&gt;ensemble models&lt;/strong&gt;) using techniques such as aggregating and boosting can largely improve the model accuracy. &lt;a href=&#34;https://xgboost.readthedocs.io/en/latest/R-package/index.html&#34;&gt;Xgboost&lt;/a&gt; (short for Extreme gradient boosting) model is a tree-based algorithm that uses these types of techniques. It can be used for both &lt;strong&gt;classification&lt;/strong&gt; and &lt;strong&gt;regression&lt;/strong&gt;.
In this paper we learn how to implement this model to predict the well known titanic data as we did in the previous papers using different kind of models.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-preparation&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; Data preparation&lt;/h1&gt;
&lt;p&gt;First we start by calling the packages needed and the titanic data&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;suppressPackageStartupMessages(library(tidyverse))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;ggplot2&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;tibble&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;tidyr&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;dplyr&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;suppressPackageStartupMessages(library(caret))
data &amp;lt;- read_csv(&amp;quot;../train.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Parsed with column specification:
## cols(
##   PassengerId = col_double(),
##   Survived = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s take a look at this data using the &lt;strong&gt;dplyr&lt;/strong&gt; function &lt;strong&gt;glimpse&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;glimpse(data)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 891
## Columns: 12
## $ PassengerId &amp;lt;dbl&amp;gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Survived    &amp;lt;dbl&amp;gt; 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
## $ Pclass      &amp;lt;dbl&amp;gt; 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
## $ Name        &amp;lt;chr&amp;gt; &amp;quot;Braund, Mr. Owen Harris&amp;quot;, &amp;quot;Cumings, Mrs. John Bradley ...
## $ Sex         &amp;lt;chr&amp;gt; &amp;quot;male&amp;quot;, &amp;quot;female&amp;quot;, &amp;quot;female&amp;quot;, &amp;quot;female&amp;quot;, &amp;quot;male&amp;quot;, &amp;quot;male&amp;quot;, &amp;quot;...
## $ Age         &amp;lt;dbl&amp;gt; 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
## $ SibSp       &amp;lt;dbl&amp;gt; 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
## $ Parch       &amp;lt;dbl&amp;gt; 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
## $ Ticket      &amp;lt;chr&amp;gt; &amp;quot;A/5 21171&amp;quot;, &amp;quot;PC 17599&amp;quot;, &amp;quot;STON/O2. 3101282&amp;quot;, &amp;quot;113803&amp;quot;, ...
## $ Fare        &amp;lt;dbl&amp;gt; 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
## $ Cabin       &amp;lt;chr&amp;gt; NA, &amp;quot;C85&amp;quot;, NA, &amp;quot;C123&amp;quot;, NA, NA, &amp;quot;E46&amp;quot;, NA, NA, NA, &amp;quot;G6&amp;quot;,...
## $ Embarked    &amp;lt;chr&amp;gt; &amp;quot;S&amp;quot;, &amp;quot;C&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;Q&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;S&amp;quot;, &amp;quot;C&amp;quot;, &amp;quot;S&amp;quot;, ...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For prediction purposes some variables should be removed such as PassengerId, Name, Ticket, and Cabin. While some others should be converted to another suitable type. the following script performs these transformations but for more detail you can refer to my previous paper of logistic regression.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mydata&amp;lt;-data[,-c(1,4,9,11)]
mydata$Survived&amp;lt;-as.integer(mydata$Survived)
mydata&amp;lt;-modify_at(mydata,c(&amp;quot;Pclass&amp;quot;,&amp;quot;Sex&amp;quot;,&amp;quot;Embarked&amp;quot;,&amp;quot;SibSp&amp;quot;,&amp;quot;Parch&amp;quot;), as.factor)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let’s check the summary of the transformed data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;summary(mydata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##     Survived      Pclass      Sex           Age        SibSp   Parch  
##  Min.   :0.0000   1:216   female:314   Min.   : 0.42   0:608   0:678  
##  1st Qu.:0.0000   2:184   male  :577   1st Qu.:20.12   1:209   1:118  
##  Median :0.0000   3:491                Median :28.00   2: 28   2: 80  
##  Mean   :0.3838                        Mean   :29.70   3: 16   3:  5  
##  3rd Qu.:1.0000                        3rd Qu.:38.00   4: 18   4:  4  
##  Max.   :1.0000                        Max.   :80.00   5:  5   5:  5  
##                                        NA&amp;#39;s   :177     8:  7   6:  1  
##       Fare        Embarked  
##  Min.   :  0.00   C   :168  
##  1st Qu.:  7.91   Q   : 77  
##  Median : 14.45   S   :644  
##  Mean   : 32.20   NA&amp;#39;s:  2  
##  3rd Qu.: 31.00             
##  Max.   :512.33             
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see, we have 177 missing values from age variable and 2 values from Embarked. For missing values we have two strategies, removing completely the missing values from the analysis, but doing so we will lose many data, or imputing them by one of the available imputation method to fix these values. Since we have large number of missing values compared to the total examples in the data it would be better to follow the latter strategy. Thankfully to &lt;a href=&#34;https://cran.r-project.org/web/packages/mice/mice.pdf&#34;&gt;mice&lt;/a&gt; package that is a very powerfull for this purpose and it provides many imputation methods for all variable types.
We will opt for random forest method since in most cases can be the best choice. However, in order to respect the most important rule in machine learning, never touch the test data during the training process , we will apply this imputation after splitting the data.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-visualization&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; Data visualization&lt;/h1&gt;
&lt;p&gt;We have many tools outside modelization to investigate some relationships between variables like visualization tools. So we can visualize the relationship between each predictor and the target variable using the ggplot2 package.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2)
ggplot(mydata,aes(Sex,Survived,color=Sex))+
  geom_point()+
  geom_jitter()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://www.metalesaek.com/post/xgboost/xgboost_files/figure-html/unnamed-chunk-6-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The left side of the plot shows that higher fraction of females survived, whereas the right side shows the reverse situation for males where most of them died. We can induce from this plot that, ceteris paribus, this predictor is likely to be relevant for prediction.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata,aes(Pclass,Survived,color=Pclass))+
  geom_point()+
  geom_jitter()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://www.metalesaek.com/post/xgboost/xgboost_files/figure-html/unnamed-chunk-7-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;in this plot most of the first class passengers survived in contrast with the third class passengers where most of them died. However, for the second class, it seems equally balanced. Again this predictor also can be relevant.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata,aes(SibSp,Survived,color=SibSp))+
  geom_point()+
  geom_jitter()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://www.metalesaek.com/post/xgboost/xgboost_files/figure-html/unnamed-chunk-8-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This predictor refers to the number of siblings a passenger has. It seems to be equally distributed given the target variable, and hence can be highly irrelevant. In other words, knowing the number of siblings of a particular passenger does not help to predict if this passenger survived or died.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata,aes(Parch,Survived,color=Parch))+
  geom_point()+
  geom_jitter()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://www.metalesaek.com/post/xgboost/xgboost_files/figure-html/unnamed-chunk-9-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This predictor refers to the number of parents and children a passenger has. It seems that this predictor is slightly discriminative if we look closely at the level 0, passengers with no parents or children.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata,aes(Embarked,Survived,color=Embarked))+
  geom_point()+
  geom_jitter()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://www.metalesaek.com/post/xgboost/xgboost_files/figure-html/unnamed-chunk-10-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We see that a passenger who is embarked from the port &lt;strong&gt;S&lt;/strong&gt; is slightly highly to be died, while the other ports seem to be equally distributed.&lt;/p&gt;
&lt;p&gt;For numeric variables we use the empirical densitiy givan the target variable as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata[complete.cases(mydata),], aes(Age,fill=as.factor(Survived)))+
  geom_density(alpha=.5)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://www.metalesaek.com/post/xgboost/xgboost_files/figure-html/unnamed-chunk-11-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We see that some significant overlapping between the two conditional distribution may indicating less relevance related to this variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata, aes(Fare,fill=as.factor(Survived)))+
  geom_density(alpha=.5)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://www.metalesaek.com/post/xgboost/xgboost_files/figure-html/unnamed-chunk-12-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;For this variables the conditional distribution are different, we see a spike close to zero reflecting the more death among third class.&lt;/p&gt;
&lt;p&gt;we can also plot two predictors against each other. For instance let’s try with the two predictors, Sex and Pclass:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(mydata,aes(Sex,Pclass,color=as.factor(Survived)))+
  geom_point(col=&amp;quot;green&amp;quot;,pch=16,cex=7)+
  geom_jitter()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://www.metalesaek.com/post/xgboost/xgboost_files/figure-html/unnamed-chunk-13-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The majority of the survived females (blue points on the left) came from the first and the second class, while the majority of died males (red points on the right) came from the third class.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-partition&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; Data partition&lt;/h1&gt;
&lt;p&gt;we take out 80% of the data as training set and the remaining will be served as testing set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1234)
index&amp;lt;-createDataPartition(mydata$Survived,p=0.8,list=FALSE)
train&amp;lt;-mydata[index,]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: The `i` argument of ``[`()` can&amp;#39;t be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;test&amp;lt;-mydata[-index,]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we are ready to impute the missing values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;suppressPackageStartupMessages(library(mice))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;mice&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;imput_train&amp;lt;-mice(train,m=3,seed=111, method = &amp;#39;rf&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: Number of logged events: 30&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train2&amp;lt;-complete(imput_train,1)
summary(train2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;From this output we see that we do not have missing values any more.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;model-training&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; Model training&lt;/h1&gt;
&lt;p&gt;The xgboost model expects the predictors to be of numeric type, so we convert the factors to dummy variables by the help of the &lt;strong&gt;Matrix&lt;/strong&gt; package&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;suppressPackageStartupMessages(library(Matrix))
train_data&amp;lt;-sparse.model.matrix(Survived ~. -1, data=train2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that the -1 value added to the formula is to avoid adding a column as intercept with ones to our data. we can take a look at the structure of the data by the following&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;str(train_data)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Formal class &amp;#39;dgCMatrix&amp;#39; [package &amp;quot;Matrix&amp;quot;] with 6 slots
##   ..@ i       : int [1:3570] 1 3 5 8 17 20 23 24 27 28 ...
##   ..@ p       : int [1:21] 0 178 329 713 1173 1886 2062 2086 2100 2114 ...
##   ..@ Dim     : int [1:2] 713 20
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:713] &amp;quot;1&amp;quot; &amp;quot;2&amp;quot; &amp;quot;3&amp;quot; &amp;quot;4&amp;quot; ...
##   .. ..$ : chr [1:20] &amp;quot;Pclass1&amp;quot; &amp;quot;Pclass2&amp;quot; &amp;quot;Pclass3&amp;quot; &amp;quot;Sexmale&amp;quot; ...
##   ..@ x       : num [1:3570] 1 1 1 1 1 1 1 1 1 1 ...
##   ..@ factors : list()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We know that many machine learning algorithms require the inputs to be in a specific type. The input types supported by xgboost algorithm are: matrix, &lt;strong&gt;dgCMatrix&lt;/strong&gt; object rendered from the above package &lt;strong&gt;Matrix&lt;/strong&gt;, or the xgboost class &lt;strong&gt;xgb.DMatrix&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;suppressPackageStartupMessages(library(xgboost))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: package &amp;#39;xgboost&amp;#39; was built under R version 4.0.2&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We should first store the dependent variable in a separate vector, let’s call it &lt;strong&gt;train_label&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train_label&amp;lt;-train$Survived
dim(train_data)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 713  20&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;length(train$Survived)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 713&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we bind the predictors, contained in the train_data , with the train_label vector as &lt;strong&gt;xgb.DMatrix&lt;/strong&gt; object as follows&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train_final&amp;lt;-xgb.DMatrix(data = train_data,label=train_label)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To train the model you must provide the inputs and specify the argument values if we do not want to keep the following values:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;objective: for binary classification we use &lt;strong&gt;binary:logistic&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;eta (default=0.3): The learning rate.&lt;/li&gt;
&lt;li&gt;gamma (default=0): also called min_split_loss, the minimum loss required for splitting further a particular node.&lt;/li&gt;
&lt;li&gt;max_depth(default=6): the maximum depth of the tree.&lt;/li&gt;
&lt;li&gt;min_child_weight(default=1): the minimum number of instances required in a node under which the node will be leaf.&lt;/li&gt;
&lt;li&gt;subsample (default=1): with the default the model uses all the data at each tree, if 0.7 for instance, then the model randomly sample 70% of the data at each iteration, doing so we fight the overfiting problem.&lt;/li&gt;
&lt;li&gt;colsample_bytree (default=1, select all columns): subsample ratio of columns at each iteration.&lt;/li&gt;
&lt;li&gt;nthreads (default=2): number of cpu’s used in parallel processing.&lt;/li&gt;
&lt;li&gt;nrounds : the number of boosting iterations.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can check the whole parameters by typing &lt;strong&gt;?xgboost&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;It should be noted that the input data can feed into the model by two ways:
It the data is of class &lt;strong&gt;xgb.DMatrix&lt;/strong&gt; that contain both the predictors and the label, as we did, then we do not use the &lt;strong&gt;label&lt;/strong&gt; argument. Otherwise, with any other class we provide both argument data and label.&lt;/p&gt;
&lt;p&gt;Let’s our first attempt will be made with 40 iterations and the default values for the other arguments.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mymodel &amp;lt;- xgboost(data=train_final, objective = &amp;quot;binary:logistic&amp;quot;,
                   nrounds = 40)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1]  train-error:0.148668 
## [2]  train-error:0.133240 
## [3]  train-error:0.130435 
## [4]  train-error:0.137447 
## [5]  train-error:0.127630 
## [6]  train-error:0.117812 
## [7]  train-error:0.115007 
## [8]  train-error:0.109397 
## [9]  train-error:0.102384 
## [10] train-error:0.103787 
## [11] train-error:0.103787 
## [12] train-error:0.102384 
## [13] train-error:0.100982 
## [14] train-error:0.098177 
## [15] train-error:0.098177 
## [16] train-error:0.096774 
## [17] train-error:0.096774 
## [18] train-error:0.098177 
## [19] train-error:0.093969 
## [20] train-error:0.091164 
## [21] train-error:0.086957 
## [22] train-error:0.085554 
## [23] train-error:0.085554 
## [24] train-error:0.082749 
## [25] train-error:0.082749 
## [26] train-error:0.082749 
## [27] train-error:0.079944 
## [28] train-error:0.075736 
## [29] train-error:0.074334 
## [30] train-error:0.074334 
## [31] train-error:0.072931 
## [32] train-error:0.072931 
## [33] train-error:0.070126 
## [34] train-error:0.070126 
## [35] train-error:0.070126 
## [36] train-error:0.068724 
## [37] train-error:0.067321 
## [38] train-error:0.061711 
## [39] train-error:0.061711 
## [40] train-error:0.063114&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can plot the error rates as follows&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt; mymodel$evaluation_log %&amp;gt;%   
  ggplot(aes(iter, train_error))+
  geom_point()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://www.metalesaek.com/post/xgboost/xgboost_files/figure-html/unnamed-chunk-22-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;To evaluate the model we will use the test data that should follow all the above steps as the training data except for the missing values. since the test set is only used to evaluate the model so we will remove all the missing values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;test1 &amp;lt;- test[complete.cases(test),]
test2&amp;lt;-sparse.model.matrix(Survived ~. -1,data=test1)
test_label&amp;lt;-test1$Survived
test_final&amp;lt;-xgb.DMatrix(data = test2, label=test_label)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then we use the predict function and confusionMatrix function from caret package, and since the predicted values are probabbilities we convert them to predicted classes using the threshold of 0.5 as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- predict(mymodel, test_final)
pred&amp;lt;-ifelse(pred&amp;gt;.5,1,0)
confusionMatrix(as.factor(pred),as.factor(test_label))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 81 13
##          1 11 36
##                                           
##                Accuracy : 0.8298          
##                  95% CI : (0.7574, 0.8878)
##     No Information Rate : 0.6525          
##     P-Value [Acc &amp;gt; NIR] : 2.379e-06       
##                                           
##                   Kappa : 0.6211          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : 0.8383          
##                                           
##             Sensitivity : 0.8804          
##             Specificity : 0.7347          
##          Pos Pred Value : 0.8617          
##          Neg Pred Value : 0.7660          
##              Prevalence : 0.6525          
##          Detection Rate : 0.5745          
##    Detection Prevalence : 0.6667          
##       Balanced Accuracy : 0.8076          
##                                           
##        &amp;#39;Positive&amp;#39; Class : 0               
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;with the default values we obtain a pretty good accuracy rate. The next step we fine tune the hyperparameters sing &lt;strong&gt;cross validation&lt;/strong&gt; with the help of caret package.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;fine-tune-the-hyperparameters&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; Fine tune the hyperparameters&lt;/h1&gt;
&lt;p&gt;for the hyperparameters we try different grid values for the above arguments as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;eta: seq(0.2,1,0.2)&lt;/li&gt;
&lt;li&gt;max_depth: seq(2,6,1)&lt;/li&gt;
&lt;li&gt;min_child_weight: c(1,5,10)&lt;/li&gt;
&lt;li&gt;colsample_bytree : seq(0.6,1,0.1)&lt;/li&gt;
&lt;li&gt;nrounds : c(50,200 ,50)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This requires training the model 375 times.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;grid_tune &amp;lt;- expand.grid(
  nrounds = c(50,200,50),
  max_depth = seq(2,6,1),
  eta = seq(0.2,1,0.2),
  gamma = 0,
  min_child_weight = 1,
  colsample_bytree = seq(0.6,1,0.1),
  subsample = 1
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then we use 5 folds cross validation as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;control &amp;lt;- trainControl(
  method = &amp;quot;repeatedcv&amp;quot;,
  number = 5,
  allowParallel = TRUE
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now instead we use the &lt;strong&gt;train&lt;/strong&gt; function from caret to train the model and we specify the method as &lt;strong&gt;xgbtree&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train_data1 &amp;lt;- as.matrix(train_data)
train_label1 &amp;lt;- as.factor(train_label)
#mymodel2 &amp;lt;- train(
#  x = train_data1,
#  y = train_label1,
#  trControl = control,
#  tuneGrid = grid_tune,
#  method = &amp;quot;xgbTree&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: This model took several minutes so we do not the model to be rerun again when rendering this document that is why i have commented the above script and have saved the results in csv file, then i have reloaded it again to continue our analysis. If you would like to run this model you can just uncomment the script.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# results &amp;lt;- mymodel2$results
# write_csv(results, &amp;quot;xgb_results.csv&amp;quot;)
results &amp;lt;- read_csv(&amp;quot;xgb_results.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Parsed with column specification:
## cols(
##   eta = col_double(),
##   max_depth = col_double(),
##   gamma = col_double(),
##   colsample_bytree = col_double(),
##   min_child_weight = col_double(),
##   subsample = col_double(),
##   nrounds = col_double(),
##   Accuracy = col_double(),
##   Kappa = col_double(),
##   AccuracySD = col_double(),
##   KappaSD = col_double()
## )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s now check the best hyperparameter values:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;results %&amp;gt;% 
  arrange(-Accuracy) %&amp;gt;% 
  head(5)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 x 11
##     eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
##   &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;            &amp;lt;dbl&amp;gt;            &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;
## 1   0.2         4     0              0.6                1         1      50
## 2   0.2         6     0              0.6                1         1      50
## 3   0.8         2     0              0.8                1         1      50
## 4   0.4         3     0              0.6                1         1      50
## 5   0.2         3     0              1                  1         1     200
## # ... with 4 more variables: Accuracy &amp;lt;dbl&amp;gt;, Kappa &amp;lt;dbl&amp;gt;, AccuracySD &amp;lt;dbl&amp;gt;,
## #   KappaSD &amp;lt;dbl&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see the highest accuracy rate is about 81.34% with the related hyperparameter values as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;results %&amp;gt;% 
  arrange(-Accuracy) %&amp;gt;% 
  head(1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 1 x 11
##     eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
##   &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;            &amp;lt;dbl&amp;gt;            &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;
## 1   0.2         4     0              0.6                1         1      50
## # ... with 4 more variables: Accuracy &amp;lt;dbl&amp;gt;, Kappa &amp;lt;dbl&amp;gt;, AccuracySD &amp;lt;dbl&amp;gt;,
## #   KappaSD &amp;lt;dbl&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we apply these values for the final model using the whole data uploadded at the beginning from the train.csv file, and then we call the file test.csv file for titanic data to submit our prediction to the kaggle competition.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;imput_mydata&amp;lt;-mice(mydata,m=3,seed=111, method = &amp;#39;rf&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: Number of logged events: 15&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mydata_imp&amp;lt;-complete(imput_mydata,1)
my_data&amp;lt;-sparse.model.matrix(Survived ~. -1, data = mydata_imp)
mydata_label&amp;lt;-mydata$Survived
data_final&amp;lt;-xgb.DMatrix(data = my_data,label=mydata_label)
final_model &amp;lt;- xgboost(data=data_final, objective = &amp;quot;binary:logistic&amp;quot;,
                   nrounds = 50, max_depth = 4, eta = 0.2, gamma = 0,
                   colsample_bytree = 0.6, min_child_weight = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;and we get the following result&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred &amp;lt;- predict(mymodel, data_final)
pred&amp;lt;-ifelse(pred&amp;gt;.5,1,0)
confusionMatrix(as.factor(pred),as.factor(mydata_label))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 518  60
##          1  31 282
##                                          
##                Accuracy : 0.8979         
##                  95% CI : (0.8761, 0.917)
##     No Information Rate : 0.6162         
##     P-Value [Acc &amp;gt; NIR] : &amp;lt; 2.2e-16      
##                                          
##                   Kappa : 0.7806         
##                                          
##  Mcnemar&amp;#39;s Test P-Value : 0.003333       
##                                          
##             Sensitivity : 0.9435         
##             Specificity : 0.8246         
##          Pos Pred Value : 0.8962         
##          Neg Pred Value : 0.9010         
##              Prevalence : 0.6162         
##          Detection Rate : 0.5814         
##    Detection Prevalence : 0.6487         
##       Balanced Accuracy : 0.8840         
##                                          
##        &amp;#39;Positive&amp;#39; Class : 0              
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The accuracy rate with these values is about 90% .
Now lets fit this model to the test.csv file.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;kag&amp;lt;-read_csv(&amp;quot;../test.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Parsed with column specification:
## cols(
##   PassengerId = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;kag1&amp;lt;-kag[,-c(3,8,10)]
kag1 &amp;lt;- modify_at(kag1,c(&amp;quot;Pclass&amp;quot;, &amp;quot;Sex&amp;quot;, &amp;quot;Embarked&amp;quot;, &amp;quot;SibSp&amp;quot;, &amp;quot;Parch&amp;quot;), as.factor)
summary(kag1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   PassengerId     Pclass      Sex           Age        SibSp       Parch    
##  Min.   : 892.0   1:107   female:152   Min.   : 0.17   0:283   0      :324  
##  1st Qu.: 996.2   2: 93   male  :266   1st Qu.:21.00   1:110   1      : 52  
##  Median :1100.5   3:218                Median :27.00   2: 14   2      : 33  
##  Mean   :1100.5                        Mean   :30.27   3:  4   3      :  3  
##  3rd Qu.:1204.8                        3rd Qu.:39.00   4:  4   4      :  2  
##  Max.   :1309.0                        Max.   :76.00   5:  1   9      :  2  
##                                        NA&amp;#39;s   :86      8:  2   (Other):  2  
##       Fare         Embarked
##  Min.   :  0.000   C:102   
##  1st Qu.:  7.896   Q: 46   
##  Median : 14.454   S:270   
##  Mean   : 35.627           
##  3rd Qu.: 31.500           
##  Max.   :512.329           
##  NA&amp;#39;s   :1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we have 86 missing values for Age and one for Far, using a good idea from a kaggler named &lt;strong&gt;Harrison Tietze&lt;/strong&gt; who suggested to treat the persons with missing values as likely to be died. For instance he replaced the missing ages by the mean age of died persons from the train data. But for us we go even further and we consider all rows with missing values as died persons.&lt;br /&gt;
Additionally, when inspecting the summary above we notice that we have an extra level (9) in the factor &lt;strong&gt;Parch&lt;/strong&gt; that is not existed in the traind data, and hence the model does not allow such extra information. However, since this level has only two cases we can approximate this level by the closest one which is 6, then we drop the level 9 from this factor.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;kag1$Parch[kag1$Parch==9]&amp;lt;-6
kag1$Parch &amp;lt;- kag1$Parch %&amp;gt;% forcats::fct_drop()
kag_died &amp;lt;- kag1[!complete.cases(kag1),]
kag2 &amp;lt;- kag1[complete.cases(kag1),]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So we only use the kag2 data for the prediction.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;DP&amp;lt;-sparse.model.matrix(PassengerId~.-1,data=kag2)
head(DP)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 6 x 20 sparse Matrix of class &amp;quot;dgCMatrix&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    [[ suppressing 20 column names &amp;#39;Pclass1&amp;#39;, &amp;#39;Pclass2&amp;#39;, &amp;#39;Pclass3&amp;#39; ... ]]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##                                                   
## 1 . . 1 1 34.5 . . . . . . . . . . . .  7.8292 1 .
## 2 . . 1 . 47.0 1 . . . . . . . . . . .  7.0000 . 1
## 3 . 1 . 1 62.0 . . . . . . . . . . . .  9.6875 1 .
## 4 . . 1 1 27.0 . . . . . . . . . . . .  8.6625 . 1
## 5 . . 1 . 22.0 1 . . . . . 1 . . . . . 12.2875 . 1
## 6 . . 1 1 14.0 . . . . . . . . . . . .  9.2250 . 1&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predkag&amp;lt;-predict(final_model,DP)
head(predkag)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 0.10634395 0.17170778 0.09650294 0.12390183 0.60250586 0.11714594&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As we see the output is the probability of each instance, so we should convert this probabbilitis to classe labels:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predkag&amp;lt;-ifelse(predkag&amp;gt;.5,1,0)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now first we cbined passengerId with the fitted values named as Survived, next we rbind with the first set kag1 :&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predkag2K&amp;lt;-cbind(kag2[,1],Survived=predkag)
kag_died$Survived &amp;lt;- 0
predtestk &amp;lt;- rbind(predkag2K,kag_died[, c(1,9)])&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, we save the file as csv file to submit it to kaggle then check our rank :&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;write_csv(predtestk,&amp;quot;predxgbkag.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34; number=&#34;7&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;7&lt;/span&gt; Conclusion:&lt;/h1&gt;
&lt;p&gt;Xgboost is the best machine learning algorithm nowadays due to its powerful capability to predict wide range of data from various domains. Several win competitions in &lt;strong&gt;kaggle&lt;/strong&gt; and elsewhere are achieved by this model. It can handle large and complex data with ease. The large number of hyperparameters that has give the modeler a large possibilities to tune the model with respect to the data at their hand as well as to fight other problems such as overfitting, feature selection…ect.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;session-information&#34; class=&#34;section level1&#34; number=&#34;8&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;8&lt;/span&gt; Session information&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sessionInfo()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## R version 4.0.1 (2020-06-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] xgboost_1.2.0.1 Matrix_1.2-18   mice_3.11.0     caret_6.0-86   
##  [5] lattice_0.20-41 forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2    
##  [9] purrr_0.3.4     readr_1.3.1     tidyr_1.1.2     tibble_3.0.3   
## [13] ggplot2_3.3.2   tidyverse_1.3.0
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-149         fs_1.5.0             lubridate_1.7.9     
##  [4] httr_1.4.2           tools_4.0.1          backports_1.1.10    
##  [7] utf8_1.1.4           R6_2.4.1             rpart_4.1-15        
## [10] DBI_1.1.0            colorspace_1.4-1     nnet_7.3-14         
## [13] withr_2.3.0          tidyselect_1.1.0     compiler_4.0.1      
## [16] cli_2.0.2            rvest_0.3.6          xml2_1.3.2          
## [19] labeling_0.3         bookdown_0.20        scales_1.1.1        
## [22] randomForest_4.6-14  digest_0.6.25        rmarkdown_2.4       
## [25] pkgconfig_2.0.3      htmltools_0.5.0      dbplyr_1.4.4        
## [28] rlang_0.4.7          readxl_1.3.1         rstudioapi_0.11     
## [31] generics_0.0.2       farver_2.0.3         jsonlite_1.7.1      
## [34] ModelMetrics_1.2.2.2 magrittr_1.5         Rcpp_1.0.5          
## [37] munsell_0.5.0        fansi_0.4.1          lifecycle_0.2.0     
## [40] stringi_1.5.3        pROC_1.16.2          yaml_2.2.1          
## [43] MASS_7.3-53          plyr_1.8.6           recipes_0.1.13      
## [46] grid_4.0.1           blob_1.2.1           crayon_1.3.4        
## [49] haven_2.3.1          splines_4.0.1        hms_0.5.3           
## [52] knitr_1.30           pillar_1.4.6         reshape2_1.4.4      
## [55] codetools_0.2-16     stats4_4.0.1         reprex_0.3.0        
## [58] glue_1.4.2           evaluate_0.14        blogdown_0.20       
## [61] data.table_1.13.0    modelr_0.1.8         vctrs_0.3.4         
## [64] foreach_1.5.0        cellranger_1.1.0     gtable_0.3.0        
## [67] assertthat_0.2.1     xfun_0.18            gower_0.2.2         
## [70] prodlim_2019.11.13   broom_0.7.1          e1071_1.7-3         
## [73] class_7.3-17         survival_3.2-7       timeDate_3043.102   
## [76] iterators_1.0.12     lava_1.6.8           ellipsis_0.3.1      
## [79] ipred_0.9-9&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;Kevin P.Murphy 2012&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
