<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Titanic data | Modeling with R and Python</title>
    <link>https://www.metalesaek.com/tag/titanic-data/</link>
      <atom:link href="https://www.metalesaek.com/tag/titanic-data/index.xml" rel="self" type="application/rss+xml" />
    <description>Titanic data</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Mon, 16 Dec 2019 00:00:00 +0000</lastBuildDate>
    <image>
      <url>https://www.metalesaek.com/images/icon_hu0b7a4cb9992c9ac0e91bd28ffd38dd00_9727_512x512_fill_lanczos_center_2.png</url>
      <title>Titanic data</title>
      <link>https://www.metalesaek.com/tag/titanic-data/</link>
    </image>
    
    <item>
      <title>knn model</title>
      <link>https://www.metalesaek.com/post/2015-07-23-r-rmarkdown/</link>
      <pubDate>Mon, 16 Dec 2019 00:00:00 +0000</pubDate>
      <guid>https://www.metalesaek.com/post/2015-07-23-r-rmarkdown/</guid>
      <description>
&lt;script src=&#34;https://www.metalesaek.com/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#classification&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;2&lt;/span&gt; Classification&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#data-partition&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;3&lt;/span&gt; Data partition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#train-the-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;4&lt;/span&gt; Train the model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#prediction-and-confusion-matrix&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;5&lt;/span&gt; Prediction and confusion matrix&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#fine-tuning-the-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;6&lt;/span&gt; Fine tuning the model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#comparison-between-knn-and-svm-model&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;7&lt;/span&gt; Comparison between knn and svm model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#regression&#34;&gt;&lt;span class=&#34;toc-section-number&#34;&gt;8&lt;/span&gt; Regression&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;style type=&#34;text/css&#34;&gt;
strong {
  color: Navy;
}

h1,h2, h3, h4 {
  font-size:28px;
  color:DarkBlue;
}
&lt;/style&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34; number=&#34;1&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;1&lt;/span&gt; Introduction&lt;/h1&gt;
&lt;p&gt;In this paper we will explore the &lt;strong&gt;k nearest neighbors&lt;/strong&gt; model using two data sets, the first is &lt;strong&gt;Tiatanic&lt;/strong&gt; data to which we will fit this model for classification, and the second data is &lt;strong&gt;BostonHousing&lt;/strong&gt; data (from &lt;strong&gt;mlbench&lt;/strong&gt; package) that will be used to fit a regression model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;classification&#34; class=&#34;section level1&#34; number=&#34;2&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;2&lt;/span&gt; Classification&lt;/h1&gt;
&lt;p&gt;We do not repeat the whole process for data preparation and missing values imputation. you can click &lt;a href=&#34;https://github.com/Metalesaek/svm-model&#34;&gt;here&lt;/a&gt; to see all the detail in my paper about &lt;strong&gt;support vector machine&lt;/strong&gt; model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-partition&#34; class=&#34;section level1&#34; number=&#34;3&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;3&lt;/span&gt; Data partition&lt;/h1&gt;
&lt;p&gt;All the codes for the first steps are grouped in one chunk. If you notice we are using the same specified parameter values and seed numbers to be able to compare the results of the tow models &lt;strong&gt;svm&lt;/strong&gt; and &lt;strong&gt;knn&lt;/strong&gt; for &lt;strong&gt;classification&lt;/strong&gt; (Using titanic data) and for regression (using BostonHousing data)&lt;/p&gt;
&lt;p&gt;This plot shows how knn model works. With k=5 the model chooses the 5 closest points inside the dashed circle, and hence the blue point will be predicted to be red using the majority vote (3 red and 2 black), but with k=9 the blue point will be predicted to be black (5 black and 4 red).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(plotrix)
plot(train$Age[10:40],pch=16,train$Fare[10:40],
     col=train$Survived,ylim = c(0,50))
points(x=32,y=20,col=&amp;quot;blue&amp;quot;,pch=8)
draw.circle(x=32,y=20,nv=1000,radius = 5.5,lty=2)
draw.circle(x=32,y=20,nv=1000,radius = 10)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://www.metalesaek.com/post/2015-07-23-r-rmarkdown_files/figure-html/unnamed-chunk-3-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The last things we should do before training the model is converting the factors to be numerics and standardizing all the predictors for both sets (train and test), and finally we rename the target variable levels&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train1 &amp;lt;- train %&amp;gt;% mutate_at(c(2,3,8),funs(as.numeric))
test1 &amp;lt;- test %&amp;gt;% mutate_at(c(2,3,8),funs(as.numeric))

processed&amp;lt;-preProcess(train1[,-1],method = c(&amp;quot;center&amp;quot;,&amp;quot;scale&amp;quot;))
train1[,-1]&amp;lt;-predict(processed,train1[,-1])
test1[,-1]&amp;lt;-predict(processed,test1[,-1])

train1$Survived &amp;lt;- fct_recode(train1$Survived,died=&amp;quot;0&amp;quot;,surv=&amp;quot;1&amp;quot;)
test1$Survived &amp;lt;- fct_recode(test1$Survived,died=&amp;quot;0&amp;quot;,surv=&amp;quot;1&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;train-the-model&#34; class=&#34;section level1&#34; number=&#34;4&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;4&lt;/span&gt; Train the model&lt;/h1&gt;
&lt;p&gt;The big advantage of the &lt;strong&gt;k nearest neighbors&lt;/strong&gt; model is that it has one single parameters which make the tuning process very fast. Here also we will make use of the same seed as we did with &lt;strong&gt;svm&lt;/strong&gt; model. for the resampling process we will stick with the default bootstrapped method with 25 resampling iterations.&lt;/p&gt;
&lt;p&gt;Let’s now launch the model and get the summary.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
modelknn &amp;lt;- train(Survived~., data=train1,
                method=&amp;quot;knn&amp;quot;,
                tuneGrid=expand.grid(k=1:30))
modelknn&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## k-Nearest Neighbors 
## 
## 714 samples
##   7 predictor
##   2 classes: &amp;#39;died&amp;#39;, &amp;#39;surv&amp;#39; 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 714, 714, 714, 714, 714, 714, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.7717650  0.5165447
##    2  0.7688433  0.5088538
##    3  0.7820906  0.5370428
##    4  0.7881072  0.5487894
##    5  0.8003926  0.5733224
##    6  0.7992870  0.5711806
##    7  0.8046907  0.5827968
##    8  0.8104254  0.5950159
##    9  0.8093172  0.5927121
##   10  0.8098395  0.5937574
##   11  0.8110456  0.5957105
##   12  0.8103966  0.5942937
##   13  0.8100784  0.5939193
##   14  0.8115080  0.5960496
##   15  0.8146848  0.6026109
##   16  0.8125027  0.5979064
##   17  0.8147065  0.6015528
##   18  0.8142485  0.6002677
##   19  0.8146543  0.6003686
##   20  0.8124733  0.5960520
##   21  0.8100367  0.5906732
##   22  0.8102084  0.5893078
##   23  0.8094241  0.5873995
##   24  0.8103509  0.5891549
##   25  0.8106517  0.5895533
##   26  0.8116000  0.5909129
##   27  0.8090177  0.5853052
##   28  0.8102358  0.5882055
##   29  0.8114371  0.5905057
##   30  0.8127604  0.5937279
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 17.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The metric used to get the best parameter value is the &lt;strong&gt;accuracy&lt;/strong&gt; rate , for which the best value is about 81.47% obtained at k=17. we can also get these values from the plot&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(modelknn)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://www.metalesaek.com/post/2015-07-23-r-rmarkdown_files/figure-html/unnamed-chunk-6-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;For the contributions of the predictors, the measure of importance scaled from 0 to 100 shows that the most important one is far the &lt;strong&gt;Sex&lt;/strong&gt;, followed by &lt;strong&gt;Fare&lt;/strong&gt; and &lt;strong&gt;Pclass&lt;/strong&gt; , and the least important one is &lt;strong&gt;SibSp&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;varImp(modelknn)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## ROC curve variable importance
## 
##          Importance
## Sex         100.000
## Fare         62.476
## Pclass       57.192
## Embarked     17.449
## Parch        17.045
## Age           4.409
## SibSp         0.000&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;prediction-and-confusion-matrix&#34; class=&#34;section level1&#34; number=&#34;5&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;5&lt;/span&gt; Prediction and confusion matrix&lt;/h1&gt;
&lt;p&gt;Let’s now use the test set to evaluate the model performance.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred&amp;lt;-predict(modelknn,test1)
confusionMatrix(as.factor(pred),as.factor(test1$Survived))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died surv
##       died   99   26
##       surv   10   42
##                                           
##                Accuracy : 0.7966          
##                  95% CI : (0.7297, 0.8533)
##     No Information Rate : 0.6158          
##     P-Value [Acc &amp;gt; NIR] : 1.87e-07        
##                                           
##                   Kappa : 0.5503          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : 0.01242         
##                                           
##             Sensitivity : 0.9083          
##             Specificity : 0.6176          
##          Pos Pred Value : 0.7920          
##          Neg Pred Value : 0.8077          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5593          
##    Detection Prevalence : 0.7062          
##       Balanced Accuracy : 0.7630          
##                                           
##        &amp;#39;Positive&amp;#39; Class : died            
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We see that the accuracy has slightly decreased from 81.47% to 79.66. the closeness of this rates is a good sign that we do not face the &lt;strong&gt;overfitting&lt;/strong&gt; problem.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;fine-tuning-the-model&#34; class=&#34;section level1&#34; number=&#34;6&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;6&lt;/span&gt; Fine tuning the model&lt;/h1&gt;
&lt;p&gt;to seek improvements we can alter the metric. the best function that gives three importante metrics, &lt;strong&gt;sensitivity&lt;/strong&gt;, &lt;strong&gt;specivicity&lt;/strong&gt; and area under the &lt;strong&gt;ROC&lt;/strong&gt; curve for each resampling iteration is &lt;strong&gt;twoClassSummary&lt;/strong&gt;. Also we expand the grid search for the neighbors number to 30.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;control &amp;lt;- trainControl(classProbs = TRUE,
                        summaryFunction = twoClassSummary)

set.seed(123)
modelknn1 &amp;lt;- train(Survived~., data=train1,
                method = &amp;quot;knn&amp;quot;,
                trControl = control,
                tuneGrid = expand.grid(k=1:30))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning in train.default(x, y, weights = w, ...): The metric &amp;quot;Accuracy&amp;quot; was not
## in the result set. ROC will be used instead.&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;modelknn1&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## k-Nearest Neighbors 
## 
## 714 samples
##   7 predictor
##   2 classes: &amp;#39;died&amp;#39;, &amp;#39;surv&amp;#39; 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 714, 714, 714, 714, 714, 714, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    1  0.7637394  0.8092152  0.7114938
##    2  0.7959615  0.8102352  0.7013654
##    3  0.8212495  0.8217986  0.7180595
##    4  0.8351414  0.8302266  0.7201146
##    5  0.8455418  0.8448702  0.7283368
##    6  0.8543141  0.8441066  0.7269378
##    7  0.8564044  0.8477382  0.7350766
##    8  0.8590356  0.8526960  0.7421475
##    9  0.8617600  0.8511745  0.7414201
##   10  0.8611361  0.8512356  0.7424516
##   11  0.8621287  0.8546357  0.7399914
##   12  0.8633050  0.8542288  0.7392237
##   13  0.8647328  0.8526082  0.7407331
##   14  0.8656300  0.8572596  0.7369673
##   15  0.8663956  0.8612937  0.7388392
##   16  0.8657711  0.8595923  0.7359633
##   17  0.8658168  0.8652505  0.7322408
##   18  0.8659659  0.8657088  0.7301132
##   19  0.8667079  0.8685106  0.7261585
##   20  0.8668361  0.8657052  0.7252522
##   21  0.8673051  0.8641660  0.7212182
##   22  0.8672610  0.8701453  0.7118060
##   23  0.8675945  0.8703195  0.7094977
##   24  0.8677684  0.8724153  0.7087639
##   25  0.8681884  0.8733028  0.7080003
##   26  0.8681201  0.8768128  0.7048740
##   27  0.8680570  0.8748635  0.7011357
##   28  0.8685130  0.8745234  0.7047600
##   29  0.8686459  0.8756557  0.7055821
##   30  0.8681316  0.8754088  0.7094507
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 29.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This time we use the &lt;strong&gt;ROC&lt;/strong&gt; to choose the best model which gives a different value of 29 with 0.8686 for the &lt;strong&gt;ROC&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred&amp;lt;-predict(modelknn1,test1)
confusionMatrix(pred,test1$Survived)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died surv
##       died   99   29
##       surv   10   39
##                                           
##                Accuracy : 0.7797          
##                  95% CI : (0.7113, 0.8384)
##     No Information Rate : 0.6158          
##     P-Value [Acc &amp;gt; NIR] : 2.439e-06       
##                                           
##                   Kappa : 0.5085          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : 0.003948        
##                                           
##             Sensitivity : 0.9083          
##             Specificity : 0.5735          
##          Pos Pred Value : 0.7734          
##          Neg Pred Value : 0.7959          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5593          
##    Detection Prevalence : 0.7232          
##       Balanced Accuracy : 0.7409          
##                                           
##        &amp;#39;Positive&amp;#39; Class : died            
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using the &lt;strong&gt;ROC&lt;/strong&gt; metric we get worse result for the accuracy rate which has decreased from 79.66% to 77.97%.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;comparison-between-knn-and-svm-model&#34; class=&#34;section level1&#34; number=&#34;7&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;7&lt;/span&gt; Comparison between knn and svm model&lt;/h1&gt;
&lt;p&gt;Now let’s train svm model with the same resamling method and we compare between them.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;control&amp;lt;-trainControl(method=&amp;quot;boot&amp;quot;,number=25,
                      classProbs = TRUE,
                      summaryFunction = twoClassSummary)

modelsvm&amp;lt;-train(Survived~., data=train1,
                method=&amp;quot;svmRadial&amp;quot;,
                trControl=control)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning in train.default(x, y, weights = w, ...): The metric &amp;quot;Accuracy&amp;quot; was not
## in the result set. ROC will be used instead.&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;modelsvm&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Support Vector Machines with Radial Basis Function Kernel 
## 
## 714 samples
##   7 predictor
##   2 classes: &amp;#39;died&amp;#39;, &amp;#39;surv&amp;#39; 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 714, 714, 714, 714, 714, 714, ... 
## Resampling results across tuning parameters:
## 
##   C     ROC        Sens       Spec     
##   0.25  0.8703474  0.8735475  0.7602162
##   0.50  0.8706929  0.8858278  0.7456306
##   1.00  0.8655619  0.8941179  0.7327856
## 
## Tuning parameter &amp;#39;sigma&amp;#39; was held constant at a value of 0.2282701
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.2282701 and C = 0.5.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And let’s get the confusion matrix.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred&amp;lt;-predict(modelsvm,test1)
confusionMatrix(pred,test1$Survived)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Confusion Matrix and Statistics
## 
##           Reference
## Prediction died surv
##       died  101   27
##       surv    8   41
##                                           
##                Accuracy : 0.8023          
##                  95% CI : (0.7359, 0.8582)
##     No Information Rate : 0.6158          
##     P-Value [Acc &amp;gt; NIR] : 7.432e-08       
##                                           
##                   Kappa : 0.5589          
##                                           
##  Mcnemar&amp;#39;s Test P-Value : 0.002346        
##                                           
##             Sensitivity : 0.9266          
##             Specificity : 0.6029          
##          Pos Pred Value : 0.7891          
##          Neg Pred Value : 0.8367          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5706          
##    Detection Prevalence : 0.7232          
##       Balanced Accuracy : 0.7648          
##                                           
##        &amp;#39;Positive&amp;#39; Class : died            
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we see that the accuracy fo this model is much higher with 80.23% than the knn model with 77.97% (the &lt;strong&gt;modelknn1&lt;/strong&gt;).
If we have a large number of models to be compared, there exists a function in &lt;strong&gt;caret&lt;/strong&gt; called &lt;strong&gt;resamples&lt;/strong&gt; to compare between models,but the models should have the same tarincontrol prameter values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;comp&amp;lt;-resamples(list( svm = modelsvm,
                         knn = modelknn1))

summary(comp)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Call:
## summary.resamples(object = comp)
## 
## Models: svm, knn 
## Number of resamples: 25 
## 
## ROC 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA&amp;#39;s
## svm 0.8472858 0.8617944 0.8691093 0.8706929 0.8744979 0.9043001    0
## knn 0.8298966 0.8577167 0.8670815 0.8686459 0.8792487 0.9135638    0
## 
## Sens 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA&amp;#39;s
## svm 0.8117647 0.8666667 0.8870056 0.8858278 0.9030303 0.9559748    0
## knn 0.8266667 0.8523490 0.8816568 0.8756557 0.8950617 0.9117647    0
## 
## Spec 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA&amp;#39;s
## svm 0.6774194 0.7096774 0.7428571 0.7456306 0.7714286 0.8425926    0
## knn 0.5865385 0.6741573 0.6989247 0.7055821 0.7252747 0.8191489    0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;we can also plot the models’ matric values togather.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dotplot(comp,metric=&amp;quot;ROC&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://www.metalesaek.com/post/2015-07-23-r-rmarkdown_files/figure-html/unnamed-chunk-14-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;regression&#34; class=&#34;section level1&#34; number=&#34;8&#34;&gt;
&lt;h1&gt;&lt;span class=&#34;header-section-number&#34;&gt;8&lt;/span&gt; Regression&lt;/h1&gt;
&lt;p&gt;First we call the &lt;strong&gt;BostonHousing&lt;/strong&gt; data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(mlbench)
data(&amp;quot;BostonHousing&amp;quot;)
glimpse(BostonHousing)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Rows: 506
## Columns: 14
## $ crim    &amp;lt;dbl&amp;gt; 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.088...
## $ zn      &amp;lt;dbl&amp;gt; 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5...
## $ indus   &amp;lt;dbl&amp;gt; 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87,...
## $ chas    &amp;lt;fct&amp;gt; 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox     &amp;lt;dbl&amp;gt; 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.5...
## $ rm      &amp;lt;dbl&amp;gt; 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.6...
## $ age     &amp;lt;dbl&amp;gt; 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9...
## $ dis     &amp;lt;dbl&amp;gt; 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9...
## $ rad     &amp;lt;dbl&amp;gt; 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4,...
## $ tax     &amp;lt;dbl&amp;gt; 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311,...
## $ ptratio &amp;lt;dbl&amp;gt; 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2,...
## $ b       &amp;lt;dbl&amp;gt; 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396...
## $ lstat   &amp;lt;dbl&amp;gt; 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17...
## $ medv    &amp;lt;dbl&amp;gt; 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9,...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We will train a knn model to this data using the continuous variable as target &lt;strong&gt;medv&lt;/strong&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1234)
index&amp;lt;-sample(nrow(BostonHousing),size = floor(0.8*(nrow(BostonHousing))))
train&amp;lt;-BostonHousing[index,]
test&amp;lt;-BostonHousing[-index,]

scaled&amp;lt;-preProcess(train[,-14],method=c(&amp;quot;center&amp;quot;,&amp;quot;scale&amp;quot;))
trainscaled&amp;lt;-predict(scaled,train)
testscaled&amp;lt;-predict(scaled,test)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We are ready now to train our model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)
modelknnR &amp;lt;- train(medv~., data=trainscaled,
                method = &amp;quot;knn&amp;quot;,
                tuneGrid = expand.grid(k=1:60))
modelknnR&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## k-Nearest Neighbors 
## 
## 404 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 404, 404, 404, 404, 404, 404, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    1  4.711959  0.7479439  3.047925
##    2  4.600795  0.7545325  3.010235
##    3  4.554112  0.7583915  3.001404
##    4  4.416511  0.7733563  2.939100
##    5  4.414384  0.7736985  2.953741
##    6  4.405364  0.7758010  2.962082
##    7  4.375360  0.7799181  2.955250
##    8  4.409134  0.7773310  2.975489
##    9  4.427529  0.7770847  2.973016
##   10  4.414577  0.7804842  2.957983
##   11  4.447188  0.7787709  2.968389
##   12  4.475134  0.7767642  2.984709
##   13  4.489486  0.7760909  3.000489
##   14  4.518792  0.7746895  3.026858
##   15  4.554107  0.7717809  3.043645
##   16  4.583672  0.7694136  3.058097
##   17  4.599290  0.7695640  3.067001
##   18  4.632439  0.7671729  3.079895
##   19  4.670589  0.7643210  3.098643
##   20  4.708318  0.7614855  3.118593
##   21  4.736963  0.7596509  3.137784
##   22  4.756688  0.7590899  3.151654
##   23  4.781692  0.7577281  3.166203
##   24  4.813669  0.7554223  3.186575
##   25  4.843954  0.7533415  3.200120
##   26  4.872096  0.7513071  3.224031
##   27  4.896463  0.7502052  3.238489
##   28  4.920242  0.7497138  3.252959
##   29  4.944899  0.7484320  3.269227
##   30  4.966726  0.7479621  3.282756
##   31  4.996149  0.7460973  3.303607
##   32  5.024602  0.7438775  3.321013
##   33  5.055147  0.7420656  3.338457
##   34  5.083713  0.7403972  3.360867
##   35  5.108994  0.7388352  3.373694
##   36  5.132420  0.7372288  3.389177
##   37  5.156841  0.7354463  3.409025
##   38  5.175413  0.7349417  3.422294
##   39  5.196438  0.7340164  3.434986
##   40  5.225990  0.7314822  3.452499
##   41  5.249335  0.7299159  3.467267
##   42  5.275185  0.7281473  3.484101
##   43  5.300558  0.7263045  3.502388
##   44  5.322795  0.7251719  3.519220
##   45  5.349383  0.7232707  3.539266
##   46  5.376209  0.7210830  3.560509
##   47  5.398400  0.7199706  3.580476
##   48  5.424020  0.7180096  3.595497
##   49  5.445069  0.7166620  3.609308
##   50  5.469650  0.7145816  3.625718
##   51  5.492104  0.7127439  3.644329
##   52  5.515714  0.7107894  3.659286
##   53  5.535354  0.7092366  3.672172
##   54  5.562260  0.7063225  3.690854
##   55  5.581394  0.7049997  3.705917
##   56  5.600579  0.7036881  3.720464
##   57  5.623071  0.7018951  3.739874
##   58  5.645828  0.6999889  3.755824
##   59  5.662777  0.6990085  3.771570
##   60  5.682182  0.6976068  3.787733
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The best model with k=7 for which the minimum RMSE is about 4.3757.&lt;/p&gt;
&lt;p&gt;We can also get the importance of the predictors.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(varImp(modelknnR))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://www.metalesaek.com/post/2015-07-23-r-rmarkdown_files/figure-html/unnamed-chunk-18-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Then we get the prediction and the root mean squared error &lt;strong&gt;RMSE&lt;/strong&gt; as follows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred&amp;lt;-predict(modelknnR,testscaled)
head(pred)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 24.94286 29.88571 20.67143 20.31429 19.18571 20.28571&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;RMSE(pred,test$medv)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 4.416328&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The RMSE using the test set is about &lt;strong&gt;4.4163&lt;/strong&gt; which is slightly greater than that of the training set &lt;strong&gt;4.3757&lt;/strong&gt; .
Finally we can plot the predicted values vs the observed values to get insight about their relationship.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(data.frame(predicted=pred,observed=test$medv),aes(pred,test$medv))+
  geom_point(col=&amp;quot;blue&amp;quot;)+
  geom_abline(col=&amp;quot;red&amp;quot;)+
  ggtitle(&amp;quot;actual values vs predicted values&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://www.metalesaek.com/post/2015-07-23-r-rmarkdown_files/figure-html/unnamed-chunk-20-1.svg&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
