Statistical Learning

Data Description

Stock Market Data for Tesla between $2016$ and $2022$

The dataset we use for the following statistical analysis is stock market data for Elon Musk’s publicly traded companies Tesla and Twitter from $01/01/2018$ to $05/20/2022$. We obtained this data from the Yahoo Finance API using the package quantmod in R. The data contains the daily percentage returns for the Tesla and Twitter stock indexes, with $2208$ observations on the following $13$ variables.

variable	type	description
symbol	character	The ticker symbol uniquely indefintying a stock
date	datetime	The trade day of the recorderd observation
open	float	Opening value of the stock that day
close	float	Closing value of the stock that day
high	float	Highest price of the stock on a given trade day
low	float	Lowest price of the stock on a given trade day
volume	integer	Number of daily shares traded in billions
direction	factor	Factor indicating whether the market had a positive or negative return
return	decimal	Percentage return for that day
lag1	decimal	Percentage return for previous day
lag2	decimal	Percentage return for 2 days previous
lag3	decimal	Percentage return for 3 days previous
lag4	decimal	Percentage return for 4 days previous

Twitter Data for Elon Musk between $2016$ and $2022$

Dataset of Elon Musk’s most recent Tweets during 2015-2022, stored in CSV format, where each row represents a separate tweet object. All Tweets are collected, parsed, and plotted using the Twitter API and rtweet package in R. In total, there are more than ten-thousand tweets in this dataset, including retweets, replies, and quotes. All objects are to go into a single database.

Make some pairwise scatterplots of the predictors (columns) in this data set.

Here, we use the pairs function to create a scatterplot matrix for every pair of variables in the stock dataset as shown below.

df.pairs <- df %>% dplyr::select(-alltext)
pairs(stocks.data)

Based on the correlation coefficients and their corresponding p-values, there is indeed an association between the daily return rate and the predictors volume, lag2, nfav, nretweet, and nreply.

Do any of the stocks appear to have particularly high closing values, daily return rates, or volume of traded shares?

Here, let’s visualize the distribution for the closing value, daily return rate, and volume by plotting a histogram for each variable as follows.

Now, let’s take a closer look at the crim variable, per capita crime rate by town.

range(df.pairs$high)
summary(df.pairs$high)

## [1]   30.994 1243.490
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.99   51.32   68.78  273.81  463.70 1243.49

The range of the crime rates predictor is from $0.00632$ to $88.97620$. The median and maximum crime rate values are respectively $0.26\%$ and $89\%$; hence, there are indeed some suburbs of Boston where the crime rate is particularly high.

Model Fitting

In this problem, we analyze a (real, not simulated) dataset of with a quantitative response $Y$ , and $p \geq 50$ quantitative predictors.

Fit a least squares linear model to the data, and provide an estimate of the test error.

First, we must ensure that the missing values are removed from the data.

#myData <- na.omit(df.pairs)
x <- model.matrix(daily_return ~ ., data = myData)[, -1]
y <- myData$daily_return

Now, we fit a least squares linear model to the data.

lm.model <- lm(daily_return ~ ., data = myData)

Next, we split the data into a training set and a test set in order to provide an estimate of the test error.

set.seed(400)
train <- sample(1:nrow(x), nrow(x) / 2)
test <- (-train)
y.test <- y[test]

lm.pred <- predict(lm.model, newx = x[test, ])
mean((lm.pred - y.test)^2)

## [1] 17.67486

The estimate of the test error is $267736.8$.

Fit a ridge regression model to the data, with a range of values of the tuning parameter $\lambda$. Make a plot like the left-hand panel of Figure 6.4 in the textbook.

We will perform a ridge regression in order to predict Salary on the Hitters data. We first fit a ridge regression model.

grid <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(x, y, alpha = 0, lambda = grid)

We’ve chosen to implement the function over a grid of values ranging from $\lambda = 10^{10}$ to $\lambda = 10^{-2}$. Associated with each value of $\lambda$ is a vector of ridge regression coefficients, stored in a $20 \times 100$ matrix.

What value of $\lambda$ in the ridge regression model provides the smallest estimated test error?

We split the samples into a training and test set in order to estimate the test error of ridge regression. We first set a random seed so that the results obtained will be reproducible.

set.seed(500)
train <- sample(1:nrow(x), nrow(x) / 2)
test <- (-train)
y.test <- y[test]

Next, we fit a ridge regression model on the training set.

ridge.mod <- glmnet(x[train, ], y[train], alpha = 0, lambda = grid, thresh = 1e-12)

Now we want to evaluate its MSE on the test set using some $\lambda$. We use cross-validation to choose the tuning parameter $\lambda$ rather than arbitrarily choosing $\lambda$. We do this using the built-in cross-validation function cv.glmnet() as follows.

set.seed(500)
cv.out <- cv.glmnet(x[train, ], y[train], alpha = 0)
plot(cv.out)

bestlam <- cv.out$lambda.min
bestlam

## [1] 0.04977912

Therefore, we see that the value of $\lambda$ that results in the smallest cross-validation error is $326$. Now, we compute the test MSE associated with this value of $\lambda$.

ridge.pred <- predict(ridge.mod, s = bestlam, newx = x[test, ])
mean((ridge.pred - y.test)^2)

## [1] 13.99013

Hence, the estimate of test error is $105738.2$.

Finally, we refit our ridge regression model on the full data set, using the value of $\lambda$ chosen by cross-validation, and examine the coefficient estimates.

out <- glmnet(x, y, alpha = 0)

predict(out, type = "coefficients", s = bestlam)[1:12, ]

##   (Intercept)    created_at       ntweets          nfav      nretweet 
##  3.367005e+00 -2.187807e-04  2.834274e-02  8.696752e-07 -9.836911e-06 
##        nreply          open          high           low        volume 
## -1.102397e-05 -6.082904e-03  2.397504e-03  4.472647e-03  1.460485e-08 
##          lag1          lag2 
## -3.185506e-02  3.268552e-02

Repeat (c), but for a lasso model.

In order to fit a lasso model, we use the glmnet() function, but using the argument alpha = 1 as shown below.

lasso.mod <- glmnet(x[train, ], y[train], alpha = 1, lambda = grid)
plot(lasso.mod)

Repeat (d), but for a lasso model. Which features are included in this lasso model?

We now perform cross-validation and compute the associated test error.

set.seed(600)
cv.out <- cv.glmnet(x[train, ], y[train], alpha = 1)
plot(cv.out)

bestlam <- cv.out$lambda.min
lasso.pred <- predict(lasso.mod, s = bestlam, newx = x[test, ])
mean((lasso.pred - y.test)^2)

## [1] 11.76152

Hence, the test set MSE of the lasso model is $124491.4$.

Polynomial Regressions and Splines

For this problem, we use the Wage data set that is part of the ISLR package. Split the data into a training set and a test set.

First, we split the Wage dataset into a training and test set as follows.

set.seed(4)
#myData <- data.frame(na.omit(df.pairs))
train <- sample(1:nrow(myData), nrow(myData) / 2)
wage.train <- myData[train, ]
wage.test <- myData[-train, ]

Fit a polynomial model to predict Wage using Age on the training set.

Here, we fit a polynomial model on the training set using the poly() function within lm(). We begin by producing a fourth-order polynomial fit to predict wage using age as follows.

lm.fit4 <- lm(daily_return ~ poly(lag2 , 4), data = myData)
lm.summary <- summary(lm.fit4)

	Estimate	Std. Error	T-Value	P-Value
(Intercept)	0.301187	0.0969167	3.1076910	0.0019213
poly(lag2, 4)1	6.769673	3.7397777	1.8101806	0.0704700
poly(lag2, 4)2	-9.984156	3.7397777	-2.6697192	0.0076743
poly(lag2, 4)3	-3.219837	3.7397777	-0.8609702	0.3893935
poly(lag2, 4)4	-4.606210	3.7397777	-1.2316801	0.2182637


Residual Standard Error	3.7397777
Multiple R-Squared	0.0084605
Adjusted R-Squared	0.0057879
F-Statistic	3.1656150
P-Value	0.0133057

This suggests that including additional polynomial terms, up to third order, leads to an improvement in the model fit. However, further investigation of the data reveals that no polynomial terms beyond the third order have significant p-values in a regression fit. Next, we use the validation set approach and estimate the test error that results from predicting wage using polynomial functions of age as follows.

set.seed(4)
test.error <- rep(0, 10)
for (i in 1:10){
  model.fit <- lm(daily_return ~ poly(lag2, i), data = myData, subset = train)
  yhat <- predict(model.fit, newdata = wage.test)
  test.error[i] <- mean((wage.test$daily_return - yhat)^2)
}

Validation Set Errors
Order	1	2	3	4	5	6	7	8	9	10
Test Error	12.81059	12.87869	12.88537	12.97211	13.18252	13.18043	13.65858	13.66921	13.49155	13.54611

We find that the validation set error rates for the models with first, second, third, fourth, and fifth orders are $1709.923$, $1639.436$, $1636.682$, $1635.943$, $1638.892$ respectively. Based on these results, a model that predicts wage using a cubic function of age performs better than models that involve only a linear or quadratic function, and there is little evidence in favor of a model that uses anything above a fourth-degree function.

Next, we use the cross-validation approach for polynomials of order $i = 1$ to $i = 10$ and compute the associated cross-validation error. We also use the cv.glm() function to implement a $k$-fold CV. Below, we use $k = 10$ on the Wage data set.

set.seed(4)
cv.error.10 <- rep(0, 10)
for (i in 1:10) {
  glm.fit <- glm(daily_return ~ poly(lag2, i), data = myData)
  cv.error.10[i] <- cv.glm(myData, glm.fit, K = 10)$delta[1]
}

10-fold Cross-Validation Errors
Order	1	2	3	4	5	6	7	8	9	10
Test Error	14.07739	14.13862	14.26909	14.49621	14.48819	15.20953	16.91290	18.55791	17.06029	28.28094

Similar to our results from the validation set approach, we see a steep drop in the estimated test MSE between the linear, quadratic, and cubic fits, but then no clear improvement from using higher-order polynomials. Hence, we see little evidence that using a fifth or higher-order polynomial terms leads to lower test error than simply using a cubic or quartic fit.

Based on our overall results, we see that a polynomial of order three results in the lowest test error. In the following, we fit a polynomial model at the third-order to predict wage using age on the training set, calculate the test error, and then plot the fit and observations.

glm.fit3 <- glm(daily_return ~ poly(lag2 , 3), data = myData, subset = train)
glm.preds <- predict(glm.fit3, newdata = wage.test)
mean((wage.test$daily_return - glm.preds)^2)

## [1] 12.88537

We get the test error $1636.682$ using a polynomial at the third-order.

Fit a step function model.

Here, we fit a step function model to predict Wage using Age on the training set. In order to fit a step function, we use the cut() function as follows.

set.seed(4)
stepfunc.error <- rep(NA, 10)
for (i in 2:10) {
  myData$lag2.cut <- cut(myData$lag2, i)
  stepfunc.fit <- glm(daily_return ~ lag2.cut, data = myData)
  stepfunc.error[i] <- cv.glm(myData, stepfunc.fit)$delta[1]
}
err.min <- which.min(stepfunc.error)

In the above command, the err.min gives us the number of cuts yielding the smallest error. In the following, we display our step function error estimates and plot the error estimates for different degrees of the polynomial.

stepfunc.error
Order	1	2	3	4	5	6	7	8	9	10
Test Error	NA	14.07425	14.13762	14.20493	14.14848	14.24408	14.27914	14.36637	14.38867	14.43699

Based on our overall results, we see that $8$-cuts yields the lowest test error. In the following, we fit the 8-cut step function to predict wage using age on the training set, calculate the test error, and then plot the fit and observations.

step.fit <- glm(daily_return ~ cut(lag2, 8), data = myData, subset = train)
summary(step.fit)

#step.preds <- predict(step.fit$y, newdata = wage.test)
mean((wage.test$daily_return - step.fit$y)^2)

## 
## Call:
## glm(formula = daily_return ~ cut(lag2, 8), data = myData, subset = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -14.4732   -1.7782   -0.3489    1.8004   18.2628  
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                 -7.326      2.743  -2.671  0.00774 **
## cut(lag2, 8)(-15.9,-10.8]    6.845      3.542   1.933  0.05366 . 
## cut(lag2, 8)(-10.8,-5.7]     7.516      2.855   2.632  0.00866 **
## cut(lag2, 8)(-5.7,-0.584]    7.451      2.755   2.705  0.00699 **
## cut(lag2, 8)(-0.584,4.54]    7.898      2.750   2.872  0.00419 **
## cut(lag2, 8)(4.54,9.66]      7.048      2.813   2.506  0.01243 * 
## cut(lag2, 8)(9.66,14.8]      9.044      2.947   3.069  0.00223 **
## cut(lag2, 8)(14.8,19.9]      2.952      3.360   0.879  0.37989   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 15.0511)
## 
##     Null deviance: 11360  on 743  degrees of freedom
## Residual deviance: 11078  on 736  degrees of freedom
## AIC: 4138.7
## 
## Number of Fisher Scoring iterations: 2
## 
## [1] 28.82871

The resulting test error on the 8-cuts model is $1650.564$.

Fit a cubic spline model.

Here, we want to fit a cubic spline to predict wage using age on the training set. First, we fit a linear model using the bs() function to generate the B-spline basis matrix for a polynomial cubic spline, with knots at ages $30$, $42$, and $61$. This produces a spline with six basis functions.

bs.fit <- lm(daily_return ~ bs(lag2 , knots = c(30, 42, 61)), data = myData, subset = train)

	Estimate	Std. Error	T-Value	P-Value
(Intercept)	-3.393081	2.754171	-1.2319788	0.2183483
bs(age)1	2.274082	5.353810	0.4247595	0.6711355
bs(age)2	8.180027	2.623634	3.1178232	0.0018925
bs(age)3	-0.883639	3.876847	-0.2279272	0.8197658


Residual Standard Error	3.8917292
Multiple R-Squared	0.0133883
Adjusted R-Squared	0.0093886
F-Statistic	3.3472701
P-Value	0.0187282

bs.preds <- predict(bs.fit , newdata = wage.test, se = T)
mean((wage.test$daily_return - bs.preds$fit)^2)

## [1] 12.88537

The resulting test errors for the polynomial cubic spline is $1638.827$.

Fit a smoothing spline model.

Here, we fit smoothing spline models to predict Wage using Age on the training set using the smooth.spline() function. First, we fit a smoothing spline that uses leave-one-out cross-validation (cv = TRUE).

sspline.fit1 <- smooth.spline(myData$lag2 , myData$daily_return , cv = TRUE)

Our selected smoothness level by leave-one-out cross-validation results in a value of $\lambda = 0.027923$ that yields $7$ degrees of freedom, which means that $7$ knots are used.

sspline1.preds <- predict(sspline.fit1, newdata = wage.test)
mean((wage.test$daily_return - sspline1.preds$y)^2)

## [1] 12.94495

The resulting test error for this smoothing spline is $2051.321$.

Next, we fit two smoothing spline fits that defines the number of knots as $16$ (df = 15) and $50$ (df = 50).

sspline.fit2 <- smooth.spline(myData$lag2 , myData$daily_return, df = 15)
sspline.fit3 <- smooth.spline(myData$lag2 , myData$daily_return, df = 50)

By specifying df = 15 and df = 50, the function determines which values of $\lambda$ leads to $15$ and $50$ degrees of freedom, respectively.

sspline2.preds <- predict(sspline.fit2, newdata = wage.test)
mean((wage.test$daily_return - sspline2.preds$y)^2)

sspline3.preds <- predict(sspline.fit3, newdata = wage.test)
mean((wage.test$daily_return - sspline3.preds$y)^2)

## [1] 13.05814
## [1] 13.29748

The resulting test errors for the two smoothing spline fits are $2082.129$ and $2117.501$.

Regression Trees

Use the stock market data as described above to predict a stock index’s volume of shares traded for any given day.

Below, we work with bagging, random forests, boosting, and additive regression trees. Each of these approaches involves producing and combining a large number of trees, and consequently resulting in dramatic improvements of the prediction accuracy.

Produce multiple regression trees on the stock data. How accurately does your regression tree predict a stock index’s volume of shares traded?

We begin by removing the name variable from the Auto dataset and then splitting the observations into a training set and test set.

myData <- na.omit(df.pairs)
train <- sample(1:nrow(myData), nrow(myData) / 2)
auto.train <- myData[-train, ]
auto.test <- myData[-train, ]

Next, we fit a regression tree to predict mpg from the training data. The tree library is used to construct regression trees as follows.

tree.auto <- tree(daily_return ~ ., data = myData, subset = train)

tree.summary <- summary(tree.auto)

Variables Used	Terminal Nodes	Residual Mean Deviance
4	9	11.15818

The output of summary() indicates that only $4$ of the variables have been used in constructing the tree. In the context of a regression tree, the deviance is simply the sum of squared errors for the tree. We now plot the tree.

In this case, the most complex tree under consideration is selected by cross-validation. However, if we wish to prune the tree, we could do so as follows, using the prune.tree() function as follows.

prune.auto <- prune.tree(tree.auto , best = 5)

In keeping with the cross-validation results, we use the unpruned tree to make predictions on the test set. To assess the accuracy of the regression model, we examine how well the model fitted on the training data predicts the held out testing data, yielding a more realistic error rate.

tree.preds <- predict(tree.auto , newdata = auto.test)
mean((auto.test$daily_return - tree.preds)^2)

## [1] 15.39865

Hence, the test set MSE associated with the regression tree is $ 14.20715$. The square root of the MSE is therefore around $3.7692$, indicating that this model leads to test predictions that are (on average) within approximately $3.7692$ of the true median miles per gallon value.

Fit a bagged regression tree model to predict a car’s mpg. How accurately does this model predict gas mileage? What tuning parameter value(s) did you use in fitting this model?

Here, we apply a bagging to the Auto data, using the randomForest package in R. Since bagging is a special case of a random forest with $m = p$, the randomForest() function can be used to perform bagging as follows:

set.seed(5)
bag.auto2 <- randomForest(daily_return ~ ., data = myData, subset = train, importance = TRUE, mtry = 8)

Type	No. of trees	Vars tried at each split	Mean squared residuals	% Var explained
regression	500	8	16.34357	-19.32809

The argument mtry = 7 indicates that all 7 predictors are considered for each split of the tree. Next, we assess the accuracy of the model by evaluating how well this bagged fit performs on the test set.

#bag.preds <- predict(bag.auto2, newdata = auto.test)
mean((auto.test$daily_return - bag.auto2$y)^2)

## [1] 27.1849

Hence, the test set MSE associated with the bagged regression tree is $7.253194$. The square root of the MSE is therefore around $2.7026$, indicating that this model leads to test predictions that are (on average) within approximately $2.7026$ of the true median gas mileage.

Fit a random forest model to predict a car’s mpg. How accurately does this model predict gas mileage? What tuning parameter value(s) did you use in fitting this model?

For random forest, we simply specify a smaller number of predictors in mtry. Typically, $p/3$ variables are used when building a random forest of regression trees. Here, we have $p = 7$ and hence $m = \mathrm{round}(7/3)$. That is, we set mtry = 2 as follows.

set.seed(5)
rf.auto <- randomForest(daily_return ~ ., data = myData, subset = train, mtry = 2, importance = TRUE)

Type	No. of trees	Vars tried at each split	Mean squared residuals	% Var explained
regression	500	2	15.52228	-13.33168

rf.preds <- rf.auto$predicted # <- predict(rf.auto, newdata = auto.test)
mean((auto.test$daily_return - rf.preds)^2)

## [1] 15.25593

Hence, the test set MSE associated with the random forest is $7.338623$. The square root of the MSE is therefore around $2.709$, indicating that this model leads to test predictions that are (on average) within approximately $2.709$ of the true median miles per gallon value.

Fit a generalized additive model (GAM) model to predict a car’s mpg. How accurately does your GAM model predict a car’s gas mileage?

Next, we fit a GAM model to predict mpg using natural spline functions of all the predictors except origin since it’s qualitative.

gam.auto1 <- gam(daily_return ~ s(nretweet) + s(nreply) + s(nfav) + s(volume) + s(lag2), data = myData)
gam.summary <- summary(gam.auto1)

Anova for Parametric Effects
	Df	Sum Squared	Mean Squared	F-Value	P-Value
s(nretweet)	1	116.62641	116.62641	8.5687215	0.0034725
s(nreply)	1	4.35437	4.35437	0.3199222	0.5717408
s(nfav)	1	87.22795	87.22795	6.4087713	0.0114594
s(volume)	1	345.56670	345.56670	25.3893156	0.0000005
s(lag2)	1	18.39107	18.39107	1.3512201	0.2452539
Residuals	1468	19980.52451	13.61071	NA	NA

Anova for Nonparametric Effects
	Df	F-Statistic	P-Value
(Intercept)	NA	NA	NA
s(nretweet)	3	3.6284653	0.0125732
s(nreply)	3	2.8916124	0.0342888
s(nfav)	3	0.2986854	0.8263703
s(volume)	3	1.4629077	0.2229042
s(lag2)	3	5.8707029	0.0005539

The Anova for Parametric Effects p-values clearly demonstrate that cylinders, displacement, horsepower, weight, acceleration, year, and origin are all highly statistically significant. To assess the accuracy of the model, we examine how well the model fitted on the training data predicts the held out testing data, yielding a more realistic error rate.

gam.preds <- predict(gam.auto1, newdata = auto.test)
mean((auto.test$daily_return - gam.preds)^2)

## [1] 13.90912

Hence, the test set MSE associated with the regression tree is $5.80442$. The square root of the MSE is therefore around $2.40924$, indicating that this model leads to test predictions that are (on average) within approximately $2.40924$ of the true median miles per gallon value.

Considering both accuracy and interpretability of the fitted model, which of the models in (a)-(d) do you prefer?

From our results, the testing errors (MSE) had values of $14.20715$, $7.304048$, $7.188545$, and $5.804421$ for the regression tree, bagged regression tree, random forest , and generalized additive (GAM) models, respectively, to predict a car’s mpg. These results indicate that the generalized additive model (GAM) yielded an improvement over the other models. In addition, the GAM model produces straightforward interpretation of the model fit output. Hence, I prefer the GAM model over the regression tree, bagged regression tree, and random forest models.

Principal Component Analysis

In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.

Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.

We use the rnorm() function in R to generate data with three distinct classes each containing 20 observations of different means.

##        years      ntweets       volume daily_return 
## 2.018692e+03 7.991269e+00 4.028599e+07 3.011870e-01 
##        years      ntweets       volume daily_return 
## 4.023863e+00 5.439576e+01 8.239448e+14 1.406736e+01

Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes.

We now perform principal components analysis using the prcomp() function and then plot the first two principal component store vectors, labeled PC1 and PC2 respectively.

pr.out <- prcomp(df.pca, scale. = TRUE)

PC1 <- pr.out$x[, 1] # 1st principal component score vector
PC2 <- pr.out$x[, 2] # 2nd principal component score vector

## Standard deviations (1, .., p=4):
## [1] 1.2239694 1.0144761 0.9135884 0.7988076
## 
## Rotation (n x k) = (4 x 4):
##                     PC1        PC2        PC3         PC4
## years        -0.6186323  0.2590311  0.1823689  0.71898439
## ntweets      -0.6127464  0.2169850  0.3237185 -0.68750691
## volume       -0.4666432 -0.3359594 -0.8148186 -0.07379721
## daily_return -0.1551777 -0.8791767  0.4449928  0.07035391

Perform K-means clustering of the observations with $K = 3$. How well do the clusters that you obtained in K-means clustering compare to the true class labels?

We now perform K-means clustering with $K = 3$.

set.seed(30)
km.out <- kmeans(df.pca, centers = 3, nstart = 20)

The corresponding cluster assignments of the $60$ observations are contained in km.out$cluster. We now use the table function to compare the true class labels to the class labels obtained by clustering.

#true.class <- c(rep(1, 20), rep(2, 20), rep(3, 20))
table(km.out$cluster, df.pairs$year.group)

##    
##       A   B   C
##   1  63  30   1
##   2 147 148  17
##   3 371 466 246

Although K-means clustering will arbitrarily number the clusters, we see that each arbitrary cluster is assigned to one class only. The K-means clustering perfectly separated the observations into three clusters. In this example, we knew that there really were $3$ clusters because we generated the data. However, for real data, in general we do not know the true number of clusters. We could instead have performed K-means clustering on this example with $K \neq 3$.

Perform K-means clustering with $K = 2$. Describe your results.

set.seed(3)
km.out <- kmeans(df.pca, centers = 2, nstart = 20)
table(km.out$cluster, df.pairs$year.group)

##    
##       A   B   C
##   1 152  89   9
##   2 429 555 255

For $K = 2$, the table above shows that two of the classes are perfectly assigned to two clusters (i.e. class $1$ to cluster $1$ and class $3$ to cluster $2$). However, K-means clustering forces the middle class $2$ to be split among the two clusters, with the majority of the class $2$ observations going to cluster $1$.

Now perform K-means clustering with $K = 4$, and describe your results.

set.seed(3)
km.out <- kmeans(df.pca, centers = 4, nstart = 20)
table(km.out$cluster, df.pairs$year.group)

##    
##       A   B   C
##   1  15   5   0
##   2 334 383 238
##   3  96  47   3
##   4 136 209  23

For $K = 4$, the table above shows that K-means clustering assigns the majority of the observations in each class to only one cluster (i.e. class $1$ to cluster $1$, class $2$ to cluster $2$). However, K-means clustering forced some observations from class $3$ into a fourth cluster. Hence, when $K = 4$, K-means clustering splits up the three clusters such that one of the classes gets split into two separate classes.

Now perform K-means clustering with $K = 3$ on the first two principal component score vectors, rather than on the raw data.

Here, we perform K-means clustering on the $60 \times 2$ matrix of which the first column is the first principal component store vector and the second column is the second principal component store vector.

set.seed(30)
km.out <- kmeans(pr.out$x[, 1:2], centers = 3, nstart = 20)
table(km.out$cluster, df.pairs$year.group, dnn = c("Cluster", "Class"))

##        Class
## Cluster   A   B   C
##       1 451 201   1
##       2  18 402 261
##       3 112  41   2

The above table shows that the first two principal components perfectly separate each class into three unique clusters. Hence, we achieve the same accuracy as shown in part (c), which similarly uses three clusters.

Using the scale function, perform K-means clustering with $K = 3$ on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (c)? Explain.

To scale the variables before performing K-means clustering of the observations, we use the scale() function as follows.

set.seed(30)
km.out <- kmeans(scale(df.pca), centers = 3, nstart = 20)
table(km.out$cluster, df.pairs$year.group)

##    
##       A   B   C
##   1   0 355 262
##   2 100  40   1
##   3 481 249   1

Above, we used the table function to compare the true class labels to the class labels obtained by clustering the scaled data. Consequently, we obtain the same results as the un-scaled data such that the K-means clustering perfectly separated the observations into three clusters. Hence, scaling does not change the results of K-means clusters for $K = 3$.

Given the above results, we can see that the data within each cluster does not drastically move in effect with the scaling. Hence, we should expect that the clustering returns similar results with the data after scaling each variable to have standard deviation one and the original data in part (c).

This problem involves the OJ data set, which is part of the ISLR2 package.

Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

First, we create a training set, and fit the tree to the training data.

set.seed(4)
train <- sample(1:nrow(df.pairs), 800)
df.train <- df.pairs[train, ]
df.test <- df.pairs[-train, ]

Fit a support vector classifier to the training data using cost $= 0.01$, with Purchase as the response and the other variables as predictors.

The support vector machine (SVM) is an extension of the support vector classifier using linear kernels. We use the svm() function from the e1071 library to fit a support vector classifier with the argument kernel = "linear", and then we use the summary() command to produce basic summary statistics about the support vector classifier fit.

set.seed(4)
svmfit <- svm(daily_return ~ ., data = df.train, kernel = "linear", cost = 0.01)
summary.svm <- summary(svmfit)

SVM Parameters
Type	Kernel	Cost	Gamma	Number of Support Vectors	Number of Classes
Eps-regression	linear	0.01	0.0588235	686 ( 0, 0 )	2

The summary tells us that a linear kernel was used with cost $= 0.01$, and that there were $446$ support vectors, $224$ in one class and $222$ in the other. Hence, the summary shows us that the model selects $446$ out of $800$ observations as support points and that we are indeed predicting $2$ classes.

What are the training and test error rates?

To calculate the training error rate, we use the predict() function to estimate the response for the $800$ observations from the training data, and we use the mean() function to calculate the consequent MSE as follows.

train.preds <- predict(svmfit, newdata = df.train)
train.error <- mean(train.preds != df.train$daily_return)

Similarly, we calculate the test error rate as follows.

test.preds <- predict(svmfit, df.test)
test.error <- mean(test.preds != df.test$daily_return)

SVM (kernel = linear, cost = 0.01)
Training Error Rate	Test Error Rate
1	1

Hence, the training error rate is approximately $17.75\%$, and the test error rate is approximately $13.704\%$.

Use the tune() function to select an optimal cost. Consider values in the range $0.01$ to $10$.

Using the built-in function tune() from the e1071 library, we perform ten-fold cross-validation on our models by indicating a command to compare SVMs with a linear kernel using a range of values between $0.01$ and $10$ of the cost parameter. We then access the cross-validation errors for each of these models using the summary() command as follows.

set.seed(4)
tune.out <- tune(method = svm, daily_return ~ ., data = df.train, kernel = "linear",
                 ranges = list(cost = c(0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10)))
tune.summary <- summary(tune.out)

Parameter tuning of svm:
sampling method	best parameters		best performance
10-fold cross validation	cost:	10	11.67081

Detailed performance results:
cost	error	dispersion
0.01	15.10210	4.862200
0.02	15.15898	4.926682
0.05	15.16420	4.998011
0.10	15.11507	4.952348
0.20	14.98522	4.897644
0.50	14.54469	4.695697
1.00	13.94114	4.382969
2.00	13.04548	3.998433
5.00	11.99497	3.495454
10.00	11.67081	3.347013

The above summary shows that the model parameter cost $= 5$ results in the lowest cross-validation error rate. Hence, our best model fit is the one using the optimal cost parameter of $5$, which we store in the following command.

bestmod <- tune.out$best.model
best.summary <- summary(bestmod)

Compute the training and test error rates using this new value for cost.

We use the predict() function to predict the class labels on the set of training observations and test observations, using the best model obtained through cross-validation with the optimal cost of $5$.

train.preds <- predict(bestmod , newdata = df.train)
train.error <- mean(train.preds != df.train$daily_return)

test.preds <- predict(bestmod, df.test)
test.error <- mean(test.preds != df.test$daily_return)

SVM (kernel = linear, cost = 5.0)
Training Error Rate	Test Error Rate
1	1

Hence, with this value of cost $=5$, the training error rate is approximately $17.5\%$, and the test error rate is approximately $13.7037\%$.

Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.

First we fit a support vector classifier to the training data using cost $=0.01$ and kernel radial, with Purchase as the response and the other variables as predictors.

set.seed(4)
svm.radial <- svm(daily_return ~ ., data = df.train, kernel = 'radial', cost = 0.01)
radial.summary <- summary(svm.radial)

## 
## Call:
## svm(formula = daily_return ~ ., data = df.train, kernel = "radial", 
##     cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  0.01 
##       gamma:  0.05882353 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  691

SVM Parameters
Type	Kernel	Cost	Number of Support Vectors	Number of Classes
Eps-classification	radial	0.01	691 ( 0, 0 )	2

The summary tells us that a radial kernel was used with cost $= 0.01$, and that there were $610$ support vectors, $307$ in one class and $303$ in the other. The model selects $610$ out of $800$ observations as support points. Now we compute the training and test error rates.

train.preds <- predict(svm.radial, newdata = df.train)
train.error <- mean(train.preds != df.train$daily_return)

test.preds <- predict(svm.radial, df.test)
test.error <- mean(test.preds != df.test$daily_return)

SVM (kernel = radial, cost = 0.01)
Training Error Rate	Test Error Rate
1	1

Hence, with a radial kernel, the training error rate is approximately $37.875\%$, and the test error rate is approximately $42.222\%$. Using the tune() function, we select an optimal cost using the values in the range $0.01$ to $10$.

set.seed(4)
tune.out <- tune(method = svm, daily_return ~ ., data = myData, kernel = "radial",
                 ranges = list(cost = c(0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10)))
tune.summary <- summary(tune.out)
bestmod <- tune.out$best.model

Parameter tuning of svm:
sampling method	best parameters		best performance
10-fold cross validation	cost:	0.2	14.03959

Hence, our optimal cost is $1$. Based on the optimal cost, we compute the training and test error rates as follows.

train.preds <- predict(bestmod , newdata = df.train)
train.error <- mean(train.preds != df.train$daily_return)

test.preds <- predict(bestmod, df.test)
test.error <- mean(test.preds != df.test$daily_return)

SVM (kernel = radial, cost = 1.0)
Training Error Rate	Test Error Rate
1	1

Hence, with optimal cost $=1$, the training error rate is approximately $15.75\%$, and the test error rate is approximately $17.77778\%$.

Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree $=2$.

First we fit a support vector classifier to the training data using cost $=0.01$ and polynomial kernel, with Purchase as the response and the other variables as predictors.

set.seed(4)
svm.polynomial <- svm(daily_return ~ ., data = df.train, kernel = 'polynomial', cost = 0.01, degree = 2)
polynomial.summary <- summary(svm.polynomial)

SVM Parameters
Type	Kernel	Cost	Degree	Number of Support Vectors	Number of Classes
Eps-classification	polynomial	0.01	2	691 ( 0, 0 )	2

The summary tells us that a polynomial kernel was used with cost $= 0.01$ and degree $=2$, and that there were $597$ support vectors, $300$ in one class and $297$ in the other. The model selects $597$ out of $800$ observations as support points. Now we compute the training and test error rates.

train.preds <- predict(svm.polynomial, newdata = df.train)
train.error <- mean(train.preds != df.train$daily_return)

test.preds <- predict(svm.polynomial, df.test)
test.error <- mean(test.preds != df.test$daily_return)

SVM (kernel = polynomial, cost = 0.01)
Training Error Rate	Test Error Rate
1	1

Hence, with a polynomial kernel, the training error rate is approximately $36.25\%$, and the test error rate is approximately $40.37037\%$. Using the tune() function, we select an optimal cost using the values in the range $0.01$ to $10$.

set.seed(4)
tune.out <- tune(method = svm, daily_return ~ ., data = df.train, kernel = "polynomial", degree = 2,
                 ranges = list(cost = c(0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10)))
tune.summary <- summary(tune.out)
bestmod <- tune.out$best.model

Parameter tuning of svm:
sampling method	best parameters		best performance
10-fold cross validation	cost:	0.01	15.19634

Hence, our optimal cost is $5$. Based on the optimal cost, we compute the training and test error rates as follows.

train.preds <- predict(bestmod , newdata = df.train)
train.error <- mean(train.preds != df.train$daily_return)

test.preds <- predict(bestmod, df.test)
test.error <- mean(test.preds != df.test$daily_return)

SVM (kernel = polynomial, cost = 5.0)
Training Error Rate	Test Error Rate
1	1

Hence, with optimal cost $=5$, the training error rate is approximately $16.5\%$, and the test error rate is approximately $16.66667\%$.

Overall, which approach seems to give the best results on this data?

It appears that the SVM using a linear kernel and cost $=5$ gives the best results on this data. This model has the advantage of being simpler than the other approaches.

Statistical Learning

Hannah Luebbering

May 25, 2022

Data Description

Model Fitting

Polynomial Regressions and Splines

Regression Trees

Principal Component Analysis

References