The baseball dataset consists of the statistics of 263 players in Major League Baseball in the season 1986.

statistics

Description

Problem 1


 The baseball dataset consists of the statistics of 263 players in Major League Baseball in the season 1986. The dataset (hitters.csv) consist of 20 variables:

In this problem, we use Salary as the response variable, and the rest 19 variables as predictors/covariates, which measure the performance of each player in season 1986 and his whole career. Write R functions to perform variable selection using best subset selection partnered with BIC (Bayesian Information Criterion): 


1) Starting from the null model, apply the forward stepwise selection algorithm to produce a sequence of sub-models iteratively, and select a single best model using the BIC. Plot the “BIC vs Number of Variables” curve. Present the selected model with the corresponding BIC. 


2) Starting from the full model (that is, the one obtained from minimizing the MSE/RSS using all the predictors), apply the backward stepwise selection algorithm to produce a sequence of sub-models iteratively, and select a single best model using the BIC. Plot the “BIC vs Number of Variables” curve. Present the selected model with the corresponding BIC. 


3) Are the selected models from 1) and 2) the same? 


Problem 2 

In this problem, we fit ridge regression on the same dataset as in Problem 1. First, standardize the variables so that they are on the same scale. Next, choose a grid of ? values ranging from ? = 1010 to ? = 10−2 , essentially covering the full range of scenarios from the null model containing only the intercept, to the least squares fit. For example: > grid = 10^seq(10, -2, length=100) 


1) Write an R function to do the following: associated with each value of ? , compute a vector of ridge regression coefficients (including the intercept), stored in a 20 × 100 matrix, with 20 rows (one for each predictor, plus an intercept) and 100 columns (one for each value of ?). 


2) To find the “best” ? , use ten-fold cross-validation to choose the tuning parameter from the previous grid of values. Set a random seed – set.seed(1), first so your results will be reproducible, since the choice of the cross-validation folds is random. Plot the “Cross-Validation Error versus ?” curve, and report the selected ?. 


3) Finally, refit the ridge regression model on the full dataset, using the value of ? chosen by cross-validation, and report the coefficient estimates. 


Remark: You should expect that none of the coefficients are zero – ridge regression does not perform variable selection. 


Related Questions in statistics category