The file bestsubset.py defines a function called bestsubset that takes two arguments, a dataframe X and array y. If X has k columns, then there are k variables under consideration to include in our model of y=f(X). We only consider linear models in this assignment, and the function bestsubset uses 5-folds cross validation to compare every possible linear model that includes one or more of the predictors in X. The function returns a list with two elements: a list of column indices included in the best model, and an array of coefficient estimates for that model. You should assume that X does NOT have a column for intercept, so you should have LinearRegression fit the intercept (i.e. use default setting).
The function suffers from a problem that if the number of predictors gets large, the function will be very
slow. The total number of iterations of the inner loop is 2^k since there are 2^k combinations of
predictors. Thus, this function is O(2^k), considerably slower than any algorithm we discussed in our
unit on algorithmic efficiency.
Get Free Quote!
421 Experts Online