We will revisit the Lending Club data for
this week’s assignment. The company has existed since 2007 and have provided
millions of personal loans since then. Lending Club announced IPO in December 2014, since when
the company came in the limelight for negative publicity. Lending
club officials were accused of taking aggressive risks by lending money to
those with risky credit worthiness. You are asked to study this phenomenon and
determine if data provides clues of the authenticity of the claim that Lending
Club behaved irresponsibly.
You are given a
single combined file of “approved” loans data from six years, which are supposedly
the pre and post periods of the controversy.
Step 1 (30 Points)
The first step is create two new columns
as follows:
a) Comb_Risk_One:
Create a binary column by combining categories A and B (Low Risk) into one
category and all the remaining categories in another (High Risk).
b) Comb_Risk_Two:
Create a binary column by combining categories A, B and C (Low Risk) into one
category and all the remaining categories in another (High Risk).
Now, break the file into two files
filtering out data for 2012, 13, and 14 in one file and 2015, 16 and 17 in
another file.
Step 2 (70 Points)
The
primary objective is to use classification techniques learnt so far. Each loan
is graded (A to G) based on the risk, with A being least risky and G being the
highest risk category. You are asked to predict Low and High-risk categories (for
the two new response variables) using various modeling techniques like Naïve
Bayes’, KNN, Logistic Regression, and CART model. Make sure to look for the
following:
a. Outliers
based on the independent columns (predictors)
b. Multicollinearity
c. Scaling
and standardization of the predictors
d. Train-Test
split for both files and compare the confusion matrices on the Test.
Produce a “well documented and explained”
R Markdown knit file analyzing the data with findings on the model with the
highest classification ability. Also describe the features of the categories
that are not classified correctly. Create a confusion matrix to answer the last
question and run descriptive statistics on the misclassified categories. Provide
any necessary EDA and visuals to enhance understanding of your analysis.
Get Free Quote!
435 Experts Online