Assignment
#4
Due: 11
pm, Friday, November 20th
This
assignment focuses on material from the end of Chapter 7 and the beginning of
Chapter 9. In Chapter 8, various models
for Classification were introduced, but it is not practical to try using any of
them manually or in Excel. Nonetheless,
the concepts and ideas behind these models are important to understand.
This
assignment is based upon the student surveys from fall 2019 and 2020. You looked at this data in Assignment 2. In particular, you looked a pivot tables that
involved percentages. This is material
that could be interpreted as probabilities.
We will revisit these analyses, but manually.
You
also looked at average summer earnings and found some differences depending
upon where students were from. We will
look at this topic again, but using regression analysis with nominal
variables. For the regression analysis,
missing values interfere with the necessary calculations. Regression analysis does not permit missing
values for any variable in the model.
For this reason, I have “cleaned” the file by removing records for
students who did not report earning or had earnings of zero. I have also removed or edited some records
that had values that appeared to be invalid.
The edited filed is called Survey
2019 and 2020 Assignment4.xlxs
You
will observe that approximately 80 observations have been deleted because of
missing, zero or invalid summer earnings.
In
Assignment #2 we observed that summer earnings were lower for international
students and highest for Nova Scotian students.
Home was a nominal variable with many values. Even reducing it to 3 values (Nova Scotia,
Canada, and International) makes it a new challenge to include in a regression
model. Two new variables have been
created.
Intl
= 1 if the student is from outside Canada and 0 otherwise.
Canada
= 1 if the student is Canadian but from outside Nova Scotia.
We
had also observed that summer earnings appeared to be lower in 2020 than in
2019.
Class01
= 1 if the student is in the 2020 class and 0 if s/he was in the 2019 class.
Often
when looking at earnings data, we observe that earnings are lower for females.
Gender01
= 1 if female and 0 if male.
Two
numeric variables with few missing values were age and high school
average. Older students are likely to
have more work experience and thus may have higher earnings. Some may actually be working full time and
studying part time.
Does
high school average have any relationship with summer earnings? Don’t know but chose to look and see.
1.
In assignment 3, you chose the first variable by looking at
the correlation between the outcome and the input variables. Several of the input variables are
binary. Correlation measures linear
relationships. Does it make sense to
calculate correlations with binary variables?
If you recall, our first model for classification had a binary outcome,
but we still fit the model. What happens
here?
Get Free Quote!
395 Experts Online