1 Introduction and Summary
1.1 Objective
We are given a dataset comprising certain characteristics of 27,504 individuals from the United States.
These characteristics include the following:
Continuous variables: age, capital_gain, and hrs_wkly.
Categorical variables: workclass, education, marital_status, occupation, relationship, race and gender.
Response variable (binary): Earnings.
The variable Earnings takes value 1 if an individual earns more than $50k in a year and 0 otherwise. Our
aim is to use the remaining ten variables to predict if an individual makes more than $50k or not.
We use R to come up with a satisfactory logistic regression model for Earnings in terms of the other
variables. In doing so, we aim to obtain as parsimonious a model as possible – one that is simple, fits the
data well and can be easily interpreted.
1.2 Summary of Results
We present our proposed model as an equation involving the true probability of an individual earning
more than $50k, the linear predictor , the intercept, the explanatory variables and their coefficients
For each of the categorical variables education, relationship and occupation, we have merged some levels
together and given them new names. (See Section 2.2.1 for more details.) Using the merged levels, we
obtain the following estimates for the coefficients:
Get Free Quote!
320 Experts Online