We are given a dataset comprising certain characteristics of 27,504 individuals from the United States.

statistics

Description

1 Introduction and Summary 

1.1 Objective 

We are given a dataset comprising certain characteristics of 27,504 individuals from the United States. These characteristics include the following: Continuous variables: age, capital_gain, and hrs_wkly. Categorical variables: workclass, education, marital_status, occupation, relationship, race and gender. Response variable (binary): Earnings. The variable Earnings takes value 1 if an individual earns more than $50k in a year and 0 otherwise. Our aim is to use the remaining ten variables to predict if an individual makes more than $50k or not. We use R to come up with a satisfactory logistic regression model for Earnings in terms of the other variables. In doing so, we aim to obtain as parsimonious a model as possible – one that is simple, fits the data well and can be easily interpreted. 


1.2 Summary of Results 

We present our proposed model as an equation involving the true probability  of an individual earning more than $50k, the linear predictor , the intercept, the explanatory variables and their coefficients

For each of the categorical variables education, relationship and occupation, we have merged some levels together and given them new names. (See Section 2.2.1 for more details.) Using the merged levels, we obtain the following estimates for the coefficients: 


Related Questions in statistics category