Hackernoon logoLearning to understand possible effects variables such as income and physical activity have on… by@emmanuels

Learning to understand possible effects variables such as income and physical activity have on…

Emmanuel Sibanda Hacker Noon profile picture

@emmanuelsEmmanuel Sibanda

I am an aspiring data scientist and entrepreneur

Mental health is increasingly becoming a topical issue that has traditionally been glossed over. We have began to understand the impact mental health has on productivity, general wellbeing, relationships and physical health and placed greater focus on mental wellbeing. Even employers have began placing greater emphasis on providing work environments and conditions that keep their employees as happy and healthy as possible, with progressive companies providing perks such as monthly massages, catered meals, free and subsidised gym and unlimited vacation. In light of this and as an individual interested in developing my programming and data science skills I was also partially motivated by a data visualisation and analysis project on a Coursera Statistics specialisation I was studying. I decided to play around with data from the Centers for Disease Control and Prevention (CDC), extracting the 2013 Behavioral Risk Factor Surveillance System data (BRFSS). This data is collected through monthly telephonic interviews and is one of the largest datasets focused on understanding trends in preventative health practices and risk behaviours that link to chronic diseases injuries and preventable infectious diseases affecting adults in the US. The goal of my analysis is to understand the relationship between income level and one's reported mental health and the impact how much you earn per year, how physically active you are and the amount of fruits and vegetables you eat has on your mental health.

Extracting data and understanding structure

load("~/Documents/R Stats Assignments/brfss2013.RData")

After uploading the necessaries libraries and loading my data, I wanted to look at the general shape and structure of the dataset to understand the number of dimensions, the types of data in each column and the clean up I would need to run to get my data in a consistent format to facilitate my analysis.

> dim(brfss2013)
[1] 491775 330
#check structure of columns
#check for NA values

After reviewing the provided codebook/index, I had a good idea of the columns I would need to interact with and decided to focus specifically on these columns. This included columns with data pertaining to minutes of physical activity per week, fruits eaten per week, vegetables consumed per week, BMI, annual income ranges and of-course mental health measured by the number of days a person reported to not have been in good mental health in a given period. Since NA values can obfuscate the accuracy of an analysis and potentially introduce a non-response bias, I looked for NA values in columns these ranged from 4.8% to 34% of the data. I assumed that these were non ignorable missing values and could not simply be dropped from my analysis. For numerical values in cases where 34% of the data was missing, I made the assumption that simply filling in with the standard mean or median would skew my analysis and reduce its accuracy. I opted instead for an easy to implement and relatively accurate imputation method to replace my NA values.

Multiple imputation by bayesian linear regression

One of the quickest ways to impute values accurately is through a linear regression algorithm that attempts to predict potential values based on a pre-defined list of of predictors. With the frequentist approach (ordinary linear regression) we assume a linear relationship between the independent variable, additionally making an assumption that there are enough values to make an accurate prediction. However, for this imputation I opted for the Bayesian model, with this approach we assume both the response and predictors are random (factoring in uncertainty in our imputations),try to look for the probability of a particular event, update the probability based on prior data and make predictions based on these probabilities. The ability to incorporate prior information to update probabilities makes the bayesian model more flexible and possibly more accurate in making predictions, particularly for smaller samples.

Multiple imputation lets me run through the imputation model n times providing different possible imputations based on the prediction model used. Typically this is a three stage process, that involves the actual imputation, substituting each NA with a possible prediction and then the analysis and pooling of all results.

brfss2 <- brfss2013[,c('X_bmi5','X_drnkmo4','X_frutsum','X_vegesum','fc60_','pa1min_')]
imp_model <- mice(brfss2,method="norm",m=5)
#fill na function
fillna_fun <- function(data,columns){
df <- setNames(data.frame(rowMeans(squeeze(imp_model$imp[[columns]], bounds = c(0,max(brfss2013[[columns]]))))),"col2")
brf <- setNames(data.frame(data[[columns]]),"col2")
brf$col1 <- rownames(brf)
df$col1 <- rownames(df)
setDT(brf)[df,col2 :=i.col2,on=.(col1)]
brfss2013$pa1min_ <- fillna_fun(brfss2,"pa1min_")
brfss2013$X_bmi5 <- fillna_fun(brfss2,"X_bmi5")
brfss2013$X_drnkmo4 <- fillna_fun(brfss2,"X_drnkmo4")
brfss2013$X_frutsum <- fillna_fun(brfss2,"X_frutsum")
brfss2013$X_vegesum <- fillna_fun(brfss2,"X_vegesum")

The Multivariate Imputation by Chained Equations (mice) package allows for easy imputation using the bayesian linear regression. I opted to narrow down the imputation to apply to columns with numerical values that would be required for my analysis and ran the imputations the default number of times -5.

I noticed that bayesian linear regression returned negative numbers for some of the imputations which was impossible given the data these columns where measuring. For example it is impossible for a person to spend -300 minutes a week working out. To address this I used the squeeze function to add a constraint to limit the predictions to be within a plausible custom-defined range.

Contrary to the mice documentation, I opted to get the mean of all 5 imputations and use that to fill my NA values. I completely understand that this is not how machine learning should be used, that I missed the essential step of pooling my combined imputations and that some would argue that I would be better off picking one imputation than the average. Despite this, I adopted this option for ease of application and to hopefully receive further feedback from readers about this option.

I finished off the function by coercing the vectors in my function to a table using the setDT function and joining the vectors by index since the mean imputations vector only had values for indexes corresponding to NA values in my original dataframe.


An important step of my analysis involved mapping the numerical values for some columns into groups and ranges. For example, instead of seeing BMI numbers I wanted to see which group/range a particular BMI belonged to (e.g 25 = normal).

brfss2013$income2 <- as.character(brfss2013$income2)
brfss2013$X_bmi5 <- brfss2013$X_bmi5/100
brfss2013$healtheat <- (brfss2013$X_frutsum+brfss2013$X_vegesum)/100
labels <- c('Excellent','Good','Ok','Bad','Very Bad')
breaks <- c(0,5,10,15,25,10000)
bmiLabs <- c('10','20','30','40','50','60','>60')
bmiBreaks <-c(0,10,20,30,40,50,60,10000)
activLabs <-c('0-200','200-500','500-1000','1000-2000','2000-4000','4000-10000','>10000')
activBreaks <-c(0,200,500,1000,2000,4000,10000,100000)
brfss2013 <- brfss2013 %>%
mutate(mentalHealth = cut(menthlth,breaks=breaks,labels=labels,include.lowest=TRUE)) %>%
mutate(bmiLev = cut(X_bmi5, breaks=bmiBreaks,labels=bmiLabs,include.lowest = TRUE)) %>%
mutate(physLev = cut(pa1min_, breaks=activBreaks,labels=activLabs,include.lowest = TRUE)) %>%
mutate(incomeLev = case_when(grepl("15|20",income2)~"0-$20k",
income2 %in% "Less than $75,000" ~ "50-$75k",

Since cut is designed to take a numerical vector and split it into bins according to set of custom breakpoints, I decided to use the cut function, defining my breaks and labels to map the data according to my labels definitions. For the income level column, I converted the values to characters to enable me to use grepl. This is a pattern matching function that I use to create income ranges by looking for keywords that match the custom words I used to identify each row's income range.

It is important to note is that I made a few assumptions regarding mental health. I assumed that a person who only reports having mental health issues 0–5 times in a given period would be classified as being in excellent mental health, 5–10 in good health, 10–15 ok, 15–25 bad and any thing exceeding that very bad. This was based on observing the data and finding the ideal break points to make such classifications.

The relationship between mental health and annual income

I identified the stacked bar-plot as the most effective and appealing way to visualise the relationship between mental health and range of annual income range.

#replace NA with missing
brfss2013$mentalHealth <- forcats::fct_explicit_na(brfss2013$mentalHealth, na_level = "Missing")
#convert income back to factor
brfss2013$incomeLev <- as.factor(brfss2013$incomeLev)
brfss2013 <- subset(brfss2013, !is.na(incomeLev))
brfss2013 %>%
add_count(incomeLev) %>%
rename(count_inc = n) %>%
count(incomeLev, mentalHealth, count_inc) %>%
rename(count_mentalHealth = n) %>%
mutate(percent= count_mentalHealth / count_inc) %>%
mutate(incomeLev = factor(incomeLev,
ggplot(aes(x= incomeLev,
y= count_mentalHealth,
group= mentalHealth)) +
xlab('Annual Income')+ylab('Number of People')+
# Using the scales package does the percent formatting for me
geom_text(aes(label = scales::percent(percent)),position = position_stack(vjust = 0.5))+

Using the dplyr package, I used the %>% operator to write multiple operations that would tally income levels and mental health by group, find the relative percentage of reported mental health numbers by income level, group the barplots according to my customisation and visualise the stacked barplot and adding percentage labels on each plot.

Mental Health Distribution by Annual Income Range

Using the assumptions I have created about what constitutes good mental health, the data appears to confirm what one would expect, the more money you make the more likely you are to be in a better mental state.

What do I do with NA values in income level?

From the mental health distribution by annual income range visualisation it is evident that there are a lot of NA values in the income level column. While I could simply drop the NA values, assume they will not alter the accuracy of my analysis and move on, I believe that a lot of people may not have been willing to talk about their income level possibly. Meaning a significant portion of NAs came from people who were not comfortable sharing their annual income over the phone. It could be the case that people who are not comfortable sharing information about their earnings are on the lower income range (0–20k) or very high earners (over 70k), leaving out these rows which are over 15% of income data, could easily introduce a non response bias. I would like to impute the potential values of these NAs through machine learning. This time I opted to impute my ordered data through a polytomous regression model. This method uses the proportional odds logistic regression model, the mechanics of which are discussed in a bit more detail in following paragraph.

#filling na values of income level column
brfss <- brfss2013[,c('incomeLev','healtheat','X_age_g','employ1','renthom1','sex','physLev')]
ordered_brfss <-mice(brfss, m=1, method='polr', maxit=1)
fillna_inc <- function(data,columns){
df <- setNames(data.frame(ordered_brfss$imp[[columns]]),"col2")
brf <- setNames(data.frame(data[[columns]]),"col2")
brf$col1 <- rownames(brf)
df$col1 <- rownames(df)
setDT(brf)[df,col2 :=i.col2,on=.(col1)]
brfss2013$incomeLev_ <- fillna_inc(brfss2013,"incomeLev")

For the purposes of this imputation I focus on a few columns that I believed would act as the biggest predictors of potential income. I focused specifically on whether the individual reported renting or owning a home, whether s/he was employed, the individual's gender, age, level of physical activity and the amount of fruits and vegetables the respondent reported. I opted to run a single imputation, purely for ease of application.

Visualising mental distribution by income level

We all love data viz!

I wanted to see how my mental health distribution by income level would look after this imputation and built another visualisation.

Mental Health Distribution by Annual Income Range

Proportional Odds Ratio

The act of mapping out mental health into groups such as good, ok, excellent, transformed this column to a factor with 5 levels. These factored are ordered according to levels. Given the structure of this column I opted to use the proportional odds logistic regression. For the purposes of my analysis, I need to choose a model that would help me understand the probability of an individual’s mental health falling under a certain category given the independent variables I have chosen and understanding what impact these variables have on probability. While linear regression may, to a certain extent, work for binary classification problems (e.g if the response variable is whether an email is spam or not) it falls apart when you have multiple categories for your response variable. Ordinal logistic regression transforms that linear model to cater for ordered response categories using the logit function, ensuring that the probabilities returned fall within the 0 to 1 range. I could have instead opted for the more flexible multinomial logistic regression model, however this relies on the assumption that the categories in the response variable cannot be ordered in any meaningful way- an assumption that I believe does not apply to my mental health response variable.

My model relies on calculating the proportional odds ratio. Put in simple terms the proportional odds ratio is a tool used in ordinal logistic regression to help make assumptions about relationships between independent variables and ordered response factors/categories by predicting the conditional probability of belonging the response variable falling under a particular category given the values in the independent variables being observed.

Probability Odds Ratio with multiple explanatory variables

J in this instance represents the categories/factors we are trying to predict for. The formula stated above takes in n number of factors with n dependent on the number of factors I have in my dependent variable. The numbers refer to the independent variables I am using to build the model.

No Math

I use this formula to understand the independent variables that have the greatest relationship with our outcome(mental health)- and try to understand the extent the independent variables have an effect on mental health.

brfss2_model = polr(mentalHealth ~ incomeLev+bmiLev+X_drnkmo4+healtheat+physLev,data=brfss2013,Hess=TRUE) #Hessian used to get standard errors

I used the polr() function to fit a proportional odds logistics regression to the response variable(mental health) and predictors variables. In this case I decided to include BMI, total sum of fruits and vegetables consumed in a month, physical activity level, range of annual income and alcoholic beverages drank over a monthly period. I included the observed information matrix by adding the TRUE argument to the Hessian in order to get the standard errors and try and assess the applicability of this model to the data.

#calculating p_value by comparing the t-value against the stnd norm distr similar to a z-test
p<-pnorm(abs(ctable[, "t value"]),lower.tail = FALSE)*2
#combining p-value
(ctable<-cbind(ctable,"p value"=p))

I proceeded to find the regression coefficients and calculate the p_value to determine the significance of these results. The regression coefficients ranges from 0.001 for alcoholic beverages per month to a little over 2.55 for a group in BMI level and the p-values indicated statistical significant. From my understanding, the closer the regression coefficient is to zero the more it indicates that the distribution of predictors is exactly the same for each level of our response variable. For coefficients approaching zero this loosely means that the effect of adding more units to an independent variable on the response variable is close to zero.

Visualising the effects of the independent variables

To clearly show the effects of the independent variables on mental health, I decided to use the effects package. Since the polr function returns estimated regression coefficients, I used this package to help me interpret my polr results.

plot(Effect(focal.predictors = c("incomeLev","bmiLev"),brfss2_model))

After a trying a few combinations I opted to show the annual income and BMI effect plot. I used these two variables as the focal predictors, holding the other predictors with typical values (fixed and averaged).

Income Level * BMI effect on Mental Health

According to the result we see that the probability of being in excellent mental health decreases as your BMI increases regardless of your income range. Interestingly, the decrease seems to taper off as you approach the $75 000 per year income mark. We can see a similar effect in the very bad mental health state to a smaller extent, the wealthier you are, the less of an effect an increase in BMI may have on your mental health. If you are going to gain weight you may be in a better off mental state if you do so while increasing your income.

One of the most important lessons I am learning, aside from how little I know, is that data analysis and data science is an iterative process. You try out different variables and models, clean up methods and visualisations based on changing assumptions and test what impact this has on your result and its accuracy.

Feel free to reach out or send any feedback on Twitter @Emmoemm


Join Hacker Noon

Create your free account to unlock your custom reading experience.