Before you go, check out these stories!

Hackernoon logoRegression Analysis on Life Expectancy by@jnyh

Regression Analysis on Life Expectancy

Author profile picture

@jnyhJames N

perpetual student | fitness enthusiast | passionate data scientist |

Models used: Linear, Ridge, LASSO, Polynomial Regression
Python codes are available on my GitHub

I was exploring the dengue trend in Singapore when there has been a recent spike in dengue cases, especially in the Dengue Red Zone where I am living. However, I was unable to scrape raw data from the NEA website.

I was wondering:
Has dengue affected the life expectancy of people in any country?
Do people in rich nations live longer?
What are the factors affecting life expectancy of a country?

So I explored life expectancy and looked for data on the following aspects (features):
Birth Rate
Cancer Rate
Dengue Cases
路Environmental Performance Index (EPI)
路Gross Domestic Product (GDP)
Health Expenditure
Heart Disease Rate
Population Density
Stroke Rate

Target is Life Expectancy, measured in number of years.

The assumptions are:
1. These are country level average
2. There is no distinction between male and female

First I check for multi-collinearity between features. There seem to be some strong collinearity, denoted by boxes in dark red and dark blue.

For example, countries who spent more on health expenditure are having higher EPI score; when health expenditure is higher, stroke rate is also lower; a larger area yields a higher population.

How about the correlation between features and target?
To live a long life, you should have low stroke rate, high health expenditure, take good care of the environment, and have less babies (according to the correlation chart).

Let鈥檚 look at the initial pair plot.

There seems to be a need to remove outliers in many features, eg. Dengue Cases, GDP, Population, Area, and Population Density.

Each outlier is replaced by the next highest value in the column. After removing the outliers, the plots are still skewed to the right (points are very concentrated on the left side). So this suggests that some transformation might be needed.

Another way to remove outliers is to use the LOG function, which helps to spread the concentrated data to the right.

Feature Selection

To look for significant features, I dropped one feature at a time to see its impact on the simple regression model. Looking at the R虏 Score, these 3 features (Birth Rate, EPI, Stroke Rate) are chosen, because the model will be adversely affected without them.

Next, I removed outliers and review the p-values on Statsmodels. I gained one more significant feature (Population Density). When the p-value of a feature is less than 0.05, it is considered a good feature, as I have chosen 5% as the significance level.

After that, I apply LOG function to all features, and gained 3 more significant features (GDP, Heart Disease Rate, Population, Area).

I have also done other transformations (eg. Reciprocal, Power 2, Square Root) but there is no more improvement.

Features can also be selected using the LassoCV feature in SkLearn.

Finally I looked at the pair plot again with all significant features. The scatter plots are now nicely spread out with some clear trends.

Model Selection

I am now ready to fit the following models on the train data set:

Linear Regression(a straight line which approximates the relationship between the dependent variables and the independent target variable)

Ridge Regression(this reduces model complexity while keeping all coefficients in the model, known as L2 penalty)

LASSO Regression(Least Absolute Shrinkage and Selection Operator reduces model complexity by penalising model coefficients to zero, ie, L1 penalty)

Degree 2 PolynomialRegression(a curve line to approximate the relationship between the dependent variables and the independent target variable)

I have also validated their performance on the validation data set. The simple linear regression model seems to have the potential to be the best performing model.

This is confirmed by Cross Validation using KFold (with 5 splits).

Finally, I checked the residue error against assumptions. The residue errors should be normally distribution with equal variance around the mean zero. The Normal Quartile-to-Quartile plot also looks acceptably normal.

Since I only have 250 rows (data limited by the number of countries in the world), I used the entire data set to simulate the test data set (note: this is done for academic purpose, not practical as it will lead to聽data leakage). I used聽KFold Cross Validation聽with 10 splits to evaluate the model performance.

How do we interpret the model?

Unaffected by the features, your life expectancy is 62 years.
If your country has low birth rate, add 5 more years to your life.
If the EPI (Environment Performance Index) is high, add 8 more years to your life.
If you live in a rich country, add half a year to your life.
Finally for every unit (or rather LOG unit) decrease in stroke rate, 5 more years could be added to your life.

Next Steps

I could possibly collect more data by expanding the scope to cities instead of countries, and to explore other features (factors) affecting life expectancy. Also, I could split the data to male and female categories for such life expectancy regression analysis.

To conclude, here are some interesting insights:

1. Japan has the highest life expectancy (83.7 years). Central African Republic (49.5 years) and many countries in the African continent are at the bottom of scale. Singapore is ranked #5 (82.7 years).

2. Take good care of the environment. It has the largest coefficient (impact) on the country鈥檚 life expectancy.

Python codes for the above analysis are available on my GitHub, do feel free to refer to them.

Video presentation:

Thank you for reading.


Join Hacker Noon

Create your free account to unlock your custom reading experience.