Ever had a shop assistant walk up and ask, “Can I help you with something?”
Humans can guess instinctively. We observe a person’s clothing, body language, confidence, and even how they speak then guide them toward choices that seem to fit their style or needs. All these are features about a person. We analyze them deduct a budget and style they will be interested in.
But can a machine leaning algo do the same ? Yes, it can. In the above example, the shopkeeper was trying to find if he has seen another individual like me and deduce my wants. We have a similar algorithm. K nearest neighbor. The basic idea behind K-Nearest Neighbors, a simple yet powerful machine learning algorithm that makes predictions based on the behavior of "nearby" data points. Instead of learning complex patterns or weights like other algorithms, KNN looks at the K most similar data points (its "neighbors") and lets them vote to decide the outcome.
TL;DR
Can Your Demographics Reveal Your Paycheck?
Ethical Implications: Bias, Fairness & Inequality
While using demographic data can improve personalization and predictions, it also raises serious ethical concerns that must be addressed.
Bias in Data
Just like human judgment, machine learning can be biased. For example, a shop assistant who regularly sees a certain type of customer may begin to make assumptions based on past patterns, assumptions that may not apply in a different location or context.
This is exactly what happens in machine learning when models are trained on biased datasets: they inherit patterns that may not generalize well, leading to systematic errors or misinformed decisions.
Fairness
Machine learning models can produce unfair outcomes, often unintentionally. We've seen real-world examples from biased hiring tools to facial recognition systems criticized for unfair treatment of certain groups.
These outcomes can negatively impact individuals, communities, and society at large.
Inequality
When predictions are linked to pricing or access, they can reinforce existing inequalities. For instance, if a system predicts someone's willingness to pay based on their profile, it might show higher prices to some users creating an unfair experience.
Take Uber’s surge pricing as an example: while it helps motivate drivers during high-demand times (e.g., during rain or rush hour), it can feel exploitative to riders who have no other choice.
As an example, we can look at
- Origin: UCI Machine Learning Repository
- Size: 48,842 instances, 14 attributes
- Ethics: The dataset is publicly available for research and educational purposes.
It contains anonymized census data, and its use complies with ethical standards for open data. The Adult dataset contains demographic information and income labels.
Objective
The main objective is to predict whether a person earns more than $50K/year based on attributes such as age, education, occupation, and hours worked per week.
Algorithm to the solution
k-Nearest Neighbors Algorithm: kNN is a non-parametric, instance-based learning algorithm used for classification and regression. We can use it for classification as well but we will use to predict income based on the attributes at hand.
Euclidean distance
The most common way to measure the distance between two points in space.
A "straight-line" distance between two points. If we use a ruler to measure how far apart two dots are on a piece of paper.
Steps:
- Choose k (here, k=5).
- Compute the Euclidean distance from the query point to all training points.
- Select the k nearest neighbors.
- Assign the class most common among the neighbors.
Assumptions:
- All features contribute equally (hence, normalization is important).
Data import and exploration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve, classification_report
# Load dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
'hours-per-week', 'native-country', 'income']
df = pd.read_csv(url, names=columns, na_values=' ?', skipinitialspace=True)
# Initial exploration
print(df.head())
print(df.info())
print(df.describe())
output
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 32561 non-null object
2 fnlwgt 32561 non-null int64
3 education 32561 non-null object
4 education-num 32561 non-null int64
5 marital-status 32561 non-null object
6 occupation 32561 non-null object
7 relationship 32561 non-null object
8 race 32561 non-null object
9 sex 32561 non-null object
10 capital-gain 32561 non-null int64
11 capital-loss 32561 non-null int64
12 hours-per-week 32561 non-null int64
13 native-country 32561 non-null object
14 income 32561 non-null object
Feature Descriptions:
-
age: The age of the individual (integer).
-
workclass: Type of employment (e.g., Private, Self-emp-not-inc, Federal-gov).
-
fnlwgt: Final weight, a census-specific numeric value representing the number of people the entry is meant to represent.
-
education: Highest level of education attained (e.g., Bachelors, HS-grad, Masters).
-
education-num: Numeric representation of education level (e.g., 9 for HS-grad, 13 for Bachelors).
-
marital-status: Marital status (e.g., Married-civ-spouse, Never-married, Divorced).
-
occupation: Type of job or profession (e.g., Tech-support, Craft-repair, Exec-managerial).
-
relationship: Family relationship within the household (e.g., Husband, Wife, Not-in-family).
-
race: Race of the individual (e.g., White, Black, Asian-Pac-Islander).
-
sex: Gender of the individual (Male or Female).
-
capital-gain: Income from investment sources other than wages (integer).
-
capital-loss: Losses from investment sources other than wages (integer).
-
hours-per-week: Number of hours worked per week (integer).
-
native-country: Country of origin (e.g., United-States, Mexico, India).
-
income: Target variable; income class (<=50K or >50K per year).
We will drop null values, encode categorical columns and normalize features.
df = df.dropna()
df = pd.get_dummies(df, drop_first=True)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = df.drop('income_>50K', axis=1)
y = df['income_>50K']
X_scaled = scaler.fit_transform(X)
plt.figure(figsize=(8,4))
sns.histplot(df['age'], bins=30)
plt.title('Age Distribution')
plt.show()
Age is right skewed. Skewness: 0.56
The age is more concentrated on the prime working years of life which is great. Our dataset might have collected a healthy concentration of working individuals.
Class distribution:
No of people earning more than 50k in the dataset is around 8000. While less than that has a high frequency of around 25000. This is an unbalanced dataset.
sns.countplot(x='income_>50K', data=df)
plt.title('Class Distribution')
plt.xlabel('Income >50K')
plt.ylabel('Count')
plt.show()
As always we need train and test data. 80:20 is a healthy split.
Parameter |
Meaning |
---|---|
|
The classifier will look at the 5 nearest neighbors (based on distance) to decide how to classify a new data point. This is the "K" in KNN. |
|
Specifies the distance metric to use. Minkowski is a general form that can represent multiple types of distance (like Euclidean or Manhattan) depending on |
|
When |
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))
acc = accuracy_score(y_test, y_pred)
print(f'Accuracy: {acc:.2f}')
y_prob = knn.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
print(f'ROC-AUC: {roc_auc_score(y_test, y_prob):.2f}')
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.show()
output -->
precision recall f1-score support
False 0.87 0.90 0.89 4942
True 0.65 0.57 0.61 1571
accuracy 0.82 6513
macro avg 0.76 0.74 0.75 6513
weighted avg 0.82 0.82 0.82 6513
Accuracy: 0.82
ROC-AUC: 0.84
The model is much better at identifying people who do not earn more than $50K (high recall and precision for "False") than those who do (lower recall and precision for "True"). This is common in imbalanced datasets. The overall accuracy is good, but the model may miss a significant portion of high-income individuals.
0.82 means that this kNN model correctly predicted the income class for 82% of the individuals in our test dataset. In other words, out of all predictions made, 82% matched the actual labels. This indicates good overall performance.
A ROC-AUC score of 0.84 indicates that, on average, there is an 84% chance that the model will rank a randomly chosen positive instance (earns >$50K) higher than a randomly chosen negative instance (does not earn >$50K). This suggests this model performs well in terms of overall classification and is effective at separating the two income classes.
Given a person's demographic information, we can reliably predict whether they earn more than $50K—much like how a human might make an informed guess based on similar details.
Next steps
Deploying the model on AWS SageMaker would demonstrate how insights from a survey can translate into real-world, scalable applications.