Ever had a shop assistant walk up and ask, “Can I help you with something?” Humans can guess instinctively. We observe a person’s clothing, body language, confidence, and even how they speak then guide them toward choices that seem to fit their style or needs. All these are features about a person. We analyze them deduct a budget and style they will be interested in. But can a machine leaning algo do the same ? Yes, it can. In the above example, the shopkeeper was trying to find if he has seen another individual like me and deduce my wants. We have a similar algorithm. K nearest neighbor. The basic idea behind K-Nearest Neighbors, a simple yet powerful machine learning algorithm that makes predictions based on the behavior of "nearby" data points. Instead of learning complex patterns or weights like other algorithms, KNN looks at the K most similar data points (its "neighbors") and lets them vote to decide the outcome. TL;DR Can Your Demographics Reveal Your Paycheck? Ethical Implications: Bias, Fairness & Inequality Ethical Implications: Bias, Fairness & Inequality While using demographic data can improve personalization and predictions, it also raises serious ethical concerns that must be addressed. ethical concerns Bias in Data Bias in Data Just like human judgment, machine learning can be biased. For example, a shop assistant who regularly sees a certain type of customer may begin to make assumptions based on past patterns, assumptions that may not apply in a different location or context.This is exactly what happens in machine learning when models are trained on biased datasets: they inherit patterns that may not generalize well, leading to systematic errors or misinformed decisions. Fairness Fairness Machine learning models can produce unfair outcomes, often unintentionally. We've seen real-world examples from biased hiring tools to facial recognition systems criticized for unfair treatment of certain groups.These outcomes can negatively impact individuals, communities, and society at large. Inequality Inequality When predictions are linked to pricing or access, they can reinforce existing inequalities. For instance, if a system predicts someone's willingness to pay based on their profile, it might show higher prices to some users creating an unfair experience.Take Uber’s surge pricing as an example: while it helps motivate drivers during high-demand times (e.g., during rain or rush hour), it can feel exploitative to riders who have no other choice. As an example, we can look at Adult Dataset Adult Dataset Origin: UCI Machine Learning Repository
Size: 48,842 instances, 14 attributes
Ethics: The dataset is publicly available for research and educational purposes. Origin: UCI Machine Learning Repository Size: 48,842 instances, 14 attributes Ethics: The dataset is publicly available for research and educational purposes. It contains anonymized census data, and its use complies with ethical standards for open data. The Adult dataset contains demographic information and income labels. Objective Objective The main objective is to predict whether a person earns more than $50K/year based on attributes such as age, education, occupation, and hours worked per week. Algorithm to the solution Algorithm to the solution k-Nearest Neighbors Algorithm: kNN is a non-parametric, instance-based learning algorithm used for classification and regression. We can use it for classification as well but we will use to predict income based on the attributes at hand. Euclidean distance Euclidean distance The most common way to measure the distance between two points in space. A "straight-line" distance between two points. If we use a ruler to measure how far apart two dots are on a piece of paper. Steps: Steps Choose k (here, k=5).
Compute the Euclidean distance from the query point to all training points.
Select the k nearest neighbors.
Assign the class most common among the neighbors. Choose k (here, k=5). Compute the Euclidean distance from the query point to all training points. Select the k nearest neighbors. Assign the class most common among the neighbors. Assumptions: Assumptions All features contribute equally (hence, normalization is important). All features contribute equally (hence, normalization is important). Data import and exploration Data import and exploration import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve, classification_report

# Load dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
           'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
           'hours-per-week', 'native-country', 'income']
df = pd.read_csv(url, names=columns, na_values=' ?', skipinitialspace=True)

# Initial exploration
print(df.head())
print(df.info())
print(df.describe()) import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve, classification_report

# Load dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
           'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
           'hours-per-week', 'native-country', 'income']
df = pd.read_csv(url, names=columns, na_values=' ?', skipinitialspace=True)

# Initial exploration
print(df.head())
print(df.info())
print(df.describe()) output
#   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object output
#   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object Feature Descriptions: Feature Descriptions: age: The age of the individual (integer).


workclass: Type of employment (e.g., Private, Self-emp-not-inc, Federal-gov).


fnlwgt: Final weight, a census-specific numeric value representing the number of people the entry is meant to represent.


education: Highest level of education attained (e.g., Bachelors, HS-grad, Masters).


education-num: Numeric representation of education level (e.g., 9 for HS-grad, 13 for Bachelors).


marital-status: Marital status (e.g., Married-civ-spouse, Never-married, Divorced).


occupation: Type of job or profession (e.g., Tech-support, Craft-repair, Exec-managerial).


relationship: Family relationship within the household (e.g., Husband, Wife, Not-in-family).


race: Race of the individual (e.g., White, Black, Asian-Pac-Islander).


sex: Gender of the individual (Male or Female).


capital-gain: Income from investment sources other than wages (integer).


capital-loss: Losses from investment sources other than wages (integer).


hours-per-week: Number of hours worked per week (integer).


native-country: Country of origin (e.g., United-States, Mexico, India).


income: Target variable; income class (<=50K or >50K per year). age: The age of the individual (integer). age: The age of the individual (integer). age workclass: Type of employment (e.g., Private, Self-emp-not-inc, Federal-gov). workclass: Type of employment (e.g., Private, Self-emp-not-inc, Federal-gov). workclass fnlwgt: Final weight, a census-specific numeric value representing the number of people the entry is meant to represent. fnlwgt: Final weight, a census-specific numeric value representing the number of people the entry is meant to represent. fnlwgt education: Highest level of education attained (e.g., Bachelors, HS-grad, Masters). education: Highest level of education attained (e.g., Bachelors, HS-grad, Masters). education education-num: Numeric representation of education level (e.g., 9 for HS-grad, 13 for Bachelors). education-num: Numeric representation of education level (e.g., 9 for HS-grad, 13 for Bachelors). education-num marital-status: Marital status (e.g., Married-civ-spouse, Never-married, Divorced). marital-status: Marital status (e.g., Married-civ-spouse, Never-married, Divorced). marital-status occupation: Type of job or profession (e.g., Tech-support, Craft-repair, Exec-managerial). occupation: Type of job or profession (e.g., Tech-support, Craft-repair, Exec-managerial). occupation relationship: Family relationship within the household (e.g., Husband, Wife, Not-in-family). relationship: Family relationship within the household (e.g., Husband, Wife, Not-in-family). relationship race: Race of the individual (e.g., White, Black, Asian-Pac-Islander). race: Race of the individual (e.g., White, Black, Asian-Pac-Islander). race sex: Gender of the individual (Male or Female). sex: Gender of the individual (Male or Female). sex capital-gain: Income from investment sources other than wages (integer). capital-gain: Income from investment sources other than wages (integer). capital-gain capital-loss: Losses from investment sources other than wages (integer). capital-loss: Losses from investment sources other than wages (integer). capital-loss hours-per-week: Number of hours worked per week (integer). hours-per-week: Number of hours worked per week (integer). hours-per-week native-country: Country of origin (e.g., United-States, Mexico, India). native-country: Country of origin (e.g., United-States, Mexico, India). native-country income: Target variable; income class (<=50K or >50K per year). income: Target variable; income class (<=50K or >50K per year). income We will drop null values, encode categorical columns and normalize features. df = df.dropna()
df = pd.get_dummies(df, drop_first=True)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = df.drop('income_>50K', axis=1)
y = df['income_>50K']
X_scaled = scaler.fit_transform(X)

plt.figure(figsize=(8,4))
sns.histplot(df['age'], bins=30)
plt.title('Age Distribution')
plt.show() df = df.dropna()
df = pd.get_dummies(df, drop_first=True)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = df.drop('income_>50K', axis=1)
y = df['income_>50K']
X_scaled = scaler.fit_transform(X)

plt.figure(figsize=(8,4))
sns.histplot(df['age'], bins=30)
plt.title('Age Distribution')
plt.show() Age is right skewed. Skewness: 0.56 The age is more concentrated on the prime working years of life which is great. Our dataset might have collected a healthy concentration of working individuals. Class distribution: No of people earning more than 50k in the dataset is around 8000. While less than that has a high frequency of around 25000. This is an unbalanced dataset. sns.countplot(x='income_>50K', data=df)
plt.title('Class Distribution')
plt.xlabel('Income >50K')
plt.ylabel('Count')
plt.show() sns.countplot(x='income_>50K', data=df)
plt.title('Class Distribution')
plt.xlabel('Income >50K')
plt.ylabel('Count')
plt.show() As always we need train and test data. 80:20 is a healthy split. Parameter

Meaning



n_neighbors=5

The classifier will look at the 5 nearest neighbors (based on distance) to decide how to classify a new data point. This is the "K" in KNN.



metric='minkowski'

Specifies the distance metric to use. Minkowski is a general form that can represent multiple types of distance (like Euclidean or Manhattan) depending on p.



p=2

When p=2, the Minkowski metric becomes Euclidean distance. This is the most common type of distance — the "straight line" distance between two points. Parameter

Meaning



n_neighbors=5

The classifier will look at the 5 nearest neighbors (based on distance) to decide how to classify a new data point. This is the "K" in KNN.



metric='minkowski'

Specifies the distance metric to use. Minkowski is a general form that can represent multiple types of distance (like Euclidean or Manhattan) depending on p.



p=2

When p=2, the Minkowski metric becomes Euclidean distance. This is the most common type of distance — the "straight line" distance between two points. Parameter

Meaning Parameter Parameter Meaning Meaning n_neighbors=5

The classifier will look at the 5 nearest neighbors (based on distance) to decide how to classify a new data point. This is the "K" in KNN. n_neighbors=5 n_neighbors=5 n_neighbors=5 The classifier will look at the 5 nearest neighbors (based on distance) to decide how to classify a new data point. This is the "K" in KNN. The classifier will look at the 5 nearest neighbors (based on distance) to decide how to classify a new data point. This is the "K" in KNN. metric='minkowski'

Specifies the distance metric to use. Minkowski is a general form that can represent multiple types of distance (like Euclidean or Manhattan) depending on p. metric='minkowski' metric='minkowski' metric='minkowski' Specifies the distance metric to use. Minkowski is a general form that can represent multiple types of distance (like Euclidean or Manhattan) depending on p. Specifies the distance metric to use. Minkowski is a general form that can represent multiple types of distance (like Euclidean or Manhattan) depending on p. p p=2

When p=2, the Minkowski metric becomes Euclidean distance. This is the most common type of distance — the "straight line" distance between two points. p=2 p=2 p=2 When p=2, the Minkowski metric becomes Euclidean distance. This is the most common type of distance — the "straight line" distance between two points. When p=2, the Minkowski metric becomes Euclidean distance. This is the most common type of distance — the "straight line" distance between two points. p=2 X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print(classification_report(y_test, y_pred))

acc = accuracy_score(y_test, y_pred)
print(f'Accuracy: {acc:.2f}')

y_prob = knn.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
print(f'ROC-AUC: {roc_auc_score(y_test, y_prob):.2f}')

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.show()

output -->

                 precision    recall  f1-score   support

       False       0.87      0.90      0.89      4942
        True       0.65      0.57      0.61      1571

    accuracy                           0.82      6513
   macro avg       0.76      0.74      0.75      6513
weighted avg       0.82      0.82      0.82      6513

Accuracy: 0.82
ROC-AUC: 0.84 X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print(classification_report(y_test, y_pred))

acc = accuracy_score(y_test, y_pred)
print(f'Accuracy: {acc:.2f}')

y_prob = knn.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
print(f'ROC-AUC: {roc_auc_score(y_test, y_prob):.2f}')

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.show()

output -->

                 precision    recall  f1-score   support

       False       0.87      0.90      0.89      4942
        True       0.65      0.57      0.61      1571

    accuracy                           0.82      6513
   macro avg       0.76      0.74      0.75      6513
weighted avg       0.82      0.82      0.82      6513

Accuracy: 0.82
ROC-AUC: 0.84 The model is much better at identifying people who do not earn more than $50K (high recall and precision for "False") than those who do (lower recall and precision for "True"). This is common in imbalanced datasets. The overall accuracy is good, but the model may miss a significant portion of high-income individuals. 0.82 means that this kNN model correctly predicted the income class for 82% of the individuals in our test dataset. In other words, out of all predictions made, 82% matched the actual labels. This indicates good overall performance. A ROC-AUC score of 0.84 indicates that, on average, there is an 84% chance that the model will rank a randomly chosen positive instance (earns >$50K) higher than a randomly chosen negative instance (does not earn >$50K). This suggests this model performs well in terms of overall classification and is effective at separating the two income classes. Given a person's demographic information, we can reliably predict whether they earn more than $50K—much like how a human might make an informed guess based on similar details. Next steps Next steps Deploying the model on AWS SageMaker would demonstrate how insights from a survey can translate into real-world, scalable applications.

Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.

Uber

Can a Machine Learning Algorithm Predict Your Income Just From Your Demographics?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

How Online Stores Know What You’ll Buy Next: The Math Behind “Frequently Bought Together”

How Online Stores Know What You’ll Buy Next: The Math Behind “Frequently Bought Together”

Write Together, Publish Faster: How to Co-Author Stories on HackerNoon

SwapSpace Turns 6: On the Road from Aggregator to Exchange Hub

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

How Online Stores Know What You’ll Buy Next: The Math Behind “Frequently Bought Together”

How Online Stores Know What You’ll Buy Next: The Math Behind “Frequently Bought Together”

Write Together, Publish Faster: How to Co-Author Stories on HackerNoon

SwapSpace Turns 6: On the Road from Aggregator to Exchange Hub

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps