How I helped a city estimate energy use from structural data — explained like you're starting from zero
The story
I work as a data scientist for a small urban planning consultancy. A European city aiming for carbon neutrality by 2050 came to us with a problem: they need to predict the total energy consumption of commercial buildings using only structural information — size, age, type, location.
Why? Because measuring actual energy consumption is expensive: inspectors have to physically visit each building with specialized equipment. But the city already has structural data from building permits and tax records. If we can predict energy use from that, inspectors only need to visit buildings where the model flags unusually high consumption.
The city gave us data from ~3,800 non-residential buildings measured in 2016. Our mission:
- Clean and filter the data
- Understand which features matter
- Build and compare regression models
- Tell the city what drives energy consumption
Regression vs Classification
CLASSIFICATION: predict a CATEGORY
"Will this employee leave?" → Yes / No
Metrics: Precision, Recall, F1
REGRESSION: predict a NUMBER
"How much energy will this building use?" → 347,000 kBtu
Metrics: R², MAE, RMSE
This project is regression — we predict a continuous number.
What you'll learn
|
Part |
Topic |
Why it matters |
|---|---|---|
|
0 |
Environment setup |
Reproducibility |
|
1 |
Load & explore the data |
Understand what we have |
|
2 |
Explore the target variable |
Know what we're predicting |
|
3 |
Filter non-compliant & outliers |
Remove bad data |
|
4 |
Handle missing values |
Models can't handle NaN |
|
5 |
Visual exploration (EDA) |
See patterns before modeling |
|
6 |
Feature engineering (4 new features) |
Help the model find patterns |
|
7 |
Data leakage (the #1 mistake) |
Avoid fake good results |
|
8 |
Correlation matrix |
Remove redundant features |
|
9 |
Encode categorical features |
ML needs numbers |
|
10 |
Train/test split |
Reliable evaluation |
|
11 |
Helper function for evaluation |
Don't repeat yourself |
|
12 |
Model 1: Ridge (linear) |
Baseline |
|
13 |
Model 2: Random Forest (bagging) |
Non-linear |
|
14 |
Model 3: Gradient Boosting (boosting) |
Usually best |
|
15 |
Cross-validation (5-fold) |
Mean +/- std |
|
16 |
GridSearchCV |
Tune hyperparameters |
|
17 |
Feature importance (2 methods) |
What drives consumption? |
|
18 |
Final summary |
Business conclusions |
Every concept is explained before the code. Every code block does one thing.
Before we start: What IS Machine Learning?
Machine Learning (ML) is teaching a computer to find patterns in data so it can make predictions on new data.
The 3 types (simplified):
1. SUPERVISED LEARNING (this project!)
You GIVE the model examples with known answers.
"Here are 3,000 buildings with their actual energy use.
Learn the pattern, then predict energy for NEW buildings."
→ Regression: predict a NUMBER (energy = 347,000 kBtu)
→ Classification: predict a CATEGORY (employee will leave: yes/no)
2. UNSUPERVISED LEARNING
You DON'T give answers. The model finds groups on its own.
"Here are 3,000 buildings. Group similar ones together."
3. REINFORCEMENT LEARNING
The model learns by trial and error (like training a dog).
The workflow we'll follow:
Raw data → Clean → Explore → Engineer features → Split → Train models → Evaluate → Interpret
📦 🧹 🔍 🔧 ✂️ 🏋️ 📊 💡
Each step has a reason. Skipping one usually means bad results.
Part 0 — Setting up your tools
What is a pyproject.toml?
Think of it as a recipe card for your project. It lists every Python package (ingredient) needed. Anyone wanting to reproduce your work reads this file and installs everything with one command.
[project]
name = "energy-prediction"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = [
"pandas>=2.0",
"numpy>=1.24",
"matplotlib>=3.7",
"seaborn>=0.12",
"scikit-learn>=1.3",
]
Create it with uv init in your terminal. Now let's import our tools — one at a time, with explanations.
# pandas: work with tables (like Excel, but in Python)
import pandas as pd
# Quick check: can we create a DataFrame?
print(type(pd.DataFrame()))
# numpy: fast math on arrays of numbers
import numpy as np
# Quick check
print('2 + 3 =', np.add(2, 3))
# matplotlib: the base plotting library
import matplotlib.pyplot as plt
# seaborn: prettier charts built on top of matplotlib
import seaborn as sns
print('Visualization libraries ready!')
# Suppress non-critical warnings (keeps output clean)
import warnings
warnings.filterwarnings('ignore')
print('Warnings suppressed')
Scikit-learn imports (one group at a time)
Scikit-learn (sklearn) is THE machine learning library. We import many tools — each will be explained when we use it.
# Split data into training and test sets
from sklearn.model_selection import train_test_split
print('train_test_split: splits data 80/20 (or any ratio)')
# Cross-validation: test on multiple splits for reliable scores
from sklearn.model_selection import cross_validate
# KFold: defines HOW to split for cross-validation
from sklearn.model_selection import KFold
# GridSearchCV: try many hyperparameter combinations
from sklearn.model_selection import GridSearchCV
print('Cross-validation and GridSearch loaded')
# StandardScaler: normalize features to mean=0, std=1
# (needed for linear models, not for tree-based models)
from sklearn.preprocessing import StandardScaler
# OneHotEncoder: convert categories to binary columns
from sklearn.preprocessing import OneHotEncoder
# Pipeline: chain preprocessing + model in one object
from sklearn.pipeline import Pipeline
print('Preprocessing tools loaded')
The 3 model categories
The project requires at least 3 different categories of models:
|
Category |
How it works |
Example |
Needs scaling? |
|---|---|---|---|
|
Linear |
Draws the best straight line |
Ridge, Lasso |
Yes |
|
Bagging (trees) |
Many trees vote independently |
Random Forest |
No |
|
Boosting (trees) |
Trees learn from each other's mistakes |
Gradient Boosting |
No |
BAGGING (Random Forest):
Tree 1 ──┐
Tree 2 ──┤──▶ AVERAGE all predictions ──▶ Final answer
Tree 3 ──┤
Tree 4 ──┘
(trees work INDEPENDENTLY, in parallel)
BOOSTING (Gradient Boosting):
Tree 1 ─▶ errors ─▶ Tree 2 ─▶ errors ─▶ Tree 3 ─▶ ... ─▶ Final answer
(each tree CORRECTS the mistakes of the previous one)
# Linear model
from sklearn.linear_model import Ridge
# Bagging (tree-based, parallel)
from sklearn.ensemble import RandomForestRegressor
# Boosting (tree-based, sequential)
from sklearn.ensemble import GradientBoostingRegressor
print('3 model categories loaded: Ridge, RandomForest, GradientBoosting')
Regression metrics
Suppose we predict energy for 3 buildings:
Actual: [100,000 200,000 300,000]
Predicted: [110,000 180,000 350,000]
Errors: [+10,000 -20,000 +50,000]
MAE = average of |errors|
= (10,000 + 20,000 + 50,000) / 3 = 26,667 kBtu
→ "On average, we're off by ~27,000 kBtu"
RMSE = sqrt(average of errors²)
= sqrt((10k² + 20k² + 50k²) / 3) = 31,623 kBtu
→ "Like MAE but penalizes BIG errors more"
R² = 1 - (sum of errors²) / (sum of (actual - mean)²)
→ "How much variance does the model explain?"
→ R²=1.0 means perfect, R²=0 means useless
# Regression metrics
from sklearn.metrics import (
mean_absolute_error, # MAE
mean_squared_error, # Used to compute RMSE
r2_score, # R²
make_scorer, # Convert metric to scorer
)
# Feature importance
from sklearn.inspection import permutation_importance
print('Metrics and feature importance loaded')
# Global settings for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
# Prettier charts
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
pd.set_option('display.max_columns', None)
print(f'All set! Random state: {RANDOM_STATE}')
Part 1 — Loading the dataset
In a real project, you'd load the CSV: pd.read_csv('seattle_buildings_2016.csv'). Since we can't share the real data, we generate synthetic data that mimics it. We build each column separately so you see exactly what goes in.
Step 1.1 — Building IDs
Every row needs a unique identifier.
n_buildings = 3800
building_ids = [f'BLD{str(i).zfill(5)}' for i in range(1, n_buildings + 1)]
# Verify
print(f'Total: {len(building_ids)}')
print(f'First 3: {building_ids[:3]}')
print(f'Last 3: {building_ids[-3:]}')
Step 1.2 — Building type
All non-residential: offices, retail, hotels, etc. In the real project, residential buildings must be excluded — that's a hard requirement.
building_types = [
'Office', 'Retail', 'Hotel', 'Warehouse', 'School',
'Hospital', 'Restaurant', 'Supermarket', 'Parking', 'Other'
]
primary_use = np.random.choice(building_types, n_buildings,
p=[0.25, 0.15, 0.08, 0.12, 0.10, 0.05, 0.08, 0.05, 0.07, 0.05])
# Let's see the distribution
print('Building type distribution:')
for bt in building_types:
n = (primary_use == bt).sum()
pct = n / n_buildings * 100
bar = '█' * int(pct)
print(f' {bt:15s} {n:4d} ({pct:5.1f}%) {bar}')
Step 1.3 — Year built
year_built = np.random.normal(1970, 25, n_buildings).clip(1900, 2020).astype(int)
print(f'Oldest: {year_built.min()}')
print(f'Newest: {year_built.max()}')
print(f'Median: {int(np.median(year_built))}')
Step 1.4 — Number of floors
n_floors = np.random.exponential(3, n_buildings).clip(1, 50).astype(int)
print(f'Range: {n_floors.min()} to {n_floors.max()} floors')
print(f'Most common: {np.bincount(n_floors).argmax()} floor(s)')
Step 1.5 — Gross floor area
Larger buildings and different types have very different sizes. A hospital is much bigger than a restaurant. We encode this with a base area per type.
base_area = {
'Office': 50000, 'Retail': 20000, 'Hotel': 80000,
'Warehouse': 40000, 'School': 60000, 'Hospital': 120000,
'Restaurant': 5000, 'Supermarket': 30000, 'Parking': 70000,
'Other': 25000
}
# Show the base areas
print('Base floor area by type:')
for bt, area in sorted(base_area.items(), key=lambda x: -x[1]):
print(f' {bt:15s} {area:>8,} sq ft')
# Actual area = base * random variation * floor bonus
gross_floor_area = np.array([
base_area[pu] * np.random.lognormal(0, 0.5) * (nf ** 0.3)
for pu, nf in zip(primary_use, n_floors)
]).clip(1000, 2000000).round(0)
print(f'Floor area range: {gross_floor_area.min():,.0f} to {gross_floor_area.max():,.0f} sq ft')
print(f'Median: {np.median(gross_floor_area):,.0f} sq ft')
Step 1.6 — Other columns
# 10 neighborhoods
neighborhoods = [f'Zone_{chr(65+i)}' for i in range(10)]
neighborhood = np.random.choice(neighborhoods, n_buildings)
print(f'{len(np.unique(neighborhood))} neighborhoods')
# Number of property uses (1=single, 2+=mixed use)
n_property_uses = np.random.choice([1, 2, 3, 4], n_buildings,
p=[0.60, 0.25, 0.10, 0.05])
vals, cnts = np.unique(n_property_uses, return_counts=True)
for v, c in zip(vals, cnts):
print(f' {v} use(s): {c} buildings')
Step 1.7 — Compliance and Outlier flags
The real dataset has two critical columns:
- ComplianceStatus: did the building pass inspection? Non-compliant → remove
- Outlier: was this data point flagged as suspicious? Yes → remove
Using these columns to filter is required by the project.
compliance = np.random.choice(
['Compliant', 'Non-Compliant'], n_buildings, p=[0.85, 0.15])
outlier_flag = np.random.choice(
['No', 'Yes'], n_buildings, p=[0.92, 0.08])
print(f'Compliant: {(compliance == "Compliant").sum()}')
print(f'Non-Compliant: {(compliance == "Non-Compliant").sum()}')
print(f'Outlier=Yes: {(outlier_flag == "Yes").sum()}')
Step 1.8 — Energy columns (leakage risks!)
The real dataset has individual energy sources. We create them here but will remove them later — using them to predict total energy is cheating (data leakage, explained in Part 7).
# EnergyStarScore (1-100): efficiency rating
energy_star = np.random.normal(60, 20, n_buildings).clip(1, 100).round(0)
print(f'EnergyStarScore: {energy_star.min():.0f} to {energy_star.max():.0f}')
# Individual energy sources (these ARE total energy's components!)
electricity_kbtu = (gross_floor_area * np.random.uniform(10, 40, n_buildings)).round(0)
natural_gas_kbtu = (gross_floor_area * np.random.uniform(2, 20, n_buildings)).round(0)
steam_kbtu = np.where(
np.random.random(n_buildings) < 0.3,
gross_floor_area * np.random.uniform(1, 10, n_buildings), 0).round(0)
print(f'Electricity: mean={electricity_kbtu.mean():,.0f} kBtu')
print(f'Natural gas: mean={natural_gas_kbtu.mean():,.0f} kBtu')
print(f'Steam (30% of buildings): mean of nonzero={steam_kbtu[steam_kbtu > 0].mean():,.0f} kBtu')
Step 1.9 — The TARGET: total site energy use
# Total = sum of all sources + noise
site_energy = (electricity_kbtu + natural_gas_kbtu + steam_kbtu
+ np.random.normal(0, 5000, n_buildings)).clip(10000).round(0)
# Make flagged outliers extreme (realistic)
outlier_mask = outlier_flag == 'Yes'
site_energy[outlier_mask] *= np.random.uniform(2, 5, outlier_mask.sum())
print(f'SiteEnergyUse_kBtu:')
print(f' Min: {site_energy.min():>12,.0f}')
print(f' Median: {np.median(site_energy):>12,.0f}')
print(f' Mean: {site_energy.mean():>12,.0f}')
print(f' Max: {site_energy.max():>12,.0f}')
# EUI and CO2 (also leakage)
site_eui = (site_energy / gross_floor_area).round(2)
co2_emissions = (site_energy * np.random.uniform(0.00005, 0.00015, n_buildings)).round(2)
print(f'SiteEUI range: {site_eui.min():.2f} to {site_eui.max():.2f} kBtu/sf')
print(f'CO2 range: {co2_emissions.min():.2f} to {co2_emissions.max():.2f} metric tons')
Step 1.10 — Inject missing values (like real data)
energy_star_dirty = energy_star.copy().astype(float)
energy_star_dirty[np.random.random(n_buildings) < 0.05] = np.nan
print(f'Missing EnergyStarScore: {np.isnan(energy_star_dirty).sum()}')
year_built_dirty = year_built.copy().astype(float)
year_built_dirty[np.random.random(n_buildings) < 0.02] = np.nan
print(f'Missing YearBuilt: {np.isnan(year_built_dirty).sum()}')
Step 1.11 — Assemble the full DataFrame
df = pd.DataFrame({
'BuildingID': building_ids,
'PrimaryPropertyType': primary_use,
'YearBuilt': year_built_dirty,
'NumberOfFloors': n_floors,
'GrossFloorArea_sf': gross_floor_area,
'Neighborhood': neighborhood,
'NumberOfPropertyUses': n_property_uses,
'ComplianceStatus': compliance,
'Outlier': outlier_flag,
'EnergyStarScore': energy_star_dirty,
'Electricity_kBtu': electricity_kbtu,
'NaturalGas_kBtu': natural_gas_kbtu,
'Steam_kBtu': steam_kbtu,
'SiteEnergyUse_kBtu': site_energy,
'SiteEUI_kBtu_sf': site_eui,
'CO2Emissions_tons': co2_emissions,
})
print(f'Dataset: {df.shape[0]} rows x {df.shape[1]} columns')
# List all columns with their types
print('All columns:')
for i, (col, dtype) in enumerate(df.dtypes.items()):
print(f' {i+1:2d}. {col:30s} {str(dtype):10s}')
df.head(3)
What is a DataFrame?
Think of a DataFrame as a spreadsheet table inside Python. It has:
- Rows (one per building, employee, or observation)
- Columns (one per variable: age, floor area, energy...)
- Index (row number, starting from 0)
FloorArea YearBuilt Energy
BLD001 50000 1985 320000
BLD002 12000 2001 95000
BLD003 85000 1962 580000
↑ index ↑ col 1 ↑ col 2 ↑ col 3
What do .shape, .info(), .describe() tell us?
|
Command |
What it shows |
Why you need it |
|---|---|---|
|
|
(rows, columns) |
Know the size |
|
|
Column names, types, non-null count |
Spot wrong types and missing data |
|
|
Mean, std, min, max, quartiles |
Spot outliers and weird distributions |
|
|
Number of missing values per column |
Know what needs fixing |
Part 2 — Exploring the raw data
The 3 essential checks for ANY dataset
Every time you load data, ask these 3 questions:
- How big is it? →
df.shape - What types are the columns? →
df.info() - How much data is missing? →
df.isnull().sum()
# Check 1: Size
print(f'Rows: {df.shape[0]}')
print(f'Columns: {df.shape[1]}')
# Check 2: Types and non-null counts
df.info()
How to read .describe() output
FloorArea YearBuilt Energy
count 3800.0 3800.0 3800.0 ← how many non-null values
mean 52341.0 1970.0 420000.0 ← average
std 45200.0 25.0 380000.0 ← spread (bigger = more varied)
min 1000.0 1900.0 10000.0 ← smallest value
25% 18000.0 1952.0 150000.0 ← 25% of buildings are below this
50% 38000.0 1970.0 310000.0 ← MEDIAN (middle value)
75% 72000.0 1990.0 560000.0 ← 75% are below this
max 2000000.0 2020.0 5000000.0 ← largest value
What to look for:
- Big difference between mean and 50% (median)? → Data is skewed (has extreme values)
- min very different from 25%? → Possible outliers at the bottom
- max very different from 75%? → Possible outliers at the top
- count less than total rows? → Missing values in that column
# Check 3: Missing values
missing = df.isnull().sum()
print('Columns with missing values:')
print(missing[missing > 0])
print(f'\nTotal missing cells: {missing.sum()} out of {df.shape[0]*df.shape[1]}')
# Descriptive stats for numeric columns
df.describe().round(2)
What is a "target variable"?
In supervised learning, we always have:
- Features (X): the information we GIVE to the model (floor area, year built, ...)
- Target (y): what we want the model to PREDICT (energy consumption)
Features (X) Target (y)
┌───────────┬───────────┬─────────┐ ┌────────────────┐
│ FloorArea │ YearBuilt │ Floors │ │ EnergyUse_kBtu │
├───────────┼───────────┼─────────┤ ├────────────────┤
│ 50,000 │ 1985 │ 3 │ → │ 320,000 │
│ 12,000 │ 2001 │ 1 │ → │ 95,000 │
│ 85,000 │ 1962 │ 7 │ → │ 580,000 │
└───────────┴───────────┴─────────┘ └────────────────┘
The model learns the relationship: X → y
Then predicts y for NEW buildings where we don't know y.
Explore the target variable FIRST
Always look at what you're trying to predict before anything else.
target_col = 'SiteEnergyUse_kBtu'
print(f'Target: {target_col}')
print(f' Count: {df[target_col].count()}')
print(f' Mean: {df[target_col].mean():>12,.0f} kBtu')
print(f' Median: {df[target_col].median():>12,.0f} kBtu')
print(f' Std: {df[target_col].std():>12,.0f} kBtu')
print(f' Min: {df[target_col].min():>12,.0f} kBtu')
print(f' Max: {df[target_col].max():>12,.0f} kBtu')
print(f' Skew: {df[target_col].skew():>12.2f}')
Why is the distribution "skewed" and what does log-transform do?
Most buildings use moderate energy, but a few use ENORMOUS amounts. This creates a "long tail" on the right:
Raw distribution: Log-transformed:
│█ │ █
│█ │ ███
│██ │ █████
│████ │ ███████
│████████ │████████
│██████████████░░░░░░── │████████
└────────────────────── Energy └──────────── log(Energy)
Most buildings Long tail More symmetric!
are here (outliers)
The log function compresses big numbers more than small numbers, making the distribution more symmetric. Some models (especially linear ones) work better with roughly symmetric data.
Note: we use np.log1p(x) instead of np.log(x) because log1p handles zero values safely (log(0) would crash).
# Visualize: raw distribution vs log-transformed
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].hist(df[target_col], bins=50, color='steelblue', edgecolor='white')
axes[0].set_xlabel('Site Energy Use (kBtu)')
axes[0].set_ylabel('Number of buildings')
axes[0].set_title('Raw Distribution (very right-skewed)')
axes[1].hist(np.log1p(df[target_col]), bins=50, color='coral', edgecolor='white')
axes[1].set_xlabel('log(Site Energy Use)')
axes[1].set_title('Log-transformed (more symmetric)')
plt.tight_layout()
plt.show()
What does "filtering" mean in pandas?
Filtering means keeping only the rows that match a condition. It's like applying a filter in Excel.
# English: "Keep only rows where ComplianceStatus equals 'Compliant'"
# Python:
df = df[df['ComplianceStatus'] == 'Compliant']
# How does this work, step by step?
# 1. df['ComplianceStatus'] == 'Compliant'
# → creates a column of True/False for every row
# → [True, True, False, True, False, ...]
#
# 2. df[True/False column]
# → keeps only rows where the value is True
Important: always keep track of how many rows you have before and after each filter. If you lose too many, you might not have enough data to train a model.
Part 3 — Filtering the data
Why filter?
The dataset contains buildings that shouldn't be included:
- Non-compliant buildings: their data is unreliable
- Flagged outliers: suspicious measurements
The project requires using ComplianceStatus and Outlier columns to filter.
Track how many rows we keep at each step
print(f'Starting count: {len(df)} buildings')
# What's the compliance distribution?
print(df['ComplianceStatus'].value_counts())
# Remove non-compliant
df = df[df['ComplianceStatus'] == 'Compliant'].copy()
print(f'After removing non-compliant: {len(df)} buildings')
# What's the outlier distribution?
print(df['Outlier'].value_counts())
# Remove flagged outliers
df = df[df['Outlier'] == 'No'].copy()
print(f'After removing outliers: {len(df)} buildings')
# These columns served their purpose — drop them
df.drop(columns=['ComplianceStatus', 'Outlier'], inplace=True)
print(f'Columns remaining: {df.shape[1]}')
Part 4 — Handling missing values
Why median, not mean?
The mean is pulled by extreme values. The median is the middle value and ignores outliers.
Values: [2, 3, 3, 4, 100]
Mean: 22.4 ← pulled by 100
Median: 3 ← the actual middle value
# What's still missing?
print('Missing values after filtering:')
missing = df.isnull().sum()
print(missing[missing > 0])
# Fill YearBuilt
median_year = df['YearBuilt'].median()
before_null = df['YearBuilt'].isnull().sum()
df['YearBuilt'].fillna(median_year, inplace=True)
print(f'YearBuilt: filled {before_null} missing values with median={median_year:.0f}')
# Fill EnergyStarScore
median_star = df['EnergyStarScore'].median()
before_null = df['EnergyStarScore'].isnull().sum()
df['EnergyStarScore'].fillna(median_star, inplace=True)
print(f'EnergyStarScore: filled {before_null} missing values with median={median_star:.0f}')
# Verify: no more nulls
print(f'Total remaining nulls: {df.isnull().sum().sum()}')
Remove extreme statistical outliers
Even after removing flagged outliers, some buildings may have unrealistic energy values. We use percentiles to trim the extremes.
Q01 = df['SiteEnergyUse_kBtu'].quantile(0.01)
Q99 = df['SiteEnergyUse_kBtu'].quantile(0.99)
print(f'1st percentile: {Q01:>12,.0f} kBtu')
print(f'99th percentile: {Q99:>12,.0f} kBtu')
before = len(df)
df = df[(df['SiteEnergyUse_kBtu'] >= Q01) & (df['SiteEnergyUse_kBtu'] <= Q99)].copy()
print(f'Removed {before - len(df)} extreme buildings')
print(f'Remaining: {len(df)} buildings')
Part 5 — Visualizations
Which chart for which comparison?
|
What you want to see |
Chart type |
Example |
|---|---|---|
|
Two numbers together |
Scatterplot |
Floor area vs energy |
|
Number across categories |
Boxplot |
Energy by building type |
|
Distribution of one thing |
Histogram |
Year built |
|
Target after filtering |
Histogram |
Energy distribution |
All charts must relate to the target variable.
How to read a scatterplot
Each dot is one building. The position tells you two things at once:
Energy (y)
│ ●
│ ● ●
│ ●●● "Pattern: dots go UP from left to right"
│ ●●● → POSITIVE correlation
│ ●● → Bigger floor area = more energy
│●●
└──────────────── FloorArea (x)
What to look for:
- Dots going up-right? → Positive relationship (both increase together)
- Dots going down-right? → Negative relationship (one up, other down)
- Random cloud? → No relationship
- Dots far from the main group? → Outliers
# SCATTERPLOT: Floor area vs Energy (the most important relationship)
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(df['GrossFloorArea_sf'], df['SiteEnergyUse_kBtu'],
alpha=0.3, s=15, color='steelblue')
ax.set_xlabel('Gross Floor Area (sq ft)')
ax.set_ylabel('Site Energy Use (kBtu)')
ax.set_title('Bigger buildings use more energy (obvious but important)')
plt.tight_layout()
plt.show()
How to read a boxplot
A boxplot shows the distribution of a numeric variable at a glance:
┌──────┐
─────┤ ├───── ← Whiskers (range without outliers)
├──────┤
│ │ ← Box = middle 50% of data (Q1 to Q3)
│──────│ ← Line inside box = MEDIAN (middle value)
│ │
├──────┤
─────┤ ├─────
└──────┘
● ← Dots = outliers (values far from the rest)
What to look for: if one building type has a much higher median than another, that type tends to use more energy.
# BOXPLOT: Energy by building type
fig, ax = plt.subplots(figsize=(14, 5))
order = df.groupby('PrimaryPropertyType')['SiteEnergyUse_kBtu'].median().sort_values(ascending=False).index
sns.boxplot(data=df, x='PrimaryPropertyType', y='SiteEnergyUse_kBtu',
order=order, ax=ax, palette='husl')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
ax.set_ylabel('Site Energy Use (kBtu)')
ax.set_title('Energy Use by Building Type (sorted by median)')
plt.tight_layout()
plt.show()
# HISTOGRAM: Year built
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(df['YearBuilt'], bins=40, color='coral', edgecolor='white')
ax.set_xlabel('Year Built')
ax.set_ylabel('Count')
ax.set_title('When were these buildings constructed?')
plt.tight_layout()
plt.show()
# SCATTERPLOT: Number of floors vs Energy
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(df['NumberOfFloors'], df['SiteEnergyUse_kBtu'],
alpha=0.3, s=15, color='coral')
ax.set_xlabel('Number of Floors')
ax.set_ylabel('Site Energy Use (kBtu)')
ax.set_title('More floors → more energy (generally)')
plt.tight_layout()
plt.show()
# SCATTERPLOT: EnergyStarScore vs Energy (before we drop it)
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(df['EnergyStarScore'], df['SiteEnergyUse_kBtu'],
alpha=0.3, s=15, color='green')
ax.set_xlabel('EnergyStarScore')
ax.set_ylabel('Site Energy Use (kBtu)')
ax.set_title('EnergyStarScore vs Energy — but this column is LEAKAGE!')
plt.tight_layout()
plt.show()
Why does feature engineering help?
Think of it this way. You're trying to predict someone's running speed. The model has:
leg_length = 90 cmbody_height = 180 cm
These raw numbers are useful, but the ratio leg_length / body_height = 0.50 is directly informative — it tells you about body proportions in one number instead of two.
Same idea here: instead of giving the model GrossFloorArea = 100,000 and NumberOfFloors = 10 separately, we create AreaPerFloor = 10,000 which directly tells the model "each floor is 10,000 sq ft". The model doesn't have to figure out this relationship on its own.
Rules for feature engineering:
- New features should make physical sense (age of building, area per floor, etc.)
- NEVER use the target to create a feature — that's data leakage
- At least 3 new features are required for this project
- Verify each new feature looks reasonable (check min, max, mean)
Part 6 — Feature Engineering (at least 3 new features required)
What is feature engineering?
Creating new columns from existing ones to help the model find patterns.
Analogy: predicting how fast a car goes. You have engine_horsepower and car_weight. Creating power_to_weight_ratio = horsepower / weight gives the model a shortcut.
Golden rule: NEVER use the target to create features (that's data leakage).
Feature 1: Building age
# Older buildings often have worse insulation → more energy
df['BuildingAge'] = 2016 - df['YearBuilt']
print(f'BuildingAge:')
print(f' Min: {df["BuildingAge"].min():.0f} years (newest)')
print(f' Max: {df["BuildingAge"].max():.0f} years (oldest)')
print(f' Mean: {df["BuildingAge"].mean():.0f} years')
Feature 2: Area per floor
# A 10-floor building with 100k sqft has different characteristics
# than a 2-floor building with 100k sqft
df['AreaPerFloor'] = df['GrossFloorArea_sf'] / df['NumberOfFloors']
print(f'AreaPerFloor:')
print(f' Min: {df["AreaPerFloor"].min():>10,.0f} sq ft')
print(f' Max: {df["AreaPerFloor"].max():>10,.0f} sq ft')
print(f' Mean: {df["AreaPerFloor"].mean():>10,.0f} sq ft')
Feature 3: Has multiple uses (binary flag)
# Mixed-use buildings (office + retail) may have different energy profiles
df['HasMultipleUses'] = (df['NumberOfPropertyUses'] > 1).astype(int)
print(f'Single use: {(df["HasMultipleUses"] == 0).sum()} buildings')
print(f'Multiple uses: {(df["HasMultipleUses"] == 1).sum()} buildings')
Feature 4: Old building flag
# Pre-1960 buildings often lack modern insulation
df['IsOldBuilding'] = (df['YearBuilt'] < 1960).astype(int)
print(f'Old (pre-1960): {df["IsOldBuilding"].sum()} buildings')
print(f'Newer (1960+): {(df["IsOldBuilding"] == 0).sum()} buildings')
print(f'\nTotal new features created: 4 (minimum 3 required)')
Part 7 — Data Leakage: the most common beginner mistake
What is data leakage?
Your model accidentally gets information it shouldn't have. Like taking an exam with the answer key.
In our dataset:
SiteEnergyUse_kBtu = Electricity + NaturalGas + Steam + other
^^^^^^^^^^^ ^^^^^^^^^^ ^^^^^
COMPONENTS of total energy!
If we use Electricity_kBtu as a FEATURE to predict SiteEnergyUse_kBtu,
the model just learns to add up the components → R² = 0.99+
Looks amazing... but completely USELESS on new buildings where
we DON'T have energy measurements!
Columns that cause leakage:
|
Column |
Why it leaks |
|---|---|
|
|
Component of total energy |
|
|
Component of total energy |
|
|
Component of total energy |
|
|
Computed FROM total energy |
|
|
Derived FROM energy consumption |
|
|
= SiteEnergy / FloorArea (contains the target!) |
Also drop BuildingID — it's just a label, not a feature.
leakage_cols = [
'Electricity_kBtu', 'NaturalGas_kBtu', 'Steam_kBtu',
'CO2Emissions_tons', 'EnergyStarScore', 'SiteEUI_kBtu_sf',
]
drop_cols = leakage_cols + ['BuildingID']
print(f'Dropping {len(drop_cols)} columns:')
for col in drop_cols:
print(f' ✗ {col}')
Data leakage — another example to make sure you get it
Imagine predicting a student's final exam grade. You have:
✓ Hours_Studied = 45 ← GOOD feature (known before the exam)
✓ Attendance_Rate = 0.90 ← GOOD feature (known before the exam)
✗ Exam_Score = 85 ← THIS IS WHAT WE'RE PREDICTING! Can't use it!
✗ Grade_Letter = "B+" ← DERIVED from the exam score! Leakage!
✗ Class_Rank = 12 ← COMPUTED after all grades are known! Leakage!
Same logic in our project:
✓ GrossFloorArea ← GOOD: structural data we have before measurement
✓ YearBuilt ← GOOD: structural data
✗ Electricity_kBtu ← BAD: component of total energy (the thing we predict!)
✗ SiteEUI_kBtu_sf ← BAD: = SiteEnergy / FloorArea (contains the answer!)
Rule: if a column COULD NOT EXIST without measuring the target, drop it.
df.drop(columns=drop_cols, inplace=True)
print(f'\nColumns remaining: {df.shape[1]}')
for i, col in enumerate(df.columns):
marker = '← TARGET' if col == 'SiteEnergyUse_kBtu' else ''
print(f' {i+1:2d}. {col:30s} {marker}')
What is correlation?
Correlation measures how two variables move together. It ranges from -1 to +1:
r = +1.0 → Perfect positive: when A goes up, B always goes up
r = +0.7 → Strong positive: A up → B mostly up
r = 0.0 → No relationship
r = -0.7 → Strong negative: A up → B mostly down
r = -1.0 → Perfect negative
Example:
FloorArea and Energy: r = +0.85 (bigger buildings use more energy)
EnergyStarScore and Energy: r = -0.40 (efficient buildings use less)
Why drop highly correlated FEATURES?
If GrossFloorArea and AreaPerFloor have r = 0.95, they tell the model almost the same thing. Keeping both:
- Wastes space
- Can confuse linear models (they fight over which one to use)
Rule of thumb: if |r| > 0.70 between two features (NOT the target), drop one.
Part 8 — Correlation matrix
Why check correlations?
If two features are very similar (|r| > 0.70), they carry the same information. Keeping both:
- Wastes space
- Can make linear models unstable
A correlation of 1.0 = perfect positive relationship. -1.0 = perfect negative. 0 = no relationship.
num_df = df.select_dtypes(include=[np.number])
corr = num_df.corr()
print(f'Correlation matrix: {corr.shape[0]} x {corr.shape[1]} features')
# Visualize as heatmap
plt.figure(figsize=(12, 9))
mask = np.triu(np.ones_like(corr, dtype=bool)) # Hide upper triangle (redundant)
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f',
cmap='coolwarm', center=0, square=True,
linewidths=0.3, annot_kws={'size': 8})
plt.title('Correlation Matrix — look for |r| > 0.70')
plt.tight_layout()
plt.show()
# Automatically find highly correlated pairs
target_name = 'SiteEnergyUse_kBtu'
print('Feature pairs with |r| > 0.70 (excluding target):')
found = False
for i in range(len(corr)):
for j in range(i+1, len(corr)):
fi = corr.index[i]
fj = corr.columns[j]
r = corr.iloc[i, j]
if abs(r) > 0.70 and fi != target_name and fj != target_name:
print(f' {fi} <-> {fj}: r = {r:.3f}')
found = True
if not found:
print(' None found — no redundant features to drop!')
Why can't ML use text directly?
Imagine you give the model this:
PrimaryPropertyType = "Office"
The model does math: multiplications, additions, comparisons. You can multiply numbers, but you can't multiply "Office" by 0.5. So we need to convert text to numbers.
OneHot encoding — the most common method
Instead of one column with 10 text values, we create 10 binary columns (but drop 1 to avoid redundancy):
BEFORE: AFTER (with drop='first'):
Type Type_Hotel Type_Office Type_Retail ...
──────── ────────── ─────────── ──────────
Hospital 0 0 0
Office 0 1 0
Retail 0 0 1
Hotel 1 0 0
Hospital (again) 0 0 0 ← if all zeros, model knows it's Hospital
We drop the first category because if all the others are 0, the model already knows the value.
Part 9 — Encode categorical features
Why?
ML algorithms only understand numbers. Text columns like 'Office' or 'Zone_A' must be converted.
Which encoding method?
|
Method |
When to use |
Example |
|---|---|---|
|
OneHotEncoder |
No natural order |
Department, Neighborhood |
|
Ordinal mapping |
Natural order |
Junior < Mid < Senior |
Building type and neighborhood have no natural order, so we use OneHotEncoder.
# Which columns are text?
cat_cols = df.select_dtypes(include=['object']).columns.tolist()
print(f'Categorical columns: {cat_cols}')
for col in cat_cols:
print(f' {col}: {df[col].nunique()} unique values')
# Create encoder
ohe = OneHotEncoder(
sparse_output=False, # Dense array (easier to work with)
drop='first', # Drop 1st category to avoid multicollinearity
handle_unknown='infrequent_if_exist'
)
print('OneHotEncoder created with drop="first"')
print('(drop="first" means if we have [A, B, C], we create cols for B and C only)')
print('(because if B=0 and C=0, the model knows it must be A)')
Concrete example: before and after OneHotEncoding
BEFORE (3 buildings):
┌──────────────────┐
│ PrimaryProperty │
├──────────────────┤
│ Office │
│ Hotel │
│ Office │
└──────────────────┘
AFTER (with drop='first', keeping alphabetical order, first dropped = Hospital):
┌────────────┬───────────┬────────────┬───────────┐
│ Type_Hotel │ Type_Off │ Type_Ret │ Type_Ware │ ...
├────────────┼───────────┼────────────┼───────────┤
│ 0 │ 1 │ 0 │ 0 │ ← Office
│ 1 │ 0 │ 0 │ 0 │ ← Hotel
│ 0 │ 1 │ 0 │ 0 │ ← Office
└────────────┴───────────┴────────────┴───────────┘
Each building becomes a row of 0s and 1s. The model can now do math with these.
# Fit and transform
encoded = ohe.fit_transform(df[cat_cols])
enc_df = pd.DataFrame(
encoded,
columns=ohe.get_feature_names_out(cat_cols),
index=df.index
)
print(f'Created {enc_df.shape[1]} binary columns:')
for col in enc_df.columns[:8]:
print(f' {col}')
if enc_df.shape[1] > 8:
print(f' ... and {enc_df.shape[1] - 8} more')
# Combine: drop originals, add encoded
df_encoded = pd.concat([df.drop(columns=cat_cols), enc_df], axis=1)
print(f'Final dataset: {df_encoded.shape[0]} rows x {df_encoded.shape[1]} columns')
The train/test split — explained simply
Imagine you're studying for an exam:
- You study from a textbook (= training data)
- You take the exam with new questions (= test data)
If you memorize the textbook word-for-word (= overfitting), you might fail the exam because you haven't learned the patterns, just the specific examples.
The train/test split simulates this:
- Train set (80%): the model studies these buildings
- Test set (20%): new buildings the model has NEVER seen
If it predicts well on the test set, it has learned real patterns — not just memorized.
Part 10 — Separate X and y, then split into train/test
Why split?
Full data (100%)
├── Train (80%): model learns from this
└── Test (20%): we pretend these are "new" buildings
and check if predictions are good
If the model performs well on TEST data it has never seen,
we can trust it on truly new buildings.
target = 'SiteEnergyUse_kBtu'
# X = everything except the target
X = df_encoded.drop(columns=[target])
# y = just the target
y = df_encoded[target]
print(f'X shape: {X.shape} (features)')
print(f'y shape: {y.shape} (target)')
print(f'Target mean: {y.mean():,.0f} kBtu')
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for test
random_state=RANDOM_STATE # Reproducible split
)
print(f'Train: {len(X_train)} buildings ({len(X_train)/len(X)*100:.0f}%)')
print(f'Test: {len(X_test)} buildings ({len(X_test)/len(X)*100:.0f}%)')
Part 11 — A reusable evaluation function
We'll train 3 models. Instead of copying evaluation code 3 times, we write it once in a function.
Build → Test → Use (the same pattern as .apply())
def evaluate_model(model, X_tr, y_tr, X_te, y_te, name):
"""Train a regression model and print all metrics."""
# 1. Train
model.fit(X_tr, y_tr)
# 2. Predict on both sets
pred_tr = model.predict(X_tr)
pred_te = model.predict(X_te)
# 3. Compute metrics
r2_tr = r2_score(y_tr, pred_tr)
r2_te = r2_score(y_te, pred_te)
mae_te = mean_absolute_error(y_te, pred_te)
rmse_te = np.sqrt(mean_squared_error(y_te, pred_te))
# 4. Print results
print(f'\n{name}:')
print(f' R² train: {r2_tr:.4f} R² test: {r2_te:.4f}')
print(f' MAE: {mae_te:>12,.0f} kBtu')
print(f' RMSE: {rmse_te:>12,.0f} kBtu')
print(f' → On average, the model is off by ~{mae_te:,.0f} kBtu')
# Check for overfitting
if r2_tr - r2_te > 0.10:
print(f' ⚠ Possible overfitting: train R² much higher than test R²')
return {'name': name, 'R2_train': r2_tr, 'R2_test': r2_te,
'MAE': mae_te, 'RMSE': rmse_te, 'model': model}
# Test the function exists
print('evaluate_model() is ready to use!')
print('It takes: model, X_train, y_train, X_test, y_test, name')
What does a regression model actually DO?
In the simplest case (1 feature), it finds the best line through the data:
Energy
│ ●
│ ● /
│ ● / ← best line
│ ● /
│ ● /
│ ●/
│/●
└──────────── FloorArea
The model learns: Energy = slope × FloorArea + intercept
With many features, it becomes: Energy = w₁×FloorArea + w₂×Age + w₃×Floors + ... + bias
Each weight (w) tells the model how important each feature is. Ridge adds a penalty to keep weights small, which prevents overfitting.
What is overfitting?
UNDERFITTING: GOOD FIT: OVERFITTING:
Too simple Just right Too complex
────── ╱╲ ╱╲ ╱╲╱╲╱╲╱╲
(misses pattern) ╱ ╲_╱ ╲ ╱ ╲
(captures trend) (memorizes noise)
(fails on new data)
Part 12 — Model 1: Ridge Regression (linear)
What does "linear" mean?
The model tries to find the best formula like:
Energy = a₁ × FloorArea + a₂ × Age + a₃ × Floors + ... + bias
It's a weighted sum. Ridge adds a small penalty to keep the weights small (prevents overfitting).
Why does it need scaling?
Feature A: FloorArea ranges from 1,000 to 2,000,000
Feature B: NumberOfFloors ranges from 1 to 50
Without scaling, the model thinks FloorArea is 40,000x more important
just because its numbers are bigger! Scaling fixes this.
StandardScaler makes every feature: mean=0, std=1
We use a Pipeline to chain scaler + model. This ensures the scaler is fitted on TRAIN data only (no leakage).
What is a Pipeline?
A Pipeline chains multiple steps into one object. Instead of:
# Without Pipeline (manual, error-prone):
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit + transform on TRAIN
X_test_scaled = scaler.transform(X_test) # transform only on TEST (no fit!)
model = Ridge()
model.fit(X_train_scaled, y_train) # train
predictions = model.predict(X_test_scaled) # predict
With a Pipeline, everything happens in one line:
# With Pipeline (clean, safe):
pipe = Pipeline([('scaler', StandardScaler()), ('model', Ridge())])
pipe.fit(X_train, y_train) # scaler fits on train, then model trains
pipe.predict(X_test) # scaler transforms test, then model predicts
Why is this better?
- No leakage risk: the scaler is fitted on TRAIN only, never on TEST
- Cleaner code: one object to manage instead of two
- Works with cross-validation: GridSearchCV handles the scaler automatically
ridge_pipe = Pipeline([
('scaler', StandardScaler()), # Step 1: normalize
('model', Ridge(alpha=1.0, random_state=RANDOM_STATE)) # Step 2: train
])
print('Pipeline created: StandardScaler → Ridge')
res_ridge = evaluate_model(
ridge_pipe, X_train, y_train, X_test, y_test,
'Ridge Regression (linear)')
How to interpret these results
Let's say the output was:
Ridge Regression (linear):
R² train: 0.7200 R² test: 0.7100
MAE: 102,000 kBtu
RMSE: 145,000 kBtu
What this means in plain English:
- R² = 0.71 → The model explains 71% of the variation in energy consumption. Not bad, but 29% is still unexplained (noise, missing features, non-linear patterns).
- R² train ≈ R² test → No overfitting! The model generalizes well. If train was 0.99 and test was 0.50, that would mean overfitting (memorized training data).
- MAE = 102,000 kBtu → On average, when the model predicts energy for a building, it's wrong by about 102,000 kBtu. Is that good? Depends on context — if buildings average 400,000 kBtu, being off by 102,000 is a ~25% error.
- RMSE > MAE → This is always true. RMSE penalizes large errors more, so RMSE > MAE means some predictions are way off (the model struggles with certain buildings).
What is a decision tree?
Before understanding a forest, let's understand one tree:
Is FloorArea > 50,000?
/ \
YES NO
/ \
Is Age > 40? Is Floors > 5?
/ \ / \
YES NO YES NO
/ \ / \
400,000 kBtu 250,000 kBtu 150,000 kBtu 80,000 kBtu
The tree asks yes/no questions, splitting data into smaller groups until it reaches a prediction. It's like a flowchart.
A Random Forest builds 200 of these on random subsets and averages their predictions. This is called bagging (Bootstrap AGGregating). The averaging makes the model much more stable.
Part 13 — Model 2: Random Forest (bagging, tree-based)
How does it work?
- Take a random sample of the data
- Build a decision tree on it
- Repeat 200 times (n_estimators=200)
- Average all 200 predictions
Key advantages:
- Captures non-linear patterns (curves, interactions)
- Does NOT need scaling (trees split on values, not distances)
- Hard to overfit (averaging reduces variance)
rf = RandomForestRegressor(
n_estimators=200, # Build 200 trees
random_state=RANDOM_STATE,
n_jobs=-1 # Use all CPU cores (faster)
)
print('RandomForest created: 200 trees, all CPUs')
res_rf = evaluate_model(
rf, X_train, y_train, X_test, y_test,
'Random Forest (bagging)')
Part 14 — Model 3: Gradient Boosting (boosting, tree-based)
How is it different from Random Forest?
Random Forest: 200 trees work INDEPENDENTLY → average
Gradient Boost: 200 trees work SEQUENTIALLY → each fixes the last one's errors
learning_rate controls how much each tree contributes. Smaller = more cautious but needs more trees.
gb = GradientBoostingRegressor(
n_estimators=200,
learning_rate=0.1, # How much each tree contributes
max_depth=5, # How deep each tree can grow
random_state=RANDOM_STATE
)
print('GradientBoosting: 200 sequential trees, lr=0.1, depth=5')
res_gb = evaluate_model(
gb, X_train, y_train, X_test, y_test,
'Gradient Boosting')
Quick comparison so far
Reading the comparison table
name R2_train R2_test MAE RMSE
Ridge (linear) 0.72 0.71 102,000 145,000
Random Forest (bagging) 0.95 0.82 78,000 110,000
Gradient Boosting 0.91 0.85 68,000 98,000
What to notice:
- Gradient Boosting wins on test set (highest R², lowest MAE/RMSE)
- Random Forest overfits slightly (train R²=0.95 is much higher than test R²=0.82)
- Ridge is the weakest but simplest — it can't capture non-linear patterns
- Non-linear models (trees) beat linear — energy consumption has complex relationships with features
This is why we try multiple model categories: you can't know which will work best beforehand.
# Show all 3 results side by side
quick = pd.DataFrame([res_ridge, res_rf, res_gb]).drop(columns=['model'])
print(quick.to_string(index=False))
Part 15 — Cross-Validation: getting reliable estimates
Why is one train/test split not enough?
One split can be lucky or unlucky. Imagine asking ONE person if a movie is good vs asking FIVE people.
5-fold cross-validation splits the data 5 different ways and averages:
Fold 1: [TEST] [train] [train] [train] [train] → R² = 0.82
Fold 2: [train] [TEST] [train] [train] [train] → R² = 0.79
Fold 3: [train] [train] [TEST] [train] [train] → R² = 0.84
Fold 4: [train] [train] [train] [TEST] [train] → R² = 0.81
Fold 5: [train] [train] [train] [train] [TEST] → R² = 0.80
Final: R² = 0.81 +/- 0.02 (MUCH more reliable!)
kf = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
scoring = {
'R2': 'r2',
'MAE': 'neg_mean_absolute_error',
}
print('KFold: 5 splits, shuffled')
print('Scoring: R² and MAE')
# Cross-validate all 3 models
for model, name, needs_scaling in [
(Ridge(alpha=1.0, random_state=RANDOM_STATE),
'Ridge (linear)', True),
(RandomForestRegressor(n_estimators=200, random_state=RANDOM_STATE, n_jobs=-1),
'Random Forest (bagging)', False),
(GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=5, random_state=RANDOM_STATE),
'Gradient Boosting', False),
]:
if needs_scaling:
pipe = Pipeline([('scaler', StandardScaler()), ('model', model)])
else:
pipe = Pipeline([('model', model)])
cv = cross_validate(pipe, X, y, cv=kf, scoring=scoring,
return_train_score=True, n_jobs=-1)
r2_vals = cv['test_R2']
mae_vals = -cv['test_MAE'] # sklearn returns NEGATIVE MAE
print(f'\n{name}:')
print(f' R² = {r2_vals.mean():.4f} +/- {r2_vals.std():.4f}')
print(f' MAE = {mae_vals.mean():>10,.0f} +/- {mae_vals.std():>8,.0f} kBtu')
Parameters vs Hyperparameters
This distinction confuses many beginners:
|
Parameters |
Hyperparameters | |
|---|---|---|
|
What |
The weights the model LEARNS |
Settings YOU choose BEFORE training |
|
Who sets them |
The model (during .fit()) |
You (before calling .fit()) |
|
Example |
The weight of FloorArea in Ridge |
|
|
Can be optimized? |
Automatically by training |
By GridSearchCV |
Analogy: If the model is a recipe...
- Parameters = the amounts of each ingredient (the model figures these out)
- Hyperparameters = oven temperature and cooking time (you choose these before baking)
GridSearchCV tries many "oven temperatures" and picks the one that produces the best cake.
Part 16 — Hyperparameter Optimization (GridSearchCV)
What are hyperparameters?
Settings you choose before training. Different values → different performance. Finding the best combination manually is tedious, so GridSearch automates it.
What we're tuning:
|
Parameter |
What it does |
Values to try |
|---|---|---|
|
|
Number of trees |
100, 200, 300 |
|
|
How deep each tree grows |
3, 5, 7, 10 |
|
|
Step size for each tree |
0.05, 0.1, 0.2 |
|
|
Min samples to split a node |
2, 5, 10 |
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [3, 5, 7, 10],
'learning_rate': [0.05, 0.1, 0.2],
'min_samples_split': [2, 5, 10],
}
total_combos = 3 * 4 * 3 * 3 # = 108
print(f'Total combinations: {total_combos}')
print(f'With 5-fold CV: {total_combos * 5} model fits')
print(f'This may take a few minutes...')
What GridSearchCV does, step by step
You tell it: "Try these combinations"
n_estimators: [100, 200, 300] → 3 options
max_depth: [3, 5, 7, 10] → 4 options
learning_rate:[0.05, 0.1, 0.2] → 3 options
min_samples: [2, 5, 10] → 3 options
─────────
Total: 3×4×3×3 = 108 combinations
For EACH combination, it does 5-fold cross-validation:
Combo 1: n_est=100, depth=3, lr=0.05, samples=2
→ Train on folds 1-4, test on fold 5 → R² = 0.78
→ Train on folds 1-3+5, test on fold 4 → R² = 0.80
→ ... (3 more folds)
→ Average: R² = 0.79
Combo 2: n_est=100, depth=3, lr=0.05, samples=5
→ (same 5-fold process)
→ Average: R² = 0.77
... (106 more combos)
Combo 108: n_est=300, depth=10, lr=0.2, samples=10
→ Average: R² = 0.84 ← BEST!
GridSearchCV picks Combo 108 and retrains on ALL training data.
Total model fits: 108 combos × 5 folds = 540 models trained! This is why it takes time.
gs = GridSearchCV(
GradientBoostingRegressor(random_state=RANDOM_STATE),
param_grid,
cv=KFold(5, shuffle=True, random_state=RANDOM_STATE),
scoring='r2',
n_jobs=-1,
verbose=1,
return_train_score=True
)
gs.fit(X_train, y_train)
print(f'\nDone! Best R² (CV): {gs.best_score_:.4f}')
# What parameters won?
print('Best parameters:')
for k, v in gs.best_params_.items():
print(f' {k}: {v}')
# Evaluate the optimized model on the test set
best_model = gs.best_estimator_
y_pred_best = best_model.predict(X_test)
r2_opt = r2_score(y_test, y_pred_best)
mae_opt = mean_absolute_error(y_test, y_pred_best)
rmse_opt = np.sqrt(mean_squared_error(y_test, y_pred_best))
print(f'Optimized Gradient Boosting on TEST set:')
print(f' R²: {r2_opt:.4f}')
print(f' MAE: {mae_opt:>10,.0f} kBtu')
print(f' RMSE: {rmse_opt:>10,.0f} kBtu')
print(f' → The model is off by ~{mae_opt:,.0f} kBtu on average')
Feature importance — the intuition
Imagine you're predicting house prices. Someone asks "what matters most: location, size, or age?"
Permutation importance answers this with an experiment:
- Take a trained model that predicts well (R² = 0.80)
- Randomly shuffle the
FloorAreacolumn (destroy its information) - Re-measure: R² drops to 0.45
- Conclusion: FloorArea was very important (R² dropped by 0.35)
Repeat for every feature. The features that cause the biggest drop when shuffled are the most important.
Feature R² without it Drop Importance
───────── ───────────── ──── ──────────
FloorArea 0.45 -0.35 ★★★★★
BuildingAge 0.70 -0.10 ★★
NumberOfFloors 0.75 -0.05 ★
Neighborhood 0.78 -0.02 ·
Part 17 — Feature Importance: what drives energy consumption?
Why this matters
The city doesn't just want predictions — they want to know why. Which building characteristics drive high energy consumption? This informs policy.
Two methods:
|
Method |
How it works |
Speed |
|---|---|---|
|
Built-in (tree) |
Measures how much each feature reduces error in tree splits |
Fast |
|
Permutation |
Shuffles each feature and measures how much the score drops |
Slower but more reliable |
# Method 1: Built-in feature importance
fi_df = pd.DataFrame({
'Feature': X.columns,
'Importance': best_model.feature_importances_
}).sort_values('Importance', ascending=False)
print('Top 10 features (built-in importance):')
print(fi_df.head(10).to_string(index=False))
# Plot top 15
top15 = fi_df.head(15)
fig, ax = plt.subplots(figsize=(10, 7))
ax.barh(top15['Feature'][::-1], top15['Importance'][::-1], color='steelblue')
ax.set_xlabel('Feature Importance (built-in)')
ax.set_title('Top 15 Features — Gradient Boosting Built-in')
plt.tight_layout()
plt.show()
# Method 2: Permutation importance
print('Computing permutation importance (30 repeats)...')
perm = permutation_importance(
best_model, X_test, y_test,
n_repeats=30,
random_state=RANDOM_STATE,
n_jobs=-1,
scoring='r2'
)
print('Done!')
How to read a feature importance barplot
GrossFloorArea_sf ████████████████████ 0.35 ← BY FAR the most important
BuildingAge ████████ 0.12 ← Second
AreaPerFloor ██████ 0.09 ← Our engineered feature works!
NumberOfFloors █████ 0.07
HasMultipleUses ███ 0.04 ← Another engineered feature
Type_Hospital ██ 0.03 ← Hospitals use lots of energy
IsOldBuilding ██ 0.02
...
How to interpret this for the client:
- Floor area dominates — bigger buildings use more energy (obvious but confirmed)
- Building age matters — older buildings have worse insulation
- Our engineered features appear in the top — AreaPerFloor and HasMultipleUses added value
- Building type matters — hospitals and hotels use more than offices
- Some features barely matter — could be dropped to simplify the model
# Permutation results
pi_df = pd.DataFrame({
'Feature': X.columns,
'Importance': perm.importances_mean
}).sort_values('Importance', ascending=False)
print('Top 10 features (permutation importance):')
print(pi_df.head(10).to_string(index=False))
# Plot top 15
top15_perm = pi_df.head(15)
fig, ax = plt.subplots(figsize=(10, 7))
ax.barh(top15_perm['Feature'][::-1], top15_perm['Importance'][::-1], color='coral')
ax.set_xlabel('Permutation Importance (R² decrease)')
ax.set_title('Top 15 Features — Permutation Importance')
plt.tight_layout()
plt.show()
Compare both methods side by side
comp = pd.DataFrame({
'Feature': X.columns,
'Tree_Builtin': best_model.feature_importances_,
'Permutation': perm.importances_mean,
}).sort_values('Permutation', ascending=False).head(12)
print('Top 12 features — two methods compared:')
print(comp.to_string(index=False))
Part 18 — Final Summary
# Collect all results
all_results = pd.DataFrame([
res_ridge, res_rf, res_gb,
{'name': 'Optimized GB',
'R2_train': r2_score(y_train, best_model.predict(X_train)),
'R2_test': r2_opt, 'MAE': mae_opt, 'RMSE': rmse_opt}
]).drop(columns=['model'], errors='ignore')
# Sort by test R²
all_results = all_results.sort_values('R2_test', ascending=False)
print('='*72)
print('FINAL MODEL COMPARISON')
print('='*72)
print(all_results.to_string(index=False))
What we'd tell the client
Here's how a data scientist would present these results to the city:
"We built a model that predicts building energy consumption using only structural data you already have — floor area, age, type, number of floors.
The model explains about X% of the variation in energy use (R² = X.XX), and its predictions are off by about Y kBtu on average.
The biggest driver is building size: floor area alone explains most of the variation. Building age is the second factor — older buildings tend to use more energy, likely due to poorer insulation.
Practical application: instead of measuring every building, you can use this model to flag buildings where predicted consumption seems abnormally high, and send inspectors only to those. This could cut inspection costs by 80% while still catching the worst offenders.
Limitations: the model can't capture everything — unusual building configurations, recent renovations, or operational differences are not in the data. We recommend adding renovation history and HVAC system type if available in the future."
This is the storytelling part. Raw numbers don't convince decision-makers — narratives do.
# Business translation
best_r2 = all_results.iloc[0]['R2_test']
best_mae = all_results.iloc[0]['MAE']
best_name = all_results.iloc[0]['name']
print(f'\nBest model: {best_name}')
print(f' Explains {best_r2*100:.1f}% of variance in energy consumption')
print(f' Average prediction error: ~{best_mae:,.0f} kBtu')
print(f'')
print(f'Business insight:')
print(f' Instead of measuring every building, the city can use this model')
print(f' to estimate energy consumption from structural data alone,')
print(f' with an average error of ~{best_mae:,.0f} kBtu.')
Key Takeaways
- Filter first: remove non-compliant buildings (
ComplianceStatus) and flagged outliers (Outlier). - Data leakage is the #1 mistake: never use energy source columns to predict total energy.
- Feature engineering adds value: building age, area per floor, multi-use flags capture real patterns.
- Correlation matrix: drop one from each pair with |r| > 0.70.
- 3 model categories: linear (Ridge), bagging (Random Forest), boosting (Gradient Boosting).
- Scale linear models: use StandardScaler in a Pipeline. Tree models don't need scaling.
- Cross-validate everything: mean +/- std is far more reliable than a single split.
- GridSearchCV: systematic tuning over 3+ hyperparameters.
- Feature importance: two methods (built-in + permutation) to tell the city what matters.
- Storytelling: "the model is off by ~X kBtu" beats "MAE = X".
Happy predicting!
