Predicting Building Energy Consumption — A Complete Beginner's Guide to Supervised Regression

How I helped a city estimate energy use from structural data — explained like you're starting from zero

The story

I work as a data scientist for a small urban planning consultancy. A European city aiming for carbon neutrality by 2050 came to us with a problem: they need to predict the total energy consumption of commercial buildings using only structural information — size, age, type, location.

Why? Because measuring actual energy consumption is expensive: inspectors have to physically visit each building with specialized equipment. But the city already has structural data from building permits and tax records. If we can predict energy use from that, inspectors only need to visit buildings where the model flags unusually high consumption.

The city gave us data from ~3,800 non-residential buildings measured in 2016. Our mission:

Clean and filter the data
Understand which features matter
Build and compare regression models
Tell the city what drives energy consumption

Regression vs Classification

CLASSIFICATION: predict a CATEGORY
  "Will this employee leave?"  →  Yes / No
  Metrics: Precision, Recall, F1

REGRESSION: predict a NUMBER
  "How much energy will this building use?"  →  347,000 kBtu
  Metrics: R², MAE, RMSE

This project is regression — we predict a continuous number.

What you'll learn

Part	Topic	Why it matters
0	Environment setup	Reproducibility
1	Load & explore the data	Understand what we have
2	Explore the target variable	Know what we're predicting
3	Filter non-compliant & outliers	Remove bad data
4	Handle missing values	Models can't handle NaN
5	Visual exploration (EDA)	See patterns before modeling
6	Feature engineering (4 new features)	Help the model find patterns
7	Data leakage (the #1 mistake)	Avoid fake good results
8	Correlation matrix	Remove redundant features
9	Encode categorical features	ML needs numbers
10	Train/test split	Reliable evaluation
11	Helper function for evaluation	Don't repeat yourself
12	Model 1: Ridge (linear)	Baseline
13	Model 2: Random Forest (bagging)	Non-linear
14	Model 3: Gradient Boosting (boosting)	Usually best
15	Cross-validation (5-fold)	Mean +/- std
16	GridSearchCV	Tune hyperparameters
17	Feature importance (2 methods)	What drives consumption?
18	Final summary	Business conclusions

Every concept is explained before the code. Every code block does one thing.

Before we start: What IS Machine Learning?

Machine Learning (ML) is teaching a computer to find patterns in data so it can make predictions on new data.

The 3 types (simplified):

1. SUPERVISED LEARNING (this project!)
   You GIVE the model examples with known answers.
   "Here are 3,000 buildings with their actual energy use.
    Learn the pattern, then predict energy for NEW buildings."

   → Regression:     predict a NUMBER (energy = 347,000 kBtu)
   → Classification: predict a CATEGORY (employee will leave: yes/no)

2. UNSUPERVISED LEARNING
   You DON'T give answers. The model finds groups on its own.
   "Here are 3,000 buildings. Group similar ones together."

3. REINFORCEMENT LEARNING
   The model learns by trial and error (like training a dog).

The workflow we'll follow:

Raw data → Clean → Explore → Engineer features → Split → Train models → Evaluate → Interpret
  📦         🧹       🔍          🔧             ✂️       🏋️           📊         💡

Each step has a reason. Skipping one usually means bad results.

Part 0 — Setting up your tools

What is a pyproject.toml?

Think of it as a recipe card for your project. It lists every Python package (ingredient) needed. Anyone wanting to reproduce your work reads this file and installs everything with one command.

[project]
name = "energy-prediction"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = [
    "pandas>=2.0",
    "numpy>=1.24",
    "matplotlib>=3.7",
    "seaborn>=0.12",
    "scikit-learn>=1.3",
]

Create it with uv init in your terminal. Now let's import our tools — one at a time, with explanations.

# pandas: work with tables (like Excel, but in Python)
import pandas as pd

# Quick check: can we create a DataFrame?
print(type(pd.DataFrame()))

# numpy: fast math on arrays of numbers
import numpy as np

# Quick check
print('2 + 3 =', np.add(2, 3))

# matplotlib: the base plotting library
import matplotlib.pyplot as plt

# seaborn: prettier charts built on top of matplotlib
import seaborn as sns

print('Visualization libraries ready!')

# Suppress non-critical warnings (keeps output clean)
import warnings
warnings.filterwarnings('ignore')

print('Warnings suppressed')

Scikit-learn imports (one group at a time)

Scikit-learn (sklearn) is THE machine learning library. We import many tools — each will be explained when we use it.

# Split data into training and test sets
from sklearn.model_selection import train_test_split

print('train_test_split: splits data 80/20 (or any ratio)')

# Cross-validation: test on multiple splits for reliable scores
from sklearn.model_selection import cross_validate

# KFold: defines HOW to split for cross-validation
from sklearn.model_selection import KFold

# GridSearchCV: try many hyperparameter combinations
from sklearn.model_selection import GridSearchCV

print('Cross-validation and GridSearch loaded')

# StandardScaler: normalize features to mean=0, std=1
# (needed for linear models, not for tree-based models)
from sklearn.preprocessing import StandardScaler

# OneHotEncoder: convert categories to binary columns
from sklearn.preprocessing import OneHotEncoder

# Pipeline: chain preprocessing + model in one object
from sklearn.pipeline import Pipeline

print('Preprocessing tools loaded')

The 3 model categories

The project requires at least 3 different categories of models:

Category	How it works	Example	Needs scaling?
Linear	Draws the best straight line	Ridge, Lasso	Yes
Bagging (trees)	Many trees vote independently	Random Forest	No
Boosting (trees)	Trees learn from each other's mistakes	Gradient Boosting	No

BAGGING (Random Forest):
  Tree 1 ──┐
  Tree 2 ──┤──▶ AVERAGE all predictions ──▶ Final answer
  Tree 3 ──┤
  Tree 4 ──┘
  (trees work INDEPENDENTLY, in parallel)

BOOSTING (Gradient Boosting):
  Tree 1 ─▶ errors ─▶ Tree 2 ─▶ errors ─▶ Tree 3 ─▶ ... ─▶ Final answer
  (each tree CORRECTS the mistakes of the previous one)

# Linear model
from sklearn.linear_model import Ridge

# Bagging (tree-based, parallel)
from sklearn.ensemble import RandomForestRegressor

# Boosting (tree-based, sequential)
from sklearn.ensemble import GradientBoostingRegressor

print('3 model categories loaded: Ridge, RandomForest, GradientBoosting')

Regression metrics

Suppose we predict energy for 3 buildings:

  Actual:    [100,000    200,000    300,000]
  Predicted: [110,000    180,000    350,000]
  Errors:    [+10,000    -20,000    +50,000]

MAE  = average of |errors|
     = (10,000 + 20,000 + 50,000) / 3 = 26,667 kBtu
     → "On average, we're off by ~27,000 kBtu"

RMSE = sqrt(average of errors²)
     = sqrt((10k² + 20k² + 50k²) / 3) = 31,623 kBtu
     → "Like MAE but penalizes BIG errors more"

R²   = 1 - (sum of errors²) / (sum of (actual - mean)²)
     → "How much variance does the model explain?"
     → R²=1.0 means perfect, R²=0 means useless

# Regression metrics
from sklearn.metrics import (
    mean_absolute_error,    # MAE
    mean_squared_error,     # Used to compute RMSE
    r2_score,               # R²
    make_scorer,            # Convert metric to scorer
)

# Feature importance
from sklearn.inspection import permutation_importance

print('Metrics and feature importance loaded')

# Global settings for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Prettier charts
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
pd.set_option('display.max_columns', None)

print(f'All set! Random state: {RANDOM_STATE}')

Part 1 — Loading the dataset

In a real project, you'd load the CSV: pd.read_csv('seattle_buildings_2016.csv'). Since we can't share the real data, we generate synthetic data that mimics it. We build each column separately so you see exactly what goes in.

Step 1.1 — Building IDs

Every row needs a unique identifier.

n_buildings = 3800

building_ids = [f'BLD{str(i).zfill(5)}' for i in range(1, n_buildings + 1)]

# Verify
print(f'Total: {len(building_ids)}')
print(f'First 3: {building_ids[:3]}')
print(f'Last 3:  {building_ids[-3:]}')

Step 1.2 — Building type

All non-residential: offices, retail, hotels, etc. In the real project, residential buildings must be excluded — that's a hard requirement.

building_types = [
    'Office', 'Retail', 'Hotel', 'Warehouse', 'School',
    'Hospital', 'Restaurant', 'Supermarket', 'Parking', 'Other'
]

primary_use = np.random.choice(building_types, n_buildings,
    p=[0.25, 0.15, 0.08, 0.12, 0.10, 0.05, 0.08, 0.05, 0.07, 0.05])

# Let's see the distribution
print('Building type distribution:')
for bt in building_types:
    n = (primary_use == bt).sum()
    pct = n / n_buildings * 100
    bar = '█' * int(pct)
    print(f'  {bt:15s} {n:4d} ({pct:5.1f}%) {bar}')

Step 1.3 — Year built

year_built = np.random.normal(1970, 25, n_buildings).clip(1900, 2020).astype(int)

print(f'Oldest: {year_built.min()}')
print(f'Newest: {year_built.max()}')
print(f'Median: {int(np.median(year_built))}')

Step 1.4 — Number of floors

n_floors = np.random.exponential(3, n_buildings).clip(1, 50).astype(int)

print(f'Range: {n_floors.min()} to {n_floors.max()} floors')
print(f'Most common: {np.bincount(n_floors).argmax()} floor(s)')

Step 1.5 — Gross floor area

Larger buildings and different types have very different sizes. A hospital is much bigger than a restaurant. We encode this with a base area per type.

base_area = {
    'Office': 50000, 'Retail': 20000, 'Hotel': 80000,
    'Warehouse': 40000, 'School': 60000, 'Hospital': 120000,
    'Restaurant': 5000, 'Supermarket': 30000, 'Parking': 70000,
    'Other': 25000
}

# Show the base areas
print('Base floor area by type:')
for bt, area in sorted(base_area.items(), key=lambda x: -x[1]):
    print(f'  {bt:15s} {area:>8,} sq ft')

# Actual area = base * random variation * floor bonus
gross_floor_area = np.array([
    base_area[pu] * np.random.lognormal(0, 0.5) * (nf ** 0.3)
    for pu, nf in zip(primary_use, n_floors)
]).clip(1000, 2000000).round(0)

print(f'Floor area range: {gross_floor_area.min():,.0f} to {gross_floor_area.max():,.0f} sq ft')
print(f'Median: {np.median(gross_floor_area):,.0f} sq ft')

Step 1.6 — Other columns

# 10 neighborhoods
neighborhoods = [f'Zone_{chr(65+i)}' for i in range(10)]
neighborhood = np.random.choice(neighborhoods, n_buildings)

print(f'{len(np.unique(neighborhood))} neighborhoods')

# Number of property uses (1=single, 2+=mixed use)
n_property_uses = np.random.choice([1, 2, 3, 4], n_buildings,
    p=[0.60, 0.25, 0.10, 0.05])

vals, cnts = np.unique(n_property_uses, return_counts=True)
for v, c in zip(vals, cnts):
    print(f'  {v} use(s): {c} buildings')

Step 1.7 — Compliance and Outlier flags

The real dataset has two critical columns:

ComplianceStatus: did the building pass inspection? Non-compliant → remove
Outlier: was this data point flagged as suspicious? Yes → remove

Using these columns to filter is required by the project.

compliance = np.random.choice(
    ['Compliant', 'Non-Compliant'], n_buildings, p=[0.85, 0.15])

outlier_flag = np.random.choice(
    ['No', 'Yes'], n_buildings, p=[0.92, 0.08])

print(f'Compliant:     {(compliance == "Compliant").sum()}')
print(f'Non-Compliant: {(compliance == "Non-Compliant").sum()}')
print(f'Outlier=Yes:   {(outlier_flag == "Yes").sum()}')

Step 1.8 — Energy columns (leakage risks!)

The real dataset has individual energy sources. We create them here but will remove them later — using them to predict total energy is cheating (data leakage, explained in Part 7).

# EnergyStarScore (1-100): efficiency rating
energy_star = np.random.normal(60, 20, n_buildings).clip(1, 100).round(0)
print(f'EnergyStarScore: {energy_star.min():.0f} to {energy_star.max():.0f}')

# Individual energy sources (these ARE total energy's components!)
electricity_kbtu = (gross_floor_area * np.random.uniform(10, 40, n_buildings)).round(0)
natural_gas_kbtu = (gross_floor_area * np.random.uniform(2, 20, n_buildings)).round(0)
steam_kbtu = np.where(
    np.random.random(n_buildings) < 0.3,
    gross_floor_area * np.random.uniform(1, 10, n_buildings), 0).round(0)

print(f'Electricity: mean={electricity_kbtu.mean():,.0f} kBtu')
print(f'Natural gas: mean={natural_gas_kbtu.mean():,.0f} kBtu')
print(f'Steam (30% of buildings): mean of nonzero={steam_kbtu[steam_kbtu > 0].mean():,.0f} kBtu')

Step 1.9 — The TARGET: total site energy use

# Total = sum of all sources + noise
site_energy = (electricity_kbtu + natural_gas_kbtu + steam_kbtu
               + np.random.normal(0, 5000, n_buildings)).clip(10000).round(0)

# Make flagged outliers extreme (realistic)
outlier_mask = outlier_flag == 'Yes'
site_energy[outlier_mask] *= np.random.uniform(2, 5, outlier_mask.sum())

print(f'SiteEnergyUse_kBtu:')
print(f'  Min:    {site_energy.min():>12,.0f}')
print(f'  Median: {np.median(site_energy):>12,.0f}')
print(f'  Mean:   {site_energy.mean():>12,.0f}')
print(f'  Max:    {site_energy.max():>12,.0f}')

# EUI and CO2 (also leakage)
site_eui = (site_energy / gross_floor_area).round(2)
co2_emissions = (site_energy * np.random.uniform(0.00005, 0.00015, n_buildings)).round(2)

print(f'SiteEUI range: {site_eui.min():.2f} to {site_eui.max():.2f} kBtu/sf')
print(f'CO2 range: {co2_emissions.min():.2f} to {co2_emissions.max():.2f} metric tons')

Step 1.10 — Inject missing values (like real data)

energy_star_dirty = energy_star.copy().astype(float)
energy_star_dirty[np.random.random(n_buildings) < 0.05] = np.nan
print(f'Missing EnergyStarScore: {np.isnan(energy_star_dirty).sum()}')

year_built_dirty = year_built.copy().astype(float)
year_built_dirty[np.random.random(n_buildings) < 0.02] = np.nan
print(f'Missing YearBuilt: {np.isnan(year_built_dirty).sum()}')

Step 1.11 — Assemble the full DataFrame

df = pd.DataFrame({
    'BuildingID': building_ids,
    'PrimaryPropertyType': primary_use,
    'YearBuilt': year_built_dirty,
    'NumberOfFloors': n_floors,
    'GrossFloorArea_sf': gross_floor_area,
    'Neighborhood': neighborhood,
    'NumberOfPropertyUses': n_property_uses,
    'ComplianceStatus': compliance,
    'Outlier': outlier_flag,
    'EnergyStarScore': energy_star_dirty,
    'Electricity_kBtu': electricity_kbtu,
    'NaturalGas_kBtu': natural_gas_kbtu,
    'Steam_kBtu': steam_kbtu,
    'SiteEnergyUse_kBtu': site_energy,
    'SiteEUI_kBtu_sf': site_eui,
    'CO2Emissions_tons': co2_emissions,
})

print(f'Dataset: {df.shape[0]} rows x {df.shape[1]} columns')

# List all columns with their types
print('All columns:')
for i, (col, dtype) in enumerate(df.dtypes.items()):
    print(f'  {i+1:2d}. {col:30s} {str(dtype):10s}')

df.head(3)

What is a DataFrame?

Think of a DataFrame as a spreadsheet table inside Python. It has:

Rows (one per building, employee, or observation)
Columns (one per variable: age, floor area, energy...)
Index (row number, starting from 0)

          FloorArea  YearBuilt  Energy
  BLD001  50000      1985       320000
  BLD002  12000      2001       95000
  BLD003  85000      1962       580000
  ↑ index  ↑ col 1    ↑ col 2    ↑ col 3

What do `.shape`, `.info()`, `.describe()` tell us?

Command	What it shows	Why you need it
`df.shape`	(rows, columns)	Know the size
`df.info()`	Column names, types, non-null count	Spot wrong types and missing data
`df.describe()`	Mean, std, min, max, quartiles	Spot outliers and weird distributions
`df.isnull().sum()`	Number of missing values per column	Know what needs fixing

Part 2 — Exploring the raw data

The 3 essential checks for ANY dataset

Every time you load data, ask these 3 questions:

How big is it? → df.shape
What types are the columns? → df.info()
How much data is missing? → df.isnull().sum()

# Check 1: Size
print(f'Rows:    {df.shape[0]}')
print(f'Columns: {df.shape[1]}')

# Check 2: Types and non-null counts
df.info()

How to read `.describe()` output

            FloorArea    YearBuilt     Energy
count       3800.0       3800.0        3800.0      ← how many non-null values
mean        52341.0      1970.0        420000.0    ← average
std         45200.0      25.0          380000.0    ← spread (bigger = more varied)
min         1000.0       1900.0        10000.0     ← smallest value
25%         18000.0      1952.0        150000.0    ← 25% of buildings are below this
50%         38000.0      1970.0        310000.0    ← MEDIAN (middle value)
75%         72000.0      1990.0        560000.0    ← 75% are below this
max         2000000.0    2020.0        5000000.0   ← largest value

What to look for:

Big difference between mean and 50% (median)? → Data is skewed (has extreme values)
min very different from 25%? → Possible outliers at the bottom
max very different from 75%? → Possible outliers at the top
count less than total rows? → Missing values in that column

# Check 3: Missing values
missing = df.isnull().sum()
print('Columns with missing values:')
print(missing[missing > 0])
print(f'\nTotal missing cells: {missing.sum()} out of {df.shape[0]*df.shape[1]}')

# Descriptive stats for numeric columns
df.describe().round(2)

What is a "target variable"?

In supervised learning, we always have:

Features (X): the information we GIVE to the model (floor area, year built, ...)
Target (y): what we want the model to PREDICT (energy consumption)

Features (X)                          Target (y)
┌───────────┬───────────┬─────────┐   ┌────────────────┐
│ FloorArea │ YearBuilt │ Floors  │   │ EnergyUse_kBtu │
├───────────┼───────────┼─────────┤   ├────────────────┤
│ 50,000    │ 1985      │ 3       │ → │ 320,000        │
│ 12,000    │ 2001      │ 1       │ → │ 95,000         │
│ 85,000    │ 1962      │ 7       │ → │ 580,000        │
└───────────┴───────────┴─────────┘   └────────────────┘

The model learns the relationship: X → y
Then predicts y for NEW buildings where we don't know y.

Explore the target variable FIRST

Always look at what you're trying to predict before anything else.

target_col = 'SiteEnergyUse_kBtu'

print(f'Target: {target_col}')
print(f'  Count:  {df[target_col].count()}')
print(f'  Mean:   {df[target_col].mean():>12,.0f} kBtu')
print(f'  Median: {df[target_col].median():>12,.0f} kBtu')
print(f'  Std:    {df[target_col].std():>12,.0f} kBtu')
print(f'  Min:    {df[target_col].min():>12,.0f} kBtu')
print(f'  Max:    {df[target_col].max():>12,.0f} kBtu')
print(f'  Skew:   {df[target_col].skew():>12.2f}')

Why is the distribution "skewed" and what does log-transform do?

Most buildings use moderate energy, but a few use ENORMOUS amounts. This creates a "long tail" on the right:

Raw distribution:                    Log-transformed:
  │█                                    │    █
  │█                                    │   ███
  │██                                   │  █████
  │████                                 │ ███████
  │████████                             │████████
  │██████████████░░░░░░──               │████████
  └────────────────────── Energy        └──────────── log(Energy)
   Most buildings     Long tail          More symmetric!
   are here           (outliers)

The log function compresses big numbers more than small numbers, making the distribution more symmetric. Some models (especially linear ones) work better with roughly symmetric data.

Note: we use np.log1p(x) instead of np.log(x) because log1p handles zero values safely (log(0) would crash).

# Visualize: raw distribution vs log-transformed
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(df[target_col], bins=50, color='steelblue', edgecolor='white')
axes[0].set_xlabel('Site Energy Use (kBtu)')
axes[0].set_ylabel('Number of buildings')
axes[0].set_title('Raw Distribution (very right-skewed)')

axes[1].hist(np.log1p(df[target_col]), bins=50, color='coral', edgecolor='white')
axes[1].set_xlabel('log(Site Energy Use)')
axes[1].set_title('Log-transformed (more symmetric)')

plt.tight_layout()
plt.show()

What does "filtering" mean in pandas?

Filtering means keeping only the rows that match a condition. It's like applying a filter in Excel.

# English: "Keep only rows where ComplianceStatus equals 'Compliant'"
# Python:
df = df[df['ComplianceStatus'] == 'Compliant']

# How does this work, step by step?
# 1. df['ComplianceStatus'] == 'Compliant'
#    → creates a column of True/False for every row
#    → [True, True, False, True, False, ...]
#
# 2. df[True/False column]
#    → keeps only rows where the value is True

Important: always keep track of how many rows you have before and after each filter. If you lose too many, you might not have enough data to train a model.

Part 3 — Filtering the data

Why filter?

The dataset contains buildings that shouldn't be included:

Non-compliant buildings: their data is unreliable
Flagged outliers: suspicious measurements

The project requires using ComplianceStatus and Outlier columns to filter.

Track how many rows we keep at each step

print(f'Starting count: {len(df)} buildings')

# What's the compliance distribution?
print(df['ComplianceStatus'].value_counts())

# Remove non-compliant
df = df[df['ComplianceStatus'] == 'Compliant'].copy()
print(f'After removing non-compliant: {len(df)} buildings')

# What's the outlier distribution?
print(df['Outlier'].value_counts())

# Remove flagged outliers
df = df[df['Outlier'] == 'No'].copy()
print(f'After removing outliers: {len(df)} buildings')

# These columns served their purpose — drop them
df.drop(columns=['ComplianceStatus', 'Outlier'], inplace=True)
print(f'Columns remaining: {df.shape[1]}')

Part 4 — Handling missing values

Why median, not mean?

The mean is pulled by extreme values. The median is the middle value and ignores outliers.

Values: [2, 3, 3, 4, 100]
Mean:   22.4   ← pulled by 100
Median: 3      ← the actual middle value

# What's still missing?
print('Missing values after filtering:')
missing = df.isnull().sum()
print(missing[missing > 0])

# Fill YearBuilt
median_year = df['YearBuilt'].median()
before_null = df['YearBuilt'].isnull().sum()
df['YearBuilt'].fillna(median_year, inplace=True)
print(f'YearBuilt: filled {before_null} missing values with median={median_year:.0f}')

# Fill EnergyStarScore
median_star = df['EnergyStarScore'].median()
before_null = df['EnergyStarScore'].isnull().sum()
df['EnergyStarScore'].fillna(median_star, inplace=True)
print(f'EnergyStarScore: filled {before_null} missing values with median={median_star:.0f}')

# Verify: no more nulls
print(f'Total remaining nulls: {df.isnull().sum().sum()}')

Remove extreme statistical outliers

Even after removing flagged outliers, some buildings may have unrealistic energy values. We use percentiles to trim the extremes.

Q01 = df['SiteEnergyUse_kBtu'].quantile(0.01)
Q99 = df['SiteEnergyUse_kBtu'].quantile(0.99)
print(f'1st percentile: {Q01:>12,.0f} kBtu')
print(f'99th percentile: {Q99:>12,.0f} kBtu')

before = len(df)
df = df[(df['SiteEnergyUse_kBtu'] >= Q01) & (df['SiteEnergyUse_kBtu'] <= Q99)].copy()
print(f'Removed {before - len(df)} extreme buildings')
print(f'Remaining: {len(df)} buildings')

Part 5 — Visualizations

Which chart for which comparison?

What you want to see	Chart type	Example
Two numbers together	Scatterplot	Floor area vs energy
Number across categories	Boxplot	Energy by building type
Distribution of one thing	Histogram	Year built
Target after filtering	Histogram	Energy distribution

All charts must relate to the target variable.

How to read a scatterplot

Each dot is one building. The position tells you two things at once:

Energy (y)
  │            ●
  │        ● ●
  │      ●●●          "Pattern: dots go UP from left to right"
  │    ●●●             → POSITIVE correlation
  │  ●●                → Bigger floor area = more energy
  │●●
  └──────────────── FloorArea (x)

What to look for:

Dots going up-right? → Positive relationship (both increase together)
Dots going down-right? → Negative relationship (one up, other down)
Random cloud? → No relationship
Dots far from the main group? → Outliers

# SCATTERPLOT: Floor area vs Energy (the most important relationship)
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(df['GrossFloorArea_sf'], df['SiteEnergyUse_kBtu'],
           alpha=0.3, s=15, color='steelblue')
ax.set_xlabel('Gross Floor Area (sq ft)')
ax.set_ylabel('Site Energy Use (kBtu)')
ax.set_title('Bigger buildings use more energy (obvious but important)')
plt.tight_layout()
plt.show()

How to read a boxplot

A boxplot shows the distribution of a numeric variable at a glance:

           ┌──────┐
     ─────┤      ├───── ← Whiskers (range without outliers)
           ├──────┤
           │      │ ← Box = middle 50% of data (Q1 to Q3)
           │──────│ ← Line inside box = MEDIAN (middle value)
           │      │
           ├──────┤
     ─────┤      ├─────
           └──────┘
       ●                ← Dots = outliers (values far from the rest)

What to look for: if one building type has a much higher median than another, that type tends to use more energy.

# BOXPLOT: Energy by building type
fig, ax = plt.subplots(figsize=(14, 5))
order = df.groupby('PrimaryPropertyType')['SiteEnergyUse_kBtu'].median().sort_values(ascending=False).index
sns.boxplot(data=df, x='PrimaryPropertyType', y='SiteEnergyUse_kBtu',
            order=order, ax=ax, palette='husl')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
ax.set_ylabel('Site Energy Use (kBtu)')
ax.set_title('Energy Use by Building Type (sorted by median)')
plt.tight_layout()
plt.show()

# HISTOGRAM: Year built
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(df['YearBuilt'], bins=40, color='coral', edgecolor='white')
ax.set_xlabel('Year Built')
ax.set_ylabel('Count')
ax.set_title('When were these buildings constructed?')
plt.tight_layout()
plt.show()

# SCATTERPLOT: Number of floors vs Energy
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(df['NumberOfFloors'], df['SiteEnergyUse_kBtu'],
           alpha=0.3, s=15, color='coral')
ax.set_xlabel('Number of Floors')
ax.set_ylabel('Site Energy Use (kBtu)')
ax.set_title('More floors → more energy (generally)')
plt.tight_layout()
plt.show()

# SCATTERPLOT: EnergyStarScore vs Energy (before we drop it)
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(df['EnergyStarScore'], df['SiteEnergyUse_kBtu'],
           alpha=0.3, s=15, color='green')
ax.set_xlabel('EnergyStarScore')
ax.set_ylabel('Site Energy Use (kBtu)')
ax.set_title('EnergyStarScore vs Energy — but this column is LEAKAGE!')
plt.tight_layout()
plt.show()

Why does feature engineering help?

Think of it this way. You're trying to predict someone's running speed. The model has:

leg_length = 90 cm
body_height = 180 cm

These raw numbers are useful, but the ratio leg_length / body_height = 0.50 is directly informative — it tells you about body proportions in one number instead of two.

Same idea here: instead of giving the model GrossFloorArea = 100,000 and NumberOfFloors = 10 separately, we create AreaPerFloor = 10,000 which directly tells the model "each floor is 10,000 sq ft". The model doesn't have to figure out this relationship on its own.

Rules for feature engineering:

New features should make physical sense (age of building, area per floor, etc.)
NEVER use the target to create a feature — that's data leakage
At least 3 new features are required for this project
Verify each new feature looks reasonable (check min, max, mean)

Part 6 — Feature Engineering (at least 3 new features required)

What is feature engineering?

Creating new columns from existing ones to help the model find patterns.

Analogy: predicting how fast a car goes. You have engine_horsepower and car_weight. Creating power_to_weight_ratio = horsepower / weight gives the model a shortcut.

Golden rule: NEVER use the target to create features (that's data leakage).

Feature 1: Building age

# Older buildings often have worse insulation → more energy
df['BuildingAge'] = 2016 - df['YearBuilt']

print(f'BuildingAge:')
print(f'  Min: {df["BuildingAge"].min():.0f} years (newest)')
print(f'  Max: {df["BuildingAge"].max():.0f} years (oldest)')
print(f'  Mean: {df["BuildingAge"].mean():.0f} years')

Feature 2: Area per floor

# A 10-floor building with 100k sqft has different characteristics
# than a 2-floor building with 100k sqft
df['AreaPerFloor'] = df['GrossFloorArea_sf'] / df['NumberOfFloors']

print(f'AreaPerFloor:')
print(f'  Min:  {df["AreaPerFloor"].min():>10,.0f} sq ft')
print(f'  Max:  {df["AreaPerFloor"].max():>10,.0f} sq ft')
print(f'  Mean: {df["AreaPerFloor"].mean():>10,.0f} sq ft')

Feature 3: Has multiple uses (binary flag)

# Mixed-use buildings (office + retail) may have different energy profiles
df['HasMultipleUses'] = (df['NumberOfPropertyUses'] > 1).astype(int)

print(f'Single use:   {(df["HasMultipleUses"] == 0).sum()} buildings')
print(f'Multiple uses: {(df["HasMultipleUses"] == 1).sum()} buildings')

Feature 4: Old building flag

# Pre-1960 buildings often lack modern insulation
df['IsOldBuilding'] = (df['YearBuilt'] < 1960).astype(int)

print(f'Old (pre-1960):  {df["IsOldBuilding"].sum()} buildings')
print(f'Newer (1960+):   {(df["IsOldBuilding"] == 0).sum()} buildings')

print(f'\nTotal new features created: 4 (minimum 3 required)')

Part 7 — Data Leakage: the most common beginner mistake

What is data leakage?

Your model accidentally gets information it shouldn't have. Like taking an exam with the answer key.

In our dataset:

SiteEnergyUse_kBtu = Electricity + NaturalGas + Steam + other
                     ^^^^^^^^^^^   ^^^^^^^^^^   ^^^^^
                     COMPONENTS of total energy!

If we use Electricity_kBtu as a FEATURE to predict SiteEnergyUse_kBtu,
the model just learns to add up the components → R² = 0.99+
Looks amazing... but completely USELESS on new buildings where
we DON'T have energy measurements!

Columns that cause leakage:

Column	Why it leaks
`Electricity_kBtu`	Component of total energy
`NaturalGas_kBtu`	Component of total energy
`Steam_kBtu`	Component of total energy
`CO2Emissions_tons`	Computed FROM total energy
`EnergyStarScore`	Derived FROM energy consumption
`SiteEUI_kBtu_sf`	= SiteEnergy / FloorArea (contains the target!)

Also drop BuildingID — it's just a label, not a feature.

leakage_cols = [
    'Electricity_kBtu', 'NaturalGas_kBtu', 'Steam_kBtu',
    'CO2Emissions_tons', 'EnergyStarScore', 'SiteEUI_kBtu_sf',
]
drop_cols = leakage_cols + ['BuildingID']

print(f'Dropping {len(drop_cols)} columns:')
for col in drop_cols:
    print(f'  ✗ {col}')

Data leakage — another example to make sure you get it

Imagine predicting a student's final exam grade. You have:

✓ Hours_Studied = 45         ← GOOD feature (known before the exam)
✓ Attendance_Rate = 0.90     ← GOOD feature (known before the exam)
✗ Exam_Score = 85            ← THIS IS WHAT WE'RE PREDICTING! Can't use it!
✗ Grade_Letter = "B+"        ← DERIVED from the exam score! Leakage!
✗ Class_Rank = 12            ← COMPUTED after all grades are known! Leakage!

Same logic in our project:

✓ GrossFloorArea             ← GOOD: structural data we have before measurement
✓ YearBuilt                  ← GOOD: structural data
✗ Electricity_kBtu           ← BAD: component of total energy (the thing we predict!)
✗ SiteEUI_kBtu_sf            ← BAD: = SiteEnergy / FloorArea (contains the answer!)

Rule: if a column COULD NOT EXIST without measuring the target, drop it.

df.drop(columns=drop_cols, inplace=True)

print(f'\nColumns remaining: {df.shape[1]}')
for i, col in enumerate(df.columns):
    marker = '← TARGET' if col == 'SiteEnergyUse_kBtu' else ''
    print(f'  {i+1:2d}. {col:30s} {marker}')

What is correlation?

Correlation measures how two variables move together. It ranges from -1 to +1:

r = +1.0  →  Perfect positive: when A goes up, B always goes up
r = +0.7  →  Strong positive: A up → B mostly up
r =  0.0  →  No relationship
r = -0.7  →  Strong negative: A up → B mostly down
r = -1.0  →  Perfect negative

Example:
  FloorArea and Energy: r = +0.85 (bigger buildings use more energy)
  EnergyStarScore and Energy: r = -0.40 (efficient buildings use less)

Why drop highly correlated FEATURES?

If GrossFloorArea and AreaPerFloor have r = 0.95, they tell the model almost the same thing. Keeping both:

Wastes space
Can confuse linear models (they fight over which one to use)

Rule of thumb: if |r| > 0.70 between two features (NOT the target), drop one.

Part 8 — Correlation matrix

Why check correlations?

If two features are very similar (|r| > 0.70), they carry the same information. Keeping both:

Wastes space
Can make linear models unstable

A correlation of 1.0 = perfect positive relationship. -1.0 = perfect negative. 0 = no relationship.

num_df = df.select_dtypes(include=[np.number])
corr = num_df.corr()

print(f'Correlation matrix: {corr.shape[0]} x {corr.shape[1]} features')

# Visualize as heatmap
plt.figure(figsize=(12, 9))
mask = np.triu(np.ones_like(corr, dtype=bool))  # Hide upper triangle (redundant)
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f',
            cmap='coolwarm', center=0, square=True,
            linewidths=0.3, annot_kws={'size': 8})
plt.title('Correlation Matrix — look for |r| > 0.70')
plt.tight_layout()
plt.show()

# Automatically find highly correlated pairs
target_name = 'SiteEnergyUse_kBtu'
print('Feature pairs with |r| > 0.70 (excluding target):')
found = False
for i in range(len(corr)):
    for j in range(i+1, len(corr)):
        fi = corr.index[i]
        fj = corr.columns[j]
        r = corr.iloc[i, j]
        if abs(r) > 0.70 and fi != target_name and fj != target_name:
            print(f'  {fi} <-> {fj}: r = {r:.3f}')
            found = True
if not found:
    print('  None found — no redundant features to drop!')

Why can't ML use text directly?

Imagine you give the model this:

PrimaryPropertyType = "Office"

The model does math: multiplications, additions, comparisons. You can multiply numbers, but you can't multiply "Office" by 0.5. So we need to convert text to numbers.

OneHot encoding — the most common method

Instead of one column with 10 text values, we create 10 binary columns (but drop 1 to avoid redundancy):

BEFORE:                    AFTER (with drop='first'):
Type                       Type_Hotel  Type_Office  Type_Retail ...
────────                   ──────────  ───────────  ──────────
Hospital                   0           0            0
Office                     0           1            0
Retail                     0           0            1
Hotel                      1           0            0
Hospital (again)           0           0            0   ← if all zeros, model knows it's Hospital

We drop the first category because if all the others are 0, the model already knows the value.

Part 9 — Encode categorical features

Why?

ML algorithms only understand numbers. Text columns like 'Office' or 'Zone_A' must be converted.

Which encoding method?

Method	When to use	Example
OneHotEncoder	No natural order	Department, Neighborhood
Ordinal mapping	Natural order	Junior < Mid < Senior

Building type and neighborhood have no natural order, so we use OneHotEncoder.

# Which columns are text?
cat_cols = df.select_dtypes(include=['object']).columns.tolist()
print(f'Categorical columns: {cat_cols}')
for col in cat_cols:
    print(f'  {col}: {df[col].nunique()} unique values')

# Create encoder
ohe = OneHotEncoder(
    sparse_output=False,    # Dense array (easier to work with)
    drop='first',           # Drop 1st category to avoid multicollinearity
    handle_unknown='infrequent_if_exist'
)

print('OneHotEncoder created with drop="first"')
print('(drop="first" means if we have [A, B, C], we create cols for B and C only)')
print('(because if B=0 and C=0, the model knows it must be A)')

Concrete example: before and after OneHotEncoding

BEFORE (3 buildings):
┌──────────────────┐
│ PrimaryProperty   │
├──────────────────┤
│ Office            │
│ Hotel             │
│ Office            │
└──────────────────┘

AFTER (with drop='first', keeping alphabetical order, first dropped = Hospital):
┌────────────┬───────────┬────────────┬───────────┐
│ Type_Hotel │ Type_Off  │ Type_Ret   │ Type_Ware │ ...
├────────────┼───────────┼────────────┼───────────┤
│     0      │     1     │     0      │     0     │  ← Office
│     1      │     0     │     0      │     0     │  ← Hotel
│     0      │     1     │     0      │     0     │  ← Office
└────────────┴───────────┴────────────┴───────────┘

Each building becomes a row of 0s and 1s. The model can now do math with these.

# Fit and transform
encoded = ohe.fit_transform(df[cat_cols])
enc_df = pd.DataFrame(
    encoded,
    columns=ohe.get_feature_names_out(cat_cols),
    index=df.index
)

print(f'Created {enc_df.shape[1]} binary columns:')
for col in enc_df.columns[:8]:
    print(f'  {col}')
if enc_df.shape[1] > 8:
    print(f'  ... and {enc_df.shape[1] - 8} more')

# Combine: drop originals, add encoded
df_encoded = pd.concat([df.drop(columns=cat_cols), enc_df], axis=1)

print(f'Final dataset: {df_encoded.shape[0]} rows x {df_encoded.shape[1]} columns')

The train/test split — explained simply

Imagine you're studying for an exam:

You study from a textbook (= training data)
You take the exam with new questions (= test data)

If you memorize the textbook word-for-word (= overfitting), you might fail the exam because you haven't learned the patterns, just the specific examples.

The train/test split simulates this:

Train set (80%): the model studies these buildings
Test set (20%): new buildings the model has NEVER seen

If it predicts well on the test set, it has learned real patterns — not just memorized.

Part 10 — Separate X and y, then split into train/test

Why split?

Full data (100%)
  ├── Train (80%): model learns from this
  └── Test (20%):  we pretend these are "new" buildings
                   and check if predictions are good

If the model performs well on TEST data it has never seen,
we can trust it on truly new buildings.

target = 'SiteEnergyUse_kBtu'

# X = everything except the target
X = df_encoded.drop(columns=[target])

# y = just the target
y = df_encoded[target]

print(f'X shape: {X.shape} (features)')
print(f'y shape: {y.shape} (target)')
print(f'Target mean: {y.mean():,.0f} kBtu')

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,           # 20% for test
    random_state=RANDOM_STATE # Reproducible split
)

print(f'Train: {len(X_train)} buildings ({len(X_train)/len(X)*100:.0f}%)')
print(f'Test:  {len(X_test)} buildings ({len(X_test)/len(X)*100:.0f}%)')

Part 11 — A reusable evaluation function

We'll train 3 models. Instead of copying evaluation code 3 times, we write it once in a function.

Build → Test → Use (the same pattern as `.apply()`)

def evaluate_model(model, X_tr, y_tr, X_te, y_te, name):
    """Train a regression model and print all metrics."""
    # 1. Train
    model.fit(X_tr, y_tr)

    # 2. Predict on both sets
    pred_tr = model.predict(X_tr)
    pred_te = model.predict(X_te)

    # 3. Compute metrics
    r2_tr  = r2_score(y_tr, pred_tr)
    r2_te  = r2_score(y_te, pred_te)
    mae_te = mean_absolute_error(y_te, pred_te)
    rmse_te = np.sqrt(mean_squared_error(y_te, pred_te))

    # 4. Print results
    print(f'\n{name}:')
    print(f'  R² train: {r2_tr:.4f}    R² test: {r2_te:.4f}')
    print(f'  MAE:  {mae_te:>12,.0f} kBtu')
    print(f'  RMSE: {rmse_te:>12,.0f} kBtu')
    print(f'  → On average, the model is off by ~{mae_te:,.0f} kBtu')

    # Check for overfitting
    if r2_tr - r2_te > 0.10:
        print(f'  ⚠ Possible overfitting: train R² much higher than test R²')

    return {'name': name, 'R2_train': r2_tr, 'R2_test': r2_te,
            'MAE': mae_te, 'RMSE': rmse_te, 'model': model}

# Test the function exists
print('evaluate_model() is ready to use!')
print('It takes: model, X_train, y_train, X_test, y_test, name')

What does a regression model actually DO?

In the simplest case (1 feature), it finds the best line through the data:

Energy
  │        ●
  │      ● /
  │    ●  /   ← best line
  │   ● /
  │  ● /
  │ ●/
  │/●
  └──────────── FloorArea

The model learns: Energy = slope × FloorArea + intercept

With many features, it becomes: Energy = w₁×FloorArea + w₂×Age + w₃×Floors + ... + bias

Each weight (w) tells the model how important each feature is. Ridge adds a penalty to keep weights small, which prevents overfitting.

What is overfitting?

UNDERFITTING:              GOOD FIT:                 OVERFITTING:
Too simple                 Just right                Too complex
    ──────                   ╱╲   ╱╲                   ╱╲╱╲╱╲╱╲
   (misses pattern)        ╱  ╲_╱  ╲                 ╱        ╲
                          (captures trend)           (memorizes noise)
                                                     (fails on new data)

Part 12 — Model 1: Ridge Regression (linear)

What does "linear" mean?

The model tries to find the best formula like:

Energy = a₁ × FloorArea + a₂ × Age + a₃ × Floors + ... + bias

It's a weighted sum. Ridge adds a small penalty to keep the weights small (prevents overfitting).

Why does it need scaling?

Feature A: FloorArea ranges from 1,000 to 2,000,000
Feature B: NumberOfFloors ranges from 1 to 50

Without scaling, the model thinks FloorArea is 40,000x more important
just because its numbers are bigger! Scaling fixes this.

StandardScaler makes every feature: mean=0, std=1

We use a Pipeline to chain scaler + model. This ensures the scaler is fitted on TRAIN data only (no leakage).

What is a Pipeline?

A Pipeline chains multiple steps into one object. Instead of:

# Without Pipeline (manual, error-prone):
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit + transform on TRAIN
X_test_scaled = scaler.transform(X_test)          # transform only on TEST (no fit!)
model = Ridge()
model.fit(X_train_scaled, y_train)                # train
predictions = model.predict(X_test_scaled)        # predict

With a Pipeline, everything happens in one line:

# With Pipeline (clean, safe):
pipe = Pipeline([('scaler', StandardScaler()), ('model', Ridge())])
pipe.fit(X_train, y_train)        # scaler fits on train, then model trains
pipe.predict(X_test)              # scaler transforms test, then model predicts

Why is this better?

No leakage risk: the scaler is fitted on TRAIN only, never on TEST
Cleaner code: one object to manage instead of two
Works with cross-validation: GridSearchCV handles the scaler automatically

ridge_pipe = Pipeline([
    ('scaler', StandardScaler()),   # Step 1: normalize
    ('model', Ridge(alpha=1.0, random_state=RANDOM_STATE))  # Step 2: train
])

print('Pipeline created: StandardScaler → Ridge')

res_ridge = evaluate_model(
    ridge_pipe, X_train, y_train, X_test, y_test,
    'Ridge Regression (linear)')

How to interpret these results

Let's say the output was:

Ridge Regression (linear):
  R² train: 0.7200    R² test: 0.7100
  MAE:      102,000 kBtu
  RMSE:     145,000 kBtu

What this means in plain English:

R² = 0.71 → The model explains 71% of the variation in energy consumption. Not bad, but 29% is still unexplained (noise, missing features, non-linear patterns).
R² train ≈ R² test → No overfitting! The model generalizes well. If train was 0.99 and test was 0.50, that would mean overfitting (memorized training data).
MAE = 102,000 kBtu → On average, when the model predicts energy for a building, it's wrong by about 102,000 kBtu. Is that good? Depends on context — if buildings average 400,000 kBtu, being off by 102,000 is a ~25% error.
RMSE > MAE → This is always true. RMSE penalizes large errors more, so RMSE > MAE means some predictions are way off (the model struggles with certain buildings).

What is a decision tree?

Before understanding a forest, let's understand one tree:

                  Is FloorArea > 50,000?
                  /                    \
                YES                     NO
               /                         \
    Is Age > 40?                  Is Floors > 5?
    /          \                  /            \
   YES         NO                YES           NO
   /            \              /               \
400,000 kBtu  250,000 kBtu  150,000 kBtu   80,000 kBtu

The tree asks yes/no questions, splitting data into smaller groups until it reaches a prediction. It's like a flowchart.

A Random Forest builds 200 of these on random subsets and averages their predictions. This is called bagging (Bootstrap AGGregating). The averaging makes the model much more stable.

Part 13 — Model 2: Random Forest (bagging, tree-based)

How does it work?

Take a random sample of the data
Build a decision tree on it
Repeat 200 times (n_estimators=200)
Average all 200 predictions

Key advantages:

Captures non-linear patterns (curves, interactions)
Does NOT need scaling (trees split on values, not distances)
Hard to overfit (averaging reduces variance)

rf = RandomForestRegressor(
    n_estimators=200,          # Build 200 trees
    random_state=RANDOM_STATE,
    n_jobs=-1                  # Use all CPU cores (faster)
)

print('RandomForest created: 200 trees, all CPUs')

res_rf = evaluate_model(
    rf, X_train, y_train, X_test, y_test,
    'Random Forest (bagging)')

Part 14 — Model 3: Gradient Boosting (boosting, tree-based)

How is it different from Random Forest?

Random Forest:  200 trees work INDEPENDENTLY → average
Gradient Boost: 200 trees work SEQUENTIALLY → each fixes the last one's errors

learning_rate controls how much each tree contributes. Smaller = more cautious but needs more trees.

gb = GradientBoostingRegressor(
    n_estimators=200,
    learning_rate=0.1,        # How much each tree contributes
    max_depth=5,              # How deep each tree can grow
    random_state=RANDOM_STATE
)

print('GradientBoosting: 200 sequential trees, lr=0.1, depth=5')

res_gb = evaluate_model(
    gb, X_train, y_train, X_test, y_test,
    'Gradient Boosting')

Quick comparison so far

Reading the comparison table

              name      R2_train  R2_test     MAE      RMSE
Ridge (linear)           0.72     0.71    102,000   145,000
Random Forest (bagging)  0.95     0.82     78,000   110,000
Gradient Boosting        0.91     0.85     68,000    98,000

What to notice:

Gradient Boosting wins on test set (highest R², lowest MAE/RMSE)
Random Forest overfits slightly (train R²=0.95 is much higher than test R²=0.82)
Ridge is the weakest but simplest — it can't capture non-linear patterns
Non-linear models (trees) beat linear — energy consumption has complex relationships with features

This is why we try multiple model categories: you can't know which will work best beforehand.

# Show all 3 results side by side
quick = pd.DataFrame([res_ridge, res_rf, res_gb]).drop(columns=['model'])
print(quick.to_string(index=False))

Part 15 — Cross-Validation: getting reliable estimates

Why is one train/test split not enough?

One split can be lucky or unlucky. Imagine asking ONE person if a movie is good vs asking FIVE people.

5-fold cross-validation splits the data 5 different ways and averages:

Fold 1: [TEST] [train] [train] [train] [train]  → R² = 0.82
Fold 2: [train] [TEST] [train] [train] [train]  → R² = 0.79
Fold 3: [train] [train] [TEST] [train] [train]  → R² = 0.84
Fold 4: [train] [train] [train] [TEST] [train]  → R² = 0.81
Fold 5: [train] [train] [train] [train] [TEST]  → R² = 0.80

Final: R² = 0.81 +/- 0.02  (MUCH more reliable!)

kf = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

scoring = {
    'R2': 'r2',
    'MAE': 'neg_mean_absolute_error',
}

print('KFold: 5 splits, shuffled')
print('Scoring: R² and MAE')

# Cross-validate all 3 models
for model, name, needs_scaling in [
    (Ridge(alpha=1.0, random_state=RANDOM_STATE),
     'Ridge (linear)', True),
    (RandomForestRegressor(n_estimators=200, random_state=RANDOM_STATE, n_jobs=-1),
     'Random Forest (bagging)', False),
    (GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=5, random_state=RANDOM_STATE),
     'Gradient Boosting', False),
]:
    if needs_scaling:
        pipe = Pipeline([('scaler', StandardScaler()), ('model', model)])
    else:
        pipe = Pipeline([('model', model)])

    cv = cross_validate(pipe, X, y, cv=kf, scoring=scoring,
                        return_train_score=True, n_jobs=-1)

    r2_vals = cv['test_R2']
    mae_vals = -cv['test_MAE']  # sklearn returns NEGATIVE MAE

    print(f'\n{name}:')
    print(f'  R²  = {r2_vals.mean():.4f} +/- {r2_vals.std():.4f}')
    print(f'  MAE = {mae_vals.mean():>10,.0f} +/- {mae_vals.std():>8,.0f} kBtu')

Parameters vs Hyperparameters

This distinction confuses many beginners:

	Parameters	Hyperparameters
What	The weights the model LEARNS	Settings YOU choose BEFORE training
Who sets them	The model (during .fit())	You (before calling .fit())
Example	The weight of FloorArea in Ridge	`n_estimators=200` (how many trees)
Can be optimized?	Automatically by training	By GridSearchCV

Analogy: If the model is a recipe...

Parameters = the amounts of each ingredient (the model figures these out)
Hyperparameters = oven temperature and cooking time (you choose these before baking)

GridSearchCV tries many "oven temperatures" and picks the one that produces the best cake.

Part 16 — Hyperparameter Optimization (GridSearchCV)

What are hyperparameters?

Settings you choose before training. Different values → different performance. Finding the best combination manually is tedious, so GridSearch automates it.

What we're tuning:

Parameter	What it does	Values to try
`n_estimators`	Number of trees	100, 200, 300
`max_depth`	How deep each tree grows	3, 5, 7, 10
`learning_rate`	Step size for each tree	0.05, 0.1, 0.2
`min_samples_split`	Min samples to split a node	2, 5, 10

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.05, 0.1, 0.2],
    'min_samples_split': [2, 5, 10],
}

total_combos = 3 * 4 * 3 * 3  # = 108
print(f'Total combinations: {total_combos}')
print(f'With 5-fold CV: {total_combos * 5} model fits')
print(f'This may take a few minutes...')

What GridSearchCV does, step by step

You tell it: "Try these combinations"
  n_estimators: [100, 200, 300]    → 3 options
  max_depth:    [3, 5, 7, 10]      → 4 options
  learning_rate:[0.05, 0.1, 0.2]   → 3 options
  min_samples:  [2, 5, 10]         → 3 options
                                     ─────────
                               Total: 3×4×3×3 = 108 combinations

For EACH combination, it does 5-fold cross-validation:
  Combo 1: n_est=100, depth=3, lr=0.05, samples=2
    → Train on folds 1-4, test on fold 5 → R² = 0.78
    → Train on folds 1-3+5, test on fold 4 → R² = 0.80
    → ... (3 more folds)
    → Average: R² = 0.79

  Combo 2: n_est=100, depth=3, lr=0.05, samples=5
    → (same 5-fold process)
    → Average: R² = 0.77

  ... (106 more combos)

  Combo 108: n_est=300, depth=10, lr=0.2, samples=10
    → Average: R² = 0.84  ← BEST!

GridSearchCV picks Combo 108 and retrains on ALL training data.

Total model fits: 108 combos × 5 folds = 540 models trained! This is why it takes time.

gs = GridSearchCV(
    GradientBoostingRegressor(random_state=RANDOM_STATE),
    param_grid,
    cv=KFold(5, shuffle=True, random_state=RANDOM_STATE),
    scoring='r2',
    n_jobs=-1,
    verbose=1,
    return_train_score=True
)

gs.fit(X_train, y_train)

print(f'\nDone! Best R² (CV): {gs.best_score_:.4f}')

# What parameters won?
print('Best parameters:')
for k, v in gs.best_params_.items():
    print(f'  {k}: {v}')

# Evaluate the optimized model on the test set
best_model = gs.best_estimator_
y_pred_best = best_model.predict(X_test)

r2_opt  = r2_score(y_test, y_pred_best)
mae_opt = mean_absolute_error(y_test, y_pred_best)
rmse_opt = np.sqrt(mean_squared_error(y_test, y_pred_best))

print(f'Optimized Gradient Boosting on TEST set:')
print(f'  R²:   {r2_opt:.4f}')
print(f'  MAE:  {mae_opt:>10,.0f} kBtu')
print(f'  RMSE: {rmse_opt:>10,.0f} kBtu')
print(f'  → The model is off by ~{mae_opt:,.0f} kBtu on average')

Feature importance — the intuition

Imagine you're predicting house prices. Someone asks "what matters most: location, size, or age?"

Permutation importance answers this with an experiment:

Take a trained model that predicts well (R² = 0.80)
Randomly shuffle the FloorArea column (destroy its information)
Re-measure: R² drops to 0.45
Conclusion: FloorArea was very important (R² dropped by 0.35)

Repeat for every feature. The features that cause the biggest drop when shuffled are the most important.

Feature          R² without it    Drop    Importance
─────────        ─────────────    ────    ──────────
FloorArea        0.45             -0.35   ★★★★★
BuildingAge      0.70             -0.10   ★★
NumberOfFloors   0.75             -0.05   ★
Neighborhood     0.78             -0.02   ·

Part 17 — Feature Importance: what drives energy consumption?

Why this matters

The city doesn't just want predictions — they want to know why. Which building characteristics drive high energy consumption? This informs policy.

Two methods:

Method	How it works	Speed
Built-in (tree)	Measures how much each feature reduces error in tree splits	Fast
Permutation	Shuffles each feature and measures how much the score drops	Slower but more reliable

# Method 1: Built-in feature importance
fi_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': best_model.feature_importances_
}).sort_values('Importance', ascending=False)

print('Top 10 features (built-in importance):')
print(fi_df.head(10).to_string(index=False))

# Plot top 15
top15 = fi_df.head(15)

fig, ax = plt.subplots(figsize=(10, 7))
ax.barh(top15['Feature'][::-1], top15['Importance'][::-1], color='steelblue')
ax.set_xlabel('Feature Importance (built-in)')
ax.set_title('Top 15 Features — Gradient Boosting Built-in')
plt.tight_layout()
plt.show()

# Method 2: Permutation importance
print('Computing permutation importance (30 repeats)...')
perm = permutation_importance(
    best_model, X_test, y_test,
    n_repeats=30,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    scoring='r2'
)
print('Done!')

How to read a feature importance barplot

GrossFloorArea_sf    ████████████████████  0.35   ← BY FAR the most important
BuildingAge          ████████              0.12   ← Second
AreaPerFloor         ██████                0.09   ← Our engineered feature works!
NumberOfFloors       █████                 0.07
HasMultipleUses      ███                   0.04   ← Another engineered feature
Type_Hospital        ██                    0.03   ← Hospitals use lots of energy
IsOldBuilding        ██                    0.02
...

How to interpret this for the client:

Floor area dominates — bigger buildings use more energy (obvious but confirmed)
Building age matters — older buildings have worse insulation
Our engineered features appear in the top — AreaPerFloor and HasMultipleUses added value
Building type matters — hospitals and hotels use more than offices
Some features barely matter — could be dropped to simplify the model

# Permutation results
pi_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': perm.importances_mean
}).sort_values('Importance', ascending=False)

print('Top 10 features (permutation importance):')
print(pi_df.head(10).to_string(index=False))

# Plot top 15
top15_perm = pi_df.head(15)

fig, ax = plt.subplots(figsize=(10, 7))
ax.barh(top15_perm['Feature'][::-1], top15_perm['Importance'][::-1], color='coral')
ax.set_xlabel('Permutation Importance (R² decrease)')
ax.set_title('Top 15 Features — Permutation Importance')
plt.tight_layout()
plt.show()

Compare both methods side by side

comp = pd.DataFrame({
    'Feature': X.columns,
    'Tree_Builtin': best_model.feature_importances_,
    'Permutation': perm.importances_mean,
}).sort_values('Permutation', ascending=False).head(12)

print('Top 12 features — two methods compared:')
print(comp.to_string(index=False))

Part 18 — Final Summary

# Collect all results
all_results = pd.DataFrame([
    res_ridge, res_rf, res_gb,
    {'name': 'Optimized GB',
     'R2_train': r2_score(y_train, best_model.predict(X_train)),
     'R2_test': r2_opt, 'MAE': mae_opt, 'RMSE': rmse_opt}
]).drop(columns=['model'], errors='ignore')

# Sort by test R²
all_results = all_results.sort_values('R2_test', ascending=False)

print('='*72)
print('FINAL MODEL COMPARISON')
print('='*72)
print(all_results.to_string(index=False))

What we'd tell the client

Here's how a data scientist would present these results to the city:

"We built a model that predicts building energy consumption using only structural data you already have — floor area, age, type, number of floors.

The model explains about X% of the variation in energy use (R² = X.XX), and its predictions are off by about Y kBtu on average.

The biggest driver is building size: floor area alone explains most of the variation. Building age is the second factor — older buildings tend to use more energy, likely due to poorer insulation.

Practical application: instead of measuring every building, you can use this model to flag buildings where predicted consumption seems abnormally high, and send inspectors only to those. This could cut inspection costs by 80% while still catching the worst offenders.

Limitations: the model can't capture everything — unusual building configurations, recent renovations, or operational differences are not in the data. We recommend adding renovation history and HVAC system type if available in the future."

This is the storytelling part. Raw numbers don't convince decision-makers — narratives do.

# Business translation
best_r2 = all_results.iloc[0]['R2_test']
best_mae = all_results.iloc[0]['MAE']
best_name = all_results.iloc[0]['name']

print(f'\nBest model: {best_name}')
print(f'  Explains {best_r2*100:.1f}% of variance in energy consumption')
print(f'  Average prediction error: ~{best_mae:,.0f} kBtu')
print(f'')
print(f'Business insight:')
print(f'  Instead of measuring every building, the city can use this model')
print(f'  to estimate energy consumption from structural data alone,')
print(f'  with an average error of ~{best_mae:,.0f} kBtu.')

Key Takeaways

Filter first: remove non-compliant buildings (ComplianceStatus) and flagged outliers (Outlier).
Data leakage is the #1 mistake: never use energy source columns to predict total energy.
Feature engineering adds value: building age, area per floor, multi-use flags capture real patterns.
Correlation matrix: drop one from each pair with |r| > 0.70.
3 model categories: linear (Ridge), bagging (Random Forest), boosting (Gradient Boosting).
Scale linear models: use StandardScaler in a Pipeline. Tree models don't need scaling.
Cross-validate everything: mean +/- std is far more reliable than a single split.
GridSearchCV: systematic tuning over 3+ hyperparameters.
Feature importance: two methods (built-in + permutation) to tell the city what matters.
Storytelling: "the model is off by ~X kBtu" beats "MAE = X".

Happy predicting!