How I helped a city estimate energy use from structural data — explained like you're starting from zero How I helped a city estimate energy use from structural data — explained like you're starting from zero The story I work as a data scientist for a small urban planning consultancy. A European city aiming for carbon neutrality by 2050 came to us with a problem: they need to predict the total energy consumption of commercial buildings using only structural information — size, age, type, location. predict the total energy consumption of commercial buildings Why? Because measuring actual energy consumption is expensive: inspectors have to physically visit each building with specialized equipment. But the city already has structural data from building permits and tax records. If we can predict energy use from that, inspectors only need to visit buildings where the model flags unusually high consumption. measuring that The city gave us data from ~3,800 non-residential buildings measured in 2016. Our mission: Clean and filter the data Understand which features matter Build and compare regression models Tell the city what drives energy consumption Clean and filter the data Understand which features matter Build and compare regression models Tell the city what drives energy consumption Regression vs Classification CLASSIFICATION: predict a CATEGORY "Will this employee leave?" → Yes / No Metrics: Precision, Recall, F1 REGRESSION: predict a NUMBER "How much energy will this building use?" → 347,000 kBtu Metrics: R², MAE, RMSE CLASSIFICATION: predict a CATEGORY "Will this employee leave?" → Yes / No Metrics: Precision, Recall, F1 REGRESSION: predict a NUMBER "How much energy will this building use?" → 347,000 kBtu Metrics: R², MAE, RMSE This project is regression — we predict a continuous number. regression What you'll learn Part Topic Why it matters 0 Environment setup Reproducibility 1 Load & explore the data Understand what we have 2 Explore the target variable Know what we're predicting 3 Filter non-compliant & outliers Remove bad data 4 Handle missing values Models can't handle NaN 5 Visual exploration (EDA) See patterns before modeling 6 Feature engineering (4 new features) Help the model find patterns 7 Data leakage (the #1 mistake) Avoid fake good results 8 Correlation matrix Remove redundant features 9 Encode categorical features ML needs numbers 10 Train/test split Reliable evaluation 11 Helper function for evaluation Don't repeat yourself 12 Model 1: Ridge (linear) Baseline 13 Model 2: Random Forest (bagging) Non-linear 14 Model 3: Gradient Boosting (boosting) Usually best 15 Cross-validation (5-fold) Mean +/- std 16 GridSearchCV Tune hyperparameters 17 Feature importance (2 methods) What drives consumption? 18 Final summary Business conclusions Part Topic Why it matters 0 Environment setup Reproducibility 1 Load & explore the data Understand what we have 2 Explore the target variable Know what we're predicting 3 Filter non-compliant & outliers Remove bad data 4 Handle missing values Models can't handle NaN 5 Visual exploration (EDA) See patterns before modeling 6 Feature engineering (4 new features) Help the model find patterns 7 Data leakage (the #1 mistake) Avoid fake good results 8 Correlation matrix Remove redundant features 9 Encode categorical features ML needs numbers 10 Train/test split Reliable evaluation 11 Helper function for evaluation Don't repeat yourself 12 Model 1: Ridge (linear) Baseline 13 Model 2: Random Forest (bagging) Non-linear 14 Model 3: Gradient Boosting (boosting) Usually best 15 Cross-validation (5-fold) Mean +/- std 16 GridSearchCV Tune hyperparameters 17 Feature importance (2 methods) What drives consumption? 18 Final summary Business conclusions Part Topic Why it matters Part Part Topic Topic Why it matters Why it matters 0 Environment setup Reproducibility 0 0 Environment setup Environment setup Reproducibility Reproducibility 1 Load & explore the data Understand what we have 1 1 Load & explore the data Load & explore the data Understand what we have Understand what we have 2 Explore the target variable Know what we're predicting 2 2 Explore the target variable Explore the target variable Know what we're predicting Know what we're predicting 3 Filter non-compliant & outliers Remove bad data 3 3 Filter non-compliant & outliers Filter non-compliant & outliers Remove bad data Remove bad data 4 Handle missing values Models can't handle NaN 4 4 Handle missing values Handle missing values Models can't handle NaN Models can't handle NaN 5 Visual exploration (EDA) See patterns before modeling 5 5 Visual exploration (EDA) Visual exploration (EDA) See patterns before modeling See patterns before modeling 6 Feature engineering (4 new features) Help the model find patterns 6 6 Feature engineering (4 new features) Feature engineering (4 new features) Help the model find patterns Help the model find patterns 7 Data leakage (the #1 mistake) Avoid fake good results 7 7 Data leakage (the #1 mistake) Data leakage (the #1 mistake) Avoid fake good results Avoid fake good results 8 Correlation matrix Remove redundant features 8 8 Correlation matrix Correlation matrix Remove redundant features Remove redundant features 9 Encode categorical features ML needs numbers 9 9 Encode categorical features Encode categorical features ML needs numbers ML needs numbers 10 Train/test split Reliable evaluation 10 10 Train/test split Train/test split Reliable evaluation Reliable evaluation 11 Helper function for evaluation Don't repeat yourself 11 11 Helper function for evaluation Helper function for evaluation Don't repeat yourself Don't repeat yourself 12 Model 1: Ridge (linear) Baseline 12 12 Model 1: Ridge (linear) Model 1: Ridge (linear) Baseline Baseline 13 Model 2: Random Forest (bagging) Non-linear 13 13 Model 2: Random Forest (bagging) Model 2: Random Forest (bagging) Non-linear Non-linear 14 Model 3: Gradient Boosting (boosting) Usually best 14 14 Model 3: Gradient Boosting (boosting) Model 3: Gradient Boosting (boosting) Usually best Usually best 15 Cross-validation (5-fold) Mean +/- std 15 15 Cross-validation (5-fold) Cross-validation (5-fold) Mean +/- std Mean +/- std 16 GridSearchCV Tune hyperparameters 16 16 GridSearchCV GridSearchCV Tune hyperparameters Tune hyperparameters 17 Feature importance (2 methods) What drives consumption? 17 17 Feature importance (2 methods) Feature importance (2 methods) What drives consumption? What drives consumption? 18 Final summary Business conclusions 18 18 Final summary Final summary Business conclusions Business conclusions Every concept is explained before the code. Every code block does one thing. Every concept is explained before the code. Every code block does one thing. before one thing Before we start: What IS Machine Learning? Machine Learning (ML) is teaching a computer to find patterns in data so it can make predictions on new data. find patterns in data The 3 types (simplified): 1. SUPERVISED LEARNING (this project!) You GIVE the model examples with known answers. "Here are 3,000 buildings with their actual energy use. Learn the pattern, then predict energy for NEW buildings." → Regression: predict a NUMBER (energy = 347,000 kBtu) → Classification: predict a CATEGORY (employee will leave: yes/no) 2. UNSUPERVISED LEARNING You DON'T give answers. The model finds groups on its own. "Here are 3,000 buildings. Group similar ones together." 3. REINFORCEMENT LEARNING The model learns by trial and error (like training a dog). 1. SUPERVISED LEARNING (this project!) You GIVE the model examples with known answers. "Here are 3,000 buildings with their actual energy use. Learn the pattern, then predict energy for NEW buildings." → Regression: predict a NUMBER (energy = 347,000 kBtu) → Classification: predict a CATEGORY (employee will leave: yes/no) 2. UNSUPERVISED LEARNING You DON'T give answers. The model finds groups on its own. "Here are 3,000 buildings. Group similar ones together." 3. REINFORCEMENT LEARNING The model learns by trial and error (like training a dog). The workflow we'll follow: Raw data → Clean → Explore → Engineer features → Split → Train models → Evaluate → Interpret 📦 🧹 🔍 🔧 ✂️ 🏋️ 📊 💡 Raw data → Clean → Explore → Engineer features → Split → Train models → Evaluate → Interpret 📦 🧹 🔍 🔧 ✂️ 🏋️ 📊 💡 Each step has a reason. Skipping one usually means bad results. Part 0 — Setting up your tools What is a pyproject.toml? Think of it as a recipe card for your project. It lists every Python package (ingredient) needed. Anyone wanting to reproduce your work reads this file and installs everything with one command. recipe card [project] name = "energy-prediction" version = "0.1.0" requires-python = ">=3.10" dependencies = [ "pandas>=2.0", "numpy>=1.24", "matplotlib>=3.7", "seaborn>=0.12", "scikit-learn>=1.3", ] [project] name = "energy-prediction" version = "0.1.0" requires-python = ">=3.10" dependencies = [ "pandas>=2.0", "numpy>=1.24", "matplotlib>=3.7", "seaborn>=0.12", "scikit-learn>=1.3", ] Create it with uv init in your terminal. Now let's import our tools — one at a time, with explanations. uv init # pandas: work with tables (like Excel, but in Python) import pandas as pd # Quick check: can we create a DataFrame? print(type(pd.DataFrame())) # pandas: work with tables (like Excel, but in Python) import pandas as pd # Quick check: can we create a DataFrame? print(type(pd.DataFrame())) # numpy: fast math on arrays of numbers import numpy as np # Quick check print('2 + 3 =', np.add(2, 3)) # numpy: fast math on arrays of numbers import numpy as np # Quick check print('2 + 3 =', np.add(2, 3)) # matplotlib: the base plotting library import matplotlib.pyplot as plt # seaborn: prettier charts built on top of matplotlib import seaborn as sns print('Visualization libraries ready!') # matplotlib: the base plotting library import matplotlib.pyplot as plt # seaborn: prettier charts built on top of matplotlib import seaborn as sns print('Visualization libraries ready!') # Suppress non-critical warnings (keeps output clean) import warnings warnings.filterwarnings('ignore') print('Warnings suppressed') # Suppress non-critical warnings (keeps output clean) import warnings warnings.filterwarnings('ignore') print('Warnings suppressed') Scikit-learn imports (one group at a time) Scikit-learn (sklearn) is THE machine learning library. We import many tools — each will be explained when we use it. # Split data into training and test sets from sklearn.model_selection import train_test_split print('train_test_split: splits data 80/20 (or any ratio)') # Split data into training and test sets from sklearn.model_selection import train_test_split print('train_test_split: splits data 80/20 (or any ratio)') # Cross-validation: test on multiple splits for reliable scores from sklearn.model_selection import cross_validate # KFold: defines HOW to split for cross-validation from sklearn.model_selection import KFold # GridSearchCV: try many hyperparameter combinations from sklearn.model_selection import GridSearchCV print('Cross-validation and GridSearch loaded') # Cross-validation: test on multiple splits for reliable scores from sklearn.model_selection import cross_validate # KFold: defines HOW to split for cross-validation from sklearn.model_selection import KFold # GridSearchCV: try many hyperparameter combinations from sklearn.model_selection import GridSearchCV print('Cross-validation and GridSearch loaded') # StandardScaler: normalize features to mean=0, std=1 # (needed for linear models, not for tree-based models) from sklearn.preprocessing import StandardScaler # OneHotEncoder: convert categories to binary columns from sklearn.preprocessing import OneHotEncoder # Pipeline: chain preprocessing + model in one object from sklearn.pipeline import Pipeline print('Preprocessing tools loaded') # StandardScaler: normalize features to mean=0, std=1 # (needed for linear models, not for tree-based models) from sklearn.preprocessing import StandardScaler # OneHotEncoder: convert categories to binary columns from sklearn.preprocessing import OneHotEncoder # Pipeline: chain preprocessing + model in one object from sklearn.pipeline import Pipeline print('Preprocessing tools loaded') The 3 model categories The project requires at least 3 different categories of models: 3 different categories Category How it works Example Needs scaling? Linear Draws the best straight line Ridge, Lasso Yes Bagging (trees) Many trees vote independently Random Forest No Boosting (trees) Trees learn from each other's mistakes Gradient Boosting No Category How it works Example Needs scaling? Linear Draws the best straight line Ridge, Lasso Yes Bagging (trees) Many trees vote independently Random Forest No Boosting (trees) Trees learn from each other's mistakes Gradient Boosting No Category How it works Example Needs scaling? Category Category How it works How it works Example Example Needs scaling? Needs scaling? Linear Draws the best straight line Ridge, Lasso Yes Linear Linear Linear Draws the best straight line Draws the best straight line Ridge, Lasso Ridge, Lasso Yes Yes Bagging (trees) Many trees vote independently Random Forest No Bagging (trees) Bagging (trees) Bagging Many trees vote independently Many trees vote independently Random Forest Random Forest No No Boosting (trees) Trees learn from each other's mistakes Gradient Boosting No Boosting (trees) Boosting (trees) Boosting Trees learn from each other's mistakes Trees learn from each other's mistakes Gradient Boosting Gradient Boosting No No BAGGING (Random Forest): Tree 1 ──┐ Tree 2 ──┤──▶ AVERAGE all predictions ──▶ Final answer Tree 3 ──┤ Tree 4 ──┘ (trees work INDEPENDENTLY, in parallel) BOOSTING (Gradient Boosting): Tree 1 ─▶ errors ─▶ Tree 2 ─▶ errors ─▶ Tree 3 ─▶ ... ─▶ Final answer (each tree CORRECTS the mistakes of the previous one) BAGGING (Random Forest): Tree 1 ──┐ Tree 2 ──┤──▶ AVERAGE all predictions ──▶ Final answer Tree 3 ──┤ Tree 4 ──┘ (trees work INDEPENDENTLY, in parallel) BOOSTING (Gradient Boosting): Tree 1 ─▶ errors ─▶ Tree 2 ─▶ errors ─▶ Tree 3 ─▶ ... ─▶ Final answer (each tree CORRECTS the mistakes of the previous one) # Linear model from sklearn.linear_model import Ridge # Bagging (tree-based, parallel) from sklearn.ensemble import RandomForestRegressor # Boosting (tree-based, sequential) from sklearn.ensemble import GradientBoostingRegressor print('3 model categories loaded: Ridge, RandomForest, GradientBoosting') # Linear model from sklearn.linear_model import Ridge # Bagging (tree-based, parallel) from sklearn.ensemble import RandomForestRegressor # Boosting (tree-based, sequential) from sklearn.ensemble import GradientBoostingRegressor print('3 model categories loaded: Ridge, RandomForest, GradientBoosting') Regression metrics Suppose we predict energy for 3 buildings: Actual: [100,000 200,000 300,000] Predicted: [110,000 180,000 350,000] Errors: [+10,000 -20,000 +50,000] MAE = average of |errors| = (10,000 + 20,000 + 50,000) / 3 = 26,667 kBtu → "On average, we're off by ~27,000 kBtu" RMSE = sqrt(average of errors²) = sqrt((10k² + 20k² + 50k²) / 3) = 31,623 kBtu → "Like MAE but penalizes BIG errors more" R² = 1 - (sum of errors²) / (sum of (actual - mean)²) → "How much variance does the model explain?" → R²=1.0 means perfect, R²=0 means useless Suppose we predict energy for 3 buildings: Actual: [100,000 200,000 300,000] Predicted: [110,000 180,000 350,000] Errors: [+10,000 -20,000 +50,000] MAE = average of |errors| = (10,000 + 20,000 + 50,000) / 3 = 26,667 kBtu → "On average, we're off by ~27,000 kBtu" RMSE = sqrt(average of errors²) = sqrt((10k² + 20k² + 50k²) / 3) = 31,623 kBtu → "Like MAE but penalizes BIG errors more" R² = 1 - (sum of errors²) / (sum of (actual - mean)²) → "How much variance does the model explain?" → R²=1.0 means perfect, R²=0 means useless # Regression metrics from sklearn.metrics import ( mean_absolute_error, # MAE mean_squared_error, # Used to compute RMSE r2_score, # R² make_scorer, # Convert metric to scorer ) # Feature importance from sklearn.inspection import permutation_importance print('Metrics and feature importance loaded') # Regression metrics from sklearn.metrics import ( mean_absolute_error, # MAE mean_squared_error, # Used to compute RMSE r2_score, # R² make_scorer, # Convert metric to scorer ) # Feature importance from sklearn.inspection import permutation_importance print('Metrics and feature importance loaded') # Global settings for reproducibility RANDOM_STATE = 42 np.random.seed(RANDOM_STATE) # Prettier charts plt.style.use('seaborn-v0_8-whitegrid') sns.set_palette('husl') pd.set_option('display.max_columns', None) print(f'All set! Random state: {RANDOM_STATE}') # Global settings for reproducibility RANDOM_STATE = 42 np.random.seed(RANDOM_STATE) # Prettier charts plt.style.use('seaborn-v0_8-whitegrid') sns.set_palette('husl') pd.set_option('display.max_columns', None) print(f'All set! Random state: {RANDOM_STATE}') Part 1 — Loading the dataset In a real project, you'd load the CSV: pd.read_csv('seattle_buildings_2016.csv'). Since we can't share the real data, we generate synthetic data that mimics it. We build each column separately so you see exactly what goes in. pd.read_csv('seattle_buildings_2016.csv') Step 1.1 — Building IDs Every row needs a unique identifier. n_buildings = 3800 building_ids = [f'BLD{str(i).zfill(5)}' for i in range(1, n_buildings + 1)] # Verify print(f'Total: {len(building_ids)}') print(f'First 3: {building_ids[:3]}') print(f'Last 3: {building_ids[-3:]}') n_buildings = 3800 building_ids = [f'BLD{str(i).zfill(5)}' for i in range(1, n_buildings + 1)] # Verify print(f'Total: {len(building_ids)}') print(f'First 3: {building_ids[:3]}') print(f'Last 3: {building_ids[-3:]}') Step 1.2 — Building type All non-residential: offices, retail, hotels, etc. In the real project, residential buildings must be excluded — that's a hard requirement. residential buildings must be excluded building_types = [ 'Office', 'Retail', 'Hotel', 'Warehouse', 'School', 'Hospital', 'Restaurant', 'Supermarket', 'Parking', 'Other' ] primary_use = np.random.choice(building_types, n_buildings, p=[0.25, 0.15, 0.08, 0.12, 0.10, 0.05, 0.08, 0.05, 0.07, 0.05]) building_types = [ 'Office', 'Retail', 'Hotel', 'Warehouse', 'School', 'Hospital', 'Restaurant', 'Supermarket', 'Parking', 'Other' ] primary_use = np.random.choice(building_types, n_buildings, p=[0.25, 0.15, 0.08, 0.12, 0.10, 0.05, 0.08, 0.05, 0.07, 0.05]) # Let's see the distribution print('Building type distribution:') for bt in building_types: n = (primary_use == bt).sum() pct = n / n_buildings * 100 bar = '█' * int(pct) print(f' {bt:15s} {n:4d} ({pct:5.1f}%) {bar}') # Let's see the distribution print('Building type distribution:') for bt in building_types: n = (primary_use == bt).sum() pct = n / n_buildings * 100 bar = '█' * int(pct) print(f' {bt:15s} {n:4d} ({pct:5.1f}%) {bar}') Step 1.3 — Year built year_built = np.random.normal(1970, 25, n_buildings).clip(1900, 2020).astype(int) print(f'Oldest: {year_built.min()}') print(f'Newest: {year_built.max()}') print(f'Median: {int(np.median(year_built))}') year_built = np.random.normal(1970, 25, n_buildings).clip(1900, 2020).astype(int) print(f'Oldest: {year_built.min()}') print(f'Newest: {year_built.max()}') print(f'Median: {int(np.median(year_built))}') Step 1.4 — Number of floors n_floors = np.random.exponential(3, n_buildings).clip(1, 50).astype(int) print(f'Range: {n_floors.min()} to {n_floors.max()} floors') print(f'Most common: {np.bincount(n_floors).argmax()} floor(s)') n_floors = np.random.exponential(3, n_buildings).clip(1, 50).astype(int) print(f'Range: {n_floors.min()} to {n_floors.max()} floors') print(f'Most common: {np.bincount(n_floors).argmax()} floor(s)') Step 1.5 — Gross floor area Larger buildings and different types have very different sizes. A hospital is much bigger than a restaurant. We encode this with a base area per type. base_area = { 'Office': 50000, 'Retail': 20000, 'Hotel': 80000, 'Warehouse': 40000, 'School': 60000, 'Hospital': 120000, 'Restaurant': 5000, 'Supermarket': 30000, 'Parking': 70000, 'Other': 25000 } # Show the base areas print('Base floor area by type:') for bt, area in sorted(base_area.items(), key=lambda x: -x[1]): print(f' {bt:15s} {area:>8,} sq ft') base_area = { 'Office': 50000, 'Retail': 20000, 'Hotel': 80000, 'Warehouse': 40000, 'School': 60000, 'Hospital': 120000, 'Restaurant': 5000, 'Supermarket': 30000, 'Parking': 70000, 'Other': 25000 } # Show the base areas print('Base floor area by type:') for bt, area in sorted(base_area.items(), key=lambda x: -x[1]): print(f' {bt:15s} {area:>8,} sq ft') # Actual area = base * random variation * floor bonus gross_floor_area = np.array([ base_area[pu] * np.random.lognormal(0, 0.5) * (nf ** 0.3) for pu, nf in zip(primary_use, n_floors) ]).clip(1000, 2000000).round(0) print(f'Floor area range: {gross_floor_area.min():,.0f} to {gross_floor_area.max():,.0f} sq ft') print(f'Median: {np.median(gross_floor_area):,.0f} sq ft') # Actual area = base * random variation * floor bonus gross_floor_area = np.array([ base_area[pu] * np.random.lognormal(0, 0.5) * (nf ** 0.3) for pu, nf in zip(primary_use, n_floors) ]).clip(1000, 2000000).round(0) print(f'Floor area range: {gross_floor_area.min():,.0f} to {gross_floor_area.max():,.0f} sq ft') print(f'Median: {np.median(gross_floor_area):,.0f} sq ft') Step 1.6 — Other columns # 10 neighborhoods neighborhoods = [f'Zone_{chr(65+i)}' for i in range(10)] neighborhood = np.random.choice(neighborhoods, n_buildings) print(f'{len(np.unique(neighborhood))} neighborhoods') # 10 neighborhoods neighborhoods = [f'Zone_{chr(65+i)}' for i in range(10)] neighborhood = np.random.choice(neighborhoods, n_buildings) print(f'{len(np.unique(neighborhood))} neighborhoods') # Number of property uses (1=single, 2+=mixed use) n_property_uses = np.random.choice([1, 2, 3, 4], n_buildings, p=[0.60, 0.25, 0.10, 0.05]) vals, cnts = np.unique(n_property_uses, return_counts=True) for v, c in zip(vals, cnts): print(f' {v} use(s): {c} buildings') # Number of property uses (1=single, 2+=mixed use) n_property_uses = np.random.choice([1, 2, 3, 4], n_buildings, p=[0.60, 0.25, 0.10, 0.05]) vals, cnts = np.unique(n_property_uses, return_counts=True) for v, c in zip(vals, cnts): print(f' {v} use(s): {c} buildings') Step 1.7 — Compliance and Outlier flags The real dataset has two critical columns: critical ComplianceStatus: did the building pass inspection? Non-compliant → remove Outlier: was this data point flagged as suspicious? Yes → remove ComplianceStatus: did the building pass inspection? Non-compliant → remove ComplianceStatus Outlier: was this data point flagged as suspicious? Yes → remove Outlier Using these columns to filter is required by the project. required compliance = np.random.choice( ['Compliant', 'Non-Compliant'], n_buildings, p=[0.85, 0.15]) outlier_flag = np.random.choice( ['No', 'Yes'], n_buildings, p=[0.92, 0.08]) print(f'Compliant: {(compliance == "Compliant").sum()}') print(f'Non-Compliant: {(compliance == "Non-Compliant").sum()}') print(f'Outlier=Yes: {(outlier_flag == "Yes").sum()}') compliance = np.random.choice( ['Compliant', 'Non-Compliant'], n_buildings, p=[0.85, 0.15]) outlier_flag = np.random.choice( ['No', 'Yes'], n_buildings, p=[0.92, 0.08]) print(f'Compliant: {(compliance == "Compliant").sum()}') print(f'Non-Compliant: {(compliance == "Non-Compliant").sum()}') print(f'Outlier=Yes: {(outlier_flag == "Yes").sum()}') Step 1.8 — Energy columns (leakage risks!) The real dataset has individual energy sources. We create them here but will remove them later — using them to predict total energy is cheating (data leakage, explained in Part 7). remove them later # EnergyStarScore (1-100): efficiency rating energy_star = np.random.normal(60, 20, n_buildings).clip(1, 100).round(0) print(f'EnergyStarScore: {energy_star.min():.0f} to {energy_star.max():.0f}') # EnergyStarScore (1-100): efficiency rating energy_star = np.random.normal(60, 20, n_buildings).clip(1, 100).round(0) print(f'EnergyStarScore: {energy_star.min():.0f} to {energy_star.max():.0f}') # Individual energy sources (these ARE total energy's components!) electricity_kbtu = (gross_floor_area * np.random.uniform(10, 40, n_buildings)).round(0) natural_gas_kbtu = (gross_floor_area * np.random.uniform(2, 20, n_buildings)).round(0) steam_kbtu = np.where( np.random.random(n_buildings) < 0.3, gross_floor_area * np.random.uniform(1, 10, n_buildings), 0).round(0) print(f'Electricity: mean={electricity_kbtu.mean():,.0f} kBtu') print(f'Natural gas: mean={natural_gas_kbtu.mean():,.0f} kBtu') print(f'Steam (30% of buildings): mean of nonzero={steam_kbtu[steam_kbtu > 0].mean():,.0f} kBtu') # Individual energy sources (these ARE total energy's components!) electricity_kbtu = (gross_floor_area * np.random.uniform(10, 40, n_buildings)).round(0) natural_gas_kbtu = (gross_floor_area * np.random.uniform(2, 20, n_buildings)).round(0) steam_kbtu = np.where( np.random.random(n_buildings) < 0.3, gross_floor_area * np.random.uniform(1, 10, n_buildings), 0).round(0) print(f'Electricity: mean={electricity_kbtu.mean():,.0f} kBtu') print(f'Natural gas: mean={natural_gas_kbtu.mean():,.0f} kBtu') print(f'Steam (30% of buildings): mean of nonzero={steam_kbtu[steam_kbtu > 0].mean():,.0f} kBtu') Step 1.9 — The TARGET: total site energy use # Total = sum of all sources + noise site_energy = (electricity_kbtu + natural_gas_kbtu + steam_kbtu + np.random.normal(0, 5000, n_buildings)).clip(10000).round(0) # Make flagged outliers extreme (realistic) outlier_mask = outlier_flag == 'Yes' site_energy[outlier_mask] *= np.random.uniform(2, 5, outlier_mask.sum()) print(f'SiteEnergyUse_kBtu:') print(f' Min: {site_energy.min():>12,.0f}') print(f' Median: {np.median(site_energy):>12,.0f}') print(f' Mean: {site_energy.mean():>12,.0f}') print(f' Max: {site_energy.max():>12,.0f}') # Total = sum of all sources + noise site_energy = (electricity_kbtu + natural_gas_kbtu + steam_kbtu + np.random.normal(0, 5000, n_buildings)).clip(10000).round(0) # Make flagged outliers extreme (realistic) outlier_mask = outlier_flag == 'Yes' site_energy[outlier_mask] *= np.random.uniform(2, 5, outlier_mask.sum()) print(f'SiteEnergyUse_kBtu:') print(f' Min: {site_energy.min():>12,.0f}') print(f' Median: {np.median(site_energy):>12,.0f}') print(f' Mean: {site_energy.mean():>12,.0f}') print(f' Max: {site_energy.max():>12,.0f}') # EUI and CO2 (also leakage) site_eui = (site_energy / gross_floor_area).round(2) co2_emissions = (site_energy * np.random.uniform(0.00005, 0.00015, n_buildings)).round(2) print(f'SiteEUI range: {site_eui.min():.2f} to {site_eui.max():.2f} kBtu/sf') print(f'CO2 range: {co2_emissions.min():.2f} to {co2_emissions.max():.2f} metric tons') # EUI and CO2 (also leakage) site_eui = (site_energy / gross_floor_area).round(2) co2_emissions = (site_energy * np.random.uniform(0.00005, 0.00015, n_buildings)).round(2) print(f'SiteEUI range: {site_eui.min():.2f} to {site_eui.max():.2f} kBtu/sf') print(f'CO2 range: {co2_emissions.min():.2f} to {co2_emissions.max():.2f} metric tons') Step 1.10 — Inject missing values (like real data) energy_star_dirty = energy_star.copy().astype(float) energy_star_dirty[np.random.random(n_buildings) < 0.05] = np.nan print(f'Missing EnergyStarScore: {np.isnan(energy_star_dirty).sum()}') energy_star_dirty = energy_star.copy().astype(float) energy_star_dirty[np.random.random(n_buildings) < 0.05] = np.nan print(f'Missing EnergyStarScore: {np.isnan(energy_star_dirty).sum()}') year_built_dirty = year_built.copy().astype(float) year_built_dirty[np.random.random(n_buildings) < 0.02] = np.nan print(f'Missing YearBuilt: {np.isnan(year_built_dirty).sum()}') year_built_dirty = year_built.copy().astype(float) year_built_dirty[np.random.random(n_buildings) < 0.02] = np.nan print(f'Missing YearBuilt: {np.isnan(year_built_dirty).sum()}') Step 1.11 — Assemble the full DataFrame df = pd.DataFrame({ 'BuildingID': building_ids, 'PrimaryPropertyType': primary_use, 'YearBuilt': year_built_dirty, 'NumberOfFloors': n_floors, 'GrossFloorArea_sf': gross_floor_area, 'Neighborhood': neighborhood, 'NumberOfPropertyUses': n_property_uses, 'ComplianceStatus': compliance, 'Outlier': outlier_flag, 'EnergyStarScore': energy_star_dirty, 'Electricity_kBtu': electricity_kbtu, 'NaturalGas_kBtu': natural_gas_kbtu, 'Steam_kBtu': steam_kbtu, 'SiteEnergyUse_kBtu': site_energy, 'SiteEUI_kBtu_sf': site_eui, 'CO2Emissions_tons': co2_emissions, }) print(f'Dataset: {df.shape[0]} rows x {df.shape[1]} columns') df = pd.DataFrame({ 'BuildingID': building_ids, 'PrimaryPropertyType': primary_use, 'YearBuilt': year_built_dirty, 'NumberOfFloors': n_floors, 'GrossFloorArea_sf': gross_floor_area, 'Neighborhood': neighborhood, 'NumberOfPropertyUses': n_property_uses, 'ComplianceStatus': compliance, 'Outlier': outlier_flag, 'EnergyStarScore': energy_star_dirty, 'Electricity_kBtu': electricity_kbtu, 'NaturalGas_kBtu': natural_gas_kbtu, 'Steam_kBtu': steam_kbtu, 'SiteEnergyUse_kBtu': site_energy, 'SiteEUI_kBtu_sf': site_eui, 'CO2Emissions_tons': co2_emissions, }) print(f'Dataset: {df.shape[0]} rows x {df.shape[1]} columns') # List all columns with their types print('All columns:') for i, (col, dtype) in enumerate(df.dtypes.items()): print(f' {i+1:2d}. {col:30s} {str(dtype):10s}') # List all columns with their types print('All columns:') for i, (col, dtype) in enumerate(df.dtypes.items()): print(f' {i+1:2d}. {col:30s} {str(dtype):10s}') df.head(3) df.head(3) What is a DataFrame? Think of a DataFrame as a spreadsheet table inside Python. It has: spreadsheet table Rows (one per building, employee, or observation) Columns (one per variable: age, floor area, energy...) Index (row number, starting from 0) Rows (one per building, employee, or observation) Rows Columns (one per variable: age, floor area, energy...) Columns Index (row number, starting from 0) Index FloorArea YearBuilt Energy BLD001 50000 1985 320000 BLD002 12000 2001 95000 BLD003 85000 1962 580000 ↑ index ↑ col 1 ↑ col 2 ↑ col 3 FloorArea YearBuilt Energy BLD001 50000 1985 320000 BLD002 12000 2001 95000 BLD003 85000 1962 580000 ↑ index ↑ col 1 ↑ col 2 ↑ col 3 What do .shape, .info(), .describe() tell us? .shape .info() .describe() Command What it shows Why you need it df.shape (rows, columns) Know the size df.info() Column names, types, non-null count Spot wrong types and missing data df.describe() Mean, std, min, max, quartiles Spot outliers and weird distributions df.isnull().sum() Number of missing values per column Know what needs fixing Command What it shows Why you need it df.shape (rows, columns) Know the size df.info() Column names, types, non-null count Spot wrong types and missing data df.describe() Mean, std, min, max, quartiles Spot outliers and weird distributions df.isnull().sum() Number of missing values per column Know what needs fixing Command What it shows Why you need it Command Command What it shows What it shows Why you need it Why you need it df.shape (rows, columns) Know the size df.shape df.shape df.shape (rows, columns) (rows, columns) Know the size Know the size df.info() Column names, types, non-null count Spot wrong types and missing data df.info() df.info() df.info() Column names, types, non-null count Column names, types, non-null count Spot wrong types and missing data Spot wrong types and missing data df.describe() Mean, std, min, max, quartiles Spot outliers and weird distributions df.describe() df.describe() df.describe() Mean, std, min, max, quartiles Mean, std, min, max, quartiles Spot outliers and weird distributions Spot outliers and weird distributions df.isnull().sum() Number of missing values per column Know what needs fixing df.isnull().sum() df.isnull().sum() df.isnull().sum() Number of missing values per column Number of missing values per column Know what needs fixing Know what needs fixing Part 2 — Exploring the raw data The 3 essential checks for ANY dataset Every time you load data, ask these 3 questions: How big is it? → df.shape What types are the columns? → df.info() How much data is missing? → df.isnull().sum() How big is it? → df.shape How big is it? df.shape What types are the columns? → df.info() What types are the columns? df.info() How much data is missing? → df.isnull().sum() How much data is missing? df.isnull().sum() # Check 1: Size print(f'Rows: {df.shape[0]}') print(f'Columns: {df.shape[1]}') # Check 1: Size print(f'Rows: {df.shape[0]}') print(f'Columns: {df.shape[1]}') # Check 2: Types and non-null counts df.info() # Check 2: Types and non-null counts df.info() How to read .describe() output .describe() FloorArea YearBuilt Energy count 3800.0 3800.0 3800.0 ← how many non-null values mean 52341.0 1970.0 420000.0 ← average std 45200.0 25.0 380000.0 ← spread (bigger = more varied) min 1000.0 1900.0 10000.0 ← smallest value 25% 18000.0 1952.0 150000.0 ← 25% of buildings are below this 50% 38000.0 1970.0 310000.0 ← MEDIAN (middle value) 75% 72000.0 1990.0 560000.0 ← 75% are below this max 2000000.0 2020.0 5000000.0 ← largest value FloorArea YearBuilt Energy count 3800.0 3800.0 3800.0 ← how many non-null values mean 52341.0 1970.0 420000.0 ← average std 45200.0 25.0 380000.0 ← spread (bigger = more varied) min 1000.0 1900.0 10000.0 ← smallest value 25% 18000.0 1952.0 150000.0 ← 25% of buildings are below this 50% 38000.0 1970.0 310000.0 ← MEDIAN (middle value) 75% 72000.0 1990.0 560000.0 ← 75% are below this max 2000000.0 2020.0 5000000.0 ← largest value What to look for: What to look for: Big difference between mean and 50% (median)? → Data is skewed (has extreme values) min very different from 25%? → Possible outliers at the bottom max very different from 75%? → Possible outliers at the top count less than total rows? → Missing values in that column Big difference between mean and 50% (median)? → Data is skewed (has extreme values) Big difference between mean and 50% (median)? min very different from 25%? → Possible outliers at the bottom min very different from 25%? max very different from 75%? → Possible outliers at the top max very different from 75%? count less than total rows? → Missing values in that column count less than total rows? # Check 3: Missing values missing = df.isnull().sum() print('Columns with missing values:') print(missing[missing > 0]) print(f'\nTotal missing cells: {missing.sum()} out of {df.shape[0]*df.shape[1]}') # Check 3: Missing values missing = df.isnull().sum() print('Columns with missing values:') print(missing[missing > 0]) print(f'\nTotal missing cells: {missing.sum()} out of {df.shape[0]*df.shape[1]}') # Descriptive stats for numeric columns df.describe().round(2) # Descriptive stats for numeric columns df.describe().round(2) What is a "target variable"? In supervised learning, we always have: Features (X): the information we GIVE to the model (floor area, year built, ...) Target (y): what we want the model to PREDICT (energy consumption) Features (X): the information we GIVE to the model (floor area, year built, ...) Features Target (y): what we want the model to PREDICT (energy consumption) Target Features (X) Target (y) ┌───────────┬───────────┬─────────┐ ┌────────────────┐ │ FloorArea │ YearBuilt │ Floors │ │ EnergyUse_kBtu │ ├───────────┼───────────┼─────────┤ ├────────────────┤ │ 50,000 │ 1985 │ 3 │ → │ 320,000 │ │ 12,000 │ 2001 │ 1 │ → │ 95,000 │ │ 85,000 │ 1962 │ 7 │ → │ 580,000 │ └───────────┴───────────┴─────────┘ └────────────────┘ The model learns the relationship: X → y Then predicts y for NEW buildings where we don't know y. Features (X) Target (y) ┌───────────┬───────────┬─────────┐ ┌────────────────┐ │ FloorArea │ YearBuilt │ Floors │ │ EnergyUse_kBtu │ ├───────────┼───────────┼─────────┤ ├────────────────┤ │ 50,000 │ 1985 │ 3 │ → │ 320,000 │ │ 12,000 │ 2001 │ 1 │ → │ 95,000 │ │ 85,000 │ 1962 │ 7 │ → │ 580,000 │ └───────────┴───────────┴─────────┘ └────────────────┘ The model learns the relationship: X → y Then predicts y for NEW buildings where we don't know y. Explore the target variable FIRST Always look at what you're trying to predict before anything else. target_col = 'SiteEnergyUse_kBtu' print(f'Target: {target_col}') print(f' Count: {df[target_col].count()}') print(f' Mean: {df[target_col].mean():>12,.0f} kBtu') print(f' Median: {df[target_col].median():>12,.0f} kBtu') print(f' Std: {df[target_col].std():>12,.0f} kBtu') print(f' Min: {df[target_col].min():>12,.0f} kBtu') print(f' Max: {df[target_col].max():>12,.0f} kBtu') print(f' Skew: {df[target_col].skew():>12.2f}') target_col = 'SiteEnergyUse_kBtu' print(f'Target: {target_col}') print(f' Count: {df[target_col].count()}') print(f' Mean: {df[target_col].mean():>12,.0f} kBtu') print(f' Median: {df[target_col].median():>12,.0f} kBtu') print(f' Std: {df[target_col].std():>12,.0f} kBtu') print(f' Min: {df[target_col].min():>12,.0f} kBtu') print(f' Max: {df[target_col].max():>12,.0f} kBtu') print(f' Skew: {df[target_col].skew():>12.2f}') Why is the distribution "skewed" and what does log-transform do? Most buildings use moderate energy, but a few use ENORMOUS amounts. This creates a "long tail" on the right: Raw distribution: Log-transformed: │█ │ █ │█ │ ███ │██ │ █████ │████ │ ███████ │████████ │████████ │██████████████░░░░░░── │████████ └────────────────────── Energy └──────────── log(Energy) Most buildings Long tail More symmetric! are here (outliers) Raw distribution: Log-transformed: │█ │ █ │█ │ ███ │██ │ █████ │████ │ ███████ │████████ │████████ │██████████████░░░░░░── │████████ └────────────────────── Energy └──────────── log(Energy) Most buildings Long tail More symmetric! are here (outliers) The log function compresses big numbers more than small numbers, making the distribution more symmetric. Some models (especially linear ones) work better with roughly symmetric data. log Note: we use np.log1p(x) instead of np.log(x) because log1p handles zero values safely (log(0) would crash). Note np.log1p(x) np.log(x) log1p log(0) # Visualize: raw distribution vs log-transformed fig, axes = plt.subplots(1, 2, figsize=(14, 5)) axes[0].hist(df[target_col], bins=50, color='steelblue', edgecolor='white') axes[0].set_xlabel('Site Energy Use (kBtu)') axes[0].set_ylabel('Number of buildings') axes[0].set_title('Raw Distribution (very right-skewed)') axes[1].hist(np.log1p(df[target_col]), bins=50, color='coral', edgecolor='white') axes[1].set_xlabel('log(Site Energy Use)') axes[1].set_title('Log-transformed (more symmetric)') plt.tight_layout() plt.show() # Visualize: raw distribution vs log-transformed fig, axes = plt.subplots(1, 2, figsize=(14, 5)) axes[0].hist(df[target_col], bins=50, color='steelblue', edgecolor='white') axes[0].set_xlabel('Site Energy Use (kBtu)') axes[0].set_ylabel('Number of buildings') axes[0].set_title('Raw Distribution (very right-skewed)') axes[1].hist(np.log1p(df[target_col]), bins=50, color='coral', edgecolor='white') axes[1].set_xlabel('log(Site Energy Use)') axes[1].set_title('Log-transformed (more symmetric)') plt.tight_layout() plt.show() What does "filtering" mean in pandas? Filtering means keeping only the rows that match a condition. It's like applying a filter in Excel. keeping only the rows that match a condition # English: "Keep only rows where ComplianceStatus equals 'Compliant'" # Python: df = df[df['ComplianceStatus'] == 'Compliant'] # How does this work, step by step? # 1. df['ComplianceStatus'] == 'Compliant' # → creates a column of True/False for every row # → [True, True, False, True, False, ...] # # 2. df[True/False column] # → keeps only rows where the value is True # English: "Keep only rows where ComplianceStatus equals 'Compliant'" # Python: df = df[df['ComplianceStatus'] == 'Compliant'] # How does this work, step by step? # 1. df['ComplianceStatus'] == 'Compliant' # → creates a column of True/False for every row # → [True, True, False, True, False, ...] # # 2. df[True/False column] # → keeps only rows where the value is True Important: always keep track of how many rows you have before and after each filter. If you lose too many, you might not have enough data to train a model. Important Part 3 — Filtering the data Why filter? The dataset contains buildings that shouldn't be included: Non-compliant buildings: their data is unreliable Flagged outliers: suspicious measurements Non-compliant buildings: their data is unreliable Non-compliant Flagged outliers: suspicious measurements Flagged outliers The project requires using ComplianceStatus and Outlier columns to filter. requires ComplianceStatus Outlier Track how many rows we keep at each step print(f'Starting count: {len(df)} buildings') print(f'Starting count: {len(df)} buildings') # What's the compliance distribution? print(df['ComplianceStatus'].value_counts()) # What's the compliance distribution? print(df['ComplianceStatus'].value_counts()) # Remove non-compliant df = df[df['ComplianceStatus'] == 'Compliant'].copy() print(f'After removing non-compliant: {len(df)} buildings') # Remove non-compliant df = df[df['ComplianceStatus'] == 'Compliant'].copy() print(f'After removing non-compliant: {len(df)} buildings') # What's the outlier distribution? print(df['Outlier'].value_counts()) # What's the outlier distribution? print(df['Outlier'].value_counts()) # Remove flagged outliers df = df[df['Outlier'] == 'No'].copy() print(f'After removing outliers: {len(df)} buildings') # Remove flagged outliers df = df[df['Outlier'] == 'No'].copy() print(f'After removing outliers: {len(df)} buildings') # These columns served their purpose — drop them df.drop(columns=['ComplianceStatus', 'Outlier'], inplace=True) print(f'Columns remaining: {df.shape[1]}') # These columns served their purpose — drop them df.drop(columns=['ComplianceStatus', 'Outlier'], inplace=True) print(f'Columns remaining: {df.shape[1]}') Part 4 — Handling missing values Why median, not mean? The mean is pulled by extreme values. The median is the middle value and ignores outliers. mean median Values: [2, 3, 3, 4, 100] Mean: 22.4 ← pulled by 100 Median: 3 ← the actual middle value Values: [2, 3, 3, 4, 100] Mean: 22.4 ← pulled by 100 Median: 3 ← the actual middle value # What's still missing? print('Missing values after filtering:') missing = df.isnull().sum() print(missing[missing > 0]) # What's still missing? print('Missing values after filtering:') missing = df.isnull().sum() print(missing[missing > 0]) # Fill YearBuilt median_year = df['YearBuilt'].median() before_null = df['YearBuilt'].isnull().sum() df['YearBuilt'].fillna(median_year, inplace=True) print(f'YearBuilt: filled {before_null} missing values with median={median_year:.0f}') # Fill YearBuilt median_year = df['YearBuilt'].median() before_null = df['YearBuilt'].isnull().sum() df['YearBuilt'].fillna(median_year, inplace=True) print(f'YearBuilt: filled {before_null} missing values with median={median_year:.0f}') # Fill EnergyStarScore median_star = df['EnergyStarScore'].median() before_null = df['EnergyStarScore'].isnull().sum() df['EnergyStarScore'].fillna(median_star, inplace=True) print(f'EnergyStarScore: filled {before_null} missing values with median={median_star:.0f}') # Fill EnergyStarScore median_star = df['EnergyStarScore'].median() before_null = df['EnergyStarScore'].isnull().sum() df['EnergyStarScore'].fillna(median_star, inplace=True) print(f'EnergyStarScore: filled {before_null} missing values with median={median_star:.0f}') # Verify: no more nulls print(f'Total remaining nulls: {df.isnull().sum().sum()}') # Verify: no more nulls print(f'Total remaining nulls: {df.isnull().sum().sum()}') Remove extreme statistical outliers Even after removing flagged outliers, some buildings may have unrealistic energy values. We use percentiles to trim the extremes. Q01 = df['SiteEnergyUse_kBtu'].quantile(0.01) Q99 = df['SiteEnergyUse_kBtu'].quantile(0.99) print(f'1st percentile: {Q01:>12,.0f} kBtu') print(f'99th percentile: {Q99:>12,.0f} kBtu') Q01 = df['SiteEnergyUse_kBtu'].quantile(0.01) Q99 = df['SiteEnergyUse_kBtu'].quantile(0.99) print(f'1st percentile: {Q01:>12,.0f} kBtu') print(f'99th percentile: {Q99:>12,.0f} kBtu') before = len(df) df = df[(df['SiteEnergyUse_kBtu'] >= Q01) & (df['SiteEnergyUse_kBtu'] <= Q99)].copy() print(f'Removed {before - len(df)} extreme buildings') print(f'Remaining: {len(df)} buildings') before = len(df) df = df[(df['SiteEnergyUse_kBtu'] >= Q01) & (df['SiteEnergyUse_kBtu'] <= Q99)].copy() print(f'Removed {before - len(df)} extreme buildings') print(f'Remaining: {len(df)} buildings') Part 5 — Visualizations Which chart for which comparison? What you want to see Chart type Example Two numbers together Scatterplot Floor area vs energy Number across categories Boxplot Energy by building type Distribution of one thing Histogram Year built Target after filtering Histogram Energy distribution What you want to see Chart type Example Two numbers together Scatterplot Floor area vs energy Number across categories Boxplot Energy by building type Distribution of one thing Histogram Year built Target after filtering Histogram Energy distribution What you want to see Chart type Example What you want to see What you want to see Chart type Chart type Example Example Two numbers together Scatterplot Floor area vs energy Two numbers together Two numbers together Scatterplot Scatterplot Scatterplot Floor area vs energy Floor area vs energy Number across categories Boxplot Energy by building type Number across categories Number across categories Boxplot Boxplot Boxplot Energy by building type Energy by building type Distribution of one thing Histogram Year built Distribution of one thing Distribution of one thing Histogram Histogram Histogram Year built Year built Target after filtering Histogram Energy distribution Target after filtering Target after filtering Histogram Histogram Histogram Energy distribution Energy distribution All charts must relate to the target variable. must relate to the target variable How to read a scatterplot Each dot is one building. The position tells you two things at once: dot is one building Energy (y) │ ● │ ● ● │ ●●● "Pattern: dots go UP from left to right" │ ●●● → POSITIVE correlation │ ●● → Bigger floor area = more energy │●● └──────────────── FloorArea (x) Energy (y) │ ● │ ● ● │ ●●● "Pattern: dots go UP from left to right" │ ●●● → POSITIVE correlation │ ●● → Bigger floor area = more energy │●● └──────────────── FloorArea (x) What to look for: What to look for: Dots going up-right? → Positive relationship (both increase together) Dots going down-right? → Negative relationship (one up, other down) Random cloud? → No relationship Dots far from the main group? → Outliers Dots going up-right? → Positive relationship (both increase together) Dots going up-right? Dots going down-right? → Negative relationship (one up, other down) Dots going down-right? Random cloud? → No relationship Random cloud? Dots far from the main group? → Outliers Dots far from the main group? # SCATTERPLOT: Floor area vs Energy (the most important relationship) fig, ax = plt.subplots(figsize=(10, 6)) ax.scatter(df['GrossFloorArea_sf'], df['SiteEnergyUse_kBtu'], alpha=0.3, s=15, color='steelblue') ax.set_xlabel('Gross Floor Area (sq ft)') ax.set_ylabel('Site Energy Use (kBtu)') ax.set_title('Bigger buildings use more energy (obvious but important)') plt.tight_layout() plt.show() # SCATTERPLOT: Floor area vs Energy (the most important relationship) fig, ax = plt.subplots(figsize=(10, 6)) ax.scatter(df['GrossFloorArea_sf'], df['SiteEnergyUse_kBtu'], alpha=0.3, s=15, color='steelblue') ax.set_xlabel('Gross Floor Area (sq ft)') ax.set_ylabel('Site Energy Use (kBtu)') ax.set_title('Bigger buildings use more energy (obvious but important)') plt.tight_layout() plt.show() How to read a boxplot A boxplot shows the distribution of a numeric variable at a glance: ┌──────┐ ─────┤ ├───── ← Whiskers (range without outliers) ├──────┤ │ │ ← Box = middle 50% of data (Q1 to Q3) │──────│ ← Line inside box = MEDIAN (middle value) │ │ ├──────┤ ─────┤ ├───── └──────┘ ● ← Dots = outliers (values far from the rest) ┌──────┐ ─────┤ ├───── ← Whiskers (range without outliers) ├──────┤ │ │ ← Box = middle 50% of data (Q1 to Q3) │──────│ ← Line inside box = MEDIAN (middle value) │ │ ├──────┤ ─────┤ ├───── └──────┘ ● ← Dots = outliers (values far from the rest) What to look for: if one building type has a much higher median than another, that type tends to use more energy. What to look for # BOXPLOT: Energy by building type fig, ax = plt.subplots(figsize=(14, 5)) order = df.groupby('PrimaryPropertyType')['SiteEnergyUse_kBtu'].median().sort_values(ascending=False).index sns.boxplot(data=df, x='PrimaryPropertyType', y='SiteEnergyUse_kBtu', order=order, ax=ax, palette='husl') ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right') ax.set_ylabel('Site Energy Use (kBtu)') ax.set_title('Energy Use by Building Type (sorted by median)') plt.tight_layout() plt.show() # BOXPLOT: Energy by building type fig, ax = plt.subplots(figsize=(14, 5)) order = df.groupby('PrimaryPropertyType')['SiteEnergyUse_kBtu'].median().sort_values(ascending=False).index sns.boxplot(data=df, x='PrimaryPropertyType', y='SiteEnergyUse_kBtu', order=order, ax=ax, palette='husl') ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right') ax.set_ylabel('Site Energy Use (kBtu)') ax.set_title('Energy Use by Building Type (sorted by median)') plt.tight_layout() plt.show() # HISTOGRAM: Year built fig, ax = plt.subplots(figsize=(10, 5)) ax.hist(df['YearBuilt'], bins=40, color='coral', edgecolor='white') ax.set_xlabel('Year Built') ax.set_ylabel('Count') ax.set_title('When were these buildings constructed?') plt.tight_layout() plt.show() # HISTOGRAM: Year built fig, ax = plt.subplots(figsize=(10, 5)) ax.hist(df['YearBuilt'], bins=40, color='coral', edgecolor='white') ax.set_xlabel('Year Built') ax.set_ylabel('Count') ax.set_title('When were these buildings constructed?') plt.tight_layout() plt.show() # SCATTERPLOT: Number of floors vs Energy fig, ax = plt.subplots(figsize=(10, 6)) ax.scatter(df['NumberOfFloors'], df['SiteEnergyUse_kBtu'], alpha=0.3, s=15, color='coral') ax.set_xlabel('Number of Floors') ax.set_ylabel('Site Energy Use (kBtu)') ax.set_title('More floors → more energy (generally)') plt.tight_layout() plt.show() # SCATTERPLOT: Number of floors vs Energy fig, ax = plt.subplots(figsize=(10, 6)) ax.scatter(df['NumberOfFloors'], df['SiteEnergyUse_kBtu'], alpha=0.3, s=15, color='coral') ax.set_xlabel('Number of Floors') ax.set_ylabel('Site Energy Use (kBtu)') ax.set_title('More floors → more energy (generally)') plt.tight_layout() plt.show() # SCATTERPLOT: EnergyStarScore vs Energy (before we drop it) fig, ax = plt.subplots(figsize=(10, 6)) ax.scatter(df['EnergyStarScore'], df['SiteEnergyUse_kBtu'], alpha=0.3, s=15, color='green') ax.set_xlabel('EnergyStarScore') ax.set_ylabel('Site Energy Use (kBtu)') ax.set_title('EnergyStarScore vs Energy — but this column is LEAKAGE!') plt.tight_layout() plt.show() # SCATTERPLOT: EnergyStarScore vs Energy (before we drop it) fig, ax = plt.subplots(figsize=(10, 6)) ax.scatter(df['EnergyStarScore'], df['SiteEnergyUse_kBtu'], alpha=0.3, s=15, color='green') ax.set_xlabel('EnergyStarScore') ax.set_ylabel('Site Energy Use (kBtu)') ax.set_title('EnergyStarScore vs Energy — but this column is LEAKAGE!') plt.tight_layout() plt.show() Why does feature engineering help? Think of it this way. You're trying to predict someone's running speed. The model has: leg_length = 90 cm body_height = 180 cm leg_length = 90 cm leg_length = 90 cm body_height = 180 cm body_height = 180 cm These raw numbers are useful, but the ratio leg_length / body_height = 0.50 is directly informative — it tells you about body proportions in one number instead of two. ratio leg_length / body_height = 0.50 directly Same idea here: instead of giving the model GrossFloorArea = 100,000 and NumberOfFloors = 10 separately, we create AreaPerFloor = 10,000 which directly tells the model "each floor is 10,000 sq ft". The model doesn't have to figure out this relationship on its own. GrossFloorArea = 100,000 NumberOfFloors = 10 AreaPerFloor = 10,000 Rules for feature engineering: New features should make physical sense (age of building, area per floor, etc.) NEVER use the target to create a feature — that's data leakage At least 3 new features are required for this project Verify each new feature looks reasonable (check min, max, mean) New features should make physical sense (age of building, area per floor, etc.) New features should make physical sense NEVER use the target to create a feature — that's data leakage NEVER use the target At least 3 new features are required for this project At least 3 new features Verify each new feature looks reasonable (check min, max, mean) Verify Part 6 — Feature Engineering (at least 3 new features required) What is feature engineering? Creating new columns from existing ones to help the model find patterns. new columns Analogy: predicting how fast a car goes. You have engine_horsepower and car_weight. Creating power_to_weight_ratio = horsepower / weight gives the model a shortcut. Analogy engine_horsepower car_weight power_to_weight_ratio = horsepower / weight Golden rule: NEVER use the target to create features (that's data leakage). Golden rule Feature 1: Building age # Older buildings often have worse insulation → more energy df['BuildingAge'] = 2016 - df['YearBuilt'] print(f'BuildingAge:') print(f' Min: {df["BuildingAge"].min():.0f} years (newest)') print(f' Max: {df["BuildingAge"].max():.0f} years (oldest)') print(f' Mean: {df["BuildingAge"].mean():.0f} years') # Older buildings often have worse insulation → more energy df['BuildingAge'] = 2016 - df['YearBuilt'] print(f'BuildingAge:') print(f' Min: {df["BuildingAge"].min():.0f} years (newest)') print(f' Max: {df["BuildingAge"].max():.0f} years (oldest)') print(f' Mean: {df["BuildingAge"].mean():.0f} years') Feature 2: Area per floor # A 10-floor building with 100k sqft has different characteristics # than a 2-floor building with 100k sqft df['AreaPerFloor'] = df['GrossFloorArea_sf'] / df['NumberOfFloors'] print(f'AreaPerFloor:') print(f' Min: {df["AreaPerFloor"].min():>10,.0f} sq ft') print(f' Max: {df["AreaPerFloor"].max():>10,.0f} sq ft') print(f' Mean: {df["AreaPerFloor"].mean():>10,.0f} sq ft') # A 10-floor building with 100k sqft has different characteristics # than a 2-floor building with 100k sqft df['AreaPerFloor'] = df['GrossFloorArea_sf'] / df['NumberOfFloors'] print(f'AreaPerFloor:') print(f' Min: {df["AreaPerFloor"].min():>10,.0f} sq ft') print(f' Max: {df["AreaPerFloor"].max():>10,.0f} sq ft') print(f' Mean: {df["AreaPerFloor"].mean():>10,.0f} sq ft') Feature 3: Has multiple uses (binary flag) # Mixed-use buildings (office + retail) may have different energy profiles df['HasMultipleUses'] = (df['NumberOfPropertyUses'] > 1).astype(int) print(f'Single use: {(df["HasMultipleUses"] == 0).sum()} buildings') print(f'Multiple uses: {(df["HasMultipleUses"] == 1).sum()} buildings') # Mixed-use buildings (office + retail) may have different energy profiles df['HasMultipleUses'] = (df['NumberOfPropertyUses'] > 1).astype(int) print(f'Single use: {(df["HasMultipleUses"] == 0).sum()} buildings') print(f'Multiple uses: {(df["HasMultipleUses"] == 1).sum()} buildings') Feature 4: Old building flag # Pre-1960 buildings often lack modern insulation df['IsOldBuilding'] = (df['YearBuilt'] < 1960).astype(int) print(f'Old (pre-1960): {df["IsOldBuilding"].sum()} buildings') print(f'Newer (1960+): {(df["IsOldBuilding"] == 0).sum()} buildings') print(f'\nTotal new features created: 4 (minimum 3 required)') # Pre-1960 buildings often lack modern insulation df['IsOldBuilding'] = (df['YearBuilt'] < 1960).astype(int) print(f'Old (pre-1960): {df["IsOldBuilding"].sum()} buildings') print(f'Newer (1960+): {(df["IsOldBuilding"] == 0).sum()} buildings') print(f'\nTotal new features created: 4 (minimum 3 required)') Part 7 — Data Leakage: the most common beginner mistake What is data leakage? Your model accidentally gets information it shouldn't have. Like taking an exam with the answer key. In our dataset: SiteEnergyUse_kBtu = Electricity + NaturalGas + Steam + other ^^^^^^^^^^^ ^^^^^^^^^^ ^^^^^ COMPONENTS of total energy! If we use Electricity_kBtu as a FEATURE to predict SiteEnergyUse_kBtu, the model just learns to add up the components → R² = 0.99+ Looks amazing... but completely USELESS on new buildings where we DON'T have energy measurements! SiteEnergyUse_kBtu = Electricity + NaturalGas + Steam + other ^^^^^^^^^^^ ^^^^^^^^^^ ^^^^^ COMPONENTS of total energy! If we use Electricity_kBtu as a FEATURE to predict SiteEnergyUse_kBtu, the model just learns to add up the components → R² = 0.99+ Looks amazing... but completely USELESS on new buildings where we DON'T have energy measurements! Columns that cause leakage: Column Why it leaks Electricity_kBtu Component of total energy NaturalGas_kBtu Component of total energy Steam_kBtu Component of total energy CO2Emissions_tons Computed FROM total energy EnergyStarScore Derived FROM energy consumption SiteEUI_kBtu_sf = SiteEnergy / FloorArea (contains the target!) Column Why it leaks Electricity_kBtu Component of total energy NaturalGas_kBtu Component of total energy Steam_kBtu Component of total energy CO2Emissions_tons Computed FROM total energy EnergyStarScore Derived FROM energy consumption SiteEUI_kBtu_sf = SiteEnergy / FloorArea (contains the target!) Column Why it leaks Column Column Why it leaks Why it leaks Electricity_kBtu Component of total energy Electricity_kBtu Electricity_kBtu Electricity_kBtu Component of total energy Component of total energy NaturalGas_kBtu Component of total energy NaturalGas_kBtu NaturalGas_kBtu NaturalGas_kBtu Component of total energy Component of total energy Steam_kBtu Component of total energy Steam_kBtu Steam_kBtu Steam_kBtu Component of total energy Component of total energy CO2Emissions_tons Computed FROM total energy CO2Emissions_tons CO2Emissions_tons CO2Emissions_tons Computed FROM total energy Computed FROM total energy EnergyStarScore Derived FROM energy consumption EnergyStarScore EnergyStarScore EnergyStarScore Derived FROM energy consumption Derived FROM energy consumption SiteEUI_kBtu_sf = SiteEnergy / FloorArea (contains the target!) SiteEUI_kBtu_sf SiteEUI_kBtu_sf SiteEUI_kBtu_sf = SiteEnergy / FloorArea (contains the target!) = SiteEnergy / FloorArea (contains the target!) Also drop BuildingID — it's just a label, not a feature. BuildingID leakage_cols = [ 'Electricity_kBtu', 'NaturalGas_kBtu', 'Steam_kBtu', 'CO2Emissions_tons', 'EnergyStarScore', 'SiteEUI_kBtu_sf', ] drop_cols = leakage_cols + ['BuildingID'] print(f'Dropping {len(drop_cols)} columns:') for col in drop_cols: print(f' ✗ {col}') leakage_cols = [ 'Electricity_kBtu', 'NaturalGas_kBtu', 'Steam_kBtu', 'CO2Emissions_tons', 'EnergyStarScore', 'SiteEUI_kBtu_sf', ] drop_cols = leakage_cols + ['BuildingID'] print(f'Dropping {len(drop_cols)} columns:') for col in drop_cols: print(f' ✗ {col}') Data leakage — another example to make sure you get it Imagine predicting a student's final exam grade. You have: final exam grade ✓ Hours_Studied = 45 ← GOOD feature (known before the exam) ✓ Attendance_Rate = 0.90 ← GOOD feature (known before the exam) ✗ Exam_Score = 85 ← THIS IS WHAT WE'RE PREDICTING! Can't use it! ✗ Grade_Letter = "B+" ← DERIVED from the exam score! Leakage! ✗ Class_Rank = 12 ← COMPUTED after all grades are known! Leakage! ✓ Hours_Studied = 45 ← GOOD feature (known before the exam) ✓ Attendance_Rate = 0.90 ← GOOD feature (known before the exam) ✗ Exam_Score = 85 ← THIS IS WHAT WE'RE PREDICTING! Can't use it! ✗ Grade_Letter = "B+" ← DERIVED from the exam score! Leakage! ✗ Class_Rank = 12 ← COMPUTED after all grades are known! Leakage! Same logic in our project: ✓ GrossFloorArea ← GOOD: structural data we have before measurement ✓ YearBuilt ← GOOD: structural data ✗ Electricity_kBtu ← BAD: component of total energy (the thing we predict!) ✗ SiteEUI_kBtu_sf ← BAD: = SiteEnergy / FloorArea (contains the answer!) ✓ GrossFloorArea ← GOOD: structural data we have before measurement ✓ YearBuilt ← GOOD: structural data ✗ Electricity_kBtu ← BAD: component of total energy (the thing we predict!) ✗ SiteEUI_kBtu_sf ← BAD: = SiteEnergy / FloorArea (contains the answer!) Rule: if a column COULD NOT EXIST without measuring the target, drop it. Rule df.drop(columns=drop_cols, inplace=True) print(f'\nColumns remaining: {df.shape[1]}') for i, col in enumerate(df.columns): marker = '← TARGET' if col == 'SiteEnergyUse_kBtu' else '' print(f' {i+1:2d}. {col:30s} {marker}') df.drop(columns=drop_cols, inplace=True) print(f'\nColumns remaining: {df.shape[1]}') for i, col in enumerate(df.columns): marker = '← TARGET' if col == 'SiteEnergyUse_kBtu' else '' print(f' {i+1:2d}. {col:30s} {marker}') What is correlation? Correlation measures how two variables move together. It ranges from -1 to +1: r = +1.0 → Perfect positive: when A goes up, B always goes up r = +0.7 → Strong positive: A up → B mostly up r = 0.0 → No relationship r = -0.7 → Strong negative: A up → B mostly down r = -1.0 → Perfect negative Example: FloorArea and Energy: r = +0.85 (bigger buildings use more energy) EnergyStarScore and Energy: r = -0.40 (efficient buildings use less) r = +1.0 → Perfect positive: when A goes up, B always goes up r = +0.7 → Strong positive: A up → B mostly up r = 0.0 → No relationship r = -0.7 → Strong negative: A up → B mostly down r = -1.0 → Perfect negative Example: FloorArea and Energy: r = +0.85 (bigger buildings use more energy) EnergyStarScore and Energy: r = -0.40 (efficient buildings use less) Why drop highly correlated FEATURES? If GrossFloorArea and AreaPerFloor have r = 0.95, they tell the model almost the same thing. Keeping both: GrossFloorArea AreaPerFloor Wastes space Can confuse linear models (they fight over which one to use) Wastes space Can confuse linear models (they fight over which one to use) Rule of thumb: if |r| > 0.70 between two features (NOT the target), drop one. Rule of thumb Part 8 — Correlation matrix Why check correlations? If two features are very similar (|r| > 0.70), they carry the same information. Keeping both: Wastes space Can make linear models unstable Wastes space Can make linear models unstable unstable A correlation of 1.0 = perfect positive relationship. -1.0 = perfect negative. 0 = no relationship. num_df = df.select_dtypes(include=[np.number]) corr = num_df.corr() print(f'Correlation matrix: {corr.shape[0]} x {corr.shape[1]} features') num_df = df.select_dtypes(include=[np.number]) corr = num_df.corr() print(f'Correlation matrix: {corr.shape[0]} x {corr.shape[1]} features') # Visualize as heatmap plt.figure(figsize=(12, 9)) mask = np.triu(np.ones_like(corr, dtype=bool)) # Hide upper triangle (redundant) sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='coolwarm', center=0, square=True, linewidths=0.3, annot_kws={'size': 8}) plt.title('Correlation Matrix — look for |r| > 0.70') plt.tight_layout() plt.show() # Visualize as heatmap plt.figure(figsize=(12, 9)) mask = np.triu(np.ones_like(corr, dtype=bool)) # Hide upper triangle (redundant) sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='coolwarm', center=0, square=True, linewidths=0.3, annot_kws={'size': 8}) plt.title('Correlation Matrix — look for |r| > 0.70') plt.tight_layout() plt.show() # Automatically find highly correlated pairs target_name = 'SiteEnergyUse_kBtu' print('Feature pairs with |r| > 0.70 (excluding target):') found = False for i in range(len(corr)): for j in range(i+1, len(corr)): fi = corr.index[i] fj = corr.columns[j] r = corr.iloc[i, j] if abs(r) > 0.70 and fi != target_name and fj != target_name: print(f' {fi} <-> {fj}: r = {r:.3f}') found = True if not found: print(' None found — no redundant features to drop!') # Automatically find highly correlated pairs target_name = 'SiteEnergyUse_kBtu' print('Feature pairs with |r| > 0.70 (excluding target):') found = False for i in range(len(corr)): for j in range(i+1, len(corr)): fi = corr.index[i] fj = corr.columns[j] r = corr.iloc[i, j] if abs(r) > 0.70 and fi != target_name and fj != target_name: print(f' {fi} <-> {fj}: r = {r:.3f}') found = True if not found: print(' None found — no redundant features to drop!') Why can't ML use text directly? Imagine you give the model this: PrimaryPropertyType = "Office" PrimaryPropertyType = "Office" The model does math: multiplications, additions, comparisons. You can multiply numbers, but you can't multiply "Office" by 0.5. So we need to convert text to numbers. OneHot encoding — the most common method Instead of one column with 10 text values, we create 10 binary columns (but drop 1 to avoid redundancy): BEFORE: AFTER (with drop='first'): Type Type_Hotel Type_Office Type_Retail ... ──────── ────────── ─────────── ────────── Hospital 0 0 0 Office 0 1 0 Retail 0 0 1 Hotel 1 0 0 Hospital (again) 0 0 0 ← if all zeros, model knows it's Hospital BEFORE: AFTER (with drop='first'): Type Type_Hotel Type_Office Type_Retail ... ──────── ────────── ─────────── ────────── Hospital 0 0 0 Office 0 1 0 Retail 0 0 1 Hotel 1 0 0 Hospital (again) 0 0 0 ← if all zeros, model knows it's Hospital We drop the first category because if all the others are 0, the model already knows the value. Part 9 — Encode categorical features Why? ML algorithms only understand numbers. Text columns like 'Office' or 'Zone_A' must be converted. numbers Which encoding method? Method When to use Example OneHotEncoder No natural order Department, Neighborhood Ordinal mapping Natural order Junior < Mid < Senior Method When to use Example OneHotEncoder No natural order Department, Neighborhood Ordinal mapping Natural order Junior < Mid < Senior Method When to use Example Method Method When to use When to use Example Example OneHotEncoder No natural order Department, Neighborhood OneHotEncoder OneHotEncoder OneHotEncoder No natural order No natural order Department, Neighborhood Department, Neighborhood Ordinal mapping Natural order Junior < Mid < Senior Ordinal mapping Ordinal mapping Ordinal mapping Natural order Natural order Junior < Mid < Senior Junior < Mid < Senior Building type and neighborhood have no natural order, so we use OneHotEncoder. no natural order OneHotEncoder # Which columns are text? cat_cols = df.select_dtypes(include=['object']).columns.tolist() print(f'Categorical columns: {cat_cols}') for col in cat_cols: print(f' {col}: {df[col].nunique()} unique values') # Which columns are text? cat_cols = df.select_dtypes(include=['object']).columns.tolist() print(f'Categorical columns: {cat_cols}') for col in cat_cols: print(f' {col}: {df[col].nunique()} unique values') # Create encoder ohe = OneHotEncoder( sparse_output=False, # Dense array (easier to work with) drop='first', # Drop 1st category to avoid multicollinearity handle_unknown='infrequent_if_exist' ) print('OneHotEncoder created with drop="first"') print('(drop="first" means if we have [A, B, C], we create cols for B and C only)') print('(because if B=0 and C=0, the model knows it must be A)') # Create encoder ohe = OneHotEncoder( sparse_output=False, # Dense array (easier to work with) drop='first', # Drop 1st category to avoid multicollinearity handle_unknown='infrequent_if_exist' ) print('OneHotEncoder created with drop="first"') print('(drop="first" means if we have [A, B, C], we create cols for B and C only)') print('(because if B=0 and C=0, the model knows it must be A)') Concrete example: before and after OneHotEncoding BEFORE (3 buildings): ┌──────────────────┐ │ PrimaryProperty │ ├──────────────────┤ │ Office │ │ Hotel │ │ Office │ └──────────────────┘ AFTER (with drop='first', keeping alphabetical order, first dropped = Hospital): ┌────────────┬───────────┬────────────┬───────────┐ │ Type_Hotel │ Type_Off │ Type_Ret │ Type_Ware │ ... ├────────────┼───────────┼────────────┼───────────┤ │ 0 │ 1 │ 0 │ 0 │ ← Office │ 1 │ 0 │ 0 │ 0 │ ← Hotel │ 0 │ 1 │ 0 │ 0 │ ← Office └────────────┴───────────┴────────────┴───────────┘ BEFORE (3 buildings): ┌──────────────────┐ │ PrimaryProperty │ ├──────────────────┤ │ Office │ │ Hotel │ │ Office │ └──────────────────┘ AFTER (with drop='first', keeping alphabetical order, first dropped = Hospital): ┌────────────┬───────────┬────────────┬───────────┐ │ Type_Hotel │ Type_Off │ Type_Ret │ Type_Ware │ ... ├────────────┼───────────┼────────────┼───────────┤ │ 0 │ 1 │ 0 │ 0 │ ← Office │ 1 │ 0 │ 0 │ 0 │ ← Hotel │ 0 │ 1 │ 0 │ 0 │ ← Office └────────────┴───────────┴────────────┴───────────┘ Each building becomes a row of 0s and 1s. The model can now do math with these. # Fit and transform encoded = ohe.fit_transform(df[cat_cols]) enc_df = pd.DataFrame( encoded, columns=ohe.get_feature_names_out(cat_cols), index=df.index ) print(f'Created {enc_df.shape[1]} binary columns:') for col in enc_df.columns[:8]: print(f' {col}') if enc_df.shape[1] > 8: print(f' ... and {enc_df.shape[1] - 8} more') # Fit and transform encoded = ohe.fit_transform(df[cat_cols]) enc_df = pd.DataFrame( encoded, columns=ohe.get_feature_names_out(cat_cols), index=df.index ) print(f'Created {enc_df.shape[1]} binary columns:') for col in enc_df.columns[:8]: print(f' {col}') if enc_df.shape[1] > 8: print(f' ... and {enc_df.shape[1] - 8} more') # Combine: drop originals, add encoded df_encoded = pd.concat([df.drop(columns=cat_cols), enc_df], axis=1) print(f'Final dataset: {df_encoded.shape[0]} rows x {df_encoded.shape[1]} columns') # Combine: drop originals, add encoded df_encoded = pd.concat([df.drop(columns=cat_cols), enc_df], axis=1) print(f'Final dataset: {df_encoded.shape[0]} rows x {df_encoded.shape[1]} columns') The train/test split — explained simply Imagine you're studying for an exam: You study from a textbook (= training data) You take the exam with new questions (= test data) You study from a textbook (= training data) textbook You take the exam with new questions (= test data) exam If you memorize the textbook word-for-word (= overfitting), you might fail the exam because you haven't learned the patterns, just the specific examples. patterns The train/test split simulates this: Train set (80%): the model studies these buildings Test set (20%): new buildings the model has NEVER seen Train set (80%): the model studies these buildings Train set (80%) Test set (20%): new buildings the model has NEVER seen Test set (20%) If it predicts well on the test set, it has learned real patterns — not just memorized. Part 10 — Separate X and y, then split into train/test Why split? Full data (100%) ├── Train (80%): model learns from this └── Test (20%): we pretend these are "new" buildings and check if predictions are good If the model performs well on TEST data it has never seen, we can trust it on truly new buildings. Full data (100%) ├── Train (80%): model learns from this └── Test (20%): we pretend these are "new" buildings and check if predictions are good If the model performs well on TEST data it has never seen, we can trust it on truly new buildings. target = 'SiteEnergyUse_kBtu' # X = everything except the target X = df_encoded.drop(columns=[target]) # y = just the target y = df_encoded[target] print(f'X shape: {X.shape} (features)') print(f'y shape: {y.shape} (target)') print(f'Target mean: {y.mean():,.0f} kBtu') target = 'SiteEnergyUse_kBtu' # X = everything except the target X = df_encoded.drop(columns=[target]) # y = just the target y = df_encoded[target] print(f'X shape: {X.shape} (features)') print(f'y shape: {y.shape} (target)') print(f'Target mean: {y.mean():,.0f} kBtu') X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, # 20% for test random_state=RANDOM_STATE # Reproducible split ) print(f'Train: {len(X_train)} buildings ({len(X_train)/len(X)*100:.0f}%)') print(f'Test: {len(X_test)} buildings ({len(X_test)/len(X)*100:.0f}%)') X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, # 20% for test random_state=RANDOM_STATE # Reproducible split ) print(f'Train: {len(X_train)} buildings ({len(X_train)/len(X)*100:.0f}%)') print(f'Test: {len(X_test)} buildings ({len(X_test)/len(X)*100:.0f}%)') Part 11 — A reusable evaluation function We'll train 3 models. Instead of copying evaluation code 3 times, we write it once in a function. Build → Test → Use (the same pattern as .apply()) .apply() def evaluate_model(model, X_tr, y_tr, X_te, y_te, name): """Train a regression model and print all metrics.""" # 1. Train model.fit(X_tr, y_tr) # 2. Predict on both sets pred_tr = model.predict(X_tr) pred_te = model.predict(X_te) # 3. Compute metrics r2_tr = r2_score(y_tr, pred_tr) r2_te = r2_score(y_te, pred_te) mae_te = mean_absolute_error(y_te, pred_te) rmse_te = np.sqrt(mean_squared_error(y_te, pred_te)) # 4. Print results print(f'\n{name}:') print(f' R² train: {r2_tr:.4f} R² test: {r2_te:.4f}') print(f' MAE: {mae_te:>12,.0f} kBtu') print(f' RMSE: {rmse_te:>12,.0f} kBtu') print(f' → On average, the model is off by ~{mae_te:,.0f} kBtu') # Check for overfitting if r2_tr - r2_te > 0.10: print(f' ⚠ Possible overfitting: train R² much higher than test R²') return {'name': name, 'R2_train': r2_tr, 'R2_test': r2_te, 'MAE': mae_te, 'RMSE': rmse_te, 'model': model} def evaluate_model(model, X_tr, y_tr, X_te, y_te, name): """Train a regression model and print all metrics.""" # 1. Train model.fit(X_tr, y_tr) # 2. Predict on both sets pred_tr = model.predict(X_tr) pred_te = model.predict(X_te) # 3. Compute metrics r2_tr = r2_score(y_tr, pred_tr) r2_te = r2_score(y_te, pred_te) mae_te = mean_absolute_error(y_te, pred_te) rmse_te = np.sqrt(mean_squared_error(y_te, pred_te)) # 4. Print results print(f'\n{name}:') print(f' R² train: {r2_tr:.4f} R² test: {r2_te:.4f}') print(f' MAE: {mae_te:>12,.0f} kBtu') print(f' RMSE: {rmse_te:>12,.0f} kBtu') print(f' → On average, the model is off by ~{mae_te:,.0f} kBtu') # Check for overfitting if r2_tr - r2_te > 0.10: print(f' ⚠ Possible overfitting: train R² much higher than test R²') return {'name': name, 'R2_train': r2_tr, 'R2_test': r2_te, 'MAE': mae_te, 'RMSE': rmse_te, 'model': model} # Test the function exists print('evaluate_model() is ready to use!') print('It takes: model, X_train, y_train, X_test, y_test, name') # Test the function exists print('evaluate_model() is ready to use!') print('It takes: model, X_train, y_train, X_test, y_test, name') What does a regression model actually DO? In the simplest case (1 feature), it finds the best line through the data: line through the data Energy │ ● │ ● / │ ● / ← best line │ ● / │ ● / │ ●/ │/● └──────────── FloorArea Energy │ ● │ ● / │ ● / ← best line │ ● / │ ● / │ ●/ │/● └──────────── FloorArea The model learns: Energy = slope × FloorArea + intercept Energy = slope × FloorArea + intercept With many features, it becomes: Energy = w₁×FloorArea + w₂×Age + w₃×Floors + ... + bias Energy = w₁×FloorArea + w₂×Age + w₃×Floors + ... + bias Each weight (w) tells the model how important each feature is. Ridge adds a penalty to keep weights small, which prevents overfitting. weight Ridge What is overfitting? UNDERFITTING: GOOD FIT: OVERFITTING: Too simple Just right Too complex ────── ╱╲ ╱╲ ╱╲╱╲╱╲╱╲ (misses pattern) ╱ ╲_╱ ╲ ╱ ╲ (captures trend) (memorizes noise) (fails on new data) UNDERFITTING: GOOD FIT: OVERFITTING: Too simple Just right Too complex ────── ╱╲ ╱╲ ╱╲╱╲╱╲╱╲ (misses pattern) ╱ ╲_╱ ╲ ╱ ╲ (captures trend) (memorizes noise) (fails on new data) Part 12 — Model 1: Ridge Regression (linear) What does "linear" mean? The model tries to find the best formula like: Energy = a₁ × FloorArea + a₂ × Age + a₃ × Floors + ... + bias Energy = a₁ × FloorArea + a₂ × Age + a₃ × Floors + ... + bias It's a weighted sum. Ridge adds a small penalty to keep the weights small (prevents overfitting). Ridge Why does it need scaling? Feature A: FloorArea ranges from 1,000 to 2,000,000 Feature B: NumberOfFloors ranges from 1 to 50 Without scaling, the model thinks FloorArea is 40,000x more important just because its numbers are bigger! Scaling fixes this. StandardScaler makes every feature: mean=0, std=1 Feature A: FloorArea ranges from 1,000 to 2,000,000 Feature B: NumberOfFloors ranges from 1 to 50 Without scaling, the model thinks FloorArea is 40,000x more important just because its numbers are bigger! Scaling fixes this. StandardScaler makes every feature: mean=0, std=1 We use a Pipeline to chain scaler + model. This ensures the scaler is fitted on TRAIN data only (no leakage). Pipeline What is a Pipeline? A Pipeline chains multiple steps into one object. Instead of: # Without Pipeline (manual, error-prone): scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) # fit + transform on TRAIN X_test_scaled = scaler.transform(X_test) # transform only on TEST (no fit!) model = Ridge() model.fit(X_train_scaled, y_train) # train predictions = model.predict(X_test_scaled) # predict # Without Pipeline (manual, error-prone): scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) # fit + transform on TRAIN X_test_scaled = scaler.transform(X_test) # transform only on TEST (no fit!) model = Ridge() model.fit(X_train_scaled, y_train) # train predictions = model.predict(X_test_scaled) # predict With a Pipeline, everything happens in one line: # With Pipeline (clean, safe): pipe = Pipeline([('scaler', StandardScaler()), ('model', Ridge())]) pipe.fit(X_train, y_train) # scaler fits on train, then model trains pipe.predict(X_test) # scaler transforms test, then model predicts # With Pipeline (clean, safe): pipe = Pipeline([('scaler', StandardScaler()), ('model', Ridge())]) pipe.fit(X_train, y_train) # scaler fits on train, then model trains pipe.predict(X_test) # scaler transforms test, then model predicts Why is this better? Why is this better? No leakage risk: the scaler is fitted on TRAIN only, never on TEST Cleaner code: one object to manage instead of two Works with cross-validation: GridSearchCV handles the scaler automatically No leakage risk: the scaler is fitted on TRAIN only, never on TEST No leakage risk Cleaner code: one object to manage instead of two Cleaner code Works with cross-validation: GridSearchCV handles the scaler automatically Works with cross-validation ridge_pipe = Pipeline([ ('scaler', StandardScaler()), # Step 1: normalize ('model', Ridge(alpha=1.0, random_state=RANDOM_STATE)) # Step 2: train ]) print('Pipeline created: StandardScaler → Ridge') ridge_pipe = Pipeline([ ('scaler', StandardScaler()), # Step 1: normalize ('model', Ridge(alpha=1.0, random_state=RANDOM_STATE)) # Step 2: train ]) print('Pipeline created: StandardScaler → Ridge') res_ridge = evaluate_model( ridge_pipe, X_train, y_train, X_test, y_test, 'Ridge Regression (linear)') res_ridge = evaluate_model( ridge_pipe, X_train, y_train, X_test, y_test, 'Ridge Regression (linear)') How to interpret these results Let's say the output was: Ridge Regression (linear): R² train: 0.7200 R² test: 0.7100 MAE: 102,000 kBtu RMSE: 145,000 kBtu Ridge Regression (linear): R² train: 0.7200 R² test: 0.7100 MAE: 102,000 kBtu RMSE: 145,000 kBtu What this means in plain English: What this means in plain English: R² = 0.71 → The model explains 71% of the variation in energy consumption. Not bad, but 29% is still unexplained (noise, missing features, non-linear patterns). R² train ≈ R² test → No overfitting! The model generalizes well. If train was 0.99 and test was 0.50, that would mean overfitting (memorized training data). MAE = 102,000 kBtu → On average, when the model predicts energy for a building, it's wrong by about 102,000 kBtu. Is that good? Depends on context — if buildings average 400,000 kBtu, being off by 102,000 is a ~25% error. RMSE > MAE → This is always true. RMSE penalizes large errors more, so RMSE > MAE means some predictions are way off (the model struggles with certain buildings). R² = 0.71 → The model explains 71% of the variation in energy consumption. Not bad, but 29% is still unexplained (noise, missing features, non-linear patterns). R² = 0.71 R² train ≈ R² test → No overfitting! The model generalizes well. If train was 0.99 and test was 0.50, that would mean overfitting (memorized training data). R² train ≈ R² test MAE = 102,000 kBtu → On average, when the model predicts energy for a building, it's wrong by about 102,000 kBtu. Is that good? Depends on context — if buildings average 400,000 kBtu, being off by 102,000 is a ~25% error. MAE = 102,000 kBtu RMSE > MAE → This is always true. RMSE penalizes large errors more, so RMSE > MAE means some predictions are way off (the model struggles with certain buildings). RMSE > MAE What is a decision tree? Before understanding a forest, let's understand one tree: Is FloorArea > 50,000? / \ YES NO / \ Is Age > 40? Is Floors > 5? / \ / \ YES NO YES NO / \ / \ 400,000 kBtu 250,000 kBtu 150,000 kBtu 80,000 kBtu Is FloorArea > 50,000? / \ YES NO / \ Is Age > 40? Is Floors > 5? / \ / \ YES NO YES NO / \ / \ 400,000 kBtu 250,000 kBtu 150,000 kBtu 80,000 kBtu The tree asks yes/no questions, splitting data into smaller groups until it reaches a prediction. It's like a flowchart. A Random Forest builds 200 of these on random subsets and averages their predictions. This is called bagging (Bootstrap AGGregating). The averaging makes the model much more stable. A Random Forest builds 200 of these bagging Part 13 — Model 2: Random Forest (bagging, tree-based) How does it work? Take a random sample of the data Build a decision tree on it Repeat 200 times (n_estimators=200) Average all 200 predictions Take a random sample of the data Build a decision tree on it Repeat 200 times (n_estimators=200) Average all 200 predictions Key advantages: Key advantages Captures non-linear patterns (curves, interactions) Does NOT need scaling (trees split on values, not distances) Hard to overfit (averaging reduces variance) Captures non-linear patterns (curves, interactions) Does NOT need scaling (trees split on values, not distances) NOT Hard to overfit (averaging reduces variance) rf = RandomForestRegressor( n_estimators=200, # Build 200 trees random_state=RANDOM_STATE, n_jobs=-1 # Use all CPU cores (faster) ) print('RandomForest created: 200 trees, all CPUs') rf = RandomForestRegressor( n_estimators=200, # Build 200 trees random_state=RANDOM_STATE, n_jobs=-1 # Use all CPU cores (faster) ) print('RandomForest created: 200 trees, all CPUs') res_rf = evaluate_model( rf, X_train, y_train, X_test, y_test, 'Random Forest (bagging)') res_rf = evaluate_model( rf, X_train, y_train, X_test, y_test, 'Random Forest (bagging)') Part 14 — Model 3: Gradient Boosting (boosting, tree-based) How is it different from Random Forest? Random Forest: 200 trees work INDEPENDENTLY → average Gradient Boost: 200 trees work SEQUENTIALLY → each fixes the last one's errors Random Forest: 200 trees work INDEPENDENTLY → average Gradient Boost: 200 trees work SEQUENTIALLY → each fixes the last one's errors learning_rate controls how much each tree contributes. Smaller = more cautious but needs more trees. learning_rate gb = GradientBoostingRegressor( n_estimators=200, learning_rate=0.1, # How much each tree contributes max_depth=5, # How deep each tree can grow random_state=RANDOM_STATE ) print('GradientBoosting: 200 sequential trees, lr=0.1, depth=5') gb = GradientBoostingRegressor( n_estimators=200, learning_rate=0.1, # How much each tree contributes max_depth=5, # How deep each tree can grow random_state=RANDOM_STATE ) print('GradientBoosting: 200 sequential trees, lr=0.1, depth=5') res_gb = evaluate_model( gb, X_train, y_train, X_test, y_test, 'Gradient Boosting') res_gb = evaluate_model( gb, X_train, y_train, X_test, y_test, 'Gradient Boosting') Quick comparison so far Reading the comparison table name R2_train R2_test MAE RMSE Ridge (linear) 0.72 0.71 102,000 145,000 Random Forest (bagging) 0.95 0.82 78,000 110,000 Gradient Boosting 0.91 0.85 68,000 98,000 name R2_train R2_test MAE RMSE Ridge (linear) 0.72 0.71 102,000 145,000 Random Forest (bagging) 0.95 0.82 78,000 110,000 Gradient Boosting 0.91 0.85 68,000 98,000 What to notice: What to notice: Gradient Boosting wins on test set (highest R², lowest MAE/RMSE) Random Forest overfits slightly (train R²=0.95 is much higher than test R²=0.82) Ridge is the weakest but simplest — it can't capture non-linear patterns Non-linear models (trees) beat linear — energy consumption has complex relationships with features Gradient Boosting wins on test set (highest R², lowest MAE/RMSE) Gradient Boosting wins Random Forest overfits slightly (train R²=0.95 is much higher than test R²=0.82) Random Forest overfits slightly Ridge is the weakest but simplest — it can't capture non-linear patterns Ridge is the weakest Non-linear models (trees) beat linear — energy consumption has complex relationships with features Non-linear models (trees) beat linear This is why we try multiple model categories: you can't know which will work best beforehand. # Show all 3 results side by side quick = pd.DataFrame([res_ridge, res_rf, res_gb]).drop(columns=['model']) print(quick.to_string(index=False)) # Show all 3 results side by side quick = pd.DataFrame([res_ridge, res_rf, res_gb]).drop(columns=['model']) print(quick.to_string(index=False)) Part 15 — Cross-Validation: getting reliable estimates Why is one train/test split not enough? One split can be lucky or unlucky. Imagine asking ONE person if a movie is good vs asking FIVE people. 5-fold cross-validation splits the data 5 different ways and averages: 5-fold cross-validation Fold 1: [TEST] [train] [train] [train] [train] → R² = 0.82 Fold 2: [train] [TEST] [train] [train] [train] → R² = 0.79 Fold 3: [train] [train] [TEST] [train] [train] → R² = 0.84 Fold 4: [train] [train] [train] [TEST] [train] → R² = 0.81 Fold 5: [train] [train] [train] [train] [TEST] → R² = 0.80 Final: R² = 0.81 +/- 0.02 (MUCH more reliable!) Fold 1: [TEST] [train] [train] [train] [train] → R² = 0.82 Fold 2: [train] [TEST] [train] [train] [train] → R² = 0.79 Fold 3: [train] [train] [TEST] [train] [train] → R² = 0.84 Fold 4: [train] [train] [train] [TEST] [train] → R² = 0.81 Fold 5: [train] [train] [train] [train] [TEST] → R² = 0.80 Final: R² = 0.81 +/- 0.02 (MUCH more reliable!) kf = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE) scoring = { 'R2': 'r2', 'MAE': 'neg_mean_absolute_error', } print('KFold: 5 splits, shuffled') print('Scoring: R² and MAE') kf = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE) scoring = { 'R2': 'r2', 'MAE': 'neg_mean_absolute_error', } print('KFold: 5 splits, shuffled') print('Scoring: R² and MAE') # Cross-validate all 3 models for model, name, needs_scaling in [ (Ridge(alpha=1.0, random_state=RANDOM_STATE), 'Ridge (linear)', True), (RandomForestRegressor(n_estimators=200, random_state=RANDOM_STATE, n_jobs=-1), 'Random Forest (bagging)', False), (GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=5, random_state=RANDOM_STATE), 'Gradient Boosting', False), ]: if needs_scaling: pipe = Pipeline([('scaler', StandardScaler()), ('model', model)]) else: pipe = Pipeline([('model', model)]) cv = cross_validate(pipe, X, y, cv=kf, scoring=scoring, return_train_score=True, n_jobs=-1) r2_vals = cv['test_R2'] mae_vals = -cv['test_MAE'] # sklearn returns NEGATIVE MAE print(f'\n{name}:') print(f' R² = {r2_vals.mean():.4f} +/- {r2_vals.std():.4f}') print(f' MAE = {mae_vals.mean():>10,.0f} +/- {mae_vals.std():>8,.0f} kBtu') # Cross-validate all 3 models for model, name, needs_scaling in [ (Ridge(alpha=1.0, random_state=RANDOM_STATE), 'Ridge (linear)', True), (RandomForestRegressor(n_estimators=200, random_state=RANDOM_STATE, n_jobs=-1), 'Random Forest (bagging)', False), (GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=5, random_state=RANDOM_STATE), 'Gradient Boosting', False), ]: if needs_scaling: pipe = Pipeline([('scaler', StandardScaler()), ('model', model)]) else: pipe = Pipeline([('model', model)]) cv = cross_validate(pipe, X, y, cv=kf, scoring=scoring, return_train_score=True, n_jobs=-1) r2_vals = cv['test_R2'] mae_vals = -cv['test_MAE'] # sklearn returns NEGATIVE MAE print(f'\n{name}:') print(f' R² = {r2_vals.mean():.4f} +/- {r2_vals.std():.4f}') print(f' MAE = {mae_vals.mean():>10,.0f} +/- {mae_vals.std():>8,.0f} kBtu') Parameters vs Hyperparameters This distinction confuses many beginners: Parameters Hyperparameters What The weights the model LEARNS Settings YOU choose BEFORE training Who sets them The model (during .fit()) You (before calling .fit()) Example The weight of FloorArea in Ridge n_estimators=200 (how many trees) Can be optimized? Automatically by training By GridSearchCV Parameters Hyperparameters What The weights the model LEARNS Settings YOU choose BEFORE training Who sets them The model (during .fit()) You (before calling .fit()) Example The weight of FloorArea in Ridge n_estimators=200 (how many trees) Can be optimized? Automatically by training By GridSearchCV Parameters Hyperparameters Parameters Parameters Hyperparameters Hyperparameters What The weights the model LEARNS Settings YOU choose BEFORE training What What What The weights the model LEARNS The weights the model LEARNS Settings YOU choose BEFORE training Settings YOU choose BEFORE training Who sets them The model (during .fit()) You (before calling .fit()) Who sets them Who sets them Who sets them The model (during .fit()) The model (during .fit()) You (before calling .fit()) You (before calling .fit()) Example The weight of FloorArea in Ridge n_estimators=200 (how many trees) Example Example Example The weight of FloorArea in Ridge The weight of FloorArea in Ridge n_estimators=200 (how many trees) n_estimators=200 (how many trees) n_estimators=200 Can be optimized? Automatically by training By GridSearchCV Can be optimized? Can be optimized? Can be optimized? Automatically by training Automatically by training By GridSearchCV By GridSearchCV Analogy: If the model is a recipe... Analogy Parameters = the amounts of each ingredient (the model figures these out) Hyperparameters = oven temperature and cooking time (you choose these before baking) Parameters = the amounts of each ingredient (the model figures these out) Parameters Hyperparameters = oven temperature and cooking time (you choose these before baking) Hyperparameters GridSearchCV tries many "oven temperatures" and picks the one that produces the best cake. Part 16 — Hyperparameter Optimization (GridSearchCV) What are hyperparameters? Settings you choose before training. Different values → different performance. Finding the best combination manually is tedious, so GridSearch automates it. before What we're tuning: Parameter What it does Values to try n_estimators Number of trees 100, 200, 300 max_depth How deep each tree grows 3, 5, 7, 10 learning_rate Step size for each tree 0.05, 0.1, 0.2 min_samples_split Min samples to split a node 2, 5, 10 Parameter What it does Values to try n_estimators Number of trees 100, 200, 300 max_depth How deep each tree grows 3, 5, 7, 10 learning_rate Step size for each tree 0.05, 0.1, 0.2 min_samples_split Min samples to split a node 2, 5, 10 Parameter What it does Values to try Parameter Parameter What it does What it does Values to try Values to try n_estimators Number of trees 100, 200, 300 n_estimators n_estimators n_estimators Number of trees Number of trees 100, 200, 300 100, 200, 300 max_depth How deep each tree grows 3, 5, 7, 10 max_depth max_depth max_depth How deep each tree grows How deep each tree grows 3, 5, 7, 10 3, 5, 7, 10 learning_rate Step size for each tree 0.05, 0.1, 0.2 learning_rate learning_rate learning_rate Step size for each tree Step size for each tree 0.05, 0.1, 0.2 0.05, 0.1, 0.2 min_samples_split Min samples to split a node 2, 5, 10 min_samples_split min_samples_split min_samples_split Min samples to split a node Min samples to split a node 2, 5, 10 2, 5, 10 param_grid = { 'n_estimators': [100, 200, 300], 'max_depth': [3, 5, 7, 10], 'learning_rate': [0.05, 0.1, 0.2], 'min_samples_split': [2, 5, 10], } total_combos = 3 * 4 * 3 * 3 # = 108 print(f'Total combinations: {total_combos}') print(f'With 5-fold CV: {total_combos * 5} model fits') print(f'This may take a few minutes...') param_grid = { 'n_estimators': [100, 200, 300], 'max_depth': [3, 5, 7, 10], 'learning_rate': [0.05, 0.1, 0.2], 'min_samples_split': [2, 5, 10], } total_combos = 3 * 4 * 3 * 3 # = 108 print(f'Total combinations: {total_combos}') print(f'With 5-fold CV: {total_combos * 5} model fits') print(f'This may take a few minutes...') What GridSearchCV does, step by step You tell it: "Try these combinations" n_estimators: [100, 200, 300] → 3 options max_depth: [3, 5, 7, 10] → 4 options learning_rate:[0.05, 0.1, 0.2] → 3 options min_samples: [2, 5, 10] → 3 options ───────── Total: 3×4×3×3 = 108 combinations For EACH combination, it does 5-fold cross-validation: Combo 1: n_est=100, depth=3, lr=0.05, samples=2 → Train on folds 1-4, test on fold 5 → R² = 0.78 → Train on folds 1-3+5, test on fold 4 → R² = 0.80 → ... (3 more folds) → Average: R² = 0.79 Combo 2: n_est=100, depth=3, lr=0.05, samples=5 → (same 5-fold process) → Average: R² = 0.77 ... (106 more combos) Combo 108: n_est=300, depth=10, lr=0.2, samples=10 → Average: R² = 0.84 ← BEST! GridSearchCV picks Combo 108 and retrains on ALL training data. You tell it: "Try these combinations" n_estimators: [100, 200, 300] → 3 options max_depth: [3, 5, 7, 10] → 4 options learning_rate:[0.05, 0.1, 0.2] → 3 options min_samples: [2, 5, 10] → 3 options ───────── Total: 3×4×3×3 = 108 combinations For EACH combination, it does 5-fold cross-validation: Combo 1: n_est=100, depth=3, lr=0.05, samples=2 → Train on folds 1-4, test on fold 5 → R² = 0.78 → Train on folds 1-3+5, test on fold 4 → R² = 0.80 → ... (3 more folds) → Average: R² = 0.79 Combo 2: n_est=100, depth=3, lr=0.05, samples=5 → (same 5-fold process) → Average: R² = 0.77 ... (106 more combos) Combo 108: n_est=300, depth=10, lr=0.2, samples=10 → Average: R² = 0.84 ← BEST! GridSearchCV picks Combo 108 and retrains on ALL training data. Total model fits: 108 combos × 5 folds = 540 models trained! This is why it takes time. 540 models trained! gs = GridSearchCV( GradientBoostingRegressor(random_state=RANDOM_STATE), param_grid, cv=KFold(5, shuffle=True, random_state=RANDOM_STATE), scoring='r2', n_jobs=-1, verbose=1, return_train_score=True ) gs.fit(X_train, y_train) print(f'\nDone! Best R² (CV): {gs.best_score_:.4f}') gs = GridSearchCV( GradientBoostingRegressor(random_state=RANDOM_STATE), param_grid, cv=KFold(5, shuffle=True, random_state=RANDOM_STATE), scoring='r2', n_jobs=-1, verbose=1, return_train_score=True ) gs.fit(X_train, y_train) print(f'\nDone! Best R² (CV): {gs.best_score_:.4f}') # What parameters won? print('Best parameters:') for k, v in gs.best_params_.items(): print(f' {k}: {v}') # What parameters won? print('Best parameters:') for k, v in gs.best_params_.items(): print(f' {k}: {v}') # Evaluate the optimized model on the test set best_model = gs.best_estimator_ y_pred_best = best_model.predict(X_test) r2_opt = r2_score(y_test, y_pred_best) mae_opt = mean_absolute_error(y_test, y_pred_best) rmse_opt = np.sqrt(mean_squared_error(y_test, y_pred_best)) print(f'Optimized Gradient Boosting on TEST set:') print(f' R²: {r2_opt:.4f}') print(f' MAE: {mae_opt:>10,.0f} kBtu') print(f' RMSE: {rmse_opt:>10,.0f} kBtu') print(f' → The model is off by ~{mae_opt:,.0f} kBtu on average') # Evaluate the optimized model on the test set best_model = gs.best_estimator_ y_pred_best = best_model.predict(X_test) r2_opt = r2_score(y_test, y_pred_best) mae_opt = mean_absolute_error(y_test, y_pred_best) rmse_opt = np.sqrt(mean_squared_error(y_test, y_pred_best)) print(f'Optimized Gradient Boosting on TEST set:') print(f' R²: {r2_opt:.4f}') print(f' MAE: {mae_opt:>10,.0f} kBtu') print(f' RMSE: {rmse_opt:>10,.0f} kBtu') print(f' → The model is off by ~{mae_opt:,.0f} kBtu on average') Feature importance — the intuition Imagine you're predicting house prices. Someone asks "what matters most: location, size, or age?" Permutation importance answers this with an experiment: Permutation importance Take a trained model that predicts well (R² = 0.80) Randomly shuffle the FloorArea column (destroy its information) Re-measure: R² drops to 0.45 Conclusion: FloorArea was very important (R² dropped by 0.35) Take a trained model that predicts well (R² = 0.80) Randomly shuffle the FloorArea column (destroy its information) FloorArea Re-measure: R² drops to 0.45 Conclusion: FloorArea was very important (R² dropped by 0.35) Repeat for every feature. The features that cause the biggest drop when shuffled are the most important. Feature R² without it Drop Importance ───────── ───────────── ──── ────────── FloorArea 0.45 -0.35 ★★★★★ BuildingAge 0.70 -0.10 ★★ NumberOfFloors 0.75 -0.05 ★ Neighborhood 0.78 -0.02 · Feature R² without it Drop Importance ───────── ───────────── ──── ────────── FloorArea 0.45 -0.35 ★★★★★ BuildingAge 0.70 -0.10 ★★ NumberOfFloors 0.75 -0.05 ★ Neighborhood 0.78 -0.02 · Part 17 — Feature Importance: what drives energy consumption? Why this matters The city doesn't just want predictions — they want to know why. Which building characteristics drive high energy consumption? This informs policy. why Two methods: Method How it works Speed Built-in (tree) Measures how much each feature reduces error in tree splits Fast Permutation Shuffles each feature and measures how much the score drops Slower but more reliable Method How it works Speed Built-in (tree) Measures how much each feature reduces error in tree splits Fast Permutation Shuffles each feature and measures how much the score drops Slower but more reliable Method How it works Speed Method Method How it works How it works Speed Speed Built-in (tree) Measures how much each feature reduces error in tree splits Fast Built-in (tree) Built-in (tree) Built-in Measures how much each feature reduces error in tree splits Measures how much each feature reduces error in tree splits Fast Fast Permutation Shuffles each feature and measures how much the score drops Slower but more reliable Permutation Permutation Permutation Shuffles each feature and measures how much the score drops Shuffles each feature and measures how much the score drops Slower but more reliable Slower but more reliable # Method 1: Built-in feature importance fi_df = pd.DataFrame({ 'Feature': X.columns, 'Importance': best_model.feature_importances_ }).sort_values('Importance', ascending=False) print('Top 10 features (built-in importance):') print(fi_df.head(10).to_string(index=False)) # Method 1: Built-in feature importance fi_df = pd.DataFrame({ 'Feature': X.columns, 'Importance': best_model.feature_importances_ }).sort_values('Importance', ascending=False) print('Top 10 features (built-in importance):') print(fi_df.head(10).to_string(index=False)) # Plot top 15 top15 = fi_df.head(15) fig, ax = plt.subplots(figsize=(10, 7)) ax.barh(top15['Feature'][::-1], top15['Importance'][::-1], color='steelblue') ax.set_xlabel('Feature Importance (built-in)') ax.set_title('Top 15 Features — Gradient Boosting Built-in') plt.tight_layout() plt.show() # Plot top 15 top15 = fi_df.head(15) fig, ax = plt.subplots(figsize=(10, 7)) ax.barh(top15['Feature'][::-1], top15['Importance'][::-1], color='steelblue') ax.set_xlabel('Feature Importance (built-in)') ax.set_title('Top 15 Features — Gradient Boosting Built-in') plt.tight_layout() plt.show() # Method 2: Permutation importance print('Computing permutation importance (30 repeats)...') perm = permutation_importance( best_model, X_test, y_test, n_repeats=30, random_state=RANDOM_STATE, n_jobs=-1, scoring='r2' ) print('Done!') # Method 2: Permutation importance print('Computing permutation importance (30 repeats)...') perm = permutation_importance( best_model, X_test, y_test, n_repeats=30, random_state=RANDOM_STATE, n_jobs=-1, scoring='r2' ) print('Done!') How to read a feature importance barplot GrossFloorArea_sf ████████████████████ 0.35 ← BY FAR the most important BuildingAge ████████ 0.12 ← Second AreaPerFloor ██████ 0.09 ← Our engineered feature works! NumberOfFloors █████ 0.07 HasMultipleUses ███ 0.04 ← Another engineered feature Type_Hospital ██ 0.03 ← Hospitals use lots of energy IsOldBuilding ██ 0.02 ... GrossFloorArea_sf ████████████████████ 0.35 ← BY FAR the most important BuildingAge ████████ 0.12 ← Second AreaPerFloor ██████ 0.09 ← Our engineered feature works! NumberOfFloors █████ 0.07 HasMultipleUses ███ 0.04 ← Another engineered feature Type_Hospital ██ 0.03 ← Hospitals use lots of energy IsOldBuilding ██ 0.02 ... How to interpret this for the client: How to interpret this for the client: Floor area dominates — bigger buildings use more energy (obvious but confirmed) Building age matters — older buildings have worse insulation Our engineered features appear in the top — AreaPerFloor and HasMultipleUses added value Building type matters — hospitals and hotels use more than offices Some features barely matter — could be dropped to simplify the model Floor area dominates — bigger buildings use more energy (obvious but confirmed) Floor area dominates Building age matters — older buildings have worse insulation Building age matters Our engineered features appear in the top — AreaPerFloor and HasMultipleUses added value Our engineered features appear in the top Building type matters — hospitals and hotels use more than offices Building type matters Some features barely matter — could be dropped to simplify the model Some features barely matter # Permutation results pi_df = pd.DataFrame({ 'Feature': X.columns, 'Importance': perm.importances_mean }).sort_values('Importance', ascending=False) print('Top 10 features (permutation importance):') print(pi_df.head(10).to_string(index=False)) # Permutation results pi_df = pd.DataFrame({ 'Feature': X.columns, 'Importance': perm.importances_mean }).sort_values('Importance', ascending=False) print('Top 10 features (permutation importance):') print(pi_df.head(10).to_string(index=False)) # Plot top 15 top15_perm = pi_df.head(15) fig, ax = plt.subplots(figsize=(10, 7)) ax.barh(top15_perm['Feature'][::-1], top15_perm['Importance'][::-1], color='coral') ax.set_xlabel('Permutation Importance (R² decrease)') ax.set_title('Top 15 Features — Permutation Importance') plt.tight_layout() plt.show() # Plot top 15 top15_perm = pi_df.head(15) fig, ax = plt.subplots(figsize=(10, 7)) ax.barh(top15_perm['Feature'][::-1], top15_perm['Importance'][::-1], color='coral') ax.set_xlabel('Permutation Importance (R² decrease)') ax.set_title('Top 15 Features — Permutation Importance') plt.tight_layout() plt.show() Compare both methods side by side comp = pd.DataFrame({ 'Feature': X.columns, 'Tree_Builtin': best_model.feature_importances_, 'Permutation': perm.importances_mean, }).sort_values('Permutation', ascending=False).head(12) print('Top 12 features — two methods compared:') print(comp.to_string(index=False)) comp = pd.DataFrame({ 'Feature': X.columns, 'Tree_Builtin': best_model.feature_importances_, 'Permutation': perm.importances_mean, }).sort_values('Permutation', ascending=False).head(12) print('Top 12 features — two methods compared:') print(comp.to_string(index=False)) Part 18 — Final Summary # Collect all results all_results = pd.DataFrame([ res_ridge, res_rf, res_gb, {'name': 'Optimized GB', 'R2_train': r2_score(y_train, best_model.predict(X_train)), 'R2_test': r2_opt, 'MAE': mae_opt, 'RMSE': rmse_opt} ]).drop(columns=['model'], errors='ignore') # Sort by test R² all_results = all_results.sort_values('R2_test', ascending=False) print('='*72) print('FINAL MODEL COMPARISON') print('='*72) print(all_results.to_string(index=False)) # Collect all results all_results = pd.DataFrame([ res_ridge, res_rf, res_gb, {'name': 'Optimized GB', 'R2_train': r2_score(y_train, best_model.predict(X_train)), 'R2_test': r2_opt, 'MAE': mae_opt, 'RMSE': rmse_opt} ]).drop(columns=['model'], errors='ignore') # Sort by test R² all_results = all_results.sort_values('R2_test', ascending=False) print('='*72) print('FINAL MODEL COMPARISON') print('='*72) print(all_results.to_string(index=False)) What we'd tell the client Here's how a data scientist would present these results to the city: "We built a model that predicts building energy consumption using only structural data you already have — floor area, age, type, number of floors. The model explains about X% of the variation in energy use (R² = X.XX), and its predictions are off by about Y kBtu on average. The biggest driver is building size: floor area alone explains most of the variation. Building age is the second factor — older buildings tend to use more energy, likely due to poorer insulation. Practical application: instead of measuring every building, you can use this model to flag buildings where predicted consumption seems abnormally high, and send inspectors only to those. This could cut inspection costs by 80% while still catching the worst offenders. Limitations: the model can't capture everything — unusual building configurations, recent renovations, or operational differences are not in the data. We recommend adding renovation history and HVAC system type if available in the future." "We built a model that predicts building energy consumption using only structural data you already have — floor area, age, type, number of floors. The model explains about X% of the variation in energy use (R² = X.XX), and its predictions are off by about Y kBtu on average. The model explains about X% of the variation in energy use The biggest driver is building size: floor area alone explains most of the variation. Building age is the second factor — older buildings tend to use more energy, likely due to poorer insulation. The biggest driver is building size Practical application: instead of measuring every building, you can use this model to flag buildings where predicted consumption seems abnormally high, and send inspectors only to those. This could cut inspection costs by 80% while still catching the worst offenders. Practical application Limitations: the model can't capture everything — unusual building configurations, recent renovations, or operational differences are not in the data. We recommend adding renovation history and HVAC system type if available in the future." Limitations This is the storytelling part. Raw numbers don't convince decision-makers — narratives do. storytelling # Business translation best_r2 = all_results.iloc[0]['R2_test'] best_mae = all_results.iloc[0]['MAE'] best_name = all_results.iloc[0]['name'] print(f'\nBest model: {best_name}') print(f' Explains {best_r2*100:.1f}% of variance in energy consumption') print(f' Average prediction error: ~{best_mae:,.0f} kBtu') print(f'') print(f'Business insight:') print(f' Instead of measuring every building, the city can use this model') print(f' to estimate energy consumption from structural data alone,') print(f' with an average error of ~{best_mae:,.0f} kBtu.') # Business translation best_r2 = all_results.iloc[0]['R2_test'] best_mae = all_results.iloc[0]['MAE'] best_name = all_results.iloc[0]['name'] print(f'\nBest model: {best_name}') print(f' Explains {best_r2*100:.1f}% of variance in energy consumption') print(f' Average prediction error: ~{best_mae:,.0f} kBtu') print(f'') print(f'Business insight:') print(f' Instead of measuring every building, the city can use this model') print(f' to estimate energy consumption from structural data alone,') print(f' with an average error of ~{best_mae:,.0f} kBtu.') Key Takeaways Filter first: remove non-compliant buildings (ComplianceStatus) and flagged outliers (Outlier). Data leakage is the #1 mistake: never use energy source columns to predict total energy. Feature engineering adds value: building age, area per floor, multi-use flags capture real patterns. Correlation matrix: drop one from each pair with |r| > 0.70. 3 model categories: linear (Ridge), bagging (Random Forest), boosting (Gradient Boosting). Scale linear models: use StandardScaler in a Pipeline. Tree models don't need scaling. Cross-validate everything: mean +/- std is far more reliable than a single split. GridSearchCV: systematic tuning over 3+ hyperparameters. Feature importance: two methods (built-in + permutation) to tell the city what matters. Storytelling: "the model is off by ~X kBtu" beats "MAE = X". Filter first: remove non-compliant buildings (ComplianceStatus) and flagged outliers (Outlier). Filter first ComplianceStatus Outlier Data leakage is the #1 mistake: never use energy source columns to predict total energy. Data leakage is the #1 mistake Feature engineering adds value: building age, area per floor, multi-use flags capture real patterns. Feature engineering adds value Correlation matrix: drop one from each pair with |r| > 0.70. Correlation matrix 3 model categories: linear (Ridge), bagging (Random Forest), boosting (Gradient Boosting). 3 model categories Scale linear models: use StandardScaler in a Pipeline. Tree models don't need scaling. Scale linear models Cross-validate everything: mean +/- std is far more reliable than a single split. Cross-validate everything GridSearchCV: systematic tuning over 3+ hyperparameters. GridSearchCV Feature importance: two methods (built-in + permutation) to tell the city what matters. Feature importance Storytelling: "the model is off by ~X kBtu" beats "MAE = X". Storytelling Happy predicting!