These are the (Unofficial) Lecture Notes of the Fast.ai Machine Learning for Coders MOOC.
You can find the Official Thread here
This is Part 1/12 Lecture Notes.
CrestlePricing: 3 Cents an hour (Approx)Jupyter Nb opens in browser
Assuming you have your GPU and Anaconda Setup (Preferably CUDA ≥9):
$ git clone https://github.com/fastai/fastai$ cd fastai$ conda env update
$ bash | wget files.fast.ai/setup/paperspace
Watch First and Follow Later (Lose Recommendation).You might miss some important information and you can experiment.
“Hey I Just Learned this Concept, and I’ll share about it”
%load ext_autoreload%autoreload 2
If you modify the source code of the imports, you’ll have to reload the kernel in order to reflect these changes.
These two lines auto-reload the Nb incase you change the source.
%matplotlib inline
To plot Figures inline
from fastai.imports import*
Data Science is not Software Engineering. Prototyping models needs things to be done interactively.
import * allows everything to be present, we don’t need to determine the specifics.
fn_name
?fn_name
??fn_name
Kaggle: Real World Problems posted by a company/institute.These are really close to real world problems, allow you to check yourself against other competitors.
TL;DR: Perfect place to check your skillset.
Jeremy: “I’ve learnt more from Kaggle competitions than anything else I’ve done in my Entire Life”
OR
OR
Note: Prefer Techniques that will be useful for Downloading Data to your Cloud Compute Instance.Crestle and Paperspace will have most of the Datasets pre-downloaded.
Good Practise: Create a Data Folder for all of your data
To Run BASH Commands in Jupyter
!BASH_COMMAND
To Add Python Commands
!BASH {Python}
Goal:
The goal of the contest is to predict the sale price of a particular piece of heavy equiment at auction based on it’s usage, equipment type, and configuaration. The data is sourced from auction result postings and includes information on usage and equipment configurations.
Fast Iron is creating a “blue book for bull dozers,” for customers to value what their heavy equipment fleet is worth at auction.
!head data/bulldozers/Train.csv
Gives the First Few lines.
(Unoffcial Def) Columns of Data having varying types of Data.
Pandas:Best Library to work with Tabular Data.
df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=False,parse_dates=["saledate"])
low_mem=FalseAllows it to load more details to memory.
var ='abc'f'ABC {abc}'
This allows Python to interpret Code inside the {}
df_raw
Simply writing this would truncate the output
display_all()def display_all(df):with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000):display(df)
This allows The Complete df to be printed.
display_all(df_raw.tail().T)
Since there are a lot of columns, we have taken Transpose.
Since the Metric is RMSLE, we would consider the logarithmic values here.
Root mean squared log error: between the actual and predicted auction prices.
TL;DR: It’s a great Start.
The Greater number of Columns creates emptier Mathematical space where the Data Points sit on the Edges (Math Property).
This leads to distance between points being meaningless.
In General, False.
There is no Universal kind of Model that works well for all kinds of Dataset.
In general, we look at Data that was created by some cause/structure. There are actually techniques that work well for nearly all of the General Datasets that we work with. Ensembles of Decision Tree is the Technique that is most widely used.
ValueError: could not convert string to float: 'Conventional'
SKLearn isn’t the Best library, but it’s good for our purposes.
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
Regressor: Continous Values.
Classifier: Classify Values.
Note: Regression!=Linear Regression.
The RandomForest Algorithm expects numerical data.
DataSet:
Categorical:- Numbers - Strings- Dates
df_raw.saledate
Information inside a Date:
??add_datepart
To look at the source code.
This grabs the field “fldname”Note: df.fldname would literally look up a Field named fldname.
df[fldname] is a safer option in general. It’s a safe bet, doesn’t give weird errors in case we make a mistake. Don’t be lazy to do df(dot)fldname
Also, df[fldname] returns a series.
The function goes through all of the Strings, it looks inside the object and finds attribute with that name. This has been made to create any column that might be relevant to our case. (Exact opposite of the Curse of Dimensionality- We are creating more columns)
There is no harm in adding more columns to your data.
Link getattr()
Pandas splits out different methods inside attributes.
All of the Date time specific linked in pd.dt.___
Finally we drop the column.
train_cats
Creates categorical variables for strings. It creates a column that stores number and stores the mapping of the String and numbers.
Make sure you use the same mapping for training dataset and testing dataset.
Since we’ll have a decision tree that will split the columns. It’ll be better to have a “Logical” order.
RF consists of Trees that make splits. The splits could be High Vs Low+Medium then followed by Low Vs Medium.
display_all(df_raw.isnull().sum().sort_index()/len(df_raw))
os.makedirs('tmp', exist_ok=True)df_raw.to_feather('tmp/bulldozers-raw')
Feather: Saves the Files in the Format similar to the one in RAM. In layman-it’s fast.
Pro-Tip: Use Temporary folder for all actions/needs that pop up while you’re working.
proc_df
A Function inside the Structured.fastai
Fix Missing- Numeric values: If it does have missing values, then create a new column named Col_na (Boolean column) and replace the _na with the median- Non-Numeric and Categorical: Replace with the code and add 1.
df, y, nas = proc_df(df_raw, 'SalePrice')
m = RandomForestRegressor(n_jobs=-1)m.fit(df, y)m.score(df,y)
Random Forests are parallel-isablen_jobs=-1 creates a separate process for every CPU we have.
1 is the Best Score.
0 is the Worst.
def rmse(x,y): return math.sqrt(((x-y)**2).mean())
def print_score(m):res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),m.score(X_train, y_train), m.score(X_valid, y_valid)]if hasattr(m, 'oob_score_'): res.append(m.oob_score_)print(res)
def split_vals(a,n): return a[:n].copy(), a[n:].copy()
n_valid = 12000 # same as Kaggle's test set sizen_trn = len(df)-n_validraw_train, raw_valid = split_vals(df_raw, n_trn)X_train, X_valid = split_vals(df, n_trn)y_train, y_valid = split_vals(y, n_trn)
X_train.shape, y_train.shape, X_valid.shape
If you’re in the Top Half of the Kaggle LB, it’s a great start.
print_score(m)[0.09044244804386327, 0.2508166961122146, 0.98290459302099709, 0.88765316048270615]
0.25 would get a LB position in the Top 25%
Appreciation: Without any thinking or intensive Feature Engineering, without defining/worrying about any statistical assumption-we get a decent score.
If you found this article to be useful and would like to stay in touch, you can find me on Twitter here.