These are the (Unofficial) Lecture Notes of the Machine g for Coders MOOC. Fast.ai Learnin You can find the Official Thread here This is Part 1/12 Lecture Notes. Introduction Course taught at USF, available as MOOC. [Course Website is yet to be launched, I’ll update the link once it is. Meanwhile You can find the ] Official Thread here Tip: Check for “Cards” on the Video. Alternatives to Local Setup: (With Fast AI Support) CrestlePricing: 3 Cents an hour (Approx)Jupyter Nb opens in browser Paperspace Local Setup instructions: Assuming you have your GPU and Anaconda Setup (Preferably CUDA ≥9): $ git clone $ cd fastai$ conda env update https://github.com/fastai/fastai Use the Setup Script provided Here $ bash | wget files.fast.ai/setup/paperspace Approach to Learning Follow Along. Watch First and Follow Later (Lose Recommendation).You might miss some important information and you can experiment. Teaching Approach: Dive into Code Build Models Theory Comes later, at a point which you’ll be able to effective coder. Try with More Datasets. The More Coding you do, The Better (Recommended by Alumni) Write Blog Posts. “Hey I Just Learned this Concept, and I’ll share about it” Good Technical Blogs: (more ) Peter Norvig here Stephen Merity (more ) Julia Evans here Julia Ferraioli Edwin Chen (fast.ai student) Slav Ivanov (fast.ai and USF MSAN student) Brad Kenstler Imports Auto reload commands: %load ext_autoreload%autoreload 2 If you modify the source code of the imports, you’ll have to reload the kernel in order to reflect these changes. These two lines auto-reload the Nb incase you change the source. %matplotlib inline To plot Figures inline from fastai.imports import* Data Science is not Software Engineering. Prototyping models needs things to be done interactively. import * allows everything to be present, we don’t need to determine the specifics. Jupyter Tricks fn_name ?fn_name ??fn_name Gives the fn_name library Gives Details of the fn Gives the Source Code of the Fn Getting the Data Kaggle: Real World Problems posted by a company/institute.These are really close to real world problems, allow you to check yourself against other competitors. TL;DR: Perfect place to check your skillset. Jeremy: “I’ve learnt more from Kaggle competitions than anything else I’ve done in my Entire Life” Go to Competition page. Accept Terms and Conditions. Download Dataset OR Setup Official Kaggle API Use The Terminal to Download the Dataset. OR Use CurlWget Chrome Extension. Start Download and Cancel it. Click on the Extension. Paste the Copied Command into a Terminal. Note: Prefer Techniques that will be useful for Downloading Data to your Cloud Compute Instance.Crestle and Paperspace will have most of the Datasets pre-downloaded. Good Practise: Create a Data Folder for all of your data To Run BASH Commands in Jupyter !BASH_COMMAND To Add Python Commands !BASH {Python} Blue Book for Bulldozers: Goal: The goal of the contest is to predict the sale price of a particular piece of heavy equiment at auction based on it’s usage, equipment type, and configuaration. The data is sourced from auction result postings and includes information on usage and equipment configurations. Fast Iron is creating a “blue book for bull dozers,” for customers to value what their heavy equipment fleet is worth at auction. Look at Data to Get Started. !head data/bulldozers/Train.csv Gives the First Few lines. Structured Data: (Unoffcial Def) Columns of Data having varying types of Data. Pandas:Best Library to work with Tabular Data. Fastai imports import pandas library by default. Reading CSV df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=False,parse_dates=["saledate"]) low_mem=FalseAllows it to load more details to memory. Python 3.6 Formatting: var ='abc'f'ABC {abc}' This allows Python to interpret Code inside the {} Display data: df_raw Simply writing this would truncate the output display_all()def display_all(df):with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000):display(df) This allows The Complete df to be printed. display_all(df_raw.tail().T) Since there are a lot of columns, we have taken Transpose. Evaluation: Since the Metric is RMSLE, we would consider the logarithmic values here. Root mean squared log error: between the actual and predicted auction prices. Random Forests: Introduction: Universal Machine Learning Technique that can be used for predicting categorical/continuous variables. It can work with Pixel Values/Columns. In General, it doesn’t overfit. It’s easy to avoid Overfitting. Works fine without Any Validation Cells. Requires no Statistical Assumption. TL;DR: It’s a great Start. Curse of Dimensionality: The Greater number of Columns creates emptier Mathematical space where the Data Points sit on the Edges (Math Property). This leads to distance between points being meaningless. In General, False. Datapoints have distance even when they sit on the boundaries. Theoretical Research was more heavy in ’90s. Building Models on lots of columns works really well. No Free Lunch Theorem: There is no Universal kind of Model that works well for all kinds of Dataset. In general, we look at Data that was created by some cause/structure. There are actually techniques that work well for nearly all of the General Datasets that we work with. Ensembles of Decision Tree is the Technique that is most widely used. ValueError: could not convert string to float: 'Conventional' SKLearn isn’t the Best library, but it’s good for our purposes. RandomForest: from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier Regressor: Continous Values. Classifier: Classify Values. Note: Regression!=Linear Regression. Feature Engineering The RandomForest Algorithm expects numerical data. We Need to Convert Everything to Numbers. DataSet: Continuos variables. Categorical:- Numbers - Strings- Dates df_raw.saledate Information inside a Date: Is it a holiday? Is it a weekend? The weather. Event(s) Information. ??add_datepart To look at the source code. This grabs the field “fldname”Note: df.fldname would literally look up a Field named fldname. df[fldname] is a safer option in general. It’s a safe bet, doesn’t give weird errors in case we make a mistake. Don’t be lazy to do df(dot)fldname Also, df[fldname] returns a series. The function goes through all of the Strings, it looks inside the object and finds attribute with that name. This has been made to create any column that might be relevant to our case. (Exact opposite of the Curse of Dimensionality- We are creating more columns) There is no harm in adding more columns to your data. Link getattr() Pandas splits out different methods inside attributes. All of the Date time specific linked in pd.dt.___ Finally we drop the column. Dealing with Strings UsageBand has Low, High, Medium. Pandas has a Categorical Variable but it doesn’t work by default. train_cats Creates categorical variables for strings. It creates a column that stores number and stores the mapping of the String and numbers. Make sure you use the same mapping for training dataset and testing dataset. Similar to .dt, .cat gives access to Categorical data. Since we’ll have a decision tree that will split the columns. It’ll be better to have a “Logical” order. RF consists of Trees that make splits. The splits could be High Vs Low+Medium then followed by Low Vs Medium. Missing Values display_all(df_raw.isnull().sum().sort_index()/len(df_raw)) .isnull() Returns T/F if the data has null values. .sum() adds up the null values. We then sort and divide them by the length to return the missing values. Saving os.makedirs('tmp', exist_ok=True)df_raw.to_feather('tmp/bulldozers-raw') Feather: Saves the Files in the Format similar to the one in RAM. In layman-it’s fast. Pro-Tip: Use Temporary folder for all actions/needs that pop up while you’re working. Final Steps proc_df A Function inside the Structured.fastai Grabs a copy of the df Grab the dependent column. Dependent column is dropped. Missing Values are fixed. Fix Missing- Numeric values: If it does have missing values, then create a new column named Col_na (Boolean column) and replace the _na with the median- Non-Numeric and Categorical: Replace with the code and add 1. df, y, nas = proc_df(df_raw, 'SalePrice') Running Regressor m = RandomForestRegressor(n_jobs=-1)m.fit(df, y)m.score(df,y) Random Forests are parallel-isablen_jobs=-1 creates a separate process for every CPU we have. Create a Model Return the Score 1 is the Best Score. 0 is the Worst. def rmse(x,y): return math.sqrt(((x-y)**2).mean()) def print_score(m):res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),m.score(X_train, y_train), m.score(X_valid, y_valid)]if hasattr(m, 'oob_score_'): res.append(m.oob_score_)print(res) Checking Overfitting We can create a Validation Dataset. Sorted by Date, The Most Recent 12,000 dates will be the validation set. def split_vals(a,n): return a[:n].copy(), a[n:].copy() n_valid = 12000  # same as Kaggle's test set sizen_trn = len(df)-n_validraw_train, raw_valid = split_vals(df_raw, n_trn)X_train, X_valid = split_vals(df, n_trn)y_train, y_valid = split_vals(y, n_trn) X_train.shape, y_train.shape, X_valid.shape Final Score If you’re in the Top Half of the Kaggle LB, it’s a great start. print_score(m)[0.09044244804386327, 0.2508166961122146, 0.98290459302099709, 0.88765316048270615] 0.25 would get a LB position in the Top 25% Appreciation: Without any thinking or intensive Feature Engineering, without defining/worrying about any statistical assumption-we get a decent score. If you found this article to be useful and would like to stay in touch, you can find me on Twitter here .

Twitter

Ubuntu 18.04 Deep Learning Environment Setup

My Machine Learning Path

Connect with me on Twitter

Fast AI Machine Learning Lecture 1 Notes

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

A Full Time ML Role, 1 Million Blog Views, 10k Podcast Downloads: A Community Taught ML Engineer

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

A Full Time ML Role, 1 Million Blog Views, 10k Podcast Downloads: A Community Taught ML Engineer

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps