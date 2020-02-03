| Product Manager | Machine Learning Practitioner | UI/UX Designer/Preacher | Full-Stack Developer |
Deep Learning and Embedding Layers.
fastai.tabular
from fastai import *
from fastai.tabular import *
and this will be the DataFrame we’ll be mainly working on. We will also read in
train_df
which is the test set.
test_df
len(train_df), len(test_df)
(401125, 12457)
train_df = train_df.sort_values(by='saledate', ascending=False)
train_df = train_df.reset_index(drop=True)
train_df.SalePrice = np.log(train_df.SalePrice)
. This is the advantage of using a Deep Learning approach, it requires way less feature engineering and less dependent on domain knowledge. We’ll use the fast.ai’s
saledate
function to for adding some more features related to the sale date.
add_datepart
# The only feature engineering we do is add some meta-data from the sale date column, using 'add_datepart' function in fast.ai
add_datepart(train_df, "saledate", drop=False)
add_datepart(test_df, "saledate", drop=False)
does is it takes the
add_datepart
column and added a bunch of other columns like
saledate
,
day of week
, whether it is the start or end of a month, quarter and year, etc. These added features will offer more insights into the date and are relevant to user purchasing behaviors. For example, at the end of the year, the company will usually run promotions and prices will usually decrease.
day of month
# check and see whether all date related meta data is added.
def display_all(df):
with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000):
display(df)
display_all(train_df.tail(10).T)
# Defining pre-processing we want for our fast.ai DataBunch
procs=[FillMissing, Categorify, Normalize]
and all continuous columns into a list
cat_vars
. These two variables will also be used to construct fast.ai DataBunch.
cont_vars
# prepare categorical and continous data columns for building Tabular DataBunch.
cat_vars = ['SalesID', 'YearMade', 'MachineID', 'ModelID', 'datasource', 'auctioneerID', 'UsageBand', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc', 'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control', 'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics', 'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size', 'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb', 'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type', 'Travel_Controls', 'Differential_Type', 'Steering_Controls',
'saleYear', 'saleMonth', 'saleWeek', 'saleDay', 'saleDayofweek', 'saleDayofyear', 'saleIs_month_end', 'saleIs_month_start', 'saleIs_quarter_end', 'saleIs_quarter_start', 'saleIs_year_end',
'saleIs_year_start'
]
cont_vars = ['MachineHoursCurrentMeter', 'saleElapsed']
# rearrange training set before feed into the databunch
dep_var = 'SalePrice'
df = train_df[cat_vars + cont_vars + [dep_var,'saledate']].copy()
# Look at the time period of test set, make sure it's more recent
test_df['saledate'].min(), test_df['saledate'].max()
# Calculate where we should cut the validation set. We pick the most recent 'n' records in training set where n is the number of entries in test set.
cut = train_df['saledate'][(train_df['saledate'] == train_df['saledate'][len(test_df)])].index.max()
cut
12621
# specify the valid_idx variable as the cut out range.
valid_idx = range(cut)
# Use fast.ai datablock api to put our training data into the DataBunch, getting ready for training
data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs)
.split_by_idx(valid_idx)
.label_from_df(cols=dep_var, label_cls=FloatList)
.databunch())
from the DataBunch we just created. We want to limit the price range for our prediction to be within the history sale price range, so we need to calculate the
tabular.learner
. Note that we multiplied the maximum of
y_range
by 1.2 so when we apply sigmoid, the upper limit will also be covered. This is a small trick to squeeze a bit more performance out of the model.
SalePrice
max_y = np.max(train_df['SalePrice'])*1.2
y_range = torch.tensor([0, max_y], device=defaults.device)
y_range
tensor([ 0.0000, 14.2363], device='cuda:0')
# Create our tabular learner. The dense layer is 1000 and 500 two layer NN. We used dropout, hai
learn = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04, y_range=y_range, metrics=rmse)
is the use of embedding layers for categorical data. This is the ‘secret sauce’ that enables Deep Learning to be competitive in handling tabular data. With one embedding layer for each categorical variable, we introduced good interaction for the categorical variables and leverage Deep Learning’s biggest strength: Automatic Feature Extraction. We also used Drop Out for both the dense layers and embedding layers for better regularization. The metrics of the learner is RMSE since we’ve already taken the log of SalePrice. Let’s look at the model.
tabular_learner
TabularModel(
(embeds): ModuleList(
(0): Embedding(388505, 600)
(1): Embedding(72, 18)
(2): Embedding(331868, 600)
(3): Embedding(5155, 192)
...
(60): Embedding(3, 3)
(61): Embedding(2, 2)
(62): Embedding(3, 3)
)
(emb_drop): Dropout(p=0.04, inplace=False)
(bn_cont): BatchNorm1d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(layers): Sequential(
(0): Linear(in_features=2102, out_features=1000, bias=True)
(1): ReLU(inplace=True)
(2): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Dropout(p=0.001, inplace=False)
(4): Linear(in_features=1000, out_features=500, bias=True)
(5): ReLU(inplace=True)
(6): BatchNorm1d(500, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Dropout(p=0.01, inplace=False)
(8): Linear(in_features=500, out_features=1, bias=True)
)
)
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(2, 1e-2, wd=0.2)
learn.fit_one_cycle(5, 3e-4, wd=0.2)
And if you circle some states on the embedding space and same states on the actual map. You’ll find out that they are scarily similar. The embedding layer actually discovered geography.