In this article, I’m going to narrate how and why I started to learn “data science” even though I was an embedded software developer and had very limited spare time.
In addition, I will demonstrate my progress made in such a short period of time via a “machine learning” tutorial.
I believe this article will give some inspiration to others who are willing to learn data science but don’t have enough time or motivation.
Here is my story…
Last year, I decided to do affiliate marketing as a side project by building a niche website specializing in juicing to earn some extra money.
Nine months later I finally started to sell some products.
The money I have earned so far didn’t certainly make me rich but I kept my motivation on working on the project.
Although it slowed down in the past few months, the project is still going on.
Besides making money online, there was another benefit that I got from the project.
Thankfully, I met data science.
Here is how it happened.
Most of the articles I published on the site were boring “stuff”.
One day, I came across a brilliant idea when I was reading an article on medium.com.
In that article, the author analyzed every post about the recommended books on programming languages and ordered them by their popularity. He also published this work on an affiliate website. This small project became very successful.
I followed the same principle on my own niche project.
I collected over 10,000 smoothie recipes and analyzed them, ordered the most popular ingredients and published the results.
I also published another article on medium.com about the motivation behind the article on my niche and how I did it.
In the article, I showed how I used my python programming skills to publish an easy-to-promote and informative article.
Python helped me collect and analyze genuine data about smoothie recipes and in the end; this helped increase traffic to my website.
After this small success, I realized the power of the big data and that I had the ability to analyze it.
Information is the oil of the 21th century, and analytics is the combustion engine.
- Peter Sondergaad, Gartner Research
Then I gradually fell in love with data science.
After that, I decided to learn more in this field.
However, I had very limited free time as I had a daytime job and have tree little monsters (kids) waiting for me at home.
But I was dedicated. I worked while they were sleeping early in the morning and late at night.
I got related online courses from Udemy, Coursera and dataquest.io.
I read the notepads on kaggle.com.
I also joined a community in Turkey in which we discuss about “deep learning” by following the courses by Andrew Ng on Coursera.
The most important part: I practiced a lot on what I learned.
In theory, theory and practice are the same. In practice, they’re not.
- Yogi Berra
I believe I made a decent progress on learning data science in this 5 months period.
Now, I am able to use the following tools and libraries to some extend:
Python libraries: Numpy, Pandas, matplotlib, seaborn, sklearn, keras…
Tools: Jupyter Notebook, Pycharm IDE
I am sure that my progress is a tiny little portion of what I should do to become a fairly good data scientist.
However, I am trying hard to close the huge gap as fast as possible.
Because I love the field, I’m not bored in the time I spend learning the field of data science.
Now, here is the tutorial which I have promised from the beginning.
The data I used is about my side project, juicer niche.
I captured the data from amazon.com using a product advertising API and saved it in CSV format. As it is out of topic I will not cover how I get the data from Amazon.
I will try to anticipate how “price”, “brand name”, “juicer type” and “color” affect the “sales rank” and to make a model using that data and python machine learning libraries.
import numpy as npimport matplotlib.pyplot as pltimport matplotlib.mlab as mlabimport pandas as pdimport refrom sklearn import linear_modelfrom sklearn.model_selection import train_test_splitfrom sklearn import preprocessingfrom collections import Counter#from nltk.corpus import stopwordsimport stringimport operatorimport seaborn as snsfrom itertools import groupbyfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.neighbors import KNeighborsClassifier%matplotlib inline
df_all = pd.read_csv("../out/data.csv")df_all.index = df_all['Unnamed: 0'].valuesdf_all.drop('Unnamed: 0', axis=1, inplace=True)df_all.head()
df_all.info()
<class 'pandas.core.frame.DataFrame'>Index: 1017 entries, B00004R93U to B0739LW3XSData columns (total 7 columns):brand 1010 non-null objectcategory 1017 non-null objectcolor 839 non-null objectfeatures 1017 non-null objectprice 968 non-null objectsales_rank 1017 non-null objecttitle 1017 non-null objectdtypes: object(7)memory usage: 63.6+ KB
“Brand”, “category” and “color” are categorical features. “Price” and “sales_rank” are numeric features. I am going to exclude the “features” column as it is too complicated to use whilst modelling for me right now.
This will be a classification problem. I will try to make a model that can predict the most successful products in terms of the sales rank.
Let’s check if there are null data.
df_all.isnull().sum()
brand 7category 0color 178features 0price 49sales_rank 0title 0dtype: int64
Yes, there are some null data. There are also some hidden null data in “str” type as ‘None’ in sales_rank and color columns.
(df_all=='None').sum()
brand 0category 0color 1features 0price 0sales_rank 252title 0dtype: int64
Convert this null data to float ‘NaN’ so that isnull() function can detect them.
df_all[df_all == 'None'] = float('NaN')df_all.isnull().sum()
brand 7category 0color 179features 0price 49sales_rank 252title 0dtype: int64
Let’s convert the price column to float:
def getPrice(x):if type(x)==str:x = x.replace(',','.')res = re.findall('\d+.\d+', x)if res:return float(res[0])else:return float('NaN')else:return x
df_all['price'] = df_all['price'].apply(getPrice)
Convert sales_rank column to float. Normally, I would use int. However, as the pandas doesnt have the ability to store NaN values as int, I used float. ref. https://stackoverflow.com/a/21290084
df_all['sales_rank'] = df_all['sales_rank'].astype(float)
Convert the color column to lower case, and correct the typos:
df_all['color'] = df_all['color'].str.lower()
def correct_typo(s):if type(s) == str:typo = {"sliver": "silver", "golden": "gold", "balck": "black", "sless": "stainless"}for k, v in typo.items():s = s.replace(k, v)return s
df_all['color'] = df_all['color'].apply(correct_typo)
Correct typos for ‘brand’ column, too:
def correct_brands(s):if type(s) == str:typo = {"Breville Juicer": "Breville", "Omega Juicers": "Omega"}for k, v in typo.items():s = s.replace(k, v)return s
df_all['brand'] = df_all['brand'].apply(correct_brands)
Investigate the relationship between sales rank and price.
sns.lmplot(data=df_all, x='price', y='sales_rank')df_all[['sales_rank', 'price']].corr()
sns.lmplot(data=df_all[df_all['price'] > 100], x='price', y='sales_rank')df_all[df_all['price'] > 100][['sales_rank', 'price']].corr()
There is a positive correlation between price and the sales rank especially for the prices greater than $100.
Divide sales rank into four category:
srank_qcut = pd.qcut(df_all['sales_rank'], 4, labels ['0','1','2','3'])srank_qcut.head()
B00004R93U 3B00004R93V 2B00004S8FH 1B00004S8FI 1B00004S8FJ 1Name: sales_rank, dtype: categoryCategories (4, object): [0 < 1 < 2 < 3]
Calculate the number of the juicers in specific color categories in each sales rank category:
def getColors(carr):clist = []for c in carr:if type(c) == str:lis = re.split(" |/|-|\.|,|\&", c)for l in lis:if l == '' or l == 'and' or l=='steel':continueif l == 'sliver':new_l = 'silver'elif l == 'golden':new_l = 'gold'elif l == 'balck':new_l = 'black'elif l == 'sless':new_l = 'stainless'else:new_l = lclist.append(new_l)
return clist
serilist = []for i in range(4):colors = df_all.loc[srank_qcut[srank_qcut == str(i)].index.tolist()]['color'].valuescolor_list = getColors(colors)color_freq = dict(Counter(color_list))serilist.append(pd.Series(color_freq))#ordered_colors = sorted(color_freq.items(), key=operator.itemgetter(1), reverse=True)df = pd.concat(serilist, axis=1)
df.dropna(inplace=True)
df
top_color_list = df.index.tolist()plt.figure(figsize=(5,5))sns.heatmap(df, cmap='magma')
Silver, black and white are most frequently found colors in better sales ranks.
Explore the relationship between ‘category’ and ‘sales_rank’ columns:
df_all.pivot_table(index='category', values='sales_rank')
plt.figure(figsize=(15,8))sns.violinplot(x='category', y='sales_rank', data=df_all)
Determine brand distributions for each sales rank category:
serilist = []for i in range(4):brands = df_all.loc[srank_qcut[srank_qcut == str(i)].index.tolist()]['brand'].values.tolist()color_freq = dict(Counter(brands))serilist.append(pd.Series(color_freq))df = pd.concat(serilist, axis=1)
plt.figure(figsize=(5,15))sns.heatmap(df, cmap='magma')
As clearly seen Omega brand is in the best position.
Get the top 10 brands and convert them into dummy variables:
top_brand_list = df.sort_values(by=0, ascending=False).index.tolist()[:10]
arr_brand_dummy = np.zeros(df_all.shape[0] * len(top_brand_list)).reshape(df_all.shape[0], len(top_brand_list))df_brand_dummy = pd.DataFrame(arr_brand_dummy, columns=top_brand_list, index=df_all.index)
for row in df_all.iterrows():b = row[1]['brand']if type(b) == str:df_brand_dummy.loc[row[0]][row[1]['brand']] = 1
df_brand_dummy.head()
Get the top 10 colors and convert them into dummy variables:
arr_color_dummy = np.zeros(df_all.shape[0] * len(top_color_list)).reshape(df_all.shape[0], len(top_color_list))df_color_dummy = pd.DataFrame(arr_color_dummy, columns=top_color_list, index=df_all.index)
for row in df_all.iterrows():c = row[1]['color']if type(c) == str:colors = list(set(top_color_list).intersection(set(re.split(" |/|-|\.|,|\&", c))))df_color_dummy.loc[row[0]][colors] = 1
df_color_dummy.head()
Create dummy variables for the category column, too:
cat_dummy = pd.get_dummies(df_all['category'])cat_dummy.head()
Now it is time to create target (sales_rank categories). I will assign 1 for category 0 and 0 for the remaining categories so that I get 2 classes in the target.
target_sales_rank = pd.Series(np.zeros(df_all.shape[0]), index=df_all.index, name='target_sales_rank')findex = srank_qcut[srank_qcut == '0'].indextarget_sales_rank[findex] = 1
Now, create the final DataFrame:
df_final = pd.concat([cat_dummy, df_brand_dummy, df_color_dummy, df_all['price'], target_sales_rank], axis=1)
Drop null data:
df_final.dropna(inplace=True)df_final.shape
Get input variables and output target:
X = df_final.drop('target_sales_rank', axis=1)y = df_final['target_sales_rank']
Split train and test data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Use the following models:
lm = linear_modelmodels = {'Logistic Regression': lm.LogisticRegression,'Logistic Regression CV': lm.LogisticRegressionCV,'Ridge': lm.RidgeClassifier,'Random Forest': RandomForestClassifier,'KNN': KNeighborsClassifier}
Run a loop to fit and test all the models, and list the sorted scores:
scores = {}for name, model in models.items():clf = model()clf.fit(X_train, y_train)score = clf.score(X_test, y_test)scores.update({name: score})
sorted_scores = sorted(scores.items(), key=operator.itemgetter(1), reverse=True)df = pd.DataFrame(sorted_scores, columns=['models', 'scores'])df
Find the feature importance values for the best model Logistic Regression, and sort them in order to find out how features affect the results.
clf = linear_model.LogisticRegression()clf.fit(X_train, y_train)
#https://stackoverflow.com/questions/34052115/how-to-find-the-importance-of-the-features-for-a-logistic-regression-modelimportance_arr = np.std(X, 0).reshape(1, X.shape[1])*clf.coef_featurelist = X.columns.tolist()importance = {}for i, imp in enumerate(importance_arr[0]):importance.update({featurelist[i]: imp})
ordered_importance = sorted(importance.items(), key=operator.itemgetter(1), reverse=True)
df = pd.DataFrame(ordered_importance, columns=['features','importance'])df
Final outcome: If Omega develops a new masticating juicer in silver or black color and low price, their success is guaranteed :)