Hackernoon logoHow I Met and Fell in Love with Data Science by@yahyacivelek

How I Met and Fell in Love with Data Science

Yahya Civelek Hacker Noon profile picture

@yahyacivelekYahya Civelek

In this article, I’m going to narrate how and why I started to learn “data science” even though I was an embedded software developer and had very limited spare time.

In addition, I will demonstrate my progress made in such a short period of time via a “machine learning” tutorial.

I believe this article will give some inspiration to others who are willing to learn data science but don’t have enough time or motivation.

Here is my story…

Last year, I decided to do affiliate marketing as a side project by building a niche website specializing in juicing to earn some extra money.

Nine months later I finally started to sell some products.

The money I have earned so far didn’t certainly make me rich but I kept my motivation on working on the project.

Although it slowed down in the past few months, the project is still going on.

Besides making money online, there was another benefit that I got from the project.

Thankfully, I met data science.

Here is how it happened.

Most of the articles I published on the site were boring “stuff”.

One day, I came across a brilliant idea when I was reading an article on medium.com.

In that article, the author analyzed every post about the recommended books on programming languages and ordered them by their popularity. He also published this work on an affiliate website. This small project became very successful.

I followed the same principle on my own niche project.

I collected over 10,000 smoothie recipes and analyzed them, ordered the most popular ingredients and published the results.

I also published another article on medium.com about the motivation behind the article on my niche and how I did it.

In the article, I showed how I used my python programming skills to publish an easy-to-promote and informative article.

Python helped me collect and analyze genuine data about smoothie recipes and in the end; this helped increase traffic to my website.

After this small success, I realized the power of the big data and that I had the ability to analyze it.

Information is the oil of the 21th century, and analytics is the combustion engine.
- Peter Sondergaad, Gartner Research

Then I gradually fell in love with data science.

After that, I decided to learn more in this field.

However, I had very limited free time as I had a daytime job and have tree little monsters (kids) waiting for me at home.

But I was dedicated. I worked while they were sleeping early in the morning and late at night.

I got related online courses from Udemy, Coursera and dataquest.io.

I read the notepads on kaggle.com.

I also joined a community in Turkey in which we discuss about “deep learning” by following the courses by Andrew Ng on Coursera.

The most important part: I practiced a lot on what I learned.

In theory, theory and practice are the same. In practice, they’re not.
- Yogi Berra

I believe I made a decent progress on learning data science in this 5 months period.

Now, I am able to use the following tools and libraries to some extend:

Python libraries: Numpy, Pandas, matplotlib, seaborn, sklearn, keras…

Tools: Jupyter Notebook, Pycharm IDE

I am sure that my progress is a tiny little portion of what I should do to become a fairly good data scientist.

However, I am trying hard to close the huge gap as fast as possible.

Because I love the field, I’m not bored in the time I spend learning the field of data science.

Now, here is the tutorial which I have promised from the beginning.

The Tutorial

The data I used is about my side project, juicer niche.

I captured the data from amazon.com using a product advertising API and saved it in CSV format. As it is out of topic I will not cover how I get the data from Amazon.

I will try to anticipate how “price”, “brand name”, “juicer type” and “color” affect the “sales rank” and to make a model using that data and python machine learning libraries.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import pandas as pd
import re
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from collections import Counter
#from nltk.corpus import stopwords
import string
import operator
import seaborn as sns
from itertools import groupby
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
%matplotlib inline
df_all = pd.read_csv("../out/data.csv")
df_all.index = df_all['Unnamed: 0'].values
df_all.drop('Unnamed: 0', axis=1, inplace=True)
df_all.head()

Data Preperation

df_all.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1017 entries, B00004R93U to B0739LW3XS
Data columns (total 7 columns):
brand 1010 non-null object
category 1017 non-null object
color 839 non-null object
features 1017 non-null object
price 968 non-null object
sales_rank 1017 non-null object
title 1017 non-null object
dtypes: object(7)
memory usage: 63.6+ KB

“Brand”, “category” and “color” are categorical features. “Price” and “sales_rank” are numeric features. I am going to exclude the “features” column as it is too complicated to use whilst modelling for me right now.

This will be a classification problem. I will try to make a model that can predict the most successful products in terms of the sales rank.

Let’s check if there are null data.

df_all.isnull().sum()
brand           7
category 0
color 178
features 0
price 49
sales_rank 0
title 0
dtype: int64

Yes, there are some null data. There are also some hidden null data in “str” type as ‘None’ in sales_rank and color columns.

(df_all=='None').sum()
brand           0
category 0
color 1
features 0
price 0
sales_rank 252
title 0
dtype: int64

Convert this null data to float ‘NaN’ so that isnull() function can detect them.

df_all[df_all == 'None'] = float('NaN')
df_all.isnull().sum()
brand           7
category 0
color 179
features 0
price 49
sales_rank 252
title 0
dtype: int64

Let’s convert the price column to float:

def getPrice(x):
if type(x)==str:
x = x.replace(',','.')
res = re.findall('\d+.\d+', x)
if res:
return float(res[0])
else:
return float('NaN')
else:
return x
df_all['price'] = df_all['price'].apply(getPrice)

Convert sales_rank column to float. Normally, I would use int. However, as the pandas doesnt have the ability to store NaN values as int, I used float. ref. https://stackoverflow.com/a/21290084

df_all['sales_rank'] = df_all['sales_rank'].astype(float)

Convert the color column to lower case, and correct the typos:

df_all['color'] = df_all['color'].str.lower()
def correct_typo(s):
if type(s) == str:
typo = {"sliver": "silver", "golden": "gold", "balck": "black", "sless": "stainless"}
for k, v in typo.items():
s = s.replace(k, v)
return s
df_all['color'] = df_all['color'].apply(correct_typo)

Correct typos for ‘brand’ column, too:

def correct_brands(s):
if type(s) == str:
typo = {"Breville Juicer": "Breville", "Omega Juicers": "Omega"}
for k, v in typo.items():
s = s.replace(k, v)
return s
df_all['brand'] = df_all['brand'].apply(correct_brands)

Exploratory Data Visualization

Investigate the relationship between sales rank and price.

sns.lmplot(data=df_all, x='price', y='sales_rank')
df_all[['sales_rank', 'price']].corr()
sns.lmplot(data=df_all[df_all['price'] > 100], x='price', y='sales_rank')
df_all[df_all['price'] > 100][['sales_rank', 'price']].corr()

There is a positive correlation between price and the sales rank especially for the prices greater than $100.

Divide sales rank into four category:

srank_qcut = pd.qcut(df_all['sales_rank'], 4, labels ['0','1','2','3'])
srank_qcut.head()
B00004R93U    3
B00004R93V 2
B00004S8FH 1
B00004S8FI 1
B00004S8FJ 1
Name: sales_rank, dtype: category
Categories (4, object): [0 < 1 < 2 < 3]

Calculate the number of the juicers in specific color categories in each sales rank category:

def getColors(carr):
clist = []
for c in carr:
if type(c) == str:
lis = re.split(" |/|-|\.|,|\&", c)
for l in lis:
if l == '' or l == 'and' or l=='steel':
continue
if l == 'sliver':
new_l = 'silver'
elif l == 'golden':
new_l = 'gold'
elif l == 'balck':
new_l = 'black'
elif l == 'sless':
new_l = 'stainless'
else:
new_l = l
clist.append(new_l)

return clist
serilist = []
for i in range(4):
colors = df_all.loc[srank_qcut[srank_qcut == str(i)].index.tolist()]['color'].values
color_list = getColors(colors)
color_freq = dict(Counter(color_list))
serilist.append(pd.Series(color_freq))
#ordered_colors = sorted(color_freq.items(), key=operator.itemgetter(1), reverse=True)
df = pd.concat(serilist, axis=1)
df.dropna(inplace=True)
df
top_color_list = df.index.tolist()
plt.figure(figsize=(5,5))
sns.heatmap(df, cmap='magma')

Silver, black and white are most frequently found colors in better sales ranks.

Explore the relationship between ‘category’ and ‘sales_rank’ columns:

df_all.pivot_table(index='category', values='sales_rank')
plt.figure(figsize=(15,8))
sns.violinplot(x='category', y='sales_rank', data=df_all)

Feature Engineering

Determine brand distributions for each sales rank category:

serilist = []
for i in range(4):
brands = df_all.loc[srank_qcut[srank_qcut == str(i)].index.tolist()]['brand'].values.tolist()
color_freq = dict(Counter(brands))
serilist.append(pd.Series(color_freq))
df = pd.concat(serilist, axis=1)
plt.figure(figsize=(5,15))
sns.heatmap(df, cmap='magma')

As clearly seen Omega brand is in the best position.

Get the top 10 brands and convert them into dummy variables:

top_brand_list = df.sort_values(by=0, ascending=False).index.tolist()[:10]
arr_brand_dummy = np.zeros(df_all.shape[0] * len(top_brand_list)).reshape(df_all.shape[0], len(top_brand_list))
df_brand_dummy = pd.DataFrame(arr_brand_dummy, columns=top_brand_list, index=df_all.index)
for row in df_all.iterrows():
b = row[1]['brand']
if type(b) == str:
df_brand_dummy.loc[row[0]][row[1]['brand']] = 1
df_brand_dummy.head()

Get the top 10 colors and convert them into dummy variables:

arr_color_dummy = np.zeros(df_all.shape[0] * len(top_color_list)).reshape(df_all.shape[0], len(top_color_list))
df_color_dummy = pd.DataFrame(arr_color_dummy, columns=top_color_list, index=df_all.index)
for row in df_all.iterrows():
c = row[1]['color']
if type(c) == str:
colors = list(set(top_color_list).intersection(set(re.split(" |/|-|\.|,|\&", c))))
df_color_dummy.loc[row[0]][colors] = 1
df_color_dummy.head()

Create dummy variables for the category column, too:

cat_dummy = pd.get_dummies(df_all['category'])
cat_dummy.head()

Now it is time to create target (sales_rank categories). I will assign 1 for category 0 and 0 for the remaining categories so that I get 2 classes in the target.

target_sales_rank = pd.Series(np.zeros(df_all.shape[0]), index=df_all.index, name='target_sales_rank')
findex = srank_qcut[srank_qcut == '0'].index
target_sales_rank[findex] = 1

Now, create the final DataFrame:

df_final = pd.concat([cat_dummy, df_brand_dummy, df_color_dummy, df_all['price'], target_sales_rank], axis=1)

Drop null data:

df_final.dropna(inplace=True)
df_final.shape

Working on the models

Get input variables and output target:

X = df_final.drop('target_sales_rank', axis=1)
y = df_final['target_sales_rank']

Split train and test data:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Use the following models:

lm = linear_model
models = {
'Logistic Regression': lm.LogisticRegression,
'Logistic Regression CV': lm.LogisticRegressionCV,
'Ridge': lm.RidgeClassifier,
'Random Forest': RandomForestClassifier,
'KNN': KNeighborsClassifier
}

Run a loop to fit and test all the models, and list the sorted scores:

scores = {}
for name, model in models.items():
clf = model()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
scores.update({name: score})
sorted_scores = sorted(scores.items(), key=operator.itemgetter(1), reverse=True)
df = pd.DataFrame(sorted_scores, columns=['models', 'scores'])
df

Find the feature importance values for the best model Logistic Regression, and sort them in order to find out how features affect the results.

clf = linear_model.LogisticRegression()
clf.fit(X_train, y_train)
#https://stackoverflow.com/questions/34052115/how-to-find-the-importance-of-the-features-for-a-logistic-regression-model
importance_arr = np.std(X, 0).reshape(1, X.shape[1])*clf.coef_
featurelist = X.columns.tolist()
importance = {}
for i, imp in enumerate(importance_arr[0]):
importance.update({featurelist[i]: imp})

ordered_importance = sorted(importance.items(), key=operator.itemgetter(1), reverse=True)
df = pd.DataFrame(ordered_importance, columns=['features','importance'])
df

Conclusions

  • As clearly seen the biggest influencer is the price. It is not surprising, right! Lower price is a means of getting better sales rank in juicer market, too.
  • As for the brand, Omega is the number one. The company has been in the market for a long period of time and has lots of successful products. They clearly have credibility and trust on their customers.
  • In terms of color, silver and stainless steel still seem the most demanding colors. Worth to mention that the black and white options are close to them.
  • As for the category, all of the juicer types affect the sales rank negatively. However, the masticating type has the least negative effect. This means that the masticating juicers are prone to better sales rank than centrifugal and the citrus types. This outcome is aligned with the recent popularity of the slow juicing concept.

Final outcome: If Omega develops a new masticating juicer in silver or black color and low price, their success is guaranteed :)

Tags

Join Hacker Noon

Create your free account to unlock your custom reading experience.