Who Has the Best Prices for Tech’s Top 100 Products of the Year? A Machine Learning Analysis. by@low-wei-hong

December 19th 2018 739 reads

If you haven’t read my friend’s post for part 1, please do so as here, we will continue to discuss about the remaining part of the project.

We would like to predict whether each given searched result from iprice will match with one of the top 100 coolest electronic gadgets.

The features we considered to be included are:

`dist_jw`

: Jaro–Winkler distance

`price_diff_ratio`

: ( `price`

— `refer_price`

) / `refer_price`

`discount`

: Discount percentage

“In computer science and statistics, the Jaro-Winkler distance is a string metric for measuring the edit distance between two sequences.

Informally, the Jaro distance between two words is the minimum number of single-character transpositions required to change one word into the other.

The Jaro-Winkler distance uses a prefix scale which gives more favorable ratings to strings that match from the beginning for a set prefix length”

— Source:Wikipedia.

We will be diving into the Math so that it is easier for us to fully understand!

Jaro distance is defined as:

Wow, this seems quite complicated…. I would prefer to sleep… Nah, I promise you will fully understand after these few examples.

But first, we need to understand what do theses terms mean.

dj:Jaro distancem:Number of matching characters which appear ins1and ins2.tis half the number of transpositions (compare the i-th character of s1 and the i-th character ofs2divided by2)|s1|is the length of the first string|s2|is the length of the second string

Lets use an example to explain the math.

How to calculate jaro distance between **Facebook** and **Firebook?**

`matching characters : Febook -> 6 characters -> m = 6`

no transposition needed : t=0

length of the 1st string : Facebook -> 8 characters -> |s1| = 8

length of the 2nd string : Firebook -> 8 characters -> |s2| = 8

dj = (1/3)*( (6/8) + (6/8) + ((6-0)/6)) )

dj ~= 0.83

Jaro distance = 83%

After knowing how to calculate Jaro distance, it’s time to understand how to calculate *Jaro-Winkler distance!*

** l: **Length of common prefix at the start of the string up to a maximum of 4 characters.

**p**: Constant scaling factor for how much the score is adjusted upwards for having common prefixes. Normally we use p=0.1 .

Continue to the previous case example of Firebook vs Facebook

dj : 0.83

prefix: character F -> 1 character ->l=1

p : 0.1

dw = 0.83 + 1 * 0.1 * (1-0.83)

dw = 0.847

Jaro-Winkler distance= 84.7%

Intuition of creating this feature, is that we believe if the price of the product is higher or lower than the price of the* top 100 coolest product*(**keyword**) by a lot, then this product do not match the keyword which we want to find.

For instance, taken from one of our keywords : **Apple Ipad Pro**

`refer_price`

of Apple Ipad Pro equals to around SGD 1081 (using exchange rate 1 USD = 1.37 SGD). Then, we can conclude base on price difference ratio = (51–1081)/1081 ~= -0.95.

Explain in another word -> 95% price difference between `refer_price`

and the keyword’s price -> high probability that product do not match our keyword in this case -> Apple Ipad Pro.

for index,row in data.iterrows():

data.loc[index,'dist_jw'] = L.jaro_winkler(row['name'], row['refer_name'])

data['price_diff_ratio'] = (data['price']-data['refer_price'])/data['refer_price']

With code below, you should be able to reproduce our result.

sns.scatterplot(data=data, x='dist_jw',y='price_diff_ratio', hue='status').set_title("Relationship between jaro-wrinkle distance and price difference ratio")

Wow! It seems there should exist a boundary which is able to separate between searched products which match or do not match the keyword (**status = 0 or 1**).

Using code below, we are able to find the best horizontal line to separate between status =0 and 1 and visualize it.

count_dict = {}

x = min(data['price_diff_ratio'])

while x<2:

temp_data = data

temp_data['guess'] = [1 if price>=x else 0 for price in data['price_diff_ratio'] ]

correct = len(data[temp_data['status'] == temp_data['guess']])

count_dict[x] = correct

x = x+0.001

boundary_const = [max(count_dict, key=lambda x: count_dict[x])][0]

ax = sns.scatterplot(

data=data[data['price_diff_ratio']<=1],

x='dist_jw',y='price_diff_ratio', hue='status')

plt.axhline(y=boundary_const, color='r', linestyle='-')

plt.show()

From the figure above, we are able to see the best horizontal line separating the status is equal to around -0.55.

The classification rule is:

- Below -0.55 will be classified as status = 0.
- Above 0.55 will be classified as status = 1.

Based on the above classification rule, we will be able to get roughly **94%** accuracy! We shouldn’t treat this number too seriously as we should actually apply this rule only on training data to avoid data leakage problem. So this number is for us to have a rough idea on the machine learning model prediction later.

This observation gives us intuition on creating a machine learning model to find the best boundary so that we are able to have the predictive power!

There are several ways to select feature to include in our model, for our case, we use ** p-value** to select suitable feature.

What is p-value?

Probability of finding more extreme value given thatnull hypothesisis true.

If you want to learn more about **p-value**, do visit the links here:

If **p-value** of the variable is smaller than significant value, then the variable is statistically significant and vice-versa. We choose **0.05** as our significance level.

First, we run **logit** model on `dist_jw`

, `price_diff_ratio`

and `discount`

variables. Do refer to links below for **detail** explanation on **Logit model**.

Running code below to get started!

logit_model=sm.Logit(data['status'],data[["dist_jw","price_diff_ratio", "discount"]])

result=logit_model.fit()

print(result.summary2())

We can see that p-value of `discount`

(0.8549) is **larger than 0.05 , **so **discount variable** is statistically **insignificant**. Other variables such as `dist_jw`

and `price_diff_ratio`

p-value are **lesser than 0.05**, which are statistically **significant**, thus they are the variables which **will be included** in our later machine learning modeling.

There are three basic machine learning models which we will be using:

- Logistic regression for classification
- Support Vector Machine for classification @Rohith Gandhi
- Random Forest for classification @Niklas Donges

Wait…. We still need to separate our data into train data and test data before training our model. For our case, we will split **80%** of our data for train data while **20%** data for test data.

X_train, X_test, y_train, y_test = train_test_split(data[["dist_jw","price_ratio"]], data['status'], test_size=0.2, random_state=0)

Let us begin to train and predict our first machine learning model.

**Logistic regression for classification**

logreg = LogisticRegression()

logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Wow!!! Just using two features and we are able to get 92% accuracy without fine tuning our machine learning model. Good Stuff! 92% accuracy will act as a benchmark for our machine learning model.

To visualize more, we plot boundary calculated by logistic regression!

xx, yy = np.mgrid[-5:5:.01, -5:5:.01]

grid = np.c_[xx.ravel(), yy.ravel()]

probs = logreg.predict_proba(grid)[:, 1].reshape(xx.shape)

f, ax = plt.subplots(figsize=(8, 6))

contour = ax.contourf(xx, yy, probs, 25, cmap="RdBu",

vmin=0, vmax=1)

ax_c = f.colorbar(contour)

ax_c.set_label("$P(y = 1)$")

ax_c.set_ticks([0, .25, .5, .75, 1])

ax.scatter(X_train.iloc[:,0], X_train.iloc[:,1], c=y_train, s=50,

cmap="RdBu", vmin=-.2, vmax=1.2,

edgecolor="white", linewidth=1)

ax.set(aspect="equal",

xlim=(-5, 5), ylim=(-5, 5),

xlabel="$X_1$", ylabel="$X_2$")

After visualizing the boundary, we proceed to plot confusion matrix. I refer to this link to plot our confusion matrix.

def plot_confusion_matrix(cm, classes,

normalize=False,

title='Confusion matrix',

cmap=plt.cm.Blues):

"""

This function prints and plots the confusion matrix.

Normalization can be applied by setting `normalize=True`.

"""

if normalize:

cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

print("Normalized confusion matrix")

else:

print('Confusion matrix, without normalization')

print(cm)

plt.imshow(cm, interpolation='nearest', cmap=cmap)

plt.title(title)

plt.colorbar()

tick_marks = np.arange(len(classes))

plt.xticks(tick_marks, classes, rotation=45)

plt.yticks(tick_marks, classes)

fmt = '.2f' if normalize else 'd'

thresh = cm.max() / 2.

for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):

plt.text(j, i, format(cm[i, j], fmt),

horizontalalignment="center",

color="white" if cm[i, j] > thresh else "black")

plt.ylabel('True label')

plt.xlabel('Predicted label')

plt.tight_layout()

confusion_mat = confusion_matrix(y_test, y_pred)

# Plot non-normalized confusion matrix

plt.figure()

plot_confusion_matrix(confusion_mat, classes=['Incorrect', 'Correct'], title='Confusion matrix, without normalization')

To conclude, there are 220 out of 239 test data are predicted correctly. No particular variable have higher wrong prediction rate.

Let us further visualizing ROC Curve.

In a Receiver Operating Characteristic (ROC) curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points.

Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold.

A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner (100% sensitivity, 100% specificity).

Therefore the closer the ROC curve is to the upper left corner, the higher the overall accuracy of the test (Zweig & Campbell, 1993).

— by https://www.medcalc.org/manual/roc-curves.php

logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))

fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])

plt.figure()

plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)

plt.plot([0, 1], [0, 1],'r--')

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('Receiver operating characteristic')

plt.legend(loc="lower right")

plt.savefig('Log_ROC')

plt.show()

We can observed that Area of ROC curve (0.92) is close to 1 for Logistic Regression, meaning the accuracy for logistic regression model is high!

**Support Vector Machine for classification**

clf = SVC(kernel='rbf')

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print('Accuracy of SVM classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))

Support Vector Machine for classification beat our benchmark 92% by 2%!

**Random Forest for classification**

clf = RandomForestClassifier(random_state=0)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))

Random Forest classification beat our benchmark 92% by 4%, is the best machine learning model among the three models we have tested. We can achieve **96%** without any fine tuning of machine learning model. **Meaning if we are able to create a good feature, we actually do not need to spend a lot of our time to fine tune our model to achieve a desirable accuracy!**

- Include more variables for example likes, comments and ratings of each of the keywords.
- Perform more string manipulation on each keyword to obtain more searched results for our analysis and modeling.

If you want us to fine tune our machine learning model, please let us know by commenting below!

Link to code: Top100Gadgets

Feel free to **reach out to me **too:)

Stay tune for my next post!

Join Hacker Noon

Create your free account to unlock your custom reading experience.