If you haven’t read my friend’s post for part 1, please do so as here, we will continue to discuss about the remaining part of the project.
We would like to predict whether each given searched result from iprice will match with one of the top 100 coolest electronic gadgets.
The features we considered to be included are:
dist_jw
: Jaro–Winkler distance
price_diff_ratio
: ( price
— refer_price
) / refer_price
discount
: Discount percentage
“In computer science and statistics, the Jaro-Winkler distance is a string metric for measuring the edit distance between two sequences.
Informally, the Jaro distance between two words is the minimum number of single-character transpositions required to change one word into the other.
The Jaro-Winkler distance uses a prefix scale which gives more favorable ratings to strings that match from the beginning for a set prefix length”
— Source: Wikipedia.
We will be diving into the Math so that it is easier for us to fully understand!
Jaro distance is defined as:
Wow, this seems quite complicated…. I would prefer to sleep… Nah, I promise you will fully understand after these few examples.
But first, we need to understand what do theses terms mean.
dj: _Jaro distancem: Number of matching characters which appear in s1 and in s2.t is half the number of transpositions (compare the i-th character of s1 and the i-th character of s2 divided by 2)_|s1| is the length of the first string|s2| is the length of the second string
Lets use an example to explain the math.
How to calculate jaro distance between Facebook and Firebook?
matching characters : Febook -> 6 characters -> m = 6no transposition needed : t=0length of the 1st string : Facebook -> 8 characters -> |s1| = 8length of the 2nd string : Firebook -> 8 characters -> |s2| = 8
dj = (1/3)*( (6/8) + (6/8) + ((6-0)/6)) )dj ~= 0.83
Jaro distance = 83%
After knowing how to calculate Jaro distance, it’s time to understand how to calculate Jaro-Winkler distance!
l: Length of common prefix at the start of the string up to a maximum of 4 characters.
p: Constant scaling factor for how much the score is adjusted upwards for having common prefixes. Normally we use p=0.1 .
Continue to the previous case example of Firebook vs Facebook
dj : 0.83prefix: character F -> 1 character -> l=1p : 0.1
dw = 0.83 + 1 * 0.1 * (1-0.83)dw = 0.847
Jaro-Winkler distance = 84.7%
Intuition of creating this feature, is that we believe if the price of the product is higher or lower than the price of the top 100 coolest product(keyword) by a lot, then this product do not match the keyword which we want to find.
For instance, taken from one of our keywords : Apple Ipad Pro
Product that MATCH our keyword (Apple Ipad Pro)
Product that DO NOT MATCH our keyword (Apple Ipad Pro)
refer_price
of Apple Ipad Pro equals to around SGD 1081 (using exchange rate 1 USD = 1.37 SGD). Then, we can conclude base on price difference ratio = (51–1081)/1081 ~= -0.95.
Explain in another word -> 95% price difference between refer_price
and the keyword’s price -> high probability that product do not match our keyword in this case -> Apple Ipad Pro.
for index,row in data.iterrows():data.loc[index,'dist_jw'] = L.jaro_winkler(row['name'], row['refer_name'])
data['price_diff_ratio'] = (data['price']-data['refer_price'])/data['refer_price']
With code below, you should be able to reproduce our result.
sns.scatterplot(data=data, x='dist_jw',y='price_diff_ratio', hue='status').set_title("Relationship between jaro-wrinkle distance and price difference ratio")
Wow! It seems there should exist a boundary which is able to separate between searched products which match or do not match the keyword (status = 0 or 1).
Using code below, we are able to find the best horizontal line to separate between status =0 and 1 and visualize it.
count_dict = {}x = min(data['price_diff_ratio'])
while x<2:temp_data = datatemp_data['guess'] = [1 if price>=x else 0 for price in data['price_diff_ratio'] ]correct = len(data[temp_data['status'] == temp_data['guess']])count_dict[x] = correctx = x+0.001
boundary_const = [max(count_dict, key=lambda x: count_dict[x])][0]
ax = sns.scatterplot(data=data[data['price_diff_ratio']<=1],x='dist_jw',y='price_diff_ratio', hue='status')
plt.axhline(y=boundary_const, color='r', linestyle='-')plt.show()
From the figure above, we are able to see the best horizontal line separating the status is equal to around -0.55.
The classification rule is:
Based on the above classification rule, we will be able to get roughly 94% accuracy! We shouldn’t treat this number too seriously as we should actually apply this rule only on training data to avoid data leakage problem. So this number is for us to have a rough idea on the machine learning model prediction later.
This observation gives us intuition on creating a machine learning model to find the best boundary so that we are able to have the predictive power!
There are several ways to select feature to include in our model, for our case, we use p-value to select suitable feature.
What is p-value?
Probability of finding more extreme value given that null hypothesis is true.
If you want to learn more about p-value, do visit the links here:
If p-value of the variable is smaller than significant value, then the variable is statistically significant and vice-versa. We choose 0.05 as our significance level.
First, we run logit model on dist_jw
, price_diff_ratio
and discount
variables. Do refer to links below for detail explanation on Logit model.
Running code below to get started!
logit_model=sm.Logit(data['status'],data[["dist_jw","price_diff_ratio", "discount"]])result=logit_model.fit()print(result.summary2())
We can see that p-value of discount
(0.8549) is larger than 0.05 , so discount variable is statistically insignificant. Other variables such as dist_jw
and price_diff_ratio
p-value are lesser than 0.05, which are statistically significant, thus they are the variables which will be included in our later machine learning modeling.
There are three basic machine learning models which we will be using:
Wait…. We still need to separate our data into train data and test data before training our model. For our case, we will split 80% of our data for train data while 20% data for test data.
X_train, X_test, y_train, y_test = train_test_split(data[["dist_jw","price_ratio"]], data['status'], test_size=0.2, random_state=0)
Let us begin to train and predict our first machine learning model.
Logistic regression for classification
logreg = LogisticRegression()logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
Wow!!! Just using two features and we are able to get 92% accuracy without fine tuning our machine learning model. Good Stuff! 92% accuracy will act as a benchmark for our machine learning model.
To visualize more, we plot boundary calculated by logistic regression!
xx, yy = np.mgrid[-5:5:.01, -5:5:.01]grid = np.c_[xx.ravel(), yy.ravel()]probs = logreg.predict_proba(grid)[:, 1].reshape(xx.shape)
f, ax = plt.subplots(figsize=(8, 6))contour = ax.contourf(xx, yy, probs, 25, cmap="RdBu",vmin=0, vmax=1)ax_c = f.colorbar(contour)ax_c.set_label("$P(y = 1)$")ax_c.set_ticks([0, .25, .5, .75, 1])
ax.scatter(X_train.iloc[:,0], X_train.iloc[:,1], c=y_train, s=50,cmap="RdBu", vmin=-.2, vmax=1.2,edgecolor="white", linewidth=1)
ax.set(aspect="equal",xlim=(-5, 5), ylim=(-5, 5),xlabel="$X_1$", ylabel="$X_2$")
After visualizing the boundary, we proceed to plot confusion matrix. I refer to this link to plot our confusion matrix.
def plot_confusion_matrix(cm, classes,normalize=False,title='Confusion matrix',cmap=plt.cm.Blues):"""This function prints and plots the confusion matrix.Normalization can be applied by setting `normalize=True`."""if normalize:cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]print("Normalized confusion matrix")else:print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick\_marks = np.arange(len(classes))
plt.xticks(tick\_marks, classes, rotation=45)
plt.yticks(tick\_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape\[0\]), range(cm.shape\[1\])):
plt.text(j, i, format(cm\[i, j\], fmt),
horizontalalignment="center",
color="white" if cm\[i, j\] > thresh else "black")
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight\_layout()
confusion_mat = confusion_matrix(y_test, y_pred)# Plot non-normalized confusion matrixplt.figure()plot_confusion_matrix(confusion_mat, classes=['Incorrect', 'Correct'], title='Confusion matrix, without normalization')
To conclude, there are 220 out of 239 test data are predicted correctly. No particular variable have higher wrong prediction rate.
Let us further visualizing ROC Curve.
In a Receiver Operating Characteristic (ROC) curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points.
Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold.
A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner (100% sensitivity, 100% specificity).
Therefore the closer the ROC curve is to the upper left corner, the higher the overall accuracy of the test (Zweig & Campbell, 1993).
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])plt.figure()plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)plt.plot([0, 1], [0, 1],'r--')plt.xlim([0.0, 1.0])plt.ylim([0.0, 1.05])plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.title('Receiver operating characteristic')plt.legend(loc="lower right")plt.savefig('Log_ROC')plt.show()
We can observed that Area of ROC curve (0.92) is close to 1 for Logistic Regression, meaning the accuracy for logistic regression model is high!
Support Vector Machine for classification
clf = SVC(kernel='rbf')clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)print('Accuracy of SVM classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))
Support Vector Machine for classification beat our benchmark 92% by 2%!
Random Forest for classification
clf = RandomForestClassifier(random_state=0)clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))
Random Forest classification beat our benchmark 92% by 4%, is the best machine learning model among the three models we have tested. We can achieve 96% without any fine tuning of machine learning model. Meaning if we are able to create a good feature, we actually do not need to spend a lot of our time to fine tune our model to achieve a desirable accuracy!
Who Carries Tech's Top 100 Products of the Year? A Machine Learning Analysis._As a junior data scientist, most of the times training data are ready for me to train the model (either by accessing…_hackernoon.com
If you want us to fine tune our machine learning model, please let us know by commenting below!
Link to code: Top100Gadgets
Feel free to reach out to me too:)
Stay tune for my next post!