If you haven’t read , please do so as here, we will continue to discuss about the remaining part of the project. my friend’s post for part 1 Continuation Analysis We would like to predict whether each given searched result from iprice will match with one of the top 100 coolest electronic gadgets. The features we considered to be included are: : Jaro–Winkler distance dist_jw : ( — ) / price_diff_ratio price refer_price refer_price : Discount percentage discount Jaro-Wrinkler distance “In computer science and statistics, the Jaro-Winkler distance is a string metric for measuring the edit distance between two sequences. Informally, the Jaro distance between two words is the minimum number of single-character transpositions required to change one word into the other. The Jaro-Winkler distance uses a prefix scale which gives more favorable ratings to strings that match from the beginning for a set prefix length” — Source: Wikipedia . We will be diving into the Math so that it is easier for us to fully understand! Jaro distance is defined as: Wow, this seems quite complicated…. I would prefer to sleep… Nah, I promise you will fully understand after these few examples. But first, we need to understand what do theses terms mean. _Jaro distance )_ dj: s1 s2. s2 2 m: Number of matching characters which appear in and in t is half the number of transpositions (compare the i-th character of s1 and the i-th character of divided by |s1| is the length of the first string |s2| is the length of the second string Lets use an example to explain the math. How to calculate jaro distance between and Facebook Firebook? matching characters       : Febook   -> 6 characters -> m = 6no transposition needed   : t=0length of the 1st string  : Facebook -> 8 characters -> |s1| = 8length of the 2nd string  : Firebook -> 8 characters -> |s2| = 8 dj = (1/3)*( (6/8) + (6/8) + ((6-0)/6)) )dj ~= 0.83 Jaro distance = 83% After knowing how to calculate Jaro distance, it’s time to understand how to calculate Jaro-Winkler distance! Length of common prefix at the start of the string up to a maximum of 4 characters. l: : Constant scaling factor for how much the score is adjusted upwards for having common prefixes. Normally we use p=0.1 . p Continue to the previous case example of Firebook vs Facebook dj    : 0.83prefix: character F -> 1 character -> =1p     : 0.1 l dw = 0.83 + 1 * 0.1 * (1-0.83)dw = 0.847 = 84.7% Jaro-Winkler distance Price Difference Ratio Intuition of creating this feature, is that we believe if the price of the product is higher or lower than the price of the ( ) by a lot, then this product do not match the keyword which we want to find. top 100 coolest product keyword For instance, taken from one of our keywords : Apple Ipad Pro Product that our keyword ( ) MATCH Apple Ipad Pro Product that our keyword ( ) DO NOT MATCH Apple Ipad Pro of Apple Ipad Pro equals to around SGD 1081 (using exchange rate 1 USD = 1.37 SGD). Then, we can conclude base on price difference ratio = (51–1081)/1081 ~= -0.95. refer_price Explain in another word -> 95% price difference between and the keyword’s price -> high probability that product do not match our keyword in this case -> Apple Ipad Pro. refer_price Calculate Jaro-Winkler distance and Price Difference Ratio for index,row in data.iterrows():data.loc[index,'dist_jw'] = L.jaro_winkler(row['name'], row['refer_name']) data['price_diff_ratio'] = (data['price']-data['refer_price'])/data['refer_price'] Visualization of Jaro-Winkler distance and Price Difference Ratio With code below, you should be able to reproduce our result. sns.scatterplot(data=data, x='dist_jw',y='price_diff_ratio', hue='status').set_title("Relationship between jaro-wrinkle distance and price difference ratio") Wow! It seems there should exist a boundary which is able to separate between searched products which match or do not match the keyword ( ). status = 0 or 1 Using code below, we are able to find the best horizontal line to separate between status =0 and 1 and visualize it. count_dict = {}x = min(data['price_diff_ratio']) while x<2:temp_data = datatemp_data['guess'] = [1 if price>=x else 0 for price in       data['price_diff_ratio'] ]correct = len(data[temp_data['status'] == temp_data['guess']])count_dict[x] = correctx = x+0.001 boundary_const = [max(count_dict, key=lambda x: count_dict[x])][0] ax = sns.scatterplot(data=data[data['price_diff_ratio']<=1],x='dist_jw',y='price_diff_ratio', hue='status') plt.axhline(y=boundary_const, color='r', linestyle='-')plt.show() From the figure above, we are able to see the best horizontal line separating the status is equal to around -0.55. The classification rule is: Below -0.55 will be classified as status = 0. Above 0.55 will be classified as status = 1. Based on the above classification rule, we will be able to get roughly accuracy! We shouldn’t treat this number too seriously as we should actually apply this rule only on training data to avoid data leakage problem. So this number is for us to have a rough idea on the machine learning model prediction later. 94% This observation gives us intuition on creating a machine learning model to find the best boundary so that we are able to have the predictive power! Select Feature For Machine Learning Model There are several ways to select feature to include in our model, for our case, we use to select suitable feature. p-value What is p-value? Probability of finding more extreme value given that is true. null hypothesis If you want to learn more about , do visit the links here: p-value Wiki StatsDirect Blog If of the variable is smaller than significant value, then the variable is statistically significant and vice-versa. We choose as our significance level. p-value 0.05 First, we run model on , and variables. Do refer to links below for explanation on . logit dist_jw price_diff_ratio discount detail Logit model machinelearningmastery @ Blog Amitabha Dey Running code below to get started! logit_model=sm.Logit(data['status'],data[["dist_jw","price_diff_ratio", "discount"]])result=logit_model.fit()print(result.summary2()) We can see that p-value of (0.8549) is so is statistically . Other variables such as and p-value are , which are statistically , thus they are the variables which in our later machine learning modeling. discount larger than 0.05 , discount variable insignificant dist_jw price_diff_ratio lesser than 0.05 significant will be included Machine Learning Model Train There are three basic machine learning models which we will be using: Logistic regression for classification @ Support Vector Machine for classification Rohith Gandhi @ Random Forest for classification Niklas Donges Wait…. We still need to separate our data into train data and test data before training our model. For our case, we will split of our data for train data while data for test data. 80% 20% X_train, X_test, y_train, y_test = train_test_split(data[["dist_jw","price_ratio"]], data['status'], test_size=0.2, random_state=0) Let us begin to train and predict our first machine learning model. Logistic regression for classification logreg = LogisticRegression()logreg.fit(X_train, y_train) y_pred = logreg.predict(X_test)print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test))) Wow!!! Just using two features and we are able to get 92% accuracy without fine tuning our machine learning model. Good Stuff! 92% accuracy will act as a benchmark for our machine learning model. To visualize more, we plot boundary calculated by logistic regression! xx, yy = np.mgrid[-5:5:.01, -5:5:.01]grid = np.c_[xx.ravel(), yy.ravel()]probs = logreg.predict_proba(grid)[:, 1].reshape(xx.shape) f, ax = plt.subplots(figsize=(8, 6))contour = ax.contourf(xx, yy, probs, 25, cmap="RdBu",vmin=0, vmax=1)ax_c = f.colorbar(contour)ax_c.set_label("$P(y = 1)$")ax_c.set_ticks([0, .25, .5, .75, 1]) ax.scatter(X_train.iloc[:,0], X_train.iloc[:,1], c=y_train, s=50,cmap="RdBu", vmin=-.2, vmax=1.2,edgecolor="white", linewidth=1) ax.set(aspect="equal",xlim=(-5, 5), ylim=(-5, 5),xlabel="$X_1$", ylabel="$X_2$") After visualizing the boundary, we proceed to plot . I refer to this to plot our confusion matrix. confusion matrix link def plot_confusion_matrix(cm, classes,normalize=False,title='Confusion matrix',cmap=plt.cm.Blues):"""This function prints and plots the confusion matrix.Normalization can be applied by setting `normalize=True`."""if normalize:cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]print("Normalized confusion matrix")else:print('Confusion matrix, without normalization') print(cm)

plt.imshow(cm, interpolation='nearest', cmap=cmap)  
plt.title(title)  
plt.colorbar()  
tick\_marks = np.arange(len(classes))  
plt.xticks(tick\_marks, classes, rotation=45)  
plt.yticks(tick\_marks, classes)

fmt = '.2f' if normalize else 'd'  
thresh = cm.max() / 2.  
for i, j in itertools.product(range(cm.shape\[0\]),   range(cm.shape\[1\])):  
    plt.text(j, i, format(cm\[i, j\], fmt),  
             horizontalalignment="center",  
             color="white" if cm\[i, j\] > thresh else "black")

plt.ylabel('True label')  
plt.xlabel('Predicted label')  
plt.tight\_layout() confusion_mat = confusion_matrix(y_test, y_pred)# Plot non-normalized confusion matrixplt.figure()plot_confusion_matrix(confusion_mat, classes=['Incorrect', 'Correct'], title='Confusion matrix, without normalization') To conclude, there are 220 out of 239 test data are predicted correctly. No particular variable have higher wrong prediction rate. Let us further visualizing . ROC Curve In a Receiver Operating Characteristic (ROC) curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner (100% sensitivity, 100% specificity). Therefore the closer the ROC curve is to the upper left corner, the higher the overall accuracy of the test (Zweig & Campbell, 1993). — by https://www.medcalc.org/manual/roc-curves.php logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])plt.figure()plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)plt.plot([0, 1], [0, 1],'r--')plt.xlim([0.0, 1.0])plt.ylim([0.0, 1.05])plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.title('Receiver operating characteristic')plt.legend(loc="lower right")plt.savefig('Log_ROC')plt.show() We can observed that Area of ROC curve (0.92) is close to 1 for Logistic Regression, meaning the accuracy for logistic regression model is high! Support Vector Machine for classification clf = SVC(kernel='rbf')clf.fit(X_train, y_train) y_pred = clf.predict(X_test)print('Accuracy of SVM classifier on test set: {:.2f}'.format(clf.score(X_test, y_test))) Support Vector Machine for classification beat our benchmark 92% by 2%! Random Forest for classification clf = RandomForestClassifier(random_state=0)clf.fit(X_train, y_train) y_pred = clf.predict(X_test)print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(clf.score(X_test, y_test))) Random Forest classification beat our benchmark 92% by 4%, is the best machine learning model among the three models we have tested. We can achieve without any fine tuning of machine learning model. 96% Meaning if we are able to create a good feature, we actually do not need to spend a lot of our time to fine tune our model to achieve a desirable accuracy! Further Improvements Include more variables for example likes, comments and ratings of each of the keywords. Perform more string manipulation on each keyword to obtain more searched results for our analysis and modeling. Also read _As a junior data scientist, most of the times training data are ready for me to train the model (either by accessing…_hackernoon.com Who Carries Tech's Top 100 Products of the Year? A Machine Learning Analysis. Happy Coding and feel free to comment below. If you want us to fine tune our machine learning model, please let us know by commenting below! Link to code: Top100Gadgets Feel free to too:) reach out to me Stay tune for my next post!

Apple

Facebook

Scrapy or Selenium?

Get more details for my web crawling service!

Who Has the Best Prices for Tech’s Top 100 Products of the Year? A Machine Learning Analysis.

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

Scrapy or Selenium?

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

10 Ways AI Has Changed Our Lives

Scrapy or Selenium?

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

10 Ways AI Has Changed Our Lives

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps