paint-brush
Predicting Cost of Tender with 99.24% Accuracy : Miracle!by@singhuddeshyaofficial
199 reads

Predicting Cost of Tender with 99.24% Accuracy : Miracle!

by Uddeshya SinghSeptember 2nd, 2018
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

<a href="https://hackernoon.com/tagged/data-science" target="_blank">Data Science</a> is reaching new levels and so are the models. But reaching a whooping 99.24% accuracy using simple feature engineering and a simple <strong>Decision Tree Classifier&nbsp;</strong>?
featured image - Predicting Cost of Tender with 99.24% Accuracy : Miracle!
Uddeshya Singh HackerNoon profile picture

Data Science is reaching new levels and so are the models. But reaching a whooping 99.24% accuracy using simple feature engineering and a simple Decision Tree Classifier ?

That’s new!

Hello everyone, today I am going to present you my model which can predict value range of a tender in Seattle Trade Permits with a whooping accuracy of 99.24 %(With some obvious caveats which I will discuss in the end).

My Kernel : Yet Another Value Prediction

The code reference for this blog

Basic EDA

This time out, I am going to use plotly library in Python. This is literally the best option for interactive plots and if you actually visit the kernel, you will understand why.










mySummingGroup = df.drop(columns=['Longitude', 'Latitude', 'Application/Permit Number']).groupby(by = 'Contractor').agg({'Value':sum})x = mySummingGroup['Value'].nlargest(10)data1 = [Bar(y=x,x=x.keys(),marker = dict(color = 'rgba(25, 82, 1, .9)'),name = "Contractor's amount earned per project")]





















layout1 = go.Layout(title="Top Grossing Contractors",xaxis=dict(title='Contractor',titlefont=dict(family='Courier New, monospace',size=18,color='#7f7f7f')),yaxis=dict(title='Total Amount Earned',titlefont=dict(family='Courier New, monospace',size=18,color='#7f7f7f')))myFigure2 = go.Figure(data = data1 , layout = layout1)iplot(myFigure2)

First of all, we will focus on checking out the Top Grossing Contractors in the Seattle area who have earned the most out of the tender acquisitions.


























catCount = df.groupby('Category')['Permit Type'].count()fig = {"data":[{"values":catCount,"labels":catCount.keys(),"domain": {"x": [0, 1]},"name": "Categories","hoverinfo":"label+percent+name","hole": .4,"type": "pie","textinfo": "value"}],"layout":{"title":"Categorical Distribution of Tenders","annotations": [{"font": {"size": 15},"showarrow": False,"text": "DISTRIBUTION","x": 0.5,"y": 0.5}]}}


trace = go.Pie(labels = catCount.keys(), values=catCount,textinfo='value', hoverinfo='label+percent', textfont=dict(size = 15))iplot(fig)

Similarly, one could plot out another graph for Amount earned per project. But another thing which caught my eye was a really high composition of SINGLE FAMILY/DUPLEX category (about 75% of the data)

Well, with this basic EDA planned out, Let’s move to feature engineering!

Feature Engineering

First of all, we will encode the Value of a tender by the encoder listed below which is simple enough to understand. It simply categorizes the value in 5 categories.













# My Value Encoderdef valueEncoder(value):if value > 10000000:return 4elif value > 100000:return 3elif value > 10000:return 2elif value > 100:return 1else:return 0df['ValueLabel'] = df['Value'].apply(valueEncoder)

and after this, we move on to One Hot Encode out Category variables.




cat_ohe = OneHotEncoder()cat_feature_arr = cat_ohe.fit_transform(df[['CategoryLabel']]).toarray()cat_feature_labels = list(genLabel_cat.classes_)cat_features = pd.DataFrame(cat_feature_arr, columns=cat_feature_labels)

cat_features.head(10)

Next steps were to simply binary encode the status column. After that, If we look at the Action Column, it mentions the presence of 21 unique entries. Now binary encodes of 21 more features isn’t plausible. How about using feature hashing?

from sklearn.feature_extraction import FeatureHasher





fh = FeatureHasher(n_features = 5, input_type = 'string')hashed_features = fh.fit_transform(df2['Action Type'])hashed_features = hashed_features.toarray()df2 = pd.concat([df2, pd.DataFrame(hashed_features)],axis=1).dropna()

Example for feature hashing

Now Since all our features are ready, It’s time for Model Development!

Model Programming

We will make a simple Decision Tree Classifier with max_depth = 5 to prevent overfitting!




from sklearn.tree import DecisionTreeClassifiermyClassifier2 = DecisionTreeClassifier(max_depth = 5, min_samples_leaf = 2)myClassifier2.fit(X_train, y_train)predictions2 = myClassifier2.predict(X_test)


cnf2 = confusion_matrix(y_test, predictions2)score2 = accuracy_score(y_test, predictions2)

print ("Confusion Matrix for our Decision Tree classifier is :\n ", cnf2)

print("While the accuracy score for the same is %.2f percent" % (score2 * 100))

As you can see, it worked out pretty well for me. A whooping 99.24% accuracy!

Caveats!

Now now, in telegram discussion, many people were astounded by the accuracy score. The Recall and Precision worked out fine for this model. But, where could it go wrong?

Now I would like to draw you attention to the fact that this dataset was largely dominated by a single category : SINGLE/DUPLEX and after a little insight into the dataset, I found out that value label distribution was as follows :

This means that about 99.15 % of the dataset is dominated by tenders of a single value label .

So truth be told. If my model would still have gone shooting arrows in the dark, it would still have a 98% chances of correctly stating the label given, the input is a subset of this dataset!

Conclusion

The only thing left for conclusion is that it is really easy to get a high performance classifier in a biased data set like this given that you choice of algorithm is right ( You may need to visit the kernel to actually understand why did I say this)