First of all, a short disclaimer: I’m not an expert in machine learning at all. In fact I’m in a rather early stage of the learning process, have basic knowledge and this project is kind of my first practical hands-on ML. I’ve done the machine learning course by Pedro Domingos at University of Washington, Intro to Machine Learning by Udacity and Google and the Machine Learning 1 lecture at Karlsruhe Institute Of Technology, all of which I can really recommend. After having gathered all that theoretical knowledge, I wanted to try something practical on my own. I decided to learn a simple classifier for chat messages from my Telegram messenger history. I wanted to learn a program that can, given a chat message, tell who the sender of that message is. I further got inspired after having read the papers related to Facebook’s fastText text classification algorithm. In their examples they classify Wikipedia abstracts / descriptions to DBPedia classes or news article headlines to their respective news categories, only based on plain, natural words. Basically these problems are very similar to mine, so I decided to give it a try. Since I found that many text classifiers are learned using the Naive Bayes algorithm (especially popular in spam detection and part of SpamAssassin) and it’s really easy to understand, I decided to go for that one, too. Inspired by this article, where the sentiment of tweets is analyzed, I chose to also use the natural language toolkit for Python. Another option would have been sklearn, but NLTK also provided some useful utilities beyond the pure ML scope.
All of my code is available on GitHub.
['in', 'case', 'of', 'fire', 'coffee', 'we', 'trust']
, the resultung feature vector for a message "in coffee we trust" would be ('in'=True, 'case'=False, 'of'=False, 'fire'=False, 'coffee'=True, 'we'=True, 'trust'=True)
.Using that basic initial setup on a set of 37257 messages, (7931 from M, 9795 from P, 9314 from F and 10217 from J), I ended up with an accuracy of 0.58. There seemed to be room for optimization.
With these optimizations, I ended up with an accuracy of 0.61 after a training time of 348 seconds (I didn’t log testing time at that point).
Certainly 61 % accuracy isn’t really a good classifier, but at least significantly better than random guessing (chance of 1/4 in this case). However, I trained a fastText classifier on my data as a comparison baseline and it even only reached 60 % accuracy (but with a much better training time of only 0.66 seconds). My intuitive explanation for these rather bad results is the complexity of the problem itself. Given only a set of words without any context and semantics, it’s not only hard for a machine to predict the message’s sender but also for a human. Moreover, given more training data (I’d need a longer message history) and more computing power to handle larger feature sets, the accuracy might further improve slightly. Actually, the practical relevance of this project isn’t quit high anyway, but it was a good practice for me to get into the basics of ML and it’s really fun!
Please leave me feedback if you like to.
Originally published at ferdinand-muetsch.de.
Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising & sponsorship opportunities.
To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.
If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!