Intro First of all, a short disclaimer: I’m not an expert in machine at all. In fact I’m in a rather early stage of the learning process, have basic knowledge and this project is kind of my first practical hands-on ML. I’ve done the by at University of Washington, by Udacity and and the , all of which I can really recommend. After having gathered all that theoretical knowledge, I wanted to try something practical on my own. I decided to learn a simple classifier for chat messages from my messenger history. I wanted to learn a program that can, given a chat message, tell who the sender of that message is. I further got inspired after having read the papers related to Facebook’s text classification algorithm. In their examples they classify Wikipedia abstracts / descriptions to classes or news article headlines to their respective news categories, only based on plain, natural words. Basically these problems are very similar to mine, so I decided to give it a try. Since I found that many text classifiers are learned using the Naive Bayes algorithm (especially popular in spam detection and part of ) and it’s really easy to understand, I decided to go for that one, too. Inspired by , where the sentiment of tweets is analyzed, I chose to also use the for Python. Another option would have been , but NLTK also provided some useful utilities beyond the pure ML scope. learning machine learning course Pedro Domingos Intro to Machine Learning Google Machine Learning 1 lecture at Karlsruhe Institute Of Technology Telegram fastText DBPedia SpamAssassin this article natural language toolkit sklearn All of my . code is available on GitHub Basic Steps The very first step was to download the data, namely the chat messages. Luckily, Telegram has an open API. However it’s not a classical REST API, but instead they’re using the protocol. I found as a cool C++-written commandline client on GitHub as well as as a Ruby script to automate the history download for a set of users / chat partners. I told the script (ran in a Docker container, since I didn’t want to install Ruby) to fetch at max 40,000 messages for my top three chat partners (let’s call them , and ). The outcome were three files. MTProto vysheng/tg tvdstaaij/telegram-history-dump M P J JSON Lines To pre-process these files as needed for my learning algorithm, I wrote a Python script that extracted only message text and sender from all incoming messages and dumped these data to a JSON file. Additionally I also extracted the same information for all outgoing messages, i.e. all messages where the sender was me. Consequently, there are four classes: C = { , , , } M P J F Another data-preprocessing step was to convert the JSON objects with class names as keys for message-arrays to one large list of tuples of the form , where is the name of the message’s sender and is the respective message text. In this step I also discarded words with a length of less than 2 characters and converted everything to lower case. (text, label) label text Next step was to extract the features. In text classification, there is often one binary ( / ) feature for every possible word. So if all messages in total comprise X different words, there will be a X-dimensional feature vector. contains contains not Last step before actually training the classifier is to compute the feature vector for every messages. For examples if the total feature set is , the resultung feature vector for a message would be . ['in', 'case', 'of', 'fire', 'coffee', 'we', 'trust'] "in coffee we trust" ('in'=True, 'case'=False, 'of'=False, 'fire'=False, 'coffee'=True, 'we'=True, 'trust'=True) One more minor thing: shuffle the feature set so that the order of messages and message senders is random. Also divide the feature set into training- and test data, where test data contains about 10 % of the number of messages in the train data. Train classifier. This is really just one line of code. nltk.NaiveBayesClassifier Use the returned classifier to predict classes for the test messages, validate them and compute the accuracy. Using that basic initial setup on a set of , (7931 from M, 9795 from P, 9314 from F and 10217 from J), I ended up with an . There seemed to be room for optimization. 37257 messages accuracy of 0.58 Optimizations Inspired by , I decided to include n-grams. This seemed resonable to me, because intuitively I’d say that single words a way less characteristic for a person’s writing style than certain phrases. I extended the feature list from step 4 by all possible , which are easy to compute with NLTK. Actually I’m not taking ALL bi- and tri-grams and I’m not even take all single words as features. Reason for that is that there were approx. 35k different words in the dataset. Plus the n-grams this would make an extremely multi-dimensional feature vector and as it turned out, it was way to complex for my 16 GB MacBook Pro to compute. Consequently, I only took the , ranked descending by their overall frequency. fastText bi- and tri-grams top 5000 single words, bigrams and trigrams Since NLTK already provides a corpus of (like “in”, “and”, “of”, etc.), which are obviously not characteristic for a person’s style of chatting, I decided to remove them (the German ones) from the message set in step 2. stopwords With these optimizations, I ended up with an after a training time of 348 seconds (I didn’t log testing time at that point). accuracy of 0.61 Conclusion Certainly 61 % accuracy isn’t really a good classifier, but at least significantly better than random guessing (chance of 1/4 in this case). However, I trained a classifier on my data as a comparison baseline and it even only reached (but with a much better ). My intuitive explanation for these rather bad results is the complexity of the problem itself. Given only a set of words without any context and semantics, it’s not only hard for a machine to predict the message’s sender but also for a human. Moreover, given more training data (I’d need a longer message history) and more computing power to handle larger feature sets, the accuracy might further improve slightly. Actually, the practical relevance of this project isn’t quit high anyway, but it was a good practice for me to get into the basics of ML and it’s really fun! fastText 60 % accuracy training time of only 0.66 seconds Please leave me feedback if you like to. Originally published at ferdinand-muetsch.de . is how hackers start their afternoons. We’re a part of the family. We are now and happy to opportunities. Hacker Noon @AMI accepting submissions discuss advertising & sponsorship To learn more, , , or simply, read our about page like/message us on Facebook tweet/DM @HackerNoon. If you enjoyed this story, we recommend reading our and . Until next time, don’t take the realities of the world for granted! latest tech stories trending tech stories