What is Topic Modelling?? There are numerous definition available online, but what I understood from these two words is, topic + modelling — creating/collecting/modelling topics. What are these topics and why do we need to model them? Imagine, while talking to your friend somehow you got little distracted and heard only few things, when asked for your opinion you try to remember the topics or main things they were saying during the conversation — this is called building up theme. Let’s take another example, I am sure many of you have given exams(who hasn’t), in most of the English(or any language) exams, we always get unseen passages and ask to find answers to given questions. In a one or two hour exam it is really difficult to read whole paragraph as you might have other questions to attempt too. I always used a trick here — read the question, search for specific and meaningful topics in the passage related to that question and finally answer it.
Topic modelling is finding thematic structure of data from a large collection of documents, when queried.
When we go on google and search for Anaconda, we might see results for Anaconda software, Nicki’s anaconda song and anaconda snake images -
These all are quite different topics. How did this happen? Were you expecting only one kind of result? Quite frustrating isn’t it??
The big question here is -
For a successful search results or a meaningful result, the first step is Topic modelling.
The topic modelling process is a text mining approach. Most of the time we get unstructured data, e.g, Articles, Newspapers, Books, online posts etc and after performing topic modelling algorithms we can get a set of topics.
Courtesy — GitHub
Each topic contains top-ranked terms and reference to associated or relevant documents.
Starting with Text Preprocessing, the documents across the web and different databases are mostly in unstructured format. The style of writing, the language, quality of text and vocabulary etc can vary.
Now, the question is how will we identify the topics or themes of these documents. Document are stored in written and stored in textual format. Each term in a document has its own significance for the document. We will create a dictionary containing all the words in the documents and documents where terms were appeared. It is similar to index page of a book, where you can find page number of a particular topic.
First we will need to split the text of input documents into individual tokens, each corresponds to a single term of the document. This processing is known as “Tokenization”.
Once the tokenization is done, we would need to have some reference to the document where these terms were appeared. This process is called creating “Bag of words”. In bag of words, every document is represented as term vector along with the number of times a term appeared in that document.
Courtesy — GitHub
Further in Text Preprocessing we also perform few steps like — Stopwords filtering (removing terms that do not convey any meaning to topic e.g — at, the, as, an etc), Removing words with minimum or maximum frequency (as they don’t add any value to topic ) and Stemming (removing tense or plurals from the word as same word can be used with different tenses/plurals e.g computer, computing, compute).
Now that we have documents and terms appeared in those documents, we will need to decide if the terms are important enough, this will improve the usefulness of Document-term matrix. One of the common approach is TF-IDF. It is the combination of two weighting parts — TF(term frequency), Number of times a term has appeared in a given document and IDF(inverse document frequency), deals with terms that appear in almost every document and tries to find total number of distinct documents containing a term.
Document-term matrix
As we have Document-term matrix, we can now apply any machine learning algorithm and explore the data e.g. searching for a particular answer to a query just like we do in google search.
Hope, You would now have got some basic understanding of Topic Modelling. More details on the algorithm will be covered in the next post. Subscribe and Stay tuned! #DataScience #MachineLearning #TextAnalytics #NLP #TopicModelling
References