What is Topic Modelling?? There are numerous definition available online, but what I understood from these two words is, topic + modelling — creating/collecting/modelling topics. What are these topics and why do we need to model them? Imagine, while talking to your friend somehow you got little distracted and heard only few things, when asked for your opinion you try to remember the topics or main things they were saying during the conversation — this is called building up theme. Let’s take another example, I am sure many of you have given exams(who hasn’t), in most of the English(or any language) exams, we always get unseen passages and ask to find answers to given questions. In a one or two hour exam it is really difficult to read whole paragraph as you might have other questions to attempt too. I always used a trick here — read the question, search for specific and meaningful topics in the passage related to that question and finally answer it. Topic modelling is finding thematic structure of data from a large collection of documents, when queried. Use of Topic Modelling When we go on google and search for Anaconda, we might see results for Anaconda software, Nicki’s anaconda song and anaconda snake images - These all are quite different topics. How did this happen? Were you expecting only one kind of result? Quite frustrating isn’t it?? The big question here is - For a successful search results or a meaningful result, the first step is Topic modelling. Performing Topic Modelling The topic modelling process is a text mining approach. Most of the time we get unstructured data, e.g, Articles, Newspapers, Books, online posts etc and after performing topic modelling algorithms we can get a set of topics. Courtesy — GitHub Each topic contains top-ranked terms and reference to associated or relevant documents. the documents across the web and different databases are mostly in unstructured format. The style of writing, the language, quality of text and vocabulary etc can vary. Starting with Text Preprocessing, Now, the question is how will we identify the topics or themes of these documents. Document are stored in written and stored in textual format. Each term in a document has its own significance for the document. We will create a dictionary containing all the words in the documents and documents where terms were appeared. It is similar to index page of a book, where you can find page number of a particular topic. First we will need to split the text of input documents into individual tokens, each corresponds to a single term of the document. This processing is known as “ ”. Tokenization Once the tokenization is done, we would need to have some reference to the document where these terms were appeared. This process is called creating “ ”. In bag of words, every document is represented as term vector along with the number of times a term appeared in that document. Bag of words Courtesy — GitHub Further in Text Preprocessing we also perform few steps like — (removing terms that do not convey any meaning to topic e.g — at, the, as, an etc), (as they don’t add any value to topic ) and (removing tense or plurals from the word as same word can be used with different tenses/plurals e.g computer, computing, compute). Stopwords filtering Removing words with minimum or maximum frequency Stemming Term Weighting Now that we have documents and terms appeared in those documents, we will need to decide if the terms are important enough, this will improve the usefulness of Document-term matrix. One of the common approach is . It is the combination of two weighting parts — (term frequency), Number of times a term has appeared in a given document and (inverse document frequency), deals with terms that appear in almost every document and tries to find total number of distinct documents containing a term. TF-IDF TF IDF Document-term matrix As we have Document-term matrix, we can now apply any machine algorithm and explore the data e.g. searching for a particular answer to a query just like we do in search. learning google Hope, You would now have got some basic understanding of Topic Modelling. More details on the algorithm will be covered in the next post. Subscribe and Stay tuned! #DataScience #MachineLearning #TextAnalytics #NLP #TopicModelling References Medium Blog Wikipedia — Topic Modelling UCD lectures

NLP 101: Topic Modeling for Humans — Part #1

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps