Github Project Link: Click here When we skim or scan an , is the most important indicator for us to have a basic understanding about the relevant topics of the article. When we want to let our computer do the same task as we do, article keyword we have to teach them how to get the keywords from the articles and decide which keywords are more important than the others . The easiest way is to find the “keywords” from a predefined list. And, the list can be extracted from the largest free online encyclopedia — Wikipedia. Wikipedia I tried to download the Wikipedia’s dump from and create an local database of Wikipedia. You can refer to to import entire Wikipedia into your local database. here Robert’s instruction The next step is to create a list of keywords from the title of each Wikipedia’s article, because Wikipedia seems to explain every single term in its encyclopedia. Different ways of presentation has also been captured in the Wikipedia, such as “World War I, “WW1” or “World War One”. Now, we have the list of keywords (Chinese version) and we can use simple scripts to extract the keywords from an article. here keyword_article = []for k in keywords:k = re.sub(“\r\n”,””,k)if k in articlekeyword_article.append(k) However, this is not the end of the step, because the list of extracted keyword may have some overlapped keywords, like “World”, “War”, “World War” when getting the keyword “World War I”. These overlapped keywords are filtered out. keyword_overlap = []for g in keyword_article:for h in keyword_article:if g != h:if h in g:keyword_overlap.append(h) wiki_terms = list(set(keyword_article)-set(keyword_overlap)) The next step is to identify the of the keywords the program extract from the article. importance TF-IDF Score was a scoring method to calculate the in the target article with the consideration of the in other articles. TF-IDF (term frequency–inverse document frequency) frequency of a particular term scarcity of that particular term The meaning of TF-IDF can be simply illustrated as below: TF: The more frequent the term, the higher the score IDF: The more common the term, the lower the score Let’s have an example. We select the most-read stories of 2016 in New York Time in 2016 — “ _”. I_t has around 96 of “the” in the articles compared to 15 of “marriage” . Since "the” is very common in every article, the score of IDF would be very low compared to “marriage”. In this example, TD-IDF score did its job on identifying the important keywords. W_hy You Will Marry the Wrong Person_ Calculating the TF from the Articles term_no = []term_sum = 0wordcount = {}tfidf = {} for i in xrange(len(wiki_terms)):term = articles.count(wiki_terms[i])term_no.append(term) for i in term_no:term_sum = term_sum + i for i in xrange(len(wiki_terms)):tf = Decimal(term_no[i])/Decimal(term_sum)wordcount[wiki_terms[i]]=tf Calculating the IDF in your local Wikipedia database for k in wiki_terms:x2.execute(“select key_idf from key_cn where key_term = %s”,(k))idf = x2.fetchone()if idf:tfidf_value = float(wordcount[k])* idf[0]if tfidf_value > 0.1:tfidf[k] = tfidf_value#if the keywords appear in header, it is important.if k in articles_header:tfidf[k] = 1 This is not the end of the story. When we want to link the stories together, tf-idf itself does not calculate the similarity of each article. In the next section, we will explore how we can link relevant articles with the abstracted keywords. For any comments, please feel free to leave it here or drop me an email at . adam.kc.chin@gmail.com
Share Your Thoughts