Github Project Link: Click here
When we skim or scan an article, keyword is the most important indicator for us to have a basic understanding about the relevant topics of the article. When we want to let our computer do the same task as we do, we have to teach them how to get the keywords from the articles and decide which keywords are more important than the others.
The easiest way is to find the “keywords” from a predefined list. And, the list can be extracted from the largest free online encyclopedia — Wikipedia.
I tried to download the Wikipedia’s dump from here and create an local database of Wikipedia. You can refer to Robert’s instruction to import entire Wikipedia into your local database.
The next step is to create a list of keywords from the title of each Wikipedia’s article, because Wikipedia seems to explain every single term in its encyclopedia. Different ways of presentation has also been captured in the Wikipedia, such as “World War I, “WW1” or “World War One”.
Now, we have the list of keywords here (Chinese version) and we can use simple scripts to extract the keywords from an article.
keyword_article = []for k in keywords:k = re.sub(“\r\n”,””,k)if k in articlekeyword_article.append(k)
However, this is not the end of the step, because the list of extracted keyword may have some overlapped keywords, like “World”, “War”, “World War” when getting the keyword “World War I”. These overlapped keywords are filtered out.
keyword_overlap = []for g in keyword_article:for h in keyword_article:if g != h:if h in g:keyword_overlap.append(h)
wiki_terms = list(set(keyword_article)-set(keyword_overlap))
The next step is to identify the importance of the keywords the program extract from the article.
TF-IDF (term frequency–inverse document frequency) was a scoring method to calculate the frequency of a particular term in the target article with the consideration of the scarcity of that particular term in other articles.
The meaning of TF-IDF can be simply illustrated as below:
TF: The more frequent the term, the higher the score
IDF: The more common the term, the lower the score
Let’s have an example. We select the most-read stories of 2016 in New York Time in 2016 — “W_hy You Will Marry the Wrong Person__”. I_t has around 96 of “the” in the articles compared to 15 of “marriage” . Since "the” is very common in every article, the score of IDF would be very low compared to “marriage”. In this example, TD-IDF score did its job on identifying the important keywords.
term_no = []term_sum = 0wordcount = {}tfidf = {}
for i in xrange(len(wiki_terms)):term = articles.count(wiki_terms[i])term_no.append(term)
for i in term_no:term_sum = term_sum + i
for i in xrange(len(wiki_terms)):tf = Decimal(term_no[i])/Decimal(term_sum)wordcount[wiki_terms[i]]=tf
for k in wiki_terms:x2.execute(“select key_idf from key_cn where key_term = %s”,(k))idf = x2.fetchone()if idf:tfidf_value = float(wordcount[k])* idf[0]if tfidf_value > 0.1:tfidf[k] = tfidf_value#if the keywords appear in header, it is important.if k in articles_header:tfidf[k] = 1
This is not the end of the story. When we want to link the stories together, tf-idf itself does not calculate the similarity of each article. In the next section, we will explore how we can link relevant articles with the abstracted keywords.
For any comments, please feel free to leave it here or drop me an email at [email protected] .