paint-brush
Finding The Most Important Sentences Using NLP & TF-IDFby@autumnskerritt
25,857 reads
25,857 reads

Finding The Most Important Sentences Using NLP & TF-IDF

by autumnAugust 24th, 2018
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

We’re going to use Term Frequency — Inverse Document Frequency (TF-IDF) to find the most important sentences in a BBC news article. Then we are going to implement this algorithm into a quick & easy Firefox extension.

People Mentioned

Mention Thumbnail

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Finding The Most Important Sentences Using NLP & TF-IDF
autumn HackerNoon profile picture

Photo by Pankaj Patel on Unsplash

We’re going to use Term Frequency — Inverse Document Frequency (TF-IDF) to find the most important sentences in a BBC news article. Then we are going to implement this algorithm into a quick & easy Firefox extension.

Click here to view an updated version of this article:


Finding The Most Important Sentences Using NLP & TF-IDF_We're going to create a summary of BBC News Articles and place them at the top using a Firefox extension. This article…_skerritt.blog

We’re going to create a summary of BBC News Articles and place them at the top using a Firefox extension. This article is about the gnarly algorithm Term Frequency-Inverse Document Frequency (TF-IDF). We’re going to create a real-world usage in the form of a Firefox extension. I know what you’re thinking. “TF-IDF? Yawn 😴” but bare with me, it’s quite interesting!

When we’re done, it’ll look like this:

News article found here. Red box highlights the most important sentences. The red box is not in the final product, is only used for illustration purposes.

Term frequency * Inverse Document Frequency

Don’t worry, the name of the algorithm makes me fall asleep every time I hear it said out loud too. This algorithm is 2 algorithms multiplied together. Let’s see how both of these work:

Term Frequency

Term frequency (TF) is how often a word appears in a document, divided by how many words there are.

The Term Frequency (TF) of a term, t, and a document, d.

Let’s say you’re reading a news article on Brexit. The word “Brexit” will appear a lot, so the term frequency of the word “Brexit” is high.

Quite often, we would want to build a dictionary (hashmap) of term frequencies alongside the term. Like {word: term frequency of that word} and then iterate through this dictionary to find out which word appears the most times.

Now, what if I told you that the term frequency dictionary would look a little like this:

{"and": 0.87, "the": 0.73}

You can see how these common English words aren’t useful to us. Of course, most English texts have these words in them, but we call English words like these stopwords. Stopwords usually refer to the most common words in a language, although there isn’t one singular definition. You have to choose stopwords per usage. You have to decide on what words to use. Before processing some text you’ll usually want to remove stopwords to better process the text.

Words with capital letters in them differ from words without capitals. In programming, “Africa” and “africa” are two different things. Because of this, we want to turn everything into lowercase or uppercase to better process our text. We’re going to turn all words into lowercase.

Given a string, we want to remove stop words and turn it into lowercase. Our extension will give us a string of all text on a BBC news article. Don’t worry about where we get the text from yet, that’s done later in the Firefox extension section. For now, assume we have text that looks like this:


... These are external links and will open in a new windowA neat feature of many modern laptops is the ability to power them up through the USB port. Unlike the rectangular USB ports of old, the newer type - USB-C - can carry enough power to charge yourmachine.That’s great news: it means ...

The above text is shortened to prevent the reader from falling asleep.

This is the function which will “prettify” our documents. Line 3 is an array of stopwords I found on StackOverflow. I added “share” and “linkthese” since these are commons words in the news article we don’t want.

Line 5 is Regex. The square brackets mean or. [,.] means “activate on a comma or a full stop ”. /g means global. Once you find one ‘,’ or ‘.’ don’t stop, continue searching the string. The empty string is what we’re replacing it with. If we find a full stop or a comma, replace it with nothing — delete it. This is because the words “Africa.” and “Africa” would be classified as two different words without this.

From my Instagram, @Brandon.codes

Line 4 splits the document up into separate words. The map function applies a function to every element in an array. Once the string is split up into an array of words we apply the toLowerCase() method to every element. It makes every word lowercase.

From my Instagram, @Brandon.codes. The arrows are Arrow Functions (Lambda / Anonymous functions).


We then return the lowercase words once we’ve filtered out stop words. Filter() creates a new array with only the elements for which the function inside returns True.If a word is a stop word, it’ll result in True which means we’ll get a new array of only the stopwords in the document. We use the negation operator “!” to get the opposite, which is what we want. To return a list of words with no stopwords in it.

Now we want to count how many times each word appears in the document. This will be useful for both Term Frequency and Inverse Document Frequency. First, we want to get all the unique words from an array of words.

We convert the array into a set because sets have no repetitions. This lets us get only the unique words in the array. Sets also don’t have an order, so we can’t use array indices to access elements. We need to turn it straight back into an array. For more about Set Theory, check out this article I wrote.

Okay, now time to count how many times a word appears in the words array.

This function goes through every single unique word and counts how many times that word appears in the array of words. The Term frequency function is quite long, so I’m going to break it down.

Line 6 divides the document up into sentences. Sometimes sentences have white space before them. “Brandon. Dogs.” Has whitespace before “Dogs.” we apply the trim() method to each item to get rid of these trailing whitespace.

Regarding line 7, The first 146 characters of the first word is social media links. The rest of that word is a title or sub-title. Here, look:

Share this withEmailFacebookMessengerMessengerTwitterPinterestWhatsAppLinkedInCopy this linkThese are external links and will open in a new windowRyanair is tightening the rules on what passengers pay to take luggage onto the plane, in order to "speed up boarding".

This is annoying, as the title is an essential part of the story and needs to be taken into account. So we remove the first 146 characters of the first word to get:

Ryanair is tightening the rules on what passengers pay to take luggage onto the plane, in order to "speed up boarding"

Remember this formula?

The Term Frequency (TF) of a term, t, and a document, d.

The variable “TFVals” is calculating this formula. If we run the sentence “Hello, my name is Brandon. Brandon Brandon. The elephant jumps over the moon” through the term frequency function, we’ll get something that looks like this:

TF only returns 1 “Brandon”, so not every Brandon has a TF score. For this example, bare in mind that TF only returns 1 Brandon because we are calculating the singular term (word) frequency.

We have the term frequencies of words, but we want to calculate the most important sentences, not words. To do that, we go through every single sentence and see what words come up in that sentence which are in TFVals.

Remember earlier how I said we don’t count all 3 “Brandon”’s? I’ve done that now. Also, note that we’re splitting this up into sentences. There are 3 sentences in this image.

We just need to add them all up and divide by how many words we have. Since we’re only adding up the TF values of non stop words, it’s only fair if we divide by how many non stop words there are, instead of how many words there are in a sentence. If we don’t divide by how many words we have, long sentences have an advantage over shorter ones.

The sentence with the highest TF is “Brandon Brandon.”

This is what line 20 onwards does below. We go through every single sentence and calculate the TF values of each sentence, just as we did above.

And that’s it. But we have a problem with using only Term Frequency. As you may have seen earlier, “Brandon Brandon” was the highest scoring TF out of all 3 sentences we looked at.

Popularity isn’t enough. We don’t want sentences that have the most keywords in them as they may not make sense, or they may be repetitions of each other. Such as in the “Brandon” Brandon” sentence. It has a high TF value but doesn’t hold much content.

It doesn’t contain much information and isn’t useful. We want a sentence that is both rare, unique and contains keywords common in the article. This is where inverse document frequency comes in.

Inverse document frequency

Term frequency is how common a word is, inverse document frequency (IDF) is how unique or rare a word is. The formula for IDF is:

t is the term and d is the documents. We have multiple documents, we’re treating each sentence as its own document.

IDF used over many documents, whereas TF is built for one document. You can decide what a document is. In this article, each sentence is its own document.

The first few steps of IDF is the same as TF. We prettify the document, count the words in the document and get all the unique words.

Lines 1–6 is nothing new. The for loop on line 17 loops through every sentence in the document. Since each sentence is a new ‘document’, we need to count the words of each sentence individually. We have to prettify them to get rid of the stopwords and turn them into an array of words. We push the wordcount object of each new sentence into wordCountSentences.

There are 3 sentences here. We don’t care how many times Brandon appears over the whole document. I tried to pick the words to give you something interesting without making the maths too large, but in the process I developed narcissism.

We’re now going to go through every single word and count how many times that word appears in every sentence and calculate the IDF score using the below formula.

I use log 10 here. Note: all logarithms are constantly scaled, so it doesn’t matter much on what logarithm you use.

A small example of how IDF is used.

Now we just do this for every non stop word.

And the code for this is:

Now we want to get the IDF Values of all the sentences, we use the same code from TF here but replace some things to make it work.

If I’m being truthful with you, I did a simple “find and replace” the variables. Instead of “TF” in the comments, I replaced them with IDF. Instead of “TFVals”, I replaced it with “IDFVals”. Nothing important has happened here, so feel free to skip this part.

Now that we have written code to calculate the IDF value of sentences; we have completed the formula from earlier. “is a dog” has the highest IDF value.

We now know how unique or rare a sentence is. This isn’t that useful since we want the sentence to be information-rich too. We want some way to combine the popularity of TF with the uniqueness of IDF. This leads us to our next section…

TF-IDF revisited

“Are we done yet? Can I wake up now?” Yes, very nearly done! Gif from Giphy.

We now have TF and IDF functions implemented. The only thing left to do is to multiply them together.

The objects TF and IDF both stem from the same data, so TF isn’t going to contain something that isn’t in IDF. Because of this, we can iterate through one object and use the same key. We multiply the value in TFVals by the value from in IDFVals.

Our next step is to calculate the 3 most important sentences in our TF-IDF Object. Iterating over the [key, value] of the object with a couple of if statements works perfectly.

You’ll see at the bottom we return the formatted string. We format it so it looks nice when we insert it into the webpage. Each <br> is a line break, a space in the text. The black dots are bullet points. We’re now going to implement this algorithm into a Firefox extension. 🔥🦊

Getting & changing text in a BBC news article

Go to any BBC news article, right-click and press “inspect element”. You’ll see a nice box at the bottom of the screen. Use the element selector tool in the top left-hand corner and hover over the article. We can see that the whole article is encompassed within a CSS class of ‘story-body’.

If we go further in, we can see that all the actual text in the article is encompassed by paragraph tags, inside this CSS class.

We’re going to use JQuery to select the text.

This line selects all the <p> tags within the story-body class. Now we want to get the text, we do this by applying the method .text().

We want to add our text to the top of the article. JQuery has a method called prepend which lets us prepend data to the top of an object.

And we’re done! We can now identify the most important sentences in a BBC News article and display them right at the top. Just time to turn it into an extension.

Firefox extension basics

Firefox extensions have 2 main parts. The Javascript you wrote and the manifest.json file which tells Mozilla what your extension does. We’re going over manifest.json now.

manifest_version tells Firefox what version of manifest you are using. Name tells Firefox what the name of your extension is. Version tells Firefox what version number your extension is. These 3 are mandatory.

description tells Firefox what your extension does.

content_scripts tells Firefox what scripts to load when the URL matches what you have inputted. In order for the scripts you have specified to run, the current URL must match at least one of the URLs you have specified. You can use 2 special characters here:

  1. *” Matches zero or more characters. In this instance, I don’t know whether the user will load HTTP or HTTPS so I have it step to load both. I also don’t know what exact article the user will look at, so I have it set to activate on any article.

2. “?” matches exactly one character.

The Mozilla Developer Network has a nice explanation of these:

For example: "*na?i" would match "illuminati" and "annunaki", but not "sagnarelli".

Since we’re going to be using jQuery, we’ll import the jQuery JS file as well into the website, before our script executes. You can grab the jQuery file from here. Copy and paste into a file named “jquery.js”.

Enter “about:debugging” into your Firefox URL to load this page:

From here, click “Load Temporarary Add-on…” and then click any of the files in the extension. Once you do, you should see this:

Mozilla have a nice article on the basics of Firefox extensions, here.

Now load any BBC news article to play with it!

Conclusion

You’ve now seen the awesome power of TF-IDF and a real-world application for it. This idea came to me because I have email anxiety. I get so nervous about reading emails that I wanted a quick summary of them to calm my thoughts. Alas, this is my first time ever writing Javascript. I started out on something easier like BBC news articles.

Here are some ways you can improve upon this code, if you so wish:

  • Dynamically select how many sentences in a summary you want. You can find out the average TF*IDF value in the whole article and anything over X you can include in the summary. This makes it so long articles are treated as equally as shorter articles.
  • Extending this to work on any other websites you wish.

The GitHub repo is here.

Did you like this article? Connect with me on Social Media to discuss all things computer science related 😁

Twitter| Instagram| LinkedIn

I didn’t get paid for writing this article. If you want to support me or enjoyed this article, feel free to buy me a cup of tea or something below 😍✨


Pay Brandon Skerritt using PayPal.Me_Go to paypal.me/BrandonSkerritt and type in the amount. Since it's PayPal, it's easy and secure. Don't have a PayPal…_www.paypal.me


Pay Brandon instantly through Monzo.me_Tap the link to pay Brandon. You don't need to create an account and it's totally free._monzo.me