 Photo by [Jason Yu](https://unsplash.com/@jason_yu?utm_source=medium&utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&utm_medium=referral) I have been playing last days with some tools to analyze online texts and I have been using NLTK (Natural Language ToolKit) which is a platform for building Python programs to work with human language data. **NLP** or **Natural language processing** is the science of enabling the uter to understand human language, derive meaning and generate natural language. It is the intersection of **computer science,** **artificial intelligence,** and **linguistics.** Lately I have been working on a quite similar project to implement an intelligent chatbot on my raspberry pi, I will be posting my experiment on my [blog](http://eon01.com/blog), but for now, et’s learn **how to use NLTK to analyze text**. The analysis that I provide in this tutorial is based on **348 files** but still approximate and are intended to be an educational tool to learn basic stuff, not more. NLTK is a great tool but still remains a software designed to evolve over time to be more efficient. Maybe you will find some small classification errors, but this remains negligible compared to the overall result. Texts have been downloaded from [this site](http://www.americanrhetoric.com/barackobamaspeeches.htm), I have not seen the content of each of the speeches, but the overwhelming majority of these texts were told by Obama during his speeches. Anything else told by somebody else is negligible compared to the overall result of this analysis. ### Downloading Content First thing is downloading content, I used this simple Python script: #!/usr/bin/env python \# coding: utf8 from goose import Goose import urllib import lxml.html import codecs def get\_links(url, domain): connection = urllib.urlopen(url) dom = lxml.html.fromstring(connection.read()) for link in dom.xpath(‘//a/@href’): # select the url in href for all a tags(links) if ( link.startswith(“speech”) and link.endswith(“htm”) ): yield domain + link def get\_text(url): g = Goose() article = g.extract(url=url) with codecs.open(article.link\_hash + “.speech”, “w”, “utf-8-sig”) as text\_file: text\_file.write(article.cleaned\_text) if (\_\_name\_\_ == “\_\_main\_\_”): link = “[http://www.americanrhetoric.com/barackobamaspeeches.htm](http://www.americanrhetoric.com/barackobamaspeeches.htm)" domain = “[http://www.americanrhetoric.com/](http://www.americanrhetoric.com/)" for i in get\_links(link, domain): get\_text(i) Concatenating is the second step: import os for file in os.listdir(“.”): if file.endswith(“.speech”): os.system(“cat “+ file + “ >> all.speeches”) Then it is recommended to create what we call tokens in NLTK jargon: with codecs.open(“all.speeches”, “r”, “utf-8-sig”) as text\_file: r = text\_file.read() #Remove punctuation tokenizer = RegexpTokenizer(r’\\w+’) \_tokens = tokenizer.tokenize(r) # Get clean tokens tokens = \[t for t in \_tokens if t.lower() not in english\_stopwords\] ### Analyzing Content #### The Lexical Diversity According to Wikipedia The **lexical diversity** of a given text is defined as the ratio of total number of words to the number of different unique word stems. \# Process lexical diversity st = len(set(tokens)) lt = len(tokens) y = \[st\*100/lt\] print(y) fig = plt.figure() ax = fig.add\_subplot(111) N = 1 # necessary variables ind = np.arange(N) width = 0.7 rect = ax.bar(ind, y, width, color=’black’) # axes and labels ax.set\_xlim(-width,len(ind)+width) ax.set\_ylim(0,100) ax.set\_ylabel(‘Score’) ax.set\_title(‘Lexical Diversity’) xTickMarks = \[‘Lexical Diversity Meter’\] ax.set\_xticks(ind+width) xtickNames = ax.set\_xticklabels(xTickMarks) plt.setp(xtickNames, rotation=45, fontsize=10) ## add a legend ax.legend( (rect\[0\], (‘’) )) plt.show()  #### POS Tags Frequency Like it is explained in the official documentation of NLTK, the process of classifying words into their **parts of speech** and labeling them accordingly is known as **part-of-speech tagging**, **POS-tagging**, or simply **tagging**. Well, in simple words, a word can be a noun, a verb and adjective. NLTK wil help us determine this: \# get tagged tokens tagged = nltk.pos\_tag(tokens) # top words by tag (verb, noun ..etc) counts = Counter(tag for word,tag in tagged) # counter data, counter is your counter object keys = counts.keys() y\_pos = np.arange(len(keys)) # get the counts for each key p = \[counts\[k\] for k in keys\] error = np.random.rand(len(keys)) Here is a list of POS tag with a description: POS Tag | Description | Example CC coordinating conjunction : and CD cardinal number : 1, third DT determiner : the EX existential : there there is FW foreign word : d’hoevre IN preposition/subordinating conjunction : in, of, like JJ adjective : big JJR adjective, comparative : bigger JJS adjective, superlative : biggest LS list marker : 1) MD modal : could, will NN noun, singular or mass : door NNS noun plural : doors NNP proper noun, singular : John NNPS proper noun, plural : Vikings PDT predeterminer : both the boys POS possessive ending : friend‘s PRP personal pronoun : I, he, it PRP$ possessive pronoun : my, his RB adverb : however, usually, naturally, here, good RBR adverb, comparative : better RBS adverb, superlative : best RP particle : give up TO to : to go, to him UH interjection : uhhuhhuhh VB verb, base form : take VBD verb, past tense : took VBG verb, gerund/present participle : taking VBN verb, past participle : taken VBP verb, sing. present, non-3d : take VBZ verb, 3rd person sing. present : takes WDT wh-determiner : which WP wh-pronoun : who, what WP$ possessive wh-pronoun : whose WRB wh-abverb : where, when And here is what we got: * More that 16k noun * Almost 10k adjective * Almost no predeterminers * ..etc  Commons Words To figure out the 60 most used words (without the stop words) in Obama speeches, this code will do the work: \# Top 60 words dist = nltk.FreqDist(tokens) dist.plot(60, cumulative=False)  #### Common Expressions Collocations are expressions of multiple words which commonly co-occur. In this example I limited the number to 60: text = nltk.Text(\_tokens) collocation = text.collocations(num=60) Well, the commons expressions are: > United States; > make sure; > health care; > middle class; > American people; > God bless; > White House; > young people; > years ago; > 21st century; > Middle East; > long term; > Prime Minister; > making sure; > clean energy; > climate change; > health insurance; > national security; > Governor Romney; > law enforcement; > nuclear weapons; > little bit; > private sector; > Wall Street; > international community; > Affordable Care; > nuclear weapon; > every single; > small businesses; > Social Security; > four years; > human rights; > civil society; > move forward; > Supreme Court; > Care Act; > bin Laden; > New York; > every day; > United Nations; > tax cuts; > even though; > first time; > World War; > insurance companies; > status quo; > two years; > Cold War; > last year; > federal government; > economic growth; > global economy; > come together; > whole bunch; > good news; > Asia Pacific; > Good afternoon; > new jobs; > took office; > common sense #### Extracting Nouns, Locations, Organizations And Other Stuff The code: #ORGANIZATION Georgia-Pacific Corp., WHO #PERSON Eddy Bonte, President Obama #LOCATION Murray River, Mount Everest #DATE June, 2008–06–29 #TIME two fifty a m, 1:30 p.m. #MONEY 175 million Canadian Dollars, GBP 10.40 #PERCENT twenty pct, 18.75 % #FACILITY Washington Monument, Stonehenge #GPE South East Asia, Midlothian nouns = \[chunk for chunk in ne\_chunk(tagged) if isinstance(chunk, Tree)\] persons = \[\] locations = \[\] organizations = \[\] dates = \[\] times = \[\] percents = \[\] facilities = \[\] gpes = \[\] for tree in nouns: if tree.label() == “PERSON”: person = ‘ ‘.join(c\[0\] for c in tree.leaves()) persons.append(person) if tree.label() == “LOCATION”: location = ‘ ‘.join(c\[0\] for c in tree.leaves()) locations.append(location) if tree.label() == “ORGANIZATION”: organization = ‘ ‘.join(c\[0\] for c in tree.leaves()) organizations.append(organization) if tree.label() == “DATE”: date = ‘ ‘.join(c\[0\] for c in tree.leaves()) dates.append(date) if tree.label() == “TIME”: time = ‘ ‘.join(c\[0\] for c in tree.leaves()) timess.append(time) if tree.label() == “PERCENT”: percent = ‘ ‘.join(c\[0\] for c in tree.leaves()) percents.append(percent) if tree.label() == “FACILITY”: facility = ‘ ‘.join(c\[0\] for c in tree.leaves()) facilities.append(facility) if tree.label() == “GPE”: gpe = ‘ ‘.join(c\[0\] for c in tree.leaves()) gpes.append(gpe) The result is actually the frequency of every person name, location or organization name that appeared in the speeches:      #### Finding Other Possibilities Using n-grams can help us generate sentences of two (bi-grams) or three (tri-grams) or all possible number of words. Those sentences are the result of all possible combinations of speeches words. Let’s see this example: bi = bigrams(tokens) tri = trigrams(tokens) every = everygrams(\_tokens, min\_len= 20, max\_len=20) i = 0 bilist = list(bi)\[:120\] for element in bilist: print(element\[0\] + “ “ + element\[1\]) _ps: I am using only the odd indexes of the list in my real example._ And the result is: > Thank everybody > Well thank > Janice thanks > everybody coming > beautiful day > Welcome White > House three > weeks ago > federal government > shut Affordable > Care Act > health insurance > marketplaces opened > business across > country Well > gotten government > back open > American people > today want > talk going > get marketplaces > running full > steam well > joined today > folks either > benefited Affordable > Care Act > already helping > fellow citizens > learn law > means get > covered course > probably heard > new website > people apply > health insurance > browse buy > affordable plans > states worked > smoothly supposed > work number > people visited > site overwhelming > aggravated underlying > problems Despite > thousands people > signing saving > money speak > Many Americans > preexisting condition > like Janice > discovering finally > get health > insurance like > everybody else > today want > speak every > American looking > get affordable > health insurance #### Let’s Generate A Speech We can use the 348 files of speeches to generate our own speech, and since speech is composed of sentences, the next code will generate random sentences using Markov chain: from pymarkovchain import MarkovChain mc = MarkovChain() for i in range(1,20): mc.generateDatabase(r) g = mc.generateString() And here is a list of generated sentences: > They didn’t simply embrace the American ideal, they lived it > We will never send you into harm’s way unless it’s absolutely necessary > In Africa — kingdoms come and say, enough; we’ve suffered too much time jogging in place > And of course, there is a man’s turn > And so it makes no sense > And we had an impact on families, and reduces the deficit — three or fourfold, and help make sure that when firms seek new markets that already and saved their lives so that a nuclear weapon with one that protects both the heartbreak and destruction in Mumbai > As Senator McCain fought long and contentious issues the Court is different from any disruptions or damage > Now — Now let me say as this process > They lead nations to work than the intelligence services had alerted U > For good reasons, we don’t lay this new century — and Congressman Andre Carson from the United States and other countries — South Korea -– while also addressing its causes > We’re also announcing a new United Nations was born on 3rd base, thinking you hit a lot of you know > And if there’s disagreement > You’ve got time for a ceasefire in the memory of those $716 billion in sensible spending cuts to education > No, it’s not just one of those who struggled at times to have that kind of politics > And when that happens, progress stalls > And based on facts and assumptions — those things where I stood before you faced these same fears and insecurities, and admitting when we’re not betting on the ground > Thank you very much > And we can avoid prescribing a medication that could fall into three areas that are more efficient > So they want is these > There are setbacks and false starts, and we want to extend Bush tax cuts for small businesses; a tax cut — for the vote > The pundits, the pundits have pointed out correctly that production of clean energy to the job done Processing natural language is funny. All of the above codes need more optimization but as an introduction to NLTK, it was a good exercise. ### Connect Deeper If you resonated with this article, please subscribe to [DevOpsLinks](http://devopslinks.com) : An Online Community Of Diverse & Passionate DevOps, SysAdmins & Developers From All Over The World. You can find me on [Twitter](https://twitter.com/eon01), [Clarity](https://clarity.fm/aymenelamri/) or my [blog](http://eon01.com/blog) and you can also check my books: [SaltStack For DevOps](http://saltstackfordevops.com),[The Jumpstart Up](http://thejumpstartup.com) & [Painless Docker](http://painlessdocker.com).  If you liked this post, please recommend and share it with your followers.