Photo by on Jason Yu Unsplash I have been playing last days with some tools to analyze online texts and I have been using NLTK (Natural Language ToolKit) which is a platform for building Python programs to work with human language data. or is the science of enabling the uter to understand human language, derive meaning and generate natural language. It is the intersection of and NLP Natural language processing computer science, artificial intelligence, linguistics. Lately I have been working on a quite similar project to implement an intelligent chatbot on my raspberry pi, I will be posting my experiment on my , but for now, et’s learn . blog how to use NLTK to analyze text The analysis that I provide in this tutorial is based on but still approximate and are intended to be an educational tool to learn basic stuff, not more. NLTK is a great tool but still remains a software designed to evolve over time to be more efficient. Maybe you will find some small classification errors, but this remains negligible compared to the overall result. 348 files Texts have been downloaded from , I have not seen the content of each of the speeches, but the overwhelming majority of these texts were told by Obama during his speeches. Anything else told by somebody else is negligible compared to the overall result of this analysis. this site Downloading Content First thing is downloading content, I used this simple Python script: #!/usr/bin/env python# coding: utf8 from goose import Gooseimport urllibimport lxml.htmlimport codecs def get_links(url, domain):connection = urllib.urlopen(url)dom = lxml.html.fromstring(connection.read())for link in dom.xpath(‘//a/@href’): # select the url in href for all a tags(links)if ( link.startswith(“speech”) and link.endswith(“htm”) ):yield domain + link def get_text(url):g = Goose()article = g.extract(url=url)with codecs.open(article.link_hash + “.speech”, “w”, “utf-8-sig”) as text_file:text_file.write(article.cleaned_text) if (__name__ == “__main__”):link = “ "domain = “ "for i in get_links(link, domain):get_text(i) http://www.americanrhetoric.com/barackobamaspeeches.htm http://www.americanrhetoric.com/ Concatenating is the second step: import osfor file in os.listdir(“.”):if file.endswith(“.speech”):os.system(“cat “+ file + “ >> all.speeches”) Then it is recommended to create what we call tokens in NLTK jargon: with codecs.open(“all.speeches”, “r”, “utf-8-sig”) as text_file:r = text_file.read()#Remove punctuationtokenizer = RegexpTokenizer(r’\w+’)_tokens = tokenizer.tokenize(r) Get clean tokens tokens = [t for t in _tokens if t.lower() not in english_stopwords] Analyzing Content The Lexical Diversity According to Wikipedia The of a given text is defined as the ratio of total number of words to the number of different unique word stems. lexical diversity # Process lexical diversityst = len(set(tokens))lt = len(tokens)y = [st*100/lt]print(y)fig = plt.figure()ax = fig.add_subplot(111)N = 1 necessary variables ind = np.arange(N)width = 0.7rect = ax.bar(ind, y, width, color=’black’) axes and labels ax.set_xlim(-width,len(ind)+width)ax.set_ylim(0,100)ax.set_ylabel(‘Score’)ax.set_title(‘Lexical Diversity’)xTickMarks = [‘Lexical Diversity Meter’]ax.set_xticks(ind+width)xtickNames = ax.set_xticklabels(xTickMarks)plt.setp(xtickNames, rotation=45, fontsize=10) add a legend ax.legend( (rect[0], (‘’) ))plt.show() POS Tags Frequency Like it is explained in the official documentation of NLTK, the process of classifying words into their and labeling them accordingly is known as , , or simply . parts of speech part-of-speech tagging POS-tagging tagging Well, in simple words, a word can be a noun, a verb and adjective. NLTK wil help us determine this: # get tagged tokenstagged = nltk.pos_tag(tokens) top words by tag (verb, noun ..etc) counts = Counter(tag for word,tag in tagged) counter data, counter is your counter object keys = counts.keys()y_pos = np.arange(len(keys)) get the counts for each key p = [counts[k] for k in keys]error = np.random.rand(len(keys)) Here is a list of POS tag with a description: POS Tag | Description | ExampleCC coordinating conjunction : andCD cardinal number : 1, thirdDT determiner : theEX existential : there there isFW foreign word : d’hoevreIN preposition/subordinating conjunction : in, of, likeJJ adjective : bigJJR adjective, comparative : biggerJJS adjective, superlative : biggestLS list marker : 1)MD modal : could, willNN noun, singular or mass : doorNNS noun plural : doorsNNP proper noun, singular : JohnNNPS proper noun, plural : VikingsPDT predeterminer : both the boysPOS possessive ending : friend‘sPRP personal pronoun : I, he, itPRP$ possessive pronoun : my, hisRB adverb : however, usually, naturally, here, goodRBR adverb, comparative : betterRBS adverb, superlative : bestRP particle : give upTO to : to go, to himUH interjection : uhhuhhuhhVB verb, base form : takeVBD verb, past tense : tookVBG verb, gerund/present participle : takingVBN verb, past participle : takenVBP verb, sing. present, non-3d : takeVBZ verb, 3rd person sing. present : takesWDT wh-determiner : whichWP wh-pronoun : who, whatWP$ possessive wh-pronoun : whoseWRB wh-abverb : where, when And here is what we got: More that 16k noun Almost 10k adjective Almost no predeterminers ..etc Commons Words To figure out the 60 most used words (without the stop words) in Obama speeches, this code will do the work: # Top 60 wordsdist = nltk.FreqDist(tokens)dist.plot(60, cumulative=False) Common Expressions Collocations are expressions of multiple words which commonly co-occur. In this example I limited the number to 60: text = nltk.Text(_tokens)collocation = text.collocations(num=60) Well, the commons expressions are: United States; make sure; health care; middle class; American people;God bless; White House; young people; years ago; 21st century; Middle East; long term; Prime Minister; making sure; clean energy; climate change; health insurance; national security; Governor Romney; law enforcement; nuclear weapons; little bit; private sector; Wall Street;international community; Affordable Care; nuclear weapon; every single; small businesses; Social Security; four years; human rights;civil society; move forward; Supreme Court; Care Act; bin Laden; New York; every day; United Nations; tax cuts; even though; first time;World War; insurance companies; status quo; two years; Cold War; last year; federal government; economic growth; global economy; come together; whole bunch; good news; Asia Pacific; Good afternoon;new jobs; took office; common sense Extracting Nouns, Locations, Organizations And Other Stuff The code: #ORGANIZATION Georgia-Pacific Corp., WHO#PERSON Eddy Bonte, President Obama#LOCATION Murray River, Mount Everest#DATE June, 2008–06–29#TIME two fifty a m, 1:30 p.m.#MONEY 175 million Canadian Dollars, GBP 10.40#PERCENT twenty pct, 18.75 %#FACILITY Washington Monument, Stonehenge#GPE South East Asia, Midlothian nouns = [chunk for chunk in ne_chunk(tagged) if isinstance(chunk, Tree)] persons = []locations = []organizations = []dates = []times = []percents = []facilities = []gpes = [] for tree in nouns:if tree.label() == “PERSON”:person = ‘ ‘.join(c[0] for c in tree.leaves())persons.append(person)if tree.label() == “LOCATION”:location = ‘ ‘.join(c[0] for c in tree.leaves())locations.append(location)if tree.label() == “ORGANIZATION”:organization = ‘ ‘.join(c[0] for c in tree.leaves())organizations.append(organization)if tree.label() == “DATE”:date = ‘ ‘.join(c[0] for c in tree.leaves())dates.append(date)if tree.label() == “TIME”:time = ‘ ‘.join(c[0] for c in tree.leaves())timess.append(time)if tree.label() == “PERCENT”:percent = ‘ ‘.join(c[0] for c in tree.leaves())percents.append(percent)if tree.label() == “FACILITY”:facility = ‘ ‘.join(c[0] for c in tree.leaves())facilities.append(facility)if tree.label() == “GPE”:gpe = ‘ ‘.join(c[0] for c in tree.leaves())gpes.append(gpe) The result is actually the frequency of every person name, location or organization name that appeared in the speeches: Finding Other Possibilities Using n-grams can help us generate sentences of two (bi-grams) or three (tri-grams) or all possible number of words. Those sentences are the result of all possible combinations of speeches words. Let’s see this example: bi = bigrams(tokens)tri = trigrams(tokens)every = everygrams(_tokens, min_len= 20, max_len=20)i = 0bilist = list(bi)[:120] for element in bilist:print(element[0] + “ “ + element[1]) ps: I am using only the odd indexes of the list in my real example. And the result is: Thank everybodyWell thankJanice thankseverybody comingbeautiful dayWelcome WhiteHouse threeweeks agofederal governmentshut AffordableCare Acthealth insurancemarketplaces openedbusiness acrosscountry Wellgotten governmentback openAmerican peopletoday wanttalk goingget marketplacesrunning fullsteam welljoined todayfolks eitherbenefited AffordableCare Actalready helpingfellow citizenslearn lawmeans getcovered courseprobably heardnew websitepeople applyhealth insurancebrowse buyaffordable plansstates workedsmoothly supposedwork numberpeople visitedsite overwhelmingaggravated underlyingproblems Despitethousands peoplesigning savingmoney speakMany Americanspreexisting conditionlike Janicediscovering finallyget healthinsurance likeeverybody elsetoday wantspeak everyAmerican lookingget affordablehealth insurance Let’s Generate A Speech We can use the 348 files of speeches to generate our own speech, and since speech is composed of sentences, the next code will generate random sentences using Markov chain: from pymarkovchain import MarkovChainmc = MarkovChain()for i in range(1,20):mc.generateDatabase(r)g = mc.generateString() And here is a list of generated sentences: They didn’t simply embrace the American ideal, they lived it We will never send you into harm’s way unless it’s absolutely necessary In Africa — kingdoms come and say, enough; we’ve suffered too much time jogging in place And of course, there is a man’s turn And so it makes no sense And we had an impact on families, and reduces the deficit — three or fourfold, and help make sure that when firms seek new markets that already and saved their lives so that a nuclear weapon with one that protects both the heartbreak and destruction in Mumbai As Senator McCain fought long and contentious issues the Court is different from any disruptions or damage Now — Now let me say as this process They lead nations to work than the intelligence services had alerted U For good reasons, we don’t lay this new century — and Congressman Andre Carson from the United States and other countries — South Korea -– while also addressing its causes We’re also announcing a new United Nations was born on 3rd base, thinking you hit a lot of you know And if there’s disagreement You’ve got time for a ceasefire in the memory of those $716 billion in sensible spending cuts to education No, it’s not just one of those who struggled at times to have that kind of politics And when that happens, progress stalls And based on facts and assumptions — those things where I stood before you faced these same fears and insecurities, and admitting when we’re not betting on the ground Thank you very much And we can avoid prescribing a medication that could fall into three areas that are more efficient So they want is these There are setbacks and false starts, and we want to extend Bush tax cuts for small businesses; a tax cut — for the vote The pundits, the pundits have pointed out correctly that production of clean energy to the job done Processing natural language is funny. All of the above codes need more optimization but as an introduction to NLTK, it was a good exercise. Connect Deeper If you resonated with this article, please subscribe to : An Online Community Of Diverse & Passionate DevOps, SysAdmins & Developers From All Over The World. DevOpsLinks You can find me on , or my and you can also check my books: , & . Twitter Clarity blog SaltStack For DevOps The Jumpstart Up Painless Docker If you liked this post, please recommend and share it with your followers.