I have been playing last days with some tools to analyze online texts and I have been using NLTK (Natural Language ToolKit) which is a platform for building Python programs to work with human language data.
NLP or Natural language processing is the science of enabling the uter to understand human language, derive meaning and generate natural language. It is the intersection of computer science, artificial intelligence, and linguistics.
Lately I have been working on a quite similar project to implement an intelligent chatbot on my raspberry pi, I will be posting my experiment on my blog, but for now, et’s learn how to use NLTK to analyze text.
The analysis that I provide in this tutorial is based on 348 files but still approximate and are intended to be an educational tool to learn basic stuff, not more. NLTK is a great tool but still remains a software designed to evolve over time to be more efficient. Maybe you will find some small classification errors, but this remains negligible compared to the overall result.
Texts have been downloaded from this site, I have not seen the content of each of the speeches, but the overwhelming majority of these texts were told by Obama during his speeches. Anything else told by somebody else is negligible compared to the overall result of this analysis.
First thing is downloading content, I used this simple Python script:
#!/usr/bin/env python# coding: utf8
from goose import Gooseimport urllibimport lxml.htmlimport codecs
def get_links(url, domain):connection = urllib.urlopen(url)dom = lxml.html.fromstring(connection.read())for link in dom.xpath(‘//a/@href’): # select the url in href for all a tags(links)if ( link.startswith(“speech”) and link.endswith(“htm”) ):yield domain + link
def get_text(url):g = Goose()article = g.extract(url=url)with codecs.open(article.link_hash + “.speech”, “w”, “utf-8-sig”) as text_file:text_file.write(article.cleaned_text)
if (__name__ == “__main__”):link = “http://www.americanrhetoric.com/barackobamaspeeches.htm"domain = “http://www.americanrhetoric.com/"for i in get_links(link, domain):get_text(i)
Concatenating is the second step:
import osfor file in os.listdir(“.”):if file.endswith(“.speech”):os.system(“cat “+ file + “ >> all.speeches”)
Then it is recommended to create what we call tokens in NLTK jargon:
with codecs.open(“all.speeches”, “r”, “utf-8-sig”) as text_file:r = text_file.read()#Remove punctuationtokenizer = RegexpTokenizer(r’\w+’)_tokens = tokenizer.tokenize(r)
tokens = [t for t in _tokens if t.lower() not in english_stopwords]
According to Wikipedia The lexical diversity of a given text is defined as the ratio of total number of words to the number of different unique word stems.
# Process lexical diversityst = len(set(tokens))lt = len(tokens)y = [st*100/lt]print(y)fig = plt.figure()ax = fig.add_subplot(111)N = 1
ind = np.arange(N)width = 0.7rect = ax.bar(ind, y, width, color=’black’)
ax.set_xlim(-width,len(ind)+width)ax.set_ylim(0,100)ax.set_ylabel(‘Score’)ax.set_title(‘Lexical Diversity’)xTickMarks = [‘Lexical Diversity Meter’]ax.set_xticks(ind+width)xtickNames = ax.set_xticklabels(xTickMarks)plt.setp(xtickNames, rotation=45, fontsize=10)
ax.legend( (rect[0], (‘’) ))plt.show()
Like it is explained in the official documentation of NLTK, the process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging.
Well, in simple words, a word can be a noun, a verb and adjective. NLTK wil help us determine this:
# get tagged tokenstagged = nltk.pos_tag(tokens)
counts = Counter(tag for word,tag in tagged)
keys = counts.keys()y_pos = np.arange(len(keys))
p = [counts[k] for k in keys]error = np.random.rand(len(keys))
Here is a list of POS tag with a description:
POS Tag | Description | ExampleCC coordinating conjunction : andCD cardinal number : 1, thirdDT determiner : theEX existential : there there isFW foreign word : d’hoevreIN preposition/subordinating conjunction : in, of, likeJJ adjective : bigJJR adjective, comparative : biggerJJS adjective, superlative : biggestLS list marker : 1)MD modal : could, willNN noun, singular or mass : doorNNS noun plural : doorsNNP proper noun, singular : JohnNNPS proper noun, plural : VikingsPDT predeterminer : both the boysPOS possessive ending : friend‘sPRP personal pronoun : I, he, itPRP$ possessive pronoun : my, hisRB adverb : however, usually, naturally, here, goodRBR adverb, comparative : betterRBS adverb, superlative : bestRP particle : give upTO to : to go, to himUH interjection : uhhuhhuhhVB verb, base form : takeVBD verb, past tense : tookVBG verb, gerund/present participle : takingVBN verb, past participle : takenVBP verb, sing. present, non-3d : takeVBZ verb, 3rd person sing. present : takesWDT wh-determiner : whichWP wh-pronoun : who, whatWP$ possessive wh-pronoun : whoseWRB wh-abverb : where, when
And here is what we got:
Commons Words
To figure out the 60 most used words (without the stop words) in Obama speeches, this code will do the work:
# Top 60 wordsdist = nltk.FreqDist(tokens)dist.plot(60, cumulative=False)
Collocations are expressions of multiple words which commonly co-occur. In this example I limited the number to 60:
text = nltk.Text(_tokens)collocation = text.collocations(num=60)
Well, the commons expressions are:
United States; make sure; health care; middle class; American people;God bless; White House; young people; years ago; 21st century; Middle East; long term; Prime Minister; making sure; clean energy; climate change; health insurance; national security; Governor Romney; law enforcement; nuclear weapons; little bit; private sector; Wall Street;international community; Affordable Care; nuclear weapon; every single; small businesses; Social Security; four years; human rights;civil society; move forward; Supreme Court; Care Act; bin Laden; New York; every day; United Nations; tax cuts; even though; first time;World War; insurance companies; status quo; two years; Cold War; last year; federal government; economic growth; global economy; come together; whole bunch; good news; Asia Pacific; Good afternoon;new jobs; took office; common sense
The code:
#ORGANIZATION Georgia-Pacific Corp., WHO#PERSON Eddy Bonte, President Obama#LOCATION Murray River, Mount Everest#DATE June, 2008–06–29#TIME two fifty a m, 1:30 p.m.#MONEY 175 million Canadian Dollars, GBP 10.40#PERCENT twenty pct, 18.75 %#FACILITY Washington Monument, Stonehenge#GPE South East Asia, Midlothian
nouns = [chunk for chunk in ne_chunk(tagged) if isinstance(chunk, Tree)]
persons = []locations = []organizations = []dates = []times = []percents = []facilities = []gpes = []
for tree in nouns:if tree.label() == “PERSON”:person = ‘ ‘.join(c[0] for c in tree.leaves())persons.append(person)if tree.label() == “LOCATION”:location = ‘ ‘.join(c[0] for c in tree.leaves())locations.append(location)if tree.label() == “ORGANIZATION”:organization = ‘ ‘.join(c[0] for c in tree.leaves())organizations.append(organization)if tree.label() == “DATE”:date = ‘ ‘.join(c[0] for c in tree.leaves())dates.append(date)if tree.label() == “TIME”:time = ‘ ‘.join(c[0] for c in tree.leaves())timess.append(time)if tree.label() == “PERCENT”:percent = ‘ ‘.join(c[0] for c in tree.leaves())percents.append(percent)if tree.label() == “FACILITY”:facility = ‘ ‘.join(c[0] for c in tree.leaves())facilities.append(facility)if tree.label() == “GPE”:gpe = ‘ ‘.join(c[0] for c in tree.leaves())gpes.append(gpe)
The result is actually the frequency of every person name, location or organization name that appeared in the speeches:
Using n-grams can help us generate sentences of two (bi-grams) or three (tri-grams) or all possible number of words. Those sentences are the result of all possible combinations of speeches words.
Let’s see this example:
bi = bigrams(tokens)tri = trigrams(tokens)every = everygrams(_tokens, min_len= 20, max_len=20)i = 0bilist = list(bi)[:120]
for element in bilist:print(element[0] + “ “ + element[1])
ps: I am using only the odd indexes of the list in my real example.
And the result is:
Thank everybodyWell thankJanice thankseverybody comingbeautiful dayWelcome WhiteHouse threeweeks agofederal governmentshut AffordableCare Acthealth insurancemarketplaces openedbusiness acrosscountry Wellgotten governmentback openAmerican peopletoday wanttalk goingget marketplacesrunning fullsteam welljoined todayfolks eitherbenefited AffordableCare Actalready helpingfellow citizenslearn lawmeans getcovered courseprobably heardnew websitepeople applyhealth insurancebrowse buyaffordable plansstates workedsmoothly supposedwork numberpeople visitedsite overwhelmingaggravated underlyingproblems Despitethousands peoplesigning savingmoney speakMany Americanspreexisting conditionlike Janicediscovering finallyget healthinsurance likeeverybody elsetoday wantspeak everyAmerican lookingget affordablehealth insurance
We can use the 348 files of speeches to generate our own speech, and since speech is composed of sentences, the next code will generate random sentences using Markov chain:
from pymarkovchain import MarkovChainmc = MarkovChain()for i in range(1,20):mc.generateDatabase(r)g = mc.generateString()
And here is a list of generated sentences:
They didn’t simply embrace the American ideal, they lived it
We will never send you into harm’s way unless it’s absolutely necessary
In Africa — kingdoms come and say, enough; we’ve suffered too much time jogging in place
And of course, there is a man’s turn
And so it makes no sense
And we had an impact on families, and reduces the deficit — three or fourfold, and help make sure that when firms seek new markets that already and saved their lives so that a nuclear weapon with one that protects both the heartbreak and destruction in Mumbai
As Senator McCain fought long and contentious issues the Court is different from any disruptions or damage
Now — Now let me say as this process
They lead nations to work than the intelligence services had alerted U
For good reasons, we don’t lay this new century — and Congressman Andre Carson from the United States and other countries — South Korea -– while also addressing its causes
We’re also announcing a new United Nations was born on 3rd base, thinking you hit a lot of you know
And if there’s disagreement
You’ve got time for a ceasefire in the memory of those $716 billion in sensible spending cuts to education
No, it’s not just one of those who struggled at times to have that kind of politics
And when that happens, progress stalls
And based on facts and assumptions — those things where I stood before you faced these same fears and insecurities, and admitting when we’re not betting on the ground
Thank you very much
And we can avoid prescribing a medication that could fall into three areas that are more efficient
So they want is these
There are setbacks and false starts, and we want to extend Bush tax cuts for small businesses; a tax cut — for the vote
The pundits, the pundits have pointed out correctly that production of clean energy to the job done
Processing natural language is funny. All of the above codes need more optimization but as an introduction to NLTK, it was a good exercise.
If you resonated with this article, please subscribe to DevOpsLinks : An Online Community Of Diverse & Passionate DevOps, SysAdmins & Developers From All Over The World.
You can find me on Twitter, Clarity or my blog and you can also check my books: SaltStack For DevOps,The Jumpstart Up & Painless Docker.
If you liked this post, please recommend and share it with your followers.