Hackernoon logoAnalyzing Obama Speeches Since 2004 by@eon01

Analyzing Obama Speeches Since 2004

Aymen Hacker Noon profile picture


Photo by Jason Yu on Unsplash

I have been playing last days with some tools to analyze online texts and I have been using NLTK (Natural Language ToolKit) which is a platform for building Python programs to work with human language data.

NLP or Natural language processing is the science of enabling the uter to understand human language, derive meaning and generate natural language. It is the intersection of computer science, artificial intelligence, and linguistics.

Lately I have been working on a quite similar project to implement an intelligent chatbot on my raspberry pi, I will be posting my experiment on my blog, but for now, et’s learn how to use NLTK to analyze text.

The analysis that I provide in this tutorial is based on 348 files but still approximate and are intended to be an educational tool to learn basic stuff, not more. NLTK is a great tool but still remains a software designed to evolve over time to be more efficient. Maybe you will find some small classification errors, but this remains negligible compared to the overall result.

Texts have been downloaded from this site, I have not seen the content of each of the speeches, but the overwhelming majority of these texts were told by Obama during his speeches. Anything else told by somebody else is negligible compared to the overall result of this analysis.

Downloading Content

First thing is downloading content, I used this simple Python script:

#!/usr/bin/env python
# coding: utf8
from goose import Goose
import urllib
import lxml.html
import codecs
def get_links(url, domain):
connection = urllib.urlopen(url)
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath(‘//a/@href’): # select the url in href for all a tags(links)
if ( link.startswith(“speech”) and link.endswith(“htm”) ):
yield domain + link
def get_text(url):
g = Goose()
article = g.extract(url=url)
with codecs.open(article.link_hash + “.speech”, “w”, “utf-8-sig”) as text_file:
if (__name__ == “__main__”):
link = “http://www.americanrhetoric.com/barackobamaspeeches.htm"
domain = “http://www.americanrhetoric.com/"
for i in get_links(link, domain):

Concatenating is the second step:

import os
for file in os.listdir(“.”):
if file.endswith(“.speech”):
os.system(“cat “+ file + “ >> all.speeches”)

Then it is recommended to create what we call tokens in NLTK jargon:

with codecs.open(“all.speeches”, “r”, “utf-8-sig”) as text_file:
r = text_file.read()
#Remove punctuation
tokenizer = RegexpTokenizer(r’\w+’)
_tokens = tokenizer.tokenize(r)
# Get clean tokens
tokens = [t for t in _tokens if t.lower() not in english_stopwords]

Analyzing Content

The Lexical Diversity

According to Wikipedia The lexical diversity of a given text is defined as the ratio of total number of words to the number of different unique word stems.

# Process lexical diversity
st = len(set(tokens))
lt = len(tokens)
y = [st*100/lt]
fig = plt.figure()
ax = fig.add_subplot(111)
N = 1
# necessary variables
ind = np.arange(N)
width = 0.7
rect = ax.bar(ind, y, width, color=’black’)
# axes and labels
ax.set_title(‘Lexical Diversity’)
xTickMarks = [‘Lexical Diversity Meter’]
xtickNames = ax.set_xticklabels(xTickMarks)
plt.setp(xtickNames, rotation=45, fontsize=10)
## add a legend
ax.legend( (rect[0], (‘’) ))

POS Tags Frequency

Like it is explained in the official documentation of NLTK, the process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging.

Well, in simple words, a word can be a noun, a verb and adjective. NLTK wil help us determine this:

# get tagged tokens
tagged = nltk.pos_tag(tokens)
# top words by tag (verb, noun ..etc)
counts = Counter(tag for word,tag in tagged)
# counter data, counter is your counter object
keys = counts.keys()
y_pos = np.arange(len(keys))
# get the counts for each key
p = [counts[k] for k in keys]
error = np.random.rand(len(keys))

Here is a list of POS tag with a description:

POS Tag | Description | Example
CC coordinating conjunction : and
CD cardinal number : 1, third
DT determiner : the
EX existential : there there is
FW foreign word : d’hoevre
IN preposition/subordinating conjunction : in, of, like
JJ adjective : big
JJR adjective, comparative : bigger
JJS adjective, superlative : biggest
LS list marker : 1)
MD modal : could, will
NN noun, singular or mass : door
NNS noun plural : doors
NNP proper noun, singular : John
NNPS proper noun, plural : Vikings
PDT predeterminer : both the boys
POS possessive ending : friend‘s
PRP personal pronoun : I, he, it
PRP$ possessive pronoun : my, his
RB adverb : however, usually, naturally, here, good
RBR adverb, comparative : better
RBS adverb, superlative : best
RP particle : give up
TO to : to go, to him
UH interjection : uhhuhhuhh
VB verb, base form : take
VBD verb, past tense : took
VBG verb, gerund/present participle : taking
VBN verb, past participle : taken
VBP verb, sing. present, non-3d : take
VBZ verb, 3rd person sing. present : takes
WDT wh-determiner : which
WP wh-pronoun : who, what
WP$ possessive wh-pronoun : whose
WRB wh-abverb : where, when

And here is what we got:

  • More that 16k noun
  • Almost 10k adjective
  • Almost no predeterminers
  • ..etc

Commons Words

To figure out the 60 most used words (without the stop words) in Obama speeches, this code will do the work:

# Top 60 words
dist = nltk.FreqDist(tokens)
dist.plot(60, cumulative=False)

Common Expressions

Collocations are expressions of multiple words which commonly co-occur. In this example I limited the number to 60:

text = nltk.Text(_tokens)
collocation = text.collocations(num=60)

Well, the commons expressions are:

United States; 
make sure; 
health care; 
middle class; 
American people;
God bless; 
White House; 
young people; 
years ago; 
21st century; 
Middle East; 
long term; 
Prime Minister; 
making sure; 
clean energy; 
climate change; 
health insurance; 
national security; 
Governor Romney; 
law enforcement; 
nuclear weapons; 
little bit; 
private sector; 
Wall Street;
international community; 
Affordable Care; 
nuclear weapon; 
every single; 
small businesses; 
Social Security; 
four years; 
human rights;
civil society; 
move forward; 
Supreme Court; 
Care Act; 
bin Laden; 
New York; 
every day; 
United Nations; 
tax cuts; 
even though; 
first time;
World War; 
insurance companies; 
status quo; 
two years; 
Cold War; 
last year; 
federal government; 
economic growth; 
global economy; 
come together; 
whole bunch; 
good news; 
Asia Pacific; 
Good afternoon;
new jobs; 
took office; 
common sense

Extracting Nouns, Locations, Organizations And Other Stuff

The code:

 #ORGANIZATION Georgia-Pacific Corp., WHO
#PERSON Eddy Bonte, President Obama
#LOCATION Murray River, Mount Everest
#DATE June, 2008–06–29
#TIME two fifty a m, 1:30 p.m.
#MONEY 175 million Canadian Dollars, GBP 10.40
#PERCENT twenty pct, 18.75 %
#FACILITY Washington Monument, Stonehenge
#GPE South East Asia, Midlothian
nouns = [chunk for chunk in ne_chunk(tagged) if isinstance(chunk, Tree)]
 persons = []
locations = []
organizations = []
dates = []
times = []
percents = []
facilities = []
gpes = []
for tree in nouns:
if tree.label() == “PERSON”:
person = ‘ ‘.join(c[0] for c in tree.leaves())
if tree.label() == “LOCATION”:
location = ‘ ‘.join(c[0] for c in tree.leaves())
if tree.label() == “ORGANIZATION”:
organization = ‘ ‘.join(c[0] for c in tree.leaves())
if tree.label() == “DATE”:
date = ‘ ‘.join(c[0] for c in tree.leaves())
if tree.label() == “TIME”:
time = ‘ ‘.join(c[0] for c in tree.leaves())
if tree.label() == “PERCENT”:
percent = ‘ ‘.join(c[0] for c in tree.leaves())
if tree.label() == “FACILITY”:
facility = ‘ ‘.join(c[0] for c in tree.leaves())
if tree.label() == “GPE”:
gpe = ‘ ‘.join(c[0] for c in tree.leaves())

The result is actually the frequency of every person name, location or organization name that appeared in the speeches:

Finding Other Possibilities

Using n-grams can help us generate sentences of two (bi-grams) or three (tri-grams) or all possible number of words. Those sentences are the result of all possible combinations of speeches words.

Let’s see this example:

bi = bigrams(tokens)
tri = trigrams(tokens)
every = everygrams(_tokens, min_len= 20, max_len=20)
i = 0
bilist = list(bi)[:120]
for element in bilist:
print(element[0] + “ “ + element[1])

ps: I am using only the odd indexes of the list in my real example.

And the result is:

Thank everybody
Well thank
Janice thanks
everybody coming
beautiful day
Welcome White
House three
weeks ago
federal government
shut Affordable
Care Act
health insurance
marketplaces opened
business across
country Well
gotten government
back open
American people
today want
talk going
get marketplaces
running full
steam well
joined today
folks either
benefited Affordable
Care Act
already helping
fellow citizens
learn law
means get
covered course
probably heard
new website
people apply
health insurance
browse buy
affordable plans
states worked
smoothly supposed
work number
people visited
site overwhelming
aggravated underlying
problems Despite
thousands people
signing saving
money speak
Many Americans
preexisting condition
like Janice
discovering finally
get health
insurance like
everybody else
today want
speak every
American looking
get affordable
health insurance

Let’s Generate A Speech

We can use the 348 files of speeches to generate our own speech, and since speech is composed of sentences, the next code will generate random sentences using Markov chain:

from pymarkovchain import MarkovChain
mc = MarkovChain()
for i in range(1,20):
g = mc.generateString()

And here is a list of generated sentences:

They didn’t simply embrace the American ideal, they lived it
We will never send you into harm’s way unless it’s absolutely necessary
In Africa — kingdoms come and say, enough; we’ve suffered too much time jogging in place
And of course, there is a man’s turn
And so it makes no sense
And we had an impact on families, and reduces the deficit — three or fourfold, and help make sure that when firms seek new markets that already and saved their lives so that a nuclear weapon with one that protects both the heartbreak and destruction in Mumbai
As Senator McCain fought long and contentious issues the Court is different from any disruptions or damage
Now — Now let me say as this process
They lead nations to work than the intelligence services had alerted U
For good reasons, we don’t lay this new century — and Congressman Andre Carson from the United States and other countries — South Korea -– while also addressing its causes
We’re also announcing a new United Nations was born on 3rd base, thinking you hit a lot of you know
And if there’s disagreement
You’ve got time for a ceasefire in the memory of those $716 billion in sensible spending cuts to education
No, it’s not just one of those who struggled at times to have that kind of politics
And when that happens, progress stalls
And based on facts and assumptions — those things where I stood before you faced these same fears and insecurities, and admitting when we’re not betting on the ground
Thank you very much
And we can avoid prescribing a medication that could fall into three areas that are more efficient
So they want is these
There are setbacks and false starts, and we want to extend Bush tax cuts for small businesses; a tax cut — for the vote
The pundits, the pundits have pointed out correctly that production of clean energy to the job done

Processing natural language is funny. All of the above codes need more optimization but as an introduction to NLTK, it was a good exercise.

Connect Deeper

If you resonated with this article, please subscribe to DevOpsLinks : An Online Community Of Diverse & Passionate DevOps, SysAdmins & Developers From All Over The World.

You can find me on Twitter, Clarity or my blog and you can also check my books: SaltStack For DevOps,The Jumpstart Up & Painless Docker.

If you liked this post, please recommend and share it with your followers.


Join Hacker Noon

Create your free account to unlock your custom reading experience.