Exploratory data analysis is one of the most important parts of any machine learning workflow and Natural Language Processing is no different. to explore and visualize text data efficiently? But which tools you should choose In this article we will that you can use to understand your text data and give you a complete(ish) tour into Python tools that get the job done. (originally posted by Shahul Es on the Neptune blog ), discuss and implement nearly all the major techniques Before we start: Dataset and Dependencies In this article, we will use from Kaggle. If you want to follow the analysis step-by-step you may want to install the following libraries: a million news headlines dataset pip install \ pandas matplotlib numpy \ nltk seaborn sklearn gensim pyldavis \ wordcloud textblob spacy textstat Now, we can take a look at the data. news= pd.read_csv( ,nrows= ) news.head( ) 'data/abcnews-date-text.csv' 10000 3 The dataset contains only two columns, the published date, and the news heading. For simplicity, I will be exploring the first from this dataset. Since the headlines are sorted by it is actually 10000 rows publish_date 2 months from February/19/2003 until April/07/2003 . Ok, I think we are ready to start our data exploration! Analyzing text statistics Text statistics visualizations are simple but very insightful techniques. They include: word frequency analysis,sentence length analysis,average word length analysis,etc. Those really help of the text data. explore the fundamental characteristics To do so, we will be mostly using (continuous data) and (categorical data). histograms bar charts First, I’ll take a look at the number of characters present in each sentence. This can give us a rough idea about the news headline length. news[ ].str.len().hist() 'headline_text' Code Snippet that Generates this Chart The histogram shows that news headlines range from 10 to 70 characters and generally, it is between 25 to 55 characters. Now, we will move on to data exploration at a word-level. Let’s plot the number of words appearing in each news headline. text.str.split().\ map( x: len(x)).\ hist() lambda Code Snippet that Generates this Chart It is clear that the number of words in news headlines ranges from 2 to 12 and mostly falls between 5 to 7 words. Up next, let’s check the in each sentence. average word length news[ ].str.split().\ apply( x : [len(i) i x]). \ map( x: np.mean(x)).hist() 'headline_text' lambda for in lambda Code Snippet that Generates this Chart The average word length ranges between 3 to 9 with 5 being the most common length. Does it mean that people are using really short words in news headlines? Let’s find out. One reason why this may not be true is stopwords. such as ” etc. As these words are probably small in length these words may have caused the above graph to be left-skewed. Stopwords are the words that are most commonly used in any language “the”,” a”,” an Analyzing the amount and the types of stopwords can give us some good insights into the data. To get the corpus containing stopwords you can use the . Nltk contains stopwords from many languages. Since we are only dealing with English news I will filter the English stopwords from the corpus. nltk library nltk nltk.download( ) stop=set(stopwords.words( )) import 'stopwords' 'english' Now, we’ll create the corpus. corpus=[] new= news[ ].str.split() new=new.values.tolist() corpus=[word i new word i] collections defaultdict dic=defaultdict(int) word corpus: word stop: dic[word]+= 'headline_text' for in for in from import for in if in 1 and plot top stopwords. Code Snippet that Generates this Chart We can evidently see that stopwords such as “to”,” in” and “for” dominate in news headlines. So now we know which stopwords occur frequently in our text, let’s inspect which words other than these stopwords occur frequently. We will use the from the collections library to count and store the occurrences of each word in a list of tuples. This is a in natural language processing. counter function very useful function when we deal with word-level analysis counter=Counter(corpus) most=counter.most_common() x, y= [], [] word,count most[: ]: (word stop): x.append(word) y.append(count) sns.barplot(x=y,y=x) for in 40 if not in Code Snippet that Generates this Chart Wow! The “us”, “Iraq” and “war” dominate the headlines over the last 15 years. Here ‘us’ could mean either the USA or us (you and me). us is not a stopword, but when we observe other words in the graph they are all related to the US — Iraq war and “us” here probably indicate the USA. Ngram exploration Ngrams are simply . For example “riverbank”,” The three musketeers” etc. If the number of words is two, it is called bigram. For 3 words it is called a trigram and so on. contiguous sequences of n words in which the word was used. Looking at most frequent n-grams can give you a better understanding of the context To implement n-grams we will use function from . For example: ngrams nltk.util nltk.util ngrams list(ngrams([ , , , , , ], )) from import 'I' 'went' 'to' 'the' 'river' 'bank' 2 Now that we know how to create n-grams lets visualize them. is a simple method used to tokenize, vectorize and represent the corpus in an appropriate form. It is available in To build a representation of our vocabulary we will use Countvectorizer. Countvectorizer sklearn.feature_engineering.text So with all this, we will analyze the top bigrams in our news headlines. vec = CountVectorizer(ngram_range=(n, n)).fit(corpus) bag_of_words = vec.transform(corpus) sum_words = bag_of_words.sum(axis= ) words_freq = [(word, sum_words[ , idx]) word, idx vec.vocabulary_.items()] words_freq =sorted(words_freq, key = x: x[ ], reverse= ) words_freq[: ] : def get_top_ngram (corpus, n=None) 0 0 for in lambda 1 True return 10 and top_n_bigrams=get_top_ngram(news[ ], )[: ] x,y=map(list,zip(*top_n_bigrams)) sns.barplot(x=y,y=x) 'headline_text' 2 10 Code Snippet that Generates this Chart We can observe that the bigrams such as ‘anti-war’, ’killed in’ that are related to war dominate the news headlines. How about trigrams? top_tri_grams=get_top_ngram(news[ ],n= ) x,y=map(list,zip(*top_tri_grams)) sns.barplot(x=y,y=x) 'headline_text' 3 Code Snippet that Generates this Chart We can see that many of these trigrams are some combinations of and and see if we were able to combine those synonym terms into one clean token. “to face court” “anti war protest”. It means that we should put some effort into data cleaning Topic Modeling exploration with pyLDAvis Topic modeling is the process of using unsupervised learning techniques to extract the main topics that occur in a collection of documents. (LDA) is an easy to use and efficient model for topic modeling. Each document is represented by the distribution of topics and each topic is represented by the distribution of words. Latent Dirichlet Allocation Once we categorize our documents in topics we can dig into further . data exploration for each topic or topic group But before getting into topic modeling we have to pre-process our data a little. We will: : the process by which sentences are converted to a list of tokens or words. : reduces the inflectional forms of each word into a common base or root. : Bag of words is a dictionary where the keys are words(or ngrams/tokens) and values are the number of times each word occurs in the corpus. tokenize remove stopwordslemmatize convert to the bag of words With NLTK you can tokenize and lemmatize easily: nltk nltk.download( ) nltk.download( ) corpus=[] stem=PorterStemmer() lem=WordNetLemmatizer() news df[ ]: words=[w w word_tokenize(news) (w stop)] words=[lem.lemmatize(w) w words len(w)> ] corpus.append(words) corpus corpus=preprocess_news(news) import 'punkt' 'wordnet' : def preprocess_news (df) for in 'headline_text' for in if not in for in if 2 return Now, let’s create the bag of words model using gensim dic=gensim.corpora.Dictionary(corpus) bow_corpus = [dic.doc2bow(doc) doc corpus] for in and we can finally create the LDA model: lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics = , id2word = dic, passes = , workers = ) lda_model.show_topics() 4 10 2 The topic 0 indicates something related to the Iraq war and police. Topic 3 shows the involvement of Australia in the Iraq war. You can print all the topics and try to make sense of them but there are tools that can help you run this data exploration more efficiently. One such tool is which pyLDAvis visualizes the results of LDA interactively. pyLDAvis.enable_notebook() vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dic) vis Code Snippet that Generates this Chart On the left side, the relative to the corpus. As there are four topics, we have four circles. area of each circle represents the importance of the topic The between the topics. Here you can see that the topic 3 and topic 4 overlap, this indicates that the topics are more similar.On the right side, the . For example, in topic 1 the most relevant words are police, new, may, war, etc distance between the center of the circles indicates the similarity histogram of each topic shows the top 30 relevant words So in our case, we can see a lot of words and topics associated with war in the news headlines. Wordcloud Wordcloud is a great way to represent text data. The size and color of each word that appears in the wordcloud indicate it’s frequency or importance. Creating with is easy but we need the data in a form of a corpus. Luckily, I prepared it in the previous section. wordcloud in python wordcloud WordCloud, STOPWORDS stopwords = set(STOPWORDS) wordcloud = WordCloud( background_color= , stopwords=stopwords, max_words= , max_font_size= , scale= , random_state= ) wordcloud=wordcloud.generate(str(data)) fig = plt.figure( , figsize=( , )) plt.axis( ) plt.imshow(wordcloud) plt.show() show_wordcloud(corpus) from import : def show_wordcloud (data) 'white' 100 30 3 1 1 12 12 'off' Code Snippet that Generates this Chart Again, you can see that the terms associated with the war are highlighted which indicates that these words occurred frequently in the news headlines. There are . Some of the most prominent ones are: many parameters that can be adjusted : The set of words that are blocked from appearing in the image. : Indicates the maximum number of words to be displayed. : maximum font size. stopwords max_words max_font_size There are many more options to create beautiful word clouds. For more details, you can refer here. Sentiment analysis Sentiment analysis is a very common natural language processing task in which we This is very useful for finding the sentiment associated with reviews, comments which can get us some valuable insights out of text data. determine if the text is positive, negative or neutral. There are many projects that will help you do sentiment analysis in python. I personally like and TextBlob Vader Sentiment. Textblob Textblob is a python library built on top of nltk. It has been around for some time and is very easy and convenient to use. The sentiment function of TextBlob returns two properties: is a floating-point number that lies in the range of where statement and statement. refers to and feelings. Subjectivity is represented as a floating-point value which lies in the range of [0,1]. polarity: [-1,1] 1 means positive -1 means a negative subjectivity: how someone’s judgment is shaped by personal opinions I will run this function on our news headlines. textblob TextBlob TextBlob( ).sentiment from import '100 people killed in Iraq' TextBlob claims that the text is negative and is not an opinion or feeling but rather a factual statement. I think we can agree with TextBlob here. “100 people killed in Iraq” Now that we know how to calculate those sentiment scores we can visualize them using a histogram and explore data even further. TextBlob(text).sentiment.polarity news[ ]=news[ ].\ apply( x : polarity(x)) news[ ].hist() : def polarity (text) return 'polarity_score' 'headline_text' lambda 'polarity_score' Code Snippet that Generates this Chart You can see that the polarity mainly ranges between 0.00 and 0.20. This indicates that the majority of the news headlines are neutral. Let’s dig a bit deeper by classifying the news as negative, positive and neutral based on the scores. x< : x== : : news[ ]=news[ ].\ map( x: sentiment(x)) plt.bar(news.polarity.value_counts().index, news.polarity.value_counts()) : def sentiment (x) if 0 return 'neg' elif 0 return 'neu' else return 'pos' 'polarity' 'polarity_score' lambda Code Snippet that Generates this Chart Yep, 70 % of news is neutral with only 18% of positive and 11% of negative. Let’s take a look at some of the positive and negative headlines. news[news[ ]== ][ ].head() 'polarity' 'pos' 'headline_text' Positive news headlines are mostly about some victory in sports. news[news[ ]== ][ ].head() 'polarity' 'neg' 'headline_text' Yep, pretty negative news headlines indeed. Vader Sentiment Analysis The next library we are going to discuss is VADER. . It is very useful in the case of social media text sentiment analysis. Vader works better in detecting negative sentiment is a rule/lexicon-based, open-source sentiment analyzer pre-built library, protected under the MIT license. VADER or Valence Aware Dictionary and Sentiment Reasoner VADER sentiment analysis class Then we can filter and choose the sentiment with most probability. returns a dictionary that contains the probabilities of the text for being positive, negative and neutral. We will do the same analysis using VADER and check if there is much difference. nltk.sentiment.vader SentimentIntensityAnalyzer nltk.download( ) sid = SentimentIntensityAnalyzer() ss = sid.polarity_scores(sent) np.argmax(list(ss.values())[: ]) news[ ]=news[ ].\ map( x: get_vader_score(x)) polarity=news[ ].replace({ : , : , : }) plt.bar(polarity.value_counts().index, polarity.value_counts()) from import 'vader_lexicon' : def get_vader_score (sent) # Polarity score returns dictionary #return ss return -1 'polarity' 'headline_text' lambda 'polarity' 0 'neg' 1 'neu' 2 'pos' Code Snippet that Generates this Chart Yep, there is a slight difference in distribution. Even more headlines are classified as neutral 85 % and the number of negative news headlines has increased (to 13 %). Named Entity Recognition Named entity recognition is an information extraction method in which entities that are present in the text are classified into predefined entity types like “Person”,” Place”,” Organization”, etc. By using . NER we can get great insights about the types of entities present in the given text dataset Let us consider an example of a news article. In the above news, the named entity recognition model should be able to identify entities such as RBI as an organization, Mumbai and India as Places, etc. There are three standard libraries to do Named Entity Recognition: Standford NER spaCy NLTK In this tutorial, which is an open-source library for advanced natural language processing tasks. It is written in Cython and is known for its industrial applications. Besides NER, I will use spaCy spaCy provides many other functionalities like pos tagging, word to vector transformation, etc. has been trained on the corpus and it supports the following entity types: SpaCy’s named entity recognition OntoNotes 5 There are three in spaCy. I will use for our task but you can try other models. pre-trained models for English en_core_web_sm To use it we have to download it first: python -m spacy download en_core_web_sm Now we can initialize the language model: spacy nlp = spacy.load( ) import "en_core_web_sm" One of the nice things about Spacy is that we only need to apply once, the entire background pipeline will return the objects we need. nlp function doc=nlp( ) [(x.text,x.label_) x doc.ents] 'India and Iran have agreed to boost the economic viability \ of the strategic Chabahar port through various measures, \ including larger subsidies to merchant shipping firms using the facility, \ people familiar with the development said on Thursday.' for in We can see that India and Iran are recognized as Geographical locations (GPE), Chabahar as Person and Thursday as Date. We can also visualize the output using module in spaCy. displacy spacy displacy displacy.render(doc, style= ) from import 'ent' This creates a very neat where each entity type is marked in different colors. visualization of the sentence with the recognized entities Now that we know how to perform NER we can explore the data even further by doing a variety of visualizations on the named entities extracted from our dataset. First, we will headlines and store the entity types. run the named entity recognition on our news doc=nlp(text) [X.label_ X doc.ents] ent=news[ ].\ apply( x : ner(x)) ent=[x sub ent x sub] counter=Counter(ent) count=counter.most_common() : def ner (text) return for in 'headline_text' lambda for in for in Now, we can visualize the entity frequencies: x,y=map(list,zip(*count)) sns.barplot(x=y,y=x) Code Snippet that Generates this Chart Now we can see that the GPE and ORG dominate the news headlines followed by the PERSON entity. We can also Let’s check which places appear the most in news headlines. visualize the most common tokens per entity. doc=nlp(text) [X.text X doc.ents X.label_ == ent] gpe=news[ ].apply( x: ner(x)) gpe=[i x gpe i x] counter=Counter(gpe) x,y=map(list,zip(*counter.most_common( ))) sns.barplot(y,x) : def ner (text,ent= ) "GPE" return for in if 'headline_text' lambda for in for in 10 Code Snippet that Generates this Chart I think we can confirm the fact that the “us” means the USA in news headlines. Let’s also find the most common names that appeared in news headlines. per=news[ ].apply( x: ner(x, )) per=[i x per i x] counter=Counter(per) x,y=map(list,zip(*counter.most_common( ))) sns.barplot(y,x) 'headline_text' lambda "PERSON" for in for in 10 Code Snippet that Generates this Chart Saddam Hussain and George Bush were the presidents of Iraq and the USA during wartime. Also, we can see that the model is far from perfect classifying or as a person rather than a government agency. “vic govt” “nsw govt” Exploration through Parts of Speach Tagging in python Parts of speech (POS) tagging is a There are eight main parts of speech: method that assigns part of speech labels to words in a sentence. Noun (NN)- Joseph, London, table, cat, teacher, pen, city Verb (VB)- read, speak, run, eat, play, live, walk, have, like, are, is Adjective(JJ)- beautiful, happy, sad, young, fun, three Adverb(RB)- slowly, quietly, very, always, never, too, well, tomorrow Preposition (IN)- at, on, in, from, with, near, between, about, under Conjunction (CC)- and, or, but, because, so, yet, unless, since, if Pronoun(PRP)- I, you, we, they, he, she, it, me, us, them, him, her, this Interjection (INT)- Ouch! Wow! Great! Help! Oh! Hey! Hi! This is not a straightforward task, as the same word may be used in different sentences in different contexts. However, once you do it, there are a lot of helpful visualizations that you can create that can give you additional insights into your dataset. but there are other libraries that do a good job (spacy, textblob). I will use the nltk to do the parts of speech tagging Let’s look at an example. nltk sentence= tokens=word_tokenize(sentence) nltk.pos_tag(tokens) import "The greatest comeback stories in 2019" Note: You can also visualize the sentence parts of speech and its dependency graph with module. spacy.displacy doc = nlp( ) displacy.render(doc, style= , jupyter= , options={ : }) 'The greatest comeback stories in 2019' 'dep' True 'distance' 90 We can observe various dependency tags here. For example, tag denotes the relationship between the determiner “the” and the noun “stories”. DET You can check the list of dependency tags and their meanings . here Ok, now that we now what POS tagging is, let’s use it to explore our headlines dataset. pos=nltk.pos_tag(word_tokenize(text)) pos=list(map(list,zip(*pos)))[ ] pos tags=news[ ].apply( x : pos(x)) tags=[x l tags x l] counter=Counter(tags) x,y=list(map(list,zip(*counter.most_common( )))) sns.barplot(x=y,y=x) : def pos (text) 1 return 'headline_text' lambda for in for in 7 Code Snippet that Generates this Chart We can clearly see that the noun (NN) dominates in news headlines followed by the adjective (JJ). This is typical for news articles while could happen quite a lot. for artistic forms higher adjective(ADJ) frequency You can dig deeper into this by investigating Let us find out. which singular noun occur most commonly in news headlines. adj=[] pos=nltk.pos_tag(word_tokenize(text)) word,tag pos: tag== : adj.append(word) adj words=news[ ].apply( x : get_adjs(x)) words=[x l words x l] counter=Counter(words) x,y=list(map(list,zip(*counter.most_common( )))) sns.barplot(x=y,y=x) : def get_adjs (text) for in if 'NN' return 'headline_text' lambda for in for in 7 Code Snippet that Generates this Chart Nouns such as dominate in the news headlines. You can visualize and examine other parts of speech using the above function. “war”, “iraq”, “man” Exploring through text complexity It can be very informative to know and what type of reader can fully understand it. Do we need a college degree to understand the message or a first-grader can clearly see what the point is? how readable (difficult to read) the text is You can actually put a number called readability index on a document or text. Readability index is a numeric value that indicates how difficult (or easy) it is to read and understand a text. There are many readability score formulas available for the English language. Some of the most prominent ones are: is a cool Python library that provides an implementation of all these text statistics calculation methods. Let’s use Textstat to implement Flesch Reading Ease index. Textstat Now, you can plot a histogram of the scores and visualize the output. textstat flesch_reading_ease news[ ].\ apply( x : flesch_reading_ease(x)).hist() from import 'headline_text' lambda Code Snippet that Generates this Chart Almost all of the readability scores fall above 60. This means that an average 11-year-old student can read and understand the news headlines. Let’s check all news headlines that have a readability score below 5. x=[i i range(len(reading)) reading[i]< ] news.iloc[x][ ].head() for in if 5 'headline_text' You can see some of the complex words being used in news headlines like etc. These words may have caused the scores to fall under 5. “capitulation”,” interim”,” entrapment” Final Thoughts In this article, we discussed and implemented various exploratory data analysis methods for text data. Some common, some lesser-known but all of them could be a great addition to your data exploration toolkit. Hopefully, you will find some of them useful in your current and future projects. To make data exploration even easier, I have created a that you can use for your work. “Exploratory Data Analysis for Natural Language Processing Template” Get Exploratory Data Analysis for Natural Language Processing Template Also, as you may have seen already, that creates it. Just click on the button below a chart. for every chart in this article, there is a code snippet Happy exploring!