Ingeniera de data crawler para ayudar a los principiantes en la programación.
I am not a big fan of Donald Trump. Technically, I don’t like him at all. However, he has this charismatic sensation effect which occupies most newspapers and social media all the time. People’s attitude towards him is dramatic and bilateral. His descriptive words are either highly positive or negative, which are some perfect material for text mining and sentiment analysis. The goal of this workshop is to use a website scraper to read and pull tweets about Donald Trump. Then we will use a combination of text mining and visualization techniques to analyze the public voice about Donald Trump.
This workshop is easy to follow. Even you don’t know anything about programming, you should feel comfortable as you read this article. Feel free to copy the code and try it yourself. If you are a beginner, I recommend trying out with your code first before comparing with that in this workshop.
Let’s start with web scraping, I need an effective web scraper tool to do all the boring work for me. Any web scraper tool would work. I recommend Octoparse since it is free with no limitation on the number of pages.
I downloaded it from its official websites and finished registration by following the instructions. After I logged in, I opened their built-in Twitter template.
Octoparse Scraping Templates
The scraping rule on a template is pre-set with data extraction fields including the Name, ID, Content, Comments and etc.
I entered “Donald Trump” at the perimeter filed to tell the crawler the keyword. Just as simple as it seemed, I got about 10k tweets. You can scrape as many tweets as possible. There are also some other ways to crawl the data, and probably you can get a better result than mine. Welcome to share your innovative crawling experience with me, I am always a passionate learner :)
After getting the tweets, export the data as a text file, name the file as “data.txt”.
Then we use two opinion word lists to analyze the scraped tweets. You can download them from here. These two lists contain positive and negative words (sentiment words) that were summarized by Minqing Hu and Bing Liu from research study about presented opinions words in social media.
The idea here is to take each opinion word from the lists, return to the tweets, and count the frequency of each opinion words in the tweets. As a result, we collect corresponding opinion words in the tweets and the count.
First, I created a positive and negative list in line 5 and line 13 with two downloaded word lists. They store all the words that are parsed from the text files.
Then, I processed texts and massaged the data by taking out all the punctuations, signs and numbers with the following code。
As a result, the data only consisted of tokenized words, which makes it easier to analyze. (This is the blog I found useful about text preprocessing in data science.)
Afterward, create three dictionaries: word_count_dict, word_count_positive, and word_count_negative.
Next, I defined each dictionary. If an opinion word exists in the data, count it by increasing word_count_dict value by “1”.
Afterwords counting, we need to decide whether a word sounds positive or negative. If it is a positive word, word_count_positive increases its value by “1”, otherwise positive dictionary remains the same value. Respectively, word_count_negative increases its value or remains the same value. If the word is not present in either positive or negative list, it is a pass.
For a complete version of the code, you can download here (https://gist.github.com/octoparse/fd9e0006794754edfbdaea86de5b1a51)
5352 negative words and 3894 positive words
As the graph showed. The use of positive words is unilateral. There are only 404 kinds of positive word used.
The most frequent words are, for example, “like”, “great” and “right”. Most word choices are basic and colloquial, like “wow” and “cool,” whereas the use of negative words is much more multilateral. There are 809 kinds of negative word that most of them are formal and advanced. The most frequently used are “illegal,” “lies,” and “racist.” Other advanced words such as “delinquent”, “inflammatory” and “hypocrites” are also present.
The choice of words clearly indicates the level of education of whom is supportive is lower than that disapproval. Apparently, Donald Trump is not so welcomed among Twitter users.
In this article, we talked about how to scrape tweets on Twitter using Octoparse. We also discussed text mining and sentiment analysis using python.
There are some limitations to this research. I scrapped 15K tweets. However, among scraped data, there are 5K tweets either didn’t have text content nor show any opinion word. As a result, the sentiment analysis was argumentative. Also, the analysis in this article only focused on polarized opinions (either negative or positive). Fine-Grained sentiment analysis should be more precise to a various degree ( very positive, positive, neutral, negative, very negative).
At last, I would love to share some thoughts regarding the result. The word “illegal” is at the top negative word associated with Donald Trump. It’s not surprising the word ranks number one because Donald Trump has been devoting his efforts to focus on immigration since his incumbent. However, I am amazed by how people start abusing this word. I pulled out tweets about that word and most of them are “illegal immigrants” and “illegal aliens.” That got me thinking, since when “undocumented” is equivalent to “illegal”?
In closing, I want to quote Elie Wiesel, “You who are so-called illegal aliens must know that no human being is illegal. That is a contradiction in terms. The human being can be beautiful or more beautiful, they can be fat or skinny, they can be right or wrong, but illegal? How can a human being be illegal?”
Create your free account to unlock your custom reading experience.