In honor of today being April 20th, I thought it would be interesting to do some NLP on Reddit comments about marijuana (shout out to Yufeng Guo for this idea!). My teammate Felipe Hoffa has conveniently made all Reddit comments available in BigQuery, so I took a subset of those comments and ran them through Google’s Natural Language API.
What did the volume of pot-related comments throughout November look like?
We see a big spike in the number of comments around the election, which makes sense: seven states changed their marijuana laws during the 2016 election.
The Natural Language API’s syntax analysis method tells us the part of speech of each word in a sentence, which makes it easy to get all of the adjectives used in these comments. Which adjectives were used most frequently in conjunction with ‘marijuana’ and ‘cannabis’?
I also wondered if the top adjectives used each day changed throughout the month. Making use of BigQuery’s handy partitioning I used this query to find the top 3 adjectives used each day:
SELECT * FROM (SELECT day, adjective, count(*) c,row_number() over(partition by day order by count(*) desc) seqnumFROM `sara-bigquery.reddit_pot_nlp.adj_day`GROUP BY 1, 2ORDER BY day, c desc)WHERE seqnum <= 3ORDER BY day
Here’s a snippet of the output from November 5th to the 10th (the 4th column is the number of mentions for a word, and the last column is seqnum
from the query above):
We can see that ‘good’, ‘more’, ‘other’ and legal were part of the vocabulary throughout the month, but during the election the adjectives ‘medical’ and ‘recreational’ came into play.
The Natural Language API’s analyzeEntities endpoint can tell us! It will extract any known entities from our text, along with their Wikipedia URL if it exists. Here are the most common entities mentioned in conjunction with marijuana on Reddit:
Want to do this sort of thing on text that’s not related to marijuana? That’s cool too. This NL analysis can be applied to any sort of text — news articles, customer service feedback, speech transcriptions, and more.
You can try out the Natural Language API directly in the browser with your own text. For example, here’s the results of syntax annotation on one of the comments from my dataset above:
If you want to follow the approach I used in this analysis, here are the steps:
Go forth and process your text!
Machine learning APIs are pretty magical in that they take your data and give you back detailed information about it — all without requiring you to think about what’s happening under the hood.
You can think of it kind of like ordering a pastry: you provide money and a bakery gives you back a chocolate chip cookie. This will result in a delicious cookie a lot of the time, but sometimes you may decide you want to add white chocolate chips or toffee bits to your cookie. In this case you’re going to have to customize it, which likely involves making it yourself even if it may ruin some of the magic. After you’ve gotten over this fear you can either use a recipe someone else has written (BYO ingredients) or live dangerously and write your own.
The machine learning equivalent of making your cookie from scratch is building and training a model with your own data. I’ve recently started exploring this side of machine learning by looking at word2vec: a model for learning the relationships between words in a dataset. As the name implies, it takes words as input and outputs word vectors: a vector representation of a word in a set of text.
Using the word2vec sample on GitHub and this tutorial, I fed the model a subset of the Reddit comment data (~55k words, 8k unique) to generate a visualization of word embeddings:
Word embeddings essentially map words to vectors by extracting the meaningful words from a large text dataset and determining the semantic relationship between them. We can see in the visualization above that the model did a pretty good job clustering related words together: ‘rights’, ‘laws’, and ‘legalization’ are grouped together, as is ‘marijuana’ and ‘cannabis’.
Next steps? I could use this trained model to predict the topic of a comment or generate a new comment. I’m just getting started with custom models, so I’d love to hear your feedback and suggestions! Leave a comment or find me on Twitter Sara Robinson.
My colleague Allen Day has made genomics data for 1000 cannabis strains available in BigQuery. Details here, or head over to BigQuery to check out the public dataset.