It’s officially winter, so what could be better than drinking hot chocolate while querying the new Stack Overflow dataset in BigQuery? It has every Stack Overflow question, answer, comment, and more — which means endless possibilities of data crunching. Inspired by Felipe Hoffa’s post on how response time varies by tag, I wanted to look at the comments table (53 million rows!).
To measure happy comments I looked at comments with “thank you”, “thanks”, “awesome” or “:)” in the body. I limited the analysis to tags with more than 500,000 comments. Here’s the query:
#standardSQLSELECTtag,ROUND((COUNT(case when comment_text like '%thanks%' or comment_text like '%:)%' or comment_text like '%thank you%' or comment_text like '%awesome%' then 1 end) / COUNT(*)) * 100,2) as percent_happy,COUNT(*) total_commentsFROM (SELECTLOWER(a.text) as comment_text,SPLIT(b.tags, '|') as tagsFROM `bigquery-public-data.stackoverflow.comments` aJOIN `bigquery-public-data.stackoverflow.posts_questions` bON a.post_id = b.idUNION ALLSELECTLOWER(b.text) as comment_text,SPLIT(c.tags, '|') as tagsFROM `bigquery-public-data.stackoverflow.posts_answers` aJOIN (SELECT post_id, text FROM `bigquery-public-data.stackoverflow.comments`) bON a.id = b.post_idJOIN `bigquery-public-data.stackoverflow.posts_questions` cON c.id = a.parent_id), UNNEST(tags) tagGROUP BY 1HAVING total_comments > 500000ORDER BY percent_happy DESC
Here’s the result in BigQuery:
And the chart:
R, Ruby, HTML / CSS, and iOS are the communities with the happiest commenters according to this list. People who ask questions about XML and regular expressions also seem particularly thankful for help. If you’re curious, here are the 15 highest scoring happy comments that were short enough to fit in a screenshot (and their associated tags) :
But because people sometimes get angry on the internet, you’re probably wondering…
For angry comments, I counted those with “wrong”, “horrible”, “stupid”, or “:(” in the body. The SQL is the same as above with the search terms swapped out. Here’s the result:
And the chart:
Clearly the angriest comments are those related to C derivatives. Many programming concepts also wound up here: multithreading, arrays, algorithms, and strings. And here are the highest scoring angry comments:
This analysis is not perfect, as the comment “that one’s so stupid it underflows and becomes awesome” appears in both lists. That’s where a machine learning tool like the Natural Language API would come in handy.
Between the two lists there were only a few tag overlaps. The most excitable tags (I’m interpreting tags that showed up in both the happy and angry list as ‘excitable’) are: ios, iphone, objective-c, and regex questions. And while the internet may seem like a dark place sometimes, there appears to be roughly six happy comments for every angry one.
Dive into the Stack Overflow dataset, or check out some of these awesome posts to get inspired:
If you have comments or ideas for future analysis, find me on Twitter @SRobTweets.