paint-brush
Using Machine Learning to Visualize Customer Preferencesby@johncoogan
4,227 reads
4,227 reads

Using Machine Learning to Visualize Customer Preferences

by John CooganJuly 28th, 2017
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In 2012, I saw the most amazing visualization in the New York Times. It was created by one of my favorite engineers, <a href="https://bost.ocks.org/mike/" target="_blank">Mike Bostock</a>, and his team of data visualization specialists to give readers a deeper look into the most polarizing issues of Obama’s reelection campaign.

People Mentioned

Mention Thumbnail

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Using Machine Learning to Visualize Customer Preferences
John Coogan HackerNoon profile picture

With a case study on the Hyperloop launch announcement

In 2012, I saw the most amazing visualization in the New York Times. It was created by one of my favorite engineers, Mike Bostock, and his team of data visualization specialists to give readers a deeper look into the most polarizing issues of Obama’s reelection campaign.

When we launched Soylent, we were flooded with comments on Hacker News, Twitter, and Reddit. There was so much feedback, both positive and negative, we actually couldn’t read through every comment. Faced with the problem of having too much customer feedback (an amazing problem to have), I built a solution that let us highlight the most important issues to both our supporters and detractors. The analysis generalizes pretty well, so I’m going to outline how I designed the system, and then use Elon Musk’s Hyperloop announcement and the discussion associated with it as an example.

How the NYTimes visualization actually works:

A comparison of how often speakers at the two presidential nominating conventions used different words and phrases, based on an analysis of transcripts from the Federal News Service.

Basically, the bigger bubble in the NYT visualization, the more times that word was mentioned in speeches, just like a traditional word cloud. The NYT visualization is different though because it actually gives you a useful perspective on where an issue sits between the two parties. Normal word clouds just highlight popular topics and don’t really teach you anything.

Example of a useless word cloud:

This word cloud really teaches you nothing about a topic except the buzzwords that are broadly associated with it. The different colors don’t actually correlate to any information. There are many great articles about the pitfalls of standard word clouds and how they can be misleading. I’m partial to this one, in which a New York Times senior software architect describes them as the “mullets of the Internet.”

How a “weighted” word cloud is different:

The weighted word cloud shown in the NYT example is different because it shows you not only what concepts are popular topics, but where they sit on two sides of an issue (political stance). This is relevant to customer feedback analysis because the feedback you receive about your product, comes on a continuum similar to the political spectrum. Just like Democrats and Republicans will highlight different policy topics in their convention speeches, supporters and detractors will highly different product features in their comments. When launching a new product, especially one that is receiving a mixed response or has sparked controversy, it is incredibly important to treat the feedback of your supporters differently from your detractors and a weighted word cloud helps you do just that. Using machine learning / natural language processing, we can automatically classify comments as positive, negative, or neutral with sentiment analysis and then find exactly what issue is at the heart of each comment using entity resolution. If you’d like to learn more about sentiment analysis and how it works Walaa Medhat, Ahmed Hassan, and Hoda Korashy wrote a great overview of the field as it stood in 2014.

Building an automated system to do this analysis:

I thought this would be a cool way to look at the data available online about various products, so I built an automation around this type of analysis using the HackerNews API, Google’s Natural Language API, and D3.js. All of this is wired together using Python. I would like to build a web interface to this and make it publicly available, but I’m a bit worried about the cost of the all the API calls if people start using it frequently.

The HackerNews API is hosted on Firebase and documented on GitHub. It allows us to pull all the comments about a story very easily. Once we have these comments, we can use Google’s Cloud Natural Language API for entity resolution and sentiment analysis. Once we have aggregated all of the most frequently mentioned entities and their associated sentiment scores, plugging that data into a weighted word-cloud is trivial.

The code essentially does the following:

  1. Use the HackerNews ID to get comment IDs from the HackerNews API.
  2. Traverse the comment threads to get all of the related story comments.
  3. Submit each comment to the Natural Language API and store the results.
  4. Transform the results into a JSON object for visualization in D3.
  5. Use D3 to draw a colored bubble for each entity according to sentiment.

One of the most critical steps here is the sentiment classification of each comment as either positive or negative. This is where the machine learning actually comes into play and what makes it possible to separate out comments into two categories (supporters and detractors) that map well onto the weighted word cloud. In the screenshot, you can see some example results from Google’s Natural Language API. Using this API essentially removes the need to dive into all the complexities that come with training and using a new machine learning model. Google’s model is highly accurate, reliable, and most importantly, available at affordable rates via a simple API call. In the API example, the entire document is classified as expressing positive sentiment (with a score of 0.3 on a range of -1.0 to 1.0). Additionally, nine different entities are identified, three of which have sentiments associated with them.

As each comment is passed to the Natural Language API, the document sentiment score along with the entities identified within it are stored. These form the basis of our weighted word cloud. Entities that are mentioned most frequently will have larger bubbles and the bubbled will be shaded according the their sentiment. Instead of the red for Republican and blue for Democrat color scheme, the sentiment analysis weighted word cloud uses red for detractors and green for supporters. Lastly, this visualization is drawn using the excellent JavaScript library D3.js, using the force-directed graph layout.

Using this analysis on the Hyperloop launch announcement

Concept imagery from Hyperloop Alpha

I thought it would be apt to revisit the Hyperloop launch announcement with a case study using this type of analysis for a few reasons. First, Elon Musk just announced on Tuesday that his newest venture, The Boring Company, will be building a fully-functional Hyperloop. Second, the Hyperloop manifesto is perhaps the most consummate pre-launch teaser I can think of in recent memory. Venture capitalists have been funding pre-launch companies for decades, providing the most rudimentary of market signals (“does a VC think this is a good idea”). Recently, KickStarter and crowdfunding more broadly have created a wave of companies that have benefited from being able to test their theories with potential users before investing heavily in development of the actual product, but the Hyperloop announcement took that idea to a new level. By announcing the idea of a product (high-speed trains in vacuum tubes) four years before announcing an actual plan to build that product, Elon was able to popularize the idea and learn what what most important to potential consumers with very little upfront investment.

Here is what HackerNews comments on the original Hyperloop launch look like when visualized using a sentiment-weighted word cloud.

A few key takeaways:

  • Detractors are most worried about cost and safety. Infrastructure projects are notorious for going over-budget and taking longer than expected (see the Big Dig) and this was a common thread in the comments.
  • Supporters think Elon Musk is a genius who can achieve anything. His star power is really significant on HackerNews in general, but I think this announcement put him in a bit of a new category.
  • Political factors are a significant issue for both sides. Commenters love finding ways around regulation but didn’t seem to be able to come to a consensus on how the government would affect the Hyperloop.
  • Supporters focus more on specific design criteria like temperature, pressure, tolerances, and pylons. The 58 page paper is quite detailed and supporters were very engaged with the technical details of the project.
  • Detractors seem to focus on the feasibility of building the Hyperloop and associated costs like contractors, farms, power, and investors. This is related to cost but touches on some of the sub-problems in planning something of this magnitude.

This is just the tip of the iceberg though, and there are many interesting insights from reading selections of comments that center around a common theme. This type of analysis can help companies understand what issues are most important to their customers, even before they have launched.

How we wound up using this at Soylent:

At Soylent, doing this type of analysis lead us to realized that our customers were most excited about the fact that the product was healthy, convenient, and affordable. When we looked at what was available before Soylent, it really seemed like you always had to pick just two of the three. Healthy and cheap meals could be obtained by growing and cooking food, but that wasn’t convenient. Cheap and convenient meals could be bought from any fast food joint, but they weren’t healthy. Convenient and healthy options were available at restaurants, but not affordably. These three key messages gave us a direction when marketing Soylent at the early stages.

The most important thing at the early stage of a company is talking to your customers. If you’re are blessed with a buzz-filled launch that has lots of people talking, I highly recommend that you take advantage of that and do whatever you can to separate the signal from the noise. The features and ideas that your supporters have latched on to can serve as clear product development guidelines. Similarly, the discussion topics that your detractors are centering around might be worth addressing in a blog post or future marketing campaigns, you might be able to win some of the over.

If you would like to discuss this further or need help parsing your customer data, feel free to reach out to me. I’m happy to help.

Thanks for reading! Feel free to hit the recommend button below if you found this piece interesting!

You can connect with me on Twitter: www.twitter.com/johncoogan