If you’ve ever participated in a brainstorming session, you may have been in a room with a wall that looks like the image above. Usually, the session starts with a prompt or a problem statement of some sort. The team will write down as many ideas/thoughts around it as possible in a predetermined time frame and eventually throw them up on the wall. Taking a step back, patterns start to emerge and the sticky notes can start to be clustered into themed camps, creating order out of the chaos of random ideas.
I was a part of an ideation session like this recently where the goal was to create a brand platform for a company. This creative exercise inspired me to think programmatically about what we were doing. My thought was that there were computational methods that would be able to organize and interpret the ideas we were coming up with. Clustering, specifically, is an analysis that I’ve come across frequently in my research into machine learning algorithms. Maybe there’s a way to utilize clustering to find themes in our ideas that we might’ve missed, and maybe we could use machine learning to spot them.
Understanding the data is an important first step in figuring out the best solution. There’s a wide variety of algorithms out there, and some are better than others at specific tasks. Some algorithms may even work better on particularly small or large datasets. The type of content I’m working with is single sentences or short ideas, similar to a set of tweets. I also am starting with zero assumptions about how to cluster the data and letting the algorithm make its own conclusions without providing a training data set.
Topic modeling is a statistical model to discover hidden semantic patterns in unstructured collection of documents. This is exactly what I was looking for to be able to parse and make sense of the seemingly random thoughts. It’s also unsupervised, in that it doesn’t need training data. There are two types of topic modeling algorithms I explored: Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF).
Without getting too deep into the math behind these algorithms, LDA is a probabilistic model in that it calculates both the probability of selecting a word when sampling a topic and the probability of selecting a topic when sampling a document. A “document,” in this case, is a single sticky note. This is repeated for all words on each sticky note, and then repeated on the entire collection. NMF, instead, relies on Linear Algebra and is a deterministic algorithm which arrives at a single representation of the collection of sticky notes.
Even though both of these algorithms are unsupervised, they still need us to determine how many topics we want to sort into beforehand. In my experience, I’ve been able to adjust this number and test the data to achieve the best results. Also, both LDA and NMF take a “bag of words” model as an input. For this post, we’ll be using NMF as our algorithm because I’ve found that it works best on our smaller dataset. So, each sticky note is represented as a row in the matrix, with each column containing a representation of a word’s importance to the overall collection.
To create the bag of words for our matrix and get the representation of importance, we can utilize python’s Scikit Learn text extraction library using TF-IDF. TF-IDF is an algorithm that weighs a term’s frequency (Number of times a term appears in a sticky note divided by the total number of terms in the sticky note) and its inverse document frequency (log(Total number of sticky notes/ Number of sticky notes with the term in it)).
Consider a sticky note containing 10 words where the word ‘hungry’ appears 2 times.
The term frequency (tf) for ‘hungry’ is TF = (2 / 10) = 0.2.
Assume we have 1000 sticky notes and the word ‘hungry’ appears in 10 of these. Then, the inverse document frequency (idf) is calculated as IDF = log(1000 / 10) = 2.
The Tf-idf weight is the product of these quantities TF-IDF = 0.2 * 2 = 0.4.
NMF takes this matrix and generates a set number of topics that represent weighted sets of co-occurring terms. These discovered topics form a basis that will provide a representation of the original sticky notes.
Building It Out
For the purposes of this post I came across a small collection of sticky notes from a brainstorming session on ending world hunger. There would normally be much more data to use, but I’ll use this to illustrate how the algorithm would work.
I added all of these sticky notes to a CSV file and took the liberty of separating out some sentences where there may have been two or more ideas.
After importing the CSV and creating an array of the ideas, we need to first vectorize the sticky notes with TF-IDF. Using Scikit Learn we can also filter out stop words (commonly used words such as ‘the’) and pick out which words to filter out by how much or how little they appear. We can then set up NMF to include our desired number of topics. Next, we can generate the two lower-level matrices generated by the NMF function. Matrix W is a representation of scores for each sticky note on each generated topic and Matrix H is the list of main terms and scores on how much they correlate to each generated topic.
Below is the basic code implementation for this process:
Here’s the output for the 3 topics we generated:
Since we used such a small number of sticky notes, it’s tough to really draw any conclusions from this method based on the above output, but I’ve had success generating topics from company brainstorming sessions where there are well over 150 sticky notes with specific solutions and ideas. Feel free to reach out if you have any ideas on how to improve the methods I mention above! Tweet at me @joezeoli or check out 20nine https://www.20nine.com