A few years ago, I was binge-reading the Game of Thrones books and found myself having a hard time keeping track of all the characters in my head. (This is not surprising — there are more than 150 named characters in the series!) I was going back and forth between chapters or constantly looking up the A Song of Ice and Fire wiki to remember plotlines. I needed a mental map — surely there was a better way to visualize these characters?
Pictured here is a sample network graph from Wikipedia that illustrates the contributions of Wikipedia editors to different languages. Using this example, here are some basics (or a quick refresher, if you’re already familiar) of graph theory concepts:
Circles representing the languages in which articles were written are the “vertices” of the graph (interchangeably, the “nodes”).
The “edges” are the lines connecting each pair of vertices. Each edge in the graph is determined through an incidence function that maps a pair of vertices to an edge.
In this example, each edge represents (by line weight, or thickness) the number of editors who have contributed to both the languages that the line connects. This is what we call an undirected simple graph. “Undirected” meaning {en--> fr} and {fr --> en} are identical, and “simple” means no more than one edge connects each pair of vertices. The graph is also “weighted,” meaning that the thickness of the edges is relative to the strength of the relationship between the vertices. In this example, the weighted incidence function could look something like this:
While the visual representation of graphs in this way is an intuitive approach to quickly show relationships so that they are easy to comprehend, there are even richer insights we can derive from representing a dataset as a graph object.
“In data science, 80 percent of time spent is preparing data, 20 percent of time is spent complaining about the need to prepare data.”
Data scientists may not agree on everything — but we agree that the most difficult part of any project is getting the data. Lucky for us, that part is behind us for this article. There is a nice clean dataset of Hamilton lyrics readily available on Kaggle that you can simply download and start graphing.
This is what the Hamilton dataset looks like.
There is one line of record per character/song/line of lyric.
To build a network graph of all the Hamilton speakers, the following must be defined:
Nodes (list of speakers)
Edges (to connect each pair of speakers)
Incidence function to map each pair of vertices to an edge (with an optional weight)
The incidence function I’ve chosen is the Number of songs each pair of speakers appears in together. My assumption is that the more songs two characters appear together in, the stronger their relationship.
Weight {speaker,x, speaker,y} = #songs that feature both speaker,x and speaker,y
Using R’s dplyr, I can transform my original dataset into an **{src, dest, weight}**
entity, and then convert that into an adjacency matrix. I can then use graph.adjacency in R’s igraph package to create a “graph object” from this adjacency matrix, which I can then use for plotting and other analyses.
The graph_obj can be visualized using the plot.igraph function. Because this function has many custom layouts to choose from, I start by rendering the same graph using the “star” layout.
The result is technically a network plot. But is it possible to do even better? The chart above seems to suggest that all vertices and edges have equal importance — but that undermines the whole point of visualizing a social network. Some characters are indeed more “significant,” and some speakers have stronger relationships relative to others.
How can this graph reflect that?
This is where edge weight and vertex degree come into play. I start by playing around with the parameters of the plot.igraph
function to make edge.width
(i.e., the thickness of the edge in the plot) relative to the weight, and vertex.label.cex
(i.e., the font size of the vertices) relative to degree.
Much better! Characters with a higher degree are visually larger, and the distinction between strong and weak relationships is also apparent from the darkness of the lines. This iteration is much more intuitive and lets the viewer immediately grasp the relationships among characters. It's also fitting that King George is a lone node, considering that his songs are always (very funny) monologues.
You could also use visNetwork library in R to make an interactive network graph. The library makes it possible to zoom in and out of multiple parts of the graph (especially useful with a particularly large graph), and has support for Shiny.
Centrality is a key concept in graph theory to identify the significance of nodes:
Degree centrality: This is a measure of the number of edges connected to each node.
Eigen centrality: This represents a measure of how “well connected” a node is, how many links connections share, and so on through the network. It identifies nodes with influence over the whole network, not just those directly connected to it.
Betweenness centrality: This is literally how much a given node is between other nodes and acts as a “bridge” among various clusters of networks. It’s a measure of the “influence” of each of the vertices on the rest of the network.
I can use igraph’s degree(), betweenness(), and eigen_centrality() functions to get centrality for the generated graph:
It looks like Aaron Burr has the highest betweenness centrality (the “bridge”) in our graph, while Hamilton has the highest eigenvector centrality (the “influencer”). Make what you will of that.
Business applications of network graphs are numerous:
Social networking sites make use of network graphs to create communities of similar users and offer targeted recommendations. A rudimentary implementation of the algorithm behind a “suggested friends” feature could look something like this: “Nine out of ten of Alice’s immediate friends are also friends with Bob -> recommend Bob as a potential friend for Alice.”
Applications that map the shortest distance from place X to place Y (such as maps, ride-sharing services, supply chain and logistics for delivery trucks, and so on) likely use variants of “shortest path” algorithms, popularly known in computer science as the traveling salesman problem.
Network theory is a crucial component of lexical and semantic processing within natural language processing (NLP), in turn used among chatbots and virtual assistants like Alexa, Cortana, Siri, and even IBM’s Watson winning Jeopardy!, a game of puns and wordplay that is far from straightforward.
Name-dropping party games like Six Degrees of Kevin Bacon use network graphs.
In epidemiology, centrality measures may be used in identifying the origins of pandemics or “super spreader” events.
If you think about it, the Internet is simply one gargantuan network of different websites. Search engines make use of knowledge graph measures to return the most relevant pages for a particular search query.
Fun as they are, it is important to note that network graphs are not without drawbacks when employed in production. For example, they can be resource-intensive. As is the case with any matrix operations, scalability and performance sometimes take a hit. There is also a “cold start” problem — if your dataset is too sparse or there aren’t really many relationships among entities, a network graph is not an effective solution. Used correctly and in the right context, however, they can be valuable to business.
Code:https://github.com/iswaryam/hamilton/•
Dataset credit: https://www.kaggle.com/lbalter/hamilton-lyrics#
If you're a Potterhead, check out my GitHub - I've also graphed the characters of Harry Potter with a similar method.