Isn’t it great how Instagram’s “Explore” section displays content that matches your interests? When you open the application, the content and recommendations shown are almost always relevant to your specific likes, interests, connections, etc. While it may be fun to think we’re the center of the Instagram universe, the reality is that personalized, relevant content is also uniquely curated for 400 million other people daily. With 400M active users and 80M photos posted daily, how does Instagram decide what to put on your explore section? Let’s explore the key factors Instagram uses to determine scores for posts in your Instagram timeline and explore section.
Before we get into the nitty-gritty, here are some features Instagram uses to determine what content to serve up:
Now, let’s use these features to build our own Instagram discovery engine. In order to query data from Instagram I am going to use the very cool, yet unofficial, Instagram API written by Pasha Lev. For Mac users, the following should get you up and running. All other libraries are pip installable, and all Python code was run within a Jupyter notebook.
To get up and running, run the following in your terminal:
Then run jupyter notebook in your terminal, which will open in your default browser. I would also recommend verifying your Instagram phone number before continuing. This will prevent some unexpected redirects.
Now on to the good stuff. Let’s start with finding my social network and a bit of graph analysis.
If all goes well you should get a ‘Login success!’ response.
We can now build a true social network by finding everyone I follow as well as everyone they follow. For a quick intro on social network analysis and personalized pagerank, take a look at this blog post.
Before stepping into the code, let’s take a look at my own profile to see what we’re trying to analyze.
As you can see, I follow 42 people, who are considered my immediate network, which isn’t too many. If we start to look at 2nd degree connections that number quickly grows. In my case, if we look at 2nd degree connections the number of nodes reaches over 24,000. A nice visualization of this can be seen in step 2.
Cool, now let’s get that into a nicely formated Pandas Dataframe.
While it’s not essential to visualize your network in order to build your own discovery engine, it is pretty interesting and may help with understanding personal pageranks. I’m going to use one of my new favorite graph visualizations library, Graphistry (check them out sometime). However, if you don’t want to wait around for an API key (though I got a same day response), there are lots of other good libraries such as Lightning and NetworkX.
For this example, I’m going to display to src_id, and dst_id to give my friends a bit of privacy, though it is pretty fun to display usernames (which is what the below code will do). The first graph only displays edges that are sourced from me and filtered using the built in tools in Graphistry.
The second graph shows all of my extended network.
Isn’t that cool? You can already see a couple interesting features such as the few external centroids and how they interact with the rest of my social network.
It’s now time to grab the most recent images from everyone and rate them by how relevant they are to me. Since there about 24,000 nodes, it may take a while to download all the data.
Let’s do a quick trial run of only the 44 people I immediately follow to make sure we’re on the right track. Based on what I thought might determine the relative score of Instagram posts, we need to grab the # of likes, # of comments and the time the photo was taken for all recent photos of people I follow (in this example I considered recent equivalent to one week and cut off photos older then that). It would also be useful to grab how many times I’ve ‘liked’ that user’s posts and how connected that person is to me. Everything besides “how connected” that user is to me is a simple sum. To calculate the “connected” piece, we’ll use a personalized pagerank. Once we’ve compiled that information, we can define an importance metric like:
Alright, now that we have that defined, let’s see how it works! I apologize for the big chunk of code coming up, but don’t you worry…there is a picture of my new puppy at the end!
Which gives me:
This actually looks very similar to my personal timeline — cool! Now that we know we’re onto something, let’s tackle the discovery section.
We can take the same approach as before by calculating the relative score of each photo of friends of friends. To do this, we’d start with the first social graph that we calculated…but that has over 24K nodes and I’m too lazy to wait for all the data. Instead, let’s grab photos of friends of friends whose posts I’ve ‘liked’. This drops the number of nodes down to just over 1,500 which, depending on your internet speed, is the perfect amount of time for a coffee break.
There are a couple minor tweaks to the above code that are needed to deal with the extended user base, but most of the code is the same.
The results ended up showing a lot of images from National Geographic and Red Bull, which I currently don’t follow, but might starting now!
Interests haven’t yet been taken into account just yet. A nice aspect of Instagram is its rich set of #hashtags used to describe photos. Let’s see if we can discover my interests by using the hashtags of photos I’ve ‘liked’, and photos that I’ve been tagged in. While Instagram most likely uses click data alongside ‘like’ data, we don’t have access to clicks, so we’re going to stick with likes only.
Now let’s grab the most popular images for each of those tags:
Now that we have the most popular image from each hashtag feed, we can display them.
Now let’s combine these two techniques.
You may have noticed I was saving all the collected image data to top_graph_img and images_top_tags. Let’s combine them using a fairly naive technique, random sampling:
That’s not too shabby! I personally find some of those photos pretty cool, but it definitely could be better.
Ways to improve the discovery engine:
This is by no means is an exhaustive list, so if you have any other ideas please let me know!
This is a collaboration from the team at GetStream.io, led by Balazs Horanyi, Data Scientist at GetStream.io. The original blog post can be found at https://getstream.io/blog/building-instagram-discovery-engine-step-step-tutorial/.