This is what we did so far
We want to factorize our user item interaction matrix into a User matrix and Item matrix. To do that, we will use the Alternating Least Squares (ALS) algorithm to factorize the matrix. We could write our own implementation of ALS like how it’s been done in this post or this post, or we can use the already available, fast implementation by Ben Frederickson. The ALS model here is from implicit and can easily be added to your Python packages with pip or with Anaconda package manager with conda.
import implicit
model = implicit.als.AlternatingLeastSquares(factors=10,iterations=20,regularization=0.1,num_threads=4)model.fit(user_item.T)
Here, we called ALS with the following parameters:
One thing to note is that the input for the ALS model is a movie user interaction matrix, so we just have to pass the transpose of our item movie matrix to the model fit function
It’s time to get some results. We want to find similar movies for a selected title. The implicit module offers a ready to use method that returns similar items by providing the movie index in the movie user matrix. However, we need to translate that index to the movie ID in the movies table
movies_table = pd.read_csv(“data/ml-latest-small/movies.csv”)movies_table.head()
def similar_items(item_id, movies_table, movies, N=5):“””Input
item_id: intMovieID in the movies table
movies_table: DataFrameDataFrame with movie ids, movie title and genre
movies: np.arrayMapping between movieID in the movies_table and id in the item user matrix
N: intNumber of similar movies to return
recommendation: DataFrameDataFrame with selected movie in first row and similar movies for N next rows
“””
user_item_id = movies.index(item_id)
similars = model.similar_items(user_item_id, N=N+1)
l = [item[0] for item in similars]
ids = [movies[ids] for ids in l]
ids = pd.DataFrame(ids, columns=[‘movieId’])
recommendation = pd.merge(ids, movies_table, on=’movieId’, how=’left’)
return recommendation
Let’s try it!
Let’s see what similar movies do we get for a James Bond Movie: Golden Eye
df = similar_items(10, movies_table, movies, 5)df
Interesting recommendations. One thing to notice is that all recommended movies are also in the Action genre. Remember that there was no indication to the ALS algorithm about movies genres. Let’s try another example
df = similar_items(500, movies_table, movies, 5)df
Selected movie is a comedy movie and so are the recommendations. Another interesting thing to note is that recommended movies are in the same time frame (90s).
df = similar_items(1, movies_table, movies, 5)df
This is a case where the recommendations are not relevant. Recommending Silence of the Lambs for a user that just watched Toy Story does not seem as a good idea.
So far, the recommendations are displayed in a DataFrame. Let’s make it fancy by showing the movie posters instead of just titles. This might help us later when we deploy our model and separate the work into Front End and Back End. To do that we will download movies metadata that I found on Kaggle. We will need the following data:
metadata = pd.read_csv(‘data/movies_metadata.csv’)metadata.head(2)
From this metadata file we only need the imdb_id and poster_path columns.
image_data = metadata[[‘imdb_id’, ‘poster_path’]]image_data.head()
We want to merge this column with the movies table. Therefore, we need the links file to map between imdb id and movieId
links = pd.read_csv(“data/links.csv”)links.head()
links = links[[‘movieId’, ‘imdbId’]]
Merging the ids will be done in 2 steps:
But first we need to remove missing imdb ids and extract the integer ID
image_data = image_data[~ image_data.imdb_id.isnull()]
def app(x):try:return int(x[2:])except ValueError:print x
image_data[‘imdbId’] = image_data.imdb_id.apply(app)
image_data = image_data[~ image_data.imdbId.isnull()]
image_data.imdbId = image_data.imdbId.astype(int)
image_data = image_data[[‘imdbId’, ‘poster_path’]]
image_data.head()
posters = pd.merge(image_data, links, on=’imdbId’, how=’left’)
posters = posters[[‘movieId’, ‘poster_path’]]
posters = posters[~ posters.movieId.isnull()]
posters.movieId = posters.movieId.astype(int)
posters.head()
movies_table = pd.merge(movies_table, posters, on=’movieId’, how=’left’)movies_table.head()
Now that we have the poster path, we need to download them from a website. One way to do it is to use the TMDB API to get movie posters. However, we will have to make an account on the website, apply to use the API and wait for approval to get a token ID. We don’t have time for that, so we’ll improvise.
All movie posters can be accessed through a base URL plus the movie poster path that we got, and using HTML module for Python we can display them directly in Jupyter Notebook.
from IPython.display import HTMLfrom IPython.display import display
def display_recommendations(df):
images = ‘’for ref in df.poster_path:if ref != ‘’:link = ‘http://image.tmdb.org/t/p/w185/' + refimages += “<img style=’width: 120px; margin: 0px; \float: left; border: 1px solid black;’ src=’%s’ />” \% linkdisplay(HTML(images))
df = similar_items(500, movies_table, movies, 5)display_recommendations(df)
Recommendations for `Mrs Doubtfire`
Put all of it into one small method
def similar_and_display(item_id, movies_table, movies, N=5):
df = similar\_items(item\_id, movies\_table, movies, N=N)
display\_recommendations(df)
similar_and_display(10, movies_table, movies, 5)
Recommendations for `Golden Eye`
In this post we implemented ALS through the implicit module to find similar movies. Additionally we did some hacking to display the movie posters instead of just DataFrame. In the next post we will see how to make recommendations for users depending on what movies they’ve seen. We will also see how we can set up an evaluation scheme and optimize the ALS parameters for.
Stay tuned!