Flix-Finder: Building a Movie Recommendation App With Django and Bright Data [Part 2/3]

Written by infinity | Published 2023/07/05
Tech Story Tags: web-development | programming | software-engineering | web-data | django | computer-science | application-development | tutorial

TLDRThis is the second part of a three part series. In the first part, we used pre-curated IMDB rating information from [Bright Data] The same thing is required for Rotten Tomatoes. Bright Data’s collector is a tool that can be used to collect web data at large scales.via the TL;DR App

You are reading the second part of a three-part series. It is highly encouraged to read the first article here before proceeding!

We started working on “Flix-Finder” with a vision of developing a software solution to the problem of picking a movie based on numerous rating parameters. By the end of the previous part of this series, Flix-Finder enabled us to filter films based on release year and minimum IMDB rating. But as we argued, our application will stand true to its existence only if it can filter movies across several rating parameters.

Today, in the second part of this series, we will be adding support for filtering movies based on Rotten Tomato ratings as well!

With this feature implemented, our application would be able to search movies based on two different rating parameters. Additionally, with the method outlined in this article, you (the reader) can add as many other rating parameters as you want!

Let us get coding…

Rotten Tomato Rating Data

Our software utilizes information gathered from multiple rating agencies. For the first section, we used pre-curated IMDB rating information from Bright Data. The same thing is required for Rotten Tomatoes.

A quick search indicates that Rotten Tomato does not offer a developer API for retrieving such information. Additionally, Bright Data does not offer a pre-curated dataset for Rotten Tomatoes.

Does this imply that we are at a dead end? No! We have a route out in the form of Bright Data’s custom web scrapers!

Bright Data’s Custom Collectors

What, then, is a Collector? Bright Data’s collector is a tool that can be used to collect web data at large scales. It provides an integrated development environment (IDE) that enables the automation of any data-collecting procedure. This environment comes with a sizable number of pre-defined functions to make some of the typical data-collecting tasks easier.

Before writing our own custom collector for collecting Rotten Tomatoes rating data, we should first understand some terminologies surrounding these collectors.

Collector Stages: A "stage" is a particular phase of data collecting. When users wish to gather data from several places or want to carry out numerous activities while collecting data, stages can be added. Users may more easily control and monitor the development of their data-gathering efforts by segmenting the total data collection process into several stages. For instance, a YouTube data collector may contain steps like searching, playing each video that comes up in the search results, and retrieving all of the comments.

Interaction Code: In Bright Data IDE, the "interaction code" refers to the code that is executed during the data collection process to interact with the web page. It is a code written in JavaScript that runs within the context of the web page being scraped. The interaction code can be used to perform a wide range of actions, such as clicking buttons, scrolling through the page, and filling out forms.

Parser Code: It contains the logic for parsing the HTML from the interactions made in the interaction code. We use the parse() method from the interaction code to call the parser code.

And that is all we need to know to get started with writing our first collector!

Collector For Rotten Tomato

Having understood the theory, let us now move on to the practical.

We will be using data from Rotten Tomato's all-time lists. This list contains a large number of movie ratings (~2000) across various genres.

So this is how we will structure stages for our collector:

  • The first stage would be responsible for opening various lists from the all-time lists page.
  • The second stage would be responsible for collecting movie names, ratings, and years of release.

Stage 1: Opening Movie List

According to the description above, this stage should be able to navigate to the all-time lists page, capture the URLs for all the lists on the page, and pass them one by one on to the next stage.

This is how the interaction code for this stage will look like.

let url = new URL('https://editorial.rottentomatoes.com/all-time-lists/') // (1)
navigate(url) // (2) 
let hrefArray1 = parse().hrefArray; // (3) 
hrefArray1.forEach(href => next_stage({  // (4)
  url: href
}))

let url2 = new URL('https://editorial.rottentomatoes.com/all-time-lists/?wpv_view_count=1773-CATTR9a61a7f98d435e1c32de073e05574776&wpv_paged=2');
navigate(url2);
let hrefArray2 = parse().hrefArray;
hrefArray2.forEach(href => next_stage({
  url: href
}))

Let us understand this code, line-by-line.

  1. We create a new URL object. This class is provided by default in the IDE by Bright Data. We provide the URL for the all-time list as a parameter.
  2. We navigate to this URL using another function provided. This is essentially equivalent to opening a web page.
  3. We next call the parser code that will parse the data and return a list of URLs. These URLs correspond to different lists on the page.
  4. We then call the next stage for each of the lists. Yet again, IDE provides us with the next_stage method to send inputs to the next stage.

We repeat these same steps for another URL that corresponds to the second page of the all-time list.

Next, let us look at the parser code for this stage.

var hrefArray = []; //create an empty array to store href values

$("div.col-sm-8.newsItem.col-full-xs a.unstyled.articleLink").each(function() {

  var href = $(this).attr("href"); //get the href attribute value of current 'a' tag

  hrefArray.push(href); //push href value to the array

});



return {hrefArray};

The parser logic is pretty simple. We have written a selector for selecting all the URLs from the different lists available on the page.

Stage 2: Collecting Movie Data

The second stage receives a URL as input. This URL refers to the web page containing a list of movies. The responsibility of this stage should be to navigate to the provided URL and collect all the movie data available on that page.

Let us look at the interaction code for it.

navigate(input.url);
let {movieData} = parse();
movieData.forEach(movie => collect(movie))

As described, we navigate to the provided URL and call the parser. The parser returns a list of movie data. We call the collect method to “collect“ that data!

This is what the parser code looks like.

let movieData = []
$('div[id^="row-index-"]').each(function() {
  let posterSrc = $(this).find("div > a.article_movie_poster > div > img").attr("src");
  let movieName = $(this).find('div.article_movie_title div > h2 > a').text();
  let movieYear = $(this).find('span.subtle.start-year').text().slice(1, -1);
  let movieRating = $(this).find('span.tMeterScore').text();
  movieData.push({posterSrc, movieName, movieYear, movieRating})
});

return {movieData}

A quick look at the parser code reveals that we are selecting four properties for each of the movies on the page. We then create a list of all such movies and return that from the parser.

Great! We just finished writing our custom collector! To test that everything is working, click on the preview button (as highlighted in the image below).

Once the preview is successful, you can click on the Save button on the top-right. Our collector is now saved and ready to use.

Running our Collector

Select the MyScrapers tab from the Datasets & WebScraper IDE menu. Your recently created collector will be visible here.

Click on the three dots towards the left of your collector and select initiate manually.

Click ‘Start’ on the ‘Initiate manually’ menu. This will start the collector. Wait for the process to finish.

Once the run is finished, we can download the JSON file under the Actions. Check that this file contains roughly 2000 movies!

Rename the downloaded file to rtm.json and place it under the static folder of recommender Django app. This folder should already contain the IMDB data file and the CSS file for our application.

With all this, the data for our application is ready, and we can begin updating our application!

Adding an extra field to our Django Form

Continuing from the form we created in the last part, we will be adding three new fields to our form. Following will be the fields -

(1) Minimum Rotten Tomato Rating: Sets a threshold for the minimum Rotten Tomato Rating

(2) Include Empty IMDB Ratings: Checkbox to indicate if we want to include those movies for which Rotten Tomato ratings exist and meet the threshold, but IMDB ratings are not present.

(3) Include Empty Rotten Tomato Ratings: Checkbox to indicate if we want to include those movies for which IMDB ratings exist and meet the threshold, but Rotten Tomato ratings are not present.

Here is what the updated class looks like.

from django import forms

class MovieSearchForm(forms.Form):
    imdb_rating = forms.DecimalField(min_value=0.0, max_value=9.9, label='Minimum IMDB Rating')
    rotten_tomato_rating = forms.DecimalField(min_value=0.0, max_value=9.9, label='Minimum Rotten Tomato Rating')
    include_empty_rotten_tomato_ratings = forms.BooleanField()
    include_empty_imdb_ratings = forms.BooleanField()
    release_year = forms.IntegerField(required=True)

With this change, you should be able to see the updated form on the webpage!

Updating the View

With both the data and the form in place, what remains is to update our view so that recommendations can be made considering the Rotten Tomato ratings as well.

Let us first write a method that can read the static JSON file and create a list of Python objects representing Rotten Tomato Rating data.

# recommender/views.py

import json

ROTTEN_TOMATO_DATA_FILE_PATH = './recommender/static/rtm.json'

class RottenTomatoMovieData:
    def __init__(self, movie_title, rating, release_year) -> None:
        self.movie_title = movie_title
        try: 
            self.rating = float(rating.strip('%')) / 10
        except ValueError:
            self.rating = 0
        try:
            self.release_year = int(release_year)
        except ValueError:
            self.release_year = None

def _getRottenTomatoMovieDataList():
    rotten_tomato_movie_list = []
    data_file = open(ROTTEN_TOMATO_DATA_FILE_PATH, encoding="utf8")
    json_data = json.load(data_file)
    for movie_json_data in json_data:
        if 'movieName' not in movie_json_data:
            continue
        rotten_tomato_movie_data = RottenTomatoMovieData(movie_json_data['movieName'], movie_json_data['movieRating'], movie_json_data['movieYear'])
        rotten_tomato_movie_list.append(rotten_tomato_movie_data)
    data_file.close()
    return rotten_tomato_movie_list

We created a new class RottenTomatoMovieData to represent the rating data obtained from Rotten Tomato. The parsing code is pretty simple, we read the JSON file and convert each JSON object one-by-one to RottenTomatoMovieData.

The next thing we need to do is filter this list based on the values that the user entered in the form. Let's create another function that will accept the form data and the list of Rotten Tomatoes movie ratings. The filtered list would then be returned by this function.

# recommender/views.py

def _filter_rotten_tomato_movie_list(rotten_tomato_movie_list, movie_search_form: MovieSearchForm):
    filtered_list = []
    for movie in rotten_tomato_movie_list:
        if movie_search_form.cleaned_data['rotten_tomato_rating'] and movie.rating < movie_search_form.cleaned_data['rotten_tomato_rating']:
            continue
        if movie.release_year and movie.release_year != movie_search_form.cleaned_data['release_year']:
            continue
        filtered_list.append(movie)
    return filtered_list

Again, the filtering logic is straightforward. We iterate over the list of movies and filter out any movie whose rating does not match the threshold or the release year is different from what is requested.

Merging Movie Data

We have a list of movies from IMDB rating data after part 1. This list was used to present the results on our website. Using the Rotten Tomatoes ratings, we have now produced another list of movies. These two lists must be combined in some way to produce a single list that can be used to display results on the result page.

# recommender/views.py

class MovieData:
    def __init__(self, movie_title, imdb_rating, rtm_rating, release_year):
        self.movie_title = movie_title
        self.imdb_rating = imdb_rating
        self.rtm_rating = rtm_rating
        self.release_year = release_year
    def __str__(self) -> str:
        return f"{self.movie_title} {self.imdb_rating} {self.rtm_rating} {self.release_year}"

def _merge_movie_lists(imdb_movie_list, rtm_movie_list, form: MovieSearchForm):
    merged_movie_list = []
    movie_name_movie_year_to_movie_data = defaultdict(lambda: {})
    for imdb_movie in imdb_movie_list:
        movie_name = imdb_movie.movie_title
        imdb_rating = imdb_movie.rating
        rtm_rating = None
        release_year = imdb_movie.release_date.year
        movie_name_movie_year_to_movie_data[movie_name.lower()][release_year] = MovieData(movie_name, imdb_rating, rtm_rating, release_year)
    for rtm_movie in rtm_movie_list:
        movie_name = rtm_movie.movie_title
        rtm_rating = rtm_movie.rating
        imdb_rating = None
        release_year = rtm_movie.release_year
        existing_object = movie_name_movie_year_to_movie_data[movie_name.lower()].get(release_year, None)
        if not existing_object:
            movie_name_movie_year_to_movie_data[movie_name.lower()][release_year] = MovieData(movie_name, imdb_rating, rtm_rating, release_year)
        else:
            existing_object.rtm_rating = rtm_rating
            movie_name_movie_year_to_movie_data[movie_name.lower()][release_year] = existing_object
    for movie_year_to_movie_data in movie_name_movie_year_to_movie_data.values():
        for movie_data in movie_year_to_movie_data.values():
            if not form.cleaned_data['include_empty_rotten_tomato_ratings'] and movie_data.rtm_rating is None:
                continue
            if not form.cleaned_data['include_empty_imdb_ratings'] and movie_data.imdb_rating is None:
                continue
            merged_movie_list.append(movie_data)
    return merged_movie_list

This code might look overwhelming at first sight! But let us break this down to better understand this.

(1) We first create a map from the movie name and movie release year to actual movie data.

(2) We fill this map for IMDB movie data first.

(3) Next, we try to fill this map for Rotten Tomato movie data.

(4) Once the map is populated with data from both lists, we filter the entries in this map based on the values provided in the form by the user.

(5) We finally return the merged list.

Final Result

Okay… with all that work done, What do we have?

We can try searching for movies based on two different ratings, and our app WORKS!

With all the methods described in this article, you can add many more such ratings. It is all up to the imagination of readers!

Code For This Series

All code created as a part of this series will be available on this GitHub repository. The final result of this part is contained in this branch named Part2.

It is always a good idea to cross-check your code with the one present on the repo to check if something is wrong.


Concluding Remarks

With this, we come to the end of the second part of this three-part series. In the next part, we will automate the process of data delivery so that our application stays up to date with all the new data.

See you next time… Till then, Happy Learning! 🙂


Written by infinity | Tech Enthusiast!
Published by HackerNoon on 2023/07/05