How to Win a Kaggle Competition: Box Office Prediction Competition

Introduction The is in a constant growth trend. The global box office was worth 41.7 billion in 2018. Hollywood has the world’s most massive box office revenue with 2.6 billion tickets sold and around 2000 films produced annually. film-industry One of the main interests of the film studios and related stakeholders is a prediction of revenue that a new movie can generate based on a few given input attributes. Background Starting in 1929, during the Great Depression and the Golden Age of Hollywood, an insight began to evolve related to the consumption of movie tickets. It appeared that even in that bad economic period, the film industry kept growing. The phenomenon repeated in the 2008 recession. The primary goal is to build a machine-learning-based model that will predict the revenue of a new movie given such features as cast, crew, keywords, budget, release dates, languages, production companies, and countries. EDA was the first step followed by introducing an initial linear model and comparing it to other models at the end of the process. 7398 movie data collected from The Movie Database (TMDB) as part of a kaggle.com Box Office Prediction Competition. A train/test division is also given to build and evaluate the developed model. The Challenge Consumer behaviours have changed over the years: the MeToo movement, as well as other social developments, have surfaced in our society, and that reflected in movie scripts. However, some of the preferences that were relevant 50 years ago are still relevant today; hence, an analysis based on the last few decades of movie production is always appropriate and will be able to serve any stakeholders that have an interest in predicting a new movie revenue. The Packages that I used in this exercise: warnings warnings.filterwarnings( ) numpy np matplotlib.pyplot plt %matplotlib inline pandas pd scipy stats scipy stats, special sklearn model_selection, metrics, linear_model, datasets, feature_selection sklearn neighbors sklearn.preprocessing StandardScaler time scipy io sklearn.preprocessing StandardScaler sklearn.model_selection train_test_split sklearn.linear_model LogisticRegression sklearn.model_selection cross_val_score, KFold sklearn.neighbors KNeighborsClassifier sklearn.tree DecisionTreeClassifier sklearn.metrics confusion_matrix sklearn.model_selection cross_validate sklearn.preprocessing LabelEncoder vaderSentiment.vaderSentiment SentimentIntensityAnalyzer json seaborn sns ast import 'ignore' import as import as import as from import from import from import from import from import import from import from import from import from import from import from import from import from import from import from import from import import import as import EDA Pictures are best to illustrate and present the first findings from the dataset. Begin the exploration with a scatters plot of ‘Revenue vs Budget’ to view the upper-end data points: plt.figure(figsize=( , )) plt.scatter((train[ ]), (train[ ])) plt.title( ) plt.xlabel( ) plt.ylabel( ) plt.show() 8 6 'budget' 'revenue' 'Revenue vs Budget' 'Budget' 'Revenue' Continue the exploration with a scatter plot that will show us the lower-end points by using a log10 of the values. The plot of ‘Revenue vs Budget’ will change: plt.figure(figsize=( , )) plt.scatter(np.log10(train[ ]), np.log10(train[ ])) plt.title( ) plt.xlabel( ) plt.ylabel( ) plt.show() 8 6 'budget' 'revenue' 'Revenue vs Budget' 'Budget [log10]' 'Revenue [log10]' Checking the scatter plot of ‘Revenue vs Popularity’. plt.figure(figsize=( , )) plt.scatter(np.log10(train[ ]), np.log10(train[ ])) plt.title( ) plt.xlabel( ) plt.ylabel( ) plt.show() 8 6 'popularity' 'revenue' 'Revenue vs popularity' 'Popularity [log]' 'Revenue [log]' Comparing the movies with the biggest budget values: train.sort_values( , ascending= ).head( ).plot(x= , y= , kind= ) plt.xlabel( ); 'budget' False 10 'original_title' 'budget' 'barh' 'Budget [USD]' Comparing the movies with the biggest Revenue values: train.sort_values( , ascending= ).head( ).plot(x= , y= , kind= ) plt.xlabel( ); 'revenue' False 10 'original_title' 'revenue' 'barh' 'Revenue [USD]' Comparing the movies with the biggest Profit values: train.assign(profit = df: df[ ] - df[ ] ).sort_values( , ascending= ).head( ).plot(x= , y= , kind= ) plt.xlabel( ); lambda 'revenue' 'budget' 'profit' False 10 'original_title' 'profit' 'barh' 'Profit [USD]' Moving ahead to explore the highest Revenue by ‘genres’ as follow: train.groupby( )[ ].mean().sort_values().plot(kind= ) plt.xlabel( ); 'genres' 'revenue' 'barh' 'Revenue [USD]' The column ‘belongs_to_collection’ was converted to a ‘True’ / ‘False’ column if a movie belongs to a collection of movies, or not. Simple box-plot reveals that movies that belong to a collection benefit from a higher Revenue as reflected by the median and the range (25,75 percentile), the orange (right) box-plot is more elevated. fig, ax= plt.subplots(figsize=( , )) ax.set_yscale( ) sns.boxplot(x= , y= , data=train, ax=ax); 8 6 'symlog' 'collection' 'revenue' Define a function (named ‘parse_json’) to parse the first ‘name’ value from this structure of a list of dictionaries: : json.loads(x.replace( , ))[ ][ ] : : def parse_json (x) try return "'" '"' 0 'name' except return '' Applying the ‘parse_json’ function on the ‘production_companies’ column yields: Visualizing the production companies with the highest Revenue yields the plot: train.groupby( )[ ].mean().sort_values(ascending= ).head( ).plot(kind= ) plt.xlabel( ); 'production_companies' 'revenue' False 20 'barh' 'Revenue [USD]' Data Preparation Starting with Sentiment AnaLysis of the columns ‘overview’ and ‘tagline’ that contains a short verbal overview of the movie as well as the relevant tagline. I used vaderSentiment package with the value ‘compound’ to explore the question: Does a sentiment analysis is correlated with the Revenue column? analyser = SentimentIntensityAnalyzer() train[ ] = train[ ].fillna( ) train[ ] = train[ ].fillna( ) train[[ , ]].corrwith(train[ ]) # using SentimentIntensityAnalyzer function from the vaderSentiment package # for an analysis of the sentiment of the films 'overview' and 'tagline' # Fill out the NaNs values in 'overview' and 'tagline' # with an empty string ('') before processing the analyser scores 'overview' 'overview' '' 'tagline' 'tagline' '' # As we can see from the sentiment analysis, there is (almost) no correlation between # the 'compound' value generated by vaderSentiment package (a composition sentiment value) # To the 'overview' and 'tagline' columns. 'tag_sentiment' 'sentiment' 'revenue' Continue with a helper function that helps to convert the given data as a string to a list, for example, the function will convert ‘[1,2,3,4]’ (string) into [1,2,3,4] (a list). pd.isna(x): : ast.literal_eval(x) # Helper function to parse text and convert given strings to lists : def text_to_list (x) if return '' else return The next step is to combine the Train and Test Sets into a combined Set, all the preparations will be done on the combined Set that will be split later. combined = pd.concat((train, test), sort= ) False Drop all of the not-relevant columns from the combined dataset Columns that will not contribute to predicting the revenue. combined.drop(columns=[ , , , , ], inplace= ) 'id' 'imdb_id' 'poster_path' 'title' 'original_title' True Preparation for the parsing step, applying the ‘text_to_list’ function on the relevant columns. col [ , , , \ , , , ]: combined[col] = combined[col].apply(text_to_list) for in 'genres' 'production_companies' 'production_countries' 'spoken_languages' 'Keywords' 'cast' 'crew' Converts the ‘belogs_to_collection’ column to a zero/one column. Every value that includes some value (meaning the movie belongs to a collection) will be converted to 1. Every value that includes a NaN (meaning the movie does not belong to a collection) will be converted to 0. Reminder, a Sentiment analysis Revealed that there is no correlation between the columns: ‘overview’ and ‘tagline’ to the ‘revenue’ column. (our predicted column) Hence, we will create a binary label for each movie ‘tagline’ (and for ‘homepage’ as well later), for every movie: has or has not a ‘tagline’ and a ‘homepage’. The second step will be to create a new feature with an overview of the characters count. combined[ ] = *(~combined[ ].isna()) combined[ ] = *(~combined[ ].isna()) 'tagline' 1 'tagline' 'homepage' 1 'homepage' Creating a new feature, the new feature includes the number of characters in each movie’s overview. combined[ ] = combined[ ].str.len() combined[ ].fillna( , inplace= ) # New feature includes the number of characters in each movie's overview 'overview' 'overview' # Any movie without an overview (Nan) will set to zero 'overview' 0 True The head() of the new feature: Creating a new feature contains the NUMBER of genres for each movie. Moving on to parse the ‘genres’ names from the ‘genres’ column. Some movies have more than one genre while others have no genre at all. For this purpose, there is a helper function named: ‘parse_genre’ that will parse the first three genres that relate to a movie (if exists) and create 3 new columns named: ‘genres1’, ‘genres2’, ‘genres3’ in the combined dataset. type(x) == str: pd.Series([ , , ], index=[ , , ] ) len(x) == : pd.Series([x[ ][ ], , ], index=[ , , ] ) len(x) == : pd.Series([x[ ][ ],x[ ][ ], ], index=[ , , ] ) len(x) > : pd.Series([x[ ][ ],x[ ][ ],x[ ][ ]], index=[ , , ] ) : def parse_genre (x) if return '' '' '' 'genres1' 'genres2' 'genres3' if 1 return 0 'name' '' '' 'genres1' 'genres2' 'genres3' if 2 return 0 'name' 1 'name' '' 'genres1' 'genres2' 'genres3' if 2 return 0 'name' 1 'name' 2 'name' 'genres1' 'genres2' 'genres3' Apply the function to create 3 new columns and drop the original ‘genres’ column: Creating a new column with the number of production companies related to each movie with the code-line: combined[ ] = \ combined[ ].apply( x: len(x)) 'production_company_number' 'production_companies' lambda Building a function to parse the production companies of a movie. Few movies do not have a production companies value, some have more than one value, the function will parse only the first 3 production companies (if exist) and create 3 new columns named: ‘prod1’, ‘prod2’, ‘prod3’ in the combined dataset. type(x) == str: pd.Series([ , , ], index=[ , , ] ) len(x) == : pd.Series([x[ ][ ], , ], index=[ , , ] ) len(x) == : pd.Series([x[ ][ ],x[ ][ ], ], index=[ , , ] ) len(x) > : pd.Series([x[ ][ ],x[ ][ ],x[ ][ ]], index=[ , , ] ) : def parse_production_companies (x) if return '' '' '' 'prod1' 'prod2' 'prod3' if 1 return 0 'name' '' '' 'prod1' 'prod2' 'prod3' if 2 return 0 'name' 1 'name' '' 'prod1' 'prod2' 'prod3' if 2 return 0 'name' 1 'name' 2 'name' 'prod1' 'prod2' 'prod3' Apply the function to create 3 new columns and drop the original ‘production companies’ column. Create a new column with the number of production countries related to each movie with the code-line: combined[ ] = \ combined[ ].apply( x: len(x)) 'production_country_number' 'production_countries' lambda Few movies do not have a production countries value, some have more than one value. A helper function will parse the production countries of a movie. It will parse only the first 3 production countries (if exist) and create 3 new columns named: ‘country1’, ‘country2’, ‘country3’ in the combined dataset. type(x) == str: pd.Series([ , , ], index=[ , , ] ) len(x) == : pd.Series([x[ ][ ], , ], index=[ , , ] ) len(x) == : pd.Series([x[ ][ ],x[ ][ ], ], index=[ , , ] ) len(x) > : pd.Series([x[ ][ ],x[ ][ ],x[ ][ ]], index=[ , , ] ) : def parse_production_countries (x) if return '' '' '' 'country1' 'country2' 'country3' if 1 return 0 'name' '' '' 'country1' 'country2' 'country3' if 2 return 0 'name' 1 'name' '' 'country1' 'country2' 'country3' if 2 return 0 'name' 1 'name' 2 'name' 'country1' 'country2' 'country3' Apply the function to create 3 new columns and drop the original ‘production countries’ column with the code-line: combined[[ , , ]] = \ combined[ ].apply(parse_production_countries) combined.drop(columns= , inplace= ) 'country1' 'country2' 'country3' 'production_countries' 'production_countries' True The ‘release_date’ column need a parse and a fill for the Nan values, that will be done with the following code: combined[ ] = pd.to_datetime(combined[ ], format= ) combined[ ] = combined[ ].dt.weekday combined[ ].fillna( , inplace= ) combined[ ] = combined[ ].dt.month combined[ ].fillna( , inplace= ) combined[ ] = combined[ ].dt.year combined[ ].fillna(combined[ ].median(), inplace= ) combined[ ] = combined[ ].dt.day combined[ ].fillna( , inplace= ) combined.drop(columns =[ ], inplace= ) # Parse and break-down the date column ('release_date' column) 'release_date' 'release_date' '%m/%d/%y' # Parse 'weekday' 'weekday' 'release_date' # fill Nan in 'weekday' column with the most common weekday value - 4 'weekday' 4 True # Parse 'month' 'month' 'release_date' # fill Nan in 'month' with the most common month value - 9 'month' 9 True # Parse 'year' 'year' 'release_date' # fill Nan in 'year' with the median value of the 'year' column 'year' 'year' True # Parse 'day' 'day' 'release_date' # fill Nan with the most common day value - 1 'day' 1 True # Drop the original 'release_date' column 'release_date' True Fill the Nan values in the ‘runtime’ column with the median value. combined[ ].fillna(combined[ ].median(), inplace= ) 'runtime' 'runtime' True Create a new column with the number of spoken languages for each movie with the code-line: combined[ ] = \ combined[ ].apply( x: len(x)) 'spoken_languages_number' 'spoken_languages' lambda Few movies do not have the value of a spoken language, some have more than one value the function. A helper function to parse the spoken languages of a movie. will parse only the first 3 spoken languages (if exist) and create 3 new columns named: ‘lang1’, ‘lang2’, ‘lang3’ in the combined dataset: type(x) == str: pd.Series([ , , ], index=[ , , ]) len(x) == : pd.Series([x[ ][ ], , ], index=[ , , ]) len(x) == : pd.Series([x[ ][ ],x[ ][ ], ], index=[ , , ]) len(x) > : pd.Series([x[ ][ ],x[ ][ ],x[ ][ ]], index=[ , , ]) : def parse_spoken_languages (x) if return '' '' '' 'lang1' 'lang2' 'lang3' if 1 return 0 'name' '' '' 'lang1' 'lang2' 'lang3' if 2 return 0 'name' 1 'name' '' 'lang1' 'lang2' 'lang3' if 2 return 0 'name' 1 'name' 2 'name' 'lang1' 'lang2' 'lang3' Apply the function to create 3 new columns and drop the original ‘spoken languages’ column: combined[[ , , ]] = \ combined[ ].apply(parse_spoken_languages) combined.drop(columns= , inplace= ) 'lang1' 'lang2' 'lang3' 'spoken_languages' 'spoken_languages' True Most of the ‘status’ column values are ‘Released’, hence, the Nan values in this column will change to ‘Released’. combined[ ].fillna( , inplace= ) 'status' 'Released' True Create a new column with the number of Keywords for each movie. combined[ ] = \ combined[ ].apply( x: len(x)) 'keywords_number' 'Keywords' lambda Few movies do not have the value of a keyword, some have more than one value. The helper function will parse only the first 3 keywords (if exist) and create 3 new columns named: ‘key1’, ‘key2’, ‘key3’ in the combined dataset. type(x) == str: pd.Series([ , , ], index=[ , , ]) len(x) == : pd.Series([x[ ][ ], , ], index=[ , , ]) len(x) == : pd.Series([x[ ][ ],x[ ][ ], ], index=[ , , ]) len(x) > : pd.Series([x[ ][ ],x[ ][ ],x[ ][ ]], index=[ , , ]) : def parse_keywords (x) if return '' '' '' 'key1' 'key2' 'key3' if 1 return 0 'name' '' '' 'key1' 'key2' 'key3' if 2 return 0 'name' 1 'name' '' 'key1' 'key2' 'key3' if 2 return 0 'name' 1 'name' 2 'name' 'key1' 'key2' 'key3' Apply the function to create 3 new columns and drop the original ‘Keywords’ column: combined[[ , , ]] = \ combined[ ].apply(parse_keywords) combined.drop(columns= , inplace= ) 'key1' 'key2' 'key3' 'Keywords' 'Keywords' True Create 3 new features. Counting the number of the cast for genders 0,1,2 for each movie. combined[ ] = combined[ ].apply( row: sum([x[ ] == x row])) combined[ ] = combined[ ].apply( row: sum([x[ ] == x row])) combined[ ] = combined[ ].apply( row: sum([x[ ] == x row])) 'gender_0_number' 'cast' lambda 'gender' 0 for in 'gender_1_number' 'cast' lambda 'gender' 1 for in 'gender_2_number' 'cast' lambda 'gender' 2 for in Sample to observe one of the new columns head: Create a new column with the number of cast values for each movie with the code-line: combined[ ] = \ combined[ ].apply( x: len(x)) 'cast_number' 'cast' lambda Parsing the cast column. Taking the first five cast members by their cast_id values and creating five cast-related new columns: myindx = [ , , , , ] out = [ ]* type(x) != str: i range(min([ ,len(x)])): out[i] = x[i][ ] pd.Series(out, index=myindx) : def parse_cast (x) 'cast1' 'cast2' 'cast3' 'cast4' 'cast5' -1 5 if for in 5 'id' return Apply the function to create 5 new columns and drop the original ‘cast’ column: combined[[ , , , , ]] = combined[ ].apply(parse_cast) combined.drop(columns= , inplace= ) 'cast1' 'cast2' 'cast3' 'cast4' 'cast5' 'cast' 'cast' True Create a new column with the number of crew values for each movie: combined[ ] = \ combined[ ].apply( x: len(x)) 'crew_number' 'crew' lambda A function to parse the Director and Producer from the ‘crew’ column: myindx = [ , ] out = [ ]* type(x) != str: item x: item[ ] == : out[ ] = item[ ] item[ ] == : out[ ] = item[ ] pd.Series(out, index=myindx) : def parse_crew (x) 'Director' 'Producer' -1 2 if for in if 'job' 'Director' 0 'id' elif 'job' 'Producer' 1 'id' return Apply the function to create 2 new columns and drop the original ‘crew’ column: combined[[ , ]] = combined[ ].apply(parse_crew) combined.drop(columns= , inplace= ) 'Director' 'Producer' 'crew' 'crew' True Create two new columns (features) for the two columns that contain Numeric Values (‘budget’, ‘popularity’) using np.log1p (calculate log(1 + x)) since there is a possibility that we will have a zero value and log of zero does not exist. RandomForest or light_gbm models can use both features without a conflict, Moreover, these two new features contribute to the models’ accuracy. combined[ ] = np.log1p(combined[ ]) combined[ ] = np.log1p(combined[ ]) 'budget_log' 'budget' 'pop_log' 'popularity' Apply LabelEncoder on the new 5 generated feature-groups columns, fit and transform as a second step. cols = [ , , ] allitems = list(set(combined[cols].values.ravel().tolist())) labeler = LabelEncoder() labeler.fit(allitems) combined[cols] = combined[cols].apply( x: labeler.transform(x)) cols = [ , , ] allitems = list(set(combined[cols].values.ravel().tolist())) labeler = LabelEncoder() labeler.fit(allitems) combined[cols] = combined[cols].apply( x: labeler.transform(x)) cols = [ , , ] allitems = list(set(combined[cols].values.ravel().tolist())) labeler = LabelEncoder() labeler.fit(allitems) combined[cols] = combined[cols].apply( x: labeler.transform(x)) cols = [ , , ] allitems = list(set(combined[cols].values.ravel().tolist())) labeler = LabelEncoder() labeler.fit(allitems) combined[cols] = combined[cols].apply( x: labeler.transform(x)) cols = [ , , ] allitems = list(set(combined[cols].values.ravel().tolist())) labeler = LabelEncoder() labeler.fit(allitems) combined[cols] = combined[cols].apply( x: labeler.transform(x)) 'genres1' 'genres2' 'genres3' lambda 'prod1' 'prod2' 'prod3' lambda 'country1' 'country2' 'country3' lambda 'lang1' 'lang2' 'lang3' lambda 'key1' 'key2' 'key3' lambda Apply Label Encode the two left category column: combined_dummy = combined.copy() cat_col = combined.select_dtypes( ).columns combined_dummy[cat_col] = combined_dummy[cat_col].apply( x: LabelEncoder().fit_transform(x)) 'object' lambda Split the combined dataset back to Test and Train sets train_data = combined_dummy.iloc[:ntrain] test_data = combined_dummy.iloc[-ntest:] Another three steps of preparation: X_train = train_data.drop(columns= ).values y_train = np.log1p(train_data[ ].values) X_test = test_data.drop(columns= ).values # Drop the 'revenue' column, it is the values to predict 'revenue' # The log transformation of the revenue gives better results, hence, we will use it 'revenue' # Drop the 'revenue' column, will be filled at the end when the model will be ready 'revenue' Model Building Start with a basic Linear Regression Model. kf = KFold(n_splits= , shuffle= , random_state= ) lr = LinearRegression() y_pred = cross_val_predict(lr, X_train, y_train, cv=kf) y_pred[y_pred < ] = 5 True 123 0 0 Continue with a random forest regression model (Improved result comparing to the LinearRegression try). rf = RandomForestRegressor(max_depth= , random_state= , n_estimators= ) y_pred = cross_val_predict(rf, X_train, y_train, cv=kf) y_pred[y_pred < ] = 20 123 100 0 0 View the importance of the features of the random forest model in a bar plot. dropping the revenue column before. rf.fit(X_train, y_train) imp = pd.Series(rf.feature_importances_, index=train_data.drop(columns= ).columns) imp.sort_values(ascending= ).plot(kind= , figsize=( , )) 'revenue' False 'barh' 8 10 Continue with an LGBMRegressor Model (fast execution) the results improved comparing to the RandomForestRegressor try. The parameters of this model explanation: 0.4 means that for each of the 1500 (n_estimator) only 40% of the features will be selected (randomly). max_depth is inf (-1) but is restricted by the leaves number (20). lgb_model = lgb.LGBMRegressor(num_leaves= , max_depth= , learning_rate= , metrics= , n_estimators= , feature_fraction = ) y_pred = cross_val_predict(lgb_model, X_train, y_train, cv=kf) 20 -1 0.01 'rmse' 1500 0.4 View the importance of the features of the LGBMRegressor model in a bar plot. Dropping the revenue column before According to this model, the year is the most important feature in predicting the revenue and that makes sense, as the years pass the revenue increase. (across all Industries) The second important feature according to this model is the production company, budget, director.. The choices of this model are relevant and lead to a better prediction outcome, compare to the previous two models that I tried. lgb_model.fit(X_train, y_train) imp = pd.Series(lgb_model.feature_importances_, index=train_data.drop(columns= ).columns) imp.sort_values(ascending= ).plot(kind= , figsize=( , )) 'revenue' False 'barh' 8 10 License I open-sourced this for all to use as an entry point to the competition. If you, however, make progress and develop a better performance model, please let me know, empowering me to understand better and grow. Thank you. jupyter-notebook This article, along with any associated source code and files, is licensed under GPL. (GPLv3)