Predicting IMDb Ratings of New Movies
Introduction
As part of the MGSC 661: Multivariate Statistics course in my Fall term, I worked on a project to predict IMDb ratings of new movies based on their features.
It was a fun project to work on and it gave me a chance to apply the concepts I learned in the course around linear regression to a real-world dataset.
Data
The dataset used for this project was provided by the Professor and contained ~2000 movies with 5 types of features:
- Identifiers
- Rating
- Movie characteristics
- Cast characteristics
- Production characteristics
Exploratory Data Analysis
I started off by exploring the dataset and looking at the distribution of the features.
The distribution of the target variable, imdb_score
, is shown below:
From the above, we can see that most of the movies in the dataset have a rating between 5 and 8 on IMDb.
I looked at the distribution of some other features as well, in addition to looking at the bivariate relationship between the target variable and the features.
I will show the plots first to let you go through them and then discuss the insights I found.
- There does not appear to a significant relationship between the movie budget and IMDb rating. However, movies with a higher budget tend to not have lower ratings, except for a few outliers.
- Most movies have a duration between 90 and 150 minutes. Longer running movies tend to have higher ratings, but only up to a certain point, indicating the presence of a hypothetical 'sweet spot'.
- The distribution of ratings by genre is similar for most genres, except for the "Drama" genre, which has a higher median rating.
- The number of movies released is highest in the months of January and October, coinciding with the holiday season.
- Among the top 10 distributors, the distribution of ratings seems fairly consistent, except for "Miramax", which has a higher median rating.
Feature Engineering
I created some new features based on the existing features in the dataset:
plot_keywords
: Created dummy variables for the top 10 keywords in the datasetdistributors
: Created dummy variables for the top 5 distributors in the dataset
Modeling
For the modeling part, the first step was to run indvidual linear regression models for each feature and correct the model errors such as heteroscedasticity, non-linearity, multicollinearity, and removing outliers.
With that done, I ran three different models:
- Model 1: Multiple linear regression with all relevant features
- Model 2: Polynomial regression with all relevant features and non-linear predictors
- Model 3: Spline regression with all relevant features and non-linear predictors
The performance of the three models is shown below:
Metric | Linear | Polynomial | Spline |
MSE | 0.76 | 0.68 | 0.67 |
RMSE | 0.87 | 0.83 | 0.82 |
R2 | 0.38 | 0.45 | 0.46 |
Number of predictors | 52 | 60 | 61 |
The spline regression model performed the best, with a lower MSE and RMSE and a higher R2 value than the other two models. However, when looking at the interpretability of the model, the polynomial regression model was the best, as it had the lowest number of predictors. Hence, I chose the polynomial regression model as the final model.
Predictions
The predicted ratings for the test set are shown below:
Movie Title | Release Date | Linear | Polynomial | Spline | |
1 | Pencils vs Pixels | 2023-11-07 | 6.19 | 5.87 | 3.44 |
2 | The Dirty South | 2023-11-10 | 6.45 | 5.41 | 3.40 |
3 | The Marvels | 2023-11-10 | 6.82 | 7.08 | 9.41 |
4 | The Holdovers | 2023-11-10 | 7.22 | 7.66 | 8.70 |
5 | Next Goal Wins | 2023-11-17 | 6.78 | 7.43 | 8.59 |
6 | Thanksgiving | 2023-11-17 | 6.31 | 7.13 | 9.66 |
7 | The Hunger Games: The Ballad of Songbirds and Snakes | 2023-11-17 | 7.02 | 7.90 | 10.38 |
8 | Trolls Band Together | 2023-11-17 | 7.01 | 7.71 | 8.88 |
9 | Leo | 2023-11-21 | 6.92 | 6.41 | 5.08 |
10 | Dream Scenario | 2023-11-22 | 6.43 | 7.10 | 8.28 |
11 | Wish | 2023-11-22 | 6.86 | 7.83 | 10.40 |
12 | Napoleon | 2023-11-22 | 7.57 | 8.32 | 10.52 |
Conclusion
In this project, I built a model using regression techniques to predict IMDb ratings of new movies based on their features, and analyzed the importance of each feature in determining the rating. I also used the model to predict the ratings of 12 new movies to be released in November 2023.
Based on the model, the top 3 features that determine the IMDb rating of a movie are:
- Duration: Audiences appreciate a well-paced, substantial movie, but there is a point of diminishing returns. Longer movies initially receive higher IMDb scores, but this advantage tapers off for exceedingly long durations. Aim for a ’Goldilocks’ duration that’s just right.
- Movie budget: Contrary to popular belief, throwing money at a project doesn’t necessarily make it better. Our model indicates that higher budgets are actually associated with slightly lower IMDb scores. This suggests a focus on resourceful filmmaking could be more beneficial.
- Number of news articles: The number of news articles about a movie shows a quadratic relationship with IMDb scores. Initial media coverage boosts ratings, but the effect plateaus. A well-planned PR strategy that avoids overexposure could be the key.
Code
The code for this project can be found on GitHub.
Thanks for reading! If you have any questions or feedback, please feel free to reach out to me on Twitter. 👋