Correlating Netflix and IMDB Datasets

Over this Christmas break I correlated the Netflix dataset (17,700 movies) against the IMDB database. It was a fun and a fairly tedious proof of concept for one of my courses. I got what I wanted and moved on, but perhaps someone else will find my results useful.

The idea was simple, I wanted to play with item-based collaborative filtering algorithms and I needed some domain-specific knowledge to do so. Computing movie similarities solely on ratings is a poor strategy which can't capture much of the underlying relationships, hence I opted in to bring IMDB into the picture. For each movie, I defined a 'feature space' which consisted of a union of:

Directors + Writers + Producers + Composers + Cast + Country + Genres + Year of Release (11, 10-year intervals, from 1900 to 2005)

The algorithm itself was a simple two-pass scan over IMDB: first run collects and assigns unique identifiers, second pass constructs feature vectors for each movie. In the end, this technique produced close to 230,000 unique features, much of which was dominated by the cast. Plotting co-occurrence counts (how many times each feature appears across the entire space) gave me:

Of course, we don't have to keep all 230,000 features in our space. We can refine this semantic space by removing features with low co-occurrence counts. Infrequent features carry little mutual information (1 occurrence carries none; movie 'stars' count more than 2-time appearances), hence we can safely prune our space without losing much underlying information. On top of that, the power-law relationship significantly reduces the size of the space every time we bump our threshold. It's almost too good to be true! A quick calculation shows:

Motivation for semantic-enhancements

If you are not convinced in usefulness of integrating semantic knowledge, consider these benefits:- Semantic attributes often provide additional clues about the underlying reasons for which a user may or may not be interested in a certain production.

  • In cases where little or no rating data is available, you can still use semantic attributes to provide reasonable recommendations.
  • Considering item-to-item similarities often allows you to work with a much smaller feature-space.
  • You can effectively address the problem of data-sparsity. More importantly, there are no missing values; a value of 0 has meaning in a semantic space. (ex: actor xyz did not appear in production #1234)

Kosher semantic enhancements in one download

Below you can find two 'pickled' python hash-maps which map each movie id to an array of unique feature id's. First file contains a map of all 2+ occurrences, and second of all 25+. There is no particular reason I chose 25, it was a combination of what I believe to be a reasonable cutoff value and a good-size feature space.

For those not versed in the ways of python, using these maps is very easy (below). If you wish, you can load these maps once in python and convert them into another format you are comfortable with:

import pickle

print "Loading hash..."
fID = open('name.pickle', 'r')
features = pickle.load(fID)
fID.close()

features[movie_id]
...

Note: SVD Recommendation System in Ruby shows one good strategy to go with this data.

Short disclaimer

As far as I understand, Netflix does not object to using external data-sources to improve on their algorithm. However, IMDB does object to using their database for commercial purposes. If you plan to use this data for the Netflix competition, the responsibility to clear all the legal spaghetti is solely yours.

Python scripts to generate your own maps

Just as the heading implies, you can find my 'two-pass' scripts to generate the maps in the zip file. I do not claim that they are the most efficient, or well refactored/optimized. Both scripts assume that you have imported the netflix dataset into SQL, and that you also have a local copy of the IMDB database.

imdbMatrix.zip - Python scripts

Ilya Grigorik

Ilya Grigorik is a web performance engineer and developer advocate at Google, where his focus is on making the web fast and driving adoption of performance best practices at Google and beyond.