Collaborative Filtering with Ensembles

One of the most interesting insights from the results of the Netflix challenge is that while the algorithms, the psychology, and good knowledge of statistics goes a long way, it was ultimately the cross-team collaboration that ended the contest. "The Ensemble" team, appropriately named for the technique they used to merge their results consists of over 30 people. Likewise, the runner up team ("BellKor") is a collaborative effort of several distinct groups that merged their results. It is easy to overlook this fact, except that it is not a one-time occurrence. The leaderboard for the recent GitHub contest also clearly shows over half of the top ten entries as ensemble techniques!

Few people would argue against the idea of consulting several experts on any complex subject - as the saying goes: "two heads are better than one". Having said that, the ensemble techniques which leverage this exact strategy are rarely a hot topic in the machine learning communities, or at least, up until now. Given their recent success this is likely to change, but perhaps even more importantly, their effectiveness may actually force the 'collaborative filtering' space to become, well, a lot more collaborative.

GitHub Contest: The Good, The Bad

Religious arguments about the merits of each distributed version control system aside, one thing GitHub does really well is allow users to follow and discover new open source projects. In that light, a competition for a recommendations engine makes perfect sense. As part of the challenge you get access to records of over 56K users, 120K repositories, and 440K relationships between each. Great data set, immediate feedback via post-commit hooks, all results available in the open, and a few people even shared their code!

If there was one thing to improve about the contest, then it would have to be the ranking methodology. The goal was to "guess" the 4,788 removed repositories, which I would argue, is optimizing for the wrong thing. The goal should have been not to guess what users are already watching, but to produce a more general predictor for what users should be watching - the former is a subset of the latter. To see why, think of the case where a general predictor might actually have a set of better recommendations (based on other user's patterns) than the few repos that the GitHub team hid from the dataset for any user.

Ensemble Techniques & Machine Learning

Ranking methodology aside, perhaps the most interesting result of the GitHub contest is due to how it was setup: each submission is a push into a public Git repo, which means that each entry is automatically in the public domain. Unlike the Netflix contest, where all submissions were private and teams had to agree to merge their results, the GitHub contest became a free-for-all and a fertile crowd for testing ensembles!

While the topic of Ensembles is a rich one, the general idea is remarkably simple: instead of attempting to build one general model to capture all the subtleties of the data, use multiple predictors and combine their results. For an illustrative example, assume that we were trying to build a model that would separate the data in the diagram above. After a brief examination, we form a simple hypothesis (call this hypothesis family H): we could separate the data using simple geometric shapes such as a circle (H1), or a square (H2). Visually we can clearly see that neither one by itself will do a good job, but applying them both gives us a pretty good approximation while the model remains very simple (we want our model to be simple, for computational reasons). So what do we do? We train two independent classifiers and then merge their results, aka, use an ensemble of classifiers.

The description above sweeps a lot of details under the rug, but core of the idea is there. Boosting, bagging, consensus aggregation, dynamic classifier selection, and dozens of other techniques have been developed on top of this process and have proven to be very successful. The overarching principle of ensemble techniques is to make each predictor as unique as possible: use a different learning algorithm (decision trees, svm, svd), or use a different feature (random subspace method). Then, once you have many individual classifiers, determine a mechanism to join the results (simple method: each predictor votes on each point, then tally up the votes) and make a final prediction using the new classifier.

GitHub Ensembles

Perhaps the most illustrative example of this simple technique is John Rowell’s entry into the GitHub contest, which is best described by a snippet right from the source:

require 'nokogiri'
require 'open-uri'

class Crowdsource
  def initialize
    load_leaderboard  # scrape github contest leaders
    parse_leaders     # find their top performing results
    fetch_results     # download best results
    cleanup_leaders   # cleanup missing or incorrect data
    crunchit          # build an ensemble
ghcontest-crowdsourced.git - John Rowell's ensemble predictor

Because all the entries in the GitHub contest are in the public domain (actually, a few people caught on and started deleting their repos immediately after their submissions because of this technique), we can download the best predictions of other users, construct an ensemble, and then submit our own set of results.

Collaborative, Collaborative Filtering: CCF!

While on first examination the ensemble technique may have a flavor of cheating, I think it can actually have a significant effect on how future challenges of this sort could be structured. Don't fight it, embrace it.

Instead of focusing on individual effort or insight, it is not hard to imagine much more collaborative, collaborative-filtering competitions (CCF's). Some teams can focus on building great independent predictors, while others can focus on developing and improving the methods for constructing ensembles (a mashup, of sorts). In other words, collaborative filtering has the potential to become a lot more collaborative! And in the meantime, I wish the GitHub guys luck in trying to determine the winner.

Update: GitHub removed most ensemble submissions / released the code to power the challenge.

Ilya GrigorikIlya Grigorik is a web performance engineer at Google, co-chair of the W3C Web Performance working group, and author of High Performance Browser Networking (O'Reilly) book — follow on Twitter, Google+.