Choice and the Long Tail at Netflix

By Ilya Grigorik on January 24, 2007

A popular article from Wired magazine by Chris Anderson, later turned into a book 'The Long Tail', prompted a interesting debate on Hits vs. Blockbusters and the future of digital retail economics as a whole. In his book Chris proposes an updated model of the Pareto distribution (the 80/20 rule) for the era of 'economic abundance'. He argues that the rise of micro-markets, composed of small customer niches, will shift demand from the head of the distribution to the tail. This idea is much simpler explained through a diagram:

Looking at Netflix data

A few weeks ago Chris posted a link to my analysis of Netflix data released as part of their recommendation contest. In the same post, he also left an interesting remark:Meanwhile, Anita Elberse from Harvard Business School has been doing the best Netflix data crunching around (she and I have proprietary data from Netflix, which gives us some unique insights that you can't get from the public data), and you can read an interview with her about this research here.I'm rather skeptical of this argument.

For all practical intents and purposes, I would argue, the publicly released data must be equivalent to anything Chris and Anita Elberse have on their hands. Checking the rules for the competition will show that Netflix openly states that the released sample is representative of the underlying dataset. Thus, while I don't argue that Chris and Anita might have a bigger and more complete subset, I have all reasons to believe that the released data is large enough to contain all the same patterns.

Mining Netflix data

Shortly after realizing that I'm sitting on a goldmine of valuable information I decided to dig deeper into the patterns. I was primarily interested in seeing how, if at all, the long tail has changed over the years. Here are the results:

On the first glance, it looks like Chris is right; Top 100 movies (by number of ratings) have progressively claimed less and less market share. However, using an absolute number such as the Top 100 is slightly misleading.

Between 2000 and 2005, Netflix increased their inventory nearly four-fold, going from 4,461 movies to 17,761. Consequently, the underlying scales have changed. In 2000, looking at the Top 100 movies was equivalent to considering the top 2.24% by popularity. By 2005 this number was as low as the top 0.56%. Year after year we are progressively looking at a narrower band of popular movies. Without a doubt, the data shows an interesting dynamic to confirm Chris's theory, but are we seeing the whole picture?I would argue that the discrepancy between the absolute/fixed size of the sample and the growing size of the dataset distorts the underlying picture. Instead, we need to normalize the scale across the years and look at equivalent segments.

Looking at normalized popularity slices

Once we fix our scale to consider percent segments (Top 1%, Top 2%, etc), instead of absolute rankings, a very different picture emerges:

It is immediately obvious that the head does not get shorter, if anything, it gets more pronounced over time. Furthermore, if you look carefully at the graphs, you'll see that the total demand from the tail (in %) is, in fact, steadily decreasing while the Top 1% still attracts 20%~25% of total viewership . Year 2000 seems to be closest to Chris's model. It appears that the early adopters (movie-buffs) really did drive demand down the tail, which may also answer an open question I posed in my earlier analysis:One pattern that you cannot observe here is that the mean (average rating) tends to drift to the right as the number of users grows - in 1998 the average was 3.4, by 2005 it steadily moved to 3.8. I wonder what accounts for the drift? Are early adopters (techies) more discerning in their ratings/choices?

It would certainly make sense "movie-buffs watch more movies from the tail, and are generally more discerning in their ratings. However, once early adopters are settled in and the mainstream audience joins, we revert back to the old 80/20 rule" once again, the demand shifts to the head.

Netflix-2000-to-2005.xls - Excel Spreadsheet

Scaling the 80/20 rule

Chris's argument still stands, the era of mega-blockbusters may be passing, but the model he proposes is slightly misleading. Simply adding more choice does not seem to shift demand to the tail. Rather, and I would claim that this is non-obvious, the long tail distribution proportionately scales as new choices appear. In absolute numbers this does mean that there is more space at the top (and at the bottom). Top 1% of 1000 movies makes for 10 mega-blockbusters, but Top 1% of 10,000 movies gives us a 100 hits. I would not hold this surprising, we're simply discovering that many of us do, in fact, differ in our movie watching preferences, and more choice facilitates this dynamic. However, the top 1% of the movies still claims just as much viewership year after year.

There may be more niches, but you still need to hit it 'big' to leave an impact. Perhaps 'big' is not as large as it used to be ten years ago (which makes it easier), but the growing number of choices is segmenting consumer attention at an unprecedented rate (which makes it harder). What niches gain by crowding at the top, they also lose by smaller market-share and more challenging marketing schemes. Perhaps 'Long Tail' economics is not as revolutionary as it may seem at first glance. As I was looking at the data, I couldn't help but to recall a great quote by an American humorist Evan Esar: Statistics: The only science that enables different experts using the same figures to draw different conclusions. Once again, I was reminded that every theory should be taken with a grain of salt.

Note: Dissecting the Netflix Dataset has more analysis of the same dataset.

Ilya Grigorik is a web ecosystem engineer, author of High Performance Browser Networking (O'Reilly), and Principal Engineer at Shopify — follow on Twitter.