Reinventing RSS Readers

My RSS reader exploded once again - I can't help it, when I find a great blog I feel compelled to save it as a reference. Behind each feed stands a passionate and dedicated person, it's the proverbial 'signal' in a stream of pure noise. Whether this signal is relevant to my key interest areas is another story, but I like to keep my horizons as wide as possible, and so the author gets another subscription.

Microsoft Research Asia published some very interesting results from their RSS Usage Study. Their tabulated results reveal some interesting trends, but us carbon-based life forms are much better with visuals, hence I took the liberty of graphing some of their data:

 

Use patterns by time of day are not surprising, but they do raise some interesting questions. For example, what is the optimal time to publish your content? Should it be in late afternoon, before most people open their readers? Should it be early morning? Does it matter at all? Personally, I'm a subscriber to the early morning theory but for a completely different set of reasons - I write my posts in the afternoon, sleep on them, and check and publish in the morning.

Read frequency shows a very peculiar pattern - it seems to be normally distributed, except that a very large number of people form a singularity at one of the tails. Those who read RSS, read it a lot - 36% of people check their readers more than 5 times a day!

 

Now, it seems that Microsoft researches underestimated the choices they offered in their study for both of these questionnaires. As with most population averages, both of these graphs should be normally distributed - except they're not! Both capture only one side of the distribution, which really hinders us in our ability to draw any meaningful conclusions.

The disturbing pattern is the number of feeds we're subscribed to. It seems that ~100 might be the mean, if not more. Compound that with the time spent and a quick (conservative) calculation will show that an average user is expected to check their reader 3 times a day, with expected time spent of 30 minutes per session. That's 1.5 hours a day! Sounds about right? The pattern fits in my case.

Information overload, or just poor habits?

Back in 2003, Wired magazine had an interesting article on RSS information overload. It appears that this is not a new problem, we've had it for a while. What is truly surprising is that no one has successfully solved Ryan's dilemma:

I want to solve the question of 'I don't have any time and I subscribe to 500 feeds. I just got off the plane. What do I need to read?' (Kevin Burton).

Keyword filtering is not the panacea

Web 2.0 brought an explosion of RSS scrapers, splicers, filters, you name it! Bursting at their seams with AJAX and cool looking reflections they all seem to revolve around the same model: keyword filtering, feed splicing and some sort of feed ranking. Keyword filtering is heralded as the solution to all RSS problems - not. Let's step back. Keyword filtering solves an instance of information overload problem. I use keyword monitoring to let me know on new product announcements and to track public mentions of certain people or products. It does work in that instance. However, the act of me specifying the keyword presupposes a priori knowledge of the topic I would like to track.

The reason why I cast my net so wide across so many sources is exactly because I don't know what I'm looking for - I just want to make sure that once something significant happens, I'll know about it! Not only does keyword filtering fail here, but the act of choosing the keywords is extremely hard in itself. It is almost impossible to select a combination of words that will not result in either an absurdly broad category (finance, economics, stocks) or an exceedingly narrow one (finance, stocks, apple). What if I simply want a notification when an entire industry sector experiences a shake up? (without specifying the sector a priori) What about semantic disambiguation? I specified 'car', the announcement I was hoping to catch used 'auto' - what now?

Collaborative filtering is over-hyped

How do you expect to make a meaningful prediction for additional sources (i.e. suggest another feed not in my list) when you can't even tell me what I should be paying attention to in the feeds I already specified? Sure you can use collaborative filtering - you can pair me with similar users and offer me some 'similar' feeds to read. But this only makes my problem worse! I'm drowning in information I already have, and the 'competitive advantage' of your RSS reader is to crank the pressure coming out the hose? That is absurd. Google has it right, their frontpage is about what you need to do now, minus all the distractions. It's not about what they think you should be searching for, it's about your goals.

Reinventing the RSS process

Matt started a very interesting conversation in his post 'Taming the RSS Beast'. Among some of his proposals are: author decides (separate feed), community decides (interestingness measure), reader's friends decide, RSS client decides. First and third propositions are interesting, but flawed. Placing the burden on the author is unlikely to show any results, we've seen this with XFN and other semantic web projects. Likewise, letting my friends pick what I read could work, but it's too much of a burden on them (even more so, it presupposes we all use the same system).

Build a smarter RSS client

Shifting the burden on the RSS reader could yield some improvements. Clustering, correlation, prior reading patterns can all be taken into account when information is sifted through the filters. The problem with this approach is the startup phase - the perpetual problem of chicken and egg in all collaborative filtering systems. You need to convince the user to invest a lot of time and energy up-front while the system is in the cold-start mode - it can't make meaningful predictions yet, it needs to observe and learn. The key here is usability, it has to be simple and elegant. If you can make me want to use your reader even while you're in cold-start, and later you tweak and learn from my patterns, then you're in the money. Unfortunately, I'm yet to see an RSS reader that exhibits any of these properties - they are always either feature laden and not usable, or simply functionally crippled.

Rank the posts, not the feeds!

And last, but in my opinion the most promising approach: community decides, or what I dub as separating the wheat from the chaff. No this is not collaborative filtering, this assumes existence of an 'interestingness' measure. Conceptually, if we want to answer the original question from Wired (given 500 feeds, what do I need to read if I have 5 minutes?) then we need to prioritize every post by its impact and make our way down the list. One problem, how do you measure impact?

We can develop objective and subjective measures to capture interestingess. Subjective measures are customized to the user, these are much harder to implement and have the downside of the aformentioned cold-start problem. On the other hand, objective measures don't always yield the best results, afterall what you want might be different form what everybody else wants. However, on average objective measures work rather well (ex. Google's PageRank). There is plenty of details we can capture to develop a formulation of interestingness for blog posts: number of views, number of comments, etc. It's feasible.

However, the tradeoff to this approach is time. Google can't capture trends in real-time, it has a delay, same problem surfaces here. There needs to be a delay before we can reliably capture the number of views or comments. But, if you're not reading 80% of your feeds every day, is time really a problem? Imagine - you open your reader and scroll down to the section you only check once a week, in it posts are prioritized by their 'impact' since you last checked. If you only have a few minutes, then you only read the stuff you should be reading! Would you even read past the first 10 posts if you had such an 'impact' guarantee? Do you ever use the 2nd and 3rd pages of Google's search results? I think the landscape of our RSS usage habits would change dramatically.

Part 2: PostRank RSS Filtering - Empirical results of PostRank filtering


Ilya Grigorik

Ilya Grigorik is a web performance engineer and developer advocate at Google, where his focus is on making the web fast and driving adoption of performance best practices at Google and beyond.