PostRank RSS Filtering

By Ilya Grigorik on December 29, 2006

My earlier post on 'Reinventing RSS Readers' generated a lot of very insightful feedback. For one, Adam Kalsey made a very good argument for human filtering (Solving feed overload). I dismissed 'friend filtering' all too easily in my initial analysis - I was hamstrung by an image of a la delicious 'for:myfriend' type scheme where friends manually forwarded articles. Adam is right, blogs by their very nature serve as human filters. Relaxing my definition, we can easily see that each blogger acts, in fact, as your friend. Some are better than others, but the reason we subscribe to their thoughts is exactly because they are able to provide a distilled version of their interests - an industry digest of type. Of course, now we simply turned the question on its head: I have so many friends! (Yay!) But whom should I pay attention to most at any given point in time?

Sampling 'Friend-Space'

Several other readers pointed to services like Memigo, Tailrank and Findory in response to my thoughts. I tried each one, and I think it's worthwhile to make a distinction about the underlying model. Each of these services insists on sampling the entire blogosphere, effectively removing the need to specify your own feeds. I'm not saying this is bad, in fact, the end result may be just what you're looking for, but personally, I prefer to keep human friends (other blogger's) as industry-specific filters instead. Here is a quick illustration of the currently used aggregation model to explain what I mean:

Now, machine learning is an interest of mine, so you would think I'd favor the aggregator approach - not so. I still think that humans, especially ones dedicated to a niche, can do a much better job at aggregating and filtering the news. The problem is, even after they deliver filtered results, the sheer number of my subscriptions makes the resulting 'news-space' unmanageable. My solution - define a gradient to rank delivered results. My proposed model looks something like:

Neither of the models are very different, 'blog-gradient' simply takes a slightly different approach at the same problem. More specifically, I force the user to specify information sources to be tracked instead of sampling the 'news space' for them. In turn, I have more control over the information presented, but at the tradeoff of possibly missing some sectors. However, what's interesting is that a 'successful' aggregator will eventually revert to the PostRank model once it learns my preferences - with time, it should converge on the news that are 'relevant' to me, but it will do so without having me to specify the feeds upfront. But, even then, the aggregator still stands to benefit from the PostRank scheme - it may have learned the sources, but it still needs a concept of 'quality' to distinguish between individual items. Personally, I consider the 'training' step unnecessary, I like the control of finding and specifying my own feeds.

Introducing PostRank

After a few days of 'I wonder if a PostRank scheme would really work?' , I rolled up my sleeves and put together a quick ruby application to test my theory. Here is how it works: you specify the feed URL, I parse the results, scour the Internet for information on every post in your feed, rank each post with respect to the combined score, and return the ranked results. Here is a sample run on my own feed:

How did I come up with the score? My ranking function is simple, I look at the number of comments, number of bookmarks the visitors made, and the number of trackbacks. I collect this information from the internet and then normalize each post against the average for the blog in question - if you always get 15 comments, then you getting 17 comments doesn't affect the ranking as much as, say getting 15 comments when you usually get 2.

We can certainly dispute the results, my ranking function is very simple. However, I would claim that, at least for my own blog, the results are a very good representation of the 'significance' of each one of my posts. (#1 made it to digg/delicious front pages, #2 created a buzz in the Rails community, #3 resulted in a lot of bookmarks, and so on) If you're a dedicated reader of either modernlife, or blogmaverick, please do let me know on the 'quality' of my two other rankings.

Avoiding 'Absolute Rankings'

An important consequence of considering each post with respect to the originating blog is that we are not imposing a global measure on the author; when each post is ranked with respect to the author (assuming the ranking is meaningful), we simply see the 'quality' gradient. There is no concept of 'universal importance' here - a problem that services like Findory and Tailrank absolutely have to address. It is almost like cherry picking - given all gradients of all feeds you're subscribed to, let others decide what percolates to the top, and read only the stuff at the top (or more, if you have the time).

Perhaps it is time we introduce PostRank into the blogosphere.

Ilya Grigorik is a web ecosystem engineer, author of High Performance Browser Networking (O'Reilly), and Principal Engineer at Shopify — follow on Twitter.

PostRank RSS Filtering

Sampling 'Friend-Space'

Introducing PostRank

Avoiding 'Absolute Rankings'

High Performance Browser Networking - O'Reilly