<?xml version="1.0" encoding="utf-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>
<channel>
	<title>Comments on: Collaborative Filtering with Ensembles</title>
	<atom:link href="http://www.igvita.com/2009/09/01/collaborative-filtering-with-ensembles/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.igvita.com/2009/09/01/collaborative-filtering-with-ensembles/</link>
	<description>A goal is a dream with a deadline.</description>
	<pubDate>Mon, 15 Mar 2010 00:47:21 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7.1</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Ennuyer.net &#187; Blog Archive &#187; Rails Reading - Sept 10, 2009</title>
		<link>http://www.igvita.com/2009/09/01/collaborative-filtering-with-ensembles/comment-page-1/#comment-211854</link>
		<dc:creator>Ennuyer.net &#187; Blog Archive &#187; Rails Reading - Sept 10, 2009</dc:creator>
		<pubDate>Thu, 10 Sep 2009 19:38:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/?p=609#comment-211854</guid>
		<description>[...]  Collaborative Filtering with Ensembles - igvita.com  [...]</description>
		<content:encoded><![CDATA[<p>[...]  Collaborative Filtering with Ensembles - igvita.com  [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Github contest: Winning the bottle of Pappy &#8212; Some French Guy</title>
		<link>http://www.igvita.com/2009/09/01/collaborative-filtering-with-ensembles/comment-page-1/#comment-211355</link>
		<dc:creator>Github contest: Winning the bottle of Pappy &#8212; Some French Guy</dc:creator>
		<pubDate>Sun, 06 Sep 2009 01:49:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/?p=609#comment-211355</guid>
		<description>[...] Netflix taught us that ensembles win. Ilya Grigorik submitted an entry exploiting that fact and wrote about it Collaborative Filtering with Ensembles. [...]</description>
		<content:encoded><![CDATA[<p>[...] Netflix taught us that ensembles win. Ilya Grigorik submitted an entry exploiting that fact and wrote about it Collaborative Filtering with Ensembles. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ilya Grigorik</title>
		<link>http://www.igvita.com/2009/09/01/collaborative-filtering-with-ensembles/comment-page-1/#comment-211153</link>
		<dc:creator>Ilya Grigorik</dc:creator>
		<pubDate>Fri, 04 Sep 2009 02:17:07 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/?p=609#comment-211153</guid>
		<description>John: Completely agree with your analysis on the incorrect setup of the problem for GitHub contest. Ultimately, that's why I ended up passing on the contest - couldn't figure out why / how I could optimize the algorithm I had in mind to what seemed to be a random scoring process.</description>
		<content:encoded><![CDATA[<p>John: Completely agree with your analysis on the incorrect setup of the problem for GitHub contest. Ultimately, that&#8217;s why I ended up passing on the contest - couldn&#8217;t figure out why / how I could optimize the algorithm I had in mind to what seemed to be a random scoring process.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John Rowell</title>
		<link>http://www.igvita.com/2009/09/01/collaborative-filtering-with-ensembles/comment-page-1/#comment-211138</link>
		<dc:creator>John Rowell</dc:creator>
		<pubDate>Fri, 04 Sep 2009 00:44:28 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/?p=609#comment-211138</guid>
		<description>Honored by the fork Ilya! I talk about the crowdsourced entry here: http://bit.ly/ghcontest

About the 18% score guys (Jerod), I had two entries. My "regular" one got 18%, the crowdsourced one got 50%+ (or as much as the winning one, just change the weight constant to 2 :P).</description>
		<content:encoded><![CDATA[<p>Honored by the fork Ilya! I talk about the crowdsourced entry here: <a href="http://bit.ly/ghcontest" rel="nofollow">http://bit.ly/ghcontest</a></p>
<p>About the 18% score guys (Jerod), I had two entries. My &#8220;regular&#8221; one got 18%, the crowdsourced one got 50%+ (or as much as the winning one, just change the weight constant to 2 :P).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Flow &#187; Blog Archive &#187; Daily Digest for September 3rd - The zeitgeist daily</title>
		<link>http://www.igvita.com/2009/09/01/collaborative-filtering-with-ensembles/comment-page-1/#comment-211002</link>
		<dc:creator>Flow &#187; Blog Archive &#187; Daily Digest for September 3rd - The zeitgeist daily</dc:creator>
		<pubDate>Thu, 03 Sep 2009 08:29:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/?p=609#comment-211002</guid>
		<description>[...] Shared Collaborative Filtering with Ensembles &#8211; igvita.com [...]</description>
		<content:encoded><![CDATA[<p>[...] Shared Collaborative Filtering with Ensembles &#8211; igvita.com [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Haran</title>
		<link>http://www.igvita.com/2009/09/01/collaborative-filtering-with-ensembles/comment-page-1/#comment-210893</link>
		<dc:creator>Daniel Haran</dc:creator>
		<pubDate>Wed, 02 Sep 2009 14:09:15 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/?p=609#comment-210893</guid>
		<description>Jochen: most of us wrote utilities to extract our own test sets, e.g.:
http://github.com/danielharan/github_resys/tree/master data/prepare_training_set.rb

The Netflix prize was very well designed. They prevented users from submitting more than once every 24 hours. Further, they had a dev-test-set and *2* test-sets - one for public scoring, another for awarding prizes.

It's only participating in the Github contest that I realized how well the Netflix prize was organized; many non-specialists would have made the very same mistakes. Not having a dev-test-set was a small issue compared to the temptation to hill climb and overfit to the test-set.</description>
		<content:encoded><![CDATA[<p>Jochen: most of us wrote utilities to extract our own test sets, e.g.:<br />
<a href="http://github.com/danielharan/github_resys/tree/master" rel="nofollow">http://github.com/danielharan/github_resys/tree/master</a> data/prepare_training_set.rb</p>
<p>The Netflix prize was very well designed. They prevented users from submitting more than once every 24 hours. Further, they had a dev-test-set and *2* test-sets - one for public scoring, another for awarding prizes.</p>
<p>It&#8217;s only participating in the Github contest that I realized how well the Netflix prize was organized; many non-specialists would have made the very same mistakes. Not having a dev-test-set was a small issue compared to the temptation to hill climb and overfit to the test-set.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dr. Jochen L. Leidner</title>
		<link>http://www.igvita.com/2009/09/01/collaborative-filtering-with-ensembles/comment-page-1/#comment-210892</link>
		<dc:creator>Dr. Jochen L. Leidner</dc:creator>
		<pubDate>Wed, 02 Sep 2009 13:53:17 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/?p=609#comment-210892</guid>
		<description>The GitHub competition is a nice idea, however methodologically unsound. In machine learning, data sets are usually split in three buckets, training (sometimes called "development") set, dev-test set and test set.
The training set is inspected to learn about the task, class distributions, and to hypothesize about features of potential value. Once a classifier is built (by training it on the training set), it can be tested on the dev-test set, which is used for evaluations multiple times during the development cycle to obtain interim results that are more meaningful than evaluation on the training set would be (cf. Erik's comment about overfitting above), but without touching the test set, which ought to remain "unseen data" until the system is considered good enough, is used only once for a final evaluation, and is never inspected.</description>
		<content:encoded><![CDATA[<p>The GitHub competition is a nice idea, however methodologically unsound. In machine learning, data sets are usually split in three buckets, training (sometimes called &#8220;development&#8221;) set, dev-test set and test set.<br />
The training set is inspected to learn about the task, class distributions, and to hypothesize about features of potential value. Once a classifier is built (by training it on the training set), it can be tested on the dev-test set, which is used for evaluations multiple times during the development cycle to obtain interim results that are more meaningful than evaluation on the training set would be (cf. Erik&#8217;s comment about overfitting above), but without touching the test set, which ought to remain &#8220;unseen data&#8221; until the system is considered good enough, is used only once for a final evaluation, and is never inspected.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ilya Grigorik</title>
		<link>http://www.igvita.com/2009/09/01/collaborative-filtering-with-ensembles/comment-page-1/#comment-210888</link>
		<dc:creator>Ilya Grigorik</dc:creator>
		<pubDate>Wed, 02 Sep 2009 13:30:22 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/?p=609#comment-210888</guid>
		<description>bd: When you're making a prediction, there should always be a score, or a confidence interval assigned to it. Hence, if you were to build a general predictor, just sort your predictions in order of their score, and output the top 10. The thing is, the 10 that github omitted may not be the actual top 10! That's where the problem lies.

Erik: Overfitting is definitely a pitfall, but that's exactly where Bagging and subspace methods shine. Likewise, if the predictors are 'independent', then you're less likely to run into this problem. 

Anony: Yes, exactly. Though the more I think about it, the more I'm convinced that it would be very interesting to try a two tier system. On the first tier, teams can build predictors, on the other we can optimize the ensembles.

Jerod: Actually, if you just rerun his code, you'll get over 50% accuracy now. I think the low score was representative of the poor predictors at the time when he tried it.

Cole: Yep, their blog is a treasure trove of information.

Daniel: My thoughts exactly.</description>
		<content:encoded><![CDATA[<p>bd: When you&#8217;re making a prediction, there should always be a score, or a confidence interval assigned to it. Hence, if you were to build a general predictor, just sort your predictions in order of their score, and output the top 10. The thing is, the 10 that github omitted may not be the actual top 10! That&#8217;s where the problem lies.</p>
<p>Erik: Overfitting is definitely a pitfall, but that&#8217;s exactly where Bagging and subspace methods shine. Likewise, if the predictors are &#8216;independent&#8217;, then you&#8217;re less likely to run into this problem. </p>
<p>Anony: Yes, exactly. Though the more I think about it, the more I&#8217;m convinced that it would be very interesting to try a two tier system. On the first tier, teams can build predictors, on the other we can optimize the ensembles.</p>
<p>Jerod: Actually, if you just rerun his code, you&#8217;ll get over 50% accuracy now. I think the low score was representative of the poor predictors at the time when he tried it.</p>
<p>Cole: Yep, their blog is a treasure trove of information.</p>
<p>Daniel: My thoughts exactly.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Haran</title>
		<link>http://www.igvita.com/2009/09/01/collaborative-filtering-with-ensembles/comment-page-1/#comment-210852</link>
		<dc:creator>Daniel Haran</dc:creator>
		<pubDate>Wed, 02 Sep 2009 05:08:49 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/?p=609#comment-210852</guid>
		<description>For increasing accuracy, there's probably more low-hanging fruit in blending.

It would have been cool to have 2 parallel competitions: one for the blenders, one for those developing base algorithms. That way you might have won a bottle of whiskey :)</description>
		<content:encoded><![CDATA[<p>For increasing accuracy, there&#8217;s probably more low-hanging fruit in blending.</p>
<p>It would have been cool to have 2 parallel competitions: one for the blenders, one for those developing base algorithms. That way you might have won a bottle of whiskey :)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Collaborative Filtering with Ensembles &#8211; igvita.com &#171; Netcrema &#8211; creme de la social news via digg + delicious + stumpleupon + reddit</title>
		<link>http://www.igvita.com/2009/09/01/collaborative-filtering-with-ensembles/comment-page-1/#comment-210829</link>
		<dc:creator>Collaborative Filtering with Ensembles &#8211; igvita.com &#171; Netcrema &#8211; creme de la social news via digg + delicious + stumpleupon + reddit</dc:creator>
		<pubDate>Tue, 01 Sep 2009 22:45:08 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/?p=609#comment-210829</guid>
		<description>[...] Collaborative Filtering with Ensembles &#8211; igvita.comigvita.com [...]</description>
		<content:encoded><![CDATA[<p>[...] Collaborative Filtering with Ensembles &#8211; igvita.comigvita.com [...]</p>
]]></content:encoded>
	</item>
</channel>
</rss>
