<?xml version="1.0" encoding="utf-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>
<channel>
	<title>Comments on: Correlating Netflix and IMDB Datasets</title>
	<atom:link href="http://www.igvita.com/2007/01/27/correlating-netflix-and-imdb-datasets/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.igvita.com/2007/01/27/correlating-netflix-and-imdb-datasets/</link>
	<description>A goal is a dream with a deadline.</description>
	<pubDate>Thu, 18 Mar 2010 00:25:10 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7.1</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Pavel</title>
		<link>http://www.igvita.com/2007/01/27/correlating-netflix-and-imdb-datasets/comment-page-1/#comment-169497</link>
		<dc:creator>Pavel</dc:creator>
		<pubDate>Sat, 14 Feb 2009 15:28:11 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/01/27/correlating-netflix-and-imdb-datasets/#comment-169497</guid>
		<description>yeah I tried doing something similar myself: doing an exact string match against all entries in results page and then comparing the year. But this only worked on 48% of the netflix titles. Adding some heuristic rules (like moving 'the' to the front, removing non alphanumeric characters, etc) got me another 20%.</description>
		<content:encoded><![CDATA[<p>yeah I tried doing something similar myself: doing an exact string match against all entries in results page and then comparing the year. But this only worked on 48% of the netflix titles. Adding some heuristic rules (like moving &#8216;the&#8217; to the front, removing non alphanumeric characters, etc) got me another 20%.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ilya Grigorik</title>
		<link>http://www.igvita.com/2007/01/27/correlating-netflix-and-imdb-datasets/comment-page-1/#comment-169489</link>
		<dc:creator>Ilya Grigorik</dc:creator>
		<pubDate>Sat, 14 Feb 2009 15:12:32 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/01/27/correlating-netflix-and-imdb-datasets/#comment-169489</guid>
		<description>Pavel, you're right, I used a naive matching technique. It's been a while since I've done this, but I thought I went one step further by comparing the year the movie was released. Once again, this won't give you 100% match, but it will eliminate some false positives.

Atul, I would recommend that you take a look at pyimdb or some other libraries which have already solved those mysteries. ;)</description>
		<content:encoded><![CDATA[<p>Pavel, you&#8217;re right, I used a naive matching technique. It&#8217;s been a while since I&#8217;ve done this, but I thought I went one step further by comparing the year the movie was released. Once again, this won&#8217;t give you 100% match, but it will eliminate some false positives.</p>
<p>Atul, I would recommend that you take a look at pyimdb or some other libraries which have already solved those mysteries. ;)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Atul</title>
		<link>http://www.igvita.com/2007/01/27/correlating-netflix-and-imdb-datasets/comment-page-1/#comment-168938</link>
		<dc:creator>Atul</dc:creator>
		<pubDate>Fri, 13 Feb 2009 06:58:05 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/01/27/correlating-netflix-and-imdb-datasets/#comment-168938</guid>
		<description>I have been trying to read a few IMDB datasets but they are set up really absurd. Do you have any documentation that helped you understand what is the structure of various .list files?</description>
		<content:encoded><![CDATA[<p>I have been trying to read a few IMDB datasets but they are set up really absurd. Do you have any documentation that helped you understand what is the structure of various .list files?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Pavel</title>
		<link>http://www.igvita.com/2007/01/27/correlating-netflix-and-imdb-datasets/comment-page-1/#comment-164946</link>
		<dc:creator>Pavel</dc:creator>
		<pubDate>Sat, 31 Jan 2009 14:12:43 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/01/27/correlating-netflix-and-imdb-datasets/#comment-164946</guid>
		<description>Looking at your imdbMatrix script it seems that you correlated the netflix movies with the top search result in the imdb list. However, that seems like a very poor method. Just by looking at the first few films in the netflix dataset and doing a search for their name in the imdb site I get quite a few entries that infact relate to a different film. ex: netflix id 17 is "7 seconds" but searching for it on imdb the first result is "Gone in 60 seconds". This seems to be also true for quite a few other films.</description>
		<content:encoded><![CDATA[<p>Looking at your imdbMatrix script it seems that you correlated the netflix movies with the top search result in the imdb list. However, that seems like a very poor method. Just by looking at the first few films in the netflix dataset and doing a search for their name in the imdb site I get quite a few entries that infact relate to a different film. ex: netflix id 17 is &#8220;7 seconds&#8221; but searching for it on imdb the first result is &#8220;Gone in 60 seconds&#8221;. This seems to be also true for quite a few other films.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: DNAHelix.org: Netflix Prize</title>
		<link>http://www.igvita.com/2007/01/27/correlating-netflix-and-imdb-datasets/comment-page-1/#comment-157991</link>
		<dc:creator>DNAHelix.org: Netflix Prize</dc:creator>
		<pubDate>Thu, 08 Jan 2009 00:44:29 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/01/27/correlating-netflix-and-imdb-datasets/#comment-157991</guid>
		<description>[...] Pickled Python data http://www.igvita.com/2007/01/27/correlating-netflix-and-imdb-datasets/ [...]</description>
		<content:encoded><![CDATA[<p>[...] Pickled Python data <a href="http://www.igvita.com/2007/01/27/correlating-netflix-and-imdb-datasets/" rel="nofollow">http://www.igvita.com/2007/01/27/correlating-netflix-and-imdb-datasets/</a> [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Netflix Prize Progress &#171; Social Mode</title>
		<link>http://www.igvita.com/2007/01/27/correlating-netflix-and-imdb-datasets/comment-page-1/#comment-107404</link>
		<dc:creator>Netflix Prize Progress &#171; Social Mode</dc:creator>
		<pubDate>Sun, 27 Jul 2008 18:59:36 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/01/27/correlating-netflix-and-imdb-datasets/#comment-107404</guid>
		<description>[...] I used the python pickles from Ilya G.  He did the hard work of creating feature vectors (coded descriptions) from IMBD data for things like release date, genre, cast and so forth.  This makes the KNN algorithms more interesting than just going off of the movie title, rating and year. [...]</description>
		<content:encoded><![CDATA[<p>[...] I used the python pickles from Ilya G.  He did the hard work of creating feature vectors (coded descriptions) from IMBD data for things like release date, genre, cast and so forth.  This makes the KNN algorithms more interesting than just going off of the movie title, rating and year. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Diarmuid</title>
		<link>http://www.igvita.com/2007/01/27/correlating-netflix-and-imdb-datasets/comment-page-1/#comment-101782</link>
		<dc:creator>Diarmuid</dc:creator>
		<pubDate>Wed, 05 Mar 2008 22:40:04 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/01/27/correlating-netflix-and-imdb-datasets/#comment-101782</guid>
		<description>Actually, now that I look at the data more carefully, I see that the numbers map 1 to 1. Sorry about that. 

thanks

D</description>
		<content:encoded><![CDATA[<p>Actually, now that I look at the data more carefully, I see that the numbers map 1 to 1. Sorry about that. </p>
<p>thanks</p>
<p>D</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Diarmuid</title>
		<link>http://www.igvita.com/2007/01/27/correlating-netflix-and-imdb-datasets/comment-page-1/#comment-101780</link>
		<dc:creator>Diarmuid</dc:creator>
		<pubDate>Wed, 05 Mar 2008 22:21:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/01/27/correlating-netflix-and-imdb-datasets/#comment-101780</guid>
		<description>Hi,
Great job on the extract. I have loaded up the dictionary and see that I can get the features for a movie by passing in the ID. However, the id is from 1 to 17770, so how does this relate to the movie id in the neflix data and more specifically the data generated by pyflix?
Also, have you any info on what real world feature each numeric feature corresponds to. I have assumed 1030 to 1035 is the rating as they are common to all.
I have inverted your data to produce a dictionary of features and corresponding movies. 
cheers

Diarmuid</description>
		<content:encoded><![CDATA[<p>Hi,<br />
Great job on the extract. I have loaded up the dictionary and see that I can get the features for a movie by passing in the ID. However, the id is from 1 to 17770, so how does this relate to the movie id in the neflix data and more specifically the data generated by pyflix?<br />
Also, have you any info on what real world feature each numeric feature corresponds to. I have assumed 1030 to 1035 is the rating as they are common to all.<br />
I have inverted your data to produce a dictionary of features and corresponding movies.<br />
cheers</p>
<p>Diarmuid</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ilya Grigorik</title>
		<link>http://www.igvita.com/2007/01/27/correlating-netflix-and-imdb-datasets/comment-page-1/#comment-48550</link>
		<dc:creator>Ilya Grigorik</dc:creator>
		<pubDate>Fri, 29 Jun 2007 13:30:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/01/27/correlating-netflix-and-imdb-datasets/#comment-48550</guid>
		<description>Pickle is a default python library, so you can deserialize the contents fairly easy. But to answer your question, the file itself is a dictionary in the following format: movie_id =&gt; [feature_id, feature_id,...]. What the feature_id is, the file does not say, but you don't need to know that either, all you care about is feature_id overlap with over movies.</description>
		<content:encoded><![CDATA[<p>Pickle is a default python library, so you can deserialize the contents fairly easy. But to answer your question, the file itself is a dictionary in the following format: movie_id => [feature_id, feature_id,...]. What the feature_id is, the file does not say, but you don&#8217;t need to know that either, all you care about is feature_id overlap with over movies.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: guest</title>
		<link>http://www.igvita.com/2007/01/27/correlating-netflix-and-imdb-datasets/comment-page-1/#comment-48519</link>
		<dc:creator>guest</dc:creator>
		<pubDate>Fri, 29 Jun 2007 07:12:22 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/01/27/correlating-netflix-and-imdb-datasets/#comment-48519</guid>
		<description>great thanks for your process on imdb, but i don't quite understand the two features-map.pickle files. 

so would mind explain a little bit on the file, what is the meaning for each line for a segment:

(dp0
I1
(lp1
I737
aI885
aI997
aI998
aI999
aI1035
asI2
(lp2
I886
aI1035
asI3
(lp3
I887



Or would you mind post the whole pickle.py source, so that we can use it to convert the features into other formats.</description>
		<content:encoded><![CDATA[<p>great thanks for your process on imdb, but i don&#8217;t quite understand the two features-map.pickle files. </p>
<p>so would mind explain a little bit on the file, what is the meaning for each line for a segment:</p>
<p>(dp0<br />
I1<br />
(lp1<br />
I737<br />
aI885<br />
aI997<br />
aI998<br />
aI999<br />
aI1035<br />
asI2<br />
(lp2<br />
I886<br />
aI1035<br />
asI3<br />
(lp3<br />
I887</p>
<p>Or would you mind post the whole pickle.py source, so that we can use it to convert the features into other formats.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
