<?xml version="1.0" encoding="utf-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Ruby Screen-Scraper in 60 Seconds</title>
	<atom:link href="http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/</link>
	<description>A goal is a dream with a deadline.</description>
	<pubDate>Fri, 29 Aug 2008 04:32:18 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
		<item>
		<title>By: Daddy Fixes Everything &#187; Blog Archive &#187; Tutorial: Data Updates using YAML Files in Rails</title>
		<link>http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-103116</link>
		<dc:creator>Daddy Fixes Everything &#187; Blog Archive &#187; Tutorial: Data Updates using YAML Files in Rails</dc:creator>
		<pubDate>Sat, 10 May 2008 04:10:33 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-103116</guid>
		<description>[...] the parts you want. I tend to use XPaths because it&#8217;s so easy to find them with firefox (see: hpricot/firefox/firebug screen scraping tutorial). It finds my target table (by css selector), then parses through each element looking for the data [...]</description>
		<content:encoded><![CDATA[<p>[...] the parts you want. I tend to use XPaths because it&#8217;s so easy to find them with firefox (see: hpricot/firefox/firebug screen scraping tutorial). It finds my target table (by css selector), then parses through each element looking for the data [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ilya Grigorik</title>
		<link>http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-102622</link>
		<dc:creator>Ilya Grigorik</dc:creator>
		<pubDate>Tue, 22 Apr 2008 03:44:26 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-102622</guid>
		<description>Thanks Jimmy, I updated the link.</description>
		<content:encoded><![CDATA[<p>Thanks Jimmy, I updated the link.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jimmy Z</title>
		<link>http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-102613</link>
		<dc:creator>Jimmy Z</dc:creator>
		<pubDate>Mon, 21 Apr 2008 17:59:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-102613</guid>
		<description>Harish Mallipeddi's python scraping url changed to:

http://mallipeddi.tumblr.com/post/29578374</description>
		<content:encoded><![CDATA[<p>Harish Mallipeddi&#8217;s python scraping url changed to:</p>
<p><a href="http://mallipeddi.tumblr.com/post/29578374" rel="nofollow">http://mallipeddi.tumblr.com/post/29578374</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ilya Grigorik</title>
		<link>http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-102320</link>
		<dc:creator>Ilya Grigorik</dc:creator>
		<pubDate>Mon, 07 Apr 2008 11:19:22 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-102320</guid>
		<description>Thanks for sharing the tip Nick. What you're seeing is expected behavior - that's what XPath was designed to do!</description>
		<content:encoded><![CDATA[<p>Thanks for sharing the tip Nick. What you&#8217;re seeing is expected behavior - that&#8217;s what XPath was designed to do!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Nick</title>
		<link>http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-102317</link>
		<dc:creator>Nick</dc:creator>
		<pubDate>Sun, 06 Apr 2008 22:11:29 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-102317</guid>
		<description>I just started using this technique as part of a scraping/teach myself Ruby project, and I really like it. One issue I've come across that hasn't been addressed in the comments: Firebug won't specify the first element in a series, even if it's necessary. So if your HTML is:
&lt;code&gt;

  Test 1
  Test 2

&lt;/code&gt;

(not sure if this will display properly - it's a div, with 2 nested divs inside, each with an H1 tag inside)
and you ask FireBug for the XPath of the first H1 tag, it will say "html/body/div/div/h1". Ask hpricot for the inner_html, like this:

puts(doc/"html/body/div/div/h1").inner_html

and you get "Test 1Test 2" - the joined contents of both H1 tags. In order to get the first H1, the one you want, you need to change the XPath to "html/body/div/div[1]/h1".

Hope that's helpful to someone - thanks for a great technique!</description>
		<content:encoded><![CDATA[<p>I just started using this technique as part of a scraping/teach myself Ruby project, and I really like it. One issue I&#8217;ve come across that hasn&#8217;t been addressed in the comments: Firebug won&#8217;t specify the first element in a series, even if it&#8217;s necessary. So if your HTML is:<br />
<code></p>
<p>  Test 1<br />
  Test 2</p>
<p></code></p>
<p>(not sure if this will display properly - it&#8217;s a div, with 2 nested divs inside, each with an H1 tag inside)<br />
and you ask FireBug for the XPath of the first H1 tag, it will say &#8220;html/body/div/div/h1&#8243;. Ask hpricot for the inner_html, like this:</p>
<p>puts(doc/&#8221;html/body/div/div/h1&#8243;).inner_html</p>
<p>and you get &#8220;Test 1Test 2&#8243; - the joined contents of both H1 tags. In order to get the first H1, the one you want, you need to change the XPath to &#8220;html/body/div/div[1]/h1&#8243;.</p>
<p>Hope that&#8217;s helpful to someone - thanks for a great technique!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ilya Grigorik</title>
		<link>http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-101675</link>
		<dc:creator>Ilya Grigorik</dc:creator>
		<pubDate>Thu, 28 Feb 2008 02:54:06 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-101675</guid>
		<description>Tim, it could be due to a broken DOM tree. Firefox, just like any other browser often 'magically' rebuilds broken DOM-trees on the fly. This could be the case here - Firebug is picking up the path from a fixed tree, whereas the actual source code does not contain that node. That would be my best guess.</description>
		<content:encoded><![CDATA[<p>Tim, it could be due to a broken DOM tree. Firefox, just like any other browser often &#8216;magically&#8217; rebuilds broken DOM-trees on the fly. This could be the case here - Firebug is picking up the path from a fixed tree, whereas the actual source code does not contain that node. That would be my best guess.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: TIm</title>
		<link>http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-101615</link>
		<dc:creator>TIm</dc:creator>
		<pubDate>Fri, 22 Feb 2008 10:05:40 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-101615</guid>
		<description>Hey,
Thanks for the tip. I do have a question though. It looks as the the following code using the full xpath retrieved by firebug returns nil:

puts (doc/"/html/body/div[2]/div/div/blockquote/p").inner_html

So in this case, firebug gave an xpath that didn't actually function as predicted.

The code you used in your program (using a less specific xpath and selecting the first element from the elements of the array) returns the text as expected.

puts (doc/"blockquote/p").first.inner_html

But both should work. Any explanation?
Thanks,
Tim</description>
		<content:encoded><![CDATA[<p>Hey,<br />
Thanks for the tip. I do have a question though. It looks as the the following code using the full xpath retrieved by firebug returns nil:</p>
<p>puts (doc/&#8221;/html/body/div[2]/div/div/blockquote/p&#8221;).inner_html</p>
<p>So in this case, firebug gave an xpath that didn&#8217;t actually function as predicted.</p>
<p>The code you used in your program (using a less specific xpath and selecting the first element from the elements of the array) returns the text as expected.</p>
<p>puts (doc/&#8221;blockquote/p&#8221;).first.inner_html</p>
<p>But both should work. Any explanation?<br />
Thanks,<br />
Tim</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Scraping the SAQ: Hpricot and Xpath gotchas &#8212; Some French Guy</title>
		<link>http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-101605</link>
		<dc:creator>Scraping the SAQ: Hpricot and Xpath gotchas &#8212; Some French Guy</dc:creator>
		<pubDate>Wed, 20 Feb 2008 18:53:14 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-101605</guid>
		<description>[...] play. Building a scraper to retrieve data from those files should be just as easy - see e.g. Ruby Screen-Scraper in 60 Seconds. Basically, copy the XPath for the data you want from firebug, and paste inside your script - and [...]</description>
		<content:encoded><![CDATA[<p>[...] play. Building a scraper to retrieve data from those files should be just as easy - see e.g. Ruby Screen-Scraper in 60 Seconds. Basically, copy the XPath for the data you want from firebug, and paste inside your script - and [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ilya Grigorik</title>
		<link>http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-100991</link>
		<dc:creator>Ilya Grigorik</dc:creator>
		<pubDate>Sat, 12 Jan 2008 19:05:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-100991</guid>
		<description>Jefe, I would start on Wikipedia (&lt;a href="http://en.wikipedia.org/wiki/Screen_scraping" rel="nofollow"&gt;Screen Scraping&lt;/a&gt;), and follow the links from there.</description>
		<content:encoded><![CDATA[<p>Jefe, I would start on Wikipedia (<a href="http://en.wikipedia.org/wiki/Screen_scraping" rel="nofollow">Screen Scraping</a>), and follow the links from there.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jefe</title>
		<link>http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-100941</link>
		<dc:creator>jefe</dc:creator>
		<pubDate>Wed, 09 Jan 2008 20:22:25 +0000</pubDate>
		<guid isPermaLink="false">http://www.igvita.com/blog/2007/02/04/ruby-screen-scraper-in-60-seconds/#comment-100941</guid>
		<description>OK I am not getting this, I have been trying to teach myself how to parse a table for info I can use or scrape a screen and havn't grasped it yet. Does anyone know a tutorial that includes every step about parsing or scraping? php would be ok, I dont have any background in ruby or python.</description>
		<content:encoded><![CDATA[<p>OK I am not getting this, I have been trying to teach myself how to parse a table for info I can use or scrape a screen and havn&#8217;t grasped it yet. Does anyone know a tutorial that includes every step about parsing or scraping? php would be ok, I dont have any background in ruby or python.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
