An Introduction to Screen Scraping
Screen scraping is a technique for using software to grab data from websites. Many websites we use everyday are poorly designed and difficult to use, just look at most Internet Banking sites. Fortunately, programatically accessing information on these sites can be relatively easy.
Toolkit
The essential elements of a screen scraper’s toolkit are:
- Firebug, the Firefox plugin (Safari’s Web Inspector is also good)
- A scripting language (I’m going to use Ruby here)
- A flexible XML parser (I usually use hpricot)
- Libraries for dealing with HTTP
Typical Workflow
- Identify the form or GET parameters you want to manipulate
- Identify where the results appear in the page using Firebug (right-click the element and select Inspect Element)
- Attempt to trigger the resulting page successfully
- Attempt to process the results and extract the data
Some sites use cookies which will make step 3 an exercise in patience. Firebug makes reverse engineering pages and HTTP requests easy.
Simple Example
Let’s scrape a shopping site for search results. I want to get the prices of search results from Play.com. Play.com works like this:
- Search uses HTTP GET and a searchstring parameter - let’s use ruby’s open-uri library to make this easy
- The results are in div.info HTML blocks with h5 or h6 headers - hpricot will fetch these variables with a CSS selector: “.info h6”
Further Reading
There are dedicated scraping libraries out there: try scrubyt for starters. There are also loads of Python and Perl libraries for scraping: many Ruby libraries are derived from Perl’s WWW::Mechanize.

