Screen Scraping: Cookies, Headers
This is a follow-up to yesterday’s article, An Introduction to Screen Scraping. I’m going to show you the basics behind reverse engineering headers and cookies.
Note: These posts might not display the examples in your feed reader, so try visiting Quite Useful in a browser to see them.
HTTP POST
Some sites only accept a POST for a particular form. The example I used yesterday used the Ruby library, open-uri, to perform a GET request. Whilst this is very clean, it’s also less flexible than the basic HTTP libraries that come bundled with Ruby.
Here’s an example of a HTTP post:
This attempts a login on a service I run; I don’t mind if this example is run. The code itself is simple:
- The uri library is used to manipulate a URL, grabbing the hostname and path as required
- net/http is used to create a HTTP POST request with custom headers
- The full response is displayed with a message that determines the outcome of the login attempt
Cookies
Many sites use cookies to prevent cross site scripting attempts, or just to make life difficult for us scrapers. Some people are probably using Microsoft tools to create sites and don’t even know what’s going on, so don’t get too angry if a site is hard to manipulate.
Let’s use cookies with the previous example to actually login and request a document from Helipad. You’ll probably want to create an account to test this out yourself.
This example fetches the create document page and tests the result for an ID that I know appears on that page.
- Note that http.get2 has been used to pass headers to the server
- The cookie (session ID in this case) was captured from the initial login POST and has been used in the GET request
- If the Preview ID is found in the page, then the create document page has been successfully fetched while logged in
Reverse Engineering Headers
Sometimes headers must be cloned. This can be extra work, but Firebug makes it easy.
- Enable Firebug for the site you want to reverse engineer
- Fill out the form you’re targeting and submit it
- In Firebug, click Net then HTTP and look at the Request Headers
- Compare this against the headers you’re sending in your script
You might find mechanisation libraries that streamline some of this work, but once you’ve got working code manipulating POST, headers and handling cookies, most sites can be successfully scraped.

