How To: Build a Simple Data Scraper

We’ve seen how to use spreadsheets to work with data, but what if that data is in a table, not a spreadsheet? You COULD type it all up … or you could write a simple data scraper! Relax – if you know a tiny bit about spreadsheets, it’s pretty easy.

Quick Review of Formulas (in Google Drive):

  • Formulas here work just like in Excel
  • Two parts: Formula (=SUM) and Parameters (in the parentheses)
    • Can have one parameter (A1:A20) or more (B1:B10, “Y”)
    • Parameters can be either strings (in quotes) or numbers
    • Example: =SUM(A1:A20) will add up all the values in the range of cells from A1 to A20.
    • Other formulas:
      • =COUNT(A1:A?)
      • =AVERAGE(A1:A?)
      • There are also IF formulas (COUNTIF, SUMIF, AVERAGEIF) that take two parameters: =COUNTIF(A1:A?, >100) counts all cells in a range that have values greater than 100.

Recipe for a Simple Data Scraper (using Google Drive):

  1. Find a website of interest that includes data in a table (Wikipedia has lots)
    1. Anywhere you’d expect a table or list (e.g. Oscar winners, baseball stats, lists of prison)
    2. You can check in source code by searching <table, <ul, or <ol (don’t include the >)
  2. Note the following:
    1. The URL
    2. The index of the table (e.g., “1” for first table on the page, etc.)
  3. In new Google Docs spreadsheet, paste the following 3-parameter formula:
    1. =ImportHTML(url,query,index)
      1. url is the website’s URL [string]
      2. query is the HTML tag you want (e.g., “table” or “list”) [string]
      3. index is the ranking of that query on the page [number]
      4. Oscar winners example: =ImportHTML(“http://en.wikipedia.org/wiki/Oscars”, “table”, 2)
  1. You can customize it, too:
    1. Delete your formula from A1 and re-paste in B1
    2. Enter the three parameters in A1 (url), A2 (query), and A3 (index)
    3. Paste the formula in B1 and replace parameters with A1, A2, and A3
    4. This allows you to change parameters more easily and visibly
  2. ADVANCED: There’s an =ImportXML formula as well
    1. XML is heavily structured and uses more specific tags, like <book>
    2. Example from openlylocal.com, an open gov’t site: =importXML(“http://openlylocal.com/councils.xml&#8221;,”councils/council”) – Example drawn from Bradshaw’s book (see below)
  3. Want more? For $15, you can download Paul Bradshaw’s “Scraping for Journalists,” an excellent PDF book (first chapter is free): https://leanpub.com/scrapingforjournalists

Having problems?

  • Don’t forget to start the formula with an = sign
  • Check to see if you included quotes around your URL and query
  • Try a different index
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: