Lab 2 - Data Collecting

Created Monday 17 November 2014 - Lab Home

At the end of this lab we should be able to extract the data from human readable documents (Web, PDF etc) into machine readable formats like CSV, Excel, JSON etc.

  • Download the files from a website to local file system
  • Scrape the tables from a website into CSV
  • Scrape the tables from a pdf into CSV

Sample Exercise Questions

  • Get the list of schools, colleges under BBMP
  • Scrape park list by ward under BBMP
  • Download all the Assets and Liabilities documents from BBMP website

Example Data Sets for Lab

Tools

DownThemAll

PDF Tables

Tabula

  • Tabula
  • Scrapes most pdfs (except the ones which have embedded images)
  • OpenSource

Table Capture

Table2Clipboard

Other Tools

  • ScraperWiki s a platform with tools which you can use yourself, and we’re a highly qualified team of data scientists who can do the job for you.
  • IMacros - If you love the Firefox web browser, but are tired of repetitive tasks like visiting the same sites every days, filling out forms, and remembering passwords, then iMacros for Firefox is the solution you’ve been dreaming of! Whatever you do with Firefox, iMacros can automate it.