Lab 2 - Data Collecting
Created Monday 17 November 2014 - Lab Home
At the end of this lab we should be able to extract the data from human readable documents (Web, PDF etc) into machine readable formats like CSV, Excel, JSON etc.
- Download the files from a website to local file system
- Scrape the tables from a website into CSV
- Scrape the tables from a pdf into CSV
Sample Exercise Questions
- Get the list of schools, colleges under BBMP
- Scrape park list by ward under BBMP
- Download all the Assets and Liabilities documents from BBMP website
Example Data Sets for Lab
- BBMP Schools info
- BBMP Park Info
- Assets and Liabilities of BBMP Councillors
Tools
DownThemAll
- DownThemAll Mozilla Addon
- DownThemAll website for more information
- Can download the files by format
PDF Tables
- PDF Tables
- Website, nothing to install
Tabula
- Tabula
- Scrapes most pdfs (except the ones which have embedded images)
- OpenSource
Table Capture
- Table Capture Addon for Google Chrome
- HTML tables pages to CSV file
Table2Clipboard
- Table2Clipboard Addon for Firefox
- HTML tables pages to CSV file
Other Tools
- ScraperWiki s a platform with tools which you can use yourself, and we’re a highly qualified team of data scientists who can do the job for you.
- IMacros - If you love the Firefox web browser, but are tired of repetitive tasks like visiting the same sites every days, filling out forms, and remembering passwords, then iMacros for Firefox is the solution you’ve been dreaming of! Whatever you do with Firefox, iMacros can automate it.