Scraping National Air Quality Data

we live in Bengaluru/Bangalore. We as family suffer due to air quality. Given this concern, I wanted a data-set to analyze what has happened over the years and what to expect. I also wanted to compare the same with other cities. National Air Quality Index has this data. But it’s available only for major cities and even for cities like Bengaluru they have a very few stations. Bangalore has ten sensors across the city, which is not a lot. Any how this is the only official data-set available. But unfortunately the site doesn’t allow you to download a larger set for personal use. Hence the scraper. Ther scraper has three scripts and a sqilte db cpcbccr/data/db/data.sqlite3. The scripts and some data is available on Github. The code is under GNU GPL v3. Link to this blog post if you like to give credit. I appreciate it, even though its not a requirement :)

1. To start using the script clone the git repo https://github.com/thejeshgn/cpcbccr

 git clone https://github.com/thejeshgn/cpcbccr 

2. Get the sites for which you want data from https://app.cpcbccr.com/AQI_India/. Like in the screenshot below. Open the data.sqlite3 and add the sites to sites table manually. I use DB Browser for SQLite GUI client. Db is at

cpcbccr/data/db/data.sqlite3
Website - get site information

Website – get site information

Screenshots sites table

Screenshots sites table

3. Update the fromDate and endDate in setup_pull.py for which data need to be pulled

	fromDate = "1-11-2018"  #TODO 2: starting date
	endDate = "31-10-2019"   #TODO 3: ending date, will be next day after that day

4. Install the dependencies and run setup_pull.py – This will setup all the requests that needs to run

pip install dataset requests
python setup_pull.py

5. Run pull.py – This does that requests and pulls the JSON data. You can stop and restart it anytime. Run as slow as possible. Run in batches. As of now it sends one request per five seconds. All the pulled data along with status is stored in request_status_data in data.sqlite3.

python pull.py

6. Run parse.py – It parses the json data and puts into flat table – data

python parse.py

7. Export the data from the table data. Its straight forward.

Data table which contains the final AQ data.

Data table which contains the final AQ data.

Note. I am pulling only PM10 and PM2.5 as of now. But you can pull other data too. For it in setup_pull.py update the following line

		prompt_both='{"draw":2,"columns":[{"data":0,"name":"","searchable":true,"orderable":false,"search":{"value":"","regex":false}}],"order":[],"start":10,"length":10,"search":{"value":"","regex":false},"filtersToApply":{"parameter_list":[{"id":0,"itemName":"PM2.5","itemValue":"parameter_193"},{"id":1,"itemName":"PM10","itemValue":"parameter_215"}],"criteria":"4 Hours","reportFormat":"Tabular","fromDate":"'+fromDate+'","toDate":"'+toDate+'","state":"'+state+'","city":"'+city+'","station":"'+site+'","parameter":["parameter_193","parameter_215"],"parameterNames":["PM2.5","PM10"]},"pagination":1}'

You can refer the requests the browser send on this page to form this or similar query.

I have used this code for more than an year now. It seems to work well. Let me know what do you think. In near time I will make the scripts Digdag tasks so it can be scheduled easily.

You may also like...

2 Responses

  1. Rajesh Shenoy says:

    Thanks for this post. I found it very easy to download air pollution data!

    Got one question: In setup_pull.py, I am trying to modify the query to pull something different – hourly data instead of 4 hours and some additional parameters. In developer tools I couldn’t view the query formed by the browser in the POST request of the page as the parameters are encrypted if my understanding is correct. Is there any other way to view the query? Please suggest

  2. nikhil says:

    HI how do “itemName”:”PM2.5″,”itemValue”:”parameter_193″ for others parameter like NO2 during post json request?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.