Mapping Local Government Directory to WikiData
The Indian government maintains the directory and hierarchy of local governments and administrative areas in India called the Local Government Directory (LGD). Recently I wanted to map them to WikiData items. This means I wanted to map the administrative areas in the local government to items on WikiData. Update them if necessary.
As a first step, I downloaded the data from the LGD. It's painful but possible. Suppose you are okay with some old data. You can also use this git repository. I did load some of those sheets into SQLITE so it's easy for me to work on it and publish. My work in project repository is at
https://github.com/datameet/india-local-government-directory
In the first round, I wanted to match only states and districts. States was easy. Get the list of official valid states from WikiData. Match the label of WikiData with "StateName(InEnglish)" of the LGD. Do the same with UTs. If there are spelling differences, map them manually. There are 36 States/UTs in India.
SPARQL query to get the states of India (wd:Q12443800) is below. I am also getting property wdt:P5578 which 2011 Indian census code. It helps me in mapping.
SELECT DISTINCT ?S ?SLabel ?SDescription ?Indian_census_area_code__2011_ WHERE {
# where s is "state of india" aka wd:Q12443800
?S wdt:P31 wd:Q12443800.
# remove the ones with
# S has the dissolved property, lets call it dt
FILTER(NOT EXISTS { ?S wdt:P576 ?dt. })
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
OPTIONAL { ?S wdt:P5578 ?Indian_census_area_code__2011_. }
}
SPARQL query to get the UTs of India(wd:Q467745) is below. I am also getting property wdt:P5578 which 2011 Indian census code. It helps me in mapping.
SELECT DISTINCT ?S ?SLabel ?SDescription ?Indian_census_area_code__2011_ WHERE {
# where s is "union territory of India" aka wd:Q467745
?S wdt:P31 wd:Q467745.
FILTER(NOT EXISTS { ?S wdt:P576 ?dt. })
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
OPTIONAL { ?S wdt:P5578 ?Indian_census_area_code__2011_. }
}
Matching the districts was not easy. Though I did follow the same procedure. First I got all the district of india(wd:Q1149652 ) from WikiData using the query below
SELECT DISTINCT ?S ?SLabel ?SDescription ?Indian_census_area_code__2011_ WHERE {
?S wdt:P31 wd:Q1149652.
FILTER(NOT EXISTS { ?S wdt:P576 ?dt. })
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
OPTIONAL { ?S wdt:P5578 ?Indian_census_area_code__2011_. }
}
It returns the 744 items, Where as according to LGD there are only 735 districts. Wikipedia gives a completely different number. It says there are 741. So I had to figure the invalid districts. I did some string matching, SQL magic and came up this list.
WikiDataId | Label | Description | Comments |
---|---|---|---|
Q955977 | South Arcot | Former district in Tamil Nadu, India | Needs to be marked as dissolved in WikiData |
Q1900496 | Bangalore | Former district in Karnataka, India | Needs to be marked as dissolved in WikiData |
Q1606061 | Andaman | Former district of the Andaman and Nicobar Islands | Needs to be marked as dissolved in WikiData |
Q24949801 | Shahbazwan | District of Bihar in India | is this same as GOPALGANJ district? Marked by mistake in WikiData. Should be removed as a district. |
Q6007135 | Imphal | Wikimedia disambiguation page | is ex-district. Was split. Needs to be marked as dissolved in WikiData |
Q48731903 | Noklak | District in India, Nagaland | New district. LGD needs update. January 20, 2021. |
Q61746013 | Narayanapet | District of Telangana, India | There seem to be a duplicate Narayanpet district (Q85787759); but Q61746013 was created earlier. DataCommons also uses the same. It also has |
Q29025081 | East Karbi Anglong | District of Assam, India | When KARBI ANGLONG was split. The western part became the new "West Karbi Anglong" and the rest remained part of "Karbi Anglong". There is no "East Karbi Anglong" as such. Should be removed in WikiData? |
Q101088203 | Bajali | district of Assam India | New district formed in 12 January 2021. LGD needs an update |
DONT KNOW | Vijayanagara | district of Karnataka in India | New district formed in 2020/21. Needs an addition to LGD. May be mark Q1611788 as district in WikiData? |
DONT KNOW | Chachaura | district of mp | Missing on LGD, WikiData and OSM. No gazette yet |
DONT KNOW | Maihar | district of mp | Missing on LGD, WikiData and OSM. No gazette yet |
DONT KNOW | Nagda | district of mp | Missing on LGD and WikiData. No gazette yet. |
Q61439260 | Pakke-Kessang | district of Arunachal Pradesh in India | It was missing from WikiData query results. Because it was not tagged as district. I updated WikiData. |
Since Chachaura, Maihar and Nagda are not gazetted. They are not officially districts yet. So I have not added it yet. I have synced the rest in my DB. My list now has 738 districts, because I have added Noklak, Bajali, Vijayanagara to the LGD list. I will push the changes to WikiData once I get some confirmation by the community.
If everything is okay. Then I will update the WikiData. I will also update the WikiData with Census2011Code, Census2011Code, StateCode and DistrictCode.
I have deployed the SQLITE3 project as datasette project on Glitch. You can explore the tables states and districts there without downloading, or you can follow on Github.
Do let me know what do you think.
DataMeet discussions on this is happening here.