Extract Data from HTML/XML/JSON using Xidel

As you would know, I scrape a lot of web pages as a Data Archivist at DataMeet. I usually use BS4 for this, and it's beautiful, simple, and works. But often don't want to write a python script to do that, and I need a simple tool to get data out of HTML.

I want to curl or wget the HTML and pipe to a tool to get the data I want. Xidel is what I use then. Xidel is an old (since 2012) CLI tool that helps you extract data from HTML, XML, or JSON documents. Xidel goes one more step; you don't even need to do a curl or wget. Xidel does that also for you.

This quick one-paragraph feature list explains what it can do. Visit the tutorial page to learn more.

  • Extract expressions:
    • CSS 3 Selectors: to extract simple elements
    • XPath 3.0: to extract values and calculate things with them
    • XQuery 3.0: to create new documents from the extracted values
    • JSONiq: to work with JSON apis
    • Templates: to extract several expressions in an easy way using a annotated version of the page for pattern-matching
    • XPath 2.0/XQuery 1.0: compatibility mode for the old XPath/XQuery version

  • Following:
    • HTTP Codes: Redirections like 30x are automatically followed, while keeping things like cookies
    • Links: It can follow all links on a page as well as some extracted values
    • Forms: It can fill in arbitrary data and submit the form

  • Output formats:
    • Adhoc: just prints the data in a human readable format
    • XML: encodes the data as XML
    • HTML: encodes the data as HTML
    • JSON: encodes the data as JSON
    • bash/cmd: exports the data as shell variables

  • Connections: HTTP / HTTPS as well as local files or stdin

  • Systems: Windows (using wininet), Linux (using synapse+openssl), Mac (synapse)

Here is an example. I want to get the Total Vaccination Count from mygov.in the covid dashboard. This specific data item comes embedded in the HTML, and for Xidel, it's not a problem.

Firefox developer console view - Copy CSS Selector

I open the Dev console, select and right-click on HTML node. Then I will copy the CSS selector.

Get the data using CSS selectors.

$xidel https://www.mygov.in/covid-19 --css="div.total-vcount:nth-child(5) > strong:nth-child(1)"
**** Retrieving (GET): https://www.mygov.in/covid-19 ****
**** Processing: https://www.mygov.in/covid-19 ****

I don't want any extra logs. Here you go

$xidel https://www.mygov.in/covid-19 --css="div.total-vcount:nth-child(5) > strong:nth-child(1)" --silent

I want JSON output, no problem.

$xidel https://www.mygov.in/covid-19 --css="div.total-vcount:nth-child(5) > strong:nth-child(1)" --silent --output-format=json-wrapped

XML output

$xidel https://www.mygov.in/covid-19 --css="div.total-vcount:nth-child(5) > strong:nth-child(1)" --silent --output-format=xml-wrapped
<?xml version="1.0" encoding="UTF-8"?>


$xidel https://www.mygov.in/covid-19 --css="div.total-vcount:nth-child(5) > strong:nth-child(1)" --silent --output-format=html
<!DOCTYPE html>

Reading and Parsing JSON is also easy. Here I am using JSONiq notation to extract

$xidel https://www.mygov.in/sites/default/files/covid/covid_state_counts_ver1.json --input-format=json  --extract='$json("Death")("15")' --silent

Note: "15" is a string is a key. It's not an array index

Using dot notation

$xidel https://www.mygov.in/sites/default/files/covid/covid_state_counts_ver1.json --input-format=json --extract='($json).Death.15' --silent

There are lots of features that I am yet to explore. I will write more later.

You can read this blog using RSS Feed. But if you are the person who loves getting emails, then you can join my readers by signing up.

Join 2,241 other subscribers

2 Responses

  1. November 21, 2021

    […] I will run the xidel and pipe the output to a curl to insert a record into a CouchDB […]

  2. December 4, 2021

    […] at the year archive count or running a query on the database. But recently, I have started using Xidel, so why not use it? […]