Liberate PDFs

Created Tuesday 03 March 2015
This tutorial was written as a help material for Liberate PDF event. I have tried to make the tutorial as generic as possible so it can be used as a base material for any PDF scpraing process. Some of the examples used here are real use cases.

Simple Tools

PDF Tables

  • PDF Tables
  • Webservice, nothing to install
  • Free as in free beer
  • For extracting tables from PDFs

Tabula

  • Tabula
  • Runs locally as a webservice. Can be installed for the organization or an institution
  • FOSS
  • Scrapes most pdfs (except the ones which have embedded images)

Smart PDF

Dig deeper

iText

PDFMiner

  • PDFMiner
  • Python library
  • Can be used to convert PDF to HTML
  • Can be used to obtain the exact location of text in a page, as well as other information such as fonts or lines
  • FOSS, on GitHub
  • Command line tool and can also be used as a python library
  • Tutorial

pdftables

  • pdftables
  • Python library
  • Can be used to extract tables
  • FOSS
  • Command line tool and can also be used as a python library

Apache PDFBox

  • Apache PDFBox
  • Java Library
  • Can extract text from PDF
  • Can convert PDFs to images
  • Can work with Unicode
  • FOSS

PDFTOHTML

  • pdftohtml
  • pdftohtml is a utility which converts PDF files into HTML and XML formats.
  • FOSS
  • Best for text pdf conversion
  • Primarly command line

pypdfocr

  • pypdfocr
  • http://virantha.github.io/pypdfocr/html/
  • Converts a scanned PDF into an OCR'ed pdf using Tesseract-OCR and Ghostscript
  • FOSS
  • Take a scanned PDF file and run OCR on it (using the Tesseract OCR software from Google), generating a searchable PDF
  • Command line

OCR - tesseract-ocr