Liberate PDFs
Created Tuesday 03 March 2015
This tutorial was written as a help material for Liberate PDF event. I have tried to make the tutorial as generic as possible so it can be used as a base material for any PDF scpraing process. Some of the examples used here are real use cases.
Simple Tools
PDF Tables
- PDF Tables
- Webservice, nothing to install
- Free as in free beer
- For extracting tables from PDFs
Tabula
- Tabula
- Runs locally as a webservice. Can be installed for the organization or an institution
- FOSS
- Scrapes most pdfs (except the ones which have embedded images)
Smart PDF
- http://smallpdf.com/pdf-to-jpg
- All kind of interconversion
Dig deeper
iText
- iText
- Libraries available for Java and CSharp
- Comercial and AGPL Licensing
- FreeBook available
PDFMiner
- PDFMiner
- Python library
- Can be used to convert PDF to HTML
- Can be used to obtain the exact location of text in a page, as well as other information such as fonts or lines
- FOSS, on GitHub
- Command line tool and can also be used as a python library
- Tutorial
pdftables
- pdftables
- Python library
- Can be used to extract tables
- FOSS
- Command line tool and can also be used as a python library
Apache PDFBox
- Apache PDFBox
- Java Library
- Can extract text from PDF
- Can convert PDFs to images
- Can work with Unicode
- FOSS
PDFTOHTML
- pdftohtml
- pdftohtml is a utility which converts PDF files into HTML and XML formats.
- FOSS
- Best for text pdf conversion
- Primarly command line
pypdfocr
- pypdfocr
- http://virantha.github.io/pypdfocr/html/
- Converts a scanned PDF into an OCR'ed pdf using Tesseract-OCR and Ghostscript
- FOSS
- Take a scanned PDF file and run OCR on it (using the Tesseract OCR software from Google), generating a searchable PDF
- Command line
OCR - tesseract-ocr
- Simple and open source
- Supports Kannada
- Download the intial language files
- Train it for the improvement
- https://code.google.com/p/tesseract-ocr/wiki/ReadMe
- https://code.google.com/p/tesseract-ocr/downloads/list
- https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3