Excalibur is a free and open-source tool that can help you to easily extract tabular data from PDFs. I originally wrote this post for my website.
Photo by Patrick Tomasso on Unsplash
Borrowing the first three paragraphs from my previous blog post since they perfectly explain why extracting tables from PDFs is hard.
The PDF (Portable Document Format) was born out of The Camelot Project to create “a universal way to communicate documents across a wide variety of machine configurations, operating systems and communication networks”. Basically, the goal was to make documents viewable on any display and printable on any modern printer. PDF was built on top of PostScript (a page description language), which had already solved this “view and print anywhere” problem. PDF encapsulates the components required to create a “view and print anywhere” document. These include characters, fonts, graphics and images.
A PDF file defines instructions to place characters (and other components) at precise x,y coordinates relative to the bottom-left corner of the page. Words are simulated by placing some characters closer than others. Similarly, spaces are simulated by placing words relatively far apart. How are tables simulated then? You guessed it correctly — by placing words as they would appear in a spreadsheet.
The PDF format has no internal representation of a table structure, which makes it difficult to extract tables for analysis. Sadly, a lot of open data is stored in PDFs, which was not designed for tabular data in the first place!
Excalibur is a web interface to extract tabular data from PDFs, written in Python 3! It is powered by Camelot. You can check out fantastic documentation at Read the Docs and follow the development on GitHub.
Note: Excalibur only works with text-based PDFs and not scanned documents. (As Tabula explains, “If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based”.)
After installing ghostscript (see install instructions), you can simply use pip to install Excalibur:
$ pip install excalibur-py
Note: You can also download executables for Windows and Linux from the releases page and run them directly!
After installation with pip, you can initialize the metadata database using:
$ excalibur initdb
And then start the webserver using:
$ excalibur webserver
That’s it! Now you can go to http://localhost:5000 and start extracting tabular data from your PDFs.
Camelot, the Python library that powers Excalibur, implements two methods to extract tables from two different types of table structures: Lattice, for tables formed with lines, and Stream, for tables formed with whitespaces. Lattice gave nice results from v0.1.0 since it was able to detect different tables on a single PDF page, in contrast to Stream which treated the whole page as a table.
But last week, Camelot v0.4.0 was released to fix that problem. #206 adds an implementation of the table detection algorithm described by Anssi Nurminen’s master’s thesis that is able to detect multiple Stream-type tables on a single PDF page (most of the time)! You can see the difference in the following images.
Both Stream-type tables detected in v0.4.0
as compared to
Whole page being treated as a table in v0.3.0
Excalibur was voted #1 on Labworm in the second week of November! Labworm is a platform that guides scientists to the best online resources for their research and helps mediate knowledge exchange by promoting open science.
There are both open (Tabula, pdfplumber) and closed-source (Smallpdf, Docparser) tools that are widely used to extract data tables from PDFs. They either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy.
Excalibur uses Camelot under the hood, which was created to offer users complete control over table extraction. If you can’t get your desired output with the default settings, you can tweak the “Advanced” settings and get the job done!
For a more detailed account of why Camelot was created, you should also check out “The longer read” section of my previous blog post. Use Ctrl + F.
Reiterating from “The longer read” section I talked about above, it was a pain to see open-source tools not give a nice table extraction output every time. And it was frustrating to see paywalls on closed-source tools. I think that paywalls should not block the way to open science. I believe that Camelot was a successful attempt by us, at SocialCops, to address the problem of extracting tables from text-based PDFs accurately. Excalibur has made it more easier for anyone to access Camelot’s goodness with a nice web interface.
But there’s still a lot of open data trapped inside images and image-based PDFs. And state of the art optical character recognition software is locked behind paywalls.
‘At this time, proprietary OCR software drastically outperforms free and open source OCR software and as such could be worth a public agency’s investment depending on the amount and type of OCR jobs the public agency is needing to perform.’ — How to Open Data — Working with PDFs
So the next step is to make it easy for anyone to extract tables (or any other type of data for that matter) from images or image-based PDFs by adding OCR support to Camelot and Excalibur. If you would like to contribute your ideas towards this, do add your comments on #101. You can also check out the Contributor’s Guide for guidelines around contributing code, documentation or tests, reporting issues and proposing enhancements.
If Excalibur has helped you extract tables from PDFs, please consider supporting its development by becoming a backer or a sponsor on OpenCollective!
Also, stop publishing open data as PDFs and keep looking up! :)
Thanks to Christine Garcia for providing feedback and suggesting edits.