I originally wrote this post for the SocialCops engineering blog.
Photo by Carles Rabada on Unsplash
The PDF (Portable Document Format) was born out of The Camelot Project to create “a universal way to communicate documents across a wide variety of machine configurations, operating systems and communication networks”. Basically, the goal was to make documents viewable on any display and printable on any modern printer. PDF was built on top of PostScript (a page description language), which had already solved this “view and print anywhere” problem. PDF encapsulates the components required to create a “view and print anywhere” document. These include characters, fonts, graphics and images.
A PDF file defines instructions to place characters (and other components) at precise x,y coordinates relative to the bottom-left corner of the page. Words are simulated by placing some characters closer than others. Similarly, spaces are simulated by placing words relatively far apart. How are tables simulated then? You guessed it correctly — by placing words as they would appear in a spreadsheet.
The PDF format has no internal representation of a table structure, which makes it difficult to extract tables for analysis. Sadly, a lot of open data is stored in PDFs, which was not designed for tabular data in the first place!
Today, we’re pleased to announce the release of Camelot, a Python library and command-line tool that makes it easy for anyone to extract data tables trapped inside PDF files! You can check out the documentation at Read the Docs and follow the development on GitHub.
Installation is easy! After installing the dependencies, you can install Camelot using pip (the recommended tool for installing Python packages):
$ pip install camelot-py
Extracting tables from a PDF using Camelot is very simple. Here’s how you do it. (Here’s the PDF used in the following example.)
>>> import camelot>>> tables = camelot.read_pdf('foo.pdf')>>> tables<TableList n=1>>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html>>> tables[0]<Table shape=(7, 7)>>>> tables[0].parsing_report{'accuracy': 99.02,'whitespace': 12.24,'order': 1,'page': 1}>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html>>> tables[0].df # get a pandas DataFrame!
You can also check out the command-line interface.
Many people use open (Tabula, pdf-table-extract) and closed-source (smallpdf, pdftables) tools to extract tables from PDFs. But they either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy. This leads to the creation of ad-hoc table extraction scripts for each type of PDF table.
We created Camelot to offer users complete control over table extraction. If you can’t get your desired output with the default settings, you can tweak them and get the job done!
You can check out a comparison of Camelot’s output with other open-source PDF table extraction libraries.
We’ve often needed to extract data trapped inside PDFs.
The first tool that we tried was Tabula, which has nice user and command-line interfaces, but it either worked perfectly or failed miserably. When it failed, it was difficult to tweak the settings — such as the image thresholding parameters, which influence table detection and can lead to a better output.
We also tried closed-source tools like smallpdf and pdftables, which worked slightly better than Tabula. But then again, they also didn’t allow tweaking and cost money. (We wrote a blog post about how we went about extracting tables from PDFs back in 2015, titled “PDF is evil”.)
When these full-blown PDF table extraction tools didn’t work, we tried pdftotext (an open-source command-line utility). pdftotext extracts text from a PDF while preserving the layout, using spaces. After getting the text, we had to write Python scripts with complicated regexes (regular expressions) to convert the text into tables. This wasn’t scalable, since we had to change the regexs for each new table layout.
We clearly needed a tweakable PDF table extraction tool, so we started developing one in December 2015. We started with the idea of giving the tool back to the community, which had given us so many open-source tools to work with.
We knew that Tabula classifies PDF tables into two classes. It has two methods to extract these different classes: Lattice (to extract tables with clearly defined lines between cells) and Stream (to extract tables with spaces between cells). We named Camelot’s table extraction flavors, Lattice and Stream, after Tabula’s methods.
For Lattice, Tabula uses Hough Transform, an image processing technique to detect lines. Since we wanted to use Python, OpenCV was the obvious choice to do image processing. However, OpenCV’s Hough Line Transform returned only line equations. After more exploration, we settled on morphological transformations, which gave the exact line segments. From here, representing the table trapped inside a PDF was straightforward.
To get more information on how Lattice and Stream work in Camelot, check out the “How It Works” section of the documentation.
We’ve battle tested Camelot by using it in a variety of projects, both for one-off and automated table extraction.
Earlier this year, we developed our UN SDG Solution to help organizations track and measure their contribution to Agenda 2030. For India, we identified open data sources (primarily PDF reports) for each of the 17 Sustainable Development Goals. For example, one of our sources for Goal 3 (“Good Health and Well-Being for People”) is the National Family Health Survey (NFHS) report released by IIPS. To get data from these PDF sources, we created an internal web interface built on top of Camelot, where our data analysts could upload PDF reports and extract tables in their preferred format.
Note: We became finalists for the UN SDG Action Awards in February 2018.
We also set up an ETL workflow using Apache Airflow to track disease outbreaks in India. The workflow scrapes the Integrated Disease Surveillance Programme (IDSP) website for weekly PDFs of disease outbreak data, and then it extracts tables from the PDFs using Camelot, sends alerts to our team, and loads the data into a data warehouse.
Camelot has some limitations. (We’re developing solutions!) Here are a couple of them:
You can check out the GitHub repository for more information.
You can help too — every contribution counts! Check out the Contributor’s Guide for guidelines around contributing code, documentation or tests, reporting issues and proposing enhancements. You can also head to the issue tracker and look for issues labeled “help wanted” and “good first issue”.
We urge organizations to release open data in a “data friendly” format like the CSV. But while tables are trapped inside PDF files, there’s Camelot :)