I originally wrote this post for the . SocialCops engineering blog Photo by on Carles Rabada Unsplash The PDF ( ) was born out of to create “a universal way to communicate documents across a wide variety of machine configurations, operating systems and communication networks”. Basically, the goal was to make documents viewable on any display and printable on any modern printer. PDF was built on top of (a page description language), which had already solved this “view and print anywhere” problem. PDF encapsulates the components required to create a “view and print anywhere” document. These include characters, fonts, graphics and images. Portable Document Format The Camelot Project PostScript A PDF file defines instructions to place characters (and other components) at precise coordinates relative to the bottom-left corner of the page. Words are simulated by placing some characters closer than others. Similarly, spaces are simulated by placing words relatively far apart. How are tables simulated then? You guessed it correctly — by placing words as they would appear in a spreadsheet. x,y The PDF format has no internal representation of a table structure, which makes it difficult to extract tables for analysis. Sadly, a lot of open data is stored in PDFs, which was not designed for tabular data in the first place! Camelot: PDF table extraction for humans Today, we’re pleased to announce the release of Camelot, a Python library and command-line tool that makes it easy for anyone to extract data tables trapped inside PDF files! You can check out the documentation at and follow the development on . Read the Docs GitHub How to install Camelot Installation is easy! After , you can install Camelot using pip (the recommended tool for installing Python packages): installing the dependencies $ pip install camelot-py How to use Camelot Extracting tables from a PDF using Camelot is very simple. Here’s how you do it. ( used in the following example.) Here’s the PDF >>> import camelot>>> tables = camelot.read_pdf('foo.pdf')>>> tables<TableList n=1>>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html>>> tables[0]<Table shape=(7, 7)>>>> tables[0].parsing_report{'accuracy': 99.02,'whitespace': 12.24,'order': 1,'page': 1}>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html>>> tables[0].df # get a pandas DataFrame! You can also check out the . command-line interface Why use Camelot? Camelot gives you complete control over table extraction by letting you tweak its settings. Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table. Each table is a , which seamlessly integrates into . pandas DataFrame ETL and data analysis workflows You can export tables to multiple formats, including CSV, JSON, Excel and HTML. Okay, but why another PDF table extraction library? TL;DR: Total control for better table extraction Many people use open ( , ) and closed-source ( , ) tools to extract tables from PDFs. But they either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy. This leads to the creation of ad-hoc table extraction scripts for each type of PDF table. Tabula pdf-table-extract smallpdf pdftables We created Camelot to offer users complete control over table extraction. If you can’t get your desired output with the default settings, you can tweak them and get the job done! You can check out a . comparison of Camelot’s output with other open-source PDF table extraction libraries The longer read We’ve often needed to extract data trapped inside PDFs. The first tool that we tried was , which has nice user and command-line interfaces, but it either worked perfectly or failed miserably. When it failed, it was difficult to tweak the settings — such as the image thresholding parameters, which influence table detection and can lead to a better output. Tabula We also tried closed-source tools like and , which worked slightly better than Tabula. But then again, they also didn’t allow tweaking and cost money. (We wrote a blog post about how we went about extracting tables from PDFs back in 2015, titled .) smallpdf pdftables “PDF is evil” When these full-blown PDF table extraction tools didn’t work, we tried (an open-source command-line utility). pdftotext extracts text from a PDF while preserving the layout, using spaces. After getting the text, we had to write Python scripts with complicated regexes ( ) to convert the text into tables. This wasn’t scalable, since we had to change the regexs for each new table layout. pdftotext regular expressions We clearly needed a tweakable PDF table extraction tool, so we started developing one in December 2015. We started with the idea of giving the tool back to the community, which had given us so many open-source tools to work with. We knew that Tabula classifies PDF tables into two classes. It has two methods to extract these different classes: Lattice (to extract tables with clearly defined lines between cells) and Stream (to extract tables with spaces between cells). We named Camelot’s table extraction flavors, Lattice and Stream, after Tabula’s methods. For Lattice, , an image processing technique to detect lines. Since we wanted to use Python, was the obvious choice to do image processing. However, OpenCV’s returned only line equations. After more exploration, we settled on , which gave the exact line segments. From here, representing the table trapped inside a PDF was straightforward. Tabula uses Hough Transform OpenCV Hough Line Transform morphological transformations To get more information on how Lattice and Stream work in Camelot, check out the section of the documentation. “How It Works” How we use Camelot We’ve battle tested Camelot by using it in a variety of projects, both for one-off and automated table extraction. Earlier this year, we developed our to help organizations track and measure their contribution to . For India, we identified open data sources (primarily PDF reports) for each of the 17 Sustainable Development Goals. For example, one of our sources for Goal 3 (“Good Health and Well-Being for People”) is the released by . To get data from these PDF sources, we created an internal web interface built on top of Camelot, where our data analysts could upload PDF reports and extract tables in their preferred format. UN SDG Solution Agenda 2030 National Family Health Survey (NFHS) report IIPS Note: We became finalists for the UN SDG Action Awards in February 2018 . We also set up an . The workflow scrapes the website for weekly PDFs of disease outbreak data, and then it extracts tables from the PDFs using Camelot, sends alerts to our team, and loads the data into a data warehouse. ETL workflow using Apache Airflow to track disease outbreaks in India Integrated Disease Surveillance Programme (IDSP) To infinity and beyond! Camelot has some limitations. (We’re developing solutions!) Here are a couple of them: When using Stream, tables aren’t autodetected. Stream treats the whole page as a single table, which gives bad output when there are multiple tables on the page. Camelot only works with text-based PDFs and not scanned documents. (As Tabula , “If you can click-and-drag to select text in your table in a PDF viewer… then your PDF is text-based”.) explains You can check out the for more information. GitHub repository You can help too — every contribution counts! Check out the for guidelines around contributing code, documentation or tests, reporting issues and proposing enhancements. You can also head to the and look for issues labeled “help wanted” and “good first issue”. Contributor’s Guide issue tracker We urge organizations to release open data in a “data friendly” format like the . But while tables are trapped inside PDF files, there’s Camelot :) CSV

Apache

An Open-Source Tool to Extract Tables from PDFs into CSVs

Announcing Camelot, a Python Library to Extract Tabular Data from PDFs

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Airflow, Meta Data Engineering, and a Data Platform for the World’s Largest Democracy

3 Mejores Formas de Crawl Datos desde Website

5 Técnicas Anti-Scraping que Puedes Encontrar

Database APIs vs Datasets: Weighing Benefits, Drawbacks, and Transition Strategies

Effective Strategies for Efficient Data Extraction

Exploiting the proftpd Linux Server

Airflow, Meta Data Engineering, and a Data Platform for the World’s Largest Democracy

3 Mejores Formas de Crawl Datos desde Website

5 Técnicas Anti-Scraping que Puedes Encontrar

Database APIs vs Datasets: Weighing Benefits, Drawbacks, and Transition Strategies

Effective Strategies for Efficient Data Extraction

Exploiting the proftpd Linux Server

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps