Excalibur is a free and open-source tool that can help you to easily extract tabular data from PDFs. I originally wrote this post for my . website Photo by on Patrick Tomasso Unsplash Borrowing the first three paragraphs from my previous blog post since they perfectly explain why extracting tables from PDFs is hard. The PDF ( ) was born out of to create “a universal way to communicate documents across a wide variety of machine configurations, operating systems and communication networks”. Basically, the goal was to make documents viewable on any display and printable on any modern printer. PDF was built on top of (a page description language), which had already solved this “view and print anywhere” problem. PDF encapsulates the components required to create a “view and print anywhere” document. These include characters, fonts, graphics and images. Portable Document Format The Camelot Project PostScript A PDF file defines instructions to place characters (and other components) at precise coordinates relative to the bottom-left corner of the page. Words are simulated by placing some characters closer than others. Similarly, spaces are simulated by placing words relatively far apart. How are tables simulated then? You guessed it correctly — by placing words as they would appear in a spreadsheet. x,y The PDF format has no internal representation of a table structure, which makes it difficult to extract tables for analysis. Sadly, a lot of open data is stored in PDFs, which was not designed for tabular data in the first place! Excalibur: Extract tables from PDFs into CSVs Excalibur is a web interface to extract tabular data from PDFs, written in Python 3! It is powered by . You can check out fantastic documentation at and follow the development on . Camelot Read the Docs GitHub : Excalibur only works with text-based PDFs and not scanned documents. (As Tabula , “If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based”.) Note explains How to install Excalibur After installing (see ), you can simply use pip to install Excalibur: ghostscript install instructions $ pip install excalibur-py : You can also download executables for Windows and Linux from the and run them directly! Note releases page How to use Excalibur After installation with pip, you can initialize the metadata database using: $ excalibur initdb And then start the webserver using: $ excalibur webserver That’s it! Now you can go to and start extracting tabular data from your PDFs. http://localhost:5000 a PDF and enter the page numbers you want to extract tables from. Upload Go to each page and select the table by drawing a box around it. (You can choose to skip this step since Excalibur can automatically detect tables on its own. Click on “ ” to see what Excalibur sees.) Autodetect tables Choose a flavor (Lattice or Stream) from “ ”: for tables formed with lines or for tables formed with whitespaces. Advanced Lattice, Stream, Click on “ ” to see the extracted tables. View and download data Select your favorite format (CSV/Excel/JSON/HTML) and click on “ ”! Download A table detection upgrade Camelot, the Python library that powers Excalibur, implements two methods to extract tables from two different types of table structures: , for tables formed with lines, and , for tables formed with whitespaces. Lattice gave nice results from v0.1.0 since it was able to detect different tables on a single PDF page, in contrast to Stream which treated the whole page as a table. Lattice Stream But last week, Camelot v0.4.0 was released to fix that problem. adds an implementation of the table detection algorithm described by Anssi Nurminen’s that is able to detect multiple -type tables on a single PDF page (most of the time)! You can see the difference in the following images. #206 master’s thesis Stream Both -type tables detected in v0.4.0 Stream as compared to Whole page being treated as a table in v0.3.0 Voted #1 on Labworm Excalibur was voted #1 on in the second week of November! Labworm is a platform that guides scientists to the best online resources for their research and helps mediate knowledge exchange by promoting open science. Labworm Why another PDF table extraction tool? There are both open ( , ) and closed-source ( , ) tools that are widely used to extract data tables from PDFs. They either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy. Tabula pdfplumber Smallpdf Docparser Excalibur uses Camelot under the hood, which was created to offer users complete control over table extraction. If you can’t get your desired output with the default settings, you can tweak the “ ” settings and get the job done! Advanced For a more detailed account of why Camelot was created, you should also check out “The longer read” section of my . Use . previous blog post Ctrl + F The road ahead Reiterating from “The longer read” section I talked about above, it was a pain to see open-source tools not give a nice table extraction output every time. And it was frustrating to see paywalls on closed-source tools. I think that paywalls should not block the way to . I believe that Camelot was a successful attempt by us, at SocialCops, to address the problem of extracting tables from text-based PDFs accurately. Excalibur has made it more easier for anyone to access Camelot’s goodness with a nice web interface. open science But there’s still a lot of open data trapped inside images and image-based PDFs. And state of the art software is locked behind paywalls. optical character recognition ‘At this time, proprietary OCR software drastically outperforms free and open source OCR software and as such could be worth a public agency’s investment depending on the amount and type of OCR jobs the public agency is needing to perform.’ — How to Open Data — Working with PDFs So the next step is to make it easy for anyone to extract tables (or any other type of data for that matter) from images or image-based PDFs by adding OCR support to Camelot and Excalibur. If you would like to contribute your ideas towards this, do add your comments on . You can also check out the for guidelines around contributing code, documentation or tests, reporting issues and proposing enhancements. #101 Contributor’s Guide If Excalibur has helped you extract tables from PDFs, please consider supporting its development by ! becoming a backer or a sponsor on OpenCollective Also, stop publishing open data as PDFs and keep looking up! :) to Christine Garcia for providing feedback and suggesting edits. Thanks
Share Your Thoughts