How to Run LibreOffice in AWS Lambda for Dirty-Cheap PDFs at Scale

Written by vladholubiev | Published 2017/11/07
Tech Story Tags: serverless | aws-lambda | aws | pdf

TLDRvia the TL;DR App

Did you know you can convert almost 100 document formats to PDF with LibreOffice? Did you know you can do this in Lambda now? TL;DR; online demo and github sources

This short post will answer 2 simple questions Why? and How? Skip to How.

Why?

To not ruin your expectations, let’s clarify some things from the start.

This is not something you can run in production, at least yet.

Ok, so let’s go on. If you ever had to work with converting office documents to PDFs you probably know the pain. There are not so many open-source and even commercial tools for this job.

Inspired by a recent story of running Google Chrome in Lambda, I was pumped to repeat a similar feat for something I desperately need. LibreOffice is one of these things. I was running it inside Docker container for a year, and it requires a lot of care. LibreOffice doesn’t work well when you run several instances on 1 container. Processes often become zombies and eat all the memory. There is a way to run it as a “server” via a socket, but it’s even more unstable.

I was ranting on this for some time until eventually a salesman from Accusoft reached out to me trying to sell their solution for document conversion.

The only problem pricing bar is too high for me: $7400 per server or $10,000 per million documents.

This is so ridiculous I decided to build it myself. Here is how.

How?

So after some quick calculations, I end up with a price of $150 per million documents. Quite an improvement over ten grand, isn’t it? Later it proved to be correct.

So as you see the main driver of price is S3, and not Lambda, which is for me funny.

Approximate price estimations: $150 for 1 million files. Worth trying!

So there are 2 problems preventing bringing LibreOffice to Lambda:

  1. Size. Installed on the system it takes gigabytes of disk space
  2. Portability. It depends on a lot of exotic libraries not available in Lambda.

Size

Given 512MB of disk space in Lambda, this is quite a challenge. Fortunately, you can disable a ton of crap during the compilation process. God, why do I need JDBC driver or KDE extensions. That’s how LibreOffice size reduced from 2 to 0.45 GB.

450 MB is already enough to fit in Lambda. But we can more. strip **/* is a magical command to remove symbols from shared objects (.so files)

So now the size is 340 MB, which is 110 MB gzipped 🎉

Portability

In order to compile anything for Lambda, you need to do this under the same environment. Currently, this is 2017.03 Amazon Linux AMI. If it runs there, then it runs in Lambda.

For the compilation, you will need a beefy server. I took the 16 core C5 one.

I won’t dive into details of the compilation process, but holy crap, that was hard. I am n00b in everything which is not node.js. And imagine me reading cryptic C++ errors when I tried to compile this beast.

The codebase is 32 years old 😰

I collected all the scripts needed for a smooth compilation in the GitHub repo, so you don’t have to go through this yourself.

https://github.com/vladgolubev/serverless-libreoffice

Summary

Having a compiled LibreOffice folder you can zip it with your Lambda function and deploy. On startup, you unzip it into /tmp folder and spawn a process.

P.S. Check out a more detailed step-by-step guide. And an NPM library.

Also, I put a demo so you could try it asap. Give it a spin!

https://vladholubiev.com/serverless-libreoffice

If this kind of content was interesting for you, check out my previous post on Running Docker Containers in AWS Lambda.


Published by HackerNoon on 2017/11/07