Maintaining an open source C library: lessons learnt from librdkafka

For about four years now I’ve maintained the open source Apache Kafka C/C++ client library - librdkafka, this blog post outlines what I’ve learned from its little open source odyssey, covering everything from bit-level compatibility to interpreting engrish.

Countless (if you are really bad at counting, like me) applications have been written using librdkafka, and the community has written language bindings for almost all of the prevalent programming languages on top of it, such as the two I wrote for Confluent: confluent-kafka-python and confluent-kafka-go. See this somewhat complete list for more.

Let’s get down to business.

Open source community

Open source is great, and all that, and unrelentless. I’ve spent on average 1–2h per day the last three years providing free support to the online open source community. Most of it is answering questions, explaining concepts, pointing out that it is already in the official documentation. Then there’s the inevitable and countless troubleshooting and guidance of users who’s assumptions are not in line with the application <-> library contract. And last, least, and most important: issues of varying quality filed for actual bugs in the library.

There isn’t much of an alternative, this is how open source works.

If I could give users seeking help a few tips, it would be:

You’re a programmer, I’m a programmer, please give me enough information to understand and debug your problem post-mortemly. Give me logs, configs, code snippets and backtraces. Please point out to me where in the logs things start getting wrong (I’ve received gigabyte-sized log files with no indication what to look for). I’m pretty good at guessing but I never guess right. What information would you need to troubleshoot the issue?
Please always reproduce your issue with the latest version of the library. But if you can’t reproduce, that’s fine, and logs from a one-time occurrence is useful too.
I will spend time investigating your issue and I will spend time helping you, and everyone else, for free. Please put some effort into your issue, it will help us all, especially you.
Before filing an issue: read the FAQ, read the documentation, search existing (open and closed) issues. If you are still out of luck; file the issue.
Give back: improve documentation, write blog posts, answer StackOverflow questions, answer other users’ issues on GitHub.
Feed back: as with any support channel you never hear of the success stories, when things go as planned. Everyone loves to hear that their projects come to actual use, so please drop an email briefly outlining your use-case and if you’re happy with the project, it makes my day.

But, this is probably the wrong medium (he he) for users.

So here are some guidelines for you as an open source library maintainer instead:

Be polite, be respectful, treat every user as a paying customer — and if you do, it is much more likely that they’ll some day be a paying customer (whatever that means in your case).
Language barrier — you will be seeing a lot of issues with extremely poor english. If you’re not an english tutor there isn’t much you can do about it and these issues are no less valuable than your extravagant American’s essay. Take a deep breath and dig in.
Take pride in your work. The quality of your project is evenly distributed between actual run-time characteristics, code quality and community communication quality. If you dont tend to issues or you tell people they are idiots (directly or indirectly) then your project will suffer, rightfully. Linus might get away with it, but he really shouldn’t.
A lot of companies, especially big corporations, are not allowed to discuss internal systems in public, for this reason you should provide your email on project’s GitHub page. You’ll eventually want to get the issue on GitHub in the end anyway for tracking purposes, leaving out anything that can identify the user or the company.
Code style guide — write one, it can be really bad, but it is still better than mixed styles in your code. Make sure no PR is accepted without adhering to it, including your own.
Write documentation — only you really understand how your library works. Make sure your assumptions are in writing, make sure the application <-> library contract is clear. This is tedious and boring work, but it pays off, not that most users will read your documentation before filing issues, that’s too much of an effort, but it allows you to quickly reply with a link to the relevant section of your documentation. Make a habit of updating the documentation if you need to explain stuff in an issue, otherwise you are bound to do it again.
There are no good in-line code documentation frameworks, just pick one and be consistent.
Be interactive — some issues and questions are easier to vet out in real-time — join an IRC channel or get a free Gitter channel to support your users interactively. Here’s librdkafkas Gitter channel.

Use-cases — or, what you don’t know

You probably had a use-case when you decided to create your project and it will most likely have affected the design and interfaces of the project accordingly.

But you’re not only making this software for yourself, you also want everyone else to use it. Adaptation is the goal of open source projects after all.

If there is one thing I can tell you about user’s use-cases is that you don’t have any idea what they are, sure — they might surface every now and then in issues and emails but you are only seeing the tip of an iceberg.

With this in mind it is imperative that you try your absolute best not to constrain the use of your software:

Dont make assumptions on performance or latency — do your best to implement the fastest and most efficient implementation you can conceive. A slow library will be okay for most initial users, but a no-go for demanding use-cases. A fast and efficient library will work for everyone, from scratch. And benchmarks is such a convenient selling point — everyone loves fast.
Be frugal — the CPU and memory is not yours. Write green code. Or as Cheryl puts it: Be tasteful not wasteful.
If your library shuffles a lot of user data, provide optional zero-copy interfaces where applicable.
Write a functional API (not as in FP but as in focus on functionality, not technicality). Abstract the underlying hardships and protocol or implementation details, provide an API that does the right thing. You are the domain expert, not your users, they just want a bit of functionality to add to their program.
Don’t require users to understand threading. Make the library thread-safe, this makes it usable in both threaded and non-threaded environments. For example: allow methods to be called from different user threads (where it makes sense), and avoid calling callbacks from your internal threads, always call them from the user’s thread (through a service poll call), otherwise you’ll force the user to understand and correctly implement concurrency — which is hard.
Save users from themselves. You’re a fantastic programmer, some of your users might not be — so to make the library accessible — stay away from complex APIs or paradigms — don’t require the user to understand low level threading primitives, distributed systems, et.al, just give them what they want and keep them safe.
Don’t do synchronous or blocking interfaces, just don’t. They’re the snakes of our precious garden. Example: quite a few of librdkafka’s users have asked for a synchronous/blocking produce() API that produces a message, waits for delivery, and then returns. This is very practical, you just call produce() and when it returns you know everything is okay. So you try this out on your development machine, it seems to perform pretty well, latencies in sub-millisecond. You’re able to fire off thousands of calls per second, this will be fine, let’s push to production. The production machines are beefier, things run even faster now. Then the load increases, latencies start to increase, networks act up, and all of a sudden your previously so snappy produce() call takes 10–50ms to complete. Your throughput plummets from 3000 msgs/s to 30 msgs/s, back-pressure trickles up your data pipeline and queues start building up. Data sources stops accepting new input and boom, your system grinds to halt, auxiliaries falling like dominos through your platform. If instead you would’ve provided an asynchronous API with a future or callback to indicate status, and allow multiple operations in-flight simultaneously, then you are effectively unaffected by increased latencies, your throughput will not suffer. Due to this I decided that instead of providing a sync produce() I chose to explain why this is bad and steer users in the right direction, saving them from future problems. If you know better, share that knowledge. Now, for people that do understand the problems and still want to do it (they shouldnt, really), provide a viable alternative (e.g., I provided a wiki page with an example).

API and ABI guarantees

Any C library worth its name must guarantee both API and ABI stability. SONAME bumps are frowned upon and must be avoided as far as possible since it’ll likely break existing applications.

API stability means that recompiling an existing program with a new version of the library will succeed and the program’s interaction with the library remains unchanged.

ABI stability means the same thing without recompiling the program. While API stability allows for changing public types, etc, ABI stability does not — the library must be completely binary compatible, on the bit-level, in all its public interfaces, to the previous version.

Make all your structs private, yes, all of them, I have yet to add a public struct that did not eventually need to be extended or modified, and you simply can’t do that without breaking existing applications.
Add accessor methods to your private type, getters and setters as needed.
Don’t leak internal symbols, use a linker script to only expose your public symbols in built library. librdkafka uses a well-defined format for its public header files and a small Python script that extracts all public functions and writes them to a linker script file, which is then passed to the linker (if supported (gcc, clang)) at library compile time.
Defines are final too, use global getter functions for anything that might change. One example is RD_KAFKA_DEBUG_CONTEXTS which is a nice little CSV string containing librdkafka’s supported debugging contexts, e.g.: “topic,broker,cgrp,metadata,…”. I thought it would be convenient to provide this as a string through a define, but it turned out this breaks the ABI — a program compiled with an older version using this define would not pick up new debug contexts after the library was replaced (but program not recompiled), it is now provided with a method too. The only notable exception to this is the library version, e.g. #define RD_KAFKA_VERSION 0x00090300 which you must note your users to only use during compile time for API discovery.
One convenient way to provide a backwards compatible and future-proof interfaces is to use var-arg functions that take a sentinel-terminated list of tuples. Each tuple is made up of a tuple id and that token’s specific arguments. To ease use provide per-tuple macros that take the expected number of arguments, optionally (and depending on toolchain support (gcc,clang)) verifying the type of the arguments and, most importantly, casting the provided args to their expected types (as an uncasted int can’t correctly be read as an int64_t in the function). See rdkafka.h for more information.

Configuration

Design your configuration properties so they reflect desired behaviour, not how to get there. Let them be the contract for which

Portability

Portability takes the fun out of programming.

Portability effectively means that:

You can’t use any language feature added during the last 15 or so years. Yes, people are stuck on old toolchains on old enterprise systems. You will get issues for 13 year old GCCs on RHEL 2.1. You’ll have to draw the line somewhere and librdkafka’s fuzzy line is somewhere between C89 and C99 with no outspoken support for vendor toolchains (but they might work anyway).
POSIX is great and mostly portable, stick to a minimal feature set, sit down in the boat.
Use auto configuration (but for god’s sake, not autoconf). mklove is my non-bloated alternative (mklove as in “make love, not war”, but replacing war with its synonym autoconf).
Win32 support — urgh, what a mess. Microsoft has done pretty much everything wrong in their (sort-of) POSIX compliant APIs. Weird or buggy behaviour, underscore-prefix everything, add _s to indicate that things are safe (what?!). Great idea! What this means in practice is you should abstract network and IO code from the start to make Win32 porting easier. For threads I can strongly suggest tinycthread which is a C11-like thread abstraction for POSIX and Win32. Visual Studio is pretty nice though.
Use CIs to get quick feedback on portability issues. Travis-CI provides Linux and OSX builds, AppVeyor provides Win32. Also look at Suse Open Build Service which allows you to build for pretty much any Linux distro (but is very poorly integrated with the modern web, e.g. Github). Andreas Smas CI project doozer.io also looks promising.

Releases

For optimal reach, announce releases tuesday mornings (EST).

Treat yourself to a beer.

It is hard to maintain multiple releases for small teams, so try to avoid multiple release branches with backporting fixes, the increased test matrix and maintenance cost won’t pay off.

Some truths

gettimeofday() is very cheap on Linux (kernel-mapped memory, no syscall).
don’t check malloc(constant size) (et.al) return value for NULL, unless you are allocating large chunks (megabytes) or variable sized (you dont know the size beforehand, such as st.st_size). Modern platforms over-subscribe memory and it is not the malloc call that will fail but the later page-fault when virtual memory needs to be mapped to physical pages.
be careful that calculated malloc sizes, e.g, malloc(n * m) wont wrap.
uncontended mutexes are cheap on Linux.
atomics on the other hand are sometimes more expensive than expected, but not worse than a mutex.

WIP

This post is a work in progress, more notes will be added over time (unless this is really all there is to open source projects, then I’m done).