In this blogpost I would like to address the issue of documentation, why it is important that we produce a good body of docs for our projects, whether they are for an open source project, or a private one.
One of the most important things that we can do to advance our careers and that of our colleagues is to learn to share our knowledge. One of the most direct ways to do that is to write about it. Document our code, explain why we took certain tradeoffs when we implemented a new feature, fixed a bug, or decided to include a new library or tool into our project.
Sometimes it’s hard to know what to document. In the same way that we need to prioritize what features we implement, or which bugs we fix first, we also need to prioritize what we write docs about. We could address this problem by asking questions about our project, or the current issue we are working on, and then try to answer them with documentation.
There’s a reason why societies collect their lore in written form so it becomes history. Oral tradition gets lost.
Below I add some questions that we should ask about our project, most of them taken from my time as a RabbitMQ core-developer. I provide some example answers as well.
Photo credit: https://flic.kr/p/944A8a
Here are the questions, in no particular order.
At RabbitMQ we used our own logger implementation, when Erlang already provides one. Why? The one provided with the language makes it difficult to rotate logs. Was there any problem by using our own wheel? Yes. When a new Erlang version came out that changed the logging API, RabbitMQ logging would crash.
Here’s another example from RabbitMQ. The broker used to have a build infrastructure implemented with Make. It was a bit cumbersome, with little documentation, and hard to extend. Also, it was home made, while Erlang had at least two well know build tools: rebar and erlang.mk. While rebar was probably the most popular one inside the Erlang community, we went with erlang.mk. Why? Our existing build tool was already using Make and had lots of features. We used it to build .deb packages, .rpm packages, OS X self contained tarballs, Windows Releases, run tests, compile the broker, compile RabbitMQ plugins, dependency management from Git/Mercurial, and so on. We didn’t want to port all that functionality to rebar. As an extra bonus, the author of erlang.mk was working with us at the time, so all in all, erlang.mk looked like the right trade-off for a build tool.
One of my tasks at RabbitMQ was to reimplement parts of the queue persistence layer. Why? First we had an issue where RabbitMQ would become slow and blocked for a while when messages started to accumulate in the queue. Yes, RabbitMQ was very slow when it had to queue things (the irony is strong in this one). After running a couple of benchmarks and going through several debugging sessions, I managed to pin down the problem to how messages were being persisted to the file system. So, I had to reimplement some of the logic there. This whole issue lead to the creation of Lazy Queues.
In RabbitMQ many features couldn’t be implemented because AMQP was a constraint on what the broker could do. It mandated what should happen with a message, how it should traverse the internal parts of the broker, when and how a message should be discarded depending on properties set by the publisher, and constraints set by the consumers while subscribing to a queue, and so on.
Long story short, there’s always tradeoffs.
Sometimes you don’t have a protocol to adhere to, like AMQP, but you have time constraints, or economical ones. Maybe the law doesn’t allow you to collect data from your users (for very good reasons), or you don’t have enough developers working in your team that would be able to maintain the new feature after it got deployed to production. I’m sure you can come up with many more examples based on your own line of work. Long story short, there’s always tradeoffs.
Of course there could be many more questions like these. The ones presented above are just some examples taken from my experience while working at RabbitMQ.
Once we have the docs in place, we have to make sure our colleagues know about the existence of the documentation. You don’t want to be writing stuff so it goes to die a solemn death at the company’s wiki.
What I try to do is to just email the team with a link to the new document, but I think that’s not enough. Something better would be to give a presentation to our team explaining how we implemented our new feature. Granted, we can’t have a tech talk for every new feature in our project, but perhaps we could get used to giving talks about the major ones. Let me give you an example.
When I worked on fixing the performance problems from RabbitMQ’s queues, I tried to make sure that everybody in the team knew what I was working on. I tried to explain the problems I was facing while debugging the issue, and later, once I had it pinned down, I tried to explain to everybody in the team what was going on with the queue implementation we had back then. Later on, I produced a couple of documents explaining how things worked internally on RabbitMQ’s queues. Apart from getting everybody at the same page about how RabbitMQ works internally, the goal was also to make sure we didn’t make the same mistakes again.
You could ask if we could just solve this communication problem by doing pair programming. I don’t think so. I don’t think pair programming is the right answer here. There’s a reason why societies collect their lore in written form so it becomes history. Oral tradition gets lost. Besides we need a solution that scales beyond small teams and that would work for remote teams as well. Pair programming is great for sharing ideas and coming up with solutions together but there has to be a way to solidify the knowledge acquired during pair programming sessions. That’s what documentation and tech talks are for.
Another reason for having good documentation is that not everybody is a code archeologist. We can’t allow for every bug to become a research project in archeology about our code, where whoever is assigned to the ticket has to go through histories of commits just to understand why the buggy line of code got there in the first place.
We can’t allow for every bug to become a research project in archeology of our code where whoever is assigned to the ticket has to go through histories of commits just to understand why the buggy line of code got there in the first place.
We could bring another example from RabbitMQ here. Due to performance reasons RabbitMQ discontinued support for the immediate flag in AMQP, but RabbitMQ should be a faithful implementation of AMQP, right? Well, maybe that’s true in a perfect world, where Turing machines have infinite tape and network partitions don’t happen, but in our world we have to make tradeoffs. Business goals will be compromised in order to be able to deliver a product. Users stories will be split. Features that were planned will be discarded. We need a place were all this is documented, if possible. Keep in mind that team members change, project managers come and go, and we can’t have the same discussion over and over why feature X is needed, to then realize it can’t be implemented.
Another good reason for writing docs is that sooner or later our team will have new people joining it. We want to make the on-boarding of those new members as smooth as possible. We don’t want them going around desk after desk trying to fish out what are the right compiler flags to get the project to build, or where they can find the credentials to be able to deploy the project to production.
Whether you are working on an open source company, or for a private project, there might be remote people contributing to the project. These people might even be at completely different timezones from us. We can’t pretend that the whole team will be available for questions when the other person is working. So having easy to access documentation is a must for these kind of projects. Also, even if you don’t have remote colleagues, people will still take holidays, get sick, or simply be unavailable for questions for whatever the reason. So, write those docs please.
Finally I would like to talk a bit about the hero culture that’s well established in the software industry. I think we value too much the experience of those who have had to deal with bad architectures in the past. We consider their battle scars to be worth a lot and to deserve reverence.
Production incidents should teach us to respect our colleagues and understand that their time after work is as valuable as ours
It doesn’t have to be like that. We can try to build a culture that systematizes the act of learning from these incidents, so new bloods know what’s going on, what could happen if they implement this and that, without suffering so many scars along to way. Yes, they will still need their baptism of fire some Friday evening when they get called in to salvage production since the cache machines have gone rogue, while everybody else is enjoying the start of the weekend. Those incidents teach us to respect good practices and to understand they are there for a reason. They should also teach us to respect our colleagues and understand that their time after work is as valuable as ours. Because someone is being paid extra to be on call, that doesn’t mean I don’t have to pay attention to the code I put in production.
If you fix a hard bug, I want you to explain it to everyone, so we learn from the experience, and we learn how the system works, we learn how to debug things, we learn what worked and what not while we were fixing the bug. People will also learn to understand that fixing things the right way™ takes time. That what we do is not trivial. If we skip all these steps, we are making things harder for our future selves.
If when we leave a team, we have project knowledge no one else has, then that’s a failure. Let’s work together to prevent that.