Algorithms in the Wild: Reusable Insights and Strategies from the Open Source Community

The open source ideology serves as a powerful mechanism not only for uniting individuals with similar interests but also for exchanging experiences and ideas, embodying the principle that sharing is indeed caring. Linux is a great example of how a community of enthusiasts can change the world. Its usage statistics are highly impressive and we can claim that most of the Internet works on Linux’s horsepower.

Bad programmers worry about the code. Good programmers worry about data structures and their relationships. Linus Torvalds

Managing an open-source project carries significant weight, particularly as its popularity grows: it brings forth the spotlight on technical expertise and architectural choices. Popular projects could be a treasure trove of knowledge for anyone who wants to boost their skills. Let's delve deeply into a few notable projects.

Why focus on open source?

First of all, it embodies top-tier coding standards, integrating best practices championed by industry leaders. My firm belief lies in the notion that genuine programmers are fueled by passion and curiosity. Why not learn something new and integrate it into our daily practice?

Secondly, hacking code could take your skills to a whole new level if you are consistent and tenacious.

Last but not least, it could help you jump right into the project you like: find new buddies, brand yourself as a skilled developer, and get the job of your dreams. Help the community to solve their problems, and the reward will not keep you waiting.

Take your cup of coffee, let’s get straight to work.

Linux Kernel

The very heart of the Linux system — its kernel. Indeed, this is not merely a singular project, but rather a constellation of diverse projects unified by a central idea: tying up different parts of hardware and providing a sustainable, secure API for the end users.

The Linux codebase is huge and contains a lot of vendor-specific code, but it is well-organized and structured. I strongly recommend everyone interested to read this. It might be a bit outdated but it will foster a solid understanding of how the kernel internals work. The kernel version I have on the table is 6.3.

Radix tree for memory management

At the very beginning of the main.c we can spot definitions of some functions like radix_tree_init. It is responsible for initializing the data structure that is used in memory management processes.

In the codebase, the radix_tree_root struct is an alias for a wrapper above the XArray struct. It holds a lock (pthread mutex); xa_flags — the gfp_t enum that tells how memory allocations will be performed; and the pointer to xa_head.

struct xarray {
	spinlock_t	xa_lock;
/* private: The rest of the data structure is not to be used directly. */
	gfp_t		xa_flags;
	void __rcu *	xa_head;
};

typedef enum {
	GFP_KERNEL,
	GFP_ATOMIC,
	__GFP_HIGHMEM,
	__GFP_HIGH
} gfp_t;

If all of the entries in the array are NULL, xa_head is a NULL pointer. If the only non-NULL entry in the array is at index 0, xa_head is that entry. If any other entry in the array is non-NULL, xa_head points to a xa_node — the internal struct that handles offset, counts, slots, and other fields.

struct xa_node {
	unsigned char	shift;		/* Bits remaining in each slot */
	unsigned char	offset;		/* Slot offset in parent */
	unsigned char	count;		/* Total entry count */
	unsigned char	nr_values;	/* Value entry count */
	struct xa_node __rcu *parent;	/* NULL at top of tree */
	struct xarray	*array;		/* The array we belong to */
	union {
		struct list_head private_list;	/* For tree user */
		struct rcu_head	rcu_head;	/* Used when freeing node */
	};
	void __rcu	*slots[XA_CHUNK_SIZE];
	union {
		unsigned long	tags[XA_MAX_MARKS][XA_MARK_LONGS];
		unsigned long	marks[XA_MAX_MARKS][XA_MARK_LONGS];
	};
};

The radix tree data structure contains “nodes” that link with each other based on the int key (index). The node contains several slots, each of which can handle a pointer.

The main idea here is to keep a tree wide, not deep — it is effective for fast addressing.

When the radix_tree_insert function is called by a user (it can be a device driver), the following actions occur:

__radix_create_tree creates a slot in the tree if it is necessary, based on the passed index which is used to check the size of the tree. Or raise an error if there is no available memory;
insert_entities pass an item to the slot;
some asserts check that offsets are applied correctly to the data structure.

Wherever a user of the tree is going to look up a specific page, they need to look up it with an index. The lookup phase is quite efficient because of the broad nature of the structure. It is based on the key that is modified on the user side. Take a look at the article for more insight into how it works.

Users of this data structure include GPU and network device drivers, which maintain memory page pointers in a structured format rather than juggling them. It's like having a pocket dimension! This is where Linus's earlier quote comes into play.

Consider the essence of the data you handle; aim for efficiency, compactness, and cleanliness in its usage, regardless of the language or project you're involved in.

In 2021 Liam Howlett (no, another Liam) and Matthew Wilcox introduced the maple tree data structure. I encourage you to read more about the intentions, motivation, and realization details here and here.

Kubernetes

Let's direct our steps to the point of the architectural design. One of the most famous orchestrators, invented by Google and spread around the world at rocket pace — Kubernetes. There are a lot of fairly new resources out there covering the internal details, for example here is a repository that you can start exploring as you delve deeper into the details.

The whole point of this article is to get some useful insights that answer not just “how” but “why”. Understanding the motivation is crucial, in my opinion, that is how we can “catch a mindset” to re-implement all those cool staff in our daily job.

Let’s briefly recall what the Kubernetes is. This is the system that manages containers within the applications and provides orchestration features such as scaling and high availability. Kubernetes’ main components are the API service, the scheduler, etcd — the key-value storage, the controller service, and the agent service — “kubelete” on each node to manage containers. Guys from Google developed it to utilize the compute resources of their cloud as efficiently as possible. Their main intention was a “great container management system”.

Architectural highlights

The very first thing I’d like to highlight is the distributed nature of execution. To synchronize all of the user’s entities states the orchestrator uses etcd — the key-value storage which keeps information in a tree data structure. So the whole system uses the storage as a source of truth to keep maintaining the desired state of a cluster. It is a common pattern that presents a source of truth to manage different states between multiple entities, with one slight difference: the storage is also used to “instruct” all of the components about the final cluster’s state.

One more nice architectural decision is the CRI — container runtime interface.

The concept aims to loosen the connection between the kubelet agent and a runtime which manages containers. It allows Kubernetes to work not just with Docker but many others (CRI-O, Dockerd) by defining the interaction protocol, just like one from RFC. The kubelet agent utilizes GRPC to connect with a runtime, so in case one would like to implement its own runtime, they should implement the exact interface on their side.

At the end let’s briefly look at the “sidecar” pattern that was presented by third-party services like Istio and has recently been implemented within the core Kubernetes functionality.

The whole idea is to embed a service to the main application that helps to handle some specific cases such as proxy or load-balancing. Kubernetes manages this container as a whole different entity of a pod: the container can influence the entire pod status; restarting and updating separately of the main application; share resources with the main app.

The main point here is that the main application may know nothing about the sidecar, while the behavior of the entire service can be changed.

Bless Open Source, we can trace how new functionality was implemented by the community.

PostgreSQL

PostgreSQL is not just a database but a whole platform. After the client connects to the server a new backend process is established. The query goes through many stages while executed: parsing, the rewriting system, planning and executing.

The PSQL server has a bunch of interfaces that allow developers to interact with different internal components. For example, we can write our scan provider aimed, for some optimizations at each stage. Or create an index add-on. Even though the overall process required for specific knowledge and the documentation could be a bit knotty, the loosely coupled architecture empowers developers to vastly expand functionality.

There are many extensions that we can use within the platform and the “bloom index” realization is one of them.

Bloom index

I faced a problem of growing indexes back in the day. The task was to implement the custom search filter: every column in the database and the combination of it could be a key for a search. Building indexes for each and every column could be a bad decision because of the complexity and size.

The very first idea that comes to mind is to reduce the dimension of possible variations removing the query values that we are sure are not in the database. For example, if we are sure that item==2 is False there is no reason to check all other fields. I remembered the Bloom filter — the probabilistic data structure that could return a false-positive result (because of hash collision) and be quite sure when the data is absent. As I’ve mentioned earlier PSQL has an extension for this data structure.

Bloom filter in a nutshell — it’s just a bit array. Every item would be hashed with a certain amount of different functions that translates it to some index. That index will set the exact bit in the filter’s array. In the case of 3 different functions, an item would set 3 different bits. At the lookup phase, the query would be also hashed and compared within the bits in the array. If at least one of the bits is not set, we can be sure the item is absent. If all of them were set, it is still possible that the item was not placed — that is because each bit we have found could refer to other items — the collision takes place. Check out this great explanation of how it works within the examples.

To start using the bloom index within the PSQL we need to create it properly:

CREATE INDEX bloom_idx_bar ON foo.bar USING bloom (id,dept_id,zipcode) WITH (length=80, col1=4, col2=2, col3=4);

where “length” — is the length of the array of bits for each row; and the “col” numbers — describe how many bits would be occupied by the column (how many functions would be applied). By increasing these numbers we can reduce false-positive shots.

Of index creation in PSQL is a bit complicated, we should understand internal data structures and protocols; let’s try to make the observation short and sweet.

The blhandler function promotes the data structure’s functions as an IndexAmRoutine pointer — a pointer to the struct, that represents an API for the index access methods.

The blbuild function creates the new index: allocate memory pages, scan the table for tuples that will be indexed and in the BloomState structure.

typedef struct BloomState
{
	FmgrInfo	hashFn[INDEX_MAX_KEYS];
	Oid			collations[INDEX_MAX_KEYS];
	BloomOptions opts;			/* copy of options on index's metapage */
	int32		nColumns;

	/*
	 * sizeOfBloomTuple is index-specific, and it depends on reloptions, so
	 * precompute it
	 */
	Size		sizeOfBloomTuple;
} BloomState;

When we insert data to the index, we pass it in the form of an array within the array of booleans that marks NULL values to skip it at the indexing phase. The BloomFormTuple function forms the BloomTuple struct from the data — this is the part where the data would be signed. The tuple would be passed to a memory page after that phase.

BloomTuple *
BloomFormTuple(BloomState *state, ItemPointer iptr, Datum *values, bool *isnull)
{
	int			i;
	BloomTuple *res = (BloomTuple *) palloc0(state->sizeOfBloomTuple);

	res->heapPtr = *iptr;

	/* Blooming each column */
	for (i = 0; i < state->nColumns; i++)
	{
		/* skip nulls */
		if (isnull[i])
			continue;

		signValue(state, res->sign, values[i], i);
	}

	return res;
}

At the scan phase the blbeginscan forms the IndexScanDesc struct which contains the index’s data and the blgetbitmap makes the TIDBitmap struct from it — the internal data structure that holds item pointers — reading BloomState page by page.

I hope the explanation isn't too confusing, PSQL internals are quite specific.

Bloom filter is a pretty common data structure, PSQL uses the same approach for some internal purposes. It is also broadly used in many different situations: from cache-hitting cases to kernel packet filter implementations.

More to know

In the article, I’ve just lifted the veil of the shiny open-source world. There are many more great implementations you could meet in the wild: CFS and CFQ — if you’d like to catch the magic of the “fair” resource allocation; operational transform and CRDT — to understand the simultaneous editing in Google Docs.

The whole software world is in motion: sidecars and operators claimed to be the new model for software delivery; eBPF is in a hurry to change the service-mesh post.

Take a look at the elegance of the great things’ internals.