The open source ideology serves as a powerful mechanism not only for uniting individuals with similar interests but also for exchanging experiences and ideas, embodying the principle that sharing is indeed caring. Linux is a great example of how a community of enthusiasts can change the world. Its
Bad programmers worry about the code. Good programmers worry about data structures and their relationships. Linus Torvalds
Managing an open-source project carries significant weight, particularly as its popularity grows: it brings forth the spotlight on technical expertise and architectural choices. Popular projects could be a treasure trove of knowledge for anyone who wants to boost their skills. Let's delve deeply into a few notable projects.
First of all, it embodies top-tier coding standards, integrating best practices championed by industry leaders. My firm belief lies in the notion that genuine programmers are fueled by passion and curiosity. Why not learn something new and integrate it into our daily practice?
Secondly, hacking code could take your skills to a whole new level if you are consistent and tenacious.
Last but not least, it could help you jump right into the project you like: find new buddies, brand yourself as a skilled developer, and get the job of your dreams. Help the community to solve their problems, and the reward will not keep you waiting.
Take your cup of coffee, let’s get straight to work.
The very heart of the Linux system — its kernel. Indeed, this is not merely a singular project, but rather a constellation of diverse projects unified by a central idea: tying up different parts of hardware and providing a sustainable, secure API for the end users.
The Linux codebase is huge and contains a lot of vendor-specific code, but it is well-organized and structured. I strongly recommend everyone interested to read this. It might be a bit outdated but it will foster a solid understanding of how the kernel internals work. The kernel version I have on the table is 6.3.
At the very beginning of the main.c we can spot definitions of some functions like
In the codebase, the radix_tree_root struct is an alias for a wrapper above the
struct xarray {
spinlock_t xa_lock;
/* private: The rest of the data structure is not to be used directly. */
gfp_t xa_flags;
void __rcu * xa_head;
};
typedef enum {
GFP_KERNEL,
GFP_ATOMIC,
__GFP_HIGHMEM,
__GFP_HIGH
} gfp_t;
If all of the entries in the array are NULL, xa_head is a NULL pointer. If the only non-NULL entry in the array is at index 0, xa_head is that entry. If any other entry in the array is non-NULL, xa_head points to a xa_node — the internal struct that handles offset, counts, slots, and other fields.
struct xa_node {
unsigned char shift; /* Bits remaining in each slot */
unsigned char offset; /* Slot offset in parent */
unsigned char count; /* Total entry count */
unsigned char nr_values; /* Value entry count */
struct xa_node __rcu *parent; /* NULL at top of tree */
struct xarray *array; /* The array we belong to */
union {
struct list_head private_list; /* For tree user */
struct rcu_head rcu_head; /* Used when freeing node */
};
void __rcu *slots[XA_CHUNK_SIZE];
union {
unsigned long tags[XA_MAX_MARKS][XA_MARK_LONGS];
unsigned long marks[XA_MAX_MARKS][XA_MARK_LONGS];
};
};
The radix tree data structure contains “nodes” that link with each other based on the int key (index). The node contains several slots, each of which can handle a pointer.
The main idea here is to keep a tree wide, not deep — it is effective for fast addressing.
When the
Wherever a user of the tree is going to look up a specific page, they need to look up it with an index. The lookup phase is quite efficient because of the broad nature of the structure. It is based on the key that is modified on the user side. Take a look at the
Users of this data structure include GPU and network device drivers, which maintain memory page pointers in a structured format rather than juggling them. It's like having a pocket dimension! This is where Linus's earlier quote comes into play.
Consider the essence of the data you handle; aim for efficiency, compactness, and cleanliness in its usage, regardless of the language or project you're involved in.
In 2021 Liam Howlett (no, another Liam) and Matthew Wilcox introduced
Let's direct our steps to the point of the architectural design. One of the most famous orchestrators, invented by Google and spread around the world at rocket pace — Kubernetes. There are a lot of fairly new resources out there covering the internal details, for example here is a
The whole point of this article is to get some useful insights that answer not just “how” but “why”. Understanding the motivation is crucial, in my opinion, that is how we can “catch a mindset” to re-implement all those cool staff in our daily job.
Let’s briefly recall what the Kubernetes is. This is the system that manages containers within the applications and provides orchestration features such as scaling and high availability. Kubernetes’ main components are the API service, the scheduler, etcd — the key-value storage, the controller service, and the agent service — “kubelete” on each node to manage containers. Guys from Google developed it to utilize the compute resources of their cloud
The very first thing I’d like to highlight is the distributed nature of execution. To synchronize all of the user’s entities states the orchestrator uses etcd — the key-value storage which
One more nice architectural decision is the
The concept aims to loosen the connection between the kubelet agent and a runtime which manages containers. It allows Kubernetes to work not just with Docker but many others (CRI-O, Dockerd) by defining the
At the end let’s briefly look at the “sidecar” pattern that was presented by third-party services like
The whole idea is to embed a service to the main application that helps to handle some specific cases such as proxy or load-balancing. Kubernetes manages this container as a whole different entity of a pod: the container can
The main point here is that the main application may know nothing about the sidecar, while the behavior of the entire service can be changed.
Bless Open Source, we
PostgreSQL is not just a database but a whole platform. After the client connects to the server a new
The PSQL server has a bunch of
There are many extensions that we can use within the platform and the “bloom index”
I faced a problem of growing indexes back in the day. The task was to implement the custom search filter: every column in the database and the combination of it could be a key for a search. Building indexes for each and every column could be a bad decision because of the complexity and size.
The very first idea that comes to mind is to reduce the dimension of possible variations removing the query values that we are sure are not in the database. For example, if we are sure that item==2 is False there is no reason to check all other fields. I remembered the
Bloom filter in a nutshell — it’s just a bit array. Every item would be hashed with a certain amount of different functions that translates it to some index. That index will set the exact bit in the filter’s array. In the case of 3 different functions, an item would set 3 different bits. At the lookup phase, the query would be also hashed and compared within the bits in the array. If at least one of the bits is not set, we can be sure the item is absent. If all of them were set, it is still possible that the item was not placed — that is because each bit we have found could refer to other items — the collision takes place. Check out this great
To start using the bloom index within the PSQL we need to create it properly:
CREATE INDEX bloom_idx_bar ON foo.bar USING bloom (id,dept_id,zipcode) WITH (length=80, col1=4, col2=2, col3=4);
where “length” — is the length of the array of bits for each row; and the “col” numbers — describe how many bits would be occupied by the column (how many functions would be applied). By increasing these numbers we can reduce false-positive shots.
Of index creation in PSQL is a bit complicated, we should understand internal data structures and protocols; let’s try to make the observation short and sweet.
The
The
typedef struct BloomState
{
FmgrInfo hashFn[INDEX_MAX_KEYS];
Oid collations[INDEX_MAX_KEYS];
BloomOptions opts; /* copy of options on index's metapage */
int32 nColumns;
/*
* sizeOfBloomTuple is index-specific, and it depends on reloptions, so
* precompute it
*/
Size sizeOfBloomTuple;
} BloomState;
When we insert data to the index, we pass it in the form of an array within the array of booleans that marks NULL values to skip it at the indexing phase. The BloomFormTuple function forms the BloomTuple struct from the data — this is the part where the data would be signed. The tuple would be passed to a memory page after that phase.
BloomTuple *
BloomFormTuple(BloomState *state, ItemPointer iptr, Datum *values, bool *isnull)
{
int i;
BloomTuple *res = (BloomTuple *) palloc0(state->sizeOfBloomTuple);
res->heapPtr = *iptr;
/* Blooming each column */
for (i = 0; i < state->nColumns; i++)
{
/* skip nulls */
if (isnull[i])
continue;
signValue(state, res->sign, values[i], i);
}
return res;
}
At the scan phase the
I hope the explanation isn't too confusing, PSQL internals are quite specific.
Bloom filter is a pretty common data structure, PSQL uses the same approach for some
In the article, I’ve just lifted the veil of the shiny open-source world. There are many more great implementations you could meet in the wild:
The whole software world is in motion: sidecars and operators
Take a look at the elegance of the great things’ internals.