A look at two databases that have made claims to the Kubernetes native label: TiDB and DataStax Astra DB.
The cloud computing revolution has inspired and benefitted from multiple interrelated trends. The availability of self-service, public cloud infrastructure has helped to drive the adoption of microservice architectures and DevOps practices, including automation and observability.
The drive toward containerization and container orchestration has led to the widespread adoption of Kubernetes as an environment for managing cloud-native applications.
But one of the lagging areas in this revolution has been data and data infrastructure. For too long, data has been something that has lived outside of Kubernetes, leading to a lot of extra effort and complexity for developers in deploying cloud-native applications.
One oft-repeated axiom in the early years of Kubernetes was that it was not yet ready for stateful workloads. Thankfully, a major shift has been quietly underway and has reached a point of maturity.
The transformation happened slowly initially, beginning with efforts to containerize existing databases. This worked relatively well in small databases that ran on a single compute node, or databases that had been designed in a cloud-native world, like Apache Cassandra and DynamoDB, but challenges remained.
Over the past two to three years, a new generation of databases has emerged. These “Kubernetes native” databases have been designed from the ground up to run on this open-source orchestration system.
Here, we’ll define the qualities that make a database Kubernetes native and the benefits of adopting a Kubernetes native database. To do that, we’ll look at two databases claiming the Kubernetes native label: TiDB and DataStax Astra DB.
First, let’s examine a database with a relational emphasis: TiDB (short for Titanium Database). TiDB is an open-source system built by PingCAP that provides a MySQL-compatible database and a columnar database to support hybrid transactional and analytic processing (known as HTAP, for short).
As shown in Figure 1 below, TiDB has a microservice design. The TiDB query layer, TiKV MySQL databases, TiFlash columnar databases, Spark nodes, and metadata management are each deployed as scalable microservices in their clusters. This design separates compute-intensive work from storage-intensive work, as the query and database layers are independently scalable.
One critical commitment the TiDB creators made was that the database only runs on Kubernetes.
Is that enough to make it Kubernetes native?
Let’s dig a bit deeper.
First, TiDB is deployed and managed by a Kubernetes operator using custom resources (CRDs). The TiDB CRDs include the TiDBCluster, which enables you to specify the scaling and configuration of each microservice and how the database layer components use storage through Kubernetes Persistent Volumes. Additional CRDs are used to deploy monitoring tools and manage operational tasks like backup and restore.
TiDB also has an optional scheduler extension that interfaces with the default K8s scheduler to make more application-aware scheduling decisions. This emphasis on using existing Kubernetes capabilities where available is the mark of a Kubernetes native database.
Now, look at another Kubernetes native database and note some similarities and differences.
Cassandra is a highly scalable NoSQL database that was one of the first to claim to be cloud native, but what does it look like to deploy Cassandra in Kubernetes?
DataStax Astra DB is a version of Cassandra that has been factored into microservices, as shown in Figure 2.
Like TiDB, the database includes microservices concerned with query processing and data storage, as well as services for identity and access control, data repair, and backup/restore.
The data services are particularly interesting in their use of storage, with Kubernetes Persistent Volumes used only for caching and object storage used for longer-term persistence. Separating compaction into its service enables this compute-intensive processing to happen in the background without affecting the performance of data services serving read and write traffic.
Astra DB is offered as a managed service available in multiple cloud regions. Each region contains a data plane consisting of the services mentioned above, managed by a Kubernetes operator, as well as infrastructure services, including the Kube-Promethus stack for observability and etcd for metadata management.
The data planes are managed by a control plane that can run in one or more clouds to manage customer accounts and databases and provision Kubernetes clusters in new regions.
One novel aspect of Astra DB is its multi-tenant architecture in which multiple user databases can share the same microservices and supporting infrastructure, lowering unit economics for smaller-scale users.
As users grow their applications, they can move to dedicated resources to achieve optimal performance at scale, all on a “pay-as-you-go” basis.
Based on our observations of TiDB and Astra DB, we can derive some ideas of what makes a database Kubernetes native. Many of these correspond to a list of principles for cloud-native data, which I described in an earlier article:
Databases and other data infrastructures that faithfully adopt these principles will yield benefits, including a high performance for optimal cost at all scales, lower operational complexity resulting in faster time to market, and standards-compliant solutions meeting today’s high availability and security demands.
Much progress is still to be made, and it’s not limited to databases alone. Kubernetes native principles can be applied to other types of data infrastructure, including streaming, analytics, and machine learning.
Kubernetes native solutions will continue to make strides in multicluster and multi-cloud deployments to scale globally and will adopt multitenancy and serverless principles for better cost optimization.
Kubernetes itself has room for improvement in adding more flexibility to StatefulSets and support for multicluster federation.
The key to continued progress is open collaboration. The Data on Kubernetes Community is a highly active group of data geeks bringing together builders of data-intensive applications and the infrastructure that supports them.
Join us to talk about ideas like developing reusable operators that can manage multiple databases or defining a common set of CRDs for concepts like backup/restore and data loading. Together we’ll continue to push the horizon of cloud computing for the benefit of all.
Learn more about Kassandra native databases and more at the Cassandra Forward digital summit on March 14, 2023.
This article is based on Chapter 7, “The Kubernetes Native Database,” from the O’Reilly book “Managing Cloud Native Data on Kubernetes” by Jeff Carpenter and Patrick McFadin.
[
By Jeff Carpenter, DataStax
Jeff Carpenter has worked as a software engineer and architect in multiple industries and as a developer advocate at DataStax, helping engineers succeed with Apache Cassandra. He's involved in multiple open source projects in the Cassandra and Kubernetes ecosystems, including Stargate and K8ssandra. He is co-author of the O’Reilly books "Cassandra: The Definitive Guide" and "Managing Cloud Native Data on Kubernetes."