Cassandra: Highly Scalable Database Out of the Box

Cassandra is a distributed, decentralized, scalable, and highly available wide-column database.

In terms of CAP theorem, Cassandra stands for AP (availability and partition tolerance).

It means that Cassandra prefers that all clients can find data even in cases where not all nodes are available and will work as expected when partial network failure occurs. However, it also means that the consistency of the data can be compromised in favor of availability and partition tolerance - users will see the data, but it may be stale for a while.

Cassandra is designed for achieving high throughput and faster write operations.

And it is precisely sacrificing consistency that allows Cassandra to be highly available.

By default, Cassandra is designed to be eventually consistent, meaning that it may not provide strong consistency. This makes Cassandra suitable for applications where consistency is not a critical requirement. However, it is possible to configure Cassandra to provide strong consistency, although this can impact performance.

Cassandra, being a NoSQL database, does not support table joins, foreign keys, or the ability to add columns other than the primary key in the WHERE clause while querying. These limitations should be taken into consideration before choosing to use Cassandra.

Cassandra Building blocks

Column: A column represents a key-value pair and serves as the fundamental unit of data structure.
Row: acts as a container for columns that are referenced by the primary key.
Keyspace: serves as a container for tables that span across one or more Cassandra nodes.
Cluster: a container of keyspaces within Cassandra.
Node: refers to a computer system that runs an instance of Cassandra. A node can be a physical host, a machine instance in the cloud, or even a Docker container.

How Cassandra stores data

Cassandra stores the data as a column family. It serves as a container for columns that are referenced by a primary key.

A row of a column family includes several columns with key and value, and the row key serves as the primary key:

A column family can store a different set of columns for each row key:

Cassandra does not store columns with null values, which helps save significant storage space

What is the primary key in Cassandra?

The Primary key uniquely identifies each row of a table. In Cassandra, the primary key has two parts:

In Cassandra, the partition key determines which node stores the data, while the clustering key determines how the data is stored within a node. For instance, consider a table with a

PRIMARY KEY (city_id, event_id). This primary key consists of two parts, represented by the two columns:

1. city_id serves as the partition key, meaning that data will be partitioned based on the city_id field, resulting in all rows with the same city_id being stored on the same node.

2. event_id acts as the clustering key. Within each node, the data is organized and stored in sorted order based on the event_id column.

Clustering keys determine the storage arrangement of data within a node. It is possible to have multiple clustering keys, and any columns listed after the partition key are referred to as clustering columns. Clustering columns define the order in which data is organized on a node.

Every row with partition key = "Paris" will be stored on the same node, ordered by the value of the event_id column.

Data partitioning out of the box

Cassandra provides data partitioning based on Consistent Hashing to reduce latency in read/write operations and increase the throughput of the system when the amount of data stored in the database becomes large.

The partitioner in Cassandra is responsible for deciding how data is distributed across the Consistent Hash ring. When data is inserted into a Cassandra cluster, the partitioner applies a hashing algorithm to the partition key. The result of this hashing algorithm determines the range in which the data falls and determines the node on which the data will be stored.

Coordinator node

In Cassandra, each node is aware of the token assignments of other nodes through gossip protocol, allowing any node to handle requests for any other node's range. Therefore, a client can connect to any node to initiate read or write queries.

The node that receives the request is known as the coordinator and can be any node in the cluster. If a key does not belong to the coordinator's range, the request is forwarded to the replicas responsible for that range.

Replication

Cassandra replicates data across multiple nodes to ensure high availability. Each node in Cassandra serves as a replica for a specific data range. By spreading multiple copies of the data across different replicas, Cassandra enables other replicas to handle queries for that data range in case one node is unavailable. Two settings will affect the replication process:

The replication factor determines how many nodes will store copies of the same data. In a cluster with a replication factor of 3, each row will be stored on three different nodes.

Each keyspace in Cassandra can have a different replication factor.

In Cassandra, the first replica is assigned to the node that owns the range based on the hash of the partition key. The remaining replicas are then placed on consecutive nodes in a clockwise manner. Cassandra uses two replication strategies to determine which nodes will be responsible for the replicas:

Simple replication strategy

In this strategy, the first replica is placed on a node determined by the partitioner, and the succeeding replicas are placed on the subsequent nodes in a clockwise manner.

This replication strategy is applicable only for a single data center cluster.

Network topology strategy

To ensure resilience against complete loss of data, additional replicas within the same data center are placed by moving clockwise along the ring until reaching the first node in a different data center. This arrangement helps mitigate the impact of simultaneous failures that typically occur within the same data center due to power, cooling, or network issues.

When it comes to multi-datacenter configurations, you should consider the network topology strategy. This approach allows for the specification of varying replication factors for each data center, enabling control over the number of replicas to be placed in each specific location.

When to use Cassandra

Cassandra excels in applications that require handling large volumes of data and prioritize data availability over consistency. It is well-suited for:

1. Internet of Things (IoT) Applications: Cassandra is an ideal choice for IoT environments, as it can handle massive amounts of data generated by devices and sensors. Its distributed architecture enables management of geographically dispersed, large-scale data.

2. Time-Series Data: Applications dealing with time-series data, such as metrics, monitoring systems, and telemetry data, benefit from Cassandra's efficient write operations and horizontal scalability. These capabilities are crucial for storing and managing extensive volumes of time-stamped data.

3. Web and Mobile Applications: Cassandra offers high throughput and low-latency data access, making it suitable for web and mobile platforms with large user bases generating significant amounts of data. This includes social media platforms, gaming apps, and e-commerce sites.

4. Real-Time Big Data Analytics: Cassandra supports real-time processing of big data, making it a valuable choice for applications requiring immediate insights from large datasets. Examples include recommendation engines and fraud detection systems.

5. Distributed Data Warehouses: Enterprises with large, distributed datasets can leverage Cassandra as a data warehouse solution. Its ability to replicate data across multiple data centers ensures high availability and disaster recovery.

6. Messaging Systems: Cassandra's high write and read throughput makes it well-suited for messaging systems that handle high data volumes, such as event logging, audit trails, or message queues.

7. Personalization and Content Management Systems: Applications requiring personalized content delivery at scale, such as content management systems, benefit from Cassandra's speed and scalability in delivering customized content to a large number of users simultaneously.

8. Geographically Distributed Applications: Cassandra's support for multiple data centers makes it an excellent choice for applications requiring geographically distributed data. It ensures low-latency data access across different regions and provides high resilience.

When not to use Cassandra

While Apache Cassandra is powerful and scalable, it may not be suitable for every application or use case. It is less suitable for transaction-heavy applications, complex querying, and scenarios that require strong consistency or rapid schema changes. Traditional relational database management systems (RDBMS) or other NoSQL solutions may be more appropriate in such cases. Here are several scenarios where Cassandra might not be the optimal choice:

Small-Scale Projects: Cassandra's complexity and resource requirements can be excessive for small-scale projects or applications with limited datasets. Simpler database solutions may offer a more cost-effective and manageable alternative.
Transactional Systems Requiring ACID Properties: Cassandra does not fully support ACID (Atomicity, Consistency, Isolation, Durability) properties. If your application requires complex transactions typically found in banking or financial systems, a traditional RDBMS might be a better fit.
Join Heavy Queries and Aggregations: If your application heavily relies on joins, subqueries, or complex aggregations, Cassandra may not be the most suitable choice. It is designed for fast writes and reads but not for complex query processing.
Data with Strong Consistency Requirements: Cassandra provides eventual consistency, which may not be suitable for use cases that require strong consistency for every read and write operation.
Low-Latency Reads and Writes in a Single Cluster: While Cassandra performs well in multi-node distributed environments, it may not be the optimal choice for single-node or small cluster deployments that require low-latency reads and writes.
Blob Storage: Cassandra is not optimized for storing large binary objects (blobs) such as images or videos. Other storage solutions are better suited for efficiently handling large blobs.
Applications Requiring Ad-hoc Querying: Cassandra's query capabilities are limited compared to full-fledged SQL databases. It is not well-suited for applications that heavily rely on ad-hoc querying and reporting.

In Cassandra, the design of tables is closely connected to the way data will be accessed, emphasizing the query patterns rather than solely focusing on the relationships between data entities. This differs from the approach in RDBMS, where schema design is based on normalization principles.
Rapid Schema Evolution: If your application requires frequent changes to the database schema, Cassandra's schema management may be less flexible compared to traditional RDBMS systems or other NoSQL solutions.
Data Warehouse Applications that involve complex queries, joins, and historical data analysis: While Cassandra is well-suited for write-heavy workloads and real-time data access, it may not be the most suitable choice for data warehousing scenarios that require complex queries, joins, and historical data analysis.

Summary

This article provides an overview of Cassandra, a highly scalable and distributed wide-column database. Cassandra is designed to prioritize availability and partition tolerance, making it suitable for applications where consistency is not a critical requirement. It supports high throughput and faster write operations.

The building blocks of Cassandra include columns, rows, keyspaces, clusters, and nodes. Columns represent key-value pairs, rows act as containers for columns referenced by the primary key, keyspaces serve as containers for tables spanning multiple nodes, clusters contain keyspaces, and nodes refer to computer systems running Cassandra instances.

Cassandra stores data in column families, which are containers for columns referenced by a primary key. Data partitioning is achieved through consistent hashing, allowing for reduced latency and increased throughput. The partitioner distributes data across the Consistent Hash ring, and a coordinator node handles read and write queries.

Cassandra provides replication for high availability. Replicas of data are stored on multiple nodes, ensuring that queries can be handled by replicas if a node becomes unavailable. Replication factors and strategies determine the number of replicas and the nodes responsible for them.

While Cassandra offers benefits such as scalability and high availability, it has limitations. It does not support table joins, foreign keys, or the ability to add columns other than the primary key in the WHERE clause during querying.

Overall, Cassandra is a powerful database solution for highly scalable applications, particularly those that prioritize availability and partition tolerance over strong consistency.

There are several interesting aspects of Cassandra that I will cover in my next article. Subscribe to me so you don't miss it!

Cheers!