Apache Cassandra—the immensely popular open-source, distributed NoSQL database management system—is approaching the general availability of its 5.0 version (the 5.0 beta is already available).
Whether you’re an existing Cassandra user or considering a migration to the open-source database with the release of this new version, there’s a lot to be excited about with 5.0. I spoke with Mo Ansari, a Product Manager at Instaclustr by NetApp, about what developers, engineers, DBAs, and other users should know about Cassandra 5.0—from database upgrades to new use cases to open-source community support.
Mo knows Cassandra particularly well (Instaclustr has long had a managed platform around the fully open-source version of the database, which Mo works on).
Here’s what Mo had to say about Cassandra 5.0:
Cassandra 5.0 adds in the advantages of trie (prefix tree)-based Memtables and SSTables—meaning the new version of the open source NoSQL database offers significant potential for improved database performance and memory optimization.
Cassandra’s performance for reads and modification operations (and its ability to correctly size structures to data) benefit from these storage formats, which make use of tries and byte-comparable representations of database keys.
Memory management overhead and garbage collection also face fewer burdens with trie Memtables and trie-indexed SSTables, delivering even more benefits for Cassandra users (especially those using the database at scale).
Developers will want to get their hands on Cassandra 5.0’s vector support, including Vector Search for locating content in large datasets. They can also utilize new CQL vector functions and a new vector data type designed for saving and retrieving embeddings vectors. Because of these improvements, Cassandra 5.0 is well-positioned as a data layer technology for supporting AI/ML application development. ML models function by comparing similarities among data and putting data connections in context.
Embeddings vectors power similarity comparisons by offering arrays of floating-point numbers representing how similar particular objects are to each other. With Cassandra 5.0, developers now have a powerful (and open source) database with that specific functionality that’s so important to AI/ML applications.
Developers should also be hyped about storage-attached indexing, which makes secondary indexes far more usable and efficient. On a Cassandra database table, developers can now easily create one or more secondary indexes, each based on a single column they choose. The result is massively scalable and globally distributed indexing with search throughput that can’t be beat (useful with Vector Search), modular extensibility (also demonstrated by Vector Search), and stunning indexing functionality that captures semantics via queries and content (including large documents and images).
Last but not least, developers using Cassandra 5.0 will like the ability to build their own user-defined functions, and to utilize the range of useful new native CQL aggregation and math functions available with the new version.
Cassandra is bringing significant enhancements to the developer experience. The new release features open the door for a vast number of use cases and balance the developer experience. The new version focuses on ease of use, performance, and security and includes features such as storage-attached indexes that make queries on non-primary key columns more efficient. This reduces the complexity and overhead associated with secondary indexes.
Additionally, the new version is anticipated to support ACID transactions with version 5.1, which will bring SQL-like functionality to Cassandra—making it more approachable for developers familiar with relational databases. Furthermore, the new version has more guardrails and enhanced tooling, including a new virtual table to view system lows, which will aid in the development process.
Overall, Cassandra 5.0 is packed with features that will empower developers to work faster, create more efficient queries, and manage data with increased security and control!
Yes, with a new vector datatype and storage-attached indexes improving the performance of the usage of vectors, Apache Cassandra is positioning itself as a contender in the AI market. Over the years, Cassandra’s exceptional write throughput and ability to handle large volumes of data made it ideal for several use cases and applications, such as scalable web applications, messaging systems, and event logging and monitoring systems.
However, Cassandra 5.0 is going to significantly alter the landscape. Apart from all the use cases that Cassandra supported and suited in the past, version 5.0 will extend its applicability to more complex, transactional, and analytical applications, bridging the gap between NoSQL flexibility and the rigorous demands of modern data-intensive applications.
I see several new use cases, such as analytics and machine learning, niche financial services requiring ACID transactions, applications requiring complex querying capabilities, etc. As the Apache Cassandra project calls it, “moving towards an AI-driven future.”
Apache Cassandra has always been designed with distributed and fault-tolerant computing and scaling in mind. It has always been a good choice for cloud-based workloads. While not introducing specific features under the “cloud-native” label, Cassandra 5.0 continues to support and enhance deployments in cloud and containerized environments through its scalable, resilient, and community-driven projects.
Features included in the 4.1 version release have already paved the way for a more cloud-native future for Cassandra. The K8ssandra project is another great example of a commitment to a cloud-native future.
The transition to Cassandra 5.0 does bring a learning curve, but it’s designed to be as smooth as possible. I anticipate gradual adoption as developers familiarize themselves with new features and improvements. I expect the adoption will be pretty quick in dev and testing environments; however, it will be gradual in production environments.
Given the focus on cost reduction and AI readiness, there are a lot more reasons, use cases, motivations, and other drivers for adoption. Based on its several utilities, I'm guessing that Apache Cassandra will become the preferred choice for AI workloads.
For those looking to migrate, my advice is to leverage the extensive resources available on the Apache website, including its documentation and community forums. Planet Cassandra is a fantastic place to look at the use cases. I would also advise the developers to engage in the community for shared insights and best practices through mailing lists, project town halls, and other events such as contributor meetings.
Apache Cassandra has a large and vibrant open-source community, playing a crucial role in the development of 5.0. Members have been actively involved in proposing new features and improving existing ones based on real-world use cases and future needs. Feedback was collected through various channels, such as mailing lists, Jira tickets, and Cassandra Enhancement Proposals (CEPs), and it has been instrumental in shaping the release to meet the evolving needs of users.
The community has been actively contributing to many more features that are currently in the draft stage and will bring many more benefits to Cassandra in the future.