762 reads

Should You Be Using NoSQL?

by HerokuApril 15th, 2020

Too Long; Didn't Read

NoSQL got quite some hype a few years ago. It was going to solve your scaling, uptime, and speed problems. But now the dust has settled, what’s the outcome? Should you be using NoSQL in your next project? Let's answer that through the lens of three of NoSQL's promises: scale, fault tolerance, and different data models. Some NoSQL databases can help you scale easily but at the cost of queryability, data structure, and data integrity.

Companies Mentioned

featured image - Should You Be Using NoSQL?

NoSQL got quite some hype a few years back. It was going to solve your scaling, uptime, and speed problems. There were trade-offs, of course, but, for a brief moment, seemingly everything we knew about storing and querying data was up for grabs.

So, now the dust has settled, what’s the outcome? Should you be using NoSQL in your next project? Let’s answer that through the lens of three of NoSQL’s promises: scale, fault tolerance, and different data models.

Scale

NoSQL scale comes in three flavours:

how many concurrent operations you need to handle
how much data you’re storing
the size of the individual assets you’re storing.

In most cases, when people talk about scale and NoSQL databases they mean the ability to deal with unpredictable usage patterns. Some, but not all, NoSQL databases are good at dealing with peaks and troughs in demand because they run across multiple nodes. When things are busy, add more nodes. When they’re quieter, spin a few down.

When it comes to planned scaling in order to handle a predictable increase in demand, then the same easy clustering comes into play. Add more nodes and you can both handle more demand and more data.

Such scaling comes at a cost, though. Simple scaling models suit simple data models. One of the things that makes scaling relational databases harder, though not at all impossible, is the infrastructure that supports query. Databases such as Amazon’s DynamoDB or the open source Riak make it super-easy to scale by doing away with the kind of indexes you’d get in a relational database, normalization, and consistency.

Couchbase, a document database, boasts SQL-like query while spreading the workload across multiple nodes. But it can’t easily solve the problem of denormalization, that is there’s a good chance you’ll have more than one copy of the same data in your database and they could get out of sync even if you are very diligent.

When it comes to storing large assets, then you should probably have a conversation about why you’re not using cloud storage and only storing the file asset metadata in your database.

Some NoSQL databases can help you scale easily but at the cost of queryability, data structure, and data integrity. In the early stages of a new project, speed of development and ease of maintenance are probably more valuable than being able to scale at a moment’s notice.

And, besides, if you do hit the limits of a relational database then perhaps you should turn to NoSQL as an aid rather than a replacement. Use Redis to take care of ephemeral data, for example, while using a relational database as your main data store.

Fault tolerance

Uptime was the flipside of NoSQL’s scalability promise. Most NoSQL databases scale by distributing multiple copies of data across a cluster of nodes. If a single node fails, those remaining can step in.

This can be a nightmare for consistency and there are computer scientists dedicating their academic lives to solving the problem through techniques such as conflict-free replicated data types (CRDTs). Essentially, the question comes down to this: if two copies of the same data item are updated at more or less the same time then which one is correct?

Now, relational databases tend to solve this problem by having just one writable instance of each table. You can have multiple read-only replicas, meaning that read fault tolerance is pretty much solved, but write fault tolerance depends on a single database server being available.

The question to balance is: how much pain will I get from an unlikely situation such as my Postgres primary going offline versus having to deal with data inconsistencies?

Semi-structured data

We haven’t mentioned MongoDB yet. Mongo’s great promise is that it is JSON native. Before it came MarkLogic, an XML data store. Then there’s Cassandra that deals mostly in time series data. Or Neo4j that works with a graph of connections between each data point. At the other end is DynamoDB, as a key-value store, that will blindly store anything you throw at it with no special understanding of what it is.

Database systems like MongoDB excel at doing away with the mismatch between the data we work with and the needs of the relational model. We send JSON, we receive JSON, why don’t we store JSON?

Or what about all that big data? Maybe you’re getting readings from IoT devices or audience data from an advertising network.

Again, it comes down to whether the trade-offs are worth it. Postgres can store and query JSON natively, so is it worth switching to something like MongoDB just to look after that JSON? If you do have large volumes of unstructured data, then consider something like Redis or DynamoDB just for that data, with a more traditional relational database taking care of the rest.

Maybe yes but not right away

Our ability to build and, most importantly, deploy software is enriched by the great variety that the NoSQL revolution brought to data stores. However, let’s be honest, most development projects are essentially CRUD-shaped.

The reliable query, data consistency, and familiarity of relational databases are more than “nice to haves”, they’re essential in situations where delivering value to end users is the goal. For our CRUD data patterns, Postgres, MySQL, and SQL Server are hard to beat.

NoSQL comes into its own when we start to hit the limitations of relational. If you’re taking in location data from a fleet of trucks every few seconds, do you really need to wait for Postgres to update its indexes and write to disk before you can close the API call? Or would it be better to pipe that inbound data into Redis for asynchronous processing? If you’re recording how many people have liked a video on your social network, maybe a simple key-value store such as RocksDB would be more efficient than throwing the might of SQL Server at the problem.

Perhaps one of the failings of the NoSQL movement was to suggest that non-relational databases were a replacement for, rather than a complement to, relational data stores. If you’re embarking on a new project, then there’s every chance that a NoSQL database will be useful at some part of the journey but you’re likely to get more done if you have a database system such as Postgres at the core.