. SQL is great, but sometimes you may need something else By and large, the prevalent type of data that data engineers deal with on a regular basis is . Tables in a data warehouse, transactional data in (OLTP) databases — they can all be . But does it mean that NoSQL is irrelevant for data engineering? In this article, we’ll investigate for which . relational Online Transactional Processing queried and accessed using SQL use cases data engineers may need to interact with NoSQL data stores The reasons behind NoSQL These days, data is stored at such a ( ) that many relational database systems can’t keep up. Historically, this was the main reason why large tech companies developed their own NoSQL solutions to mitigate those issues. In 2006, Google published their paper which laid the foundations for the open-source NoSQL data store and GCP’s . In 2007, Amazon offered their alternative solution with paper. velocity, volume, and variety in short: Big Data Bigtable HBase Cloud Bigtable Dynamo From that time on, other NoSQL distributed database systems kept emerging. All of them have been mainly trying to mitigate the issue of scale ( ) that is hard to achieve with the traditional vertically-scalable RDBMS. Instead of scaling vertically , they operate on horizontally-scalable distributed clusters allowing them to increase the capacity by simply adding more nodes to the cluster. data volume (buying more RAM and CPU for your single server) Apart from scalability and addressing the sheer data volume, many distributed NoSQL database systems address the challenge of data coming in many , i.e., , which is difficult to achieve in relational databases. I’ve seen many creative solutions in my career when people tried to serialize and blob non-relational data to a relational database. Still, such self-built systems tend to be hard to operate and manage at scale. with such as MongoDB allow much more flexibility in that regard. different formats (highly nested JSON-like structure, XML, time-series, images, videos, audio…) variety Document databases schema-on-read Finally, with the of data, continuously streaming in real-time, many NoSQL databases help address that issue. For instance, handles data streaming from a variety of IoT devices in a very intuitive and easy way. velocity Amazon Timestream Why do I need to know NoSQL data stores as a data engineer? One of a data engineer responsibilities is and data from various sources and providing it to data consumers consistently and reliably so that this data can be used for analytics and building data products. This means building data pipelines that ( ) and store it in a data lake or data warehouse. Due to the above-mentioned data , data lakes are particularly useful for storing data from NoSQL data stores. integrating consolidating pull data from among others NoSQL databases variety NoSQL use cases for data engineering Now that we know why NoSQL is important for data engineering let’s look at the typical use cases that we may encounter. 1. Analysis of log streams These days, many applications are sending logs to Elasticsearch and can be analyzed and visualized using Kibana. Thanks to the dynamic schema and indexing, Elaticsearch is a handy NoSQL data store in your data engineering toolbox, especially for container monitoring and log analytics. 2. Extracting time-series data from IoT devices and real-time applications Relational database systems are usually not intended to be used with thousands of open connections used simultaneously. Imagine that you manage a large fleet of servers or IoT devices that keep streaming metrics such as CPU utilization to some centralized data store so that this information can be used for a real-time dashboard and for anomaly detection analysis. If you would use a relational database for that, you would likely encounter connection issues ( ) since RDBMS is not designed for thousands of short-lived connections. In contrast, NoSQL distributed data stores such as can handle that easily. (showing the health of the system) leading to some data loss AWS Timestream 3. Extracting data from NoSQL backend systems NoSQL databases are used as a backend data store for so many applications, ranging from e-commerce, content management platforms (blogs), mobile apps, websites, web analytics, clickstream, shopping carts, and many more. All of this data can be used for analytics once you established data pipelines that extract or stream this data into your data lake or data warehouse. 4. Caching data for fast retrieval If you’ve ever built a dashboard with some BI tools such as Tableau, you probably preferred working with extracts rather than with live data, provided that the dataset was fairly large. Making connections to relational databases and waiting till they process and retrieve the data is often too slow to build really performant dashboards ( ). For this purpose, in-memory NoSQL data stores such as are great. If you use those to that you need for your dashboards, you can make sure that you experience no lag and provide a good user experience. that don’t hang when you apply a filter Redis cache recent data Demo: extracting data from NoSQL Let’s demonstrate , one of the most popular ( ) NoSQL data stores on AWS. I created a table called demo with partition key index. This index will be equivalent to my Pandas dataframe’s index. how we can extract data from DynamoDB serverless Now we can use a combination of boto3 and awswrangler to load a sample dataset. Then, starting from line 34, we build a simple ETL to retrieve data that has been inserted since yesterday. Additionally, we could use the PartiQL query editor in the management console to validate that our ETL returns correct data. DynamoDB PartiQL query editor — image by author How to keep an eye on your DynamoDB resources? To monitor the read and write capacity units used by our table, it’s helpful to make use of an observability platform that provides you more details about your infrastructure and your applications’ health. For example, with you can track any that occurred when reading or writing your data, the average of your read and write operations, whether a continuous is enabled for a quick , and how many are consumed by your resources. Dashbird errors latency backup point-in-time recovery read (RCU) and write capacity units (WCU) The “demo” DynamoDB table in Dashbird — image by author The average latency in your read and write operations observed using Dashbird — image by author Drawbacks of NoSQL databases The previous sections demonstrated the use cases where NoSQL data stores shine in data engineering. But now to the drawbacks. The major drawback of NoSQL data stores is that , the developers need to some proprietary vocabulary, API, or interface to access data. Timescale DB put it nicely on their blog: without a full-fledged SQL interface learn ( again ) “SQL is back. Not just because writing glue code to kludge together NoSQL tools is annoying. Not just because retraining workforces to learn a myriad of new languages is hard. Not just because standards can be a good thing. But also because the world is filled with data. […] Either we can live in a world of brittle systems and a million interfaces. Or we can continue to embrace SQL.” — Source Even though many NoSQL data stores offer SQL interfaces on top, they will where you can create complex queries ( ). never be as powerful as a full-fledged relational database not that complex queries are always a good thing So purely from a , it’s best to those NoSQL systems . Use them when there is a use case that clearly justifies the need for NoSQL’s scale and flexibility coming from schema-on-read. data engineering perspective work with rather than against them If data comes in a format that is not suitable to be stored in a data warehouse, we can always extract it, store it in a data lake ( ), and . These days, many tools allow us to make sense of data stored in a data lake. When leveraging serverless SQL query engines such as Presto ( ) or BigQuery, you can analyze even semi-structured nested data. ELT-approach provide it for analytics in its native raw format or AWS version: Amazon Athena Conclusion To answer the question from the title: yes, NoSQL data stores are important for data engineering. Even though they often lack a friendly SQL-like interface to retrieve the information you need, there are still very beneficial in various use cases. Given that for data retrieval ( ), we should expect that the trend will continue. It’s possible that one day, we may be able to write SQL queries to retrieve data the same way regardless of the backend data store under the hood. In fact, if we look at federated queries with Presto, this future may not be that far away. more and more vendors realize the need for a SQL interface such as PartiQL from AWS, the “good old” Hive, Spark SQL, CQL in Cassandra, SQL queries in Timestream Previously published at https://dashbird.io/blog/nosql-database-data-engineering/