In a we have discussed some of the features of MapR-DB that make this distributed database especially interesting. In this post, we intend to continue that effort by presenting a specific use case. previous post, The Problem The problem to be solved can be described as follows. A serie of message are coming through a stream. Each value has an and a . For each , we must update the existing in the data base by incrementing its value using the coming in the stream. _id_ _count_ _id_ _count_ _count_ The following image shows an example of the problem. The problem in an image. There are various ways to solve this problem. One could be by reading the current state of a given stored on the database, then update its using the values on the stream to finally save the updated values back to the database. id count Let’s see this process in details. Loading, merging, and saving. As we can see in the image above, in order to update the state in the database, we need to load every piece of data first from it, join the value with the incoming stream to calculate the new state, and finally, save the new state back to persistent storage. In general, we will have to follow this same process regardless of the database technology we choose to use, whether it is MapR-DB or any other persistent technology. The following code shows how we could implement this idea using Apache Spark integrated with MapR-ES (MapR Streams technology) and MapR-DB. It is important to notice that on every streaming slide, we load from MapR-DB (or any other Db for this matter) and merge the loaded data frame with the stream. Then we save the current, recently calculated, new state back to the database. This process makes total sense, and in most databases out there, there is no way around it. However, these operations are costly to execute every time we receive data on the stream. MapR-DB Mutations When others fall shorts, MapR-DB shines. MapR-DB is able to incrementally update documents without the need of loading them first. More specifically, it is possible to update only some of the fields of a document without touching anything else at the document level. A reasonable question that quickly raises is: Are this updates atomically applied given the distributed nature of MapR-DB? The answer is YES. Let’s first write some code to prove the last statement and then we will move to solve the problem in question using this concept. We can start with the following code snippet. As we can see, the function will increment the for a given , specifically, times. run count id times Based on this code, we can create another function that does the same in parallel by creating a number of threads and execute on each thread. UpdateSameId.run We can run the above code in the following way. This will print out the final state of each document on MapR-DB, and by that, we can certainly say that updates are atomically applied. The entire code is part of the . Reactor project that you can find here As we can see, even when running on a multi-thread or multi-processor environment, MapR-DB guarantees that values are consistently (atomic) update. We could write similar code using Apache Spark in order to increase parallelism, but the results will be the same; we tried it out. Based on these findings, we could improve our original app to solve the problem we first stated. Let’s start by removing the parts that first load from MapR-DB and replace them by updates instead. Let’s review a few important changes from our original app. First, there is no need to load the current state from the database at all. We, instead, create the necessary mutations and apply them in parallel to MapR-DB. Secondly, we have significantly reduced the number of tasks that Spark has to execute, increasing the overall performance and while reducing the load in our database. Thirdly, the code is simpler, more elegant and easy to understand, increasing the maintainable indexes while making other people life easier. Finally, we are creating a connection per partition so we avoid any Spark serialization issue while increasing parallelism since each partition operations will be run at the same time (Spark magic at work here). After the process runs, we can rest assured that our data will be correctly and efficiently updated by our latest approach. Conclusions As discussed before, MapR-DB is a non-SQL database with very interesting features that are worth exploring. These features not only put it on top of the pick for highly performant and distributed workloads but also make our lives easier as problem solvers. Even though the intrinsic distributed nature of MapR-DB, we can abstract ourselves from many of the issues that complex systems like this one has and use MapR-DB with easiness to solve the most complicated business problems in the market. Please, read for a more comprehensive view about how to use MapR-DB and the tooling around it. Interacting with MapR-DB

MapR-DB Atomic Document Updates

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Miami Scala 2017 Presentations and Conferences Journal, with Pictures.

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

10 Ways to Optimize Your Database

10 Essential Computer Skills for Data Mining

10 Most Evolving Big Data Technologies to Catch Up on in 2022

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

Miami Scala 2017 Presentations and Conferences Journal, with Pictures.

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

10 Ways to Optimize Your Database

10 Essential Computer Skills for Data Mining

10 Most Evolving Big Data Technologies to Catch Up on in 2022

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps