Logging in Apache Spark comes very easy since Spark offers access to a object out of the box. Only some configuration setups need to be done. In a we have looked at how to do this while showing some problems that may arise. However, the solution presented might cause some problems at the moment we want to collect the logs since they are distributed across the entire cluster. Even if we utilize log aggregation capabilities, there will be some contentions that might affect performance or even worse, in some cases we could end with log interleaves corrupting the nature of logs itself, they time ordered properties they should present. log previous post Yarn In order to solve these problems, a different approach needs to be taken, a functional one. The Monad Writer I do not intend to go over the details about monads or in this particular case, the Monad Writer, if you are interested in learning more, take a look at this link ( ) which is very informative about this topic. functor, applicative, and monad Just to put things in context, let’s say that the monad writer ( is a container that holds the current value of a computation in addition to history (log) of the value (set of transformation on the value). writer) Because the monadic properties, it allows us to do functional transformations and we will soon see how everything sticks together. writer A Simplistic Log The following code demonstrates a simplistic log. The only thing to note is that logging is actually happening on the Spark driver, so we don’t have synchronization or contention problems. Everything starts to get complicated once we start distributing our computations. The following code won’t work (read to know why) previous post A solution to this was also presented in the , but it requires extra work to manage the logs. previous post Once we start logging on each node of the cluster, we need to go to each node, and collect each log file in order to make sense of whatever is in the logs. Hopefully, you are using some kind of tool to help you with this task, such as Splunk, Datalog, etc… yet you still need to know a lot of stuffs to get those logs into your system. Our Data Set Our data set is a collection of the class Person that is going to be transformed while keeping an unified log of the operations on our data set. Let’s suppose we want our data set to get loaded, then filter each people whose age is less than 20 years, and finally extract its name. It is a very silly example, but it will demonstrate how the logs are produced. You could replace these computations, but the ideas of building an unified log will remain. Getting the Writer We are going to use library to import the monad writer, to do this we add the following line to our file. Typelevel / Cats build.sbt Playing with our data Now, let’s define the transformations we are going to use. First, let’s load the data. In here the operation is defined via implicit conversions as follows. ~> If you look closely, our loading operation is not returning an RDD, in fact, it returns the monad writer that keeps track of the logs. Let’s define the filter that we want to apply over the collection of users. Again, we are applying the same function ( ) to keep track of this transformation. ~> Lastly, we define the mapping which follows the same pattern we just saw. Putting it together So far we have only defined our transformations, but we need to stick them together. Scala is a very convenient way to work with monadic structures. Let’s see how. for Please note that is of type: . result Writer[List[String], RDD[String]] Calling will give us the and the final computation expressed by : . result.run log: List[String] rdd RDD[String] At this point we could use to write down the log generated by the chain of transformations. Note that this operation will be executed on Spark master which implies that one log file will be created with all the log information. Also, we are removing potential contention problems during the log writes. In addition, we are not locking the log file, which avoid performance issues by creating and writing to the file in a serial way. Spark logger Conclusions We have improved how we log on Apache Spark by using the . This functional approach allows us to distribute the creation of logs along with our computations, something Spark knows well how to do. However, instead of writing the logs on each worker node, we are collecting them back to the master to write them down. This mechanism has certain advantages over our previous implementation. We now control exactly how and when our logs are going to be written down, we boost performance by removing IO operations on the worker nodes, we also removed synchronization issues by writing the logs in a serial way, and we avoid the hazard of fishing logs across our entire cluster. Monad Writer is how hackers start their afternoons. We’re a part of the family. We are now and happy to opportunities. Hacker Noon @AMI accepting submissions discuss advertising &sponsorship To learn more, , , or simply, read our about page like/message us on Facebook tweet/DM @HackerNoon. If you enjoyed this story, we recommend reading our and . Until next time, don’t take the realities of the world for granted! latest tech stories trending tech stories