have been quite interesting and most of the time rewarding in our current project. It has a simple yet powerful API that abstracts out the need to code in complex transformations and computations. To be honest, we also have a fairly straightforward use case: few domain entities, fewer transformations based on simple joins. Working with Spark Datasets However, there are also few things that have been counterproductive to us but I am going to focus on one of them: lack of type safety in some operations, particularly, joins. dataSetA.join(dataSetB, "columnA") The above code will fail on runtime if either of and (or both) don’t have “ ” column. This is a waste of resources at multiple levels: from precious CPU cycles to developer’s time. In the remainder of this blog, we will add compile-time safety to join operations and learn a lot in the process. dataSetA dataSetB columnA , a disclaimer: . does a fantastic job at providing the type-safety for Datasets. However, it is a very evolved and complete framework which provides a newer abstraction of TypedDatasets and we really did not want to add an external dependency when we just wanted to have type safety in our select Dataset methods. The solution we are going to formulate is what Frameless does which inturn leverages on generic programming using awesome . Before we proceed This is not an unsolved problem Frameless Shapeless Problem Statement Let’s come up with goals we want to achieve at the end of this post: When we access a column by name, the compilation should fail if the column does not exist in the dataset. When we join two datasets, the compilation should fail if the joining column is not part of either one of the dataset or if present, not of the same type. Some good DSL for doing above never hurts! Step 0: Basics For any Dataset of type T (case class/Product type), we need to understand all the properties of type T along with their types. This means that we want to move from a T to list of properties with types. and this, in a very very simplified way of explanation, is what provides. It provides a conversion to and from a and a and a bouquet of functions to apply on the list. The best material to read about shapeless is and I strongly suggest to give it a thorough read. specialized generalized Shapeless case class heterogeneous list (HList) this For now, we can do with a knowledge that shapeless provides an interface LabelledGeneric which provides the interface. This can be explained as below case class Person(name: String, age: Int, isEmployee: Boolean) //defined class Person generic = LabelledGeneric[Person] //generic: shapeless.LabelledGeneric[Person]{type Repr = shapeless.::[String with shapeless.labelled.KeyTag[Symbol with shapeless.tag.Tagged[String("name")],String],shapeless.::[Int with shapeless.labelled.KeyTag[Symbol with shapeless.tag.Tagged[String("age")],Int],shapeless.::[Boolean with shapeless.labelled.KeyTag[Symbol with shapeless.tag.Tagged[String("isEmployee")],Boolean],shapeless.HNil]]]} //usage: val person = Person("John Doe", 32, true) val hlist = generic.to(person) //hlist: generic.Repr = John Doe :: 32 :: true :: HNil HNilgeneric.from(hlist) //res0: Person = Person(John Doe,32,true) Step 1: Property Exists? Given a type T, if there exists a property of name PName and type PType then yes, the conditions are satisfied <a href="https://medium.com/media/1a1b7bfc9b765cc9cef244fc438d206c/href">https://medium.com/media/1a1b7bfc9b765cc9cef244fc438d206c/href</a> Let’s break down the gist line by line: We define a trait PropertyExists for type T which also expects types PName ( for Property Name) and PType ( for Property Type), we don’t worry about the properties/methods of the trait as the existence of such instance is truthfulness of our condition. We define an method which accepts a Witness and implicitly expects an instance of PropertyExists for a certain PType. Witness is one of the utility abstractions of Shapeless which given a Symbol returns handle to its type and value. apply But how to do we pass the implicit parameter of PropertyExists? Also, where are we looking for the properties? well, the implicit is provided by implicitProvider which rely on LabelledGeneric that we introduced above. It takes a couple of more implicitly created parameters. Let’s dissect them: implicit gen: LabelledGeneric.Aux[T, H] gen provides the heterogenous list (HList) representation of type T. It uses the (another must read for type-level programming!) to forward the result type to the next implicit parameter creation Aux pattern selector: Selector.Aux[H, PName, PType] The Selector is one of the simpler abstraction of Shapeless which provides the PType given it finds the propertyName PName in record H. So in simpler terms, the implicitProvider talks the following: For a given , if you are able to create a of type from and then if you are able to also from that HList a property of name , then go ahead and provide a instance for type . type T HList H LabelledGeneric[T] select PType H PName PropertyExists T, PName and PType Step 2: First Test of Type Safety Now that we have our PropertyExists, let's have our first stab at type safety: Creating a Column instance from a key and failing on compile time if it doesn’t exist. <a href="https://medium.com/media/6a7c1d1a083839a5c180a2d7827b684e/href">https://medium.com/media/6a7c1d1a083839a5c180a2d7827b684e/href</a> We define a RichDataset abstraction which extends spark Dataset to provide the functionality of type checking. We add an apply method which takes a Symbol and implicitly tries to get a PropertyExists instance for the column type column.T (Aux pattern at play here too!). Like always this will compile only if the column exists in A. If we take our above case class Person, the following behaviour should be observed: personDs = Seq(persons).toDS().enriched val ageColumn: Column = personDs('age) //compiles val nameColumn: Column = personDs('namesss) //Error:(36, 56) Symbol with shapeless.tag.Tagged[String("namesss")] not found in Person val nameColumn: Column = personDs('namesss) and that is our first milestone! PS: we need to expose enriched as the compile will pick apply method of Dataset and not that of RichDataset. Step 3: Let’s Join Now that we have established the usage of PropertyExists lets try to formulate a DSL we would want to use for carrying out our joins //for left join //natural join single key reference datasetA.leftJoin(datasetB).withKey('key) //natural join multiple keys datasetA.leftJoin(datasetB).on('key1, 'key2) //for joins not natural. datasetA.leftJoin(datasetB) where { datasetA('keyA) === datasetB('keyB) } seems pretty ok. Let’s dive in! <a href="https://medium.com/media/dfc60e490507fa5ff7974e83defa5b2b/href">https://medium.com/media/dfc60e490507fa5ff7974e83defa5b2b/href</a> We introduce a JoinDataSet which provides the syntactical sugar to facilitate the actual join operations. JoinDataSet will also provide us with the final methods of actual join as decided in DSL: , and . withKey on where .withKey <a href="https://medium.com/media/95bb40f4d57e43b49a61b9211dd767ff/href">https://medium.com/media/95bb40f4d57e43b49a61b9211dd767ff/href</a> As we can see, withKey is identical to what we achieved in our step 2 with a couple of notable differences. for a Symbol we check if PropertyExists for both Dataset[L] and Dataset[R] and also for both datasets the type is K. column, This enforces that not only column name should be the same, but also their type. .where <a href="https://medium.com/media/091a0ca8c34421dc764296539ea1ef8a/href">https://medium.com/media/091a0ca8c34421dc764296539ea1ef8a/href</a> is even simpler. It takes a nullary function which returns a Column and leverages on the way we express conditions on Column. To express Column we use the apply method we created .where .on <a href="https://medium.com/media/ee19c89565a711052eed172ea768bc6f/href">https://medium.com/media/ee19c89565a711052eed172ea768bc6f/href</a> As one can observe is not a function at all! If we think on this and our definition of on method in the DSL, what we need to work on is the varargs of Symbol and for each such symbol have a PropertyExists created. Unfortunately, there is no way to convert a varargs to HList as varargs are Seq and Seq is not Product (case class type). For this Shapeless has provided a sugar abstraction SingletonProductArgs which uses dynamic programming to create an HList. the applyProduct is really an apply method on “on” object and allows us to achieve our syntax. .on Here’s how the above code pans out: Given I have to dynamically apply to a method named “ ” which gives out an , and I can generate an implicit instance which gives out of it, and also for both the , of in the : do the Join. varargs apply HList V List of Symbols Datasets PropertiesExists type K HList V Heres the complete code: <a href="https://medium.com/media/adf2bc599111e677b9da8a85f86cae27/href">https://medium.com/media/adf2bc599111e677b9da8a85f86cae27/href</a> PropertiesExists? For matching multiple properties, we create another trait like PropertyExists. While PropertyExists worked with single property PName, the PropertiesExists needs to work with HList. So we get our trait as: trait PropertiesExists[T, PName <: HList, PType] Now, like a List, HList also has 3 basic building blocks: Head, Tail and Nil (in this case HNil) where: HList = Head :: Tail :: HNil So all we need to do now is define implicitProviders for HNil, Tail and Head. Since the head is essentially a single Property, PropertyExists fits just fine! for the tail, we recursively try to create an implicit provider as we do for any List. <a href="https://medium.com/media/17a5c19eca9d2c0565461a09aa8b7195/href">https://medium.com/media/17a5c19eca9d2c0565461a09aa8b7195/href</a> we can complete our RichDataSet as below: <a href="https://medium.com/media/a3a077534c4e0ca0a42013b506605cc9/href">https://medium.com/media/a3a077534c4e0ca0a42013b506605cc9/href</a> And That’s it! Pursuing type safety goes a long way in optimizing development flow, catching early issues (even before execution!) and most importantly helps writing meaningful unit tests. Apart from the type safety, I also wanted to share how Shapeless (and really generic type-level programming) can aid in writing succinct, compile-time and type-safe code and I hope I was able to do some justice to how awesome Shapeless and Frameless (for Spark Dataset) are! Thank you for reading! If you liked this article, please share/recommend. If not, please comment/critique so I can improve and learn more! :)