Spark Packages, from Xml to Json The Apache Spark community has put a lot of efforts on extending Spark so we all can benefit of the computing capabilities that it brings to us. Recently, we have been interested on transforming of XML dataset to something easier to be queried. Our main interest relies on the ability of doing data exploration on top of billions of transactions that we get every day. XML is a well known format, but sometimes it gets complicated to work with. In Apache Hive, for instance, we could define the structure of the schema of our XML and then query them using SQL which is something important to us. However, it is hard for us to keep up with the changes on the XML structure so the previous option has been discarded. We are using Spark Streaming capabilities to bring to our cluster these transactions, and we were thinking of doing the required transformations within Spark. However, the same problem remains, we will be changing our Spark application every time the XML structure changes. There must be another way! There is an Apache Spark package from the community that we could use to solve these problems. Let’s load the Spark Shell and see an example: In here, we just added the xml package to our spark environment. This of course can be added when writing a Spark App and packaging it into a Jar file. Using the package, we can read any XML file into a DataFrame. When loading the DataFrame, we could specify the schema of our data, but this was our main concern in the first place, so we will let Spark infers it. The inference of the DataFrame schema is a very powerful trick since we don’t need to know the schema anymore so it can change at any time. Let’s see how we load our XML files into a data frame: Printing the DataFrame schema gives us an idea of what the inference system has done. At this point, we could use any SQL tool to query our XML using Spark SQL. Please, read this post ( ) to learn more about Spark SQL. Apache Spark as a Distributed SQL Engine Going a step further, we might one to use tools that read JSON format. Having JSON datasets is especially useful if you have something like Apache Drill, from MapR . As we could expect, with Spark we can do any kind of transformations, but there is no need to write a fancy JSON encoder because Spark already supports these features. Let’s convert our DataFrame to JSON and save it our file system. When applying the function to the DataFrame, we get an with the JSON representation of our data. Then we save the as a plain text file. toJSON RDD[String] RDD Now, we could use Drill to read and query our new dataset and of course, we can always go back to Spark if we need to do something more complicated operations / transformations. Conclusions Transforming our dataset from XML to JSON is an easy task in Spark, but the advantages of JSON over XML are a big deal. We now can rest assured that XML schema changes are not going to affect us at all, we have removed ourselves the burden our changing our application for every XML change, we can also use powerful tools to query our JSON dataset such as Apache Drill in a schema free fashion while our clients can report on our data using SQL. Thanks to the community for the awesome tools being built. is how hackers start their afternoons. We’re a part of the family. We are now and happy to opportunities. Hacker Noon @AMI accepting submissions discuss advertising &sponsorship To learn more, , , or simply, read our about page like/message us on Facebook tweet/DM @HackerNoon. If you enjoyed this story, we recommend reading our and . Until next time, don’t take the realities of the world for granted! latest tech stories trending tech stories

Apache

Spark Packages, from Xml to Json

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Miami Scala 2017 Presentations and Conferences Journal, with Pictures.

3 Best Hadoop Alternatives to Consider for Migration

The Noonification: Introduction to Python Debugging with Pdb (8/25/2022)

8 Lessons For Building Data Companies On Solid Ground

Accelerate Spark and Hive Jobs on AWS S3 by 10x with Alluxio as a Tiered Storage Solution

Accelerating Write-Intensive Data Workloads on AWS S3

Miami Scala 2017 Presentations and Conferences Journal, with Pictures.

3 Best Hadoop Alternatives to Consider for Migration

The Noonification: Introduction to Python Debugging with Pdb (8/25/2022)

8 Lessons For Building Data Companies On Solid Ground

Accelerate Spark and Hive Jobs on AWS S3 by 10x with Alluxio as a Tiered Storage Solution

Accelerating Write-Intensive Data Workloads on AWS S3

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps