Too Long; Didn't Read
In the last post, <a href="https://medium.com/@anicolaspp/apache-spark-as-a-distributed-sql-engine-4373e254e0f9#.x4kyh8jqr" target="_blank">Apache Spark as a Distributed SQL Engine</a>, we explained how we could use SQL to query our data stored within Hadoop. Our engine is capable of reading <strong>CSV</strong> files from a distributed file system, auto discovering the schema from the files and exposing them as tables through the <em>Hive</em> meta store. All this was done to be able to connect standard SQL clients to our engine and explore our data set without manually define the schema of our files, avoiding ETL work.