So far I’ve written articles on Google BigQuery (1,2,3,4,5) , on cloud-native economics(1,2), and even on ephemeral VMs (1). One product that really excites me is Google Cloud Dataproc — Google’s managed Hadoop, Spark, and Flink offering. In what seems to be a fully commoditized market at first glance, Dataproc manages to create significant differentiated value that bodes to transform how folks think about their Hadoop workloads.
Jobs-first Hadoop+Spark, not Clusters-first
Typical mode of operation of Hadoop — on premise or in cloud — require you deploy a cluster, and then you proceed to fill up said cluster with jobs, be it MapReduce jobs, Hive queries, SparkSQL, etc. Pretty straightforward stuff.
The standard way of running Hadoop and Spark.
Services like Amazon EMR go a step further and let you run ephemeral clusters, enabled by separation of storage and compute through EMRFS and S3. This means that you can discard your cluster while keeping state on S3 after the workload is completed.
Google Cloud Platform has two critical differentiating characteristics:
When your clusters start in well under 90 seconds (under 60 seconds is not unusual), and when you do not have to worry about wasting that hard-earned cash on your cloud provider’s pricing inefficiencies, you can flip this cluster->jobs equation on its head. You start with a job, and you acquire a cluster as a step in job execution.
If you have a MapReduce job, as long as you’re okay with paying the 60 second initial boot-up tax, rather than submitting the job to an already-deployed cluster, you submit the job to Dataproc, which creates a cluster on your behalf on-demand. A cluster is now a means to an end for job execution.
Demonstration of my exquisite art skills, plus illustration of the jobs before clusters concept realized with Dataproc.
Again, this is only possible with Google Dataproc, only because of:
Operational and economic benefits are obvious and easily realized:
I’m sure I’m forgetting others. Feel free to leave a comment here to add color. Best response gets a collectors’ edition Google Cloud Android figurine!
Dataproc is as close as you can get to serverless and cloud-native pay-per-job with VM-based architectures — across the entire cloud space. There’s nothing even close to it in that regard.
Dataproc does have a 10-minute minimum for pricing. Add the sub-90 second cluster creation timer, and you rule out many relatively lightweight ad-hoc workloads. In other words, this works for big serious batch jobs, not ad-hoc SQL queries that you want to run in under 10 seconds. I write on this topic here.(do let us know if you have a compelling use case that leaves you asking for less than a 10-minute minimum).
The rest of the Dataproc goodies
Google Cloud doesn’t stop there. There’s a few other benefits of Dataproc that truly make your life easier and your pockets fuller:
Dataproc for stateful clusters
Now if you are running a stateful cluster with, say Impala and Hbase on HDFS, Dataproc is a nice offering here too, if for some reason you don’t want to run Bigtable + BigQuery.
If you are after the biggest baddest disk performance on the market, why not go with something that resembles RAM more than SSD in terms of performance — Google’s Local SSD? Mr. Dinesh does a great job comparing Amazon’s and Google’s offerings here. Cliff notes — Local SSD is really, really, really good — really.
Finally, Google’s Sustained Use Discounts automatically rewards folks who run their VMs for longer periods of time, up to 30% off. No contracts and no commitments. And, thank goodness, no managing your Reserved Instance bills.
You win if you use Google’s VMs for short bursts, and you win when you use Google for longer periods.
Economics of Dataproc
We discussed how Google’s VMs are typically much cheaper through Preemptible VMs, Custom VMs, Sustained Use Discounts, and even lower list pricing. Some folks find the difference to be 50% cheaper!
Two things that studying Economics taught me (put down your pitchforks, I also did Math) — the difference between soft and hard sciences, and the ability to tell a story with two-dimensional charts.
Let’s assume a worst-case scenario, in which EMR and Dataproc VM prices are equal. We get this chart, which hopefully requires no explanation:
Which line would you rather be on?
If you believe our good friend thehftguy’s claims that Google is 50% cheaper (after things like Preemptible VMs, Custom VMs, Sustained Use Discounts, etc), you get this compelling chart:
Same chart, but with some more aggressive assumptions.
When you’re dishing out your precious shekels to your cloud provider, think of all this extra blue area that you’re volunteering to pay that’s entirely spurious. This is why many of Dataproc’s customers don’t mind paying egress from their non-Google cloud vendors to GCS!
Summary
Google Cloud has the advantage of a second-comer. Things are simpler, cheaper, and faster. Lower-level services like instances (GCE) and storage (GCS) are more powerful and easier to use. This, in turn, lets higher-level services like Dataproc be more effective:
Fundamentally, Dataproc lets you think in terms of jobs, not clusters. You start with a job, and you get a cluster as just another step in job execution. This is a very different mode of thinking, and we feel that you’ll find it compelling.
You don’t have to take my word for it — good folks at O’Reilly had this to say about Dataproc and EMR.
Find me on twitter at @thetinot . Happy to chat further!