So far I’ve written articles on BigQuery ( , , , , ) , on cloud-native economics( , ), and even on ephemeral VMs ( ). One product that really excites me is Google Cloud — Google’s managed Hadoop, Spark, and Flink offering. In what seems to be a fully commoditized market at first glance, Dataproc manages to create significant differentiated value that bodes to transform how folks think about their workloads. Google 1 2 3 4 5 1 2 1 Dataproc Hadoop Jobs-first Hadoop+Spark, not Clusters-first Typical mode of operation of Hadoop — on premise or in cloud — require you deploy a cluster, and then you proceed to fill up said cluster with jobs, be it MapReduce jobs, Hive queries, SparkSQL, etc. Pretty straightforward stuff. The standard way of running Hadoop and Spark. Services like Amazon EMR go a step further and let you run ephemeral clusters, enabled by separation of storage and compute through and S3. This means that you can discard your cluster while keeping state on S3 after the workload is completed. EMRFS Google Cloud Platform has two critical differentiating characteristics: Per-minute billing (Azure has this as well) Very fast VM boot up times When your clusters start in well under 90 seconds (under 60 seconds is not unusual), and when you do not have to worry about wasting that hard-earned cash on your cloud provider’s pricing inefficiencies, you can flip this cluster->jobs equation on its head. You start with a job, and you acquire a cluster as a step in job execution. If you have a MapReduce job, as long as you’re okay with paying the 60 second initial boot-up tax, rather than submitting the job to an already-deployed cluster, you submit the job to Dataproc, which creates a cluster on your behalf on-demand. A cluster is now a means to an end for job execution. Demonstration of my exquisite art skills, plus illustration of the jobs before clusters concept realized with Dataproc. Again, this is only possible with Google Dataproc, only because of: high granularity of billing (per-minute) very low tax on initial boot-up times separation of storage and compute (and ditching HDFS as primary store). Operational and economic benefits are obvious and easily realized: Resource segregation though tenancy segregation avoids non-obvious bottlenecks and resource contention between jobs. Simplicity of management — no need to actually manage the cluster or resource allocation and priorities through things like YARN resource manager. Your dev/stage/prod workloads are now intrinsically separate — and what a pain that is to resolve and manage elsewhere! Simplicity of pricing — no need to worry about rounding up to nearest hour. Simplicity of cluster sizing — to get the job done faster, simply ask Dataproc to deploy more resources for the job. When you pay per-minute, you can start thinking in terms of VM-minutes. Simplicity of troubleshooting — resources are isolated, so you can’t blame the other tenants on your problems. I’m sure I’m forgetting others. Feel free to leave a comment here to add color. Best response gets a collectors’ edition Google Cloud Android figurine! Dataproc is as close as you can get to serverless and cloud-native pay-per-job with VM-based architectures — across the entire cloud space. There’s nothing even close to it in that regard. Dataproc does have a 10-minute minimum for pricing. Add the sub-90 second cluster creation timer, and you rule out many relatively lightweight ad-hoc workloads. In other words, this works for big serious batch jobs, not ad-hoc SQL queries that you want to run in under 10 seconds. I write on this topic .(do let us know if you have a compelling use case that leaves you asking for less than a 10-minute minimum). here The rest of the Dataproc goodies Google Cloud doesn’t stop there. There’s a few other benefits of Dataproc that truly make your life easier and your pockets fuller: — if you know the typical resource utilization profile of your job in terms of CPU/RAM, you can tailor-make your own instances with that CPU/RAM profile. This is really really cool, you guys. Custom VMs — I on this topic recently. Google’s alternative to Spot instances is just great. Flat 80% off, and Dataproc is smart enough to repair your jobs in case instances go away. I , and in my biased opinion it’s worth a read on its own. Preemptible VMs wrote beat this topic to death in the blog post Google Compute Engine is for comparably-sized VMs. In some cases, up t0 40% less than EC2. Best pricing in town. the industry price leader — Yes, you can run your Spark jobs on thousands of Preemptible VMs, and we won’t make you sign a big commitment, as gentleman found out (TL;DR: running 25,000 Preemptible VMs) . Gobs of ephemeral capacity this — When ditching HDFS in favor of object stores, what matters is the overall pipe between storage and instances. Mr. Jornson details performance characteristics of GCS and comparable offerings . GCS is fast fast fast here Dataproc for stateful clusters Now if you are running a stateful cluster with, say Impala and Hbase on HDFS, Dataproc is a nice offering here too, if for some reason you don’t want to run Bigtable + BigQuery. If you are after the biggest baddest disk performance on the market, why not go with something that resembles RAM more than SSD in terms of performance — Google’s Local SSD? Mr. Dinesh does a great job comparing Amazon’s and Google’s offerings . Cliff notes — Local SSD is really, really, really good — really. here Finally, Google’s Sustained Use Discounts automatically rewards folks who run their VMs for longer periods of time, up to 30% off. No contracts and no commitments. And, thank goodness, no managing your Reserved Instance bills. You win if you use Google’s VMs for short bursts, and you win when you use Google for longer periods. Economics of Dataproc We discussed how Google’s VMs are typically much cheaper through Preemptible VMs, Custom VMs, Sustained Use Discounts, and even lower list pricing. ! Some folks find the difference to be 50% cheaper Two things that studying Economics taught me (put down your pitchforks, I also did Math) — the difference between soft and hard sciences, and the ability to tell a story with two-dimensional charts. Let’s assume a worst-case scenario, in which EMR and Dataproc VM prices are equal. We get this chart, which hopefully requires no explanation: Which line would you rather be on? If you believe our good friend thehftguy’s that Google is 50% cheaper (after things like Preemptible VMs, Custom VMs, Sustained Use Discounts, etc), you get this compelling chart: claims Same chart, but with some more aggressive assumptions. When you’re dishing out your precious shekels to your cloud provider, think of all this extra blue area that you’re volunteering to pay that’s entirely spurious. This is why many of Dataproc’s customers don’t mind paying egress from their non-Google cloud vendors to GCS! Summary Google Cloud has the advantage of a second-comer. Things are simpler, cheaper, and faster. Lower-level services like instances (GCE) and storage (GCS) are more powerful and easier to use. This, in turn, lets higher-level services like Dataproc be more effective: — per-minute billing, Custom VMs, Preemptible VMs, sustained use discounts, and cheaper VMs list prices. Cheaper — rapid cluster boot-up times, best-in-class object storage, best-in-class networking, and RAM-like performance characteristics of Local SSDs. Faster — lots of capacity, less fragmented instance type offerings, VPC-by-default, and images that closely follow Apache releases. Easier Fundamentally, Dataproc lets you think in terms of jobs, not clusters. You start with a job, and you get a cluster as just another step in job execution. This is a very different mode of thinking, and we feel that you’ll find it compelling. You don’t have to take my word for it — good folks at O’Reilly had to say about Dataproc and EMR. this Find me on twitter at @thetinot . Happy to chat further!