Cassie Kozyrkov

@kozyrkov

What do you call AI without the boring bits?

Kubeflow.

If you read on, you might just fall in love with Kubeflow. It’s not just a cuddly way to get going with machine learning, it stands for something huge for the data science community. You’ve been warned.

The Hard Way

Let’s talk about an ancient phenomenon: worshipping The Hard Way. Many of us have been guilty of it, but it’s especially common among STEM folk who’ve undergone decades of hazing. The kind of hazing that measures your worth in how well you can reinvent a wheel, how well you can do every bit yourself, and how little help you accept. When you’re in the thick of this mentality, you think the difficulty or complexity of a task matters more than its value. (And perhaps one day you’ll use your vacation to build a computer and its operating system from scratch just to make sure you really understand the everything of everything.)

The Hard Way means swimming through a sticky syrup of drudgery, taking forever to get things done.

I had this attitude in my wild youth, but today I’m embarrassed on behalf of forces that make life difficult for no good reason. Progress tends to move in the direction of making things easier. I’d prefer to celebrate progress instead of being proud I can still do things the hard way. Sure, I can invert a matrix with pen and paper… by why would I ever do it? Because all the world’s computers might be raptured?

Still waiting for this tech to make a comeback. Any day now. I’d better practice my backpropagation by hand.

Let’s welcome the reduction in toil that technology brings. Bad tools and processes that slow everyone down belong on the endangered species list — let’s start being proud of doing what’s valuable instead of what’s difficult, and if you insist on tackling a real challenge, why not work to make difficult things easy for everyone else?

Kubeflow embodies taking a stand against doing things The Hard Way. It’s a ski lift for your mountain of chores.

Among the tools that embody a stand against The Hard Way, Kubeflow is a project that’s close to my heart — David Aronchick and I wrote some of the original blueprints in a manic fever right after our first cup of coffee together. I’m pretty sure I didn’t get up for sustenance that day, not while the first draft of the design doc was still on the inside of our skulls.

What is Kubeflow?

So, what is Kubeflow and why did it inspire all that intense passion? It’s a ski lift to help you with the mountain of machine learning setup chores. Let’s unpack that…

The document David and I wrote started under a different name. Eventually it was codenamed Grace, but we called first version Beautiful Machine Learning. Okay, maybe not my most original naming moment, but cut me some slack. We statistician folk like a thing to do exactly what it says on the tin. (Thank goodness someone with some creativity fixed it later, right?)

That’s the point of Kubeflow: beautiful machine learning. Specifically, the beauty of the data scientist’s experience while wrestling into submission the beast that is machine learning in multicloud hybrid environments as an entire stack. Which, if you’ve tried to DIY it in the pre-Kubeflow days, was anything but beautiful.

In case you prefer videos to reading stuff, here’s me summarizing David’s talk on Kubeflow from Next 2018.

If you’re like me, you barely tolerate the boring parts of the data science process and it’s very hard to work with a song in your heart while you do the bits involving minimal cunning and maximal drudgery. (A personal exception is my perverse love of data cleaning, which I enjoy as a meditative activity in the same category as playing 2048. Mmmm… injecting order into chaos. Delicious.)

What data scientists want

You want to work on those interesting models, you want to focus on testing your hypotheses, you want to make gorgeous plots (you the kind whose heart beats a little faster when I say interactive, shiny, animated?), and you want to get to the actionable insights. Yeah, me too. But first we have to spend an eternity on setup and operating systems and scaling and all kinds of painfully boring stuff. I mean, come on, does anyone actually enjoy giving Big Data special treatment relative to small data?

As far as most data scientists are concerned, it would be great if all the code we wrote were the same for big or small, laptop or cloud, prototype or production…

Deep down inside we know that the only reason that this isn’t the case is that we live in the dark ages where our tools suck.

So how about if we make them unsuck? And, while we’re at it, how about if we could have the ideal data science workflow with every dream bell and whistle all laid out so that using them was effortless and we could just get on with the fun parts of our work?

Kubeflow is about giving data scientists the experience they’d have if they got rid of all the fiddly bits they don’t like.

That is what beautiful machine learning, and ultimately Kubeflow, is all about. It’s a blueprint for living the dream, for giving data scientists the experience they’d have if they got rid of all the fiddly bits they don’t like. Kubeflow is not perfect yet (it first greeted the world at the end of last year), but it’s moving rapidly in the right direction.

Making our tools unsuck

If we want to go about fixing things, where should we start? The Kubeflow team picked composability, portability, and scalability.

  • Composable ML accommodates all the various data science tools out there.
  • Portable ML means it’s easy to go from prototyping models on your laptop to running them unchanged in production.
  • Scalable ML means it’s effortless to go from your small data prototype to huge data pipelines.

You know what’s really good at composability, portability, and scalability? Containers and Kubernetes!

For most data scientists, even talking about these tech buzzwords is an exercise in improvisation.

Ugh, the gotcha. For most data scientists, even talking about these tech buzzwords is an exercise in improvisation. If you’re not already an expert, using machine learning on Kubernetes presents a world of pain — not only because there are so many things you have to become an expert in that are unrelated to your comfort zone, but also because all the primers are written for a different kind of engineer. You’ll need to earn a second black belt in addition to your machine learning one. Aaaand we’re face-to-face with the problem again!

A second black belt?

Most data scientists consider learning all that stuff a special kind of torture, and those who are excited by it might simply not have the spare time. Can’t someone else go learn Kubernetes for us so we can get on with our actual jobs?

Kubeflow is a ski lift to help you with the mountain of machine learning setup chores.

Consider Kubeflow your volunteer buddy here. The whole point of it is to make it easy for everyone to develop, deploy, and manage portable, distributed machine learning on Kubernetes. No extra black belts needed. The goal is to have your entire stack made unannoying and complete with every beautiful toy you want. We’re not there yet, but we’re accelerating fast. For example, Kubeflow v0.2 gets you fully set up with one (!) single line of code.

You can deploy Kubeflow with the one-step script included in the download. For more info, see David’s talk.

It takes only one line of effort to get Jupyter notebooks, distributed training, and model serving fit for the hybrid cloud environment. Oh, and there’s customizable ksonnet packaging too, H2O.ai tooling, and powerful hyperparameter tuning to boot. Data scientists, think on these additional things, look me in the eye, and tell me you want to figure out autoscaling based on job submission, cloud-specific VMs, and data exfiltration prevention. No? Well, luckily you don’t have to. Congratulations on waiting it out long enough to have it taken care of for you, kind of like you don’t need to build your own computer anymore.

A glimpse of what hyperparameter tuning with Katib looks like. What’s a hyperparameter? It’s that dial on your toaster. When your machine learning algorithm has a lot of knobs and dials, figuring out what to set them to can be annoying — luckily tools like Katib are here to help.

An attitude shift towards inclusion

Data science is the discipline of making data useful, and since the world is generating data like never before, the work data scientists do is becoming more necessary than ever. We all want a world where information is used to make life better instead of collecting cobwebs. There’s so much data and so much work to be done that we need the barriers to entry into analytics lowered as quickly as possible. I’m proud of the role Kubeflow is playing in that.

We’re entering an era that is about empowering everyone stand on the shoulders of giants.

Actually, by talking about how lovably shiny Kubeflow is, I might have distracted you from the bigger point: what it represents. It’s an attitude shift towards inclusion and it’s one of the early steps our community is taking in an era that accelerates everyone’s ability to stand on the shoulders of giants. It points to a bright and exciting future!

Imagine a world where the tools are so easy absolutely everyone can use them. That’s where we’re headed!

I’m proud that as a community of data professionals we’re starting to shun The Hard Way and stepping up to make fiddly things easy for everyone else. We’re beginning to say, “Don’t know the gory behind-the-scenes details? That’s okay! We refuse to leave you behind. Come have a seat at our table and join us in making incredible things.”

That smells like progress to me — let’s have more of it!

If you’re interested in trying out Kubeflow, get started here. Or learn more in David’s fabulous talk.

More by Cassie Kozyrkov

Topics of interest

More Related Stories