You’re a startup founder. You know you need to have a “data play” (or worse, “AI play”). Investors and clients are asking about machine learning (or worse, deep learning). The question is no longer why, but when.
So, you hire your first data scientist. A fresh graduate (those with industry experience are too expensive and hard to recruit). They have a PhD (so you can nonchalantly drop ‘I have my best PhDs looking into this’ in the conversation with potential clients and VCs.) They‘re excited about doing machine learning — and so are you. This is going to rock!
Your data scientist starts, ready to take your data and build models from it, just like you agreed. Back in academia, they were working with datasets that several generations of grad students have been using. They’re not naive, though — they know this time around they’ll have to create their own. No problem! Your startup has a year worth of logs in S3, ready to be mined and used as training data.
There’s no infrastructure for data analysis, so everything has to be done from scratch. Your data scientist is trying to install various tools incompatible with your stack or architecture and, for now, is doing one-off log parsing to answer every question. The engineering team doesn’t feel it’s safe to give them prod access so they provide an offline copy of your database.
There! you say to the data scientist. You have access to all the data you need!
Except… data in that offline database is, by design, not structured in a way that’s easy to analyze or combine with the data in the logs. Values don’t make sense or are missing for weeks at a time, the only events instrumented are those relevant to the ops team, and even the simplest query takes forever to run.
Everybody is frustrated .
The data scientist wanted to work on machine learning — those elegant algorithms they spent their career thinking and publishing about. Yes, they expected to put some effort into gathering and cleaning data — but they didn’t expect it to be so complex and messy, with so much missing or hard to access. They didn’t expect to spend so much time in meetings or Slack with questions about how the data was gathered and what, exactly, is in that json blob. They didn’t expect the rest of the company to care so little about how each redesign messes up month to month comparisons. They didn’t realize nobody logged all the options a user saw before they clicked, nor the version of the UI at the time, so that one year of training data is not exactly usable.
You hired a data scientist to work on machine learning. They’ve been warned about spending 80% of their time cleaning data. That sounds like a distant dream now. Instead, they spend 80% of their time begging for data to be created, accessed, moved, or explained. They’re spending the other 20% lobbying for data science-friendly tools, security policies, and infrastructure, or looking for a new job.
The engineering team is frustrated, too. They have to take time away from their own work to do a ton of thankless grunt work for the scientist who sits around, does nothing useful and still manages to constantly complain about the data not being good enough.
You, the founder, are even more frustrated. It’s been 2 months, and the data scientist didn’t even produce a decent dashboard, much less the magic machine learning everybody keeps raving about. Plus, they seem like a bad culture fit, slowing everybody else down for no reason. Deep down, you start suspecting machine learning is snake oil and AI is just hype.
You keep this to yourself, though, and tell your investors and clients about your AI play and your crazy smart PhD data scientist who’s super excited to work on machine learning any day now.
If you’re wondering how to avoid this painful but too frequent scenario, here’s the Quora answer that inspired this post : “What are the challenges of building a data team at a startup?”.