Mostapha Benhenda

@mostafab

No, Kaggle is unsuitable to study AI & ML. A reply to Ben Hamner

Artificial intelligence and machine learning are driving billions of dollars in revenue across industries. To benefit from these growing opportunities, students are flocking to this field, and many of them wonder how to acquire those precious skills. They can feel a little bit lost and overwhelmed by the abundance of online sources.

In a recent Quora session, Kaggle CTO Ben Hamner outlined his advice to study machine learning.

In fact, Ben Hamner mixes up good advice with promotional stuff for Kaggle. This adds up to the confusion. The truth is that Kaggle is not so useful for real-world problems, especially if you want to bite a chunk of this multi-billion dollar market.

In this post, I propose an alternative method of study, more useful and realistic. At the end of the post, I suggest an alternative platform, Startcrowd, to build real-world AI products, instead of statistical models.

Start with online courses

Start with online courses, but move on quickly!

If you start from scratch, with no coding skills, nor data science experience, I personally recommend the Python course on Codecademy, the Andrew Ng ML course on Coursera, the Intro to Data Science on Udacity, and the Stanford courses on Convolutional Neural Networks and NLP. There are many other good courses, new ones appear everyday, but do not get stuck in this warm-up phase. Jump to practice as soon as you can.

Find a problem you like, and build a quick-and-dirty solution

Pick ideas from various sources

I mostly agree with the first two steps outlined by Ben in his post. It is important to start with a problem you like, in order to keep motivation over time. There are many ways to find inspiration: see how AI can solve your own problems. Read the news, Quora. Skim through academic papers. Look at the work of AI startups on AngelList and F6S. And yes, to some extent, have a look at Kaggle.

Second, build a quick-and-dirty solution. It is always better to avoid re-inventing the wheel, so Github and Stackoverflow are your best friends. It is a good exercise to try building upon the work of others, even with a vanishingly small contribution. Grab whatever you can find, and there can be some nice stuff on Kaggle.

Improve your initial solution with customer feedback

I strongly disagree with Ben’s third step. Improving the performance of your initial solution should not be your next step. It is a big mistake.

Instead, it is time to get effective: prepare a demo. Show your solution to potential users. Wrap your model into a web application, a visualization, a video clip, a blog post, whatever. For example, I prepared a simple demo of facial recognition, based on the OpenFace library:

Show a demo to get feedback

Communicating with users might require additional skills besides data science: other coding skills (HTML, Javascript, SQL, devops…), storytelling, or just human social skills. Teaming up with other people can be effective here.

So for improvements, listen to user feedback. You want a customer-centric solution, not a data-centric one. Does Kaggle leaderboard matter that much? Is it gonna pay your bills?

In the case of my facial recognition demo, feedback was dominated by privacy issues: many people still find facial recognition intrusive and creepy. Few bothered that OpenFace was only 91% accurate.

Find out why your solution is not adopted: is it customer reach (marketing…)? User experience (design…)? Timing and usefulness (product-market fit…)? Or is it poor model performance?

If performance is really the issue, then you can follow Ben’s third advice: acquire more data, improve data cleaning, or optimize the model like a Kaggle player.

Iterate quickly and build your portfolio of real-world projects

If you hit product-market fit, congratulations, keep going. Otherwise, be persistent. Or try again with another product, or with another market, depending on your mood.

To iterate even faster, follow the ‘sell first, build second’ method: focus first on the marketing and sales stuff, throw a landing page with no product at all of your own (like I did here), and if you attract customer attention, build a prototype. Before following this method, it might be better to acquire prototype-building skills.

If all your initiatives fail, and you ran out of cash, then you can start interviewing for machine learning positions. You now built an amazing CV, by working with the best employer of the world: you.

At job interviews, recruiters will appreciate your real-world experiences, and your deep understanding of the AI industries. It will be more impressive than your over-engineered Kaggle solutions.

Do not waste your time on Kaggle

Kaggle is cramped. Giving your best shot at a Kaggle competition, against thousands of participants, is a terrible waste, and has a tremendous opportunity cost: there are so many original problems out there, with nobody to work on. Be the first at your own competition. You should force yourself to find out those opportunities, instead of waiting to be spoon-fed by Kaggle. That’s the best way to give a shot at the billions of dollars in AI that everybody is talking about.

Science would be ruined if (like sports) it were to put competition above everything else. Benoit Mandelbrot

On the other hand, the expected rate of the average Kaggler is less than 2$/hour, given the value of sponsors prizes and the huge number of competitors.

Wake up and get out of this cunning exploitative scheme.

Should you join an alternative platform to study ML &AI?

Platforms should be community-owned, to avoid acquisitions by big corporations

Kaggle is definitely not the home for data science: maybe the stadium for data science, or just its sandbox. Data science is homeless, this field is too broad to be confined to a single platform.

However, I still think that studying AI can be facilitated by appropriate platforms. After all, Github and Stackoverflow are really helpful platforms, which fill this need to some extent.

In the alternative studying process that I outlined, there are many pain points: it is hard to get market visibility, team up with other people, and so on. In general, incubators and accelerators are supposed to address those pain points. However, they are often insufficient. That’s why I suggest a new platform to build AI products. In my opinion, this platform should be:

  • community-owned, and unacquirable by design.
  • truly collaborative: accumulating contributions should be at the heart of the platform, not a by-product like in Kaggle.
  • Incentivize quality contributions, and avoid the ‘lemon problem’.

I prepared a very preliminary prototype, check it out here.

Build AI products collaboratively at www.startcrowd.club

More by Mostapha Benhenda

Topics of interest

More Related Stories