The HBO show Silicon Valley released a real AI app that identifies hotdogs — and not hotdogs — like the one shown on season 4’s 4th episode (the app is now available on Android as well as iOS!)
body[data-twttr-rendered="true"] {background-color: transparent;}.twitter-tweet {margin: auto !important;}
Larry King loves Not Hotdog 😊 https://t.co/sbDiw3ZHyT
function notifyResize(height) {height = height ? height : document.documentElement.offsetHeight; var resized = false; if (window.donkey && donkey.resize) {donkey.resize(height); resized = true;}if (parent && parent._resizeIframe) {var obj = {iframe: window.frameElement, height: height}; parent._resizeIframe(obj); resized = true;}if (window.webkit && window.webkit.messageHandlers && window.webkit.messageHandlers.resize) {window.webkit.messageHandlers.resize.postMessage(height); resized = true;}return resized;}twttr.events.bind('rendered', function (event) {notifyResize();}); twttr.events.bind('resize', function (event) {notifyResize();});if (parent && parent._resizeIframe) {var maxWidth = parseInt(window.frameElement.getAttribute("width")); if ( 500 < maxWidth) {window.frameElement.setAttribute("width", "500");}}
To achieve this, we designed a bespoke neural architecture that runs directly on your phone, and trained it with Tensorflow, Keras & Nvidia GPUs.
While the use-case is farcical, the app is an approachable example of both deep learning, and edge computing. All AI work is powered 100% by the user’s device, and images are processed without ever leaving their phone. This provides users with a snappier experience (no round trip to the cloud), offline availability, and better privacy. This also allows us to run the app at a cost of $0, even under the load of a million users, providing significant savings compared to traditional cloud-based AI approaches.
The author’s development setup with the attached eGPU used to train Not Hotdog’s AI.
The app was developed in-house by the show, by a single developer, running on a single laptop & attached GPU, using hand-curated data. In that respect, it may provide a sense of what can be achieved today, with a limited amount of time & resources, by non-technical companies, individual developers, and hobbyists alike. In that spirit, this article attempts to give a detailed overview of steps involved to help others build their own apps.
body[data-twttr-rendered="true"] {background-color: transparent;}.twitter-tweet {margin: auto !important;}
The hotdog/not hotdog is to machine learning what the Utah teapot is to CGI. #WWDC17 #WWDC
function notifyResize(height) {height = height ? height : document.documentElement.offsetHeight; var resized = false; if (window.donkey && donkey.resize) {donkey.resize(height); resized = true;}if (parent && parent._resizeIframe) {var obj = {iframe: window.frameElement, height: height}; parent._resizeIframe(obj); resized = true;}if (window.webkit && window.webkit.messageHandlers && window.webkit.messageHandlers.resize) {window.webkit.messageHandlers.resize.postMessage(height); resized = true;}return resized;}twttr.events.bind('rendered', function (event) {notifyResize();}); twttr.events.bind('resize', function (event) {notifyResize();});if (parent && parent._resizeIframe) {var maxWidth = parseInt(window.frameElement.getAttribute("width")); if ( 500 < maxWidth) {window.frameElement.setAttribute("width", "500");}}
From Prototype to Production ☛V0: Prototype ☞V1: Tensorflow, Inception & Transfer Learning ☞V2: Keras & SqueezeNet ☞
The DeepDog Architecture ☛Training ☞Running Neural Networks on Mobile Phones ☞Changing App Behavior by Injecting Neural Networks on The fly ☞What We Would Do Differently ☞
If you haven’t seen the show or tried the app (you should!), the app lets you snap a picture and then tells you whether it thinks that image is of a hotdog or not. It’s a straightforward use-case, that pays homage to recent AI research and applications, in particular ImageNet.
While we’ve probably dedicated more engineering resources to recognizing hotdogs than anyone else, the app still fails in horrible and/or subtle ways.
body[data-twttr-rendered="true"] {background-color: transparent;}.twitter-tweet {margin: auto !important;}
If there's ketchup, it's a hotdog @FunnyAsianDude #nothotdog #NotHotdogchallenge
function notifyResize(height) {height = height ? height : document.documentElement.offsetHeight; var resized = false; if (window.donkey && donkey.resize) {donkey.resize(height); resized = true;}if (parent && parent._resizeIframe) {var obj = {iframe: window.frameElement, height: height}; parent._resizeIframe(obj); resized = true;}if (window.webkit && window.webkit.messageHandlers && window.webkit.messageHandlers.resize) {window.webkit.messageHandlers.resize.postMessage(height); resized = true;}return resized;}twttr.events.bind('rendered', function (event) {notifyResize();}); twttr.events.bind('resize', function (event) {notifyResize();});if (parent && parent._resizeIframe) {var maxWidth = parseInt(window.frameElement.getAttribute("width")); if ( 500 < maxWidth) {window.frameElement.setAttribute("width", "500");}}
Conversely, it’s also sometimes able to recognize hotdogs in complex situations… According to Engadget, “It’s incredible. I’ve had more success identifying food with the app in 20 minutes than I have had tagging and identifying songs with Shazam in the past two years.”
body[data-twttr-rendered="true"] {background-color: transparent;}.twitter-tweet {margin: auto !important;}
Can confirm that the 'Not Hotdog' app is not easily fooled. #SiliconValleyHBO @SiliconHBO
function notifyResize(height) {height = height ? height : document.documentElement.offsetHeight; var resized = false; if (window.donkey && donkey.resize) {donkey.resize(height); resized = true;}if (parent && parent._resizeIframe) {var obj = {iframe: window.frameElement, height: height}; parent._resizeIframe(obj); resized = true;}if (window.webkit && window.webkit.messageHandlers && window.webkit.messageHandlers.resize) {window.webkit.messageHandlers.resize.postMessage(height); resized = true;}return resized;}twttr.events.bind('rendered', function (event) {notifyResize();}); twttr.events.bind('resize', function (event) {notifyResize();});if (parent && parent._resizeIframe) {var maxWidth = parseInt(window.frameElement.getAttribute("width")); if ( 500 < maxWidth) {window.frameElement.setAttribute("width", "500");}}
Have you ever found yourself reading Hacker News, thinking “they raised a 10M series A for that? I could build it in one weekend!” This app probably feels a lot like that, and the initial prototype was indeed built in a single weekend using Google Cloud Platform’s Vision API, and React Native. But the final app we ended up releasing on the app store required months of additional (part-time) work, to deliver meaningful improvements that would be difficult for an outsider to appreciate. We spent weeks optimizing overall accuracy, training time, inference time, iterating on our setup & tooling so we could have a faster development iterations, and spent a whole weekend optimizing the user experience around iOS & Android permissions (don’t even get me started on that one).
All too often technical blog posts or academic papers skip over this part, preferring to present the final chosen solution. In the interest of helping others learn from our mistake & choices, we will present an abridged view of the approaches that didn’t work for us, before we describe the final architecture we ended up shipping in the next section.
Example image & corresponding API output from Google Cloud Vision’s documentation
We chose React Native to build the prototype as it would give us an easy sandbox to experiment with, and would help us quickly support many devices. The experience ended up being a good one and we kept React Native for the remainder of the project: it didn’t always make things easy, and the design for the app was purposefully limited, but in the end React Native got the job done.
The other main component we used for the prototype — Google Cloud’s Vision API was quickly abandoned. There were 3 main factors:
For these reasons, we started experimenting with what’s trendily called “edge computing”, which for our purposes meant that after training our neural network on our laptop, we would move export it and embed it directly into our mobile app, so that the neural network execution phase (or inference) would run directly inside the user’s phone.
Through a chance encounter with Pete Warden of the TensorFlow team, we had become aware of its ability to run TensorFlow directly embedded on an iOS device, and started exploring that path. After React Native, TensorFlow became the second fixed part of our stack.
It only took a day of work to integrate TensorFlow’s Objective-C++ camera example in our React Native shell. It took slightly longer to use their transfer learning script, which helps you retrain the Inception architecture to deal with a more specific image problem. Inception is the name of a family of neural architectures built by Google to deal with image recognition problems. Inception is available “pre-trained” which means the training phase has been completed and the weights are set. Most often for image recognition networks, they have been trained on ImageNet, a yearly competition to find the best neural architecture at recognizing over 20,000 different types of objects (hotdogs are one of them). However, much like Google Cloud’s Vision API, the competition rewards breadth as much as depth here, and out-of-the-box accuracy on a single one of the 20,000+ categories can be lacking. As such, retraining (also called “transfer learning”) aims to take a full-trained neural net, and retrain it to perform better on the specific problem you’d like to handle. This usually involves some degree of “forgetting”, either by excising entire layers from the stack, or by slowly erasing the network’s ability to distinguish a type of object (e.g. chairs) in favor of better accuracy at recognizing the one you care about (i.e. hotdogs).
While the network (Inception in this case) may have been trained on the 14M images contained in ImageNet, we were able to retrain it on a just a few thousand hotdog images to get drastically enhanced hotdog recognition.
The big advantage of transfer learning are you will get better results much faster, and with less data than if you train from scratch. A full training might take months on multiple GPUs and require millions of images, while retraining can conceivably be done in hours on a laptop with a couple thousand images.
One of the biggest challenges we encountered was understanding exactly what should count as a hotdog and what should not. Defining what a “hotdog” is ends up being surprisingly difficult (do cut up sausages count, and if so, which kinds?) and subject to cultural interpretation.
Similarly, the “open world” nature of our problem meant we had to deal with an almost infinite number of inputs. While certain computer-vision problems have relatively limited inputs (say, x-rays of bolts with or without a mechanical default), we had to prepare the app to be fed selfies, nature shots and any number of foods.
Suffice to say, this approach was promising, and did lead to some improved results, however, it had to be abandoned for a couple of reasons.
First The nature of our problem meant a strong imbalance in training data: there are many more examples of things that are not hotdogs, than things that are hotdogs. In practice this means that if you train your algorithm on 3 hotdog images and 97 non-hotdog images, and it recognizes 0% of the former but 100% of the latter, it will still score 97% accuracy by default! This was not straightforward to solve out of the box using TensorFlow’s retrain tool, and basically necessitated setting up a deep learning model from scratch, import weights, and train in a more controlled manner.
At this point we decided to bite the bullet and get something started with Keras, a deep learning library that provides nicer, easier-to-use abstractions on top of TensorFlow, including pretty awesome training tools, and a class_weights option which is ideal to deal with this sort of dataset imbalance we were dealing with.
We used that opportunity to try other popular neural architectures like VGG, but one problem remained. None of them could comfortably fit on an iPhone. They consumed too much memory, which led to app crashes, and would sometime takes up to 10 seconds to compute, which was not ideal from a UX standpoint. Many things were attempted to mitigate that, but in the end it these architectures were just too big to run efficiently on mobile.
SqueezeNet vs. AlexNet, the grand-daddy of computer vision architectures. Source: SqueezeNet paper.
To give you a context out of time, this was roughly the mid-way point of the project. By that time, the UI was 90%+ done and very little of it was going to change. But in hindsight, the neural net was at best 20% done. We had a good sense of challenges & a good dataset, but 0 lines of the final neural architecture had been written, none of our neural code could reliably run on mobile, and even our accuracy was going to improve drastically in the weeks to come.
The problem directly ahead of us was simple: if Inception and VGG were too big, was there a simpler, pre-trained neural network we could retrain? At the suggestion of the always excellent Jeremy P. Howard (where has that guy been all our life?), we explored Xception, Enet and SqueezeNet. We quickly settled on SqueezeNet due to its explicit positioning as a solution for embedded deep learning, and the availability of a pre-trained Keras model on GitHub (yay open-source).
So how big of a difference does this make? An architecture like VGG uses about 138 million parameters (essentially the number of numbers necessary to model the neurons and values between them). Inception is already a massive improvement, requiring only 23 million parameters. SqueezeNet, in comparison only requires 1.25 million.
This has two advantages:
There are tradeoffs of course:
A smaller neural architecture has less available “memory”: it will not be as efficient at handling complex cases (such as recognizing 20,000 different objects), or even handling complex subcases (like say, appreciating the difference between a New York-style hotdog and a Chicago-style hotdog)As a corollary, smaller networks are usually less accurate overall than big ones. When trying to recognize ImageNet’s 20,000 different objects, SqueezeNet will only score around 58%, whereas Vgg will be accurate 72% of the time.
Data Augmentation example from the Keras Blog.
During this phase, we started experimenting with tuning the neural network architecture. In particular, we started using Batch Normalization and trying different activation functions.
After adding Batch Normalization and ELU to SqueezeNet, we were able to train neural network that achieve 90%+ accuracy when training from scratch, however, they were relatively brittle meaning the same network would overfit in some cases, or underfit in others when confronted to real-life testing. Even adding more examples to the dataset and playing with data augmentation failed to deliver a network that met expectations.
So while this phase was promising, and for the first time gave us a functioning app that could work entirely on an iPhone, in less than a second, we eventually moved to our 4th & final architecture.
<a href="https://medium.com/media/c1e0cd4f6cae1d304ad614d5c6f50c17/href">https://medium.com/media/c1e0cd4f6cae1d304ad614d5c6f50c17/href</a>
Our final architecture was spurred in large part by the publication on April 17 of Google’s MobileNets paper, promising a new neural architecture with Inception-like accuracy on simple problems like ours, with only 4M or so parameters. This meant it sat in an interesting sweet spot between a SqueezeNet that had maybe been overly simplistic for our purposes, and the possibly overwrought elephant-trying-to-squeeze-in-a-tutu of using Inception or VGG on Mobile. The paper introduced some capacity to tune the size & complexity of network specifically to trade memory/CPU consumption against accuracy, which was very much top of mind for us at the time.
With less than a month to go before the app had to launch we endeavored to reproduce the paper’s results. This was entirely anticlimactic as within a day of the paper being published a Keras implementation was already offered publicly on GitHub by Refik Can Malli, a student at Istanbul Technical University, whose work we had already benefitted from when we took inspiration from his excellent Keras SqueezeNet implementation. The depth & openness of the deep learning community, and the presence of talented minds like R.C. is what makes deep learning viable for applications today — but they also make working in this field more thrilling than any tech trend we’ve been involved with.
Our final architecture ended up making significant departures from the MobileNets architecture or from convention, in particular:
So how does this stack work exactly? Deep Learning often gets a bad rap for being a “black box”, and while it’s true many components of it can be mysterious, the networks we use often leak information about how some of their magic work. We can look at the layers of this stack and how they activate on specific input images, giving us a sense of each layer’s ability to recognize sausage, buns, or other particularly salient hotdog features.
Data quality was of the utmost importance. A neural network can only be as good as the data that trained it, and improving training set quality was probably one of the top 3 things we spent time on during this project. The key things we did to improve this were:
Example distortion introduced by moiré and a flash. Original photo: Wikimedia Commons.
The final composition of our dataset was 150k images, of which only 3k were hotdogs: there are only so many hotdogs you can look at, but there are many not hotdogs to look at. The 49:1 imbalance was dealt with by saying a Keras class weight of 49:1 in favor of hotdogs. Of the remaining 147k images, most were of food, with just 3k photos of non-food items, to help the network generalize a bit more and not get tricked into seeing a hotdog if presented with an image of a human in a red outfit.
Our data augmentation rules were as follows:
These numbers were derived intuitively, based on experiments and our understanding of the real-life usage of our app, as opposed to careful experimentation.
The final key to our data pipeline was using Patrick Rodriguez’s multiprocess image data generator for Keras. While Keras does have a built-in multi-threaded and multiprocess implementation, we found Patrick’s library to be consistently faster in our experiments, for reasons we did not have time to investigate. This library cut our training time to a third of what it used to be.
The network was trained using a 2015 MacBook Pro and attached external GPU (eGPU), specifically an Nvidia GTX 980 Ti (we’d probably buy a 1080 Ti if we were starting today). We were able to train the network on batches of 128 images at a time. The network was trained for a total of 240 epochs, meaning we ran all 150k images through the network 240 times. This took about 80 hours.
We trained the network in 3 phases:
UPDATED: a previous version of this chart contained inaccurate learning rates.
While learning rates were identified by running the linear experiment recommended by the CLR paper, they seem to intuitively make sense, in that the max for each phase is within a factor of 2 of the previous minimum, which is aligned with the industry standard recommendation of halving your learning rate if your accuracy plateaus during training.
In the interest of time we performed some training runs on a Paperspace P5000 instance running Ubuntu. In those cases, we were able to double the batch size, and found that optimal learning rates for each phase were roughly double as well.
Even having designed a relatively compact neural architecture, and having trained it to handle situations it may find in a mobile context, we had a lot of work left to make it run properly. Trying to run a top-of-the-line neural net architecture out of the box can quickly burns hundreds megabytes of RAM, which few mobile devices can spare today. Beyond network optimizations, it turns out the way you handle images or even load TensorFlow itself can have a huge impact on how quickly your network runs, how little RAM it uses, and how crash-free the experience will be for your users.
This was maybe the most mysterious part of this project. Relatively little information can be found about it, possibly due to the dearth of production deep learning applications running on mobile devices as of today. However, we must commend the Tensorflow team, and particularly Pete Warden, Andrew Harp and Chad Whipkey for the existing documentation and their kindness in answering our inquiries.
Instead of using TensorFlow on iOS, we looked at using Apple’s built-in deep learning libraries instead (BNNS, MPSCNN and later on, CoreML). We would have designed the network in Keras, trained it with TensorFlow, exported all the weight values, re-implemented the network with BNNS or MPSCNN (or imported it via CoreML), and loaded the parameters into that new implementation. However, the biggest obstacle was that these new Apple libraries are only available on iOS 10+, and we wanted to support older versions of iOS. As iOS 10+ adoption and these frameworks continue to improve, there may not be a case for using TensorFlow on device in the near future.
If you think injecting JavaScript into your app on the fly is cool, try injecting neural nets into your app! The last production trick we used was to leverage CodePush and Apple’s relatively permissive terms of service, to live-inject new versions of our neural networks after submission to the app store. While this was mostly done to help us quickly deliver accuracy improvements to our users after release, you could conceivably use this approach to drastically expand or alter the feature set of your app without going through an app store review again.
<a href="https://medium.com/media/215a0e4e1b6c6aafdda637d5abc80e7b/href">https://medium.com/media/215a0e4e1b6c6aafdda637d5abc80e7b/href</a>
There are a lot of things that didn’t work or we didn’t have time to do, and these are the ideas we’d investigate in the future:
Finally, we’d be remiss not to mention the obvious and important influence of User Experience, Developer Experience and built-in biases in developing an AI app. Each probably deserve their own post (or their own book) but here are the very concrete impacts of these 3 things in our experience.
UX (User Experience) is arguably more critical at every stage of the development of an AI app than for a traditional application**.** There are no Deep Learning algorithms that will give you perfect results right now, but there are many situations where the right mix of Deep Learning + UX will lead to results that are indistinguishable from perfect. Proper UX expectations are irreplaceable when it comes to setting developers on the right path to design their neural networks, setting the proper expectations for users when they use the app, and gracefully handling the inevitable AI failures. Building AI apps without a UX-first mindset is like training a neural net without Stochastic Gradient Descent: you will end up stuck in the local minima of the Uncanny Valley on your way to building the perfect AI use-case.
Source: New Scientist.
DX (Developer Experience) is extremely important as well, because deep learning training time is the new horsing around while waiting for your program to compile. We suggest you heavily favor DX first (hence Keras), as it’s always possible to optimize runtime for later runs (manual GPU parallelization, multi-process data augmentation, TensorFlow pipeline, even re-implementing for caffe2 / pyTorch).
body[data-twttr-rendered="true"] {background-color: transparent;}.twitter-tweet {margin: auto !important;}
Made a few changes to modernize .@xkcdComic's #1 reason for programmers to slack off :)
function notifyResize(height) {height = height ? height : document.documentElement.offsetHeight; var resized = false; if (window.donkey && donkey.resize) {donkey.resize(height); resized = true;}if (parent && parent._resizeIframe) {var obj = {iframe: window.frameElement, height: height}; parent._resizeIframe(obj); resized = true;}if (window.webkit && window.webkit.messageHandlers && window.webkit.messageHandlers.resize) {window.webkit.messageHandlers.resize.postMessage(height); resized = true;}return resized;}twttr.events.bind('rendered', function (event) {notifyResize();}); twttr.events.bind('resize', function (event) {notifyResize();});if (parent && parent._resizeIframe) {var maxWidth = parseInt(window.frameElement.getAttribute("width")); if ( 500 < maxWidth) {window.frameElement.setAttribute("width", "500");}}
Even projects with relatively obtuse APIs & documentation like TensorFlow greatly improve DX by providing a highly-tested, highly-used, well-maintained environment for training & running neural networks.
For the same reason, it’s hard to beat both the cost as well as the flexibility of having your own local GPU for development. Being able to look at / edit images locally, edit code with your preferred tool without delays greatly improves the development quality & speed of building AI projects.
Most AI apps will hit more critical cultural biases than ours, but as an example, even our straightforward use-case, caught us flat-footed with built-in biases in our initial dataset, that made the app unable to recognize French-style hotdogs, Asian hotdogs, and more oddities we did not have immediate personal experience with. It’s critical to remember that AI do not make “better” decisions than humans — they are infected by the same human biases we fall prey to, via the training sets humans provide.
Please do get in touch if you have any questions or comments: [email protected]
Thanks to: Mike Judge, Alec Berg, Clay Tarver, Todd Silverstein, Jonathan Dotan, Lisa Schomas, Amy Solomon, Dorothy Street & Rich Toyon, and all the writers of the show — the app would simply not exist without them.Meaghan, Dana, David, Jay, and everyone at HBO. Scale Venture Partners & GitLab. Rachel Thomas and Jeremy Howard & Fast AI for all that they have taught me, and for kindly reviewing a draft of this post. Check out their free online Deep Learning course, it’s awesome! JP Simard for his help on iOS. And finally, the TensorFlow team & r/MachineLearning for their help & inspiration.
… And thanks to everyone who used & shared the app! It made staring at pictures of hotdogs for months on end totally worth it 😅
body[data-twttr-rendered="true"] {background-color: transparent;}.twitter-tweet {margin: auto !important;}
Best costume today at #baytobreakers! @SiliconHBO
function notifyResize(height) {height = height ? height : document.documentElement.offsetHeight; var resized = false; if (window.donkey && donkey.resize) {donkey.resize(height); resized = true;}if (parent && parent._resizeIframe) {var obj = {iframe: window.frameElement, height: height}; parent._resizeIframe(obj); resized = true;}if (window.webkit && window.webkit.messageHandlers && window.webkit.messageHandlers.resize) {window.webkit.messageHandlers.resize.postMessage(height); resized = true;}return resized;}twttr.events.bind('rendered', function (event) {notifyResize();}); twttr.events.bind('resize', function (event) {notifyResize();});if (parent && parent._resizeIframe) {var maxWidth = parseInt(window.frameElement.getAttribute("width")); if ( 500 < maxWidth) {window.frameElement.setAttribute("width", "500");}}