Carvana Image Masking Challenge hosted on Kaggle have attracted a lot of attention from the Deep Learning community. Currently, the contest has more than 600 teams registered. The task is to build a model that segments the car out of the scene background.
Original and target images
Conceptually, the task seems to be well defined and simple, especially in comparison, say, with full recognition of the road scene for safe self driving.
Indeed, 396 teams have achieved the score above 0.99. All further fight will be for 3–4 decimal place in the final score.
In general, the Kaggle community is extremely creative and very non-trivial solutions are born as a result of tough competition. For instance, take a look at the winner solution of Taxi Prediction Challenge.
However, when it comes to semantic segmentation problems, non-trivial approaches are difficult to utilize. Ensembles of Unet like architectures trained at different resolutions will prevail in most top-scored solutions. In situations where very similar approaches compete with each other, the Chance plays a huge role.
And the following question arises:
“Is there any other way to get a competitive advantage?”
We think that the answer is yes, especially if we look at the task from different perspective: attack the data rather than model.
In this post we will describe the way we managed to generate Synthetic Carvana Images (plus Ground Truth) that are very similar to the real ones — the training data provided by challenge organizers. What’s even more important is that our synthetic training set are freely available and everyone may make use of it to obtain higher score in the challenge.
Check out those two cars above. One of them is real, and one is synthetically generated using GTA V. Which is which?
Modern computer games have interesting connection with Deep Learning. They are engaging and, more importantly, look realistic. Take GTA V, for instance. Rockstar North, over the years, put huge amount of efforts to make the gameplay as close to reality as possible. So, potentially, one may consider the game as infinity training set with all possible and impossible road scene configurations.
Here we narrow the general approach mentioned above. We use GTA V to obtain the car images and segmentation masks under different camera views. The idea is not new, for instance, Playing for Data dataset from 2016.
Unfortunately, there is no straight forward way to do it (no such kind of API for GTA V available). But conceptually it’s possible. So a bit of reverse engineering could help us. We will not focus much on the reverse engineering procedure itself (if someone is interested, let us know in the comments), rather will describe the process in general (and show a lot of cool pictures).
Some intermediate magic in action
After we’ve successfully injected our DLL in GTA process, we programmable place every vehicle available in GTA into garage. Well, not every — after some filtering we kept only 154 models that make sense for Carvana challenge, because airship does not. Then, we rotate our model per 10° with several different camera angles. Finally, we change car color: we chose black and white.
Okay, now we can take nice screenshots like the one above, but there are no ground truth available. That’s bad. Luckily, we can hook into DirectX API calls and make some manipulations with objects on scene. After a few broken keyboards we found a way to highlight the car:
As you can see, there is no windows. It’s because windows are totally separate objects in GTA V. So, we also highlight only windows:
Now that’s something! We actually got both ground truth mask and a car image. But we also need to extract and place our model on Carvana scene and make the final result as close to reality, as possible. Because of that, we also want to extract a car shadow from GTA:
As you can see, we’ve failed to make the floor exactly white and plain. But don’t worry: Photoshop is here to help us!
What kind of people use Photoshop for machine learning? Well, we do.
Actually, Photoshop has a lot to offer. But most people don’t know it’s possible to use good ol’ JavaScript to automate every action. That’s what we did.
We start with screenshot from the game:
First, the easy one: we combine car and windows ground truth to obtain the final mask:
Now, we can cut out the car from screenshot and place it on empty stage we made before:
As you can see, the car is too dark. That’s because it was shot in darker place. Luckily, Photoshop has Auto Tone and Auto Color:
Much better! But the car is floating in the air. That’s because there is no shadow. It is possible to generate shadows in Photoshop, but it’s hard because we need to keep model rotation angle in mind. So, we will take shadow directly from GTA. We load screenshot with white (kinda) floor and make some manipulations:
But there are still no windows! Let’s fix that by generating windows using some gradients:
And finally, enlarge car to fit the scene:
All those manipulations are done programmable using Photoshop JS scripting and pre-recorded actions. If you think this is an interesting topic for a tutorial, please leave your opinion in comments.
We have made this dataset publicly available in our training data platform Supervise.ly. Check out this post on medium if you want to know more about it. Follow those simple steps to get data:
Signup on Supervise.ly. It’s free and takes just a minute.
Open Import
→ Datasets
library and click on “CarvanaGTA5" dataset. Enter a project name (for example, “Carvana”), click Next
and Upload
. After import task completion you will see your new dataset on Projects
page.
Datasets library
You can check out images in Annotation tool
by clicking on dataset or look at statistics.
Annotation tool
Now you can download dataset on your computer by using Export tool
. Export is a powerful feature of Supervise.ly that uses JSON-configurations to make filtering, resizing, augmentation, train-validation splitting, combining multiple datasets in one — and then save your results in popular ready to train frameworks formats.
Go to the Export
page and paste the following config in editor:
[{"action": "data","src": ["<Your project name>/*"],"dst": "$sample","settings": {"classes_mapping": "default"}},{"action": "tag","src": ["$sample"],"dst": "$sample2","settings": {"tag": "train"}},{"action": "background","src": ["$sample2"],"dst": "$sample3","settings": {"class": "bg"}},{"action": "segmentation","src": ["$sample3"],"dst": "Carvana","settings": {"gt_machine_color": {"car": [255, 255, 255],"bg": [0, 0, 0]},"tag2part": {"train": "train"},"txt_generation": {"prefix": "."}}}]
Here we define an array of sequential transformations of data: we tag every image as “train”, pass it to background
layer to generate bg
class and finally use segmentation
layer to make ground truth images. You can read more about Export in documentation.
Now click Start Exporting
button and enter some name (optional).
Supervise.ly will prepare your archive and after some time Download
button would appear in Tasks:
Done! If you have some time, check out our tutorials on Supervise.ly like Number plate detection — it has a lot to offer.
We live in an era of unprecedented democratization of Deep Learning technologies — the academic community and business openly publish research and frameworks for building neural networks. However, when it comes to training data the situation is very different. In terms of data availability, industry giants (google, facebook, amazon) have huge advantage over other companies.
The following chart from Andrew Ng is a very illustrative:
Or in words: the quality of intellectual products based on Deep Learning is determined by the amount of available training data.
Increasing the training data availability is the main priority of our company. We approach the problem from two sides:
Please, let us know whether or not our synthetic dataset help you to achieve higher scores.