How Data Selection Impacts Model Performance: An AMA with SiaSearch

Written by limarc | Published 2021/04/09
Tech Story Tags: computer-vision | ama | slogging | slack-blogging | artificial-intelligence | machine-learning | data-science | hackernoon-top-story

TLDR SiaSearch is a Berlin-based AI startup on a mission to accelerate computer vision application development. It provides a data management tool for engineers working on self driving cars and other computer vision applications. We discuss computer vision technology, the future of autonomous vehicles, and the Sia search data management platform. We’re eager to hear any questions you might have about: Computer vision model development, data selection & curation, and how data selection makes up for a large amount of the actual model performance rather than large amounts of hyperparameter tuning.via the TL;DR App

SiaSearch is a Berlin-based AI startup on a mission to accelerate computer vision application development.
In this Slack AMA with the SiaSearch team, we discuss computer vision technology, the future of autonomous vehicles, and the SiaSearch data management platform.
This discussion occurred in Slogging's official #amas channel and has been edited for readability.
Note: SiaSearch will be holding a free session at NVIDIA #GTC21 on April 13. For more details, check out: https://www.siasearch.io/blog/nvidia-gtc21
Mark PfeifferApr 8, 2021, 7:02 AM
Hey everyone!

My name is Mark Pfeiffer and I’m the Co-Founder and CTO of SiaSearch (https://www.siasearch.io/), a Berlin-based AI startup that provides a data management tool for engineers working on self driving cars and other computer vision applications.

Our Head of Product Armaghan Khan and I will be hosting an AMA later today, April 8 at 12pm MT. We’re eager to hear any questions you might have about:
🖼 Computer vision model development
🤖 Autonomous vehicles
💾 Data bottlenecks in machine learning
💾Data selection & curation

Looking forward to an interactive session!
Limarc AmbalinaApr 8, 2021, 3:16 PM
Hey Mark! Thanks to you and the team for doing this AMA on Slogging. To start us off, why is a separate tool built just for data management purposes necessary?

Also, is this tool targeted just at engineers working on computer vision applications or is that just the largest side of the market?
Mark PfeifferApr 8, 2021, 6:02 PM
First of all, thanks for having us! Really looking forward to this session and we’ll try to cover the questions as fast as possible! 🙂
Mark PfeifferApr 8, 2021, 6:10 PM
That’s a very important question. Of course, there’s a lot of ML tooling around. Many tools however focus on model training, versioning, monitoring, deployment or annotation. However we’ve seen that in most real-world applications the data selection makes up for a large amount of the actual model performance rather than small model adjustments or large amounts of hyperparameter tuning. Therefore with SiaSearch we wanted to provide a tool which makes it as easy as possible for the users to select the right data for the right applications and build better models in less time. 

Currently the tool is quite tailored to the computer vision use case. A lot of other ML applications deal with more structured data which is also challenging but easier to handle and select. For computer vision applications the data selection is particularly hard as the content of images is hard to access and therefore a lot of manual screening is required to select the right data. Currently we see the largest value add of SiaSearch in this application and focus on that with our team!
richard-kubinaApr 8, 2021, 3:58 PM
Hello Mark 👋

One thing I've read about with ML/AI, is that it is difficult to trace back the why a model came up with its final verdict or weights. Like that it can be hard to debug the layers/network to know why, as an example, the image of the speed limit sign with tape over a part of it made the model determine the number it saw. I imagine users want to know which parameters to tweak (without too much trial and error) that'll nudge the probabilities in the right direction.

Is this something your software addresses or is this an ongoing challenge in the field?
Mark PfeifferApr 8, 2021, 6:22 PM
Thanks for the question Richard! In general in ML we need to get 2 elements right: (1) The model and (2) the data. With SiaSearch we heavily focus on the latter one. We still focus on model performance though. However, instead of analyzing which model elements to tune we try to make it easy for the user to select the right data. With SiaSearch you can easily figure out under which conditions the model still has problems. With these insights you can then adapt training datasets in order to improve overall model performance.

Also adding an example regarding your question: You might realize that your model has problems to detect traffic lights under sunny conditions while it works well in the dark or rain. This is an interesting insight and tells you that you should probably get more data of sunny intersections with traffic lights annotated. So as a summary, improving model performance is a core element of SiaSearch, however we rather look at it from an I/O perspective rather than raw model weights.
Natasha NelApr 8, 2021, 4:20 PM
Hey Mark! Wow cool stuff - thanks for doing this AMA! 👏 I'm curious about how you got to that product market fit for your software. Which came first - the data management tool or the drive to help power the development of autonomous vehicles? Curious to hear a little more about SiaSearch's origin story.
Clemens ViernickelApr 8, 2021, 6:14 PM
Great question! Mark and I both worked in the domain of self-driving and faced this data management challenge first hand. Mark while at the self-driving lab at Berkeley and later during his PhD, I was working in consulting projects with big German automotive companies. The goal was definitely to get these systems to work better, but it turned out to be very complex and manual to work with the raw data, which became a big bottleneck to improve models.

For data driven development, of which computer vision and self-driving are a subset, there are just very few tools so far that make the work of developers simple and easy, we wanted to change that. We envision a future where building data driven products is as easy as building software today.

And I believe the industry is currently taking on a similar perspective. During a recent conference, Andrew Ng urged developers and companies to take on a more data centric approach to ML
(https://scale.com/events/transform/videos/big-data-to-good-data?validation=big-data-to-good-data)  One of the big challenges to do so is better tooling, and that is our mission! 🛠 🛠 🛠
radhikaa kapoorApr 8, 2021, 4:33 PM
Hello Mark! Thanks for the AMA. Autonomous driving tool is a really helpful one but can you give a little more insight as to how it exactly works and whats the basic principle behind it.
Armaghan KhanApr 8, 2021, 6:18 PM
Thanks for the question, Radhikaa. Applications like autonomous driving (e.g. robotics, aerial imagery) produce tons of raw data. (Fun fact: an autonomous vehicle can produce up to 15TBs of data per hour). Here’s how SiaSearch helps manage this raw data:

1. Intelligent algorithms are applied to extract useful information e.g. whether the car was making a turn, what was the weather like, how many people were in view 

2. This information (which we call metadata) is populated into a proprietary database which allows super fast queries on PB scale data

3. To make it super easy for the user an SDK and GUI interface is provided, where they can easily search, select and visualize data as needed

You can dive into more depth https://www.siasearch.io/product and can also experience the product for yourself here: https://public.sia-search.com/
KatarinaApr 8, 2021, 5:39 PM
Your application within retail is quite a new concept for me, but I find it very interesting.

Would love to know more about how you are able to improve consumer experiences.
Armaghan KhanApr 8, 2021, 6:25 PM
Hi Katarina! Retail is indeed a super interesting use case. While SiaSearch isn’t directly used in a consumer facing role, it does empower the emerging self-checkout technology (similar to amazon go). 

The most popular approach to self checkout technologies involves the use of multiple cameras. Using the video feeds the self-checkout software stack recognizes inventory, buyers and can associate the two. Naturally these algorithms need data to be trained, which is where SiaSearch comes in. Using our product a developer can easily get a subset of situations e.g. a buyer fetching a yoghurt pack from the refrigerator. They can use this subset to train the right model and improve their performance quicker.
Limarc AmbalinaApr 8, 2021, 6:27 PM
Armaghan Khan you said "Intelligent algorithms are applied to extract useful information e.g. whether the car was making a turn, what was the weather like, how many people were in view"

That's super interesting. So in a way, SiaSearch, can provide some of the initial annotation itself without the need for human annotators?

If so, I'd see that as a huge value-add. Have you been marketing it as both an automatic annotation platform + data management platform?
Armaghan KhanApr 8, 2021, 6:34 PM
Great question! Yes the algorithms can indeed be used to auto-annotate data but we don’t see this as a replacement of high quality, low error human annotations. The auto-tagging, as we call it, is a step before the human annotation which helps to make the job of the annotator faster and simpler.
Armaghan KhanApr 8, 2021, 6:35 PM
For example, you get 100 hours of video recording from a car and you are interested in left turns. There are two ways to go about it:

1. Without SiaSearch: send all data for human annotation i.e. time and cost intensive

2. With SiaSearch: extract the left turns and only get those portion annotated i.e. faster and cheaper
Clemens ViernickelApr 8, 2021, 6:36 PM
Great observation though. There are lot’s of synergies with data annotation, which is why we’ll soon add this to our offering as well!
Limarc AmbalinaApr 8, 2021, 6:39 PM
Ah I get it. So the tool itself provides more of an automated filtering mechanism (which is incredibly useful of course), meaning if you want to annotate stop signs, the tool can return all of the video frames that have a stop sign in it, but we still need a human annotator to actually draw the bounding box around the stop sign. Am I sort of understanding that correctly?
Clemens ViernickelApr 8, 2021, 6:40 PM
Precisely!
Clemens ViernickelApr 8, 2021, 6:42 PM
you can think of this as a cycle: 1 train model, 2 identify model failures, 3 find better data to improve failures, 4 annotate, back to train model. SiaSearch helps with 2 and 3, which we sometimes summarize as training data management
Limarc AmbalinaApr 8, 2021, 6:44 PM
Sorry I don't mean to hijack this AMA and ask all the questions, but I did content writing for a year and a half in the machine learning/training data space so a lot of this is coming back to me and reigniting my interest. So let's say we do steps 1 - 3, for step 4, does SiaSearch have a built-in data annotation tool or does the engineer need to then import that data into a separate tool for annotating?

If not, is that a feature you're looking to add in the future or have you purposefully stayed away from that feature as not to compete with the already existing tools?
Clemens ViernickelApr 8, 2021, 6:47 PM
Haha, great to go deeper there! So far, we just easily connect to many common annotation companies via API. This makes it still easy for the developer to get from the data they collected in SiaSearch to trigger annotation. However, step 4 is definitely a feature we’re looking to add going forward!
Limarc AmbalinaApr 8, 2021, 6:55 PM
So going in a more speculative direction...SiaSearch can automate the filtering of data. We also have some early-stage tools that can automate some data annotation tasks. But as you said before, we still can't beat the low error rate of human annotation.

Since your company has worked to solve the data filtering problem, how long do you think it'll be before we are able to solve the data annotation problem? When do you think we'll have algorithms that can annotate data as well as humans can? Now that the training data industry has become quite huge, with millions around the world contributing to data annotation projects, I imagine the answer to that question could change the entire industry.
Clemens ViernickelApr 8, 2021, 7:01 PM
That’s kind of the million dollar question 🙂
Mark PfeifferApr 8, 2021, 7:01 PM
That’s an interesting question. Of course it would be ideal to automate the whole process, but if we’d already have models which can annotate, then the major part of finding such models would be done already, right? So I think there will always be some human labor required. Of course we can use models which have no real-time requirements for annotation, but ultimately a human will be more precise. So we really have to focus on building the right tooling in order to use human labor as efficiently as possible.
AfifaApr 8, 2021, 7:01 PM
When will autonomous vehicles become reality in India?
Mark PfeifferApr 8, 2021, 7:02 PM
Afifa Isn’t the answer to this question always “Next year”? 😉
Clemens ViernickelApr 8, 2021, 7:02 PM
I think the past couple of years have taught us to be careful with estimates in that domain, but we’re working hard to make it happen as soon possible! Let us know if you’re working on a self-driving project in India, we might be able to help!
AfifaApr 8, 2021, 7:06 PM
Mark Pfeiffer I'm excited about it. Hope for the best.
Mark PfeifferApr 8, 2021, 7:06 PM
Thanks everyone for all the great questions! If you have any more coming up, don’t hesitate to reach out to us either here or contact me under mailto:[email protected]! Also, if you wanna try out SiaSearch you can sign up for our research version https://www.siasearch.io/open-data.
Limarc AmbalinaApr 8, 2021, 7:07 PM
Thanks Mark Pfeiffer Clemens Viernickel and Armaghan Khan for joining us here today! We wish you the best of luck throughout the rest of 2021.

Written by limarc | HackerNoon's Editorial Ambassador by day, VR Gamer and Anime Binger by night.
Published by HackerNoon on 2021/04/09