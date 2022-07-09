What is Data-Centric AI?

What makes GPT-3 and Dalle powerful is exactly the same thing: Data.

Data is crucial in our field, and our models are extremely data-hungry. These large models, either language models for GPT or image models for Dalle, all require the same thing: way too much data.

The more data you have, the better it is. So you need to scale up those models, especially for real-world applications.

Bigger models can use bigger datasets to improve only if the data is of high quality.

Feeding images that do not represent the real world will be of no use and even worsen the model’s ability to generalize. This is where data-centric AI comes into play...

Learn more in the video:

Video Transcript

what makes gpt3 and delhi powerful is

exactly the same thing data data is

crucial in our field and our models are

extremely data hungry these large models

either language models for gpt or image

models for delhi all require the same

thing

way too much data unfortunately the more

data you have the better it is so you

need to scale up those models especially

for real world applications bigger

models can use bigger datasets to

improve only if the data is of high

quality feeding images that do not

represent the real world will be of no

use and even worsen the model's ability

to generalize this is where data centric

ai comes into play data centric ai also

referred to as software 2.0 is just a

fancy way of saying that we optimize our

data to maximize the model's

performances instead of model-centric

where you will just tweak the model's

parameters on a fixed dataset of course

both need to be done to have the best

results possible but data is by far the

bigger player here in this video in

partnership with snorkel i will cover

what data centric ai is and review some

big advancements in the field you will

quickly understand why data is so

important in machine learning which is

snorkel's mission taking a quote from

their blog post linked below teams will

often spend time writing new models

instead of understanding their problem

and its expression in data more deeply

writing a new model is a beautiful

refuge to hide from the mess of

understanding the real problems and this

is what this video aims to combat in one

sentence the goal of data centric ai is

to encode knowledge from our data into

the model by maximizing the data's

quality and model's performance it all

started in 2016 at stanford with a paper

called data programming creating large

training sets quickly introducing a

paradigm for labeling training data sets

programmatically rather than by hand

this was an eternity ago in terms of ai

research age as you know the best

approaches to date use supervised

learning a process in which models train

on data and labels and learn to

reproduce the labels when given the data

for example you'd feed a model many

images of ducks and cats with their

respective labels and ask the model to

find out what is in the picture then use

back propagation to train the model

based on how well it succeeds if you are

unfamiliar with back propagation i

invite you to pause the video to watch

my one minute explanation and return

where you left off as data sets are

getting bigger and bigger it becomes

increasingly difficult to curate them

and remove hurtful data to allow for the

model to focus on only relevant data you

don't want to train your model to detect

a cat when it's a skunk it could end

badly when i refer to data keep in mind

that it can be any sort of data tabular

images text videos etc now that you can

easily download a modal for any task the

shift to data improvement and

optimization is inevitable motor

availability the scale of recent data

sets and the data dependent cds models

have are why such a paradigm for

labeling training data sets

programmatically becomes essential

now the main problem comes with having

labels for our data it's easy to have

thousands of images of cats and dogs but

it's much harder to know which images

have a dug and which images have a cat

and even harder to have their exact

locations in the image for segmentation

tasks for example

the first paper introduces a data

programming framework where the user

either ml engineer or data scientist

expresses weak supervision strategies as

labeling functions using a generative

model that labels subsets of the data

and found that data programming may be

an easier way for non-experts to create

machine learning models when training

data is limited or unavailable in short

they show how improving data without

much additional work while keeping the

model the same improve results which is

a now evident but essential stepping

stone it's a really interesting

foundation paper in this field and worth

the read

the second paper we cover here is called

snorkel rapid training data creation

with weak supervision this paper

published a year later also from

stanford university presents a flexible

interface layer to write labeling

functions based on experience continuing

on the idea that training data is

increasingly large and difficult to

label causing a bottleneck in models

performances they introduce snorkel a

system that implements the previous

paper in an end-to-end system the system

allowed knowledge experts the people

that best understand the data to easily

define labeling functions to

automatically label data instead of

doing hand annotation building models up

to 2.8 times faster while also

increasing predictive performance by an

average of 45.5 percent so again instead

of writing labels the users or knowledge

experts write labeling functions these

functions simply give insights to the

models on patterns to look for or

anything the expert will use to classify

the data helping the model follow the

same process then the system applies the

newly written labeling functions over

our unlabeled data and learns a

generative model to combine the output

labels into probabilistic labels which

are then used to train our final deep

neural network snorkel does all this by

itself facilitating this whole process

for the first time

our last paper also from stanford

another year later introduces software

2.0 this one page paper is once again

pushing forward with the same deep

learning data centric approach using

labeling functions to produce training

labels for large unlabeled data sets and

train our final model which is

particularly useful for huge internet

scraped data sets like the one used in

google applications such as google ads

gmail youtube etc tackling the lack of

hand labeled data of course this is just

an overview of the progress and

direction of data centric ai and i

strongly invite you to read the

information in the description below to

have a complete view of data centric ai

where it comes from and where it's

heading i also want to thank snorkel for

sponsoring this video and i invite you

to check out their website for more

information if you haven't heard of

snorkel before you've still already used

their approach in many products like

youtube google ads gmail and other big

applications

thank you for watching the video until

the end

0