paint-brush
主动学习简介 经过@whatsai
850 讀數
850 讀數

主动学习简介

经过 Louis Bouchard
Louis Bouchard HackerNoon profile picture

Louis Bouchard

@whatsai

I explain Artificial Intelligence terms and news to non-experts.

3 分钟 read2023/06/18
Read on Terminal Reader
Read this story in a terminal
Print this story

太長; 讀書

主动学习旨在优化数据集的注释,并使用最少的训练数据训练最佳模型。这是一种监督学习方法,涉及模型预测和数据之间的迭代过程。通过整体注释较少的图像,您可以节省时间和金钱,同时实现优化的模型。
featured image - 主动学习简介
Louis Bouchard HackerNoon profile picture
Louis Bouchard

Louis Bouchard

@whatsai

I explain Artificial Intelligence terms and news to non-experts.

0-item
1-item
2-item
3-item

STORY’S CREDIBILITY

DYOR

DYOR

The writer is smart, but don't just like, take their word for it. #DoYourOwnResearch before making any investment decisions or decisions regarding you health or security. (Do not regard any of this content as professional investment advice, or health advice)

Guide

Guide

Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.

Opinion piece / Thought Leadership

Opinion piece / Thought Leadership

The is an opinion piece based on the author’s POV and does not necessarily reflect the views of HackerNoon.

Vested Interest

Vested Interest

This writer has a vested interested be it monetary, business, or otherwise, with 1 or more of the products or companies mentioned within.

在当今世界,我们可以访问大量数据,这要归功于ChatGPT等强大的人工智能模型,以及视觉模型和其他类似技术。然而,这些模型所依赖的数据不仅仅在于数量,还在于质量。快速大规模地创建良好的数据集可能是一项具有挑战性且成本高昂的任务。


这就是主动学习的用武之地。

简单来说,主动学习旨在优化数据集的注释,并使用最少的训练数据训练出尽可能好的模型。


这是一种监督学习方法,涉及模型预测和数据之间的迭代过程。无需等待完整的数据集,您可以从一小批精选的注释数据开始并使用它来训练您的模型。


然后,使用主动学习,您可以利用您的模型来标记看不见的数据,评估预测的准确性,并根据采集函数选择下一组数据进行注释。


主动学习的优势之一是您可以分析模型预测的置信度。


如果预测的置信度较低,模型将请求标记该类型的其他图像。另一方面,具有高置信度的预测不需要更多数据。通过整体注释较少的图像,您可以节省时间和金钱,同时实现优化的模型。主动学习是处理大规模数据集的一种非常有前途的方法。


主动学习的表现。图片来自 Kumar 等人。

主动学习的表现。图片来自 Kumar 等人。



关于主动学习,有几个要点需要记住。

首先,它涉及人工注释,让您可以控制模型预测的质量。它不是经过数百万张图像训练的黑匣子。您积极参与其开发并协助改进其性能。这方面使主动学习变得重要和有趣,尽管与无监督方法相比它可能会增加成本。但是,训练和部署模型所节省的时间通常会超过这些成本。


此外,您可以使用自动注释工具并手动更正它们,进一步降低开支。


在主动学习中,您有一组标记的数据用于训练您的模型,而未标记的数据集包含尚未注释的潜在数据。一个关键的概念是查询策略,它确定要标记哪些数据。有多种方法可以在大量未标记数据中找到信息最丰富的子集。例如,不确定性抽样涉及在未标记数据上测试您的模型并选择最不自信的分类示例进行注释。


使用委员会查询方法表示主动学习。图片来自 Kumar 等人。

使用委员会查询方法表示主动学习。图片来自 Kumar 等人。



主动学习中的另一种技术是委员会查询 (QBC) ,其中多个模型组成一个委员会,每个模型都在不同的标记数据子集上进行训练。这些模型对分类问题有不同的看法,就像不同经历的人对某些概念有不同的理解一样。要注释的数据是根据委员会模型之间的分歧选择的,表明其复杂性。随着对所选数据的连续注释,此迭代过程继续进行。


这只是主动学习的基本解释,展示了查询策略的一个示例。

如果您有兴趣,我可以提供更多关于其他机器学习策略的信息或视频。主动学习的一个真实例子是当你在谷歌上回答验证码时。通过这样做,您可以帮助他们识别复杂的图像并使用多个用户的集体输入构建数据集,从而确保数据集质量和人工验证。所以,下次遇到验证码时,请记住你正在为 AI 模型的进步做出贡献!


要了解更多信息并查看使用我在 Encord 的朋友开发的出色工具的实际示例,请观看视频:


foreign

[00:00:00] : [00:00:03]

[Music]

[00:00:03] : [00:00:10]

amounts of data thanks to the

[00:00:10] : [00:00:13]

superpowers of large models including

[00:00:13] : [00:00:16]

the famous chatgpt but also Vision

[00:00:16] : [00:00:18]

models and all other types you may be

[00:00:18] : [00:00:21]

working with indeed the secrets behind

[00:00:21] : [00:00:23]

those models is not only the large

[00:00:23] : [00:00:25]

amount of data they are being trained on

[00:00:25] : [00:00:28]

but also the quality of that data but

[00:00:28] : [00:00:31]

what does this mean it means we need

[00:00:31] : [00:00:34]

lots of very good balance and varied

[00:00:34] : [00:00:37]

data and as data scientists we all know

[00:00:37] : [00:00:40]

how complicated and painful it can be to

[00:00:40] : [00:00:43]

build such a good data set fast and at

[00:00:43] : [00:00:45]

large scale and maybe with a limited

[00:00:45] : [00:00:47]

budget what if we could have helped

[00:00:47] : [00:00:50]

build that or even have automated help

[00:00:50] : [00:00:53]

well that is where Active Learning comes

[00:00:53] : [00:00:56]

in in one sentence the goal of active

[00:00:56] : [00:00:58]

learning is to use the least amount of

[00:00:58] : [00:01:00]

training data to optimize The annotation

[00:01:00] : [00:01:03]

of your whole data set and train the

[00:01:03] : [00:01:05]

best possible model it's a supervised

[00:01:05] : [00:01:06]

learning approach that will go back and

[00:01:06] : [00:01:09]

forth between your model's predictions

[00:01:09] : [00:01:11]

and your data what I mean here is that

[00:01:11] : [00:01:13]

you may start with a small batch of

[00:01:13] : [00:01:16]

curated annotated data and train your

[00:01:16] : [00:01:18]

model with it you don't have to wait for

[00:01:18] : [00:01:21]

your whole millions of images that are

[00:01:21] : [00:01:23]

set to be ready just push it out there

[00:01:23] : [00:01:25]

then using Active Learning you can use

[00:01:25] : [00:01:28]

your model on your unseen data and get

[00:01:28] : [00:01:31]

human annotators to label it but that is

[00:01:31] : [00:01:34]

not only it we can also evaluate how

[00:01:34] : [00:01:36]

accurate the predictions are and using a

[00:01:36] : [00:01:38]

variety of acquisition functions which

[00:01:38] : [00:01:41]

are functions used to select the next

[00:01:41] : [00:01:43]

unseen data to annotate we can quantify

[00:01:43] : [00:01:46]

the impact of labeling a larger data set

[00:01:46] : [00:01:49]

volume or improving the accuracy of the

[00:01:49] : [00:01:52]

labels generated to improve the model's

[00:01:52] : [00:01:54]

performance thanks to how you train the

[00:01:54] : [00:01:56]

models you can analyze the confidence

[00:01:56] : [00:01:58]

they have in their predictions

[00:01:58] : [00:02:00]

predictions with low confidence will

[00:02:00] : [00:02:02]

automatically request additional images

[00:02:02] : [00:02:05]

of this type to be labeled and

[00:02:05] : [00:02:07]

predictions with high confidence won't

[00:02:07] : [00:02:09]

need additional data so you will

[00:02:09] : [00:02:11]

basically save a lot of time and money

[00:02:11] : [00:02:14]

by having to annotate fewer images in

[00:02:14] : [00:02:16]

the end and have the most optimized

[00:02:16] : [00:02:20]

model possible how cool is that Active

[00:02:20] : [00:02:22]

Learning is one of the most promising

[00:02:22] : [00:02:24]

approach to working with large-scale

[00:02:24] : [00:02:26]

data sets and there are a few important

[00:02:26] : [00:02:28]

key Notions to remember with active

[00:02:28] : [00:02:30]

learning the most important is that it

[00:02:30] : [00:02:33]

uses humans which you can clearly see

[00:02:33] : [00:02:34]

here in the middle of this great

[00:02:34] : [00:02:37]

presentation of active learning it will

[00:02:37] : [00:02:40]

still require humans to annotate data

[00:02:40] : [00:02:42]

which has the plus side to give you full

[00:02:42] : [00:02:44]

control over the quality of your model's

[00:02:44] : [00:02:47]

prediction it's not a complete Black Box

[00:02:47] : [00:02:49]

that trained with millions of images

[00:02:49] : [00:02:51]

anymore you iteratively follow its

[00:02:51] : [00:02:54]

development and help it get better when

[00:02:54] : [00:02:56]

it fails of course it does have the

[00:02:56] : [00:02:58]

downside of increasing costs versus

[00:02:58] : [00:03:01]

unsupervised approaches where you don't

[00:03:01] : [00:03:03]

need anyone but it allows you to limit

[00:03:03] : [00:03:06]

those costs by only training where the

[00:03:06] : [00:03:08]

models need it instead of feeding it as

[00:03:08] : [00:03:11]

much data as possible and hoping for the

[00:03:11] : [00:03:13]

best moreover the reduction in time

[00:03:13] : [00:03:16]

taken to train the model and put it into

[00:03:16] : [00:03:18]

production often outweighs these costs

[00:03:18] : [00:03:20]

and you can use some automatic

[00:03:20] : [00:03:22]

annotation tools and manually correct it

[00:03:22] : [00:03:25]

after again reducing the costs then

[00:03:25] : [00:03:27]

obviously you will have your labeled

[00:03:27] : [00:03:29]

data set the labeled set of data is what

[00:03:29] : [00:03:31]

your current model is being trained on

[00:03:31] : [00:03:34]

and the unlabeled set is the data you

[00:03:34] : [00:03:36]

could put in usually used but hasn't

[00:03:36] : [00:03:39]

been annotated yet another key notion is

[00:03:39] : [00:03:40]

actually the answer to the most

[00:03:40] : [00:03:43]

important question you may already have

[00:03:43] : [00:03:46]

in mind how do you find the bad data to

[00:03:46] : [00:03:49]

annotate and add to the training set

[00:03:49] : [00:03:51]

the solution here is called query

[00:03:51] : [00:03:54]

strategies and they are essential to any

[00:03:54] : [00:03:57]

Active Learning algorithm deciding which

[00:03:57] : [00:04:00]

data to label and which not to there are

[00:04:00] : [00:04:02]

multiple possible approaches to finding

[00:04:02] : [00:04:05]

the most informative subsets in our

[00:04:05] : [00:04:07]

large pool of unlabeled data that will

[00:04:07] : [00:04:10]

most help our model by being annotated

[00:04:10] : [00:04:13]

like uncertainty sampling where you test

[00:04:13] : [00:04:15]

your current model on your unlabeled

[00:04:15] : [00:04:17]

data and draw the least confident

[00:04:17] : [00:04:20]

classified examples to annotate another

[00:04:20] : [00:04:22]

technique shown here is the query by

[00:04:22] : [00:04:25]

committee or QBC approach here we have

[00:04:25] : [00:04:27]

multiple models our committee models

[00:04:27] : [00:04:29]

they will all be trained on a different

[00:04:29] : [00:04:32]

subset of our label data and thus have a

[00:04:32] : [00:04:34]

different understanding of our problem

[00:04:34] : [00:04:37]

these models will each have a hypothesis

[00:04:37] : [00:04:39]

on the classification of our unlabeled

[00:04:39] : [00:04:43]

data that should be somewhat similar but

[00:04:43] : [00:04:45]

still different because they basically

[00:04:45] : [00:04:47]

see the world differently just like us

[00:04:47] : [00:04:50]

that have different live experience and

[00:04:50] : [00:04:52]

have seen different animals in our lives

[00:04:52] : [00:04:54]

but still have the same concepts of a

[00:04:54] : [00:04:57]

cat and a dog then it's easy the data to

[00:04:57] : [00:04:59]

be annotated is simply the ones our

[00:04:59] : [00:05:02]

models most disagree on which means it

[00:05:02] : [00:05:05]

is complicated to understand and we

[00:05:05] : [00:05:07]

start over by feeding the selected data

[00:05:07] : [00:05:10]

to our experts for annotation this is of

[00:05:10] : [00:05:12]

course a basic explanation of active

[00:05:12] : [00:05:15]

learning with only one example of a

[00:05:15] : [00:05:17]

query strategy let me know if you'd like

[00:05:17] : [00:05:19]

more videos on other machine learning

[00:05:19] : [00:05:21]

strategies like this here A clear

[00:05:21] : [00:05:23]

example of the active learning process

[00:05:23] : [00:05:26]

is when you answer captchas on Google it

[00:05:26] : [00:05:29]

helps you identify complex images and

[00:05:29] : [00:05:32]

build data sets using you and many other

[00:05:32] : [00:05:35]

people as a committee jury for

[00:05:35] : [00:05:36]

annotation

[00:05:36] : [00:05:39]

building cheap and great data sets while

[00:05:39] : [00:05:41]

entering you are a human serving two

[00:05:41] : [00:05:44]

purposes so next time you are annoyed by

[00:05:44] : [00:05:46]

a captcha just think that you are

[00:05:46] : [00:05:49]

helping AI models progress but we have

[00:05:49] : [00:05:51]

enough theory for now I thought it would

[00:05:51] : [00:05:52]

be great to partner with some friends

[00:05:52] : [00:05:55]

from encord a great company I have known

[00:05:55] : [00:05:58]

for a while now to Showcase a real

[00:05:58] : [00:06:00]

example of active learning since we are

[00:06:00] : [00:06:02]

in this team it's for sure the best

[00:06:02] : [00:06:04]

platform I have seen yet for active

[00:06:04] : [00:06:07]

learning and the team is amazing before

[00:06:07] : [00:06:09]

diving into a short practical example I

[00:06:09] : [00:06:11]

just wanted to mention that I will be at

[00:06:11] : [00:06:14]

cvpr in person this year and so will

[00:06:14] : [00:06:16]

Encore if you are attending in person 2

[00:06:16] : [00:06:19]

let me know and go check out their Booth

[00:06:19] : [00:06:22]

it's Booth 1310. here's a quick demo we

[00:06:22] : [00:06:23]

put together for exploring one of

[00:06:23] : [00:06:26]

encore's products that perfectly fits

[00:06:26] : [00:06:29]

this episode and chord active it is

[00:06:29] : [00:06:30]

basically an active learning platform

[00:06:30] : [00:06:32]

where you can perform everything we

[00:06:32] : [00:06:35]

talked about in this video without any

[00:06:35] : [00:06:37]

coding with a great visual interface

[00:06:37] : [00:06:39]

here's what you would see in a classic

[00:06:39] : [00:06:41]

visual task like segmentation once you

[00:06:41] : [00:06:44]

open up your project you directly have

[00:06:44] : [00:06:46]

relevant information and statistics

[00:06:46] : [00:06:48]

about your data you'll see all the

[00:06:48] : [00:06:50]

outlier characteristics of your data

[00:06:50] : [00:06:52]

which will help you figure out what

[00:06:52] : [00:06:55]

causes the issues in your test for

[00:06:55] : [00:06:57]

example here we see that blur is one of

[00:06:57] : [00:06:59]

those outliers that has been

[00:06:59] : [00:07:01]

automatically identified if we check out

[00:07:01] : [00:07:03]

the worst images for that category we

[00:07:03] : [00:07:05]

can easily find some problematic images

[00:07:05] : [00:07:08]

and tag them for review like here where

[00:07:08] : [00:07:10]

the image is super saturated you can

[00:07:10] : [00:07:13]

also visualize groups of data thanks to

[00:07:13] : [00:07:14]

their embeddings just like clip

[00:07:14] : [00:07:16]

embeddings that you might have heard a

[00:07:16] : [00:07:18]

lot these days and those embeddings can

[00:07:18] : [00:07:21]

easily be compared together and grouped

[00:07:21] : [00:07:23]

when similar helping you find

[00:07:23] : [00:07:25]

problematic groups all at once instead

[00:07:25] : [00:07:27]

of going through your data one by one

[00:07:27] : [00:07:29]

then once you are satisfied with your

[00:07:29] : [00:07:32]

identified images to review you can

[00:07:32] : [00:07:34]

simply export it to their encode

[00:07:34] : [00:07:35]

platform where you can do your

[00:07:35] : [00:07:38]

annotation directly when you have your

[00:07:38] : [00:07:40]

annotations and you get back on the

[00:07:40] : [00:07:42]

encode active platform you can now

[00:07:42] : [00:07:44]

visualize what it looks like with labels

[00:07:44] : [00:07:47]

you can see how the embedding plots have

[00:07:47] : [00:07:49]

changed now with the different classes

[00:07:49] : [00:07:51]

attached here again you can look at

[00:07:51] : [00:07:53]

different subgroups of data to find

[00:07:53] : [00:07:56]

problematic ones for example you can

[00:07:56] : [00:07:58]

look at images containing school buses

[00:07:58] : [00:08:00]

this can be done using natural language

[00:08:00] : [00:08:03]

to look for any information in images

[00:08:03] : [00:08:06]

metadata or classes something quite

[00:08:06] : [00:08:07]

necessary these days if you want to say

[00:08:07] : [00:08:09]

that you are working in AI when you

[00:08:09] : [00:08:11]

cannot find any more problems easily

[00:08:11] : [00:08:14]

with your data you train your model and

[00:08:14] : [00:08:16]

come back to the platform to analyze its

[00:08:16] : [00:08:19]

performance once again you have access

[00:08:19] : [00:08:22]

to a ton of valuable information about

[00:08:22] : [00:08:25]

how well your model is performing for

[00:08:25] : [00:08:27]

example if we take a look at the object

[00:08:27] : [00:08:30]

area where we see that small images seem

[00:08:30] : [00:08:32]

problematic we can easily filter them

[00:08:32] : [00:08:35]

out and create a new sub data set using

[00:08:35] : [00:08:38]

only our problematic small object images

[00:08:38] : [00:08:41]

the project is created in your Encore

[00:08:41] : [00:08:43]

active dashboard with all the same

[00:08:43] : [00:08:45]

statistics you had but for only this set

[00:08:45] : [00:08:48]

of data if you want to have a closer

[00:08:48] : [00:08:50]

look or run experiments with this more

[00:08:50] : [00:08:53]

complicated part of the data like using

[00:08:53] : [00:08:55]

it for training one of your committee

[00:08:55] : [00:08:58]

models and you repeat this Loop over and

[00:08:58] : [00:09:00]

over on the annotating problematic data

[00:09:00] : [00:09:03]

and improving your model as efficiently

[00:09:03] : [00:09:06]

as possible it will both reduce the need

[00:09:06] : [00:09:08]

for paying experts annotators especially

[00:09:08] : [00:09:11]

if you work with medical applications as

[00:09:11] : [00:09:13]

I do or other applications where experts

[00:09:13] : [00:09:15]

are quite expensive and maximize the

[00:09:15] : [00:09:18]

results of your model I hope you can now

[00:09:18] : [00:09:20]

see how valuable Active Learning can be

[00:09:20] : [00:09:23]

and maybe even try it out with your own

[00:09:23] : [00:09:25]

application and it can all be done with

[00:09:25] : [00:09:27]

a single product if you want to let me

[00:09:27] : [00:09:29]

know if you do so

[00:09:29] : [00:09:31]

but before ending this video I just

[00:09:31] : [00:09:33]

wanted to thank ankord for sponsoring

[00:09:33] : [00:09:35]

this week's episode with a great example

[00:09:35] : [00:09:37]

of active learning and an amazing

[00:09:37] : [00:09:39]

product I also wanted to point out that

[00:09:39] : [00:09:42]

they had a webinar on June 14th on how

[00:09:42] : [00:09:44]

to build a semantic search for visual

[00:09:44] : [00:09:47]

data using chatgpt and clip that is

[00:09:47] : [00:09:50]

housed on encode active with a recording

[00:09:50] : [00:09:52]

available if you want to check it out

[00:09:52] : [00:09:54]

it's definitely worthwhile and super

[00:09:54] : [00:09:57]

interesting I hope you enjoyed this

[00:09:57] : [00:09:59]

episode format as much as I enjoyed

[00:09:59] : [00:10:03]

making it thank you for watching

[00:10:03] : [00:10:03]


L O A D I N G
. . . comments & more!

About Author

Louis Bouchard HackerNoon profile picture
Louis Bouchard@whatsai
I explain Artificial Intelligence terms and news to non-experts.

標籤

这篇文章刊登在...

Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
X REMOVE AD