paint-brush
アクティブラーニングの概要 に@whatsai
850 測定値
850 測定値

アクティブラーニングの概要

Louis Bouchard
Louis Bouchard HackerNoon profile picture

Louis Bouchard

@whatsai

I explain Artificial Intelligence terms and news to non-experts.

3 分 read2023/06/18
Read on Terminal Reader
Read this story in a terminal
Print this story

長すぎる; 読むには

アクティブ ラーニングは、データセットのアノテーションを最適化し、最小限のトレーニング データを使用して可能な限り最良のモデルをトレーニングすることを目的としています。これは、モデルの予測とデータの間の反復プロセスを含む教師あり学習アプローチです。全体的に注釈を付ける画像の数を減らすことで、最適化されたモデルを実現しながら時間とコストを節約できます。
featured image - アクティブラーニングの概要
Louis Bouchard HackerNoon profile picture
Louis Bouchard

Louis Bouchard

@whatsai

I explain Artificial Intelligence terms and news to non-experts.

0-item
1-item
2-item
3-item

STORY’S CREDIBILITY

DYOR

DYOR

The writer is smart, but don't just like, take their word for it. #DoYourOwnResearch before making any investment decisions or decisions regarding you health or security. (Do not regard any of this content as professional investment advice, or health advice)

Guide

Guide

Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.

Opinion piece / Thought Leadership

Opinion piece / Thought Leadership

The is an opinion piece based on the author’s POV and does not necessarily reflect the views of HackerNoon.

Vested Interest

Vested Interest

This writer has a vested interested be it monetary, business, or otherwise, with 1 or more of the products or companies mentioned within.

今日の世界では、 ChatGPTのような強力な AI モデルやビジョン モデル、その他の同様のテクノロジーのおかげで、私たちは膨大な量のデータにアクセスできます。ただし、これらのモデルが依存するデータの量だけでなく、品質も重要です。優れたデータセットを迅速かつ大規模に作成することは、困難でコストがかかる作業となる可能性があります。


そこでアクティブラーニングの出番です。

簡単に言えば、アクティブ ラーニングの目的は、データセットのアノテーションを最適化し、最小限のトレーニング データを使用して可能な限り最良のモデルをトレーニングすることです。


これは、モデルの予測とデータの間の反復プロセスを含む教師あり学習アプローチです。完全なデータセットを待つ代わりに、厳選された注釈付きデータの小さなバッチから開始して、それを使用してモデルをトレーニングできます。


次に、アクティブ ラーニングを使用して、モデルを活用して、目に見えないデータにラベルを付け、予測の精度を評価し、取得関数に基づいて注釈を付ける次のデータ セットを選択できます。


アクティブ ラーニングの利点の 1 つは、モデルの予測の信頼レベルを分析できることです。


予測の信頼度が低い場合、モデルはそのタイプの追加画像にラベルを付けるように要求します。一方、信頼性の高い予測には追加のデータは必要ありません。全体的に注釈を付ける画像の数を減らすことで、最適化されたモデルを実現しながら時間とコストを節約できます。アクティブ ラーニングは、大規模なデータセットを操作するための非常に有望なアプローチです。


アクティブラーニングの代表格。 Kumar らからの画像

アクティブラーニングの代表格。 Kumar らからの画像



アクティブ ラーニングについては、覚えておくべき重要なポイントがいくつかあります。

まず、人間による注釈が含まれており、モデルの予測の品質を制御できるようになります。それは何百万もの画像で訓練されたブラックボックスではありません。あなたはその開発に積極的に参加し、そのパフォーマンスの向上を支援します。この側面により、教師なしアプローチと比較してコストが増加する可能性がありますが、アクティブ ラーニングが重要かつ興味深いものになります。ただし、モデルのトレーニングとデプロイにかかる時間の節約は、多くの場合、これらのコストを上回ります。


さらに、自動注釈ツールを使用して手動で修正できるため、経費をさらに削減できます。


アクティブ ラーニングでは、モデルのトレーニングに使用されるラベル付きデータ セットがあり、ラベルなしセットにはまだ注釈が付けられていない潜在的なデータが含まれています。重要な概念は、どのデータにラベルを付けるかを決定するクエリ戦略です。ラベルのないデータの大規模なプールから最も有益なサブセットを見つけるには、さまざまなアプローチがあります。たとえば、不確実性サンプリングには、ラベルのないデータでモデルをテストし、最も信頼性の低い分類された例をアノテーション用に選択することが含まれます。


委員会による質問アプローチによるアクティブ ラーニングの表現。 Kumar らからの画像

委員会による質問アプローチによるアクティブ ラーニングの表現。 Kumar らからの画像



アクティブ ラーニングのもう 1 つの手法は、委員会によるクエリ (QBC) です。これは、ラベル付きデータの異なるサブセットでそれぞれトレーニングされた複数のモデルが委員会を形成します。さまざまな経験を持つ人々が特定の概念についてさまざまな理解を持っているのと同じように、これらのモデルは分類問題に関して異なる視点を持っています。注釈を付けるデータは、複雑さを示す委員会モデル間の不一致に基づいて選択されます。この反復プロセスは、選択されたデータに継続的に注釈が付けられるまで続きます。


これはアクティブ ラーニングの基本的な説明であり、クエリ戦略の一例を示しています。

ご興味があれば、他の機械学習戦略に関する詳細情報やビデオを提供できます。アクティブ ラーニングの実例は、Google のキャプチャに答える場合です。そうすることで、複雑な画像を識別し、複数のユーザーの集合的な入力を使用してデータセットを構築し、データセットの品質と人間による検証の両方を確保できるようになります。したがって、次回キャプチャに遭遇したときは、AI モデルの進歩に貢献していることを思い出してください。


さらに詳しく知り、Encord の友人が開発した優れたツールを使用した実践例を確認するには、次のビデオをご覧ください。


foreign

[00:00:00] : [00:00:03]

[Music]

[00:00:03] : [00:00:10]

amounts of data thanks to the

[00:00:10] : [00:00:13]

superpowers of large models including

[00:00:13] : [00:00:16]

the famous chatgpt but also Vision

[00:00:16] : [00:00:18]

models and all other types you may be

[00:00:18] : [00:00:21]

working with indeed the secrets behind

[00:00:21] : [00:00:23]

those models is not only the large

[00:00:23] : [00:00:25]

amount of data they are being trained on

[00:00:25] : [00:00:28]

but also the quality of that data but

[00:00:28] : [00:00:31]

what does this mean it means we need

[00:00:31] : [00:00:34]

lots of very good balance and varied

[00:00:34] : [00:00:37]

data and as data scientists we all know

[00:00:37] : [00:00:40]

how complicated and painful it can be to

[00:00:40] : [00:00:43]

build such a good data set fast and at

[00:00:43] : [00:00:45]

large scale and maybe with a limited

[00:00:45] : [00:00:47]

budget what if we could have helped

[00:00:47] : [00:00:50]

build that or even have automated help

[00:00:50] : [00:00:53]

well that is where Active Learning comes

[00:00:53] : [00:00:56]

in in one sentence the goal of active

[00:00:56] : [00:00:58]

learning is to use the least amount of

[00:00:58] : [00:01:00]

training data to optimize The annotation

[00:01:00] : [00:01:03]

of your whole data set and train the

[00:01:03] : [00:01:05]

best possible model it's a supervised

[00:01:05] : [00:01:06]

learning approach that will go back and

[00:01:06] : [00:01:09]

forth between your model's predictions

[00:01:09] : [00:01:11]

and your data what I mean here is that

[00:01:11] : [00:01:13]

you may start with a small batch of

[00:01:13] : [00:01:16]

curated annotated data and train your

[00:01:16] : [00:01:18]

model with it you don't have to wait for

[00:01:18] : [00:01:21]

your whole millions of images that are

[00:01:21] : [00:01:23]

set to be ready just push it out there

[00:01:23] : [00:01:25]

then using Active Learning you can use

[00:01:25] : [00:01:28]

your model on your unseen data and get

[00:01:28] : [00:01:31]

human annotators to label it but that is

[00:01:31] : [00:01:34]

not only it we can also evaluate how

[00:01:34] : [00:01:36]

accurate the predictions are and using a

[00:01:36] : [00:01:38]

variety of acquisition functions which

[00:01:38] : [00:01:41]

are functions used to select the next

[00:01:41] : [00:01:43]

unseen data to annotate we can quantify

[00:01:43] : [00:01:46]

the impact of labeling a larger data set

[00:01:46] : [00:01:49]

volume or improving the accuracy of the

[00:01:49] : [00:01:52]

labels generated to improve the model's

[00:01:52] : [00:01:54]

performance thanks to how you train the

[00:01:54] : [00:01:56]

models you can analyze the confidence

[00:01:56] : [00:01:58]

they have in their predictions

[00:01:58] : [00:02:00]

predictions with low confidence will

[00:02:00] : [00:02:02]

automatically request additional images

[00:02:02] : [00:02:05]

of this type to be labeled and

[00:02:05] : [00:02:07]

predictions with high confidence won't

[00:02:07] : [00:02:09]

need additional data so you will

[00:02:09] : [00:02:11]

basically save a lot of time and money

[00:02:11] : [00:02:14]

by having to annotate fewer images in

[00:02:14] : [00:02:16]

the end and have the most optimized

[00:02:16] : [00:02:20]

model possible how cool is that Active

[00:02:20] : [00:02:22]

Learning is one of the most promising

[00:02:22] : [00:02:24]

approach to working with large-scale

[00:02:24] : [00:02:26]

data sets and there are a few important

[00:02:26] : [00:02:28]

key Notions to remember with active

[00:02:28] : [00:02:30]

learning the most important is that it

[00:02:30] : [00:02:33]

uses humans which you can clearly see

[00:02:33] : [00:02:34]

here in the middle of this great

[00:02:34] : [00:02:37]

presentation of active learning it will

[00:02:37] : [00:02:40]

still require humans to annotate data

[00:02:40] : [00:02:42]

which has the plus side to give you full

[00:02:42] : [00:02:44]

control over the quality of your model's

[00:02:44] : [00:02:47]

prediction it's not a complete Black Box

[00:02:47] : [00:02:49]

that trained with millions of images

[00:02:49] : [00:02:51]

anymore you iteratively follow its

[00:02:51] : [00:02:54]

development and help it get better when

[00:02:54] : [00:02:56]

it fails of course it does have the

[00:02:56] : [00:02:58]

downside of increasing costs versus

[00:02:58] : [00:03:01]

unsupervised approaches where you don't

[00:03:01] : [00:03:03]

need anyone but it allows you to limit

[00:03:03] : [00:03:06]

those costs by only training where the

[00:03:06] : [00:03:08]

models need it instead of feeding it as

[00:03:08] : [00:03:11]

much data as possible and hoping for the

[00:03:11] : [00:03:13]

best moreover the reduction in time

[00:03:13] : [00:03:16]

taken to train the model and put it into

[00:03:16] : [00:03:18]

production often outweighs these costs

[00:03:18] : [00:03:20]

and you can use some automatic

[00:03:20] : [00:03:22]

annotation tools and manually correct it

[00:03:22] : [00:03:25]

after again reducing the costs then

[00:03:25] : [00:03:27]

obviously you will have your labeled

[00:03:27] : [00:03:29]

data set the labeled set of data is what

[00:03:29] : [00:03:31]

your current model is being trained on

[00:03:31] : [00:03:34]

and the unlabeled set is the data you

[00:03:34] : [00:03:36]

could put in usually used but hasn't

[00:03:36] : [00:03:39]

been annotated yet another key notion is

[00:03:39] : [00:03:40]

actually the answer to the most

[00:03:40] : [00:03:43]

important question you may already have

[00:03:43] : [00:03:46]

in mind how do you find the bad data to

[00:03:46] : [00:03:49]

annotate and add to the training set

[00:03:49] : [00:03:51]

the solution here is called query

[00:03:51] : [00:03:54]

strategies and they are essential to any

[00:03:54] : [00:03:57]

Active Learning algorithm deciding which

[00:03:57] : [00:04:00]

data to label and which not to there are

[00:04:00] : [00:04:02]

multiple possible approaches to finding

[00:04:02] : [00:04:05]

the most informative subsets in our

[00:04:05] : [00:04:07]

large pool of unlabeled data that will

[00:04:07] : [00:04:10]

most help our model by being annotated

[00:04:10] : [00:04:13]

like uncertainty sampling where you test

[00:04:13] : [00:04:15]

your current model on your unlabeled

[00:04:15] : [00:04:17]

data and draw the least confident

[00:04:17] : [00:04:20]

classified examples to annotate another

[00:04:20] : [00:04:22]

technique shown here is the query by

[00:04:22] : [00:04:25]

committee or QBC approach here we have

[00:04:25] : [00:04:27]

multiple models our committee models

[00:04:27] : [00:04:29]

they will all be trained on a different

[00:04:29] : [00:04:32]

subset of our label data and thus have a

[00:04:32] : [00:04:34]

different understanding of our problem

[00:04:34] : [00:04:37]

these models will each have a hypothesis

[00:04:37] : [00:04:39]

on the classification of our unlabeled

[00:04:39] : [00:04:43]

data that should be somewhat similar but

[00:04:43] : [00:04:45]

still different because they basically

[00:04:45] : [00:04:47]

see the world differently just like us

[00:04:47] : [00:04:50]

that have different live experience and

[00:04:50] : [00:04:52]

have seen different animals in our lives

[00:04:52] : [00:04:54]

but still have the same concepts of a

[00:04:54] : [00:04:57]

cat and a dog then it's easy the data to

[00:04:57] : [00:04:59]

be annotated is simply the ones our

[00:04:59] : [00:05:02]

models most disagree on which means it

[00:05:02] : [00:05:05]

is complicated to understand and we

[00:05:05] : [00:05:07]

start over by feeding the selected data

[00:05:07] : [00:05:10]

to our experts for annotation this is of

[00:05:10] : [00:05:12]

course a basic explanation of active

[00:05:12] : [00:05:15]

learning with only one example of a

[00:05:15] : [00:05:17]

query strategy let me know if you'd like

[00:05:17] : [00:05:19]

more videos on other machine learning

[00:05:19] : [00:05:21]

strategies like this here A clear

[00:05:21] : [00:05:23]

example of the active learning process

[00:05:23] : [00:05:26]

is when you answer captchas on Google it

[00:05:26] : [00:05:29]

helps you identify complex images and

[00:05:29] : [00:05:32]

build data sets using you and many other

[00:05:32] : [00:05:35]

people as a committee jury for

[00:05:35] : [00:05:36]

annotation

[00:05:36] : [00:05:39]

building cheap and great data sets while

[00:05:39] : [00:05:41]

entering you are a human serving two

[00:05:41] : [00:05:44]

purposes so next time you are annoyed by

[00:05:44] : [00:05:46]

a captcha just think that you are

[00:05:46] : [00:05:49]

helping AI models progress but we have

[00:05:49] : [00:05:51]

enough theory for now I thought it would

[00:05:51] : [00:05:52]

be great to partner with some friends

[00:05:52] : [00:05:55]

from encord a great company I have known

[00:05:55] : [00:05:58]

for a while now to Showcase a real

[00:05:58] : [00:06:00]

example of active learning since we are

[00:06:00] : [00:06:02]

in this team it's for sure the best

[00:06:02] : [00:06:04]

platform I have seen yet for active

[00:06:04] : [00:06:07]

learning and the team is amazing before

[00:06:07] : [00:06:09]

diving into a short practical example I

[00:06:09] : [00:06:11]

just wanted to mention that I will be at

[00:06:11] : [00:06:14]

cvpr in person this year and so will

[00:06:14] : [00:06:16]

Encore if you are attending in person 2

[00:06:16] : [00:06:19]

let me know and go check out their Booth

[00:06:19] : [00:06:22]

it's Booth 1310. here's a quick demo we

[00:06:22] : [00:06:23]

put together for exploring one of

[00:06:23] : [00:06:26]

encore's products that perfectly fits

[00:06:26] : [00:06:29]

this episode and chord active it is

[00:06:29] : [00:06:30]

basically an active learning platform

[00:06:30] : [00:06:32]

where you can perform everything we

[00:06:32] : [00:06:35]

talked about in this video without any

[00:06:35] : [00:06:37]

coding with a great visual interface

[00:06:37] : [00:06:39]

here's what you would see in a classic

[00:06:39] : [00:06:41]

visual task like segmentation once you

[00:06:41] : [00:06:44]

open up your project you directly have

[00:06:44] : [00:06:46]

relevant information and statistics

[00:06:46] : [00:06:48]

about your data you'll see all the

[00:06:48] : [00:06:50]

outlier characteristics of your data

[00:06:50] : [00:06:52]

which will help you figure out what

[00:06:52] : [00:06:55]

causes the issues in your test for

[00:06:55] : [00:06:57]

example here we see that blur is one of

[00:06:57] : [00:06:59]

those outliers that has been

[00:06:59] : [00:07:01]

automatically identified if we check out

[00:07:01] : [00:07:03]

the worst images for that category we

[00:07:03] : [00:07:05]

can easily find some problematic images

[00:07:05] : [00:07:08]

and tag them for review like here where

[00:07:08] : [00:07:10]

the image is super saturated you can

[00:07:10] : [00:07:13]

also visualize groups of data thanks to

[00:07:13] : [00:07:14]

their embeddings just like clip

[00:07:14] : [00:07:16]

embeddings that you might have heard a

[00:07:16] : [00:07:18]

lot these days and those embeddings can

[00:07:18] : [00:07:21]

easily be compared together and grouped

[00:07:21] : [00:07:23]

when similar helping you find

[00:07:23] : [00:07:25]

problematic groups all at once instead

[00:07:25] : [00:07:27]

of going through your data one by one

[00:07:27] : [00:07:29]

then once you are satisfied with your

[00:07:29] : [00:07:32]

identified images to review you can

[00:07:32] : [00:07:34]

simply export it to their encode

[00:07:34] : [00:07:35]

platform where you can do your

[00:07:35] : [00:07:38]

annotation directly when you have your

[00:07:38] : [00:07:40]

annotations and you get back on the

[00:07:40] : [00:07:42]

encode active platform you can now

[00:07:42] : [00:07:44]

visualize what it looks like with labels

[00:07:44] : [00:07:47]

you can see how the embedding plots have

[00:07:47] : [00:07:49]

changed now with the different classes

[00:07:49] : [00:07:51]

attached here again you can look at

[00:07:51] : [00:07:53]

different subgroups of data to find

[00:07:53] : [00:07:56]

problematic ones for example you can

[00:07:56] : [00:07:58]

look at images containing school buses

[00:07:58] : [00:08:00]

this can be done using natural language

[00:08:00] : [00:08:03]

to look for any information in images

[00:08:03] : [00:08:06]

metadata or classes something quite

[00:08:06] : [00:08:07]

necessary these days if you want to say

[00:08:07] : [00:08:09]

that you are working in AI when you

[00:08:09] : [00:08:11]

cannot find any more problems easily

[00:08:11] : [00:08:14]

with your data you train your model and

[00:08:14] : [00:08:16]

come back to the platform to analyze its

[00:08:16] : [00:08:19]

performance once again you have access

[00:08:19] : [00:08:22]

to a ton of valuable information about

[00:08:22] : [00:08:25]

how well your model is performing for

[00:08:25] : [00:08:27]

example if we take a look at the object

[00:08:27] : [00:08:30]

area where we see that small images seem

[00:08:30] : [00:08:32]

problematic we can easily filter them

[00:08:32] : [00:08:35]

out and create a new sub data set using

[00:08:35] : [00:08:38]

only our problematic small object images

[00:08:38] : [00:08:41]

the project is created in your Encore

[00:08:41] : [00:08:43]

active dashboard with all the same

[00:08:43] : [00:08:45]

statistics you had but for only this set

[00:08:45] : [00:08:48]

of data if you want to have a closer

[00:08:48] : [00:08:50]

look or run experiments with this more

[00:08:50] : [00:08:53]

complicated part of the data like using

[00:08:53] : [00:08:55]

it for training one of your committee

[00:08:55] : [00:08:58]

models and you repeat this Loop over and

[00:08:58] : [00:09:00]

over on the annotating problematic data

[00:09:00] : [00:09:03]

and improving your model as efficiently

[00:09:03] : [00:09:06]

as possible it will both reduce the need

[00:09:06] : [00:09:08]

for paying experts annotators especially

[00:09:08] : [00:09:11]

if you work with medical applications as

[00:09:11] : [00:09:13]

I do or other applications where experts

[00:09:13] : [00:09:15]

are quite expensive and maximize the

[00:09:15] : [00:09:18]

results of your model I hope you can now

[00:09:18] : [00:09:20]

see how valuable Active Learning can be

[00:09:20] : [00:09:23]

and maybe even try it out with your own

[00:09:23] : [00:09:25]

application and it can all be done with

[00:09:25] : [00:09:27]

a single product if you want to let me

[00:09:27] : [00:09:29]

know if you do so

[00:09:29] : [00:09:31]

but before ending this video I just

[00:09:31] : [00:09:33]

wanted to thank ankord for sponsoring

[00:09:33] : [00:09:35]

this week's episode with a great example

[00:09:35] : [00:09:37]

of active learning and an amazing

[00:09:37] : [00:09:39]

product I also wanted to point out that

[00:09:39] : [00:09:42]

they had a webinar on June 14th on how

[00:09:42] : [00:09:44]

to build a semantic search for visual

[00:09:44] : [00:09:47]

data using chatgpt and clip that is

[00:09:47] : [00:09:50]

housed on encode active with a recording

[00:09:50] : [00:09:52]

available if you want to check it out

[00:09:52] : [00:09:54]

it's definitely worthwhile and super

[00:09:54] : [00:09:57]

interesting I hope you enjoyed this

[00:09:57] : [00:09:59]

episode format as much as I enjoyed

[00:09:59] : [00:10:03]

making it thank you for watching

[00:10:03] : [00:10:03]


L O A D I N G
. . . comments & more!

About Author

Louis Bouchard HackerNoon profile picture
Louis Bouchard@whatsai
I explain Artificial Intelligence terms and news to non-experts.

ラベル

この記事は...

Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
X REMOVE AD