I explain Artificial Intelligence terms and news to non-experts.
The writer is smart, but don't just like, take their word for it. #DoYourOwnResearch before making any investment decisions or decisions regarding you health or security. (Do not regard any of this content as professional investment advice, or health advice)
Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.
The is an opinion piece based on the author’s POV and does not necessarily reflect the views of HackerNoon.
This writer has a vested interested be it monetary, business, or otherwise, with 1 or more of the products or companies mentioned within.
今日の世界では、 ChatGPTのような強力な AI モデルやビジョン モデル、その他の同様のテクノロジーのおかげで、私たちは膨大な量のデータにアクセスできます。ただし、これらのモデルが依存するデータの量だけでなく、品質も重要です。優れたデータセットを迅速かつ大規模に作成することは、困難でコストがかかる作業となる可能性があります。
簡単に言えば、アクティブ ラーニングの目的は、データセットのアノテーションを最適化し、最小限のトレーニング データを使用して可能な限り最良のモデルをトレーニングすることです。
これは、モデルの予測とデータの間の反復プロセスを含む教師あり学習アプローチです。完全なデータセットを待つ代わりに、厳選された注釈付きデータの小さなバッチから開始して、それを使用してモデルをトレーニングできます。
次に、アクティブ ラーニングを使用して、モデルを活用して、目に見えないデータにラベルを付け、予測の精度を評価し、取得関数に基づいて注釈を付ける次のデータ セットを選択できます。
アクティブ ラーニングの利点の 1 つは、モデルの予測の信頼レベルを分析できることです。
予測の信頼度が低い場合、モデルはそのタイプの追加画像にラベルを付けるように要求します。一方、信頼性の高い予測には追加のデータは必要ありません。全体的に注釈を付ける画像の数を減らすことで、最適化されたモデルを実現しながら時間とコストを節約できます。アクティブ ラーニングは、大規模なデータセットを操作するための非常に有望なアプローチです。
アクティブラーニングの代表格。 Kumar らからの画像
まず、人間による注釈が含まれており、モデルの予測の品質を制御できるようになります。それは何百万もの画像で訓練されたブラックボックスではありません。あなたはその開発に積極的に参加し、そのパフォーマンスの向上を支援します。この側面により、教師なしアプローチと比較してコストが増加する可能性がありますが、アクティブ ラーニングが重要かつ興味深いものになります。ただし、モデルのトレーニングとデプロイにかかる時間の節約は、多くの場合、これらのコストを上回ります。
さらに、自動注釈ツールを使用して手動で修正できるため、経費をさらに削減できます。
アクティブ ラーニングでは、モデルのトレーニングに使用されるラベル付きデータ セットがあり、ラベルなしセットにはまだ注釈が付けられていない潜在的なデータが含まれています。重要な概念は、どのデータにラベルを付けるかを決定するクエリ戦略です。ラベルのないデータの大規模なプールから最も有益なサブセットを見つけるには、さまざまなアプローチがあります。たとえば、不確実性サンプリングには、ラベルのないデータでモデルをテストし、最も信頼性の低い分類された例をアノテーション用に選択することが含まれます。
委員会による質問アプローチによるアクティブ ラーニングの表現。 Kumar らからの画像
アクティブ ラーニングのもう 1 つの手法は、委員会によるクエリ (QBC) です。これは、ラベル付きデータの異なるサブセットでそれぞれトレーニングされた複数のモデルが委員会を形成します。さまざまな経験を持つ人々が特定の概念についてさまざまな理解を持っているのと同じように、これらのモデルは分類問題に関して異なる視点を持っています。注釈を付けるデータは、複雑さを示す委員会モデル間の不一致に基づいて選択されます。この反復プロセスは、選択されたデータに継続的に注釈が付けられるまで続きます。
ご興味があれば、他の機械学習戦略に関する詳細情報やビデオを提供できます。アクティブ ラーニングの実例は、Google のキャプチャに答える場合です。そうすることで、複雑な画像を識別し、複数のユーザーの集合的な入力を使用してデータセットを構築し、データセットの品質と人間による検証の両方を確保できるようになります。したがって、次回キャプチャに遭遇したときは、AI モデルの進歩に貢献していることを思い出してください。
foreign
[Music]
amounts of data thanks to the
superpowers of large models including
the famous chatgpt but also Vision
models and all other types you may be
working with indeed the secrets behind
those models is not only the large
amount of data they are being trained on
but also the quality of that data but
what does this mean it means we need
lots of very good balance and varied
data and as data scientists we all know
how complicated and painful it can be to
build such a good data set fast and at
large scale and maybe with a limited
budget what if we could have helped
build that or even have automated help
well that is where Active Learning comes
in in one sentence the goal of active
learning is to use the least amount of
training data to optimize The annotation
of your whole data set and train the
best possible model it's a supervised
learning approach that will go back and
forth between your model's predictions
and your data what I mean here is that
you may start with a small batch of
curated annotated data and train your
model with it you don't have to wait for
your whole millions of images that are
set to be ready just push it out there
then using Active Learning you can use
your model on your unseen data and get
human annotators to label it but that is
not only it we can also evaluate how
accurate the predictions are and using a
variety of acquisition functions which
are functions used to select the next
unseen data to annotate we can quantify
the impact of labeling a larger data set
volume or improving the accuracy of the
labels generated to improve the model's
performance thanks to how you train the
models you can analyze the confidence
they have in their predictions
predictions with low confidence will
automatically request additional images
of this type to be labeled and
predictions with high confidence won't
need additional data so you will
basically save a lot of time and money
by having to annotate fewer images in
the end and have the most optimized
model possible how cool is that Active
Learning is one of the most promising
approach to working with large-scale
data sets and there are a few important
key Notions to remember with active
learning the most important is that it
uses humans which you can clearly see
here in the middle of this great
presentation of active learning it will
still require humans to annotate data
which has the plus side to give you full
control over the quality of your model's
prediction it's not a complete Black Box
that trained with millions of images
anymore you iteratively follow its
development and help it get better when
it fails of course it does have the
downside of increasing costs versus
unsupervised approaches where you don't
need anyone but it allows you to limit
those costs by only training where the
models need it instead of feeding it as
much data as possible and hoping for the
best moreover the reduction in time
taken to train the model and put it into
production often outweighs these costs
and you can use some automatic
annotation tools and manually correct it
after again reducing the costs then
obviously you will have your labeled
data set the labeled set of data is what
your current model is being trained on
and the unlabeled set is the data you
could put in usually used but hasn't
been annotated yet another key notion is
actually the answer to the most
important question you may already have
in mind how do you find the bad data to
annotate and add to the training set
the solution here is called query
strategies and they are essential to any
Active Learning algorithm deciding which
data to label and which not to there are
multiple possible approaches to finding
the most informative subsets in our
large pool of unlabeled data that will
most help our model by being annotated
like uncertainty sampling where you test
your current model on your unlabeled
data and draw the least confident
classified examples to annotate another
technique shown here is the query by
committee or QBC approach here we have
multiple models our committee models
they will all be trained on a different
subset of our label data and thus have a
different understanding of our problem
these models will each have a hypothesis
on the classification of our unlabeled
data that should be somewhat similar but
still different because they basically
see the world differently just like us
that have different live experience and
have seen different animals in our lives
but still have the same concepts of a
cat and a dog then it's easy the data to
be annotated is simply the ones our
models most disagree on which means it
is complicated to understand and we
start over by feeding the selected data
to our experts for annotation this is of
course a basic explanation of active
learning with only one example of a
query strategy let me know if you'd like
more videos on other machine learning
strategies like this here A clear
example of the active learning process
is when you answer captchas on Google it
helps you identify complex images and
build data sets using you and many other
people as a committee jury for
annotation
building cheap and great data sets while
entering you are a human serving two
purposes so next time you are annoyed by
a captcha just think that you are
helping AI models progress but we have
enough theory for now I thought it would
be great to partner with some friends
from encord a great company I have known
for a while now to Showcase a real
example of active learning since we are
in this team it's for sure the best
platform I have seen yet for active
learning and the team is amazing before
diving into a short practical example I
just wanted to mention that I will be at
cvpr in person this year and so will
Encore if you are attending in person 2
let me know and go check out their Booth
it's Booth 1310. here's a quick demo we
put together for exploring one of
encore's products that perfectly fits
this episode and chord active it is
basically an active learning platform
where you can perform everything we
talked about in this video without any
coding with a great visual interface
here's what you would see in a classic
visual task like segmentation once you
open up your project you directly have
relevant information and statistics
about your data you'll see all the
outlier characteristics of your data
which will help you figure out what
causes the issues in your test for
example here we see that blur is one of
those outliers that has been
automatically identified if we check out
the worst images for that category we
can easily find some problematic images
and tag them for review like here where
the image is super saturated you can
also visualize groups of data thanks to
their embeddings just like clip
embeddings that you might have heard a
lot these days and those embeddings can
easily be compared together and grouped
when similar helping you find
problematic groups all at once instead
of going through your data one by one
then once you are satisfied with your
identified images to review you can
simply export it to their encode
platform where you can do your
annotation directly when you have your
annotations and you get back on the
encode active platform you can now
visualize what it looks like with labels
you can see how the embedding plots have
changed now with the different classes
attached here again you can look at
different subgroups of data to find
problematic ones for example you can
look at images containing school buses
this can be done using natural language
to look for any information in images
metadata or classes something quite
necessary these days if you want to say
that you are working in AI when you
cannot find any more problems easily
with your data you train your model and
come back to the platform to analyze its
performance once again you have access
to a ton of valuable information about
how well your model is performing for
example if we take a look at the object
area where we see that small images seem
problematic we can easily filter them
out and create a new sub data set using
only our problematic small object images
the project is created in your Encore
active dashboard with all the same
statistics you had but for only this set
of data if you want to have a closer
look or run experiments with this more
complicated part of the data like using
it for training one of your committee
models and you repeat this Loop over and
over on the annotating problematic data
and improving your model as efficiently
as possible it will both reduce the need
for paying experts annotators especially
if you work with medical applications as
I do or other applications where experts
are quite expensive and maximize the
results of your model I hope you can now
see how valuable Active Learning can be
and maybe even try it out with your own
application and it can all be done with
a single product if you want to let me
know if you do so
but before ending this video I just
wanted to thank ankord for sponsoring
this week's episode with a great example
of active learning and an amazing
product I also wanted to point out that
they had a webinar on June 14th on how
to build a semantic search for visual
data using chatgpt and clip that is
housed on encode active with a recording
available if you want to check it out
it's definitely worthwhile and super
interesting I hope you enjoyed this
episode format as much as I enjoyed
making it thank you for watching
アクティブラーニングの概要 | HackerNoon