I explain Artificial Intelligence terms and news to non-experts.
The writer is smart, but don't just like, take their word for it. #DoYourOwnResearch before making any investment decisions or decisions regarding you health or security. (Do not regard any of this content as professional investment advice, or health advice)
Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.
The is an opinion piece based on the author’s POV and does not necessarily reflect the views of HackerNoon.
This writer has a vested interested be it monetary, business, or otherwise, with 1 or more of the products or companies mentioned within.
在当今世界,我们可以访问大量数据,这要归功于ChatGPT等强大的人工智能模型,以及视觉模型和其他类似技术。然而,这些模型所依赖的数据不仅仅在于数量,还在于质量。快速大规模地创建良好的数据集可能是一项具有挑战性且成本高昂的任务。
简单来说,主动学习旨在优化数据集的注释,并使用最少的训练数据训练出尽可能好的模型。
这是一种监督学习方法,涉及模型预测和数据之间的迭代过程。无需等待完整的数据集,您可以从一小批精选的注释数据开始并使用它来训练您的模型。
然后,使用主动学习,您可以利用您的模型来标记看不见的数据,评估预测的准确性,并根据采集函数选择下一组数据进行注释。
主动学习的优势之一是您可以分析模型预测的置信度。
如果预测的置信度较低,模型将请求标记该类型的其他图像。另一方面,具有高置信度的预测不需要更多数据。通过整体注释较少的图像,您可以节省时间和金钱,同时实现优化的模型。主动学习是处理大规模数据集的一种非常有前途的方法。
主动学习的表现。图片来自 Kumar 等人。
首先,它涉及人工注释,让您可以控制模型预测的质量。它不是经过数百万张图像训练的黑匣子。您积极参与其开发并协助改进其性能。这方面使主动学习变得重要和有趣,尽管与无监督方法相比它可能会增加成本。但是,训练和部署模型所节省的时间通常会超过这些成本。
此外,您可以使用自动注释工具并手动更正它们,进一步降低开支。
在主动学习中,您有一组标记的数据用于训练您的模型,而未标记的数据集包含尚未注释的潜在数据。一个关键的概念是查询策略,它确定要标记哪些数据。有多种方法可以在大量未标记数据中找到信息最丰富的子集。例如,不确定性抽样涉及在未标记数据上测试您的模型并选择最不自信的分类示例进行注释。
使用委员会查询方法表示主动学习。图片来自 Kumar 等人。
主动学习中的另一种技术是委员会查询 (QBC) ,其中多个模型组成一个委员会,每个模型都在不同的标记数据子集上进行训练。这些模型对分类问题有不同的看法,就像不同经历的人对某些概念有不同的理解一样。要注释的数据是根据委员会模型之间的分歧选择的,表明其复杂性。随着对所选数据的连续注释,此迭代过程继续进行。
如果您有兴趣,我可以提供更多关于其他机器学习策略的信息或视频。主动学习的一个真实例子是当你在谷歌上回答验证码时。通过这样做,您可以帮助他们识别复杂的图像并使用多个用户的集体输入构建数据集,从而确保数据集质量和人工验证。所以,下次遇到验证码时,请记住你正在为 AI 模型的进步做出贡献!
foreign
[Music]
amounts of data thanks to the
superpowers of large models including
the famous chatgpt but also Vision
models and all other types you may be
working with indeed the secrets behind
those models is not only the large
amount of data they are being trained on
but also the quality of that data but
what does this mean it means we need
lots of very good balance and varied
data and as data scientists we all know
how complicated and painful it can be to
build such a good data set fast and at
large scale and maybe with a limited
budget what if we could have helped
build that or even have automated help
well that is where Active Learning comes
in in one sentence the goal of active
learning is to use the least amount of
training data to optimize The annotation
of your whole data set and train the
best possible model it's a supervised
learning approach that will go back and
forth between your model's predictions
and your data what I mean here is that
you may start with a small batch of
curated annotated data and train your
model with it you don't have to wait for
your whole millions of images that are
set to be ready just push it out there
then using Active Learning you can use
your model on your unseen data and get
human annotators to label it but that is
not only it we can also evaluate how
accurate the predictions are and using a
variety of acquisition functions which
are functions used to select the next
unseen data to annotate we can quantify
the impact of labeling a larger data set
volume or improving the accuracy of the
labels generated to improve the model's
performance thanks to how you train the
models you can analyze the confidence
they have in their predictions
predictions with low confidence will
automatically request additional images
of this type to be labeled and
predictions with high confidence won't
need additional data so you will
basically save a lot of time and money
by having to annotate fewer images in
the end and have the most optimized
model possible how cool is that Active
Learning is one of the most promising
approach to working with large-scale
data sets and there are a few important
key Notions to remember with active
learning the most important is that it
uses humans which you can clearly see
here in the middle of this great
presentation of active learning it will
still require humans to annotate data
which has the plus side to give you full
control over the quality of your model's
prediction it's not a complete Black Box
that trained with millions of images
anymore you iteratively follow its
development and help it get better when
it fails of course it does have the
downside of increasing costs versus
unsupervised approaches where you don't
need anyone but it allows you to limit
those costs by only training where the
models need it instead of feeding it as
much data as possible and hoping for the
best moreover the reduction in time
taken to train the model and put it into
production often outweighs these costs
and you can use some automatic
annotation tools and manually correct it
after again reducing the costs then
obviously you will have your labeled
data set the labeled set of data is what
your current model is being trained on
and the unlabeled set is the data you
could put in usually used but hasn't
been annotated yet another key notion is
actually the answer to the most
important question you may already have
in mind how do you find the bad data to
annotate and add to the training set
the solution here is called query
strategies and they are essential to any
Active Learning algorithm deciding which
data to label and which not to there are
multiple possible approaches to finding
the most informative subsets in our
large pool of unlabeled data that will
most help our model by being annotated
like uncertainty sampling where you test
your current model on your unlabeled
data and draw the least confident
classified examples to annotate another
technique shown here is the query by
committee or QBC approach here we have
multiple models our committee models
they will all be trained on a different
subset of our label data and thus have a
different understanding of our problem
these models will each have a hypothesis
on the classification of our unlabeled
data that should be somewhat similar but
still different because they basically
see the world differently just like us
that have different live experience and
have seen different animals in our lives
but still have the same concepts of a
cat and a dog then it's easy the data to
be annotated is simply the ones our
models most disagree on which means it
is complicated to understand and we
start over by feeding the selected data
to our experts for annotation this is of
course a basic explanation of active
learning with only one example of a
query strategy let me know if you'd like
more videos on other machine learning
strategies like this here A clear
example of the active learning process
is when you answer captchas on Google it
helps you identify complex images and
build data sets using you and many other
people as a committee jury for
annotation
building cheap and great data sets while
entering you are a human serving two
purposes so next time you are annoyed by
a captcha just think that you are
helping AI models progress but we have
enough theory for now I thought it would
be great to partner with some friends
from encord a great company I have known
for a while now to Showcase a real
example of active learning since we are
in this team it's for sure the best
platform I have seen yet for active
learning and the team is amazing before
diving into a short practical example I
just wanted to mention that I will be at
cvpr in person this year and so will
Encore if you are attending in person 2
let me know and go check out their Booth
it's Booth 1310. here's a quick demo we
put together for exploring one of
encore's products that perfectly fits
this episode and chord active it is
basically an active learning platform
where you can perform everything we
talked about in this video without any
coding with a great visual interface
here's what you would see in a classic
visual task like segmentation once you
open up your project you directly have
relevant information and statistics
about your data you'll see all the
outlier characteristics of your data
which will help you figure out what
causes the issues in your test for
example here we see that blur is one of
those outliers that has been
automatically identified if we check out
the worst images for that category we
can easily find some problematic images
and tag them for review like here where
the image is super saturated you can
also visualize groups of data thanks to
their embeddings just like clip
embeddings that you might have heard a
lot these days and those embeddings can
easily be compared together and grouped
when similar helping you find
problematic groups all at once instead
of going through your data one by one
then once you are satisfied with your
identified images to review you can
simply export it to their encode
platform where you can do your
annotation directly when you have your
annotations and you get back on the
encode active platform you can now
visualize what it looks like with labels
you can see how the embedding plots have
changed now with the different classes
attached here again you can look at
different subgroups of data to find
problematic ones for example you can
look at images containing school buses
this can be done using natural language
to look for any information in images
metadata or classes something quite
necessary these days if you want to say
that you are working in AI when you
cannot find any more problems easily
with your data you train your model and
come back to the platform to analyze its
performance once again you have access
to a ton of valuable information about
how well your model is performing for
example if we take a look at the object
area where we see that small images seem
problematic we can easily filter them
out and create a new sub data set using
only our problematic small object images
the project is created in your Encore
active dashboard with all the same
statistics you had but for only this set
of data if you want to have a closer
look or run experiments with this more
complicated part of the data like using
it for training one of your committee
models and you repeat this Loop over and
over on the annotating problematic data
and improving your model as efficiently
as possible it will both reduce the need
for paying experts annotators especially
if you work with medical applications as
I do or other applications where experts
are quite expensive and maximize the
results of your model I hope you can now
see how valuable Active Learning can be
and maybe even try it out with your own
application and it can all be done with
a single product if you want to let me
know if you do so
but before ending this video I just
wanted to thank ankord for sponsoring
this week's episode with a great example
of active learning and an amazing
product I also wanted to point out that
they had a webinar on June 14th on how
to build a semantic search for visual
data using chatgpt and clip that is
housed on encode active with a recording
available if you want to check it out
it's definitely worthwhile and super
interesting I hope you enjoyed this
episode format as much as I enjoyed
making it thank you for watching
主动学习简介 | HackerNoon