I explain Artificial Intelligence terms and news to non-experts.
The writer is smart, but don't just like, take their word for it. #DoYourOwnResearch before making any investment decisions or decisions regarding you health or security. (Do not regard any of this content as professional investment advice, or health advice)
Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.
The is an opinion piece based on the author’s POV and does not necessarily reflect the views of HackerNoon.
This writer has a vested interested be it monetary, business, or otherwise, with 1 or more of the products or companies mentioned within.
In der heutigen Welt haben wir dank leistungsstarker KI-Modelle wie ChatGPT sowie Vision-Modellen und anderen ähnlichen Technologien Zugriff auf eine enorme Datenmenge. Allerdings kommt es bei diesen Modellen nicht nur auf die Menge der Daten an, sondern auch auf deren Qualität. Die schnelle und maßstabsgetreue Erstellung eines guten Datensatzes kann eine herausfordernde und kostspielige Aufgabe sein.
Vereinfacht ausgedrückt zielt aktives Lernen darauf ab, die Annotation Ihres Datensatzes zu optimieren und das bestmögliche Modell mit der geringsten Menge an Trainingsdaten zu trainieren.
Es handelt sich um einen überwachten Lernansatz, der einen iterativen Prozess zwischen den Vorhersagen Ihres Modells und Ihren Daten beinhaltet. Anstatt auf einen vollständigen Datensatz zu warten, können Sie mit einem kleinen Stapel kuratierter annotierter Daten beginnen und Ihr Modell damit trainieren.
Mithilfe von aktivem Lernen können Sie dann Ihr Modell nutzen, um unsichtbare Daten zu kennzeichnen, die Genauigkeit von Vorhersagen zu bewerten und den nächsten Datensatz auszuwählen, der anhand von Erfassungsfunktionen mit Anmerkungen versehen werden soll.
Ein Vorteil des aktiven Lernens besteht darin, dass Sie das Konfidenzniveau der Vorhersagen Ihres Modells analysieren können.
Wenn eine Vorhersage ein geringes Vertrauen aufweist, fordert das Modell die Beschriftung zusätzlicher Bilder dieses Typs an. Andererseits erfordern Vorhersagen mit hoher Zuverlässigkeit nicht mehr Daten. Indem Sie insgesamt weniger Bilder mit Anmerkungen versehen, sparen Sie Zeit und Geld und erhalten gleichzeitig ein optimiertes Modell. Aktives Lernen ist ein vielversprechender Ansatz für die Arbeit mit großen Datensätzen.
Darstellung des aktiven Lernens. Bild von Kumar et al.
Erstens beinhaltet es menschliche Anmerkungen, die Ihnen die Kontrolle über die Qualität der Vorhersagen Ihres Modells geben. Es ist keine Blackbox, die auf Millionen von Bildern trainiert wird. Sie beteiligen sich aktiv an seiner Entwicklung und helfen bei der Verbesserung seiner Leistung. Dieser Aspekt macht aktives Lernen wichtig und interessant, auch wenn dadurch die Kosten im Vergleich zu unbeaufsichtigten Ansätzen steigen können. Allerdings übersteigt die Zeitersparnis beim Training und Einsatz des Modells diese Kosten oft.
Darüber hinaus können Sie automatische Anmerkungstools verwenden und diese manuell korrigieren, was die Kosten weiter senkt.
Beim aktiven Lernen verfügen Sie über einen beschrifteten Datensatz, auf dem Ihr Modell trainiert wird, während der unbeschriftete Satz potenzielle Daten enthält, die noch nicht mit Anmerkungen versehen wurden. Ein entscheidendes Konzept sind die Abfragestrategien, die bestimmen, welche Daten gekennzeichnet werden sollen. Es gibt verschiedene Ansätze, um im großen Pool unbeschrifteter Daten die aussagekräftigsten Teilmengen zu finden. Bei der Unsicherheitsstichprobe geht es beispielsweise darum, Ihr Modell anhand unbeschrifteter Daten zu testen und die am wenigsten zuverlässig klassifizierten Beispiele für die Annotation auszuwählen.
Darstellung des aktiven Lernens mit dem Query by Committee-Ansatz. Bild von Kumar et al.
Eine weitere Technik des aktiven Lernens ist Query by Committee (QBC) , bei der mehrere Modelle, die jeweils auf einer anderen Teilmenge gekennzeichneter Daten trainiert werden, ein Komitee bilden. Diese Modelle haben unterschiedliche Perspektiven auf das Klassifizierungsproblem, ebenso wie Menschen mit unterschiedlichen Erfahrungen ein unterschiedliches Verständnis bestimmter Konzepte haben. Die zu kommentierenden Daten werden auf der Grundlage der Meinungsverschiedenheiten zwischen den Ausschussmodellen ausgewählt, was auf Komplexität hinweist. Dieser iterative Prozess wird fortgesetzt, während die ausgewählten Daten kontinuierlich mit Anmerkungen versehen werden.
Bei Interesse kann ich weitere Informationen oder Videos zu anderen maschinellen Lernstrategien bereitstellen. Ein reales Beispiel für aktives Lernen ist das Beantworten von Captchas bei Google. Auf diese Weise helfen Sie ihnen, komplexe Bilder zu identifizieren und Datensätze mit dem gemeinsamen Input mehrerer Benutzer zu erstellen und so sowohl die Qualität der Datensätze als auch die Überprüfung durch den Menschen sicherzustellen. Wenn Sie also das nächste Mal auf ein Captcha stoßen, denken Sie daran, dass Sie zum Fortschritt von KI-Modellen beitragen!
foreign
[Music]
amounts of data thanks to the
superpowers of large models including
the famous chatgpt but also Vision
models and all other types you may be
working with indeed the secrets behind
those models is not only the large
amount of data they are being trained on
but also the quality of that data but
what does this mean it means we need
lots of very good balance and varied
data and as data scientists we all know
how complicated and painful it can be to
build such a good data set fast and at
large scale and maybe with a limited
budget what if we could have helped
build that or even have automated help
well that is where Active Learning comes
in in one sentence the goal of active
learning is to use the least amount of
training data to optimize The annotation
of your whole data set and train the
best possible model it's a supervised
learning approach that will go back and
forth between your model's predictions
and your data what I mean here is that
you may start with a small batch of
curated annotated data and train your
model with it you don't have to wait for
your whole millions of images that are
set to be ready just push it out there
then using Active Learning you can use
your model on your unseen data and get
human annotators to label it but that is
not only it we can also evaluate how
accurate the predictions are and using a
variety of acquisition functions which
are functions used to select the next
unseen data to annotate we can quantify
the impact of labeling a larger data set
volume or improving the accuracy of the
labels generated to improve the model's
performance thanks to how you train the
models you can analyze the confidence
they have in their predictions
predictions with low confidence will
automatically request additional images
of this type to be labeled and
predictions with high confidence won't
need additional data so you will
basically save a lot of time and money
by having to annotate fewer images in
the end and have the most optimized
model possible how cool is that Active
Learning is one of the most promising
approach to working with large-scale
data sets and there are a few important
key Notions to remember with active
learning the most important is that it
uses humans which you can clearly see
here in the middle of this great
presentation of active learning it will
still require humans to annotate data
which has the plus side to give you full
control over the quality of your model's
prediction it's not a complete Black Box
that trained with millions of images
anymore you iteratively follow its
development and help it get better when
it fails of course it does have the
downside of increasing costs versus
unsupervised approaches where you don't
need anyone but it allows you to limit
those costs by only training where the
models need it instead of feeding it as
much data as possible and hoping for the
best moreover the reduction in time
taken to train the model and put it into
production often outweighs these costs
and you can use some automatic
annotation tools and manually correct it
after again reducing the costs then
obviously you will have your labeled
data set the labeled set of data is what
your current model is being trained on
and the unlabeled set is the data you
could put in usually used but hasn't
been annotated yet another key notion is
actually the answer to the most
important question you may already have
in mind how do you find the bad data to
annotate and add to the training set
the solution here is called query
strategies and they are essential to any
Active Learning algorithm deciding which
data to label and which not to there are
multiple possible approaches to finding
the most informative subsets in our
large pool of unlabeled data that will
most help our model by being annotated
like uncertainty sampling where you test
your current model on your unlabeled
data and draw the least confident
classified examples to annotate another
technique shown here is the query by
committee or QBC approach here we have
multiple models our committee models
they will all be trained on a different
subset of our label data and thus have a
different understanding of our problem
these models will each have a hypothesis
on the classification of our unlabeled
data that should be somewhat similar but
still different because they basically
see the world differently just like us
that have different live experience and
have seen different animals in our lives
but still have the same concepts of a
cat and a dog then it's easy the data to
be annotated is simply the ones our
models most disagree on which means it
is complicated to understand and we
start over by feeding the selected data
to our experts for annotation this is of
course a basic explanation of active
learning with only one example of a
query strategy let me know if you'd like
more videos on other machine learning
strategies like this here A clear
example of the active learning process
is when you answer captchas on Google it
helps you identify complex images and
build data sets using you and many other
people as a committee jury for
annotation
building cheap and great data sets while
entering you are a human serving two
purposes so next time you are annoyed by
a captcha just think that you are
helping AI models progress but we have
enough theory for now I thought it would
be great to partner with some friends
from encord a great company I have known
for a while now to Showcase a real
example of active learning since we are
in this team it's for sure the best
platform I have seen yet for active
learning and the team is amazing before
diving into a short practical example I
just wanted to mention that I will be at
cvpr in person this year and so will
Encore if you are attending in person 2
let me know and go check out their Booth
it's Booth 1310. here's a quick demo we
put together for exploring one of
encore's products that perfectly fits
this episode and chord active it is
basically an active learning platform
where you can perform everything we
talked about in this video without any
coding with a great visual interface
here's what you would see in a classic
visual task like segmentation once you
open up your project you directly have
relevant information and statistics
about your data you'll see all the
outlier characteristics of your data
which will help you figure out what
causes the issues in your test for
example here we see that blur is one of
those outliers that has been
automatically identified if we check out
the worst images for that category we
can easily find some problematic images
and tag them for review like here where
the image is super saturated you can
also visualize groups of data thanks to
their embeddings just like clip
embeddings that you might have heard a
lot these days and those embeddings can
easily be compared together and grouped
when similar helping you find
problematic groups all at once instead
of going through your data one by one
then once you are satisfied with your
identified images to review you can
simply export it to their encode
platform where you can do your
annotation directly when you have your
annotations and you get back on the
encode active platform you can now
visualize what it looks like with labels
you can see how the embedding plots have
changed now with the different classes
attached here again you can look at
different subgroups of data to find
problematic ones for example you can
look at images containing school buses
this can be done using natural language
to look for any information in images
metadata or classes something quite
necessary these days if you want to say
that you are working in AI when you
cannot find any more problems easily
with your data you train your model and
come back to the platform to analyze its
performance once again you have access
to a ton of valuable information about
how well your model is performing for
example if we take a look at the object
area where we see that small images seem
problematic we can easily filter them
out and create a new sub data set using
only our problematic small object images
the project is created in your Encore
active dashboard with all the same
statistics you had but for only this set
of data if you want to have a closer
look or run experiments with this more
complicated part of the data like using
it for training one of your committee
models and you repeat this Loop over and
over on the annotating problematic data
and improving your model as efficiently
as possible it will both reduce the need
for paying experts annotators especially
if you work with medical applications as
I do or other applications where experts
are quite expensive and maximize the
results of your model I hope you can now
see how valuable Active Learning can be
and maybe even try it out with your own
application and it can all be done with
a single product if you want to let me
know if you do so
but before ending this video I just
wanted to thank ankord for sponsoring
this week's episode with a great example
of active learning and an amazing
product I also wanted to point out that
they had a webinar on June 14th on how
to build a semantic search for visual
data using chatgpt and clip that is
housed on encode active with a recording
available if you want to check it out
it's definitely worthwhile and super
interesting I hope you enjoyed this
episode format as much as I enjoyed
making it thank you for watching
Eine Einführung in aktives Lernen | HackerNoon