I explain Artificial Intelligence terms and news to non-experts.
The writer is smart, but don't just like, take their word for it. #DoYourOwnResearch before making any investment decisions or decisions regarding you health or security. (Do not regard any of this content as professional investment advice, or health advice)
Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.
The is an opinion piece based on the author’s POV and does not necessarily reflect the views of HackerNoon.
This writer has a vested interested be it monetary, business, or otherwise, with 1 or more of the products or companies mentioned within.
Günümüz dünyasında ChatGPT gibi güçlü yapay zeka modellerinin yanı sıra görüş modelleri ve diğer benzer teknolojiler sayesinde çok büyük miktarda veriye erişebiliyoruz. Ancak bu modellerin dayandığı yalnızca veri miktarı değil, aynı zamanda kalite de önemlidir. Hızlı ve geniş ölçekte iyi bir veri kümesi oluşturmak zorlu ve maliyetli bir iş olabilir.
Basit bir ifadeyle aktif öğrenme, veri kümenizin açıklamalarını optimize etmeyi ve en az miktarda eğitim verisi kullanarak mümkün olan en iyi modeli eğitmeyi amaçlar.
Modelinizin tahminleri ile verileriniz arasında yinelenen bir süreci içeren denetimli bir öğrenme yaklaşımıdır. Tam bir veri kümesini beklemek yerine, küçük bir grup seçilmiş, açıklamalı veriyle başlayabilir ve modelinizi bununla eğitebilirsiniz.
Ardından, aktif öğrenmeyi kullanarak, görünmeyen verileri etiketlemek, tahminlerin doğruluğunu değerlendirmek ve edinme işlevlerine dayalı olarak açıklama eklenecek sonraki veri kümesini seçmek için modelinizden yararlanabilirsiniz.
Aktif öğrenmenin bir avantajı, modelinizin tahminlerinin güven düzeyini analiz edebilmenizdir.
Bir tahminin güvenirliği düşükse model, etiketlenecek bu türdeki ek görüntüleri talep edecektir. Öte yandan, yüksek güvenilirliğe sahip tahminler daha fazla veri gerektirmeyecek. Genel olarak daha az görüntüye açıklama ekleyerek, optimize edilmiş bir model elde ederken zamandan ve paradan tasarruf edersiniz. Aktif öğrenme, büyük ölçekli veri kümeleriyle çalışmak için oldukça umut verici bir yaklaşımdır.
Aktif öğrenmenin temsili. Kumar ve ark.'dan görüntü
Birincisi, modelinizin tahminlerinin kalitesi üzerinde kontrol sahibi olmanızı sağlayan insan açıklamasını içerir. Milyonlarca görüntüyle eğitilmiş bir kara kutu değil. Gelişimine aktif olarak katılırsınız ve performansının iyileştirilmesine yardımcı olursunuz. Bu durum, denetimsiz yaklaşımlarla karşılaştırıldığında maliyetleri artırsa da aktif öğrenmeyi önemli ve ilginç kılmaktadır. Ancak modelin eğitiminde ve devreye alınmasında tasarruf edilen zaman çoğu zaman bu maliyetlerden daha fazladır.
Ek olarak, otomatik açıklama araçlarını kullanabilir ve bunları manuel olarak düzelterek masrafları daha da azaltabilirsiniz.
Aktif öğrenmede, modelinizin üzerinde eğitim aldığı etiketli bir veri kümesine sahip olursunuz; etiketlenmemiş küme ise henüz açıklama eklenmemiş potansiyel verileri içerir. Önemli bir kavram, hangi verilerin etiketleneceğini belirleyen sorgu stratejileridir. Büyük etiketlenmemiş veri havuzunda en bilgilendirici alt kümeleri bulmaya yönelik çeşitli yaklaşımlar vardır. Örneğin belirsizlik örneklemesi, modelinizi etiketlenmemiş veriler üzerinde test etmeyi ve açıklama için en az güvenli şekilde sınıflandırılmış örnekleri seçmeyi içerir.
Aktif öğrenmenin Komiteye Göre Sorgulama yaklaşımıyla temsili. Kumar ve ark.'dan görüntü
Aktif öğrenmedeki diğer bir teknik, her biri etiketli verilerin farklı bir alt kümesi üzerinde eğitilen birden fazla modelin bir komite oluşturduğu Komiteye Göre Sorgulamadır (QBC) . Tıpkı farklı deneyimlere sahip insanların belirli kavramlara ilişkin farklı anlayışlara sahip olması gibi, bu modellerin de sınıflandırma sorununa ilişkin farklı bakış açıları vardır. Açıklama eklenecek veriler, karmaşıklığı gösteren komite modelleri arasındaki anlaşmazlığa göre seçilir. Bu yinelemeli süreç, seçilen verilere sürekli olarak açıklama eklendiğinden devam eder.
İlgileniyorsanız diğer makine öğrenimi stratejileri hakkında daha fazla bilgi veya video sağlayabilirim. Aktif öğrenmeye gerçek hayattan bir örnek, Google'da captcha'ları yanıtlamanızdır. Bunu yaparak, karmaşık görüntüleri belirlemelerine ve birden fazla kullanıcının ortak girdisiyle veri kümeleri oluşturmalarına yardımcı olarak hem veri kümesi kalitesini hem de insan doğrulamasını sağlarsınız. Dolayısıyla bir dahaki sefere bir captcha ile karşılaştığınızda yapay zeka modellerinin ilerlemesine katkıda bulunduğunuzu unutmayın!
foreign
[Music]
amounts of data thanks to the
superpowers of large models including
the famous chatgpt but also Vision
models and all other types you may be
working with indeed the secrets behind
those models is not only the large
amount of data they are being trained on
but also the quality of that data but
what does this mean it means we need
lots of very good balance and varied
data and as data scientists we all know
how complicated and painful it can be to
build such a good data set fast and at
large scale and maybe with a limited
budget what if we could have helped
build that or even have automated help
well that is where Active Learning comes
in in one sentence the goal of active
learning is to use the least amount of
training data to optimize The annotation
of your whole data set and train the
best possible model it's a supervised
learning approach that will go back and
forth between your model's predictions
and your data what I mean here is that
you may start with a small batch of
curated annotated data and train your
model with it you don't have to wait for
your whole millions of images that are
set to be ready just push it out there
then using Active Learning you can use
your model on your unseen data and get
human annotators to label it but that is
not only it we can also evaluate how
accurate the predictions are and using a
variety of acquisition functions which
are functions used to select the next
unseen data to annotate we can quantify
the impact of labeling a larger data set
volume or improving the accuracy of the
labels generated to improve the model's
performance thanks to how you train the
models you can analyze the confidence
they have in their predictions
predictions with low confidence will
automatically request additional images
of this type to be labeled and
predictions with high confidence won't
need additional data so you will
basically save a lot of time and money
by having to annotate fewer images in
the end and have the most optimized
model possible how cool is that Active
Learning is one of the most promising
approach to working with large-scale
data sets and there are a few important
key Notions to remember with active
learning the most important is that it
uses humans which you can clearly see
here in the middle of this great
presentation of active learning it will
still require humans to annotate data
which has the plus side to give you full
control over the quality of your model's
prediction it's not a complete Black Box
that trained with millions of images
anymore you iteratively follow its
development and help it get better when
it fails of course it does have the
downside of increasing costs versus
unsupervised approaches where you don't
need anyone but it allows you to limit
those costs by only training where the
models need it instead of feeding it as
much data as possible and hoping for the
best moreover the reduction in time
taken to train the model and put it into
production often outweighs these costs
and you can use some automatic
annotation tools and manually correct it
after again reducing the costs then
obviously you will have your labeled
data set the labeled set of data is what
your current model is being trained on
and the unlabeled set is the data you
could put in usually used but hasn't
been annotated yet another key notion is
actually the answer to the most
important question you may already have
in mind how do you find the bad data to
annotate and add to the training set
the solution here is called query
strategies and they are essential to any
Active Learning algorithm deciding which
data to label and which not to there are
multiple possible approaches to finding
the most informative subsets in our
large pool of unlabeled data that will
most help our model by being annotated
like uncertainty sampling where you test
your current model on your unlabeled
data and draw the least confident
classified examples to annotate another
technique shown here is the query by
committee or QBC approach here we have
multiple models our committee models
they will all be trained on a different
subset of our label data and thus have a
different understanding of our problem
these models will each have a hypothesis
on the classification of our unlabeled
data that should be somewhat similar but
still different because they basically
see the world differently just like us
that have different live experience and
have seen different animals in our lives
but still have the same concepts of a
cat and a dog then it's easy the data to
be annotated is simply the ones our
models most disagree on which means it
is complicated to understand and we
start over by feeding the selected data
to our experts for annotation this is of
course a basic explanation of active
learning with only one example of a
query strategy let me know if you'd like
more videos on other machine learning
strategies like this here A clear
example of the active learning process
is when you answer captchas on Google it
helps you identify complex images and
build data sets using you and many other
people as a committee jury for
annotation
building cheap and great data sets while
entering you are a human serving two
purposes so next time you are annoyed by
a captcha just think that you are
helping AI models progress but we have
enough theory for now I thought it would
be great to partner with some friends
from encord a great company I have known
for a while now to Showcase a real
example of active learning since we are
in this team it's for sure the best
platform I have seen yet for active
learning and the team is amazing before
diving into a short practical example I
just wanted to mention that I will be at
cvpr in person this year and so will
Encore if you are attending in person 2
let me know and go check out their Booth
it's Booth 1310. here's a quick demo we
put together for exploring one of
encore's products that perfectly fits
this episode and chord active it is
basically an active learning platform
where you can perform everything we
talked about in this video without any
coding with a great visual interface
here's what you would see in a classic
visual task like segmentation once you
open up your project you directly have
relevant information and statistics
about your data you'll see all the
outlier characteristics of your data
which will help you figure out what
causes the issues in your test for
example here we see that blur is one of
those outliers that has been
automatically identified if we check out
the worst images for that category we
can easily find some problematic images
and tag them for review like here where
the image is super saturated you can
also visualize groups of data thanks to
their embeddings just like clip
embeddings that you might have heard a
lot these days and those embeddings can
easily be compared together and grouped
when similar helping you find
problematic groups all at once instead
of going through your data one by one
then once you are satisfied with your
identified images to review you can
simply export it to their encode
platform where you can do your
annotation directly when you have your
annotations and you get back on the
encode active platform you can now
visualize what it looks like with labels
you can see how the embedding plots have
changed now with the different classes
attached here again you can look at
different subgroups of data to find
problematic ones for example you can
look at images containing school buses
this can be done using natural language
to look for any information in images
metadata or classes something quite
necessary these days if you want to say
that you are working in AI when you
cannot find any more problems easily
with your data you train your model and
come back to the platform to analyze its
performance once again you have access
to a ton of valuable information about
how well your model is performing for
example if we take a look at the object
area where we see that small images seem
problematic we can easily filter them
out and create a new sub data set using
only our problematic small object images
the project is created in your Encore
active dashboard with all the same
statistics you had but for only this set
of data if you want to have a closer
look or run experiments with this more
complicated part of the data like using
it for training one of your committee
models and you repeat this Loop over and
over on the annotating problematic data
and improving your model as efficiently
as possible it will both reduce the need
for paying experts annotators especially
if you work with medical applications as
I do or other applications where experts
are quite expensive and maximize the
results of your model I hope you can now
see how valuable Active Learning can be
and maybe even try it out with your own
application and it can all be done with
a single product if you want to let me
know if you do so
but before ending this video I just
wanted to thank ankord for sponsoring
this week's episode with a great example
of active learning and an amazing
product I also wanted to point out that
they had a webinar on June 14th on how
to build a semantic search for visual
data using chatgpt and clip that is
housed on encode active with a recording
available if you want to check it out
it's definitely worthwhile and super
interesting I hope you enjoyed this
episode format as much as I enjoyed
making it thank you for watching
Aktif Öğrenmeye Giriş | HackerNoon