Have you ever dreamed of a good transcription tool that would accurately understand what you say and write it down? Not like the automatic YouTube translation tools… I mean, they are good but far from perfect. Just try it out and turn the feature on for the video, and you’ll see what I’m talking about.
Luckily, OpenAI just released and open-sourced a pretty powerful AI model just for that: Whisper.
It understands stuff I can’t even comprehend, not being a native English speaker (listen in the video) and it works for language translation too! Learn more in the video below...
►Read the full article: https://www.louisbouchard.ai/whisper/
► Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C. and
Sutskever, I., Robust Speech Recognition via Large-Scale Weak
Supervision.
►Project link: https://openai.com/blog/whisper/
►Code: https://github.com/openai/whisper
►Google Colab notebook: https://colab.research.google.com/github/openai/whisper/blob/master/notebooks/LibriSpeech.ipynb
►YouTube Whisperer app: https://huggingface.co/spaces/jeffistyping/Youtube-Whisperer
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/
0:00
have you ever dreamed of a good
0:01
transcription tool that will accurately
0:03
understand what you see and write it
0:05
down not like the automatic YouTube
0:07
translation tools I mean they are good
0:09
but far from perfect just try it out and
0:12
turn the feature on for this video and
0:14
you'll see what I'm talking about well
0:16
open AI just released an open sourced
0:18
and a pretty powerful AI model just for
0:21
that whisper it even understands stuff I
0:24
can't even comprehend not being a native
0:26
English speaker
0:28
this is the micro machine man presenting
0:29
the most miniature motorcade of
0:30
micro machine each one has dramatic
0:32
details terrific trim Precision page Art
0:33
Plus incredible micro machine pocket
0:34
place that says a PlayStation fire it
0:36
works for language translation too
0:38
whisper a consistent the liquid it
0:41
sounds automatically alternator
0:48
the results and precision are incredible
0:51
but what's even cooler is how it works
0:53
let's dive into it but first let me
0:56
introduce this episode's sponsor that is
0:58
highly related to This research assembly
1:00
AI assembly AI is the API platform for
1:03
state-of-the-art AI models from startups
1:06
to Fortune 500 companies developers and
1:08
product teams around the world leverage
1:10
assembly AI to build better ai-based
1:13
products and features if you are
1:15
building a meeting summarizer podcast
1:17
analyzer or really anything related to
1:19
audio or video and want to leverage AI
1:22
to power transcription or insights at
1:24
scale definitely check out their API
1:26
platform more specifically I wanted to
1:29
share their summarization model which I
1:31
find really cool as the name says with
1:34
this model you can build tools that
1:36
automatically summarize your audio and
1:38
video files the model is flexible to fit
1:41
your use case and can be customized to
1:44
different summary types bullets
1:46
paragraphs headlines or adjust it all
1:48
Works through simple API calls and you
1:51
can find all the information you need
1:53
for the summarization model and assembly
1:55
AI with the first link below
1:59
when it comes to the model itself
2:01
whisper is pretty classic it is built on
2:04
the Transformer architecture stacking
2:06
encoder blocks and decoder blocks with
2:08
the attention mechanism propagating
2:10
information between both it will take
2:13
the audio recording split it into 30
2:16
second chunks and process them one by
2:18
one for each 30 second recordings it
2:21
will encode the audio using the encoder
2:23
section and save the position of each
2:25
word said and leverage this encoded
2:28
information to find what was said using
2:30
the decoder the decoder will predict
2:33
what we call tokens from all this
2:34
information which are basically each
2:37
words being said then it will repeat
2:39
this process for the next word using all
2:41
the same information as well as the
2:43
predicted previous word helping it to
2:46
guess the next one that will make more
2:48
sense as I said the overall architecture
2:50
is a classic encoder and decoder and I
2:53
covered it in multiple videos similar to
2:55
gpt3 and other language models which I
2:58
invite you to check for more
3:00
architectural details this works as it
3:02
was trained on more than 600 000 hours
3:05
of multilingual and multitask supervised
3:08
data collected from the web meaning that
3:11
they trained their Audio model in a
3:12
similar way as gpt3 with data available
3:15
on the internet making it a large and
3:18
general Audio model it also makes the
3:20
model way more robust than others in
3:23
fact they mentioned that whisper
3:24
approaches human level robustness due to
3:27
being trained on such a diverse set of
3:29
data ranging from Clips TED Talks
3:32
podcasts interviews and more which all
3:34
represent real world-like data with some
3:36
of them transcribed using machine
3:38
learning based models and not humans
3:40
using such imperfect data certainly
3:43
reduces the possible Precision but I
3:45
will argue it helps for robustness when
3:47
used as sparsely compared to Pure human
3:49
curated audio data sets with perfect
3:52
transcriptions having such a general
3:54
model isn't very powerful in itself as
3:57
it will be beaten at most tasks by
3:58
smaller and more specific models adapted
4:01
to the task at hand but it has other
4:03
benefits you can use this kind of
4:05
pre-trained models and fine-tune them on
4:08
your task meaning that you will take
4:10
this powerful model and retrain a part
4:13
of it or the entire thing with your own
4:15
data this technique has been shown to
4:17
produce much better models than starting
4:19
training from scratch with your data and
4:21
what's even cooler is that openai open
4:24
sourced their code and everything
4:25
instead of an API so you can use whisper
4:28
as a pre-trained foundation architecture
4:30
to build upon and create more powerful
4:33
models for yourself some people have
4:35
already released the tools like the
4:37
YouTube Whisperer unhugging face by Jeff
4:39
is typing taking a YouTube link and
4:42
generating transcriptions which I found
4:44
thanks to Yannick kilter they also
4:46
released a Google collab notebook to
4:48
play with right away while something
4:50
competition is key I'm glad openai is
4:53
releasing some of its work to the public
4:54
I'm convinced such collaborations are
4:57
the best way to advance in our field let
5:00
me know what you think if you'd like to
5:01
see more public releases of openai or if
5:04
you like the final products they build
5:06
like dally as always you can find more
5:08
information about whisper in the paper
5:11
and code linked below and I hope you've
5:13
enjoyed this video I will see you next
5:15
week with another amazing paper