If you wonder how a Tesla car can not only see but navigate the roads with other vehicles, this is the video you were waiting for. A couple of days ago was the first Tesla AI day where Andrej Karpathy, the Director of AI at Tesla, and others presented how Tesla’s autopilot works from the image acquisition through their eight cameras to the navigation process on the roads.
This week, I cover Andrej Karpathy's talk at Tesla AI Day on how Tesla's autopilot works.
00:00
if you wonder how a tesla car can not
00:02
only see but navigate the roads with
00:04
other vehicles this is the video you
00:06
were waiting for a couple of days ago
00:08
was the first tesla ai day where andrei
00:11
karpathy the director of ai at tesla and
00:14
others presented how tesla's autopilot
00:17
works from the image acquisition through
00:19
their eight cameras to the navigation
00:21
process on the roads tesla's cars have
00:23
eight cameras like this illustration
00:26
allowing the vehicle to see its
00:27
surrounding and far in front
00:29
unfortunately you cannot simply take all
00:31
the information from these eight cameras
00:33
and send it directly to an ai that will
00:35
tell you what to do as this will be way
00:37
too much information to process at once
00:39
and our computers aren't this powerful
00:42
yet just imagine trying to do this
00:44
yourself having to process everything
00:46
all around you honestly i find it
00:48
difficult to turn left when there are no
00:50
stop signs and you need to check both
00:52
sides multiple times before taking a
00:54
decision well it's the same for neural
00:57
networks or more precisely for computing
00:59
devices like cpus and gpus to attack
01:02
this issue we have to compress the data
01:04
while keeping the most relevant
01:06
information similar to what our brain
01:08
does with the information coming from
01:10
our eyes to do this tesla transfers
01:13
these eight cameras data into a smaller
01:16
space they call the much smaller vector
01:18
space this space is a three-dimensional
01:21
space that looks just like this and
01:23
contains all the relevant information in
01:25
the world like the road signs cars
01:27
people lines etc this new space is then
01:31
used for many different tasks the car
01:33
will have to do like object detection
01:35
traffic light tests lane prediction etc
01:38
but how do they go from eight cameras
01:40
which will mean eight times three
01:42
dimensions inputs composed of red green
01:45
blue images to a single output in three
01:47
dimensions this is achieved in four
01:50
steps and done in parallel for all eight
01:52
cameras making it super efficient at
01:55
first the images are sent into a
01:56
rectification model which takes the
01:59
images and calibrates them by
02:00
translating them into a virtual
02:02
representation this step dramatically
02:05
improves the autopilot's performance
02:07
because it makes the images look more
02:09
similar to each other when nothing is
02:11
happening allowing the network to
02:13
compare the images more easily and focus
02:15
on essential components that aren't part
02:18
of the typical background then these new
02:20
versions of the images are sent in a
02:22
first network called regnet this regnet
02:25
is just an optimized version of the
02:27
convolutional neural network
02:29
architecture cnns if you are not
02:31
familiar with this kind of architecture
02:33
you should pause the video and quickly
02:34
watch the simple explanation i made
02:36
appearing on the top right corner right
02:38
now basically it takes these newly made
02:41
images compresses the information
02:43
iteratively like a pyramid where a start
02:45
of the network is composed of a few
02:47
neurons representing some variations of
02:50
the images focusing on specific objects
02:52
telling us where it is especially then
02:55
the deeper we get the smaller these
02:57
images will be but they will represent
03:00
the overall images while also focusing
03:02
on specific objects so at the end of
03:05
this pyramid you will end up with many
03:07
neurons each telling you general
03:09
information about the overall picture
03:11
whether it contains a car a road sign
03:13
etc in order to have the best of both
03:16
worlds we extract the information at
03:18
multiple levels of this pyramid which
03:20
can also be seen as image
03:21
representations at different scales
03:23
focusing on specific features in the
03:25
original image we end up with local and
03:28
general information all of them together
03:31
telling us what the images are composed
03:34
of and where it is
03:35
then this information is sent into a
03:38
model called bi fpm which will force
03:41
this information from different scales
03:42
to talk together and extract the most
03:45
valuable knowledge among the general and
03:47
specific information it contains the
03:50
output of this network will be the most
03:52
interesting and useful information from
03:54
all these different scales of the eight
03:56
cameras information so it contains both
03:58
the general information about the images
04:01
which is what it contains and the
04:03
specific information such as where it is
04:06
its size etc for example it will use the
04:09
context coming from the general
04:10
knowledge of deep features extracted at
04:13
the top of the pyramid to understand
04:15
that since these two blurry lights are
04:17
on the road between two lanes they are
04:20
probably attached to a specific object
04:22
that was identified from one camera in
04:25
the early layers of the network using
04:27
both this context and knowing it is part
04:29
of a single object one could
04:31
successfully guess that these blurry
04:33
lights are attached to a car so now we
04:35
have the most useful information coming
04:37
from different scales for all eight
04:39
cameras we need to compress this
04:41
information so we don't have eight
04:43
different data inputs and this is done
04:45
using a transformer block if you are not
04:48
familiar with transformers i will invite
04:50
you to watch my video covering them in
04:52
vision applications in short this block
04:54
will take the eight different pictures
04:56
condensed information we have and
04:58
transfer it into the three-dimensional
05:00
space we want the vector space it will
05:03
take this general and spatial
05:04
information here called the key
05:07
calculate the query which is of the
05:09
dimension of our vector field and we'll
05:12
try to find what goes where for example
05:14
one of these query could be seen as a
05:16
pixel of the resulting vector space
05:19
looking for a specific part of the car
05:21
in front of us the value will merge both
05:23
of these accordingly telling us what is
05:26
where in this new vector space this
05:28
transformer can be seen as the bridge
05:30
between the eight cameras and this new
05:33
3d space to understand all
05:35
interrelations between the cameras now
05:37
that we have finally condensed your data
05:40
into a 3d representation we can start
05:42
the real work this is a space where they
05:45
annotate the data they use for training
05:47
their navigation network as the space is
05:49
much less complex than 8 cameras and
05:52
easier to annotate ok so we have an
05:54
efficient way of representing all our 8
05:57
cameras now but we still have a problem
05:59
single camera inputs are not intelligent
06:02
if a car on the opposite side is
06:04
occluded by another car we need the
06:06
autopilot to know it is still there and
06:08
it hasn't disappeared because another
06:10
car went in front of it for a second to
06:13
fix this we have to use time information
06:16
or in other words use multiple frames
06:18
they chose to use a feature cue and a
06:21
video module the feature queue will take
06:23
a few frames and save them in the cache
06:26
then for every meter the car does or
06:29
every 27 milliseconds it will send the
06:32
cached frames to the model here they use
06:35
both a time or a distance measure to
06:38
cover when the car is moving and stopped
06:41
then these 3d dimensions of the frames
06:44
we just processed are merged with their
06:46
corresponding positions and kinematic
06:48
data containing the car's acceleration
06:51
and velocity informing us how it is
06:53
moving at each frame all this precious
06:56
information is then sent into the video
06:59
module this video module uses these to
07:01
understand the car itself and its
07:03
environment in the present and past few
07:05
frames this understanding process is
07:07
made using a recurrent neural network
07:10
that processes all the information
07:11
iteratively over all frames to
07:13
understand the context better and
07:16
finally build this well-defined map you
07:18
can see if you are not familiar with
07:20
recurrent neural networks i will again
07:22
orient you to a video i made explaining
07:24
them since it uses past frames the
07:26
network now has much more information to
07:29
understand better what is happening
07:31
which will be necessary for temporary
07:33
occlusions this is the final
07:35
architecture of the vision process with
07:37
this output on the right and below you
07:40
can see some of these outputs translated
07:42
back into the images to show what the
07:44
car sees in our representation of the
07:47
world or rather the eight cameras
07:50
representation of it we finally have
07:52
this video module output that we can
07:54
send in parallel to all the cars tasks
07:57
such as object detection lane prediction
08:00
traffic lights etc if we summarize this
08:02
architecture we first have the eight
08:04
cameras taking pictures then they are
08:07
calibrated and sent into a cnn
08:10
condensing the information which
08:12
extracts information from them
08:14
efficiently and merges everything before
08:16
sending this into a transformer
08:18
architecture that will fuse the
08:20
information coming from all eight
08:22
cameras into one 3d representation
08:26
finally this 3d representation will be
08:29
saved in the cache over a few frames and
08:32
then sent into an rnn architecture that
08:35
will use all these frames to better
08:37
understand the context and output the
08:40
final version of the 3d space to send
08:42
our tasks that can finally be trained
08:44
individually and may all work in
08:47
parallel to maximize performance and
08:49
efficiency as you can see the biggest
08:52
challenge for such a task is an
08:53
engineering challenge make a car
08:56
understand the world around us as
08:58
efficiently as possible through cameras
09:00
and speed sensors so it can all run in
09:03
real time and with a close to perfect
09:06
accuracy for many complicated human
09:08
tasks of course this was just a simple
09:11
explanation of how tesla autopilot sees
09:13
our world i strongly recommend watching
09:15
the amazing video on tesla's youtube
09:18
channel linked in the description below
09:20
for more technical details about the
09:22
models they use the challenges they face
09:24
the data labeling and training process
09:26
with their simulation tool their custom
09:28
software and hardware and the navigation
09:32
it is definitely worth the time, thank you for watching.
►Read the full article: https://www.louisbouchard.ai/tesla-autopilot-explained-tesla-ai-day/
►"Tesla AI Day", Tesla, August 19th, 2021,
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/