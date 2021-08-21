Tesla AI Day: How Does Tesla's Autopilot Work

This week, I cover Andrej Karpathy's talk at Tesla AI Day on how Tesla's autopilot works.

Learn more in this short video

Video Transcript

if you wonder how a tesla car can not

only see but navigate the roads with

other vehicles this is the video you

were waiting for a couple of days ago

was the first tesla ai day where andrei

karpathy the director of ai at tesla and

others presented how tesla's autopilot

works from the image acquisition through

their eight cameras to the navigation

process on the roads tesla's cars have

eight cameras like this illustration

allowing the vehicle to see its

surrounding and far in front

unfortunately you cannot simply take all

the information from these eight cameras

and send it directly to an ai that will

tell you what to do as this will be way

too much information to process at once

and our computers aren't this powerful

yet just imagine trying to do this

yourself having to process everything

all around you honestly i find it

difficult to turn left when there are no

stop signs and you need to check both

sides multiple times before taking a

decision well it's the same for neural

networks or more precisely for computing

devices like cpus and gpus to attack

this issue we have to compress the data

while keeping the most relevant

information similar to what our brain

does with the information coming from

our eyes to do this tesla transfers

these eight cameras data into a smaller

space they call the much smaller vector

space this space is a three-dimensional

space that looks just like this and

contains all the relevant information in

the world like the road signs cars

people lines etc this new space is then

used for many different tasks the car

will have to do like object detection

traffic light tests lane prediction etc

but how do they go from eight cameras

which will mean eight times three

dimensions inputs composed of red green

blue images to a single output in three

dimensions this is achieved in four

steps and done in parallel for all eight

cameras making it super efficient at

first the images are sent into a

rectification model which takes the

images and calibrates them by

translating them into a virtual

representation this step dramatically

improves the autopilot's performance

because it makes the images look more

similar to each other when nothing is

happening allowing the network to

compare the images more easily and focus

on essential components that aren't part

of the typical background then these new

versions of the images are sent in a

first network called regnet this regnet

is just an optimized version of the

convolutional neural network

architecture cnns if you are not

familiar with this kind of architecture

you should pause the video and quickly

watch the simple explanation i made

appearing on the top right corner right

now basically it takes these newly made

images compresses the information

iteratively like a pyramid where a start

of the network is composed of a few

neurons representing some variations of

the images focusing on specific objects

telling us where it is especially then

the deeper we get the smaller these

images will be but they will represent

the overall images while also focusing

on specific objects so at the end of

this pyramid you will end up with many

neurons each telling you general

information about the overall picture

whether it contains a car a road sign

etc in order to have the best of both

worlds we extract the information at

multiple levels of this pyramid which

can also be seen as image

representations at different scales

focusing on specific features in the

original image we end up with local and

general information all of them together

telling us what the images are composed

of and where it is

then this information is sent into a

model called bi fpm which will force

this information from different scales

to talk together and extract the most

valuable knowledge among the general and

specific information it contains the

output of this network will be the most

interesting and useful information from

all these different scales of the eight

cameras information so it contains both

the general information about the images

which is what it contains and the

specific information such as where it is

its size etc for example it will use the

context coming from the general

knowledge of deep features extracted at

the top of the pyramid to understand

that since these two blurry lights are

on the road between two lanes they are

probably attached to a specific object

that was identified from one camera in

the early layers of the network using

both this context and knowing it is part

of a single object one could

successfully guess that these blurry

lights are attached to a car so now we

have the most useful information coming

from different scales for all eight

cameras we need to compress this

information so we don't have eight

different data inputs and this is done

using a transformer block if you are not

familiar with transformers i will invite

you to watch my video covering them in

vision applications in short this block

will take the eight different pictures

condensed information we have and

transfer it into the three-dimensional

space we want the vector space it will

take this general and spatial

information here called the key

calculate the query which is of the

dimension of our vector field and we'll

try to find what goes where for example

one of these query could be seen as a

pixel of the resulting vector space

looking for a specific part of the car

in front of us the value will merge both

of these accordingly telling us what is

where in this new vector space this

transformer can be seen as the bridge

between the eight cameras and this new

3d space to understand all

interrelations between the cameras now

that we have finally condensed your data

into a 3d representation we can start

the real work this is a space where they

annotate the data they use for training

their navigation network as the space is

much less complex than 8 cameras and

easier to annotate ok so we have an

efficient way of representing all our 8

cameras now but we still have a problem

single camera inputs are not intelligent

if a car on the opposite side is

occluded by another car we need the

autopilot to know it is still there and

it hasn't disappeared because another

car went in front of it for a second to

fix this we have to use time information

or in other words use multiple frames

they chose to use a feature cue and a

video module the feature queue will take

a few frames and save them in the cache

then for every meter the car does or

every 27 milliseconds it will send the

cached frames to the model here they use

both a time or a distance measure to

cover when the car is moving and stopped

then these 3d dimensions of the frames

we just processed are merged with their

corresponding positions and kinematic

data containing the car's acceleration

and velocity informing us how it is

moving at each frame all this precious

information is then sent into the video

module this video module uses these to

understand the car itself and its

environment in the present and past few

frames this understanding process is

made using a recurrent neural network

that processes all the information

iteratively over all frames to

understand the context better and

finally build this well-defined map you

can see if you are not familiar with

recurrent neural networks i will again

orient you to a video i made explaining

them since it uses past frames the

network now has much more information to

understand better what is happening

which will be necessary for temporary

occlusions this is the final

architecture of the vision process with

this output on the right and below you

can see some of these outputs translated

back into the images to show what the

car sees in our representation of the

world or rather the eight cameras

representation of it we finally have

this video module output that we can

send in parallel to all the cars tasks

such as object detection lane prediction

traffic lights etc if we summarize this

architecture we first have the eight

cameras taking pictures then they are

calibrated and sent into a cnn

condensing the information which

extracts information from them

efficiently and merges everything before

sending this into a transformer

architecture that will fuse the

information coming from all eight

cameras into one 3d representation

finally this 3d representation will be

saved in the cache over a few frames and

then sent into an rnn architecture that

will use all these frames to better

understand the context and output the

final version of the 3d space to send

our tasks that can finally be trained

individually and may all work in

parallel to maximize performance and

efficiency as you can see the biggest

challenge for such a task is an

engineering challenge make a car

understand the world around us as

efficiently as possible through cameras

and speed sensors so it can all run in

real time and with a close to perfect

accuracy for many complicated human

tasks of course this was just a simple

explanation of how tesla autopilot sees

our world i strongly recommend watching

the amazing video on tesla's youtube

channel linked in the description below

for more technical details about the

models they use the challenges they face

the data labeling and training process

with their simulation tool their custom

software and hardware and the navigation

it is definitely worth the time, thank you for watching.

References

►Read the full article: https://www.louisbouchard.ai/tesla-autopilot-explained-tesla-ai-day/

►"Tesla AI Day", Tesla, August 19th, 2021, https://youtu.be/j0z4FweCy4M

►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/