With LASR, you can generate 3D models of humans or animals moving from only a short video as input.
This task is called 3D reconstruction, and Google Research, along with Carnegie Mellon University, just published a paper called LASR: Learning Articulated Shape Reconstruction from a Monocular Video.
Learn more about the project below.
References:►Read the full article: https://www.louisbouchard.ai/3d-reconstruction-from-videos►Gengshan Yang et al., (2021), LASR: Learning Articulated Shape Reconstruction from a Monocular Video, CVPR, https://lasr-google.github.io/
How hard is it for a machine to understand an image?
Researchers have made a lot of progress in image classification, image detection, and
These three tasks iteratively deepen our understanding of what's going on in an image.
In the same order, classification tells us what's in the image.
Detection tells us where it is approximately, and segmentation precisely tells us where
Now, an even more complex step would be to represent this image in the real world.
In other words, it would be to represent an object taken from an image or video into a
3D surface, just like GANverse3D can do for inanimate objects, as I showed in a recent
This demonstrates a deep understanding of the image or video by the model, representing
the complete shape of an object, which is why it is such a complex task.
Even more challenging is to do the same thing on nonrigid shapes.
Or rather, on humans and animals, objects that can be weirdly shaped and even deformed
to a certain extent.
This task of generating a 3D model based on a video or images is called 3D reconstruction,
and Google Research, along with Carnegie Mellon University just published a paper called LASR:
Learning Articulated Shape Reconstruction from a Monocular Video.
As the name says, this is a new method for generating 3D models of humans or animals
moving from only a short video as input.
Indeed, it actually understands that this is an odd shape, that it can move, but still
needs to stay attached as this is still one "object" and not just many objects together.
Typically, 3D modeling techniques needed data prior.
In this case, the data prior was an approximate shape of the complex objects, which looks
As you can see, it had to be quite similar to the actual human or animal, which is not
With LASR, you can produce even better results.
With no prior at all, it starts with just a plain sphere whatever the object to reconstruct.
You can imagine what this means for generalizability and how powerful this can be when you don't
have to explicitly tell the network both what the object is and how it "typically" looks.
This is a significant step forward!
But how does it work?
As I said, it only needs a video, but there are still some pre-processing steps to do.
These steps are quite well-understood in computer vision.
As you may recall, I mentioned image segmentation at the beginning of the video.
We need this segmentation of an object that can be done easily using a trained neural
Then, we need the optical flow for each frame, which is the motion of objects between consecutive
frames of the video.
This is also easily found using computer vision techniques and improved with neural networks,
as I covered not even a year ago on my channel.
They start the rendering process with a sphere assuming it is a rigid object, so an object
that does not have articulations.
With this assumption, they optimize the shape and the camera viewpoint understanding of
their model iteratively for 20 epochs.
This rigid assumption is shown here with the number of bones equal to zero, meaning that
nothing can move separately.
Then, we get back to real life, where the human is not rigid.
Now, the goal is to have an accurate 3d model that can move realistically.
This is achieved by increasing the number of bones and vertices to make the model more
and more precise.
Here the vertices are 3-dimensional pixels where the lines and volumes of the rendered
object connect, and the bones are, well, they are basically bones.
These bones are all the parts of the objects that move during the video with either translations
Both the bones and vertices are incrementally augmented until we reach stage 3, where the
model has learned to generate a pretty accurate render of the current object.
Here, they also need a model to render this object, which is called a differentiable renderer.
I won't dive into how it works as I already covered it in previous videos, but basically,
it is a model able to create a 3-dimensional representation of an object.
It has the particularity to be differentiable.
Meaning that you can train this model in a similar way as a typical neural network with
Here, everything is trained together, optimizing the results following the four stages we just
saw improving the rendered result at each stage.
The model then learns just like any other machine learning model using gradient descent
and updating the model's parameters based on the difference between the rendered output
and the ground-truth video measurements.
So it doesn't even need to see a ground-truth version of the rendered object.
It only needs the video, segmentation, and optical flow results to learn by transforming
back the rendered object into a segmented image and its optical flow and comparing it
to the input.
What is even better is that all this is done in a self-supervised learning process.
Meaning that you give the model the videos to train on with their corresponding segmentation
and optical flow results, and it iteratively learns to render the objects during training.
No annotations are needed at all!
And Voilà, you have your complex 3D renderer without any special training or ground truth
If gradient descent, epoch, parameters, or self-supervised learning are still unclear
concepts to you, I invite you to watch the series of short videos I made explaining the
basics of machine learning.
As always, the full article is available on my website louisbouchard.ai, with many other
great papers explained and information.
Thank you for watching.