

One of the biggest and most elusive pieces of the augmented reality puzzle is occlusion. In other words, the ability to hide virtual objects behind real things.
This post is about why occlusion in AR is so hard and why deep learning might be the key to solving it in the future.
When you look at a purely real or a purely virtual world, you tend to accept the βrulesβ of that world or suspend disbelief, as long as it satisfies some basic notions of reality like gravity, lighting, shadows etc. Youβll notice when these rules are broken because itβs jarring and feels like something βdoesnβt look rightβ. Thatβs why itβs so instinctual to cringe at bad special effects in movies.
In VR, itβs actually quite easy to achieve suspension of disbelief because you have complete control over all elements in the scene. Unfortunately, as an AR developer, you donβt have this luxury because most of your appβs screen real estate (i.e. the real world) is totally out of your control.
As an AR developer, most of your appβs screen real estate (i.e. the real world) is totally out of yourΒ control.
In the mobile world, Appleβs ARKit has achieved incredibly fast motion tracking as well as realistic lighting and shadows, but itβs still lacking when it comes to occlusion.
Does the screenshot below look strange to you? Itβs because the dragon looks like itβs further away from the chair but still appears in front of the chair.
This isnβt just a problem with mobile AR. Itβs also a problem on every headset available today.
The goal of occlusion is to preserve the rules of line-of-sight when creating AR scenes. That means any virtual object that is behind a real object, should be βoccludedβ or hidden behind that real object.
So how is this done in ARΒ ? Basically, we selectively prevent parts of the virtual scene from rendering on the screen based some knowledge of the 3D structure of the real world.
Assuming you have a good reconstruction of all the objects in your real environment, occlusion involves simply rendering that model as a transparent mask in your scene. Thatβs the easy part. Itβs getting to that point where things start to get unwieldy.
Consider a common street scene. There are people, vehicles, trees and all kinds of objects at various distances from you. Further away, there are larger structures like bridges and buildings, each with their own unique features.
The hardest thing about creating a realistic occlusion mask is actually reconstructing a good enough model of the real world to apply that mask.
Thatβs because no AR device available today has the ability to perceive its environment precisely or quickly enough for realistic occlusion.
Sensing 3D structure really boils down to one important abilityβββdepth sensing. Depth sensors come in many flavors, the common ones being Structured Light, Time of Flight and Stereo Cameras.
In terms of hardware, Structured light and Time of Flight involve an Infrared projector and sensor pair, while Stereo requires two cameras at a fixed distance from each other, pointing in the same direction.
At a high level, hereβs how they work:
Structured Light sensing works by projecting an IR light pattern onto a 3D surface and using the distortions to reconstruct surface contours.
This sensor works by emitting rapid pulses of IR light that are reflected by objects in its field of view. The delay in the reflected light is used by an image sensor to calculate the depth at each pixel.
Stereo cameras simulate human binocular vision by measuring the displacement of pixels between the two cameras placed a fixed distance apart and use that to triangulate distances to points in the scene.
Of course, all these sensors have their limitations. IR based sensors like have a harder time functioning outdoors because bright sunlight (lots of IR) can wash out or add noise to the measurements. Stereo cameras have no problems working outdoors and consume less power, but they work best in well-lit areas with a lot of features and stark contrast.
All you need to do to confuse a stereo camera is point it at a flat whiteΒ wall.
Since all these sensors work on pixel-based measurements, any noise or error in the measurements creates holes in the depth image. Also, at the size and capacity of phones and headset devices today, the maximum range achieved so far has been about 3β4 meters.
The image below is an example of a depth map created with a stereo camera. The colors represent distance from the camera. See how the measurements are good at a close range while further objects are too noisy or ignored?
3D perception doesnβt end at depth sensing. The next step is to take the 2D depth image and turn it into a 3D point cloud model where each pixel in the depth image gets a 3D position relative to the camera.
Next, all the camera relative point clouds are fused with an estimate of camera motion to create a 3D point cloud map of the world around the sensor.
The video below illustrates the complete point cloud mapping process.
Now that you understand the complete pipeline of 3D perception, letβs look at how this translates to implementing occlusion.
Thereβs a few ways 3D depth information can be used to occlude virtual objects.
Directly use the 2D depth map coming in from the sensor.
In this method, we align the camera image and the depth map and hide parts of the scene that should be behind any pixels of the depth map. This method doesnβt really need a full 3D reconstruction since it just uses the depth image.
This makes it faster but has a few problems:
The video below is an example of depth map based occlusion. Notice the irregularities in the Occlusion mask as the red cube is moved around.
Re-construct and use the 3D point cloud model.
Since the point cloud is a geometrically accurate map of the real world, we can use it to create an occlusion mask. Note that a point cloud itself isnβt sufficient for occlusion, but point clouds can be processed to create meshes that essentially fit a surface onto the point map (like a blanket covering your 3D point cloud).
Meshes are much less computationally intensive than point clouds and are the go-to mechanism for calculations like detecting collisions in 3DΒ games.
This mesh can now be used to create the transparent mask we need to occlude virtual elements in our scene.
Well that sounds like we have a good enough solution for Occlusion! So whatβs the problem?
The 3 AR devices that I think have the most impressive tracking and mapping capabilities today are Google Tango, Microsoft Hololens, and Apple iPhone X. Hereβs how their sensors stack up against each other.
Google Tango (Discontinued by Google)
Depth SensorβββIR time-of-flight
Rangeβββ4m
Microsoft Hololens
Depth SensorβββIR time-of-flight
Rangeβββ4m
Apple iPhone X
Forward facing depth sensorβββIR Structured light
Back facing depth sensorβββStereo Cameras
Rangeβββ4m
The main problem with all the above systems is that in terms of depth sensing, they have:
Generating a mesh from a point cloud, currently, isnβt fast enough for real-time occlusion on any tablet or headsetΒ device.
So how does a developer today hack together a reasonable solution to get around these issues?
Perfect occlusion is an elusive target, but we can get close to it in some situations, especially when we can relax the real-time constraint.
If the application allows pre-mapping the environment, itβs possible to use a pre-built mesh as an occlusion mask for the larger prominent objects in the scene, provided they donβt move.
This means that youβre not limited to the 4m range of the depth sensor, at least for occlusion behind static objects.
Moving objects are still a problem and the only solution right now it to use the depth map masking method for close range moving objects like your hands.
Now itβs clear from the example mesh above that a big problem with pre-built meshes is that although theyβre lighter than point clouds, they can cause more than a ten-fold increase in the complexity of your 3D content.
The way to simply a 3D mesh is to approximate its structure with simpler objects like walls and blocks that envelope complex structures.
At Placenote, weβve built guided tours of large museums in AR and the way we hacked occlusion was to manually draw planes to cover specific walls in the space that might get in the way of our virtual content.
Of course, this method assumes that either the developer or the user will take the time to map the environment before the AR session.
Since this might be a bit overwhelming for the average user, it likely works best in location-based AR experiences where the map can be pre-built by the developer.
In an extreme scenario, you might want to occlude an AR experience at a much larger scale, like rendering a dinosaur walking among buildings in New York City. Perhaps, the way to do this is to use known 3D models of buildings from services like Google Maps or Mapbox to create occlusion surfaces at the city scale.
Our friends at Sturfee have built a unique way of creating city scale augmented reality experiences, using satellite imagery to reconstruct large buildings and static structures in 3D. Sheng Huang at Sturfee has written about their platform here.
Of course, this means you need to be able to accurately localize the device in 3D, which is quite challenging at that scale. GPS position is simply not good enough for occlusion since itβs slow (1Hz) and highly inaccurate (measurement error of 5β20 meters).
In fact, centimeter-level position tracking indoors and outdoors is a critical component of occlusion and through our work with Placenote, weβre working towards a cloud-based visual positioning system that can solve some of these problems.
While pre-built meshes are great for AR experiences tied to a single location, occluding moving objects still requires instant depth measurements at a range greater than just 4m.
Whatβs needed to create a realistic AR experience is a sensor that produces a high-resolution depth map with near infiniteΒ range.
Improvements in sensing hardware can certainly help squeeze greater resolution and range from IR or Stereo sensors, but these improvements will likely hit a ceiling and produce diminishing returns in the near future.
Interestingly, an alternative approach has emerged in 3D sensing research, that turns this hardware problem into a software problem by leveraging deep learning to improve the speed and quality of 3D reconstruction.
Neural networks might be the key to solving occlusion in theΒ future.
This method uses neural networks that can pick out visual cues in the scene to estimate 3D structure, much like the way we as humans estimate distance. (i.e. guessing distance by using our general knowledge of the sizes of things in the real world. The networks are trained on a large dataset of images and are capable of segmenting out objects in a scene and then recognizing them to estimate depth.
That means, if we can design the neural networks and train them on a good enough dataset, we might be able to bypass a lot of limitations in resolution and range present in current depth sensing technologies, with no added hardware costs.
The image above is from a paper that explores methods to segment and label scenes using neural networks in combination with depth sensors to improve the quality of generated maps.
You can find the full text of the paper here.
If youβre a new AR developer looking to build compelling AR experiences, donβt let occlusion stop you. Remember Pokemon Go? Poor occlusion in Pokemon Go resulted in some hilarious AR screenshots that spread all over the internet and helped with the meteoric rise of the game.
So have fun with it!
If you want to build amazing AR experiences on iOS or Unity, partner with us, or join our team, letβs connect!
Contact me at neil@vertical.ai
Weβre building an SDK for persistent, shared augmented reality experiences. We call it Placenote SDK.
Special thanks to Sheng Huang, Dominikus Baur, David Smooke and Peter Feld for your help with reviewing this article!
Create your free account to unlock your custom reading experience.