Akshay Kore


A machine and human’s perception of the world in Augmented Reality

Augmented reality is the culmination of machines helping us see an enhanced view of the world. However, in order for machines to understand our reality and vice versa, it is imperative to look at how each of these groups perceives the world.

A computer’s understanding of space for Augmented Reality

The goal of Augmented Reality is to superimpose the computer’s perception of space with human’s understanding of it. In computer science, space is simply a metaphor for commonly agreed and scientifically validated concepts of space, time and matter. A computer’s understanding of space is nothing more than a mathematically defined 3D representation of objects, location and matter. It can be simply understood by means of coordinate systems without the need of confusing jargon like hyper-realties or alternate universes. Although these are definitely interesting thought experiments. A virtual space is nothing but a computer’s understanding of the real world as provided by humans.

Humans are spatial beings. We interact with and understand a large portion of our realities in three dimensions. As Augmented reality tries to simulate virtual worlds into human reality, it is important to understand the basic aspects of virtual 3D spaces.

Visual space and object space

What we perceive as location of objects in the environment is the reconstruction of light patterns on the retina. A visual space in computer graphics can be defined as perceived space or a visual scene of a 3D virtual space being experienced by a participant.

The virtual space in which the object exists is called the object space. It is a direct counterpart of the visual space.

Each eye sees the visual space differently. This is a critical challenge of computer graphics for binocular virtual devices or smart glasses. In order to design for virtual worlds, it is important to have a common understanding of the position and orientation of virtual objects in the real world. Common co-ordinate and orientation systems greatly help here.

Position and coordinates

Three types of coordinate systems are used for layout and programming of virtual and augmented reality applications:

Cartesian Coordinates

The Cartesian coordinate system is used mainly for it’s simplicity and familiarity and most virtual spaces are defined by it. The x-y-z based coordinate system is precise for specifying location of 3D objects in virtual space. The three coordinate planes are perpendicular to each other. Distances and locations are specified from the point of origin which is the point where the three planes intersect with each other. This system is mainly used for defining visual coordinates of 3D objects.

Cartesian Coordinates

Spherical Polar Coordinates

The Cartesian system defines the positions of 3D objects often with respect to an origin point. A system of spherical polar coordinates is used when locating objects and features with respect to the users’ position. This system is used mainly for mapping of a virtual sound source, or the mapping of spherical video in the case of first person based immersive VR. The Spherical coordinate system is based on perpendicular planes bisecting a sphere and consists of three elements: azimuth, elevation and distance. Azimuth is the angle from the origin point in the horizontal/ground plane, while the elevation is the angle in the vertical plane. Distance is the magnitude or range from the origin.

Spherical Polar Coordinates

Cylindrical Coordinates

This system is mainly used in VR applications for viewing 360 degree panoramas. The cylindrical system allows for precise mapping and alignment of still images to overlap for edge stitching in panoramas. The system consists of a central reference axis (L) with an origin point (O). The radial distance (ρ) is defined from the origin (O). The angular coordinate (φ) is defined for the radial distance (ρ) along with a height (z). Although this system is good for scenarios that require rotational symmetry, it is limited in terms of it’s vertical view.

Cylindrical Coordinates

Defining orientation and rotation

It is necessary to define the orientation and rotation of user viewpoints and objects along with their position in the virtual space. Knowing this information is especially important when tracking where the user is looking at or for knowing the orientation of virtual objects with respect to the visual space.

Six degrees of freedom (6 DOF)

In virtual and augmented reality, it is common to define orientation and rotation with three independent values. These are referred as roll (x), pitch (y) and yaw (z) and are know an Tait-Byan angles. A combination of position (x-y-z) and orientation (roll-pitch-yaw) is referred to as six degrees of freedom (6 DOF).

Orientation and Rotation


Navigation and way finding are two of the most complex concepts in virtual space especially for VR and AR. It can be handled by either physical movement of the user in real space or by use of consoles for traversing larger distances. For example, the physical movement might refer to movement of your hands and legs for shooting in a game like Call of duty while virtual movement would refer to the player going to an enemy base. There are a large number of devices that enable virtual movement from keyboards, game controllers to multi-directional treadmills. A single universal interface to navigate both virtual and physical space could possibly be the holy grail for navigation controller design.

The human eye’s understanding of space for Augmented Reality

From an engineering standpoint, our eyes are precision optical sensors and the structure of it has long been used as a blueprint to design cameras. Light enters through a series of optical elements. Refraction and focus happens here. The diaphragm controls the amount of light passing through an aperture. The subsequent light pattern falls on an image plane from where electrical signals are sent to the brain. These signals are then decoded as images. Although the functioning of the eye is an interesting topic in itself, for the sake of simplicity, I have taken the liberty to omit a large portion of how the eye works. But I’m sure you get the general idea. What is more interesting and relevant is what this amazing organic optical sensor enables humans.

The Visible Spectrum

Human sight exists due to light. A form of electromagnetic radiation, light is the key stimulus for human vision. This radiation moves through space in the form of waves and is capable of stimulating the retina to produce a sense of vision. Electromagnetic radiation is classified according to it’s wavelength which is the distance between two crests of a wave. Although the the entire electromagnetic spectrum includes radio waves, infrared, x-rays and other wave types, the human eye is only sensitive to a narrow band between 380–740 nanometers*. This is known as the visible spectrum.
*1 nanometer = billionth of 1 meter

Superman's X-Ray vision
The idea of Superman is that of an augmented human. X-Ray vision is one of his powers. This is nothing but a simplistic way of saying that his retina is sensitive to a much wider spectrum of electromagnetic waves (Sorry to spoil the romance of the idea). If Augmented reality devices could enable us to view a wider spectrum of electromagnetic radiation, it would bring us one step closer to becoming Super People.

Basic properties of the human eye

Field of View (FOV)

It is defined as the total angular size of the image visible to both the eyes. On an average, the horizontal binocular FOV is 200 deg out of which 120 deg is a binocular overlap. The binocular overlap is especially important for stereopsis and other depth cues discussed further. The vertical FOV is approximately 130 deg.

Inter-pupillary distance (IPD)

As the name suggests, it is the distance between the pupils of the eyes and is an extremely important consideration for binocular viewing systems. This distance varies from person to person, by gender and ethnicity. An inaccurate IPD consideration can result in poor eye-lens alignment, image distortion, strain on the eyes and headache. Mean IPD for adults is around 63mm with majority in the 50–75mm range. The minimum IPD for children is around 40mm.

Eye relief

This is the distance from the cornea of the eye to the surface of the first optical element. It defines the distance at which the user can obtain full viewing angles. This is an important consideration especially for people who wear corrective lenses or spectacles. Eye relief for spectacles is approximately 12mm. Enabling users to adjust the eye relief is extremely important for head mounted displays.

Exit pupil

This is the diameter of light transmitted to the eye by an optical system.

Eye box

This is the volume within which users can place their pupils to experience the visuals wholly.

Spatial vision and depth cues

Literally billions of signals sent to the cerebral cortex for analysis to form an image. There are a number of spatial and depth cues that enable the human brain to decipher these light signals to create our visible reality.

Extra-retinal cues

These type of cues are a result of physiological processes rather than those derived from light patterns entering the eye.

Accommodation (shifting focus)
Accommodation is an extra-retinal cue that helps the eye to shift focus between the objects in the foreground and background. The ciliary muscle encircles the iris to help an observer rapidly shift focus between different depth of fields. The optical power of the eye lens changes. When the eye is looking at objects at a comfortable distance, these muscles are relaxed. Ciliary muscles contract or accommodate when the eye needs to focus on objects nearby. This is the reason why it is advisable to look into the distance when you need to relax your eyes.

This is a fairly simple idea where both eyeballs rotate towards the centre when looking at objects far away in order to align the image for the brain to process. The eyeballs converge slightly towards each other when the object is near. This synchronized rotation is called vergence.

Why are accommodation and vergence important concepts for AR?
A number of AR glasses users often complain about headaches and eye strain. This is caused due to the eyes being focused on the flat panel within inches of the eye. Even though the 3D objects appear far, the illusion of depth is only simulated. Addition to this, there is a mismatch in the sensory cues provided by the eye to the brain. The eye has to accommodate and converge for a large amount of time which is the main cause of this discomfort. Reducing the effects of accommodation and vergence while simulating real depth is an interesting problem to be solved.

Retinal or Depth cues

These cues are derived from the light patterns entering the eye. These cues are either binocular (having two eyeballs has an effect on these cues) or monocular (these can be observed even with one functioning eyeball).

We have two eyes that are separate by an average distance of about 2.5 inches. Each eyeball captures a slightly different image. The perception depth observed due to the slight offset in both images being processed by the brain is called stereopsis. Stereopsis is especially important for immersive head-mounted VR displays. The two separate images shown to each eye and even a slight displacement in the images would lead to a loss of stereopsis making the VR experience feel unnatural.

Place your eyes closer to the screen such that one eye can see only a single cube. If you are on a phone, make sure it is in landscape mode for optimum experience

Stereopsis is the only binocular cue that is discussed here. Rest of the cues are all monocular.

Motion Parallax
This is a strong depth cue where objects which are closer appear of be moving much faster than those that are far away, even when both the objects are moving at same speeds. The reason for this is that, objects closer will through your field of view quicker than far off objects. This information is important to simulate relative depths between moving objects in a 3D environment.

Occlusion or interposition
These cues are observed when one object blocks the view of another object. The brain registers the blocking object to be closer than the object which is being blocked. Simulating effects of occlusion is especially difficult for AR scenarios where the computer has to know the position of near and far objects in the view. Using depth sensing cameras would be a viable solution to solve this issue for nearby objects.

Deletion and Accretion
This cue is an extension of motion parallax and depth cues. Deletion occurs when an object moves behind another object while accretion occurs when the object reveals itself in the observer’s viewpoint. If the deleting and accretion happen quickly, then the object is registered as being closer to the blocking object. Deletion and accretion occurs slowly if the two objects are farther away.

Linear perspective
This depth cue is a result of convergence of lines toward a single point in the distance. Parallel lines appear to recede into the distance. The more lines converge, the farther they appear.

Left: Linear perspective. Right: Kinetic depth effect from a series of silhouettes (Source: Wikipedia)

Kinetic depth effect
This effect is the perception of an object’s structure from it’s motion. This effect is especially useful in showing an object’s complex structure even when other depth cues are missing.

Familiar size
This depth cue helps us in estimating the size of an object with respect to surrounding elements. This can especially be useful in data visualizations where showing a relative size gives the user a perspective of the data.

The green house appears smaller than the yellow house

Relative size
Two objects of similar size but at different distances are perceived as different sizes relative to their distance from the observer. Two houses of the same size but at different distances cast a different retinal image which is perceived as a distance cue.

Relative size and Relative height

Relative height
In most normal settings, objects near your field of vision are seen on the lower portion of the retinal field, while those farther way are viewed on the higher portion.

Atmospheric or aerial perspective
This depth cue is a result of light being scattered by particles such as vapor and smoke. As the distance increases, the contrast between the object and the background decreases.

Atmospheric perspective

Texture gradient
This is an important cue where a gradual change in the texture of objects (normally from fine to coarse) gives a perception of depth. The density of a unit of texture or height of a unit or the reducing distance between textures gives a perception of distance.

Texture gradient

Lighting, Shade and Shadows
This is one of the most common and commonly used depth cues by artists and architects. The angle and sharpness of a shadow influence depth perception. Crisp and clearly defined shadows indicate a closer proximity while a fuzzy one may indicate greater depth. Also the way in which light interacts with irregularly shaped objects might reveal significant information about the object.

Optical expansion
This cue is an extension of the relative size cue and occlusion. As an object’s retinal image increases in size, it appears to be moving closer and starts occluding objects in it’s path.

The ball appears to be moving closer

Understanding how the human eye works can help us build more natural AR experiences and contribute significantly to the future of virtual environments.

Building blocks for augmented vision

Most early and popular ideas around AR focus on augmenting human vision and technology was developed to support these ideas. The camera plays the main role in this type of Augmented Reality (AR). A camera paired with a computer (smartphone) uses computer vision(CV)to scan it’s surroundings and content is superimposed on the camera view. A large number of modern AR applications readily use the smartphone’s camera to show 3D objects in real space without having to use special markers. This method is sometimes called marker-less AR. However, this was not always the standard. There are number of techniques used to augment content on the camera view.

Fiducial markers and images

Fiducial markers are black and white patterns often printed on a plane surface. The computer vision algorithm uses these markers to scan the image to place and scale the 3D object in the camera view accordingly. Earlier AR solutions regularly relied on fiducial markers. As an alternative, images too can be used instead of fiducial markers. Fiducial markers are the most accurate mechanisms for AR content creation and are regularly used in motion capture (MOCAP) in the film industry.

3D depth sensing

With ‘You are the controller’ as it’s tagline, Microsoft’s Kinect was a revolutionary device for Augmented reality research. It is a 3D depth-sensing camera which recognizes and maps spatial data. 3D depth sensing was available much before the Kinect, however the Kinect made the technology a lot more accessible. It changed the way regular computers see and augment natural environments. Depth sensing cameras analyze and map spatial environments to place 3D objects in the camera view. A more mainstream depth sensing camera in the recent times would be iPhone X’s front camera.

Simultaneous localization and mapping (SLAM)

For a robot or a computer to be able to move through or augment an environment, it needs to map the environment and understand it’s location within it. Simultaneous localization and mapping (SLAM) is a technology that enables just that. It was originally built for robots to navigate complex terrains and is even used by Google’s self-driving car. As the name suggests, SLAM enables realtime mapping of environment to generate a 3D map with the help of a camera and a few sensors. This 3D map can be used by the computer to place multimedia content in the environment.

Point cloud

3D depth sensing cameras like Microsoft’s Kinect and Intel’s real sense, and SLAM generate a set of datapoints in space known as a point cloud. Point clouds are referenced by the computer to place content in 3D environments. Once mapped to an environment, they enable the system to remember where a 3D object is placed in an environment or even at a particular GPS location.

Machine learning + Normal camera

Earlier AR methods relied on a multitude of sensors in addition to the camera. Software libraries like OpenCV, Vuforia, ARCore, ARKit, MRKit have enabled AR on small computing devices like the smartphone with surprising accuracy. These libraries use machine learning algorithms to place 3D objects in the environment and require only a digital camera for input. The frugality of these algorithms in terms of sensor requirements have largely been responsible for the ensuing excitement around AR in recent times.

We have made great strides in image recognition, machine learning, 3D graphics optimization and a whole host of other technical challenges to have this first wave of AR within hands reach. Most of us have seen a 3D architectural model of a building in an AR environment being showcased as the pinnacle of AR’s potential. This is only the beginning.

We have started experiencing a major part of our lives through little portals we call computers. These portals are available under a variety of different names; laptops, iPads, smartphones, smart watches, smart speakers, etcetera etcetera. Even then, there seems to be a barrier between us and our technology. We cannot physically interact with what’s there on the other side. The idea of augmenting reality brings an interesting promise and could possibly be a fundamental shift in how we interact with computers.

If you liked this article, please click the 👏 button (once, twice or more). 
Share to help others find it!

More by Akshay Kore

Topics of interest

More Related Stories