Augmented Reality (AR) has come a long way in the past few years. However, there isn’t much information out there that helps developers understand what are the technologies that power AR and what the skills are necessary to develop solid AR apps. I hope to shed some light on this area, and hopefully convince you that it is not that hard. As a full stack web developer, it took me three weeks to get the hang of AR and build an MVP for my startup. If you are a game developer, it might be even faster for you to get started.
In essence, AR allows digital information to be overlaid on top of the real world. However, depending on the use case, very different stacks of technology are used to achieve the desired effect.
For outdoor navigation based AR, such as Pokémon Go, established and simple to apply GPS technology is utilised. The current coordinates are sent to the server and the nearby Pokemon are returned. When you choose a Pokemon to catch, 3D renders of specific characters are overlaid on top of the scene. This delivers the complete AR experience. Such technology is neither hard to implement nor very impressive. There is no element in which the app attempts to understand the scene, it just needs to position you on the earth. However, the game has indeed made it easier to convey the meaning of AR to the non tech-savvy audience.
On the other end of the spectrum are AR technologies that apply computer vision to understand and identify points of interest in the scene and overlay information on specific places. Depending on the purpose, different kinds and levels of sophistication of computer vision technology are used. Snapchat uses facial features recognition and tracking on real time video feed for its social filters. It’s the most valuable AR company at this point. Unlike Pokemon Go, snapchat needs to position your facial features at every frame before any graphics can be overlaid. It is fairly impressive as it doesn’t require any predefined models and works for all faces. However the detection doesn’t require high recall as the graphics are overlaid for social and entertainment purposes.
The next level of computer vision scans the surroundings to understand the scene. This demands a 3D model of the scene from the camera input. As shown in the image above, the ring below the robot is rendered on top of the table. So the device can figure out that the table can be used as a planer base. What’s most impressive and elicits a “wow” response from everyone is when you cannot differentiate the real world from the digital one. This level of experience in real-time requires very high processing power (for computer vision and graphics), bigger power sources and advanced cameras (for 3D reconstruction), which currently can’t fit in a mobile device. Currently, they are available in the form of headwear, such as Microsoft Hololens. Such devices are currently extremely expensive and bulky, and therefore not mainstream yet.
When you move around the scene, you do not expect the overlaid information to move with you. It needs to stay stationary with respect to the real world regardless of your own movement. This requires somehow being able to “track” the scene. At each frame, within milliseconds, we need to know the position and orientation of the 3D world space we are in. We can go further in the AR experience. Instead of just tracking the scene as one unit, we track individual objects in it. So if we are trying out a new hairstyle and clothing in AR in the same view, we need to track the head, body and arms simultaneously in the same frame. For the most convincing experience, we need to track multiples objects of different shapes and sizes in the 3D world.
So the next question is, how do we find out the origin at each frame? First, we need to define what we are tracking. Are we tracking the face of a person? Just the face or each facial feature? Or hands? A 3D object? An image? A barcode? Then, we need to write an algorithm that returns the position and orientation of the desired target with respect to the plane of the lens. If we just get the position, we won’t know which way is up, so we need the orientation as well. Both can be represented together in a pose matrix. Then you have to draw your augmented 3D models with respect to this matrix. You can just pass this matrix to the 3D rendering engine, but it’s good to know the 3D rendering pipeline, which is based on linear algebra.
So let’s summarize the AR pipeline for one frame. The camera input is passed to the detection engine, which returns the pose matrix. This pose matrix is used as the center of the 3D world, and your custom models get rendered around this center. And bam, you have successfully augmented the camera input.
The following are necessary to overcome for AR to become mainstream:
There are incredible efforts being made all around the world to solve these hard problems. If you just want to develop AR enabled apps, you don’t have to worry about solving them all of them. As long as you know how to use what’s out there, you are ready to build amazing AR applications. Of course, you have to keep in mind that these problems haven’t been completely solved. So be prepared to make some compromises along the way. Snapchat and Pokemon Go are examples of great applications which made the right compromises.
Create your free account to unlock your custom reading experience.