So, my last article was relevant for about 3 days.

Shortly after musing on how Hide and Seek was AR’s greatest hurdle, Niantic, the developer of Pokemon Go, revealed an Occlusion mode in it’s AR engine. Pikachu is dodging around people, behind bushes. Pikachu is playing Hide and Seek. Bravo!

In 2017, a team of scientists from University College London specializing in the fields of Computer Vision and Machine Learning formed Matrix Mill, to create “machines that think around occlusions.” A year later, they joined Niantic with a working model of this technology.

What is this dark magic? One of the main ingredients of computer vision are convolutional neural networks (CNNs). Think of them as an elaborate method of connect-the-dots, a way for the computer to infer and rebuild what it’s seeing through its camera.

Deep CNNs work by consecutively modeling small pieces of information and combining them deeper in network. One way to understand them is that the first layer will try to detect edges and form templates for edge detection. Then subsequent layers will try to combine them into simpler shapes and eventually into templates of different object positions, illumination, scales, etc. The final layers will match an input image with all the templates and the final prediction is like a weighted sum of all of them. So, deep CNNs are able to model complex variations and behaviour giving highly accurate predictions (1).

Visual Effects artists have long been working with neural networks without realizing it. They are integral to image analysis techniques such as 3D motion tracking, advanced motion blur, and time remapping. Photogrammetry, the technique of building a 3D model from several photographs, is the perfect example of utilizing CNNs.

I see photogrammetry as the static form of what Niantic and Matrix Mill are doing with their Real World AR Occlusion. The input video for the game is probably requiring a bit of scene analysis to detect a ground plane and effectively build up a rough 3D model of the scene. Computer Vision, an advanced method that can anticipate what will happen within a scene based on context, would be used (in the case of the Niantic demo) to handle a person walking through the frame in a public area.

A Computer Vision evaluated scene. Numbers are percentage of accuracy.

What’s particularly amazing about the Niantic Real World Occlusion prototype is that it’s working in real-time, on a mobile device. Computer vision assessed scenes are typically post-processed - what we’re experiencing is live masking in real-time, all while continuing to lock to a ground plane AND rendering CG models with active lighting.

The occlusion prototype is in its infancy, but is showing great potential to break down the barrier between augmented and mixed reality. With photography, the best camera is the one that’s always in your hand - the same will hold true with the burgeoning field of extended reality.

Aarshay Jain, Deep Learning for Computer Vision – Introduction to Convolution Neural Networks

BLOG

Neural Networks: Magic Little Elves.