An oft-underappreciated feature of human sight is the ability to perceive depth in flat images, and while computers can “see” thanks to camera technology, that kind of depth perception has been a challenge — but perhaps not anymore.
Humans can naturally see depth in photographs from a single point of view, even if the image is flat. Computers struggle with this, but researcher scientists from Simon Fraser University in Canada have developed a way to help computer vision see depth in flat images.
“When we look at a picture, we can tell the relative distance of objects by looking at their size, position, and relation to each other,” says Mahdi Miangoleh, an MSc student who worked on the project. “This requires recognizing the objects in a scene and knowing what size the objects are in real life. This task alone is an active research topic for neural networks.”
The process is called monocular depth estimation and it involves teaching computers how to see depth by using machine learning. Basically, the process uses contextual cues such as the relative sizes of objects to estimate the structure of the scene. The team started with an existing model that would take the highest resolution image available and resize it for the model, but this method had largely failed to provide consistent results.
The team realized that while neural network technology has advanced significantly over the years, they still have a relatively small capacity to generate many details at once. Additionally, the networks are limited by how much of the scene the network can analyze at a time. It’s why feeding just a high-resolution image into a neural network wasn’t able to work properly in previous attempts at depth mapping.
The team explains that during the process, they found that there were strange effects in how the neural network saw images at different resolutions. High-resolution images would result in details of subjects being visible, but some of the depth structure would be lost. Low-resolution images would lose the detail, but that depth information would return.
The researchers explain that a small network that generates the structure of a complex scene cannot also generate fine details. At a low resolution, every pixel can see the edges of a structure in a photo, so the network can judge that it is, for example, a flat wall. But at a high resolution, some pixels do not receive any contextual information due to the limitations of the network’s processing capability. This results in large structural inconsistencies.
So rather than rely on one resolution, the team decided to build a method that uses multiple.
“Our method analyzes an image and optimizes the process by looking at the image content according to the limitations of current architectures,” explains PhD student Sebastian Dille. “We give our input image to our neural network in many different forms, to create as many details as the model allows while preserving a realistic geometry.”
The team published its full report here and produced an easy-to-understand video explanation above.