Recent advances in computer vision have revolutionized many areas of research including robotics, automation, and self-driving vehicles. The self-driving car industry has grown markedly in recent years, in no small part enabled by use of state-of-the-art computer vision techniques. However, there remain many challenges in the field. One of the most difficult problems in autonomous driving is perception. Once autonomous vehicles have an accurate perception of the world around them, planning and control become easier. This article primarily focuses on perception with computer vision and capabilities of computer vision and neural networks for use in fully autonomous self-driving vehicles.
Autonomous driving is a very challenging problem. Researchers and the auto industry have been working on developing autonomous vehicles for decades. One such company is General Motors who, in 1958, produced a self-driving car that was guided by a radio-controlled electromagnetic field. A number of car companies improved the technology based upon this idea. However, the challenge of achieving full autonomy remained. The journey reached an interesting point in 2005 when a few teams of scientists were able to complete the DARPA grand challenge (details in the video below). This challenge involves a 240 kilometre desert course. Continuous efforts from many scientists have made it possible to trial autonomous vehicles on public roads.
Following the DARPA challenge in 2005, researchers highlighted the criticality of perception of the world around the vehicle. Since then, a lot of companies began to develop autonomous cars focusing primarily on perception using vision. Autonomous car companies have varying strategies for achieving perception around an autonomous car. Most companies use some combination of RADAR, LIDAR and SONAR, and cameras. Tesla is the only large company that does not use LIDAR in its autonomous cars, primarily focusing on RADAR and cameras, and also making use of SONAR to detect near field objects. Despite the variation between companies, almost all of them place computer vision technologies to the fore.
Despite recent progress, autonomous driving still faces great challenges in representing the 3D world around a vehicle using computer vision only. It is difficult to achieve accurate representation because cameras generate 2D images and do not directly provide depth information of objects. Although many papers have been published on 3D reconstruction from multiple 2D images from cameras at different locations, 3D reconstruction is computationally expensive [1]. Therefore, some companies are using RADAR and LIDAR for depth perception of objects in the scene.
RADAR is cheap, but only gives us the range of an object. LIDAR, on the other hand, is expensive but provides 3D a point cloud around a vehicle with great accuracy. One benefit of LIDAR is that it has better resolution than RADAR. A disadvantage is that it has performance issues in opaque media. As such, LIDAR cannot be relied on in foggy weather, for example.
Another shortcoming of LIDAR is that it is sometimes difficult or impossible to understand exactly what the object they detect is. For example, if LIDAR sees a lightweight object on the road such as a plastic bag, it gives us just the point cloud. Sometimes it might be difficult to detect whether this is a plastic bag or a heavy object, a rock, or some other heavy object. Action taken by an autonomous vehicle will be significantly different based on the object the vehicle detects it to be. We do not want our vehicle to hit a heavy rock. However, in the case of plastic bags on the road, the vehicle does not even need to slow down. An advantage of using computer vision is that it is possible detect the difference between a plastic bag and a rock.
Let us look at a specific scenario: a biker is riding in the right lane, and the biker is looking at the left side to see if there is a car approaching from behind. The vehicle might be able to understand that this is a biker on the right lane from LIDAR point cloud. However, the advantage of using vision is that it could additionally tell us which direction the biker is looking in. If the vehicle has the information that this biker is looking at the left lane, the vehicle might be able to predict that the biker is planning to merge to the left lane and the vehicle needs to slow down to accommodate enough space for the biker. Vision could also detect if a pedestrian is distracted by their phone and approaching your lane.
Computer vision can give us a lot of information. However, accurate depth perception is still a challenge. There are some techniques for depth estimation and 3D reconstruction from vision only. Using multiple 2D images, it is possible to reconstruct a scene in 3D. One of these approaches is called multi-view stereo (MVS). First, multiple 2D images are analyzed and using structure from motion (SfM), camera poses of each 2D image are generated. SfM also gives point clouds in 3D. The multi-view stereo technique uses this point cloud from different camera poses to make 3D dense point cloud. Some research papers showed that using these techniques scenes can be well represented in 3D [2].
Depth perception can also be achieved using neural networks. The authors of [3] proposed SfM-net to estimate depth of the objects given a sequence of frames. SfM-net can be trained with various degrees of supervision, e.g., self-supervised by the reprojection photometric error, supervised by ego-motion, or supervised by depth. Self-supervised neural networks can be trained using raw videos into neural networks without any labels and it is possible to learn depth [4]. The neural network predicts depth in every single frame of the video. The objective of the network is to be consistent (and correct) over time. The network automatically predicts the depth for all the pixels.
In summary, in order to achieve fully autonomous vehicles accurate computer vision is a necessity, and the neural networks used must provide a complete representation of the surrounding environment. There has been a huge advance during recent years, but there is still much work to be done.
[1] SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels, J. Xiao et al (2013).
[2] Building Rome in a day, S. Agarwal et al (2011).
[3] SfM-Net: Learning of Structure and Motion from Video, S. Vijayanarasimhan et al (2017).
[4] Andrej Karpathy speaks at the Tesla Autonomy day 2019