Technology of unmanned vehicles. Yandex lecture

Yandex continues to develop unmanned vehicle technology. Today we publish a lecture by one of the leaders of this project - Anton Slesarev. Anton spoke at the “Data-tree” at the end of 2017 and spoke about one of the important components of the technology stack necessary for the operation of the drone.


- My name is Anton Slesarev. I am responsible for what works inside the unmanned vehicle, and for the algorithms that prepare the car for the trip.

I’ll try to tell you what technologies we use. Here is a brief block diagram of what happens in a car.

We can assume that this scheme appeared as follows: it was told and invented in 2007, when the DARPA Urban Challenge was held in the USA, a competition about how the car will go in urban conditions. Several top American universities competed, such as Carnegie Mellon, Stanford and MIT. Carnegie - Mellon seems to have won. The participating teams published excellent detailed reports on how they made the car and how they drove in an urban setting. From the point of view of components, everyone painted about the same thing, and this scheme is still relevant.

We have a perception that is responsible for what kind of world is around us. There are maps and locations that are responsible for where the car is located in the world. Both of these components are fed to the input of the motion planning component - it decides where to go, which trajectory to build, taking into account the world around. Finally, motion planning passes the path to the vehicle control component, which performs the path based on the physics of the vehicle. Vehicle control is more about physics.

Today we will focus on the perception component, since it is more about data analysis, and in my opinion, in the near future this is the most challenging part on the whole front of work on drones. The remaining components are also insanely important, but the better we recognize the world around us, the easier it will be to do the rest.

I'll show you a different approach first. Many have heard that there are end-to-end architectures and, more specifically, there is the so-called behavior cloning when we try to collect datasets of how the driver drives and to clone his behavior. There are several works that describe how this is easiest to do. For example, the option is used when we have only three cameras to “agitate” the data so that we do not go along the same path. It all sticks into a single neural network that says where to turn the steering wheel. And it somehow works, but as the current state of affairs shows, now end-to-end is still in a state of research.

We also tried it. We have one person end-to-end quickly trained. We were even a little scared that we would dismiss the rest of the team now, because in a month he had achieved the results that we had been doing a lot of people for three months. But the problem is that it’s already hard to move on. We learned to ride around one building, and driving around the same building in the opposite direction is already much more difficult. There is still no way to present everything as a single neural network so that it works more or less robustly. Therefore, everything that travels in real conditions usually works on the classical approach, where perception explicitly builds the world around.


How does perception work? First you need to understand what data and what information is flowing to the car input. The car has many sensors. The most widely used cameras, radars and lidars.

The radar is already a production sensor, which is actively used in adaptive cruise control. This is a sensor that tells you where the car is in the corner. It works very well on metal things like cars. On pedestrians it works worse. A distinctive feature of the radar is that it not only positions, but also gives speed. Knowing the Doppler effect, we can find out the radial velocity.


Cameras - of course, an ordinary video picture.


Lidar is more interesting. Those who made repairs at home are familiar with the laser rangefinder, which is hung on the wall. Inside is a stopwatch that counts how much light has flown back and forth, and we measure the distance.

In fact, there are more complex physical principles, but the bottom line is that there are many laser rangefinders that are vertically arranged. They scan the space, it spins like that.

Here's a picture of a 32-ray lidar. Very cool sensor, at a distance of several meters a person can be recognized. Even naive approaches work, the level has found a plane - everything above is an obstacle. Therefore, everyone loves lidar, it is a key component of unmanned vehicles.

There are several problems with lidar. First, it is quite expensive. The second - it spins all the time, and sooner or later it will unscrew. Their reliability leaves much to be desired. Lidars promise no moving parts and are cheaper, while others promise that they will do everything on computer vision only on cameras. Who will win is the most interesting question.

There are several sensors, each of them generates some kind of data. There is a classic pipeline of how we train some kind of machine learning algorithms.

The data needs to be collected, poured into some kind of cloud, using the example of a car, we collect data from cars, upload it to the clouds, somehow mark it up, choose the best model, come up with a model, tune parameters, retrain. An important nuance is that you need to put it back into the car so that it works very quickly.

The data has been collected in the cloud, we want to mark it up.

Already today, the mentioned Toloka is my favorite Yandex service, which allows you to lay out a bunch of data very cheaply. You can create a GUI as a web page and give it to markup. In the case of a machine detector, it is enough for us to highlight them with rectangles, this is done simply and cheaply.

Then we choose some kind of machine learning method. There are many quick methods for ML: SSD, Yolo, their modifications.

Then you need to insert it into the car. There are many cameras, 360 degrees must be covered, it must work very quickly in order to respond. A variety of techniques are used, Inference engines such as Tensor RT, specialized hardware, Drive PX, FuseNet, several algorithms are used, a single backend, convolutions are run once. This is a fairly common technology.

Object detection works something like this:


Here, in addition to cars, we also detect pedestrians, we also detect the direction. The arrow shows the direction estimate only for the camera. Now she's messing up. This is an algorithm that works on a large number of cameras in real time on a machine.

About object detection, this is a solved task, many people can do it, a bunch of algorithms, a bunch of competitions, a bunch of datasets. Well, not very many, but there are.

With lidars it is much more complicated, there is one more or less relevant dataset, it is KITTI dataset. We have to mark from scratch.

The process of marking a point cloud is a fairly non-trivial procedure. Ordinary people work in Tolok, and explaining to them how 3D projections work, how to find cars in the cloud is a rather non-trivial task. We spent a certain amount of effort, sort of more or less managed to establish a stream of this kind of data.

How to work with it? Point clouds, neural networks are the best in detection, so you need to understand how a point cloud with 3D coordinates around the car is fed to the network input.

It all looks like you need to somehow imagine it. We experimented with the approach when you need to make a projection, a top view of the points, and cut into cells. If there is at least one dot in the cell, then it is busy.

You can go further - make slices vertically and, if there is at least one dot in the cube vertically, write it to some characteristic. For example, writing the highest point in a cube works well. Slices are fed to the input of the neural network, it’s just an analogue of the pictures, we have 14 input channels, we work in much the same way as with SSD. Also here comes a signal from a network trained for detection. At the input of the network there is a picture, it all trains end-to-end. At the output, we predict 3D boxes, their classes and position.

Here are the results of a month ago on the KITTI dataset. Then multiple view 3D was state of the art. Our algorithm was similar in quality in terms of precision, but it worked several times faster, and we could deploy it to a real machine. Acceleration was achieved by simplifying the presentation basically.

Need to re-deploy on the machine. Here is an example of work.


We have to be careful here, this is a train, but it also works on the test, cars are marked with green parallelepipeds.


Segmentation is another algorithm that can be used to understand what is in the picture. Segmentation tells which class each pixel belongs to. Specifically in this picture there is a road, marking. The edges of the road are highlighted in green, and the cars are slightly different, purple.


Who understands the disadvantages of segmentation in terms of how to feed this in motion planning? Everything merges. If there are parked cars nearby, then we have one big purple spot of cars, we don’t know how many there are. Therefore, there is another wonderful statement of the problem - instance segmentation, when you still have to cut different entities into pieces. And we are also doing this, a friend last week in the top 5 city scapes for instance segmentation entered. I wanted to take first place, so far it does not work out, but there is such a task too.

We try to try as many diverse approaches, hypotheses as possible. Our goal is not to write the world's best object detection. This is necessary, but, first of all, new sensors appear, new approaches. The task is to try and implement them as soon as possible in real life circumstances. We are working on everything that interferes with us. Slowly mark up the data - we make a system that marks them with the active use of the Tolok service. The problem with the deployment to the car - we come up with how to speed it up in a single way.

It seems that not the one who now has much experience will win, but the one who runs forward faster. And we are focused on this, we want to try everything as quickly as possible.

Here's a video which we recently showed, travel in winter conditions. This is an advertising video, but here you can clearly see how unmanned vehicles drive in current realities (since then another video has appeared - approx. Ed.). Thank.