Yandex, robots and Siberia - how we made a search engine for a loaded image

Today, Yandex launched a search for a picture on a loaded image . In this post we want to talk about the technology behind this service, and how it was made.

The technology inside Yandex was called Siberia. From CBIR - Content-Based Image Retrieval .

Of course, the task itself is not new, and much research has been devoted to it. But making a prototype working on an academic collection and building an industrial system that works with billions of images and a large stream of requests are very different stories.

What is all this for?

There are three scenarios in which we need to search for the downloaded image and which we needed to learn how to process.

  • A person needs the same or similar picture, but in a different resolution (usually the largest), or maybe a different color, better quality or not cropped.
  • You need to understand what is in the picture. In this case, a short description next to the image is enough to make it clear who or what is on it.
  • You need to find a site that has the same picture. For example, when I wanted to read about what is in the picture. Or look at the same pictures. Or buy what is shown in the picture - in this case we need shops.

How it works?

There are different approaches with which you can find similar images. The most common is based on the representation of the image in the form of visual words - quantized local descriptors computed at singular points. The points that are most stable during image changes are called special. To find them, the image is processed by special filters. The description of the areas around these points in digital form is a descriptor. To turn descriptors into visual words, a dictionary of visual words is used. It is obtained by clustering all descriptors calculated for a representative set of images. In the future, each newly calculated descriptor is assigned to the corresponding cluster - this is how quantized descriptors (visual words) are obtained.

The process of searching for an image from a downloaded picture in large collections, as a rule, is built in the following order:
  1. Getting a set of visual words for a downloaded picture.
  2. Search for candidates by inverted index, which determines the list of images containing it by a given set of visual words.
  3. Checking the relative position of the matching descriptors for the sample picture and the image under investigation. This is the stage of validation and ranking of candidates. Traditionally, clustering of Hough transforms or RANSAC is used.

Идеи подхода изложены в статьях:
  • J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In CVPR, 2007
  • J. Sivic and A. Zisserman. Video google: a text retrieval approach to object matching in videos. In ICCV, 2003.
  • James Philbin Josef Sivic Andrew Zisserman Geometric Latent Dirichlet Allocation on a Matching Graph for Large-scale Image Datasets (Items 3.1, 3.2)

The method we use to search for images from a picture-request is similar to the above traditional approach. However, verification of the relative position of local features requires significant computational resources. And in order to search by the huge collection of images stored in the Yandex search index, we had to find more effective ways to solve the problem. Our indexing method can significantly reduce the number of images that can be considered relevant to the sample (request).

The key in our search for candidates is the transition from indexing visual words to indexing more discriminatory features, and a special index structure.

Two methods have proved equally good for selecting candidates:
  1. Indexing high-level features or visual phrases (phrases). They are a combination of visual words and parameters characterizing the relative position and other relative characteristics of the corresponding local features of the image.
  2. A multi-index where the key is made up of quantized parts of descriptors (product quantization). The method was published:

During validation, we use our own implementation of clustering transformations between images.

Now we will go up one level and take a look at the scheme of the product as a whole.

  1. The user's picture ends up in temporary storage.
  2. From there, a small copy of the image falls into the daemon, where descriptors and visual words are calculated for the image and a search query is formed from them.
  3. The request is sent first to the average metasearch and from there it is distributed according to the basic searches. Each base search can have many images.
  4. Found images are sent back to medium metasearch. There the results merge, and the resulting ranked list is shown to the user.

Each basic search works with its own part of the index. This ensures scalability of the system: with the growth of the index, new fragments and new replicas of the basic search are added. And fault tolerance is provided by duplication of basic searches and index fragments.

And one more important nuance. When we built our search index of images, in order to increase the search efficiency, we used the knowledge we already had about duplicate images. We used it in such a way that we took only one representative from each group of duplicates and included it in the search index in the picture. The logic is simple: if the picture is relevant to the request, then all its copies will be the same relevant answer.

In this version, we expected to find only copies of the image, either completely matching the loaded picture, or containing matching fragments. But now our solution shows the ability to generalize: sometimes there is not just the same picture, but another picture, but containing the same object. For example, this often manifests itself in architecture.





This is our first step in finding images by content. Of course, we will develop such technologies and make new products based on them. And so something, but we have enough ideas and desires for this.