Cabbage soup, or Recognition of 330 million faces at a speed of 400 photos / sec

Face recognition in 2018 will not surprise anyone - every student, maybe even a schoolboy, did it. But things get a little more complicated when you don’t have 1 million users, but:


  • 330 million user accounts;
  • 20 million user photos ;
  • the maximum processing time for one photo should not exceed 0.2 seconds ;
  • limited volumes of equipment to solve the problem.


In this article, we will share our experience in developing and launching a face recognition system in user photos on Odnoklassniki social network and tell you everything from A to Z:


  • mathematical apparatus;
  • technical implementation;
  • launch results;
  • and the StarFace stock , which we used to PR our decision.


A task


More than 330 million accounts are registered in Odnoklassniki, these accounts contain more than 30 billion photos.


OK users upload 20 million photos per day. There are faces on 9 million photos uploaded per day, and a total of 23 million faces are detected. That is, an average of 2.5 faces per photo containing at least one face.


Users have the opportunity to mark people in the photo, but usually they are lazy. We decided to automate the search for friends in photos in order to increase the user's awareness of the photos uploaded with him and the volume of feedback for user photos.


image


In order for the author to instantly confirm friends after uploading the photo, the photo processing in the worst case should fit in 200 milliseconds.


Social Recognition System


Face recognition on uploaded photo


The user downloads a photo from any client (from a browser or mobile applications iOS, Android), it gets to the detector, whose task is to find faces and align them.


After the detector, chopped and pre-processed faces fall onto a neural network recognizer, which builds a characteristic profile of the user's face. After that, the most similar profile is searched in the database. If the degree of similarity of the profiles is greater than the boundary value, then the user is automatically detected, and we will send him a notification that he is in the photo.


image


Figure 1. User recognition in the photo


Before starting automatic recognition, you need to create a profile for each user and fill out the database.


Building user profiles


For face recognition algorithms, just one photo, for example an avatar, is enough. But will this profile picture contain a profile photo? Users put pictures of stars on avatars, and profiles abound in memes or contain only group photos.


image


Figure 2. Difficult profile


Consider a user profile consisting only of group photos.
You can determine the account owner (Fig. 2) if you take into account his gender and age, as well as friends whose profiles were built earlier.


image


Figure 3. Building user profiles


We built the user profile as follows (Fig. 3):


1) Choose the highest quality user photos


If there were too many photos, we used no more than one hundred of the best.
The quality of photographs was determined based on:


  • the presence of user marks in the photo (photopins) manually;
  • photo meta-information (photo uploaded from a mobile phone, shot on the front camera, on vacation, ...);
  • the photo was on the avatar

2) Looked for faces in these photos


  • it’s not scary if it will be other users (in step 4 we filter them)

3) The characteristic face vector was calculated


  • such a vector is called embedding

4) Clustered vectors


The task of this clustering is to determine which set of vectors belongs to the account owner. The main problem is the presence of friends and relatives in the photographs. For clustering, we use the DBScan algorithm.


5) Lead cluster was determined


For each cluster, we calculated the weight based on:


  • cluster size;
  • the quality of the photos on which embeddings are built in the cluster;
  • the presence of photopins attached to persons from the cluster;
  • correspondence of gender and age of persons in the cluster with information from the profile;
  • the proximity of the cluster centroid to the friends profiles calculated earlier.

The coefficients of the parameters involved in calculating the cluster weight are trained by linear regression. The honest gender and age of the profile is a separate difficult task, we will talk about this later.


For a cluster to be considered a leader, it is necessary that its weight be greater than its closest competitor by a constant calculated on a training set. If the leader is not found, we again go to step 2, but use a larger number of photos. For some users, we saved two clusters. This happens for joint profiles - some families have a common profile.


6) Received user embedding on its clusters


  • Finally, we are building a vector that will characterize the appearance of the account owner - "user embedding."

User embedding is the centroid of a (selected) cluster selected for it.
Centroids can be built in many different ways. After numerous experiments, we returned to the simplest of them: averaging the vectors in the cluster.


Like clusters, a user can have several embeddings.


During the iteration, we processed eight billion photos, iterated 330 million profiles and built embeddings for three hundred million accounts. On average, we processed 26 photos to build one profile. At the same time, even one photo is enough to build a vector, but the more photos, the more we are sure that the constructed profile belongs to the account owner.


The process of constructing all profiles on the portal we performed several times, since the availability of information about friends improves the quality of the cluster selection.
The amount of data needed to store vectors is ~ 300 GB.


Face detector


The first version of the face detector OK was launched in 2013 on the basis of a third-party solution similar in characteristics to a detector based on the Viola - Jones method. For 5 years this solution is outdated, modern solutions based on MTCNN show accuracy twice as high. Therefore, we decided to follow the trends and built our cascade of convolutional neural networks (MTCNN).


For the old detector, we used more than 100 "old" servers with a CPU. Almost all modern algorithms for finding faces in the photo are based on convolutional neural networks that work most efficiently on the GPU. We did not have the opportunity to buy a large number of video cards for objective reasons: expensive miners bought everything. It was decided to start with a detector on the CPU (well, do not throw the same server).


To detect faces in the photos we upload, we use a cluster of 30 cars (the rest drank handed over to the scrap). Detection during the construction of user vectors (iteration over accounts) we do on 1000 virtual cores with low priority in our cloud. The cloud solution is described in detail in a report by Oleg Anastasiev : One-cloud - OS of the data center level in Odnoklassniki .


When analyzing the detector’s operating time, we faced such a worst case: the top-level network passes too many candidates to the next level of the cascade, and the detector starts to work for a long time. For example, the search time reaches 1.5 seconds in such photographs:


image


Figure 4. Examples of a large number of candidates after the first network in a cascade


Optimizing this case, we suggested that there are usually few faces in a photograph. Therefore, after the first stage of the cascade, we leave no more than 200 candidates, relying on the confidence of the relevant neural network that this person is.
Such optimization reduced the worst case time to 350 ms, i.e. 4 times.


Applying a couple of optimizations (for example, replacing Non-Maximum Suppression after the first stage of the cascade with filtering based on Blob detection ), we overclocked the detector another 1.4 times without loss of quality.
However, progress didn’t stand still either, and now looking for faces in the photo was adopted by more elegant methods - see FaceBoxes . We do not exclude that in the near future we will move to something similar.


Face recognition


When developing a recognizer system, we experimented with several architectures: Wide ResNet , Inception-ResNet , Light CNN .
Inception-ResNet showed itself a little better than the others, while we settled on it.


The algorithm needs a trained neural network. You can find it on the Internet, buy it, or train it yourself. For training neural networks, a certain data set (dataset) is needed on which training and validation takes place. Since face recognition is a well-known task, ready-made datasets already exist for it: MSCeleb, VGGFace / VGGFace2, MegaFace. However, harsh reality comes into play here: the generalizing ability of modern neural networks in face identification problems (and indeed in general) leaves much to be desired.
And on our portal, faces are very different from what can be found in open datasets:


  • Another age distribution - there are children in our photos;
  • Another distribution of ethnic groups;
  • Faces in very low quality and resolution come across (photos from the phone taken 10 years ago, group photos).

It’s easy to overcome the third point by artificially reducing the resolution and applying jpeg artifacts, but the rest cannot be qualitatively emulated.
Therefore, we decided to create our own dataset.


In the process of constructing the set by trial and error, we came to the following procedure:


  1. We pump out photos from ~ 100k open profiles. We
    select profiles randomly, minimizing the number of those who are on friendly terms with each other. As a result of this, we believe that each person from the dataset appears in only one profile
  2. We build vectors (embeddings) of persons
    To build embeddings we use a pre-trained open source neural network (we took from here ).
  3. Cluster faces within each account
    A couple of obvious observations:


    • We don’t know how many different people appear in the photo from the account. Therefore, the clusterer should not require the number of clusters as a hyperparameter.
    • Ideally, for the faces of the same person, we hope to get very similar vectors forming dense spherical clusters. But the Universe does not care about our aspirations, and in practice these clusters creep into intricate forms (for example, for a person in glasses and without a cluster it usually consists of two clumps). Therefore, centroid-based methods will not help us here, you need to use density-based.

      For these two reasons and experimental results, DBSCAN was chosen . Hyperparameters were selected by hand and validated by eyes, everything is standard here. For the most important of them - eps in terms of scikit-learn - they came up with a simple heuristic on the number of persons in the account.

  4. We filter clusters.
    The main sources of pollution of the dataset and how we fought with them:


    • Sometimes faces of different people merge into one cluster (due to imperfection of the neural network-reconnaiser and the density-based nature of DBSCAN).
      The simplest reinsurance helped us: if two or more persons in a cluster came from the same photo, we just threw out such a cluster just in case.
      This means that lovers of selfie collages did not get into our dataset, but it was worth it, because the number of false "mergers" decreased significantly.
    • The opposite also happens: the same face forms several clusters (for example, when there are photographs in glasses and without, in makeup and without, etc.).
      Common sense and experimentation led us to the following. We measure the distance between the centroids of a pair of clusters. If it is more than a certain threshold - we combine, if it is large enough, but the threshold does not pass - we throw one of the clusters away from sin.
    • It happens that the detector is mistaken, and not the faces appear in the clusters.
      Fortunately, a neural network recognizer is easily forced to filter out such false positives. More on this below.

  5. We train the neural network on what happened, return with it to point 2
    Repeat 3-4 times until ready.
    Gradually, the network gets better, and at the last iterations, the need for filtering in our heuristics disappears altogether.

Having decided that the more diverse, the better - we mix something else with our brand new dataset (3.7M people, 77K people; code name - OKFace).
The most useful something else was VGGFace2 - quite large and complex (turns, lighting). As usual, it is made up of celebrity photos found in Google. Unsurprisingly very "dirty." Fortunately, cleaning it with a neural network trained at OKFace is a trivial matter.


Loss function


A good loss function for Embedding learning is still an open task. We tried to approach it, based on the following position: we must strive to ensure that the loss function is as close as possible to how the model will be used after training


And our network will be used in the most standard way.


When a person is in the photo, embedding $x$ will be compared with the centroids from the profiles of candidates (of the user + his friends) $c_1,..., c_n$ by cosine distance. If a $cos(x, c_i) >= m_p$ , then declare that in the photo - the candidate number $i$ .


Accordingly, we want to:


  • for the “right” candidate $t$ , $ cos(x,c_t)$ exceeded the threshold $m_p $ ;
  • for the rest - was lower $m_p$ and even preferably with a margin, $m_n$ .

Deviation from this ideal will be punished by the square, because everyone does it empirically, it turned out better. The same in the formula language:


$ loss(x,t) = max(m_p – cos(x,c_t), 0)^2 + \sum_{i!=t} max(cos(x,c_i) - m_n, 0)^2$



And the centroids themselves $c_1,...,c_n$ - these are just the parameters of a neural network, they are trained, like everything else, by gradient descent.


This loss function has its own problems. Firstly, it is not suitable for learning from scratch. Secondly, pick up as many as two parameters - $m_p$ and $m_n$ - quite tiring. Nevertheless, additional training with its use allowed achieving higher accuracy than other functions known to us: Center Loss , Contrastive-Center Loss , A-Softmax (SphereFace) , LMCL (CosFace) .


And was it worth it?


Lfw OKFace, test set
Accuracy TP@FP0.001 Accuracy TP@FP0.001
Before 0.992 + -0.003 0.977 + -0.006 0.941 + -0.007 0.476 + -0.022
After 0.997 + -0.002 0.992 + -0.004 0.994 + -0.003 0.975 + -0.012

figures in the table - average results of 10 measurements + - standard deviation


An important indicator for us is TP @ FP: what percentage of individuals we recognize with a fixed proportion of false positives (here - 0.1%).
With a limit of errors of 1 in 1000 and without additional training of a neural network on our dataset, we could recognize only half of the faces on the portal.


Minimize false positives


The detector sometimes finds faces where there are none, and it often does this on user photos (4% of false positives).
It’s rather unpleasant when such “garbage” gets into the training dataset.
It is very unpleasant when we insistently ask our users to "mark a friend" in a bouquet of roses or on the texture of a carpet.
It is possible to solve the problem, and the most obvious way is to collect more non-persons and drive them through the neural network-recognizer.


As usual, we decided to start with a quick crutch:


  1. We borrow from the Internet a dozen images on which, in our opinion, individuals should not be
    image
  2. We take random crop, build embeddings for them and cluster. We got only 14 clusters.
  3. If the embedding of the tested “face” is close to the centroid of one of the clusters, we consider the “face” to be non-face.
  4. Rejoice how our method works well
  5. We are aware that the described scheme is implemented by a two-layer neural network (with 14 units on a hidden layer) on top of embeddings, and is a little sad.

What is interesting here is that the recognizer network sends all the diversity of non-persons to just a few areas in the embedding space, although no one has taught it this way.


Everyone lies or determining the real age and gender on a social network


Users often do not indicate their age or indicate it incorrectly. Therefore, the age of the user will be estimated using his friends graph. Here, clustering of the ages of friends will help us: in the general case, the age of the user in the largest cluster of the ages of his friends, and with the determination of gender, names and surnames helped us.


Vitaliy Khudobakhshov : “ How to find out the age of a person in a social network, even if he is not indicated


Solution Architecture


Since the entire internal infrastructure of OK is built in Java, then we will wrap all the components in Java. Inference on detector and recognizer is running TensorFlow through the Java API. Detector runs on a CPU as it meets our requirements and runs on existing hardware. For Recognizer, we installed 72 GPU cards, since running Inception-ResNet is not advisable on the CPU in terms of resources.


We use Cassandra as a database for storing user vectors.
Since the total volume of vectors of all portal users is ~ 300Gb, we add the cache for quick access to vectors. The cache is implemented in off-heap, details can be found in the article by Andrey Pangin : “ Using shared memory in Java and off-heap caching ”.


The constructed architecture can withstand a load of up to 1 billion photos per day when iterating over user profiles, while the processing of new uploaded photos ~ 20 million photos per day continues.


image


Figure 6. Solution architecture


results


As a result, we filmed a system trained on real data from a social network that gives good results with limited resources.


The recognition quality on a dataset built on real OK profiles was TP = 97.5% at FP = 0.1%. The average processing time for one photograph is 120 ms, and the 99th percentile fits in 200 ms. The system is self-learning, and the more the user is tagged in the photo, the more accurate his profile becomes.


Now after downloading photos, users found on them receive notifications and can confirm themselves in the photo or delete if they do not like the photo.


image


Automatic recognition led to a 2-fold increase in event impressions in the tape about the marks in the photos, and the number of clicks on these events increased 3 times. The interest of users in the new feature is obvious, but we plan to increase activity even more by improving the UX and new points of application, such as Starface.


User UX Video


Flashmob StarFace


In order to acquaint users of the social network with the new functionality, OK announced a contest: users upload their photos with Russian sports stars, show business and popular bloggers who maintain their accounts in Odnoklassniki and receive a badge for an avatar or subscription to paid services. Details here: https://insideok.ru/blog/odnoklassniki-zapustili-raspoznavanie-lic-na-foto-na-osnove-neyrosetey


In the first days of the action, users have already downloaded more than 10 thousand photos with celebrities. They laid out selfies and photos with stars, photos on the background of posters and, of course, “photoshop”. Photos of users who have received VIP status:


image


Plans


Since most of the time is spent on the detector, further optimization of the speed must be carried out precisely in the detector: replace it or transfer it to the GPU.


Try a combination of different recognition models, if it significantly improves the quality.
From a user perspective, the next step is recognizing people in the video. We also plan to inform the user about the availability of copies of his profile on the network with the ability to complain about the clone.


Submit your ideas for using face recognition in the comments.