We learn Nirvana - the universal computing platform of Yandex

Machine learning has become a fashionable term, but when working with large amounts of data, it has been a vital necessity for many years. Yandex processes over 200 million requests daily! Once upon a time there were so few sites on the Internet that the best of them were placed in a catalog, and now complex formulas that learn from new and new data are responsible for the relevance of links to pages. The task lies with the so-called conveyors, regular processes that train and control these formulas.

Today we want to share with the Habr community our experience in creating the Nirvana computing platform, which, among other things, is used for machine learning tasks.

Nirvana is a non-specialized cloud platform for managing computing processes, where applications are launched in the order specified by the user. Nirvana stores the descriptions needed by processes, links, process units, and related data. Processes are designed as acyclic graphs.

Developers, analysts and managers from various Yandex departments use Nirvana to solve computational problems because far from everything can be counted on your laptop (and why else, we will tell you at the end of the article when we move on to examples of using Nirvana).

We will tell you what problems you encountered when using the previous solution, describe the key components of Nirvana and explain why this name was chosen for the platform. And then we look at the screenshot and move on to the tasks for which the platform is useful.

How did Nirvana

The process of learning ranking formulas is a constant and voluminous task. Yandex now works with CatBoost and Matrixnet technologies , in both cases building ranking models requires significant computing resources - and an intuitive interface.

The FML service (Friendly Machine Learning) at one time was a big step in automation and simplification - it put work with machine learning on the flow. FML has opened easy access to tools for configuring training parameters, analyzing results and managing hardware resources for distributed launch on a cluster.

But since users received FML as a ready-made tool, it means that any improvements to the interface and development of innovations fell on the shoulders of the team. At first it seemed that it was convenient - we add only the necessary features to FML, follow the release cycle, dive into the user domain and make a really friendly service.

But along with these advantages, we got poor development scalability. The flow of orders for improvements and improvements to the FML forms exceeded all our expectations - and in order to do everything quickly, we would have to expand the team infinitely.

FML was created as an internal service for Search, but developers from other departments quickly learned about it, whose work tasks were also related to MatrixNet and machine learning. It turned out that the capabilities of FML are much wider than search tasks, and the demand significantly exceeds our resources - we are at a dead end. How to develop a popular service if it requires a proportional expansion of the team?

We found the answer for ourselves in open architecture. Fundamentally did not become attached to the subject area, developing Nirvana. Hence the name: the platform is indifferent to what tasks you come to it with - the development environment is just as indifferent about what your program is about, and it does not matter for the graphic editor which picture you are editing right now.

And what is important to Nirvana? Accurately and quickly carry out an arbitrary process configured in the form of a graph, at the vertices of which there are blocks with operations, and the connections between the blocks are built according to the data.

Since Nirvana appeared in the company, developers, analysts and managers of various departments of Yandex became interested in it - not only related to machine learning (other examples are at the end of the article). Nirvana processes millions of blocks of operations per week. Some of them are started from scratch, some are raised from the cache - if the process is put on a thread and the graph is often restarted, it is likely that some deterministic blocks do not need to be restarted and you can reuse the result already obtained by such a block in another graph.

Nirvana not only made machine learning more accessible, it became a meeting place: the manager creates the project, calls the developer, then the developer collects the process and starts it, and after many starts, which the manager watches, the analyst comes to understand the results. Nirvana allowed the reuse of operations (or entire graphs!) Created and supported by other users so that they would not have to do double work. Graphs are very different: from several blocks to several thousand operations and data objects. They can be assembled in the graphical interface (a screenshot will be at the end of the article) or using the API services.

How is Nirvana

There are three large sections in Nirvana: Projects (large business tasks or groups with which the guys saw common tasks), Operations (a library of ready-made components and the ability to create a new one), Data (a library of all objects loaded into Nirvana and the ability to load a new one).

Users collect graphs in the Editor. You can clone someone else’s successful process and edit it - or build your own from scratch by dragging the blocks with operations or data onto the field and connecting them with the links (in Nirvana, the links between the blocks go through the data).

First, let’s talk about the architecture of the system - we think that among our readers there are our back-up colleagues who are curious to look into our kitchen. We usually talk about this at an interview so that the candidate is ready for the device of Nirvana.

And then let's move on to a screenshot of the interface and examples from life.

First, users usually come to the Nirvana graphical interface (single page application), over time, many permanent processes transfer to API services. In general, Nirvana doesn’t care what interface they use, the graphs run the same. But the more production processes are transferred to Nirvana, the more noticeable is that most graphs are launched through the API. The UI remains for experimentation and initial setup, as well as for changes as needed.

On the backend side is Data Management : a model and storage of information about graphs, operations and results, as well as a services layer that provides the frontend and API.

Located a little lower Workflow Processor , another important component. It ensures the execution of graphs, not knowing anything about what kind of operations they consist of. It initializes blocks, works with the operations cache and tracks dependencies. However, the execution of the operations themselves is not part of the Workflow Processor task. This is done by individual external components, which we call processors.

Processors introduce specific functionality from a particular domain into Nirvana, they are developed by the users themselves (however, we support the core processors ourselves). Processors have access to our distributed storage, from where they read the input data to perform operations, there they write the results.

The processor in relation to Nirvana plays the role of an external service that implements the specified API - therefore, you can write your own processor without making any changes to either Nirvana or existing processors. There are three main methods: starting, stopping, and getting task status. Nirvana (or rather, Workflow Processor), making sure that all incoming dependencies of the operation on the graph are ready, sends a start request to the processor specified in the task, transfers the configuration and links to the input data. We periodically request execution status and, if ready, move on to the dependencies.

The main processor supported by the Nirvana team is called the Job processor. . It allows you to run an arbitrary executable file on an extensive Yandex cluster (using the scheduler and resource management system). A distinctive feature of this processor is the launch of applications in full isolation, so parallel launches work exclusively within the resources allocated to it.

In addition, the application, if necessary, can be run on several servers in distributed mode (this is how Matrixnet works). The user just needs to download the executable file, specify the command line to run and the required amount of computing resources. The platform takes care of the rest.

Another key component of Nirvana is the key-value store. , which stores both the results of operations and downloaded executable files or other resources. We laid in the architecture of Nirvana the ability to work with several locations and storage implementations at once, which allows us to improve the efficiency and structure of data storage, as well as carry out the necessary migrations without interrupting user processes. During the platform’s operation, we managed to live with the CEPH file system and with our MapReduce-a technology and YT data storage, eventually we moved to MDS, another internal storage.

Any storage system has limitations. First of all, this is the maximum amount of data stored. With the ever-increasing number of Nirvana users and processes, we risk filling up any, even the largest repository. But we believe that most of the data in the system is temporary, which means that they can be deleted. Due to the known structure of the experiment, one or another result can be obtained again by restarting the corresponding graph. And if the user needs some kind of data object forever, he can purposefully save it in the Nirvana repository with infinite TTL to protect it from deletion. We have a quota system that allows us to share storage between different business tasks.

What does Nirvana look like and why it is useful

So that you can imagine what the interface of our service looks like, we have attached an example of a graph that prepares and runs the evaluation of the quality of the formula using Catboost technology.

Why are Yandex services and developers using Nirvana? Here are some examples.

1. The process of selecting ads for the Advertising Network using the MatrixNet is implemented using the graphs of Nirvana. Machine learning allows you to improve the formula by adding new factors to it. Nirvana allows you to visualize the learning process, reuse the results, set up regular training starts - and, if necessary, make changes to the process.

2. Weather Team uses Nirvana for ML tasks. Due to the seasonal variability of the predicted values, it is necessary to constantly retrain the model by adding the most relevant data to the training set. In Nirvana, there is a graph that automatically clones itself through the API and restarts new versions on fresh data in order to recount and regularly update the model.

The weather also collects experiments in Nirvana to improve the current production solution, tests new features, compares ML algorithms among themselves, choosing the necessary settings. Nirvana guarantees reproducibility of experiments, provides power for volumetric computing, knows how to work with other internal and external products (YT, CatBoost, etc.), eliminates the need for local installation of frameworks.

3. The computer vision team with the help of Nirvana can sort out the hyperparameters of the neural network by running a hundred copies of the graph with different parameters - and select the best of them. Thanks to Nirvana, a new classifier for any task, if necessary, is created “by button” without the help of specialists in computer vision.

4. Directory Team navigates through Toloka and assessors Thousands of ratings per day, using Nirvana to automate this conveyor. For example, photographs to organizations are filtered this way, and new ones are being collected via mobile Toloka. Nirvana helps to cluster organizations (find duplicates and glue them together). And most importantly - you can build automatic processes for completely new assessments in literally hours.

5. On Nirvana, all assessment processes are based on assessors and Tolok, not only important for the Directory. For example, Nirvana helps organize and customize all the work of Pedestrians, updating maps, technical support work and testing by assessors.

Will we discuss it?

In our offices, special Yandex meetings are regularly held . At one of them we talked a little about Nirvana (there is a video about the device of Nirvana , the use of Nirvana in machine learning ), and it aroused great interest. While it is available only to Yandex employees, but we would like to know your opinion about the Nirvana device we described, about those of your tasks for which it would be useful. We will be grateful if you tell us about systems similar to ours in the comments. Perhaps your companies already use similar computing platforms, and we will be grateful for advice and feedback, stories from practice, and stories about your experience.