“We have to write for ourselves. They sat down and wrote ": the life of the developers of the laboratory cluster of supermassives in Sbertekh
There is a myth that banks are very ossified structures in which there is no place for experiment. To refute this myth, we conducted a short interview with Valery Vybornov, head of the department for developing a laboratory cluster of super-arrays at Sberbank Technologies. In their team, they are not afraid to use the full power of Scala, Akka, Hadoop, Spark, and even write prototypes in Rust.
- Discussion of an example of a pilot project (working with a social graph) with technical details;
- Used languages and technologies (Scala, Akka, Hadoop, Spark, Rust, etc.);
- Is it possible to come to Sberteh immediately to a managerial position? How is everything organized inside, what are the grades?
- How does a simple developer live? Details of the implementation of Sberdzhayl;
- Tell us a little about yourself?
- I came to Sberteh almost three years ago, was engaged in the fact that he built Big Data, took part in building the infrastructure of big data. We worked in a single team, including with specialists from Sberbank itself.
What tasks did we have to solve? Anything, including recruiting. Now the department employs 47 people. I hired employees, built a job in the department. He took part in the construction of the cluster, the development of the pilot and prototype (this was the first phase of the construction of Big Data).
- Where did you come from Sberbank?
- From the company Video International. We were essentially engaged in the technical support of online advertising. They made an advertising twist that was used there for some time. Big Data was also engaged in building there, participated in the creation of the platform - which then separated, and until recently was known in the market as Amber Data. Now they were bought by NMG.
- When a person came to a large company immediately to a managerial position, the question usually arises: what were the key skills that allowed to reach such heights?
- It is difficult to say what was important: different skills came in handy to varying degrees. First, of course, experience working with people, management experience. One way or another, I have been working in leadership positions for quite some time, more than ten years already. The second thing that helped - over all these ten years, we managed not to break away from technology. I still have not forgotten how to do something with pens.
- Did you do this in your free time? How long did it take to program?
- Let's just say, if we take Sberbank, then in the early stages, when it all started, up to 80% of the time. As soon as the department appeared, it gradually fell to the current twenty percent, or even lower.
“Don't miss the programming?”
- There is a little nostalgia. On the other hand, those tasks that are here at a higher level, they are also quite interesting. Management tasks.
- When you were engaged in programming, what tasks did you like more? You didn’t just have a unit, but a Big Data unit. Does this have anything to do with personal preferences?
— Разумеется. As a result of the development of the division, it turned out what is now - the IT Area of Big Data application development.
- What is it, IT Area?
- The Sbergile matrix structure. Vertically there are “tribes”, horizontally - “IT Areas” and business chapters. In fact, IT Area is a group of people of the same IT competence working in different teams.
- People, when they hear about Sbertekh, they are interested not only in specific technologies. Moreover, these technology is difficult to use, because you have all of them and are very special. Let's discuss hype topics, including agile. After the report of German Oskarovich, only the lazy one does not know about it. So how do you feel about him (if at all somehow)?
- This is a difficult topic, there are many interesting problems. What is more specific to tell?
- What are the prospects for the introduction of agail for such a large company? We lived before with the usual project management somehow. German Oskarovich talks about high-level abstractions, but what does this mean for specific people? Specific developers? The specific head of the unit? In short, we need an insider how to implement a large agile.
- Specifically, this means several things. For example, debossing. Leaders are no longer as “leaders" as they were before.
- "Debossing." We take the boss and reduce it. We eliminate!
- Yes :-) This is what is happening now. Perhaps it is better to formulate it differently. IT professionals used to sit separately from the business, and now we are sitting with the business. This changes our working conditions and some tasks. Together we work to understand customer needs and improve products.
- How does this affect the life of a particular developer? Rules become stricter, or vice versa, softer, or - what?
- Neither stricter nor softer. The degree of responsibility of each team and individual employee is increasing. There is no one to ask a question and there is no one to throw off the coordination on - you make decisions yourself and you bear responsibility for this.
If before Sbergile IT specialists were in the same location, in the same office, now even in Moscow they are scattered across several offices. If before people were sitting inside the department, now they are sitting on teams with the business.
- Who is your product manager?
- There is a product owner, and if we are talking about business projects, then usually, this is a person from business. He is engaged in the development of this product, prioritizes backlog, helps solve problems that the team is facing. The concept is that a team is created along with the product for the entire life of the product. Previously, the team was formed only for the duration of the project.
- Is it possible for an ordinary person to switch from one product to another product? How are you doing with Internal Mobility? For example, you worked for three years on one product, you can somehow from there ...
— Разумеется. If a person has reached the limit of development within the framework of the current team, he can initiate rotation and choose another team for himself.
- Cool. What about people you can learn from? So that they have a level, an order of magnitude higher than your current knowledge and skills. To progress not only career, but also in terms of knowledge.
- Of course, there are experts in all IT areas. If we take their knowledge, the gap between the expert of the 12th grade and the novice developer of the 8th grade is exactly the one you mentioned. We try to stimulate such processes in every possible way at the IT area level. First of all, in the form of knowledge exchange, exchange of competencies. Just yesterday, there was a meeting on this topic: it was decided to launch periodic meetings on the main phases of the products in which our IT area participates. Plus, they decided to launch their own internal messaging in order to stimulate the exchange of knowledge and the subsequent growth of competencies.
- By the way, how do you look at making a couple of meetings in Moscow? Can you find interesting speakers with interesting stories there?
- Yes, we look positively. There are speakers. Our speakers will participate, including at conferences such as JBreak and JPoint. We often have internal reports, but the level is such that you can go out with them.
- Returning to the topic, what is the developer’s progression in the company? So you said, 8th grade is a novice developer. As he is called there, an ordinary "engineer", without the prefixes "senior" and "main". What further career can be built?
- As competencies grow, it can grow to grades 9 and 10 - these are specialists without a leadership load, but with higher competencies. Grade is growing - compensation is growing. Starting with 11, these are more serious things. 11th grade used to be called the "development manager" - that is, team leader. Now with the transition to Sberdzhayl everything has changed a little. Now there is a debossing, and there are no team leaders as such. In fact, these are people through whom architectural policy is carried out, their role is very, very large, in the sense that they can tell other employees how to do things so that everything is correct. 12th grade is the coolest experts who can organize and synchronize several teams.
- After 12 grades, is there life?
- There are 13 grades - a leading head of the direction and an expert strategist in development. Grades above are managerial competencies.
- We discussed Sbergile and organizational issues. Tell me about your department? What do you do?
- We develop applications that deal with various kinds of data processing on the Hadoop platform. Map-Reduce, Spark, Hive, and so on. Well, of course, machine learning too. We take tasks and turn them into code that runs on these platforms.
- You said that you have an IT area of application development?
- Yes, IT Area is essentially an association of people with the same competencies. This is what Sbergile is turning into a “department” at the Competence Center.
- That is, you develop applications that automate some kind of business processes?
- These are applications that just do something for a business or for an infra-structure, specifically solve various problems for them. Perhaps this is a somewhat artificial term for our team. At first, it so happened that we have two large departments in the center of competence, divided by a very simple attribute - the main development tool. The platforms are the same, but the programming language is different. I have it Scala, and the department of Vadim Surpin has Java. His unit is called the IT area of the Big Data platform development, and mine is called the IT area of application development. But we must understand that this division is connected with our internal issues, and both IT areas are engaged in approximately the same tasks. For example, we differ slightly in business clients: I have “security”, and he has “risks”. But then again, this border is now erased.
- And what tribe?
- Our employees work in different tribes. Tribe is the other axis. I have infrastructure tribal and corporations, and Vadim still has risks and retail, it seems.
“Now it’s clear where you are in the general coordinates: we talked about the horizontal axis of the charters and the tribes.” Now let's talk a bit about technology. You said that you use Scala everywhere. Why Scala and not Java? Does everyone have Java usually?
— Не уверен, что у всех только java , есть достаточно большие компании, у которых Scala один из основных инструментов разработки, тот же Тинькофф, QIWI…
— Ну просто в Сбербанке основной инструмент — java , он везде прописан. And you got Scala. What's happening?
- Actually, no one demanded that I use Java. As head of the then department, I had to choose a tool that my people would be interested in using, developing, and developing themselves, developing their competencies in this area. In addition, Scala was chosen for many reasons. As a development language, Scala is very convenient.
Plus, I hunted precisely people who want to develop in this direction: there was an idea to create a Scala-development community that can solve our problems.
- And why exactly Scala? What do you like about her? How is it good for you as a department head, for example?
— Во-первых, она совместима с java , а весь Hadoop, Spark — они все работают на JVM. This is a serious requirement, greatly narrowing the set of options. Поэтому, это должна была быть или java , или JVM-совместимый язык. There are not many JVM-compatible languages. For example, Clojure was also interesting to us, but in reality it didn’t take off because it has a peculiarity - it is difficult to write really large applications on it with a large team. In addition, Scala was chosen as the most dynamically developing language that provides the capabilities we need.
- There are Groovy and Kotlin there. What is the matter with them?
- Kotlin then, it can be considered, was not there yet, it was in childhood. Groovy was, but Groovy is still in its original purpose - a scripting language, and its applicability for large projects has raised questions.
- You chose Scala because everything else fell off, in a deductive way? Or because Scala has some cool features?
- Cool features. I watched him for a long time, even starting from work in Beeline. At the previous work in Video International, I thought about using it, but then there was still not enough supply on the labor market. After I came to Sberbank, I realized that it was time to start, because a lot of specialists had already appeared.
- You did not think to do any of your events dedicated to Scala? Inside the company to carry the word Scala to the corporate masses. Or even for the whole of Russia.
- We think, and, most likely, we will do it.
- How can a programmer, a conditional novice developer of the 8th grade, develop when they give him Scala? What is better to learn there? Maybe some frameworks ...
- Firstly, we have fairly intensive training on this topic. These are full-time courses. There are courses online. We take people who do not even know about the Rock, but are eager to study it, and have a background in development. Online courses - any, starting with Coursera, they can be taken at the expense of the company's budget for training. The main training, of course, in the workplace — completing tasks, reading documentation, and sharing experiences — is what older comrades tell you. You come, they give you a task, and you do it, while learning.
- Are there any things whose comprehension would help a lot in the life of a developer? For example, to learn some specific technology. The specific part of Hadoop. The whole Hadoop, probably, cannot be learned - because it is huge.
- Sure. Experience shows that even if a person perfectly learns something small, such as what is done
inside Spark, he is greatly advanced in his competencies in general, and in the perception of his colleagues.
- I remember your presentation in Innopolis. It was about social graphs.
- Yes, that was a year ago. We had a laboratory cluster project, and there was a challenge ... a Big Data pilot project, when the management decided whether Big Data was ripe enough to be used with us, or else we had to wait. And one of the prototypes was associated with the social graph, with which there were difficulties, which had to fall upon the whole team. This is working with an extra-large graph in interactive mode. A prototype was made, and thanks to this, the Laboratory Cluster project took place, it was recognized as successful, and after that, what is happening now began. This is the background.
- What is the scale of the task, and what does it consist of?
- From the point of view of technology, the task is simple. There is a certain social graph, people with connections in social networks. Moreover, these are several social networks, and there is a juxtaposition of people ... entities from different social networks, and you need to understand that this is one and the same person.
- What is the business task?
- Very simple: search for people on social networks to solve problem debts.
- Is it just about bad debts, or are there still applications?
- There are many different uses. But when we solved the problem, the client was the department for working with distressed assets. It was a prototype.
- How big is the task? Roughly speaking, do you need to analyze ten people, or the whole of Facebook as a whole?
- No, actually, the whole problem is that the graph is big. There are billions of nodes. A significant part of a large social network.
- Did you cross-connect between different social networks, or remained inside the same network?
- There were several networks, and we tried to match the data from them.
- It's just that if you use some kind of conditional GraphQL on Facebook, you can do queries yourself for free while remaining inside it. And here you had to write your own adapter for each social network, bring it to one universal view, right?
- Not. Look how it was done: at the beginning there was one big graph, and there each person had several vertices, and each vertex corresponded to the social network. We compare them, connect, and get one person who is present in several networks.
- And what was primary: you took the already existing Sberbank base, and for each of its element they mined in social networks, or all social networks were abstractly analyzed for matches.
- And social networks, and also participated the base of Sberbank.
- That is, it was she who “participated”, and was not a source. Was equal in this column?
- What data was collected?
- Data of public profiles. There was no classified information.
- As a whole, was this prototype successful?
— В общем, да. He was recognized as successful, managed to achieve interactivity, and a user interface was written. When they showed the result and the interface, everyone said that it turned out great, and we need to continue. But the continuation has not yet begun :-)
- How is this technically implemented?
- The binary graph file was formed from the sources, then these binary data were loaded into RAM via the Unsafe JVM interface. It’s clear that in JVM the array is indexed intom, that is, it’s two billion maximum, and we had much more numbers, and we had to put them in off-heap. Then they connected Akka, developed their own model of messaging. Similar to Bulk Synchronous Processing, used in Giraph and HANA. We implemented the distributed Dijkstra algorithm, Crauser - Meyer - Mehlhorn - Sanders. Based on it, we achieved very good performance results. As the task looked: there is point A, point B, to find the shortest path in the interactive area (that is, quickly enough). We managed to achieve interactive.
- You usually talk about different algorithms, Dijkstra, Bidirectional Shortest Path, etc. How are they used on this large graph?
- These are all variations on the topic of how to find the shortest distance between two points, and as soon as possible. These are solutions to the same problem.
- And how does this help in finding people who are interconnected?
“In what shortest way they are connected.” For example, if we see two points on a graph and the distance between them is one, then most likely it is the same person. Because one connection. There are many aspects, analysts are very good at this.
“These“ spaces ”on which the graph is built, is this - first-last names, or something else?
- “Spaces”? The graph has vertices and connections.
- OK, what is the distance in this column?
- We have ribs (ties), and each has weight. The shortest path is a set of edges between vertices that have the smallest total weight.
- How is the weight for the rib determined? What does it consist of?
- This is determined on the basis of, for example, how reliable this connection is for us, from what source we received it. There may be several weights, and you need to look depending on the characteristics.
- In the end, these weights are reduced to one number, or remain in the form of a vector?
- No, we are looking for a way and look at it. We need the whole path. If the task needs collapse, it can be done.
- The word "interactive" has already sounded several times. Can you decipher what it is?
- This is when a person sits in front of a computer, sets a task, and expects to receive an answer in an acceptable time for him. With a delay measured in seconds. Maximum, a minute. It’s clear that this is different from how Hadoop usually works: there are batch jobs, big batches that can spin for hours or days. It is important to understand that here we had to move away from the main Hadoop paradigm.
“But does Hadoop respond to requests anyway?”
- No, Spark is used here only for data preparation. He makes a binary file that is pulled into RAM, and then what works - in principle, this can also be considered Big Data, but it does not use Hadoop, the main thing that was used there is Akka.
- Hadoop works as an Internet crawler, do I understand correctly?
— В общем, да. He prepares data, collects, processes ...
- Pulls from the Internet and puts in that structure?
- Let's just say that our Hadoop does not directly pull them from the Internet. It was a difficult multi-step combination, there were other automated systems involved, I can’t tell you the exact picture now.
- Who crouched?
- The contractor. We ourselves did not cheat on anything. This is not entirely in our competence. Such questions were raised, but in the end they decided to engage directly in their specialty - Big Data processing.
- Any interesting tasks arose during the implementation of this whole thing?
- Sure. возникали. In the beginning, they simply took the graph database of one large supplier “out of the box” and it turned out that the waiting time there was hours.
- Waiting time in which scenario?
- Say, an employee of the Department for the work with distressed assets involved in the search for affiliated companies, saw two points on the graph, wanted to check the connections between them, set the task to find the shortest path, and it works for several hours. Of course, this is no longer interactive. This approach was rejected, and after we went through several solutions, we realized that we had to write ourselves. They sat down and wrote. The JVM solution was written second. The prototype was written in Rust.
- Why did you abandon Rust?
- Firstly, Hadoop is not written in Rust. And you can’t say that we completely refused, because in the future, perhaps we will still write on it. Writing applications under it under Hadoop, to put it mildly, is inconvenient, because it is not a JVM language. In this case, Rust was used simply because our collaborators who wrote the prototype were very good at Rust.
- All the same, you need to get data from the Khadupov stack and transfer them to Rust. How to organize an interop?
- The data is being prepared by Spark. An application on Spark, of course, was not written on any Rust. It was written in Scala. Prepared data is transferred to RAM.
- That is, they communicated with each other using a file?
— Да. Spark generated a binary file, very large, and then this application - first written in Rust, and then rewritten under the JVM, on Scala with Akka - this file was enough, sucked, and worked on it.
- When switching from Rust to Scala and JVM, did the execution speed change?
- It has changed, but slightly. One hundred percent is impossible to say for sure, because the application that was originally made in Rust did not exist for very long. It was such a proof of concept. To make sure that this approach basically works. Nobody really benchmarked it regarding a ready-made application on Scala. It became clear that if we make an industrial solution specifically for this problem, we will still do it on Scala, because we do not have people with experience in Rust. At that time, there was already a question of how accessible such people are in the labor market. Now these questions also remain, although the language is developing very dynamically, but still, it is not clear how accessible this competence is in the market. Then we just agreed that since we need to quickly check, we’ll do it like this, on Rust, and later redo it, чтобы другие люди могли прийти и сопровождать, исходя из принятых стандартов по инструментам разработки.
- You said that in the future you can apply it again. Why did you like him so much that you generally remember him among all these experiments?
- This is a kind of development of the same branch on which C and C ++ are sitting. System development. But it is devoid of most of all the significant problems that C and C ++ have. Moreover, he does not have a number of significant problems that his more modern competitors, such as Golang, have. Better performance, there is no overhead in the form of GC, there are language abstractions that allow you to effectively make large applications. If you suddenly need to do system low-level development, then most likely it will be on Rust.
- If you suddenly imagine that there will be as many Rust specialists as there are rocky ones. Rust vs. Scala. Which is better and when?
- The question is not worth it, because there is Hadoop. The main thing for us, after all, is not a development tool, but a platform. If Hadoop with Spark will be rewritten in Rust, then, indeed, such a question will be raised. But so far this has not happened, and there is no question.
- As a result of the project, which we are now talking about, what happened except for the prototype? There was some kind of FastGraph. What is this, a few words?
- Yes, there was an application that quickly searched for the shortest path, a user interface was written (which was called, it seems, “the workplace of a researcher of distressed assets” - you can ask Vadim Surpin for more details). In general, he was handed over as part of a laboratory cluster project. It has already been decided to build a large industrial cluster. The management decided that the Big Data line had matured and needed to be launched.
The prototype under discussion has not yet received further development. Because there were attempts to launch it somewhere, but for various reasons they did not fire. As far as I understand, the main official reason is other priorities, in favor of more urgent projects. Which we are doing now.
- What interesting technical conclusions and solutions did you learn from this whole story?
- We realized that the choice of the stack (including Scala as a tool) was made absolutely correctly. This tool is quite ripe to use in industrial development, even in such a serious and large company as Sberbank. The project that we studied is waiting in the wings - when the opportunity arises, we immediately continue to develop it. In fact, we realized that our approach is correct.
- It was such a business level of conclusions. And the technical level? You wrote a system that uses the Dijkstra algorithm on social graphs, and does it really work?
— Да. Well, here it’s clear, Dijkstra can be implemented in different ways, but the approach we have chosen is that it works quite effectively on large graphs.
- And what kind of approach?
- Application layout, for example. It could be done differently. We are still being asked, why didn’t you use GraphX? But we say that GraphX is a batch system, which in general does not give an interactive result. You could try to work with him so that he began to work in interactive mode. There were many options, but we chose this one, and he fired.
Also, we immediately laid out the distribution for very, very large graphs. While this issue has been postponed, the current system has been working successfully on one JVM, but the architecture has the ability to use several JVMs with not very large improvements that will not cause a radical revision of solutions.
- You brought Akka there. Why is Akka needed, as she has proven herself?
- This is just for messaging, and in order to support our message model. So far, no problems have been found with her.
- Who and who exchanges messages?
- The architecture is such that there are workers, each serves some subset of the vertices of the graph. Workers exchange messages with each other in order to do some distributed computing, for example, the CMMS we talked about. All the cores of the machine on which all this is spinning are used.
- Distribution at what level? At the machine level, at the cluster level?
- Now - at the machine level, it is parallelizing inside one machine. In general - at the cluster level. We just didn’t have time to do it yet, one project has already ended, and the other has not yet begun.
“Distribution within one machine means that you have nailed sixty-four processes to sixty-four cores, or ...”
- No, the process is one, but it uses all the cores. It is multithreaded.
- That is, you use java threads. Or are you using Akka and it does everything automatically?
- Of course, we ourselves do not manage threads, this is all happening at the Akka level.
- So I see in your presentations the term “conical messaging model”, what is it?
- Yes, this is the same messaging model. This is a kind of formalization that says how workers should exchange messages in order to organize parallel computing. This is an analogue of Bulk Synchronous Processing - such a term used in Giraph and HANA.
“Is that her clone?”
- In a sense, we can assume that this is her clone, but in fact we wrote everything from scratch and looked at BSP a little. Then our specialists made a comparison and came to the conclusion that different things turned out. It would probably be interesting to write BSP on our platform and see how it works, but for the same reasons, hands have not reached this point yet.
- You said something about PageRank and MPI. How do they relate to the project?
- These were experimental things that we did on an optional basis to ensure the completeness of the features in the prototype, in order to show this to someone in the future.
- Clear. We need to round off already, so the last question. For those people who are now reading us on Habré, do you have any parting words, wishes, and so on? Maybe you need people to code on a Scala team? Something like that.
- Yes, we always need smart rocky, and there are a lot of interesting projects to get experience with Big Data and ML. For example, the execution of machine learning models in prod. This is a topic for another discussion!
- Since they touched on the topic of hiring. Describe the profile of the developer you want to see at home?
- A developer who knows Scala. Or someone who wants to study it - but at the same time having development experience. I would like some knowledge of the Big Data, Hadoop, Spark tools. Specialists who are ready to do tasks on the user interface are also needed - we use ScalaJS.
- Dedicated specialized mathematicians, data scientists - do you need them?
- Yes of course. Now begins some interesting products related to machine learning, in particular, "News Monitoring" - the allocation of news of organizations of interest to us. Machine Learning Specialists would be very handy for us.
- Thanks, Valery! It was a very eventful interview, which I hope our readers will enjoy. I hope to meet you again at our conferences, for example at JPoint / JBreak / Joker or SmartData. Come with reports!