Overview of the second day of Data Science Weekend 2018. Data Engineering, ETL, search services and much more

A few days ago, we published a review of the first day of Data Science Weekend 2018 , which took place on March 2-3 at the Attic of Rambler & Co. Having studied the practice of using machine learning algorithms, we now turn to a review of the second day of the conference, during which the speakers talked about using various date engineer tools for the needs of data platforms, ETL, search hints services and much more.


The second day of DSWknd2018 was opened by Yuri Babak, the development manager of the machine learning module for the Apache Ignite platform at GridGain, telling about how the company managed to optimize distributed machine learning on the cluster.

Usually, when talking about optimizing the machine learning process, they primarily mean optimizing the training itself. We decided to go from the other side and solve the problem that data is usually stored separately from the system where it is processed and trained.

Therefore, we decided to focus on the long ETL problem (it makes no sense to compete in machine learning performance with Tensorflow), as a result of which we were able to train distributed models on the entire cluster that we have. The result was a new approach in the field of ML, using not a parallel, but a distributed learning system.

We succeeded in achieving this goal due to the fact that we developed a new machine learning module in Apache Ignite, which relies on the functionality that is already in it (streaming, native persistence, and much more). I would like to tell you about two features of the resulting module: distributed key-value storage and collocated computing.

  • Key-value storage. First, we needed distributed replicated caches, which in general have the following structure: we have a cache with some data, it is “spread out” across the cluster, and Apache Ignite takes care of how to balance the data so that it is even distributed across the cluster. The diagram below shows the situation when in each partition we have 3 copies of the data, and even if the cluster collapses and only 4 of the 4 nodes remain, we still have all the data intact.

  • Collocated computing. This is essentially our implementation of the MapReduce concept, when we do not try to aggregate data in one place and make calculations, but, on the contrary, “deliver” the calculations to our data, thereby minimizing the load on the cluster. The scheme is as follows: at the Map step, we distribute our task only to those nodes where there is the necessary data, then it is performed locally on each specific node, so that at the Reduce step we can aggregate the results and send them somewhere further.

You can find more information about Apache Ignite here , here and here .


After that, Alexey Kuznetsov and Mikhail Setkin, graduates of our Data Engineer and Big Data Specialist programs , shared their experience in building a Real-Time Decision Platform (RTDP) based on the Hortonworks Data Platform and Data Flow .

Any organization has data that characterizes events that can be displayed on the timeline, and somehow use them to make real-time decisions. For example, you can imagine a scenario where a bank client has refused a card payment 2 times, then he called the call center, went to the Internet bank, left a request, etc. Of course, we would like to receive some kind of signals based on these events in order to promptly offer the client some services, special offers or help with his problem.

Each RTDM system is based on streaming. , which works in such a way that there are sources from which we can take events and put them in the data bus, and then pick them up from there, and we cannot connect events from different sources and somehow aggregate them.

The next level of abstraction over streaming is Complex Events Processing (CEP), which differs from simple streaming in that there are many sources that we try to process at the same time, that is, we see events together, join-s of these events appear, we can somehow aggregate them on the fly, etc.

The last element of abstraction is the RTDM system itself, the main difference between which and CEP is that it has a set of preconfigured decisions and actions that can be taken online: a call from a call center in case of requesting a consultation in the Internet bank, SMS from a special offer in case of replenishment of an account for a large amount and other actions.

How to implement this system? You can go to the vendors and in 99% of cases this is standard practice in the field. On the other hand, we have a team of data engineers who can do everything herself using open-source solutions. The main drawback of most of them is the lack of a user interface.

However, we managed to find a platform suitable for us - the choice fell on HortonWorks Data Flow 3.0, in the new version of which exactly what we needed appeared - Streaming Analytics Manager (SAM), where the graphical interface was implemented, and given that HortonWorks was already in our production, we took the path of least resistance.

Let's move on to the architecture of our RTDM solution. Data from the sources goes to the data bus, where it is aggregated, and then, with the help of HDF, it is collected and stored in Kafka. Then SAM enters, where, using the user interface, the user launches the campaign for execution, the JAR file is immediately compiled and sent to Apache Storm.

The central element of the whole system is SAM, thanks to which all this has become possible. Here's what its interface looks like:
SAM provides the user with great opportunities: there is a choice of data sources, a set of processors with which he processes the data stream, a group of filters, brancheting, the choice of an action for a specific client, whether it is an absolutely personalized offer or a general action for a specific group of people .


Our Data Science Weekend was in full swing and next in line was Igor Mosyagin, another graduate of the Data Engineer program and R&D developer at Lamoda. Igor talked about how they optimized search suggestions on the site and tried to make friends with Apache Solr , Golang and Airflow .

On our site, like on many others, there is a search field where people enter something and prompts appear along the way with which we try to predict what the user needs. At that time, we already had some ready-made solution, but it did not quite suit us. For example, the system did not respond if the user confused the layout.

Apache Solr is the center of the entire resulting system, we used it in the old solution, but now we decided to implement Airflow, which I learned about on the Data Engineer program. As a result, the request that comes from the user gets into our service written in Go, which preprocesses it and sends it to Solr, and then receives a response and returns it to the user. In this case, regularly, at some predetermined time, Airflow is launched, which crawls into the database and, if necessary, starts importing data into Solr. The main thing in all of this is the fact that 50 ms pass from the user's request to the response, of which the lion's share - 40 ms - is a request to Solr and receiving a response from it.

Generally speaking, Apache Solr is such a big “colossus” with good documentation, it has a lot of sugasters who work according to different logic: they can return an answer only by exact coincidence, or there are options when finding a line far from the beginning of a word and t .d. In total there are 7 or 8 options for sugesters, but we used only 3, since the rest in most cases worked out very slowly.

It is also important to update weights from time to time, as request frequencies change. For example, if this month one of the most popular queries is winter boots, then, of course, this must be taken into account.

Aligned Research Group

Next came the turn of Nikolai Markov, who is a Senior Data Engineer at Aligned Research Group, and also lectures on our Big Data Specialist and Data Engineer programs . Nikolay spoke about the advantages and disadvantages of the Hadoop ecosystem and why command-line data analysis and processing can be a good alternative.

If you look at Hadoop from a modern engineering point of view, you can find not only many advantages, but also a number of disadvantages. For example, MapReduce is a general-purpose paradigm. In fact, it all comes down to the fact that you spend a lot of time on transferring your algorithm to MapReduce, so that something is counted there, while possibly making a lot of errors in the process. Sometimes you need to pack a lot of MapReduce to do another thing, and it turns out that instead of writing business logic, you spend time on MapReduce.

The benefit of Hadoop, of course, is its support for Python. It is good for everyone, I write on it and recommend it to everyone, but the problem is that when we write in Python under Hadoop, we need a lot of engineering support to make it work in production: all analytic packages (Pandas, Numpy, etc.) .d.) should stand on final nodes, all this should be developed automatically. As a result, it turns out that we either adapt to a specific vendor that allows us to put our versions there, or we need a configuration management system that will be engaged in deployment.

Naturally, one of the main disadvantages of Hadoop and at the same time the main reason for the appearance of Spark is that Hadoop always writes the results to the disk, reads from it and writes there. In fact, even if we scatter this process into many nodes, it will still work at the disk speed (average).

You can solve the speed problem, of course, by scaling Hadoop, that is, simply "throw money". However, there are more effective alternatives. One of them is the analysis and processing of data on the command line . It is really possible to solve serious analytical problems in it, and it will be several times faster than on Hadoop. The only negative is presented below:

Кому-то это может показаться нечитабельным, однако эта штука работает на порядок быстрее, чем скрипт на python , поэтому отлично, если у вас есть инженер, способный писать такие вещи в командной строке.

Also, I don’t quite understand why in companies your business logic must be tied to a relational database, because nothing prevents you from taking in some modern form some kind of non-relational database (the same MongoDB). It’s not worth it to justify the fact that you have a bunch of join-there and it is impossible to do without SQL. There are a lot of databases today, and you can choose for yourself which one is closer to you.

If you still can’t do without SQL, you can try Presto - It is an extensible engine for distributed work with many data sources at once. That is, you can write a plugin for your data source and essentially retrieve anything you like using SQL. In principle, Hive has the same logic, but it is tied to the Hadoop infrastructure, and Presto is an independent development. Integration with Apache Zeppelin is a plus - it is such a beautiful front-end where you can write SQL queries and immediately receive graphs.

Rambler & Co

To finish our productive weekend, it was an honor for Alexander Shorin, a teacher at the Data Engineer program and a senior Python development engineer at Rambler & Co, who continued the story of his colleagues from the previous day of the conference about how they recognized the gender and age of movie theater visitors using computer vision. This time the main attention was paid to the engineering part of the project.

The source pipeline is as follows:
The cameras transfer photos to WebDAV, then the task pulls out new photos from Airflow and sends them to the API, which forms all this into separate tasks, and then uploads it to RabbitMQ. Workers take these tasks from the “rabbit”, make some transformations with them, and send the results back.

How can we scale this process from a technical point of view? How many more cars do we need to cope with the whole flow of photos? To answer this question, take a profiler. We decided to take PyFlame , made by Uber, which is actually a wrapper over Ptrace that hooks to a process in Linux, looks at what it is doing, and writes down what and how many times it happened.

We launched a test dataset consisting of 472 photos, and it miscalculated in 293 seconds. Is it a lot or a little? The PyFlame report looks like this in the following beautiful way:

Here we see a “gorge” of loading models, there is a “valley” of segmentation and other interesting things. This report shows that our code slows down because a huge bar in the center of the image refers to it.

In fact, it turned out that we needed to change only one line in the Jupyter laptop in order to optimize segmentation: the duration of the process fell from 293 to 223 seconds. We also switched from PIL, which only the lazy did not scold for slowness, to OpenCV, due to which the total operating time was reduced by another 20 seconds. Finally, the use of Pillow-SIMD, which was described by Alexander Karpinsky in his speech at the Piter Py # 4 conference, for image processing, reduced the task execution time to 183 seconds. True, this affected PyFlame slightly:

As you can see, PyTorch is still standing out here , so we will “kick” it. what can we do with him? When PyTorch sends data to a video card, it first goes through preprocessing, and then throws it into the DataLoader .

Having studied the principles of DataLoader, we saw that it raises workers, processes the dataset, and then kills the workers. The question arises: why do we need to constantly raise and kill workers, if we have few people in the cinema, and photo processing takes about a second? Why raise and kill two processes every second if this is ineffective?

It was possible to optimize the DataLoader due to its modification: now it does not kill workers and does not use them if there are less than 24 people in the room (the number is taken more or less from the ceiling). At the same time, such optimization did not give a significant increase in processing speed, however, the average CPU utilization decreased from ~ 600% to ~ 200%, that is, 3 times.

Finally, other improvements include facilitating the implementation of Conv2d , removing extra lambdas from the neural network model, and converting Image to np.array for ToTensor .

Finally, some more feedback from our speakers and listeners about our conference:
“A very comfortable party, where you can on the sidelines and talk about the industry as a whole and catch speakers with questions. As the speaker I note the professionalism of the organizers, it is clear that everything is done so that both the speakers and the listeners are as comfortable as possible. ” - Igor Mosyagin, speaker, R&D developer, Lamoda.

“I really enjoyed performing. Friendly audience, clever questions. ” - Mikhail Setkin, speaker, Product Manager, Raiffeisenbank.

You can watch the full speeches of all the speakers on our Facebook page . See you soon at our other events!