Effective data compression methods for training neural networks. Lecture in Yandex

Not so long ago, Gennady Pekhimenko, professor at the University of Toronto and PhD at Carnegie Mellon University, came to Yandex. He gave a lecture on coding algorithms that can circumvent the problem of GPU memory limits in training deep neural networks.

“I am a member of several groups at the University of Toronto.” One of them is Computer Systems and Networking Group. There is also my own group - EcoSystem Group. As the names of the groups show, I am not a specialist directly in machine learning. But neural networks are now quite popular, and people who deal with computer architecture and networks, computer systems, have to deal with these applications on an ongoing basis. Therefore, the last one and a half to two years, I am also closely engaged in this topic.

I’ll tell you how to properly compress in memory, in the processor cache. That was the topic of my doctoral dissertation in America, at Carnegie Mellon. And this will help to understand what problems we encountered when we wanted to apply similar mechanisms to other applications, such as neural networks.

One of the main problems in computer architecture is that we want to get high-performance systems - graphic cards, coprocessors, phones, laptops - with good energy efficiency parameters.

At the moment, it is easy to get a fairly high performance, if you do not limit yourself to memory. You can get an exoflop computer, if necessary. The question is how much electricity will have to be spent for this. Therefore, one of the main problems is to get good performance results with available resources and at the same time not upset the balance of energy efficiency.

One of the main problems on the path to energy efficiency is that many key applications used in cloud computing and on various mobile devices have serious data costs - both forwarding and storage. These are modern databases, and graphic cards, and of course, machine learning. All this requires very serious resources from all levels of the stack, from the kernel down to network resources.

Here is one of the key problems that arise when you have to carry out various optimizations: you can actually replace one type of resource with another. Before, very often in computer architecture, the computing resources themselves, such as addition, multiplication, arithmetic operations, were very expensive. However, this situation has changed in recent years, and this is due to the fact that the development of the processor core progressed much faster than the speed of memory access.

As a result, one arithmetic operation of addition in terms of energy will cost you about 1 picojoule. In this case, one operation with a floating point, floating point, will cost about 20 picojoules. If you want to read 4 or 8 bytes from memory, it will cost you at least two orders of magnitude more energy. And this is a significant problem. Any attempt to work with memory is quite expensive. And it doesn’t matter which devices we are talking about. The situation is the same for mobile phones, and for large clusters and supercomputers.

It follows from this that very many resources that even the current mobile phone have cannot be fully used for energy resources. If we take a modern phone, it doesn’t matter whether it’s Android or iPhone, we can use the available bandwidth between the memory and the core at the peak and only by about 50%. If we do not do this, the phone will overheat, or rather, no one will let it overheat - the bus frequency will be reduced during communication between the memory and the core, and performance will also drop.

Many resources are now impossible to use unless tricky optimizations are applied.

One way to deal with the scarcity of various resources at different levels is data compression. This is not a new optimization, it has been successfully applied to both networks and drives, we all used various utilities. Say, on Linux, many used the gzip or BZip2 utilities. All these utilities have been very successfully applied at this level. Usually, algorithms are applied either based on Huffman encoding or Lempel-Ziv.

All of these algorithms usually require a large vocabulary, and the problem is that their algorithms are very consistent, and do not really fall on modern architecture, very parallel. If we look at the existing hardware, compression at the memory, cache or processor level was practically not used, at least until some of our first work five years ago.

I will explain why this happened and what can be done so that compression is available at various levels, that is, compression directly in the cache. Cache compression - it means that compression occurs directly in the hardware, that is, part of the logic of the processor cache itself changes. I’ll briefly tell you what bonuses this one has. I’ll tell you about the compression in memory, what problems are there. It seems that this is one and the same thing, but to implement compression effectively in memory are completely different, not the same as in the cache. I’ll tell you about working with NVidia, where we did Bandwidth Compression in real hardware for modern GPUs, and the optimization that we did is in the latest generation of GPU cards - Volt. And I’ll tell you about completely radical optimization, when execution takes place directly on compressed data without decompression at all.

A few words about cache compression. This was an article at a PACT conference in 2012, and the work was jointly with Intel. To make it clear what the main problem is, if you want to turn your 2 MB or 4 MB processor cache into 4 MB or 8 MB. You did L-compression, and what's the problem?

Roughly speaking, when you have a memory access operation, if we are talking about x86 architecture, load or store in memory, then if there is no data on the registers, then we go to the first level cache. Usually these are three or four clock cycles on a modern processor. If there is data, it goes back to the CPU. If they are not there, then the memory request goes further in the hierarchy and reaches the L2 cache.

L2 cache on most Intel processors has 15-20 clock cycles, depending on the size of the cache. And then the data usually goes back if it was found in the L2 cache, if you did not go to memory. The data goes to the processor immediately and to the L1 cache, it is saved if suddenly you continue to reuse this data so that it is closer to the processor.

The problem is that if the data is compressed, it doesn’t matter how you optimize the compression process, decompression is always on the critical start path. If earlier access to the second-level cache took 15 clock cycles, then any delay related to decompression is added to the delay of your request. And this limitation is true for almost all compression applications, both in memory and when we apply it to real applications, such as for training neural networks, decompression is always on the critical path, and its delays, the time it takes to execute it is very critical.

What does this mean for us? If you understand that cache latency is at the level of 15 cycles, decompression should be very optimized. We can afford just a few processor cycles. To understand how small it is, one plus sign takes about two measures, one such simple operation. That is, you can’t do anything super complicated.

Largely because of this, Intel at some point stopped on the development of cache compression. They had a whole group that worked there, in 2005-2006 they developed an algorithm that decompressed about 5 cycles. This delay increased by about 30%, but the cache became almost twice as large. However, their designers looked at most applications and said that it was too expensive.

When I started working on this topic in 2011, they said that if you can do something in 1-2 steps, then this can be done in real hardware, to try.

I tried different algorithms, and one of the reasons why it was not possible to use algorithms that were already available in the literature is that they were all made originally in software. There are other restrictions in software; people use various dictionaries and the like. If you try to make these techniques in real hardware, they work quite slowly. IBM made the Lempel-Ziv algorithm completely the same as in gzip, completely in hardware, and decompression took 64 cycles. It’s clear that in the cache you won’t use this, only in your memory.

I tried to change the strategy. Instead of trying to optimize software algorithms, I decided to look at what data is actually stored in the cache and try to make an algorithm that will work well for this data.

I saw that paradoxically there are many zeros, from 20% to 30%. If you take a large package of applications from Intel, there are 200 different applications that people use for computing - a lot of zeros. This is initialization, it is a matrix with a lot of zeros, these are null pointers. There are many reasons for this.

Very often there are duplicate values. In a small piece of memory, in the cache, very similar values ​​can be repeated. This, for example, and if you are working with graphics, you have a bunch of pixels, and if you have part of the picture with the same colors, then all of the pixels in a row will be the same. In addition, narrow values ​​are single-byte and double-byte values ​​that are stored in 2, 4, and 8 bytes. Why is this happening and whose mistake is this? Where does such redundancy come from?

Redundancy is related to the way we program code. We use some kind of language, for example, C ++. If we want to allocate memory for some object, for example, for an entire array, imagine that we store statistics about some events in the array, and these events can be very frequent, say, billions, or single ones. For example, a memory access with some specific instruction. Most instructions are not accessed, but some can be accessed billions of times during startup.

The programmer is forced to allocate an array of eight-byte numbers, because in the worst case, his integer values ​​can take large values. However, this is redundant. Many of these values ​​are not really needed, and there will be a bunch of incomplete zeros, but part of the leading zeros ahead.

In addition, we have many other values ​​that also have different types of redundancy. For example, pointers. Anyone who once debugged code and looked at pointers, you will notice that they change quite large. But if you have pointers with approximately the same memory area, then most of the bits ahead will be the same. This type of redundancy is also evident.

I saw many types of redundancy. The first question is how many are there?

Here is an experiment in which I periodically took data from the second-level cache, saved a snapshot of this cache and looked at how many zeros, repeating values ​​were there. On the X axis, various applications from the SPEC2006 package, which is used actively in computer architecture, as well as other various applications from Intel, are both databases and various web workflows, such as the Apachi server, for example. And here is the assumption that this is a 2 megabyte L2 cache.

You may notice that there is great variability between redundancy in different applications, but even these very simple patterns are quite common. Only they cover 43% of all cache lines, all the data that is stored in the cache.

You can come up with something simple enough that will cover these and maybe some other patterns and give us good compression performance.

However, the question is what makes these patterns related? Will I have to make some kind of compression algorithm that will work specifically for each of these patterns or can I do something in common?

The idea of ​​observation was common. All these values, they can be large or small, there is very little difference between them. Roughly speaking, the dynamic range of values ​​in each particular cache line is very small. And you can imagine the values ​​stored in the cache, for example, the 32-byte line of the cache can be represented simply using Base + Delta Encoding.

Take, for example, the first value for the base, and present all the others as an offset from this base. And since the values ​​from each other do not differ much in most cases, our delta is placed in one byte, and instead of 32 or 64 bytes we can do with only 12 bytes, and save about 20 bytes of space.

I will not talk about all the details of how to implement this in real hardware. We made a real prototype, wrote a prototype on Verilog, drove it on modern FPGAs, talked with Intel about implementation. It is possible to make an algorithm based on this idea, which will require only one or two clock cycles in decompression. This algorithm is applicable, and it gives good compression ...

The best previous works that were used in the cache gave about 50% of the additional space. This is not pure compression - it can give much more - this is a real bonus of effective compression, that is, how much the cache looks more for users. There are all sorts of problems with fragmentation and so on that need to be addressed.

We have compression at the level of the best previous mechanisms that Intel had, but the main gain in the middle of the slide is decompression. Previous algorithms, they had the best decompression was 5-9 cycles. We managed to do it in 1-2 cycles, while the compression is also quite effective.

Algorithms of this kind can be done in real hardware and used in the cache, for example, in memory.

The effect of applying such optimization in the cache leads to the fact that the cache often looks almost twice as efficient for the user. What does it mean? If you look at the modern processor, in the photo, there are almost no cores themselves. The processor caches occupy the majority of the place there - 40-50% easily with both IBM and Intel. In fact, Intel cannot just take and double the cache, there simply is no room for more cache. And such optimizations, which cost only a few percent of the core itself, are of course very interesting.

We worked with various optimizations in the second work, which I will not talk about today, about how to work with such caches, which now cache lines can be of different sizes. All these problems have been successfully resolved.

I want to talk about our third work, which was also done with Intel, about how to compress memory.

What are the problems there?

The main problem is that if I have a 4 KB memory page on Linux or Windows, then in order to compress it, you need to solve the following problem: you need to solve the problem of how the data addresses on this page change. Initially, you have 4 KB, and each cache line in it is also 64 bytes. And to find the offset of any cache line inside this memory page is trivial: take 64 and multiply by the offset you need.

However, if you apply compression, each line is now of a potentially different size. And if you now need to read some page from the memory in the cache, you do not have these offsets.

We can say that you can save them somewhere. And where to save them? Either again in memory, or in the cache. But if you want to save all offsets for each memory, you don’t have any cache, you need resources of the order of hundreds of MB to serve all the memory. You cannot save this data on a chip, but you don’t want to store it in memory, because every memory access will now have several memory accesses. First you will go for the metadata, and then for the real data.

The second problem that everyone encountered when working with the OS was data fragmentation. Here it becomes extremely complex, because each page also has a different place in memory. Moreover, in the virtual address space, all pages are still 4 KB, but after compression they all occupy completely different sizes. And this is a problem, how can now work with this empty place. The OS is not aware that we allowed the pages to be made smaller, it simply does not see these fragmented pieces. Just like that, without changing anything, we won’t get compression bonuses.

What have we proposed to solve this problem? Compression using a linear coefficient. We imposed a set of certain restrictions, which are described in detail in the article, but the point is that if I use compression for memory, I use an algorithm that ensures that each cache line on this page either is compressed with a certain coefficient, let's say 4 to 1, 3 to 1 or does not compress at all.

We potentially lose something in terms of data compression, because we impose additional restrictions, but the main bonus is that the design is very simple, it can be implemented, which we have successfully done, for example, in Linux.

Linearly Compressed Pages (LCP), the technique that we proposed, copes with the main problem that now the addresses of all the data are quite simple. We have a small metadata block that is stored on the same page, and there is either the original data that is stored in a compressed form, or the so-called exception storage, cache lines are stored there that we could not compress in this way.

Based on these ideas, we managed to get good data compression, and most importantly, I compare it with the best previous work, which was done mainly by IBM, they also had good compression algorithms.

Our compression was not much better than theirs, but most importantly, we got more memory without paying extra performance. Roughly speaking, they got more memory, but for this they had a loss of performance, albeit a small one, but there was. And the loss in energy efficiency. And we managed to get everything in the plus, we got better performance and lower energy costs.

Briefly about the last work that is in real hardware. This is our work with Nvidia. The problem was energy efficiency when transporting data via communication channels from memory to the chip. In graphic memory cards, the memory itself is much smaller, but the throughput of this memory is much larger, 5-6 times easily on most graphic cards. Because many graphics applications require a crazy amount of this bandwidth to handle large amounts of data efficiently.

The problem that Nvidia faced when I did my internship there in 2014 was due to the fact that they implemented compression in real hardware, and it provided good bonuses on almost all applications. But the problem was that on some applications the following effect was suddenly observed: suddenly a mechanism called DVFS turned on, which noticed that the memory transportation was starting to spend a crazy amount of energy, and in order to defeat this problem, there is a certain switch on the communication channel, which if he sees that power consumption exceeds acceptable thresholds, then you need to increase the frequency of data transfer.

The frequency decreases - these wiring, which connects the memory to the chip, are cooled. The problem is that when you lower the frequency, your transmission speed drops. Compression ultimately had a negative impact on performance. It was clear that this was somehow connected with compression, but this did not always happen, and it is not clear how to deal with it. With this problem, they came to the Nvidia research department to one of my mentors with whom this problem was investigated.

The problem was that here it was necessary to understand not only the low-level mechanism, but also the physics of the process. I will show you a simple example. Imagine that you forwarded, say, 0011 through wires. For four wires you forwarded four bits. And now you are forwarding 0101. From the point of view of forwarding memory, if you forwarded 0 and forwards 0 again, energy is practically not wasted, because you do not need to produce the so-called bit toggle, you do not need to change the voltage on this pin, and the energy, roughly saying, it is proportional to the square of the voltage, and energy is wasted when you switch from 0 to 1 or from 1 to 0. If you have many switching bits going to a pin, you spend a lot of energy. And if the bits do not change, energy is practically not wasted. Engineers are very familiar with this phenomenon, but programmers usually don’t think about it.

These so-called toggles are the reason that we spend substantial energy for sending memory. And in fact, for most protocols, for example, for networks of chip or for DRM, the amount of energy spent on sending data is directly proportional to the number of toggles.

What is going on? Why can compression for no reason make the situation so bad?

Here is an example of one piece of data from a single Nvidia application, with real data. This data is sent in parts, usually 32 bytes or 16 bytes, these pieces of data are called fleets. How are they forwarded? First, the first part, then the second part by pins. If we look at what happened, if we do the usual XOR, then we have only two bits flipped by 16 or 8 bytes, because the data was very well laid out before that, it was very regular.

Now we have applied some kind of algorithm, for example, frequent pattern compression, an algorithm from Intel, and voila, every second bit we almost flip. The number of flips can grow both 15 and 16 times. Compression fundamentally changes the entropy, the amount of information that is stored in each bit, and also completely breaks alignment. That is, if your bytes were very well laid out, now they are laid out completely randomly. This allows you to send less, but you have to send more. This is a problem that Nvidia has encountered. I analyzed their application, and those where they had problems were those where there was a crazy excess of the average thresholds of these toggles. It could be that before there were 2%, and then it became 40%. And this is the reason that energy efficiency has fallen.

We investigated this problem and found its solution. We found certain good metrics that allowed us, without going into details, to make a small box in the processor, which can calculate in advance how many toggles awaits us, see what compression awaits us, and make sure that we only forward when it's profitable.

Such a problem was realized in real hardware a few years after that. I told this so that there is an understanding of what problems you encounter if you want to apply compression so close to the computing core. You need very low latencies, you very often need to carefully study your applications, not try to hope that some standard algorithms will help you, because the best compression algorithms are those that are specially optimized for your data structures. And to know that not only performance is important, but energy consumption. Roughly speaking, you applied compression, it may cost you nothing, it may be literally free, but it can change data in such a way that you will spend more energy.

I began to apply this expertise about a year and a half ago to various applications, and one of the main applications that I used, I then worked at Microsoft Research in Redmond, it was machine learning.

I’ll tell you about where I applied them, what results we managed to get, and what interesting problems we managed to solve there, and which ones still remained. The main focus of this part will be on very popular DNNs. I will focus on training data. There are many other interesting algorithms, but DNNs are what we encountered in the Microsoft data center in Ajura, a lot of users train different types of networks, and a lot of them train them, to put it mildly, inefficiently. This means that productivity from potential iron is less than 10%. We were interested to see why this is happening, what users are doing wrong, what the potential creators of all the popular frameworks are doing wrong: TensorFlow, CNTK, MXNet. They are also not optimized enough, in my opinion, несмотря на потраченные там ресурсы.

The focus on DNNs is related to the fact that it was the application that loaded most of the servers that used the GPU.

DNNs are used for a large number of applications. This translation, and speech recognition, and a recommendation system, and the detection of objects, and the classification of pictures. Different types of networks are used. In my opinion, there is still less science, and a lot of voodoo magic, but these nets train on real machines, spend real resources. Whatever they are, we must optimize them quite effectively.

The focus of my work is mainly about training, not about using networks. Both issues are very important, both network training and inference.

When I started tackling this problem, Google already had the first version TPU made in real hardware, and it seemed to me not particularly interesting to compete with several companies that are already doing inferent in real hardware, so I focused on training, which was very important for Microsoft.

How are they fundamentally different? Why is optimization and hardware for them potentially different? If you look at the announcements from Google, everyone saw that the TPU of the first version was generally suitable only for inference. The second is partially suitable for training, but also not particularly good. The first step, inference training, is the same. This is the so-called forward pass. You go through all levels of the network, generate some kind of answer, and in inference it all ends. You got the result - this is the prediction of your grid. Unfortunately, in training this does not end there. You need to do the so-called backward pass, you need to propagate gradients at the network levels, minimize your error, repeat this process many times for all of your input data.

This is the main difference, especially for the problem we are facing. And the main problem for us was that these grids consume a large amount of memory. Memory is consumed because intermediate results are generated between these layers, which are usually called feature maps or activations in the English literature. In inference, this data is also calculated, but it is consumed by the next level and is no longer needed. With back propagation, a typical algorithm that is used to train neural networks, all of these results are needed on the back pass in order to correctly calculate the gradients. Roughly speaking, if I have an L2 level, I have to store its result, all forward pass and all reverse, because it will be needed right at the very end, and I don’t have any option here, this data is needed, чтобы правильно вычислить градиент.

If I have deep grids, we know there are grids of both 100 and 200 levels, for example, ResNet, used for image classification. All these intermediate data must be stored somewhere. The result of this is that here, if we take ResNet even level 101, the standard grid that is used if you want to use more or less normal mini batch, say, 64 pictures at a time, you often need more memory than it fits on one graphic card, more than 16 GB. Even the most expensive card, P100 Pascal or even Volt100, there is no more than 16 GB. And if it does not fit into the memory of one graphic card, several must be used.

For a company such as Facebook, it’s not a problem to buy a lot. But the question is, if you look that the card is used only for its memory, its computing resources are not fully used. And it’s stupid to use a lot of cards just to cover lacking memory.

One of the key issues is that neural networks consume large amounts of memory. If you start using a lot of GPUs, then computing resources are inefficiently used. You simply divide these resources into two cards, and each of them uses less than half of the resources.

The first step that we took to analyze and understand where this data goes, we made a profile on popular grids, used to classify pictures. On the X axis, here is AlexNet, with which this DNN revolution began, and other popular grids: Overfeat, VGG16, Inception (this grid is also known as Google version 3, one of the popular grids that Google uses). The brackets indicate the size of the mini batch. This is a parameter indicating how many images we process in parallel. And the data that was used for this test is the classic ImageNet, the classification of pictures is approximately 220 by 220 pixels with a depth of 3, and the output of 1000 classes: cats, dogs, people and so on.

Where does memory go? Even on quite small old AlexNet networks, memory consumption was already a few GB, this was in 2011, when the cards were even smaller. And the bulk of the memory, the deeper the grids we make, the more memory is consumed by these feature maps, intermediate results.

Why not weight? The weights are really important, but the weights do not change, do not grow linearly with the resizing of the mini batch, when you start to process more and more pictures in parallel, the weights used are the same, but the activation parameters grow linearly with a change in the number of inputs. Why are we increasing the number of pictures in parallel? To make good use of GPU resources. A modern GPU can easily run 8000 threads in parallel, and in order to load it fully, you need to process a large number of pictures in the praallel, otherwise the GPU will not be fully loaded. And if so, why use it? It will not be more efficient than even a CPU or ISAC.

To use the accelerator effectively, it must be fully loaded. And here there is a problem associated with memory. This is one of the main problems, and when we began to study it, the main problem that we encountered was that in my area, in computer architecture, in efficient systems, everyone is focused on DNN inference, at our key conferences, such as ISCA, MICRO , in 2015 there were 15 articles on one devoted to deep neural networks, and everyone was engaged inference, all optimized only weights.

Firstly, it is easier for architects to optimize inference, weights have already been found, and you just need to make one forward pass, there are no large memory costs, and you can optimize the convolutional levels before losing consciousness, and enjoy the results.

The problem is that when you solved this problem, it doesn’t help much in training. All the methods that they suggested for inference, such as removing some weights, using quantization, in fact, it’s taking a 32-bit floating point, a float, and turns it into 16 bits, 8 or 4. All these mechanisms for training do not work, because the quality of the calculations is lost, and some kind of stochastic gradient descend algorithm is usually used, it just stops converging. We reviewed these algorithms; none of them are directly applicable to training. Also, they do not fit very well on modern GPUs, all these articles suggest making a new ISAC, something similar to TPU, or let's use at best some kind of programmable logic like FPGA for this. On the GPU, all these techniques did not fit very well.

We decided to see what we can do new and interesting specifically for training these networks. One of the main observations that we made is that if you look at how the data that is stored for so long in memory is used, the main reason why it is stored there for so long is that we have one place where the data is generated, for example, at the first level LX, and at the next level, the results of the previous level are consumed. After this, another hundred levels can go through while we go, while we need to reuse this data. And for the sake of this, we keep this data all the time. But the use of the generated results in only two places: immediately and after.

We have a long period of time where this data is not used in any way. We suggest using some new potentially encoding algorithms for this data, and store the data on this gap in a compressed form. And only when these data are needed, at this particular level to unclench them. Moreover, neural networks for this optimization are very well arranged. I do not need all the data at once at any moment. At every moment I process one level, one by one. And I do not need to decompress, decompress all the data at every moment. At every moment I work with one, with a maximum of two levels. This allows me to compress and decompress data.

Everything seems to be fine, but the question boils down to how to do this data compression and decompression efficiently? After all, basically all neural networks now store data in the form of floating arithmetic, which is not compressed very well. And besides, usually if it is compressed, it is compressed with a loss of value. Our main idea is that we should not try to apply algorithms, even good ones. Although we tried to apply them, none of them worked efficiently enough. It is necessary to use some new algorithms and a good understanding of what exactly is happening in the neural network.

Here is the first thing we noticed while working on this study: most of the activation of these feature maps actually has very specific types. Despite the fact that there are many levels, there are very few different types of levels in the neural networks that are now. These are, for example, convolutional levels, these are Relu levels, where a rectified linear function or pulling layer is simply calculated, in fact a small filter that goes through the picture or by the intermediate result, the so-called normalization, and so on.

Despite the fact that there can be hundreds of levels, there are not many different levels. There are certain levels that dominate others. We noticed that there are a very large number of pairs in all of these grids, one level that uploads to Relu, goes to the pulling layer. There are a large number of layers, followed by evolutionary levels and many others.

Interestingly, Relu is used initially instead of, say, hyperbolic tangent or some other mathematical nonlinear function - because it is well and quickly calculated. But she has another wonderful property - her gradient is also calculated very quickly.

In fact, if we use Relu, then on the back pass. What we store the data for is to calculate its gradient. But Relu is a very simple function, it gives the same value if the value was positive, or 0 if the value was negative. Its gradient is just a sign, either a plus or a minus.

When we understood this modernly by chance, we analyzed the distribution of values ​​during the training of the network, we noticed that it is by modal. In fact, it tends to two meanings. We thought that there seems to be no mathematical logic, why it tends to two values ​​during training. It turned out she was striving for either plus or minus. And before doing any powerful optimizations, we just took from 32 bits, and all but the first bit, 31 were junked. And they saw that the training was going on perfectly, nothing was changing.

In fact, we store the full 32 bits, and only one of them is needed. And there are many such tricks. This is due to the fact that the levels used are very simple. For example, a pulling layer is often a 2 by 2 window, where you select the maximum value, 2 by 2 or 3 by 3. The full values ​​are stored and then there is a maximum of them in the backward pass, but in fact, all you need is index of the maximum value. Now all the values ​​are stored in TensorFlow, in CNTK, in all the frameworks that I watched, in full. This is due to the fact that when I talked with the guys involved in machine learning, they said that it was so right. But they usually don’t do the frameworks themselves. And those who make frameworks usually don’t worry about mathematics, they call some function of calculating the gradient, they don’t care if there is a linear function or tangent there.

Due to the fact that, roughly speaking, the two areas do not really speak among themselves, it turned out that there are a large number of opportunities for optimizing this training. And Relu levels can be compressed 32 times, pulling layer can be compressed 8 times. All memory is consumed, everyone complains, including the guys from TensorFlow, that they lack memory. But there are great resources for compressing memory without any loss of quality. We do not lose anything in quality in all these optimizations.

We used standard binarization, we stored instead of 32 bits 1 bit. There is a detail that we directly need the full generated value at the next level, so we cannot immediately encode the data. Therefore, when the result is generated, we create two copies. One full 32-bit, it is immediately consumed by the next level, and memory can be freed. And for the second use, we store an already one-bit copy.

This optimization requires only one bit, and it is lossless, there is no loss of quality.

The second idea is based on the fact that Relu, by virtue of its semantics, generates a bunch of zeros. And all negative results turn into zeros. If you take measurements, take one of the grids, we took VGG16, trained it for 10 eras, and looked, the total number of zeros is shown in green here - the higher the better. 1x corresponds to 100% zeros, and 0.6 to 60% zeros. This is how many zeros are stored in these activations.

Further down the chart are different levels, say Relu_1, Relu_2, and so on, and how many zeros are in each of these levels. You can see that there are many, more than 60%, closer to 70%. Although this graph is not always clear, it can be said that the longer the training lasts, the more zeros become at many levels. By epoch, we add the number of zeros, but after a while it stabilizes.

What does it mean? If we only have zeros, they can be represented more efficiently. But if you do not use new hardware, everything needs to be done using the GPU. You need to write your CUDA code, which will conduct all these optimizations. It sounds easy, but there was no good spars matrix presentation in CUDA. We tried to use what NVidia offers, and it is sharpened by the number of zeros above 99%. The very idea is ground on this. We had to make our own representation, which works well in the range from 50% to 80% of zeros. These zeros can be effectively removed using GPU instructions. Ideally, all these optimizations can be done in real hardware, and, say, in the CUDNN library, which we suggested by Nvidia. But for now, we are doing this using CUDA code, and even this works quite efficiently.

Finally, we made several observations that you can use certain optimizations with loss of quality, the so-called Lossy Encoding, but this must be done carefully. In the previous literature, especially the guys from IBM suggested taking a certain level. Let's say L1 generated some results. We take 32 bits, turn them into 8 or 16, and already somehow deal with the error that accumulates.

The problem is that if you occupy 32-bit numbers with 16-bit numbers, you have this error accumulating over time, it starts here, then grows, and in the end, say, AlexNet, if you replace 32 bits with 16 and make honest training, not just one era, as they did in their article, but to train it as it should when you reach the right level of training for validation. You see that she stops training normally, and this optimization does not work. Although yes, it does not break, nothing happens, but it ceases to train normally. And all because if the loss of quality occurs so early, it is propagated through the entire grid, and this is inefficient.

But in fact, you do not need this loss of quality so early. What you need is to take the original version, not compress and transfer it to the next level. Remember, there were two uses, the first and second. The first use is to do without loss of quality. An error on forward pass does not occur at all. And to store only the version used on the second pass, on the backward pass, only it should already be stored in a compressed form. The error is inserted only on the backward pass, it spreads much less.

What did we manage to get?

Here I show how the training goes with various optimizations. Here is our base - this is traditional AlexNet training with FP32, 32 bits. If you do what IBM suggested, make All-FP16, train everything in 16 bits, and it does not train on our results. It is possible to selectively some data structures of 16 bits, it will work, but if everything is done in 16 bits, and calculation, and storage, at some point it stops training.

But if you do the training as we propose, either the error occurs only on the return pass, or you use our technique with a delay in introducing the error, then you can train with 16 bits, and even 8 bits. The error after training falls in an almost identical way, these three curves are almost the same.

The bottom line is that if you want to apply a loss of quality, delay the application of optimization for as long as possible. Therefore, it makes sense to apply it immediately, because the error is introduced at the initial steps, and accumulates over time.

We proposed a system that we called Gist, literally from English it would simply mean “essence”. We took the DNN, took the execution graph, which is used for calculations, and our system for the specific description of the neural network, say, we used CNTK in our experiments mainly because the experiments were done at Microsoft. She finds various possibilities for data compression and inserts our CUDA code in these places. We get a new modified execution graph, which is used by various, and TensorFlow, and MXNet, all use it often in their calculation model. We simply add small GPU computing kernels that compress and decompress where needed. And this does not require a change in the iron that you use, nor a change in even the library.

We have shown in our articles that if you do this in a library, for example, cuDNN, which is often used to work with neural networks, you can get a lot more bonuses. But introducing into the product takes some time. We are also working on this. Memory allocation for new structures sometimes occurs not only by the framework itself, such as TensorFlow. GIST itself selects some data structures for itself.

The question is, what bonuses can we get from this without losing the quality of the training? This is a very significant point. The fact that it took a lot of resources to prove that after we all do this, we all train, 90-100 eras with the same quality as without our optimizations. And it does not take longer. We show on five popular grids that we reduce the amount of memory consumed by up to two times, and you can train twice larger models or deeper ones with the same hardware. At the same time, a slowdown in training from 6% to 8%, depending on which optimizations you use, which is usually acceptable, since these grids are often trained for weeks. This is the main part of the optimizations that we recently did when I was still working at Microsoft.

I'll talk about what I did with my students at the University of Toronto. The main problem that I encountered while working at Microsoft, and when I was working on PhD, is that at DNN Training, despite the fact that everyone talks about it and everyone works, good benchmarks that are in the public domain outside the company, no. And even within the company, limited access is a lot to what. When I worked at Microsoft, I asked for the best LSTM grid that you use for speech, I want to see what possibilities there are. It was impossible to get some normal code from someone, although it seems everyone is working on it.

This problem is significant. In order to do useful research at universities, you need to work on the problems that are currently relevant. In my community, everyone works mainly on image classification, everyone optimizes AlexNet, ResNet and so on, and does not look at anything else. Image classification is a very important problem, it is clear that image recognition is important for everyone to be competently used on Facebook. But there are tons of other interesting issues, and machine learning doesn't just consist of that.

We found that often the conclusions, if the right benchmarks are done, can change dramatically. For example, an anecdotal case. The developer developing MXNet, which is actively supported by Amazon, when they market their grid, they say that we are faster than TensorFlow. If we take ResNet 50, indeed, we measured on Pascal cards, they are 40% faster than TensorFlow. Does this mean that MXNet is faster than TensorFlow? On this application, yes. But benchmarks do not consist of one application. We took LSTM, one made for MXNet with the same parameters, and the same made for TensorFlow, and not somehow ourselves, official from Amazon and from Google, and TensorFlow is 2.5 times faster in translation. This means that not everything is so simple. You can not draw conclusions and benchmarks on only one result.

This is one of the problems that we wanted to solve. This requires a crazy amount of resources, but we are slowly moving forward. We selected several problems that we consider interesting, and it will be interesting for me to listen to your opinion, people who are also involved in machine learning especially, what other applications should be added to these benchmarks.

We took, of course, Image Classification, where without it, it is like a benchmark. We took object detection, and the best algorithms that we found use the so-called Faster RCNNs, the models of the 16-17th year we use from Google and from MIT, the latest ones. In translation, we use two different groups of models: the old classic LSTM, but it’s maybe six months to them. One nmt is Google’s top grid, it is open source. And a very similar sockeye net from Amazon. We compared them.

We also used the new Transformer model, which uses Attention layers. They have been used for years since 2013, but this is the first grid that I know, which generally abandoned, in fact, from the revolutionary levels and uses the Attention layers. And developers from Google claimed that it fits very well on a modern GPU, trains well, it was all very interesting to check on real benchmarks.

We have speech recognition, where we also use LSTM. We use adversarial networks, we have reinforcement learning, which Google has very efficiently used in AlphaGo recently. We have both supervise and n-supervise learning. Well, and image classification. This is what we ourselves are trying to cover. Models all 16-17th year, as good as possible.

Naturally, it is interesting to compare them on different platforms. We look at TensorFlow, CNTK, MXNet. I will not show comparative results, we have not finished everything, but this is what is included in our plans. We take the model on TensorFlow as efficient as possible, honestly compare it with CNTK or MXNet. This means that we select all the values ​​of hyperparameters, which are the same in our power. And there are parameters that are in only one framework, then you choose optimally for this grid.

We look at various iron. Of course, you can use the GPU. In my opinion, using a CPU for most of these grids is inefficient. You can either use the option if you are at the academy, either a GPU or FPGA, because no one will give us ISAC direct access from Google. Maybe then they will share virtual machines with TensorFlow, but so far this time has not come for most.

We use different metrics: performance, energy efficiency, how well these grids train, it is also important, you need to compare them honestly. Everyone trains from a different amount.

A very large number of requests, including from companies, were that everyone wanted to understand where the memory goes. Remember, we did a profiler that showed where the memory went on CNTK. It took almost a month to do it. It’s not so easy to track where which data structures went. If you take TensorFlow and look for where they allocate memory, you will have 2,000 places where they allocate memory, and where then they go, through which pool, it is quite difficult to track. And there are no good profiling tools. We are doing them slowly, we have done them for CNTK, for MXNet. TensorFlow is a very user-friendly framework, but very difficult to change. It’s hard to track everything there, but we are trying to do it slowly together with Google ... You started your grid for training, we can immediately show where your memory went, what data structures, и ты можешь понять, как это оптимизировать.

I will show a couple of interesting examples of what we found.

I announced that we were comparing TensorFlow with MXNet on models. The original model, one called NMT from Google, and the other Sockeye. They have different parameters, but in order to honestly compare them, we took the same number of LSTM levels, the same batch size, that is, all the hyper parameters are the same, the same training algorithm.

Interestingly, we ourselves were surprised, this graph shows the training steps, mini-eras, and in Y we have a blue score. This is a metric that says how well we train when translating. We see how close we are translating to a person. In this case, the translation from English into Vietnamese is used.

Что можно заметить? Модели разные, но если смотреть только на тренировочные шаги, забыть про время, то для blue score порядка 20 они тренируются практически идентично. Наша гипотеза в том, что если ты правильно подобрал все гиперпараметры, математика там очень одинаковая. Они используют один и тот же SGD. Да, он описан разными людьми, какие-то флуктуации возможны, но он ведет себя очень похоже на практически максимальный.

After that, they lead a little differently, because their learning rate changes very much, each has its own way, so then they diverge. But at first, the training in terms of mathematics and computing is very the same. And it would seem that this means that there is no difference, but in fact, if you look at the time, then she trains two and a half times faster than the other. And this is due to the fact that TensorFlow is currently much better optimized for LSTM levels. It uses a good variation of CUDNN, and MXNet uses its library to use LSTM. That is, mathematically they do the same thing, but performance can vary significantly.

We conducted an analysis, one of my students is now doing this, how good the effectiveness of the same TensorFlow is, it seems like it is two and a half times more effective than MXNet. It turned out, if you look at the GPU, how efficiently LSTM is running on it, only about 30% of the GPU cores are used. 70% are generally idle. And those that are used, not all mathematical pluses and multiplications are used effectively. That is, the efficiency even on TensorFlow is about 10-15%, not higher. And while it is two and a half times faster than MXNet. That is, the reserves there are still very solid.

We are working to improve LSTM, but others are also working, Nvidia announced a new version of CUDNN from a week ago, on which LSTM should be much faster. As always with marketing, all sorts of tricks are used. If you look at the results of the press release, what their SEO did, it says that CUDNN is 2.5 times faster now on LSTM. But if you look carefully at the chart, it compares the “Pascal” card P100 on CUDNN version 8, and the “Voltaic” card 100 on CUDNN version 9. Two cards by themselves are twice as fast in speed. That is, the acceleration there is mainly not due to the library, but due to the card. But in marketing, usually such details are ignored.

We are also actively exploring reinforcement learning, there are no benchmarks at all. We are slowly understanding what people are publishing at leading conferences such as NIPS or ICLR, and we are trying to use these models. One of the problems is that very often machine learning researchers can do this on MATLAB and not bother doing it on a potentially normal network. And then all of their hyperparameters, which they strung into their grids, are not available. You take their grid, as it is written in the article, you start to train, and nothing trains. And it takes a tremendous amount of time. And this happens even with the guys from Google, but at least they at least quickly answer questions.

Very often, all these hyperparameters, you have slightly changed one parameter twice, here and there, and the grid can begin to train. A lot of interesting things are happening.

We look at reinforcement learning, compare MXNet with TensorFlow, build interesting graphs related to this.

In the longer term, we have more aggressive optimization, which I will talk about. Namely, all the optimizations that I carried out related to data compression always assumed that compression is useful, but with this benefit comes the need to decompress this data. And it is always energy costs, delays, loss of productivity.

In fact, there is an interesting idea that I started trying in my dissertation, and recently tried it on real applications: you can continue to run the code with some algorithms without expanding them. What does it mean? Now we have a hash, memory, processor. If everything is compressed here, before you do something, first you need to unpack it, then pack it again and put it back. And it's like a tax on computing, you have to pay all the time. For example, Google said that compression and decompression in their cloud computing can reach 5-7%.

In total, their cloud tax reaches 20–25%, this is data transfer, copying, compression and encryption. For any calculation that you do on Google Cloud, 25% are side calculations that are in no way related to your calculations.

And part of the problem is compression. The question is, is it possible to extract some kind of bonus from the fact that your data is compressed? And not only to avoid decompression, and maybe even do better than at the beginning.

I took an example from the database, it can be easily and quickly described. A database often has an array of data. Or, if we are talking about stream processing, new data can constantly come from a certain number of servers. And there are some keys by which some values ​​are stored. Usually your database can be in the form of columns or rows. If locality and performance are important to you, you do a large number of searches, then you store all the values ​​separately from the keys, because then you go over them. For example, if you wrote a query where value = 10, you should go for each value. And if you want to make good use of memory, it is better to store them together.

We have an array of values ​​that occupy 4 bytes in memory.

Suppose we used one of the algorithms that I talked about, Base + Delta Encoding, it has some kind of metadata. Say he compressed this data well to 1 byte, four times. Everything is fine.

Suppose you need to look for something in this data. Before, when there was no compression, what did you do? If you are looking for something in a data array, you need to make a linear number of comparisons to find this data. If it is standard to deal with compression, then you need to unclench them, spend some energy, and again make n four-byte comparisons. But if you look closely, the transformations that certain types of compression do do, retain very important quality data. Let the data and change, but with Base + Delta Encoding, the order is preserved. Roughly speaking, the comparison of the two values ​​will not change if compared in the form of Base + Delta Encoding.

What does it mean? When I search for values, I do not take this array over a large array, but convert it to a decompression view. I’ll take this value and try to imagine it with the existing base with a certain bias.

I take the value that I am looking for, take from the metadata the database that I used to compress this data, and see what offset I have left, if we subtract this database from this value.

If it turns out that my delta after such a subtraction of 1 byte does not fit - that's it, the search is over. After that, you know that it cannot match any of the available values, because all these deltas are single-byte. Either you need one comparison, or you are out of luck and after that you only need to compare the deltas. And the deltas are single-byte. And if you are familiar with vector programming, you can take Intel's CND instructions and put more of these comparisons there. Instead of four-byte comparisons, you can insert four at once, because they occupy only 1 byte, and compare them in the form of a vector. The number of comparisons you can reduce by 4 or 8 times.

And this is a bonus, a slight delay. We can increase the number of comparisons by an order of magnitude and completely avoid decompression. Now we are working on how to use such tricks for neural networks, and for others. That is, we want to not decompress the data at all, but to run the code directly on the compressed data. It's all.