How in 2009 we started to build a cloud, and where we made a mistake

In October 2009, we double-checked everything. It was necessary to build a data center with 800 racks. Based on our intuition, market forecasts and the American situation. It seemed like it sounded logical, but it was scary.

Then there were no "cloud" computing in Russia, as well as cloud hosting. Actually, the word itself was hardly used on the market. But we have already seen across America that such installations are in demand there. We had big projects behind us to create HPC clusters for aircraft designers with 500 nodes, and we believed that the cloud was just as big a computing cluster.

Our mistake was that in 2009 no one thought that the clouds would be used for anything other than distributed computing. Everyone needs CPU time, we thought. And they began to build the architecture as they built HPC clusters for research institutes.

Do you know how fundamentally such a cluster differs from modern cloud infrastructures ? The fact that he has very few disk accesses and the whole reading is more or less consistent. The task is set alone, broken into pieces, and each machine makes its own piece. At that time, no one took seriously into account that the load profile on the disk subsystem for HPC clusters and the cloud is fundamentally different: in the first case, these are sequential read / write operations, in the second - full random. And this was not the only problem that we had to face.

Network architecture

The first important choice was: InfiniBand or Ethernet for the network inside the main cloud platform. We compared and chose InfiniBand for a long time. Why? Because, I repeat, we considered the cloud as an HPC cluster, firstly, and because then everything was assembled from 10Gb connections. InfiniBand has promised wonderful speeds, simplified support and reduced network operating costs.

The first shoulder in 2010 was on 10G Ethernet. At that time, we were the first to use on our platform again the world's first Nicara SDN solution, later bought by VMware for a lot of money, which is now called VMware NSX. As we learned to build clouds then, the same way the Nicira team learned to do SDN. Of course, it could not do without problems, even somehow once a couple of times everything fell noticeably. The then network cards “fell off” during long operation, which only added us drive to work - in general, tin. For some long time after the next major update from Nicira, the operation lived on valerian. However, by the time the 56G InfiniBand was launched, we, together with colleagues from Nicira, had successfully treated part of the problems, the storm had calmed down and everyone breathed a sigh of relief.

If we designed the cloud today, we would probably put it on Ethernet. Because the correct history of architecture nevertheless went in this direction. But it was InfiniBand that gave us huge advantages, which we could use later.

First growth

In 2011-2012, the first stage of growth began. “We want, like in Amazon, but cheaper in Russia,” is the first category of customers. “We want special magic” - the second. Due to the fact that everyone then advertised the clouds as a miracle tool for the smooth operation of the infrastructure, we had some misunderstandings with customers. The whole market was quickly hit by big customers on the head due to the fact that large customers are accustomed to a physical infrastructure that is close to zero. The server fell - reprimand to the head of the department. And the cloud, due to an additional layer of virtualization and a certain pool of orchestration, runs a little less stable physical servers. Nobody wanted to work with VM failures, because everything was set up manually in the cloud and no one used automation and cluster solutions that could improve the situation. Amazon says: “Everything in the cloud may fall,” but the market doesn’t like it. Customers believed that the cloud was magic, everything should work without interruptions and virtual machines themselves should migrate between data centers ... They all went with one server instance to one virtual machine. And the level of IT development at that time was such that automation was not enough: they did everything with their hands once according to the ideology “works - do not touch”. Therefore, when restarting the physical host, it was necessary to manually raise all the virtual machines. Our support also dealt with this for a number of customers. This is one of the first things that was decided by the internal service.

Who came to the cloud? The most different people. Distributed online stores were among the first to arrive. Then people began to introduce business critical services in a normal architecture. Many considered the cloud as a faylover-platform, something like a data center reserve. Then they moved as to the main one, but left the second site as a reserve. Those customers who already laid the foundation on such an architecture are still very satisfied with the majority. A properly configured migration scheme in case of failures was our pride - it was very cool to watch how some major accident happened in Moscow, and customer services automatically migrate and deploy to the reserve.

Disks and Flash

The first growth was very fast. Faster than we could predict when designing architecture. We quickly bought iron, but at some point we ran into the ceiling on the disks. Just then we were laying the third data center, it was the second under the cloud - the future Compressor, certified T-III in uptime.

In 2014, very large customers appeared and we faced the following problem - the drawdown of storage systems. When you have 7 banks, 5 retail chains, a travel company and some research institute with geological exploration, all this can suddenly coincide at the peak of the load.

The then typical storage architecture did not assume that users had quotas for write speed. For recording or reading, everything was put in the order of a live queue, and then the storage processed all this. And then there was a “Black Friday” of sales and we saw that the storage users fell by almost 30 times — retail retailed their requests for almost all of the recording power. The site of the medical clinic fell, the pages opened for 15 minutes. It was necessary to urgently do something.

Even on the most high-performance disk arrays, which are usually expensive, there was no possibility of differentiating performance priorities. That is, customers could still influence each other. It was necessary either to rewrite the driver in the hypervisor, or to invent something else - and urgently.

We solved the problem by purchasing all-flash arrays with bandwidth for a million IOPS. It turned out 100,000 IOPS per virtual disk. Performance was enough for the eyes, but it was still necessary to come up with a limitation on R / W. At the level of the disk array at that time (end of 2014), the problem was unsolvable. Our cloud platform is built on non-proprietary KVM, and we could freely climb into its code. In about 9 months, we carefully rewrote and tested the functionality.

At this point, the combination of InfiniBand and All-flash gave us a completely wild thing - we were the first in our market to introduce a service with guaranteed performance disks with the most severe penalties prescribed by SLA. And in the market, competitors looked at us with round eyes. We said: “We give 100,000 IOPS to disk.” They are: "This is impossible ..." We: "And we still do it guaranteed." They are: "You are generally what, plague, you are crazy." For the market it was a shock. Of the 10 major contests, 8 we won because of the disks. Then they hung medals on their chests.

16 arrays, each with a million IOPS yields 40 terabytes each! They are still directly connected to the servers via InfiniBand. It exploded where no one thought at all. They drove for six months on tests, there was not even a hint.

The fact is that when the array controller falls on the InfiniBand, the routes are rebuilt for about 30 seconds. You can reduce this time to 15 seconds, but no further - because there are limitations of the protocol itself. It turned out that upon reaching a certain number of virtual disks (which customers created for themselves), a rare heisenbag with an all-flash storage controller appears. When asked to create a new disk, the controller may go crazy, get 100% load, go into thermal shutdown and generate the same 30-second switch. Disks fall off from virtualok. Sailed. For several months, we, along with the storage vendor, were looking for a bug. As a result, they found it, and they controlled the microcode of the controllers on the arrays for us. During this time, we really wrote a whole layer around these arrays that allowed us to solve the problem. И ещё пришлось переписать почти весь стек управления.

Array demotivators hang from support so far.

Our days

Then there were problems with software for remote workstations. There the decision was proprietary, and the dialogue with the vendor was as follows:
- Could you help us?
- Not.
“You are damn full, we will complain about you.”
- You are welcome.
At this point, we decided that we should abandon the proprietary components. Then the need was closed by its development. We are now investing in open source projects - as in the story that we once provided an almost six-month budget for ALT Linux, sometimes our request drastically accelerated the development of the necessary development. At the same time, we brought our development on this wave to the state, as our European colleagues said, “damn amazing.”

Today we look at the cloud with an experienced look and understand how to develop it further for several years in advance. And, again, we understand that we can do anything with KVM , because there are development resources.