A couple of stories about RAID'ersky lawlessness
On air, the continuation of our Friday column about failures, failures and other fakapy. If you missed our previous stories, then here are the links: one , two , three . And today we’ll talk about the troubles with RAID in one “small but proud” data center.
The story of data inconsistency
The story began with the fact that one of the disks in RAID 5 failed. Well, it turned out and went out - a common thing. The disk began to rebuild, and then another disk failed. Also a common problem for RAID 5. As a result, the entire disk pool in which this mdisk was located fell off completely. Who does not know, mdisk is a small raid, and the disk pool consists of a bunch of mdisk. We decided to switch to a backup data center. Everything went smoothly: the servers started normally, there are no errors. Everything seems to be working. While the data was in the backup data center, we reassembled the failed mdisk in the main data center. It seems that there are no errors on it: the array glows green, the data is replicated between the arrays on the main and backup sites.
We switch back and understand that part of the data starts up normally with us, and, for example, database servers - with errors. We see data inconsistency on them.
We check for integrity and find a bunch of errors. It’s strange, because in the data center the data was in order, and with reverse replication to the main data center they came already failed. Unclear. At first we thought that the problem was in the array. But in the logs everything is clean, no errors.
Then they began to sin on the replication procedure, because on the arrays in the primary and backup data centers the firmware versions differed by 2-3 minor versions. They found in the vendor’s description that there really are replication errors with different firmware versions. Then, using a special proprietary utility, we checked the consistency of replication: did all the frames that flew from one array reached another. Everything works without problems, but as soon as we transfer data from the backup data center to the main one, we get inconsistency - some of the blocks come in broken.
They began to simply copy data over the network. We make a base dump and fill it with an array. The dump comes with other hash amounts. Probably somewhere along the way, data is lost. We tried to send through different networks - everything is the same. It turns out that the problem is in the array, despite the fact that it cheerfully reports on its trouble-free operation.
We decided to rebuild the disk pool. The data that worked in the main data center was transferred back to the backup one, freed up the disk array, completely formatted and reassembled the entire pool. And only after that the data inconsistency disappeared.
The reason was in one faulty element of the logic of the RAID controller. One of the mdiskes, that is, the RAID group, did not work correctly after rebuilding. The array considered it to function normally, but it was not. When blocks got to this mdisk, he wrote them incorrectly. For example, file servers do not write so much, and there was no inconsistency on them. And in the database, information is constantly changing, blocks are often recorded in various places of the disk pool. Therefore, we encountered the problem of inconsistency on the server with the databases.
About 4 hours passed from the moment the disk pool failed to start the system, and all this time one of the critical business systems did not work for the customer. Financial losses were not so great, basically it was a blow to the reputation.
In our practice, this was the only such failure. Although it is not uncommon for RAID 5, when one of the drives dies, the load on the remaining drives increases, and because of this, another drive dies. So there’s a tip: update the firmware in a timely manner and prevent versioning from inconsistent.
Another amazing story that happened to us for the first time. Actors: the same arrays and platforms, also have a primary and backup data centers. In RCOD, it malfunctioned, one of the controllers in the array hung up and did not rise. Overdone - the controller rose, the array worked. But at this very moment on the main site, I / O on physical Linux servers fell off.
Surprisingly, the controller in the data center and the input / output in the data center are completely isolated from each other. The only thing that connects them is the perimeter FC routers. They simply wrap FC traffic in IP and drive it between sites to replicate data between disk storage arrays. That is, the controller and I / O have absolutely everything different - SAN, physical sites, physical servers.
It turned out that when the controller was turned on at the backup site, he considered himself a SCSI initiator, instead of being a SCSI target. We have replication going from the main data center to the backup one. That is, the controller in theory should be a SCSI target, all frames should come to it. But he decided that he was more suited to an active life position, and began to try to send some kind of data.
At this point, the multi-pass driver on Linux servers with Red Hat 7 did not work correctly. He took these commands verbatim. Despite the fact that he himself is the initiator, he saw another initiator and decided, just in case, to disable all paths to the disks. And since these were boot disks, they simply fell off. Literally for four minutes. Then they rose, but the customer had a short-term subsidence of business transactions. That is, within four minutes, the customer-retailer could not sell their products throughout the country. And with two thousand retail outlets, every minute of downtime is expressed in a decent cash equivalent, not to mention reputation losses.
Perhaps the cause of this incident was the imposition of bugs. Or maybe two technologies are simply not friends: binding the disk storage and the multipassing driver in Red Hat, which somehow behaved strangely. After all, even if another SCSI initiator appears, the driver just has to say that it is now also the initiator, and continue to work, and not shoot the disks. This should not be.
That's all. Less bugs for you!
Alexander Marchuk, chief engineer of the Jet Infosystems Service Center