Architecture and High Availability Overview of SQL Database (SQL Azure)

Windows Azure offers both NoSQL repositories and SQL relational repositories. NoSQL storage is, for example, Windows Azure Tables (key \ value) or blobs (binary data such as photos, videos, documents, etc.). Relational repositories include SQL Database (formerly SQL Azure ).



This post is based on Inside Windows Azure SQL Database. , which I decided to split into two parts. In the first part, I will provide information about the architecture of SQL Database and how high availability is provided (failure detection, reconfiguration). In the second part, I’ll talk about scalability (performance management, load balancing), as well as recommendations for developers.

SQL Database Architecture Overview


SQL Database is a shared cloud database, we can say that it is Database as Service (DaaS). The Microsoft data centers installed by SQL Server are great about the first vessel constructed on the basis of standard equipment. Each SQL Server in the data center contains several different client databases (logical databases), i.e. it turns out shared mode. To access data, automatic load balancing and routing of network connections are used.

It is worth noting that physically or actually the data is not stored in one database, but replicated . Data is replicated in three SQL Server databases distributed across three physical servers of the same data center: one primary and two additional replicas. All read and write operations are performed in the primary replica, and any changes are asynchronously replicated to additional replicas. These replicas provide high availability for SQL Database. Most Microsoft data centers contain hundreds of computers with hundreds of SQL Server instances hosting SQL Database replicas. It is extremely unlikely that primary and secondary replicas of SQL Database databases will be stored on the same computers.

A logical server is the SQL Database server that you see on the Windows Azure Management Web Portal. SQL Database Gateway acts as a proxy server, redirecting to the logical server requests for a tabular data stream (TDS). SQL Database Gateway is a security frontier that provides credential verification, firewall compliance, and protection of SQL Server instances behind the gateway against denial of service attacks. The gateway consists of many computers, each of which receives network connection requests from clients, checks connection information and sends a tabular data stream request to the corresponding physical server, depending on the database name specified in the connection.


The physical distribution of the databases that are part of the same logical instance of SQL Server means that each network connection is tied to one database, and not to one instance of SQL Server. If USE were used for the connection, the tabular data stream would have to be redirected to a completely different physical computer in the data center. For this reason, the USE command is not supported for SQL Database connections.



Network topology

The application uses the client tier to communicate directly with SQL Database. The client tier can be located in the local data center or hosted in Windows Azure. All protocols that create tabular data streams for transmission over a network are supported. You can use familiar tools and libraries to create client applications that use data in the cloud.



Service level

The service level consists of computers that host gateway services that provide routing, provisioning, metrics, and billing. These services are supported by four groups of computers.



An interface cluster contains the physical computers of the gateway. Application-level computers authorize server and database requests, as well as manage billing. Service platform computers monitor and manage the health of SQL Server instances in the data center. The main cluster computers monitor which replicas of which particular databases physically exist in each instance of SQL Server in the data center.

The numbered lines of workflows in the figure reflect the procedure for checking and creating a client connection:
  1. A gateway in an interface cluster that receives a request for a new incoming connection to transmit a table data stream (TDS) can establish a connection with the client. A parser with a minimum set of supported functions checks the validity of the received command for transmission to the database. Commands of type CREATE DATABASE are not allowed, because they must be processed by the application layer.
  2. The gateway performs the SSL confirmation procedure for the client. If the client refuses to use SSL, the gateway disconnects. It is necessary to provide full traffic encryption. The protocol parser also includes a Denial of Service attack protection tool that monitors requests from IP addresses. If an excessive number of requests comes from the IP address or range of addresses, then subsequent attempts to connect from these addresses will be rejected.
  3. The server name and login credentials provided by the user must be verified. A firewall-level check ensures that only IP addresses from specified ranges are connected.
  4. After checking the server, the main cluster matches the database name used by the client with the internal database name. A primary cluster is a set of computers that process mapping information. When working in SQL Database, the concept of a partition has a completely different meaning than when working with local instances of SQL Server. In a SQL Database environment, a partition is part of a SQL Server database in a data center that maps to a single SQL Database. For example, in the figure, each database contains three sections, since each of them contains three SQL Database databases.
  5. After the database is discovered, the user name is authenticated; if the test fails, the connection is terminated. The gateway checks for the database that the user wants to connect to.
  6. A new connection can be made after checking all the connection parameters.
  7. A new connection is established directly between the user's computer and the internal server node.
  8. After the connection is established, the gateway acts as a proxy server for packets transmitted between the client and the data processing platform.

Platform level

The platform level consists of computers that host physical SQL Server databases in the data center. These computers are called data nodes. The figure shows the internal organization of data nodes. Each data node consists of one instance of SQL Server. Each of these instances has one user database partitioned. Each section contains one SQL Database client database, presented as a primary or secondary replica.



A typical SQL Server database SQL Server database can contain up to 650 partitions. These databases hosted in the data center are managed in the same way as the local SQL Server databases. The only difference is that regular maintenance and backups are performed by data center specialists. All databases hosted on a data node use the same log file . This improves logging performance using sequential batch I / O. Unlike local databases, SQL Database database logs are stored in pre-allocated and zeroed disk space. This avoids recording pauses when automatically increasing the size of log files.

Another distinguishing feature of log management in the SQL Database data center is the need for quorum commit . This means that the primary replica and at least one of the additional replicas must confirm that the log files were written before the transaction is considered committed.

The figure shows that each of the computers in the data node contains a set of processes, also called structure. Structure processes are used to solve the following tasks:
  1. Failure detection: control the availability of primary or secondary replicas; if they become unavailable, the reconfiguration agent may be started.
  2. Reconfiguration Agent: Manages the re-creation of primary or secondary replicas after a node failure.
  3. Partition Manager Location: Provides messages for the partition manager.
  4. Kernel load regulation: prevents the logical server from exclusively using the node resources or exceeding its physical limitations.
  5. Ring Topology: controls the cluster computers in a logical ring; each computer has two neighbors that can detect its emergency shutdown.

High Availability in SQL Database


The Microsoft SQL Database platform provides 99.9% subscriber database availability. This is achieved through the use of consumer equipment, which allows simple and quick replacement in the event of a computer or drive failure, as well as by managing replicas of each of the SQL Database (one primary and two additional replicas are supported).

Fault detection

It is necessary to identify not only cases of a complete failure of computers, but also tendencies of slow decline in computer performance and data exchange disruptions with them. The above concept of quorum fixation allows us to solve these problems. First, a transaction is not considered committed if the primary replica and at least one additional replica do not confirm the transaction is logged. Secondly, if the primary and secondary replicas confirm a successful recording, small malfunctions can be detected that do not prevent the transaction from being committed, but can lead to serious problems.

Reconfiguration

The procedure for replacing damaged replicas is called reconfiguration . Reconfiguration may be required during hardware failures or an abnormal shutdown of the operating system, as well as in the event of a failure of the instance of SQL Server. Reconfiguration is also used when upgrading the operating system, SQL Server, or SQL Database platform.

Health monitoring of each node is performed by six similar nodes located in different racks. These nodes are called neighbors. The failure is recorded by one of the neighbors of the failed node. For each database that stored the replica on the failed node, a reconfiguration procedure is performed. Each computer contains replicas of hundreds of SQL Database, some of which are primary, and some are optional. Therefore, in the event of a node failure, hundreds of reconfiguration operations are performed. When processing hundreds of errors caused by a node failure, prioritization is not used. The partition manager randomly selects a replica for processing, after completing operations on it, selects the next one, and so on, until all replicas from the failed node are processed.

If a node is disconnected due to a reboot, this is considered a pure failure because the neighbors of the node receive an exception message.

Another possible option is to stop communicating with the computer for an unknown reason, when an undefined error is recorded. In this case, the arbitration procedure is applied , which allows to reliably determine the fact of a node failure.

In addition to determining the failure of an individual replica, the system identifies and eliminates the consequences of failures of entire nodes. A node consists of an entire instance of SQL Server with multiple partitions containing replicas of up to 650 different databases. Some replicas are primary, others are secondary. In the event of a node failure, the procedure described above is performed on each of the affected databases. For some databases with failed primary replicas, the arbitration process selects a new primary replica from existing additional replicas, and for other databases with failed additional replicas, a new additional replica is created.

Most SQL Database replicas must confirm commit. User databases currently support three replicas. Therefore, quorum commit replicas require transaction confirmation by two other replicas. The metadata store, which is part of the data center gateway components, supports five replicas. He needs three confirmations for quorum fixation. The primary cluster, which supports seven replicas, needs confirmation from four of them to commit the transaction. Information from the main cluster can be restored even if all seven replicas fail. There are mechanisms for automated recovery of the main cluster in such large-scale failures.

Primary Replica Failure
All read and write operations are first performed in the primary replica. Therefore, a failure of the main replica is detected immediately and impedes further work. During reconfiguration in the event of a failure of the primary replica, the partition manager selects one of the additional replicas and assigns it to the primary. Typically, an additional replica on the node with the smallest workload is selected as the new primary replica. The procedure for assigning the primary replica status to the main one does not cause database downtime and is not noticeable to most users. The gateway will send a disconnect message to the client application, after which the application should immediately attempt to reconnect. Distributing information about the new primary replica to all gateway servers can take up to 30 seconds. Поэтому рекомендуется попробовать подключиться повторно несколько раз, делая небольшие паузы после каждой неудачной попытки.

Additional replica
failure If the additional replica fails, the database only has two replicas for quorum commit. The reconfiguration procedure is similar to the procedure that is performed after the failure of the main replica, when the status of one of the additional replicas increases to the main one. In both cases, only one additional replica remains. After a short wait, the partition manager tries to determine if this failure is permanent in order to create a new additional replica.

In some cases, for example, during a failure or upgrade of the operating system, the failure of the additional replica may have an apparent character. The failure of an additional replica on a failed node can only be temporary. Therefore, instant creation of a new replica does not occur. If the additional replica returns to operational state, data verification commands (checkdisk, etc.) are executed to confirm the replica is operational.

If the replica remains inoperative for more than two hours, the partition manager proceeds to create a new replica to replace it. In some cases, such a fixed latency is not an optimal solution, for example, when a computer crashes due to an irreparable hardware failure. New releases of the SQL Database platform may contain functions for detecting various types of replica failures, as well as have the ability to more quickly eliminate the consequences of irreparable failures.

If an irreparable node failure occurs, then to create a new additional replica one of the cluster computers is selected that has sufficient disk space and processor performance margin. This computer is used to host the new additional replica. The database is copied from the primary replica, then this copy is connected to the existing configuration. The time required to copy the entire contents of the database is a limiting factor for the maximum size of the SQL Database managed databases.

All computers in the data center are consumer computing systems with an average level of performance and component quality. At the time of writing, 32 GB of RAM, an eight-core processor and 12 disks. The cost of such a system was about $ 3,500. An economical and affordable configuration makes it easy to quickly replace computers in the event of fatal failures. Windows Azure uses the same consumer hardware. This makes all computers in the data center interchangeable, regardless of whether they are used to support SQL Database or Windows Azure.

In total, the distribution of database replicas across different servers and efficient algorithms for assigning additional replicas to the status of the main ones guarantee availability even with the simultaneous failure of 15% of all data center computers. That is, in the event of failure of up to 15% of all computers, the level of supported workload will not decrease.

This is not the end of the story about SQL Database, there will be a continuation (there is a continuation ).

PS. If someone liked the title picture, here is a link to a large poster .