An open-source solution for tenfold reduction in data reading latency with Apache Cassandra

Original author: Instagram Engineering
  • Transfer


Instagram has one of the largest Apache Cassandra databases in the world. The project began using Cassandra in 2012 to replace Redis and support the implementation of application features such as fraud recognition system, Tape and Direct. At first, Cassandra clusters worked in AWS, but later engineers migrated them to the Facebook infrastructure along with all other Instagram systems. Cassandra performed very well in terms of reliability and fault tolerance. At the same time, latency metrics when reading data could clearly be improved.

Last year, Cassandra's Instagram support team began working on a project aimed at significantly reducing latency in reading data in Cassandra, which engineers called Rocksandra. In this article, the author tells what prompted the team to implement this project, the difficulties that had to be overcome, and performance metrics that engineers use both in the internal and external cloud environments.

Grounds for Transition


Instagram actively and widely uses Apache Cassandra as a key-value storage service. Most Instagram requests happen online, therefore, to provide a reliable and enjoyable user experience for hundreds of millions of Instagram users, SLAs are very demanding on system performance.

Instagram adheres to a five-nine reliability rating. This means that the number of failures at any given time cannot exceed 0.001%. In order to improve performance, engineers actively monitor the throughput and latencies of various Cassandra clusters, and make sure that 99% of all requests fit into a certain indicator (delay P99).

Below is a graph showing the client-side delay of one for one of the Cassandra combat clusters. Blue indicates the average read speed (5 ms), and orange indicates the read speed for 99%, ranging from 25-60 ms. Its changes are highly dependent on client traffic.





The study found that the sharp bursts of delay are largely due to the work of the JVM garbage collector. Engineers introduced a metric called “percentage of stops on the SM” to measure the percentage of time that was spent on “stopping the world” by the Cassandra server, and was accompanied by a denial of service for customer requests. Here is the chart above showing the amount of time (in percent) that went to SM stops using the example of one of the Cassandra combat servers. The indicator ranged from 1.25% at times of the smallest traffic to 2.5% at times of peak load.

The graph shows that this Cassandra server instance could spend 2.5% of its time collecting garbage instead of serving client requests. The preventive operations of the collector obviously had a significant impact on the P99 delay, and therefore it became clear that if we could reduce the CM stop rate, then the engineers could significantly reduce the P99 delay rate.

Decision


Apache Cassandra is a Java-based distributed database with its own data storage engine based on LSM trees. Engineers found that engine components such as a memory table, compression tool, read / write paths, and some others created many objects in Java dynamic memory, which led to the JVM having to perform many additional overhead operations. To reduce the impact of storage mechanisms on the work of the garbage collector, the support team considered various approaches and ultimately decided to develop a C ++ engine and replace the existing one with it.

Engineers did not want to do everything from scratch, and therefore decided to take RocksDB as a basis.

RocksDB is a high-performance, open-source embedded database for key-value storage. It is written in C ++, and its API has official language bindings for C ++, C, and Java. RocksDB is optimized for high performance, especially on fast drives such as SSDs. It is widely used in the industry as a storage engine for MySQL, mongoDB, and other popular databases.

Difficulties


In the process of implementing the new storage engine on RocksDB, engineers faced three difficult tasks and solved them.

The first difficulty was that Cassandra still lacks an architecture that allows third-party data processors to be connected. This means that the work of the existing engine is quite closely interconnected with other database components. To find a balance between large-scale refactoring and fast iterations, engineers defined the API of the new engine, including the most common read, write, and stream interfaces. Thus, the support team was able to implement new data processing mechanisms for the API and insert them into the appropriate code execution paths inside Cassandra.

The second difficulty was that Cassandra supported structured data types and table schemas, while RocksDB only provided key-value interfaces. Engineers carefully defined coding and decoding algorithms to support the Cassandra data model within the RocksDB data structures and ensured the continuity of the semantics of similar queries between the two databases.

The third difficulty was associated with such an important component for any distributed database component as working with data streams. Whenever a node is added or removed from a Cassandra cluster, it needs to correctly distribute data between different nodes to balance the load within the cluster. Existing implementations of these mechanisms were based on obtaining detailed data from the existing database engine. Therefore, the engineers had to separate them from each other, create an abstraction layer and implement a new option for processing streams using the RocksDB API. To obtain high throughput of streams, the support team now first distributes the data to temporary sst files, and then uses the special RocksDB API to “swallow” the files, allowing them to be loaded simultaneously into the RocksDB instance.

Performance indicators


After almost a year of development and testing, engineers completed the first version of the implementation and successfully “rolled out” it on several Instagram Instagram Cassandra clusters. On one of the combat clusters, the P99 delay dropped from 60 ms to 20 ms. Observations also showed that SM stops in this cluster fell from 2.5% to 0.3%, that is, almost 10 times!

Engineers also wanted to check if Rocksandra could perform well in a public cloud environment. The support team set up a Cassandra cluster in AWS using three i3.8 xlarge EC2 instances, each with a 32-core processor, 244 GB of RAM, and a zero raid of four NVMe flash drives.

For comparative tests, we used NDBench , and the default for the table schema.

TABLE emp (
 emp_uname text PRIMARY KEY,
emp_dept text,
emp_first text,
emp_last text
 )

Engineers pre-loaded 250 million 6 rows of 6 KB each (about 500 GB of data is stored on each server). Next, set up 128 readers and writers in NDBench.

The support team tested various loads and measured average / P99 / P999 read and write latencies. The graphs below show that Rocksandra showed significantly lower and more stable read and write latencies.





Engineers also checked the load in read mode without writing and found that with the same P99 read delay (2 ms), Rocksandra is able to provide more than 10-fold increase in information reading speed (300 K / s for Rocksandra versus 30 K / s for C * 3.0).





Plans for the future


The Instagram support team has opened the Rocksandra code and framework for evaluating performance . You can download them from Github and try in your own environment. Be sure to tell us what came of it!

As a next step, the team is actively working on adding broader support for C * functionality, such as secondary indexes, fixes, and more. And besides, engineers are developing the architecture of the plug-in database engine in C * in order to further transfer these developments to the Apache Cassandra community.

image