Highly loaded WebSocket service development

How to create a web service that will interact with users in real time, while supporting several hundred thousand connections at the same time?

Hello everyone, my name is Andrey Klyuyev, I’m a developer. Recently, I faced such a task - to create an interactive service where the user can receive quick bonuses for their actions. The matter was complicated by the fact that the project had rather high requirements for the load, and the terms were extremely short.

In this article I will tell you how I chose a solution for implementing a websocket server for the difficult requirements of the project, what problems I encountered during the development process, and I will also say a few words about how Linux kernel configuration can help achieve the above goals.

The article concludes with useful links to development, testing, and monitoring tools.

Tasks and requirements


Requirements for the project functionality:


  • make it possible to track the user's presence on the resource and track the viewing time;
  • provide quick messaging between the client and server, since the time for the user to receive the bonus is strictly limited;
  • create a dynamic interactive interface with synchronization of all actions when a user works with the service through several tabs or devices at the same time.

Load requirements:


  • The application must withstand at least 150 thousand users online.

The implementation period is 1 month.

Technology selection


Having compared the tasks and requirements of the project, I came to the conclusion that it is most expedient to use WebSocket technology for its development. It provides a permanent connection to the server, eliminating the overhead of a new connection for every message that is present in the implementation using ajax and long-polling technologies. This allows you to get the necessary high speed messaging in combination with adequate resource consumption, which is very important at high loads.

Also, due to the fact that the installation and disconnection of the connection are two clear events, it becomes possible to accurately track the time a user is on the site.

Given the rather limited project timelines, I decided to conduct development using the WebSocket framework. I studied several options, the most interesting of which seemed to me PHP ReactPHP, PHP Ratchet, Node.JS websockets / ws, PHP Swoole , PHP Workerman , Go Gorilla, Elixir Phoenix. I tested their capabilities in terms of load on a laptop with an Intel Core i5 processor and 4 GB of RAM (such resources were quite enough for research).

PHP Workerman - Asynchronous event-oriented framework. Its capabilities are limited to the simplest implementation of the websocket server and the ability to work with the libevent library, which is necessary for processing asynchronous event notifications. The code is at PHP 5.3 and does not comply with any standards. For me, the main drawback was that the framework does not allow the implementation of highly loaded projects. At the test bench, the developed application of the Hello World level could not hold thousands of connections.

ReactPHP and Ratchet are generally comparable in their capabilities to Workerman. Ratchet internally depends on ReactPHP, also works through libevent and does not allow creating a solution for high loads.

Swoole - An interesting framework written in C, plugs in as an extension for PHP, has tools for parallel programming. Unfortunately, I found that the framework is not stable enough: on the test bench, it cut off every second connection.

Next I reviewed Node.JS WS . This framework showed good results - about 5 thousand connections on a test bench without additional settings. However, my project implied noticeably higher loads, so I opted for the Go Gorilla + Echo Framework and Elixir Phoenix frameworks. These options have already been tested in more detail.

Stress Testing


For testing, such tools as artillery, gatling and the flood.io service were used.

The purpose of testing was to study the consumption of processor resources and memory. The characteristics of the machine were the same - Intel iCore 5 processor and 4 GB of RAM. Tests were conducted on the example of the simplest chats on Go and Phoenix :
Here is such a simple chat application that normally worked on a machine of the indicated capacity with a load of 25-30 thousand users:

config:
  target: "ws://127.0.0.1:8080/ws"
  phases
    -
      duration:6
      arrivalCount: 10000
  ws:
    rejectUnauthorized: false
scenarios:
  -
    engine: "ws"
    flow
      -
        send "hello"
      -
        think 2
      -
        send "world"

Class LoadSimulation extends Simulation {

  val users = Integer.getInteger ("threads", 30000)
  val rampup   = java.lang.Long.getLong ("rampup", 30L)
  val duration  = java.lang.Long.getLong ("duration", 1200L)
 
  val httpConf = http
    .wsBaseURL("ws://8.8.8.8/socket")

  val scn = scenario("WebSocket")
    .exes(ws("Connect WS").open("/websocket?vsn=2.0.0"))
    .exes(
      ws("Auth")
        sendText("""["1", "1", "my:channel", "php_join", {}]""")
    )
    .forever() {
      exes(
        ws("Heartbeat").sendText("""[null, "2", "phoenix", "heartbeat", {}]""")
      )
      .pause(30)
  }
  .exes(ws("Close WS").close)

setUp(scn.inject(rampUsers(users) over (rampup seconds)))
  .maxDuration(duration)
  .protocols(httpConf)

Test launches showed that everything works quietly on a machine of the indicated capacity with a load of 25-30 thousand users.

CPU consumption:
Phoenix
gorilla
RAM consumption with a load of 20 thousand connections reached 2 GB in the case of both frameworks:
Phoenix
gorilla

image

At the same time, Go is even ahead of Elixir in performance, but the Phoenix Framework provides much more features. In the graph below, which shows the consumption of network resources, you can see that in the Phoenix test 1.5 times more messages are transmitted. This is due to the fact that this framework already in the initial “boxed" version has a mechanism of heartbeats (periodic synchronizing signals), which in Gorilla will have to be implemented independently. In the limited time frame, any additional work was a powerful argument in favor of Phoenix.

Phoenix
gorilla

image

About Phoenix Framework


Phoenix is ​​a classic MVC framework, quite similar to Rails, which is not surprising, since Jose Valim, one of the main creators of Ruby on Rails, is one of its developers and creator of the Elixir language. Some similarities can be seen even in the syntax.

Phoenix :

defmodule Benchmarker.Router do
  use Phoenix.Router
  alias Benchmarker.Controllers
  
  get "/:title", Controllers.Pages, :index, as: :page
end

Rails:

Benchmarker::Application.routes.draw do
  root to: "pages#index"
  get "/:title", to: "pages#index", as: :page
end

Mix - an automation utility for Elixir projects


When using Phoenix and Elixir, a significant part of the processes is done through the Mix utility. This is a build tool that solves many different tasks in creating, compiling and testing an application, managing its dependencies and some other processes.
Mix is ​​a key part of any Elixir project. This utility is in no way inferior and in no way superior to analogues from other languages, but it does its job perfectly. And due to the fact that the Elixir code is executed on the Erlang virtual machine, it becomes possible to add any libraries from the Erlang world depending on. In addition, together with Erlang VM you get convenient and safe concurrency, as well as high fault tolerance.

Problems and Solutions


With all the advantages, Phoenix has its drawbacks. One of them is the difficulty of solving such a problem as tracking active users on a site under high load conditions.
The fact is that users can connect to different application nodes, and each node will only know about its own clients. To display a list of active users, you will have to poll all the application nodes.
To solve these problems, Phoenix has a Presence module, which gives the developer the ability to track active users in just three lines of code. It uses the heartbeat mechanism and conflict-free replication within the cluster, as well as the PubSub server for exchanging messages between nodes.

image

It sounds good, but in fact it turns out something like the following. Hundreds of thousands of connecting and disconnecting users generate millions of messages for synchronization between nodes, because of which the consumption of processor resources goes beyond all acceptable limits, and even connecting Redis PubSub does not save the situation. The list of users is duplicated on each node, and the calculation of the differential with each new connection is becoming more and more expensive - and this is taking into account that the calculation is carried out on each of the existing nodes.

image

In such a situation, the mark of even 100 thousand customers becomes unattainable. I could not find other ready-made solutions for this task, so I decided to do the following: assign the responsibility of monitoring the online presence of users to the database.

At first glance, this is a good idea in which there is nothing complicated: just store the last activity field in the database and update it periodically. Unfortunately, for projects with a high load this is not an option: when the number of users reaches several hundred thousand, the system will not cope with the millions of hartbits coming from them.

I chose a less trivial but more productive solution. When a user connects, a unique row is created for him in the table, which stores his identifier, the exact time of entry and the list of nodes to which he is connected. The list of nodes is stored in a JSONB field, and in case of a row conflict it is enough to update it.

create table watching_times (
  id serial not null constraint watching_times_pkey primary key,
  user_id integer,
  join_at timestamp,
  terminate_at timestamp,
  nodes jsonb
);

create unique index watching_times_not_null_uni_idx
  on watching_times (user_id, terminate_at)
  where (terminate_at IS NOT NULL);
 
create unique index watching_times_null_uni_idx
  on watching_times (user_id)
  where (terminate_at IS NULL);

This request is responsible for user login:

INSERT INTO watching_times (
  user_id,
  join_at,
  terminate_at,
  nodes
)
VALUES (1, NOW(), NULL, '{nl@192.168.1.101": 1}')
ON CONFLICT (user_id)
  WHERE terminate_at IS NULL
  DO UPDATE SET nodes = watching_times.nodes ||
      CONCAT(
        '{nl@192.168.1.101:',
        COALESCE(watching_times.nodes->>'nl@192.168.1.101', '0')::int + 1, 
        '}'
      )::JSONB
RETURNING id;

The list of nodes looks like this:
If a user opens a service in a second window or on another device, he can go to another node, and then it will also be added to the list. If it falls on the same node as in the first window, the number opposite the name of this node in the list will increase. This number reflects the number of active user connections to a particular node.

Here is the query that goes to the database when the session is closed:

UPDATE watching_times
SET nodes
  CASE WHEN
    (
      CONCAT(
        '{"nl@192.168.1.101": ',
        COALESCE(watching_times.nodes ->> 'nl@192.168.1.101', '0') :: INT - 1,
        '}'
      )::JSONB ->>'nl@192.168.1.101'
    )::INT <= 0
  THEN
    (watching_times.nodes - 'nl@192.168.1.101')
  ELSE
    CONCAT(
      '{"nl@192.168.1.101": ',
      COALESCE(watching_times.nodes ->> 'nl@192.168.1.101', '0') :: INT - 1,
      '}'
    )::JSONB
  END
 ),
 terminate_at = (CASE WHEN ... = '{}' :: JSONB THEN NOW() ELSE NULL END)
WHERE id = 1;

List of nodes:
When closing a session on a specific node, the connection counter in the database decreases by one, and when zero is reached, the node is removed from the list. When the list of nodes is completely empty, this moment will be fixed as the final user exit time.

This approach made it possible not only to track the user's presence online and viewing time, but also to filter these sessions according to various criteria.

In all this, there is only one drawback - if a node falls, all its users “hang” online. To solve this problem, we have a daemon that periodically cleans the database from such records, but so far this has not been required. Analysis of the load and monitoring of the cluster, carried out after the project went into production, showed that there were no node drops and this mechanism was not used.

There were other difficulties, but they are more specific, so it’s worth moving on to the issue of application resiliency.

Linux kernel configuration for better performance


Writing a good application in a productive language is only half the battle, without literate DevOps it is impossible to achieve at least any high results.
The first barrier to the target load was the Linux network kernel. It took some adjustments to achieve a more rational use of its resources.
Each open socket is a file descriptor in Linux, and their number is limited. The reason for the limit is that for each open file in the kernel a C-structure is created that takes up unreclaimable kernel memory.

To use the maximum memory, I set the size of the transmit and receive buffers very high, and also increased the size of the TCP socket buffers. The values ​​here are set not in bytes, but in pages of memory, usually one page is 4 KB, and I set the value to 15 thousand for the maximum number of open sockets waiting for connections for highly loaded servers.

File Descriptor Limits:


#!/usr/bin/env bash
sysctl -w 'fs.nr_open=10000000' # Максимальное количество открытых файловых дескрипторов

sysctl -w  'net.core.rmem_max=12582912' # Максимальный размер буферов приема всех типов
sysctl -w 'net.core.wmem_max=12582912' # Максимальный размер буферов передачи всех типов

sysctl -w 'net.ipv4.tcp_mem=10240 87380 12582912' # Объем памяти TCP сокета
sysctl -w 'net.ipv4.tcp_rmem=10240 87380 12582912' # размер буфера приема
sysctl -w 'net.ipv4.tcp_wmem=10240 87380 12582912'# размер буфера передачи

<code>sysctl -w 'net.core.somaxconn=15000' # Максимальное число открытых сокетов, ждущих соединения

If you use nginx in front of a cowboy server, then you should also consider increasing its limits. The directives worker_connections and worker_rlimit_nofile are responsible for this.

The second barrier is not so obvious. If you run such an application in a distributed mode, you will notice a sharp increase in CPU consumption while increasing the number of connections. The problem is that Erlang works with Poll system calls by default. In version 2.6 of the Linux kernel, there is Epoll, which can provide a more efficient mechanism for applications that process a large number of concurrently open connections - with O (1) complexity as opposed to Poll with O (n) complexity.

Fortunately, Epoll mode is turned on with one flag: + K true, I also recommend increasing the maximum number of processes generated by your application and the maximum number of open ports using the + P and + Q flags, respectively.

Poll vs. Epoll


#!/usr/bin/env bash
Elixir --name ${MIX_NODE_NAME}@${MIX_HOST} --erl "-config sys.config -setcookie ${ERL_MAGIC_COOKIE} +K true +Q 500000 +P 4194304" -S mix phx.server

The third problem is more individual, and not everyone can face it. The process of automatic deployment and dynamic scaling using Chef and Kubernetes was organized on this project. Kubernetes allows you to quickly deploy Docker containers on a large number of hosts, and this is very convenient, however, you can’t find out the IP address of a new host in advance, and if you do not register it in the Erlang config, you cannot connect the new node to the distributed application.

Fortunately, the libcluster library exists to solve these problems. Communicating with Kubernetes via the API, she real-time learns about the creation of new nodes and registers them in the erlang cluster.

config :libcluster,
  topologies: [
    k8s: [
      strategy: Cluster.Strategy.Kubernetes,
      config: [
        kubernetes_selector: "app=my -backend",
        kubernetes_node_basename: "my -backend"]]]

Results and Prospects


The chosen framework in combination with the correct server settings made it possible to achieve all the goals of the project: within the set time frame (1 month) to develop an interactive web service that communicates with users in real time and at the same time withstands loads of 150 thousand connections and higher.

After the launch of the project in production, monitoring was carried out, which showed the following results: with a maximum number of connections up to 800 thousand, the consumption of processor resources reaches 45%. The average load is 29% with 600 thousand connections.

image

On this graph - memory consumption when working in a cluster of 10 machines, each of which has 8 GB of RAM.

image

image

As for the main working tools in this project, Elixir and Phoenix Framework, I have every reason to believe that in the coming years they will become as popular as Ruby and Rails at one time, so it makes sense to start their development now.
Thanks for your attention!

References


Development:
elixir-lang.org
phoenixframework.org
Load testing:
gatling.io
flood.io
Monitoring:
prometheus.io
grafana.com