FPGA accelerators go to the clouds

FPGA-ускорители уходят в облака, статья от Selectel

The appearance on the market of FPGA accelerators, which can be reprogrammed as many times as necessary, and in a high-level language like “C”, has become a real breakthrough in the niche of high-performance computing. But the opportunity to use FPGA technology without buying these very expensive adapters (price in Russia from 250 thousand rubles), but just renting a dedicated server with an accelerator in the provider's cloud, was no less a breakthrough.



Introduction or about FPGA chips in 3 paragraphs


The FPGA (Field-Programmable Gate Array) chip, also known as the Field Programmable Gate Array (FPGA), is an integrated circuit (IC) that can be reconfigured for any complex computing tasks. There is a need in the industry for specialized microcircuits (ASICs, application-specific integrated circuits, “special-purpose integrated circuits”) - from controlling spacecraft to calculating financial models. However, before the advent of FPGAs, the strong and at the same time weak point of specialized integrated circuits was the rigid functionality embedded in the chip, as well as the high complexity of the design and the cost of launching the production. If the functional was then required to be changed at least slightly, or errors occurred at the design stage, then it was necessary to create a new IC in essence.


FPGA-ускоритель с чипом Intel Altera Arria 10 для шины PCI Express

FPGA accelerator with Intel Altera Arria 10 chip and 10GE ports


The appearance on the market of FPGA accelerators, which can be reprogrammed as many times as necessary, and in a high-level language such as C, has become a real breakthrough in the niche of high-performance computing. This allowed us to accelerate the development time, the time to market for products. There are completely new opportunities for hardware developers, including working on programming specialized integrated circuits such as ASIC.


FPGA processors have already passed 2 stages in terms of the availability of this technology and today they are actively entering the third stage. The first FPGAs appeared in 1985, but their programming still required knowledge of a low-level language such as assembler. At the second stage, which began around 2013, and thanks to the efforts of Altera, it became possible to program in a high-level C-like language. This dramatically expanded the applicability of FPGAs, but the high cost of the chips still held back the expansion of the circle of customers who could afford this technology.


Traditionally, the FPGA design and verification route is extremely time-consuming and requires high specialization; in its complexity, the route approaches ASIC design. This limits the use of FPGAs by developers. This is especially true for computing applications, where the participants in the process — a programmer, mathematician, and algorithmist — want to focus on their task, and not on its hardware implementation. Solving this problem, Altera in 2013 launched the market for its FPGAs supporting the open programming standard of heterogeneous computing platforms OpenCL, which expanded the possibility of using the hardware by developers of computing applications that are not familiar with the FPGA hardware, HDL languages, design and verification routes. But, the problem remained - expensive equipment and design tools.


And, finally, somewhere in 2016, we can talk about the third stage, which was marked by the availability for a wide range of clients of completely ready-made servers (physical and virtual) with FPGA processors in the clouds of the largest data centers - Amazon Web Services (AWS), Cloud Alibaba and Huawei Cloud. In Russia, for the first time dedicated servers with FPGA processors have become available in the Selectel data center since 2017.


Зачем могут потребоваться FPGA-ускорители? Потоки данных растут с одной стороны, а с другой отмечены сложности в наращивании вычислительной мощности без увеличения размеров и потребления вычислительной системы. Как правило, в приложении есть задачи управления и задачи ресурсоемкой обработки данных. Целесообразно оставить задачи управления на ЦП, а задачи обработки отправить на специализированный ресурс. «Конфигурация на лету» под задачу — также представляется весьма полезным свойством. Синтез вычислительного ресурса на ПЛИС под конкретную задачу также должен дать выигрыш как в приросте производительности, так и в снижении потребления. Также, на ПЛИС присутствует внутренняя быстрая память и развитая (и реконфигурируемая) коммуникационная часть, что позволяет организовывать практически все известные протоколы ввода-вывода. Например, для организации хэш-памяти, аппаратных DSP-блоков, контроллеров памяти и т.д. Иными словами, это развитая система на кристалле, обладающая свойством синтеза конкретного вычислительного ядра под каждую задачу.

The basic differences between FPGA and CPU, GPU


What types of accelerators are available today? Today available: multi-core processors (CPU) Xeon, GPU and FPGA, consider them below.


Each type of processor - universal (CPU), graphic (GPU) or FPGA - has its own advantages, otherwise they would simply not be produced. CPUs provide good performance with the highest degree of versatility and applicability. About 99% of all existing programs are written for execution under the CPU. GPUs have a larger number of cores and vector architecture, high speed memory and I / O. FPGAs have the highest performance per watt of power consumption due to the properties of the equipment, but require very careful and time-consuming programming.


Below about these differences a little more in detail:


  • Universal CPUs are essentially the workhorses of the IT industry. They can be used for a wide variety of tasks, but because of their architecture, the CPUs are not so effective for parallel computing. In recent years, this problem has been partially solved by implementing multiple cores in the processor chip. However, even with the most productive CPUs, the number of cores is still measured in a few dozen.
  • Graphic processors (GPUs) for many years worked only in the niche of displaying information on the screen. And only relatively recently have GPUs been used for high-performance computing tasks, including cryptocurrency mining. Working with graphics as vector tasks led to such a development of the GPU architecture, which became adapted for the purposes of parallel computing. As a result, the modern architecture of the GPU allows you to accelerate the run of vectorized data through its pipelines, which otherwise would have to run through many other logical blocks in the CPU with a corresponding loss in performance. Modern GPUs contain several thousand processor cores in a chip.
  • FPGA, in contrast to the universal and graphic processors, can be reprogrammed in accordance with the features of the computing problem being solved on them. It turns out the synthesis of a specialized processor for a specific task. Other important FPGA differences are reduced power consumption per unit of computational power, as well as an architecture with the parallel execution of many vector operations at the same time - the so-called massive-parallel fine-grained architecture. The number of cores in an FPGA chip can reach one million or more.

An FPGA accelerator, as a rule, is a device in a different form factor (VPX, Com-express, PCIe, etc.), which, in addition to the FPGA chip itself (or several), contains SRAM and DRAM memory on the board, including ultra -New HBM (high-bandwidth DRAM) and high-speed I / O interfaces such as the popular 10/40/100 GE and PCI Express. FPGA accelerators are also available in the SOM form factor (system on a module, single-board computer) for embedded systems, which is popular in video analytics systems or industrial applications.
image
SOM FPGA Accelerator


Each FPGA chip contains an array of up to 5 million logic elements (transcoding array and triggers), which can be reprogrammed for different functional tasks. In addition, there are hardware resources - cache memory, signal processors, digital processing units, interface units.


Почему FPGA выигрывает в производительности у ASIC? Ответ очень простой — благодаря более совершенным техпроцессам создания кристаллов. Для FPGA применяются техпроцессы уровня 20 нм и даже 14 нм. В то время как для создания кристаллов ASIC используются более «древние» техпроцессы уровня 60 нм. Соответственно, на той же площади кристалла у FPGA можно расположить в разы большее число логических ячеек, чем у ASIC, что и обеспечивает выигрыш в производительности.

FPGA Applications


From the moment of its invention to the present day, one of the basic directions of FPGA application has been and remains the prototyping of microcircuits for small and medium-sized products, when the production of ASIC microcircuits is not economically feasible.


At the beginning of 2018, according to the Russian company Almaz-SP, the scope of application of FPGA accelerators was as follows:


  • 50% - special applications in military electronics,
  • 20% - telecommunications (equipment of GSM base stations, etc.),
  • 10% - processing of video streams (video studios, video analytics),
  • 10% - industrial use,
  • 10% - prototyping and more (including scientific calculations).

However, despite the predominantly military use in the past, the civilian use of FPGA accelerators is growing much faster now. In 2015, Intel acquired one of the largest manufacturers of FPGAs - Altera. Altera developments are now embodied in silicon already under the Intel brand. And the new line of FPGA chips known as Intel Cyclone 10 was not long in coming. Models of the Cyclone 10 GX chip show very high performance (up to 134 GFLOP) and have advanced I / O capabilities. Connecting to other devices is done through a 10GE network port or through a PCI Express x4 bus. These FPGA chips are designed for machine vision systems, surveillance, video broadcasts, as well as robotics. The junior model of the Cyclone 10 LP chip is implemented as a computing core for engineering systems - control of sensor complexes, контроллерами двигателей и так далее.


In addition to the Cyclone line, the Intel production program also includes other series of FPGA-chips inherited from Altera: MAX, Arria and Startix. The last two series are the most powerful FPGA chips on the market, in 2018 they are expected to upgrade to Arria 10 and Startix 10. Startix 10 will be built on hyperflex architecture and have a performance of 10 teraflops (i.e. almost 3 orders of magnitude more powerful Cyclone 10).


The Cyclone, MAX, Arria, and Startix series partially overlap in performance, but Intel positions each series separately. For Arria, these are signal processors for instrumentation; for Startix, high-performance computing in data centers and telecommunications. We have already talked about the applications for the Cyclone series, which was the only one to receive updates in 2017. But another such application for Cyclone is definitely worth mentioning: the Internet of Things, IoT.


Более 50% случаев применения FPGA –ускорителей приходится на военную и промышленную электронику, но сфера гражданских задач и научных расчетов быстро растет.

The concept of image in FPGA technology


Above, we have listed Intel's popular FPGA chip series today, but to use them in servers, you will need to purchase FPGA accelerator cards and program the chip logic on the adapter for a specific application. Adapter cards are available from Intel partners in the FPGA Design Solutions Network. In particular, in Russia such a partner is Almaz-SP LLC (also participating in the Euler project), which supplies both original Intel adapters and own-developed motherboards with FPGAs of the latest generations.


Демонстрация сервера с FPGA-ускорителем на конференции SelectelTechDay #2, в центре - Антон Висто, представитель ООО «Алмаз-СП»

Demonstration of a server with an Almaz-SP FPGA accelerator on SelectelTechDay # 2


Демо-зона аппаратных новинок на SelectelTechDay #2. Первый слева - FPGA-сервер от «Алмаз-СП»

Demo zone of hardware innovations at SelectelTechDay # 2 (FPGA - the first stand on the left)


If you need to ignore the design route and focus on the computational task, you can use OpenCL and Intel FPGA SDK for OpenCL. To do this, you need a BSP support package that allows you to ignore the complexities of building a system on a chip (memory controllers, PCIe, interfaces, clock domains, time constraints, partial reconfiguration, etc.) and focus on the computational task. Such a package is provided if the board has OpenCL support (OpenCL BSP). Having a similar support package, you can get a "software developer environment" - where there is a platform model, a function for acceleration, a runtime support library, a memory model, as well as special extensions to increase throughput. Then they start writing code, profiling, optimization.


As a result of using SDK and BSP, a single configuration file (bitstream) is obtained, which FPGA is configured and a complete system on a chip is obtained for a specific computational task. The result of programming is a microprogram that solves a specific application (for example, calculating a matrix of equations, converting video formats, etc.). Such firmware is called an FPGA image (FPGA Image). Quite often, the term “IP core” is used instead of the term “image”.


FPGA-образ (FPGA Image) — это управляющая микропрограмма для чипа FPGA, разработанная и отлаженная для выполнения специализированной вычислительной задачи.

Difficulties accessing FPGA technology for customers


Despite the attractive concept, “the highest performance for a specific computing task,” two objective factors impede the widespread adoption of FPGAs. This is the high cost of an adapter with an FPGA chip and a shortage of developers with practical experience in programming and debugging FPGA cores.


In addition to the accelerator, you must also acquire a license for the Intel OpenCL SDK, without which it is only possible to run compiled kernels, but their compilation is impossible. The requirements for the developer's computer are also very high: this includes recommendations for the RAM capacity of 18-48 GB. On a machine with an 8-core CPU and 32 GB of memory, compiling the kernel to calculate the Mandelbrot set takes about 2 hours. If the processor utilization exceeds 90%, then compilation may take a day or more. With less than 16 GB of memory, compilation may not be possible.


Therefore, potential customers are actively interested in this technology, but are in no hurry with the acquisition of FPGA accelerators. This is mainly due to fears that the costs of the accelerator (s) will be significant for their IT budget, and the in-house team will not be able to master programming and debugging FPGA images at the proper level.


FPGA Cloud Computing


FPGA cloud services have emerged as a response to the high cost of accelerator boards with an FPGA chip. In this case, customers are offered to rent physical and / or virtual servers with FPGA accelerators installed in them. As a rule, this is a partner product from a manufacturer (for example, Intel) and a data center as an IaaS service provider.


FPGA-сервер с ускорителем от «Алмаз-СП» можно бесплатно протестировать в дата-центре Selectel

FPGA server with accelerator from Almaz-SP can be tested for free in the Selectel data center


One of the solutions to the problem of accessibility of technology for mass application seems to be the possibility of leasing computing power based on FPGA. At Selectel, the service involves gaining access to a server with the installed Euler accelerator manufactured by Euler Project based on Intel Arria 10 FPGA. The necessary SDK and BSP are deployed on the server for developing, debugging and compiling OpenCL kernels, development tools for writing host applications (Visual Studio). As an introductory demonstration, the previously considered example with the construction of the Mandelbrot set is proposed: the project is provided in source codes and configured for compilation.


The Euler Project provides an OpenCL programming course for FPGAs for everyone. This course is designed specifically for the Russian audience: engineers, researchers, students of technical universities. It has incorporated the material of official Intel training and makes it possible to step-by-step study of technology from the assembly of the simplest application to the application of specific optimization methods, sometimes absolutely necessary to achieve optimal performance.


In this form, FPGA technology is becoming more attractive to customers, since they no longer need to purchase hardware directly, and capital costs are replaced by operating expenses. Accordingly, the range of companies that can afford the use of FPGA accelerator calculations for their projects is expanding significantly.


Облачная модель использования серверов с FPGA-ускорителями дает доступ к этой технологии для множества новых клиентов, которые хотели бы попробовать "как это работает" на своих конкретных проектах и вычислительных задачах.

FPGA Imaging Store Concept


Creating an efficiently working FPGA image for a specific application is a rather time-consuming and time-consuming task. A well-coordinated team for image programming can take up to a couple of months, and less experienced clients will spend much more time, or even not cope with this task at all.


Therefore, the concept of an image store suggests itself - by analogy with existing application stores for platforms such as MacOS, Windows or Android. Developers could transfer workable images created by them for various tasks, and customers could purchase them for upload to their servers with FPGA accelerators if these images correspond to the computational tasks in their projects.


In 2018, Selectel began work on creating a similar store of FPGA images that could be used on Selectel rental servers with this technology. Thus, the development cycle for new projects would be significantly accelerated for clients, and the programmers (teams of authors) themselves would receive a certain income from previously done work, plus they would be protected from pirated distribution of images around the market without their consent.


Useful link:


Free testing of dedicated server with FPGA adapter in Selectel Labs