Low-level code optimization on the Elbrus platform: vector addition of uint16_t using intrinsics



In this article we will talk about lower-level optimizations that can be done on Elbrus processors.

In principle, optimizations of this level are not a necessary stage of development for Elbrus. For most computational operations requiring high performance, you can and should use the functions from the EML library.

However, in the current version of EML, we did not find some functions interesting to us, so we decided to write them ourselves.

For this we used intrinsics. Intrinsics are constructions that look like ordinary functions to a programmer, but their calls are replaced by the in-place compiler with highly efficient code. Most often, intrinsics are needed when you want to use vector processor extensions that allow you to perform the same operation on a register containing several data elements at once. Even an optimizing compiler cannot always guess that such a design will speed up your code. In such cases, before, if there was no suitable optimized library, you had to use assembler. But the performance of the assembler code significantly depends on the efficiency of using registers, accounting for ALU delays and other wonderful things. And Elbrus also has a VLIW architecture, which means that if we want to write in assembler, we have to independently monitor the formation of broad command words. On the other hand, optimizing compilers are created for such subtleties. The transition to intrinsics allows you to intelligently distribute the work between the person and the program. Intrinsic code can be easily transferred between systems that support all the intrinsics involved. That is, in our situation, intrinsics are obviously the best solution.


The Elbrus-4C and Elbrus-8C microprocessors support vector operations with a 64-bit register. Using this register, you can simultaneously process two 32-bit numbers, four 16-bit integers or eight 8-bit integers. The Elbrus microprocessor set of intrinsics includes operations for data conversion, initialization of vector elements, arithmetic operations, bitwise logical operations, permutation of vector elements, and, as it seemed to us, is quite similar to the instruction set of SSE / SSE2 x86 processors.


So, let's start optimization. Take a piece of code to add two arrays of the uint16_t type with writing the result to the third array (there is no such operation in EML yet):


// Вариант 0
// eml_16u *src1 - указатель на первый аргумент суммы массивов
// eml_16u *src2 - указатель на второй аргумент суммы массивов
// eml_16u *dst - указатель на результирующий массив
// len - длина массивов
for (size_t i = 0; i < len; ++i)
  dst[i] = src1[i] + src2[i];

Now we rewrite it using intrinsics. For simplicity, we assume that the length of the arrays len is divided by 4, and the remaining elements of the arrays are processed separately. Then it will look something like this:


// Вариант 1
// eml_16u *src1 - указатель на первый аргумент суммы массивов
// eml_16u *src2 - указатель на второй аргумент суммы массивов
// eml_16u *dst - указатель на результирующий массив
// len - длина массивов
static const size_t block_size = sizeof(eml_64u) / sizeof(eml_16u);
for (size_t i = 0; i < len; i += block_size, src1 += block_size, src2 += 
    block_size, dst += block_size)
  *(__di*)dst = __builtin_e2k_paddh(*(__di*)src1, *(__di*)src2);

Here __di is a 64-bit data type, and __builtin_e2k_paddh is an intrinsic that performs 16-bit unsigned addition.


However, this code is not optimal, because to load an unaligned 64-bit number r at the address of the p processor, the following elementary operations must be performed:


  1. Determine the offset of the s address p from the alignment border by 64-bit (see Fig. 1). The address of the aligned 64-bit number containing the beginning r will be equal p - s . The address of the next aligned 64-bit number containing the end r will be equal p - s + 8 .


  2. Load from memory 2 64-bit numbers r 1 , r 2 containing r , at aligned addresses.


  3. Find the number r knowing r 1 , r 2 , s .


Fig. 1. Scheme of unaligned loading of 64-bit data from memory.


For further optimization, we will write this explicitly using the Elbrus macros:


__di s = E2K_BYTES_FROM_ALIGN(p, 8);
__di tmp;
E2K_PREPARE_ALIGN(s, tmp);
const __di *p_aligned = (__di *)E2K_ALIGN_PTR_BACK(p, 8);
__di r1 = *p_aligned;
__di r2 = *(p_aligned + 1);
__di r;
E2K_ALIGN_DATA(r1, r2, r, tmp);

Such code will perform 6 memory accesses at each iteration of the loop, as well as the original version with unbalanced loading. However, explicit addressing to aligned addresses makes it possible to use the special buffer for swapping arrays available in the Elbrus architecture to increase memory access efficiency (by the way, this buffer was also used in code without intrinsics).


You can easily reduce the number of memory accesses at each iteration to 3, saving the values ​​loaded at the previous iteration of the loop. In addition, we will use only aligned record in the result array, processing the initial part of the arrays separately. Thanks to this, both reading and writing to memory will be more efficient.


// Вариант 2
// eml_16u *src1 - указатель на первый аргумент суммы массивов
// eml_16u *src2 - указатель на второй аргумент суммы массивов
// eml_16u *dst - указатель на результирующий массив
// len - длина массивов
size_t i = 0;
// Найдем количество элементов до границы выравнивания на 64-бита для dst и обработаем их
size_t offset = E2K_BYTES_TO_ALIGN(dst, sizeof(eml_64u)) / sizeof(eml_16u);
for (; i < offset; ++i)
  dst[i] = src1[i] + src2[i];
// Обработаем основную часть массива
__di spec0, spec1;
__di tmp0, tmp1;
__di align1 = E2K_BYTES_FROM_ALIGN(src1 + offset, sizeof(eml_64u));
E2K_PREPARE_ALIGN(align1, spec0);
__di align2 = E2K_BYTES_FROM_ALIGN(src2 + offset, sizeof(eml_64u));
E2K_PREPARE_ALIGN(align2, spec1);
const __di *v1 = (__di *)E2K_ALIGN_PTR_BACK(src1 + offset, 8);
const __di *v2 = (__di *)E2K_ALIGN_PTR_BACK(src2 + offset, 8);
__di *v3 = (__di*)(dst + offset);
__di d01, d11;
__di d00 = *v1;
__di d10 = *v2;
++v1;
++v2;
static const size_t block_size = sizeof(eml_64u) / sizeof(eml_16u);
size_t effective_len = offset + ((len - offset) & ~(block_size - 1));
for (; i < effective_len; i += block_size, ++v1, ++v2, ++v3) {
  d01 = *v1;
  d11 = *v2;
  E2K_ALIGN_DATA(d00, d01, tmp0, spec0);
  E2K_ALIGN_DATA(d10, d11, tmp1, spec1);
  *v3 = __builtin_e2k_paddh(tmp0, tmp1);
  d00 = d01;
  d10 = d11;
}
// Обработаем оставшиеся элементы последовательно, если они есть

It would seem that what else can be done?


However, as we recall, on modern Elbrus there are 6 channels of execution, in which you can put up to 24 instructions, and they will be executed in 1 beat. Of these instructions, only 6 can be arithmetic for integers, since there is only one vector ALU for each channel (other instructions could relate to real arithmetic, loading / writing, etc.) In addition, there is another subtlety: these 6 ALUs are different, and each arithmetic instruction can be executed only in certain channels. Only channels 0 and 3 are suitable for unsaturated addition. Therefore, in 1 clock cycle, we can perform no more than 2 additions. To tell the cautious compiler that these two additions are independent (that is, the result of the first is not used in the second), we expand the loop. This can be done manually or using the compiler directive:


#pragma unroll(2)


In addition, you can tell the compiler the number of expected iterations of the loop, for example, a number of about 1024 is suitable for a loop on an image line (this is a reasonable estimate of the linear sizes of recognizable images, and colleagues from the MCST recommended this size; the general idea is that the number should be large enough for the compiler to consider it appropriate to use special loop optimizations):


#pragma loop count(1024)


Of course, in obviously short cycles, the compiler should leave the opposite hint (see below).


// Вариант 3
// eml_16u *src1 - указатель на первый аргумент суммы массивов
// eml_16u *src2 - указатель на второй аргумент суммы массивов
// eml_16u *dst - указатель на результирующий массив
// len - длина массивов
size_t i = 0;
// Найдем количество элементов до границы выравнивания
на 64-бита для dst и обработаем их
size_t offset = E2K_BYTES_TO_ALIGN(dst, sizeof(eml_64u)) / sizeof(eml_16u);
#pragma loop count(3)
for (; i < offset; ++i) {
  dst[i] = src1[i] + src2[i];
}
// Обработаем основную часть массива
__di spec1, spec2;
__di tmp0, tmp1;
__di align1 = E2K_BYTES_FROM_ALIGN(src1 + offset, sizeof(eml_64u));
E2K_PREPARE_ALIGN(align1, spec1);
__di align2 = E2K_BYTES_FROM_ALIGN(src2 + offset, sizeof(eml_64u));
E2K_PREPARE_ALIGN(align2, spec2);
const __di *v1 = (__di *)E2K_ALIGN_PTR_BACK(src1 + offset, sizeof(eml_64u));
const __di *v2 = (__di *)E2K_ALIGN_PTR_BACK(src2 + offset, sizeof(eml_64u));
__di *v3 = (__di*)(dst + offset);
__di d01, d11, d02, d12;
__di d00 = *v1;
__di d10 = *v2;
++v1;
++v2;
size_t effective_len = offset + ((len - offset) & ~0x03);
#pragma unroll(2)
#pragma loop count(1024)
for (; i < effective_len; i += 4, ++v1, ++v2, ++v3) {
  d01 = *v1;
  d11 = *v2;
  E2K_ALIGN_DATA(d00, d01, tmp0, spec0);
  E2K_ALIGN_DATA(d10, d11, tmp1, spec1);
  *v3 = __builtin_e2k_paddh(tmp0, tmp1);
  d00 = d01;
  d10 = d11;
}
// Обработаем оставшиеся элементы последовательно, если они есть

Now we give the results of measurements. To do this, we measured the addition time of two arrays of length 105. The average time for 105 iterations is shown in the table.


Option Average array addition time, μs
0 219.0
1 250.7
2 62.6
3 31,4

Our optimization allowed 7 times faster addition! We see that, with the goal of squeezing the maximum performance and spending a little time studying the features of Elbrus, significant results can be achieved.


Many thanks to the ICST staff who have repeatedly advised us on this writing!