Profiling: measurement and analysis

Original author: Tony Albrecht
  • Transfer


Hi, I'm Tony Albrecht, engineer at Riot. I like to profile and optimize. In this article I will talk about the basics of profiling, and also analyze an example of C ++ code during its profiling on a Windows machine. We will start with the simplest and will gradually delve into the giblets of the central processor. When we encounter opportunities to optimize, we will implement the changes, and in the next article we will analyze real examples from the code base of the game League of Legends. Go!


Code Review


First, take a look at the code we're going to profile. Our program is a simple small OpenGL renderer, object oriented, hierarchical scene tree. I resourcefully called the main object Object - everything in the scene inherits from one of these base classes. In our code, only Node, Cube, and Modifier are inherited from Object.



Cube is an Object that renders itself on the screen as a cube. Modifier is an Object that "lives" in the scene tree and, being Updated, converts the Objects added to it. Node is an Object that may contain other Objects.


The system is designed so that you can create a hierarchy of objects by adding cubes to nodes, as well as some nodes to other nodes. If you convert a node (using a modifier), then all the objects contained in the node will be converted. Using this simple system, I created a tree of cubes orbiting each other.



I agree, the proposed code is not the best implementation of the scene tree, but it's okay: this code is needed specifically for subsequent optimization. In fact, this is a direct porting of the example for the PlayStation3®, which I wrote in 2009 to analyze the performance of the Pitfalls of Object Oriented Programming . You can partly compare our today's article with a 9-year-old article and see if the lessons we once learned for the PS3 apply to modern hardware platforms.


But back to our cubes. The above gif shows about 55 thousand rotating cubes. Please note that I do not profile the rendering of the scene, but only the animation and culling when passing to rendering. Libraries involved in creating the example: Dear Imgui and Vectormath from Bullet , both free. For profiling, I used AMD Code XL and a simple instrumented profiling tool built in haste for this article.


Before you get down to business


Units


First, I want to discuss performance measurement. In games, frames per second (FPS) are often used as metrics. This is a good performance indicator, but it is useless when analyzing parts of a frame or comparing improvements from different optimizations. Let's say, “the game now runs 20 frames per second faster!” - How much faster is this?


Depends on the situation. If we had 60 FPS (or 1000/60 = 16.666 milliseconds per frame), and now it's 80 FPS (1000/80 = 12.5 ms per frame), then our improvement is 16.666 ms - 12.5 ms = 4.166 ms per frame. This is a good gain.


But if we had 20 FPS, and now it has become 40 FPS, then the improvement is already equal (1000/20 - 1000/40) = 50 ms - 25 ms = 25 ms per frame! This is a powerful performance boost that can turn a game from "unplayable" to "acceptable." The problem with the FPS metric is that it is relative, so we will always use milliseconds. Is always.


Taking measurements


There are several types of profilers, each with their own strengths and weaknesses.


Instrumentation Shapers


For instrumented profilers, a programmer must manually mark a piece of code whose performance needs to be measured. These profilers detect and save the start and end time of the profiled fragment, focusing on unique markers. For example:


void Node::Update()
{
    FT_PROFILE_FN
    for(Object* obj : mObjects)
    {
        obj->Update();
    }
}

In this case, it FT_PROFILE_FN creates an object that fixes the time of its creation, and then destruction when it falls out of scope. These moments of time along with the name of the function are stored in some kind of array for subsequent analysis and visualization. If you try, you can implement visualization in code, or - a little easier - in a tool like Chrome tracing .


Instrumentation profiling is great for visualizing code performance and detecting bursts. If the performance characteristics of the application are presented in the form of a hierarchy, then you can immediately see which functions as a whole work most slowly, which cause the most other functions, which vary the duration of execution, etc.



In this illustration, each colored dice corresponds to a function. Dice located directly below other dice indicate functions that are called “upstream” functions. The length of the plate is proportional to the duration of the function.


Although instrumentation profiling provides valuable visual information, it still has flaws. It slows down program execution: the more you measure, the slower the program becomes. Therefore, when writing a measurement and control profiler, try to minimize its impact on application performance. If you skip the slow function, a large gap will appear in the profile. Also, you will not get information about the speed of each line of code: you can easily mark only visibility areas, but the overhead of measuring and measuring profiling usually negates the contribution of individual lines, so measuring them is simply useless.


Sampling Profilers


Sampling profilers request the execution status of the process that you want to profile. They periodically save the program counter (Program Counter, PC), which shows which instruction is being executed, and also save the stack, so you can find out what functions the function that contains the current instruction called. All this information is useful because the function or lines with the most samples will be the slowest function or lines. The longer the sampling profiler works, the more samples of instructions and stacks are collected, which improves the results.



Sampling profilers allow you to collect very low-level characteristics of program performance and do not require manual intervention, as is the case with control and measuring profilers. In addition, they automatically cover the entire execution state of your program. These profilers have two main drawbacks: they are not very useful for determining bursts for each frame, and also do not allow you to find out when a certain function was called relative to other functions. That is, we get less information about hierarchical calls compared to a good control and measurement profiler.


Specialized Profilers


These profilers provide specific process information. Usually they are associated with hardware elements like a central processor or a video card that are capable of generating specific events if something of interest to you occurs, for example, a cache miss or erroneous branch prediction. Equipment manufacturers build in the ability to measure these events to make it easier for us to find out the reasons for poor performance; therefore, understanding these profiles requires knowledge of the hardware used.


Game-specific profilers


At a much more general level, game-specific profilers can count, say, the number of minions on a screen or the number of visible particles in a character’s field of view. Such profilers are also very useful, they will help to identify high-level errors in the game logic.


Profiling


Profiling an application without a comparative standard does not make sense, therefore, when optimizing, you must have a reliable test script on hand. It is not as simple as it seems. Modern computers run not only your application, but dozens, if not hundreds of other processes, constantly switching between them. That is, other processes can slow down the process you are profiling as a result of competition for accessing devices (for example, several processes are trying to read from a disk) or for processor / video card resources. So in order to get a good starting point, you need to run the code through several profiling operations before you even get down to the task. If the results of the runs vary greatly, you will have to figure out the reasons and get rid of the variability, or at least reduce it.


Having achieved the smallest possible spread of the results, do not forget that small improvements (less than the available variability) will be difficult to measure, because they can be lost in the "noise" of the system. Suppose a specific scene in the game is displayed in the range of 14-18 ms per frame, on average it is 16 ms. You spent two weeks optimizing some function, re-profiling and got 15.5 ms per frame. Is it faster? To find out exactly, you have to run the game many times, profiling this scene and calculating the arithmetic mean value and plotting the trend. In the application described here, we measure hundreds of frames and average the results to get a reliable enough value.


In addition, many games run in several threads, the order of which is determined by your hardware and OS, which can lead to non-deterministic behavior or at least different execution times. Do not forget about the influence of these factors.


In connection with the foregoing, I put together a small test script for profiling and optimization. It is easy to understand, but complex enough to have the resource to significantly improve performance. Please note that for the sake of simplicity, I turned off rendering when profiling, so that we only see the computational costs that are associated with the central processor.


Profile the code


Below is the code that we will optimize. Remember that one example will only teach us profiling. You will definitely encounter unexpected difficulties in profiling your own code, and I hope this article helps you create your own diagnostic framework.


{
    FT_PROFILE("Update");
    mRootNode->Update();
}

{
    FT_PROFILE("GetWBS");
    BoundingSphere totalVolume = mRootNode->GetWorldBoundingSphere(Matrix4::identity());
}

{
    FT_PROFILE("cull");
    uint8_t clipFlags = 63;
    mRootNode->Cull(clipFlags);
}

{
    FT_PROFILE("Node Render");
    mRootNode->Render(mvp);
}

I added a control macro to different scopes FT_PROFILE() to measure the duration of execution of different parts of the code. Below we will talk more about the purpose of each fragment.


When I executed the code and wrote the data from the measured profile, I got the following picture in Chrome: // tracing:



This is a single frame profile. Here we see the relative duration of each function call. Please note that you can see the execution order. If I measured the functions that are called by these function calls, they would appear under the parent function dice. For example, I measured Node::Update() and got this recursive call structure:



The duration of the execution of one frame of this code during measurement differs by a couple of milliseconds, so we take the arithmetic average of at least several hundred frames and compare it with the original standard. In this case, 297 frames were measured, the average value was 17.5 ms, some frames were executed up to 19 ms, and the others were slightly less than 16.5 ms, although almost the same thing was performed in each of them. Such is the implicit variability of frames. Repeatedly running and comparing the results stably gives us about 17.5 ms, so this value can be considered a reliable starting point.



If you disable the check marks in the code and run it through the AMD CodeXL sampling profiler , we get the following picture:



If we analyze the five most “requested” functions, we get:



It seems like the slowest function is matrix multiplication. It sounds logical, because for all these rotations, functions have to perform a bunch of calculations. If you look closely to the hierarchy of the stack a pair of images above, we can see that the operator matrix multiplication is caused by Modifier::Update() , Node::Update() , GetWorldBoundingSphere() and Node::Render() . It is called so often from so many places - so this operator can be considered a good candidate for optimization.


Matrix :: operator *


If, with the help of a sampling profiler, we analyze the code responsible for multiplication, we can find out the “cost” of each line.



Unfortunately, the matrix multiplication code length is only one line (for the sake of efficiency), so this result gives us little. Or is it not so small?


If you look at the assembler, you can identify the prologue and epilogue of function.



This is the cost of an internal function call instruction. In the prologue, a new stack space is specified (ESP is the current stack pointer, EBP is the base pointer for the current stack frame), and the epilogue clears and returns. Each time a function is called that is not inline and uses some kind of stack space (that is, it has a local variable), all these instructions can be inserted and called.


Let's expand the rest of the function and see what matrix multiplication actually does.



Wow, a bunch of code! And this is just the first page. The full function takes more than a kilobyte of code with 250-300 instructions! We analyze the beginning of the function.



The line above the highlighted in blue takes about 10% of the total execution time. Why is it running much slower than neighboring ones? This MOVSS instruction takes a floating point value from eax + 34h memory and puts it in the xmm4 register. The line above does the same with the xmm1 register, but much faster. Why?


It's all about cache miss.


Let’s take a closer look. Sampling individual instructions is applicable in a variety of situations. Modern processors at any time execute several instructions, and during one clock cycle many instructions can be re-sorted (retire). Even event-based sampling can attribute events to the wrong instruction. So, when analyzing assembler sampling, you need to be guided by some kind of logic. In our example, the most sampled instruction may not be the slowest. We can only say with a certain degree of certainty about the slow operation of something related to this line. And since the processor executes a number of MOVs to and from memory, then suppose that these MOVs are to blame for poor performance. To make sure of this, you can run a profile with event-based sampling enabled for cache misses and look at the result. But for now, we will trust our instincts and drive the profile based on the cache miss hypothesis.


Skipping the L3 cache takes more than 100 cycles (in some cases, several hundred cycles), and a miss of the L2 cache takes about 40 cycles, although it all depends on the processor. For example, x86 instructions take from 1 to about 100 cycles, while most of them take less than 20 cycles (some division instructions work on some hardware rather slowly). On my Core i7, the instructions for multiplication, addition, and even division took several cycles. The instructions get into the pipeline, so that several instructions are processed simultaneously. This means that one miss of the L3 cache - loading directly from memory - may take hundreds of instructions to complete in time. Simply put, reading from memory is a very slow process.



Modifier :: Update ()


So, we see that memory access slows down the execution of our code. Let's go back and see what in the code leads to this call. The control and measurement profiler shows that it Node::Update() is running slowly, and from the report of the sampler profiler on the stack it is obvious that the function is Modifier::Update() especially leisurely. From this we will begin optimization.


void Modifier::Update()
{
    for(Object* obj : mObjects)
    {
        Matrix4 mat = obj->GetTransform();
        mat = mTransform*mat;
        obj->SetTransform(mat);
    }
}

Modifier::Update() passes through the vector of pointers to Objects, takes their transform matrix, multiplies it by the mTransform Modifier Modifier::Update() matrix mTransform , and then applies this transformation to Objects. In the above code, the transformation is copied from the object to the stack, multiplied, and then copied back.


GetTransform() just returns a copy mTransform , while SetTransform() copying to a mTransform new matrix and sets the state of mDirty this object:


void SetDirty(bool dirty)
{
    if (dirty && (dirty != mDirty))
    {
        if (mParent)
            mParent->SetDirty(dirty);
    }
    mDirty = dirty; 
}
void SetTransform(Matrix4& transform)
{
    mTransform = transform; 
    SetDirty(true);
}
inline const Matrix4 GetTransform() const { return mTransform; } 

The inner data layer of this Object looks like this:


protected: 
    Matrix4 mTransform; 
    Matrix4 mWorldTransform;
    BoundingSphere mBoundingSphere;
    BoundingSphere mWorldBoundingSphere; 
    bool m_IsVisible = true; 
    const char* mName; 
    bool mDirty = true; 
    Object* mParent; 
};

For clarity, I colorized the entries in the memory of the Node object:



The first entry is a virtual table pointer. This is part of C ++ inheritance implementation: a pointer to an array of function pointers that act as virtual functions for this particular polymorphic object. For Object, Node, Modifier, and any class inherited from the base, there are different virtual tables.


After this 4-byte pointer comes a 64-byte array of floating point numbers. The matrix is mTransform followed by a matrix mWorldTransform , and then two bounding spheres. Please note that the next entry m_IsVisible ,, single-byte, it takes 4 full bytes. This is normal, since the next entry is a pointer that must have at least 4 byte alignment. If after we m_IsVisible put another Boolean value, then it would be packed into the available 3 bytes. Next comes the pointer mName (with 4-byte alignment), then Boolean mDirty (also loosely packed), then a pointer to the parent Object. All of this is Object-specific data. Follow vector mObjects already refers to the Node vector and occupies 12 bytes on this platform, although it may be of a different size on other platforms.


If we look at the code Modifier::Update() , we will see what could be causing the cache to miss.


void Modifier::Update()
{
    for(Object* obj : mObjects)
    {

First, note: a vector mObjects is an array of pointers to Objects that are dynamically allocated in memory. Iterating over this vector works well with the cache (red arrows in the illustration below), since pointers follow one after another. There are several misses, but they indicate something that is probably not adapted for working with the cache. And since each Object is placed in memory with a new pointer, we can only say that our interference is somewhere in memory.



When we get a pointer to an Object, we call GetTransform() :


Matrix4 mat = obj->GetTransform();

This inline function simply returns a copy of mTransform Object, so the previous line is equivalent to this:


Matrix4 mat = obj->mTransform;

As you can see in the diagram, the Objects referenced by the pointers in the array mObjects are scattered from memory. Each time we add a new Object and call it GetTransform() , this will surely lead to a cache miss when loading into mTransform and GetTransform() pushing onto the mTransform stack. On the equipment I use, the cache line occupies 64 bytes, so if you are lucky and the object starts 4 bytes before the 64-byte border, it mTransform will be loaded into the cache entirely at a time. But a more likely situation is when loading mTransform will result in two cache misses. It is Modifier::Update() obvious from the sampling profile Modifier::Update() that matrix alignment is arbitrary.



This fragment edx is the location of the Object. And as we know, it mTransform starts 4 bytes before the start of the object. So this code copies mTransform onto the stack (MOVUPS copies 4 unaligned floating point values ​​to the register). Pay 7% of the calls to the three MOVUPS instructions. This suggests that cache misses also occur in the case of MOVs. I do not know why the first MOVUPS on the stack does not take as much time as the rest. It seems to me that the "costs" are simply transferred to subsequent MOVUPS due to the peculiarities of pipelining instructions. But in any case, we have received evidence of the high cost of accessing memory, so we will work with it.


void Modifier::Update()
{
    for(Object* obj : mObjects)
    {
        Matrix4 mat = obj->GetTransform();
        mat = mTransform*mat;
        obj->SetTransform(mat);
    }
}

After the matrix is ​​multiplied, we call Object::SetTransform() which takes the result of the multiplication (freshly placed on the stack) and copies it to the Object instance. Copying is fast because the conversion is already cached, but it SetDirty() works slowly because it reads a flag mDirty , it is probably not in the cache. So in order to test and possibly determine this one byte, the processor has to read the surrounding 63 bytes.


Conclusion


If you read to the end - well done! I know, at first it can be difficult, especially if you are not familiar with assembler. But I highly recommend taking the time to see what the compilers do with the code they write. You can use the Compiler Explorer for this .


We've gathered some evidence that the main cause of performance problems in our sample code is memory access pointers. Next, we minimize the cost of access to memory, and then again measure performance to see if an improvement has been achieved. This is what we will do next time.