Firstborn/Library/PackageCache/com.unity.burst@1.7.3/Documentation~/docs/OptimizationGuidelines.md
Schaken-Mods b486678290 Library -Artifacts
Library -Artifacts
2023-03-28 12:24:16 -05:00

25 KiB

Optimization Guidelines

Loop Vectorization

Loop vectorization is one of the ways that Burst improves performance. Let's say you have code like this:

[MethodImpl(MethodImplOptions.NoInlining)]
private static unsafe void Bar([NoAlias] int* a, [NoAlias] int* b, int count)
{
    for (var i = 0; i < count; i++)
    {
        a[i] += b[i];
    }
}

public static unsafe void Foo(int count)
{
    var a = stackalloc int[count];
    var b = stackalloc int[count];

    Bar(a, b, count);
}

The compiler is able to convert that scalar loop in Bar into a vectorized loop. Instead of looping over a single value at a time, the compiler generates code that loops over multiple values at the same time, producing faster code essentially for free. Here is the x64 assembly generated for AVX2 for the loop in Bar above:

.LBB1_4:
    vmovdqu    ymm0, ymmword ptr [rdx + 4*rax]
    vmovdqu    ymm1, ymmword ptr [rdx + 4*rax + 32]
    vmovdqu    ymm2, ymmword ptr [rdx + 4*rax + 64]
    vmovdqu    ymm3, ymmword ptr [rdx + 4*rax + 96]
    vpaddd     ymm0, ymm0, ymmword ptr [rcx + 4*rax]
    vpaddd     ymm1, ymm1, ymmword ptr [rcx + 4*rax + 32]
    vpaddd     ymm2, ymm2, ymmword ptr [rcx + 4*rax + 64]
    vpaddd     ymm3, ymm3, ymmword ptr [rcx + 4*rax + 96]
    vmovdqu    ymmword ptr [rcx + 4*rax], ymm0
    vmovdqu    ymmword ptr [rcx + 4*rax + 32], ymm1
    vmovdqu    ymmword ptr [rcx + 4*rax + 64], ymm2
    vmovdqu    ymmword ptr [rcx + 4*rax + 96], ymm3
    add        rax, 32
    cmp        r8, rax
    jne        .LBB1_4

As can be seen above, the loop has been unrolled and vectorized so that it is 4 vpaddd instructions, each calculating 8 integer additions, for a total of 32 integer additions per loop iteration.

This is great! However, loop vectorization is notoriously brittle. As an example, let's introduce a seemingly innocuous branch like this:

[MethodImpl(MethodImplOptions.NoInlining)]
private static unsafe void Bar([NoAlias] int* a, [NoAlias] int* b, int count)
{
    for (var i = 0; i < count; i++)
    {
        if (a[i] > b[i])
        {
            break;
        }

        a[i] += b[i];
    }
}

Now the assembly changes to this:

.LBB1_3:
    mov        r9d, dword ptr [rcx + 4*r10]
    mov        eax, dword ptr [rdx + 4*r10]
    cmp        r9d, eax
    jg        .LBB1_4
    add        eax, r9d
    mov        dword ptr [rcx + 4*r10], eax
    inc        r10
    cmp        r8, r10
    jne        .LBB1_3

This loop is completely scalar and only has 1 integer addition per loop iteration. This is not good! In this simple case, an experienced developer would probably spot that adding the branch will break auto-vectorization. But in more complex real-life code it can be difficult to spot.

To help with this problem, Burst includes, at present, experimental intrinsics (Loop.ExpectVectorized() and Loop.ExpectNotVectorized()) to express loop vectorization assumptions, and have them validated at compile-time. For example, we can change the original Bar implementation to:

[MethodImpl(MethodImplOptions.NoInlining)]
private static unsafe void Bar([NoAlias] int* a, [NoAlias] int* b, int count)
{
    for (var i = 0; i < count; i++)
    {
        Unity.Burst.CompilerServices.Loop.ExpectVectorized();

        a[i] += b[i];
    }
}

Burst will now validate, at compile-time, that the loop has indeed been vectorized. If the loop is not vectorized, Burst will emit a compiler error. For example, if we do this:

[MethodImpl(MethodImplOptions.NoInlining)]
private static unsafe void Bar([NoAlias] int* a, [NoAlias] int* b, int count)
{
    for (var i = 0; i < count; i++)
    {
        Unity.Burst.CompilerServices.Loop.ExpectVectorized();

        if (a[i] > b[i])
        {
            break;
        }

        a[i] += b[i];
    }
}

then Burst will emit the following error at compile-time:

LoopIntrinsics.cs(6,9): Burst error BC1321: The loop is not vectorized where it was expected that it is vectorized.

As these intrinsics are experimental, they need to be enabled with the UNITY_BURST_EXPERIMENTAL_LOOP_INTRINSICS preprocessor define.

Note that these loop intrinsics should not be used inside if statements. Burst does not currently prevent this from happening, but in a future release this will be a compile-time error.

Compiler Options

When compiling a job, you can change the behavior of the compiler:

  • Using a different accuracy for the math functions (sin, cos...)
  • Allowing the compiler to re-arrange the floating point calculations by relaxing the order of the math computations.
  • Forcing a synchronous compilation of the Job (only for the Editor/JIT case)
  • Using internal compiler options (not yet detailed)

These flags can be set through the [BurstCompile] attribute, for example [BurstCompile(FloatPrecision.Med, FloatMode.Fast)]

FloatPrecision

The accuracy is defined by the following enumeration:

    public enum FloatPrecision
    {
        /// <summary>
        /// Use the default target floating point precision - <see cref="FloatPrecision.Medium"/>.
        /// </summary>
        Standard = 0,
        /// <summary>
        /// Compute with an accuracy of 1 ULP - highly accurate, but increased runtime as a result, should not be required for most purposes.
        /// </summary>
        High = 1,
        /// <summary>
        /// Compute with an accuracy of 3.5 ULP - considered acceptable accuracy for most tasks.
        /// </summary>
        Medium = 2,
        /// <summary>
        /// Compute with an accuracy lower than or equal to <see cref="FloatPrecision.Medium"/>, with some range restrictions (defined per function).
        /// </summary>
        Low = 3,
    }

Currently, the implementation is only providing the following accuracy:

  • FloatPrecision.Standard is equivalent to FloatPrecision.Medium providing an accuracy of 3.5 ULP. This is the default value.
  • FloatPrecision.High provides an accuracy of 1.0 ULP.
  • FloatPrecision.Medium provides an accuracy of 3.5 ULP.
  • FloatPrecision.Low has an accuracy defined per function, and functions may specify a restricted range of valid inputs.

Using the FloatPrecision.Standard accuracy should be largely enough for most games.

An ULP (unit in the last place or unit of least precision) is the spacing between floating-point numbers, i.e., the value the least significant digit represents if it is 1.

Note: The FloatPrecision Enum was named Accuracy in early versions of the Burst API.

FloatPrecision.Low

The following table describes the precision and range restrictions for using the FloatPrecision.Low mode. Any function not described in the table will inherit the ULP requirement from FloatPrecision.Medium.


FunctionPrecisionRange
Unity.Mathematics.math.sin(x)350.0 ULP
Unity.Mathematics.math.cos(x)350.0 ULP
Unity.Mathematics.math.exp(x)350.0 ULP
Unity.Mathematics.math.exp2(x)350.0 ULP
Unity.Mathematics.math.exp10(x)350.0 ULP
Unity.Mathematics.math.log(x)350.0 ULP
Unity.Mathematics.math.log2(x)350.0 ULP
Unity.Mathematics.math.log10(x)350.0 ULP
Unity.Mathematics.math.pow(x, y)350.0 ULPNegative `x` to the power of a fractional `y` are not supported.

Compiler floating point math mode

The compiler floating point math mode is defined by the following enumeration:

    /// <summary>
    /// Represents the floating point optimization mode for compilation.
    /// </summary>
    public enum FloatMode
    {
        /// <summary>
        /// Use the default target floating point mode - <see cref="FloatMode.Strict"/>.
        /// </summary>
        Default = 0,
        /// <summary>
        /// No floating point optimizations are performed.
        /// </summary>
        Strict = 1,
        /// <summary>
        /// Reserved for future.
        /// </summary>
        Deterministic = 2,
        /// <summary>
        /// Allows algebraically equivalent optimizations (which can alter the results of calculations), it implies :
        /// <para/> optimizations can assume results and arguments contain no NaNs or +/- Infinity and treat sign of zero as insignificant.
        /// <para/> optimizations can use reciprocals - 1/x * y  , instead of  y/x.
        /// <para/> optimizations can use fused instructions, e.g. madd.
        /// </summary>
        Fast = 3,
    }
  • FloatMode.Default is defaulting to FloatMode.Strict
  • FloatMode.Strict: The compiler is not performing any re-arrangement of the calculation and will be careful at respecting special floating point values (denormals, NaN...). This is the default value.
  • FloatMode.Fast: The compiler can perform instruction re-arrangement and/or using dedicated/less precise hardware SIMD instructions.
  • FloatMode.Deterministic: Reserved for future, when Burst will provide support for deterministic mode

Typically, some hardware can support Multiply and Add (e.g mad a * b + c) into a single instruction. These optimizations can be allowed by using the Fast calculation. The reordering of these instructions can lead to a lower accuracy.

The FloatMode.Fast compiler floating point math mode can be used for many scenarios where the exact order of the calculation and the uniform handling of NaN values are not strictly required.

AssumeRange Attribute

Being able to tell the compiler that an integer lies within a certain range can open up optimization opportunities. The AssumeRange attribute allows users to tell the compiler that a given scalar-integer lies within a certain constrained range:

[return:AssumeRange(0u, 13u)]
static uint WithConstrainedRange([AssumeRange(0, 26)] int x)
{
    return (uint)x / 2u;
}

The above code makes two promises to the compiler:

  • That the variable x is in the closed-interval range [0..26], or more plainly that x >= 0 && x <= 26.
  • That the return value from WithConstrainedRange is in the closed-interval range [0..13], or more plainly that x >= 0 && x <= 13.

These assumptions are fed into the optimizer and allow for better codegen as a result. There are some restrictions:

  • You can only place these on scalar-integer (signed or unsigned) types.
  • The type of the range arguments must match the type being attributed.

We've also added in some deductions for the .Length property of NativeArray and NativeSlice to tell the optimizer that these always return non-negative integers.

static bool IsLengthNegative(NativeArray<float> na)
{
    // The compiler will always replace this with the constant false!
    return na.Length < 0;
}

Let's assume you have your own container:

struct MyContainer
{
    public int Length;
    
    // Some other data...
}

And you wanted to tell Burst that Length was always a positive integer. You would do that like so:

struct MyContainer
{
    private int _length;

    [return: AssumeRange(0, int.MaxValue)]
    private int LengthGetter()
    {
        return _length;
    }

    public int Length
    {
        get => LengthGetter();
        set => _length = value;
    }

    // Some other data...
}

Hint Intrinsics

Burst has some Hint intrinsics that provide a way for developers to tell the optimizer additional information that could aid in optimizations:

  • Unity.Burst.CompilerServices.Hint.Likely lets developers tell Burst that a boolean condition is likely to be true.
  • Unity.Burst.CompilerServices.Hint.Unlikely lets developers tell Burst that a boolean condition is unlikely to be true.
  • Unity.Burst.CompilerServices.Hint.Assume lets developers tell Burst that a boolean condition can be assumed to be true.

The likely intrinsic is most useful to tell Burst which branch condition has a high probability of being taken, and thus the optimizer can focus on the branch in question for optimization purposes:

if (Unity.Burst.CompilerServices.Hint.Likely(b))
{
    // Any code in here will be optimized by Burst with the assumption that we'll probably get here!
}
else
{
    // Whereas the code in here will be kept out of the way of the optimizer.
}

Conversely, the unlikely intrinsic tells the compiler the opposite - the condition is very unlikely to be true, and it should optimize against it:

if (Unity.Burst.CompilerServices.Hint.Unlikely(b))
{
    // Whereas the code in here will be kept out of the way of the optimizer.
}
else
{
    // Any code in here will be optimized by Burst with the assumption that we'll probably get here!
}

These two intrinsics ensure that the code most likely to be hit will be placed after the branching condition in the binary, meaning that it will have a very high probability of being in the instruction cache. Also, the compiler can hoist code out of the likely branch if profitable, spend extra time optimizing the likely branch, and also not spend as much time looking at the unlikely code - since the developer has told the compiler it probably won't be hit.

A classic example of an unlikely branch is to check if result of an allocation is valid - the allocation will be valid nearly all the time, and so you want the code to be fast with that assumption, but you do need some sort of error case to fall back to.

The assume intrinsic is powerful and dangerous - telling the compiler that a condition is always true:

Unity.Burst.CompilerServices.Hint.Assume(b);

if (b)
{
    // The compiler has been told that b is always true, so this branch will always be taken.
}
else
{
    // Any code in here will be removed from the program because b is always true!
}

The power of the assume intrinsic is that it allows you to arbitrarily tell the compiler that something is true. A developer could tell the compiler to assume that a loop end is always a multiple of 16, meaning that it can provide perfect vectorization without any scalar spilling for that loop. A developer could tell the compiler that a value isn't NaN, is negative, etc - the sky is really the limit here.

The danger with the intrinsic though is that the compiler will assume the value is true without checking that it really was true - you as the developer have promised to the compiler that it must be true, and Burst is a trusting compiler - it entrusts that the promise is kept! As a result, this intrinsic should be one of the last tools left on the shelf - it is useful and powerful, but care must be taken.

Constant Intrinsic

Burst has an intrinsic Unity.Burst.CompilerServices.Constant.IsConstantExpression that will return true if a given expression is known to be constant at compile-time:

using static Unity.Burst.CompilerServices.Constant;

var somethingWhichWillBeConstantFolded = math.pow(42.0f, 42.0f);

if (IsConstantExpression(somethingWhichWillBeConstantFolded))
{
    // The compiler knows that somethingWhichWillBeConstantFolded is a compile-time constant!
}

This can be useful to check that some complex expression that you want to be certain is constant folded away with Burst is always constant folded. You could even use this to have some special case optimizations for a known constant value, for example, let's say we wanted to implement our own pow-like function for integer powers:

using static Unity.Burst.CompilerServices.Constant;

public static float MyAwesomePow(float f, int i)
{
    if (IsConstantExpression(i) && (2 == i))
    {
        return f * f;
    }
    else
    {
        return math.pow(f, (float)i);
    }
}

Using the IsConstantExpression check above will mean that the branch will always be removed by the compiler if i is not constant, because the if condition would be false. This means that if i is constant and is equal to 2, we'd use a more optimal simple multiply instead.

Note that constant folding will only take place during optimizations, so if you have disabled optimizations the intrinsic will return false.

Unity.Mathematics

The Unity.Mathematics provides vector types (float4, float3...) that are directly mapped to hardware SIMD registers.

Also, many functions from the math type are also mapped directly to hardware SIMD instructions.

Note that currently, for an optimal usage of this library, it is recommended to use SIMD 4 wide types (float4, int4, bool4...)

Generic Jobs

As described in AOT vs JIT, there are currently two modes Burst will compile a Job:

  • When in the Editor, it will compile the Job when it is scheduled (sometimes called JIT mode).
  • When building a Standalone Player, it will compile the Job as part of the build player (AOT mode).

If the Job is a concrete type (not using generics), the Job will be compiled correctly in both modes.

In case of a generic Job, it can behave more unexpectedly.

While Burst supports generics, it has limited support for using generic Jobs or Function pointers. You could notice that a job scheduled at Editor time is running at full speed with Burst but not when used in a Standalone player. It is usually a problem related to generic Jobs.

A generic Job can be defined like this:

// Direct Generic Job
[BurstCompile]
struct MyGenericJob<TData> : IJob where TData : struct { 
    public void Execute() { ... }
}

or can be nested:

// Nested Generic Job
public class MyGenericSystem<TData> where TData : struct {
    [BurstCompile]
    struct MyGenericJob  : IJob { 
        public void Execute() { ... }
    }

    public void Run()
    {
        var myJob = new MyGenericJob(); // implicitly MyGenericSystem<TData>.MyGenericJob
        myJob.Schedule();    
    }
}

When the previous Jobs are being used like:

// Direct Generic Job
var myJob = new MyGenericJob<int>();
myJob.Schedule();

// Nested Generic Job
var myJobSystem = new MyGenericSystem<float>();
myJobSystem.Run();

In both cases in a standalone-player build, the Burst compiler will be able to detect that it has to compile MyGenericJob<int> and MyGenericJob<float> because the generic jobs (or the type surrounding it for the nested job) are used with fully resolved generic arguments (int and float).

But if these jobs are used indirectly through a generic parameter, the Burst compiler won't be able to detect the Jobs it has to compile at standalone-player build time:

public static void GenericJobSchedule<TData>() where TData: struct {
    // Generic argument: Generic Parameter TData
    // This Job won't be detected by the Burst Compiler at standalone-player build time.
    var job = new MyGenericJob<TData>();
    job.Schedule();
}

// The implicit MyGenericJob<int> will run at Editor time in full Burst speed
// but won't be detected at standalone-player build time.
GenericJobSchedule<int>();

Same restriction applies when declaring the Job in the context of generic parameter coming from a type:

// Generic Parameter TData
public class SuperJobSystem<TData>
{
    // Generic argument: Generic Parameter TData
    // This Job won't be detected by the Burst Compiler at standalone-player build time.
    public MyGenericJob<TData> MyJob;
}

In summary, if you are using generic jobs, they need to be used directly with fully-resolved generic arguments (e.g int, MyOtherStruct), but can't be used with a generic parameter indirection (e.g MyGenericJob<TContext>).

Regarding function pointers, they are more restricted as you can't use a generic delegate through a function pointer with Burst:

public delegate void MyGenericDelegate<T>(ref TData data) where TData: struct;

var myGenericDelegate = new MyGenericDelegate<int>(MyIntDelegateImpl);
// Will fail to compile this function pointer.
var myGenericFunctionPointer = BurstCompiler.CompileFunctionPointer<MyGenericDelegate<int>>(myGenericDelegate);

This limitation is due to a limitation of the .NET runtime to interop with such delegates.

SkipLocalsInit Attribute

In C# all local variables are initialized to zero by default. This is a great feature because it means an entire class of bugs surrounding undefined data disappears. But this can come at some cost to runtime performance, because initializing this data to zero is not free:

static unsafe int DoSomethingWithLUT(int* data);

static unsafe int DoSomething(int size)
{
    int* data = stackalloc int[size];

    // Initialize every field of data to be an incrementing set of values.
    for (int i = 0; i < size; i++)
    {
        data[i] = i;
    }

    // Use the data elsewhere.
    return DoSomethingWithLUT(data);
}

The X86 assembly for this is:

        push    rbp
        .seh_pushreg rbp
        push    rsi
        .seh_pushreg rsi
        push    rdi
        .seh_pushreg rdi
        mov     rbp, rsp
        .seh_setframe rbp, 0
        .seh_endprologue
        mov     edi, ecx
        lea     r8d, [4*rdi]
        lea     rax, [r8 + 15]
        and     rax, -16
        movabs  r11, offset __chkstk
        call    r11
        sub     rsp, rax
        mov     rsi, rsp
        sub     rsp, 32
        movabs  rax, offset burst.memset.inline.X64_SSE4.i32@@32
        mov     rcx, rsi
        xor     edx, edx
        xor     r9d, r9d
        call    rax
        add     rsp, 32
        test    edi, edi
        jle     .LBB0_7
        mov     eax, edi
        cmp     edi, 8
        jae     .LBB0_3
        xor     ecx, ecx
        jmp     .LBB0_6
.LBB0_3:
        mov     ecx, eax
        and     ecx, -8
        movabs  rdx, offset __xmm@00000003000000020000000100000000
        movdqa  xmm0, xmmword ptr [rdx]
        mov     rdx, rsi
        add     rdx, 16
        movabs  rdi, offset __xmm@00000004000000040000000400000004
        movdqa  xmm1, xmmword ptr [rdi]
        movabs  rdi, offset __xmm@00000008000000080000000800000008
        movdqa  xmm2, xmmword ptr [rdi]
        mov     rdi, rcx
        .p2align        4, 0x90
.LBB0_4:
        movdqa  xmm3, xmm0
        paddd   xmm3, xmm1
        movdqu  xmmword ptr [rdx - 16], xmm0
        movdqu  xmmword ptr [rdx], xmm3
        paddd   xmm0, xmm2
        add     rdx, 32
        add     rdi, -8
        jne     .LBB0_4
        cmp     rcx, rax
        je      .LBB0_7
        .p2align        4, 0x90
.LBB0_6:
        mov     dword ptr [rsi + 4*rcx], ecx
        inc     rcx
        cmp     rax, rcx
        jne     .LBB0_6
.LBB0_7:
        sub     rsp, 32
        movabs  rax, offset "DoSomethingWithLUT"
        mov     rcx, rsi
        call    rax
        nop
        mov     rsp, rbp
        pop     rdi
        pop     rsi
        pop     rbp
        ret

But the important bit to note is the movabs rax, offset burst.memset.inline.X64_SSE4.i32@@32 line - we've had to inject a memset to zero out the data. In the above example the developer knows that the array will be entirely initialized in the following loop, but the compiler doesn't know that. To fix this exact sort of problem, there is a Burst attribute Unity.Burst.CompilerServices.SkipLocalsInitAttribute that can be placed on methods to tell the compiler that any stack allocations within do not have to be initialized to zero. Let's see that in action:

using Unity.Burst.CompilerServices;

static unsafe int DoSomethingWithLUT(int* data);

[SkipLocalsInit]
static unsafe int DoSomething(int size)
{
    int* data = stackalloc int[size];

    // Initialize every field of data to be an incrementing set of values.
    for (int i = 0; i < size; i++)
    {
        data[i] = i;
    }

    // Use the data elsewhere.
    return DoSomethingWithLUT(data);
}

And the assembly after adding the [SkipLocalsInit] on the method is:

        push    rbp
        .seh_pushreg rbp
        mov     rbp, rsp
        .seh_setframe rbp, 0
        .seh_endprologue
        mov     edx, ecx
        lea     eax, [4*rdx]
        add     rax, 15
        and     rax, -16
        movabs  r11, offset __chkstk
        call    r11
        sub     rsp, rax
        mov     rcx, rsp
        test    edx, edx
        jle     .LBB0_7
        mov     r8d, edx
        cmp     edx, 8
        jae     .LBB0_3
        xor     r10d, r10d
        jmp     .LBB0_6
.LBB0_3:
        mov     r10d, r8d
        and     r10d, -8
        movabs  rax, offset __xmm@00000003000000020000000100000000
        movdqa  xmm0, xmmword ptr [rax]
        mov     rax, rcx
        add     rax, 16
        movabs  rdx, offset __xmm@00000004000000040000000400000004
        movdqa  xmm1, xmmword ptr [rdx]
        movabs  rdx, offset __xmm@00000008000000080000000800000008
        movdqa  xmm2, xmmword ptr [rdx]
        mov     r9, r10
        .p2align        4, 0x90
.LBB0_4:
        movdqa  xmm3, xmm0
        paddd   xmm3, xmm1
        movdqu  xmmword ptr [rax - 16], xmm0
        movdqu  xmmword ptr [rax], xmm3
        paddd   xmm0, xmm2
        add     rax, 32
        add     r9, -8
        jne     .LBB0_4
        cmp     r10, r8
        je      .LBB0_7
        .p2align        4, 0x90
.LBB0_6:
        mov     dword ptr [rcx + 4*r10], r10d
        inc     r10
        cmp     r8, r10
        jne     .LBB0_6
.LBB0_7:
        sub     rsp, 32
        movabs  rax, offset "DoSomethingWithLUT"
        call    rax
        nop
        mov     rsp, rbp
        pop     rbp
        ret

And note the call to memset is gone - because the developer has promised the compiler that it is fine. Note that this is a power user feature for experienced developers - developers that are certain they won't run into undefined behaviour bugs as a result of this change.