Performance optimization, and how to do it wrong

I recently tried to optimize convolutions using SIMD instructions, but what I thought would be a simple task ended up taking me days, with issue after issue popping up one after another. Some of them make sense in hindsight, but others were utterly baffling. While the specific examples are for direct convolution, these considerations apply to pretty much any code with a hot loop.

Background

I work on burn and recently wanted to optimize direct convolution on the burn-ndarray CPU backend.
For convolutions you need to move a two-dimensional kernel across an input feature map and sum all the values across all input channels. This is repeated for each output channel. The input can have padding pixels of zero-padding around the actual data, and the kernel can move in a strided manner (i.e. two pixels at once). There are many algorithms with different tradeoffs, but I decided to go with direct convolutions since they don't have memory overhead and are still very efficient when implemented correctly. The basic outline is that you have many nested loops, some bounds checks and a very frequently executed fused-multiply-add (FMA) instruction. My initial implementation looked something like this¹ (simplified):

iter_range_par!(0, batch_size * oc_blocks).for_each(|k| {
  for oh in 0..out_height {
    for ow_block in 0..ow_blocks {
      seq!(N in 0..8 {
        let mut acc~N = bias;
      });

      for ic in 0..in_channels {
        for kh in 0..k_height {
          if !in_bounds_h {
            continue;
          }

          for kw in 0..k_width {
            let f0 = vload(&weights[[ic, kh, kw, oc]]);
            seq!(N in 0..8 {
              if in_bounds_w {
                let i~N = splat(x[[ic, ih, iw + N]]);
                acc~N = E::vmuladd(simd, i~N, f0, acc~N);
              }
            });
          }
        }
      }

      seq!(N in 0..8 {
        if ow + N >= out_width {
          continue;
        }
        vstore(&mut out[[ow + N, oc]], acc~N);
      });
    }
  }
});

In this implementation I use several techniques. In addition to SIMD loads and fmadds I use the optimized loop order and register blocking (using the seq macro) techniques from this paper. I finished the implementation, executed a benchmark, and... it's slower. More than two times slower than a naive unvectorized² implementation in fact (~670ms vs ~300ms).

Starting to investigate

To do this I tried to use various profilers, cargo-flamegraph, samply and, after a lot of desperation, AMD μProf. After a few days of trying to get useful information out of these profilers (and getting μProf to work at all), I realized it wasn't getting me anywhere. The flamegraph and hotspots just didn't seem to make any sense at all.

So what's the next step?

Reducing the problem

Ok, none of my attempts to profile led to any success. So let's try to reduce the code to only what's actually needed in the benchmark. The benchmark uses unpadded, unstrided, undilated and ungrouped convolutions, so I stripped all padding checks and all stride/dilation calculations - it was faster, but still slow.

There was one branch left to eliminate: The check for border pixels in the register-blocking loop.

Just to check I shortened the loop to only consider pixels up to the last multiple of 8. This yields incorrect results, but should help with debugging performance.

let ow_blocks = out_width / ow_b // changed from `div_ceil`
//...

for ow_block in 0..ow_blocks {
  seq!(N in 0..8 {
    let mut acc~N = bias;
  });

  for ic in 0..in_channels {
    for kh in 0..k_height {
      for kw in 0..k_width {
        let f0 = vload(&weights[[ic, kh, kw, oc]]);
        seq!(N in 0..8 {
          let i~N = splat(x[[ic, ih, iw + N]]);
          acc~N = E::vmuladd(simd, i~N, f0, acc~N);
        });
      }
    }
  }

  seq!(N in 0..8 {
    vstore(&mut out[[ow + N, oc]], acc~N);
  });
}

Executing the benchmark, the code is now significantly faster on a single thread than the old code with multiple threads!

Benchmarking - conv2d-input_16x512x512_weight_16x3x3_stride_1
―――――――― Result ―――――――――
  Timing      full
  Samples     40
  Mean        205.12ms
  Variance    69.420µs
  Median      203.24ms
  Min         201.12ms
  Max         207.23ms
―――――――――――――――――――――――――

The problem seems to have been a mixture between spilling registers (as someone previously focused on GPU, I was shocked to find modern CPUs only have 16 of them), and too many branches. This is why the profilers didn't lead me anywhere useful. Branching in modern CPUs is just too complicated to be meaningfully represented by a profiler hotspot. This is probably the biggest takeaway from this article: branches are much worse than you think, because the CPU can't predict more than one³ branch per cycle⁴. A single if statement inside a loop is enough to stop any further instructions from being decoded in that cycle. Since optimal performance requires 2 FMA instructions per cycle (they take 1 cycle with a 5 cycle latency, and Zen 4 has 2 FMA units), having a branch on every instruction massively hamstrings the performance. This may be different on Zen 5, but remember, we still have other branches in addition to the ow bounds check that need predicting. So it's worse than 50% performance in practice.

Alright, now we have a good place to start. Let's start adding things back again and see where the performance starts getting bad.

First, we need to deal with the remaining pixels after the register-blocking code. To do this we're going to use a technique, that we'll be using several more times coming up:

Why have one loop, when you can have two?

I mentioned before that I shortened the loop to only deal with clean multiples of the register-blocking factor. So the way to deal with these remaining pixels is to just... - add another loop.

We add a second unblocked loop that starts at the end of the first loop, and runs until the edge of the feature map. Since it's not unrolled, we don't need to add any bounds checks.

for ow_block in 0..ow_blocks {
  //...
}
for ow in ow_blocks * 8..out_width {
  let mut acc = bias;

  for ic in 0..in_channels {
    for kh in 0..k_height {
      for kw in 0..k_width {
        let f0 = vload(&weights[[ic, kh, kw, oc]]);
        let i0 = splat(x[[ic, ih, iw + N]]);
        acc = E::vmuladd(simd, i0, f0, acc);
      }
    }
  }

  vstore(&mut out[[ow + N, oc]], acc~N);
}

Running the benchmarks, it's still fast - yay! It's much more efficient to run two loops than to check if we're in bounds on every iteration.

Adding back the other variables

To add back padding, stride and dilation, without tanking the performance again, I decided to use compile-time monomorphization to eliminate the common zero-padding and/or unit stride/dilation cases. So I use a technique I saw used in the original convolution implementation, added by Justin Moore, to enable auto-vectorization for unit stride convolutions. By adding an if-statement that checks if stride and dilation are all 1, we allow the compiler to constant propagate this value into that branch. The inner loop is extracted into a separate, inlined function. This trick allows unstrided convolution to be auto-vectorized in the original, non-SIMD implementation.

for ow_block in 0..ow_blocks {
  let ow = ow_block * ow_b + ow_start;

  // Trick the compiler into constant propagating stride/dilation
  #[allow(clippy::if_same_then_else)]
  if (1, 1, 1, 1) == (stride_h, stride_w, dilate_h, dilate_w) {
    conv2d_inner(
      simd, &x, &weights, &mut out, bias, oh, ow, oc, ic_off, stride_h, stride_w,
      dilate_h, dilate_w, k_height, k_width, pad_h, pad_w,
    );
  } else {
    conv2d_inner(
      simd, &x, &weights, &mut out, bias, oh, ow, oc, ic_off, stride_h, stride_w,
      dilate_h, dilate_w, k_height, k_width, pad_h, pad_w,
    );
  }
}

The padding support is added back via a const generic bool, that sets the padding to 0. This allows the compiler to, once again, constant propagate it.

fn run_conv2d</*...*/, const PAD: bool>(/*...*/) {
  ///...
  if !PAD {
    pad_h = 0;
    pad_w = 0;
  }
}

Nice and easy.

Let's run the benchmark again!

Benchmarking - conv2d-input_16x512x512_weight_16x3x3_stride_1
―――――――― Result ―――――――――
  Timing      full
  Samples     40
  Mean        8.136s (+3868%)
  Variance    75.115µs
  Median      8.042s (+3861%)
  Min         8.020s (+3890%)
  Max         8.341s (+3929%)
―――――――――――――――――――――――――

Oh. Oh dear. What happened?

When the compiler gets it wrong

To explain what just happened I need to add another small background detail I didn't mention before. To use modern SIMD features, the code uses runtime feature selection with pulp. The way this works is that pulp annotates a function with something like #[target_feature(enable = "avx2")], based on the available features. This tells the compiler it's allowed to use avx2 features, even if the target wouldn't normally include avx2. However, only inlined functions will have the features enabled and non-inlined function calls will not (this is foreshadowing).

This is where samply starts becoming actually useful. Running it allows me to see the assembly for each function and find the hotspots. And this time they are actually meaningful! samply tells me, I'm spending all my time in the line that calls the FMA and in the FMA itself. So I take a look at the assembly and -oh no!

movaps xmm6, xmmword [rsp + 0x2e0]
movaps xmm7, xmmword [rsp + 0x2f0]
movaps xmmword [rsp + 0x170], xmm15
movaps xmmword [rsp + 0x160], xmm14
movaps xmm0, xmmword [rsp + 0x180]
movaps xmmword [rsp + 0xb0], xmm0
movaps xmm0, xmmword [rsp + 0x190]
movaps xmmword [rsp + 0xa0], xmm0
mov rcx, rbx
lea rbx, qword [rsp + 0x320]
mov rdx, rbx
mov r8, rdi
mov r9, r14
call 0x4edb0

Why is it using SSE registers instead of AVX registers?
Why is it putting them onto the stack?
It's using call to execute the _mm256_fmadd_ps intrinsic for some reason?

Turns out: These things are linked. Coming up is what I think is a pretty accurate guess of what happened here.

See, the compiler has a size limit for inlined functions. #[inline(always)] tells the compiler to ignore the size limit, and almost all of my functions were marked as #[inline(always)]. However, the outermost function was not.

These are the steps I think happened next:

Adding these inlined branches caused the size of the function to exceed Rust's inlining limit
Rust outlined (is that a word?) the top-level function from pulps wrapper function, the one that is marked with #[target_feature]
The compiler now treats my function as a regular function, causing it to fall back to the default feature set (x86-64-v1)
This means 256-bit registers are no longer available in my function.
Since _mm256_fmadd_ps is an AVX2 instruction that requires 256-bit registers, the compiler must now call it dynamically and transfer the data via the stack. This is slow. Very slow.

I'm somewhat unsure about that last step, maybe someone with more knowledge of compiler internals can enlighten me on the actual reason the intrinsic is no longer inlined.

Fortunately, the solution was much simpler than this chain of events: Annotate the top-level function with #[inline(always)].

vbroadcastss ymm9, dword [r12 + r11 * 1]
lea rcx, qword [r11 + r12 * 1]
vfmadd231ps ymm6, ymm8, ymm10

Much better. And the benchmark?

Benchmarking - conv2d-input_16x512x512_weight_16x3x3_stride_1
―――――――― Result ―――――――――
  Timing      full
  Samples     40
  Mean        230.12ms
  Variance    69.420µs
  Median      232.24ms
  Min         224.12ms
  Max         236.23ms
―――――――――――――――――――――――――

Nice!

Finishing up the optimizations

Performance was good for unpadded convolutions, but would still have been lackluster for padded ones, since we need to check if we are in padding in every single loop iteration. To solve this, we can use the same technique we used for the out_width earlier: All pixels that are more than padding away from the edges are guaranteed to always be in bounds, so we can run one loop from padding_h to out_height - padding_h and padding_w to width - padding_w without bounds checks, then a second loop for the border pixels that does do bounds checks. This is much faster than checking every pixel, since most pixels are always in bounds.

if (pad_h, pad_w) != (0, 0) {
  let v_borders = (0..pad_h)
    .chain(out_height.saturating_sub(pad_h)..out_height)
    .cartesian_product(0..out_width);
  let h_borders = (0..out_height)
    .cartesian_product((0..pad_w).chain(out_width.saturating_sub(pad_w)..out_width));

  for (oh, ow) in v_borders.chain(h_borders) {
    //...
  }
}

Final thoughts

Modern CPUs are weird, and performance is not always obvious. Inlining is fragile, and adding a single line of code can completely change the way your program is compiled without you even noticing. Profilers aren't always helpful, especially when your problem is more complex than something like using a slow function or allocating memory too often. My tip is to just try to figure out just when things start getting bad and maybe learn some basic assembly, so you can spot things like way too many stack loads/stores (which indicates register spilling).

I hope this helps someone deal with their performance issues a little bit more quickly than I did.

Also big shout out once again to samply. Even when the performance data didn't mean much, being able to easily view assembly for any given function was very useful.

The final version of the code for this implementation can be found here. And just for fun, here is the final benchmark after all optimizations, using multi-threading:

Benchmarking - conv2d-input_16x512x512_weight_16x3x3_stride_1
―――――――― Result ―――――――――
  Timing      full
  Samples     40
  Mean        42.731ms
  Variance    5.115µs
  Median      42.906ms
  Min         38.162ms
  Max         47.554ms
―――――――――――――――――――――――――