So, a few weeks ago, I started working on a research project that involved writing a chemical reaction simulator with tons of loops and calculations. The complexity of the code grew exponentially as the problem’s variables and dimensions increased. I tried optimizing the code itself, yet I wasn’t satisfied enough. So as any intelligent engineer would do, I decided to dive deep and discover what exactly happens to the code we write in a high-level language. Looking at the compiled assembly code, I realized I could make the code more efficient.
The number of instructions the CPU has to run for each task is one of the main factors in the performance of the code. We can use techniques like loop unrolling and loop tiling to minimize the number of instructions for a loop.
Part 1: Loop unrolling
Loop unrolling is a technique where you manually duplicate the loop body to reduce the number of times the loop control instructions in loop header are executed. By doing so, the code could run faster due to fewer instructions being executed by the processor. Take this C++ code for instance:
|
|
The assembly1 code of the for
loop is:
‍
|
|
The first block labeled L_1
checks if the condition i < 10
is met by subtracting the value stored for i
by 10
and checking if it’s greater than (implicit) zero. In that case it will branch to L_4
and continue with the rest of the code. L_2
block is the code necessary for our task which is to add the value of entry i
of our array
to sum
.
BTW, you can run gcc -S your_file.cpp
to compile the assembly code.
We have 8 instructions in the header and 7 in the body of our loop. To run this loop our CPU has to run 150 instructions in total (except the last 3 in L_1
after the last iteration, before jumping to L_4
).
Now let’s take a look at this version:
|
|
It is ugly. Yes. Here we add not only the i
th entry of arr
to sum
but the next one too. We’ve unrolled the loop by a factor of 2. Let’s see the instructions:
|
|
We’re looping through L_2
only 5 times instead of 10, but each iteration of the loop does two things instead of one. The differences are from lines 15 to 20 which is for adding the next entry to sum
, and on line 25 to increment i
by 2 after each iteration. This code runs (13 + 8) * 5 = 105 instructions instead of 150 and we expect ~66% improvement.
The issue on the other hand (other than the obvious ugly less-readable code) is that we can’t always be sure the number of iterations is divisible by 2 or other factors. The worst case is when the length of our iteration is a prime number. One workaround for this could be checking the modulus of the factor of unrolling and using the corresponding one.
I’ll talk about loop tiling in the next part. Now that I mentioned modulo maybe I write about the costs of different types of instructions too.
This is an arm64 assembly compiled with clang 1400. ↩︎