So, a few weeks ago, I started working on a research project that involved writing a chemical reaction simulator with tons of loops and calculations. The complexity of the code grew exponentially as the problem’s variables and dimensions increased. I tried optimizing the code itself, yet I wasn’t satisfied enough. So as any intelligent engineer would do, I decided to dive deep and discover what exactly happens to the code we write in a high-level language. Looking at the compiled assembly code, I realized I could make the code more efficient.

Nested Loops

The number of instructions the CPU has to run for each task is one of the main factors in the performance of the code. We can use techniques like loop unrolling and loop tiling to minimize the number of instructions for a loop.

Part 1: Loop unrolling

Loop unrolling is a technique where you manually duplicate the loop body to reduce the number of times the loop control instructions in loop header are executed. By doing so, the code could run faster due to fewer instructions being executed by the processor. Take this C++ code for instance:

1
2
3
4
5
6
7
int arr[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
int sum;

// Original loop
for (int i = 0; i < 10; i++) {
    sum += arr[i];
}

The assembly1 code of the for loop is: ‍

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
L_1:					; loop header for i < 10 (termination condition check)
	ldr	w8, [sp, #4]
	subs	w8, w8, #10
	b.ge	L_4			; branch to L4
	b	L_2			; branch to L2

L_2:					; loop body
	ldrsw	x9, [sp, #4]		; load i to register x9
	add	x8, sp, #16		; load arr[i] to w9
	ldr	w9, [x8, x9, lsl #2]
	ldr	w8, [sp, #8]		; sum +=
	add	w8, w8, w9
	str	w8, [sp, #8]
	b	L_3			; branch to L3

L_3:                    		; loop header for i++
	ldr	w8, [sp, #4]
	add	w8, w8, #1
	str	w8, [sp, #4]
	b	L_1			; branch to L1, the termination check

L_4:					; rest of the program

The first block labeled L_1 checks if the condition i < 10 is met by subtracting the value stored for i by 10 and checking if it’s greater than (implicit) zero. In that case it will branch to L_4 and continue with the rest of the code. L_2 block is the code necessary for our task which is to add the value of entry i of our array to sum.

BTW, you can run gcc -S your_file.cpp to compile the assembly code.

We have 8 instructions in the header and 7 in the body of our loop. To run this loop our CPU has to run 150 instructions in total (except the last 3 in L_1 after the last iteration, before jumping to L_4).

Now let’s take a look at this version:

1
2
3
4
5
// Loop unrolled by a factor of 2
for (int i = 0; i < 10; i+=2) {
    sum += arr[i];
    sum += arr[i+1];
}

It is ugly. Yes. Here we add not only the ith entry of arr to sum but the next one too. We’ve unrolled the loop by a factor of 2. Let’s see the instructions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
L_1:					; loop header for i < 10 (termination condition check)
	ldr	w8, [sp, #4]
	subs	w8, w8, #10
	b.ge	L_4			; branch to L4
	b	L_2			; branch to L2

L_2:					; loop body

	ldrsw	x9, [sp, #4]		; load i to register x9
	add	x8, sp, #16		; load arr[i] to w10
	ldr	w10, [x8, x9, lsl #2]
	ldr	w9, [sp, #8]		; sum +=
	add	w9, w9, w10
	str	w9, [sp, #8]
	ldr	w9, [sp, #4]		; load i+1 to register w9
	add	w9, w9, #1		; load arr[i+1] to w9
	ldr	w9, [x8, w9, sxtw #2]
	ldr	w8, [sp, #8]		; sum +=
	add	w8, w8, w9
	str	w8, [sp, #8]
	b	L_3			; branch to L3

L_3:                    		; loop header for i+2
	ldr	w8, [sp, #4]
	add	w8, w8, #2
	str	w8, [sp, #4]
	b	L_1			; branch to L1, the termination check

L_4:					; rest of the program

We’re looping through L_2 only 5 times instead of 10, but each iteration of the loop does two things instead of one. The differences are from lines 15 to 20 which is for adding the next entry to sum, and on line 25 to increment i by 2 after each iteration. This code runs (13 + 8) * 5 = 105 instructions instead of 150 and we expect ~66% improvement.

The issue on the other hand (other than the obvious ugly less-readable code) is that we can’t always be sure the number of iterations is divisible by 2 or other factors. The worst case is when the length of our iteration is a prime number. One workaround for this could be checking the modulus of the factor of unrolling and using the corresponding one.

I’ll talk about loop tiling in the next part. Now that I mentioned modulo maybe I write about the costs of different types of instructions too.


  1. This is an arm64 assembly compiled with clang 1400. ↩︎