Moving forward with optimizing and uglifying for loops, the next technique I want to write about here is loop tiling or loop blocking. If you’re coding a program that heavily relies on for loops, this method will greatly help improve the performance of the program. As the name suggests, this technique divides large loops into smaller tiles or blocks to better fit into the processor cache. The problem with large loops and heavy code blocks is that they take up much of our precious CPU’s cache space, which is crucial for processing tasks. As you know CPU’s cache is not manageable programmatically, but If our code clears this space from time to time or stores fewer data in it, we can increase the processing speed. I’d like to explore the implementation and benefits of this technique with a few simple examples in C++.

Nested Loops

Example 1: Matrix Multiplication

Matrix multiplication is a classic and practical example where this technique might be handy. Let’s say we have two nxn square matrices, A and B, and we want to find their product. Let’s code this task (C = A.B) in an intuitive naive fashion with three for loops:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Define the matrices using vector container
int n = 1000;
std::vector<std::vector<int>> A(n, std::vector<int>(n));
std::vector<std::vector<int>> B(n, std::vector<int>(n));
std::vector<std::vector<int>> C(n, std::vector<int>(n));

// Original loop
for(int i=0;i<n;i++) {
    for(int j=0;j<n;j++) {
        for(int k=0;k<n;k++) {
            C[i][j]+=A[i][k]*B[k][j];
        }
    }
}

The problem with this code is that it heavily relies on the cache, and there is a high probability of a cache miss in each iteration of the outer loops. A cache miss occurs when data in the cache is overwritten, and the CPU has to work harder and access the data from farther and slower RAM memory. Now, let’s break the loop with loop tiling.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// Loop-tiling version
const int block = 10; // block size
for(int i0=0;i0<n;i0+=block){
    for(int j0=0;j0<n;j0+=block){
        for(int k0=0;k0<n;k0+=block){
            for(int i=i0;i<i0+block;i++){
                for(int j=j0;j<j0+block;j++){
                    for(int k=k0;k<k0+block;k++){
                        C[i][j]+=A[i][k]*B[k][j];
                    }
                }
            }
        }
    }
}

Note that I used vector container class to define the matrices. Using 2D arrays for large matrices is bad practice since it uses a lot of memory and will some where on the way throw a memory allocation fault. By using vector the memory allocation and deallocation are handled automatically by the std::vector class, which makes it more convenient and safer than managing memory manually using dynamic memory allocation.

Checking how many cache misses have occurred in your code is also possible. You can use the PAPI library to initialize the performance counters and measure cache misses. For the sake of simplicity, I benchmark my code blocks using the chrono library. This is how:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#include <iostream>
#include <chrono>

int main() {
    // Start the timer
    auto start = std::chrono::high_resolution_clock::now();

    // Define and run a block of code to be benchmarked
    // ...
    // ...

    // Stop the timer
    auto stop = std::chrono::high_resolution_clock::now();

    // Calculate the elapsed time in milliseconds
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);

    // Print the elapsed time
    std::cout << "Elapsed time: " << elapsed.count() << " ms" << std::endl;
}

In this implementation, we’ve divided the loops into tiles of size B x B, where B is a chosen block size. By keeping the tiles small enough to fit into the cache, we can reduce cache misses and improve performance. In our example, we’ve chosen a block size of 16, but the optimal block size may vary depending on the specific hardware and software configuration.

Example 2: Image Convolution

Image convolution is another common computation that can benefit from loop tiling. Let’s say we have an image I of dimensions w x h, and we want to apply a convolution kernel K of size k x k to each pixel in the image. The naive implementation of convolution involves four nested loops:

1
2
3
4
5
for(int y=0;y<h;y++)
    for(int x=0;x<w;x++)
        for(int i=0;i<k;i++)
            for(int j=0;j<k;j++)
                conv[x][y]+=I[x+i][y+j]*K[i][j];

Like matrix multiplication, this implementation suffers from poor cache performance due to the large number of cache misses. To improve performance, we can divide the loops into smaller tiles that fit into the cache. Here’s an example implementation of loop tiling for image convolution:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
const int B=16; // block size

for(int y0=0;y0<h;y0+=B)
    for(int x0=0;x0<w;x0+=B)
        for(int y=y0;y<std::min(y0+B,h);y++)
            for(int x=x0;x<std::min(x0+B,w);x++)
                for(int i=0;i<k;i++)
                    for(int j=0;j<k;j++)
                        conv[x][y]+=I[x+i][y+j]*K[i][j];

In this implementation, we’ve divided the loops into tiles of size B x B, where B is a chosen block size. We’ve also added some additional logic to handle the edge cases where the tile size doesn’t evenly divide the image size. By keeping the tiles small enough to fit into the cache, we can reduce cache misses and improve performance.

In our tests, we found that the tiled implementation was up to 4x faster than the untiled implementation, depending on the block size and hardware configuration.