Moving forward with optimizing and uglifying for loops, the next technique I want to write about here is loop tiling or loop blocking. If you’re coding a program that heavily relies on for loops, this method will greatly help improve the performance of the program. As the name suggests, this technique divides large loops into smaller tiles or blocks to better fit into the processor cache. The problem with large loops and heavy code blocks is that they take up much of our precious CPU’s cache space, which is crucial for processing tasks. As you know CPU’s cache is not manageable programmatically, but If our code clears this space from time to time or stores fewer data in it, we can increase the processing speed. I’d like to explore the implementation and benefits of this technique with a few simple examples in C++.
Example 1: Matrix Multiplication
Matrix multiplication is a classic and practical example where this technique might be handy. Let’s say we have two nxn square matrices, A and B, and we want to find their product. Let’s code this task (C = A.B) in an intuitive naive fashion with three for loops:
|
|
The problem with this code is that it heavily relies on the cache, and there is a high probability of a cache miss in each iteration of the outer loops. A cache miss occurs when data in the cache is overwritten, and the CPU has to work harder and access the data from farther and slower RAM memory. Now, let’s break the loop with loop tiling.
|
|
Note that I used vector
container class to define the matrices. Using 2D arrays for large matrices is bad practice since it uses a lot of memory and will some where on the way throw a memory allocation fault. By using vector
the memory allocation and deallocation are handled automatically by the std::vector
class, which makes it more convenient and safer than managing memory manually using dynamic memory allocation.
Checking how many cache misses have occurred in your code is also possible. You can use the PAPI
library to initialize the performance counters and measure cache misses. For the sake of simplicity, I benchmark my code blocks using the chrono
library. This is how:
|
|
In this implementation, we’ve divided the loops into tiles of size B x B, where B is a chosen block size. By keeping the tiles small enough to fit into the cache, we can reduce cache misses and improve performance. In our example, we’ve chosen a block size of 16, but the optimal block size may vary depending on the specific hardware and software configuration.
Example 2: Image Convolution
Image convolution is another common computation that can benefit from loop tiling. Let’s say we have an image I of dimensions w x h, and we want to apply a convolution kernel K of size k x k to each pixel in the image. The naive implementation of convolution involves four nested loops:
|
|
Like matrix multiplication, this implementation suffers from poor cache performance due to the large number of cache misses. To improve performance, we can divide the loops into smaller tiles that fit into the cache. Here’s an example implementation of loop tiling for image convolution:
|
|
In this implementation, we’ve divided the loops into tiles of size B x B, where B is a chosen block size. We’ve also added some additional logic to handle the edge cases where the tile size doesn’t evenly divide the image size. By keeping the tiles small enough to fit into the cache, we can reduce cache misses and improve performance.
In our tests, we found that the tiled implementation was up to 4x faster than the untiled implementation, depending on the block size and hardware configuration.