## Efficient convolution of multi-dimensional data

For the first real content post I will introduce a method for improving the memory bandwidth efficiency when performing basic convolution operations on multidimensional data (eg. 2D/3D/4D datasets). I presented the basic method for this as a side note in the publication below, but without much technical details -- which we can look at in this post instead.

M. Broxvall, K. Emilsson, P. Thunberg. Fast GPU Based Adaptive Filtering of 4D Echocardiography. IEEE Transactions on Medical Imaging , 2011, DOI 10.1109/TMI.2011.2179308

I will assume that you know how a convolution operation is defined and the purposes it has in image processing. Furthermore, I will assume that the convolution filter size (to avoid confusion I will call convolution kernels as convolution filters as to avoid a confusion with the notion of kernels from OpenCL) is much smaller than the image size that you are trying to convolve.

Assume that we have the convolution filter $$f$$ of size $$\vec k$$, an image $$M$$ and want to calculate the convolution $$f * M$$. A typical approach to this would be to launch one OpenCL kernel for each output point $$\mbox{out}$$ and to perform the computation below:

$$\mbox{out}(\vec p) = 0$$
for $$j_1$$ = 0 ... $$k_1$$ do
...
for $$j_n$$ = 0 to $$k_n$$ do
$$\mbox{out}(\vec p) \leftarrow \mbox{out}(\vec p) + f(\vec j) M(\vec p + \vec j)$$

With this formulation we will need to perform $$\prod_i k_i$$ multipllications and additions, as well as the same number of memory fetch operations reading from the input image $$M$$. If we would perform, for instance, a convolution of size 10 over a 3D dataset we thus require 1000 memory fetches and 1000 floating point multiply-add operations for each generated output.

The multiply-add operations can often be performed faster than the equivalent two floating point operations by using dedicated hardware, and is exposed in the OpenCL standard through the operations mad or mad24.

Now, assuming that our hardware does not cache the input image we will have a problem. This is the case on many GPUs when you are using global memory buffers as opposed to OpenCL image objects + samplers. The reason for this has to do with how the hardware is optimised to texture memory accesses and a bit outside the scope of this post.

Although modern GPUs have high bandwidth to onboard memory on the order of 160 GB/s (ATI 6950) they have even higher floating point capacities (2.25Tflops) the convolution operation above is memory bound. Assuming that the $$out$$ and $$f$$ variables are stored in on-board memory the naive  algorithm above would be capable of at most $$4 \times 10^{10}$$ steps of the innermost loop and thus only utilize 1-2% of the total computational capacity.

In order to optimize on this and gain better performance we note that the same input image samples will contribute to many different output values, but will be multiplied with different filter coefficients before doing so. An obvious interpolation is thus to re-use the same input image values and use them to compute multiple output values.

The extreme case of this would be to load the whole image into high-speed memory (GPU registers or shared memory) -- but this would obviously not be possible for anything but trivial image sizes. However, we can do this by redesigning our compute kernels to compute $$k_o$$ number of outputs for each kernel and to re-use the same input image values for multiple output computations. We extend here the convolution filter $$f$$ to contain zeroes at all points outside the original convolution filter (in the code we can do this more flexible).

$$\mbox{out}(\vec p + \overline{(0,...,0,0)}) = 0$$
...
$$\mbox{out}(\vec p + \overline{(0,...,0,k_o)}) = 0$$
for $$j_1$$ = 0 ... $$k_1$$ do
...
for $$j_n$$ = 0 to $$k_{n-1}$$ do
for $$j_n$$ = 0 to $$k_n + k_o - 1$$ do
$$v \leftarrow M(\vec p+\vec j)$$
$$\mbox{out}(\vec p + \overline{(0,...,0,0)}) \leftarrow \mbox{out}(\vec p + \overline{(0,...,0,0)}) + v f(j - \overline{(0,...,0,0)})$$
...
$$\mbox{out}(\vec p + \overline{(0,...,0,k_o)}) \leftarrow \mbox{out}(\vec p + \overline{(0,...,0,0)}) + v f(j - \overline{(0,...,0,k_o)})$$

The above code requires $$k_1 ... k_n-1 (k_n + k_o - 1)$$ memory accesses to compute $$k_o$$ number of outputs. We thus gain a (theoretical) speed up of a factor of $$k_n k_o / (k_n k_o - 1)$$ times, which for large sizes of the convolution kernel goes to $$k_o$$.

So, now the obvious question: do we realy get a speedup factor of $$k_o$$ and where does the speedup tap out.

The short answer to this is: it depends on the number of available hardware registers. With larger values of $$k_o$$ we with exhaust the number of available hardware registers and the GPU will schedule fewer kernel invokations in parallel on each compute device.

Coming next: code + numbers demonstrating this speedup.