Tuesday, 10 July 2012

Efficient convolution of multi-dimensional data


For the first real content post I will introduce a method for improving the memory bandwidth efficiency when performing basic convolution operations on multidimensional data (eg. 2D/3D/4D datasets). I presented the basic method for this as a side note in the publication below, but without much technical details -- which we can look at in this post instead.

M. Broxvall, K. Emilsson, P. Thunberg. Fast GPU Based Adaptive Filtering of 4D Echocardiography. IEEE Transactions on Medical Imaging , 2011, DOI 10.1109/TMI.2011.2179308

I will assume that you know how a convolution operation is defined and the purposes it has in image processing. Furthermore, I will assume that the convolution filter size (to avoid confusion I will call convolution kernels as convolution filters as to avoid a confusion with the notion of kernels from OpenCL) is much smaller than the image size that you are trying to convolve.

Assume that we have the convolution filter \( f \) of size \( \vec k \), an image \( M \) and want to calculate the convolution \( f * M \). A typical approach to this would be to launch one OpenCL kernel for each output point \( \mbox{out} \) and to perform the computation below:

\(\mbox{out}(\vec p) = 0\)
for \(j_1\) = 0 ... \(k_1\) do
   ...
   for \(j_n\) = 0 to \(k_n\) do
      \(\mbox{out}(\vec p) \leftarrow \mbox{out}(\vec p) + f(\vec j) M(\vec p + \vec j)\)

With this formulation we will need to perform \(\prod_i k_i\) multipllications and additions, as well as the same number of memory fetch operations reading from the input image \(M\). If we would perform, for instance, a convolution of size 10 over a 3D dataset we thus require 1000 memory fetches and 1000 floating point multiply-add operations for each generated output.

The multiply-add operations can often be performed faster than the equivalent two floating point operations by using dedicated hardware, and is exposed in the OpenCL standard through the operations mad or mad24.

Now, assuming that our hardware does not cache the input image we will have a problem. This is the case on many GPUs when you are using global memory buffers as opposed to OpenCL image objects + samplers. The reason for this has to do with how the hardware is optimised to texture memory accesses and a bit outside the scope of this post.

Although modern GPUs have high bandwidth to onboard memory on the order of 160 GB/s (ATI 6950) they have even higher floating point capacities (2.25Tflops) the convolution operation above is memory bound. Assuming that the \(out\) and \(f\) variables are stored in on-board memory the naive  algorithm above would be capable of at most \(4 \times 10^{10}\) steps of the innermost loop and thus only utilize 1-2% of the total computational capacity.

In order to optimize on this and gain better performance we note that the same input image samples will contribute to many different output values, but will be multiplied with different filter coefficients before doing so. An obvious interpolation is thus to re-use the same input image values and use them to compute multiple output values.

The extreme case of this would be to load the whole image into high-speed memory (GPU registers or shared memory) -- but this would obviously not be possible for anything but trivial image sizes. However, we can do this by redesigning our compute kernels to compute \(k_o\) number of outputs for each kernel and to re-use the same input image values for multiple output computations. We extend here the convolution filter \(f\) to contain zeroes at all points outside the original convolution filter (in the code we can do this more flexible).




\(\mbox{out}(\vec p + \overline{(0,...,0,0)}) = 0\)
...
\(\mbox{out}(\vec p + \overline{(0,...,0,k_o)}) = 0\)
for \(j_1\) = 0 ... \(k_1\) do
   ...
   for \(j_n\) = 0 to \(k_{n-1}\) do  
     for \(j_n\) = 0 to \(k_n + k_o - 1\) do 
       \(v \leftarrow M(\vec p+\vec j) \)
       \(\mbox{out}(\vec p + \overline{(0,...,0,0)}) \leftarrow \mbox{out}(\vec p + \overline{(0,...,0,0)}) + v f(j - \overline{(0,...,0,0)}) \)
      ...
       \(\mbox{out}(\vec p + \overline{(0,...,0,k_o)}) \leftarrow \mbox{out}(\vec p + \overline{(0,...,0,0)}) + v f(j - \overline{(0,...,0,k_o)}) \)

The above code requires \(k_1 ... k_n-1 (k_n + k_o - 1)\) memory accesses to compute \(k_o\) number of outputs. We thus gain a (theoretical) speed up of a factor of \( k_n k_o / (k_n k_o - 1) \) times, which for large sizes of the convolution kernel goes to \(k_o\).

So, now the obvious question: do we realy get a speedup factor of \(k_o\) and where does the speedup tap out.

The short answer to this is: it depends on the number of available hardware registers. With larger values of \(k_o\) we with exhaust the number of available hardware registers and the GPU will schedule fewer kernel invokations in parallel on each compute device.

Coming next: code + numbers demonstrating this speedup.

Monday, 9 July 2012

Purpose of this blog

Hello World\n


The purpose of this blog is to collect some thought, notes and algorithms of mine regarding GPGPU computations in general and some specific neat OpenCL code. We'll start with a small summary of the concepts and links to a few relevant sources to get you started with writing OpenCL code -- the intention of which is mostly for me remember the links and not to make a complete getting started tutorial in OpenCL.

First of, GPGPU stands for general purpose GPU (computation) where GPU in term stands for graphic processing unit. These units are the very powerful massively parallel processors that are in use today mainly for rendering graphics, but which incidentally also happens to be good for solving other parallelizable tasks requiring mostly floating point operations. When looking at the raw computational performance of these devices they are often one or two orders of magnitude (10-100) times faster than modern CPU's. However, the power of these devices come mainly from a higher level of parallelism rather than the clock frequency. As such, programming these devices efficiently is a challenge and require very different development tools as well as algorithms in order to be efficient. 

CUDA is a language specific toolkit developed by NVidia that allows the programmer to perform general purpose computations (convolutions, fourier transforms, signal processing, bitcoin mining, ...) of NVidia cards. Since CUDA was one of the first widespread toolkits dedicated to GPGPU computations it has a large adoption in the high-performance computing community and there exists many GPU cluster machines (supercomputers build from GPU's) that perform computationally heavy tasks.

The main drawback (IMHO) of CUDA is that it restricts the user to NVidia specific platforms, and that it lacks support for some features. The main advantage is the ease of use coming from the mature toolkit and the integration into the programming language itself. (The later is also a drawback with it...)

OpenCL is an open standard maintained by Khronos (the same group that develops OpenGL) that serves to give a standardized interface for compute devices to be used from any programming language through a standardized API interfacing to one or more drivers. A compute device can here be a GPU or any other device that can perform parallel computations such as a multi-core CPU or a Cell-processor.

The main advantage (IMHO) with OpenCL is that is an Open Standard, that it works with a wide range of devices and that is (host) language agnostic. The main drawback, is that it is language agnostic. Using OpenCL directly from your C/C++/Python etc. code means a significant overhead in glue code - but once  you have this in place it gives you alot of flexibility as well as very good interfacing with OpenGL.

For the purpose of this blog I will write mainly on the development of algorithms and neat tricks for OpenCL and I use mainly AMD/ATI's OpenCL drivers and to some lesser extent Intel's OpenCL drivers.


  • AMD 5870 / 5870M / 6870 cards running on AMD APP drivers
  • AMD CPUs (x4, x6)  running on AMD APP drivers
  • Intel core i7 CPUs (x4)  running on AMD APP drivers
  • Intel core i7 CPUs (x4)  running on Intel's OpenCL drivers


In addition to the above device/driver combinations a common other combination is to use NVidia's OpenCL drivers. Currently I don't have access to modern NVidia hardware, but perhaps I'll get one for measuring differences at a later point.