Accelerating Fourier transforms using the GPU

Andrew Holme is well known to regular blog readers, as the creator of the awesome (and fearsomely clever) homemade GPS receiver. Over the last few months he’s been experimenting with writing general purpose code for the VideoCore IV graphics processing unit (GPU) in the BCM2835, the microchip at the heart of the Raspberry Pi, to create an accelerated fast Fourier transform library. Taking the Fourier transform of a function yields its frequency spectrum (i.e. the pure harmonic functions which can be added together to reconstruct the original function). In the following example, shamelessly lifted from Wikipedia, we have a function which oscillates roughly three times per second, and whose Fourier transform unsurprisingly has a peak around 3Hz.

Being able to perform lots of Fourier transforms quickly is useful for all sorts of audio and radio applications including, unsurprisingly, GPS. Ham radio enthusiasts will also find Andrew’s work very useful. In this guest post, Andrew talks about his Fourier transform library for the Pi.

Last October, Eben attended the Radio Society of Great Britain (RSGB) Convention, where radio amateurs told him they wanted a speedy fast Fourier transform (FFT) library to do Software Defined Radio (SDR) projects on the Pi.

GPU_FFT is an FFT library for the Raspberry Pi which exploits the BCM2835 SoC V3D hardware to deliver ten times the performance that is possible on the 700 MHz ARM. Kernels are provided for all power-of-2 FFT lengths from 256 to 131,072 points inclusive.

GPU_FFT uses single-precision floating point for data and twiddle factors, so it does not compete on accuracy with double-precision libraries; however, the relative root-mean-square (rms) error for a 2048-point transform is less than one part per million, which is not bad.

The library runs on dedicated 3D hardware in the BCM2835 SoC, and communication between ARM and GPU adds 100µs of latency which is much longer than the shortest transform takes to compute! To overcome this, batches of transforms can be executed with a single call. Typical per-transform runtimes in microseconds are:

Points    batch=1    batch=10    batch=50    FFTW    Speedup
256 112 22 16 92 5.8x
512 125 37 26 217 8.3x
1024 136 54 45 482 10.7x
2048 180 107 93 952 10.2x
4096 298 256 240 3002 12.5x
8192 689 624 608 5082 8.4x
16384 1274 1167 1131 12005 10.6x
32768 3397 3225 3196 31211 9.8x
65536 6978 6703 6674 82769 12.4x
131072 16734 16110 16171 183731 11.4x

To get GPU_FFT enter the following at the command prompt:

sudo rpi-update && sudo reboot

To build and run the example program:

cd /opt/vc/src/hello_pi/hello_fft
sudo mknod char_dev c 100 0
sudo ./hello_fft.bin

API documentation can be found in the hello_fft folder.