Scaling Performance of FFT Computation on an Industrial Integrated GPU Co-processor: Experiments with Algorithm Adaptation

Share it now

Recent Intel processors (IvyBridge, Haswell) contain an embedded on-chip GPU unit, in addition to the main CPU processor. In this work we consider the issue of efficiently mapping Fast Fourier Transform computation onto such coprocessor units. To achieve this we pursue three goals:

First, we want to study half-systematic ways to adjust the actual variant of the FFT algorithm, for a given size, to best fit the local memory capacity (the registers of a given GPU block) and perform computations without intermediate calls to distant memory;

Second, we want to study, by extensive experimentation, whether the remaining data transfers between memories (initial loads and final stores after each FFT computation) can be sustained by local interconnects at a speed matching the integrated GPU computations, or conversely if they have a negative impact on performance when computing FFTs on GPUs ”at full blast”;

Third, we want to record the energy consumption as observed in the previous experiments, and compare it to similar FFT implementations on the CPU side of the chip.

We report our work in this short paper and its companion poster, showing graphical results on a range of experiments. In broad terms, our findings are that GPUs can compute FFTs of a typical size faster than internal on-chip interconnects can provide them with data (by a factor of roughly 2), and that energy consumption is far smaller than on the CPU side.

Publication Date: 
Tue, 2014-04-01
Share it now