I believe the theoretical peak performance of the ARM cpu is only 350Mflops double precision at 700 MHz. I'm not sure but I would guess fused multiply-add is not available on the Raspberry Pi, because its math unit is only vfpv2 and from the

arm doc:

The fused multiply-add instructions are only available on NEON or VFP systems that implement the fused multiply-add extension. The VFP system that implements the fused multiply-add extension is VFPv4.

I timed faddd and fmuld (double prec add and mul) a while ago and I think it was something like

- faddd: 8 cycles latency, 2 cycles throughput
- fmuld: 9 cycles latency, 2 cycles throughput

So in the best case it still takes 2 cycles for one operation and then 700MHz/2 = 350 Mflops. In the worst case where in your algorithm the result of the current operation is required for the next operation, ie pipelining can't be used, it takes 8 cycles for one operation and we end up with 700MHz/8 = 87.5Mflops.

The GPU is impressively fast, but I'd guess the 24GFlops are single precision.