Wrote a quick benchmark, which sums a few hundred million dot-products without (explicitly) hitting RAM. The dot-product is in it's own function:
float dotProduct(vec2 a, vec2 b)
return a.x * b.x + a.y * b.y;
Four 32-bit floats can be passed in registers for all ABIs - just different registers. With the hard ABI, the above function should compile to exactly three instructions:
In practice GCC compiles it to five or six instructions depending on whether fast-math is specified. The extra instructions are trivial and redundant. No matter...
When inlining the above function is allowed, the function call overhead disappears (very significant with softfp) and additionally a loop-invariant optimisation becomes available in the test function (significant to all ABIs).
In softfloat mode, the finished binary is bloated by a complete set of AEABI FP routines, even though only a few are actually used. The size of these routines reflects the amount of work required to replace each VFP instruction using the integer core.
Most striking however are the performance differences:
softfloat no-inline: 75s
softfloat inline: 55s
softfp no-inline: 14.5s
softfp inline: 5.5s
hardfloat no-inline: TBD
hardfloat inline: TBD
That's a tenfold difference between softfloat and softfp when the function call overhead is elided, and a fivefold difference even with softfp's artificial handicap.
The key to knowledge is not to rely on people to teach you it.