JLLL
Posts: 3
Joined: Tue Jun 25, 2019 9:32 am

Floating Point Clock Cycle

Tue Jun 25, 2019 9:46 am

I try the following code to see how many clock cycles a floating point multiplication will take.

The result is 20 sec, when the multiplication is commented out, the time taken is 10 sec.

Meaning that one multiplication takes 10 clock cycles.

This is surprisingly slow, as I thought cortex-A53 should have hardware FPU that can do this in one clock cycle.

Or shall I turn on any option for the GCC compiler?

Code: Select all

void delayCycle ( )
{
	unsigned int cc = 1200000000;
	
	float num = 1.000001E30f; 
	
	while (cc--)
	{
		__asm("nop");
		
		__asm("nop");
		__asm("nop");
		__asm("nop");
		__asm("nop");
		__asm("nop");
		__asm("nop");
		
		num = num * 0.999915454854432f;
	}
	//printf("%e\t", num);
}
:D Tested on raspberry Pi 3, GCC version: 6.3.0.

jahboater
Posts: 4613
Joined: Wed Feb 04, 2015 6:38 pm

Re: Floating Point Clock Cycle

Tue Jun 25, 2019 12:47 pm

JLLL wrote:
Tue Jun 25, 2019 9:46 am
This is surprisingly slow, as I thought cortex-A53 should have hardware FPU that can do this in one clock cycle.
It has NEON which is a pretty fast SIMD unit. It can do four floating point multiplies per instruction.
Also, NEON is (I believe) quad issue ....

Have you checked the output of the compiler? You never know what it might do.
For example, since your final printf() is commented out, the FP arithmetic can achieve nothing and so the compiler
might remove it. Otherwise it might just pre-compute the final result.

Code: Select all

gcc -Os -S -fverbose-asm test.c -o test.s
-O2 or -O3 are OK too.

If you are using buster, it comes with GCC 8.3 that properly supports -fverbose-asm by including the C source code.
Raspbian/Buster was released yesterday.

I would use extended inline assembler and possibly add volatile.

__asm("nop"); should be changed to __asm( "nop" : );

Then the compiler knows exactly whats going on and wont do any extra saving of registers etc "just in case".
Here you have told it explicitly that no registers or memory are altered by the nop insn. Since you have told it that there are no side effects, you might find volatile is needed to force the compiler to put the nops in the right place.
Finally you can put multiple insns in each inline asm statement.

Code: Select all

#define asm __asm__ __volatile__

asm( "nop; nop; nop; nop; nop; nop" : );
The compiler options for the Pi3 should be:-

Code: Select all

-mcpu=cortex-a53 -mtune=cortex-a53 -mfpu=neon-fp-armv8

User avatar
Paeryn
Posts: 2636
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: Floating Point Clock Cycle

Tue Jun 25, 2019 1:53 pm

JLLL wrote:
Tue Jun 25, 2019 9:46 am
I try the following code to see how many clock cycles a floating point multiplication will take.

The result is 20 sec, when the multiplication is commented out, the time taken is 10 sec.

Meaning that one multiplication takes 10 clock cycles.

This is surprisingly slow, as I thought cortex-A53 should have hardware FPU that can do this in one clock cycle.

Or shall I turn on any option for the GCC compiler?

Code: Select all

void delayCycle ( )
{
	unsigned int cc = 1200000000;
	
	float num = 1.000001E30f; 
	
	while (cc--)
	{
		__asm("nop");
		
		__asm("nop");
		__asm("nop");
		__asm("nop");
		__asm("nop");
		__asm("nop");
		__asm("nop");
		
		num = num * 0.999915454854432f;
	}
	//printf("%e\t", num);
}
:D Tested on raspberry Pi 3, GCC version: 6.3.0.
I take it you didn't have any compiler optimisations on otherwise the code as given would have num and its calculation optimised away. Uncommenting the printf to prevent that and turning on optimisations yield practically no difference at -O2 and only 1 second difference at -O1.

I'm using gcc-9.1,
With the multiply line :-

Code: Select all

[email protected]:~/Programming/asm $ gcc ff.c -O1 -o ff1
[email protected]:~/Programming/asm $ time ./ff1
8.288680e-42

real    0m6.067s
user    0m6.056s
sys     0m0.011s

[email protected]:~/Programming/asm $ gcc ff.c -O2 -o ff2
[email protected]:~/Programming/asm $ time ./ff2
8.288680e-42

real    0m5.085s
user    0m5.083s
sys     0m0.002s
Without the multiply line

Code: Select all

[email protected]:~/Programming/asm $ gcc ff.c -O1 -o ff1
[email protected]:~/Programming/asm $ gcc ff.c -O2 -o ff2
[email protected]:~/Programming/asm $ time ./ff1
1.000001e+30

real    0m5.082s
user    0m5.082s
sys     0m0.000s
[email protected]:~/Programming/asm $ time ./ff2
1.000001e+30

real    0m5.079s
user    0m5.069s
sys     0m0.010s
For gcc-6 the times are slightly higher but more equal

Code: Select all

             gcc-6.3                              gcc-9.1
   Multiply, -O1   7.070s,   -O2   7.054s,        -O1   6.067s,    -O2   5.085s
No multiply, -O1   7.065s,   -O2   7.075s,        -O1   5.082s,    -O2   5.079s
Extra to what I wrote earlier, without optimisations on gcc will read and write every variable to memory every time, in a loop like yours that amounts to a fair bit of time spent on memory accesses that isn't necessary because all the values can be kept in hardware registers for the entire loop. The loop code with the multiplication but no optimisation gives (I edited out bits not relevant)

Code: Select all

        ldr     r3, .L4+4       @ Loop counter
        str     r3, [fp, #-8]   @ Loop counter stored in fp-8
        ldr     r3, .L4+8       @ initial value for num
        str     r3, [fp, #-12]  @ num stored in fp-12
        b       .L2
.L3:
        nop
        nop
        nop
        nop
        nop
        nop
        nop
        vldr.32 s15, [fp, #-12]
        vldr.32 s14, .L4        @ Multiplier
        vmul.f32        s15, s15, s14
        vstr.32 s15, [fp, #-12]
.L2:
        ldr     r3, [fp, #-8]
        sub     r2, r3, #1
        str     r2, [fp, #-8]
        cmp     r3, #0
        bne     .L3

.L4:
        .word   1065351798
        .word   1200000000
        .word   1900671703
As you can see, there are 3 reads from memory and 3 writes to memory happening every iteration whereas with optimisations on

Code: Select all

        vldr.32 s15, .L5        @ initial value for num
        vldr.32 s14, .L5+4      @ value to multiply by
        ldr     r3, .L5+8       @ loop counter
.L2:
        nop
        nop
        nop
        nop
        nop
        nop
        nop
        subs    r3, r3, #1
        vmul.f32        s15, s15, s14
        bne     .L2

        .align  2
.L5:
        .word   1900671703
        .word   1065351798
        .word   1200000000
Now there isn't any memory access inside the loop and the instruction count for the loop itself has gone from 16 down to 10. By adding the multiply instruction the optimised code only requires one extra instruction whereas unoptimised requires four extra instructions and three of them involve memory access.
She who travels light — forgot something.

JLLL
Posts: 3
Joined: Tue Jun 25, 2019 9:32 am

Re: Floating Point Clock Cycle

Thu Jun 27, 2019 9:22 am

Thanks for the very detailed explanation.

I tested the code again with different optimization.

with -O0

multiplication commented = 4 sec
multiplication uncommented = 12 sec

with -O1

multiplication commented = 2 sec
multiplication uncommented = 4 sec

There is so much difference !

Code: Select all

void delayCycle ( )
{
	unsigned int cc = 1200000000;

	float num = 1.000001E30f;

	while (cc--)
	{
		__asm("nop");

		num = num * 0.999915454854432f;
	}
	printf("%e\n", num);
}

jahboater
Posts: 4613
Joined: Wed Feb 04, 2015 6:38 pm

Re: Floating Point Clock Cycle

Thu Jun 27, 2019 5:09 pm

JLLL wrote:
Thu Jun 27, 2019 9:22 am
There is so much difference !
1.6 nano seconds for the multiply doesn't seem unreasonable?

dsyleixa123
Posts: 344
Joined: Mon Jun 11, 2018 11:22 am

Re: Floating Point Clock Cycle

Thu Jun 27, 2019 7:51 pm

I had suggested to make the float num volatile, tbh... :roll:

JLLL
Posts: 3
Joined: Tue Jun 25, 2019 9:32 am

Re: Floating Point Clock Cycle

Fri Jun 28, 2019 7:36 am

More surprising results coming ..

Trying to multiply a few more number in the loop.

Without optimization, -O0, the time taken is: 4s (no multiply), 12s (1x), 20s (2x), 28s (3x), and so on..

With optimization, -O1, the time taken is: 2s (no multiply), 4s (1x), 4s (2x), ... , 5s (5x), ... , 6s (7x), ...

The single core is able to do parallel multiplication at the same time.

Is this called vectorization?
void delayCycle ( )
{
unsigned int cc = 1200000000;

float num1 = 1.000001E30f;
float num2 = 1.000001E30f;
float num3 = 1.000001E30f;

while (cc--)
{
__asm("nop");

num1 = num1 * 0.99987f;
num2 = num2 * 0.99909f;
num3 = num3 * 0.999915454854432f;
}
printf("%e_%e_%e_\n", num1, num2, num3);
}

jahboater
Posts: 4613
Joined: Wed Feb 04, 2015 6:38 pm

Re: Floating Point Clock Cycle

Fri Jun 28, 2019 8:33 am

JLLL wrote:
Fri Jun 28, 2019 7:36 am
The single core is able to do parallel multiplication at the same time.

Is this called vectorization?
Have you looked at the assembler produced by the compiler? Until you do that, you have no idea whats going on.

The NEON fpu is I believe "quad issue" meaning that the decoders can dispatch up to four floating-point instructions at a time. That's not vectorization.

Each NEON instruction can do four single precision floating point operations or two double precision floating-point operations at once which is called SIMD (single instruction, multiple data). That is vectorization.

Finally, the compiler may fold expressions, remove redundant code, pre calculate results, vectorize it etc etc.

So to see whats happening, look at the assembler.
In this case:

Code: Select all

// try.c:36: num1 = num1 * 0.99987f;
    fmul    s0, s0, s5  // num1, num1, tmp98
// try.c:37: num2 = num2 * 0.99909f;
    fmul    s1, s1, s4  // num2, num2, tmp99
// try.c:38: num3 = num3 * 0.999915454854432f;
    fmul    s2, s2, s3  // num3, num3, tmp100
Its doing three scalar single precision floating-point multiplies.
You can see that each multiply uses a different set of registers. That means they are independent and the second one doesn't have to wait for the results of the first and so on. So my guess is the NEON unit will probably execute them all at once - which is not vectorization.

User avatar
Paeryn
Posts: 2636
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: Floating Point Clock Cycle

Fri Jun 28, 2019 11:35 am

The A53 should be capable of issuing an fmul.f32 every cycle as long as the registers are not waiting for previous instructions to write their values, but it takes four cycles before the result is available (going on earlier Cortex timings, I don't think the A53 changed this). There may be hidden register dependencies between each even/odd pair of registers in AARCH32 mode due to them sharing a 64 bit register (in AARCH64 mode each 32 bit float register is in a separate 64 bit register).
She who travels light — forgot something.

Return to “C/C++”