Result for i7-3770 using 32-bit "Integer" (gcc 5.3).

Found a total of 664579 primes (32-bit)

real 0m1.063s

user 0m1.060s

sys 0m0.000s

Found a total of 664579 primes (32-bit)

real 0m0.975s

user 0m0.972s

sys 0m0.000s

Found a total of 664579 primes (32-bit)

real 0m0.976s

user 0m0.976s

sys 0m0.000s

Sadly, we cant (yet) run the Pi3 in aarch64 mode.

That leaves us with the last result: the 64-bit integers in the 64-bit binary. It's by far the slowest and I have no idea why. Given the fact that the CPU natively executes 64-bit integers, I would expect it to perform on a similar level as the 32-bit integer results. Granted, I'm certainly no compiler expert, so I might be missing something here.

Does anyone have any theories?

I have found in the past that for larger programs on x86 that 64-bit mode gives a modest speed increase - say around 15% or so. I suspect in this case, the program is tiny, mostly in a small inner loop, and some minor effect such as the reduced number of clock cycles to do the 32-bit divide, could be dominant.

"Integer" set to 64-bits:-

Code: Select all

```
movq prime(%rip), %rcx
movslq %r8d, %r8
cmpq %r8, %rcx
ja .L4
xorl %edx, %edx
movq %rbx, %rax
divq %rcx
testq %rdx, %rdx
je .L5
```

"Integer" set to 32 bits:-

Code: Select all

```
movl prime(%rip), %ecx
cmpl %r8d, %ecx
ja .L4
xorl %edx, %edx
movl %ebx, %eax
divl %ecx
testl %edx, %edx
je .L5
```

Note the divq for 64 bits and the divl for 32 bits. Also the extra sign extend instruction at the start which must be completed before the cmp. The register move and the xor, will be eliminated by the decoder in both modes.