User avatar
jahboater
Posts: 6301
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: Benchmarking a raspberrypi compared to my own PC

Thu Nov 26, 2015 4:44 pm

The superlinear speedup when moving from 1 to 2 cores is interesting. Do you notice this with the i7 builds as well? Even though there are only 4 cores, I wonder if the build time would continue to decrease with -j5 or -j6.
No, with 16GB of memory it is never worth building with less than -j4 on the i7. As for the Pi2, next time I do this (when gcc 5.3 arrives) I'll probably just put the swapfile on a separate fast SSD. Otherwise its hammering the poor little sd card too much. I tried -j >4 on the i7 but it didn't seem to help. Kudos to the Pi that it can even be done!
Pi4 8GB running PIOS64 Lite

dmc1954
Posts: 17
Joined: Sun Mar 24, 2013 2:41 pm
Location: Austin, Texas, USA

Re: Benchmarking a raspberrypi compared to my own PC

Fri Mar 04, 2016 4:04 pm

Has anyone tried to build a 64 bit arm (AArch64 Cortex A53) gcc compiler on a RPI3, assuming you are one of the few lucky people that were able to get a RPI3?

I don't believe RPI3 needs a 64 bit OS to access a it's 1GB of memory or 16GB SDcard filesystem, but it would be nice to access to the 64 bit float point registers.

Here is the link http://infocenter.arm.com/help/topic/co ... le_2_0.pdf that got me thinking about a 64bit gcc compiler for the RPI3 to allow access to the 64 bit floating point registers.

What do you all think?

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 27447
Joined: Sat Jul 30, 2011 7:41 pm

Re: Benchmarking a raspberrypi compared to my own PC

Fri Mar 04, 2016 4:51 pm

dmc1954 wrote:Has anyone tried to build a 64 bit arm (AArch64 Cortex A53) gcc compiler on a RPI3, assuming you are one of the few lucky people that were able to get a RPI3?
IIRC 300k Pi3's made before launch, mostly sold.

Quite a few lucky people I would say!
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed.
I've been saying "Mucho" to my Spanish friend a lot more lately. It means a lot to him.

ejolson
Posts: 6032
Joined: Tue Mar 18, 2014 11:47 am

Re: Benchmarking a raspberrypi compared to my own PC

Fri Mar 04, 2016 5:07 pm

dmc1954 wrote:Has anyone tried to build a 64 bit arm (AArch64 Cortex A53) gcc compiler on a RPI3, assuming you are one of the few lucky people that were able to get a RPI3?

I don't believe RPI3 needs a 64 bit OS to access a it's 1GB of memory or 16GB SDcard filesystem, but it would be nice to access to the 64 bit float point registers.

Here is the link http://infocenter.arm.com/help/topic/co ... le_2_0.pdf that got me thinking about a 64bit gcc compiler for the RPI3 to allow access to the 64 bit floating point registers.

What do you all think?
I don't think a 32-bit kernel can load or execute a 64-bit binary. However being able to cross compile 64-bit binaries on a 32-bit platform may be the first step in creating a 64-bit kernel. It seems for now a 64-bit kernel will have to run headless as the GPU binary blob is 32-bit. That would still be enough for many applications.

loadbang
Posts: 36
Joined: Mon Aug 13, 2012 4:56 pm

Re: Benchmarking a raspberrypi compared to my own PC

Fri Mar 04, 2016 5:40 pm

Interesting thread. I was wondering where ARMs are today in terms of speed. I usually say 10 years behind x86. ARM chips used to be faster than x86 in the 1990's!

Thought I'd compile and run the primes.

2.3 GHz Core i7 Crystalwell (I7-4850HQ).
Model Identifier: MacBookPro11,3
Processor Name: Intel Core i7
Processor Speed: 2.3 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
L4 Cache: 128 MB
Memory: 16 GB

$ gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.4.0
Thread model: posix



Found a total of 664579 primes!

real 0m1.866s
user 0m1.830s
sys 0m0.005s
Found a total of 664579 primes!

real 0m1.850s
user 0m1.842s
sys 0m0.006s
Found a total of 664579 primes!

real 0m1.875s
user 0m1.868s
sys 0m0.007s

loadbang
Posts: 36
Joined: Mon Aug 13, 2012 4:56 pm

Re: Benchmarking a raspberrypi compared to my own PC

Fri Mar 04, 2016 5:43 pm

May be all worth adding to the wiki.

Mikael
Posts: 26
Joined: Wed Feb 11, 2015 12:35 pm

Re: Benchmarking a raspberrypi compared to my own PC

Fri Mar 04, 2016 8:58 pm

Tested on Raspberry Pi 3 and Core i7-6700K, just for fun. :) For those who don't know, the i7-6700K is based on Skylake, Intel's latest architecture, and runs at 4.0-4.2 GHz.

Raspberry Pi 3: 3.61 s
(gcc 4.9.2, -O3 -mcpu=cortex-a53)

Core i7-6700K: 0.60 s
(gcc 5.2.1, -O3 -msse2 -mfpmath=sse)
Ubuntu 15.10 32-bit

So the i7 is 500% faster running this (rather simple) load. I'd expect the difference to be even larger in more taxing and memory intensive loads. Using all eight hardware threads on the i7 would of course extend that lead further. Anyway, always fun to compare, even if it's not very useful in this case.
Last edited by Mikael on Sat Mar 05, 2016 8:04 am, edited 1 time in total.

jdb
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 2467
Joined: Thu Jul 11, 2013 2:37 pm

Re: Benchmarking a raspberrypi compared to my own PC

Fri Mar 04, 2016 9:05 pm

Mikael wrote:Tested on Raspberry Pi 3 and Core i7-6700K, just for fun. :) For those who don't know, the i7-6700K is based on Skylake, Intel's latest architecture, and runs at 4.0-4.2 GHz.

Raspberry Pi 3: 3.61 s
(gcc 4.9.2, -O3 -mcpu=cortex-a53)

Core i7-6700K: 0.60 s
(gcc 5.2.1, -O3 -msse2 -mfpmath=sse)

So the i7 is 500% faster running this (rather simple) load. I'd expect the difference to be even larger in more taxing and memory intensive loads. Using all eight hardware threads on the i7 would of course extend that lead further. Anyway, always fun to compare, even if it's not very useful in this case.
Interestingly, the i7 TDP is 95 watts. Total board consumption for the Pi 3 is about 4 watts under multicore benchmarks so for 500% gain you're expending 2375% energy :D
Rockets are loud.
https://astro-pi.org

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 27447
Joined: Sat Jul 30, 2011 7:41 pm

Re: Benchmarking a raspberrypi compared to my own PC

Fri Mar 04, 2016 9:22 pm

Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.

Found a total of 664579 primes!

real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!

real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!

real 0m4.397s
user 0m4.393s
sys 0m0.000s


Almost exactly the same as the Pi3!

Probably need a new desktop, that CPU speed seems a bit low.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed.
I've been saying "Mucho" to my Spanish friend a lot more lately. It means a lot to him.

User avatar
GTR2Fan
Posts: 1601
Joined: Sun Feb 23, 2014 9:20 pm
Location: South East UK

Re: Benchmarking a raspberrypi compared to my own PC

Fri Mar 04, 2016 9:31 pm

jamesh wrote:Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.
...
Almost exactly the same as the Pi3!
Extrapolating from the result of this one test alone puts my overclocked Pi2B on a par clock-for-clock with an AMD Athlon X2. I shall quietly gloat. :D
Pi2B Mini-PC/Media Centre: ARM=1GHz (+3), Core=500MHz, v3d=500MHz, h264=333MHz, RAM=DDR2-1200 (+6/+4/+4+schmoo). Sandisk Ultra HC-I 32GB microSD card on '50=100' OCed slot (42MB/s read) running Raspbian/KODI16, Seagate 3.5" 1.5TB HDD mass storage.

timrowledge
Posts: 1354
Joined: Mon Oct 29, 2012 8:12 pm
Location: Vancouver Island
Contact: Website

Re: Benchmarking a raspberrypi compared to my own PC

Fri Mar 04, 2016 10:55 pm

dmc1954 wrote: I don't believe RPI3 needs a 64 bit OS to access a it's 1GB of memory or 16GB SDcard filesystem, but it would be nice to access to the 64 bit float point registers.
ARM fp registers have been 64 bit for quite a while. Like, as long as there have been fp registers.
Making Smalltalk on ARM since 1986; making your Scratch better since 2012

ejolson
Posts: 6032
Joined: Tue Mar 18, 2014 11:47 am

Re: Benchmarking a raspberrypi compared to my own PC

Sat Mar 05, 2016 5:38 am

jamesh wrote:Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.

Found a total of 664579 primes!

real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!

real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!

real 0m4.397s
user 0m4.393s
sys 0m0.000s


Almost exactly the same as the Pi3!

Probably need a new desktop, that CPU speed seems a bit low.
Do you recall what compiler version and optimization switches you used?

Mikael
Posts: 26
Joined: Wed Feb 11, 2015 12:35 pm

Re: Benchmarking a raspberrypi compared to my own PC

Sat Mar 05, 2016 8:34 am

jamesh wrote:Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.

Found a total of 664579 primes!

real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!

real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!

real 0m4.397s
user 0m4.393s
sys 0m0.000s


Almost exactly the same as the Pi3!

Probably need a new desktop, that CPU speed seems a bit low.
That's actually far slower than expected. The general performance of AMD's K10 core should be at least 25-30% higher at the same frequency. Probably a decent amount faster still in many loads.

I just tested my old Core 2 Duo T8100 (2.1GHz, 45nm, dual core, Penryn core) with the following results:

Core 2 Duo T8100 (2.1GHz): 1.588 s
(gcc 5.2.1, -O3 -msse2 -mfpmath=sse)
Ubuntu 15.10 32-bit

Given that AMD's K10 is in the same class, maybe 10% slower per clock, a more reasonable score for the Athlon II X2 215 @ 1.5GHz would be around the 2.5 second mark.
GTR2Fan wrote:Extrapolating from the result of this one test alone puts my overclocked Pi2B on a par clock-for-clock with an AMD Athlon X2. I shall quietly gloat. :D
As said above, I think something's up with that result. Average real world performance of the K10 core in the Athlon II can be expected to be at least 70% higher than the Cortex-A7 in the Pi 2, clock-for-clock.

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 27447
Joined: Sat Jul 30, 2011 7:41 pm

Re: Benchmarking a raspberrypi compared to my own PC

Sat Mar 05, 2016 9:18 am

Mikael wrote:
jamesh wrote:Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.

Found a total of 664579 primes!

real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!

real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!

real 0m4.397s
user 0m4.393s
sys 0m0.000s


Almost exactly the same as the Pi3!

Probably need a new desktop, that CPU speed seems a bit low.
That's actually far slower than expected. The general performance of AMD's K10 core should be at least 25-30% higher at the same frequency. Probably a decent amount faster still in many loads.

I just tested my old Core 2 Duo T8100 (2.1GHz, 45nm, dual core, Penryn core) with the following results:

Core 2 Duo T8100 (2.1GHz): 1.588 s
(gcc 5.2.1, -O3 -msse2 -mfpmath=sse)
Ubuntu 15.10 32-bit

Given that AMD's K10 is in the same class, maybe 10% slower per clock, a more reasonable score for the Athlon II X2 215 @ 1.5GHz would be around the 2.5 second mark.
GTR2Fan wrote:Extrapolating from the result of this one test alone puts my overclocked Pi2B on a par clock-for-clock with an AMD Athlon X2. I shall quietly gloat. :D
As said above, I think something's up with that result. Average real world performance of the K10 core in the Athlon II can be expected to be at least 70% higher than the Cortex-A7 in the Pi 2, clock-for-clock.
There's a lot running on the machine which probably doesn't help, and the 1.5Ghz speed seems low since the 215 should be good for 2.5GHz.

Build line:

cc -O3 prime.c -lm

Compiler

gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed.
I've been saying "Mucho" to my Spanish friend a lot more lately. It means a lot to him.

dmc1954
Posts: 17
Joined: Sun Mar 24, 2013 2:41 pm
Location: Austin, Texas, USA

Re: Benchmarking a raspberrypi compared to my own PC

Sat Mar 05, 2016 5:03 pm

timrowledge wrote:
dmc1954 wrote: I don't believe RPI3 needs a 64 bit OS to access a it's 1GB of memory or 16GB SDcard filesystem, but it would be nice to access to the 64 bit float point registers.
ARM fp registers have been 64 bit for quite a while. Like, as long as there have been fp registers.
Tim, thank you for pointing this out to me. I wrongly assumed a float point double precision operations were being simulated using to single precision floating point instructions. I also verified the use of the d registers by dumping out the assembly (gcc -S) of a simple double precision floating point program.

This article https://wiki.debian.org/ArmHardFloatPort/VfpComparison also helped me better understand why double precision is so slow.

User avatar
jahboater
Posts: 6301
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: Benchmarking a raspberrypi compared to my own PC

Sat Mar 05, 2016 6:01 pm

dmc1954 wrote:
timrowledge wrote: ARM fp registers have been 64 bit for quite a while. Like, as long as there have been fp registers.
Tim, thank you for pointing this out to me. I wrongly assumed a float point double precision operations were being simulated using to single precision floating point instructions. I also verified the use of the d registers by dumping out the assembly (gcc -S) of a simple double precision floating point program.

This article https://wiki.debian.org/ArmHardFloatPort/VfpComparison also helped me better understand why double precision is so slow.
Perhaps you should be using NEON which is very fast on the Pi3 (full quad issue I think compared to dual issue on the Pi2).
Pi4 8GB running PIOS64 Lite

ejolson
Posts: 6032
Joined: Tue Mar 18, 2014 11:47 am

Re: Benchmarking a raspberrypi compared to my own PC

Sat Mar 05, 2016 9:36 pm

jamesh wrote:
Mikael wrote:
jamesh wrote:Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.

Found a total of 664579 primes!

real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!

real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!

real 0m4.397s
user 0m4.393s
sys 0m0.000s


Almost exactly the same as the Pi3!

Probably need a new desktop, that CPU speed seems a bit low.
That's actually far slower than expected. The general performance of AMD's K10 core should be at least 25-30% higher at the same frequency. Probably a decent amount faster still in many loads.

I just tested my old Core 2 Duo T8100 (2.1GHz, 45nm, dual core, Penryn core) with the following results:

Core 2 Duo T8100 (2.1GHz): 1.588 s
(gcc 5.2.1, -O3 -msse2 -mfpmath=sse)
Ubuntu 15.10 32-bit

Given that AMD's K10 is in the same class, maybe 10% slower per clock, a more reasonable score for the Athlon II X2 215 @ 1.5GHz would be around the 2.5 second mark.
GTR2Fan wrote:Extrapolating from the result of this one test alone puts my overclocked Pi2B on a par clock-for-clock with an AMD Athlon X2. I shall quietly gloat. :D
As said above, I think something's up with that result. Average real world performance of the K10 core in the Athlon II can be expected to be at least 70% higher than the Cortex-A7 in the Pi 2, clock-for-clock.
There's a lot running on the machine which probably doesn't help, and the 1.5Ghz speed seems low since the 215 should be good for 2.5GHz.

Build line:

cc -O3 prime.c -lm

Compiler

gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4
The Athlon II X2 215 is a 2.7GHz part. The number reported by /proc/cpuinfo is the current speed of the processor that the frequency governor has set. I have a similar vintage 3.1 GHz CPU and get timings consistent with yours. Under load, the governor is supposed to increase the speed as needed. The following script

Code: Select all

#!/bin/bash
for i in /sys/devices/system/cpu/cpu?/cpufreq
do
    echo $i
    cat $i/cpuinfo_max_freq >$i/scaling_min_freq
done
sets the minimum so that the processors always run at full speed. On my system this didn't make any difference to the timings.

Mikael
Posts: 26
Joined: Wed Feb 11, 2015 12:35 pm

Re: Benchmarking a raspberrypi compared to my own PC

Sun Mar 06, 2016 10:52 am

Just posted this in another thread, but it's really important for this thread as well:

gcc seems to generate sub-optimal code when compiling this program to a 64-bit binary. I used Ubuntu 15.10 32-bit for my first tests. I just tried it on 64-bit and the result for my 6700 changed "slightly":

Core i7-6700K: 1.82 s (compared to 0.60 s in 32-bit)
(gcc 5.2.1, -O3)
Ubuntu 15.10 64-bit

So, pretty much exactly 1/3 the performance. To compile a 32-bit binary in a 64-bit environment, give the -m32 option to gcc when compiling. Would be interesting to see that Athlon II X2 215 retested with a 32-bit binary...

ejolson
Posts: 6032
Joined: Tue Mar 18, 2014 11:47 am

Re: Benchmarking a raspberrypi compared to my own PC

Mon Mar 07, 2016 5:23 am

Mikael wrote:Just posted this in another thread, but it's really important for this thread as well:

gcc seems to generate sub-optimal code when compiling this program to a 64-bit binary. I used Ubuntu 15.10 32-bit for my first tests. I just tried it on 64-bit and the result for my 6700 changed "slightly":

Core i7-6700K: 1.82 s (compared to 0.60 s in 32-bit)
(gcc 5.2.1, -O3)
Ubuntu 15.10 64-bit

So, pretty much exactly 1/3 the performance. To compile a 32-bit binary in a 64-bit environment, give the -m32 option to gcc when compiling. Would be interesting to see that Athlon II X2 215 retested with a 32-bit binary...
Thanks for figuring out what was going on. I've updated the code for prime.c so that the size of the integers is specified explicitly using the stdint.h header. I've also sorted the table of timings for different processors so that it reflects which timings correspond to the use of 32-bit integers versus 64-bit integers. However, at the moment I'm not distinguishing whether the host kernel is 32-bit or 64-bit as I think the timing differences result from the size of the integers and not the kernel.

It is interesting to note that the size of the integers doesn't seem to make a difference for recent AMD processors. The table is currently missing 32-bit results for the i7-4850HQ, i7-3770k and i5-3570k. Also missing are 64-bit results for the Pi 3B and Pi B+. However, I reran the Pi 2B benchmarks using 64-bit integers by changing uint32_t to uint64_t in the updated source. Surprisingly, the -mcpu=cortex-a7 flag makes no difference for 64-bit integers on the Pi 2B and the results are disappointingly slow. It would be great if someone could run the program using 64-bit integers on the Pi 3B and figure out if there are any compiler options I'm missing that could be used to speed things up.

Mikael
Posts: 26
Joined: Wed Feb 11, 2015 12:35 pm

Re: Benchmarking a raspberrypi compared to my own PC

Mon Mar 07, 2016 8:32 am

ejolson wrote:Thanks for figuring out what was going on. I've updated the code for prime.c so that the size of the integers is specified explicitly using the stdint.h header. I've also sorted the table of timings for different processors so that it reflects which timings correspond to the use of 32-bit integers versus 64-bit integers. However, at the moment I'm not distinguishing whether the host kernel is 32-bit or 64-bit as I think the timing differences result from the size of the integers and not the kernel.
Interesting. I did a quick test on my laptop, using 32/64-bit integers and 32/64-bit kernel. The kernel does not make a difference, as you say. However, compiling a 32-bit or 64-bit binary does make a difference:

Core i5-5300U:

32-bit binary:
uint32: 1.100 s
uint64: 2.192 s

64-bit binary:
uint32: 1.152 s
uint64: 2.772 s

(gcc 5.2.1, -O3)
Ubuntu 15.10 64-bit
ejolson wrote:It is interesting to note that the size of the integers doesn't seem to make a difference for recent AMD processors. The table is currently missing 32-bit results for the i7-4850HQ, i7-3770k and i5-3570k. Also missing are 64-bit results for the Pi 3B and Pi B+. However, I reran the Pi 2B benchmarks using 64-bit integers by changing uint32_t to uint64_t in the updated source. Surprisingly, the -mcpu=cortex-a7 flag makes no difference for 64-bit integers on the Pi 2B and the results are disappointingly slow. It would be great if someone could run the program using 64-bit integers on the Pi 3B and figure out if there are any compiler options I'm missing that could be used to speed things up.
64-bit integers should be much slower than 32-bit ones when executed in a 32-bit binary. You'd need to run a 64-bit OS and binary to speed things up. However, the thing I'm not getting here is the strange results on x86 CPUs like the ones above. The results for the 32-bit binary look plausible, I think (i.e. 64-bit calculations are much slower). The results for 32-bit integers in the 64-bit binary also look okay. 64-bit mode has twice as many general purpose registers compared to 32-bit mode, which may speed up some loads. However, it also increases bandwidth requirements. For the result to remain unchanged when going from 32-bit to 64-bit mode is not uncommon.

That leaves us with the last result: the 64-bit integers in the 64-bit binary. It's by far the slowest and I have no idea why. Given the fact that the CPU natively executes 64-bit integers, I would expect it to perform on a similar level as the 32-bit integer results. Granted, I'm certainly no compiler expert, so I might be missing something here.

Does anyone have any theories?

User avatar
jahboater
Posts: 6301
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: Benchmarking a raspberrypi compared to my own PC

Mon Mar 07, 2016 8:42 am

Result for i7-3770 using 32-bit "Integer" (gcc 5.3).
Found a total of 664579 primes (32-bit)

real 0m1.063s
user 0m1.060s
sys 0m0.000s
Found a total of 664579 primes (32-bit)

real 0m0.975s
user 0m0.972s
sys 0m0.000s
Found a total of 664579 primes (32-bit)

real 0m0.976s
user 0m0.976s
sys 0m0.000s
Sadly, we cant (yet) run the Pi3 in aarch64 mode.
That leaves us with the last result: the 64-bit integers in the 64-bit binary. It's by far the slowest and I have no idea why. Given the fact that the CPU natively executes 64-bit integers, I would expect it to perform on a similar level as the 32-bit integer results. Granted, I'm certainly no compiler expert, so I might be missing something here.

Does anyone have any theories?
I have found in the past that for larger programs on x86 that 64-bit mode gives a modest speed increase - say around 15% or so. I suspect in this case, the program is tiny, mostly in a small inner loop, and some minor effect such as the reduced number of clock cycles to do the 32-bit divide, could be dominant.

"Integer" set to 64-bits:-

Code: Select all

    movq    prime(%rip), %rcx
    movslq  %r8d, %r8
    cmpq    %r8, %rcx
    ja  .L4
    xorl    %edx, %edx
    movq    %rbx, %rax
    divq    %rcx
    testq   %rdx, %rdx
    je  .L5
"Integer" set to 32 bits:-

Code: Select all

    movl    prime(%rip), %ecx
    cmpl    %r8d, %ecx
    ja  .L4
    xorl    %edx, %edx
    movl    %ebx, %eax
    divl    %ecx
    testl   %edx, %edx
    je  .L5
Note the divq for 64 bits and the divl for 32 bits. Also the extra sign extend instruction at the start which must be completed before the cmp. The register move and the xor, will be eliminated by the decoder in both modes.
Pi4 8GB running PIOS64 Lite

ejolson
Posts: 6032
Joined: Tue Mar 18, 2014 11:47 am

Re: Benchmarking a raspberrypi compared to my own PC

Mon Mar 07, 2016 4:51 pm

Mikael wrote:
ejolson wrote:Thanks for figuring out what was going on. I've updated the code for prime.c so that the size of the integers is specified explicitly using the stdint.h header. I've also sorted the table of timings for different processors so that it reflects which timings correspond to the use of 32-bit integers versus 64-bit integers. However, at the moment I'm not distinguishing whether the host kernel is 32-bit or 64-bit as I think the timing differences result from the size of the integers and not the kernel.
Interesting. I did a quick test on my laptop, using 32/64-bit integers and 32/64-bit kernel. The kernel does not make a difference, as you say. However, compiling a 32-bit or 64-bit binary does make a difference:

Core i5-5300U:

32-bit binary:
uint32: 1.100 s
uint64: 2.192 s

64-bit binary:
uint32: 1.152 s
uint64: 2.772 s

(gcc 5.2.1, -O3)
Ubuntu 15.10 64-bit
ejolson wrote:It is interesting to note that the size of the integers doesn't seem to make a difference for recent AMD processors. The table is currently missing 32-bit results for the i7-4850HQ, i7-3770k and i5-3570k. Also missing are 64-bit results for the Pi 3B and Pi B+. However, I reran the Pi 2B benchmarks using 64-bit integers by changing uint32_t to uint64_t in the updated source. Surprisingly, the -mcpu=cortex-a7 flag makes no difference for 64-bit integers on the Pi 2B and the results are disappointingly slow. It would be great if someone could run the program using 64-bit integers on the Pi 3B and figure out if there are any compiler options I'm missing that could be used to speed things up.
64-bit integers should be much slower than 32-bit ones when executed in a 32-bit binary. You'd need to run a 64-bit OS and binary to speed things up. However, the thing I'm not getting here is the strange results on x86 CPUs like the ones above. The results for the 32-bit binary look plausible, I think (i.e. 64-bit calculations are much slower). The results for 32-bit integers in the 64-bit binary also look okay. 64-bit mode has twice as many general purpose registers compared to 32-bit mode, which may speed up some loads. However, it also increases bandwidth requirements. For the result to remain unchanged when going from 32-bit to 64-bit mode is not uncommon.

That leaves us with the last result: the 64-bit integers in the 64-bit binary. It's by far the slowest and I have no idea why. Given the fact that the CPU natively executes 64-bit integers, I would expect it to perform on a similar level as the 32-bit integer results. Granted, I'm certainly no compiler expert, so I might be missing something here.

Does anyone have any theories?
Did you perform each timing using

Code: Select all

time ./a.out; time ./a.out; time ./a.out
and take the smallest "real" time out of the three? Maybe the governor of the CPU is not increasing the frequency fast enough. Have you tried running this script as root to set the minimum allowed frequency to the maximum before doing the tests?

ejolson
Posts: 6032
Joined: Tue Mar 18, 2014 11:47 am

Re: Benchmarking a raspberrypi compared to my own PC

Tue Mar 08, 2016 12:59 am

Mikael wrote:Does anyone have any theories?
No theories, however, I can now confirm your timings. The 64-bit integers with 32-bit compatible binary surprisingly run about 20% faster than 64-bit integers with 64-bit binary using an i3 550. On the other hand, the situation is reversed for the exact same binaries using an AMD Athlon II X2 255.

Code: Select all

                        32-bit binary    64-bit binary
i3 550                      2.058            2.430
Athlon II X2 255            5.301            3.830
The above numbers were obtained as follows. Modify prime.c so that Integer is defined as uint64_t. Compile the program using a 32-bit install of Debian Wheezy and a 64-bit install. The compile command is

Code: Select all

$ gcc -O3 -msse2 -mfpmath=sse prime.c -lm
in both cases. Version 4.7.2 of gcc is used in both cases. Note that various -march= options were tried for the 64-bit binary, but none of them led to an executable that ran faster than the 32-bit binary on the i3 platform. Let prime64on32 be the binary created by the 32-bit distribution and prime64on64 be the 64-bit binary. Copy the 32-bit binary to the 64-bit i3 550 and the Athlon II X2 255 systems and then run both programs.

Code: Select all

$ grep "model name" /proc/cpuinfo | sort -u
model name  : Intel(R) Core(TM) i3 CPU         550  @ 3.20GHz
$ file prime64on32 
prime64on32: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.26, BuildID[sha1]=0x9a55be5cdecd3251f5407682d64c6bcd079bea26, not stripped
$ time ./prime64on32; time ./prime64on32; time ./prime64on32
Found a total of 664579 primes (64-bit)

real    0m2.066s
user    0m2.056s
sys 0m0.000s
Found a total of 664579 primes (64-bit)

real    0m2.073s
user    0m2.064s
sys 0m0.000s
Found a total of 664579 primes (64-bit)

real    0m2.058s
user    0m2.040s
sys 0m0.008s
$ file prime64on64
prime64on64: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.26, BuildID[sha1]=0x6837b7d0137cbd9448c7976b1e49b98759215b5e, not stripped
$ time ./prime64on64; time ./prime64on64; time ./prime64on64
Found a total of 664579 primes (64-bit)

real    0m2.495s
user    0m2.484s
sys 0m0.000s
Found a total of 664579 primes (64-bit)

real    0m2.430s
user    0m2.420s
sys 0m0.000s
Found a total of 664579 primes (64-bit)

real    0m2.434s
user    0m2.424s
sys 0m0.000s
----------------------------------------------------------------
$ grep "model name" /proc/cpuinfo | sort -u
model name  : AMD Athlon(tm) II X2 255 Processor
$ time ./prime64on32; time ./prime64on32; time ./prime64on32
Found a total of 664579 primes (64-bit)

real    0m5.317s
user    0m5.304s
sys 0m0.000s
Found a total of 664579 primes (64-bit)

real    0m5.308s
user    0m5.296s
sys 0m0.004s
Found a total of 664579 primes (64-bit)

real    0m5.301s
user    0m5.292s
sys 0m0.000s
$ time ./prime64on64; time ./prime64on64; time ./prime64on64
Found a total of 664579 primes (64-bit)

real    0m3.835s
user    0m3.828s
sys 0m0.000s
Found a total of 664579 primes (64-bit)

real    0m3.830s
user    0m3.824s
sys 0m0.004s
Found a total of 664579 primes (64-bit)

real    0m3.836s
user    0m3.828s
sys 0m0.000s
I wonder what differences 64-bit mode will make when it's available for the Pi 3B. Hopefully things will get faster rather than slower!

loadbang
Posts: 36
Joined: Mon Aug 13, 2012 4:56 pm

Re: Benchmarking a raspberrypi compared to my own PC

Tue Mar 08, 2016 12:36 pm

Amazed at how speedy the i7-4850HQ is, not bad for a laptop. Wonder if the eDRAM has any effect?
ejolson wrote:
KarlSplatz wrote:

Code: Select all

CPU                     32-bit  64-bit  Compiler
-------------------------------------------------------
i7-6700K 4.0GHz          0.600   1.820  gcc-5.2.1 -O3
i7-4850HQ 2.3GHz                 1.850  LLVM-7.0.2 -O3
i7-3770k 4GHz            0.975   2.146  gcc-5.2 -O3
i5-3570k 3.4GHz                  2.497  gcc-4.8.4 -O3
Xeon E5-2620v3 2.4GHz    1.135   2.545  gcc-4.7.2 -O3
Xeon E5-2650v2 2.6GHz    1.155   2.592  gcc-4.4.7 -O3
AMD A6-5400K 3.6GHz      2.023   2.095  gcc-4.7.2 -O3
Opteron 6212 2.6GHz      3.407   3.421  gcc-5.1.0 -O3
Phenom II X4 3.4GHz      3.473   3.479  gcc-4.7.2 -O3
ARMv8 Pi 3B 1200MHz      3.611          gcc-4.9.2 -O3 \
                                        -mcpu=cortex-a53
Pentium 4 3.4Ghz         3.759   5.181  gcc-5.2.1 -O3
AthlonII X2 255 3.1GHz   3.828   3.836  gcc-4.7.2 -O3
Athlon64 X2 5400+ 2.8GHz 4.601   7.893  gcc-4.6.3 -O3
Pentium 4D CPU 2.80GHz   4.612   6.271  gcc-4.7.2 -O3
ARMv7 Pi 2B 900MHz       7.187  74.790  gcc-5.2 -O3 \
                                        -mcpu=cortex-a7
Pentium III 866MHz      14.999  20.169  gcc-4.7.2 -O3
ARMv8 Pi 3B 1200MHz     17.670          gcc-4.9.2 -O3
Pentium III 650MHz      19.891  26.735  gcc-4.7.2 -O3
AMD-K6 3D 350MHz        26.726  45.469  gcc-4.7.2 -O3
ARMv7 Pi 2B 900MHz      27.741  74.987  gcc-4.6.3 -O3
ARMv6 Pi B+ 700MHz      74.027          gcc-4.6.3 -O3
i586 Pentium 75MHz     155.804 303.710  gcc-2.7.2.3 -O3
i486 DX/2 66MHz        282.180 919.130  gcc-2.6.3 -O3
Updated to distinguish timings using 32-bit integers from 64-bit integers.

ejolson
Posts: 6032
Joined: Tue Mar 18, 2014 11:47 am

Re: Benchmarking a raspberrypi compared to my own PC

Sat Apr 07, 2018 4:11 pm

Here is another data point based on the simple prime finding program posted above for an ARMv8 processor running in 64-bit mode. I ran the program on a single board computer called the NanoPi T3 which uses the same clock speed as the Raspberry Pi 3B+ but a slightly slower memory speed. The results are

Code: Select all

$ gcc -o prime32 -O3 prime32.c -lm
$ gcc -o prime64 -O3 prime64.c -lm
$ time ./prime32; time ./prime32; time ./prime32
Found a total of 664579 primes (32-bit)

real    0m2.902s
user    0m2.896s
sys 0m0.008s
Found a total of 664579 primes (32-bit)

real    0m2.876s
user    0m2.864s
sys 0m0.012s
Found a total of 664579 primes (32-bit)

real    0m2.878s
user    0m2.872s
sys 0m0.004s
$ time ./prime64; time ./prime64; time ./prime64
Found a total of 664579 primes (64-bit)

real    0m3.114s
user    0m3.108s
sys 0m0.004s
Found a total of 664579 primes (64-bit)

real    0m3.088s
user    0m3.084s
sys 0m0.004s
Found a total of 664579 primes (64-bit)

real    0m3.091s
user    0m3.088s
sys 0m0.004s
$ gcc --version
gcc (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
This places the 1400 Mhz ARMv8 processor running in 64-bit mode right above the Opteron 6212 in the previous table when computing with either 32-bit or 64-bit integers. At the moment my Pi 3B+ is serving as a WiFi access point and unavailable for 64-bit testing, but I would expect it to run even faster because of the faster memory speed. If anyone has a 3B+ running a 64-bit operating system and wants to verify this, that would be greatly appreciated.

It is interesting to note that a Pi 3B running at 1200Mhz in 32-bit mode turns in a runtime of 3.611s when using 32-bit integers, while the test above indicates a runtime using 32-bit integers of only 2.878s. We may attribute 16 percent of this performance increase to the faster clock speed; however, the actual increase is 25 percent. Therefore, the remaining 9 percent is likely due to better code optimization, possibly resulting from the richer set of available registers when running in 64-bit mode as opposed to 32-bit mode.

Return to “General discussion”