No, with 16GB of memory it is never worth building with less than -j4 on the i7. As for the Pi2, next time I do this (when gcc 5.3 arrives) I'll probably just put the swapfile on a separate fast SSD. Otherwise its hammering the poor little sd card too much. I tried -j >4 on the i7 but it didn't seem to help. Kudos to the Pi that it can even be done!The superlinear speedup when moving from 1 to 2 cores is interesting. Do you notice this with the i7 builds as well? Even though there are only 4 cores, I wonder if the build time would continue to decrease with -j5 or -j6.
Re: Benchmarking a raspberrypi compared to my own PC
Pi4 8GB and Pi4 4GB running Raspberry Pi OS 64-bit
Re: Benchmarking a raspberrypi compared to my own PC
Has anyone tried to build a 64 bit arm (AArch64 Cortex A53) gcc compiler on a RPI3, assuming you are one of the few lucky people that were able to get a RPI3?
I don't believe RPI3 needs a 64 bit OS to access a it's 1GB of memory or 16GB SDcard filesystem, but it would be nice to access to the 64 bit float point registers.
Here is the link http://infocenter.arm.com/help/topic/co ... le_2_0.pdf that got me thinking about a 64bit gcc compiler for the RPI3 to allow access to the 64 bit floating point registers.
What do you all think?
I don't believe RPI3 needs a 64 bit OS to access a it's 1GB of memory or 16GB SDcard filesystem, but it would be nice to access to the 64 bit float point registers.
Here is the link http://infocenter.arm.com/help/topic/co ... le_2_0.pdf that got me thinking about a 64bit gcc compiler for the RPI3 to allow access to the 64 bit floating point registers.
What do you all think?
Re: Benchmarking a raspberrypi compared to my own PC
IIRC 300k Pi3's made before launch, mostly sold.dmc1954 wrote:Has anyone tried to build a 64 bit arm (AArch64 Cortex A53) gcc compiler on a RPI3, assuming you are one of the few lucky people that were able to get a RPI3?
Quite a few lucky people I would say!
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed.
I've been saying "Mucho" to my Spanish friend a lot more lately. It means a lot to him.
Contrary to popular belief, humorous signatures are allowed.
I've been saying "Mucho" to my Spanish friend a lot more lately. It means a lot to him.
Re: Benchmarking a raspberrypi compared to my own PC
I don't think a 32-bit kernel can load or execute a 64-bit binary. However being able to cross compile 64-bit binaries on a 32-bit platform may be the first step in creating a 64-bit kernel. It seems for now a 64-bit kernel will have to run headless as the GPU binary blob is 32-bit. That would still be enough for many applications.dmc1954 wrote:Has anyone tried to build a 64 bit arm (AArch64 Cortex A53) gcc compiler on a RPI3, assuming you are one of the few lucky people that were able to get a RPI3?
I don't believe RPI3 needs a 64 bit OS to access a it's 1GB of memory or 16GB SDcard filesystem, but it would be nice to access to the 64 bit float point registers.
Here is the link http://infocenter.arm.com/help/topic/co ... le_2_0.pdf that got me thinking about a 64bit gcc compiler for the RPI3 to allow access to the 64 bit floating point registers.
What do you all think?
Re: Benchmarking a raspberrypi compared to my own PC
Interesting thread. I was wondering where ARMs are today in terms of speed. I usually say 10 years behind x86. ARM chips used to be faster than x86 in the 1990's!
Thought I'd compile and run the primes.
2.3 GHz Core i7 Crystalwell (I7-4850HQ).
Model Identifier: MacBookPro11,3
Processor Name: Intel Core i7
Processor Speed: 2.3 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
L4 Cache: 128 MB
Memory: 16 GB
$ gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.4.0
Thread model: posix
Found a total of 664579 primes!
real 0m1.866s
user 0m1.830s
sys 0m0.005s
Found a total of 664579 primes!
real 0m1.850s
user 0m1.842s
sys 0m0.006s
Found a total of 664579 primes!
real 0m1.875s
user 0m1.868s
sys 0m0.007s
Thought I'd compile and run the primes.
2.3 GHz Core i7 Crystalwell (I7-4850HQ).
Model Identifier: MacBookPro11,3
Processor Name: Intel Core i7
Processor Speed: 2.3 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
L4 Cache: 128 MB
Memory: 16 GB
$ gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.4.0
Thread model: posix
Found a total of 664579 primes!
real 0m1.866s
user 0m1.830s
sys 0m0.005s
Found a total of 664579 primes!
real 0m1.850s
user 0m1.842s
sys 0m0.006s
Found a total of 664579 primes!
real 0m1.875s
user 0m1.868s
sys 0m0.007s
Re: Benchmarking a raspberrypi compared to my own PC
May be all worth adding to the wiki.
Re: Benchmarking a raspberrypi compared to my own PC
Tested on Raspberry Pi 3 and Core i7-6700K, just for fun.
For those who don't know, the i7-6700K is based on Skylake, Intel's latest architecture, and runs at 4.0-4.2 GHz.
Raspberry Pi 3: 3.61 s
(gcc 4.9.2, -O3 -mcpu=cortex-a53)
Core i7-6700K: 0.60 s
(gcc 5.2.1, -O3 -msse2 -mfpmath=sse)
Ubuntu 15.10 32-bit
So the i7 is 500% faster running this (rather simple) load. I'd expect the difference to be even larger in more taxing and memory intensive loads. Using all eight hardware threads on the i7 would of course extend that lead further. Anyway, always fun to compare, even if it's not very useful in this case.

Raspberry Pi 3: 3.61 s
(gcc 4.9.2, -O3 -mcpu=cortex-a53)
Core i7-6700K: 0.60 s
(gcc 5.2.1, -O3 -msse2 -mfpmath=sse)
Ubuntu 15.10 32-bit
So the i7 is 500% faster running this (rather simple) load. I'd expect the difference to be even larger in more taxing and memory intensive loads. Using all eight hardware threads on the i7 would of course extend that lead further. Anyway, always fun to compare, even if it's not very useful in this case.
Last edited by Mikael on Sat Mar 05, 2016 8:04 am, edited 1 time in total.
Re: Benchmarking a raspberrypi compared to my own PC
Interestingly, the i7 TDP is 95 watts. Total board consumption for the Pi 3 is about 4 watts under multicore benchmarks so for 500% gain you're expending 2375% energyMikael wrote:Tested on Raspberry Pi 3 and Core i7-6700K, just for fun.For those who don't know, the i7-6700K is based on Skylake, Intel's latest architecture, and runs at 4.0-4.2 GHz.
Raspberry Pi 3: 3.61 s
(gcc 4.9.2, -O3 -mcpu=cortex-a53)
Core i7-6700K: 0.60 s
(gcc 5.2.1, -O3 -msse2 -mfpmath=sse)
So the i7 is 500% faster running this (rather simple) load. I'd expect the difference to be even larger in more taxing and memory intensive loads. Using all eight hardware threads on the i7 would of course extend that lead further. Anyway, always fun to compare, even if it's not very useful in this case.

Rockets are loud.
https://astro-pi.org
https://astro-pi.org
Re: Benchmarking a raspberrypi compared to my own PC
Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.
Found a total of 664579 primes!
real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!
real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!
real 0m4.397s
user 0m4.393s
sys 0m0.000s
Almost exactly the same as the Pi3!
Probably need a new desktop, that CPU speed seems a bit low.
Found a total of 664579 primes!
real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!
real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!
real 0m4.397s
user 0m4.393s
sys 0m0.000s
Almost exactly the same as the Pi3!
Probably need a new desktop, that CPU speed seems a bit low.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed.
I've been saying "Mucho" to my Spanish friend a lot more lately. It means a lot to him.
Contrary to popular belief, humorous signatures are allowed.
I've been saying "Mucho" to my Spanish friend a lot more lately. It means a lot to him.
Re: Benchmarking a raspberrypi compared to my own PC
Extrapolating from the result of this one test alone puts my overclocked Pi2B on a par clock-for-clock with an AMD Athlon X2. I shall quietly gloat.jamesh wrote:Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.
...
Almost exactly the same as the Pi3!

Pi2B Mini-PC/Media Centre: ARM=1GHz (+3), Core=500MHz, v3d=500MHz, h264=333MHz, RAM=DDR2-1200 (+6/+4/+4+schmoo). Sandisk Ultra HC-I 32GB microSD card on '50=100' OCed slot (42MB/s read) running Raspbian/KODI16, Seagate 3.5" 1.5TB HDD mass storage.
-
- Posts: 1354
- Joined: Mon Oct 29, 2012 8:12 pm
- Location: Vancouver Island
- Contact: Website
Re: Benchmarking a raspberrypi compared to my own PC
ARM fp registers have been 64 bit for quite a while. Like, as long as there have been fp registers.dmc1954 wrote: I don't believe RPI3 needs a 64 bit OS to access a it's 1GB of memory or 16GB SDcard filesystem, but it would be nice to access to the 64 bit float point registers.
Making Smalltalk on ARM since 1986; making your Scratch better since 2012
Re: Benchmarking a raspberrypi compared to my own PC
Do you recall what compiler version and optimization switches you used?jamesh wrote:Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.
Found a total of 664579 primes!
real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!
real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!
real 0m4.397s
user 0m4.393s
sys 0m0.000s
Almost exactly the same as the Pi3!
Probably need a new desktop, that CPU speed seems a bit low.
Re: Benchmarking a raspberrypi compared to my own PC
That's actually far slower than expected. The general performance of AMD's K10 core should be at least 25-30% higher at the same frequency. Probably a decent amount faster still in many loads.jamesh wrote:Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.
Found a total of 664579 primes!
real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!
real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!
real 0m4.397s
user 0m4.393s
sys 0m0.000s
Almost exactly the same as the Pi3!
Probably need a new desktop, that CPU speed seems a bit low.
I just tested my old Core 2 Duo T8100 (2.1GHz, 45nm, dual core, Penryn core) with the following results:
Core 2 Duo T8100 (2.1GHz): 1.588 s
(gcc 5.2.1, -O3 -msse2 -mfpmath=sse)
Ubuntu 15.10 32-bit
Given that AMD's K10 is in the same class, maybe 10% slower per clock, a more reasonable score for the Athlon II X2 215 @ 1.5GHz would be around the 2.5 second mark.
As said above, I think something's up with that result. Average real world performance of the K10 core in the Athlon II can be expected to be at least 70% higher than the Cortex-A7 in the Pi 2, clock-for-clock.GTR2Fan wrote:Extrapolating from the result of this one test alone puts my overclocked Pi2B on a par clock-for-clock with an AMD Athlon X2. I shall quietly gloat.
Re: Benchmarking a raspberrypi compared to my own PC
There's a lot running on the machine which probably doesn't help, and the 1.5Ghz speed seems low since the 215 should be good for 2.5GHz.Mikael wrote:That's actually far slower than expected. The general performance of AMD's K10 core should be at least 25-30% higher at the same frequency. Probably a decent amount faster still in many loads.jamesh wrote:Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.
Found a total of 664579 primes!
real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!
real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!
real 0m4.397s
user 0m4.393s
sys 0m0.000s
Almost exactly the same as the Pi3!
Probably need a new desktop, that CPU speed seems a bit low.
I just tested my old Core 2 Duo T8100 (2.1GHz, 45nm, dual core, Penryn core) with the following results:
Core 2 Duo T8100 (2.1GHz): 1.588 s
(gcc 5.2.1, -O3 -msse2 -mfpmath=sse)
Ubuntu 15.10 32-bit
Given that AMD's K10 is in the same class, maybe 10% slower per clock, a more reasonable score for the Athlon II X2 215 @ 1.5GHz would be around the 2.5 second mark.
As said above, I think something's up with that result. Average real world performance of the K10 core in the Athlon II can be expected to be at least 70% higher than the Cortex-A7 in the Pi 2, clock-for-clock.GTR2Fan wrote:Extrapolating from the result of this one test alone puts my overclocked Pi2B on a par clock-for-clock with an AMD Athlon X2. I shall quietly gloat.
Build line:
cc -O3 prime.c -lm
Compiler
gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed.
I've been saying "Mucho" to my Spanish friend a lot more lately. It means a lot to him.
Contrary to popular belief, humorous signatures are allowed.
I've been saying "Mucho" to my Spanish friend a lot more lately. It means a lot to him.
Re: Benchmarking a raspberrypi compared to my own PC
Tim, thank you for pointing this out to me. I wrongly assumed a float point double precision operations were being simulated using to single precision floating point instructions. I also verified the use of the d registers by dumping out the assembly (gcc -S) of a simple double precision floating point program.timrowledge wrote:ARM fp registers have been 64 bit for quite a while. Like, as long as there have been fp registers.dmc1954 wrote: I don't believe RPI3 needs a 64 bit OS to access a it's 1GB of memory or 16GB SDcard filesystem, but it would be nice to access to the 64 bit float point registers.
This article https://wiki.debian.org/ArmHardFloatPort/VfpComparison also helped me better understand why double precision is so slow.
Re: Benchmarking a raspberrypi compared to my own PC
Perhaps you should be using NEON which is very fast on the Pi3 (full quad issue I think compared to dual issue on the Pi2).dmc1954 wrote:Tim, thank you for pointing this out to me. I wrongly assumed a float point double precision operations were being simulated using to single precision floating point instructions. I also verified the use of the d registers by dumping out the assembly (gcc -S) of a simple double precision floating point program.timrowledge wrote: ARM fp registers have been 64 bit for quite a while. Like, as long as there have been fp registers.
This article https://wiki.debian.org/ArmHardFloatPort/VfpComparison also helped me better understand why double precision is so slow.
Pi4 8GB and Pi4 4GB running Raspberry Pi OS 64-bit
Re: Benchmarking a raspberrypi compared to my own PC
The Athlon II X2 215 is a 2.7GHz part. The number reported by /proc/cpuinfo is the current speed of the processor that the frequency governor has set. I have a similar vintage 3.1 GHz CPU and get timings consistent with yours. Under load, the governor is supposed to increase the speed as needed. The following scriptjamesh wrote:There's a lot running on the machine which probably doesn't help, and the 1.5Ghz speed seems low since the 215 should be good for 2.5GHz.Mikael wrote:That's actually far slower than expected. The general performance of AMD's K10 core should be at least 25-30% higher at the same frequency. Probably a decent amount faster still in many loads.jamesh wrote:Cool, comparing with my desktop, an AMD Athlon X2 215 running at 1.5GHz in 64bit Ubuntu.
Found a total of 664579 primes!
real 0m4.400s
user 0m4.389s
sys 0m0.004s
Found a total of 664579 primes!
real 0m4.406s
user 0m4.398s
sys 0m0.000s
Found a total of 664579 primes!
real 0m4.397s
user 0m4.393s
sys 0m0.000s
Almost exactly the same as the Pi3!
Probably need a new desktop, that CPU speed seems a bit low.
I just tested my old Core 2 Duo T8100 (2.1GHz, 45nm, dual core, Penryn core) with the following results:
Core 2 Duo T8100 (2.1GHz): 1.588 s
(gcc 5.2.1, -O3 -msse2 -mfpmath=sse)
Ubuntu 15.10 32-bit
Given that AMD's K10 is in the same class, maybe 10% slower per clock, a more reasonable score for the Athlon II X2 215 @ 1.5GHz would be around the 2.5 second mark.
As said above, I think something's up with that result. Average real world performance of the K10 core in the Athlon II can be expected to be at least 70% higher than the Cortex-A7 in the Pi 2, clock-for-clock.GTR2Fan wrote:Extrapolating from the result of this one test alone puts my overclocked Pi2B on a par clock-for-clock with an AMD Athlon X2. I shall quietly gloat.
Build line:
cc -O3 prime.c -lm
Compiler
gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4
Code: Select all
#!/bin/bash
for i in /sys/devices/system/cpu/cpu?/cpufreq
do
echo $i
cat $i/cpuinfo_max_freq >$i/scaling_min_freq
done
Re: Benchmarking a raspberrypi compared to my own PC
Just posted this in another thread, but it's really important for this thread as well:
gcc seems to generate sub-optimal code when compiling this program to a 64-bit binary. I used Ubuntu 15.10 32-bit for my first tests. I just tried it on 64-bit and the result for my 6700 changed "slightly":
Core i7-6700K: 1.82 s (compared to 0.60 s in 32-bit)
(gcc 5.2.1, -O3)
Ubuntu 15.10 64-bit
So, pretty much exactly 1/3 the performance. To compile a 32-bit binary in a 64-bit environment, give the -m32 option to gcc when compiling. Would be interesting to see that Athlon II X2 215 retested with a 32-bit binary...
gcc seems to generate sub-optimal code when compiling this program to a 64-bit binary. I used Ubuntu 15.10 32-bit for my first tests. I just tried it on 64-bit and the result for my 6700 changed "slightly":
Core i7-6700K: 1.82 s (compared to 0.60 s in 32-bit)
(gcc 5.2.1, -O3)
Ubuntu 15.10 64-bit
So, pretty much exactly 1/3 the performance. To compile a 32-bit binary in a 64-bit environment, give the -m32 option to gcc when compiling. Would be interesting to see that Athlon II X2 215 retested with a 32-bit binary...
Re: Benchmarking a raspberrypi compared to my own PC
Thanks for figuring out what was going on. I've updated the code for prime.c so that the size of the integers is specified explicitly using the stdint.h header. I've also sorted the table of timings for different processors so that it reflects which timings correspond to the use of 32-bit integers versus 64-bit integers. However, at the moment I'm not distinguishing whether the host kernel is 32-bit or 64-bit as I think the timing differences result from the size of the integers and not the kernel.Mikael wrote:Just posted this in another thread, but it's really important for this thread as well:
gcc seems to generate sub-optimal code when compiling this program to a 64-bit binary. I used Ubuntu 15.10 32-bit for my first tests. I just tried it on 64-bit and the result for my 6700 changed "slightly":
Core i7-6700K: 1.82 s (compared to 0.60 s in 32-bit)
(gcc 5.2.1, -O3)
Ubuntu 15.10 64-bit
So, pretty much exactly 1/3 the performance. To compile a 32-bit binary in a 64-bit environment, give the -m32 option to gcc when compiling. Would be interesting to see that Athlon II X2 215 retested with a 32-bit binary...
It is interesting to note that the size of the integers doesn't seem to make a difference for recent AMD processors. The table is currently missing 32-bit results for the i7-4850HQ, i7-3770k and i5-3570k. Also missing are 64-bit results for the Pi 3B and Pi B+. However, I reran the Pi 2B benchmarks using 64-bit integers by changing uint32_t to uint64_t in the updated source. Surprisingly, the -mcpu=cortex-a7 flag makes no difference for 64-bit integers on the Pi 2B and the results are disappointingly slow. It would be great if someone could run the program using 64-bit integers on the Pi 3B and figure out if there are any compiler options I'm missing that could be used to speed things up.
Re: Benchmarking a raspberrypi compared to my own PC
Interesting. I did a quick test on my laptop, using 32/64-bit integers and 32/64-bit kernel. The kernel does not make a difference, as you say. However, compiling a 32-bit or 64-bit binary does make a difference:ejolson wrote:Thanks for figuring out what was going on. I've updated the code for prime.c so that the size of the integers is specified explicitly using the stdint.h header. I've also sorted the table of timings for different processors so that it reflects which timings correspond to the use of 32-bit integers versus 64-bit integers. However, at the moment I'm not distinguishing whether the host kernel is 32-bit or 64-bit as I think the timing differences result from the size of the integers and not the kernel.
Core i5-5300U:
32-bit binary:
uint32: 1.100 s
uint64: 2.192 s
64-bit binary:
uint32: 1.152 s
uint64: 2.772 s
(gcc 5.2.1, -O3)
Ubuntu 15.10 64-bit
64-bit integers should be much slower than 32-bit ones when executed in a 32-bit binary. You'd need to run a 64-bit OS and binary to speed things up. However, the thing I'm not getting here is the strange results on x86 CPUs like the ones above. The results for the 32-bit binary look plausible, I think (i.e. 64-bit calculations are much slower). The results for 32-bit integers in the 64-bit binary also look okay. 64-bit mode has twice as many general purpose registers compared to 32-bit mode, which may speed up some loads. However, it also increases bandwidth requirements. For the result to remain unchanged when going from 32-bit to 64-bit mode is not uncommon.ejolson wrote:It is interesting to note that the size of the integers doesn't seem to make a difference for recent AMD processors. The table is currently missing 32-bit results for the i7-4850HQ, i7-3770k and i5-3570k. Also missing are 64-bit results for the Pi 3B and Pi B+. However, I reran the Pi 2B benchmarks using 64-bit integers by changing uint32_t to uint64_t in the updated source. Surprisingly, the -mcpu=cortex-a7 flag makes no difference for 64-bit integers on the Pi 2B and the results are disappointingly slow. It would be great if someone could run the program using 64-bit integers on the Pi 3B and figure out if there are any compiler options I'm missing that could be used to speed things up.
That leaves us with the last result: the 64-bit integers in the 64-bit binary. It's by far the slowest and I have no idea why. Given the fact that the CPU natively executes 64-bit integers, I would expect it to perform on a similar level as the 32-bit integer results. Granted, I'm certainly no compiler expert, so I might be missing something here.
Does anyone have any theories?
Re: Benchmarking a raspberrypi compared to my own PC
Result for i7-3770 using 32-bit "Integer" (gcc 5.3).
"Integer" set to 64-bits:-"Integer" set to 32 bits:-
Note the divq for 64 bits and the divl for 32 bits. Also the extra sign extend instruction at the start which must be completed before the cmp. The register move and the xor, will be eliminated by the decoder in both modes.
Sadly, we cant (yet) run the Pi3 in aarch64 mode.Found a total of 664579 primes (32-bit)
real 0m1.063s
user 0m1.060s
sys 0m0.000s
Found a total of 664579 primes (32-bit)
real 0m0.975s
user 0m0.972s
sys 0m0.000s
Found a total of 664579 primes (32-bit)
real 0m0.976s
user 0m0.976s
sys 0m0.000s
I have found in the past that for larger programs on x86 that 64-bit mode gives a modest speed increase - say around 15% or so. I suspect in this case, the program is tiny, mostly in a small inner loop, and some minor effect such as the reduced number of clock cycles to do the 32-bit divide, could be dominant.That leaves us with the last result: the 64-bit integers in the 64-bit binary. It's by far the slowest and I have no idea why. Given the fact that the CPU natively executes 64-bit integers, I would expect it to perform on a similar level as the 32-bit integer results. Granted, I'm certainly no compiler expert, so I might be missing something here.
Does anyone have any theories?
"Integer" set to 64-bits:-
Code: Select all
movq prime(%rip), %rcx
movslq %r8d, %r8
cmpq %r8, %rcx
ja .L4
xorl %edx, %edx
movq %rbx, %rax
divq %rcx
testq %rdx, %rdx
je .L5
Code: Select all
movl prime(%rip), %ecx
cmpl %r8d, %ecx
ja .L4
xorl %edx, %edx
movl %ebx, %eax
divl %ecx
testl %edx, %edx
je .L5
Pi4 8GB and Pi4 4GB running Raspberry Pi OS 64-bit
Re: Benchmarking a raspberrypi compared to my own PC
Did you perform each timing usingMikael wrote:Interesting. I did a quick test on my laptop, using 32/64-bit integers and 32/64-bit kernel. The kernel does not make a difference, as you say. However, compiling a 32-bit or 64-bit binary does make a difference:ejolson wrote:Thanks for figuring out what was going on. I've updated the code for prime.c so that the size of the integers is specified explicitly using the stdint.h header. I've also sorted the table of timings for different processors so that it reflects which timings correspond to the use of 32-bit integers versus 64-bit integers. However, at the moment I'm not distinguishing whether the host kernel is 32-bit or 64-bit as I think the timing differences result from the size of the integers and not the kernel.
Core i5-5300U:
32-bit binary:
uint32: 1.100 s
uint64: 2.192 s
64-bit binary:
uint32: 1.152 s
uint64: 2.772 s
(gcc 5.2.1, -O3)
Ubuntu 15.10 64-bit
64-bit integers should be much slower than 32-bit ones when executed in a 32-bit binary. You'd need to run a 64-bit OS and binary to speed things up. However, the thing I'm not getting here is the strange results on x86 CPUs like the ones above. The results for the 32-bit binary look plausible, I think (i.e. 64-bit calculations are much slower). The results for 32-bit integers in the 64-bit binary also look okay. 64-bit mode has twice as many general purpose registers compared to 32-bit mode, which may speed up some loads. However, it also increases bandwidth requirements. For the result to remain unchanged when going from 32-bit to 64-bit mode is not uncommon.ejolson wrote:It is interesting to note that the size of the integers doesn't seem to make a difference for recent AMD processors. The table is currently missing 32-bit results for the i7-4850HQ, i7-3770k and i5-3570k. Also missing are 64-bit results for the Pi 3B and Pi B+. However, I reran the Pi 2B benchmarks using 64-bit integers by changing uint32_t to uint64_t in the updated source. Surprisingly, the -mcpu=cortex-a7 flag makes no difference for 64-bit integers on the Pi 2B and the results are disappointingly slow. It would be great if someone could run the program using 64-bit integers on the Pi 3B and figure out if there are any compiler options I'm missing that could be used to speed things up.
That leaves us with the last result: the 64-bit integers in the 64-bit binary. It's by far the slowest and I have no idea why. Given the fact that the CPU natively executes 64-bit integers, I would expect it to perform on a similar level as the 32-bit integer results. Granted, I'm certainly no compiler expert, so I might be missing something here.
Does anyone have any theories?
Code: Select all
time ./a.out; time ./a.out; time ./a.out
Re: Benchmarking a raspberrypi compared to my own PC
No theories, however, I can now confirm your timings. The 64-bit integers with 32-bit compatible binary surprisingly run about 20% faster than 64-bit integers with 64-bit binary using an i3 550. On the other hand, the situation is reversed for the exact same binaries using an AMD Athlon II X2 255.Mikael wrote:Does anyone have any theories?
Code: Select all
32-bit binary 64-bit binary
i3 550 2.058 2.430
Athlon II X2 255 5.301 3.830
Code: Select all
$ gcc -O3 -msse2 -mfpmath=sse prime.c -lm
Code: Select all
$ grep "model name" /proc/cpuinfo | sort -u
model name : Intel(R) Core(TM) i3 CPU 550 @ 3.20GHz
$ file prime64on32
prime64on32: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.26, BuildID[sha1]=0x9a55be5cdecd3251f5407682d64c6bcd079bea26, not stripped
$ time ./prime64on32; time ./prime64on32; time ./prime64on32
Found a total of 664579 primes (64-bit)
real 0m2.066s
user 0m2.056s
sys 0m0.000s
Found a total of 664579 primes (64-bit)
real 0m2.073s
user 0m2.064s
sys 0m0.000s
Found a total of 664579 primes (64-bit)
real 0m2.058s
user 0m2.040s
sys 0m0.008s
$ file prime64on64
prime64on64: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.26, BuildID[sha1]=0x6837b7d0137cbd9448c7976b1e49b98759215b5e, not stripped
$ time ./prime64on64; time ./prime64on64; time ./prime64on64
Found a total of 664579 primes (64-bit)
real 0m2.495s
user 0m2.484s
sys 0m0.000s
Found a total of 664579 primes (64-bit)
real 0m2.430s
user 0m2.420s
sys 0m0.000s
Found a total of 664579 primes (64-bit)
real 0m2.434s
user 0m2.424s
sys 0m0.000s
----------------------------------------------------------------
$ grep "model name" /proc/cpuinfo | sort -u
model name : AMD Athlon(tm) II X2 255 Processor
$ time ./prime64on32; time ./prime64on32; time ./prime64on32
Found a total of 664579 primes (64-bit)
real 0m5.317s
user 0m5.304s
sys 0m0.000s
Found a total of 664579 primes (64-bit)
real 0m5.308s
user 0m5.296s
sys 0m0.004s
Found a total of 664579 primes (64-bit)
real 0m5.301s
user 0m5.292s
sys 0m0.000s
$ time ./prime64on64; time ./prime64on64; time ./prime64on64
Found a total of 664579 primes (64-bit)
real 0m3.835s
user 0m3.828s
sys 0m0.000s
Found a total of 664579 primes (64-bit)
real 0m3.830s
user 0m3.824s
sys 0m0.004s
Found a total of 664579 primes (64-bit)
real 0m3.836s
user 0m3.828s
sys 0m0.000s
Re: Benchmarking a raspberrypi compared to my own PC
Amazed at how speedy the i7-4850HQ is, not bad for a laptop. Wonder if the eDRAM has any effect?
ejolson wrote:KarlSplatz wrote:Updated to distinguish timings using 32-bit integers from 64-bit integers.Code: Select all
CPU 32-bit 64-bit Compiler ------------------------------------------------------- i7-6700K 4.0GHz 0.600 1.820 gcc-5.2.1 -O3 i7-4850HQ 2.3GHz 1.850 LLVM-7.0.2 -O3 i7-3770k 4GHz 0.975 2.146 gcc-5.2 -O3 i5-3570k 3.4GHz 2.497 gcc-4.8.4 -O3 Xeon E5-2620v3 2.4GHz 1.135 2.545 gcc-4.7.2 -O3 Xeon E5-2650v2 2.6GHz 1.155 2.592 gcc-4.4.7 -O3 AMD A6-5400K 3.6GHz 2.023 2.095 gcc-4.7.2 -O3 Opteron 6212 2.6GHz 3.407 3.421 gcc-5.1.0 -O3 Phenom II X4 3.4GHz 3.473 3.479 gcc-4.7.2 -O3 ARMv8 Pi 3B 1200MHz 3.611 gcc-4.9.2 -O3 \ -mcpu=cortex-a53 Pentium 4 3.4Ghz 3.759 5.181 gcc-5.2.1 -O3 AthlonII X2 255 3.1GHz 3.828 3.836 gcc-4.7.2 -O3 Athlon64 X2 5400+ 2.8GHz 4.601 7.893 gcc-4.6.3 -O3 Pentium 4D CPU 2.80GHz 4.612 6.271 gcc-4.7.2 -O3 ARMv7 Pi 2B 900MHz 7.187 74.790 gcc-5.2 -O3 \ -mcpu=cortex-a7 Pentium III 866MHz 14.999 20.169 gcc-4.7.2 -O3 ARMv8 Pi 3B 1200MHz 17.670 gcc-4.9.2 -O3 Pentium III 650MHz 19.891 26.735 gcc-4.7.2 -O3 AMD-K6 3D 350MHz 26.726 45.469 gcc-4.7.2 -O3 ARMv7 Pi 2B 900MHz 27.741 74.987 gcc-4.6.3 -O3 ARMv6 Pi B+ 700MHz 74.027 gcc-4.6.3 -O3 i586 Pentium 75MHz 155.804 303.710 gcc-2.7.2.3 -O3 i486 DX/2 66MHz 282.180 919.130 gcc-2.6.3 -O3
Re: Benchmarking a raspberrypi compared to my own PC
Here is another data point based on the simple prime finding program posted above for an ARMv8 processor running in 64-bit mode. I ran the program on a single board computer called the NanoPi T3 which uses the same clock speed as the Raspberry Pi 3B+ but a slightly slower memory speed. The results areThis places the 1400 Mhz ARMv8 processor running in 64-bit mode right above the Opteron 6212 in the previous table when computing with either 32-bit or 64-bit integers. At the moment my Pi 3B+ is serving as a WiFi access point and unavailable for 64-bit testing, but I would expect it to run even faster because of the faster memory speed. If anyone has a 3B+ running a 64-bit operating system and wants to verify this, that would be greatly appreciated.
It is interesting to note that a Pi 3B running at 1200Mhz in 32-bit mode turns in a runtime of 3.611s when using 32-bit integers, while the test above indicates a runtime using 32-bit integers of only 2.878s. We may attribute 16 percent of this performance increase to the faster clock speed; however, the actual increase is 25 percent. Therefore, the remaining 9 percent is likely due to better code optimization, possibly resulting from the richer set of available registers when running in 64-bit mode as opposed to 32-bit mode.
Code: Select all
$ gcc -o prime32 -O3 prime32.c -lm
$ gcc -o prime64 -O3 prime64.c -lm
$ time ./prime32; time ./prime32; time ./prime32
Found a total of 664579 primes (32-bit)
real 0m2.902s
user 0m2.896s
sys 0m0.008s
Found a total of 664579 primes (32-bit)
real 0m2.876s
user 0m2.864s
sys 0m0.012s
Found a total of 664579 primes (32-bit)
real 0m2.878s
user 0m2.872s
sys 0m0.004s
$ time ./prime64; time ./prime64; time ./prime64
Found a total of 664579 primes (64-bit)
real 0m3.114s
user 0m3.108s
sys 0m0.004s
Found a total of 664579 primes (64-bit)
real 0m3.088s
user 0m3.084s
sys 0m0.004s
Found a total of 664579 primes (64-bit)
real 0m3.091s
user 0m3.088s
sys 0m0.004s
$ gcc --version
gcc (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
It is interesting to note that a Pi 3B running at 1200Mhz in 32-bit mode turns in a runtime of 3.611s when using 32-bit integers, while the test above indicates a runtime using 32-bit integers of only 2.878s. We may attribute 16 percent of this performance increase to the faster clock speed; however, the actual increase is 25 percent. Therefore, the remaining 9 percent is likely due to better code optimization, possibly resulting from the richer set of available registers when running in 64-bit mode as opposed to 32-bit mode.