ejolson
Posts: 1424
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi Benchmarks

Thu Apr 12, 2018 3:06 pm

bensimmo wrote:
Thu Apr 12, 2018 10:14 am
That doesn't show you what the platform can do, other than only showing how well it can handle legacy compiled code.
I see there are two different ideas here both with the goal of comparing and increasing performance. In one case you consider a particular hardware configuration and tune the software until it solves a given problem as fast as possible. This is the best effort benchmarking scenario previously discussed. In the other case you consider a particular item of (compiled legacy) software and tune the hardware until it runs that code as fast as possible. This is sometimes called over clocking instead of benchmarking. Thus, in benchmarking you tune the software keeping the hardware the same, while in over clocking you tune the hardware keeping the software the same.

It goes without saying that tuning the hardware to run a particular software as fast as possible is perhaps more common (and useful) than traditional benchmarking. The reason for this is the vast amount of precompiled binary code available both commercially-produced for purchase and for free as in beer. For example, since the Raspbian user land has been compiled for ARMv6 compatibility, it is possible to over clock the ARMv8 cores on the Pi3B and Pi3B+ and still correctly execute all of user land. By some accounts the stock clock settings for the 3B and 3B+ have already been tuned in this way, but never mind that.

If the hardware has been tuned to execute a particular set of binaries fast, it may crash or produce incorrect results for differently optimized code that was not taken into account during the over clocking. However, the usage scenario in which the binary executables are fixed is so common that over clocking might well be the more important type of performance tuning and grounds for comparison, at least in the short term.

User avatar
bensimmo
Posts: 2599
Joined: Sun Dec 28, 2014 3:02 pm
Location: East Yorkshire

Re: Raspberry Pi Benchmarks

Thu Apr 12, 2018 3:32 pm

I've never heard the term overclocking used in that context.
To me that is taking the hardware, say a Pi1 or a Pi3 or an Intel P5 i4460K or a 6502 and running it beyond it default designated safe parameters.

But if overclocking is running a ARMv6 optimised fft and pi calculation and running it on and ARMv8 then so be it and seeing how fast it performs, then so be it.

jahboater
Posts: 2477
Joined: Wed Feb 04, 2015 6:38 pm

Re: Raspberry Pi Benchmarks

Thu Apr 12, 2018 4:52 pm

For me "benchmarking" and "overclocking" are two entirely different things. Benchmarking is pure "measurement", "overclocking" is performance tuning (likely beyond manufacturers spec). The two are only related because benchmarking is required to assess the effectiveness of an overclock.

ejolson
Posts: 1424
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi Benchmarks

Fri Apr 13, 2018 12:57 pm

jahboater wrote:
Thu Apr 12, 2018 4:52 pm
For me "benchmarking" and "overclocking" are two entirely different things. Benchmarking is pure "measurement", "overclocking" is performance tuning (likely beyond manufacturers spec). The two are only related because benchmarking is required to assess the effectiveness of an overclock.
Given the accuracy in how parts are speed graded, over clocking almost always breaks the hardware in some way. The only reason it works is because the sequence of instructions that fail due to the over clock don't happen to be used in whatever specific software (usually a game) under consideration.

Benchmarking in general requires tuning the software (or writing new code) for the computer being measured. Many websites that run things like Sandra CPU test and Doom on a bunch of PCs are not answering the question, how fast can a particular computer solve a particular problem, but rather trying to monetize some sort of online advertising.

ejolson
Posts: 1424
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi Benchmarks

Tue Apr 24, 2018 12:12 am

I'm posting a link to a Linpack benchmarking run showing the Pi 3B+ can achieve 6.718 Gflops, which is hopefully closer to a best-effort result than the 605 Mflops reported earlier in this thread. From what I can tell, Roy's results indicate how well modern computers run historical benchmark codes without specialized tuning. Such results are important, in my opinion, because a significant percentage of numerical codes used in production actually fall into this category.

I'm posting here because current usage of the term Linpack generally refers to best-effort attempts to solve problems scaled to the size of available memory using carefully optimized code. As these forum posts show up early on web searches, I believe it is important to include results for comparison which reflect the performance of the Pi 3B+ following the practices currently used for other computers.

I'm also posting here in the hope that someone interested in benchmarks might verify whether 6.718 Gflops really reflects a best-effort Linpack result for the Pi 3B+. In particular, based on relative clock speeds I had expected a speed of over 7 Gflops and would appreciate it if someone else could verify my results.

RoyLongbottom
Posts: 214
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Thu Apr 26, 2018 5:06 pm

Raspberry Pi 3B+ Memory Benchmarks

Full details of 32 bit and 64 bit memory benchmarks (and single core tests) are available at ResearchGate in Raspberry Pi 3B+ 32 Bit and 64 Bit Benchmarks and Stress Tests.pdf, from the following link (then click on down arrow to select download)

https://www.researchgate.net/publicatio ... ress_Tests

This includes 3B+ comparisons with the older Mode 3B and 64 bit versus 32 bit performance. The latter is repeated below for the newer processor (3B similar). The 3B+/3B performance is essentially proportional to respective CPU MHz speeds, where date from caches is processed, but 3B+ is often shown to be slightly slower with RAM data transfers. The benchmarks are as follows, most doubling up data size used, to cover caches, and RAM, with performance measured in MegaBytes per second. Example full results and comparisons are provided below.

MemSpeed - carries out the calculations shown in the following, the first being of the same format as the Linpack benchmark time dependent function. Maximum MFLOPS are also shown for these, plus MFLOPS/MHz ratios, these being higher that those for Linpack, mainly due to the smoother data flow and slightly using L1 cache based results. Best 64 bit performance gains were using double precision floating point but one result indicates that the older RPi was faster using RAM based data.

Code: Select all

            Memory Reading Speed Test vfpv4 32 Bit Version 1

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S
                                                                          3B+/3B
           Raspberry Pi 3B+ CPU 1400 MHz, SDRAM ?                        Avg Gain

       8    1899   2125   4041   2783   2624   4448   3164   3693   3693  1.17 L1
      16    1901   2128   4058   2791   2628   4462   3177   3703   3707
      32    1852   2049   3817   2686   2508   4161   3186   3715   3711
      64    1796   1959   3574   2542   2367   3855   2945   3347   3347  1.16 L2
     128    1826   1989   3741   2600   2408   4031   3042   3506   3508
     256    1833   1995   3771   2617   2414   4068   2860   3616   3617
     512    1517   1618   2587   2039   1911   2687   2459   2825   2832
    1024     968   1098   1221   1172   1140   1211   1455   1144   1137  0.98 RAM 
    2048     911    980   1060   1038   1026   1062   1013    941    935
    4096     913    993   1064   1047   1038    948    992    902    903
    8192     926   1013   1077   1074   1065   1085    782    784    783

 Max MFLOPS  238    532
    Per MHz 0.17   0.38
    64 bit  0.43   0.52

 #################### Compare 64 bit / 32 bit Pi 3B+ ######################

       8    2.54   1.36   1.08   2.22   1.51   1.09   1.70   1.17   1.17
     256    2.12   1.39   1.05   1.86   1.53   1.06   1.71   1.13   1.13
    8192    0.71   1.19   1.17   1.14   1.03   1.17   1.29   1.38   1.38

#######################################################################
NeonSpeed - executes the same functions as MemSpeed, but with all floating point calculations using single precision floating point (for compatibility with NEON). Some normal calculations are also included for comparison purposes. The NEON calculations are carried out using NEON Intrinsic Functions but the latest compilers convert these into more appropriate vector instructions. This leads to little difference between 32 bit and 64 bit speed, the former being faster in one case. For some reason, 32 bit normal calculations were faster than in MemSpeed, but maximum NEON MFLOPS per MHz were significantly faster.

Code: Select all

   NEON SP Float & Integer Benchmark RPi 3B+ 64 Bit

  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v    3B/3B+
  KBytes   Norm   Neon   Norm   Neon  Float    Int  Avg Gain

      16   2724   5109   3961   4841   5446   5607  1.16 L1
      32   2612   4645   3726   4450   4968   5036 
      64   2523   4247   3540   4150   4521   4519  1.16 L2
     128   2583   4363   3666   4253   4616   4635
     256   2576   4314   3674   4254   4591   4631
     512   1852   2871   2608   2466   2916   2698
    1024   1222   1207   1305   1179   1280   1216  1.08 RAM
    4096   1157   1144   1214   1109   1181   1160
   16384   1175   1245   1244   1134   1191   1180
   65536   1143   1258   1185    909   1144   1260

Max MFLOPS  681   1277
  Per MHz  0.49   0.91
  32 Bit   0.57   0.84

 #################### Compare 64 bit / 32 bit Pi 3B+ ######################

      16   0.86   1.10   0.99   0.99   1.05   1.02
     256   0.88   1.07   0.98   1.01   1.06   1.00
   65536   0.85   0.94   0.88   0.90   0.91   0.93
 
 #######################################################################
BusSpeed - is designed to identify reading data in bursts over buses and possible maximum data transfer speed from RAM (using 1 core - see MP version). The program starts by reading a word (4 bytes) with an address increment of 32 words (128 bytes) before reading another word. The increment is reduced by half on successive tests, until all data is read. Data is read using inner loops containing 64 AND statements, that appear to essentially generate the same code for 32 bit and 64 bit compilations, with only 32 bit data words being used. Surprisingly, the 64 bit version produced slow speeds on reading all data from what should be L1 cache.

Code: Select all

                    BusSpeed 64 Bit  
                                                    
  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read  3B+ Gain
  KBytes  Words  Words  Words  Words  Words    All  Read All

      16   3823   4251   4638   4945   5045   3854  1.15 L1
      32   1543   1677   2423   3331   4152   3680
      64    672    694   1306   2169   3300   3577  1.17 L2
     128    635    648   1211   2055   3202   3604
     256    600    615   1163   1971   3152   3612
     512    328    278    695   1272   2256   2978
    1024     94    140    281    543    960   2075  1.12 RAM
    4096     99    128    259    448   1016   1931
   16384    125    129    258    500    898   1863
   65536    125    114    257    500   1015   1898

 #################### Compare 64 bit / 32 bit Pi 3B+ ######################

      16   1.02   1.03   0.98   1.00   0.99   0.76
     256   0.96   0.97   1.00   0.96   0.99   0.90
   65536   0.99   0.88   1.02   1.02   1.01   1.10

 #######################################################################
Fast Fourier Transforms - There are two FFT benchmarks, the second one benefiting from being optimised to make better use of burst data transfers, with the procedures dependent of skipped sequential access. FFT sizes vary between 1K and 1024K, covering caches and RAM. Three copies are run using both single and double precision data, the middle ones used here, as best choice due to varying millisecond running times. Because of the latter, 3B/3B+ comparisons are not as constant as for other benchmarks, this being reflected in the different 64/32 bit comparisons provided below.

With running times of the smaller FFTs being less than a millisecond, that for the first few measurements can be extended with the CPU MHz scaling governor set as on demand. A performance setting is required to produce more acceptable results. An example is shown below.

Code: Select all

                 FFT Benchmarks 
   
    Size  -------- milliseconds --------
       K  Single  Double  Single  Double

                 scaling_governor
             performance      ondemand
  
       1    0.17    0.14    0.40    0.14
       2    0.38    0.32    0.93    0.32
       4    1.07    0.77    1.97    0.75
       8    2.13    1.89    4.64    1.76
      16    4.57    5.83    4.47    5.83 



 #################### Compare 64 bit / 32 bit ######################

                RPi3           RPi3B+ 
        K  Single  Double  Single  Double 
  FFT1
   1 to 8    1.05    0.86    1.06    0.90
  16 to 128  1.17    0.83    1.14    1.06
 256 to 1M   1.26    0.88    1.58    1.13

 FFT3C
   1 to 8    1.24    0.89    1.17    0.88
  16 to 128  1.05    1.04    1.15    1.17
 256 to 1M   1.14    1.01    1.26    1.16

 #######################################################################

jahboater
Posts: 2477
Joined: Wed Feb 04, 2015 6:38 pm

Re: Raspberry Pi Benchmarks

Thu Apr 26, 2018 5:28 pm

RoyLongbottom wrote:
Thu Apr 26, 2018 5:06 pm
NeonSpeed - executes the same functions as MemSpeed, but with all floating point calculations using single precision floating point (for compatibility with NEON).
NEON happily does double precision by the way (even VFP did), or have I missed something?

RoyLongbottom
Posts: 214
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Thu Apr 26, 2018 5:55 pm

jahboater wrote:
Thu Apr 26, 2018 5:28 pm
RoyLongbottom wrote:
Thu Apr 26, 2018 5:06 pm
NeonSpeed - executes the same functions as MemSpeed, but with all floating point calculations using single precision floating point (for compatibility with NEON).
NEON happily does double precision by the way (even VFP did), or have I missed something?
I could not find any Intrinsics when I wrote the program, See the following with no sign of double or f64:

https://gcc.gnu.org/onlinedocs/gcc-4.8. ... nsics.html

jahboater
Posts: 2477
Joined: Wed Feb 04, 2015 6:38 pm

Re: Raspberry Pi Benchmarks

Thu Apr 26, 2018 7:00 pm

I don't know why there are no intrinsics for it. Perhaps because there are only two lanes - who knows.
Your link points to a GCC 4.8 doct which is very old - the current version of GCC is 7.3 with 8.1 due to be released next week.

The arm intrinsics doct. mentions double precision:-
http://infocenter.arm.com/help/topic/co ... cs_ref.pdf

A few seconds look at the "armv8 arm" shows that all the floating point instructions, both vector and scalar, accept double precision operands.

Code: Select all

C7.2.42
FADD (vector)
Floating-point Add (vector). This instruction adds corresponding vector elements in the two source SIMD&FP
registers, writes the result into a vector, and writes the vector to the destination SIMD&FP register. All the values
in this instruction are floating-point values.
This instruction can generate a floating-point exception. Depending on the settings in FPCR, the exception results
in either a flag being set in FPSR or a synchronous exception being generated. For more information, see
Floating-point exceptions and exception traps on page D1-1899.
Depending on the settings in the CPACR_EL1, CPTR_EL2, and CPTR_EL3 registers, and the current Security state
and Exception level, an attempt to execute the instruction might be trapped.
Half-precision
ARMv8.2
31 30 29 28 27 26 25 24 23 22 21 20
0 Q 0 0 1 1 1 0 0 1 0
U
16 15 14 13 12 11 10 9
Rm
0 0 0 1 0 1
5 4
Rn
0
Rd
Half-precision variant
FADD <Vd>.<T>, <Vn>.<T>, <Vm>.<T>
Decode for this encoding
if !HaveFP16Ext() then UnallocatedEncoding();
integer
integer
integer
integer
integer
integer
d = UInt(Rd);
n = UInt(Rn);
m = UInt(Rm);
esize = 16;
datasize = if Q == '1' then 128 else 64;
elements = datasize DIV esize;
boolean pair = (U == '1');
Single-precision and double-precision
31 30 29 28 27 26 25 24 23 22 21 20
0 Q 0 0 1 1 1 0 0 sz 1
U
16 15 14 13 12 11 10 9
Rm
1 1 0 1 0 1
5 4
Rn
0
Rd
Single-precision and double-precision variant
FADD <Vd>.<T>, <Vn>.<T>, <Vm>.<T>
Decode for this encoding
integer
integer
integer
if sz:Q
integer
integer
integer
d = UInt(Rd);
n = UInt(Rn);
m = UInt(Rm);
== '10' then ReservedValue();
esize = 32 << UInt(sz);
datasize = if Q == '1' then 128 else 6

ejolson
Posts: 1424
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi Benchmarks

Fri Apr 27, 2018 6:00 am

RoyLongbottom wrote:
Thu Apr 26, 2018 5:06 pm
The latter is repeated below for the newer processor (3B similar). The 3B+/3B performance is essentially proportional to respective CPU MHz speeds, where date from caches is processed, but 3B+ is often shown to be slightly slower with RAM data transfers.
Recently there has been a change in the default Pi 3B+ RAM settings from 500 MHz down to 450 MHz. I think the schmoo memory timings change as well between the two frequencies, so it's not clear which actually performs better in the end. Did the slower RAM data transfers for the 3B+ occur with the memory clock set to 500 MHz or 450 MHz?

jahboater
Posts: 2477
Joined: Wed Feb 04, 2015 6:38 pm

Re: Raspberry Pi Benchmarks

Fri Apr 27, 2018 7:32 am

ejolson wrote:
Fri Apr 27, 2018 6:00 am
I think the schmoo memory timings change as well between the two frequencies, so it's not clear which actually performs better in the end.
Yes. At 450Mhz schmoo is off. If you set 500Mhz by hand, schmoo gets set to
sdram_schmoo=0x2000020
and you cannot override it ....
Is the schmoo setting increasing the drive level or relaxing the timings?

Return to “General programming discussion”

Who is online

Users browsing this forum: No registered users and 3 guests