olso4539
Posts: 30
Joined: Mon Feb 03, 2014 9:02 pm

Re: Raspberry Pi Benchmarks

Mon Jun 02, 2014 2:25 pm

The best/easiest things you can do to reduce the temperature on the pi is to set it on a heat conducting surface or set it so that the board is vertical, instead of flat (i.e. lay the case on it's side). I had my pi hit 80 degrees while sitting in an Adafruit PiCase on carpet. I tipped it up on edge (still on the carpet, still in the case) and now I haven't seen it pass 65 degrees (Celsius of course). The thermal transfer to air is much better with a vertical surface (causes convective cooling, it actually creates a small breeze as the hot air rises past the surface).

It's not just about what kernel and overclock settings you run or what heat sink you put on it. The orientation and environment (even outside the case) make a difference.
-Jon, Aerospace Engineer

eniccm
Posts: 2
Joined: Sun Aug 31, 2014 6:13 pm
Location: Venezuela

Re: Raspberry Pi Benchmarks

Sun Aug 31, 2014 6:30 pm

Hello there, im kind of new in the area of benchmarking so excuse the lack of knowledge. But i ran all of the BM Roy posted in his website in my raspi, and then compiled them myself in the raspi. But i have a question i dont know why when i ran the whetstone BM without compiling them in the pi i get higher results than Roy ,he got 390.5 MWIPS a 1000 MHz and i got 400 MWIPS at 1000Mhz.Why is this happening ? I would apreciate any answer, im doing a research paper from this Benchmarks. :D :geek:
Enida Casanova
Barquisimeto,Venezuela.

RoyLongbottom
Posts: 281
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Fri Sep 05, 2014 4:32 pm

eniccm wrote:Hello there, im kind of new in the area of benchmarking so excuse the lack of knowledge. But i ran all of the BM Roy posted in his website in my raspi, and then compiled them myself in the raspi. But i have a question i dont know why when i ran the whetstone BM without compiling them in the pi i get higher results than Roy ,he got 390.5 MWIPS a 1000 MHz and i got 400 MWIPS at 1000Mhz.Why is this happening ? I would apreciate any answer, im doing a research paper from this Benchmarks. :D :geek:
Particularly with just one CPU core, speed can vary if the CPU has other things to do. Speed can also vary depending on memory address alignment and what programs were run before. I just looked at some of my old results at 700 MHz and they varied between 250 and 270 MWIPS. That would suggest 357 to 386 MWIPS at 1000 MHz, indicating- that you can’t really rely on the MHz claims. If you compare 1000/700 MHz speeds (ratio 1.43) of all Whetstone tests in my report, the ratio varies between 1.43 and 1.61.

For your project, I suggest that you run the benchmarks a few times. :ugeek: aged 79

eniccm
Posts: 2
Joined: Sun Aug 31, 2014 6:13 pm
Location: Venezuela

Re: Raspberry Pi Benchmarks

Wed Sep 10, 2014 4:32 pm

Thank you very much for the prompt reply :) !!!!!

I have done what you suggested me and I have run all benchmarks at least 3 times but I have not found any literature or book that tells me how many executions are needed for statistical confidence and then perform arithmetic, geometric or harmonic average, do you have any suggestions? :roll:

PS: my results do not depart greatly from which you obtained ;)
Enida Casanova
Barquisimeto,Venezuela.

RoyLongbottom
Posts: 281
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Wed Sep 10, 2014 9:27 pm

As the benchmarks generally measure speeds, harmonic mean would seem to be appropriate. Then, there can always be exceptional external events that lead to unrepresentative conclusions. For statistical confidence, you would probably have to quote such as 95 percentiles, requiring hundreds of measurements, and that is not on. Traditionally, on asking a supplier to provide benchmark results, they would supply best results I suppose that I quote typical maximum speeds or a range of results if they have significant regular variation.

RoyLongbottom
Posts: 281
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Sun Feb 08, 2015 11:41 am

Raspberry Pi 2 Performance

I have just started playing with my new Raspberry Pi 2, on which I will be running all my benchmarks, hopefully starting next week. As a sweetener, I thought that I should demonstrate MP performance

MP-MFLOPS arithmetic operations executed are of the form x = (x + a) * b - (x + c) * d + (x + e) * f with 2 or 32 operations per input data word. Array sizes used are 12.8 KB, 128 KB and 12.8 MB, to test with data in L1 cache, L2 cache and RAM. Each of 1, 2, 4 and 8 threads use the same calculations but accessing different segments of the data.

Results below demonstrate gains in line with cores used, and performance gains of 8.3 to 12.2 times, of original RPi speed.

Code: Select all

 Raspberry P1 2 900 MHz
 Features:  half thumb fastmult vfp edsp neon vfpv3
          tls vfpv4 idiva idivt vfpd32 lpae evtstrm 

 MP-MFLOPS Linux/ARM v1.0 Sun Feb  8 10:37:43 2015

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      102     147     128     407     406     390
 2T      295     289     250     814     810     778
 4T      418     554     360    1597    1612    1520
 8T      488     450     378    1459    1548    1436

#####################################################

 Raspberry Pi 700 MHz
 Features:  swp half thumb fastmult vfp edsp java tls 

 MP-MFLOPS Linux/ARM v1.0 Sat Jul 27 17:41:13 2013

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T       43      33      31     191     170     161
 2T       44      42      31     192     174     160
 4T       44      43      31     192     176     159
 8T       43      51      31     192     184     160

misal sanjay
Posts: 1
Joined: Sat Oct 17, 2015 4:01 pm

Re: Raspberry Pi Benchmarks

Sat Oct 17, 2015 4:07 pm

I have implemented dvfs mechanism ..in bash script with governors. I'm trying to measure run-time power measurement of pi or its processor . How should i do it?

RoyLongbottom
Posts: 281
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Tue Oct 20, 2015 9:44 am

misal sanjay wrote:I have implemented dvfs mechanism ..in bash script with governors. I'm trying to measure run-time power measurement of pi or its processor . How should i do it?
Sorry, I can’t help. The nearest I have been is measuring CPU MHz and temperature:

http://www.roylongbottom.org.uk/Raspber ... m#anchor28

Googling for “measure power consumption of raspberry pi” seems to suggest that external meters have to be used.

jahboater
Posts: 4452
Joined: Wed Feb 04, 2015 6:38 pm

Re: Raspberry Pi Benchmarks

Tue Oct 20, 2015 3:06 pm

Hi Roy,

Have you done any benchmarks of Thumb2 on the Pi2? It seems to give me about 25% reduction in code size together with a slight increase in speed (perhaps because more instructions fit in the I-cache).

If you have not, just adding "-mthumb" is enough, and the program runs as before - its a complete instruction set (unlike thumb 1).

RoyLongbottom
Posts: 281
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Tue Oct 20, 2015 4:34 pm

jahboater wrote:Hi Roy,

Have you done any benchmarks of Thumb2 on the Pi2? It seems to give me about 25% reduction in code size together with a slight increase in speed (perhaps because more instructions fit in the I-cache).

If you have not, just adding "-mthumb" is enough, and the program runs as before - its a complete instruction set (unlike thumb 1).
I just tried it with the last results posted for MP-MFLOPS, compiled with:

gcc mpmflops.c cpuidc.c -lrt -lc -lm -O3 -mcpu=cortex-a7 -mfloat-abi=hard -mfpu=neon-vfpv4 -lpthread -o MP-MFLOPSPiA7

Adding -mthumb produced a slightly smaller file that was slightly slower. Maybe it would be faster with an integer benchmark.

Later Dhrystone Benchmark 17.6 KB 1649 Vax MIPS from

gcc dhry_1.c dhry_2.c cpuidc.c -lm -lrt -O3 -mcpu=cortex-a7 -o dhrystonePiA7

Adding -mthumb 16.9 KB 1630 Vax MIPS

jahboater
Posts: 4452
Joined: Wed Feb 04, 2015 6:38 pm

Re: Raspberry Pi Benchmarks

Tue Oct 20, 2015 9:18 pm

Yes, I guess the short thumb instructions are all integer. I think the VFP/NEON floating point instructions remain as is.

Incidentally, when running a benchmark I increase the priority with
"sudo nice --20" and make sure at least one core is free to handle interrupts.

Thanks

NeilAlexanderHiggins
Posts: 39
Joined: Sun May 25, 2014 10:22 am

Re: Raspberry Pi Benchmarks

Sun Aug 07, 2016 8:13 am

I know this subject is a bit passé, but I though I would report on [email protected] performance. [email protected] benchmarks each new CPU before sending work to it. It rates my Pi 3 at 748/2461 floating point/integer operations per second, compared with 441/1695 respectively for a Pi 2. The measured CPU temperature of my Pi 3 shoots up to 82 degrees and stays there (regulated by throttling, no doubt). Given another report that measurement by either IR or thermocouple probe exceeds this considerably, I'm not sure what the actual temperature would be. I am not overclocking, and I don't have heat sinks installed. The Pi is installed in a closed box (a PiDP-8 enclosure). I'm going to let it run forever or until it fails.

RoyLongbottom
Posts: 281
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Sun Aug 07, 2016 10:03 am

NeilAlexanderHiggins wrote:I know this subject is a bit passé, but I though I would report on [email protected] performance. [email protected] benchmarks each new CPU before sending work to it. It rates my Pi 3 at 748/2461 floating point/integer operations per second, compared with 441/1695 respectively for a Pi 2. The measured CPU temperature of my Pi 3 shoots up to 82 degrees and stays there (regulated by throttling, no doubt). Given another report that measurement by either IR or thermocouple probe exceeds this considerably, I'm not sure what the actual temperature would be. I am not overclocking, and I don't have heat sinks installed. The Pi is installed in a closed box (a PiDP-8 enclosure). I'm going to let it run forever or until it fails.
I am currently running all my benchmarks on the RPi 3. Many have been run and reported on by others, so I will shortly be including a summary of results here.

I started a while ago, initially concentrating on my new OpenGL GLUT benchmark. This included stress testing, measuring temperatures and CPU MHz at the same time. See the following thread that includes examples of throttling and display failures. My RPi 3 has a self adhesive heatsink and that made little difference.

viewtopic.php?f=68&t=145374

RoyLongbottom
Posts: 281
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Wed Aug 10, 2016 10:24 am

Raspberry Pi 3 Benchmarks

I am currently running all my benchmarks on my Raspberry Pi 3. Some have already been run by others, with reports in various places. My results can be found via the following link, plus brief summaries of the gcc 4.8 compiled tests are provided here, including comparisons with the 900 MHz Raspberry Pi 2. For those who are interested in historic comparisons, links are provided for my results on Windows/Linux PCs, Android devices and RPis, plus original data starting in the 1970s/80s.

http://www.roylongbottom.org.uk/Raspber ... hmarks.htm

The Classic Benchmarks are the first programs that set standards of performance for computers in the 1970s and 1980s. They are Whetstone, Dhrystone, Linpack and Livermore Loops.

Whetstone - comprises eight tests measuring speeds of floating point, integers and mathematical functions, efficient compilation of the latter often determining the overall rating in MWIPS.

In the various areas, average RPi 3 speed was 40% to 47% faster than RPi 2. Floating point and integer tests were faster than a 3 GHz Pentium 4. For detailed results on Windows and Linux based PCs, Android devices and RPis, then speeds of computers from year dot see;

http://www.roylongbottom.org.uk/whetstone%20results.htm
http://www.roylongbottom.org.uk/whetstone.htm

Dhrystone - is a later sort of Whetstone benchmark without floating point. Results are in VAX MIPS or DMIPS (relative to the DEC VAX 11/780 minicomputer) and these are highly dependent on optimisation in a particular compiler.

RPi 3 was 48% faster than RPi 2 and clearly faster than Pentium 3 CPUs. My results and historic speeds are in the following. The latter provides ratings in Dhrystones Per Second and need dividing by 1757 for DMIPS (A later variation of VAX 11/780 score shown in the PDF file).

http://www.roylongbottom.org.uk/dhrystone%20results.htm
http://www.cs.virginia.edu/~mk2z/cs654/ ... chmark.pdf

ARM reports results in DMIPS/MHz. On this benchmark, the RPi 3 rating on this is 2.06. ARM rating is nearly always higher than via my benchmark but this might be a throwback to the origins where hardware and software were designed together for the highest benchmark speeds.

Linpack - This has floating point calculations, as in the original, using 100x100 matrices of double precision (DP) numbers, normally L2 cache sized data.. Performance depends almost entirely on a function calculating dy = dy + da*dx, suitable for vector type linked add and multiply. A version using NEON intrinsic functions is provided. As this uses single precision (SP), a standard compilation of this is also provided. Speed is measured in Millions of Floating Point Operations Per Second (MFLOPS).

Performance improvements over RPi 2 were 17% DP, 24% SP and 62% NEON, with RPi 3 measurements of 180, 194 and 486 MFLOPS. The DP result can be vaguely compared to a Pentium III E at 185 MFLOPS.

http://www.roylongbottom.org.uk/linpack%20results.htm
http://netlib.org/benchmark/performance.pdf

Livermore Loops - These comprise 24 kernels of numerical application, with speed measured in MFLOPS. The original was used to verify performance of the first Cray 1 supercomputer (cost $7 Million). The official average is geometric mean.

Average RPi 3 speed of 186 MFLOPS was 48% faster than RPi 2, 15.6 times that on a Cray 1 supercomputer and similar to a 1700 MHz Pentium 4. The original result were in “The Livermore Fortran Kernels: A Computer Test Of The Numerical Performance Range” by F.H. McMahon. This appears to be available for downloading but, in the case of researchgate.net, you will need approval from the authors (I am still waiting for it). My report includes a few summary results for some CDC and Cray computers.

http://www.roylongbottom.org.uk/livermo ... esults.htm

RoyLongbottom
Posts: 281
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Mon Aug 15, 2016 4:10 pm

Raspberry Pi 3 Memory Benchmarks - up to 3.66 times faster

These benchmarks measure performance of processing data from caches and RAM. Performance improvements over the Raspberry Pi 2, using RAM, can be expected, as the clock speed is double The benchmarks covered here use ten or eleven data sizes between 8 KB and 65 MB, with results in MB/second. A summary and some detail and comparisons (with 900 MHz RPi 2) are provided below, with full details in:

http://www.roylongbottom.org.uk/Raspber ... hmarks.htm

MemSpeed - measures speeds carrying out floating point and integer calculations, with one and two operations per word. As shown below, best RPi3 improvement is from RAM, at 3.25 times. Relative speeds using cached data can be similar to CPU MHz differences, calculating with double precision numbers, with single precision and integer tests better, particularly using L2 cache.

NEON MemSpeed - This is MemSpeed with the compiler instructed to use NEON instructions. these currently are not applicable for double precision working, as reflected in similar speeds to MemSpeed. Main advantage is on single precision floating point calculations, particularly using RPi 2.

BusSpeed - this uses data streaming ANDing integers. It has variable address incrementing to show where burst reading occurs and possibly help to identify maximum speeds. See above HTM for details. Here, RPi 3 improvements are shown for reading all data. Although RAM MB/second measurements are the fastest amongst these tests, RPi 3 Performance is not much better than RPi 2 CPU clock difference of 1.33. Gains using L1 and L2 caches of 2.55 and 2.12 times.

NeonSpeed - This carries out the same single precision and integer calculations as MemSpeed, where Norm is compiled to use NEON instructions and Neon is from intrinsic functions (that the compiler might translate into faster code). Raspberry Pi 2 results show little difference between the two methods, but Pi 3 shows faster speeds via intrinsics, leading to faster relative performance. All RAM tests are at least 3.3 times faster on the RPi 3.

Fast Fourier Transforms - Original and optimised single and double precision FFT calculations, real applications from 1K to 1024K. See the HTM file for details. Results are in milliseconds (the lower the better). Performance depends on random or skipped sequential access that might be why RPi 3 performance gains are similar to the CPU clock speed ratio.

Code: Select all

                   Memory Speed Tests Calculating and Copying

        x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
 Cache   Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
 RAM     MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

                                 memspeedPiA7
 RPi 2
 L1      1197   1041   1955   1955   1320   2667   1926   2570   2622
 L2      1096   1005   1549   1556   1115   1859   1245   1226   1225
 RAM      343    333    384    379    349    404    952    693    693

 RPi 3
 L1      1606   1790   3383   2344   2203   3575   2703   3127   3147
 L2      1560   1708   3223   2233   2069   3462   2614   2985   2958
 RAM      893   1043   1250   1146   1089   1238   1038    925    927

 RPi3/2
 L1      1.34   1.72   1.73   1.20   1.67   1.34   1.40   1.22   1.20
 L2      1.42   1.70   2.08   1.44   1.86   1.86   2.10   2.43   2.41
 RAM     2.60   3.13   3.25   3.02   3.12   3.07   1.09   1.33   1.34


                                 memSpdPiNEON
 RPi 2
 L1      1229   1776   2029   2028   2367   2832   2024   2832   2827
 L2      1056   1321   1458   1460   1621   1726   1448   1091   1092
 RAM      329    352    357    355    369    378    771    530    531

 RPi 3
 L1      1608   2346   3387   2348   3112   3717   2691   3144   3140
 L2      1547   2198   3144   2198   2889   3388   2618   3009   3009
 RAM      931   1155   1233   1142   1167   1241   1028    949    954

 RPi3/2
 L1      1.31   1.32   1.67   1.16   1.31   1.31   1.33   1.11   1.11
 L2      1.46   1.66   2.16   1.51   1.78   1.96   1.81   2.76   2.75
 RAM     2.83   3.28   3.45   3.21   3.16   3.29   1.33   1.79   1.80


  Bus/Cache/RAM Reading Speed Test - busspeedPiA7

    Reading Speed 4 Byte Words in MBytes/Second
  Cache Inc32  Inc16   Inc8   Inc4   Inc2   Read          Read
  RAM   Words  Words  Words  Words  Words    All           All
                                                        X RPi 2
 RPi 2
 L1      1095   1414   1535   1721   1684   1710
 L2       377    405    697   1203   1573   1630
 RAM       72     79    159    317    643   1264

 RPi 3
 L1      2650   2985   3431   4321   4348   4362          2.55
 L2       556    559   1015   1781   2747   3462          2.12
 RAM      119    128    246    492    974   1789          1.42


 Vector Reading Speed - NeonSpeed -  MBytes/Second

        Float v=v+s*v  Int v=v+v+s   Neon v=v+v           NEON/Normal
         Norm 1 Neon   Norm 2 Neon  Float    Int            1      2
 Rpi 2
 L1      1906   1965   2041   2273   2326   2771          1.03   1.11
 L2      1449   1470   1543   1611   1635   1826          1.01   1.04
 RAM      358    350    365    314    345    354          0.98   0.86

 Rpi 3
 L1      2659   3854   3364   4052   4283   4535          1.45   1.20
 L2      2495   3457   3159   3591   3724   3909          1.39   1.14
 RAM     1198   1249   1240   1148   1241   1236          1.04   0.93

 RPi3/2
 L1      1.40   1.96   1.65   1.78   1.84   1.64
 L2      1.72   2.35   2.05   2.23   2.28   2.14
 RAM     3.34   3.56   3.39   3.66   3.59   3.50

ejolson
Posts: 3064
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi Benchmarks

Wed Aug 17, 2016 5:55 am

RoyLongbottom wrote:Raspberry Pi 3 Benchmarks

Linpack - This has floating point calculations, as in the original, using 100x100 matrices of double precision (DP) numbers, normally L2 cache sized data.. Performance depends almost entirely on a function calculating dy = dy + da*dx, suitable for vector type linked add and multiply. A version using NEON intrinsic functions is provided. As this uses single precision (SP), a standard compilation of this is also provided. Speed is measured in Millions of Floating Point Operations Per Second (MFLOPS).

Performance improvements over RPi 2 were 17% DP, 24% SP and 62% NEON, with RPi 3 measurements of 180, 194 and 486 MFLOPS. The DP result can be vaguely compared to a Pentium III E at 185 MFLOPS.

http://www.roylongbottom.org.uk/linpack%20results.htm
http://netlib.org/benchmark/performance.pdf
As discussed in a different thread, versions of linpack compiled with an ARM optimized subroutine library score 1.4 gflops on the Raspberry Pi 2B and 6.4 gflops on the Pi 3 for problem sizes around 8000 by 8000. This results in a 4.5 times speedup when switching between models. Note that 100 by 100 is an extremely small problem size for current computers. Also note without proper cooling that the speedup is only about 2.2 times.

RoyLongbottom
Posts: 281
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Wed Aug 17, 2016 11:58 am

As discussed in a different thread, versions of linpack compiled with an ARM optimized subroutine library score 1.4 gflops on the Raspberry Pi 2B and 6.4 gflops on the Pi 3 for problem sizes around 8000 by 8000. This results in a 4.5 times speedup when switching between models. Note that 100 by 100 is an extremely small problem size for current computers. Also note without proper cooling that the speedup is only about 2.2 times.
That thread is not very convincing that RPi 3 achieves 6.4 Linpack GFLOPS, 4.5 times faster than RPi2, especially as the title is “Pi3 incorrect results under load ”. Those residual checks are there for a purpose and must be correct (consistently near expectations) to be able to be able to trust the reported performance. Then, can the clock measurement be trusted over a long time, with overheating occurring. I would like to see a series of complete results with decreasing values of N, to the point where consistent speeds are produced and some setting affinity to use one CPU core (see my example below).

On the other hand, I am just about to include results for my MP benchmarks, demonstrating more than 6 GFLOPS on the RPi 3. This is using single precision NEON instruction (were those results via SP NEON?). RPi 2 was up to 2.7 GFLOPS, the former being 2.2 times faster.

The source code for that Linpack is not the same as my Linpack 1, which is completely unsuitable for multi-threading. . So results cannot be compared. Linpack 1 also depends on large cache size. Results for my NEON Linpack MP benchmark are below, for unthreaded and with 1, 2 and 4 threads. The source code has some slight changes, with threading for selected parts (somebody might be able to do better), but it checks that results such as residuals are the same with and without threading.

Performance via multi-threading is the same as it is using shared data (but different segments), and 1 core at a time. At N=100, thread start/stop overheads are more significant, producing worst performance. Then N=100 is fastest without threading.

From that reference topic
3. What level of under clocking is safe for running optimized NEON code on a Pi 3B without a heat sink?
Most important issues are running time and number of cores used. See details of my stress tests:

viewtopic.php?f=68&t=145374

Mine has a simple stick on heatsink that makes little difference.

Code: Select all

 Linpack Single Precision MultiThreaded Benchmark
 Using NEON Intrinsics, Mon Aug 15 19:44:30 2016

   MFLOPS 0 to 4 Threads, N 100, 500, 1000

 Threads      None        1        2        4

 N  100     538.46   116.24   113.61   113.47 
 N  500     467.73   335.53   338.61   338.97 
 N 1000     363.87   336.10   336.72   336.22 

 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1

 N              100             500            1000

 NR            2.17            5.42            9.50
 RE  5.16722466e-05  6.46698638e-04  2.26586126e-03
 MA  1.19209290e-07  1.19209290e-07  1.19209290e-07
 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
 XN -5.06639481e-06 -4.70876694e-06  1.41978264e-04

Thread
 0 - 4 Same Results    Same Results    Same Results

ejolson
Posts: 3064
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi Benchmarks

Wed Aug 17, 2016 6:49 pm

RoyLongbottom wrote:On the other hand, I am just about to include results for my MP benchmarks, demonstrating more than 6 GFLOPS on the RPi 3. This is using single precision NEON instruction (were those results via SP NEON?). RPi 2 was up to 2.7 GFLOPS, the former being 2.2 times faster.
The ARM optimized linpack runs include the diagnostic report

Code: Select all

N      :    8000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       1 
Q      :       1 
PFACT  :    Left 
NBMIN  :       2 
NDIV   :       2 
RFACT  :   Right 
BCAST  :   2ring 
DEPTH  :       0 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words
which indicates double-precision arithmetic was used.

While there are multiple reports of people running the benchmark on broken hardware in that thread, there are also many reports of people who resolved their hardware issues with proper cooling and overvolt of the CPU. I think some changes were eventually made to the driver code that switches the CPU in and out of turbo mode to improve stability. Thus, 6.4 double-precision gflops appears to be the correct speed of linpack for 8000 by 8000 sized problems on a properly cooled Pi 3.

For your single precision MP results, getting only 2.2 times speed up hints at an overheated Pi 3 running fully throttled at 600 mhz rather than the usual 1200 mhz. It would be interesting to see how the numbers change with a better heatsink and fan.

RoyLongbottom
Posts: 281
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Wed Aug 17, 2016 10:12 pm

For your single precision MP results, getting only 2.2 times speed up hints at an overheated Pi 3 running fully throttled at 600 mhz rather than the usual 1200 mhz. It would be interesting to see how the numbers change with a better heatsink and fan.
The benchmark execution time is only 5 seconds and around 1.25 seconds using four cores and increases CPU temperature by less than 2°C. I compiled it to run up to 10 times longer and results are below. You will see that they meet my minimum requirements in believing benchmark results. That is evidence that maximum speeds might be correct. In this case, up to four times speed with four threads and threaded calculation all produce the same numeric answers.

Recorded CPU MHz and temperatures are also shown at up to 70.9°C with no performance degradation. So I compiled another version aiming for 500 seconds. As shown below, this eventually lead to throttling and degraded performance. As I said before, most important issues are running time and number of cores used.

Code: Select all

 ################### Test 47 seconds ##################

 MP-MFLOPS NEON Intrinsics v2.0 Wed Aug 17 00:59:30 2016

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      595     586     421    1637    1642    1596
 2T     1179    1164     426    3270    3266    3153
 4T     2024    2005     429    6244    6455    5892
 8T     1938    2129     430    6235    6379    5820
 Results x 100000, 12345 indicates ERRORS
 1T    40392   76406   99700   35218   66014   99520
 2T    40392   76406   99700   35218   66014   99520
 4T    40392   76406   99700   35218   66014   99520
 8T    40392   76406   99700   35218   66014   99520  

         End of test Wed Aug 17 01:00:17 2016

 #####################################################

 Temperature and CPU MHz Measurement

 Start at Wed Aug 17 00:59:30 2016

 Using 50 samples at 1 second intervals

  Seconds
    0.0     1200 scaling MHz,   1199 ARM MHz, temp=54.8'C
    1.0     1200 scaling MHz,   1200 ARM MHz, temp=56.4'C
    2.0     1200 scaling MHz,   1200 ARM MHz, temp=56.9'C
    3.1     1200 scaling MHz,   1200 ARM MHz, temp=57.5'C
    4.1     1200 scaling MHz,   1200 ARM MHz, temp=56.9'C
    5.2     1200 scaling MHz,   1200 ARM MHz, temp=57.5'C
    6.2     1200 scaling MHz,   1200 ARM MHz, temp=56.9'C
    7.2     1200 scaling MHz,   1200 ARM MHz, temp=56.9'C
    8.3     1200 scaling MHz,   1199 ARM MHz, temp=57.5'C
    9.3     1200 scaling MHz,   1200 ARM MHz, temp=58.0'C
   10.3     1200 scaling MHz,   1199 ARM MHz, temp=58.0'C
   11.4     1200 scaling MHz,   1199 ARM MHz, temp=58.5'C
   12.4     1200 scaling MHz,   1200 ARM MHz, temp=58.5'C
   13.4     1200 scaling MHz,   1199 ARM MHz, temp=58.5'C
   14.5     1200 scaling MHz,   1200 ARM MHz, temp=58.5'C
   15.5     1200 scaling MHz,   1200 ARM MHz, temp=59.1'C
   16.6     1200 scaling MHz,   1200 ARM MHz, temp=59.1'C
   17.6     1200 scaling MHz,   1200 ARM MHz, temp=59.1'C
   18.6     1200 scaling MHz,   1200 ARM MHz, temp=59.1'C
   19.7     1200 scaling MHz,   1200 ARM MHz, temp=59.1'C
   20.7     1200 scaling MHz,   1200 ARM MHz, temp=59.1'C
   21.7     1200 scaling MHz,   1200 ARM MHz, temp=59.6'C
   22.8     1200 scaling MHz,   1200 ARM MHz, temp=60.1'C
   23.9     1200 scaling MHz,   1200 ARM MHz, temp=62.3'C
   25.1     1200 scaling MHz,   1199 ARM MHz, temp=61.2'C
   26.2     1200 scaling MHz,   1200 ARM MHz, temp=61.2'C
   27.2     1200 scaling MHz,   1200 ARM MHz, temp=61.8'C
   28.2     1200 scaling MHz,   1200 ARM MHz, temp=61.2'C
   29.3     1200 scaling MHz,   1200 ARM MHz, temp=62.3'C
   30.3     1200 scaling MHz,   1200 ARM MHz, temp=62.3'C
   31.3     1200 scaling MHz,   1200 ARM MHz, temp=62.8'C
   32.4     1200 scaling MHz,   1199 ARM MHz, temp=62.3'C
   33.4     1200 scaling MHz,   1199 ARM MHz, temp=63.4'C
   34.5     1200 scaling MHz,   1200 ARM MHz, temp=65.5'C
   36.0     1200 scaling MHz,   1200 ARM MHz, temp=64.5'C
   37.0     1200 scaling MHz,   1200 ARM MHz, temp=65.5'C
   38.0     1200 scaling MHz,   1199 ARM MHz, temp=65.5'C
   39.1     1200 scaling MHz,   1200 ARM MHz, temp=67.7'C
   40.2     1200 scaling MHz,   1200 ARM MHz, temp=68.8'C
   41.2     1200 scaling MHz,   1200 ARM MHz, temp=67.7'C
   42.7     1200 scaling MHz,   1200 ARM MHz, temp=67.7'C
   43.8     1200 scaling MHz,   1200 ARM MHz, temp=68.2'C
   44.8     1200 scaling MHz,   1200 ARM MHz, temp=68.8'C
   45.9     1200 scaling MHz,   1200 ARM MHz, temp=69.8'C
   46.9     1200 scaling MHz,   1200 ARM MHz, temp=70.9'C
   Test Finished
   48.0     1200 scaling MHz,   1199 ARM MHz, temp=67.7'C
   49.0     1200 scaling MHz,   1200 ARM MHz, temp=66.6'C
   50.1     1200 scaling MHz,   1200 ARM MHz, temp=65.5'C
   51.1      600 scaling MHz,    600 ARM MHz, temp=64.5'C
   52.1      600 scaling MHz,    600 ARM MHz, temp=63.4'C
   53.2      600 scaling MHz,    600 ARM MHz, temp=62.8'C

 End at   Wed Aug 17 01:00:23 2016

 ################# Test 502 seconds ####################

 MP-MFLOPS NEON Intrinsics v2.0 Wed Aug 17 21:42:12 2016

 1T      594     584     420    1637    1632    1603
 2T     1175    1171     421    3269    3264    3152
 4T     2234    2227     421    5894    5224    4192
 8T     1801    1921     416    4719    4519    3899
 Results x 100000, 12345 indicates ERRORS
 1T    40015   40392   97075   35216   35218   95363
 2T    40015   40392   97075   35216   35218   95363
 4T    40015   40392   97075   35216   35218   95363
 8T    40015   40392   97075   35216   35218   95363

         End of test Wed Aug 17 21:50:34 2016

 #####################################################

 Temperature and CPU MHz Measurement

 Start at Wed Aug 17 21:42:12 2016

 Seconds
    0.0     1200 scaling MHz,   1200 ARM MHz, temp=56.4'C
    1.0     1200 scaling MHz,   1199 ARM MHz, temp=56.9'C
    2.0     1200 scaling MHz,   1200 ARM MHz, temp=58.0'C
            1200 to
  356.2     1200 scaling MHz,   1200 ARM MHz, temp=79.5'C
  357.3     1200 scaling MHz,   1200 ARM MHz, temp=80.1'C
  358.3     1200 scaling MHz,   1167 ARM MHz, temp=80.6'C
  359.4     1200 scaling MHz,   1154 ARM MHz, temp=80.6'C
  360.4     1200 scaling MHz,   1136 ARM MHz, temp=81.1'C
  Down To
  380.5     1200 scaling MHz,    983 ARM MHz, temp=81.7'C
  381.5     1200 scaling MHz,    992 ARM MHz, temp=82.7'C
  Down To
  400.7     1200 scaling MHz,    857 ARM MHz, temp=83.3'C
  401.8     1200 scaling MHz,    832 ARM MHz, temp=82.7'C
  Test Finished
  503.8     1200 scaling MHz,   1012 ARM MHz, temp=81.7'C
  504.8     1200 scaling MHz,   1116 ARM MHz, temp=80.1'C
  505.9     1200 scaling MHz,   1184 ARM MHz, temp=80.1'C
  506.9      600 scaling MHz,    600 ARM MHz, temp=79.0'C
  508.0      600 scaling MHz,    600 ARM MHz, temp=78.4'C

ejolson
Posts: 3064
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi Benchmarks

Thu Aug 18, 2016 7:52 am

RoyLongbottom wrote:Recorded CPU MHz and temperatures are also shown at up to 70.9°C with no performance degradation.
It does look like you can run for quite a while before the system starts throttling. I'm looking forward to the final MP results.

RoyLongbottom
Posts: 281
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Sun Aug 21, 2016 11:11 am

Raspberry Pi High Performance Linpack Benchmark

I downloaded a precompiled version of High Performance Linpack for Raspberry Pi from and installed it on my Raspberry Pi 2 and 3 systems:

https://www.howtoforge.com/tutorial/hpl ... pberry-pi/

I ran it on the RPi 2 at N = 1000, 2000, 4000 and 8000 to use 1, 2 and 4 threads (or cores - using the taskset command) and all ran successfully. The 1 and 2 thread tests ran on the RPi 3, without any problems, but not when attempting to use four cores. occasional correct operation occurred (or appeared to), otherwise errors were reported or the system crashed, using the larger data sizes.

Below are the MFLOPS results, behaving as expected by doubling single thread performance when using two cores. Then the improvement of four cores to one was up to 13.6 times on the RPi 2 and 34.9 times on the RPi 3, with the highest gains at N=1000. Based on previous MP performance, no better than 3.9 would be expected.

Performance gains of RPi 3 over RPi 2 were up to 3.8 times or 3.15 times, ignoring 4 thread speeds. With only 4 cores, performance improvements at larger data sizes is (to me) rather surprising.

Does anyone have explanations for the strange performance and what else I could try? My only suspect area is something to do with the shared L2 cache.

Code: Select all

                        MFLOPS       RPi 3/  Gains vs 1 Thread
       N   Threads  Rpi 3    Rpi 2    RPi 2    Rpi 3    Rpi 2

    1000       1       80       74     1.08
               2      159      159     1.00      2.0      2.1
               4     2794     1009     2.77     34.9     13.6
    2000       1      224      162     1.38
               2      505      355     1.42      2.3      2.2
               4     4029     1229     3.28     18.0      7.6
    4000       1      612      284     2.15
               2     1317      584     2.26      2.2      2.1
               4     5425     1429     3.80      8.9      5.0
    8000       1     1119      356     3.14
               2     2268      720     3.15      2.0      2.0
               4     N/A      1514                        4.3
For one test that crashed, I ran, with N=4000, from a remote PC via the Putty program. I opened three terminals, one to run the HPL benchmark, one to run vmstat to show system utilisation and one to run my CPU MHz and temperature monitoring program. The results are below up to where the RPi CPU stopped running after testing for 6 seconds. There was little increase in temperature and no clock throttling, with no strange vmstat performance recorded and 4 cores only fully active for the last two seconds.

Code: Select all

 Start at Fri Aug 19 21:24:07 2016

 Using 40 samples at 1 second intervals

 Boot Settings

 dtparam=audio=on

 Seconds
    0.0      600 scaling MHz,    600 ARM MHz, temp=52.6'C
    1.0      600 scaling MHz,   1200 ARM MHz, temp=52.6'C
    2.0     1200 scaling MHz,   1200 ARM MHz, temp=52.6'C
    3.1     1200 scaling MHz,   1200 ARM MHz, temp=53.7'C
    4.1      600 scaling MHz,    600 ARM MHz, temp=52.6'C
 Start
    5.2     1200 scaling MHz,   1200 ARM MHz, temp=53.7'C
    6.2     1200 scaling MHz,   1200 ARM MHz, temp=55.8'C
    7.2     1200 scaling MHz,   1200 ARM MHz, temp=55.8'C
    8.3     1200 scaling MHz,   1200 ARM MHz, temp=56.9'C
    9.3     1200 scaling MHz,   1200 ARM MHz, temp=61.2'C
   10.4     1200 scaling MHz,   1200 ARM MHz, temp=62.8'C


#############################################################################

[email protected]:~ $ vmstat 1 40
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy  id wa st
 0  0      0 524292  36648 236528    0    0    45     3  175   69  2  1  97  0  0
 0  0      0 524276  36648 236528    0    0     0     0  612  152  0  0 100  0  0
 0  0      0 524292  36648 236528    0    0     0    72  631  195  0  0 100  0  0
 0  0      0 524324  36648 236528    0    0     0     0  572  104  0  0 100  0  0
 0  0      0 524004  36784 236568    0    0   158    16  964  697  1  2  97  0  0
 0  0      0 523616  36784 236568    0    0     0    12  822  493  2  0  98  0  0
 0  0      0 523376  36784 236568    0    0     0    12  776  417  1  1  98  0  0
 0  0      0 523376  36784 236568    0    0     0    12  794  447  2  1  98  0  0
 0  0      0 523128  36784 236568    0    0     0    12  805  471  2  2  97  0  0
 Start
 1  0      0 499504  36784 236568    0    0     0   128 1035  669 18 15  67  0  0
 1  0      0 458228  36784 236568    0    0     0    12  874  434 25  2  74  0  0
 1  0      0 417184  36784 236568    0    0     0    12  818  370 25  2  73  0  0
 4  0      0 394444  36784 236564    0    0     0    12  928  382 45 18  37  0  0
 4  0      0 393700  36784 236564    0    0     0    12 1080  411 95  6   0  0  0
 4  0      0 393396  36784 236568    0    0     0    56 1126  514 92  8   0  0  0

 Used        129732 KB
 or about 4 x 4 x 8 MB for N = 4000 
I have a 2008 version of the benchmark from Intel. I ran it on a Windows 10 based tablet, with a 1.44 to 1.84 GHz quad core Atom x5-Z8300 processor. Following are results, with the sort of performance differences I would expect but, of course, I can’t say that the Raspberry Pi should follow that behaviour pattern.

Code: Select all

      N   Threads  MFLOPS   Seconds      Residual     Resid(norm)

    1000       1     1350     0.50    1.161404E-12    3.960687E-02
               2     2466     0.27    1.161404E-12    3.960687E-02
               4     4120     0.16    1.161404E-12    3.960687E-02

    2000       1     1540     3.47    4.756195E-12    4.137307E-02
               2     2800     1.91    4.756195E-12    4.137307E-02
               4     4601     1.16    4.756195E-12    4.137307E-02

    4000       1     1633    26.15    1.702119E-11    3.709929E-02
               2     3008    14.19    1.702119E-11    3.709929E-02
               4     5345     7.99    1.702119E-11    3.709929E-02

    8000       1     1641   208.04    5.967551E-11    3.282671E-02
               2     3088   110.58    5.967551E-11    3.282671E-02
               4     4982    68.55    5.967551E-11    3.282671E-02


RoyLongbottom
Posts: 281
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Mon Aug 29, 2016 10:05 am

Raspberry Pi 3 Multithreading Benchmarks

The first ones are attempts to obtain better performance running the Classic Benchmarks. For detailed descriptions and results of all multithreading benchmarks see.

http://www.roylongbottom.org.uk/Raspber ... hmarks.htm

MP Whetstone Benchmarks

As with other multithreading benchmarks, this one runs using 1, 2, 4 and 8 threads, executing multiple copies of the same program code, each thread having dedicated variables. These should all be stored in L1 cache. With no conflicts, as shown below, doubling the number of threads leads to near doubling measured performance.

Raspberry Pi 3 overall MWIPS ratings are 1.37 times RPi 2 speeds, with ratios for other tests in the range 1.19 to 1.79, except the last copy test average of 2.73.

Code: Select all

  MP-Whetstone Benchmark Linux/ARM V7A v1.0 Mon Aug 15 19:34:21 2016

                    Using 1, 2, 4 and 8 Threads

      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt     If  Equal
                 1      2      3  MOPS  MOPS    MOPS   MOPS   MOPS

 1T   723.1  517.2  517.0  254.9  12.1   8.8  5853.9 1181.8 1189.8
 2T  1464.7  960.5 1025.1  511.3  24.1  18.5 11899.0 2381.2 2385.7
 4T  2902.3 1696.4 1867.3 1013.4  47.8  36.8 19754.6 4541.3 4687.1
 8T  3004.0 2747.8 2569.0 1066.4  48.6  38.0 25502.9 6075.2 5610.8

   Overall Seconds   4.77 1T,   4.74 2T,   4.88 4T,   9.76 8T

        Comparison With Raspberry Pi 2 - CPU MHz ratio 1.33

 1T    1.37   1.43   1.42   1.38  1.21  1.57    1.77   1.33   2.67
 2T    1.39   1.33   1.41   1.39  1.21  1.65    1.79   1.34   2.68
 4T    1.37   1.23   1.28   1.37  1.19  1.64    1.49   1.27   2.62
 8T    1.37   1.44   1.39   1.32  1.19  1.65    1.45   1.26   2.96
MP Dhrystone Benchmark

This uses shared program code and dedicated memory for arrays, but some read/write variables are shared. This can result in multithreaded performance providing little improvement or even worse than.

Raspberry Pi 3 performance, using a single thread, is not much faster than model 2 at 1.43 times faster, compared with a CPU MHz ratio of 1.33. Then, it appears to perform much better using threads, at 3.49 times faster.

Code: Select all

MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Mon Aug 15 19:47:57 2016

                    Using 1, 2, 4 and 8 Threads

 Threads                        1        2        4        8
 Seconds                     0.95     1.12     1.59     3.04
 Dhrystones per Second    4229473  7124952 10091677 10523432
 VAX MIPS rating             2407     4055     5744     5989

         Internal pass count correct all threads

         End of test Mon Aug 15 19:48:04 2016

       Comparison With Raspberry Pi 2 - CPU MHz ratio 1.33

 VAX MIPS rating             1.43     1.51     3.49     2.42
MP Linpack Benchmark

The original Linpack benchmark operates on double precision floating point 100x100 matrices (N = 100). This version uses mainly the same C programming code as that for the single precision floating point. It is run run on 100x100, 500x500 and 1000x1000 matrices using 0, 1, 2 and 4 separate threads. Multiple threads each use different segments of shared data arrays.

The code differences were slight changes to allow a higher level of parallelism. The initial 100x100 Linpack benchmark is only of use for measuring performance of single processor systems. The one for shared memory multiple processor systems is a 1000x1000 variety. The programming code for the latter is the same as 100x100, except users are allowed to employ their own linear equation solver.

Performance of this MP benchmark is limited by the overhead of creating and closing threads too frequently, resulting in slower speeds using multiple threads. At 100x100, data size is 40 KB, L2 cache based. With larger matrices, performance becomes more dependent on RAM, but multi-threading overheads have less influence.

Raspberry Pi 3 - At N=100, average speed was 1.73 times that from a RPi 2, with 1.52 to 1.59 times using the larger matrices. These can be compared with a CPU MHz ratio of 1.33.

Code: Select all

Linpack Single Precision MultiThreaded Benchmark
 Using NEON Intrinsics, Mon Aug 15 19:44:30 2016

   MFLOPS 0 to 4 Threads, N 100, 500, 1000

 Threads      None        1        2        4

 N  100     538.46   116.24   113.61   113.47 
 N  500     467.73   335.53   338.61   338.97 
 N 1000     363.87   336.10   336.72   336.22 

 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1

 N              100             500            1000

 NR            2.17            5.42            9.50
 RE  5.16722466e-05  6.46698638e-04  2.26586126e-03
 MA  1.19209290e-07  1.19209290e-07  1.19209290e-07
 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
 XN -5.06639481e-06 -4.70876694e-06  1.41978264e-04

Thread
 0 - 4 Same Results    Same Results    Same Results
 
 Comparison With Raspberry Pi 2 - CPU MHz ratio 1.33

 Threads      None        1        2        4

 N  100       1.67     1.75     1.75     1.76
 N  500       1.69     1.55     1.57     1.57
 N 1000       1.55     1.52     1.51     1.50

RoyLongbottom
Posts: 281
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Thu Sep 08, 2016 2:27 pm

Memory MP Benchmarks

Next, we have benchmarks that use caches and RAM, with data sizes 12.3 KB for L1 cache, 122.9 KB for L2 cache and 12288 KB for RAM. Details and results in:

http://www.roylongbottom.org.uk/Raspber ... hmarks.htm

MP-BusSpeed Benchmark

This is read only using AND instructions, with varying address increments of 32 words to 1 word, to identify where burst reading occurs. All threads read the same data, where each thread starts from different addresses, avoiding too high RAM performance due to data being in the shared L2 cache.

Raspberry Pi 3 results are shown below. Just considering 1 word address increments (RdAll) and comparisons with RP3 2 as the last column, best RAM speed improvements were the same as memory bus speed difference. Cache speed improvements were around 1.9 times, compared with CPU MHz ratio of 1.33. MP gains of 4/1 threads average 3.65 from caches but RAM improvement were disappointing.

Code: Select all

    MP-BusSpd ARM V7A v2 Tue Aug 30 13:45:43 2016

   MB/Second Reading Data, 1, 2, 4 and 8 Threads
   Staggered starting addresses to avoid caching
                                                   RPi3/RPi2
   KB     Inc32  Inc16   Inc8   Inc4   Inc2  RdAll   RdAll

  12.3 1T  1565   3749   3718   4078   4385   4160   1.87
       2T  5041   6829   7066   7813   8584   7839   1.89  
       4T  5480  11958  13330  15256  16863  15614   1.92
       8T  6006   8477   8873   7777   8918   8315
 122.9 1T   566    566   1062   1822   2831   3907   1.91
       2T   899    906   1742   2395   5433   7638   1.88
       4T   907    935   1876   3757   7241  13871   1.76
       8T   863    919   1789   3491   6411   9403
 12288 1T   130    136    263    513   1047   2080   1.81
       2T   185    138    276    554   1108   2149   1.71
       4T   131    137    269    536   1169   2383   2.01
       8T   125    133    224    513   1038   2142

         End of test Tue Aug 30 13:45:55 2016
MP RandMem Benchmark

The benchmark has serial and random address selections, using the same program indexing structure, with read and read/write tests involving 32 bit integers. It uses data from the same array for all threads, but starting at different points. The use of shared data, with write back, leads to no increase in throughput using multiple threads. Also, random access speed can be considerably influenced by reading and writing in bursts.

Results below show average Raspberry Pi 3 vs Pi 2 performance ratios. Some are not much better than CPU MHz increase of 1.33 times, somewhat better on L2 cache serial activities and surprisingly high with serial reading from RAM. Average MP 4/1 thread gains were around 3.8 times on reading L1 cache data and L2 serial read tests, but lower via random reading from L2. Multiple thread random reading from RAM was particularly good, probable due to some data in the shared L2 cache. Unexpectedly, serial reading from RAM was better than MP BusSpeed above, with 4/1 thread improvement of 1.58 times.

Code: Select all

 MP-RandMem Linux/ARM V7A v1.0 Tue Aug 30 14:13:08 2016

  MB/Second Using 1, 2, 4 and 8 Threads

  KB       SerRD SerRDWR   RndRD RndRDWR

 12.3 1T    2930    3791    2918    3791
      2T    5571    3766    5194    3776
      4T   11196    3722   11205    3722
      8T   10063    3685   10051    3702
122.9 1T    2675    3398     681     893
      2T    5124    3387    1256     886
      4T   10041    3387    1916     891
      8T    9593    3367    1952     890
12288 1T    2120     979      54      71
      2T    3255     980     107      71
      4T    3346     979     138      70
      8T    2226     979     143      71

    End of test Tue Aug 30 14:13:54 2016

 RPi3/RPi2 Average

 L1 cache   1.53    1.36    1.47    1.35
 L2 cache   1.86    2.29    1.24    1.31
 RAM        4.46    1.04    1.17    1.25
OpenMP MemSpeed Benchmark

This is the same program as the

http://www.roylongbottom.org.uk/Raspber ... m#anchor10

but uses a simple directive for the compiler to parallelise the code. A new version was produced, with reduced overheads, and this was also compiled not to use OpenMP functions. Average full OMP results are below for Raspberry Pi 2 and 3, then for RPI3, OMP using 1 thread and without OMP. All MP benchmarks, including source code, are in the following, for anyone to play with (and change if you want).

http://www.roylongbottom.org.uk/Raspber ... hmarks.zip

The benchmark measures speed of the functions shown, using 4 KB to 132 MB, the summaries being average MB/second for data in L1 cache, L2 cache and RAM.

Code: Select all

      x[m]=x[m]+s*y[m] Int+  x[m]=x[m]+y[m]         x[m]=y[m]
 Cache  Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
 RAM    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

 Rpi2 4 Threads
 L1     3131   2508    259   3273   2906    281   1188   1705    441
 L2     2805   1781    232   4278   2846    236   1114   1297    290
 RAM     747    646    283    930    964    272   1037   1071    288

 RPi3 4 Threads
 L1     5475   3134   1312  10182   5104   1435  15879   8025   1227
 L2     4772   2916   1247   7935   4607   1329   8277   6223   1294
 RAM    2788   1903   1324   4013   2794   1099   1065   1063   1073

 RPi3 1 Thread
 L1     1539    789    996   2582   1303   1022   4177   2357    653
 L2     1380    745    922   2145   1186    945   3356   2061    633
 RAM     995    653    798   1226    924    813   1189   1184    614

 RPi3 Not Threaded
 L1     1582   2511   3733   2360   3405   3733   2724   2722   2722
 L2     1416   2071   2875   1978   2707   2888   2439   2337   2346
 RAM    1032   1242   1295   1216   1291   1288   1030   1021   1021
There are significant variations in relative performance that can be calculated but a brief summary is as follows. They all depend on what particular instructions are used with and without threading and threading overheads.

Code: Select all

        MP MemSpeed Performance gains Summary 

                                                Average   Min    Max
 
 Raspberry Pi 3 v Raspberry Pi 2 OpenMP           3.77   0.99  13.37
 Raspberry Pi 3 OpenMP v Not Threaded             1.97   0.35   5.83
 Raspberry Pi 3 Not Threaded v OMP 1 Thread       1.92   0.65   4.17

RoyLongbottom
Posts: 281
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Mon Sep 19, 2016 1:56 pm

Maximum Floating Point Speed

The last series of multithreading benchmarks were intended to measure maximum floating point speed, They execute functions, where the arithmetic operations executed are of the form x = (x + a) * b - (x + c) * d + (x + e) * f with 2 or 32 operations per input single precision floating point data word. Array sizes used cover L1 cache, L2 cache and RAM as separate tests, all at 1, 2, 4 and 8 threads. Each thread uses the same calculations but accessing different segments of the data.

When compiled by GCC and run on Intel based PCs, near maximum performance could be demonstrated. Using parameters to generate SSE instructions, maximum single core MFLOPS could be CPU MHz x 4 (SSE 128 bit registers) x 2 (linked multiply and add). Then, AVX 1 would be twice as fast. So we have 32 and 64 times CPU MHz for a quad core processor. Running on a Quad Core i7 CPU, 23 out of 32 and 45.6 out of 64 times CPU MHz were demonstrated - 90 and 178 GFLOPS via 3.9 GHz CPU.
Raspberry Pi 3 Cortex A53

Raspberry Pi 3 Cortex A53 processor is also said to have a maximum speed of 32 x CPU GHz (as Intel SSE) or 38.4 GFLOPS, with double precision at a quarter of this speed.

The later benchmarks are MP-MFLOPSPiA7 and MP-MFLOPSDP, compiled for Cortex-A7, and MP-MFLOPSPiNeon, compiled from the same code, for Cortex-A7 with NEON SIMD, where the latter, with fused multiply and add would be expected to produce maximum speeds. Note that NEON, at this time, only deals with 32 bit single precision operation. Another variation is MP-NeonMFLOPS, this time produced using manually inserted intrinsic functions, with results virtually the same as the compiled “C” version. Finally OpenMP-MFLOPS, an established PC version was produced. This has run time parameters for the starting number of data words and repeat passes. As it happens, this has a useful extra set of tests with 8 operations per word. In order to provide a benchmark with no OpenMP or threading overheads, this was recompiled as notOpenMP-MFLOPS, to test a single core. Full details of all of these benchmarks and results are in:

http://www.roylongbottom.org.uk/Raspber ... hmarks.htm

Results are below for MP-MFLOPSPiNeon plus some one thread speeds from the other programs. Except when RAM speed limited, MP gains were quite respectable. Average speed improvements are shown to be 1.92 times RPi 2 and RPi 3 NEON/normal ratio of 2.76. Then MP-MFLOPS single and double precision speeds were the same, but remember that there is no NEON functions for DP.

Code: Select all

    Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

 MP-MFLOPS Compiled NEON v1.0 Mon Aug 15 19:09:46 2016

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      419     782     437    1672    1660    1637
 2T     1324    1529     442    3331    3308    3212
 4T     1903    1574     439    5040    6073    5738
 8T     1613    2204     433    5543    5780    5445
 Results x 100000
 1T    76406   97075   99969   66008   95367   99951
 2T    76406   97075   99969   66008   95367   99951
 4T    76406   97075   99969   66008   95367   99951
 8T    76406   97075   99969   66008   95367   99951

         End of test Mon Aug 15 19:09:52 2016

 1 Thread
 RPi 2   357     451     337     690     688     657
 RPi 3 MP-MFLOPS
 1T SP   168     182     171     691     693     684
 1T DP   143     182     171     678     680     674
The main issue is maximum speed of 1.67 GFLOPS from 1 core, 17% of that thought to be possible. Disassembled code of the main calculations are included in Raspberry Pi Multithreading Benchmarks.htm, indicating insufficient number of instructions in a loop at 2 operations per word. With the higher instruction count, the compiler unrolls the loop to execute 128 calculations, four at a time using quad word registers but there are 32 unnecessary instructions loading variables that limit maximum performance (not enough registers?).

Running the default non-threaded notOpenMP-MFLOPS indicated a slightly faster speed than above of 1.7 GFLOPS at 8 operations per word. The source code was modified to manually unroll this loop to include 128 arithmetic operations. Experimenting with different data sizes produced a maximum of just over 3 GFLOPS. Here. up to 6.6 GFLOPS might be expected, via 4 way vectors, with 16 add or multiply instructions plus 8 using linked multiply and add or subtract. As shown below, there were also 4 vector loads, 4 vector stores, 4 scalar adds and 3 instructions for loop control.

Code: Select all

L27:
    vld1.32   {q11}, [r3]
    vld1.32   {q14}, [lr]
    vld1.32   {q12}, [r2]
    vadd.f32  q15, q11, q1
    vld1.32   {q13}, [ip]
    vadd.f32  q10, q14, q1
    vadd.f32  q8, q12, q1
    vmul.f32  q15, q15, q4
    vadd.f32  q7, q0, q14
    vadd.f32  q9, q13, q1
    vst1.f32  {d30-d31}, [sp:64]
    vmul.f32  q10, q10, q4
    vadd.f32  q15, q0, q12
    vmul.f32  q8, q8, q4
    vadd.f32  q6, q0, q13
    vfma.f32  q10, q7, q2
    vfma.f32  q8, q15, q2
    vadd.f32  q7, q0, q11
    vmul.f32  q9, q9, q4
    vld1.64   {d30-d31}, [sp:64]
    vfma.f32  q9, q6, q2
    vadd.f32  q14, q3, q14
    vfma.f32  q15, q7, q2
    vadd.f32  q13, q3, q13
    vadd.f32  q12, q3, q12
    vadd.f32  q11, q3, q11
    vfms.f32  q10, q14, q5
    vfms.f32  q9, q13, q5
    vfms.f32  q8, q12, q5
    vfms.f32  q15, q11, q5
    add       r4, r4, #1
    cmp       r4, r5
    vst1.32   {q10}, [lr]
    vst1.32   {q9}, [ip]
    add       lr, lr, #64
    add       ip, ip, #64
    vst1.32   {q8}, [r2]
    vst1.32   {q15}, [r3]
    add       r2, r2, #64
    add       r3, r3, #64
    bcc       .L27

RoyLongbottom
Posts: 281
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Raspberry Pi Benchmarks

Mon Sep 26, 2016 3:38 pm

Raspberry Pi 3 Stress Tests

I have been running my stress tests on the Raspberry Pi 3. These can comprise four number crunching programs plus one that measures CPU MHz and temperatures, each running in its own terminal window, with typical running times of 15 minutes. Full details and program download links are in:

http://www.roylongbottom.org.uk/Raspber ... 0Tests.htm

The tests can also comprise a graphics program, instead of one of the number crunchers. Such an exercise was carried out on the RPi 3, in conjunction with a new OpenGL GLUT benchmark. Details can be found here:

viewtopic.php?p=958209#p958209

The tests demonstrate the apparently well known fact that the RPi 3 Cortex A-53 CPU can overheat and reduce CPU MHz (throttling) to avoid even higher temperatures. I have run the same compiled code on an Android tablet, with a Snapdragon Cortex-A53, and that continued running at full speed. The main reason for these differences (claimed by someone) is that the RPI 3 Broadcom version is manufactured using the 40 nm process, the tablet having a cooler Snapdragon implementation with 0.28 nm lithography.

It should be pointed out that it is unlikely that many people will want to execute such demanding code, using all cores, for extended periods.

The stress tests were originally run using a script file, but this did not work. Instead, the script was copied and pasted to a terminal prompt, for example:

Code: Select all

lxterminal --geometry=80x15 -e ./RPiHeatMHz passes 63, seconds 15
lxterminal --geometry=80x15 -e ./burninfpuPi2 Kwds 4 Sect 2 Mins 15 Log 11
lxterminal --geometry=80x15 -e ./burninfpuPi2 Kwds 4 Sect 2 Mins 15 Log 12
lxterminal --geometry=80x15 -e ./burninfpuPi2 Kwds 4 Sect 2 Mins 15 Log 13
lxterminal --geometry=80x15 -e ./burninfpuPi2 Kwds 4 Sect 2 Mins 15 Log 14
This test specified a revised floating point stress test, where four cores can execute nearly 12 GFLOPS. The last exercise was to check out different heatsinks. In the following, Black is the latest from Pi Hut in September 2016, and Copper the rather swish Enzotech BMR-C1, kindly supplied by Doc Watson in September 2016. Then the third test is with the system plastic cover removed. Room temperature was 22°C.

Throttling started at around the reported 80°C, with a maximum of about a 34% reduction in CPU MHz and recorded MFLOPS, for both heatsinks and still 21% with the cover removed.

Code: Select all

 Revised Benchmark Max MFLOPS > 2900 Per Core - New OS Driver Enabled

          Black Heatsink        Copper Heatsink       Copper No Cover

                       4 Core                4 Core                4 Core
  Minute     °C    MHz MFLOPS      °C    MHz MFLOPS      °C    MHz MFLOPS

       0   49.9   1200           41.9   1200           46.2   1200
       1   73.6   1200  11699    65.0   1200  11706    67.1   1200  11720
       2   81.7   1124  11282    73.6   1200  11709    74.1   1200  11709
       3   82.7    977   9489    79.0   1200  11726    79.0   1200  11682
       4   82.7    917   8954    81.7   1038  10322    80.6   1118  11059
       5   83.8    867   8545    82.2    963   9629    81.7   1048  10296
       6   83.8    846   8252    82.7    932   9165    81.7   1015  10073
       7   83.8    830   8085    83.8    876   8832    81.7    991   9812
       8   83.8    809   7991    83.3    867   8558    81.7    991   9684
       9   83.8    816   7860    83.8    842   8318    82.2    963   9556
      10   83.8    795   7738    83.8    824   8146    82.7    965   9369
      11   84.4    782   7663    83.8    821   8051    82.7    968   9342
      12   84.4    787   7625    83.8    813   7966    82.7    953   9241
      13   83.8    844   8212    83.8    812   7879    82.2    956   9203
      14   83.8    827   8177    84.4    796   7780    82.7    948   9194
      15   84.4    830   8133    84.4    794   7710    82.7    949   9109

    min    73.6    782   7625    65.0    794   7710    67.1    948   9109
    max    84.4   1200  11699    84.4   1200  11726    82.7   1200  11720
   Loss
    %             34.8   34.8           33.8   34.2           21.0   22.3
 

Return to “General programming discussion”