ejolson
Posts: 2759
Joined: Tue Mar 18, 2014 11:47 am

Pi3B+ Hopefully Correct Results Under Load

Sat Mar 17, 2018 5:23 am

Sometime ago there was a thread on running the HPL benchmark which solves systems of linear equations on the Pi 3B. That thread was entitled "Pi3 incorrect results under load (possibly heat related)" and thus the title of this thread. In the previous thread it was eventually determined that the incorrect results were not heat related, but caused by power transients that were often fixable by using an over voltage setting. As the 3B+ runs almost 17 percent faster than the original 3B, I would expect a Linpack speed of about 7.5 Gflops, provided of course that it doesn't crash instead. Note the 3B attains a speed of 6.4 Gflops on the HPL benchmark.

Gareth Halfacree recently posted a number of benchmark scores for the 3B+. Unfortunately his reported Linpack result is about 200 Mflops which is nearly 30 times slower than properly-tuned Linpack timings. Is there anyone with a 3B+ and the interest to run the HPL high-performance Linpack benchmark on it? I'm wondering if improvements to the on-board power regulator and thermal management allow the 3B+ to perform reliably and whether the speed is really the 7.5 Gflops I expect. You can follow the instructions for compiling and running HPL from here.

tkaiser
Posts: 103
Joined: Fri Aug 05, 2016 1:28 pm

Re: Pi3B+ Hopefully Correct Results Under Load

Sat Mar 17, 2018 10:18 am

ejolson wrote:
Sat Mar 17, 2018 5:23 am
In the previous thread it was eventually determined that the incorrect results were not heat related, but caused by power transients that were often fixable by using an over voltage setting.
Just for the record... what you discovered back then inspired us to you use this special Linpack version to develop sane DVFS settings for a couple of other ARM boards in the meantime (Pine64 being the first one two years ago). It's important to test through all available DVFS operating points and as such the whole procedure needs to be fully automated: https://github.com/ehoutsma/StabilityTester

After adjusting one or two variables related to sysfs nodes (or replacing with VC4 commands) it should work directly on Pi 3 B and B+.

At least

Code: Select all

CURVOLT=$(cat ${REGULATOR_HANDLER}${REGULATOR_MICROVOLT})
should be replaced with

Code: Select all

CURVOLT=$(vcgencmd measure_volts | cut -f2 -d= | sed 's/000//')

ejolson
Posts: 2759
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi3B+ Hopefully Correct Results Under Load

Sun Mar 18, 2018 1:40 am

tkaiser wrote:
Sat Mar 17, 2018 10:18 am
ejolson wrote:
Sat Mar 17, 2018 5:23 am
In the previous thread it was eventually determined that the incorrect results were not heat related, but caused by power transients that were often fixable by using an over voltage setting.
Just for the record... what you discovered back then inspired us to you use this special Linpack version to develop sane DVFS settings
Thanks, except I didn't discover it: Vince Weaver at the University of Maine discovered that the Raspberry Pi produced incorrect results when solving systems of linear equations; Kazushige Goto developed the optimized linear algebra subroutine library that became OpenBLAS, and Jack Dongara created the HPL High-Performance Linpack benchmark used to compare supercomputers.

If the Pi 3B+ achieves 7.5 Gflops, that would place it 29 among the world's fastest supercomputers in 1993. It would then have stayed on the list until 1997, at which point the top 500 supercomputers in the world all became faster than the Raspberry Pi. From my point of view 1997 is not that long ago, though I suppose it really is.

ejolson
Posts: 2759
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi3B+ Hopefully Correct Results Under Load

Mon Apr 23, 2018 5:31 pm

I'm happy to report the Pi 3B+ which I've been using does not lockup at standard clock settings. Moreover, it is able to correctly solve systems of linear equations using the OpenBLAS subroutine library tested with the Linpack benchmark. Note that I have compiled OpenBLAS from source, because the version distributed with Raspbian is slow having been compiled for ARMv6 compatibility. For the MPI library and Fortran compiler I used

$ apt-get install libopenmpi-dev gfortran

to install the standard binary packages from Raspbian. The results

Code: Select all

================================================================================
HPLinpack 2.2  --  High-Performance Linpack benchmark  --   February 24, 2016
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :    8000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       1 
Q      :       1 
PFACT  :    Left 
NBMIN  :       2 
NDIV   :       2 
RFACT  :   Right 
BCAST  :   2ring 
DEPTH  :       0 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR02R2L2        8000   256     1     1              50.82              6.718e+00
HPL_pdgesv() start time Mon Apr 23 17:00:48 2018

HPL_pdgesv() end time   Mon Apr 23 17:01:39 2018

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0025941 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
indicate that the 3B+ scores 6.718 Gflops, which is about about 4.76 percent faster than the original Raspberry Pi 3B at default clock settings. While this is less than the expected 16 percent based on clock speed, I am quite happy that correct answers were always produced. It would be great if someone else could confirm this as a best effort result at default settings or obtain a better one.

At any rate, it is a big improvement compared to my old 3B which frequently crashes at the default clock settings. No heatsink was used, however a hairdryer set to cold was directed toward the Pi during the run. As a result there was no throttling, as indicated by the output

Code: Select all

0 frequency(45)=600000000 temp=39.2'C volt=1.2000V throttled=0x0
3 frequency(45)=600000000 temp=38.6'C volt=1.2000V throttled=0x0
6 frequency(45)=600000000 temp=37.6'C volt=1.2000V throttled=0x0
9 frequency(45)=1400002000 temp=39.2'C volt=1.3750V throttled=0x0
12 frequency(45)=1400000000 temp=39.7'C volt=1.3750V throttled=0x0
15 frequency(45)=1400002000 temp=39.7'C volt=1.3750V throttled=0x0
18 frequency(45)=1400000000 temp=49.4'C volt=1.3750V throttled=0x0
21 frequency(45)=1400000000 temp=53.7'C volt=1.3750V throttled=0x0
24 frequency(45)=1400000000 temp=55.8'C volt=1.3750V throttled=0x0
27 frequency(45)=1400000000 temp=58.5'C volt=1.3750V throttled=0x0
30 frequency(45)=1400000000 temp=60.1'C volt=1.3750V throttled=0x0
33 frequency(45)=1400000000 temp=60.1'C volt=1.3813V throttled=0x0
36 frequency(45)=1400000000 temp=60.1'C volt=1.3813V throttled=0x0
39 frequency(45)=1400000000 temp=61.2'C volt=1.3813V throttled=0x0
42 frequency(45)=1400000000 temp=61.2'C volt=1.3813V throttled=0x0
45 frequency(45)=1400000000 temp=61.2'C volt=1.3813V throttled=0x0
48 frequency(45)=1400002000 temp=61.2'C volt=1.3813V throttled=0x0
51 frequency(45)=1399998000 temp=61.2'C volt=1.3813V throttled=0x0
54 frequency(45)=1400000000 temp=61.2'C volt=1.3813V throttled=0x0
57 frequency(45)=1400000000 temp=62.3'C volt=1.3813V throttled=0x0
60 frequency(45)=1400000000 temp=61.2'C volt=1.3813V throttled=0x0
63 frequency(45)=1400000000 temp=60.7'C volt=1.3813V throttled=0x0
66 frequency(45)=1400000000 temp=52.6'C volt=1.3750V throttled=0x0
69 frequency(45)=1400000000 temp=49.4'C volt=1.3750V throttled=0x0
72 frequency(45)=1400000000 temp=47.2'C volt=1.3750V throttled=0x0
75 frequency(45)=600000000 temp=43.5'C volt=1.2000V throttled=0x0
78 frequency(45)=600000000 temp=41.9'C volt=1.2000V throttled=0x0
81 frequency(45)=600000000 temp=40.8'C volt=1.2000V throttled=0x0
84 frequency(45)=600000000 temp=39.7'C volt=1.2000V throttled=0x0
87 frequency(45)=600000000 temp=39.7'C volt=1.2000V throttled=0x0
for the script

Code: Select all

#!/bin/bash
# 00 (0x00001): under-voltage
# 01 (0x00002): arm frequency capped
# 02 (0x00004): currently throttled
# 17 (0x20000): arm frequency capped has occured
# 18 (0x40000): throttling has occured
let t=0
while true
do
    frequency=`vcgencmd measure_clock arm`
    temp=`vcgencmd measure_temp`
    volt=`vcgencmd measure_volts core`
    throttled=`vcgencmd get_throttled`
    echo $t $frequency $temp $volt $throttled
    let t=$t+3
    sleep 3
done
No timings were made with the hairdryer set to hot mode, nor will they be made using my Pi!

RoyLongbottom
Posts: 253
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
Contact: Website

Re: Pi3B+ Hopefully Correct Results Under Load

Wed Apr 25, 2018 11:52 am

Before comparing Linpack benchmark speeds, we should remember that there are three official versions. Results, currently up to 2014, are available from Netlib in:

http://netlib.org/benchmark/performance.pdf

The first operates on a matrix of order 100 in a Fortran environment at 64 bits floating point precision. In 1996, Netlib accepted my C version as suitable for PCs (there as LipackPC.c). This is the one I run on Raspberry Pi systems. On the Raspberry Pi 3B+, the Double Precision speed obtained was 210 MFLOPS via a 32 bit compilation and 397 MFLOPS at 64 bits (that is useful isn’t it?) - see the following that also includes SP results up to 605 MFLOPS:.

viewtopic.php?f=31&t=44080&start=75#p1300116

The second Linpack benchmark is for solving a system of equations of order 1000, with no restriction on the method or its implementation. Then we have the High-Performance Linpack.

I ran the HPL benchmark on my original RPi 3 and results are at:

viewtopic.php?f=31&t=44080&p=1026831&hilit=hpl#p1026831

The benchmark no longer exists on my Raspbian SD card. Is there one that I can just download and run, as, at this time, I don’t have time to play with complicated installation.

To me, there appears to be something seriously wrong with the implementation. I have included results for an Intel Atom that indicate the performance profiles I would expect. I would not expect those increases in MFLOPS on doubling the problem size or, assuming my affinity directives were correct, the number of cores used. Then someone might provide an explanation, justifying the results.

ejolson
Posts: 2759
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi3B+ Hopefully Correct Results Under Load

Wed Apr 25, 2018 5:15 pm

RoyLongbottom wrote:
Wed Apr 25, 2018 11:52 am
The benchmark no longer exists on my Raspbian SD card. Is there one that I can just download and run, as, at this time, I don’t have time to play with complicated installation.

To me, there appears to be something seriously wrong with the implementation. I have included results for an Intel Atom that indicate the performance profiles I would expect. I would not expect those increases in MFLOPS on doubling the problem size or, assuming my affinity directives were correct, the number of cores used. Then someone might provide an explanation, justifying the results.
I have a binary that will run on the most recent version of Raspbian. As you mentioned, the compilation and run-time configuration can be complicated. There is also the question whether vcgencmd, used to monitor temperature and clock speed while the program runs, creates noticeable overhead slowing things down and whether OpenMPI is the best MPI library to use on the Pi.

For these reasons I am not certain my Gflop numbers are the best possible. Also for these reasons it is a better verification for someone to think through the configuration issues independently to see if they obtain speeds which are consistent or better. Even so, I'll see about putting my binary someplace for download, as such things have also helped people diagnose stability and cooling issues. It is interesting that your Pi 3B also crashed for matrices where n=8000. From a stability point of view the Pi 3B+ is much improved over the original 3B.

ejolson
Posts: 2759
Joined: Tue Mar 18, 2014 11:47 am

Re: Pi3B+ Hopefully Correct Results Under Load

Fri Aug 24, 2018 4:17 pm

ejolson wrote:
Wed Apr 25, 2018 5:15 pm
For these reasons I am not certain my Gflop numbers are the best possible.
This recent run of the downloadable binary independently compiled by Dr Weaver and made available here suggests 6.78 Gflop as a best-effort Linpack score for the 3B+ computer.

Return to “General discussion”