aa3025
Posts: 17
Joined: Fri Sep 14, 2018 8:35 am

Raspberry Pi4 8GB xHPL benchmark results

Wed Jun 03, 2020 7:28 pm

Hi All,

got my Pi 4 8GB recently and decided to do HPL benchmark on it. Result: almost 9 GFlops !!! tutorial and results are here: https://www.hydromag.eu/~aa3025/rpi/.

Alex P.

arif-ali
Posts: 16
Joined: Tue Jun 02, 2020 9:45 pm
Location: Sheffield, UK

Re: Raspberry Pi4 8GB xHPL benchmark results

Wed Jun 03, 2020 8:39 pm

Thanks for sharing Alex,

This is a benchmark I used to run regularly for HPC systems in my previous role. So great to see someone attempt to run this.

I might also have a go at some point, when I buy my second one.

Have you looked at the scaling_frequencies, as well as the governor, I know these tend to make some differences. Not sure how it works on the ARM, but we used to set the min and max frequencies to the maximum available, and the governor to performance. The other thing we used to do, was have 2 runs within the HPL.dat, the first one would help bump the memory and CPU to 100%, and therefore potentially seeing maximum performance

I would deffo be interested in seeing what I could achieve on the rpi4 with these things in mind

Arif
You can also find me on IRC: arif-ali@freenode

Current rpi's:
3 x rpi4 4GB with Ubuntu 20.04 64bit
1 x rpi4 8GB with Raspberry Pi OS (beta FW for booting from SSD)

Linpack result:
14.401 GFlops, https://gitlab.arif-ali.co.uk:8543/snippets/26

aa3025
Posts: 17
Joined: Fri Sep 14, 2018 8:35 am

Re: Raspberry Pi4 8GB xHPL benchmark results

Wed Jun 03, 2020 11:30 pm

Hi Arif,

yes, HPL is my default test tool too for the compute nodes of HPC I'm managing at work.

I kept monitoring throttling state during the test to make sure it does not scale down the frequency, kept it at about 50 C during the tests -- way below "official" 85C throttling limit. Pi CPU sits normally at 600MHz at idle, and then immediately jumps to 1.5GHz and stays there when I start the test, so i think its all fine there with CPU governor.

Alex

jahboater
Posts: 5680
Joined: Wed Feb 04, 2015 6:38 pm
Location: West Dorset

Re: Raspberry Pi4 8GB xHPL benchmark results

Wed Jun 03, 2020 11:49 pm

If you run this after your benchmark run completes:

vcgencmd get_throttled

it should return zero (0x0)
Anything else and it has throttled because of temperature or low voltage.
There are bits to say its currently throttled and sticky bits to say its been throttled since the last boot.
Pi4 8GB running PIOS64

aa3025
Posts: 17
Joined: Fri Sep 14, 2018 8:35 am

Re: Raspberry Pi4 8GB xHPL benchmark results

Wed Jun 03, 2020 11:51 pm

Yes that's how I monitored throttling state.

jahboater
Posts: 5680
Joined: Wed Feb 04, 2015 6:38 pm
Location: West Dorset

Re: Raspberry Pi4 8GB xHPL benchmark results

Wed Jun 03, 2020 11:53 pm

aa3025 wrote:
Wed Jun 03, 2020 11:51 pm
Yes that's how I monitored throttling state.
Yes, but you said "I kept monitoring throttling state during the test"
I suggest that you do not need to. Just check its 0x0 once only after the test has finished.
Pi4 8GB running PIOS64

aa3025
Posts: 17
Joined: Fri Sep 14, 2018 8:35 am

Re: Raspberry Pi4 8GB xHPL benchmark results

Thu Jun 04, 2020 12:01 am

yes, but if the HPL test takes 45-30 min, you can abort it earlier, if you know that throttling has already occurred. Anyway , with my "improved" case (fan) temperature did not go above 52C during the tests.

@Arif: I actually checked whether changing scaling_governor to "performance" (from "ondemand") makes any difference:

Code: Select all

for i in $(seq 0 3); do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done

This way CPU frequency is kept at 1.5 GHz all the time.

And it doesn't (makes any difference) , still get the same 8.93 GFlops off HPL.

Alex

arif-ali
Posts: 16
Joined: Tue Jun 02, 2020 9:45 pm
Location: Sheffield, UK

Re: Raspberry Pi4 8GB xHPL benchmark results

Thu Jun 04, 2020 8:04 am

aa3025 wrote:
Thu Jun 04, 2020 12:01 am
@Arif: I actually checked whether changing scaling_governor to "performance" (from "ondemand") makes any difference:

Code: Select all

for i in $(seq 0 3); do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done

This way CPU frequency is kept at 1.5 GHz all the time.

And it doesn't (makes any difference) , still get the same 8.93 GFlops off HPL.
nice one, thanks for checking. Wasn't expecting a quick turnaround :)

Looking around on other forums, it seems people are consistently getting around 6GFlops. It would be interesting to see results from some of the places that have setup multi node setups as the ones I saw at Super Computing Conferences in the past few years
You can also find me on IRC: arif-ali@freenode

Current rpi's:
3 x rpi4 4GB with Ubuntu 20.04 64bit
1 x rpi4 8GB with Raspberry Pi OS (beta FW for booting from SSD)

Linpack result:
14.401 GFlops, https://gitlab.arif-ali.co.uk:8543/snippets/26

ejolson
Posts: 5202
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 3:45 am

I ran a shorter test with N=8000 and obtained 9.1697 GFLOPS.

Code: Select all

$ ./xhpl
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :    8000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       1 
Q      :       1 
PFACT  :    Left 
NBMIN  :       2 
NDIV   :       2 
RFACT  :   Right 
BCAST  :   2ring 
DEPTH  :       0 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR02R2L2        8000   256     1     1              37.23             9.1697e+00
HPL_pdgesv() start time Fri Jun  5 04:10:05 2020

HPL_pdgesv() end time   Fri Jun  5 04:10:42 2020

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.59405429e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
This was using the Pi 4B 2GB model. Note that MPICH was installed to use the slower ch3 network device rather than shared memory, but I think the run was using OpenBLAS threads rather than MPI anyway. Likely the smaller matrix size gives slightly faster results than using the full memory size on the 8GB model. I'll try to optimize the build and then post more details in case anyone wants to continue improving things.
Last edited by ejolson on Fri Jun 05, 2020 5:16 am, edited 5 times in total.

ejolson
Posts: 5202
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 3:56 am

Recompiling hpl with gcc-10.1 but still in 32-bit mode resulted in performance over 10 GFLOPS.

Code: Select all

$ ./xhpl 
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :    8000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       1 
Q      :       1 
PFACT  :    Left 
NBMIN  :       2 
NDIV   :       2 
RFACT  :   Right 
BCAST  :   2ring 
DEPTH  :       0 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR02R2L2        8000   256     1     1              32.78             1.0415e+01
HPL_pdgesv() start time Fri Jun  5 04:53:42 2020

HPL_pdgesv() end time   Fri Jun  5 04:54:15 2020

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.59405429e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
Alex, since I can't run the large matrix size with my 2GB Pi, would it be possible for you to try N=8000 and compare?

ejolson
Posts: 5202
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 4:48 am

Here is a run with OpenBLAS threads turned off using MPICH and the shared memory communication device.

Code: Select all

$ export OPENBLAS_NUM_THREADS=1
$ mpirun -np 4 ./xhpl
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :    8000 
NB     :     192 
PMAP   : Row-major process mapping
P      :       2 
Q      :       2 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4        8000   192     2     2              32.50             1.0505e+01
HPL_pdgesv() start time Fri Jun  5 05:39:32 2020

HPL_pdgesv() end time   Fri Jun  5 05:40:04 2020

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.04717139e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
For some reason the version of MPICH I installed doesn't allow me to specify a rank file, so I can't check if that makes a difference. The result is still greater than 10 GFLOPS. It would be nice to know whether the differences in our timing results are related to the 8GB versus 2GB hardware, the matrix size, or some other software difference such as using the version 10.1 gcc compiler. This could also be one of those weird situations where native 64-bit runs slower than 32-bit compatibility mode.
Last edited by ejolson on Fri Jun 05, 2020 5:24 am, edited 1 time in total.

ejolson
Posts: 5202
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 5:19 am

arif-ali wrote:
Thu Jun 04, 2020 8:04 am
Looking around on other forums, it seems people are consistently getting around 6GFlops. It would be interesting to see results from some of the places that have setup multi node setups as the ones I saw at Super Computing Conferences in the past few years
I think 6 GFLOPS is for the Pi 3B. Using default clock speeds, 6.4 is the original 3B and 6.7 the 3B+ . More information is in the thread

viewtopic.php?f=63&t=208167

Therefore, in terms of Linpack, the 4B is 57 percent faster than the 3B+ that came before. At the same time, having more memory makes the 4B more powerful.

aa3025
Posts: 17
Joined: Fri Sep 14, 2018 8:35 am

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 7:47 am

@ejolson: Looks like your 1st run was SMP code then, and yes, your problem size was way smaller (took just 30 sec), how much memory did it use? whole 2GB?

On 32bit one process could not address whole 8GB obviously, this is why I did not try in SMP. If I try to launch on one process, it will crash with segfault after filling 2+GB (25-27%) of memory, but this is what you can expect from 32 bit OS.

How did you compile xhpl in SMP? (just changing compiler from mpicc to gcc? or is it the same mpich-compiled xhpl, but just launched without mpirun?)

My last test with OpenMPI showed 9.05 GFlops (when I did not pull continuously temperature and throttling state), I can try with MPICH instead of OpenMPI, I've heard it has better process-affinity-behavior, so perhaps it does not require that rankfile witchcraft.

arif-ali
Posts: 16
Joined: Tue Jun 02, 2020 9:45 pm
Location: Sheffield, UK

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 9:50 am

Thanks to both of you for sharing your experiences

I have gone and compiled OpenBLAS, mpich, and linked then statically with latest version of hpl. Linking statically tends to help with performance. I didn't go a minimal build, and had other things running on my pi4, so probably not a true reflection on performance

OS: Ubuntu 20.04
Kernel: Linux pi02.arif.local 5.4.0-1011-raspi #11-Ubuntu SMP Fri May 8 07:43:33 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux

Run 1:

P=1 and Q=4
OMP_NUM_THREADS=1
OMP_NUM_THREADS=1 /opt/mpich/3.3.2/bin/mpirun -np 4 --machinefile=machinefile hpl-2.3/bin/rpi4/xhpl

where the contents of machine file is below

Code: Select all

127.0.0.1:4
The results for this is below (11.57 GFlops)

Code: Select all

================================================================================                                                                                                                                   
T/V                N    NB     P     Q               Time                 Gflops                                                                                                                                   
--------------------------------------------------------------------------------                         
WR11C2R4        8000   192     1     4              29.49             1.1578e+01                                                                                                                                   
HPL_pdgesv() start time Fri Jun  5 09:19:58 2020                                                         
                                                                                                         
HPL_pdgesv() end time   Fri Jun  5 09:20:27 2020                                                         
                                                                                                         
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-                             
Max aggregated wall time rfact . . . :               1.08                                                
+ Max aggregated wall time pfact . . :               0.27                                                
+ Max aggregated wall time mxswp . . :               0.12                                                
Max aggregated wall time update  . . :              28.28                                                
+ Max aggregated wall time laswp . . :               1.08                                                
Max aggregated wall time up tr sv  . :               0.08                                                                                                                                                          
--------------------------------------------------------------------------------                                                                                                                                   
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   4.23928476e-03 ...... PASSED                         
================================================================================                
Run 2:

OMP_NUM_THREADS=4 hpl-2.3/bin/rpi4/xhpl

The results for this is below (11.73 GFlops)

Code: Select all

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4        8000   192     1     1              29.09             1.1737e+01
HPL_pdgesv() start time Fri Jun  5 09:21:46 2020

HPL_pdgesv() end time   Fri Jun  5 09:22:15 2020

--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
Max aggregated wall time rfact . . . :               1.59
+ Max aggregated wall time pfact . . :               0.38
+ Max aggregated wall time mxswp . . :               0.15
Max aggregated wall time update  . . :              27.40
+ Max aggregated wall time laswp . . :               1.89
Max aggregated wall time up tr sv  . :               0.08
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   5.66468717e-03 ...... PASSED
================================================================================
I am sure we can get a bit more by increasing the N number as I was only using 400M of the RAM, also I was using Ubuntu 20.04 64bit, so with minimal installation of raspi OS, I may be able to get better performance

Make.rpi4: https://paste.ubuntu.com/p/VZKrDynpp5/
You can also find me on IRC: arif-ali@freenode

Current rpi's:
3 x rpi4 4GB with Ubuntu 20.04 64bit
1 x rpi4 8GB with Raspberry Pi OS (beta FW for booting from SSD)

Linpack result:
14.401 GFlops, https://gitlab.arif-ali.co.uk:8543/snippets/26

arif-ali
Posts: 16
Joined: Tue Jun 02, 2020 9:45 pm
Location: Sheffield, UK

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 10:15 am

A final result from me for the day, this was using 60% of my 4GB rpi4. From my experience, we wouldn't get much more out of it in terms of performance even if we were utilising more memory. We may get 1 to 2 percent more, but not worth waiting 30 mins for results to come through.

Code: Select all

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       17280   192     1     1             254.71             1.3507e+01
HPL_pdgesv() start time Fri Jun  5 10:02:20 2020

HPL_pdgesv() end time   Fri Jun  5 10:06:34 2020

--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
Max aggregated wall time rfact . . . :               6.46
+ Max aggregated wall time pfact . . :               1.66
+ Max aggregated wall time mxswp . . :               0.50
Max aggregated wall time update  . . :             247.84
+ Max aggregated wall time laswp . . :               7.79
Max aggregated wall time up tr sv  . :               0.35
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   4.61419323e-03 ...... PASSED
================================================================================
You can also find me on IRC: arif-ali@freenode

Current rpi's:
3 x rpi4 4GB with Ubuntu 20.04 64bit
1 x rpi4 8GB with Raspberry Pi OS (beta FW for booting from SSD)

Linpack result:
14.401 GFlops, https://gitlab.arif-ali.co.uk:8543/snippets/26

aa3025
Posts: 17
Joined: Fri Sep 14, 2018 8:35 am

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 10:19 am

Oh, that's nice!

I wasn't able to produce working xhpl binary when compiling with stock mpich, the resulting exe segfaulting straight away...When compiled with OpenMPI and gcc10.1 it is still produces 9.07 GFlops with my big (8GB) problem size (N=28800, P=2, Q=2)

With N=8000 I've got 7.8 GFlops with openmpi... I guess 64 bit OS -- Ubuntu 20 results in better performance.
Last edited by aa3025 on Fri Jun 05, 2020 3:28 pm, edited 1 time in total.

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 26442
Joined: Sat Jul 30, 2011 7:41 pm

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 10:21 am

So, 6GF is about 40 times faster than the Cray-1.

How times have changed!
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed.
I've been saying "Mucho" to my Spanish friend a lot more lately. It means a lot to him.

arif-ali
Posts: 16
Joined: Tue Jun 02, 2020 9:45 pm
Location: Sheffield, UK

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 3:09 pm

So, taking the theory of raspi OS 32 bit issue out of the window

Kernel: Linux pi04.arif.local 5.4.44-v7l+ #1320 SMP Wed Jun 3 16:13:10 BST 2020 armv7l GNU/Linux
OS: Raspbian GNU/Linux 10 (buster)

Same HPL.dat for run1 but now on the 8GB version, and the same command.

* recompiled mpich
* recompiled OpenBLAS
* recompiled hpl

I didn't tune anything on the OS, it wasn't a minimal install, and some other tasks were probably running in the background

re-ran the benchmark

Code: Select all

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       17280   192     1     1             258.66             1.3301e+01
HPL_pdgesv() start time Fri Jun  5 15:49:47 2020        
                                                                                
HPL_pdgesv() end time   Fri Jun  5 15:54:05 2020    
                                                              
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-    
Max aggregated wall time rfact . . . :               7.46
+ Max aggregated wall time pfact . . :               2.49
+ Max aggregated wall time mxswp . . :               0.53
Max aggregated wall time update  . . :             250.75
+ Max aggregated wall time laswp . . :               8.34
Max aggregated wall time up tr sv  . :               0.42                       
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.33839235e-03 ...... PASSED
================================================================================
You can also find me on IRC: arif-ali@freenode

Current rpi's:
3 x rpi4 4GB with Ubuntu 20.04 64bit
1 x rpi4 8GB with Raspberry Pi OS (beta FW for booting from SSD)

Linpack result:
14.401 GFlops, https://gitlab.arif-ali.co.uk:8543/snippets/26

aa3025
Posts: 17
Joined: Fri Sep 14, 2018 8:35 am

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 3:33 pm

Decent! Is it gcc8 (stock) compiler you compiled everything with?
I don't know, may be your compiled OpenBLAS is magic! (did you try with the stock libopenblas?)
Last edited by aa3025 on Fri Jun 05, 2020 3:52 pm, edited 1 time in total.

ejolson
Posts: 5202
Joined: Tue Mar 18, 2014 11:47 am

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 3:37 pm

arif-ali wrote:
Fri Jun 05, 2020 3:09 pm
So, taking the theory of raspi OS 32 bit issue out of the window

Kernel: Linux pi04.arif.local 5.4.44-v7l+ #1320 SMP Wed Jun 3 16:13:10 BST 2020 armv7l GNU/Linux
OS: Raspbian GNU/Linux 10 (buster)

Same HPL.dat for run1 but now on the 8GB version, and the same command.

* recompiled mpich
* recompiled OpenBLAS
* recompiled hpl

I didn't tune anything on the OS, it wasn't a minimal install, and some other tasks were probably running in the background

re-ran the benchmark

Code: Select all

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       17280   192     1     1             258.66             1.3301e+01
HPL_pdgesv() start time Fri Jun  5 15:49:47 2020        
                                                                                
HPL_pdgesv() end time   Fri Jun  5 15:54:05 2020    
                                                              
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-    
Max aggregated wall time rfact . . . :               7.46
+ Max aggregated wall time pfact . . :               2.49
+ Max aggregated wall time mxswp . . :               0.53
Max aggregated wall time update  . . :             250.75
+ Max aggregated wall time laswp . . :               8.34
Max aggregated wall time up tr sv  . :               0.42                       
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.33839235e-03 ...... PASSED
================================================================================
A speed of 13.3 GFLOPS seems pretty good and now about 80 percent faster than previous 3B+ timings. Would you mind posting the full output, or better, your HPL.DAT file so we can see the runtime tuning parameters. Also, could you please confirm not overclocking the Pi 4B, as that's a different game.

aa3025
Posts: 17
Joined: Fri Sep 14, 2018 8:35 am

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 3:46 pm

arif-ali wrote:
Fri Jun 05, 2020 3:09 pm
So, taking the theory of raspi OS 32 bit issue out of the window

Kernel: Linux pi04.arif.local 5.4.44-v7l+ #1320 SMP Wed Jun 3 16:13:10 BST 2020 armv7l GNU/Linux
OS: Raspbian GNU/Linux 10 (buster)
You are on rpi-5.4.y kernel. On my Pi I have the latest to be

Code: Select all

4.19.118-v7l+ #1311 SMP Mon Apr 27 14:26:42 BST 2020 armv7l GNU/Linux


... I see I did not do firmware update (rpi-update) ... The race is on :)

arif-ali
Posts: 16
Joined: Tue Jun 02, 2020 9:45 pm
Location: Sheffield, UK

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 4:02 pm

Below is my setup

* OS: Raspberry PI OS
* Kernel: Linux pi04.arif.local 5.4.44-v7l+ #1320 SMP Wed Jun 3 16:13:10 BST 2020 armv7l GNU/Linux*
* Bootlader:

Code: Select all

root@pi04:~# vcgencmd bootloader_version 
May 27 2020 18:47:29
version d648db3968cd31d4948341e09cb8a925c49d2ea1 (release)
timestamp 1590601649
Everything else is stock

* Download latest mpich, and compile using

Code: Select all

tar xfz mpich-3.3.2.tar.gz
cd mpich-3.3.2
./configure --prefix=/opt/mpich/3.3.2
make -j 3
sudo make install
* Download latest OpenBLAS

Code: Select all

unzip OpenBLAS.zip
cd OpenBLAS-develop
make -j 3
* Download latest hpl

Code: Select all

tar xfz hpl-2.3.tar.gz
cd hpl-2.3
HPL.dat: https://paste.ubuntu.com/p/xmq22Th5P2/
Make.rpi4-mpich: https://paste.ubuntu.com/p/yhpStnhzRr/

The compile using the following command

Code: Select all

make arch=rpi4-mpich
My HPL.dat was in ~/hpl so my PWD was ~/hpl

Code: Select all

OMP_NUM_THREADS=4 ./hpl-2.3/bin/rpi4-mpich/xhpl
Finally, my result is here from the above environment for the 8GB board, it was using 2.82GB of RAM in this instance

https://paste.ubuntu.com/p/YtfY5M8M6v/

If there is anything missing or unsure about, then let me know
You can also find me on IRC: arif-ali@freenode

Current rpi's:
3 x rpi4 4GB with Ubuntu 20.04 64bit
1 x rpi4 8GB with Raspberry Pi OS (beta FW for booting from SSD)

Linpack result:
14.401 GFlops, https://gitlab.arif-ali.co.uk:8543/snippets/26

aa3025
Posts: 17
Joined: Fri Sep 14, 2018 8:35 am

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 4:06 pm

OK but you ran in SMP (threaded) mode (I did in MPP mode):
OMP_NUM_THREADS=4 ./hpl-2.3/bin/rpi4-mpich/xhpl
Will it crash if you increase problem size and 32-bit OS can't allocate the required memory?

my HPL.dat for 4 mpi workers and 8GB RAM:

Code: Select all

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any) 
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
28800         Ns
1            # of NBs
192           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2            Ps
2            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                               Number of additional problem sizes for PTRANS
1200 10000 30000                values of N
0                               number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64        values of NB

arif-ali
Posts: 16
Joined: Tue Jun 02, 2020 9:45 pm
Location: Sheffield, UK

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 4:27 pm

Now in MPP, just for you @aa3025

OMP_NUM_THREADS=1 /opt/mpich/3.3.2/bin/mpirun -np 4 --machinefile=machinefile.mpich ./hpl-2.3/bin/rpi4-mpich/xhpl

machinefile.mpich has the following contents

Code: Select all

127.0.0.1:4
HPL.dat: https://paste.ubuntu.com/p/Dps9BJkqVz/
Results: https://paste.ubuntu.com/p/CbrHss68Nh/

EDIT:

Adding numbers from same N number as @aa3025

HPL.dat: https://paste.ubuntu.com/p/wnD6FdGD3m/
Results: https://paste.ubuntu.com/p/6P3zvMfCCr/
Last edited by arif-ali on Fri Jun 05, 2020 5:29 pm, edited 1 time in total.
You can also find me on IRC: arif-ali@freenode

Current rpi's:
3 x rpi4 4GB with Ubuntu 20.04 64bit
1 x rpi4 8GB with Raspberry Pi OS (beta FW for booting from SSD)

Linpack result:
14.401 GFlops, https://gitlab.arif-ali.co.uk:8543/snippets/26

aa3025
Posts: 17
Joined: Fri Sep 14, 2018 8:35 am

Re: Raspberry Pi4 8GB xHPL benchmark results

Fri Jun 05, 2020 4:43 pm

Great thanks! I still got 9 GFlops after kernel firmware upgrade to v5.4 and with stock openmpi and openblas. Will try to reproduce your compilation of mpich and openblas to see if I can get your 13 GFlops... :D

Return to “General discussion”