Yes, but you said "I kept monitoring throttling state during the test"
Code: Select all
for i in $(seq 0 3); do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; donenice one, thanks for checking. Wasn't expecting a quick turnaroundaa3025 wrote: ↑Thu Jun 04, 2020 12:01 am@Arif: I actually checked whether changing scaling_governor to "performance" (from "ondemand") makes any difference:
Code: Select all
for i in $(seq 0 3); do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done
This way CPU frequency is kept at 1.5 GHz all the time.
And it doesn't (makes any difference) , still get the same 8.93 GFlops off HPL.
Code: Select all
$ ./xhpl
================================================================================
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 8000
NB : 256
PMAP : Row-major process mapping
P : 1
Q : 1
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Right
BCAST : 2ring
DEPTH : 0
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR02R2L2 8000 256 1 1 37.23 9.1697e+00
HPL_pdgesv() start time Fri Jun 5 04:10:05 2020
HPL_pdgesv() end time Fri Jun 5 04:10:42 2020
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 2.59405429e-03 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
Code: Select all
$ ./xhpl
================================================================================
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 8000
NB : 256
PMAP : Row-major process mapping
P : 1
Q : 1
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Right
BCAST : 2ring
DEPTH : 0
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR02R2L2 8000 256 1 1 32.78 1.0415e+01
HPL_pdgesv() start time Fri Jun 5 04:53:42 2020
HPL_pdgesv() end time Fri Jun 5 04:54:15 2020
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 2.59405429e-03 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
Code: Select all
$ export OPENBLAS_NUM_THREADS=1
$ mpirun -np 4 ./xhpl
================================================================================
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 8000
NB : 192
PMAP : Row-major process mapping
P : 2
Q : 2
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 8000 192 2 2 32.50 1.0505e+01
HPL_pdgesv() start time Fri Jun 5 05:39:32 2020
HPL_pdgesv() end time Fri Jun 5 05:40:04 2020
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 2.04717139e-03 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
I think 6 GFLOPS is for the Pi 3B. Using default clock speeds, 6.4 is the original 3B and 6.7 the 3B+ . More information is in the thread
Code: Select all
127.0.0.1:4
Code: Select all
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 8000 192 1 4 29.49 1.1578e+01
HPL_pdgesv() start time Fri Jun 5 09:19:58 2020
HPL_pdgesv() end time Fri Jun 5 09:20:27 2020
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
Max aggregated wall time rfact . . . : 1.08
+ Max aggregated wall time pfact . . : 0.27
+ Max aggregated wall time mxswp . . : 0.12
Max aggregated wall time update . . : 28.28
+ Max aggregated wall time laswp . . : 1.08
Max aggregated wall time up tr sv . : 0.08
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 4.23928476e-03 ...... PASSED
================================================================================
Code: Select all
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 8000 192 1 1 29.09 1.1737e+01
HPL_pdgesv() start time Fri Jun 5 09:21:46 2020
HPL_pdgesv() end time Fri Jun 5 09:22:15 2020
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
Max aggregated wall time rfact . . . : 1.59
+ Max aggregated wall time pfact . . : 0.38
+ Max aggregated wall time mxswp . . : 0.15
Max aggregated wall time update . . : 27.40
+ Max aggregated wall time laswp . . : 1.89
Max aggregated wall time up tr sv . : 0.08
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 5.66468717e-03 ...... PASSED
================================================================================
Code: Select all
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 17280 192 1 1 254.71 1.3507e+01
HPL_pdgesv() start time Fri Jun 5 10:02:20 2020
HPL_pdgesv() end time Fri Jun 5 10:06:34 2020
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
Max aggregated wall time rfact . . . : 6.46
+ Max aggregated wall time pfact . . : 1.66
+ Max aggregated wall time mxswp . . : 0.50
Max aggregated wall time update . . : 247.84
+ Max aggregated wall time laswp . . : 7.79
Max aggregated wall time up tr sv . : 0.35
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 4.61419323e-03 ...... PASSED
================================================================================
Code: Select all
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 17280 192 1 1 258.66 1.3301e+01
HPL_pdgesv() start time Fri Jun 5 15:49:47 2020
HPL_pdgesv() end time Fri Jun 5 15:54:05 2020
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
Max aggregated wall time rfact . . . : 7.46
+ Max aggregated wall time pfact . . : 2.49
+ Max aggregated wall time mxswp . . : 0.53
Max aggregated wall time update . . : 250.75
+ Max aggregated wall time laswp . . : 8.34
Max aggregated wall time up tr sv . : 0.42
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.33839235e-03 ...... PASSED
================================================================================
A speed of 13.3 GFLOPS seems pretty good and now about 80 percent faster than previous 3B+ timings. Would you mind posting the full output, or better, your HPL.DAT file so we can see the runtime tuning parameters. Also, could you please confirm not overclocking the Pi 4B, as that's a different game.arif-ali wrote: ↑Fri Jun 05, 2020 3:09 pmSo, taking the theory of raspi OS 32 bit issue out of the window
Kernel: Linux pi04.arif.local 5.4.44-v7l+ #1320 SMP Wed Jun 3 16:13:10 BST 2020 armv7l GNU/Linux
OS: Raspbian GNU/Linux 10 (buster)
Same HPL.dat for run1 but now on the 8GB version, and the same command.
* recompiled mpich
* recompiled OpenBLAS
* recompiled hpl
I didn't tune anything on the OS, it wasn't a minimal install, and some other tasks were probably running in the background
re-ran the benchmark
Code: Select all
================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR11C2R4 17280 192 1 1 258.66 1.3301e+01 HPL_pdgesv() start time Fri Jun 5 15:49:47 2020 HPL_pdgesv() end time Fri Jun 5 15:54:05 2020 --VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV- Max aggregated wall time rfact . . . : 7.46 + Max aggregated wall time pfact . . : 2.49 + Max aggregated wall time mxswp . . : 0.53 Max aggregated wall time update . . : 250.75 + Max aggregated wall time laswp . . : 8.34 Max aggregated wall time up tr sv . : 0.42 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.33839235e-03 ...... PASSED ================================================================================
You are on rpi-5.4.y kernel. On my Pi I have the latest to be
Code: Select all
4.19.118-v7l+ #1311 SMP Mon Apr 27 14:26:42 BST 2020 armv7l GNU/LinuxCode: Select all
root@pi04:~# vcgencmd bootloader_version
May 27 2020 18:47:29
version d648db3968cd31d4948341e09cb8a925c49d2ea1 (release)
timestamp 1590601649
Code: Select all
tar xfz mpich-3.3.2.tar.gz
cd mpich-3.3.2
./configure --prefix=/opt/mpich/3.3.2
make -j 3
sudo make install
Code: Select all
unzip OpenBLAS.zip
cd OpenBLAS-develop
make -j 3
Code: Select all
tar xfz hpl-2.3.tar.gz
cd hpl-2.3
Code: Select all
make arch=rpi4-mpich
Code: Select all
OMP_NUM_THREADS=4 ./hpl-2.3/bin/rpi4-mpich/xhpl
Will it crash if you increase problem size and 32-bit OS can't allocate the required memory?OMP_NUM_THREADS=4 ./hpl-2.3/bin/rpi4-mpich/xhpl
Code: Select all
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
28800 Ns
1 # of NBs
192 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
2 Ps
2 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0 Number of additional problem sizes for PTRANS
1200 10000 30000 values of N
0 number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64 values of NBCode: Select all
127.0.0.1:4