ejolson
Posts: 6320
Joined: Tue Mar 18, 2014 11:47 am

Memory Bandwidth of the Pi 4B

Thu Apr 16, 2020 3:02 am

I ran John McCalpin's classic stream program for measuring memory-bandwidth from

https://www.cs.virginia.edu/stream/

on the Pi 4B with 2GB of RAM and obtained

Code: Select all

$ gcc -O3 -mtune=native -march=native -o stream.10M stream.c 
$ ./stream.10M
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 58407 microseconds.
   (= 58407 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5311.6     0.032188     0.030123     0.048096
Scale:           5581.7     0.029104     0.028665     0.032257
Add:             4562.3     0.052809     0.052605     0.053455
Triad:           4380.8     0.054941     0.054784     0.055248
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Similar runs were performed on earlier models of Pi, a first generation Ryzen desktop computer and some others. The results summarized are

Code: Select all

                   Copy    Scale      Add    Triad
PIII 650MHz       175.2    176.1    203.6    200.8
Athlon 1400MHz    507.1    634.1    692.8    692.7
Pi B+             738.7    222.2    306.5    293.4
Pi Zero           838.5    290.5    398.4    385.7
Pi 2B v1.1       2021.9   1041.3    739.0    581.9
Pi 3B+           2683.8   2500.2   2216.0   1962.2
Pi 4B 2GB        5311.6   5581.7   4562.3   4380.8
Pi 4B 4GB        5511.8   5531.7   4835.8   4829.5
Odroid N2        6868.2   6854.6   6665.7   6715.2
Jetson Nano      7170.4   6753.6   7145.0   7015.7
i3 550 3.2GHz    7992.4   7838.9   8747.0   8692.6
Xeon Gold 6162  10211.3  11678.0  13283.3  13097.5
Ryzen7 1700     28368.6  17044.8  18710.6  18770.3
Do the 1GB and 4GB models of Pi 4B have the same memory bandwidth as the 2GB model tested here?

I find it amazing how the speed of memory in computers has changed over time and would be very happy if someone could run the same test on the Nvidia Jetson Nano for comparison along with any other ARM-based single-board computers that are available. Results for IBM Power9, as well as a best-effort Pi 4B overclock would also be great.

Edit: Updated with results for the Pi Zero, the Odroid N2, the Jetson Nano, the Pi 4B with 4GB RAM and the Pi 2B.
Last edited by ejolson on Fri Apr 17, 2020 7:31 pm, edited 3 times in total.

User avatar
jahboater
Posts: 6499
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: Memory Bandwidth of the Pi 4B

Thu Apr 16, 2020 3:39 am

Here are some results:

Pi4 4GB (stock speed, GCC 9.3)

Code: Select all

$ gcc -O3 -mtune=native -march=native -o stream.10M stream.c
$ ./stream.10M 
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 59444 microseconds.
   (= 59444 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5511.8     0.029078     0.029029     0.029128
Scale:           5531.7     0.028962     0.028924     0.029019
Add:             4835.8     0.049655     0.049630     0.049684
Triad:           4829.5     0.049721     0.049695     0.049770
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
and for the Odroid N2 (stock speed, GCC 9.3):

Code: Select all

$ gcc -O3 -mtune=native -march=native -o stream.10M stream.c
$ ./stream.10M 
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 33980 microseconds.
   (= 33980 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            6868.2     0.023330     0.023296     0.023350
Scale:           6854.6     0.023356     0.023342     0.023371
Add:             6665.7     0.036027     0.036005     0.036095
Triad:           6715.2     0.035799     0.035740     0.036150
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
$ 
for the Pi Zero (stock speed, GCC 6.3)

Code: Select all

$ gcc -O3 -mtune=native -march=native -o stream.10M stream.c
$ ./stream.10M       
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 568083 microseconds.
   (= 568083 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:             838.5     0.190989     0.190806     0.191115
Scale:            290.5     0.551285     0.550845     0.551427
Add:              398.4     0.602703     0.602377     0.603549
Triad:            385.7     0.622724     0.622290     0.622985
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
$
Last edited by jahboater on Thu Apr 16, 2020 3:52 am, edited 1 time in total.
Pi4 8GB and Pi4 4GB running Raspberry Pi OS 64-bit

Heater
Posts: 17116
Joined: Tue Jul 17, 2012 3:02 pm

Re: Memory Bandwidth of the Pi 4B

Thu Apr 16, 2020 3:46 am

Code: Select all

$ uname -a
Linux jetson-nano 4.9.140-tegra #1 SMP PREEMPT Mon Dec 9 22:47:42 PST 2019 aarch64 aarch64 aarch64 GNU/Linux
dlinano@jetson-nano:~/stream$ make clean
rm -f stream_f.exe stream_c.exe *.o
dlinano@jetson-nano:~/stream$ make
gcc -O3 stream.c -o stream_c.exe
dlinano@jetson-nano:~/stream$ ./stream_c.exe
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 25575 microseconds.
   (= 25575 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            7170.4     0.023451     0.022314     0.025240
Scale:           6753.6     0.025202     0.023691     0.028262
Add:             7145.0     0.036036     0.033590     0.038900
Triad:           7015.7     0.036885     0.034209     0.038899
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Memory in C++ is a leaky abstraction .

ejolson
Posts: 6320
Joined: Tue Mar 18, 2014 11:47 am

Re: Memory Bandwidth of the Pi 4B

Thu Apr 16, 2020 4:46 am

Wow! That was fast! I'll update the first post with the new numbers.

Thanks! It looks like the 4GB Pi 4B might be slightly faster than the 2GB model. I wonder if that is simply random measurement error or whether there is a slight difference with the memory timings. On the other hand, maybe it is the effects of gcc version 9.3. I used the system compiler--gcc version 8.3 in Raspbian--for the tests with the 2GB model.

Hm. I just tried gcc version 9.2 on the Pi 4B with 2GB and the results are the same as for the system compiler. I don't have 9.3 installed.

Heater
Posts: 17116
Joined: Tue Jul 17, 2012 3:02 pm

Re: Memory Bandwidth of the Pi 4B

Thu Apr 16, 2020 6:21 am

I notice that code is written to be able to make use of OpenMP. Using OpenMP gives a nice boost:

Code: Select all

$ uname -a
Linux jetson-nano 4.9.140-tegra #1 SMP PREEMPT Mon Dec 9 22:47:42 PST 2019 aarch64 aarch64 aarch64 GNU/Linux
dlinano@jetson-nano:~/stream$ make clean
rm -f stream_f.exe stream_c.exe *.o
dlinano@jetson-nano:~/stream$ make
gcc -O3 -fopenmp stream.c -o stream_c.exe
dlinano@jetson-nano:~/stream$ ./stream_c.exe
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 20647 microseconds.
   (= 20647 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10828.4     0.016232     0.014776     0.019214
Scale:          10685.1     0.016704     0.014974     0.019135
Add:             8791.6     0.028727     0.027299     0.031163
Triad:           8664.6     0.030757     0.027699     0.032986
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Memory in C++ is a leaky abstraction .

User avatar
jahboater
Posts: 6499
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: Memory Bandwidth of the Pi 4B

Thu Apr 16, 2020 10:09 am

ejolson wrote:
Thu Apr 16, 2020 4:46 am
Thanks! It looks like the 4GB Pi 4B might be slightly faster than the 2GB model. I wonder if that is simply random measurement error or whether there is a slight difference with the memory timings. On the other hand, maybe it is the effects of gcc version 9.3. I used the system compiler--gcc version 8.3 in Raspbian--for the tests with the 2GB model.
I doubt gcc 9.3 vs 9.2 would make any difference, just a bug fix release.

I should note that the Pi4 4GB is running the 64-bit kernel (which may be faster with large memory usage).
(and also I am running the experimental 5.4 kernel).
Here is the Pi4 4GB result with the 32-bit kernel.

Code: Select all

$ uname -a
Linux raspberrypi 5.4.29-v7l+ #1304 SMP Tue Apr 7 18:38:20 BST 2020 armv7l GNU/Linux
$ gcc -O3 -mtune=native -march=native -o stream.10M stream.c
$ ./stream.10M 
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 57377 microseconds.
   (= 57377 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5497.0     0.029177     0.029107     0.029200
Scale:           5562.7     0.028820     0.028763     0.028871
Add:             4834.1     0.049665     0.049647     0.049695
Triad:           4831.4     0.049688     0.049675     0.049714
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Larger sized memory and SD cards may be faster because it is split over multiple chips which may be accessed in parallel.
Pi4 8GB and Pi4 4GB running Raspberry Pi OS 64-bit

ejolson
Posts: 6320
Joined: Tue Mar 18, 2014 11:47 am

Re: Memory Bandwidth of the Pi 4B

Thu Apr 16, 2020 9:35 pm

jahboater wrote:
Thu Apr 16, 2020 10:09 am
ejolson wrote:
Thu Apr 16, 2020 4:46 am
Thanks! It looks like the 4GB Pi 4B might be slightly faster than the 2GB model. I wonder if that is simply random measurement error or whether there is a slight difference with the memory timings. On the other hand, maybe it is the effects of gcc version 9.3. I used the system compiler--gcc version 8.3 in Raspbian--for the tests with the 2GB model.
I doubt gcc 9.3 vs 9.2 would make any difference, just a bug fix release.

I should note that the Pi4 4GB is running the 64-bit kernel (which may be faster with large memory usage).
(and also I am running the experimental 5.4 kernel).
Here is the Pi4 4GB result with the 32-bit kernel.

Code: Select all

$ uname -a
Linux raspberrypi 5.4.29-v7l+ #1304 SMP Tue Apr 7 18:38:20 BST 2020 armv7l GNU/Linux
$ gcc -O3 -mtune=native -march=native -o stream.10M stream.c
$ ./stream.10M 
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 57377 microseconds.
   (= 57377 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5497.0     0.029177     0.029107     0.029200
Scale:           5562.7     0.028820     0.028763     0.028871
Add:             4834.1     0.049665     0.049647     0.049695
Triad:           4831.4     0.049688     0.049675     0.049714
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Larger sized memory and SD cards may be faster because it is split over multiple chips which may be accessed in parallel.
I'm do not think the SD card would affect this RAM speed test, however, you may be right that the 4GB memory chip has an internal structure that is more parallel by design. It looks like you are getting about 10% more memory bandwidth on the triad test. Could you verify there are no special RAM or schmoo settings in config.txt?

See, for example,

viewtopic.php?t=6201&start=1000#p893286

User avatar
jahboater
Posts: 6499
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: Memory Bandwidth of the Pi 4B

Thu Apr 16, 2020 9:55 pm

ejolson wrote:
Thu Apr 16, 2020 9:35 pm
Could you verify there are no special RAM or schmoo settings in config.txt?
Definitely not ....

I don't think the schmoo stuff is applicable to the Pi4 anyway.

The only memory setting for this headless Pi4 is "gpu_mem=16" which I shouldn't think matters.

I have today received a new 2GB Pi4, so soon I'll try and verify the difference here.
Pi4 8GB and Pi4 4GB running Raspberry Pi OS 64-bit

ejolson
Posts: 6320
Joined: Tue Mar 18, 2014 11:47 am

Re: Memory Bandwidth of the Pi 4B

Thu Apr 16, 2020 10:22 pm

jahboater wrote:
Thu Apr 16, 2020 9:55 pm
ejolson wrote:
Thu Apr 16, 2020 9:35 pm
Could you verify there are no special RAM or schmoo settings in config.txt?
Definitely not ....

I don't think the schmoo stuff is applicable to the Pi4 anyway.

The only memory setting for this headless Pi4 is "gpu_mem=16" which I shouldn't think matters.

I have today received a new 2GB Pi4, so soon I'll try and verify the difference here.
Thanks for checking. It's too bad there are no schmoo settings for the Pi 4B, as I remember them being very good--better than sliced bread.

Image

Maybe that's something enjoyable to read during the quarantine, which so far has been a disappointment for me due to a persistent cough. Fortunately, my cough seems only bronchitis now getting better. I wish everyone were as lucky.

As suggested above where
Heater wrote:
Thu Apr 16, 2020 6:21 am
I notice that code is written to be able to make use of OpenMP.
I'm now running stream compiled as

$ gcc -O3 -march=native -mtune=native \
-fopenmp -o streamomp.10M stream.c

with OpenMP enabled and getting some strange results on the 4B. I'll post as soon as I verify what's going on.

Note that stream timings with OpenMP for many systems appear as part of Michael Larabel's open benchmarking

https://openbenchmarking.org/test/pts/stream-1.3.1

In my opinion, it is also important to have the single-core results to understand how total memory bandwidth and cache is matched to number of cores.

ejolson
Posts: 6320
Joined: Tue Mar 18, 2014 11:47 am

Re: Memory Bandwidth of the Pi 4B

Fri Apr 17, 2020 2:02 am

Here are the results I obtained when running the OpenMP version of stream on the Pi 4B 2GB and the Pi 3B+ using taskset to choose the number of cores on which to run. For example,

$ taskset -c 0,1 ./streamomp.10M

runs the program on two cores. Three runs were performed for each test and only the best plotted. The results are

Image

Image

For many SMP architectures the total aggregate memory bandwidth increases as more cores are selected. Surprisingly, this seems not to be the case for the either of the Pi models, except with Add and Triad on the Pi 3B+ when moving from 1 to 2 cores. On the other hand, it is nice to see some decreasing graphs during these days of exponential growth and infection. I suspect the decrease in bandwidth for the Pi is a shared cache effect. That is, the cache is divided between the different cores and this results in a larger number of stalls when fetching data from main memory.

Would anyone like to profile the stream program with the Linux performance counters to verify what's happening?

It would also be interesting to see what happens when the script

Code: Select all

#!/bin/bash
cpuset="0"
for i in 1 2 3 4
do
    echo Running on CPUs $cpuset ...
    taskset -c $cpuset ./streamomp.10M
    cpuset="$cpuset,$i"
done
is run three times on the Jetson Nano for comparison.

Heater
Posts: 17116
Joined: Tue Jul 17, 2012 3:02 pm

Re: Memory Bandwidth of the Pi 4B

Fri Apr 17, 2020 4:18 am

Code: Select all

$ uname -a
Linux jetson-nano 4.9.140-tegra #1 SMP PREEMPT Mon Dec 9 22:47:42 PST 2019 aarch64 aarch64 aarch64 GNU/Linux
dlinano@jetson-nano:~/stream$ gcc -Wall -O3 -fopenmp -o streamomp.10M stream.c 
stream.c: In function ‘mysecond’:
stream.c:424:13: warning: variable ‘i’ set but not used [-Wunused-but-set-variable]
         int i;
             ^
dlinano@jetson-nano:~/stream$ ./run.sh ; ./run.sh ; ./run.sh 
Running on CPUs 0 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 26298 microseconds.
   (= 26298 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            7325.7     0.022007     0.021841     0.022323
Scale:           7328.7     0.022122     0.021832     0.023336
Add:             7270.3     0.033175     0.033011     0.033609
Triad:           7235.5     0.033546     0.033170     0.035344
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 2
Number of Threads counted = 2
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 21574 microseconds.
   (= 21574 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            9638.6     0.016671     0.016600     0.016836
Scale:           9655.4     0.016620     0.016571     0.016750
Add:             8156.6     0.029746     0.029424     0.031254
Triad:           8116.3     0.029667     0.029570     0.029740
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1,2 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 3
Number of Threads counted = 3
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 18723 microseconds.
   (= 18723 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10090.2     0.015904     0.015857     0.015953
Scale:          10138.7     0.015853     0.015781     0.016147
Add:             8667.7     0.027854     0.027689     0.028733
Triad:           8655.2     0.027884     0.027729     0.028096
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1,2,3 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 19763 microseconds.
   (= 19763 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10894.8     0.014873     0.014686     0.015956
Scale:          10967.1     0.014635     0.014589     0.014680
Add:             8884.3     0.027369     0.027014     0.029035
Triad:           8889.6     0.027352     0.026998     0.029121
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 25851 microseconds.
   (= 25851 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            7376.3     0.021878     0.021691     0.022711
Scale:           7407.4     0.021736     0.021600     0.022116
Add:             7318.4     0.033250     0.032794     0.034416
Triad:           7279.8     0.033131     0.032968     0.033497
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 2
Number of Threads counted = 2
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 21566 microseconds.
   (= 21566 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            9705.2     0.016618     0.016486     0.016930
Scale:           9705.2     0.016717     0.016486     0.017850
Add:             8165.8     0.029591     0.029391     0.030675
Triad:           8174.7     0.029525     0.029359     0.030020
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1,2 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 3
Number of Threads counted = 3
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 19167 microseconds.
   (= 19167 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10102.3     0.015917     0.015838     0.015980
Scale:          10137.4     0.015915     0.015783     0.016778
Add:             8674.6     0.027888     0.027667     0.028522
Triad:           8648.9     0.027829     0.027749     0.027946
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1,2,3 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 20539 microseconds.
   (= 20539 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10903.7     0.014744     0.014674     0.014797
Scale:          10987.4     0.014682     0.014562     0.015124
Add:             8879.3     0.027226     0.027029     0.028272
Triad:           8900.1     0.027262     0.026966     0.028233
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 25820 microseconds.
   (= 25820 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            7399.8     0.021951     0.021622     0.023257
Scale:           7399.5     0.021755     0.021623     0.022091
Add:             7316.6     0.033276     0.032802     0.035188
Triad:           7282.3     0.033113     0.032957     0.033262
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 2
Number of Threads counted = 2
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 21511 microseconds.
   (= 21511 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            9672.2     0.016599     0.016542     0.016700
Scale:           9683.5     0.016725     0.016523     0.017887
Add:             8190.5     0.029505     0.029302     0.030065
Triad:           8148.6     0.029553     0.029453     0.029721
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1,2 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 3
Number of Threads counted = 3
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 18769 microseconds.
   (= 18769 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10084.4     0.015916     0.015866     0.016007
Scale:          10140.0     0.015832     0.015779     0.016003
Add:             8655.6     0.027944     0.027728     0.028976
Triad:           8657.6     0.027838     0.027721     0.027922
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1,2,3 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 18656 microseconds.
   (= 18656 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10894.8     0.014779     0.014686     0.014848
Scale:          11030.6     0.014603     0.014505     0.014679
Add:             8886.6     0.027161     0.027007     0.027520
Triad:           8865.9     0.027461     0.027070     0.029704
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
dlinano@jetson-nano:~/stream$ 
Mo' cores, mo' better.
Memory in C++ is a leaky abstraction .

ejolson
Posts: 6320
Joined: Tue Mar 18, 2014 11:47 am

Re: Memory Bandwidth of the Pi 4B

Fri Apr 17, 2020 4:54 am

Heater wrote:
Fri Apr 17, 2020 4:18 am
Mo' cores, mo' better.
That data looks good. Here is the graph:

Image

Note the Add curve is mostly obscured by the Triad results. For good measure, I tested a 6-core Xeon E5-1650 v3 and obtained another increasing graph.

Image

I wonder why the Pi computers show a decreasing bandwidth. Could a dose of hydroxychloroquine or remdesivir be responsible? At least the Xeon E5 is plateauing. Maybe the Pi 4B would also show an increasing trend with the 64-bit kernel or the 4GB model.

Heater
Posts: 17116
Joined: Tue Jul 17, 2012 3:02 pm

Re: Memory Bandwidth of the Pi 4B

Fri Apr 17, 2020 8:04 am

Pi 4, 4GB, Raspbian:

Code: Select all

pi@pi4:~/stream $ uname -a
Linux pi4 4.19.97-v7l+ #1294 SMP Thu Jan 30 13:21:14 GMT 2020 armv7l GNU/Linux
pi@pi4:~/stream $ free -h
              total        used        free      shared  buff/cache   available
Mem:          3.8Gi        89Mi       3.6Gi       8.0Mi       101Mi       3.6Gi
Swap:          99Mi          0B        99Mi
pi@pi4:~/stream $ gcc -Wall -O3 -fopenmp -o streamomp.10M stream.c
stream.c: In function ‘mysecond’:
stream.c:424:13: warning: variable ‘i’ set but not used [-Wunused-but-set-variable]
         int i;
             ^
stream.c: In function ‘checkSTREAMresults’:
stream.c:479:42: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘unsigned int’ [-Wformat=]
   printf("WEIRD: sizeof(STREAM_TYPE) = %lu\n",sizeof(STREAM_TYPE));
                                        ~~^    ~~~~~~~~~~~~~~~~~~~
                                        %u
pi@pi4:~/stream $ ./run.sh
Running on CPUs 0 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 56854 microseconds.
   (= 56854 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5525.8     0.028980     0.028955     0.029035
Scale:           5552.3     0.028866     0.028817     0.028969
Add:             4883.8     0.049168     0.049142     0.049222
Triad:           4875.8     0.049259     0.049223     0.049300
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 2
Number of Threads counted = 2
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 60302 microseconds.
   (= 60302 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            4479.8     0.037653     0.035716     0.039994
Scale:           3701.9     0.043908     0.043221     0.045058
Add:             3990.2     0.060244     0.060148     0.060330
Triad:           4055.9     0.059286     0.059173     0.059328
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1,2 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 3
Number of Threads counted = 3
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 61237 microseconds.
   (= 61237 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            3394.8     0.049149     0.047131     0.050778
Scale:           3356.3     0.049299     0.047672     0.050476
Add:             3413.4     0.070756     0.070311     0.071169
Triad:           3380.4     0.071438     0.070997     0.072346
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1,2,3 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 58586 microseconds.
   (= 58586 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            3670.1     0.044410     0.043596     0.045814
Scale:           3678.2     0.045179     0.043499     0.046334
Add:             3392.6     0.072850     0.070741     0.074430
Triad:           3342.5     0.075828     0.071802     0.082318
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
pi@pi4:~/stream $
Memory in C++ is a leaky abstraction .

Technocolour
Posts: 140
Joined: Thu Jul 04, 2019 6:23 pm

Re: Memory Bandwidth of the Pi 4B

Fri Apr 17, 2020 8:37 am

Thank you for your effort with the data, everyone! :)

It's interesting.

Heater
Posts: 17116
Joined: Tue Jul 17, 2012 3:02 pm

Re: Memory Bandwidth of the Pi 4B

Fri Apr 17, 2020 8:42 am

Pi 4, 4GB, Raspbian 64 bit (nspawn):

Code: Select all

$ uname -a
Linux debian-buster-64 4.19.97-v8+ #1294 SMP PREEMPT Thu Jan 30 13:27:08 GMT 2020 aarch64 GNU/Linux
pi@debian-buster-64:~/stream $ free -h
              total        used        free      shared  buff/cache   available
Mem:          3.7Gi       119Mi       3.1Gi       8.0Mi       503Mi       3.5Gi
Swap:          99Mi          0B        99Mi
pi@debian-buster-64:~/stream $ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 8.3.0-6' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-8 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --disable-libphobos --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 8.3.0 (Debian 8.3.0-6)
pi@debian-buster-64:~/stream $ gcc -Wall -O3 -fopenmp -o streamomp.10M stream.c
stream.c: In function ‘mysecond’:
stream.c:424:13: warning: variable ‘i’ set but not used [-Wunused-but-set-variable]
         int i;
             ^
pi@debian-buster-64:~/stream $ ./run.sh ; ./run.sh ; ./run.sh
Running on CPUs 0 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 59629 microseconds.
   (= 59629 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5426.1     0.029606     0.029487     0.030237
Scale:           5506.8     0.029117     0.029055     0.029254
Add:             4736.1     0.050715     0.050675     0.050739
Triad:           4694.3     0.051198     0.051126     0.051331
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 2
Number of Threads counted = 2
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 62619 microseconds.
   (= 62619 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            4219.1     0.039184     0.037923     0.040375
Scale:           3711.7     0.043809     0.043107     0.044489
Add:             3908.9     0.061444     0.061398     0.061506
Triad:           3961.5     0.060630     0.060583     0.060691
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1,2 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 3
Number of Threads counted = 3
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 60485 microseconds.
   (= 60485 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            3540.1     0.045948     0.045197     0.047625
Scale:           3473.5     0.046372     0.046063     0.046612
Add:             3268.0     0.073785     0.073440     0.074125
Triad:           3437.7     0.070172     0.069815     0.070442
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1,2,3 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 59525 microseconds.
   (= 59525 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            3451.6     0.046944     0.046355     0.047843
Scale:           3718.8     0.044152     0.043025     0.045274
Add:             3291.4     0.073877     0.072917     0.075240
Triad:           3253.0     0.074533     0.073779     0.075255
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 59698 microseconds.
   (= 59698 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5417.5     0.029607     0.029534     0.029730
Scale:           5504.7     0.029112     0.029066     0.029197
Add:             4725.9     0.050832     0.050784     0.050911
Triad:           4679.5     0.051316     0.051288     0.051430
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 2
Number of Threads counted = 2
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 62061 microseconds.
   (= 62061 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            4147.2     0.039571     0.038580     0.041167
Scale:           3716.6     0.043713     0.043050     0.044059
Add:             3895.3     0.061698     0.061613     0.061753
Triad:           3936.0     0.061027     0.060975     0.061079
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1,2 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 3
Number of Threads counted = 3
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 60931 microseconds.
   (= 60931 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            3362.8     0.049671     0.047580     0.052043
Scale:           3370.5     0.047724     0.047471     0.048038
Add:             3143.8     0.076699     0.076341     0.077099
Triad:           3333.7     0.072607     0.071992     0.073065
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1,2,3 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 61228 microseconds.
   (= 61228 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            3331.0     0.048920     0.048033     0.049906
Scale:           3645.6     0.044595     0.043888     0.045386
Add:             3213.6     0.075035     0.074682     0.075698
Triad:           3204.3     0.075369     0.074900     0.076134
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 59629 microseconds.
   (= 59629 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5413.8     0.029589     0.029554     0.029631
Scale:           5510.8     0.029088     0.029034     0.029151
Add:             4728.4     0.050821     0.050757     0.050915
Triad:           4683.4     0.051375     0.051245     0.052019
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 2
Number of Threads counted = 2
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 62449 microseconds.
   (= 62449 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            3856.4     0.042674     0.041490     0.044648
Scale:           3708.2     0.043976     0.043148     0.044730
Add:             3843.2     0.062549     0.062448     0.062685
Triad:           3853.2     0.062347     0.062286     0.062415
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1,2 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 3
Number of Threads counted = 3
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 60369 microseconds.
   (= 60369 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            3438.7     0.048310     0.046529     0.050204
Scale:           3444.9     0.046806     0.046445     0.047192
Add:             3135.5     0.076741     0.076542     0.077033
Triad:           3322.3     0.072500     0.072239     0.072907
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Running on CPUs 0,1,2,3 ...
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 60132 microseconds.
   (= 60132 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            3558.9     0.047131     0.044958     0.050026
Scale:           3521.2     0.046606     0.045439     0.047983
Add:             3447.1     0.072796     0.069624     0.076257
Triad:           3439.9     0.072690     0.069769     0.074759
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
pi@debian-buster-64:~/stream $
Memory in C++ is a leaky abstraction .

Heater
Posts: 17116
Joined: Tue Jul 17, 2012 3:02 pm

Re: Memory Bandwidth of the Pi 4B

Fri Apr 17, 2020 8:53 am

So what is going on here?

This must be the first time ever I have seen performance drop when adding cores.

How come the Pi does that but no other machine tested?
Memory in C++ is a leaky abstraction .

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 27721
Joined: Sat Jul 30, 2011 7:41 pm

Re: Memory Bandwidth of the Pi 4B

Fri Apr 17, 2020 10:15 am

Contention? More cores trying to access memory that is already reached maximum bandwidth will cause contention. ("Trashing" in CPU terms). Just a guess.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed.
I've been saying "Mucho" to my Spanish friend a lot more lately. It means a lot to him.

User avatar
jahboater
Posts: 6499
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: Memory Bandwidth of the Pi 4B

Fri Apr 17, 2020 10:33 am

I have just compared a 2GB Pi4 with a 4GB Pi4 and the results are the same (well within the margin of error).
Same config, same compiler, same kernel.
Pi4 8GB and Pi4 4GB running Raspberry Pi OS 64-bit

Heater
Posts: 17116
Joined: Tue Jul 17, 2012 3:02 pm

Re: Memory Bandwidth of the Pi 4B

Fri Apr 17, 2020 10:36 am

I thought "contention" as well.

But then I could not make any sense out of it. If the job is memory bandwidth limited I'd expect it to go up a bit with cores and then saturate as the memory bandwidth is consumed. As seen in the graphs for the other machines. As Amdahl's Law suggests.

But then we have the caches to think about. Perhaps I would expect a teeny weeny shared cache to be "thrashed". What is the cache layout and sizes on the Pi 4? Do the cores share all cache?

Or is it just that scheduling those cores has a massive overhead in this case for some reason?
Last edited by Heater on Fri Apr 17, 2020 11:42 am, edited 2 times in total.
Memory in C++ is a leaky abstraction .

Heater
Posts: 17116
Joined: Tue Jul 17, 2012 3:02 pm

Re: Memory Bandwidth of the Pi 4B

Fri Apr 17, 2020 10:37 am

jahboater wrote:
Fri Apr 17, 2020 10:33 am
I have just compared a 2GB Pi4 with a 4GB Pi4 and the results are the same (well within the margin of error).
Same config, same compiler, same kernel.
That is in agreement with my results above.
Memory in C++ is a leaky abstraction .

Technocolour
Posts: 140
Joined: Thu Jul 04, 2019 6:23 pm

Re: Memory Bandwidth of the Pi 4B

Fri Apr 17, 2020 10:58 am

Heater wrote:
Fri Apr 17, 2020 10:36 am
I thought "contention" as well.

But then I could not make any sense out of it. If the job is memory bandwidth limited I'd expect it to go up a bit with cores and then saturate as the memory bandwidth is consumed. As seen in the graphs for the other machines. As Amdahl's Law suggests.

But then we the caches to think about. Perhaps I would expect a teeny weeny shared cache to be "thrashed". What is the cache layout and sizes on the Pi 4? Do the cores share all cache?

Or is it just that scheduling those cores has a massive overhead in this case for some reason?
It's 1MB of unified L2 iirc, L1 should be standard for the A72 (I'm guessing the wiki link is correct)?

https://en.wikipedia.org/wiki/ARM_Cortex-A72

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5708
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge

Re: Memory Bandwidth of the Pi 4B

Fri Apr 17, 2020 1:24 pm

SDRAM is complicated.
Pages (rows) have to be opened before using and that is a multi-cycle operation.
You can have a small number of pages open at once, and accessing an open page is much quicker.
I think pages may be 4k or 8K in size.

Running too many threads of a memory test may cause thrashing, where new requests are closing pages that are still wanted.

There is also the concept of banks where switching between banks on consecutive accesses may be cheaper than using the same bank.
Banks are selected through some higher order address lines, so care about physical memory addresses.
That is largely hidden from linux where you just see virtual memory addresses unrelated to the physical memory.

We did find in early testing that some benchmark results after boot were bimodal - purely dependent on where in physical memory the buffers were allocated. Interestingly the longer linux was running the more blurred these result became as the virtual memory became more scattered in the physical address map.

ejolson
Posts: 6320
Joined: Tue Mar 18, 2014 11:47 am

Re: Memory Bandwidth of the Pi 4B

Fri Apr 17, 2020 7:23 pm

jamesh wrote:
Fri Apr 17, 2020 10:15 am
Contention? More cores trying to access memory that is already reached maximum bandwidth will cause contention. ("Trashing" in CPU terms). Just a guess.
I thought trashing was what happened after shorting out the GPIO pins.

I made a plot which combines the recently posted Raspbian Pi 4B 4G results (only one run apparently) and the 64-bit nspawn environment (best out of the posted three runs in each category) on the same graph.

Image

All the curves are decreasing, but unlike my Pi 4B 2G results, they plateau between 3 and 4 processor cores. As I'm still running the original firmware on the 2G model, I wonder if this reflects a firmware-related change in memory timings. Does anyone know whether any such changes were made in the firmware? Alternatively, since it's been running for 26 days, maybe physical memory is fragmented on my Pi.

Is the schmoo stuff really gone?

Here are stream results for the 2B version 1.1 with the ARMv7 CPU.

Image

Except for the copy operation, all the other kernels show an increasing trend. This may explain why, when the Pi 2B first came out, I was so impressed with how well parallel codes scaled on it. More information is at

viewtopic.php?f=33&t=102743

I've also updated the table in the original post with the single-core timings. Does anyone with a version 1.2 issue of the Pi 2B want to post results?

ejolson
Posts: 6320
Joined: Tue Mar 18, 2014 11:47 am

Re: Memory Bandwidth of the Pi 4B

Sun Apr 19, 2020 3:17 am

I went to the dog house to consult with the lead developer of FidoBasic about how many rows of memory the RAM subsystem in the Raspberry Pi could have open at one time, but the canine coder only growled and barked something about social distancing and Corona beer. Thinking it better not to drink alone I knocked once more, but the door would not open. However, upon returning home I had an idea: I sent Fido a Zoom invitation on the Raspberry Pi.

https://www.raspberrypi.org/blog/workin ... pberry-pi/

Although the sound quality was satisfactory, the barking was mostly incoherent. In the end I understood that due to a lack of schmoo stuff and paper products, the only way to overclock memory on the Pi 4B was to underclock the CPU and then speed the time component of the whole space-time continuum up by a factor of two. It seemed worth a try, so I edited the config.txt file to set

arm_freq=750

and rebooted. The results were

Image

Note for simplicity (and fear of the good doctor) that I did not attempt to change the flow of time itself and have only reported the weird stuff which happens when the CPU is set to 750 MHz.

Thinking I might have misunderstood the dog developer, I tried again using a factor of 4 by setting

arm_freq=375
arm_freq_min=375

and obtained

Image

At last, here is a result in which the aggregate memory bandwidth of the Pi 4B does not decrease substantially as more cores are added! Now I need an excuse to borrow that sonic screwdriver.

ejolson
Posts: 6320
Joined: Tue Mar 18, 2014 11:47 am

Re: Memory Bandwidth of the Pi 4B

Tue Apr 21, 2020 1:16 am

Here is another memory bandwidth versus number of cores graph, this time for the Ryzen 7 Pro 1700, a first-generation Ryzen processor released March 2017. To get more accurate timings I increased the memory used for the test 10 fold with

$ gcc -DSTREAM_ARRAY_SIZE=100000000 \
-O3 -mtune=native -march=native -fopenmp \
-o streamomp.100M stream.c

and used a similar script as before to obtain the graph

Image

Note that two tests were run, one in which only one hardware thread was used per core shown by the solid line and the other which tested all threads dashed. Except for the case of using only one core with two threads, higher bandwidth resulted when using only one thread per core. The Copy test shows a 4B-like decrease in bandwidth as more cores are added with a slight increase again when moving from 4 to 5 cores. This suggests internally that the cores are clustered in two groups of four. This could probably be verified by looking at the marketing literature. The Scale, Add and Triad results plateau early, but again have a slight bump between 4 and 5 cores, hopefully not due to relaxing the shelter at home quarantine.

Do the numbers for the Pi 4B 4GB change if the stream program is compiled as above to use more memory? I can't test this using the 2GB model here because that's not enough.
Last edited by ejolson on Tue Apr 21, 2020 3:30 am, edited 1 time in total.

Return to “General discussion”