jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 23346
Joined: Sat Jul 30, 2011 7:41 pm

Re: A Pi Pie Chart

Tue Jun 25, 2019 4:49 pm

ejolson wrote:
Tue Jun 25, 2019 4:36 pm
jamesh wrote:
Tue Jun 25, 2019 4:18 pm
ejolson wrote:
Tue Jun 25, 2019 3:40 pm
Thanks for posting. Your results are similar, though perhaps slightly faster, compared to the graphs that James uploaded which I converted to portable network graphics:
Here "My Computer" refers to the new Raspberry Pi 4B.

Compared to the original Pi B the Pi 4B is 26.8474 times faster. That's about double the performance of the 3B+ overall, however,I find it surprising that the merge sort timings are actually slower than the 3B+. I wonder if this result is related to the compiler version or an optimization setting. It would be nice to find a set of compiler flags for which the merge-sort timings were faster.
From Eben when I showed him the results for the merge, "Could be expensive line moves between L1s, but I suspect it's actually measuring the cost of forking processes in LPAE."

Which is why some of the other Pie charts were comparing LPAE kernels on the Pi3B+.
My understanding is that the task parallel constructs in modern OpenMP implementations fork a pool of threads at the beginning of the run (which isn't measured by the timing routines) and then use either work stealing or some sort of grand central dispatch to assign parcels of work to the threads in the pool. Maybe the cost of Linux thread synchronization primitives goes up when LPAE is enabled; however, it is strange that the serial version also runs slower.

I wonder if this is a gcc version 8.x compiler regression. Have you tried any compiler flags to remedy the situation?
No, I have not looked at any of this, just did the charts with the default compiler on the pi itself. There's probably some mileage in using the latest compilers.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
"My grief counseller just died, luckily, he was so good, I didn't care."

ejolson
Posts: 3407
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Thu Jun 27, 2019 9:39 pm

jahboater wrote:
Thu Jun 27, 2019 6:41 pm
ejolson wrote:
Thu Jun 27, 2019 5:47 pm
Would you mind compiling the pichart program downloadable from a link on the first post of this thread and reporting the output (both OpenMP and serial) running on the Pi 4B with the new 9.1 compiler?
Here it is (4GB version if it makes any difference).

Code: Select all

[email protected]:~/pichart-30 $ ./pichart-openmp 
pichart -- Raspberry Pi Performance OPENMP version 30

Prime Sieve          P=14630843 Workers=4 Sec=0.514499 Mops=1815.99
Merge Sort           N=16777216 Workers=8 Sec=1.07414 Mops=374.861
Fourier Transform    N=4194304 Workers=8 Sec=1.77694 Mflops=259.645
Lorenz 96            N=32768 K=16384 Workers=4 Sec=0.598412 Mflops=5382.96

My Computer has Raspberry Pi ratio=27.7325
Making pie charts...done.
[email protected]:~/pichart-30 $ ./pichart-serial 
pichart -- Raspberry Pi Performance Serial version 30

Prime Sieve          P=14630843 Workers=2 Sec=2.0762 Mops=450.018
Merge Sort           N=16777216 Workers=2 Sec=3.94588 Mops=102.044
Fourier Transform    N=4194304 Workers=2 Sec=2.95565 Mflops=156.099
Lorenz 96            N=32768 K=16384 Workers=1 Sec=2.1619 Mflops=1490

My Computer has Raspberry Pi ratio=9.027
Making pie charts...done.
[email protected]:~/pichart-30 $ 
[email protected]:~/pichart-30 $ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/arm-linux-gnueabihf/9.1.0/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../configure --enable-languages=c,d,c++,fortran --with-cpu=cortex-a72 --with-fpu=neon-fp-armv8 --with-float=hard --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
Thread model: posix
gcc version 9.1.0 (GCC) 
[email protected]:~/pichart-30 $ 
A summary of the merge sort timings for the Pi 4B and 3B+ are

Code: Select all

Merge Sort (larger is better)
         Serial    OpenMP
Pi 3B+  114.622   362.529   gcc 6.4
Pi 4B   102.044   374.861   gcc 9.1
Edit: The above Pi 4B timings were obtained with an over clock setting of 1600 MHz.

While the OpenMP speeds now show the Pi 4B to be 3.5% faster than the Pi 3B+ on the parallel merge sort, the single-core speeds for the serial version are still slower.

Is it possible you compiled with -mtune=native -march=native as those flags are in the Makefile and that they reset the --with-cpu and --with-fpu settings to some randomly wrong thing?

Have you verified that no throttling occurred?
Last edited by ejolson on Fri Jun 28, 2019 9:23 pm, edited 1 time in total.

jahboater
Posts: 4601
Joined: Wed Feb 04, 2015 6:38 pm

Re: A Pi Pie Chart

Fri Jun 28, 2019 9:09 am

ejolson wrote:
Thu Jun 27, 2019 9:39 pm
Is it possible you compiled with -mtune=native -march=native as those flags are in the Makefile and that they reset the --with-cpu and --with-fpu settings to some randomly wrong thing?
Yes, I used your Makefile as-is. Recent versions of GCC get the "native" types right.
To check I recompiled with

-mcpu=cortex-a72 -mtune=cortex-a72 -mfpu=neon-fp-armv8

and compared the binary with "cmp", and they were identical.
ejolson wrote:
Thu Jun 27, 2019 9:39 pm
Have you verified that no throttling occurred?
Of course - using "vcgencmd get_throttled".

Code: Select all

[email protected]:~/pichart-30 $ make
gcc -std=gnu99 -O3 -mcpu=cortex-a72 -mtune=cortex-a72 -mfpu=neon-fp-armv8 -Wall -o pichart-serial pichart.c util.c sieve.c merge.c fourier.c lorenz.c -lm
gcc -std=gnu99 -O3 -mcpu=cortex-a72 -mtune=cortex-a72 -mfpu=neon-fp-armv8 -Wall -fopenmp -o pichart-openmp pichart.c util.c sieve.c merge.c fourier.c lorenz.c -lm
[email protected]:~/pichart-30 $ cmp pichart-openmp xopenmp 
[email protected]:~/pichart-30 $ 
[email protected]:~/pichart-30 $ ./pichart-openmp 
pichart -- Raspberry Pi Performance OPENMP version 30

Prime Sieve          P=14630843 Workers=4 Sec=0.515054 Mops=1814.04
Merge Sort           N=16777216 Workers=8 Sec=1.07105 Mops=375.942
Fourier Transform    N=4194304 Workers=8 Sec=1.79254 Mflops=257.386
Lorenz 96            N=32768 K=16384 Workers=4 Sec=0.598754 Mflops=5379.88

My Computer has Raspberry Pi ratio=27.6805
Making pie charts...done.
[email protected]:~/pichart-30 $ ./pichart-serial 
pichart -- Raspberry Pi Performance Serial version 30

Prime Sieve          P=14630843 Workers=2 Sec=2.07626 Mops=450.005
Merge Sort           N=16777216 Workers=2 Sec=3.94452 Mops=102.079
Fourier Transform    N=4194304 Workers=2 Sec=2.92829 Mflops=157.557
Lorenz 96            N=32768 K=16384 Workers=1 Sec=2.16188 Mflops=1490.01

My Computer has Raspberry Pi ratio=9.04874
Making pie charts...done.
[email protected]:~/pichart-30 $ 
[email protected]:~/pichart-30 $ vcgencmd get_throttled
throttled=0x0
[email protected]:~/pichart-30 $ 
Aha, one last thing. It doesn't solve the merge-sort anomaly, but I realized there were still some changes in config.txt including a small overclock. Resetting config.txt to stock settings just lowers the speeds proportionately:-

Code: Select all

[email protected]:~/pichart-30 $ ./pichart-openmp 
pichart -- Raspberry Pi Performance OPENMP version 30

Prime Sieve          P=14630843 Workers=4 Sec=0.548289 Mops=1704.08
Merge Sort           N=16777216 Workers=8 Sec=1.13798 Mops=353.833
Fourier Transform    N=4194304 Workers=8 Sec=1.73828 Mflops=265.42
Lorenz 96            N=32768 K=16384 Workers=4 Sec=0.628486 Mflops=5125.37

My Computer has Raspberry Pi ratio=26.7226
Making pie charts...done.
[email protected]:~/pichart-30 $ ./pichart-serial 
pichart -- Raspberry Pi Performance Serial version 30

Prime Sieve          P=14630843 Workers=1 Sec=2.21255 Mops=422.285
Merge Sort           N=16777216 Workers=2 Sec=4.20747 Mops=95.6995
Fourier Transform    N=4194304 Workers=2 Sec=2.98104 Mflops=154.769
Lorenz 96            N=32768 K=16384 Workers=1 Sec=2.30599 Mflops=1396.89

My Computer has Raspberry Pi ratio=8.58486
Making pie charts...done.
[email protected]:~/pichart-30 $ vcgencmd get_throttled
throttled=0x0
[email protected]:~/pichart-30 $ vcgencmd get_config int
arm_freq=1500
audio_pwm_mode=514
config_hdmi_boost=5
core_freq=500
core_freq_min=250
disable_commandline_tags=2
disable_l2cache=1
disable_splash=1
display_hdmi_rotate=-1
display_lcd_rotate=-1
enable_gic=1
force_eeprom_read=1
force_pwm_open=1
framebuffer_depth=16
framebuffer_ignore_alpha=1
framebuffer_swap=1
gpu_freq=500
gpu_freq_min=500
init_uart_clock=0x2dc6c00
lcd_framerate=60
max_framebuffers=1
pause_burst_frames=1
program_serial_random=1
hdmi_force_cec_address:0=65535
hdmi_force_cec_address:1=65535
hdmi_pixel_freq_limit:0=0x11e1a300
hdmi_pixel_freq_limit:1=0x11e1a300
[email protected]:~/pichart-30 $ 

hvz
Posts: 7
Joined: Thu Jun 27, 2019 3:06 pm

Re: A Pi Pie Chart

Fri Jun 28, 2019 10:52 am

Here are my Pi 3B vs Pi 4 numbers. I used the same binary, that I compiled with gcc 6.3 on an older image (because I am having huge performance issues with Raspbian Buster and I wanted to investigate those). For some reason, these performance issues don't show up at all in this pichart-test - but they do in software that I wrote (an old 2017 Raspbian image is more than twice as fast as Buster on that, no idea why, if anyone has an idea: https://www.raspberrypi.org/forums/view ... 8&t=243859 ).

So anyway, here are my numbers for the Pi 3B vs Pi 4:
Prime Sieve: 483 / 1411
Merge Sort: 301 / 540
Fourier: 149 / 258
Lorenz 96: 905 / 4610

Ah. And there's the cause of the MergeSort problem: It broke in gcc.
Here are the same numbers for the Pi 4, but using gcc 8.3 that's on the Buster image:
Prime Sieve: 1690
Merge sort: 328
Fourier: 253
Lorenz 96: 5747

ejolson
Posts: 3407
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Fri Jun 28, 2019 6:21 pm

hvz wrote:
Fri Jun 28, 2019 10:52 am
So anyway, here are my numbers for the Pi 3B vs Pi 4:
Prime Sieve: 483 / 1411
Merge Sort: 301 / 540
Fourier: 149 / 258
Lorenz 96: 905 / 4610

Ah. And there's the cause of the MergeSort problem: It broke in gcc.
Here are the same numbers for the Pi 4, but using gcc 8.3 that's on the Buster image:
Prime Sieve: 1690
Merge sort: 328
Fourier: 253
Lorenz 96: 5747
Thanks for running those tests. The results make me more confident in the engineering behind the Pi 4B and the Cortex-A72. I'm sorry it didn't help with your problem.

The fact that merge sort compiled with gcc version 8.3 has only 60% the performance of the same code compiled with gcc version 6.3 makes one imagine too much time has been spent optimizing 64-bit at the expense of 32-bit targets. Maybe the regression is due to the mitigation of Spectra-like side channel vulnerabilities that might leak information about the numbers being sorted when the test is running. Maybe in-kernel side-channel information leakage mitigations are responsible for the slowdown you are experiencing.

It's not all bad. Even though the performance of merge sort deceased, the performance of prime sieve and the Lorenz 96 simulation increased. Therefore the final Pi ratio didn't decrease as much as might be expected. In summary
  • The Pi 4B Pi ratio is 27.4 with gcc version 6.3.
  • The Pi 4B Pi ratio is 26.6 with gcc version 8.3.
  • The Pi 4B Pi ratio is 26.7 with gcc version 9.1.
  • The Pi 4B Pi ratio is 30.6 using best compiler for each test.
I wonder what the Pi ratio would be with clang LLVM.

hvz
Posts: 7
Joined: Thu Jun 27, 2019 3:06 pm

Re: A Pi Pie Chart

Fri Jun 28, 2019 9:50 pm

Also interesting: Has anyone done a 64 bit test? It's too bad that Raspbian is still 32 bit, but there are other images that are 64 bits. On my Pi 3B I saw (on one specific program, and I think we were using different gcc versions too) a 10% increase in performance vs 32 bit.

I'll do some tests next week.

(Btw the other problem is solved, was a sound card speed issue in the older image, it apparently ran at a lower sample rate than selected).

hvz
Posts: 7
Joined: Thu Jun 27, 2019 3:06 pm

Re: A Pi Pie Chart

Mon Jul 01, 2019 3:57 pm

hvz wrote:
Fri Jun 28, 2019 10:52 am
Here are my Pi 3B vs Pi 4 numbers. I used the same binary, that I compiled with gcc 6.3 on an older image (because I am having huge performance issues with Raspbian Buster and I wanted to investigate those). For some reason, these performance issues don't show up at all in this pichart-test - but they do in software that I wrote (an old 2017 Raspbian image is more than twice as fast as Buster on that, no idea why, if anyone has an idea: https://www.raspberrypi.org/forums/view ... 8&t=243859 ).

So anyway, here are my numbers for the Pi 3B vs Pi 4:
Prime Sieve: 483 / 1411
Merge Sort: 301 / 540
Fourier: 149 / 258
Lorenz 96: 905 / 4610

Ah. And there's the cause of the MergeSort problem: It broke in gcc.
Here are the same numbers for the Pi 4, but using gcc 8.3 that's on the Buster image:
Prime Sieve: 1690
Merge sort: 328
Fourier: 253
Lorenz 96: 5747
More numbers, this time with Ubuntu Mate 64 bit, Pi 3B (because there's no Pi 4 version yet). And unfortunately with gcc 7.4, because that's the one that's delivered with Ubuntu Mate...

Prime Sieve: 632 (30% faster than 32 bit, note that gcc 8.3 was already 19% faster than 6.3 so it could be that, leaving about 9% difference assuming that 7.4 was already this fast)
Merge Sort: 262 (Can't really compare this one, looks like it already broke in gcc 7.4)
Fourier: 167 (12% faster than 32 bit)
Lorenz 96: 1292 (42% faster than 32 bit, but gcc 8.3 was 25% faster than 6.3, leaving about 13% difference assuming that 7.4 was already this fast).

My very very unreliable estimate would be that 64 bit is between 10 and 15% faster than 32 bit. Which matches values that I've read elsewhere, and values that I've seen in my own software (compiled with the same compiler version in both 32 and 64 bit). It would be helpful to use the same gcc versions (and I could, I have gcc 8.2 running at both 32 and 64 bit on some other Pi's), but that's too much effort for now. I'm more interested in how it affects my own software than on how it affects a benchmark.

ejolson
Posts: 3407
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Wed Jul 03, 2019 3:37 am

hvz wrote:
Mon Jul 01, 2019 3:57 pm
My very very unreliable estimate would be that 64 bit is between 10 and 15% faster than 32 bit. Which matches values that I've read elsewhere, and values that I've seen in my own software (compiled with the same compiler version in both 32 and 64 bit). It would be helpful to use the same gcc versions (and I could, I have gcc 8.2 running at both 32 and 64 bit on some other Pi's), but that's too much effort for now. I'm more interested in how it affects my own software than on how it affects a benchmark.
Thanks for the report. It's good to know that running pichart using 64-bit ARM doesn't make the surprising difference it does with sysbench. In a way I share your sentiment about only being interested in how fast 64-bit affects the software you wrote yourself. The only difference is that pichart is my software.

I'm somewhat disappointed that switching to 64-bit didn't solve the performance problems with newer compilers and merge sort. From this post it looks like the Pi 4B will soon run a 64-bit version of Gentoo Linux. I wonder if there is anything that can be done with the current C code to make merge sort run faster with the newer versions of gcc.

User avatar
Gavinmc42
Posts: 3622
Joined: Wed Aug 28, 2013 3:31 am

Re: A Pi Pie Chart

Thu Jul 04, 2019 2:15 am

Sakaki has also done a dual 32/64 nspawn version of Raspbian that works on the 3B+.
Not sure if that would give a difference between 32 and 64bit .
https://github.com/sakaki-/raspbian-nspawn-64

Will be interesting once Gentoo64 is compiled for A72 to compare that against the A53 code on a Pi4.
How to tune OS's for A72 cores? Going to need benchmarks..

Wonder how the Pi4 now compares against those other SBC's.
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

User avatar
bensimmo
Posts: 4152
Joined: Sun Dec 28, 2014 3:02 pm
Location: East Yorkshire

Re: A Pi Pie Chart

Mon Jul 15, 2019 3:31 pm

Quick first test at the desktop with temps being displayed in the top corner.

(This is a slow fan blowing down-ish over the naked Pi.
It is also a quick overclock test. Worked first time and has been looping an webgl2 aquarium for over an hour, room temp probably 24ishC)

1.75GHz ARM / 600MHz GPU / +.4 V iirc
Temp. went to touch 60C at end of Sieve with OpenMP
No throttling occurred.

Buster Raspbian as of today.
GCC 8.3.0-6+rpi1

OpenMP v30 (Mops, Buster as of today)
PS= 1972
MS= 385
FT= 280
Lz= 5258
PiRatio = 28.9

Serial v30
PS= 462
MS= 102
FT= 150
Lz= 1628
PiRatio = 9.2

ejolson
Posts: 3407
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Wed Jul 17, 2019 6:32 am

bensimmo wrote:
Mon Jul 15, 2019 3:31 pm
Quick first test at the desktop with temps being displayed in the top corner.

(This is a slow fan blowing down-ish over the naked Pi.
It is also a quick overclock test. Worked first time and has been looping an webgl2 aquarium for over an hour, room temp probably 24ishC)

1.75GHz ARM / 600MHz GPU / +.4 V iirc
Temp. went to touch 60C at end of Sieve with OpenMP
No throttling occurred.

Buster Raspbian as of today.
GCC 8.3.0-6+rpi1

OpenMP v30 (Mops, Buster as of today)
PS= 1972
MS= 385
FT= 280
Lz= 5258
PiRatio = 28.9

Serial v30
PS= 462
MS= 102
FT= 150
Lz= 1628
PiRatio = 9.2
These seem to be the best scores yet for a single run of pichart on a Raspberry Pi 4B. For the record, there is a nice comparison to a Rock64 here which shows that single-board computer has roughly the same performance as the Pi 3B+. Since both machines use a quad-core Cortex-A53 processor this is not unexpected. However, I still find it interesting.

User avatar
bensimmo
Posts: 4152
Joined: Sun Dec 28, 2014 3:02 pm
Location: East Yorkshire

Re: A Pi Pie Chart

Wed Jul 17, 2019 7:14 am

Just note the Overclock though, I assume most boards will run at it, it the one Tom's Hardware used in thier announcement and I just copied it straight off. It a 17% frequency increase.
It'll probably go faster as it not sweating with just a gentle fan blowing over it.

No doubt we'll see these faster speeds if they build in active cooling solutions in say a + board in the future. The room is these in the SoC.

ejolson
Posts: 3407
Joined: Tue Mar 18, 2014 11:47 am

Re: A Pi Pie Chart

Tue Sep 17, 2019 1:39 am

ejolson wrote:
Wed Jul 03, 2019 3:37 am
I'm somewhat disappointed that switching to 64-bit didn't solve the performance problems with newer compilers and merge sort.
I clicked with my mouse and created an Amazon EC2 instance with 4 Graviton processors running 64-bit ARM Ubuntu Linux. For reference this is an a1.xlarge instance with 8GB RAM that costs US$ 0.102 per hour on demand to run. I ran the Pi pie-chart program using gcc versions 6.5, 7.4 and 8.3. With

CFLAGS=-march=native -mtune=native -O3 -ffast-math

the best results were obtained with version 8.3 as follows:

Code: Select all

$ ./pichart-openmp ; # Amazon EC2 a1.xlarge instance
pichart -- Raspberry Pi Performance OPENMP version 30

Prime Sieve          P=14630843 Workers=4 Sec=0.395686 Mops=2361.29
Merge Sort           N=16777216 Workers=8 Sec=0.725249 Mops=555.193
Fourier Transform    N=4194304 Workers=8 Sec=0.650543 Mflops=709.213
Lorenz 96            N=32768 K=16384 Workers=4 Sec=0.298065 Mflops=10807.1

My Computer has Raspberry Pi ratio=49.9934
$ ./pichart-serial ; # Amazon EC2 a1.xlarge instance
pichart -- Raspberry Pi Performance Serial version 30

Prime Sieve          P=14630843 Workers=2 Sec=1.57494 Mops=593.246
Merge Sort           N=16777216 Workers=2 Sec=2.85815 Mops=140.879
Fourier Transform    N=4194304 Workers=2 Sec=1.95764 Mflops=235.679
Lorenz 96            N=32768 K=16384 Workers=1 Sec=1.11644 Mflops=2885.27

My Computer has Raspberry Pi ratio=13.7101
Making pie charts...done.
It should be noted that there were no regressions in the merge sort between different versions of the compiler as was seen with the Pi 4B. For comparison the 4B reaches a Pi ratio of about 28. This can be increased to 30 by cherry picking the best compiler for each of the four benchmark calculations.

In order to figure out how much per hour to charge my little brother for using the Pi 4B, I calculated 0.102 * 28 / 50 to obtain US$ 0.05712 per hour.

Return to “General discussion”