Page 5 of 8

Re: A Pi Pie Chart

Posted: Tue Jun 25, 2019 4:49 pm
by jamesh
ejolson wrote:
Tue Jun 25, 2019 4:36 pm
jamesh wrote:
Tue Jun 25, 2019 4:18 pm
ejolson wrote:
Tue Jun 25, 2019 3:40 pm
Thanks for posting. Your results are similar, though perhaps slightly faster, compared to the graphs that James uploaded which I converted to portable network graphics:
Here "My Computer" refers to the new Raspberry Pi 4B.

Compared to the original Pi B the Pi 4B is 26.8474 times faster. That's about double the performance of the 3B+ overall, however,I find it surprising that the merge sort timings are actually slower than the 3B+. I wonder if this result is related to the compiler version or an optimization setting. It would be nice to find a set of compiler flags for which the merge-sort timings were faster.
From Eben when I showed him the results for the merge, "Could be expensive line moves between L1s, but I suspect it's actually measuring the cost of forking processes in LPAE."

Which is why some of the other Pie charts were comparing LPAE kernels on the Pi3B+.
My understanding is that the task parallel constructs in modern OpenMP implementations fork a pool of threads at the beginning of the run (which isn't measured by the timing routines) and then use either work stealing or some sort of grand central dispatch to assign parcels of work to the threads in the pool. Maybe the cost of Linux thread synchronization primitives goes up when LPAE is enabled; however, it is strange that the serial version also runs slower.

I wonder if this is a gcc version 8.x compiler regression. Have you tried any compiler flags to remedy the situation?
No, I have not looked at any of this, just did the charts with the default compiler on the pi itself. There's probably some mileage in using the latest compilers.

Re: A Pi Pie Chart

Posted: Thu Jun 27, 2019 9:39 pm
by ejolson
jahboater wrote:
Thu Jun 27, 2019 6:41 pm
ejolson wrote:
Thu Jun 27, 2019 5:47 pm
Would you mind compiling the pichart program downloadable from a link on the first post of this thread and reporting the output (both OpenMP and serial) running on the Pi 4B with the new 9.1 compiler?
Here it is (4GB version if it makes any difference).

Code: Select all

pi@pi4:~/pichart-30 $ ./pichart-openmp 
pichart -- Raspberry Pi Performance OPENMP version 30

Prime Sieve          P=14630843 Workers=4 Sec=0.514499 Mops=1815.99
Merge Sort           N=16777216 Workers=8 Sec=1.07414 Mops=374.861
Fourier Transform    N=4194304 Workers=8 Sec=1.77694 Mflops=259.645
Lorenz 96            N=32768 K=16384 Workers=4 Sec=0.598412 Mflops=5382.96

My Computer has Raspberry Pi ratio=27.7325
Making pie charts...done.
pi@pi4:~/pichart-30 $ ./pichart-serial 
pichart -- Raspberry Pi Performance Serial version 30

Prime Sieve          P=14630843 Workers=2 Sec=2.0762 Mops=450.018
Merge Sort           N=16777216 Workers=2 Sec=3.94588 Mops=102.044
Fourier Transform    N=4194304 Workers=2 Sec=2.95565 Mflops=156.099
Lorenz 96            N=32768 K=16384 Workers=1 Sec=2.1619 Mflops=1490

My Computer has Raspberry Pi ratio=9.027
Making pie charts...done.
pi@pi4:~/pichart-30 $ 
pi@pi4:~/pichart-30 $ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/arm-linux-gnueabihf/9.1.0/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../configure --enable-languages=c,d,c++,fortran --with-cpu=cortex-a72 --with-fpu=neon-fp-armv8 --with-float=hard --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
Thread model: posix
gcc version 9.1.0 (GCC) 
pi@pi4:~/pichart-30 $ 
A summary of the merge sort timings for the Pi 4B and 3B+ are

Code: Select all

Merge Sort (larger is better)
         Serial    OpenMP
Pi 3B+  114.622   362.529   gcc 6.4
Pi 4B   102.044   374.861   gcc 9.1
Edit: The above Pi 4B timings were obtained with an over clock setting of 1600 MHz.

While the OpenMP speeds now show the Pi 4B to be 3.5% faster than the Pi 3B+ on the parallel merge sort, the single-core speeds for the serial version are still slower.

Is it possible you compiled with -mtune=native -march=native as those flags are in the Makefile and that they reset the --with-cpu and --with-fpu settings to some randomly wrong thing?

Have you verified that no throttling occurred?

Re: A Pi Pie Chart

Posted: Fri Jun 28, 2019 9:09 am
by jahboater
ejolson wrote:
Thu Jun 27, 2019 9:39 pm
Is it possible you compiled with -mtune=native -march=native as those flags are in the Makefile and that they reset the --with-cpu and --with-fpu settings to some randomly wrong thing?
Yes, I used your Makefile as-is. Recent versions of GCC get the "native" types right.
To check I recompiled with

-mcpu=cortex-a72 -mtune=cortex-a72 -mfpu=neon-fp-armv8

and compared the binary with "cmp", and they were identical.
ejolson wrote:
Thu Jun 27, 2019 9:39 pm
Have you verified that no throttling occurred?
Of course - using "vcgencmd get_throttled".

Code: Select all

pi@pi4:~/pichart-30 $ make
gcc -std=gnu99 -O3 -mcpu=cortex-a72 -mtune=cortex-a72 -mfpu=neon-fp-armv8 -Wall -o pichart-serial pichart.c util.c sieve.c merge.c fourier.c lorenz.c -lm
gcc -std=gnu99 -O3 -mcpu=cortex-a72 -mtune=cortex-a72 -mfpu=neon-fp-armv8 -Wall -fopenmp -o pichart-openmp pichart.c util.c sieve.c merge.c fourier.c lorenz.c -lm
pi@pi4:~/pichart-30 $ cmp pichart-openmp xopenmp 
pi@pi4:~/pichart-30 $ 
pi@pi4:~/pichart-30 $ ./pichart-openmp 
pichart -- Raspberry Pi Performance OPENMP version 30

Prime Sieve          P=14630843 Workers=4 Sec=0.515054 Mops=1814.04
Merge Sort           N=16777216 Workers=8 Sec=1.07105 Mops=375.942
Fourier Transform    N=4194304 Workers=8 Sec=1.79254 Mflops=257.386
Lorenz 96            N=32768 K=16384 Workers=4 Sec=0.598754 Mflops=5379.88

My Computer has Raspberry Pi ratio=27.6805
Making pie charts...done.
pi@pi4:~/pichart-30 $ ./pichart-serial 
pichart -- Raspberry Pi Performance Serial version 30

Prime Sieve          P=14630843 Workers=2 Sec=2.07626 Mops=450.005
Merge Sort           N=16777216 Workers=2 Sec=3.94452 Mops=102.079
Fourier Transform    N=4194304 Workers=2 Sec=2.92829 Mflops=157.557
Lorenz 96            N=32768 K=16384 Workers=1 Sec=2.16188 Mflops=1490.01

My Computer has Raspberry Pi ratio=9.04874
Making pie charts...done.
pi@pi4:~/pichart-30 $ 
pi@pi4:~/pichart-30 $ vcgencmd get_throttled
throttled=0x0
pi@pi4:~/pichart-30 $ 
Aha, one last thing. It doesn't solve the merge-sort anomaly, but I realized there were still some changes in config.txt including a small overclock. Resetting config.txt to stock settings just lowers the speeds proportionately:-

Code: Select all

pi@pi4:~/pichart-30 $ ./pichart-openmp 
pichart -- Raspberry Pi Performance OPENMP version 30

Prime Sieve          P=14630843 Workers=4 Sec=0.548289 Mops=1704.08
Merge Sort           N=16777216 Workers=8 Sec=1.13798 Mops=353.833
Fourier Transform    N=4194304 Workers=8 Sec=1.73828 Mflops=265.42
Lorenz 96            N=32768 K=16384 Workers=4 Sec=0.628486 Mflops=5125.37

My Computer has Raspberry Pi ratio=26.7226
Making pie charts...done.
pi@pi4:~/pichart-30 $ ./pichart-serial 
pichart -- Raspberry Pi Performance Serial version 30

Prime Sieve          P=14630843 Workers=1 Sec=2.21255 Mops=422.285
Merge Sort           N=16777216 Workers=2 Sec=4.20747 Mops=95.6995
Fourier Transform    N=4194304 Workers=2 Sec=2.98104 Mflops=154.769
Lorenz 96            N=32768 K=16384 Workers=1 Sec=2.30599 Mflops=1396.89

My Computer has Raspberry Pi ratio=8.58486
Making pie charts...done.
pi@pi4:~/pichart-30 $ vcgencmd get_throttled
throttled=0x0
pi@pi4:~/pichart-30 $ vcgencmd get_config int
arm_freq=1500
audio_pwm_mode=514
config_hdmi_boost=5
core_freq=500
core_freq_min=250
disable_commandline_tags=2
disable_l2cache=1
disable_splash=1
display_hdmi_rotate=-1
display_lcd_rotate=-1
enable_gic=1
force_eeprom_read=1
force_pwm_open=1
framebuffer_depth=16
framebuffer_ignore_alpha=1
framebuffer_swap=1
gpu_freq=500
gpu_freq_min=500
init_uart_clock=0x2dc6c00
lcd_framerate=60
max_framebuffers=1
pause_burst_frames=1
program_serial_random=1
hdmi_force_cec_address:0=65535
hdmi_force_cec_address:1=65535
hdmi_pixel_freq_limit:0=0x11e1a300
hdmi_pixel_freq_limit:1=0x11e1a300
pi@pi4:~/pichart-30 $ 

Re: A Pi Pie Chart

Posted: Fri Jun 28, 2019 10:52 am
by hvz
Here are my Pi 3B vs Pi 4 numbers. I used the same binary, that I compiled with gcc 6.3 on an older image (because I am having huge performance issues with Raspbian Buster and I wanted to investigate those). For some reason, these performance issues don't show up at all in this pichart-test - but they do in software that I wrote (an old 2017 Raspbian image is more than twice as fast as Buster on that, no idea why, if anyone has an idea: https://www.raspberrypi.org/forums/view ... 8&t=243859 ).

So anyway, here are my numbers for the Pi 3B vs Pi 4:
Prime Sieve: 483 / 1411
Merge Sort: 301 / 540
Fourier: 149 / 258
Lorenz 96: 905 / 4610

Ah. And there's the cause of the MergeSort problem: It broke in gcc.
Here are the same numbers for the Pi 4, but using gcc 8.3 that's on the Buster image:
Prime Sieve: 1690
Merge sort: 328
Fourier: 253
Lorenz 96: 5747

Re: A Pi Pie Chart

Posted: Fri Jun 28, 2019 6:21 pm
by ejolson
hvz wrote:
Fri Jun 28, 2019 10:52 am
So anyway, here are my numbers for the Pi 3B vs Pi 4:
Prime Sieve: 483 / 1411
Merge Sort: 301 / 540
Fourier: 149 / 258
Lorenz 96: 905 / 4610

Ah. And there's the cause of the MergeSort problem: It broke in gcc.
Here are the same numbers for the Pi 4, but using gcc 8.3 that's on the Buster image:
Prime Sieve: 1690
Merge sort: 328
Fourier: 253
Lorenz 96: 5747
Thanks for running those tests. The results make me more confident in the engineering behind the Pi 4B and the Cortex-A72. I'm sorry it didn't help with your problem.

The fact that merge sort compiled with gcc version 8.3 has only 60% the performance of the same code compiled with gcc version 6.3 makes one imagine too much time has been spent optimizing 64-bit at the expense of 32-bit targets. Maybe the regression is due to the mitigation of Spectra-like side channel vulnerabilities that might leak information about the numbers being sorted when the test is running. Maybe in-kernel side-channel information leakage mitigations are responsible for the slowdown you are experiencing.

It's not all bad. Even though the performance of merge sort deceased, the performance of prime sieve and the Lorenz 96 simulation increased. Therefore the final Pi ratio didn't decrease as much as might be expected. In summary
  • The Pi 4B Pi ratio is 27.4 with gcc version 6.3.
  • The Pi 4B Pi ratio is 26.6 with gcc version 8.3.
  • The Pi 4B Pi ratio is 26.7 with gcc version 9.1.
  • The Pi 4B Pi ratio is 30.6 using best compiler for each test.
I wonder what the Pi ratio would be with clang LLVM.

Re: A Pi Pie Chart

Posted: Fri Jun 28, 2019 9:50 pm
by hvz
Also interesting: Has anyone done a 64 bit test? It's too bad that Raspbian is still 32 bit, but there are other images that are 64 bits. On my Pi 3B I saw (on one specific program, and I think we were using different gcc versions too) a 10% increase in performance vs 32 bit.

I'll do some tests next week.

(Btw the other problem is solved, was a sound card speed issue in the older image, it apparently ran at a lower sample rate than selected).

Re: A Pi Pie Chart

Posted: Mon Jul 01, 2019 3:57 pm
by hvz
hvz wrote:
Fri Jun 28, 2019 10:52 am
Here are my Pi 3B vs Pi 4 numbers. I used the same binary, that I compiled with gcc 6.3 on an older image (because I am having huge performance issues with Raspbian Buster and I wanted to investigate those). For some reason, these performance issues don't show up at all in this pichart-test - but they do in software that I wrote (an old 2017 Raspbian image is more than twice as fast as Buster on that, no idea why, if anyone has an idea: https://www.raspberrypi.org/forums/view ... 8&t=243859 ).

So anyway, here are my numbers for the Pi 3B vs Pi 4:
Prime Sieve: 483 / 1411
Merge Sort: 301 / 540
Fourier: 149 / 258
Lorenz 96: 905 / 4610

Ah. And there's the cause of the MergeSort problem: It broke in gcc.
Here are the same numbers for the Pi 4, but using gcc 8.3 that's on the Buster image:
Prime Sieve: 1690
Merge sort: 328
Fourier: 253
Lorenz 96: 5747
More numbers, this time with Ubuntu Mate 64 bit, Pi 3B (because there's no Pi 4 version yet). And unfortunately with gcc 7.4, because that's the one that's delivered with Ubuntu Mate...

Prime Sieve: 632 (30% faster than 32 bit, note that gcc 8.3 was already 19% faster than 6.3 so it could be that, leaving about 9% difference assuming that 7.4 was already this fast)
Merge Sort: 262 (Can't really compare this one, looks like it already broke in gcc 7.4)
Fourier: 167 (12% faster than 32 bit)
Lorenz 96: 1292 (42% faster than 32 bit, but gcc 8.3 was 25% faster than 6.3, leaving about 13% difference assuming that 7.4 was already this fast).

My very very unreliable estimate would be that 64 bit is between 10 and 15% faster than 32 bit. Which matches values that I've read elsewhere, and values that I've seen in my own software (compiled with the same compiler version in both 32 and 64 bit). It would be helpful to use the same gcc versions (and I could, I have gcc 8.2 running at both 32 and 64 bit on some other Pi's), but that's too much effort for now. I'm more interested in how it affects my own software than on how it affects a benchmark.

Re: A Pi Pie Chart

Posted: Wed Jul 03, 2019 3:37 am
by ejolson
hvz wrote:
Mon Jul 01, 2019 3:57 pm
My very very unreliable estimate would be that 64 bit is between 10 and 15% faster than 32 bit. Which matches values that I've read elsewhere, and values that I've seen in my own software (compiled with the same compiler version in both 32 and 64 bit). It would be helpful to use the same gcc versions (and I could, I have gcc 8.2 running at both 32 and 64 bit on some other Pi's), but that's too much effort for now. I'm more interested in how it affects my own software than on how it affects a benchmark.
Thanks for the report. It's good to know that running pichart using 64-bit ARM doesn't make the surprising difference it does with sysbench. In a way I share your sentiment about only being interested in how fast 64-bit affects the software you wrote yourself. The only difference is that pichart is my software.

I'm somewhat disappointed that switching to 64-bit didn't solve the performance problems with newer compilers and merge sort. From this post it looks like the Pi 4B will soon run a 64-bit version of Gentoo Linux. I wonder if there is anything that can be done with the current C code to make merge sort run faster with the newer versions of gcc.

Re: A Pi Pie Chart

Posted: Thu Jul 04, 2019 2:15 am
by Gavinmc42
Sakaki has also done a dual 32/64 nspawn version of Raspbian that works on the 3B+.
Not sure if that would give a difference between 32 and 64bit .
https://github.com/sakaki-/raspbian-nspawn-64

Will be interesting once Gentoo64 is compiled for A72 to compare that against the A53 code on a Pi4.
How to tune OS's for A72 cores? Going to need benchmarks..

Wonder how the Pi4 now compares against those other SBC's.

Re: A Pi Pie Chart

Posted: Mon Jul 15, 2019 3:31 pm
by bensimmo
Quick first test at the desktop with temps being displayed in the top corner.

(This is a slow fan blowing down-ish over the naked Pi.
It is also a quick overclock test. Worked first time and has been looping an webgl2 aquarium for over an hour, room temp probably 24ishC)

1.75GHz ARM / 600MHz GPU / +.4 V iirc
Temp. went to touch 60C at end of Sieve with OpenMP
No throttling occurred.

Buster Raspbian as of today.
GCC 8.3.0-6+rpi1

OpenMP v30 (Mops, Buster as of today)
PS= 1972
MS= 385
FT= 280
Lz= 5258
PiRatio = 28.9

Serial v30
PS= 462
MS= 102
FT= 150
Lz= 1628
PiRatio = 9.2

Re: A Pi Pie Chart

Posted: Wed Jul 17, 2019 6:32 am
by ejolson
bensimmo wrote:
Mon Jul 15, 2019 3:31 pm
Quick first test at the desktop with temps being displayed in the top corner.

(This is a slow fan blowing down-ish over the naked Pi.
It is also a quick overclock test. Worked first time and has been looping an webgl2 aquarium for over an hour, room temp probably 24ishC)

1.75GHz ARM / 600MHz GPU / +.4 V iirc
Temp. went to touch 60C at end of Sieve with OpenMP
No throttling occurred.

Buster Raspbian as of today.
GCC 8.3.0-6+rpi1

OpenMP v30 (Mops, Buster as of today)
PS= 1972
MS= 385
FT= 280
Lz= 5258
PiRatio = 28.9

Serial v30
PS= 462
MS= 102
FT= 150
Lz= 1628
PiRatio = 9.2
These seem to be the best scores yet for a single run of pichart on a Raspberry Pi 4B. For the record, there is a nice comparison to a Rock64 here which shows that single-board computer has roughly the same performance as the Pi 3B+. Since both machines use a quad-core Cortex-A53 processor this is not unexpected. However, I still find it interesting.

Re: A Pi Pie Chart

Posted: Wed Jul 17, 2019 7:14 am
by bensimmo
Just note the Overclock though, I assume most boards will run at it, it the one Tom's Hardware used in thier announcement and I just copied it straight off. It a 17% frequency increase.
It'll probably go faster as it not sweating with just a gentle fan blowing over it.

No doubt we'll see these faster speeds if they build in active cooling solutions in say a + board in the future. The room is these in the SoC.

Re: A Pi Pie Chart

Posted: Tue Sep 17, 2019 1:39 am
by ejolson
ejolson wrote:
Wed Jul 03, 2019 3:37 am
I'm somewhat disappointed that switching to 64-bit didn't solve the performance problems with newer compilers and merge sort.
I clicked with my mouse and created an Amazon EC2 instance with 4 Graviton processors running 64-bit ARM Ubuntu Linux. For reference this is an a1.xlarge instance with 8GB RAM that costs US$ 0.102 per hour on demand to run. I ran the Pi pie-chart program using gcc versions 6.5, 7.4 and 8.3. With

CFLAGS=-march=native -mtune=native -O3 -ffast-math

the best results were obtained with version 8.3 as follows:

Code: Select all

$ ./pichart-openmp ; # Amazon EC2 a1.xlarge instance
pichart -- Raspberry Pi Performance OPENMP version 30

Prime Sieve          P=14630843 Workers=4 Sec=0.395686 Mops=2361.29
Merge Sort           N=16777216 Workers=8 Sec=0.725249 Mops=555.193
Fourier Transform    N=4194304 Workers=8 Sec=0.650543 Mflops=709.213
Lorenz 96            N=32768 K=16384 Workers=4 Sec=0.298065 Mflops=10807.1

My Computer has Raspberry Pi ratio=49.9934
$ ./pichart-serial ; # Amazon EC2 a1.xlarge instance
pichart -- Raspberry Pi Performance Serial version 30

Prime Sieve          P=14630843 Workers=2 Sec=1.57494 Mops=593.246
Merge Sort           N=16777216 Workers=2 Sec=2.85815 Mops=140.879
Fourier Transform    N=4194304 Workers=2 Sec=1.95764 Mflops=235.679
Lorenz 96            N=32768 K=16384 Workers=1 Sec=1.11644 Mflops=2885.27

My Computer has Raspberry Pi ratio=13.7101
Making pie charts...done.
It should be noted that there were no regressions in the merge sort between different versions of the compiler as was seen with the Pi 4B. For comparison the 4B reaches a Pi ratio of about 28. This can be increased to 30 by cherry picking the best compiler for each of the four benchmark calculations.

In order to figure out how much per hour to charge my little brother for using the Pi 4B, I calculated 0.102 * 28 / 50 to obtain US$ 0.05712 per hour.

Re: A Pi Pie Chart

Posted: Thu Oct 17, 2019 5:24 am
by ejolson
To see how far things have progressed since I first started using Linux, I ran the Pi Pie Chart program on a 66MHz 486DX2. Since there was only 32MB of RAM on that machine, I divided the array sizes by 256 for the prime sieve and by 128 for the merge sort and Fourier transforms. Because the machine was slow, I also divided the number of time steps for the Lorenz 96 simulation by 256. Other modifications were made so the program would compile with gcc version 2.7.2.3. The results were

Code: Select all

$ ./pichart-serial 
pichart -- Raspberry Pi Performance Serial version L31

Prime Sieve          P=82025 Workers=2 Sec=2.29954 Mops=1.42527
Merge Sort           N=131072 Workers=1 Sec=1.31166 Mops=1.69878
Fourier Transform    N=32768 Workers=1 Sec=1.47467 Mflops=1.66655
Lorenz 96            N=128 K=16384 Workers=1 Sec=2.69302 Mflops=4.67242

My Computer has Raspberry Pi ratio=0.0585114
Making pie charts...done.
This implies the original Raspberry Pi is about 17 times faster than the 486 PC and the Pi 4B is 478 times faster.

Re: A Pi Pie Chart

Posted: Mon Feb 10, 2020 5:49 am
by ejolson
I spent a little time polishing the Pi Pie Chart program and have fixed the scalable vector graphics output routine so the resulting pichart.svg file can be viewed in a browser. For good measure I then reran the program on the Raspberry Pi 4B with the CPU governor set to performance as

Code: Select all

# echo performance >/sys/devices/system/cpu/cpufreq/policy0/scaling_governor
Compiling with gcc version 8.3.0 on Raspbian obtained

Code: Select all

$ ./pichart-openmp -t"Pi 4B gcc"
pichart -- Raspberry Pi Performance OPENMP version 32

Prime Sieve          P=14630843 Workers=4 Sec=0.548457 Mops=1703.56
Merge Sort           N=16777216 Workers=8 Sec=1.18613 Mops=339.467
Fourier Transform    N=4194304 Workers=4 Sec=1.7113 Mflops=269.605
Lorenz 96            N=32768 K=16384 Workers=4 Sec=0.646306 Mflops=4984.05

The Pi 4B has Raspberry Pi ratio=26.3638
Making pie charts...done.
$ ./pichart-serial -t"Pi 4B gcc"
pichart -- Raspberry Pi Performance Serial version 32

Prime Sieve          P=14630843 Workers=1 Sec=2.34654 Mops=398.172
Merge Sort           N=16777216 Workers=2 Sec=4.58568 Mops=87.8066
Fourier Transform    N=4194304 Workers=1 Sec=3.25404 Mflops=141.785
Lorenz 96            N=32768 K=16384 Workers=1 Sec=2.30612 Mflops=1396.82

The Pi 4B has Raspberry Pi ratio=8.09999
Making pie charts...done.
and then with clang version 9.0.0 on Raspbian obtained

Code: Select all

$ ./pichart-openmp -t"Pi 4B clang"
pichart -- Raspberry Pi Performance OPENMP version 32

Prime Sieve          P=14630843 Workers=4 Sec=0.53926 Mops=1732.61
Merge Sort           N=16777216 Workers=4 Sec=0.879777 Mops=457.676
Fourier Transform    N=4194304 Workers=4 Sec=1.7769 Mflops=259.651
Lorenz 96            N=32768 K=16384 Workers=4 Sec=0.692296 Mflops=4652.96

My Computer has Raspberry Pi ratio=27.7803
Making pie charts...done.
$ ./pichart-serial -t"Pi 4B clang"
pichart -- Raspberry Pi Performance Serial version 32

Prime Sieve          P=14630843 Workers=1 Sec=2.12951 Mops=438.753
Merge Sort           N=16777216 Workers=2 Sec=2.93432 Mops=137.222
Fourier Transform    N=4194304 Workers=1 Sec=3.0257 Mflops=152.485
Lorenz 96            N=32768 K=16384 Workers=2 Sec=2.24929 Mflops=1432.11

The Pi 4B clang has Raspberry Pi ratio=9.50832
Making pie charts...done.
Not surprisingly, due to how badly the optimizer for recent versions of gcc work for the merge sort algorithm, the clang runs are a bit faster. The reason, however, to run all those tests was to obtain new Pi charts using the updated output routines.

Image

Image

Image

Image

Please let me know if you have any difficulty displaying the above scalable vector graphics images directly in your browser. The current version of the code may be downloaded at

http://fractal.math.unr.edu/~ejolson/pi ... urrent.tgz

Re: A Pi Pie Chart

Posted: Tue Feb 11, 2020 5:53 am
by ejolson
I made a change to the background color of the pie charts to distinguish the parallel performance to the serial and also updated the default systems appearing in the chart to include the clang timings for the Raspberry Pi 4B as a replacement for the 3B.

Here are the results for a 12-core Ryzen 1920X Threadripper:

Code: Select all

$ ./pichart-openmp -t"Ryzen 1920X"
pichart -- Raspberry Pi Performance OPENMP version 33

Prime Sieve          P=14630843 Workers=48 Sec=0.0717278 Mops=13026
Merge Sort           N=16777216 Workers=48 Sec=0.100511 Mops=4006.07
Fourier Transform    N=4194304 Workers=24 Sec=0.0830656 Mflops=5554.33
Lorenz 96            N=32768 K=16384 Workers=48 Sec=0.0653516 Mflops=49290.7

The Ryzen 1920X has Raspberry Pi ratio=306.99
Making pie charts...done.
$ ./pichart-serial -t"Ryzen 1920X"
pichart -- Raspberry Pi Performance Serial version 33

Prime Sieve          P=14630843 Workers=1 Sec=0.762902 Mops=1224.7
Merge Sort           N=16777216 Workers=1 Sec=1.57662 Mops=255.39
Fourier Transform    N=4194304 Workers=1 Sec=0.553161 Mflops=834.068
Lorenz 96            N=32768 K=16384 Workers=1 Sec=0.194706 Mflops=16544

The Ryzen 1920X has Raspberry Pi ratio=40.4726
Making pie charts...done.
The resulting pie charts are

Image

Image

Re: A Pi Pie Chart

Posted: Wed Feb 12, 2020 6:07 am
by ejolson
A friend gave me his old desktop computer, an AMD A6-5400K. I decided to make a pie chart to see whether the Pi 4B could be a desktop replacement. The run

Code: Select all

$ ./pichart-openmp -t"A6-5400K"
pichart -- Raspberry Pi Performance OPENMP version 33

Prime Sieve          P=14630843 Workers=4 Sec=1.18758 Mops=786.749
Merge Sort           N=16777216 Workers=4 Sec=1.33993 Mops=300.504
Fourier Transform    N=4194304 Workers=2 Sec=0.969594 Mflops=475.842
Lorenz 96            N=32768 K=16384 Workers=4 Sec=0.524484 Mflops=6141.71

The A6-5400K has Raspberry Pi ratio=25.6007
Making pie charts...done.
and resulting chart

Image

indicates to that the Pi 4B was faster with integer tests Prime Sieve and Merge Sort but slower with the floating-point tests Fourier Transform and Lorenz 96. Overall, the Pi 4B was slightly faster, quieter and more power efficient. I guess it would make a good desktop replacement, at least in this case.

Re: A Pi Pie Chart

Posted: Sat May 09, 2020 2:23 am
by ejolson
Woohoo! It looks like gcc version 10.1 fixes the regression with the merge sort on the Pi 4B. The new compiler gives

Code: Select all

$ ./pichart-openmp 
pichart -- Raspberry Pi Performance OPENMP version 32

Prime Sieve          P=14630843 Workers=4 Sec=0.552507 Mops=1691.07
Merge Sort           N=16777216 Workers=8 Sec=0.742115 Mops=542.575
Fourier Transform    N=4194304 Workers=8 Sec=1.33157 Mflops=346.487
Lorenz 96            N=32768 K=16384 Workers=4 Sec=0.630267 Mflops=5110.89

My Computer has Raspberry Pi ratio=31.7025
Making pie charts...done.
which shows merge sort being about 61 percent faster with version 10.1.

I wonder whether the new version of gcc creates executables which run that much faster on any other architecture or if the improvements are mostly a Pi thing.

Re: A Pi Pie Chart

Posted: Wed May 27, 2020 6:58 am
by ejolson
The Graviton2 processors reviewed at

https://www.anandtech.com/show/15578/cl ... el-and-amd

are now widely available on the EC2 cloud. I decided to make some Pi pie charts for the 4 processor instance which costs US$ 0.154 per hour. I used Ubuntu 20.04LTS with the gcc 9.3.0 compiler. The single-core results were

Code: Select all

$ ./pichart-serial -t Graviton2
pichart -- Raspberry Pi Performance Serial version 33

Prime Sieve          P=14630843 Workers=1 Sec=1.26827 Mops=736.695
Merge Sort           N=16777216 Workers=2 Sec=2.01905 Mops=199.427
Fourier Transform    N=4194304 Workers=2 Sec=0.597464 Mflops=772.22
Lorenz 96            N=32768 K=16384 Workers=1 Sec=0.518345 Mflops=6214.44

The Graviton2 has Raspberry Pi ratio=25.7304
Making pie charts...done.
with the chart

Image

The parallel results were

Code: Select all

$ ./pichart-openmp -t "Graviton2 (4-core)"
pichart -- Raspberry Pi Performance OPENMP version 33

Prime Sieve          P=14630843 Workers=4 Sec=0.318541 Mops=2933.15
Merge Sort           N=16777216 Workers=8 Sec=0.508454 Mops=791.917
Fourier Transform    N=4194304 Workers=4 Sec=0.155877 Mflops=2959.85
Lorenz 96            N=32768 K=16384 Workers=4 Sec=0.136869 Mflops=23535.2

The Graviton2 (4-core) has Raspberry Pi ratio=100.148
Making pie charts...done.
with the chart

Image

With the recent improvements to compiler technology, the Raspberry Pi 4B now has a Pi ratio of 31.7025 as described in the previous post. This means the 4 processor Graviton2 instance is about 3.16 times faster than the 4B. Compared to the original Graviton instances, the Graviton2 is about 2 times faster but the prices are only 1.5 times more.

Following an updated version of the calculation given in

viewtopic.php?f=63&t=227177&start=100#p1537398

implies the price I can charge my little brother for using the Pi 4B has been reduced from US$ 0.05712 to about US$ 0.04873 per hour.

Woohoo! That's still plenty.

Re: A Pi Pie Chart

Posted: Wed May 27, 2020 11:54 pm
by ejolson
One of those toy power meters that plug into the wall arrived in the mail.

https://www.amazon.com/Poniie-PN1500-El ... 07VPTN8FZ/

Since the pie chart program hasn't been experiencing feature creep and code bloat for a while, I added a new option that allows one to perform a stress test of specified duration for all or any of the four benchmark problems. The archive available from the download link in the first post has been updated.

I took an average of 50 readings of the Pi 4B using both the volt-ampere and watt settings. The results using the new -w10 option were

Code: Select all

                       VA  Watt    Efficiency
Pi 4B Idle           6.04  3.06    0.00 Mops/W
Prime Sieve         10.51  6.01  281.38 Mops/W
Merge Sort           9.75  5.59   97.06 Mops/W
Fourier Transform    8.80  4.96   69.86 Mflops/W
Lorenz 96           12.38  7.45  686.02 Mflops/W
The above measurements were collected with a fan running. The power to run the fan was not included. No heat or power-related throttling occurred. The Pi had a network cable, USB keyboard and monitor connected. WiFi was turned off. Mains voltage was 120V and I used the official power supply. Note that the first time I performed the test, I didn't run the fan and the system went in to throttling. While I may have been imagining things, it is possible that hot raspberry pies consume more power just before they throttle.

In all cases power usage remained under 8 watts. I wonder how much electricity the Graviton2 processors consume. For reference, the data I took (by hand) from the readout on the power meter was

Code: Select all

       PSVA   PSW  MSVA   MSW  FTVA   FTW L96VA  L96W
MIN   10.29  5.00  9.49  5.28  8.55  4.68 12.17  7.28
AVG   10.51  6.01  9.75  5.59  8.80  4.96 12.38  7.45
STD    0.15  0.26  0.16  0.17  0.15  0.17  0.14  0.14
MAX   10.77  6.33 10.09  5.93  9.04  5.47 12.62  7.70

DATA  10.46  6.18  9.57  5.83  8.88  5.00 12.49  7.30
      10.53  5.62  9.78  5.72  8.82  4.79 12.60  7.50
      10.65  5.93  9.76  5.45  8.66  4.80 12.60  7.56
      10.36  6.22  9.49  5.78  8.99  5.11 12.31  7.29
      10.74  6.18  9.87  5.62  8.55  4.83 12.50  7.30
      10.56  6.10  9.51  5.50  8.74  4.75 12.24  7.29
      10.61  5.95  9.77  5.44  8.63  5.09 12.32  7.54
      10.36  6.09  9.70  5.85  8.61  4.83 12.41  7.52
      10.76  5.96  9.62  5.61  8.89  5.10 12.17  7.30
      10.38  6.28  9.93  5.41  8.73  4.84 12.56  7.60
      10.58  6.03  9.60  5.38  8.77  4.79 12.21  7.58
      10.56  5.95  9.85  5.63  8.95  5.15 12.37  7.48
      10.77  6.27  9.71  5.45  8.63  4.83 12.38  7.29
      10.52  6.00  9.73  5.44  9.03  5.21 12.18  7.30
      10.36  5.96  9.93  5.68  8.71  4.79 12.57  7.29
      10.69  6.31  9.56  5.44  8.81  4.83 12.17  7.30
      10.31  5.95  9.97  5.28  8.90  5.16 12.41  7.40
      10.59  6.14  9.65  5.63  8.62  4.70 12.36  7.64
      10.43  6.07  9.78  5.51  9.03  4.99 12.21  7.31
      10.40  6.32  9.94  5.33  8.67  5.01 12.58  7.37
      10.64  5.97  9.52  5.84  8.86  4.82 12.20  7.32
      10.29  5.97  9.97  5.53  8.88  5.09 12.46  7.55
      10.65  5.96  9.61  5.93  8.66  4.77 12.28  7.62
      10.41  6.27  9.76  5.50  9.04  5.16 12.54  7.56
      10.47  6.24  9.86  5.58  8.62  4.79 12.22  7.30
      10.60  5.97  9.64  5.73  8.90  5.07 12.53  7.32
      10.31  5.27 10.09  5.52  8.82  4.87 12.31  7.60
      10.67  5.99  9.61  5.45  8.65  5.12 12.34  7.61
      10.35  5.96  9.86  5.42  9.03  5.00 12.51  7.34
      10.49  6.33  9.83  5.60  8.91  4.81 12.21  7.32
      10.57  6.10  9.61  5.80  8.64  4.91 12.55  7.35
      10.33  5.96  9.98  5.66  9.00  5.12 12.28  7.59
      10.73  6.15  9.57  5.57  8.60  4.80 12.37  7.48
      10.35  5.97  9.59  5.45  8.93  5.05 12.47  7.34
      10.54  6.25  9.71  5.50  8.73  5.09 12.22  7.60
      10.51  5.00  9.68  5.49  8.70  4.68 12.62  7.36
      10.31  5.97 10.01  5.81  8.95  5.15 12.25  7.36
      10.72  5.96  9.55  5.39  8.60  4.82 12.43  7.66
      10.34  6.25  9.97  5.73  9.01  4.83 12.46  7.67
      10.58  5.96  9.78  5.48  8.70  5.08 12.23  7.66
      10.48  6.00  9.71  5.48  8.79  4.71 12.62  7.65
      10.35  6.25  9.95  5.51  8.90  5.08 12.24  7.37
      10.70  5.96  9.61  5.83  8.62  5.11 12.49  7.70
      10.32  6.06  9.94  5.76  9.01  4.79 12.40  7.39
      10.62  6.02  9.69  5.69  8.67  5.07 12.21  7.28
      10.44  5.96  9.64  5.83  8.83  5.47 12.44  7.37
      10.41  5.96  9.86  5.77  8.91  4.78 12.18  7.66
      10.66  5.20  9.56  5.33  8.62  5.11 12.55  7.38
      10.33  5.97 10.00  5.79  9.03  5.12 12.20  7.37
      10.65  6.27  9.62  5.38  8.68  5.05 12.36  7.39
The variation in the measurements is likely due, in part, to non-uniform scheduling efficiency for the parallel work units during some phases of the computation and the fact that the stress test itself runs in a loop that repeatedly initializes memory using a single thread and then runs the parallel computation.

Re: A Pi Pie Chart

Posted: Thu May 28, 2020 8:02 pm
by ejolson
This year the pea plants in the back yard have already grown taller than anytime last summer. The Earth is healing.

While relating this promising fact to the canine coder, there was an interruption. At first it sounded like barking but eventually I understood that since energy efficiency is part of the new normal, subsequent research should focus on developing a PET-on-a-chip that can be used to scale legacy 8-bit code to sub-milliwatt levels. I wonder if a POC could be advantageous in other ways.

Along different lines, I plugged a Ryzen 1920X Threadripper into the power meter and measured an idle power usage of 90 watts with the powersave governor and 100 with performance. Note that this measurement included eight spinning hard disks, that crazy Radeon VII GPU, multiple fans and an NVMe SSD. I then compared the efficiency with the Pi 4B.

Since the Pi 4B has 4 cores, I divided the 1920X into three cpusets each consisting of 4 cores (8 threads), simultaneously ran three copies of the Pi chart program and totaled the output. The results were

Code: Select all

                   Mops/MFlops   Watt    Efficiency
Ryzen 1920X Idle         0.00   100.0    0.00 Mops/W
Prime Sieve          14852.56   243.1   61.09 Mops/W
Merge Sort            4349.82   246.2   17.96 Mops/W
Fourier Transform     7014.84   192.8   36.38 Mflops/W
Lorenz 96           107614.5    265.7  405.02 Mflops/W
Not surprisingly, the efficiency results are less than the Pi. At the same time since the Pi was running from an SD card, it didn't have to power 8 hard disks or a heavy graphics card at the same time.

This brings up an interesting point about using desktop computers for distributed computing projects such as BOINC--the costs of running all the peripherals decrease the energy efficiency. On the other hand, if the desktop is anyway being used for other things, then the power taken by the peripherals is amortized and only the additional power used for BOINC needs to be counted.

In particular, by subtracting the 100 watt idle power one can find the efficiency obtained when scavenging cycles from an already running machine. While better, it astonishingly tells the same story.

Code: Select all

                   Mops/MFlops  Extra    Scavenging
Ryzen 1920X                     Watts    Efficiency
Prime Sieve          14852.56   143.1  103.79 Mops/W
Merge Sort            4349.82   146.2   29.75 Mops/W
Fourier Transform     7014.84    92.8   75.59 Mflops/W
Lorenz 96           107614.5    165.7  649.45 Mflops/W
Deploying a stand-alone Pi 4B provisioned to run only BOINC appears to be more energy efficient than scavenging cycles from an already running desktop computer. It's also possible my power meter toy doesn't accurately measure power in the 3 watt range. For example, there was a noticeable difference in the volt-ampere readings compared to watts for the Pi while these two quantities were essentially the same for the Ryzen.

Of course, if you have the new 8GB model, there is enough memory to scavenge CPU cycles while the Pi is used for something else. In this case the efficiency is

Code: Select all

                  Mops/MFlops  Extra     Scavenging
Pi 4B                          Watts     Efficiency
Prime Sieve          1691.07    3.01   561.82 Mops/W
Merge Sort            542.575   2.59   209.49 Mops/W
Fourier Transform     346.487   1.96   176.78 Mflops/W
Lorenz 96            5110.89    4.45  1148.51 Mflops/W
This result may explain why people are interested in ARM processors both for cloud and high-performance computing. It further makes the 4B attractive for building clusters as well as running microservices in a data center.

For reference, the performance data for the 1920X is

Code: Select all

Prime Sieve          P=14630843 Workers=8 Sec=0.188313 Mops=4961.56
Prime Sieve          P=14630843 Workers=8 Sec=0.188105 Mops=4967.05
Prime Sieve          P=14630843 Workers=8 Sec=0.189752 Mops=4923.95

Merge Sort           N=16777216 Workers=16 Sec=0.278942 Mops=1443.5
Merge Sort           N=16777216 Workers=16 Sec=0.278015 Mops=1448.32
Merge Sort           N=16777216 Workers=16 Sec=0.276168 Mops=1458

Fourier Transform    N=4194304 Workers=4 Sec=0.218448 Mflops=2112.06
Fourier Transform    N=4194304 Workers=4 Sec=0.185614 Mflops=2485.65
Fourier Transform    N=4194304 Workers=8 Sec=0.190876 Mflops=2417.13

Lorenz 96            N=32768 K=16384 Workers=16 Sec=0.0878499 Mflops=36667.4
Lorenz 96            N=32768 K=16384 Workers=16 Sec=0.0861906 Mflops=37373.3
Lorenz 96            N=32768 K=16384 Workers=16 Sec=0.0959447 Mflops=33573.8
and the data collected from the meter

Code: Select all

        PSW    MSW    FTW   L96W
MIN   239.1  245.0  190.1  258.2
AVG   243.1  246.2  192.8  265.7
STD     2.9    0.6    1.7    4.5
MAX   251.5  247.6  199.2  275.5

DATA  241.2  245.8  190.8  273.4
      239.7  245.2  191.5  271.9
      240.1  246.5  192.0  272.7
      239.9  246.4  193.8  271.8
      239.6  246.0  193.7  271.0
      240.4  246.2  196.9  271.1
      239.7  246.3  199.2  271.2
      239.1  245.8  197.0  272.4
      239.2  246.5  194.2  274.2
      239.5  246.3  192.6  275.5
      239.4  246.0  191.7  273.4
      239.9  245.7  191.1  268.4
      240.8  246.4  193.0  268.3
      241.3  246.3  192.5  266.5
      241.1  246.0  191.4  267.1
      240.9  246.1  192.7  268.2
      241.3  246.1  193.1  267.3
      242.0  245.7  193.6  267.0
      241.8  245.2  193.7  267.5
      242.0  245.7  195.3  265.4
      242.6  246.2  193.8  266.2
      242.4  245.5  191.6  267.5
      242.1  245.7  192.8  267.2
      242.5  246.8  192.1  264.1
      243.1  245.5  191.6  264.7
      242.9  245.0  191.0  263.2
      242.2  246.0  191.5  262.3
      243.0  246.5  191.7  263.3
      242.1  245.7  192.5  264.5
      241.9  245.8  191.8  265.0
      243.1  246.4  191.5  265.4
      243.2  246.5  192.2  263.3
      242.9  245.6  192.8  261.9
      243.4  245.7  192.7  263.5
      246.5  246.2  192.3  261.8
      246.4  246.6  190.1  261.9
      247.7  246.1  191.0  260.4
      251.5  247.0  192.3  262.9
      248.8  247.6  193.8  262.4
      246.1  246.4  193.1  261.7
      245.1  245.9  194.4  260.0
      245.9  246.8  194.7  263.1
      246.0  246.9  195.8  262.5
      246.1  246.2  191.3  263.2
      246.3  247.3  191.1  260.4
      246.0  247.4  192.0  259.9
      246.4  246.3  191.3  260.2
      246.5  246.3  192.4  261.5
      246.3  247.2  192.8  260.3
      246.3  247.5  193.0  258.2
After meditating on the data, it seems possible that power consumption may have increased a bit during the run of prime sieve due to a cooling fan turning on and may have decreased during the Lorenz 96 dynamical simulation due to heat-related reduced turbo boost.

Re: A Pi Pie Chart

Posted: Sat May 30, 2020 6:59 am
by ejolson
As mentioned in

viewtopic.php?f=63&t=271121&p=1669795#p1669795

I had an opportunity to run some tests remotely on the newly announced ODRIOD-C4 single-board computer. Here are the pie chart results

Image

I find it interesting that the C4 is faster for merge sort and the Fourier transform but slower for prime sieve and the Lorenz 96 simulation. This appears to reflect the combination of a slower processor with more bandwidth compared to the 4B.

For reference, the output was

Code: Select all

$ ./pichart-openmp -t ODROID-C4
pichart -- Raspberry Pi Performance OPENMP version 34

Prime Sieve          P=14630843 Workers=4 Sec=0.881419 Mops=1060.03
Merge Sort           N=16777216 Workers=8 Sec=0.853603 Mops=471.71
Fourier Transform    N=4194304 Workers=4 Sec=1.4914 Mflops=309.356
Lorenz 96            N=32768 K=16384 Workers=4 Sec=1.12385 Mflops=2866.25

The ODROID-C4 has Raspberry Pi ratio=22.9131
Making pie charts...done.
Based on the Pi ratio, the 4B is about 38 percent faster on average.

Re: A Pi Pie Chart

Posted: Sat May 30, 2020 9:28 am
by bensimmo
2GHz A55's
and
DDR4 1.32GHz

just looked as being nosey so popped to their
they have their own benchmarks against the Pi4 too etc.

Re: A Pi Pie Chart

Posted: Sun Jul 05, 2020 11:47 pm
by ejolson
In preparation for working with Raspberry Pi OS images on an x86 server, I've been studying the user-mode QEMU binary emulation described in

https://github.com/sakaki-/gentoo-on-rp ... infmt_misc

It occurred to me, as it has to many other people, that the same technique could be used to run 64-bit AMD x86 binaries on a Raspberry Pi. So, I decided to check the performance of the Pi pie chart program while performing such an emulation. To do this I installed

Code: Select all

# apt-get install qemu-user-static
on the Pi 4B. Then I compiled a statically-linked 64-bit x86 hello world executable created on my PC with the command

Code: Select all

$ gcc -static -O3 -s -o hello hello.c
where hello.c contained the lines

Code: Select all

#include <stdio.h>
int main(){
    printf("Hello World!\n");
    return 0;
}
and copied the binary over to the Pi. Finally, I typed

Code: Select all

$ file hello
hello: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 3.2.0, BuildID[sha1]=b703362341d57ea5a503241b4ba62987a6552981, stripped
$ ./hello ; # Running of the Pi 4B
Hello World!
and thought, this is amazingly simple.

Returning to the PC I downloaded the latest copy of the Pi pie chart program from the link listed in the first post

viewtopic.php?p=1393365#p1393365

changed the Makefile so it read

Code: Select all

CFLAGS=-std=gnu99 -static -O3 -mtune=native -march=native -Wall -s
and typed make.

Things are never quite as simple as one would hope. The output

Code: Select all

$ make
gcc -std=gnu99 -static -O3 -mtune=native -march=native -Wall -s -o pichart-serial pichart.c util.c sieve.c merge.c fourier.c lorenz.c -lm
gcc -std=gnu99 -static -O3 -mtune=native -march=native -Wall -s -fopenmp -o pichart-openmp pichart.c util.c sieve.c merge.c fourier.c lorenz.c -lm
/usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/8/libgomp.a(target.o): in function `gomp_target_init':
(.text+0x328): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
indicated a problem with OpenMP.

The serial version seemed good enough to determine the relative speed between emulated and native code, so I copied it over. Then, back on the 4B I obtained

Code: Select all

$ file pichart-serial-x86_64 
pichart-serial-x86_64: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 3.2.0, BuildID[sha1]=3304010ef0e97272118f32c94ba74cf9f6f516bd, stripped
$ ./pichart-serial-x86_64 ; # qemu-x86_64 on Pi 4B
pichart -- Raspberry Pi Performance Serial version 34

Prime Sieve          P=14630843 Workers=1 Sec=25.126 Mops=37.1857
Merge Sort           N=16777216 Workers=1 Sec=23.4029 Mops=17.2052
Fourier Transform    N=4194304 Workers=1 Sec=47.0241 Mflops=9.81142
Lorenz 96            N=32768 K=16384 Workers=1 Sec=190.244 Mflops=16.9321

My Computer has Raspberry Pi ratio=0.507005
Making pie charts...done.
As indicated by the Pi ratio, emulated x86 code executes about half the speed of the original Raspberry Pi running native code.

For completeness as well as thinking part the slowness might come from emulating a 64-bit architecture using the 32-bit Raspberry Pi OS, I again compiled the Pi pie chart program, this time for the 32-bit Intel architecture. The results when running the 32-bit Intel i386 code using QEMU emulation on a Pi 4B were

Code: Select all

$ file pichart-serial-i386 
pichart-serial-i386: ELF 32-bit LSB executable, Intel 80386, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.32, BuildID[sha1]=5585668ec69506e97728b7f7cb3d35297e9d4611, stripped
$ ./pichart-serial-i386 ; # qemu-i386 on Pi 4B
pichart -- Raspberry Pi Performance Serial version 34

Prime Sieve          P=14630843 Workers=1 Sec=10.3452 Mops=90.3148
Merge Sort           N=16777216 Workers=1 Sec=18.3665 Mops=21.9233
Fourier Transform    N=4194304 Workers=2 Sec=131.468 Mflops=3.5094
Lorenz 96            N=32768 K=16384 Workers=1 Sec=323.73 Mflops=9.95035

My Computer has Raspberry Pi ratio=0.45533
Making pie charts...done.
$ sudo vcgencmd get_throttled
throttled=0x0
Note that the emulated 32-bit code for the prime sieve was much faster, but the floating-point codes for the Fourier transform and Lorenz 96 dynamical simulation were slower. On average, emulating the 32-bit Intel architecture was slightly slower than emulating the 64-bit AMD architecture.

For comparison, runs using native ARM 32-bit and 64-bit binaries were

Code: Select all

./pichart-serial ; # Native 32-bit ARM on Pi 4B
pichart -- Raspberry Pi Performance Serial version 34

Prime Sieve          P=14630843 Workers=1 Sec=2.18637 Mops=427.343
Merge Sort           N=16777216 Workers=2 Sec=2.43822 Mops=165.142
Fourier Transform    N=4194304 Workers=2 Sec=3.01506 Mflops=153.023
Lorenz 96            N=32768 K=16384 Workers=1 Sec=2.30807 Mflops=1395.64

My Computer has Raspberry Pi ratio=9.83861
Making pie charts...done.
and

Code: Select all

$ ./pichart-serial-aarch64 ; # Native 64-bit AARCH64 on Pi 4B
pichart -- Raspberry Pi Performance Serial version 34

Prime Sieve          P=14630843 Workers=2 Sec=2.3852 Mops=391.719
Merge Sort           N=16777216 Workers=1 Sec=4.20305 Mops=95.8003
Fourier Transform    N=4194304 Workers=2 Sec=2.92823 Mflops=157.561
Lorenz 96            N=32768 K=16384 Workers=1 Sec=1.84992 Mflops=1741.27

My Computer has Raspberry Pi ratio=8.94451
Making pie charts...done.
In conclusion, the results of using QEMU emulation to run Intel-compatible code on the Pi 4B may be summarized as

Code: Select all

Single-core Pi 4B        Pi Ratio   Percent
Native 32-bit ARM         9.839       100
Native 64-bit AARCH64     8.945        91
Emulated 32-bit i386      0.455       4.6
Emulated 64-bit x86_64    0.507       5.2

Re: A Pi Pie Chart

Posted: Tue Jul 07, 2020 3:16 pm
by bensimmo
Just because I was reading the thread

New processor run (in a laptop)

Code: Select all

root@G3ntleGiraffe:~/pichart-34# ./pichart-openmp -t "Ubuntu WSL2 Win10 i5-9300H"
pichart -- Raspberry Pi Performance OPENMP version 34

Prime Sieve          P=14630843 Workers=8 Sec=0.196982 Mops=4743.21
Merge Sort           N=16777216 Workers=16 Sec=0.307859 Mops=1307.92
Fourier Transform    N=4194304 Workers=8 Sec=0.182756 Mflops=2524.53
Lorenz 96            N=32768 K=16384 Workers=8 Sec=0.0630457 Mflops=51093.5

The Ubuntu WSL2 Win10 i5-9300H has Raspberry Pi ratio=149.345
Making pie charts...done.