Heater
Posts: 13102
Joined: Tue Jul 17, 2012 3:02 pm

Re: 64bit vs 32bit benchmark. Eben was right

Wed Jun 29, 2016 4:07 pm

Very true that.

It's amazing though.

In places where they are into such things a piece of code can have a design document written before it is coded. That design will be reviewed and perhaps amended by two other engineers and signed off by Quality Assurance. Then the code is written. Again reviewed and amended by two others and signed off by QA. Then the unit tests are written, and reviewed of course. Then it's tested, perhaps bugs are found and fixed. Those fixes may involve going through all the above procedure again!

And still, a new boy to the team will look at that code and say "What idiots wrote this garbage?"

Me, I have trouble starting to write anything now a days. I already know what garbage it will be in the future!

sirspudd
Posts: 6
Joined: Tue Jun 26, 2012 5:57 pm
Location: San Francisco
Contact: Website

Re: 64bit vs 32bit benchmark. Eben was right

Thu Aug 18, 2016 8:50 pm

Firstly; thanks for pointing out that there is an aarch64 build of Fedora available, that is awesome
Secondly; Your test was entirely invalid. The sd-card IO on this Fedora image is deplorable; it is already dire in comparison to a stock Arch armv7 install, and it is gonna be doubly poor if the aarch64 binaries are huge.

I have rarely seen compilers used as benchmarks; we need to overladen this bad boy with floating point high jumps and explore the register adjustment. Remember all those muppets who said the iPhone adopting a 64 bit chip was dopey as they only had a gig of ram at the time; remember the egg on their faces as the chip put every other ARM based device on its face in terms of performance? We don't need to repeat these assertions.

Working on cross compiling Qt for the aarch64 to see how it performs.

ejolson
Posts: 3419
Joined: Tue Mar 18, 2014 11:47 am

Re: 64bit vs 32bit benchmark. Eben was right

Fri Aug 19, 2016 12:47 am

sirspudd wrote:Working on cross compiling Qt for the aarch64 to see how it performs.
As far as I know the video drivers in the 64-bit distributions have no hardware acceleration, whereas the 32-bit distributions do.

The linpack benchmark for solving systems of linear equations discussed in this thread achieves about 6.5 double-precision gflops running on a well-cooled Pi 3 in 32-bit mode. As this is a well understood computational problem, it would be interesting to know whether a version optimized for 64-bit would perform any better.

sirspudd
Posts: 6
Joined: Tue Jun 26, 2012 5:57 pm
Location: San Francisco
Contact: Website

Re: 64bit vs 32bit benchmark. Eben was right

Fri Aug 19, 2016 2:09 am

ejolson wrote:
sirspudd wrote:Working on cross compiling Qt for the aarch64 to see how it performs.
As far as I know the video drivers in the 64-bit distributions have no hardware acceleration, whereas the 32-bit distributions do.

The linpack benchmark for solving systems of linear equations discussed in this thread achieves about 6.5 double-precision gflops running on a well-cooled Pi 3 in 32-bit mode. As this is a well understood computational problem, it would be interesting to know whether a version optimized for 64-bit would perform any better.
I don't know of a single reason VC4 will not fly on this 4.7 kernel; it runs nicely on Arch at 32bit, and hinges on no BRCM provided binaries

sirspudd
Posts: 6
Joined: Tue Jun 26, 2012 5:57 pm
Location: San Francisco
Contact: Website

Re: 64bit vs 32bit benchmark. Eben was right

Sat Aug 20, 2016 8:52 pm

Just to make it clear; weston-launch works and wayland clients function perfectly on aarch64. The distro shipped Qt is also very sane and can run (Qt) applications with full OpenGL ES 2 acceleration using:

./foo -platform eglfs

(Running against the mesa stack, I did not know where to get a 64bit build of /opt/vc from)

as long as the application is not QML based. Those currently burst into flames in a spectacular fashion and I am attempting to investigate why.

sirspudd
Posts: 6
Joined: Tue Jun 26, 2012 5:57 pm
Location: San Francisco
Contact: Website

Re: 64bit vs 32bit benchmark. Eben was right

Wed Aug 24, 2016 8:32 pm

OpenGL (by way of the VC4 stack) works very nicely on this aarch64 bit build.

I have written both a blog http://chaos-reins.com/2016-08-20-qt-pi ... a-aarch64/ and uploaded a clip https://www.youtube.com/watch?v=mRHDhYVYq7A of Qt 5.8 running on the Aarch64 bit build. It is running nicely, although I have yet to quantify any gains (in the move to aarch64) by stress testing the device.

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 23366
Joined: Sat Jul 30, 2011 7:41 pm

Re: 64bit vs 32bit benchmark. Eben was right

Thu Aug 25, 2016 8:40 am

sirspudd wrote: Remember all those muppets who said the iPhone adopting a 64 bit chip was dopey as they only had a gig of ram at the time; remember the egg on their faces as the chip put every other ARM based device on its face in terms of performance? We don't need to repeat these assertions.
The 64bit chip in the iPhone was faster because it was a faster chip. The 64bitness of it did make a difference, but was not the whole story. Apple has a ARM Architecture licence I believe, so is able to tweak the silicon in various ways. The big one, IIRC, was in the memory subsystem silicon which meant the chips were best in class. Which of course is 64bit, but probably would have been very good in 32 as well.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
"My grief counseller just died, luckily, he was so good, I didn't care."

sirspudd
Posts: 6
Joined: Tue Jun 26, 2012 5:57 pm
Location: San Francisco
Contact: Website

Re: 64bit vs 32bit benchmark. Eben was right

Thu Aug 25, 2016 8:35 pm

@jamesh:

http://www.anandtech.com/show/7335/the- ... s-review/4

There is a lot more potentially at stake then just address space. As mentioned, I have no demonstrable proof about gains, nor potential gains, but we are now in a position to actually test this with fire and to see whether we can stretch any more out of the chip.

el_Salmon
Posts: 17
Joined: Thu Jan 10, 2013 2:22 pm

Re: 64bit vs 32bit benchmark. Eben was right

Tue Sep 06, 2016 8:27 am

Does anyone tried the Opensuse aarch64 image (JeOS) and compare with Fedora? I'm searching an aarch64 linux image for my new RPi3 that runs a little server at home. Graphical issues on aarch64 are not problem for me.

User avatar
Penthux
Posts: 79
Joined: Thu Oct 11, 2012 7:33 am
Location: United Kingdom

Re: 64bit vs 32bit benchmark. Eben was right

Sun Dec 18, 2016 5:34 pm

MarkHaysHarris777 wrote:They're getting their act together at the Pine64 team... yes, the PI came in first in the survey, but the PineA64 came in 7th...
Which survey? It's ok quoting facts and figures but ... ah, forget it. Citation needed.
Penthux
------------
Slackware ARM on a Raspberry Pi - SARPi
http://sarpi.co.uk
"Slackware ARM - it's not for NOOBS!"

ejolson
Posts: 3419
Joined: Tue Mar 18, 2014 11:47 am

Re: 64bit vs 32bit benchmark. Eben was right

Tue Dec 20, 2016 6:07 pm

sirspudd wrote:@jamesh:

http://www.anandtech.com/show/7335/the- ... s-review/4

There is a lot more potentially at stake then just address space. As mentioned, I have no demonstrable proof about gains, nor potential gains, but we are now in a position to actually test this with fire and to see whether we can stretch any more out of the chip.
One obvious place to look for gains is in programs that perform lots of 64-bit integer arithmetic. I would guess gmp optimized for ARMv8 might use such instructions as well as certain cryptographic libraries and random number generators. A synthetic test of 64-bit processor speed that will likely show a 10x improvement can be found in the collatz algorithm.

dingo35
Posts: 13
Joined: Sat Oct 26, 2013 8:09 am

Re: 64bit vs 32bit benchmark. Eben was right

Mon Aug 21, 2017 7:08 am

In december 2016 I did some benchmarking, because I was convinced that 64bits would bring a lot more speed enhancements then people suspected; I had this experience on moving from i386 to x86_64, especially with FFDECSA, which is used for the decryption of media streams.
It turned out then that not the 64 bits itself, but the usage of more registers in the x86_64 architecture gave a big win on all kinds of software, because now the compiler could keep more variables in its registers, and not having to push and pop everything to the stack was a BIG win!

For the benchmark I used a 64 bits Fedora image, with kernel,: Linux fedora-rpi2 4.6.2-4-main #1 SMP PREEMPT Fri Jun 10 13:47:17 CEST 2016 aarch64 aarch64 aarch64 GNU/Linux,

against a 32 bits Fedora image, with kernel: Linux fedora-rpi2 4.6.2-3-main #1 SMP Fri Jun 10 12:57:29 CEST 2016 armv7l armv7l armv7l GNU/Linux

I also used a 32 bits raspbian image, with kernel: Linux raspberrypi 4.4.34-v7+ #930 SMP Wed Nov 23 15:20:41 GMT 2016 armv7l GNU/Linux

Code: Select all

	Fedora	Fedora	Raspbian
	ARM64	ARM	ARM
sysbench version	0.4.12	0.4.12	0.4.12
sysbench binary size	140592	113560	90212
			
sysbench –-num-threads=1 –test=cpu –-cpu-max-prime=20000 –-validate run			
Total time:	60.4250s	727,0269s	478.9251s
Min statistic request:	6.03ms	72.66ms	47.88ms
Avg statistic request:	6.04ms	72.70ms	47.89ms
Max statistic request:	6.05ms	76.96ms	69.86ms
			
sysbench –-num-threads=4 –test=cpu –-cpu-max-prime=20000 –-validate run			
Total time:	15.2510s	183.4330s	119.5340s
Min statistic request:	6.03ms	72.64ms	47.69ms
Avg statistic request:	6.1ms	73.36ms	47.80ms
Max statistic request:	6.42ms	75.76ms	104.88ms
			
top memory usage	0.5%	0.4%	0.2%
			
Valgrind –-tool=massif sysbench –-num-threads=4 –test=cpu –-cpu-max-prime=2000 –-validate run			
Max mem_heap_B	82592	26008	5347
With mem_heap_extra_B	4344	3296	3037
Mem_stacks_B	0	0	0
heap_tree=	peak	peak	peak
			
			
Memtester:			
Version:	4.3.0	4.3.0	4.3.0
Binary size	21664	17624	14236
time  memtester 256M 1	14m14.095s	11m12.650s	9m5.4s
Valgrind –-tool=massif memtester 1M 1			
Max mem_heap_B	1052672	1052672	1048576
With mem_heap_extra_B	16	16	8
Mem_stacks_B	0	0	0
heap_tree=	peak	peak	empty
Table formatting does not seem to be working on this forum, but I guess it is all readable ...

dingo35
Posts: 13
Joined: Sat Oct 26, 2013 8:09 am

Re: 64bit vs 32bit benchmark. Eben was right

Mon Aug 21, 2017 7:24 am

My conclusions were:
1) Floating point operations are up to 12 times faster in 64bits (see sysbench tests);
2) Binary size (on disk) increased with app. 22%,
3) memory usage can vary (in these tests) from no increase (memtester) to 4 times as much memory usage (valgrind test) in a 64 bits environment.
4) performance of non floating point, memory intensive programs can decrease by 30% (memtester) in a 64 bits environment.
The raspbian binaries are highly optimized when compared to the fedora 32 bits image:
5) binary filesize (sysbench and memtester both 20% reduction)
6) memory usage reduced by 20% (memtester) or even 500% (valgrind test)
7) performance increased 33% faster (sysbench) and 20% (memtester).

Although there could be tested a lot more, I now believe that the raspbian 32 bits image is so highly optimized that moving to a 64 bits OS for Raspberry Pi 3 only has advantages when using specific, floating point operations-intensive software. For 90-95% (or higher) of the users of Raspi the 32 bits Raspbian is so highly optimized that 64 bits will only diminish performance because of the increase of memory usage.

Only when Pi's with (much!) bigger memory become available, the discussion could become relevant again...

Just my 2 cents....

User avatar
jojopi
Posts: 3079
Joined: Tue Oct 11, 2011 8:38 pm

Re: 64bit vs 32bit benchmark. Eben was right

Mon Aug 21, 2017 1:13 pm

dingo35 wrote:
Mon Aug 21, 2017 7:24 am
1) Floating point operations are up to 12 times faster in 64bits (see sysbench tests);
"sysbench --test=cpu" is not a floating point test (except that it unnecessarily calls sqrt).

It artificially favours 64bit systems by using 64bit integer division (remainder) on numbers that would all actually fit in 16bits.

dingo35
Posts: 13
Joined: Sat Oct 26, 2013 8:09 am

Re: 64bit vs 32bit benchmark. Eben was right

Mon Aug 21, 2017 2:03 pm

Thanks for correcting that, I'm not an expert on sysbench...

It makes Eben even more right, I guess :-)

User avatar
bitbank
Posts: 252
Joined: Sat Nov 07, 2015 8:01 am
Location: Sarasota, Florida
Contact: Website

Re: 64bit vs 32bit benchmark. Eben was right

Fri Aug 25, 2017 2:59 pm

I had been holding on to some assumptions about this subject that were just rendered incorrect. I just did some testing and the 32 versus 64-bit argument is not cut and dried even for simple code execution. You're welcome to repeat my results by using my gcc_perf project:

https://github.com/bitbank2/gcc_perf

For my test system, I used a NanoPi-K2 (AmLogic s905 quad core Cortex-A53) running GNU/Linux kernel 3.14.29 in aarch64 mode. GCC 5.4.0 was used to compile both the 32-bit and 64-bit executables. I enabled multiarch support and installed the armhf libraries to allow running both 64-bit and 32-bit executables on the same system. What I observed is that there are cases where 32-bit code runs faster than 64-bit and where 64-bit runs faster than 32-bit. The performance benefit is usually on the 64-bit code and can be significant. The main advantage of 64-bit code for average C code is the extra general purpose registers. This didn't explain the results I got when I was comparing 32-bit and 64-bit assembly language which basically did exactly the same operations using the same number of registers.

Here's one of the more interesting test results:

64-bit code:
Float Sum C (bigger than cache) = 1187ms
Float Sum SIMD (bigger than cache) = 921ms
Float Sum ASM (bigger than cache) = 1143ms
Float Sum C (smaller than cache) = 401ms
Float Sum SIMD (smaller than cache) = 261ms
Float Sum ASM (smaller than cache) = 249ms

32-bit code:
Float Sum C (bigger than cache) = 1848ms
Float Sum SIMD (bigger than cache) = 849ms
Float Sum ASM (bigger than cache) = 853ms
Float Sum C (smaller than cache) = 1280ms
Float Sum SIMD (smaller than cache) = 310ms
Float Sum ASM (smaller than cache) = 334ms

The 64-bit code performs significantly better in most cases, but there are some oddities. The C (smaller than cache) case on the 32-bit code is significantly slower than the 64-bit version while the 32-bit SIMD+ASM code (bigger than cache) are slightly faster.
The fastest code is none at all :)

Return to “General discussion”