cleverca22
Posts: 3792
Joined: Sat Aug 18, 2012 2:33 pm

Re: How Slow is Go?

Sun May 16, 2021 10:26 pm

just to see how the cross-compiler in nix compares to things...

Code: Select all

[clever@amd-nixos:~/apps/rpi/realfft]$ nix build .#realfft -o x86-64
[clever@amd-nixos:~/apps/rpi/realfft]$ nix build .#packages.aarch64-linux.realfft -o arm64 --option repeat 0
[clever@amd-nixos:~/apps/rpi/realfft]$ nix build .#packages.armv6l-linux.realfft -o armv6 --option repeat 0
[clever@amd-nixos:~/apps/rpi/realfft]$ nix build .#packages.i686-linux.realfft -o x86-32 --option repeat 0
i build a bunch of versions, native and cross

the native x86-64 build:

Code: Select all

[clever@amd-nixos:~/apps/rpi/realfft]$ egrep 'model name|MHz' /proc/cpuinfo  | head -n2
model name      : AMD FX(tm)-8350 Eight-Core Processor
cpu MHz         : 4013.312
[clever@amd-nixos:~/apps/rpi/realfft]$ ./x86-64/bin/realfft
realfft.go -- Perform real to complex Fourier transform
Version=6; N=4194304

   run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
     1  7.34091862e-13  2.21084714e+00  7.37445718e-13  4.35030770e+00
     2  7.34091862e-13  2.05668330e+00  7.37445718e-13  4.11580396e+00
     3  7.34091862e-13  2.10192871e+00  7.37445718e-13  4.10689902e+00

Best real=2.0567e+00 sec; Mtflops=1.1216e+02
Best complex=4.1069e+00 sec; Mtflops=1.1234e+02
Single-core speed is 1.964 times a Pi 4B
the cross aarch64 build on a pi400:

Code: Select all

root@pi400:/sys/devices/system/cpu/cpufreq/policy0# echo performance > scaling_governor 
root@pi400:/sys/devices/system/cpu/cpufreq/policy0# cat scaling_cur_freq 
1800000
root@pi400:/sys/devices/system/cpu/cpufreq/policy0# uname -a
Linux pi400 5.4.83-v8+ #1379 SMP PREEMPT Mon Dec 14 13:15:14 GMT 2020 aarch64 GNU/Linux
root@pi400:~# /nix/store/iflgwn5igx0a4mlv0imr27kpx5absj8f-realfft-aarch64-unknown-linux-gnu-0.0.1/bin/realfft 
realfft.go -- Perform real to complex Fourier transform
Version=6; N=4194304

   run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
     1  7.14602381e-13  3.58449554e+00  7.17516816e-13  7.11788559e+00
     2  7.14602381e-13  3.45881510e+00  7.17516816e-13  6.95894599e+00
     3  7.14602381e-13  3.49032807e+00  7.17516816e-13  6.89069176e+00

Best real=3.4588e+00 sec; Mtflops=6.6695e+01
Best complex=6.8907e+00 sec; Mtflops=6.6956e+01
Single-core speed is 1.169 times a Pi 4B
the armv6 build on a pi400 (best rpi cpu, but worst compile options, to remain compatible with pi0/pi1)

Code: Select all

root@pi400:~# /nix/store/f6q5gzwdgh6rj83va50bx3nrpcibkdb0-realfft-armv6l-unknown-linux-gnueabihf-0.0.1/bin/realfft 
realfft.go -- Perform real to complex Fourier transform
Version=6; N=4194304
 
   run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
     1  7.34091862e-13  2.42649471e+02  7.37445718e-13  4.74083977e+02
     2  7.34091862e-13  2.43074687e+02  7.37445718e-13  4.74017292e+02
     3  7.34091862e-13  2.42629127e+02  7.37445718e-13  4.69767009e+02
 
Best real=2.4263e+02 sec; Mtflops=9.5078e-01
Best complex=4.6977e+02 sec; Mtflops=9.8213e-01
Single-core speed is 0.01691 times a Pi 4B
wow that was slow!!!

x86-32 native build:

Code: Select all

[clever@amd-nixos:~/apps/rpi/realfft]$ ./x86-32/bin/realfft
realfft.go -- Perform real to complex Fourier transform
Version=6; N=4194304

   run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
     1  7.34091862e-13  1.26721236e+02  7.37445718e-13  2.51482695e+02
     2  7.34091862e-13  1.27384054e+02  7.37445718e-13  2.51402122e+02
     3  7.34091862e-13  1.28058959e+02  7.37445718e-13  2.51280190e+02

Best real=1.2672e+02 sec; Mtflops=1.8204e+00
Best complex=2.5128e+02 sec; Mtflops=1.8361e+00
Single-core speed is 0.03199 times a Pi 4B
and one last test, armv7l build, so it would work on the pi2/pi3/pi4 range

Code: Select all

root@pi400:~# /nix/store/13caf6b14h0kakxx9apk4w49ql5ymkmi-realfft-armv7l-unknown-linux-gnueabihf-0.0.1/bin/realfft 
realfft.go -- Perform real to complex Fourier transform
Version=6; N=4194304
 
   run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
     1  7.34091862e-13  2.42421283e+02  7.37445718e-13  4.69981720e+02
     2  7.34091862e-13  2.41970166e+02  7.37445718e-13  4.69405938e+02
     3  7.34091862e-13  2.42087329e+02  7.37445718e-13  4.69463250e+02
 
Best real=2.4197e+02 sec; Mtflops=9.5337e-01
Best complex=4.6941e+02 sec; Mtflops=9.8289e-01
Single-core speed is 0.01694 times a Pi 4B
the main observation i can see, is that any 32bit build is horid slow, almost 60x the time spent
i think the cause, is that its using 64bit floats, and a 32bit machine has to do that in software, making it far more costly
there is also basically no difference between armv6 and armv7, so your not loosing anything (in this specific case) by supporting pi0/pi1

ejolson
Posts: 7244
Joined: Tue Mar 18, 2014 11:47 am

Re: How Slow is Go?

Sun May 16, 2021 10:42 pm

cleverca22 wrote:
Sun May 16, 2021 10:26 pm
i think the cause, is that its using 64bit floats, and a 32bit machine has to do that in software, making it far more costly
there is also basically no difference between armv6 and armv7, so your not loosing anything (in this specific case) by supporting pi0/pi1
As indicated in the post

viewtopic.php?p=1864327#p1864327

you may have to set

$ export GOARM=6

or

$ export GOARM=7

to avoid generating ARMv5 compatible software floating point. Does that help with the cross compiler?

cleverca22
Posts: 3792
Joined: Sat Aug 18, 2012 2:33 pm

Re: How Slow is Go?

Sun May 16, 2021 11:06 pm

ejolson wrote:
Sun May 16, 2021 10:42 pm
you may have to set

$ export GOARM=6

or

$ export GOARM=7

to avoid generating ARMv5 compatible software floating point. Does that help with the cross compiler?

Code: Select all

[clever@amd-nixos:~/apps/rpi/nixpkgs-test]$ git grep GOARM
pkgs/development/go-packages/generic/default.nix:    GOARM = toString (lib.intersectLists [(stdenv.hostPlatform.parsed.cpu.version or "")] ["5" "6" "7"]);
[clever@amd-nixos:~/apps/rpi/nixpkgs-test]$ nix repl .
Welcome to Nix version 2.4pre20201205_a5d85d0. Type :? for help.

Loading '.'...
Added 14142 variables.

nix-repl> pkgsCross.armv7l-hf-multiplatform.stdenv.hostPlatform.parsed.cpu.version
"7"
nix-repl> pkgsCross.raspberryPi.stdenv.hostPlatform.parsed.cpu.version
"6"
it looks like my package manager handles that already, but i cant 100% confirm if that is working in a cross-compile situation

x86-32bit also runs horribly slowly, implying its a 32bit problem, not an arm32 problem

ejolson
Posts: 7244
Joined: Tue Mar 18, 2014 11:47 am

Re: How Slow is Go?

Mon May 17, 2021 3:44 am

cleverca22 wrote:
Sun May 16, 2021 11:06 pm
x86-32bit also runs horribly slowly, implying its a 32bit problem, not an arm32 problem
Here, I think, is a reasonable run of the C and Go codes on a 32-bit Pentium 4.

Code: Select all

$ grep "model name" /proc/cpuinfo | head -n1
model name  : Intel(R) Pentium(R) 4 CPU 2.40GHz
$ ./realfft
realfft.go -- Perform real to complex Fourier transform
Version=6; N=4194304

   run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
     1  7.34091862e-13  1.20251405e+01  7.37445718e-13  2.42526693e+01
     2  7.34091862e-13  1.18629508e+01  7.37445718e-13  2.39169264e+01
     3  7.34091862e-13  1.19520833e+01  7.37445718e-13  2.39350030e+01

Best real=1.1863e+01 sec; Mtflops=1.9446e+01
Best complex=2.3917e+01 sec; Mtflops=1.9291e+01
Single-core speed is 0.3389 times a Pi 4B
$ ./rfft
rfft.c -- Perform real to complex Fourier transform
Version=6; N=4194304

   run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
     1  6.08955051e-13  6.92031200e+00  6.16497093e-13  1.38606970e+01
     2  6.08955051e-13  6.77315600e+00  6.16497093e-13  1.36002750e+01
     3  6.08955051e-13  6.77335000e+00  6.16497093e-13  1.36010130e+01

Best real=6.7732e+00 sec; Mtflops=3.4059e+01
Best complex=1.3600e+01 sec; Mtflops=3.3924e+01
Single-core speed is 0.5165 times a Pi 4B
The ratio of single-core speed for Go versus C from the last line of each run is

X/Y=0.3389/0.5165=0.6561

which shows that Go looses much more performance on 32-bit than C does, but definitely not the 60-fold loss which comes from software floating point.

For reference, what do the 64-bit C runs look like on the

Code: Select all

model name      : AMD FX(tm)-8350 Eight-Core Processor
cpu MHz         : 4013.312
computer which you used for the initial 64-bit Go test?
Last edited by ejolson on Mon May 17, 2021 4:14 am, edited 1 time in total.

cleverca22
Posts: 3792
Joined: Sat Aug 18, 2012 2:33 pm

Re: How Slow is Go?

Mon May 17, 2021 3:57 am

ejolson wrote:
Mon May 17, 2021 3:44 am
For reference, what do the 64-bit C runs look like on the FX-8350 computer which you used for the initial 64-bit Go test?
x86-64:

Code: Select all

rfft.c -- Perform real to complex Fourier transform
Version=6; N=4194304

   run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
     1  7.09627869e-13  2.14898300e+00  7.13269560e-13  4.37379700e+00
     2  7.09627869e-13  2.09731800e+00  7.13269560e-13  4.30577300e+00
     3  7.09627869e-13  2.10579900e+00  7.13269560e-13  4.28145600e+00

Best real=2.0973e+00 sec; Mtflops=1.0999e+02
Best complex=4.2815e+00 sec; Mtflops=1.0776e+02
Single-core speed is 1.654 times a Pi 4B
x86-32:

Code: Select all

rfft.c -- Perform real to complex Fourier transform
Version=6; N=4194304

   run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
     1  6.08938445e-13  2.94907800e+00  6.16520494e-13  5.97557700e+00
     2  6.08938445e-13  2.94315100e+00  6.16520494e-13  5.89734100e+00
     3  6.08938445e-13  2.90891400e+00  6.16520494e-13  5.84949400e+00

Best real=2.9089e+00 sec; Mtflops=7.9303e+01
Best complex=5.8495e+00 sec; Mtflops=7.8874e+01
Single-core speed is 1.202 times a Pi 4B
32bit c did FAR better then 32git go

ejolson
Posts: 7244
Joined: Tue Mar 18, 2014 11:47 am

Re: How Slow is Go?

Mon May 17, 2021 3:37 pm

cleverca22 wrote:
Mon May 17, 2021 3:57 am
32bit c did FAR better then 32git go
Upon dividing the Go versus C performance relative to the Pi 4B as

X/Y=1.964/1.654=1.187

it appears that 64-bit Go is pretty well tuned on the FX-8350. Even in real time it is slightly faster than C. If not for my recent tests with the i3 550 I would have thought this indicated something wrong with the underlying C library, but it still seems within the limits of normal.

The updated 64-bit graph is

Image

Other than the fact that my friend Julia is making some nice plots, it seems the code generator used in Go is optimized for 64-bit CPUs that are 5 to 10 years old and hasn't caught up to GCC for more modern architectures.

The fact that Go doesn't seem to have compiler switches for selecting the subarchitecture and instead relies on environment variables such as GOARM and GO386 may reflect design problems that affect performance on certain systems. Though perhaps necessary, I think it was a mistake to remove 387 floating point support from the x86 compilers. A compiler which produces efficient binaries will eventually have to support a number of machine tuning parameters. Selecting between 387 versus SSE2, AVX2 and AVX512 or equivalently HF, NEON and SVE for ARM is likely one of simpler cases in which to develop the needed compiler infrastructure.

Return to “Other programming languages”