ejolson
Posts: 7242
Joined: Tue Mar 18, 2014 11:47 am

How Slow is Go?

Tue May 11, 2021 6:57 am

My understanding is the Go compiler quickly produces slower executables than either GCC or LLVM because it's based on the Plan 9 toolchain. The idea in this thread is to compare the relative slowness of Go on the Pi compared to the relative slowness of Go on other architectures. To do this, near-identical programs which calculate complex to real Fourier transforms using a non-optimal implementation of the conquer and divide algorithm known as the FFT were written in both C and Go languages.

Metrics for the execution speeds of each of these programs were normalized so that each reports a score of 1.0 when run on a Raspberry Pi 4B in 64-bit mode. This was done using gcc 10.3 and go 1.16.3.

After this, the same two programs were complied and run on a number of non-Raspberry Pi computers to see how much the relative speed between C and Go might change depending on architecture. The results of such a comparison can hopefully then be used to infer the maturity of Go on the Pi compared to other platforms.

Here are some preliminary results:

Code: Select all

$ grep "model name" /proc/cpuinfo | head -n1
model name  : Intel(R) Pentium(R) 4 CPU 3.40GHz
$ ./realfft
realfft.go -- Perform real to complex Fourier transform
Version=5; N=4194304

   run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
     1  7.34091862e-13  4.42385340e+00  7.37445718e-13  8.72595406e+00
     2  7.34091862e-13  4.35549045e+00  7.37445718e-13  8.59305906e+00
     3  7.34091862e-13  4.35643291e+00  7.37445718e-13  8.59572983e+00

Best real=4.3555e+00 sec; Mtflops=5.2965e+01
Best complex=8.5931e+00 sec; Mtflops=5.3691e+01
Single-core speed is 0.9331 times a Pi 4B
$ ./rfft 
rfft.c -- Perform real to complex Fourier transform
Version=5; N=4194304

   run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
     1  7.09627869e-13  4.36951400e+00  7.13269560e-13  8.79502800e+00
     2  7.09627869e-13  4.28870000e+00  7.13269560e-13  8.65559600e+00
     3  7.09627869e-13  4.29111500e+00  7.13269560e-13  8.65074200e+00

Best real=4.2887e+00 sec; Mtflops=5.3789e+01
Best complex=8.6507e+00 sec; Mtflops=5.3333e+01
Single-core speed is 0.8139 times a Pi 4B
and

Code: Select all

$ grep "model name" /proc/cpuinfo | head -n1
model name  : AMD EPYC 7702 64-Core Processor
$ ./realfft
realfft.go -- Perform real to complex Fourier transform
Version=5; N=4194304

   run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
     1  7.34091862e-13  1.33623481e+00  7.37445718e-13  2.65256858e+00
     2  7.34091862e-13  1.32155347e+00  7.37445718e-13  2.62144589e+00
     3  7.34091862e-13  1.32319140e+00  7.37445718e-13  2.62295699e+00

Best real=1.3216e+00 sec; Mtflops=1.7456e+02
Best complex=2.6214e+00 sec; Mtflops=1.7600e+02
Single-core speed is 3.067 times a Pi 4B
$ ./rfft
rfft.c -- Perform real to complex Fourier transform
Version=5; N=4194304

   run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
     1  6.86626828e-13  1.02936000e+00  6.89441082e-13  2.06730100e+00
     2  6.86626828e-13  1.01212500e+00  6.89441082e-13  2.02839800e+00
     3  6.86626828e-13  1.01509700e+00  6.89441082e-13  2.03150100e+00

Best real=1.0121e+00 sec; Mtflops=2.2792e+02
Best complex=2.0284e+00 sec; Mtflops=2.2746e+02
Single-core speed is 3.46 times a Pi 4B
From this one sees that the Go compiler does relatively better on the Pentium IV but GCC is is relatively better on the EPYC.

For reference the source codes for the two programs are as follows:

Code: Select all

/*  realfft.go -- Perform real to complex Fourier transform
    Written May 10, 2021 by Eric Olson */

package main

import ("fmt"; "os"; "time"; "math/cmplx";
    "math"; "reflect"; "unsafe")

var tictime float64
func tic() {
    now:=time.Now()
    tictime=float64(now.Unix())+1.0E-9*float64(now.Nanosecond())
}
func toc() float64 {
    now:=time.Now()
    return float64(now.Unix())+1.0E-9*float64(now.Nanosecond())-tictime
}

type rstate struct {
    x,w,s uint64
}
var gs=rstate{0,0,0xb5ad4eceda1ce2a9}
func rint32() uint32 {
    gs.x*=gs.x; gs.w+=gs.s
    gs.x+=gs.w; gs.x=(gs.x>>32)|(gs.x<<32)
    return uint32(gs.x)
}
func rseed(x,w,s uint64) {
    gs.x=x; gs.w=w; gs.s=s|1
}

func cfft(xhat,x []complex128,s,n int){
    if n==1 {
        xhat[0]=x[0]
        return
    }
    if n%2!=0 {
        fmt.Printf("Error: cfft called with non-power-of-two argument!\n")
        os.Exit(1)
    }
    n2:=n/2
    cfft(xhat,x,2*s,n2)
    cfft(xhat[n2:],x[s:],2*s,n2)
    for l:=0;l<n2;l++ {
        theta:=-2*math.Pi*float64(l)/float64(n)
        ts,tc:=math.Sincos(theta)
        t1:=xhat[l]
        t2:=complex(tc,ts)*xhat[l+n2]
        xhat[l]=t1+t2; xhat[l+n2]=t1-t2
    }
}

func cfift(x,xhat []complex128,s,n int){
    if n==1 {
        x[0]=xhat[0]
        return
    }
    if n%2!=0 {
        fmt.Printf("Error: cfft called with non-power-of-two argument!\n")
        os.Exit(1)
    }
    n2:=n/2
    cfift(x,xhat,2*s,n2)
    cfift(x[n2:],xhat[s:],2*s,n2)
    for l:=0;l<n2;l++ {
        theta:=2*math.Pi*float64(l)/float64(n)
        ts,tc:=math.Sincos(theta)
        t1:=x[l]
        t2:=complex(tc,ts)*x[l+n2]
        x[l]=t1+t2; x[l+n2]=t1-t2
    }
}

func rfft(xhat []complex128,x []float64,s,n int){
    if n%2!=0 {
        fmt.Printf("Error: rfift called with non-power-of-two argument!\n")
        os.Exit(1)
    }
    n2:=n/2
    var xc []complex128
    xsh:=(*reflect.SliceHeader)(unsafe.Pointer(&x))
    xcsh:=(*reflect.SliceHeader)(unsafe.Pointer(&xc))
    xcsh.Data=xsh.Data
    xcsh.Len=xsh.Len/2; xcsh.Cap=xsh.Len/2
    cfft(xhat,xc,s,n2)
    n4:=n2/2
    t1:=xhat[0]
    xhat[0]=complex(real(t1)+imag(t1),0)
    xhat[s*n2]=complex(real(t1)-imag(t1),0)
    for l:=1;l<=n4;l++ {
        theta:=-2*math.Pi*float64(l)/float64(n)
        ts,tc:=math.Sincos(theta)
        ie:=complex(-ts,tc)
        q1:=xhat[s*l]; q2:=cmplx.Conj(xhat[s*(n2-l)])
        t1:=q1+q2; t2:=q1-q2
        xhat[s*l]=(t1-ie*t2)/2
        xhat[s*(n2-l)]=cmplx.Conj(t1+ie*t2)/2
    }
}

func rfift(x []float64,xhat []complex128,s,n int) {
    if n%2!=0 {
        fmt.Printf("Error: rfift called with non-power-of-two argument!\n")
        os.Exit(1)
    }
    n2:=n/2
    var xc []complex128
    xsh:=(*reflect.SliceHeader)(unsafe.Pointer(&x))
    xcsh:=(*reflect.SliceHeader)(unsafe.Pointer(&xc))
    xcsh.Data=xsh.Data
    xcsh.Len=xsh.Len/2; xcsh.Cap=xsh.Len/2
    n4:=n2/2
    for k:=0; k<=n4; k++ {
        theta:=2*math.Pi*float64(k)/float64(n)
        ts,tc:=math.Sincos(theta)
        ie:=complex(-ts,tc)
        q1:=xhat[s*k]; q2:=cmplx.Conj(xhat[s*(n2-k)])
        t1:=q1+q2; t2:=q1-q2
        xhat[s*k]=t1+ie*t2
        xhat[s*(n2-k)]=cmplx.Conj(t1-ie*t2)
    }
    cfift(xc,xhat,s,n2)
}

const N=4194304
var xr,xrs [N]float64
var    x,xs,xhat [N]complex128
var xrhat [N/2+1]complex128

var trmin,tcmin float64=0,0
var rnorm,cnorm float64=0,0
func dotest() {
    rseed(0,0,0xb5ad4eceda1ce2a9)
    for l:=0;l<N;l++ {
        xr[l]=2*float64(rint32())/(1<<32)-1
        x[l]=complex(xr[l],0)
    }
    tic()
    rfft(xrhat[:],xr[:],1,N)
    rfift(xrs[:],xrhat[:],1,N)
    tr:=toc()
    if trmin==0 || tr<trmin { trmin=tr }
    for l:=0;l<N;l++ { xrs[l]/=N }
    r:=float64(0.0)
    for l:=0;l<N;l++ {
        dx:=xr[l]-xrs[l]
        r+=dx*dx
    }
    r=math.Sqrt(r)
    fmt.Printf(" %15.8e",r)
    if rnorm==0 { rnorm=r 
    } else if rnorm!=r {
        fmt.Printf("\nReal floating point error detected!\n")
        os.Exit(1)
    }
    fmt.Printf(" %15.8e",tr)
    tic()
    cfft(xhat[:],x[:],1,N)
    cfift(xs[:],xhat[:],1,N)
    tc:=toc()
    if tcmin==0 || tc<tcmin { tcmin=tc }
    for l:=0;l<N;l++ { xs[l]/=N }
    r=0
    for l:=0;l<N;l++ {
        dx:=x[l]-xs[l]
        r+=real(dx*cmplx.Conj(dx))
    }
    r=math.Sqrt(r)
    fmt.Printf(" %15.8e",r)
    if cnorm==0 { cnorm=r 
    } else if cnorm!=r {
        fmt.Printf("\nComplex floating point error detected!\n")
        os.Exit(1)
    }
    fmt.Printf(" %15.8e\n",tc)
}

func main(){
    fmt.Printf("realfft.go -- Perform real to complex Fourier transform\n")
    fmt.Printf("Version=%d; N=%d\n\n",6,N)
    fmt.Printf("%6s %15s %15s %15s %15s\n",
        "run","norm(xr-xrs)","real sec","norm(x-xs)","complex sec")
    for w:=0;w<3;w++ {
        fmt.Printf("%6d",w+1)
        dotest()
    }
    ops:=2*N*math.Log2(N)+3*N*math.Log2(N)
    rflops:=ops/trmin/2e6
    cflops:=ops/tcmin/1e6
    fmt.Printf("\nBest real=%.4e sec; Mtflops=%.4e\n",
        trmin,rflops)
    fmt.Printf("Best complex=%.4e sec; Mtflops=%.4e\n",
        tcmin,cflops)
    fmt.Printf("Single-core speed is %.4g times a Pi 4B\n",
        math.Sqrt(rflops*cflops)/57.15)
    os.Exit(0)
}
and

Code: Select all

/*  rfft.c -- Perform real to complex Fourier transform
    Written May 10, 2021 by Eric Olson */

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <complex.h>
#include <math.h>

static struct timeval tic_start;
void tic() {
    gettimeofday(&tic_start,0);
}
double toc() {
    struct timeval tic_stop;
    gettimeofday(&tic_stop,0);
    double sec=tic_stop.tv_sec-tic_start.tv_sec;
    return sec+(tic_stop.tv_usec-tic_start.tv_usec)*1.0e-6;
}

typedef struct {
    uint64_t x,w,s;
} rstate;
rstate gs={0,0,0xb5ad4eceda1ce2a9};
uint32_t rint32(){
    gs.x*=gs.x; gs.w+=gs.s;
    gs.x+=gs.w; gs.x=(gs.x>>32)|(gs.x<<32);
    return (uint32_t)gs.x;
}
void rseed(uint64_t x,uint64_t w,uint64_t s) {
    gs.x=x; gs.w=w; gs.s=s|1;
}

typedef double complex Complex;
typedef double Real;

void cfft(Complex *xhat,Complex *x,int s,int n){
    if(n==1){
        xhat[0]=x[0];
        return;
    }
    if(n%2){
        printf("Error: cfft called with non-power-of-two argument!\n");
        exit(1);
    }
    int n2=n/2;
    cfft(xhat,x,2*s,n2);
    cfft(xhat+n2,x+s,2*s,n2);
    for(int l=0;l<n2;l++){
        Real theta=-2*M_PI*l/n;
        Real ts=sin(theta),tc=cos(theta);
        Complex t1=xhat[l];
        Complex t2=(tc+1i*ts)*xhat[l+n2];
        xhat[l]=t1+t2; xhat[l+n2]=t1-t2;
    }
}

void cfift(Complex *x,Complex *xhat,int s,int n){
    if(n==1){
        x[0]=xhat[0];
        return;
    }
    if(n%2){
        printf("Error: cfft called with non-power-of-two argument!\n");
        exit(1);
    }
    int n2=n/2;
    cfift((Complex *)x,xhat,2*s,n2);
    cfift(x+n2,xhat+s,2*s,n2);
    for(int l=0;l<n2;l++){
        Real theta=2*M_PI*l/n;
        Real ts=sin(theta),tc=cos(theta);
        Complex t1=x[l];
        Complex t2=(tc+1i*ts)*x[l+n2];
        x[l]=t1+t2; x[l+n2]=t1-t2;
    }
}

void rfft(Complex *xhat,Real *x,int s,int n){
    if(n%2){
        printf("Error: rfift called with non-power-of-two argument!\n");
        exit(1);
    }
    int n2=n/2;
    cfft(xhat,(Complex *)x,s,n2);
    int n4=n2/2;
    Complex t1=xhat[0];
    xhat[0]=creal(t1)+cimag(t1);
    xhat[s*n2]=creal(t1)-cimag(t1);
    for(int l=1;l<=n4;l++){
        Real theta=-2*M_PI*l/n;
        Real ts=sin(theta),tc=cos(theta);
        Complex ie=-ts+1i*tc;
        Complex q1=xhat[s*l], q2=conj(xhat[s*(n2-l)]);
        Complex t1=q1+q2, t2=q1-q2;
        xhat[s*l]=(t1-ie*t2)/2;
        xhat[s*(n2-l)]=conj(t1+ie*t2)/2;
    }
}

void rfift(Real *x,Complex *xhat,int s,int n){
    if(n%2){
        printf("Error: rfift called with non-power-of-two argument!\n");
        exit(1);
    }
    int n2=n/2;
    int n4=n2/2;
    for(int k=0;k<=n4;k++){
        Real theta=2*M_PI*k/n;
        Real ts=sin(theta),tc=cos(theta);
        Complex ie=-ts+1i*tc;
        Complex q1=xhat[s*k], q2=conj(xhat[s*(n2-k)]);
        Complex t1=q1+q2, t2=q1-q2;
        xhat[s*k]=t1+ie*t2;
        xhat[s*(n2-k)]=conj(t1-ie*t2);
    }
    cfift((Complex *)x,xhat,s,n2);
}

#define N 4194304

Real xr[N],xrs[N];
Complex x[N],xs[N],xhat[N];
Complex xrhat[N/2+1];

Real trmin=0,tcmin=0;
Real rnorm=0,cnorm=0;
void dotest(){
    rseed(0,0,0xb5ad4eceda1ce2a9);
    for(int l=0;l<N;l++){
        xr[l]=2.0*rint32()/((uint64_t)1<<32)-1;
        x[l]=xr[l];
    }
    tic();
    rfft(xrhat,xr,1,N);
    rfift(xrs,xrhat,1,N);
    double tr=toc();
    if(trmin==0||tr<trmin) trmin=tr;
    for(int l=0;l<N;l++) xrs[l]/=N;
    Real r=0;
    for(int l=0;l<N;l++){
        Real dx=xr[l]-xrs[l];
        r+=dx*dx;
    }
    r=sqrt(r);
    printf(" %15.8e",r);
    if(rnorm==0) rnorm=r;
    else if(rnorm!=r){
        printf("Real floating point error detected!\n");
        exit(1);
    }
    printf(" %15.8e",tr); fflush(stdout);
    tic();
    cfft(xhat,x,1,N);
    cfift(xs,xhat,1,N);
    double tc=toc();
    if(tcmin==0||tc<tcmin) tcmin=tc;
    for(int l=0;l<N;l++) xs[l]/=N;
    r=0;
    for(int l=0;l<N;l++){
        Complex dx=x[l]-xs[l];
        r+=dx*conj(dx);
    }
    r=sqrt(r);
    printf(" %15.8e",r);
    if(cnorm==0) cnorm=r;
    else if(cnorm!=r){
        printf("Complex floating point error detected!\n");
        exit(1);
    }
    printf(" %15.8e\n",tc); fflush(stdout);
}

int main(){
    printf("rfft.c -- Perform real to complex Fourier transform\n");
    printf("Version=%d; N=%d\n\n",6,N);
    printf("%6s %15s %15s %15s %15s\n",
        "run","norm(xr-xrs)","real sec","norm(x-xs)","complex sec");
    for(int w=0;w<3;w++){
        printf("%6d",w+1); fflush(stdout);
        dotest();
    }
    Real ops=2*N*log2(N)+3*N*log2(N);
    Real rflops=ops/trmin/2e6;
    Real cflops=ops/tcmin/1e6;
    printf("\nBest real=%.4e sec; Mtflops=%.4e\n",
        trmin,rflops);
    printf("Best complex=%.4e sec; Mtflops=%.4e\n",
        tcmin,cflops);
    printf("Single-core speed is %.4g times a Pi 4B\n",
        sqrt(rflops*cflops)/65.81);
    exit(0);
}
Additional points of comparison between these two programs for other computers would be greatly appreciated.

Edit: Changed rseed to make sure the generator for the Weyl sequence is odd--this doesn't affect the anything except reuse of the random number code in other projects--and updated the version to 6.
Last edited by ejolson on Thu May 13, 2021 2:33 pm, edited 7 times in total.

User avatar
jahboater
Posts: 7074
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: How Slow is Go?

Tue May 11, 2021 7:12 am

ejolson wrote:
Tue May 11, 2021 6:57 am
My understanding is the Go compiler quickly produces slower executables than either GCC or LLVM because it's based on the Plan 9 toolchain. The idea in this thread is to compare the relative slowness of Go on the Pi compared to the relative slowness of Go on other architectures. To do this, near-identical programs which calculate complex to real Fourier transforms using a non-optimal implementation of the conquer and divide algorithm known as the FFT were written in both the C and Go programming languages.
The default GCC compiler on the Pi should be able to compile go.

Code: Select all

pi@pi:~ $ gcc-8 -v
Using built-in specs.
COLLECT_GCC=gcc-8
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 8.3.0-6' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr 
Never tried it.
I am just re-installing GCC-11.1 with go added.

ejolson
Posts: 7242
Joined: Tue Mar 18, 2014 11:47 am

Re: How Slow is Go?

Tue May 11, 2021 7:32 am

jahboater wrote:
Tue May 11, 2021 7:12 am
ejolson wrote:
Tue May 11, 2021 6:57 am
My understanding is the Go compiler quickly produces slower executables than either GCC or LLVM because it's based on the Plan 9 toolchain. The idea in this thread is to compare the relative slowness of Go on the Pi compared to the relative slowness of Go on other architectures. To do this, near-identical programs which calculate complex to real Fourier transforms using a non-optimal implementation of the conquer and divide algorithm known as the FFT were written in both the C and Go programming languages.
The default GCC compiler on the Pi should be able to compile go.

Code: Select all

pi@pi:~ $ gcc-8 -v
Using built-in specs.
COLLECT_GCC=gcc-8
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 8.3.0-6' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr 
Never tried it.
I am just re-installing GCC-11.1 with go added.
The GCC version of Go is interesting as it doesn't use the Plan 9 backend. I found gccgo about twice faster for the Barnsley fern compared to the usual Go compiler. However, every other program I've written was slower with gccgo. Note I've not tried version 1.11 of GCC Go.

Heater
Posts: 18252
Joined: Tue Jul 17, 2012 3:02 pm

Re: How Slow is Go?

Tue May 11, 2021 3:29 pm

I'm on a go slow so I don't know.

My first approximation guess would be that there is no reason Go cannot crunch on numbers as fast as C or any other compiled language. Give or take a bit. Especially if its code is generated by the GCC backend. Same as C/C++.

However, some years ago when I looked into Go it was a no go.

The task was consuming XML streams. Extracting this and that from it and passing on the results to other XML streams. Grief it was so stuttery in performance. Latency was all over the place. I presume because of all the garbage collection going on. When I found I could do all that with Javascript under node.js with not so much less overall throughput but much more predictable latency I decided to skip Go and go elsewhere.

I have since read that Go had a bit of a problem with memory management, especially on 32 bit systems as we had then. I have not revisited Go to see if that has improved or not.

In short, the FFT is a great number crunching benchmark but does not say much about real application performance.
Memory in C++ is a leaky abstraction .

ejolson
Posts: 7242
Joined: Tue Mar 18, 2014 11:47 am

Re: How Slow is Go?

Tue May 11, 2021 3:48 pm

Heater wrote:
Tue May 11, 2021 3:29 pm
I'm on a go slow so I don't know.

My first approximation guess would be that there is no reason Go cannot crunch on numbers as fast as C or any other compiled language. Give or take a bit. Especially if its code is generated by the GCC backend. Same as C/C++.

However, some years ago when I looked into Go it was a no go.

The task was consuming XML streams. Extracting this and that from it and passing on the results to other XML streams. Grief it was so stuttery in performance. Latency was all over the place. I presume because of all the garbage collection going on. When I found I could do all that with Javascript under node.js with not so much less overall throughput but much more predictable latency I decided to skip Go and go elsewhere.

I have since read that Go had a bit of a problem with memory management, especially on 32 bit systems as we had then. I have not revisited Go to see if that has improved or not.

In short, the FFT is a great number crunching benchmark but does not say much about real application performance.
Weirdly about half of my real applications use FFT-based spectral methods for the approximation of solutions to differential equations. The particular FFT code here recomputes the trigonometric coefficients as needed by calling sine and cosine in the inner loop. Typical implementations use either a lookup table or an algebraic recurrence to obtain these coefficients. As a benchmark, then, this code not only depends on flops and memory bandwidth but the evaluation speed of the circular transcendental functions.

My hope is the current code provides a number of opportunities for a compiler to optimise things in the context of a verified computation that guarantees a certain amount of work be performed.

I'm expecting to see a dramatic difference between the 64-bit and 32-bit Go compiler but haven't checked this yet.

At any rate, the idea to compare how the relative performance of near-identical implementations of the same algorithm in different languages depends on computer architecture seems interesting to me no matter what problem is focused on.

Do you have a different real-world problem in mind?

Heater
Posts: 18252
Joined: Tue Jul 17, 2012 3:02 pm

Re: How Slow is Go?

Tue May 11, 2021 4:29 pm

ejolson wrote:
Tue May 11, 2021 3:48 pm
At any rate, the idea to compare how the relative performance of near-identical implementations of the same algorithm in different languages depends on computer architecture seems interesting to me no matter what problem is focused on.
Indeed. I think we have been toying with that idea for some years now.
ejolson wrote:
Tue May 11, 2021 3:48 pm
Do you have a different real-world problem in mind?
In this case not other than my old experiments with parsing and distributing XML streams some years ago. None of which code I have anymore and reproducing the experiments would be far from trivial, requiring a bunch of clients to stimulate the system.
Memory in C++ is a leaky abstraction .

ejolson
Posts: 7242
Joined: Tue Mar 18, 2014 11:47 am

Re: How Slow is Go?

Tue May 11, 2021 8:47 pm

ejolson wrote:
Tue May 11, 2021 6:57 am
Additional points of comparison between these two programs for other computers would be greatly appreciated.
Here is some data in the form of a graph:

Image

Graphed are the ratios X/Y where X and Y are obtained from the output as
  • Single-core speed is X times a Pi 4B (for the Go run);
    • Single-core speed is Y times a Pi 4B (for the C run).
    For example, the output in the first post indicates that X=0.9331 and Y=0.8139 for the Pentium IV which is then plotted as X/Y=1.1465 using logarithmic coordinates as mandated by the dog developer.

    It would appear that the EPYC 7702 is the worst processor on which to run Go while a Pentium IV is the best. Interestingly, the relative performance between Go and C on the EPYC 7371 was the closest match to the Raspberry Pi 4B.

    I also ran the programs on an i3 550 and a Gold 6126, but Go did so much better than C on these systems that I became suspicious something was wrong. It's notable, except for the Raspberry Pi, that those two outliers were the only systems running a Debian based distribution in the above set of tests.

    I wonder how the M1 would compare.

    ejolson
    Posts: 7242
    Joined: Tue Mar 18, 2014 11:47 am

    Re: How Slow is Go?

    Wed May 12, 2021 12:06 am

    ejolson wrote:
    Tue May 11, 2021 8:47 pm
    I also ran the programs on an i3 550 and a Gold 6126, but Go did so much better than C on these systems that I became suspicious something was wrong. It's notable, except for the Raspberry Pi, that those two outliers were the only systems running a Debian based distribution in the above set of tests.
    I've installed GCC versions 10.3 and 11.1 on the Gold 6126 system but it's no use. The C code still runs slow. Here is a sample run:

    Code: Select all

    $ grep "model name" /proc/cpuinfo | head -n1
    model name  : Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
    $ ./realfft
    realfft.go -- Perform real to complex Fourier transform
    Version=5; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  7.34091862e-13  1.17770433e+00  7.37445718e-13  2.36625028e+00
         2  7.34091862e-13  1.14570308e+00  7.37445718e-13  2.29919839e+00
         3  7.34091862e-13  1.14580536e+00  7.37445718e-13  2.29654408e+00
    
    Best real=1.1457e+00 sec; Mtflops=2.0135e+02
    Best complex=2.2965e+00 sec; Mtflops=2.0090e+02
    Single-core speed is 3.519 times a Pi 4B
    $ ./rfft
    rfft.c -- Perform real to complex Fourier transform
    Version=5; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  6.86568772e-13  2.27627600e+00  6.89460460e-13  4.60920300e+00
         2  6.86568772e-13  2.25957400e+00  6.89460460e-13  4.59094000e+00
         3  6.86568772e-13  2.25811600e+00  6.89460460e-13  4.55222700e+00
    
    Best real=2.2581e+00 sec; Mtflops=1.0216e+02
    Best complex=4.5522e+00 sec; Mtflops=1.0135e+02
    Single-core speed is 1.546 times a Pi 4B
    
    In this case X/Y=3.519/1.546=2.276, which is way too far from 1 to plot on the graph with the other results.

    According to

    Code: Select all

    $ ldd rfft
        linux-vdso.so.1 (0x00007ffcee3d8000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fecb50e6000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fecb4cf5000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fecb5484000)
    
    the C executable still links to the system libc and libm math libraries. These come from the standard Ubuntu repository for the bionic release. That may be the reason for the unexpected factor-of-two slowdown in execution speed. Ideally the system libraries should be optimized for the particular processor present on the system, but they may have been compiled for a generic x86 processor. If this is the sole cause of the slowdown, I'm astonished it hasn't been corrected a long time ago.

    I'll try again with a container, to see if it's possible to obtain better results by switching out the system libraries for something more optimized. I wonder if a similar effect happens with 32-bit Raspberry Pi OS where certain system libraries have been compiled to be ARMv6 compatible even when running on an ARMv7 capable device.

    ejolson
    Posts: 7242
    Joined: Tue Mar 18, 2014 11:47 am

    Re: How Slow is Go?

    Wed May 12, 2021 1:02 am

    ejolson wrote:
    Wed May 12, 2021 12:06 am
    I'll try again with a container, to see if it's possible to obtain better results by switching out the system libraries for something more optimized.
    To make a suitable container I downloaded the Void Linux root filesystem, unpacked it, converted it, entered the container as root and installed gcc.

    Code: Select all

    # wget https://alpha.de.repo.voidlinux.org/live/current/void-x86_64-ROOTFS-20210218.tar.xz
    # mkdir rootfs
    # tar xf void-x86_64-ROOTFS-20210218.tar.xz -C rootfs
    # singularity build --sandbox void rootfs
    # singularity shell -w void
    Singularity> xbps-install -Su xbps
    Singularity> xbps-install -u
    Singularity> xbps-remove base-voidstrap
    Singularity> xbps-install gcc make
    Singularity> exit
    
    Then I entered the container as a user, changed to the directory with the Fourier transform code, compiled it and ran it.

    Code: Select all

    $ singularity shell -w void
    Singularity> cd
    Singularity> cd code/rfft
    Singularity> gcc -O3 -mtune=native -march=native -o rfft rfft.c -lm
    Singularity> grep "model name" /proc/cpuinfo | head -n1
    model name  : Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
    Singularity> ./rfft 
    rfft.c -- Perform real to complex Fourier transform
    Version=5; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  6.86626828e-13  1.01068300e+00  6.89441082e-13  2.03566600e+00
         2  6.86626828e-13  9.93252000e-01  6.89441082e-13  1.98907000e+00
         3  6.86626828e-13  9.91678000e-01  6.89441082e-13  1.98545400e+00
    
    Best real=9.9168e-01 sec; Mtflops=2.3262e+02
    Best complex=1.9855e+00 sec; Mtflops=2.3238e+02
    Single-core speed is 3.533 times a Pi 4B
    
    For a small amount of work a factor 3.533/1.546=2.285 fold improvement seems pretty good. The balance between Go and C is now

    X/Y=3.519/3.533=.9960

    which is much more reasonable. I think using containers could provide a similar solution in case C under 32-bit Raspberry Pi OS is also slow.

    The updated 64-bit graph is

    Image
    Last edited by ejolson on Wed May 12, 2021 10:34 pm, edited 1 time in total.

    ejolson
    Posts: 7242
    Joined: Tue Mar 18, 2014 11:47 am

    Re: How Slow is Go?

    Wed May 12, 2021 5:02 pm

    ejolson wrote:
    Wed May 12, 2021 1:02 am
    I think using containers could provide a similar solution in case C under 32-bit Raspberry Pi OS is also slow.
    I was walking at the park with the dog developer. In the middle of explaining how comparing Go and C performance can be used to detect hidden toolchain problems on operational computer systems, I nearly fell into what looked like a rabbit hole right next to the footpath. Upon closer examination Fido remarked the hole did not smell like rabbits but more like squirrel or perhaps gopher.

    Indeed it is possible to download a fully functional Go compiler that runs on Windows, MacOS and Linux directly from

    https://golang.org/dl/

    While there are binaries for 64-bit ARM, AMD, PowerPC and System 390 along with 32-bit binaries for x86 and ARMv6, notably missing is an ARMv7 compiler. It's possible a Go compiler targeting ARMv7 can be built from source, but there is also the GCC backend for less common architectures such as the SPARC, Alpha and if Fido is to be believed, soon the BARK™ with three-tier segmented memory. Apparently the different memory tiers make collecting lots of garbage a zero-cost operation.

    The testing procedure is to compile both programs with

    Code: Select all

    $ go build realfft.go
    $ gcc -O3 -march=native -mtune=native -o rfft rffc.c -lm
    
    and then run each one to see if the speed relative to the Pi 4B reported on the last line of the output is nearly the same. If those relative speeds are quite different, then likely something is wrong with either the Plan 9 based Go compiler or more likely the GCC toolchain.

    I was surprised to detect a problem with two Debian based systems that have otherwise proved quite useful. I wonder how the different 32-bit distributions for the Raspberry Pi will fare.

    ejolson
    Posts: 7242
    Joined: Tue Mar 18, 2014 11:47 am

    Re: How Slow is Go?

    Wed May 12, 2021 10:25 pm

    ejolson wrote:
    Wed May 12, 2021 5:02 pm
    If those relative speeds are quite different, then likely something is wrong with either the Plan 9 based Go compiler or more likely the GCC toolchain.
    Here is a preliminary run under 32-bit Raspberry Pi OS on a Pi 4B using a custom-built installation of gcc 10.1 and the ARMv6 binary for Go 1.16.3 downloadable from the golang website.

    Code: Select all

    $ grep "model name" /proc/cpuinfo | head -n1
    model name  : ARMv7 Processor rev 3 (v7l)
    $ ./realfft
    realfft.go -- Perform real to complex Fourier transform
    Version=5; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  7.34091862e-13  6.63306522e+00  7.37445718e-13  1.31804063e+01
         2  7.34091862e-13  6.50276375e+00  7.37445718e-13  1.28566024e+01
         3  7.34091862e-13  6.47177792e+00  7.37445718e-13  1.28350101e+01
    
    Best real=6.4718e+00 sec; Mtflops=3.5645e+01
    Best complex=1.2835e+01 sec; Mtflops=3.5946e+01
    Single-core speed is 0.6263 times a Pi 4B
    $ ./rfft
    rfft.c -- Perform real to complex Fourier transform
    Version=5; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  6.86626828e-13  3.98657900e+00  6.89441082e-13  8.06400900e+00
         2  6.86626828e-13  3.81749800e+00  6.89441082e-13  7.72780100e+00
         3  6.86626828e-13  3.81909700e+00  6.89441082e-13  7.72298900e+00
    
    Best real=3.8175e+00 sec; Mtflops=6.0429e+01
    Best complex=7.7230e+00 sec; Mtflops=5.9740e+01
    Single-core speed is 0.913 times a Pi 4B
    
    As can be seen, moving from ARMv8 to the ARMv6 version of Go incurs a greater slowdown than switching to the ARMv7 version of GCC. Given that the C binary still links to the libc and libm system libraries, it's possible that even the C code is running slower than it should.

    At any rate, since

    X/Y=0.6263/0.913=0.6860

    seems a bit far from 1, it's worth further investigation to see whether Go in a Void Singularity container works better on Raspberry Pi OS in a way similar to what happened with C on x86.

    ejolson
    Posts: 7242
    Joined: Tue Mar 18, 2014 11:47 am

    Re: How Slow is Go?

    Thu May 13, 2021 12:37 am

    ejolson wrote:
    Wed May 12, 2021 10:25 pm
    At any rate, since

    X/Y=0.6263/0.913=0.6860

    seems a bit far from 1, it's worth further investigation to see whether Go in a Void Singularity container works better on Raspberry Pi OS in a way similar to what happened with C on x86.
    I followed the same procedure as outlined above in

    viewtopic.php?p=1863864#p1863864

    except substituting void-armv7l-ROOTFS-20210218.tar.xz for the x86 image and, of course, doing everything on a Pi. I was quite happy that everything worked on the Pi in exactly the same way as the x86 system. What a great learning system!

    Running the two FFT programs inside the container resulted in

    Code: Select all

    Singularity> grep "model name" /proc/cpuinfo | head -n1
    model name  : ARMv7 Processor rev 3 (v7l)
    Singularity> ./realfft
    realfft.go -- Perform real to complex Fourier transform
    Version=5; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  7.34091862e-13  2.91663375e+02  7.37445718e-13  5.63841475e+02
         2  7.34091862e-13  2.90888506e+02  7.37445718e-13  5.63860405e+02
         3  7.34091862e-13  2.90949214e+02  7.37445718e-13  5.63737867e+02
    
    Best real=2.9089e+02 sec; Mtflops=7.9304e-01
    Best complex=5.6374e+02 sec; Mtflops=8.1842e-01
    Single-core speed is 0.0141 times a Pi 4B
    Singularity> ./rfft 
    rfft.c -- Perform real to complex Fourier transform
    Version=5; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  7.09627869e-13  3.95009300e+00  7.13269560e-13  7.67712800e+00
         2  7.09627869e-13  3.70934200e+00  7.13269560e-13  7.46615300e+00
         3  7.09627869e-13  3.71227700e+00  7.13269560e-13  7.45153500e+00
    
    Best real=3.7093e+00 sec; Mtflops=6.2191e+01
    Best complex=7.4515e+00 sec; Mtflops=6.1917e+01
    Single-core speed is 0.9429 times a Pi 4B
    
    and the unexpected ratio

    X/Y=0.0141/0.9429=0.01495.

    To be off by a factor of 1/67 implies something is definitely wrong. Either the ARMv7 build of Go is faulty or the container went wrong. As the container works fine for C, it's tempting to blame Go; however, that implies nobody ever tested the Go compiler in the ARMv7 build of Void Linux. What's going on?

    ejolson
    Posts: 7242
    Joined: Tue Mar 18, 2014 11:47 am

    Re: How Slow is Go?

    Thu May 13, 2021 2:13 am

    ejolson wrote:
    Thu May 13, 2021 12:37 am
    As the container works fine for C, it's tempting to blame Go; however, that implies nobody ever tested the Go compiler in the ARMv7 build of Void Linux. What's going on?
    As I had good luck with Alpine Linux when making a Singularity container for Julia, I decided to download the ARMv7 build of Alpine and check the speed of Go in that image. Alpine has nice Docker images which form a good foundation for a Singularity container.

    Create a file called alpinev7.def containing

    Code: Select all

    BootStrap: docker
    From: alpine@sha256:9663906b1c3bf891618ebcac857961531357525b25493ef717bca0f86f581ad6
    
    %runscript
        echo "This is what happens when you run the container..."
    
    As root build the container and install Go.

    Code: Select all

    # singularity build --sandbox alpinev7 alpinev7.def
    # singularity shell -w alpinev7
    Singularity> apk add go make gcc musl-dev
    Singularity> exit
    
    As a user enter the container and test the Alpine version of Go with

    Code: Select all

    $ singularity shell -w alpinev7
    Singularity> cd
    Singularity> cd code/rfft
    Singularity> gcc -Wall -O3 -mtune=native -march=native -o rfft rfft.c -lm
    Singularity> go build realfft.go
    Singularity> grep "model name" /proc/cpuinfo | head -n1
    model name  : ARMv7 Processor rev 3 (v7l)
    Singularity> ./realfft
    realfft.go -- Perform real to complex Fourier transform
    Version=5; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  7.34091862e-13  7.30165529e+00  7.37445718e-13  1.26339421e+01
         2  7.34091862e-13  6.20299959e+00  7.37445718e-13  1.23255732e+01
         3  7.34091862e-13  6.20270419e+00  7.37445718e-13  1.23357844e+01
    
    Best real=6.2027e+00 sec; Mtflops=3.7191e+01
    Best complex=1.2326e+01 sec; Mtflops=3.7432e+01
    Single-core speed is 0.6529 times a Pi 4B
    Singularity> ./rfft
    rfft.c -- Perform real to complex Fourier transform
    Version=5; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  7.07539112e-13  3.37532900e+00  7.10957046e-13  6.82505700e+00
         2  7.07539112e-13  3.22540500e+00  7.10957046e-13  6.51394400e+00
         3  7.07539112e-13  3.22595000e+00  7.10957046e-13  6.51375200e+00
    
    Best real=3.2254e+00 sec; Mtflops=7.1522e+01
    Best complex=6.5138e+00 sec; Mtflops=7.0831e+01
    Single-core speed is 1.082 times a Pi 4B
    
    Note that C got a little faster than before and Go at least is not going any slower than it does natively in Raspberry Pi OS. This makes the ARMv7 build of the Go compiler in Void Linux look like the source of the slowdown in the previous post.

    ejolson
    Posts: 7242
    Joined: Tue Mar 18, 2014 11:47 am

    Re: How Slow is Go?

    Thu May 13, 2021 4:52 am

    ejolson wrote:
    Thu May 13, 2021 2:13 am
    This makes the ARMv7 build of the Go compiler in Void Linux look like the source of the slowdown in the previous post.
    I checked the Void Linux package build scripts. It seems they leave out the flag GOARM=7 and build everything for 32-bit ARM with software floating point by accident. As far as I can tell, the version of Go in the ARMv6 build of Void Linux is affected as well. While those developers aren't actively trying to crash the economy, software float sure makes Go slow. As Void Linux is currently my favorite distribution, I've filed a bug report

    https://github.com/void-linux/void-pack ... sues/30827

    and am hopefull things will get fixed.

    ejolson
    Posts: 7242
    Joined: Tue Mar 18, 2014 11:47 am

    Re: How Slow is Go?

    Thu May 13, 2021 5:12 am

    ejolson wrote:
    Thu May 13, 2021 4:52 am
    As Void Linux is currently my favorite distribution, I've filed a bug report

    https://github.com/void-linux/void-pack ... sues/30827

    and am hopefull things will get fixed.
    A workaround to avoid Go being super slow in the 32-bit ARMv7 build of Void Linux is to set GOARM=7 in the environment when compiling. This is might also be a good idea for Raspberry Pi OS, since the Go compiler downloaded from the golang website runs in ARMv6 mode by default.

    The new result for the Void Singularity container is

    Code: Select all

    $ singularity shell -w void
    Singularity> cd
    Singularity> cd code/rfft
    Singularity> GOARM=7 go build realfft.go
    Singularity> ./realfft
    realfft.go -- Perform real to complex Fourier transform
    Version=5; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  7.34091862e-13  6.24835443e+00  7.37445718e-13  1.23601818e+01
         2  7.34091862e-13  6.05836201e+00  7.37445718e-13  1.20195305e+01
         3  7.34091862e-13  6.05779099e+00  7.37445718e-13  1.20272152e+01
    
    Best real=6.0578e+00 sec; Mtflops=3.8081e+01
    Best complex=1.2020e+01 sec; Mtflops=3.8385e+01
    Single-core speed is 0.669 times a Pi 4B
    
    Woohoo! That's the fastest Go time on the Pi 4B in 32-bit so far.

    Unfortunately, the problem with 32-bit Go being slow is worse on Intel compatible systems, since support for 387 floating-point hardware was removed in the last major release. In particular, any x86 system that does not support SSE2 instructions will need to stick with Go 1.15.x to avoid slower than slow slowness from software floating point.

    ejolson
    Posts: 7242
    Joined: Tue Mar 18, 2014 11:47 am

    Re: How Slow is Go?

    Thu May 13, 2021 3:50 pm

    ejolson wrote:
    Thu May 13, 2021 5:12 am
    In particular, any x86 system that does not support SSE2 instructions will need to stick with Go 1.15.x to avoid slower than slow slowness from software floating point.
    In this continuing saga of slowness, I cross compiled Go 1.15.12 for 32-bit Intel with 387 hardware floating-point support using the command

    Code: Select all

    GOARCH=386 GO386=387 ./make.bash
    
    and copied this to a 32-bit pre-SSE2 Athlon Thunderbird system running at 1400 MHz. The results were

    Code: Select all

    $ grep "model name" /proc/cpuinfo | head -n1
    model name  : AMD Athlon(tm) Processor
    $ ./realfft
    realfft.go -- Perform real to complex Fourier transform
    Version=6; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  7.34091862e-13  2.04900169e+01  7.37445718e-13  4.27784855e+01
         2  7.34091862e-13  2.01064801e+01  7.37445718e-13  4.21431892e+01
         3  7.34091862e-13  2.01049879e+01  7.37445718e-13  4.21435165e+01
    
    Best real=2.0105e+01 sec; Mtflops=1.1474e+01
    Best complex=4.2143e+01 sec; Mtflops=1.0948e+01
    Single-core speed is 0.1961 times a Pi 4B
    $ ./rfft
    rfft.c -- Perform real to complex Fourier transform
    Version=6; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  7.10257641e-13  1.65190140e+01  7.13565150e-13  3.31793650e+01
         2  7.10257641e-13  1.63273700e+01  7.13565150e-13  3.27370210e+01
         3  7.10257641e-13  1.63438060e+01  7.13565150e-13  3.27399130e+01
    
    Best real=1.6327e+01 sec; Mtflops=1.4129e+01
    Best complex=3.2737e+01 sec; Mtflops=1.4093e+01
    Single-core speed is 0.2144 times a Pi 4B
    
    Relative to the Pi 4B that's a performance-balance score of

    X/Y=0.1961/0.2144=0.9146

    Though Go lost a little more performance than C when running on the Athlon, as the ratio seems reasonably close to 1, it seems nothing went terribly wrong with either Go or the C compiler.

    Note that GCC 6.3.0 was used for these tests rather than a more recent version. It would be interesting to see whether GCC 11.1.0 would create better or worse binaries for the Athlon. Does anyone know how to cross compile GCC? Is doing that as easy as the Go compiler?

    ejolson
    Posts: 7242
    Joined: Tue Mar 18, 2014 11:47 am

    Re: How Slow is Go?

    Thu May 13, 2021 5:06 pm

    ejolson wrote:
    Thu May 13, 2021 3:50 pm
    and copied this to a 32-bit pre-SSE2 Athlon Thunderbird system running at 1400 MHz.
    While working on the 32-bit Go performance chart it seems I crashed the mail server for the dog house. Fido is barking mad.

    Amid all the barking, I finally made out a question: Why did you do that? In reply I explained that I only typed install to load the latest version of Go from the repository. Unfortunately, the install aborted with a kill signal and the system hard crashed when I typed sync. More barking ensued, this time about why never to sync a system when it's in the middle of crashing. I finally calmed the canine by promising to check the motherboard for any bloated or leaky looking capacitors.

    Fortunately, the web server survived as did the backup mail server. This led to results for a Pentium III running at 650 MHz

    Code: Select all

    $ grep "model name" /proc/cpuinfo | head -n1
    model name  : Pentium III (Coppermine)
    $ ./realfft
    realfft.go -- Perform real to complex Fourier transform
    Version=6; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  7.34091862e-13  4.71503713e+01  7.37445718e-13  9.57537627e+01
         2  7.34091862e-13  4.63328876e+01  7.37445718e-13  9.49039192e+01
         3  7.34091862e-13  4.63296988e+01  7.37445718e-13  9.49112346e+01
    
    Best real=4.6330e+01 sec; Mtflops=4.9792e+00
    Best complex=9.4904e+01 sec; Mtflops=4.8615e+00
    Single-core speed is 0.08609 times a Pi 4B
    $ ./rfft
    rfft.c -- Perform real to complex Fourier transform
    Version=6; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  6.75497274e-13  2.16380810e+01  6.78874876e-13  4.36386730e+01
         2  6.75497274e-13  2.10003950e+01  6.78874876e-13  4.26650950e+01
         3  6.75497274e-13  2.09944300e+01  6.78874876e-13  4.26508570e+01
    
    Best real=2.0994e+01 sec; Mtflops=1.0988e+01
    Best complex=4.2651e+01 sec; Mtflops=1.0817e+01
    Single-core speed is 0.1657 times a Pi 4B
    
    and a Pentium IV running at 1500 MHz.

    Code: Select all

    $ grep "model name" /proc/cpuinfo | head -n1
    model name  : Intel(R) Pentium(R) 4 CPU 1500MHz
    $ ./realfft
    realfft.go -- Perform real to complex Fourier transform
    Version=6; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  7.34091862e-13  3.83128171e+01  7.37445718e-13  8.88216171e+01
         2  7.34091862e-13  3.77430017e+01  7.37445718e-13  8.85275037e+01
         3  7.34091862e-13  3.77864454e+01  7.37445718e-13  8.83348999e+01
    
    Best real=3.7743e+01 sec; Mtflops=6.1120e+00
    Best complex=8.8335e+01 sec; Mtflops=5.2230e+00
    Single-core speed is 0.09886 times a Pi 4B
    $ ./rfft
    rfft.c -- Perform real to complex Fourier transform
    Version=6; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  7.10257641e-13  2.64593550e+01  7.13565150e-13  5.38592710e+01
         2  7.10257641e-13  2.62705750e+01  7.13565150e-13  5.31468280e+01
         3  7.10257641e-13  2.62710930e+01  7.13565150e-13  5.31138690e+01
    
    Best real=2.6271e+01 sec; Mtflops=8.7812e+00
    Best complex=5.3114e+01 sec; Mtflops=8.6865e+00
    Single-core speed is 0.1327 times a Pi 4B
    
    I find it amazing that Go even runs on these old systems.

    More importantly, Go appears similarly slow on 32-bit Intel-compatible hardware as for ARMv7 on the Pi 4B. While I suspect lack of developers is the main reason 387 floating-point support was discontinued, maybe performance problems were another reason those gophers gave up on non-SSE2 hardware. Hopefully support for hardware float on ARMv7 will continue for some time.

    All I need is the Zero, Pi 2B and Pi 3B to finish the 32-bit chart. Since the 3B no longer functions as the inner firewall, do you think testing it will result in more loud barking noises?

    ejolson
    Posts: 7242
    Joined: Tue Mar 18, 2014 11:47 am

    Re: How Slow is Go?

    Sat May 15, 2021 3:44 am

    ejolson wrote:
    Tue May 11, 2021 8:47 pm
    I also ran the programs on an i3 550 and a Gold 6126, but Go did so much better than C on these systems that I became suspicious something was wrong.
    I updated the i3 as it had the original Devuan distribution and had stopped being able to connect to the package repository.

    Woohoo! The results look good for Go

    Code: Select all

    $ grep "model name" /proc/cpuinfo | head -n1
    model name  : Intel(R) Core(TM) i3 CPU         550  @ 3.20GHz
    $ ./realfft
    realfft.go -- Perform real to complex Fourier transform
    Version=6; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  7.34091862e-13  1.89957666e+00  7.37445718e-13  3.78571153e+00
         2  7.34091862e-13  1.87443113e+00  7.37445718e-13  3.73208070e+00
         3  7.34091862e-13  1.87445545e+00  7.37445718e-13  3.72957706e+00
    
    Best real=1.8744e+00 sec; Mtflops=1.2307e+02
    Best complex=3.7296e+00 sec; Mtflops=1.2371e+02
    Single-core speed is 2.159 times a Pi 4B
    $ ./rfft
    rfft.c -- Perform real to complex Fourier transform
    Version=6; N=4194304
    
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  7.09627869e-13  1.89432600e+00  7.13269560e-13  3.81902400e+00
         2  7.09627869e-13  1.86761800e+00  7.13269560e-13  3.76226400e+00
         3  7.09627869e-13  1.86813000e+00  7.13269560e-13  3.76119300e+00
    
    Best real=1.8676e+00 sec; Mtflops=1.2352e+02
    Best complex=3.7612e+00 sec; Mtflops=1.2267e+02
    Single-core speed is 1.87 times a Pi 4B
    
    with Go pulling ahead of C in real time and way ahead relative to the balance of performance on the Raspberry Pi 4B.

    New updated 64-bit chart is

    Image

    The machine that crashed yesterday is also running again, so hopefully a 32-bit chart will appear soon. If anyone wants to contribute additional timings please post output from running the Go and C programs in the first post along with a description of the computer hardware used.

    I wonder if the relative performance between Go and C stays the same as a Pi 4B is over clocked.

    User avatar
    jahboater
    Posts: 7074
    Joined: Wed Feb 04, 2015 6:38 pm
    Location: Wonderful West Dorset

    Re: How Slow is Go?

    Sat May 15, 2021 4:23 am

    ejolson wrote:
    Sat May 15, 2021 3:44 am
    I wonder if the relative performance between Go and C stays the same as a Pi 4B is over clocked.
    bait taken .... :)
    8GB Pi4 Aarch64 2.1GHz GCC 11.1

    Code: Select all

    $ cat /proc/device-tree/model; echo
    Raspberry Pi 4 Model B Rev 1.4 
    $ uname -m
    aarch64
    $ vcgencmd measure_clock arm
    frequency(48)=2100515584
    $ go version
    go version go1.16.4 linux/arm64
    $ gcc --version | head -n1
    gcc (GCC) 11.1.0
    $
    $ ./realfft
    realfft.go -- Perform real to complex Fourier transform
    Version=6; N=4194304
     
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  7.14602381e-13  3.13543606e+00  7.17516816e-13  6.24210477e+00
         2  7.14602381e-13  3.04840231e+00  7.17516816e-13  6.06210685e+00
         3  7.14602381e-13  3.04789257e+00  7.17516816e-13  6.06161308e+00
     
    Best real=3.0479e+00 sec; Mtflops=7.5687e+01
    Best complex=6.0616e+00 sec; Mtflops=7.6114e+01
    Single-core speed is 1.328 times a Pi 4B
    $
    $ ./rfft
    rfft.c -- Perform real to complex Fourier transform
    Version=6; N=4194304
     
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  6.86503721e-13  2.99472300e+00  6.89351350e-13  5.57285600e+00
         2  6.86503721e-13  2.66720800e+00  6.89351350e-13  5.40485100e+00
         3  6.86503721e-13  2.67030000e+00  6.89351350e-13  5.41113600e+00
     
    Best real=2.6672e+00 sec; Mtflops=8.6490e+01
    Best complex=5.4049e+00 sec; Mtflops=8.5363e+01
    Single-core speed is 1.306 times a Pi 4B
    $
    

    ejolson
    Posts: 7242
    Joined: Tue Mar 18, 2014 11:47 am

    Re: How Slow is Go?

    Sat May 15, 2021 6:30 pm

    jahboater wrote:
    Sat May 15, 2021 4:23 am
    ejolson wrote:
    Sat May 15, 2021 3:44 am
    I wonder if the relative performance between Go and C stays the same as a Pi 4B is over clocked.
    bait taken .... :)
    8GB Pi4 Aarch64 2.1GHz GCC 11.1

    Code: Select all

    $ cat /proc/device-tree/model; echo
    Raspberry Pi 4 Model B Rev 1.4 
    $ uname -m
    aarch64
    $ vcgencmd measure_clock arm
    frequency(48)=2100515584
    $ go version
    go version go1.16.4 linux/arm64
    $ gcc --version | head -n1
    gcc (GCC) 11.1.0
    $
    $ ./realfft
    realfft.go -- Perform real to complex Fourier transform
    Version=6; N=4194304
     
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  7.14602381e-13  3.13543606e+00  7.17516816e-13  6.24210477e+00
         2  7.14602381e-13  3.04840231e+00  7.17516816e-13  6.06210685e+00
         3  7.14602381e-13  3.04789257e+00  7.17516816e-13  6.06161308e+00
     
    Best real=3.0479e+00 sec; Mtflops=7.5687e+01
    Best complex=6.0616e+00 sec; Mtflops=7.6114e+01
    Single-core speed is 1.328 times a Pi 4B
    $
    $ ./rfft
    rfft.c -- Perform real to complex Fourier transform
    Version=6; N=4194304
     
       run    norm(xr-xrs)        real sec      norm(x-xs)     complex sec
         1  6.86503721e-13  2.99472300e+00  6.89351350e-13  5.57285600e+00
         2  6.86503721e-13  2.66720800e+00  6.89351350e-13  5.40485100e+00
         3  6.86503721e-13  2.67030000e+00  6.89351350e-13  5.41113600e+00
     
    Best real=2.6672e+00 sec; Mtflops=8.6490e+01
    Best complex=5.4049e+00 sec; Mtflops=8.5363e+01
    Single-core speed is 1.306 times a Pi 4B
    $
    
    Since the default clock is 1500 MHz and

    2100/1500=1.4

    it would seem this reached about

    1.3/1.4=90 percent

    of the maximum possible based on CPU clock speed.

    When I showed the output to Fido, the dog developer's tail started wagging furiously back and forth. Quickly dodging out of the way, I exclaimed please watch that tail around the Raspberry Pi. It's a finely-tuned high-performance electronics device. All that fur might get caught in the fan.

    Ignoring my alarm the canine coder calmly explained, a simple application of Amdog's law can now be used to determine how much of the total work is CPU bound.

    ejolson
    Posts: 7242
    Joined: Tue Mar 18, 2014 11:47 am

    Re: How Slow is Go?

    Sat May 15, 2021 10:32 pm

    ejolson wrote:
    Sat May 15, 2021 6:30 pm
    Ignoring my alarm the canine coder calmly explained, a simple application of Amdog's law can now be used to determine how much of the total work is CPU bound.
    Fido insisted on more data, so I reran the Go program on the Raspberry Pi 4B in 64-bit mode while varying the clock speed from 600 to 1500 MHz.

    The results were

    Code: Select all

    Frequency       Time for Real FFT
       600               9.6641
       900               6.5299
      1200               4.9854
      1500               4.0652
    
    After seeing my results the head of the BARK™ Foundation growled, 1800 MHz is missing. Though I don't approve of over clocking, to avoid the impending event of loud barking, I quickly agreed to set the clock speed to 1800 and try one more run.

    Unfortunately the 4B crashed on boot. Fortunately, it was easy to set the clock speed back to 1500 since it was one of the machines in the 2U server chassis that use rpiboot over USB to perform the initial program load.

    Later in the day I received a calculation performed by the super pet in microPascal. As there were no comments, I asked Fido for the documentation.

    It seems crashing the Pi at 1800 MHz provided only a temporary reprieve from too much barking. After the noise subsided, an email came in through the recently rebuilt mail server.

    I launched the mutt email client

    http://www.mutt.org/

    but much to my surprise it did not display PETSCII. Thus, Amdog's law remains a bit of a mystery. Could HTML-encoded email actually be a better idea?

    At any rate, I ported the Pascal to Julia as

    Code: Select all

    T=[9.6641,6.5299,4.9854,4.0652,3.0479]
    F=[600.0,900,1200,1500,2100]
    
    X=(1500)./F
    M=[X ones(5)]
    ab=M\T
    
    Fs=[500:10:2500;];
    Xs=(1500)./Fs;
    Ts=ab[1]*Xs.+ab[2];
    
    using Plots
    scatter(F,T,label="data",xaxis="CPU Frequency",yaxis="Time")
    plot!(Fs,Ts,label="Amdog's law",
        title="Fit of Amdog's Law to Real FFT in Go on Pi 4B")
    
    savefig("amdog.svg")
    
    and obtained the graph

    Image

    Does anyone have a PETSCII plugin for the mutt mail client that will run on the Raspberry Pi? What's the mimetype?
    Last edited by ejolson on Sat May 15, 2021 11:41 pm, edited 2 times in total.

    User avatar
    jahboater
    Posts: 7074
    Joined: Wed Feb 04, 2015 6:38 pm
    Location: Wonderful West Dorset

    Re: How Slow is Go?

    Sat May 15, 2021 11:21 pm

    Nice fit to Amdahls law!

    You need over_voltage=2 for 1800 MHz by the way.

    #over_voltage=2
    #arm_freq=1800

    #over_voltage=5
    #arm_freq=2000

    over_voltage=6
    arm_freq=2100

    Heater
    Posts: 18252
    Joined: Tue Jul 17, 2012 3:02 pm

    Re: How Slow is Go?

    Sun May 16, 2021 5:36 pm

    jahboater wrote:
    Sat May 15, 2021 11:21 pm
    Nice fit to Amdahls law!
    Please explain. I don't see how a graph of execution time vs CPU frequency relates to Amdahl's law.
    Memory in C++ is a leaky abstraction .

    User avatar
    jahboater
    Posts: 7074
    Joined: Wed Feb 04, 2015 6:38 pm
    Location: Wonderful West Dorset

    Re: How Slow is Go?

    Sun May 16, 2021 6:45 pm

    Heater wrote:
    Sun May 16, 2021 5:36 pm
    jahboater wrote:
    Sat May 15, 2021 11:21 pm
    Nice fit to Amdahls law!
    Please explain. I don't see how a graph of execution time vs CPU frequency relates to Amdahl's law.
    Yes, there is only one "part". Ignore my comment :(

    ejolson
    Posts: 7242
    Joined: Tue Mar 18, 2014 11:47 am

    Re: How Slow is Go?

    Sun May 16, 2021 7:47 pm

    Heater wrote:
    Sun May 16, 2021 5:36 pm
    jahboater wrote:
    Sat May 15, 2021 11:21 pm
    Nice fit to Amdahls law!
    Please explain. I don't see how a graph of execution time vs CPU frequency relates to Amdahl's law.
    I looked it up and
    Wikipedia wrote: In computer architecture, Amdahl's law is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved.
    https://en.wikipedia.org/wiki/Amdahl%27s_law

    I asked Fido whether the resource improvement in this case was due to changes in CPU clock speed. At this the dog's head tipped sideways as developers sometimes do, so I explained that I was unable to read the documentation. The furry head tipped the other way.

    I continued that I tried Petmate on my Raspberry Pi

    https://nurpax.github.io/petmate/

    but the reliance on React, Redux and Electron led to a complicated set of cross-platform bugs.

    There was a calm before the storm of barking from which I understood it only takes 2 KB of code to create a fully-functional PETSCII editor on any reasonable computer. Not wanting to get into a discussion of which computers are reasonable, I returned to the original topic and again asked, could you please explain how that graph relates to Amdahl's law?

    In a low growl Fido replied, Amdog's law.

    Return to “Other programming languages”