ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Cilkplus on RPi2B

Wed Mar 04, 2015 7:08 pm

In 1994 MIT researchers created a parallel programming extension to the C programming language called Cilk. Commercialized and purchased by Intel in 2009 these extensions are now included in GCC 5.0 mainline and Intel's commercial compiler. There is also a set of patches to add Cilk to LLVM clang. However, only 32 and 64-bit Intel hardware is supported by any of the compilers.

In order to use Cilk on the RPi2B for parallel SMP programming the runtime libcilkrts needs to be ported. For GCC 5.0 one needs an ARM specific version of the file cilk-abi-vla.c in gcc-5/libcilkrts/runtime/config to implement the function calls that manipulate the program stack to allow threads to steal work from other threads. While apparently less than 200 lines of code, getting things to work on the Raspberry Pi would require detailed knowledge of the stack and register usage for C/C++ function calls on ARM.

Is anyone interested or know of a project already underway to port the GCC 5.0 Cilk runtime to Raspberry Pi?

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Fri Mar 27, 2015 8:33 am

I just compiled libcilkrts from gcc-5.0 for the Raspberry Pi 2B and have managed to run a couple simple tests. It seems to be working fine. I'll detail what I did starting from the beginning, since I expect most people reading this are not familiar with building gcc.

For clarity, I shall prefix every command run as root with a # sign and every command run as a simple user with a $ sign. Please use "su" or "sudo bash" as appropriate for the way you have configured your system to switch to the root user when needed.

Building gcc needs additional libraries and utilities which may not already be installed in the base Raspbian image. I actually started with Luca Soltoggio's Minibian image

https://minibianpi.wordpress.com/

which is the same as Raspbian but without the extra stuff. As root type

# apt-get install libgmp-dev libmpfi-dev libmpc-dev autogen \
autoconf subversion libisl-dev make gcc flex bison g++

to install the needed utilities and libraries. A 2GB swap file is needed because part of the compiler generation process uses more than 1GB of RAM. To create a swap file type

# dd if=/dev/zero of=/swapfile bs=1M count=2048
# chmod 600 /swapfile
# mkswap /swapfile
# swapon /swapfile

Then as a regular user create a work directory and download the latest version of gcc.

$ mkdir work
$ cd work
$ svn checkout svn://gcc.gnu.org/svn/gcc/trunk gcc-src

Currently gcc development is done using autoconf version 2.64, however Raspbian uses version 2.69 instead. Thus, change two files gcc-src/configure.ac and gcc-src/config/override.m4 to use a different version of autoconf.

In configure.ac change

AC_PREREQ(2.64)

to

AC_PREREQ(2.69)

and in config/override.m4 change

[m4_define([_GCC_AUTOCONF_VERSION], [2.64])])

to

[m4_define([_GCC_AUTOCONF_VERSION], [2.69])])

Next, building libcilkrts needs to be enabled by deleting or commenting out the line UNSUPPORTED=1 from the file libcilkrts/configure.tgt

While it would be better to create a new arm subdirectory in

libcilkrts/runtime/config

and add appropriate items to the configuration system I simply put the needed changes directly in libcilkrts/runtime/config/generic.

The first change corrects a typo in generic/cilk-abi-vla.c by changing the second to the last line of the file from

vla_internal_heap_free(t, full_size);

to

vla_internal_heap_free(p, full_size);

The second change was to generic/os-fence.c and ARM specific. Comment out the line

COMMON_SYSDEP void __cilkrts_fence(void); ///< MFENCE instruction

as

// COMMON_SYSDEP void __cilkrts_fence(void); ///< MFENCE instruction

and then add the define

#define __cilkrts_fence() __asm__ volatile ("DSB")

right above it. Note that the DSB instruction is ARMv7 specific and the resulting library will therefore not run on Raspberry Pi B and B+ which are based on the ARMv6. I have not thought about how to make a library that works on all versions of the Pi.

Now it is time to build the new compiler and libraries. Change back to the work directory and then type

$ mkdir gcc-build
$ cd gcc-build
$ ../gcc-src/configure -v --enable-languages=c,c++ \
--prefix=/usr/local/gcc-5.0 --with-arch=armv7-a \
--with-fpu=vfp --with-float=hard --build=arm-linux-gnueabihf \
--host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
$ nohup make -j4 >world.log 2>&1 &

The build processes will take about 12 hours and procedes in three stages. You can check what stage is currently being built with the command

$ cat stage_current

You can also use "top" to monitor the build. After everything is done, check the log file "world.log" to make sure there are no errors. Assuming everything went well switch to root and install by typing

# make install

which copies the new compiler into /usr/local/gcc-5.0. Since this directory is not contained in the execution or library paths, the new compiler won't interfere with the standard Raspbian tools. To test the compiler consider the program fib.c given by

Code: Select all

#include <cilk/cilk.h>
#include <assert.h>

int fib(int n) {
    if (n < 2) return n;
    int a = cilk_spawn fib(n-1);
    int b = fib(n-2);
    cilk_sync;
    return a + b;
}

int main() {
    int result = fib(30);
    assert(result == 832040);
    return 0;
} 
Compile the program using the command

$ /usr/local/gcc-5.0/bin/gcc -fcilkplus fib.c -lcilkrts

and run using

$ LD_LIBRARY_PATH=/usr/local/gcc-5.0/lib ./a.out

The test program silently exits if everything goes well.

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Sun Mar 29, 2015 5:52 pm

I wrote a program to compute vector dot products to test the parallel performance of Cilk on the Raspberry Pi 2B computer. The calculation proceeds with a recursive parallel algorithm based on the fact that the dot product is the sum of two smaller dot products involving vectors of half the length. Recursion stops when the vectors reach a certain minimum length at which point a standard serial dot product is used. This minimum length was chosen as 10000 so that the resulting work unit was large enough compared to the parallel overhead for good efficiency.

The environment parameter

CILK_NWORKERS

was set to 1, 2, 3 and then 4 to determine the resulting parallel speedup when computing the dot product of vectors of size 10000000. In this case the recursive algorithm creates more than 1000 different work units that may be executed in parallel, so in theory, the test calculation could scale to 1000 processor shared memory architectures using the techniques implemented here. Note, however, that a dot product involves only one multiplication and one addition for every two numbers fetched from memory and is therefore memory intensive. Consequently this kind calculation possesses a significant challenge for memory subsystems shared by multiple cores.

Tests were performed using both single and double precision floating point. The results were

Code: Select all

            Parallel Speedup
      Cores      Single      Double
        2        1.90939     1.84109
        3        2.7171      2.31901     
        4        3.40789     2.28758
Image

Memory bandwidth was sufficient to make use of all 4 cores when computing in single precision. The double precision calculation requires twice the memory bandwidth and already saturates the memory bus when 3 cores are active. Thus, parallel speedup for the double precision version was limited to approximately 2.3 fold. For reference, the maximum parallel speedup for double precision on an 8-core Xeon E5-2650 server is about 4 fold.

The Cilk code is pasted below for reference.

Code: Select all

#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cilk/cilk.h>
#define xstr(s) str(s)
#define str(s) #s

#ifdef SINGLE
# define Real float
# define Rerr 1e-7
#else
# define Real double
# define Rerr 1e-15
#endif

#ifndef N
# define N 10000000
#endif

Real x[N],y[N];

static double tic_time;
void tic() {
    struct timeval tp;
    gettimeofday(&tp,0);
    tic_time=(double) tp.tv_sec + (double) tp.tv_usec * 1.e-6;
}
double toc() {
    struct timeval tp;
    gettimeofday(&tp,0);
    return ((double) tp.tv_sec + (double) tp.tv_usec * 1.e-6)-tic_time;
}

double dpserial(int n,Real x[n],Real y[n]){
    double r=0;
    for(int k=n-1;k>=0;k--) r+=x[k]*y[k];
    return r;
}
double dpparallel(int n,Real x[n],Real y[n]){
    if(n<10000) return dpserial(n,x,y);
    int m=n/2;
    double r1,r2;
    r2=cilk_spawn dpparallel(n-m,&x[m],&y[m]);
    r1=dpparallel(m,x,y);
    cilk_sync;
    return r1+r2;
}

typedef double (*dotprod)(int n,Real x[n],Real y[n]);
double bench(dotprod dp){
    double tmin=100.0,ttot=0.0;
    for(int j=0;;j++){
        tic();
        double r=dp(N,x,y);
        double t=toc();
        double e=r-1+1.0/(N+1);
        if(fabs(e)>Rerr){
            fprintf(stderr,"Error %25.15e is too big in dot product!\n",e);
            exit(1);
        }
        putchar('.'); fflush(stdout);
        tmin=(t<tmin)?t:tmin;
        ttot+=t;
        if(ttot>5 && j>3) break;
    }
    printf("\n");
    return tmin;
}
int main(int argc,char *argv[]){
    printf(
        "%s -- Parallel Dot Product Memory Bandwidth Benchmark\n"
        "Precision " xstr(Real) " dot products of size %d\n\n",
        argv[0],N);
    for(int k=0;k<N;k++) x[k]=1.0/(k+1);
    for(int k=0;k<N;k++) y[k]=1.0/(k+2);
    double tserial=bench(dpserial);
    double tparallel=bench(dpparallel);
    printf(
        "\n%24s %24.15e seconds\n"
        "%24s %24.15e seconds\n"
        "%24s %24g fold\n",
        "Serial elapsed time",tserial,
        "Parallel elapsed time",tparallel,
        "Parallel speedup factor",tserial/tparallel);
    return 0;
}

dmc1954
Posts: 17
Joined: Sun Mar 24, 2013 2:41 pm
Location: Austin, Texas, USA

Re: Cilkplus on RPi2B

Mon Apr 06, 2015 2:53 pm

Ejolson,
Thanks for the instructions on building the gcc 5.0 snap shot with the Cilkplus extension. :D

I started with the normal raspbian using autoconf version 2.64 instead of using minibian
distribution so I did not need to modify configure.ac and config/override.m4.

I would add the instructions to using raspi-config to disable the graphics desktop and decrease graphics memory split and reboot.

3 Enable Boot to Desktop/Scratch
Console Text console, requiring login (default)

8 Advanced Options
A3 Memory Split
Change value 128 to 16

Also, I installed the more resent gcc and g++ 4.8 version of the compilers.

apt-get install gcc-4.8 and g++-4.8

and made a minor change to the configuration script to use the gcc-4.8 compiler

CC=gcc-4.8 ../gcc-src/configure -v --enable-languages=c,c++ --prefix=/usr/local/gcc-5.0 --target=arm-linux-gnueabihf --with-arch=armv7-a --with-fpu=vfp --with-float=hard --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf

Started the build at Sunday at 12:30am and went to bed. The build finished without errors after 8 hours.

Later, did a “make install”, setup the environment variables and was able to run “gcc –v”:

pi@raspberrypi2 ~/PI $ export PATH=/usr/local/gcc-5.0/bin:$PATH
pi@raspberrypi2 ~/PI $ export LD_LIBRARY_PATH=/usr/local/gcc-5.0/lib:$LD_LIBRARY_PATH
pi@raspberrypi2 ~/PI $ which gcc
/usr/local/gcc-5.0/bin/gcc

pi@raspberrypi2 ~/PI $ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/gcc-5.0/libexec/gcc/arm-linux-gnueabihf/5.0.0/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../gcc-src/configure -v --enable-languages=c,c++ --prefix=/usr/local/gcc-5.0 --target=arm-linux-gnueabihf --with-arch=armv7-a --with-fpu=vfp --with-float=hard --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf
Thread model: posix
gcc version 5.0.0 20150330 (experimental) (GCC)

And then ran the fib.c cilk test case.

pi@raspberrypi2 ~/PI $ gcc -fcilkplus fib.c -o fib
pi@raspberrypi2 ~/PI $ ./fib
pi@raspberrypi2 ~/PI $

I need to convert and test my parallel chudnovsky pi program from openmp to cilk today.

What do you think about building a raspbian install package for the /usr/local/gcc-5.0 compiler with the Cilkplus extension for the RPI2? This would hopefully allow anyone to start write parallel code for the RPI2 using Cilkplus, without having to wait for gcc 5.0 to be available in the raspbian package repository.

Thanks again,
David

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Wed Apr 08, 2015 1:03 am

I'm happy you got gcc-5.0 with the Intel/MIT Cilk parallel extensions compiled and working. A binary package would definitely make it easier for people to try Cilk out. For reference, this tarball of my binary is 97MB and takes 455MB when unpacked. Please unpack it as root using

# cd /usr/local
# unxz <gcc-5.0.tar.xz | tar xf -

if you want to test my build on a Raspberry Pi 2B. Since gcc-5.0 is useful on the older hardware, some work needs to be done to design a proper binary package. I compiled gcc-5.0 using --with-arch=armv7-a; however, if I compile using --with-arch=armv6, then the DSB instruction needed for the fence is not available. Obviously, this instruction is not needed for the original Raspberry Pi, but what happens if it encounters a DSB instruction anyway and how does one create a binary that works for both types of hardware?

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Wed Apr 08, 2015 6:27 pm

Here is a parallel FFT written using MIT/Intel Cilk. A discrete Fourier transform can be expressed in terms of two Fourier transforms each of half the size. This observation leads to an FFT algorithm which is easy to implement as a parallel recursive routine in Cilk. At modest transform lengths parallel speedup is about 3.4 fold.

Code: Select all

./dfft -- Parallel Fast Fourier Transform Benchmark
Precision double FFTs of size 32768

     Serial elapsed time    3.118801116943359e-02 seconds
   Parallel elapsed time    8.972883224487305e-03 seconds
      FFTW3 elapsed time    2.861905097961426e-02 seconds
 Parallel speedup factor                  3.47581 fold
Performance of the recursive implementation is reasonable compared to the FFTW3 software library (in single-threaded estimate mode) as seen in the following graph:

Image

For reference, the source code is

Code: Select all

#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cilk/cilk.h>
#include <complex.h>
#include <fftw3.h>
#define xstr(s) str(s)
#define str(s) #s

#ifdef SINGLE
# define Real float
# define FFTW3(s) fftwf_ ## s
#else
# define Real double
# define FFTW3(s) fftw_ ## s
#endif

typedef Real complex Complex;

#ifndef N
# define N 32768
#endif

Complex x[N],y[N],wn[N/2];
FFTW3(plan) fftxtoy;

static double tic_time;
void tic() {
    struct timeval tp;
    gettimeofday(&tp,0);
    tic_time=(double) tp.tv_sec + (double) tp.tv_usec * 1.e-6;
}
double toc() {
    struct timeval tp;
    gettimeofday(&tp,0);
    return ((double) tp.tv_sec + (double) tp.tv_usec * 1.e-6)-tic_time;
}

void fftserial(int n,int s,Complex x[n*s],Complex y[n]){
    int m=n/2;
    if(m==1){
        y[0]=x[0];
        y[m]=x[s];
    } else {
        fftserial(m,s*2,x,y);
        fftserial(m,s*2,x+s,y+m);
    }
    for(int i=0;i<m;i++) {
        Complex z0=y[i],z1=wn[i*s]*y[i+m];
        y[i]=z0+z1;
        y[i+m]=z0-z1;
    }

}
void fftparallel(int n,int s,Complex x[n*s],Complex y[n]){
    if(s>32||n<1024) {
        fftserial(n,s,x,y);
        return;
    }
    int m=n/2;
    cilk_spawn fftparallel(m,s*2,x,y);
    fftparallel(m,s*2,x+s,y+m);
    cilk_sync;
    cilk_for(int i=0;i<m;i++) {
        Complex z0=y[i],z1=wn[i*s]*y[i+m];
        y[i]=z0+z1;
        y[i+m]=z0-z1;
    }
}
void fftwest(int n,int s,Complex x[n*s],Complex y[n]){
    FFTW3(execute)(fftxtoy);
}

typedef void (*fouriertrans)(int n,int s,Complex x[n],Complex y[n]);
double bench(fouriertrans fft){
    double tmin=100.0,ttot=0.0;
    for(int j=0;;j++){
        tic();
        fft(N,1,x,y);
        double t=toc();
        double e=cimag(y[N-1])-N/2;
        if(fabs(e)>0.5){
            fprintf(stderr,"Error %25.15e is too big in FFT!\n",e);
            exit(1);
        }
        putchar('.'); fflush(stdout);
        tmin=(t<tmin)?t:tmin;
        ttot+=t;
        if(ttot>5 && j>3) break;
    }
    printf("\n");
    return tmin;
}

int main(int argc, char *argv[]){
    printf(
        "%s -- Parallel Fast Fourier Transform Benchmark\n"
        "Precision " xstr(Real) " FFTs of size %d\n\n",
        argv[0],N);
    for(int i=0;i<N;i++) x[i]=sin(2*M_PI*i/N);
    fftxtoy=FFTW3(plan_dft_1d)(N,x,y,FFTW_FORWARD,FFTW_ESTIMATE);
    double theta=-2*M_PI/N;
    for(int i=0;i<N/2;i++) wn[i]=cos(i*theta)+I*sin(i*theta);
    double tserial=bench(fftserial);
    double tparallel=bench(fftparallel);
    double twest=bench(fftwest);
    printf(
        "\n%24s %24.15e seconds\n"
        "%24s %24.15e seconds\n"
        "%24s %24.15e seconds\n"
        "%24s %24g fold\n",
        "Serial elapsed time",tserial,
        "Parallel elapsed time",tparallel,
        "FFTW3 elapsed time",twest,
        "Parallel speedup factor",tserial/tparallel);
    FFTW3(destroy_plan)(fftxtoy);
    return 0;
}
The command

$ /usr/local/gcc-5.0/bin/gcc -O3 -Wall -std=gnu99 -fcilkplus -o dfft fft.c -lcilkrts -lfftw3 -lm

compiles this code using the gcc-5.0 build mentioned in the previous post.

dmc1954
Posts: 17
Joined: Sun Mar 24, 2013 2:41 pm
Location: Austin, Texas, USA

Re: Cilkplus on RPi2B

Thu Apr 09, 2015 9:09 pm

Eric,
Nice scaling of your parallel cilk FFT code running on the RPI2. I was wondering if there was enough
memory throughput to keep four cores busy. How did you determine the cutoff values"if(s>32||n<1024) {" in
the routine fftparallel? Have you had time to compare the your cilk version of your parallel fft program to a openmp version?

As the your other question:

> Obviously, this instruction is not needed for the original Raspberry Pi, but what happens if it encounters a DSB instruction
> anyway and how does one create a binary that works for both types of hardware?

At some point, I think that Raspbian is going to need to two software packages, one for armv6 (RPI1) and one for armv7 (RPI2).
A lot like what was done with i386 and x86_64 linux packages. GMP would be another good example of a Raspbian package that needs to support two different ARM architectures and it has lots of hand tuned assembly code.

Thanks,
David

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Fri Apr 10, 2015 12:49 am

dmc1954 wrote:How did you determine the cutoff values "if(s>32||n<1024) {" in
the routine fftparallel? Have you had time to compare the your cilk version of your parallel fft program to a openmp version?
I chose s>32 because that limits the parallel recursion depth to 6 which is enough to keep most SMP machines busy. I chose n<1024 so the minimum unit of work was large enough to offset the greater costs of cilk_spawn over a regular function call. Both values are somewhat arbitrary and could be optimized using some sort of auto-tuning for different architectures. As noted by Heater in

viewtopic.php?p=737161#p737161

the depth first nature of a recursive FFT implementation likely has better memory locality than the essentially breadth first nature of the standard bit-reversal method. From a teaching point of view, the recursive code is also a win because it looks very similar to the mathematical derivation of the FFT algorithm, exhibits the correct asymptotic run-time scaling and is easy to make parallel.

I have not coded an openmp version. Cilk keeps track of pending work behind the scenes using a directed acyclic graph that makes tuning much easier. The primitive cilk_sync waits only for the work needed to proceed with the computation while a single pool of worker threads, one tied to each processor core, steal work from each other using queues. In contrast omp taskwait is global to all threads in the current pool and creating new pools is slow and causes cache contention.

I can see how two versions of certain software packages would be needed to optimize performance. FFTW uses hybrid and more advanced algorithms that should perform much better than my simple code. However, I suspect the FFTW package in Raspbian has been tuned for the ARMv6 architecture of the original Raspberry Pi and is noticeably slower than it would be if tuned for the ARMv7. Since both architectures are 32-bit, this is more like having different packages tuned for 686 versus 486 processors in the Intel world.

People don't usually swap internal HD's from one Intel PC to another, however, swapping SD cards between Pi's is much more likely. I think one of the design principles behind Raspbian is that a teacher should be able move a microSD card from any Pi to any other Pi without problems. The presence of ARMv7 specific packages in Raspbian would break this compatibility. While there are other Linux distributions optimized for ARMv7 that should run well on the Pi 2B, it would be nice to have gcc-5.0 with the Intel/MIT Cilk extensions for parallel processing on the standard Raspbian distribution for teaching purposes. I suppose the way to do this is to compile gcc-5.0 using --with-arch=armv6 and then insert some check into __cilkrts_fence for processor type before issuing a DSB instruction.

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Tue Apr 14, 2015 10:35 am

As suggested by tufty in

viewtopic.php?p=739221#p739221

I changed the define in os-fence.h to use the deprecated armv6 memory barrier instruction and then reconfigured gcc-5.0 using the command

$ ../gcc-src/configure -v --enable-languages=c,c++ \
--prefix=/usr/local/gcc-5.0 --with-arch=armv6 \
--with-fpu=vfp --with-float=hard --build=arm-linux-gnueabihf \
--host=arm-linux-gnueabihf --target=arm-linux-gnueabihf

The resulting compiler now produces executables that run on both the Raspberry Pi 2B and the original Raspberry Pi. Obviously no parallel speedup is obtained on the original Raspberry Pi, however, binary compatibilty between all Raspbery Pi models is convenient. Moreover, the execution speeds on the Pi 2B are nearly the same as before. To demonstrate this I wrote a program that computes fractal basins of attraction for Newton's method. The execution speed of this program when compiled by the armv7-a build of gcc-5.0 and the new armv6 compatible version is depicted in the following graph.

Image

For either build of gcc-5.0 the parallel speedup of the resulting executables when running on 4 cores was approximately 3.8 fold. For reference the source code is

Code: Select all

#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cilk/cilk.h>
#include <complex.h>
#include <fenv.h>
#define FE_BAD (FE_INVALID|FE_DIVBYZERO|FE_OVERFLOW|FE_UNDERFLOW)

typedef double complex Complex;

#define N 420
#define M 560
#define IMAX 100

Complex Z0[N][M],Z[N][M];
double xm=0.0, ym=0.0;
double delta=2.0/N;

inline Complex f(Complex z){
    return (z*z)*(z*z)*z+1;
}
inline Complex df(Complex z){
    return 5*(z*z)*(z*z);
}
inline Complex phi(Complex z){
    Complex zdelta=f(z)/df(z);
    if(fetestexcept(FE_BAD)){
        feclearexcept(FE_BAD);
        return z;
    }
    return z-zdelta;
}
Complex newton(Complex z){
    for(int i=0;i<IMAX;i++){
        Complex w0=phi(z);
        if(w0==z) return z;
        z=w0;
    }
    return z;
}

static double tic_time;
void tic() {
    struct timeval tp;
    gettimeofday(&tp,0);
    tic_time=1.0e-6*tp.tv_usec+tp.tv_sec;
}
double toc() {
    struct timeval tp;
    gettimeofday(&tp,0);
    return 1.0e-6*tp.tv_usec+tp.tv_sec-tic_time;
}

typedef struct {
    unsigned char r,g,b;
} pixel;
static pixel ztorgb(complex z){
    pixel p;
    double h,s,v,r;
    if((h=carg(z))<0) h+=2*M_PI;
    r=cabs(z);
    if(r<1.0){
        s=1; v=1-r; v=1-v*v;
    } else {
        v=1; s=3/(2+r);
    }
    int i=h*=3/M_PI;
    double f=h-i;
    v*=255;
#define P v*(1-s)
#define Q v*(1-s*f)
#define T v*(1-s*(1-f))
    switch(i%6) {
        case 0:
            p.r=v; p.g=T; p.b=P;
            break;
        case 1:
            p.r=Q; p.g=v; p.b=P;
            break;
        case 2:
            p.r=P; p.g=v; p.b=T;
            break;
        case 3:
            p.r=P; p.g=Q; p.b=v;
            break;
        case 4:
            p.r=T; p.g=P; p.b=v;
            break;
        default:
            p.r=v; p.g=P; p.b=Q;
            break;
    }
#undef T
#undef Q
#undef P
    return p;
}
void ztoppm6(Complex z[N][M],char *p){
    pixel raster[M];
    FILE *fp;
    if(!(fp=fopen(p,"w"))){
        fprintf(stderr,"Can't open %s for write!\n",p);
        exit(6);
    }
    fprintf(fp,"P6\n# Created by ztoppm.\n%d %d\n255\n",M,N);
    for(int n=0;n<N;n++) {
        for(int m=0;m<M;m++) raster[m]=ztorgb(z[n][m]);
        fwrite(raster,sizeof(pixel),M,fp);
    }
    fclose(fp);
}

void juliaserial(Complex z[N][M]){
    for(int n=0;n<N;n++) for(int m=0;m<M;m++){
        z[n][m]=newton(xm+(m-M/2)*delta+I*(ym+(N/2-n)*delta));
    }
}
void juliaparallel(Complex z[N][M]){
    cilk_for(int n=0;n<N;n++) for(int m=0;m<M;m++){
        z[n][m]=newton(xm+(m-M/2)*delta+I*(ym+(N/2-n)*delta));
    }
}

double diff2(Complex z0[N][M],Complex z1[N][M]){
    double r=0;
    for(int n=0;n<N;n++) for(int m=0;m<M;m++){
        Complex t=z0[n][m]-z1[n][m];
        r+=creal(t)*creal(t)+cimag(t)*cimag(t);
    }
    return r;
}
typedef void (*juliafractal)(Complex z[N][M]);
double bench(juliafractal julia){
    double tmin=100.0,ttot=0.0;
    for(int j=0;;j++){
        tic();
        julia(Z);
        double t=toc();
        double e=diff2(Z,Z0);
        if(e!=0) {
            fprintf(stderr,"Error %25.15e is non zero!\n",e);
            exit(1);
        }
        putchar('.'); fflush(stdout);
        tmin=(t<tmin)?t:tmin;
        ttot+=t;
        if(ttot>5 && j>3) break;
    }
    printf("\n");
    return tmin;
}

int main(int argc, char *argv[]){
    printf(
        "%s -- Compute Basin of Attraction for Newton's Method\n"
        "Image size %d x %d with maximum of %d iterations\n\n",
        argv[0],N,M,IMAX);
    feclearexcept(FE_ALL_EXCEPT);
    juliaserial(Z0);
    double tserial=bench(juliaserial);
    double tparallel=bench(juliaparallel);
    printf(
        "\n%24s %24.15e seconds\n"
        "%24s %24.15e seconds\n"
        "%24s %24g fold\n",
        "Serial elapsed time",tserial,
        "Parallel elapsed time",tparallel,
        "Parallel speedup factor",tserial/tparallel);
    ztoppm6(Z0,"julia.ppm");
    return 0;
}
and the fractal that was computed looks like

Image

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Thu Apr 16, 2015 7:59 am

It appears my build of gcc-5.0 with the Intel/MIT Cilk parallel extensions for the Raspberry Pi 2B is working reasonably well. I haven't found any bugs related the cilkrts runtime and the resulting parallel speedup has been good. This post describes one more test, this time an implementation of the parallel merge sort algorithm.

In my previous tests effective parallel speedup could be achieved by simply adding cilk_spawn, cilk_sync and cilk_for to the code. Making a merge sort parallel is more complicated because it requires adding a binary search to divide and conquer the merge operation. An additional 15 lines of code were required to make the code parallel. The resulting parallel merge sort runs about 3.5 times faster on the Raspberry Pi 2B than the serial code.

Code: Select all

./merge -- Parallel Mergesort Benchmark
Working with integer arrays of size 1048576.

     Serial elapsed time    4.275419712066650e-01 seconds
   Parallel elapsed time    1.199369430541992e-01 seconds
 Parallel speedup factor                  3.56472 fold
Correct scaling with problem size is demonstrated by the graph

Image

For reference the code is

Code: Select all

#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cilk/cilk.h>

#ifndef N
# define N 1048576
#endif

int x[N],y[N],xgold[N];

static double tic_time;
void tic() {
    struct timeval tp;
    gettimeofday(&tp,0);
    tic_time=tp.tv_usec*1.e-6+tp.tv_sec;
}
double toc() {
    struct timeval tp;
    gettimeofday(&tp,0);
    return tp.tv_usec*1.e-6+tp.tv_sec-tic_time;
}

void mergeserial(int m,int x[m],int n, int y[n],int z[n+m]){
    int k=0,i=0,j=0;
    while(i<m&&j<n){
        if(x[i]<=y[j]) z[k++]=x[i++];
        else z[k++]=y[j++];
    }
    while(i<m) z[k++]=x[i++];
    while(j<n) z[k++]=y[j++];
}
void msortserial(int n,int p,int x[n],int y[n]){
    if(n==1) {
        if(p%2) y[0]=x[0];
        return;
    }
    int m=n/2;
    msortserial(m,p+1,x,y);
    msortserial(n-m,p+1,x+m,y+m);
    if(p%2) mergeserial(m,x,n-m,x+m,y);
    else mergeserial(m,y,n-m,y+m,x);
}
int binsearch(int n,int y[n],int r){
    int a=0,b=n-1;
    if(y[a]>r) return 0;
    if(y[b]<r) return n;
    for(int m=(a+b)/2;a!=m;m=(a+b)/2){
        if(y[m]<=r) a=m;
        else b=m;
    }
    return b;
}
void mergeparallel(int m,int *x,int n, int *y,int z[n+m]){
    if(m<n){
        int k=m; m=n; n=k;
        int *p=x; x=y; y=p;
    }
    if(m<4096){
        mergeserial(m,x,n,y,z);
        return;
    }
    int m2=m/2;
    int n2=binsearch(n,y,x[m2]);
    cilk_spawn mergeparallel(m2,x,n2,y,z);
    mergeparallel(m-m2,x+m2,n-n2,y+n2,z+m2+n2);
    cilk_sync;
}
void msortparallel(int n,int p,int x[n],int y[n]){
    if(p>5||n<1024) {
        msortserial(n,p,x,y);
        return;
    }
    int m=n/2;
    cilk_spawn msortparallel(m,p+1,x,y);
    msortparallel(n-m,p+1,x+m,y+m);
    cilk_sync;
    if(p%2) mergeparallel(m,x,n-m,x+m,y);
    else mergeparallel(m,y,n-m,y+m,x);
}

typedef void (*sortalg)(int n,int p,int x[n],int y[n]);
double bench(sortalg sort){
    double tmin=100.0,ttot=0.0;
    for(int j=0;;j++){
        srand(1234);
           for(int k=0;k<N;k++) x[k]=rand();
        tic();
        sort(N,0,x,y);
        double t=toc();
        putchar('.'); fflush(stdout);
        for(int k=0;k<N;k++) if(x[k]!=xgold[k]) {
            fprintf(stderr,"Error in parallel sort at index %d!\n",k);
            exit(1);
        }
        tmin=(t<tmin)?t:tmin;
        ttot+=t;
        if(ttot>5 && j>3) break;
    }
    printf("\n");
    return tmin;
}

int main(int argc, char *argv[]){
    printf(
        "%s -- Parallel Mergesort Benchmark\n"
        "Working with integer arrays of size %d.\n\n",
        argv[0],N);
    srand(1234);
       for(int k=0;k<N;k++) xgold[k]=rand();
    msortserial(N,0,xgold,y);
    double tserial=bench(msortserial);
    double tparallel=bench(msortparallel);
    printf(
        "\n%24s %24.15e seconds\n"
        "%24s %24.15e seconds\n"
        "%24s %24g fold\n",
        "Serial elapsed time",tserial,
        "Parallel elapsed time",tparallel,
        "Parallel speedup factor",tserial/tparallel);
    return 0;
}
Last edited by ejolson on Mon Apr 27, 2015 3:02 pm, edited 1 time in total.

dmc1954
Posts: 17
Joined: Sun Mar 24, 2013 2:41 pm
Location: Austin, Texas, USA

Re: Cilkplus on RPi2B

Mon Apr 20, 2015 5:38 pm

This morning I saw that gcc 5.1 will be released Wednesday, April 22.
https://gcc.gnu.org/ml/gcc/2015-04/msg00214.html

The currently release candidate is ftp://gcc.gnu.org/pub/gcc/snapshots/5.0.1-RC-20150418 and it still looks like it needs Tufty's patch to generic/os-fence.h and your patch to generic/cilk-abi-vla.c to enable cilkplus on the Raspberry PI.

Why is it called gcc 5.1 and the release candidate called 5.0.1? Well here is the link to the new numbering scheme
https://gcc.gnu.org/develop.html#num_scheme

Later,
David

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Mon Apr 27, 2015 1:03 am

dmc1954 wrote:This morning I saw that gcc 5.1 will be released Wednesday, April 22.
The new GCC appears to be out. I wonder if my patches still work. Have you downloaded the official release and tried to compile a Raspberry Pi2B compatible version of the Cilk runtime library?

I'm a little surprised there has not been more interest in this thread. Clearly the Raspberry Pi isn't intended for big computations; however, it is conceivable that parallel speedup might also be important for applications running in a small form factor. From an educational point of view, the same parallel processing techniques used to perform big computations on large multi-core servers can also be learned using the Raspberry Pi2B. Given the prevalence of multi-core hardware capable of parallel processing, I believe there will be a growing interest in this area as time passes.

Intel/MIT Cilk on the Raspberry Pi2B makes it easy to code parallel algorithms and test with hands-on experience whether those algorithms result in practical speed increases. Even though my patches for the ARM architecture do not fast path the case where swapping stacks is not required to complete a unit of work, the balance between memory bandwidth and processor speed in the Raspberry Pi still allows for effective implementation of many parallel algorithms that scale well with the number of cores.

In this post I'll present one more example: a numerical simulation of the Lorenz 96 model. Note that these are not the equations for the Lorenz butterfly, but a higher N-dimensional system that also exhibits chaotic behavior. Systems with dimension N=40 were originally used as a toy problem to study weather forecasting; however, N=40 represents a very small problem from a parallel processing point of view. To explain how parallel efficiency depends on problem size, the number of floating point operations per second versus the size of the problem for both serial and parallel codes is graphed below:

Image

For problem sizes less than 1024 the startup overhead of the parallel loop dominates the calculation and mflops actually decrease to as much as 28 times slower when running on 4 cores. When working in double precision (indicated by blue in the graph), the parallel code is faster than the serial code for problems with sizes greater than 2048. However, in single precision (red) the single-core CPU speed is faster and the crossover point occurs later at 8192. It is worth noting that the maximum speed observed for the Raspberry Pi2B was 1049.07 million single-precision floating point operations per second with an operation mixture consisting of 50% addition and subtraction versus 50% multiplication.

The source code was compiled using the command

Code: Select all

/usr/local/gcc-5.0/bin/gcc -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 \
    -mfloat-abi=hard -ffast-math -Wall -std=gnu99 -fcilkplus \
    -o lorenz lorenz.c -lcilkrts -lm
and is included below for reference.

Code: Select all

#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cilk/cilk.h>
#define xstr(s) str(s)
#define str(s) #s

#ifdef SINGLE
# define Real float
#else
# define Real double
#endif

#ifndef N
# define N 32768
#endif

#define F 8
#define T 10
#define K 2048
Real dt, expmdt, forcdt;
Real xb[N+3],xgoldb[N+3],yb[N+3];
Real *restrict x=xb+2,*restrict xgold=xgoldb+2,*restrict y=yb+2;

static double tic_time;
void tic() {
    struct timeval tp;
    gettimeofday(&tp,0);
    tic_time=(double) tp.tv_sec + (double) tp.tv_usec * 1.e-6;
}
double toc() {
    struct timeval tp;
    gettimeofday(&tp,0);
    return ((double) tp.tv_sec + (double) tp.tv_usec * 1.e-6)-tic_time;
}

void lorenzserial(Real *x){
    for(int k=0;k<K;k++){
        x[-1]=x[N-1]; x[-2]=x[N-2]; x[N]=x[0];
        for(int i=0;i<N;i++){
            y[i]=(x[i]+dt*(x[i+1]-x[i-2])*x[i-1])*expmdt+forcdt;
        }
        Real *t=y; y=x; x=t;
    }
}

void lorenzparallel(Real *x){
    for(int k=0;k<K;k++){
        x[-1]=x[N-1]; x[-2]=x[N-2]; x[N]=x[0];
        cilk_for(int i=0;i<N;i++){
            y[i]=(x[i]+dt*(x[i+1]-x[i-2])*x[i-1])*expmdt+forcdt;
        }
        Real *t=y; y=x; x=t;
    }
}

typedef void (*lorenzsolver)(Real *x);
double bench(lorenzsolver lorenz){
    double tmin=100.0,ttot=0.0;
    for(int j=0;;j++){
        for(int i=0;i<N;i++) x[i]=sin(2*M_PI*i/N);
        tic();
        lorenz(x);
        double t=toc();
        putchar('.'); fflush(stdout);
        for(int i=0;i<N;i++) if(x[i]!=xgold[i]) {
            fprintf(stderr,"Error in parallel solver at index %d!\n",i);
            exit(1);
        }
        tmin=(t<tmin)?t:tmin;
        ttot+=t;
        if(ttot>5 && j>3) break;
    }
    printf("\n");
    return tmin;
}

int main(int argc, char *argv[]){
    printf(
        "%s -- Parallel Lorenz 96 Solver Benchmark\n"
        "Precision " xstr(Real) " ODEs of size %d\n\n",
        argv[0],N);
    dt=(double)(T)/K;
    expmdt=exp(-dt);
    forcdt=2*exp(-dt/2)*sinh(dt/2)*F;
    for(int i=0;i<N;i++) xgold[i]=sin(2*M_PI*i/N);
    lorenzserial(xgold);
    double tserial=bench(lorenzserial);
    double tparallel=bench(lorenzparallel);
    printf(
        "\n%24s %24.15e seconds (mflops %g)\n"
        "%24s %24.15e seconds (mflops %g)\n"
        "%24s %24g fold\n",
        "Serial elapsed time",tserial,6e-6*N*K/tserial,
        "Parallel elapsed time",tparallel,6e-6*N*K/tparallel,
        "Parallel speedup factor",tserial/tparallel);
    return 0;
}

dmc1954
Posts: 17
Joined: Sun Mar 24, 2013 2:41 pm
Location: Austin, Texas, USA

Re: Cilkplus on RPi2B

Mon Apr 27, 2015 4:39 pm

Yes, I built gcc 5.1 with the Cilkplus extension this weekend and uploaded the tar file to github.

Cilkplus enabled gcc-5.1.0 compiler for Raspberry PI (armv6) see viewtopic.php?f=33&t=102743&p=736001 for ejolson's current build instructions.

To install become root and execute the following commands:

pi@raspberrypi2 $ sudo su -
root@raspberrypi2 # cd /usr/local
root@raspberrypi2 # wget https://github.com/davidcarver/RPI2-Cil ... 1.0.tar.xz
root@raspberrypi2 # tar xf gcc-5.1.0.tar.xz

Modify your normal user environment variables (PATH and LD_LIBRARY_PATH) to use gcc 5.1 with the cilkplus extension

pi@raspberrypi2 $ export PATH=/usr/local/gcc-5.1.0/bin:$PATH
pi@raspberrypi2 $ export LD_LIBRARY_PATH=/usr/local/gcc-5.1.0/lib:$LD_LIBRARY_PATH

Now run gcc and check the version

pi@raspberrypi2 $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/local/gcc-5.1.0/libexec/gcc/arm-linux-gnueabihf/5.1.0/lto-wrapper Target: arm-linux-gnueabihf Configured with: ../gcc-5.1.0/configure -v --enable-languages=c,c++ --prefix=/usr/local/gcc-5.1.0 --with-arch=armv6 --with-fpu=vfp --with-float=hard --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf Thread model: posix gcc version 5.1.0 (GCC)

For a little more information see https://github.com/davidcarver/RPI2-CilkPlus

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Sat May 02, 2015 4:59 am

I told people on the Intel Cilk Plus Developer Zone about my port of the Cilk runtime library to the Raspberry Pi 2B and received some interesting feedback. Apparently the routine with the typo in cilk-abi-vla.c isn't used because variable length arrays defined in functions which cilk_spawn other functions aren't fully supported in GCC yet. I'll write some code to check this later.

In the mean time, I posted another Cilk parallel processing example in reply to a different thread here on the Raspberry Pi forum that illustrates near perfect parallel scaling from one core to four cores. Usually the shared memory bus speed limits parallel scaling, however, the computation in this synthetic benchmark can be done completely in registers and cache.

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Tue May 12, 2015 4:43 pm

dmc1954 wrote:Have you had time to compare the your cilk version of your parallel fft program to a openmp version?
I still haven't compared with OpenMP; however, I just translated my parallel recursive fast Fourier transform into the Go language. The Google Go reference compiler installed from the Raspbian repository using

Code: Select all

# apt-get install golang
was tested along with GNU Go version 5.1. To install GNU Go I compiled gcc-5.1 for the Raspberry Pi 2B with support for the Go programming language. To do this I used the configuration command

Code: Select all

../gcc-5.1.0/configure -v --enable-languages=c,go --prefix=/usr/local/gcc-5.1 --with-arch=armv6 --with-fpu=vfp --with-float=hard --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
with the same patches described earlier in this thead with the change

Code: Select all

// COMMON_SYSDEP void __cilkrts_fence(void); ///< MFENCE instruction
# define __cilkrts_fence() __asm__ __volatile__ ("mcr   p15,0,%[t],c7,c10,4\n" :: [t] "r" (0) : "memory");
to os-fence.h described separately to make the Cilk runtime library backwards compatible with the original Raspberry Pi.

Three executables were produced using commands

Code: Select all

$ go build -o pfft fft.go
$ /usr/local/gcc-5.1/bin/gccgo -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard -ffast-math -Wall -o gfft fft.go
$ /usr/local/gcc-5.1/bin/gcc -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard -ffast-math -Wall -std=gnu99 -fcilkplus -o dfft fft.c -lcilkrts -lfftw3 -lm
Transforms of size N=32768 that require 5 N log_2(N) = 2457600 floating point operations were used to test the performance of the resulting executables. As indicated in the figure below, the performance of Cilk was roughly twice that of Go for these tests.

Image

The performance differences may be a result of my unfamiliarity with the Go programming language. Therefore, I would be happy to know of any changes to my code that result in better performance. Note also that the lack of a built-in parallel for loop makes the Go code slightly more verbose.

Code: Select all

package main

import ("math"; "time"; "fmt"; "os"; "runtime")

const N = 32768
var x,y [N]complex128
var wn [N/2]complex128
var ncpu int

var tic_time float64;
func tic() {
    now := time.Now()
    tic_time = float64(now.Unix())+1.0E-9*float64(now.Nanosecond())
}
func toc()(float64) {
    now := time.Now()
    return float64(now.Unix())+1.0E-9*float64(now.Nanosecond())-tic_time
}

func fftserial(n int,s int,x []complex128,y []complex128){
    m := n/2
    if m==1 {
        y[0] = x[0]
        y[m] = x[s]
    } else {
        fftserial(m,s*2,x,y)
        fftserial(m,s*2,x[s:],y[m:])
    }
    for i:=0;i<m;i++ {
        z0 := y[i]
        z1 := wn[i*s]*y[i+m]
        y[i] = z0+z1
        y[i+m] = z0-z1
    }
}

func _fftparallel(n int,s int,x []complex128,y []complex128,c chan int){
    if s>32||n<1024 {
        fftserial(n,s,x,y)
        c<-0
        return
    }
    m := n/2
    if m==1 {
        y[0] = x[0]
        y[m] = x[s]
    } else {
        sync := make(chan int,2)
        go _fftparallel(m,s*2,x,y,sync)
        go _fftparallel(m,s*2,x[s:],y[m:],sync)
        <-sync; <-sync
    }
    var tm int
    if(m<4096) {
        tm = 1
    } else {
        tm = ncpu
    }
    sem := make(chan int,tm)
    for t:=0;t<tm;t++ {
        go func(t int){
            i0:=t*m/tm
            im:=(t+1)*m/tm
            for i:=i0;i<im;i++ {
                z0 := y[i]
                z1 := wn[i*s]*y[i+m]
                y[i] = z0+z1
                y[i+m] = z0-z1
            }
            sem<-0
        }(t)
    }
    for t:=0;t<tm;t++ {
        <-sem
    }
    c<-0
}

func fftparallel(n int,s int,x []complex128,y []complex128){
    ret := make(chan int,1)
    go _fftparallel(n,s,x,y,ret)
    <-ret
}

type fouriertrans func(int,int,[]complex128,[]complex128)
func bench(fft fouriertrans)(float64){
    tmin:=100.0
    ttot:=0.0
    for j:=0;;j++ {
        tic()
        fft(N,1,x[0:],y[0:])
        t:=toc()
        e:=imag(y[N-1])-N/2
        if math.Abs(e)>0.5 {
            fmt.Printf("Error %25.15e is too big in FFT!\n",e)
            os.Exit(1)
        }
        fmt.Printf(".")
        if t<tmin {
            tmin = t
        }
        ttot = ttot + t
        if ttot>5 && j>3 {
            break
        }
    }
    fmt.Printf("\n")
    return tmin
}

func main(){
    fmt.Printf(
        "%s -- Parallel Fast Fourier Transform Benchmark (in golang)\n" +
        "Precision float64 FFTs of size %d GOMAXPROCS=%d\n\n",
        os.Args[0],N,runtime.GOMAXPROCS(0))
    ncpu = runtime.GOMAXPROCS(0)
    for i:=0;i<N;i++ {
         x[i] = complex(math.Sin(2.0*math.Pi*float64(i)/float64(N)),0)
    }
    theta :=-2*math.Pi/N
    for i:=0;i<N/2;i++ {
        wn[i]=complex(math.Cos(float64(i)*theta),math.Sin(float64(i)*theta))
    }
    tserial := bench(fftserial)
    tparallel := bench(fftparallel)
    fmt.Printf(
        "\n%24s %24.15e seconds\n" +
        "%24s %24.15e seconds\n" +
        "%24s %24g fold\n",
        "Serial elapsed time",tserial,
        "Parallel elapsed time",tparallel,
        "Parallel speedup factor",tserial/tparallel)
    os.Exit(0)
}
Last edited by ejolson on Wed May 13, 2015 8:15 am, edited 6 times in total.

Heater
Posts: 7412
Joined: Tue Jul 17, 2012 3:02 pm

Re: Cilkplus on RPi2B

Tue May 12, 2015 4:51 pm

I cannot really advise on the performance of Go. Only that a year ago or so I wrote the same functionality in Go and in JavaScript running under node.js

To my amazement the JS version was twice as fast as Go! And JS in an interpreted language.

Seemed to me that the Go garbage collector was stopping, starting and and generally stuttering things a lot. I did not take the time to figure out if I could reduce garbage collection issues in the Go version. It was so much easier to do what I wanted in JavaScript.

Of course JS would not take advantage of many cores if they are available like Go can.

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Tue Jun 30, 2015 10:11 pm

Here is an example using Intel/MIT Cilk to construct a parallel quadrature routine with adaptive refinement based on Simpson's rule. The algorithm divides [a,b] into 2^N intervals and further subdivides each interval if the estimated error is too large. Each recursive call is done in parallel until the error is within 1e5 of the desired error at which point the serial code is used to finish the calculation. As with the other examples, the parallel code is the same as the original serial code with the addition of a few lines involving the cilk_spawn and cilk_sync primitives.

Near perfect parallel scaling is obtained as seen by the output

Code: Select all

$ /usr/local/gcc-5.1/bin/gcc -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 \
  -ffast-math -Wall -fcilkplus -o quad quad.c -lcilkrts -lm
$ ./quad 
./quad -- Parallel Adaptive Numeric Quadrature Benchmark
Refining 2^10 intervals to error 1e-08 CILK_NWORKERS=4

     Serial elapsed time    1.054170105667114e+00 seconds
   Parallel elapsed time    2.657480469970703e-01 seconds
 Parallel speedup factor                   3.9668 fold
For reference the source code is

Code: Select all

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cilk/cilk.h>
#include <cilk/cilk_api.h>

#define N 10
#define EPS 1e-8

static double tic_time;
void tic() {
    struct timeval tp;
    gettimeofday(&tp,0);
    tic_time=tp.tv_sec+tp.tv_usec*1.e-6;
}
double toc() {
    struct timeval tp;
    gettimeofday(&tp,0);
    return tp.tv_sec+tp.tv_usec*1.e-6-tic_time;
}

double f(double x){
    return x*sin(x*x);
}
double F(double x){
    return -cos(x*x)/2;
}
static double simpser(int ln,
    double a,double fa,double b,double fb,double eps){
    double c=(a+b)/2,fc=f(c);
    if(ln<=0){
        double eab=fabs((fa-2*fc+fb)*(b-a)/12);
        if(eab<eps) return (fa+4*fc+fb)*(b-a)/6;
    }
    eps/=2;
    ln--;
    return simpser(ln,a,fa,c,fc,eps)+simpser(ln,c,fc,b,fb,eps);
}
double quadser(int ln,double a,double b,double eps){
    double fa=f(a),fb=f(b);
    return simpser(ln,a,fa,b,fb,eps);
}

static double simppar(int ln,
    double a,double fa,double b,double fb,double eps){
    double c=(a+b)/2,fc=f(c);
    double eab=fabs((fa-2*fc+fb)*(b-a)/12);
    if(ln<=0&&eab<eps) return (fa+4*fc+fb)*(b-a)/6;
    eps/=2;
    ln--;
    if(eab>1e5*eps){
        double r1,r2;
        r1=cilk_spawn simppar(ln,a,fa,c,fc,eps);
        r2=simppar(ln,c,fc,b,fb,eps);
        cilk_sync;
        return r1+r2;
    }
    return simpser(ln,a,fa,c,fc,eps)+simpser(ln,c,fc,b,fb,eps);
}
double quadpar(int ln,double a,double b,double eps){
    double fa=f(a),fb=f(b);
    return simppar(ln,a,fa,b,fb,eps);
}

typedef double (*quadrature)(int n,double a,double b,double eps);
double bench(quadrature quad){
    double tmin=100.0,ttot=0.0;
    for(int j=0;;j++){
        tic();
        double r=quad(N,0,10,EPS);
        double t=toc();
        double e=r-F(10)+F(0);
        if(fabs(e)>EPS){
            fprintf(stderr,"Error %25.15e is too big in quadrature!\n",e);
            exit(1);
        }
        putchar('.'); fflush(stdout);
        tmin=(t<tmin)?t:tmin;
        ttot+=t;
        if(ttot>5 && j>3) break;
    }
    printf("\n");
    return tmin;
}

int main(int argc, char *argv[]){
    int ncpu = __cilkrts_get_nworkers();
    printf(
        "%s -- Parallel Adaptive Numeric Quadrature Benchmark\n"
        "Refining 2^%d intervals to error %g CILK_NWORKERS=%d\n\n",
        argv[0],N,EPS,ncpu);
    double tserial=bench(quadser);
    double tparallel=bench(quadpar);
    printf(
        "\n%24s %24.15e seconds\n"
        "%24s %24.15e seconds\n"
        "%24s %24g fold\n",
        "Serial elapsed time",tserial,
        "Parallel elapsed time",tparallel,
        "Parallel speedup factor",tserial/tparallel);
    return 0;
}

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Sat Jul 04, 2015 12:38 am

Conway's Game of Life https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life is a cellular automation that involves only integer arithmetic. The simplest parallel algorithm, parallel0, is obtained by changing the outer loop to a parallel cilk_for loop. Since the

Code: Select all

#pragma cilk grainsize=n
directive currently appears not to affect anything in the gcc-5.1 implementation of Intel/MIT Cilk, further code changes are necessary to obtain an optimized routine, parallel1, that gives another 10 percent performance improvement. The results on a 256 by 256 grid are

Image

which shows near linear scaling with a maximum parallel speedup of about 3.5 fold.

For reference, the source code is

Code: Select all

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <cilk/cilk.h>
#include <cilk/cilk_api.h>

#define KMAX 256
#define M 256
#define N 256

typedef char cell;

cell x0[M][N];
int cgold,ncpu;

static double tic_time;
void tic(){
    struct timeval tp;
    gettimeofday(&tp,0);
    tic_time=tp.tv_sec+tp.tv_usec*1.e-6;
}
double toc(){
    struct timeval tp;
    gettimeofday(&tp,0);
    return tp.tv_sec+tp.tv_usec*1.e-6-tic_time;
}

static void genpar1sub(int ic,cell b[M][N],cell a[M][N]){
    int i0=ic*M/ncpu,iM=(ic+1)*M/ncpu;
    for(int i=i0;i<iM;i++){
        int u=(i+M-1)%M,d=(i+1)%M;
        for(int j=0;j<N;j++){
            int l=(j+N-1)%N,r=(j+1)%N;
            int s=a[u][l]+a[u][j]+a[u][r]
                 +a[i][l]        +a[i][r]
                 +a[d][l]+a[d][j]+a[d][r];
            b[i][j]=s==3?1:s==2?a[i][j]:0;
        }
    }
}
static void genpar1(cell b[M][N],cell a[M][N]){
    cilk_for(int ic=0;ic<ncpu;ic++) genpar1sub(ic,b,a);
}
void evolvepar1(int kmax,cell y[M][N],cell x[M][N]){
    cell t[M][N],(*ap)[N],(*bp)[N];
    if(kmax%2){
        ap=t; bp=y;
    } else {
        ap=y; bp=t;
    }
    memcpy(ap,x,M*N*sizeof(cell));
    for(int k=0;k<kmax;k++){
        genpar1(bp,ap);
        cell (*tp)[N]=bp;
        bp=ap; ap=tp;
    }
}

static void genpar0(cell b[M][N],cell a[M][N]){
    cilk_for(int i=0;i<M;i++){
        int u=(i+M-1)%M,d=(i+1)%M;
        for(int j=0;j<N;j++){
            int l=(j+N-1)%N,r=(j+1)%N;
            int s=a[u][l]+a[u][j]+a[u][r]
                 +a[i][l]        +a[i][r]
                 +a[d][l]+a[d][j]+a[d][r];
            b[i][j]=s==3?1:s==2?a[i][j]:0;
        }
    }
}
void evolvepar0(int kmax,cell y[M][N],cell x[M][N]){
    cell t[M][N],(*ap)[N],(*bp)[N];
    if(kmax%2){
        ap=t; bp=y;
    } else {
        ap=y; bp=t;
    }
    memcpy(ap,x,M*N*sizeof(cell));
    for(int k=0;k<kmax;k++){
        genpar0(bp,ap);
        cell (*tp)[N]=bp;
        bp=ap; ap=tp;
    }
}

static void genser(cell b[M][N],cell a[M][N]){
    for(int i=0;i<M;i++){
        int u=(i+M-1)%M,d=(i+1)%M;
        for(int j=0;j<N;j++){
            int l=(j+N-1)%N,r=(j+1)%N;
            int s=a[u][l]+a[u][j]+a[u][r]
                 +a[i][l]        +a[i][r]
                 +a[d][l]+a[d][j]+a[d][r];
            b[i][j]=s==3?1:s==2?a[i][j]:0;
        }
    }
}
void evolveser(int kmax,cell y[M][N],cell x[M][N]){
    cell t[M][N],(*ap)[N],(*bp)[N];
    if(kmax%2){
        ap=t; bp=y;
    } else {
        ap=y; bp=t;
    }
    memcpy(ap,x,M*N*sizeof(cell));
    for(int k=0;k<kmax;k++){
        genser(bp,ap);
        cell (*tp)[N]=bp;
        bp=ap; ap=tp;
    }
}

static void doinit(cell a[M][N]){
    srand(12345);
    for(int i=0;i<M;i++) for(int j=0;j<N;j++)
        a[i][j]=rand()<RAND_MAX/3?1:0;
}
static int alive(cell a[M][N]){
    int c=0;
    for(int i=0;i<M;i++) for(int j=0;j<N;j++)
        if(a[i][j]) c++;
    return c;
}

typedef void (*evolver)(int kmax,cell b[M][N],cell a[M][N]);
double bench(evolver evolve){
    double tmin=100.0,ttot=0.0;
    for(int j=0;;j++){
        cell y[M][N];
        tic();
        evolve(KMAX,y,x0);
        double t=toc();
        int c=alive(y);
        if(c!=cgold){
            fprintf(stderr,"Number of cells left alive %d not equal %d!\n",
                c,cgold);
            exit(1);
        }
        putchar('.'); fflush(stdout);
        tmin=(t<tmin)?t:tmin;
        ttot+=t;
        if(ttot>5 && j>3) break;
    }
    printf("\n");
    return tmin;
}

int main(int argc,char *argv[]){
    ncpu = __cilkrts_get_nworkers();
    printf(
        "%s -- Parallel Conway's Game of Life Benchmark\n"
        "Evolving %d steps of a %d by %d depth %d grid CILK_NWORKERS=%d.\n\n",
        argv[0],KMAX,M,N,(int)sizeof(cell)*8,ncpu);
    doinit(x0);
    {
        struct rlimit rlim = {RLIM_INFINITY, RLIM_INFINITY};
        setrlimit(RLIMIT_STACK,&rlim);
        cell y[M][N];
        evolveser(KMAX,y,x0);
        cgold=alive(y);
    }
    double tserial=bench(evolveser);
    double tparallel0=bench(evolvepar0);
    double tparallel1=bench(evolvepar1);
    printf(
        "\n%24s %24.15e seconds\n"
        "%24s %24.15e seconds\n"
        "%24s %24.15e seconds\n"
        "%24s %24g fold\n",
        "Serial elapsed time",tserial,
        "Parallel0 elapsed time",tparallel0,
        "Parallel1 elapsed time",tparallel1,
        "Parallel1 speedup factor",tserial/tparallel1);
    return 0;
}

ripednail
Posts: 9
Joined: Fri Feb 20, 2015 6:10 am

Re: Cilkplus on RPi2B

Fri Sep 25, 2015 10:25 pm

;)
Last edited by ripednail on Sat Apr 01, 2017 7:02 am, edited 1 time in total.

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Sat Nov 14, 2015 8:07 am

I've created a slightly refined Cilk version of the parallel prime finder written in JavaScript and discussed at

viewtopic.php?p=841612#p841612

The algorithm successively divides every odd number n between 5 and 100000 by the odd primes up to sqrt(n) to determine whether it's prime or not. The code

Code: Select all

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <assert.h>
#include <cilk/cilk.h>

#define N 100000
#define P 10000
#define S 8

typedef unsigned long int Integer;
static Integer primes[P]={2,3};
static Integer primep[S][P/S];
static int cold;

static double tic_time;
void tic() {
    struct timeval tp;
    gettimeofday(&tp,0);
    tic_time=(double) tp.tv_sec + (double) tp.tv_usec * 1.e-6;
}
double toc() {
    struct timeval tp;
    gettimeofday(&tp,0);
    return ((double) tp.tv_sec + (double) tp.tv_usec * 1.e-6)-tic_time;
}

static int isprime(Integer n){
    int sqrtn=sqrt((double)n)+0.5;
    for(int k=1;n%primes[k];k++){
        if(primes[k]>=sqrtn) return 1;
    }
    return 0;
}
int primeserial(int nmax){
    int c=2;
    for(Integer n=5;n<=nmax;n+=2){
        if(isprime(n)) primes[c++]=n;
    }
    assert(c<=P);
    return c;
}
int primepsub(int nmin,int i,int nmax){
    int c=0;
    for(Integer n=nmin+2*i;n<=nmax;n+=2*S){
        if(isprime(n)) primep[i][c++]=n;
    }
    assert(c<=P/S);
    return c;
}
int primeparallel(int nmax){
    int sqrtnmax=sqrt((double)nmax)+0.5;
    int c=primeserial(sqrtnmax);
    int cp[S];
    cilk_for(int i=0;i<S;i++) cp[i]=primepsub(primes[c-1]+2,i,nmax);
    for(int i=0;i<S;i++) c+=cp[i];
    return c;
}

typedef int (*primefinder)(int nmax);
double bench(primefinder pprime){
    double tmin=100.0,ttot=0.0;
    for(int j=0;;j++){
        tic();
        int c=pprime(N);
        double t=toc();
        if(c!=cold){
            fprintf(stderr,"Error %d primes instead of %d found!\n",
                c,cold);
            exit(1);
        }
//        putchar('.'); fflush(stdout);
        tmin=(t<tmin)?t:tmin;
        ttot+=t;
        if(ttot>5 && j>3) break;
    }
//    printf("\n");
    return tmin;
}

int main(int argc, char *argv[]){
    printf(
        "%s -- Parallel Prime Finder Benchmark\n\n",argv[0]);
    cold=primeserial(N);
    printf("Found a total of %d primes!\n",cold);
    double tserial=bench(primeserial);
    double tparallel=bench(primeparallel);
    printf(
        "\n%24s %24.15e seconds\n" "%24s %24.15e seconds\n"
        "%24s %24g fold\n",
        "Serial elapsed time",tserial,
        "Parallel elapsed time",tparallel,
        "Parallel speedup factor",tserial/tparallel);
    return 0;
}
leads to the output

Code: Select all

$ ./prime 
./prime -- Parallel Prime Finder Benchmark

Found a total of 9592 primes!

     Serial elapsed time    1.560298184204100e-02 seconds
   Parallel elapsed time    3.988897262573240e-03 seconds
 Parallel speedup factor                   3.9116 fold
which shows near linear scaling on 4 CPUs.

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Thu Dec 31, 2015 7:22 pm

Here is another parallel prime finding algorithm. I created a parallel version of the Sieve of Eratosthenes which obtains a super-linear speedup of 6.4 fold when running on 4 CPUs.

Code: Select all

./sievep -- Parallel Sieve of Eratosthenes Benchmark (in cilk)
Finding primes up to 10000000 CILK_NWORKERS=4

Found 664579 primes.

     Serial elapsed time    6.147608757019043e-01 seconds
   Parallel elapsed time    9.511995315551758e-02 seconds
 Parallel speedup factor                  6.46301 fold
In the parallel algorithm the bitarray is broken into n slices which are worked on simultaneously. Since the slices are smaller, there appears to be less cache contention while the program is running. This may account for the super-linear speedup. In particular, running the parallel algorithm on only one processor still yields a 1.6 fold improvement in speed.

Code: Select all

$ CILK_NWORKERS=1 ./sievep
./sievep -- Parallel Sieve of Eratosthenes Benchmark (in cilk)
Finding primes up to 10000000 CILK_NWORKERS=1

Found 664579 primes.

     Serial elapsed time    6.154000759124756e-01 seconds
   Parallel elapsed time    3.771290779113770e-01 seconds
 Parallel speedup factor                   1.6318 fold
For reference the code is

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <cilk/cilk.h>
#include <cilk/cilk_api.h>

static double tic_time;
void tic() {
    struct timeval tp;
    gettimeofday(&tp,0);
    tic_time=(double) tp.tv_sec + (double) tp.tv_usec * 1.e-6;
}
double toc() {
    struct timeval tp;
    gettimeofday(&tp,0);
    return ((double) tp.tv_sec + (double) tp.tv_usec * 1.e-6)-tic_time;
}

#define getbit(p,n) (p[n>>3]&(0x01<<(n&0x07)))
#define setbit(p,n) (p[n>>3]|=(0x01<<(n&0x07)))

#define Limit 10000000
#define Slice 64

int cgold;
char notPrimebits[Limit/8+1];
int C[Slice];

static inline int round8(int j){
    j=j+7; j-=j%8;
    return j;
}

int setrange(int imax,int jmin, int jmax){
    int i,c;
    for(i=2;i<=imax;i++){
        if(!getbit(notPrimebits,i)){
            int j=jmin+i-1; j-=j%i;
            for(;j<=jmax;j+=i)
                setbit(notPrimebits,j);
        }
    }
    for(i=jmin,c=0;i<=jmax;i++){
        if(!getbit(notPrimebits,i)) c++;
    }
    return c;
}

int serial(int limit){
    int imax=sqrt((double)limit)+0.5;
    int i,c;
    for(i=2,c=0;i<=imax;i++){
        if(!getbit(notPrimebits,i)){
            c++;
            int j;
            for(j=i*i;j<=limit;j+=i)
                setbit(notPrimebits,j);
        }
    }
    for(i=imax;i<=limit;i++){
        if(!getbit(notPrimebits,i)) c++;
    }
    return c;
}

int parallel(int limit){
    int imax=round8(sqrt((double)limit)+1.5)-1;
    int c=serial(imax);
    memset(C,0,sizeof(int)*Slice);
    cilk_for(int n=0;n<Slice;n++){
        int jmin,jmax;
        jmin=round8(imax+1+n*(Limit-imax)/Slice);
        if(n==Slice-1) jmax=limit;
        else jmax=round8(imax+1+(n+1)*(Limit-imax)/Slice)-1;
        C[n]=setrange(imax,jmin,jmax);
    }
    for(int n=0;n<Slice;n++) c+=C[n];
    return c;
}

typedef int (*primesieve)(int limit);
double bench(primesieve sieve){
    double tmin=10000.0,ttot=0.0;
    for(int j=0;;j++){
        memset(notPrimebits,0,Limit/8+1);
        tic();
        int c=sieve(Limit);
        double t=toc();
        if(c!=cgold){
            fprintf(stderr,"Wrong number %d of primes found!\n",c);
            exit(1);
        }
//        putchar('.'); fflush(stdout);
        tmin=(t<tmin)?t:tmin;
        ttot+=t;
        if(ttot>5 && j>3) break;
    }
//    printf("\n");
    return tmin;
}

int main(int argc,char *argv[]){
    int ncpu=__cilkrts_get_nworkers();
    printf(
        "%s -- Parallel Sieve of Eratosthenes Benchmark (in cilk)\n"
        "Finding primes up to %d CILK_NWORKERS=%d\n\n",
        argv[0],Limit,ncpu);
    memset(notPrimebits,0,Limit/8+1);
    cgold=serial(Limit);
    printf("Found %d primes.\n",cgold);
    double tserial=bench(serial);
    double tparallel=bench(parallel);
    printf(
        "\n%24s %24.15e seconds\n"
        "%24s %24.15e seconds\n"
        "%24s %24g fold\n",
        "Serial elapsed time",tserial,
        "Parallel elapsed time",tparallel,
        "Parallel speedup factor",tserial/tparallel);
    return 0;
}

HomeBoyLV
Posts: 1
Joined: Mon Feb 08, 2016 8:28 am

Re: Cilkplus on RPi2B

Wed Mar 09, 2016 11:15 pm

I just found this thread. ejolson, I am very interested in the work you are doing. I have a couple of itmes that will surly be of interest to you work.

HomeBoy Out

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Thu Dec 15, 2016 8:56 am

I've just posted a discussion of optimizing the Collatz algorithm using memoization and parallelization. I've placed a link in this thread, as the code provides another example of Cilk programming. While most of the performance boost was obtained by memoization, there was an additional 3 times speed increase obtained from parallelization.

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Fri Aug 11, 2017 12:55 am

Version 7.1 of gcc now automatically builds for ARM with Cilkplus enabled. However, the commands

Code: Select all

$ rm *.o
$ /usr/local/gcc-6.2/bin/gcc -fcilkplus -O3 \
    -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard \
    -ffast-math -Wall -c -o bench.o bench.c
$ /usr/local/gcc-6.2/bin/gcc -fcilkplus -O3 \
    -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard \
    -ffast-math -Wall dotprod.c bench.o -o dotprod-6.2
$ rm *.o
$ /usr/local/gcc-7.1/bin/gcc -fcilkplus -O3 \
    -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard \
    -ffast-math -Wall -c -o bench.o bench.c
$ /usr/local/gcc-7.1/bin/gcc -fcilkplus -O3 \
    -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard \
    -ffast-math -Wall dotprod.c bench.o -o dotprod-7.1
result in two executables with performance on the Raspberry Pi 3B given by

Image

The above graph indicates that gcc version 7.1 performs slower. Note that the serial code runs at the same speed for both compiler versions. Theoretically, CILK_NWORKERS=1 should also perform about the same speed as the serial code. This, in fact, is the case with gcc version 6.2 and with the x86 architecture for either version of the compiler. Thus, the regression appears to come from newly introduced overhead for ARM processors related to the Cilk parallel runtime. I don't know what has changed. Maybe the cactus stack is no longer aligned properly or inefficient code for saving and restoring the floating point state has been added. As I'm using Cilk to teach parallel processing in my classes, it would be wonderful if someone could look into what has happened.

For reference the C code for the above tests is given by dotprod.c

Code: Select all

#include "bench.h"
#define SINGLE

#ifdef SINGLE
# define Real float
# define Rerr 1e-7
#else
# define Real double
# define Rerr 1e-15
#endif

#ifndef N
# define N 10000000
#endif
Real x[N],y[N];

double rexact=1-1.0/(N+1);

double dpserial(int n,Real x[n],Real y[n]){
    double r=0;
    for(int k=n-1;k>=0;k--) r+=x[k]*y[k];
    return r;
}
double dpparallel(int n,Real x[n],Real y[n]){
    if(n<10000) return dpserial(n,x,y);
    int m=n/2;
    double r1,r2;
    r2=cilk_spawn dpparallel(n-m,&x[m],&y[m]);
    r1=dpparallel(m,x,y);
    cilk_sync;
    return r1+r2;
}
double dpdriver(int ncpu){
    double r,t;
    if(ncpu){
        tic();
        r=dpparallel(N,x,y);
        t=toc();
    } else {
        tic();
        r=dpserial(N,x,y);
        t=toc();
    }
    if(fabs(r-rexact)>Rerr){
        printf("Error!\n");
        exit(1);
    }
    return t;
}

int main(int argc,char *argv[]){
    for(int k=0;k<N;k++) x[k]=1.0/(k+1);
    for(int k=0;k<N;k++) y[k]=1.0/(k+2);
    printf("#%7s %21s %11s %9s\n",
        "workers","elapsed seconds","Mflops","speedup");
    bench(2e-6*N,dpdriver);
    return 0;
}
by bench.h

Code: Select all

#ifndef _BENCH_H

#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cilk/cilk.h>
#include <cilk/cilk_api.h>

typedef double (*func)();
extern double rexact;
extern void tic();
extern double toc();
extern void bench(double nops,func f);

#endif
and by bench.c

Code: Select all

#include "bench.h"
#include <stdarg.h>
#include <cilk/cilk_api.h>

static double tic_time;
void tic() {
    struct timeval tp;
    gettimeofday(&tp,0);
    tic_time=tp.tv_sec+tp.tv_usec*1e-6;
}
double toc() {
    struct timeval tp;
    gettimeofday(&tp,0);
    return tp.tv_sec+tp.tv_usec*1e-6-tic_time;
}

void bench(double nops,func f){
    int ncpu=__cilkrts_get_nworkers();
    double times[ncpu+1];
    for(int n=0;n<=ncpu;n++){
        __cilkrts_end_cilk();
        if(n){
            char buf[32];
            sprintf(buf,"%d",n);
            int r=__cilkrts_set_param("nworkers",buf);
            if(r!=__CILKRTS_SET_PARAM_SUCCESS){
                fprintf(stderr,"Error: unable to set nworkers to %d!\n",n);
                exit(1);
            }
            __cilkrts_init();
        }
        double tmin=100.0,ttot=0.0;
        for(int j=0;;j++){
            double t=f(n);
            tmin=(t<tmin)?t:tmin;
            ttot+=t;
            if(ttot>5 && j>3) break;
        }
        times[n]=tmin;
    }
    for(int n=0;n<=ncpu;n++){
        printf("%8d %21.14e %11.4e %9.5f\n",
            n,times[n],nops/times[n],times[0]/times[n]);
    }
}
Similar performance issues are obtained with my original dot product code as well as using any of the other parallel codes in this thread.

ejolson
Posts: 1007
Joined: Tue Mar 18, 2014 11:47 am

Re: Cilkplus on RPi2B

Sat Aug 12, 2017 11:49 pm

I just filed a bug report at the gcc bugzilla site about this performance regression. The reply I received indicated that the MIT/Cilkplus parallel processing extensions to the C programming language, which were added to gcc version 4.9 in 2013, have been deprecated and will be removed from gcc starting version 8.0. Apparently there was not enough support or expertise to maintain the code that was originally contributed by Intel. Fortunately OpenMP now includes task-based scheduling similar to Cilkplus that will continue to be supported. As a result, conversion of the examples contained in this thread to OpenMP should be straightforward, though preliminary results indicate slightly reduced performance. I'll eventually start a new thread comparing the two.

Return to “C/C++”

Who is online

Users browsing this forum: No registered users and 7 guests