Heater
Posts: 18198
Joined: Tue Jul 17, 2012 3:02 pm

Re: Julia the Language

Wed May 05, 2021 4:30 am

Gavinmc42 wrote:
Tue May 04, 2021 11:07 pm
Was hoping Julia could run a bit faster.
Compared to what?

Julia is renowned for being damn fast compared to other interpreted language systems.
Memory in C++ is a leaky abstraction .

User avatar
Gavinmc42
Posts: 5633
Joined: Wed Aug 28, 2013 3:31 am

Re: Julia the Language

Wed May 05, 2021 6:27 am

Compared to Mathematica ;)
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

Heater
Posts: 18198
Joined: Tue Jul 17, 2012 3:02 pm

Re: Julia the Language

Wed May 05, 2021 12:05 pm

Gavinmc42 wrote:
Wed May 05, 2021 6:27 am
Compared to Mathematica ;)
There seems to be some debate about that around the internet. Not conclusive.

Being closed source Mathematic has zero performance around here.
Memory in C++ is a leaky abstraction .

ejolson
Posts: 7233
Joined: Tue Mar 18, 2014 11:47 am

Re: Julia the Language

Wed May 05, 2021 7:30 pm

Heater wrote:
Wed May 05, 2021 4:30 am
Gavinmc42 wrote:
Tue May 04, 2021 11:07 pm
Was hoping Julia could run a bit faster.
Compared to what?

Julia is renowned for being damn fast compared to other interpreted language systems.
Julia seems to be fast compared to other compiled language systems as well. Here are runs of Julia 1.6.0, GCC 10.2 and Go 1.16.3 when making fern fractals on the Raspberry Pi 4B using in 64-bit ARMv8 mode:

Code: Select all

$ JULIA_NUM_THREADS=4 julia barnfast.jl # Julia 1.6.0
barnfast.jl -- Compute Barnsley's Fern (JULIA_NUM_THREADS=4)

Resolution: 3108 x 6292
Iterations: 800000000
Blk Pixels: 4826441

Iteration rate is 2.868098887441608e7 per second.
Total execution time 27.893041049 seconds.
$ ./ompfast # C version 10.2
ompfast.go -- Compute Barnsley's Fern (OMP_NUM_THREADS=4)

Resolution: 3108 x 6292
Iterations: 800000000
Blk Pixels: 4826441

Iteration rate is 2.86507e+07 per second.
Total execution time 27.9225 seconds.
$ ./fernfast # Go version 1.16.3
fernfast.go -- Compute Barnsley's Fern (GOMAXPROCS=4)

Resolution: 3108 x 6292
Iterations: 800000000
Blk Pixels: 4826441

Iteration rate is 2.6789075907738313e+07 per second.
Total execution time 29.862918853759766 seconds.
All languages were within 10 percent with Julia being the fastest, C second fastest and Go in last place. Strangely, on x86 Julia and C are again similar in speed but Go four times slower. Although Julia has a slight advantage because the JIT is not counted in the runtime and the cache is still warm after the JIT, my intention was to make all three codes literal translations of each other as much as possible.

The Julia program barnfast.jl is

Code: Select all

#=  barnfast.jl -- Compute the Barnsley Fern

    Modified to use the middle-square Weyl-sequence random number
    generator described in https://arxiv.org/abs/1704.00358
=#

using StaticArrays
import Base.Threads.@spawn

# const N=8000000
const N=800000000

const A=[SA[0.85 0.04; -0.04 0.85],
    SA[0.20 -0.26; 0.23 0.22],
    SA[-0.15 0.28; 0.26 0.24],
    SA[0.00 0.00; 0.00  0.16]]

const B=[SA[0.00, 1.60],
    SA[0.00, 1.60],
    SA[0.00, 0.44],
    SA[0.00, 0.00]]

const P=[0.85, 0.07, 0.07, Inf]
const cdf=[UInt32(floor(min(sum(P[1:i])*(1<<32),
    Float64(0xFFFFFFFF)))) for i=1:4]

f(i,x)=A[i]*x+B[i]
i(p)=findfirst(x->p<=x,cdf)

const xmin=SA[-2.1820,0]
const xmax=SA[2.6558,9.9983]
const border=0.1
# const scale=100.0
const scale=617.0

const nu=Integer.(floor.(scale*(xmax-xmin.+2*border)))
const image=zeros(Int8,nu...);

function plot()
    count=0
    open("fern.pnm","w") do io
        println(io,"P4")
        println(io,size(image)[1]," ",size(image)[2])
        row=zeros(UInt8,(size(image)[1]+7)÷8)
        for iy=size(image)[2]:-1:1
            rx=1; rb=UInt8(0)
            ib=UInt8(128)
            for ix=1:size(image)[1]
                if image[ix,iy]!=0
                    rb|=ib; count+=1
                end
                ib>>=1
                if ib==0
                    row[rx]=rb; ib=UInt8(128)
                    rb=UInt8(0); rx+=1
                end
            end
            if ib!=0
                row[rx]=rb
            end
            write(io,row)
        end
    end
    return count
end

function point(x)
    i1=Integer(floor(scale*(x[1]-xmin[1]+border)))
    i2=Integer(floor(scale*(x[2]-xmin[2]+border)))
    image[i1,i2]=1
end

mutable struct rstate
    x::UInt64; w::UInt64; s::UInt64
end
function rint32(p)
    p.x*=p.x; p.w+=p.s
    p.x+=p.w; p.x=(p.x>>32)|(p.x<<32)
    return UInt32(p.x&0xFFFFFFFF)
end
function rint64(p)
    r=UInt64(rint32(p))<<32
    return r|UInt64(rint32(p))
end
function rfloat(p)
    return Float64(rint32(p))/(1<<32)
end
const gs=rstate(0,0,0xb5ad4eceda1ce2a9)
function rseed()
    x=rint64(gs); w=rint64(gs); s=rint64(gs)|1
    return rstate(x,w,s)
end

function work(p,jmax,c)
    xn=SA[0.0,0.0]
    point(xn)
    for j=1:jmax
        xn=f(i(rint32(p)),xn)
        point(xn)
    end
    put!(c,0)
end

function main()
    ncpu=Threads.nthreads()
    println("barnfast.jl -- Compute Barnsley's Fern ",
        "(JULIA_NUM_THREADS=$ncpu)")
    println("\nResolution: ",size(image)[1]," x ",
        size(image)[2],"\nIterations: ",N)
    ret=Channel{Int}(ncpu)
    p=Vector{rstate}(undef,ncpu)
    for n=2:ncpu
        p[n]=rseed()
        @spawn work(p[n],N÷ncpu,ret)
    end
    p[1]=rseed()
    work(p[1],N÷ncpu,ret)
    for n=1:ncpu
        take!(ret)
    end
    println("Blk Pixels: ",plot())
end

tsec=@elapsed main()
println("\nIteration rate is ",N/tsec," per second.")
println("Total execution time ",tsec," seconds.")
The Go code is

Code: Select all

/*  fernfast.go -- Compute the Barnsley Fern

    Modified to use the middle-square Weyl-sequence random number
    generator described in https://arxiv.org/abs/1704.00358
 */

package main

import ("fmt"; "os"; "bufio"; "runtime"; "time")

const N=800000000

var tictime float64
func tic() {
    now:=time.Now()
    tictime=float64(now.Unix())+1.0E-9*float64(now.Nanosecond())
}
func toc() float64 {
    now:=time.Now()
    return float64(now.Unix())+1.0E-9*float64(now.Nanosecond())-tictime
}

var A=[4][2][2]float64{
    {{0.85, 0.04}, {-0.04, 0.85}},
    {{0.20, -0.26}, {0.23, 0.22}},
    {{-0.15, 0.28}, {0.26, 0.24}},
    {{0.00, 0.00}, {0.00, 0.16}}}

var B=[4][2]float64{
    {0.00, 1.60},
    {0.00, 1.60},
    {0.00, 0.44},
    {0.00, 0.00}}

var P=[4]float64{0.85, 0.07, 0.07, 0.01 }
var cdf [3]uint32

func sum(p []float64)float64 {
    r:=0.0
    for i:=0; i<len(p); i++ { r+=p[i] }
    return r
}

func f(i int, x [2]float64) [2]float64 {
    var b [2]float64
    for j:=0; j<2; j++ {
        b[j]=B[i][j]
        for k:=0; k<2; k++ {
            b[j]+=A[i][j][k]*x[k]
        }
    }
    return b
}

func i(p uint32) int {
    for j:=0; j<3; j++ {
        if p<cdf[j] {
            return j
        }
    }
    return 3
}

var xmin=[2]float64{-2.1820,0}
var xmax=[2]float64{2.6558,9.9983}
const border=0.1

const scale=617.0
var image [][]byte

func doinit() {
    for i:=0; i<3; i++ {
        cdf[i]=uint32(sum(P[0:i+1])*(1<<32))
    }
    var nu [2]int
    for i:=0; i<2; i++ {
        nu[i]=int(scale*(xmax[i]-xmin[i]+2*border))
    }
    image=make([][]byte,nu[1])
    for i:=0; i<nu[1]; i++ {
        image[i]=make([]byte,nu[0])
    }
}

func plot() uint64 {
    count:=uint64(0)
    io,_:=os.Create("fern.pnm")
    fp:=bufio.NewWriter(io)
    fmt.Fprintf(fp,"P4\n")
    fmt.Fprintf(fp,"%d %d\n",len(image[0]),len(image))
    row:=make([]byte,(len(image[0])+7)/8)
    for iy:=len(image)-1; iy>=0; iy-- {
        rx:=0; rb:=byte(0)
        ib:=byte(128)
        for ix:=0; ix<len(image[0]); ix++ {
            if image[iy][ix]!=0 { 
                rb|=ib; count++
            }
            ib>>=1
            if ib==0 {
                row[rx]=rb; ib=128
                rb=0; rx++
            }
        }
        if ib!=0 { row[rx]=rb }
        fp.Write(row)
    }
    fp.Flush()
    io.Close()
    return count
}

func point(x [2]float64) {
    var coord [2]int
    for i:=0; i<2; i++ {
        coord[i]=int(scale*(x[i]-xmin[i]+border))
    }
    image[coord[1]][coord[0]]=1
}

type rstate struct {
    x,w,s uint64
}
func (p *rstate)rint32() uint32 {
    p.x*=p.x; p.w+=p.s
    p.x+=p.w; p.x=(p.x>>32)|(p.x<<32)
    return uint32(p.x)
}
func (p *rstate)rint64() uint64 {
    r:=uint64(p.rint32())<<32
    return r|uint64(p.rint32())
}
var gs=rstate{0,0,0xb5ad4eceda1ce2a9}
func rseed() rstate {
    var p rstate
    p.x=gs.rint64(); p.w=gs.rint64()
    p.s=gs.rint64()|1
    return p
}

func work(p *rstate,jmax int,c chan int){
    var xn=[2]float64{0.0,0.0}
    point(xn)
    for j:=0; j<jmax; j++ {
        xn=f(i(p.rint32()),xn)
        point(xn)
    }
    c<-0
}

func main(){
    tic()
    ncpu:=runtime.GOMAXPROCS(0)
    doinit()
    fmt.Printf("fernfast.go -- Compute Barnsley's Fern"+
        " (GOMAXPROCS=%d)\n",ncpu)
    fmt.Printf("\nResolution: %d x %d\nIterations: %d\n",
        len(image[0]),len(image),N)
    ret:=make(chan int,ncpu)
    for n:=1;n<ncpu;n++ {
        p:=rseed()
        go work(&p,N/ncpu,ret)
    }
    p:=rseed()
    work(&p,N/ncpu,ret)
    for n:=0;n<ncpu;n++ { <-ret }
    fmt.Printf("Blk Pixels: %d\n",plot())
    tsec:=toc()
    fmt.Printf("\nIteration rate is %g per second.\n",N/tsec)
    fmt.Printf("Total execution time %g seconds.\n",tsec)
    os.Exit(0)
}
And the C code is

Code: Select all

/*  ompfast.c -- Compute the Barnsley Fern

    This is a translation of fernfast.go see also barnfast.jl
 */

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <sched.h>
#include "parallel.h"

const int N=800000000;

#ifdef CLOCK_MONOTONIC_RAW
static struct timespec tic_start;
void tic(){
    clock_gettime(CLOCK_MONOTONIC_RAW,&tic_start);
}
double toc(){
    struct timespec tic_stop;
    clock_gettime(CLOCK_MONOTONIC_RAW,&tic_stop);
    double sec=tic_stop.tv_sec-tic_start.tv_sec;
    return sec+(tic_stop.tv_nsec-tic_start.tv_nsec)*1.0e-9;
}
#else
static struct timeval tic_start;
void tic(){
    gettimeofday(&tic_start,0);
}
double toc(){
    struct timeval tic_stop;
    gettimeofday(&tic_stop,0);
    double sec=tic_stop.tv_sec-tic_start.tv_sec;
    return sec+(tic_stop.tv_usec-tic_start.tv_usec)*1.0e-6;
}
#endif

double A[4][2][2]={
    {{0.85, 0.04}, {-0.04, 0.85}},
    {{0.20, -0.26}, {0.23, 0.22}},
    {{-0.15, 0.28}, {0.26, 0.24}},
    {{0.00, 0.00}, {0.00, 0.16}}};

double B[4][2]={
    {0.00, 1.60},
    {0.00, 1.60},
    {0.00, 0.44},
    {0.00, 0.00}};

double P[4]={0.85, 0.07, 0.07, 0.01 };
uint32_t cdf[3];
typedef struct {
    double x[2];
} vector;

double sum(int n,double p[]){
    double r=0.0;
    for(int i=0;i<n;i++){ r+=p[i]; }
    return r;
}
vector f(int i, vector x){
    vector b;
    for(int j=0;j<2;j++){
        b.x[j]=B[i][j];
        for(int k=0;k<2;k++){
            b.x[j]+=A[i][j][k]*x.x[k];
        }
    }
    return b;
}
int ind(uint32_t p){
    for(int j=0;j<3;j++){
        if(p<cdf[j]){
            return j;
        }
    }
    return 3;
}

double xmin[2]={-2.1820,0};
double xmax[2]={2.6558,9.9983};
const double border=0.1;

const double scale=617.0;
uint8_t **image;
int nu[2];

void doinit(){
    for(int i=0;i<3;i++){
        cdf[i]=(uint32_t)(sum(i+1,P)*((uint64_t)1<<32));
    }
    for(int i=0;i<2;i++){
        nu[i]=scale*(xmax[i]-xmin[i]+2*border);
    }
    image=malloc(sizeof(uint8_t *)*nu[1]);
    for(int i=0;i<nu[1];i++){
        image[i]=malloc(sizeof(uint8_t)*nu[0]);
    }
}

uint64_t plot(){
    uint64_t count=0;
    FILE *fp=fopen("fern.pnm","w");
    fprintf(fp,"P4\n");
    fprintf(fp,"%d %d\n",nu[0],nu[1]);
    int rmax=(nu[0]+7)/8;
    uint8_t row[rmax];
    for(int iy=nu[1]-1;iy>=0;iy--){
        int rx=0; uint8_t rb=0;
        uint8_t ib=128;
        for(int ix=0;ix<nu[0];ix++){
            if(image[iy][ix]!=0){ 
                rb|=ib; count++;
            }
            ib>>=1;
            if(ib==0){
                row[rx]=rb; ib=128;
                rb=0; rx++;
            }
        }
        if(ib!=0) row[rx]=rb;
        fwrite(row,rmax,1,fp);
    }
    fclose(fp);
    return count;
}

void point(vector x){
    int coord[2];
    for(int i=0;i<2;i++){
        coord[i]=scale*(x.x[i]-xmin[i]+border);
    }
    image[coord[1]][coord[0]]=1;
}

typedef struct {
    uint64_t x,w,s;
} rstate;

uint32_t rint32(rstate *p){
    p->x*=p->x; p->w+=p->s;
    p->x+=p->w; p->x=(p->x>>32)|(p->x<<32);
    return (uint32_t)p->x;
}
uint64_t rint64(rstate *p){
    uint64_t r=(uint64_t)rint32(p)<<32;
    return r|rint32(p);
}
rstate gs={0,0,0xb5ad4eceda1ce2a9};
rstate rseed(){
    rstate p;
    p.x=rint64(&gs); p.w=rint64(&gs);
    p.s=rint64(&gs)|1;
    return p;
}

void work(rstate *p,int jmax){
    vector xn={{0.0,0.0}};
    point(xn);
    for(int j=0;j<jmax;j++){
        xn=f(ind(rint32(p)),xn);
        point(xn);
    }
}

int main(){
    tic();
    int ncpu=par_ncpu();
    doinit();
    printf("ompfast.c -- Compute Barnsley's Fern"
        " (OMP_NUM_THREADS=%d)\n",ncpu);
    printf("\nResolution: %d x %d\nIterations: %d\n",
        nu[0],nu[1],N);
    par_workers(ncpu){
        for(int n=1;n<ncpu;n++){
            rstate p=rseed();
            par_spawn(,firstprivate(n,p)) work(&p,N/ncpu);
        }
        rstate p=rseed();
        work(&p,N/ncpu);
        par_sync;
    }
    printf("Blk Pixels: %d\n",plot());
    double tsec=toc();
    printf("\nIteration rate is %g per second.\n",N/tsec);
    printf("Total execution time %g seconds.\n",tsec);
    return 0;
}
Given the performance, it looks like the Raspberry Pi is a good match for Julia. I wonder how Free Pascal or Rust would compare?
Last edited by ejolson on Thu May 06, 2021 6:55 am, edited 2 times in total.

ejolson
Posts: 7233
Joined: Tue Mar 18, 2014 11:47 am

Re: Julia the Language

Wed May 05, 2021 7:49 pm

ejolson wrote:
Wed May 05, 2021 7:30 pm
All languages were within 10 percent with Julia being the fastest, C second fastest and Go in last place.
In case anyone is wanting to try this same comparison on their own computer, here is the parallel.h header file that I forgot to include in the previous post.

Code: Select all

#ifndef _PARALLEL_H
#define _PARALLEL_H

/*  These rouines provide compatibility between Cilk/Tapir and OpenMP
    parallel programming support for the C programming language.

The most tricky and irritating wrapper is par_spawn(X,Y) which has to
overcome both syntactical and semantic differences.  An example usage
is given in the parallel loop

    for(int n=0;n<Slice;n++){
        int jmin,jmax;
        jmin=round8(imax+1+(int64_t)n*(Limit-imax)/Slice);
        if(n==Slice-1) jmax=limit;
        else jmax=round8(imax+1+(int64_t)(n+1)*(Limit-imax)/Slice)-1;
        par_spawn(C[n]=,firstprivate(n,imax,jmin,jmax))
            setrange(imax,jmin,jmax);
    }
    par_sync;

Note the variables n, imax, jmin and jmax are declared firstprivate in
OpenMMP even though they are immediately passed by value to setrange.
This is because, setrange may not push it's arguments onto the stack
before the loop iterates to queue work for another thread.  The local
variable n is used to index the array for the return value and for 
the same reason needs to be firstprivate.  All of this complication
is because OpenMP also needs to handle Fortran which doesn't have the
same kind of scoping rules as C.  */

#ifdef _OPENMP
#  include <omp.h>
#  define do_pragma(Z) _Pragma(#Z)
#  define par_spawn(X,Y) do_pragma(omp task default(shared) Y) X
#  define par_for _Pragma("omp taskloop default(shared)") for
#  define par_sync _Pragma("omp taskwait")
#  define par_ncpu(X) omp_get_max_threads(X)
#  define par_workers(X) \
    if(X) omp_set_num_threads(X); \
    else omp_set_num_threads(1); \
    _Pragma("omp parallel") \
    _Pragma("omp single")
#else
#  ifdef _CILK
#    include <cilk/cilk.h>
#    include <cilk/cilk_api.h>
#    define par_spawn(X,Y) X cilk_spawn
#    define par_sync cilk_sync
#    define par_ncpu(X) __cilkrts_get_nworkers(X)
#    define par_workers(X) \
    __cilkrts_end_cilk(); \
    if(X){ \
        char buf[32]; \
        sprintf(buf,"%d",X); \
        int r=__cilkrts_set_param("nworkers",buf); \
        if(r!=__CILKRTS_SET_PARAM_SUCCESS){ \
            fprintf(stderr,"Error: unable to set nworkers to %d!\n",X); \
            exit(1); \
        } \
        __cilkrts_init(); \
    }
#  else
#    define par_spawn(X,Y) X
#    define par_for for
#    define par_sync
#    define par_ncpu(X) 1
#    define par_workers(X)
#  endif
#endif

#endif
Place this file together with the ompfast.c source code and type

Code: Select all

$ gcc -O3 -fopenmp -o ompfast ompfast.c
to build the C binary.
Last edited by ejolson on Thu May 06, 2021 8:04 am, edited 1 time in total.

ejolson
Posts: 7233
Joined: Tue Mar 18, 2014 11:47 am

Re: Julia the Language

Thu May 06, 2021 7:13 am

ejolson wrote:
Wed May 05, 2021 7:30 pm
Given the performance, it looks like the Raspberry Pi is a good match for Julia. I wonder how Free Pascal or Rust would compare?
Here is a run on the Pi 4B using Free Pascal.

Code: Select all

$ ./fpfast # Free Pascal version 3.0.4+dfsg-22
fpfast.pas -- Compute Barnsley's Fern (FPC_THREADS=4)

Resolution: 3108 x 6292
Iterations: 800000000
Blk Pixels: 4826441

Iteration rate is 2.1270912765500072E+007 per second.
Total execution time 37.61004564 seconds.
Although generating an executable that's slightly slower, the Free Pascal code generator is completely independent of GCC, LLVM or the Plan 9 back end used in Go. Just like with potatoes, it's important to avoid a mono-culture with compilers so a single virus can't blight the entire crop.

For reference the Pascal program is

Code: Select all

(*  fpfast.c -- Compute the Barnsley Fern

    This is a translation of ompfast.c, fernfast.go and barnfast.jl
 *)

program fpfast;

{$mode objfpc}

uses cthreads,unixtype,linux,classes,sysutils,dos;
var tic_start:ttimespec;

procedure tic;
begin
    clock_gettime(CLOCK_MONOTONIC_RAW,@tic_start)
end;

function toc:double;
var tic_stop:ttimespec;
    sec:double;
begin
    clock_gettime(CLOCK_MONOTONIC_RAW,@tic_stop);
    sec:=tic_stop.tv_sec-tic_start.tv_sec;
    toc:=sec+(tic_stop.tv_nsec-tic_start.tv_nsec)*1.0e-9
end;

const ncap:int32=800000000;
const acap:array[1..4] of array[1..2] of array[1..2] of double=(
        ((0.85, 0.04), (-0.04, 0.85)),
        ((0.20, -0.26), (0.23, 0.22)),
        ((-0.15, 0.28), (0.26, 0.24)),
        ((0.00, 0.00), (0.00, 0.16)));
    bcap:array[1..4] of array[1..2] of double=(
        (0.00, 1.60),
        (0.00, 1.60),
        (0.00, 0.44),
        (0.00, 0.00));
    pcap:array[1..4] of double=(0.85, 0.07, 0.07, 0.01 );

var cdf:array[1..3] of uint32;
type vector=array[1..2] of double;

function sum(n:int32;
    var p:array of double):double;
var r:double;
    i:int32;
begin
    r:=0.0;
    for i:=0 to n-1 do r:=r+p[i];
    sum:=r
end;
function f(i:int32; var x:vector):vector;
var b:vector;
    j,k:int32;
begin
    for j:=1 to 2 do begin
        b[j]:=bcap[i][j];
        for k:=1 to 2 do begin
            b[j]:=b[j]+acap[i][j][k]*x[k]
        end
    end;
    f:=b
end;
function ind(p:uint32):int32;
var j:int32;
begin
    for j:=1 to 3 do begin
        if p<cdf[j] then begin
            ind:=j;
            exit
        end
    end;
    ind:=4
end;

const xmin:array[1..2] of double=(-2.1820,0);
    xmax:array[1..2] of double=(2.6558,9.9983);
    border:double=0.1;
    scale:double=617.0;

var image:array of array of uint8;
var nu:array[1..2] of int32=(0,0);

procedure doinit;
var i:int32;
begin
    for i:=1 to 3 do begin
        cdf[i]:=trunc(sum(i,pcap)*(uint64(1) shl 32))
    end;
    for i:=1 to 2 do begin
        nu[i]:=trunc(scale*(xmax[i]-xmin[i]+2*border))
    end;
    setlength(image,nu[2]);
    for i:=0 to nu[2]-1 do begin
        setlength(image[i],nu[1])
    end
end;

function plot:uint64;
var count:uint64;
    iy,ix,rmax,rx:int32;
    row:array of uint8;
    rb,ib:uint8;
    sbuf:string[80];
    fp:tfilestream;
begin
    count:=0;
    fp:=tfilestream.create('fern.pnm',fmcreate);
    sbuf:='P4'+chr(10)+inttostr(nu[1])+' '+inttostr(nu[2])+chr(10);
    fp.write(sbuf[1],length(sbuf));
    rmax:=(nu[1]+7) div 8;
    setlength(row,rmax);
    for iy:=nu[2]-1 downto 0 do begin
        rx:=0; rb:=0; ib:=128;
        for ix:=0 to nu[1]-1 do begin
            if image[iy][ix]<>0 then begin
                rb:=rb or ib; count:=count+1
            end;
            ib:=ib shr 1;
            if ib=0 then begin
                row[rx]:=rb; ib:=128;
                rb:=0; rx:=rx+1
            end
        end;
        if ib<>0 then row[rx]:=rb;
        fp.write(row[0],rmax)
    end;
    fp.free;
    plot:=count
end;

procedure point(var x:vector);
var i:int32;
    coord:array[1..2] of int32;
begin
    for i:=1 to 2 do begin
        coord[i]:=trunc(scale*(x[i]-xmin[i]+border));
    end;
    image[coord[2]][coord[1]]:=1
end;

type rstate=record
    x,w,s:uint64;
end;
function rint32(var p:rstate):uint32;
begin
    p.x:=p.x*p.x; p.w:=p.w+p.s;
    p.x:=p.x+p.w; p.x:=(p.x shr 32) or (p.x shl 32);
    rint32:=uint32(p.x)
end;
function rint64(var p:rstate):uint64;
var r:uint64;
begin
    r:=uint64(rint32(p)) shl 32;
    rint64:=r or rint32(p)
end;
var gs:rstate=(x:0; w:0; s:uint64($b5ad4eceda1ce2a9));
function rseed:rstate;
var p:rstate;
begin
    p.x:=rint64(gs); p.w:=rint64(gs);
    p.s:=rint64(gs) or 1;
    rseed:=p
end;

type wargs=record
    p:rstate;
    jmax:int32;
end;
var args:array of wargs;
    wfin:int32=0;    

function work(ap:pointer):ptrint;
var xn:vector;
    j:int32;
    a:^wargs;
begin
    a:=ap;
    xn[1]:=0.0; xn[2]:=0.0;
    point(xn);
    for j:=1 to a^.jmax do begin
        xn:=f(ind(rint32(a^.p)),xn);
        point(xn);
    end;
    interlockedincrement(wfin);
    work:=0
end;

var ncpu,n:int32;
    tsec:double;
begin
    tic();
    try
        ncpu:=strtoint(getenv('FPC_THREADS'))
    except
        ncpu:=4
    end;
    doinit;
    writeln('fpfast.pas -- Compute Barnsley''s Fern'+
        ' (FPC_THREADS=',ncpu,')');
    writeln;
    writeln('Resolution: ',nu[1],' x ',nu[2]);
    writeln('Iterations: ',ncap);
    setlength(args,ncpu);
    wfin:=-ncpu;
    for n:=1 to ncpu-1 do begin
        args[n].p:=rseed;
        args[n].jmax:=ncap div ncpu;
        beginthread(@work,pointer(@args[n]))
    end;
    args[0].p:=rseed;
    args[0].jmax:=ncap div ncpu;
    work(pointer(@args[0]));
    while wfin<0 do sleep(100);
    writeln('Blk Pixels: ',plot);
    tsec:=toc;
    writeln;
    writeln('Iteration rate is',ncap/tsec,' per second.');
    writeln('Total execution time ',tsec:0:8,' seconds.');
    halt(0)
end.
Note that this code launches the threads by hand and then explicitly synchronizes by waiting on a semaphore that was manually constructed using an atomic increment. There are more advanced techniques for parallel processing in Free Pascal, but I had trouble with the documentation.

User avatar
jahboater
Posts: 7034
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: Julia the Language

Thu May 06, 2021 9:35 am

The julia code doesn't seem to cut and paste into the terminal window on my headlesss Pi4 64-bit.

The glypth's looking like division get removed. But if I insert / characters I get:

Code: Select all

Resolution: 3108 x 6292
Iterations: 800000000
ERROR: LoadError: MethodError: no method matching zeros(::Type{UInt8}, ::Float64)
Closest candidates are:
  zeros(::Type{T}, ::Union{Integer, AbstractUnitRange}...) where T at array.jl:499
  zeros(::Type{T}, ::Tuple{}) where T at array.jl:507
  zeros(::Type{T}, ::Tuple{Vararg{Integer, N}}) where {T, N} at array.jl:502
  ...
ejolson wrote:
Wed May 05, 2021 7:30 pm
Julia seems to be fast compared to other compiled language systems as well.
It does look that way.
However, the Julia run seems to take a very long time to get started, presumably doing some effective optimization.
Perhaps. when comparing languages, the actual execution time user see's is the most useful metric.

ejolson
Posts: 7233
Joined: Tue Mar 18, 2014 11:47 am

Re: Julia the Language

Thu May 06, 2021 3:45 pm

jahboater wrote:
Thu May 06, 2021 9:35 am
ejolson wrote:
Wed May 05, 2021 7:30 pm
Julia seems to be fast compared to other compiled language systems as well.
It does look that way.
However, the Julia run seems to take a very long time to get started, presumably doing some effective optimization.
Perhaps. when comparing languages, the actual execution time user see's is the most useful metric.
The way I see it there are two reasonable ways to use Julia:
  • Interactively from the read evaluate print loop. In this case the JIT has already warmed up, the libraries are loaded and there is no delay when running a routine.
    • Noninteractively through the batch scheduling system. In this case JIT warm up is small compared to waiting in the queue. Note that the wait has recently increased as Fido's simulations seem to be going into into an infinite loop based on the effect the recommendation made by the WTO had on the creation of new vaccines to combat the variants.
    Using bash as a programmable shell to run Julia scripts somehow misses the point that the Julia REPL was designed as a programmable shell to run Julia scripts. In fact, the tab completion, built-in editing of multi-line commands, automatically searched history, native unicode support and ability to execute bash scripts among others make the corresponding interactive features of bash look a little primitive.

    ejolson
    Posts: 7233
    Joined: Tue Mar 18, 2014 11:47 am

    Re: Julia the Language

    Thu May 06, 2021 4:45 pm

    jahboater wrote:
    Thu May 06, 2021 9:35 am
    The julia code doesn't seem to cut and paste into the terminal window on my headlesss Pi4 64-bit.

    The glypth's looking like division get removed. But if I insert / characters I get:

    Code: Select all

    Resolution: 3108 x 6292
    Iterations: 800000000
    ERROR: LoadError: MethodError: no method matching zeros(::Type{UInt8}, ::Float64)
    Closest candidates are:
      zeros(::Type{T}, ::Union{Integer, AbstractUnitRange}...) where T at array.jl:499
      zeros(::Type{T}, ::Tuple{}) where T at array.jl:507
      zeros(::Type{T}, ::Tuple{Vararg{Integer, N}}) where {T, N} at array.jl:502
      ...
    
    If using the Julia REPL from an xterm one needs to set two options:

    Code: Select all

    xterm*utf8: 1
    xterm*metaSendsEscape: true
    
    The first option enables the terminal to send and receive unicode characters. The most important of those is the ÷ character which can be entered in the REPL as \div followed by the tab key. Note that ÷ is integer division (like div in Pascal) whereas / always yields a floating point result. Thus, constants written as 1/2 in Julia are equal to 0.5 rather than truncated to 0.

    The second option allows entering multi-line commands (and inserting additional lines while editing a command or function from the history) by holding the alt key while pressing enter or return. If bash had something similar, the GNU readline library might then have to be called readparagraph instead.

    Hopefully, enabling these two options will allow you to cut and paste Julia code to the Pi and conveniently use the REPL.

    ejolson
    Posts: 7233
    Joined: Tue Mar 18, 2014 11:47 am

    Re: Julia the Language

    Thu May 06, 2021 6:15 pm

    ejolson wrote:
    Thu May 06, 2021 3:45 pm
    Note that the wait has recently increased as Fido's simulations seem to be going into into an infinite loop based on the effect the recommendation made by the WTO had on the creation of new vaccines to combat the variants.
    In response to my request to deploy a fair-share scheduling policy in the batch queue, the dog developer implemented a code review and approval process for even being allowed to submit jobs. Unfortunately, the fern programs were denied.

    According to the report, computing Barnsley ferns is extremely important; however, the present algorithm mostly tests how quickly the hardware bounces cache lines between threads and not the relative performance of programming languages. I pointed out that bouncing is better than an infinite loop caused by the demise of the infrastructure needed to develop new vaccines. This led to a moment of quiet followed by a loud noise that sounded like barking.

    Apparently, the cache-friendly way to compute fractals is to divide the image up into separate blocks the way images are rendered on a multi-GPU setup. Then iterate the same random sequence forward separately on each thread while recording only the pixels that fall in the part of the image that each thread is responsible for. As I need more exercise during quarantine anyway, I've decided to stick with the bouncing approach for now.

    Return to “Other programming languages”