User avatar
RichardRussell
Posts: 722
Joined: Thu Jun 21, 2012 10:48 am

Re: Julia the Language

Sat Apr 17, 2021 4:51 pm

jalih wrote:
Sat Apr 17, 2021 1:48 pm
Below is my version written in 8th programming language. I would love to compare performance...
In almost any programming language execution time for the fern fractal will be dominated by the actual pixel plotting (for example in BBC BASIC 94% of the time is spent in the plotting routines) so I'm not sure that tells you much about one language versus another.

Heater
Posts: 18013
Joined: Tue Jul 17, 2012 3:02 pm

Re: Julia the Language

Sat Apr 17, 2021 4:59 pm

Yes. I came back to say that likely the quickest way to do this is to draw the fern into some memory image and then get that whole thing displayed when it is done.
Memory in C++ is a leaky abstraction .

ejolson
Posts: 7094
Joined: Tue Mar 18, 2014 11:47 am

Re: Julia the Language

Sat Apr 17, 2021 5:00 pm

Heater wrote:
Sat Apr 17, 2021 4:42 pm
I presume it looks like the Barnsley Ferns : https://en.wikipedia.org/wiki/Barnsley_fern. Plenty of readable example codes there.

Had ours of fun with that back in the day on my Atari ST 520.
That sounds reasonable. However, right now the main problem is the purpose of the nip function in the Eighth code seems to have been painfully misunderstood by the dog developer.

jalih
Posts: 178
Joined: Mon Apr 15, 2019 3:54 pm

Re: Julia the Language

Sat Apr 17, 2021 6:10 pm

ejolson wrote:
Sat Apr 17, 2021 5:00 pm
Heater wrote:
Sat Apr 17, 2021 4:42 pm
I presume it looks like the Barnsley Ferns : https://en.wikipedia.org/wiki/Barnsley_fern. Plenty of readable example codes there.

Had ours of fun with that back in the day on my Atari ST 520.
That sounds reasonable. However, right now the main problem is the purpose of the nip function in the Eighth code seems to have been painfully misunderstood by the dog developer.
The 8th is a stack based language and purpose of nip is just to remove item after the top of the stack. You might need to download 8th word list, so you can look up stack effects of the words.

ejolson
Posts: 7094
Joined: Tue Mar 18, 2014 11:47 am

Re: Julia my Friend

Sat Apr 17, 2021 9:09 pm

jalih wrote:
Sat Apr 17, 2021 6:10 pm
ejolson wrote:
Sat Apr 17, 2021 5:00 pm
Heater wrote:
Sat Apr 17, 2021 4:42 pm
I presume it looks like the Barnsley Ferns : https://en.wikipedia.org/wiki/Barnsley_fern. Plenty of readable example codes there.

Had ours of fun with that back in the day on my Atari ST 520.
That sounds reasonable. However, right now the main problem is the purpose of the nip function in the Eighth code seems to have been painfully misunderstood by the dog developer.
The 8th is a stack based language and purpose of nip is just to remove item after the top of the stack. You might need to download 8th word list, so you can look up stack effects of the words.
Here is a fern program in Julia

Code: Select all

#  barnsley.jl -- Compute the Barnsley Fern 

const N=8000000

const A=[[0.85 0.04; -0.04 0.85],
    [0.20 -0.26; 0.23 0.22],
    [-0.15 0.28; 0.26 0.24],
    [0.00 0.00; 0.00  0.16]]

const B=[[0.00, 1.60],
    [0.00, 1.60],
    [0.00, 0.44],
    [0.00, 0.00]]

const P=[0.85, 0.07, 0.07, Inf]
const cdf=[sum(P[1:i]) for i=1:4]

f(i,x)=A[i]*x+B[i]
i(p)=findfirst(x->p<=x,cdf)

const xmin=[-2.1820,0]
const xmax=[2.6558,9.9983]
const border=0.1
const scale=100.0

const nu=Integer.(floor.(scale*(xmax-xmin.+2*border)))
const image=zeros(Int8,nu...);

function plot()
    open("fern.pnm","w") do io
        println(io,"P1")
        println(io,size(image)[1]," ",size(image)[2])
        for iy=size(image)[2]:-1:1
            for ix=1:size(image)[1]
                print(io,image[ix,iy]," ")
            end
            println(io)
        end
    end
end

function point(x)
    coord=Integer.(floor.(scale*(x-xmin.+border)))
    image[coord...]=1
end

function main()
    println("barnsley.jl -- Compute Barnsley's Fern")
    xn=zeros(2)
    point(xn)
    for j=1:N
        xn=f(i(rand(1)[1]),xn)
        point(xn)
    end
    plot()
end

tsec=@elapsed main()
println("\nIteration rate is ",N/tsec," per second.")
println("Total execution time ",tsec," seconds.")
The Julia 1.6.0 Singularity container running on the 700 MHz ARMv6-based super-cheap cluster yields the output

Code: Select all

barnsley.jl -- Compute Barnsley's Fern

Iteration rate is 31944.5234331074 per second.
Total execution time 250.434163363 seconds.

real    5m6.408s
user    4m48.460s
sys 0m13.240s
and the image

Image

According to the super pet, speeds are a couple orders of magnitude faster on a Pi 4B running the official Julia 1.6.0 binary in 64-bit mode. I decided to not ask for details since Fido is still having problems with that nip routine.

User avatar
jahboater
Posts: 6940
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: Julia my Friend

Sat Apr 17, 2021 11:07 pm

ejolson wrote:
Sat Apr 17, 2021 9:09 pm
According to the super pet, speeds are a couple orders of magnitude faster on a Pi 4B running the official Julia 1.6.0 binary in 64-bit mode. I decided to not ask for details since Fido is still having problems with that nip routine.

Code: Select all

$ time julia barnsley.jl 
barnsley.jl -- Compute Barnsley's Fern

Iteration rate is 668458.3636382727 per second.
Total execution time 11.967835897 seconds.

real	0m15.073s
user	0m14.607s
sys	0m0.698s

User avatar
Gavinmc42
Posts: 5482
Joined: Wed Aug 28, 2013 3:31 am

Re: Julia my Friend

Sat Apr 17, 2021 11:52 pm

Pi4 32bit, Julia 1.0.3

Code: Select all

barnsley.jl -- Compute Barnsley's Fern

Iteration rate is 484000.0370661748 per second.
Total execution time 16.528924354 seconds.
Seems like 1.6 and 64bit makes a difference.
Is there a 64 bit 1.0.3 in the Ras Debian repo?
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

ejolson
Posts: 7094
Joined: Tue Mar 18, 2014 11:47 am

Re: Julia my Friend

Sat Apr 17, 2021 11:58 pm

jahboater wrote:
Sat Apr 17, 2021 11:07 pm

Code: Select all

$ time julia barnsley.jl 
barnsley.jl -- Compute Barnsley's Fern

Iteration rate is 668458.3636382727 per second.
Total execution time 11.967835897 seconds.

real	0m15.073s
user	0m14.607s
sys	0m0.698s
Is it possible the 64-bit timing above was done with overclock settings? If so, what was the overclock?

User avatar
jahboater
Posts: 6940
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: Julia my Friend

Sun Apr 18, 2021 1:01 am

ejolson wrote:
Sat Apr 17, 2021 11:58 pm
jahboater wrote:
Sat Apr 17, 2021 11:07 pm

Code: Select all

$ time julia barnsley.jl 
barnsley.jl -- Compute Barnsley's Fern

Iteration rate is 668458.3636382727 per second.
Total execution time 11.967835897 seconds.

real	0m15.073s
user	0m14.607s
sys	0m0.698s
Is it possible the 64-bit timing above was done with overclock settings? If so, what was the overclock?
Sorry, yes, 600MHz over (2.1GHz)
This is the stock speed figure:

Code: Select all

$ time julia barnsley.jl 
barnsley.jl -- Compute Barnsley's Fern

Iteration rate is 508782.90004524693 per second.
Total execution time 15.723798892 seconds.

real	0m19.646s
user	0m19.198s
sys	0m0.674s

ejolson
Posts: 7094
Joined: Tue Mar 18, 2014 11:47 am

Re: Julia my Friend

Sun Apr 18, 2021 2:27 am

Gavinmc42 wrote:
Sat Apr 17, 2021 11:52 pm
Pi4 32bit, Julia 1.0.3

Code: Select all

barnsley.jl -- Compute Barnsley's Fern

Iteration rate is 484000.0370661748 per second.
Total execution time 16.528924354 seconds.
Seems like 1.6 and 64bit makes a difference.
Is there a 64 bit 1.0.3 in the Ras Debian repo?
It looks like Julia 1.0.3 as included with 32-bit Raspberry Pi OS grows ferns a bit faster than the Singularity container I made.

In particular,

Code: Select all

$ time singularity run --bind `pwd` ~/lib/julia160.sif barnsley.jl 
barnsley.jl -- Compute Barnsley's Fern

Iteration rate is 429434.6482249083 per second.
Total execution time 18.629144232 seconds.

real    0m25.469s
user    0m22.073s
sys 0m3.118s
indicates my container is about 11 percent slower on a Pi 4B.

I wonder if this is related to overhead from Singularity, my ARMv6 compatible build process or something else. It would be interesting to create an ARMv7 container with Julia 1.6.0 using the same technique and compare results. It would be even better to figure out why a native build on Raspberry Pi OS segfaults.

User avatar
Gavinmc42
Posts: 5482
Joined: Wed Aug 28, 2013 3:31 am

Re: Julia my Friend

Sun Apr 18, 2021 4:05 am

I should have said Pi400 not Pi4.
1.8GHz verses 1.5GHz
It would be even better to figure out why a native build on Raspberry Pi OS segfaults.
That would be nice, seg fault is common for some of the stuff I compile from source.
Mostly because I don't know enough to optimise cmake/make files?

I do lots of code prototyping and testing GPIO/i2C sensors with Python.
It will be interesting to try that with Julia.
Need to set up Geany to use Julia as that is my default IDE.

That fern example looks like it might port easy to Free Pascal, I want to compare speed to fpc.
Rosetta code got jl/pas code?
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

ejolson
Posts: 7094
Joined: Tue Mar 18, 2014 11:47 am

Re: Julia my Friend

Sun Apr 18, 2021 4:14 am

Gavinmc42 wrote:
Sun Apr 18, 2021 4:05 am
That fern example looks like it might port easy to Free Pascal, I want to compare speed to fpc.
Here is a translation of the Julia fern code into the Go language.

Code: Select all

/*  barnsley.go -- Compute the Barnsley Fern */

package main

import ("fmt"; "os"; "time"; "math/rand")

const N=8000000

var tictime float64
func tic() {
    now:=time.Now()
    tictime=float64(now.Unix())+1.0E-9*float64(now.Nanosecond())
}
func toc() float64 {
    now:=time.Now()
    return float64(now.Unix())+1.0E-9*float64(now.Nanosecond())-tictime
}

var A=[4][2][2]float64{
    {{0.85, 0.04}, {-0.04, 0.85}},
    {{0.20, -0.26}, {0.23, 0.22}},
    {{-0.15, 0.28}, {0.26, 0.24}},
    {{0.00, 0.00}, {0.00, 0.16}}}

var B=[4][2]float64{
    {0.00, 1.60},
    {0.00, 1.60},
    {0.00, 0.44},
    {0.00, 0.00}}

var P=[4]float64{0.85, 0.07, 0.07, 0.01 }
var cdf [4]float64

func sum(p []float64)float64 {
    r:=0.0
    for i:=0; i<len(p); i++ { r+=p[i] }
    return r
}

func f(i int, x [2]float64) [2]float64 {
    var b [2]float64
    for j:=0; j<2; j++ {
        b[j]=B[i][j]
        for k:=0; k<2; k++ {
            b[j]+=A[i][j][k]*x[k]
        }
    }
    return b
}

func i(p float64) int {
    for j:=0; j<3; j++ {
        if p<=cdf[j] {
            return j
        }
    }
    return 3
}

var xmin=[2]float64{-2.1820,0}
var xmax=[2]float64{2.6558,9.9983}
const border=0.1

const scale=100.0
var image [][]byte

func doinit() {
    for i:=0; i<4; i++ {
        cdf[i]=sum(P[0:i+1])
    }
    var nu [2]int
    for i:=0; i<2; i++ {
        nu[i]=int(scale*(xmax[i]-xmin[i]+2*border))
    }
    image=make([][]byte,nu[1])
    for i:=0; i<nu[1]; i++ {
        image[i]=make([]byte,nu[0])
    }
}

func plot() {
    io,_:=os.Create("fern.pnm")
    fmt.Fprintf(io,"P1\n")
    fmt.Fprintf(io,"%d %d\n",len(image[0]),len(image))
    for iy:=len(image)-1; iy>=0; iy-- {
        for ix:=0; ix<len(image[0]); ix++ {
            fmt.Fprintf(io,"%d ",image[iy][ix])
        }
        fmt.Fprintf(io,"\n")
    }
    io.Close()
}

func point(x [2]float64) {
    var coord [2]int
    for i:=0; i<2; i++ {
        coord[i]=int(scale*(x[i]-xmin[i]+border))
    }
    image[coord[1]][coord[0]]=1
}

func main(){
    tic()
    doinit()
    fmt.Printf("barnsley.go -- Compute Barnsley's Fern\n")
    var xn=[2]float64{0.0,0.0}
    point(xn)
    for j:=0; j<N; j++ {
        xn=f(i(rand.Float64()),xn)
        point(xn)
    }
    plot()
    tsec:=toc()
    fmt.Printf("\nIteration rate is %g per second.\n",N/tsec)
    fmt.Printf("Total execution time %g seconds.\n",tsec)
    os.Exit(0)
}
That might be easier to convert to Pascal.

On a Pi 4B the runtime is

Code: Select all

$ time ./barnsley 
barnsley.go -- Compute Barnsley's Fern

Iteration rate is 1.4567353563057869e+06 per second.
Total execution time 5.491731882095337 seconds.

real    0m5.503s
user    0m3.050s
sys     0m2.517s
which like the energy-conserving IRK scheme, is more than double the speed of the original Julia code. That blue Gopher with those super-short legs is again running faster than Julia. Given the mascot

Image

I suspect Free Pascal will be faster as well!

As I've already put the const keyword into the Julia code, it's possible the performance difference can attributed to the speed (and quality) of the built-in random number generator. It's also possible Julia is allocating and freeing huge quantities of two vectors as it randomly iterates through the fractal.

It seems difficult to have Julia as a friend.

User avatar
Gavinmc42
Posts: 5482
Joined: Wed Aug 28, 2013 3:31 am

Re: Julia my Friend

Sun Apr 18, 2021 6:09 am

I gave up waiting for the Python version to finish plotting.
Let's just say that it's very slow.

Julia is nothing like Pascal, well not close enough for me to translate.
Crashed and burned :oops: ..
There is Delphi fern code, that might be easier.
Free Pascal might be a bit slower than C but not enough to worry my simple uses for it.

Julia turns out to be faster than I expected, after compiling ;)
Like LuaJit? Cpython? but with useful big boy math?
Hmm, have not tried Luajit of Pi4.

What else does the fern?
Nim, hmm I really should spend more time with Nim too.
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

User avatar
RichardRussell
Posts: 722
Joined: Thu Jun 21, 2012 10:48 am

Re: Julia my Friend

Sun Apr 18, 2021 10:06 am

Do these languages have profilers? I'd love to see how the time is split between the calculations and the plotting. 8,000,000 iterations is overkill in my opinion, the fern graphic is visually unchanging well before that.

Of course programs which build a bitmap in memory and then display it in one hit will take a very different time from those which plot individual pixels (often plotting the same pixel multiple times). But isn't the whole point of the fern fractal to see it 'grow'? Having to wait before anything appears seems quite silly.

Heater
Posts: 18013
Joined: Tue Jul 17, 2012 3:02 pm

Re: Julia my Friend

Sun Apr 18, 2021 11:32 am

RichardRussell wrote:
Sun Apr 18, 2021 10:06 am
But isn't the whole point of the fern fractal to see it 'grow'? Having to wait before anything appears seems quite silly.
Yes. I recall it being quite a magical experience when I first saw it on my Atari 520 ST. As one wonders how does such structure come from a random process? Where is that shape encoded? WTF?

On the other hand Prof. Barnsley, after whom this fern is named and author of the famous "Fractals Everywhere" book, started the Iterated Systems Incorporated to develop fractal image compression systems. Where I presume speed was of the essence. That plan did not work out.
Memory in C++ is a leaky abstraction .

jalih
Posts: 178
Joined: Mon Apr 15, 2019 3:54 pm

Re: Julia my Friend

Sun Apr 18, 2021 12:20 pm

RichardRussell wrote:
Sun Apr 18, 2021 10:06 am
Of course programs which build a bitmap in memory and then display it in one hit will take a very different time from those which plot individual pixels (often plotting the same pixel multiple times). But isn't the whole point of the fern fractal to see it 'grow'? Having to wait before anything appears seems quite silly.
I use separate thread to do calculations and plot pixels into image. Main GUI thread just displays the Image. That gives live update for free...

jalih
Posts: 178
Joined: Mon Apr 15, 2019 3:54 pm

Re: Julia my Friend

Sun Apr 18, 2021 12:25 pm

ejolson wrote:
Sat Apr 17, 2021 9:09 pm
Here is a fern program in Julia
Thanks, I have to try that and Go version when I get back home!

User avatar
RichardRussell
Posts: 722
Joined: Thu Jun 21, 2012 10:48 am

Re: Julia my Friend

Sun Apr 18, 2021 2:50 pm

jalih wrote:
Sun Apr 18, 2021 12:20 pm
I use separate thread to do calculations and plot pixels into image. Main GUI thread just displays the Image. That gives live update for free...
Actually BBC BASIC does the same (calculations in worker thread, plotting in GUI thread) but it's individual pixels that are queued for plotting, not the entire image, so plotting still ends up dominating the execution time.

ejolson
Posts: 7094
Joined: Tue Mar 18, 2014 11:47 am

Re: Julia my Friend

Sun Apr 18, 2021 3:16 pm

RichardRussell wrote:
Sun Apr 18, 2021 10:06 am
Do these languages have profilers? I'd love to see how the time is split between the calculations and the plotting. 8,000,000 iterations is overkill in my opinion, the fern graphic is visually unchanging well before that.

Of course programs which build a bitmap in memory and then display it in one hit will take a very different time from those which plot individual pixels (often plotting the same pixel multiple times). But isn't the whole point of the fern fractal to see it 'grow'? Having to wait before anything appears seems quite silly.
At what point the image of the fractal is unchanging depends on the resolution of the image. As this is a fractal, there are details at all resolutions so no fixed image size gives an accurate portrayal. In the present case resolution was chosen to be low compared to the number of iterations so the time taken to write the image was small compared to the time needed to calculate it. Even at this low resolution the attractor of Barnsley's iterated function system is not complete and still has random holes in it.

It's also worth mentioning, the unbuffered input and output used by default in Go results in one write system call per pixel so that almost half the execution time is spent writing that tiny image. In this respect Julia is much more my friend.

I think you are right that one of the points of the random iterations is to watch the fractal appear and even stop the process at the point where the image is most pleasing before it has converged.

Imagine the case where one is wanting to produce an image on a laser printer at the resolution of 600 dots per inch. Watching the fractal appear pixel by pixel would take lots of paper. Moreover, the number of iterations in the present calculation would likely be far too few to provide a suitable image for printing.

In a way rendering a fully resolved fern fractal at 4800 by 6300 resolution is like the huge Fibonacci challenge: The simple idea of iterating until reaching the answer may be slow and more efficient methods of computation need to be used.

jalih
Posts: 178
Joined: Mon Apr 15, 2019 3:54 pm

Re: Julia my Friend

Sun Apr 18, 2021 4:16 pm

RichardRussell wrote:
Sun Apr 18, 2021 2:50 pm
jalih wrote:
Sun Apr 18, 2021 12:20 pm
I use separate thread to do calculations and plot pixels into image. Main GUI thread just displays the Image. That gives live update for free...
Actually BBC BASIC does the same (calculations in worker thread, plotting in GUI thread) but it's individual pixels that are queued for plotting, not the entire image, so plotting still ends up dominating the execution time.
How about using multiple worker threads by giving each thread it's own image to work with that is initially all transparent? That way no locks are necessary and final image can be constructed by just drawing multiple images on top of each other.

ejolson
Posts: 7094
Joined: Tue Mar 18, 2014 11:47 am

Re: Julia the Language

Sun Apr 18, 2021 4:39 pm

ejolson wrote:
Sun Sep 13, 2020 10:54 pm
[moderator please don't change the subject afterwards!]
After getting Julia 1.6.0 working well on the Raspberry Pi using a Singularity container, I changed the title of this thread to Julia my Friend, even though the build process still segfaults under default Raspberry Pi OS.

It would appear an anonymous moderator wanted to change the title of this thread back to Julia the Language but forgot to capitalize the word language. Moreover, they have locked my original post so I'm unable to fix their sloppy mistake.

Although the rule to not change the title of a thread is new to me, I am fully able to comply with such rules. Therefore, I find it an oppressive abuse of power for a moderator to lock my first post as if I did not have the competence to be trusted with the ability to edit the posts I made earlier.

Any help getting my first post unlocked and the title of this thread properly capitalized would be appreciated.

ejolson
Posts: 7094
Joined: Tue Mar 18, 2014 11:47 am

Re: Julia my Friend

Sun Apr 18, 2021 7:36 pm

ejolson wrote:
Sun Apr 18, 2021 3:16 pm
RichardRussell wrote:
Sun Apr 18, 2021 10:06 am
Of course programs which build a bitmap in memory and then display it in one hit will take a very different time from those which plot individual pixels (often plotting the same pixel multiple times). But isn't the whole point of the fern fractal to see it 'grow'? Having to wait before anything appears seems quite silly.
It's also worth mentioning, the unbuffered input and output used by default in Go results in one write system call per pixel so that almost half the execution time is spent writing that tiny image. In this respect Julia is much more my friend.
I added a write buffer to the Go code and further switched to the binary P4 image format. The speed on the Pi 4B is now

Code: Select all

$ time ./fernbuf 
fernbin.go -- Compute Barnsley's Fern

Iteration rate is 3.8271423178127375e+06 per second.
Total execution time 2.090332508087158 seconds.

real    0m2.102s
user    0m2.075s
sys 0m0.031s
about doubling the speed of the previous Go code. This change has an effect similar to buffering the pixel updates to the screen in a graphical program. It also raises the minimum performance expectations for Julia to reach the speed of a non-interactive language.

After the recent falling out with Julia my friend, the dog developer recommended I start a new topic at the BARK™ Foundation about computing Barnsley ferns using FidoBasic. Rather than that, however, I'm still hoping the Raspberry Pi and Julia can get along. Julia looks like it is becoming popular in all types of STEM disciplines and it seems reasonable people trying to learn those disciplines be able to use a Pi.

For reference the only change in the previous Go code is

Code: Select all

func plot() {
    io,_:=os.Create("fern.pnm")
    fp:=bufio.NewWriter(io)
    fmt.Fprintf(fp,"P4\n")
    fmt.Fprintf(fp,"%d %d\n",len(image[0]),len(image))
    row:=make([]byte,(len(image[0])+7)/8)
    for iy:=len(image)-1; iy>=0; iy-- {
        rx:=0; rb:=byte(0)
        ib:=byte(128)
        for ix:=0; ix<len(image[0]); ix++ {
            if image[iy][ix]!=0 { rb|=ib }
            ib>>=1
            if ib==0 {
                row[rx]=rb; ib=128
                rb=0; rx++
            }
        }
        if ib!=0 { row[rx]=rb }
        fp.Write(row)
    }
    fp.Flush()
    io.Close()
}
In particular, only the output routine changed and the computational part is exactly the same as before. The 2-fold performance improvement results from buffering the output.
Last edited by ejolson on Sun Apr 18, 2021 8:08 pm, edited 6 times in total.

ejolson
Posts: 7094
Joined: Tue Mar 18, 2014 11:47 am

Re: Julia my Friend

Sun Apr 18, 2021 7:52 pm

jalih wrote:
Sun Apr 18, 2021 4:16 pm
RichardRussell wrote:
Sun Apr 18, 2021 2:50 pm
jalih wrote:
Sun Apr 18, 2021 12:20 pm
I use separate thread to do calculations and plot pixels into image. Main GUI thread just displays the Image. That gives live update for free...
Actually BBC BASIC does the same (calculations in worker thread, plotting in GUI thread) but it's individual pixels that are queued for plotting, not the entire image, so plotting still ends up dominating the execution time.
How about using multiple worker threads by giving each thread it's own image to work with that is initially all transparent? That way no locks are necessary and final image can be constructed by just drawing multiple images on top of each other.
Running multiple simulations and then combining them sounds like a good idea. Except for the final reduction to a single image, that technique should scale to as many cores as are available. In addition to multi-core architectures, I think the idea would also scale well on a cluster--even a super-cheap cluster of Pi Zero computers.

ejolson
Posts: 7094
Joined: Tue Mar 18, 2014 11:47 am

Re: Julia my Friend

Sun Apr 18, 2021 8:53 pm

ejolson wrote:
Sun Apr 18, 2021 7:52 pm
jalih wrote:
Sun Apr 18, 2021 4:16 pm
RichardRussell wrote:
Sun Apr 18, 2021 2:50 pm


Actually BBC BASIC does the same (calculations in worker thread, plotting in GUI thread) but it's individual pixels that are queued for plotting, not the entire image, so plotting still ends up dominating the execution time.
How about using multiple worker threads by giving each thread it's own image to work with that is initially all transparent? That way no locks are necessary and final image can be constructed by just drawing multiple images on top of each other.
Running multiple simulations and then combining them sounds like a good idea. Except for the final reduction to a single image, that technique should scale to as many cores as are available. In addition to multi-core architectures, I think the idea would also scale well on a cluster--even a super-cheap cluster of Pi Zero computers.
After thinking about it, I realized since single-byte memory writes are always atomic on the Pi that I could create a multi-threaded Go program without much effort. The result for the Pi 4B

Code: Select all

$ time ./fernpar 
fernpar.go -- Compute Barnsley's Fern (GOMAXPROCS=4)

Iteration rate is 2.4541350919831693e+07 per second.
Total execution time 0.32598042488098145 seconds.

real    0m0.331s
user    0m1.279s
sys 0m0.001s
indicates what appears to be a super-linear performance increase.

What really happened is that by switching to per-thread random number generators I avoided the locking overhead of the default generator. Those gophers are now growing ferns so quickly that I'm not sure Julia will ever catch up. It is also tempting to go for greater resolution and create some print-quality output.

For reference, the parallel Go code is

Code: Select all

/*  fernpar.go -- Compute the Barnsley Fern */

package main

import ("fmt"; "os"; "bufio"; "runtime"; "time"; "math/rand")

const N=8000000

var tictime float64
func tic() {
    now:=time.Now()
    tictime=float64(now.Unix())+1.0E-9*float64(now.Nanosecond())
}
func toc() float64 {
    now:=time.Now()
    return float64(now.Unix())+1.0E-9*float64(now.Nanosecond())-tictime
}

var A=[4][2][2]float64{
    {{0.85, 0.04}, {-0.04, 0.85}},
    {{0.20, -0.26}, {0.23, 0.22}},
    {{-0.15, 0.28}, {0.26, 0.24}},
    {{0.00, 0.00}, {0.00, 0.16}}}

var B=[4][2]float64{
    {0.00, 1.60},
    {0.00, 1.60},
    {0.00, 0.44},
    {0.00, 0.00}}

var P=[4]float64{0.85, 0.07, 0.07, 0.01 }
var cdf [4]float64

func sum(p []float64)float64 {
    r:=0.0
    for i:=0; i<len(p); i++ { r+=p[i] }
    return r
}

func f(i int, x [2]float64) [2]float64 {
    var b [2]float64
    for j:=0; j<2; j++ {
        b[j]=B[i][j]
        for k:=0; k<2; k++ {
            b[j]+=A[i][j][k]*x[k]
        }
    }
    return b
}

func i(p float64) int {
    for j:=0; j<3; j++ {
        if p<=cdf[j] {
            return j
        }
    }
    return 3
}

var xmin=[2]float64{-2.1820,0}
var xmax=[2]float64{2.6558,9.9983}
const border=0.1

const scale=100.0
var image [][]byte

func doinit() {
    for i:=0; i<4; i++ {
        cdf[i]=sum(P[0:i+1])
    }
    var nu [2]int
    for i:=0; i<2; i++ {
        nu[i]=int(scale*(xmax[i]-xmin[i]+2*border))
    }
    image=make([][]byte,nu[1])
    for i:=0; i<nu[1]; i++ {
        image[i]=make([]byte,nu[0])
    }
}

func plot() {
    io,_:=os.Create("fern.pnm")
    fp:=bufio.NewWriter(io)
    fmt.Fprintf(fp,"P4\n")
    fmt.Fprintf(fp,"%d %d\n",len(image[0]),len(image))
    row:=make([]byte,(len(image[0])+7)/8)
    for iy:=len(image)-1; iy>=0; iy-- {
        rx:=0; rb:=byte(0)
        ib:=byte(128)
        for ix:=0; ix<len(image[0]); ix++ {
            if image[iy][ix]!=0 { rb|=ib }
            ib>>=1
            if ib==0 {
                row[rx]=rb; ib=128
                rb=0; rx++
            }
        }
        if ib!=0 { row[rx]=rb }
        fp.Write(row)
    }
    fp.Flush()
    io.Close()
}

func point(x [2]float64) {
    var coord [2]int
    for i:=0; i<2; i++ {
        coord[i]=int(scale*(x[i]-xmin[i]+border))
    }
    image[coord[1]][coord[0]]=1
}

func work(s uint64,jmax int,c chan int){
    gen:=rand.New(rand.NewSource(int64(s)))
    var xn=[2]float64{0.0,0.0}
    point(xn)
    for j:=0; j<jmax; j++ {
        xn=f(i(gen.Float64()),xn)
        point(xn)
    }
    c<-0
}

func main(){
    tic()
    ncpu:=runtime.GOMAXPROCS(0)
    doinit()
    fmt.Printf("fernpar.go -- Compute Barnsley's Fern"+
        " (GOMAXPROCS=%d)\n",ncpu)
    ret:=make(chan int,ncpu)
    for n:=1;n<ncpu;n++ {
        go work(rand.Uint64(),N/ncpu,ret)
    }
    work(rand.Uint64(),N/ncpu,ret)
    for n:=0;n<ncpu;n++ { <-ret }
    plot()
    tsec:=toc()
    fmt.Printf("\nIteration rate is %g per second.\n",N/tsec)
    fmt.Printf("Total execution time %g seconds.\n",tsec)
    os.Exit(0)
}
Last edited by ejolson on Tue Apr 20, 2021 3:35 am, edited 2 times in total.

ejolson
Posts: 7094
Joined: Tue Mar 18, 2014 11:47 am

Re: Julia my Friend

Mon Apr 19, 2021 8:53 pm

ejolson wrote:
Sun Apr 18, 2021 8:53 pm
Those gophers are now growing ferns so quickly that I'm not sure Julia will ever catch up.
Here is a first attempt with Julia to catch up to those Go language gophers. Note there is no parallel processing at the moment and I'm still running the ARMv6 build in a Singularity container on the 700 MHz super-cheap cluster.

The new timing follows:

Code: Select all

barnstat.jl -- Compute Barnsley's Fern

Iteration rate is 235426.10357255733 per second.
Total execution time 33.980938726 seconds.

real    2m54.498s
user    2m32.410s
sys 0m15.750s
Recall from

viewtopic.php?p=1853110#p1853110

that the original Julia code ran at 31945 iterations per second and took 250 seconds to finish. Thus, a few micro-optimizations resulting from minor changes have so far have resulted in more than a 7-fold performance improvement.

A significant performance improvement resulted by rewriting the point plotting function. The old version was

Code: Select all

function point(x)
    coord=Integer.(floor.(scale*(x-xmin.+border)))
    image[coord...]=1
end
while the new one is

Code: Select all

function point(x)
    i1=Integer(floor(scale*(x[1]-xmin[1]+border)))
    i2=Integer(floor(scale*(x[2]-xmin[2]+border)))
    image[i1,i2]=1
end
I'm at a loss to explain why the vectorized code is much slower than the new one. I think there is a way to examine the LLVM assembler output directly from the REPL, but I haven't taken a look to see what crazy things the original routine turned into.

The other improvement resulted from the StaticArrays package. While this led to a performance improvement of about 2-fold, I found it more significant that the package downloaded from the repository and installed without trouble. Given the highly experimental nature of this ARMv6 build I don't expect every package to work, but it seems that many do.

For reference the optimized code is

Code: Select all

#  barnstat.jl -- Compute the Barnsley Fern

using StaticArrays

const N=8000000

const A=[SA[0.85 0.04; -0.04 0.85],
    SA[0.20 -0.26; 0.23 0.22],
    SA[-0.15 0.28; 0.26 0.24],
    SA[0.00 0.00; 0.00  0.16]]

const B=[SA[0.00, 1.60],
    SA[0.00, 1.60],
    SA[0.00, 0.44],
    SA[0.00, 0.00]]

const P=[0.85, 0.07, 0.07, Inf]
const cdf=[sum(P[1:i]) for i=1:4]

f(i,x)=A[i]*x+B[i]
i(p)=findfirst(x->p<=x,cdf)

const xmin=SA[-2.1820,0]
const xmax=SA[2.6558,9.9983]
const border=0.1
const scale=100.0

const nu=Integer.(floor.(scale*(xmax-xmin.+2*border)))
const image=zeros(Int8,nu...);

function plot()
    open("fern.pnm","w") do io
        println(io,"P4")
        println(io,size(image)[1]," ",size(image)[2])
        row=zeros(UInt8,(size(image)[1]+7)÷8)
        for iy=size(image)[2]:-1:1
            rx=1; rb=UInt8(0)
            ib=UInt8(128)
            for ix=1:size(image)[1]
                if image[ix,iy]!=0
                    rb|=ib
                end
                ib>>=1
                if ib==0
                    row[rx]=rb; ib=UInt8(128)
                    rb=UInt8(0); rx+=1
                end
            end
            if ib!=0
                row[rx]=rb
            end
            write(io,row)
        end
    end
end

function point(x)
    i1=Integer(floor(scale*(x[1]-xmin[1]+border)))
    i2=Integer(floor(scale*(x[2]-xmin[2]+border)))
    image[i1,i2]=1
end

function main()
    println("barnstat.jl -- Compute Barnsley's Fern")
    xn=SA[0.0,0.0]
    point(xn)
    for j=1:N
        xn=f(i(rand(1)[1]),xn)
        point(xn)
    end
    plot()
end

tsec=@elapsed main()
println("\nIteration rate is ",N/tsec," per second.")
println("Total execution time ",tsec," seconds.")
I think that's fast enough it's time to try parallelizing the Julia program. I see a number of ways to do this.
  • Create a new ARMv7 container with threading enabled.
    • Create a multi-core aware program that runs in 64-bit mode on a Pi 4B.
      • Create a distributed-memory program using the ARMv6 build.
      As the last option involves the super-cheap computing cluster for learning

      viewtopic.php?p=1851754#p1851754

      I'll follow up the distributed-memory approach there. That leaves the first two options for this thread with details on building Julia for ARMv7 probably the most relevant. At the same time, switching to a 64-bit operating system and using the official Julia 1.6.0 binary sounds easier.

      In the mean time, has anyone figured out how to profile a Julia program?

      Return to “Other programming languages”