Heater
Posts: 12952
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Fri Jan 04, 2019 5:50 pm

Ah, Steve, I was wondering what you meant by using big integers via software interrupts. It seems instead of having shared libraries one could just plug the functionality into the OS itself. Cunning.

If you present the code I'd sure be tempted to boot RISC OS and try it out.

ejolson
Posts: 3214
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Fri Jan 04, 2019 6:44 pm

Steve Drain wrote:
Fri Jan 04, 2019 5:36 pm
ejolson wrote:
Fri Jan 04, 2019 2:09 pm
Would you mind posting what you have so far?
I may do, but it only applies to BBC BASIC running on a RISC OS machine with the Numbers module loaded.

To clarify for anyone unfamiliar with RISC OS, modules are operation system extensions. They can be written in assembler, C or just about in compiled BASIC. They provide software interrupts (SWIs) which can be called by any language. In this case the Numbers module was written by a third party quite while ago and provides a fairly comprehensive suite of big number operations.
A version with inline assembler instead of SWIs also sounds nice. Do you know whether the assembler implements an O(n^2) algorithm for the multiplication or one with greater asymptotic efficiency such as Karatsuba?
You mistake my meaning. An assembler version would call the SWIs in the same way, but the overheads would be much reduced.

I do not know how multiplication is done. I have the source, but have not examined it. I do not think it is Karatsuba, because it is a development of algorithms originally written in the 1980s.
Although the Karatsuba algorithm was published in 1962, it is possible the size of the big numbers expected for programs written in BASIC on RISC OS in the 80s was sufficiently small that the asymptotic superiority of Karatsuba multiplication wasn't a clear win over the simplicity of the O(n^2) algorithm. At the same time, computing the 4784969th Fibonacci number on a Pi in 5 minutes using an O(n^2) algorithm--even coded in assembly--sounds almost too good to be true. It would be interesting to know how it is done.

User avatar
Paeryn
Posts: 2604
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: Why Avoid BASIC on RPi?

Sat Jan 05, 2019 4:19 am

ejolson wrote:
Fri Jan 04, 2019 6:44 pm
Although the Karatsuba algorithm was published in 1962, it is possible the size of the big numbers expected for programs written in BASIC on RISC OS in the 80s was sufficiently small that the asymptotic superiority of Karatsuba multiplication wasn't a clear win over the simplicity of the O(n^2) algorithm. At the same time, computing the 4784969th Fibonacci number on a Pi in 5 minutes using an O(n^2) algorithm--even coded in assembly--sounds almost too good to be true. It would be interesting to know how it is done.
There have been faster methods over the past 10 years, one by De, Kurur, Saha and Saptharishi but it doesn't get faster than the Schönhage–Strassen FFT method (fastest one currently in GMP) until your numbers have 10 ** (10 ** 4797) bits!
She who travels light — forgot something.

ejolson
Posts: 3214
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Sat Jan 05, 2019 4:37 am

Paeryn wrote:
Sat Jan 05, 2019 4:19 am
ejolson wrote:
Fri Jan 04, 2019 6:44 pm
Although the Karatsuba algorithm was published in 1962, it is possible the size of the big numbers expected for programs written in BASIC on RISC OS in the 80s was sufficiently small that the asymptotic superiority of Karatsuba multiplication wasn't a clear win over the simplicity of the O(n^2) algorithm. At the same time, computing the 4784969th Fibonacci number on a Pi in 5 minutes using an O(n^2) algorithm--even coded in assembly--sounds almost too good to be true. It would be interesting to know how it is done.
There have been faster methods over the past 10 years, one by De, Kurur, Saha and Saptharishi but it doesn't get faster than the Schönhage–Strassen FFT method (fastest one currently in GMP) until your numbers have 10 ** (10 ** 4797) bits!
Interesting. I looked up De, Kurur, Saha and Saptharishi big-number multiplication and found
Christoph Lüders wrote:In this diploma thesis, results of an implementation of DKSS multiplication are presented: run-time is about 30 times larger than SSA, while memory requirements are about 3.75 times higher than SSA. A possible crossover point is estimated to be out of reach even if we utilized the whole universe for computer memory.
I remember when 64K was the size of the universe. I guess that goes to show how important efficient expressivity is when coding asymptotically fast big-number multiplication routines. Maybe if Christoph had written the algorithm in line-numbered BASIC the crossover point would have been sooner, though it's possible he was avoiding BASIC. I hear the optimiser results in too much volatility. So how big would n have to be for the nth Fibonacci number to fill the whole universe?

Heater
Posts: 12952
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sat Jan 05, 2019 7:16 pm

I was having a play with our fibo() under Google benchmark and came across an interesting result.

First we make a benchmark for fibo():

Code: Select all

// Define fibo benchmark
static void BM_fibo(benchmark::State& state) {
    int n = state.range(0);
    for (auto _ : state) {
        if (!fiboInitialized) {
            fiboInit();
            fiboInitialized = true;
        }
        benchmark::DoNotOptimize(fibo(n));
    }
}
BENCHMARK(BM_fibo)
    ->RangeMultiplier(2)
    ->RangeMultiplier(2)
    ->Range(1<<4, 1<<25)
    ->Complexity()
    ->Unit(benchmark::kMillisecond);
Then run this produces an output like so:

Code: Select all

$ ./bench
2019-01-05 20:40:56
Running ./bench
Run on (8 X 3401 MHz CPU s)
Load Average: 0.52, 0.58, 0.59
-------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
-------------------------------------------------------------------
BM_fibo/16                   0.001 ms        0.001 ms       640000
BM_fibo/32                   0.002 ms        0.002 ms       320000
BM_fibo/64                   0.004 ms        0.004 ms       154483
BM_fibo/128                  0.009 ms        0.009 ms        74667
BM_fibo/256                  0.017 ms        0.017 ms        40727
BM_fibo/512                  0.035 ms        0.035 ms        20364
BM_fibo/1024                 0.070 ms        0.070 ms         8960
BM_fibo/2048                 0.140 ms        0.138 ms         4978
BM_fibo/4096                 0.284 ms        0.289 ms         2489
BM_fibo/8192                 0.578 ms        0.572 ms         1120
BM_fibo/16384                 1.18 ms         1.17 ms          560
BM_fibo/32768                 2.49 ms         2.46 ms          280
BM_fibo/65536                 5.20 ms         5.16 ms          112
BM_fibo/131072                11.4 ms         11.5 ms           64
BM_fibo/262144                25.5 ms         25.7 ms           28
BM_fibo/524288                59.6 ms         59.7 ms           11
BM_fibo/1048576                144 ms          144 ms            5
BM_fibo/2097152                364 ms          367 ms            2
BM_fibo/4194304                950 ms          938 ms            1
BM_fibo/8388608               2551 ms         2562 ms            1
BM_fibo/16777216              7095 ms         7094 ms            1
BM_fibo/33554432             20137 ms        20141 ms            1
Or one can get a simple CSV output.

I did the same for ejolson's formulation of the fibo doubling that I have implemented in C++. Hmm...that's odd ejolson's formulation seems to be faster up until some suitably large N. So I plot both results:
fibo_comparison_1.png
fibo_comparison_1.png (26.29 KiB) Viewed 1034 times
fibo_comparison_2.png
fibo_comparison_2.png (23.19 KiB) Viewed 1034 times
Last edited by Heater on Sat Jan 05, 2019 9:36 pm, edited 3 times in total.

Heater
Posts: 12952
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sat Jan 05, 2019 7:29 pm

Perhaps this plot shows what is going on more clearly at the high end:
Note 1: I have disabled the memoization of my fibo in all these benchmarks, for an apples-apples comparison.
Note 2: These are using my same bint big integer class to get the arithmetic done.
fibo_comparison_3.png
fibo_comparison_3.png (27.15 KiB) Viewed 1032 times

ejolson
Posts: 3214
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Sun Jan 06, 2019 7:10 am

Heater wrote:
Sat Jan 05, 2019 7:29 pm
Perhaps this plot shows what is going on more clearly at the high end:
Note 1: I have disabled the memoization of my fibo in all these benchmarks, for an apples-apples comparison.
Note 2: These are using my same bint big integer class to get the arithmetic done.

fibo_comparison_3.png
Depending on the code, n=2^k or n=2^k+1 are possible special cases for the doubling formulas. It would be interesting to see how straight the resulting lines are when other values of n are selected, perhaps randomly.

Do either lines have the slope log(3)/log(2)? Maybe plot C*n^(log(3)/log(2)) for some suitably chosen value of C to compare.

Heater
Posts: 12952
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sun Jan 06, 2019 8:43 am

Hmm... If I understand you correctly....

We can look at the raw data output of the benchmark, at the high end:

Code: Select all

..
"BM_fibo/262144",26,25.634,25.2404,ms,,,,,
"BM_fibo/524288",11,59.3675,59.6591,ms,,,,,
"BM_fibo/1048576",5,143.325,143.75,ms,,,,,
"BM_fibo/2097152",2,364.529,359.375,ms,,,,,
"BM_fibo/4194304",1,945.639,953.125,ms,,,,,
"BM_fibo/8388608",1,2540.48,2546.88,ms,,,,,
"BM_fibo/16777216",1,7060.5,7046.88,ms,,,,,
"BM_fibo/33554432",1,20041.1,20031.2,ms,,,,,
...
"BM_fiboEjOlson/262144",64,11.9017,12.207,ms,,,,,
"BM_fiboEjOlson/524288",19,36.0646,35.3618,ms,,,,,
"BM_fiboEjOlson/1048576",6,108.45,109.375,ms,,,,,
"BM_fiboEjOlson/2097152",2,326.485,328.125,ms,,,,,
"BM_fiboEjOlson/4194304",1,990.295,1000,ms,,,,,
"BM_fiboEjOlson/8388608",1,2983.46,2953.12,ms,,,,,
"BM_fiboEjOlson/16777216",1,8897.75,8906.25,ms,,,,,
"BM_fiboEjOlson/33554432",1,26804.4,26796.9,ms,,,,,
We can calculate the slopes of the lines on a log/log graph as:

Heater curve:
> (Math.log(20031.2) - Math.log(25.2404)) / (Math.log(33554432) - Math.log(262144))
1.3760426230727143

ejolson curve:
> (Math.log(26796.9) - Math.log(12.207)) / (Math.log(33554432) - Math.log(262144))
1.58573453007763

Meanwhile log(3)/log(2) is:

> Math.log10(3) / Math.log10(2)
1.584962500721156

Hmm..bang on for your curve. A bit high for mine.

What does it all mean?

I'll try and find time to test with random numbers later.

ejolson
Posts: 3214
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Sun Jan 06, 2019 10:02 am

Heater wrote:
Sun Jan 06, 2019 8:43 am
Hmm... If I understand you correctly....

We can look at the raw data output of the benchmark, at the high end:

Code: Select all

..
"BM_fibo/262144",26,25.634,25.2404,ms,,,,,
"BM_fibo/524288",11,59.3675,59.6591,ms,,,,,
"BM_fibo/1048576",5,143.325,143.75,ms,,,,,
"BM_fibo/2097152",2,364.529,359.375,ms,,,,,
"BM_fibo/4194304",1,945.639,953.125,ms,,,,,
"BM_fibo/8388608",1,2540.48,2546.88,ms,,,,,
"BM_fibo/16777216",1,7060.5,7046.88,ms,,,,,
"BM_fibo/33554432",1,20041.1,20031.2,ms,,,,,
...
"BM_fiboEjOlson/262144",64,11.9017,12.207,ms,,,,,
"BM_fiboEjOlson/524288",19,36.0646,35.3618,ms,,,,,
"BM_fiboEjOlson/1048576",6,108.45,109.375,ms,,,,,
"BM_fiboEjOlson/2097152",2,326.485,328.125,ms,,,,,
"BM_fiboEjOlson/4194304",1,990.295,1000,ms,,,,,
"BM_fiboEjOlson/8388608",1,2983.46,2953.12,ms,,,,,
"BM_fiboEjOlson/16777216",1,8897.75,8906.25,ms,,,,,
"BM_fiboEjOlson/33554432",1,26804.4,26796.9,ms,,,,,
We can calculate the slopes of the lines on a log/log graph as:

Heater curve:
> (Math.log(20031.2) - Math.log(25.2404)) / (Math.log(33554432) - Math.log(262144))
1.3760426230727143

ejolson curve:
> (Math.log(26796.9) - Math.log(12.207)) / (Math.log(33554432) - Math.log(262144))
1.58573453007763

Meanwhile log(3)/log(2) is:

> Math.log10(3) / Math.log10(2)
1.584962500721156

Hmm..bang on for your curve. A bit high for mine.

What does it all mean?

I'll try and find time to test with random numbers later.
I'm not sure the exact codes you are comparing, but my guess is

1. The flatter curve has a significant O(n) term that comes from memory allocations.

2. The steeper curve, if it uses the same recursion as the original fibonacci.c program, has a constant C in the asymptotically dominant term C*n^(log(3)/log(2)) that is 2 to 4 times larger than it needs to be.

Heater
Posts: 12952
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sun Jan 06, 2019 10:39 am

I'm comparing my C++ interpretation of your C fibo() algorithm vs my C++ interpretation of Paeryn's Haskell FiboFast example.
The actual codes used to obtain these results are below. They are now in their own fibo.cpp file here:
https://github.com/ZiCog/fibo_4784969/b ... B/fibo.cpp (Note: The version in the repo is using memoization, what I'm running here does not). I think my C++ interpretation of your fibo is the same as the current C version.

Do your comments still apply to these codes?:

Code: Select all

// This Fibonacci version derived from ejolson's doubling formula C example.
bint a;
bint b;
void fiboEjOlson(const int n) {
    if (n == 0) {
        a = zero;
        b = one;
        return;
    }
    fiboEjOlson(n / 2);
    bint ta = a;
    if (isEven(n)) {
        a = ta * ((b + b) - ta);
        b = (ta * ta) + (b * b);
    } else {
        a = (a * a) + (b * b);
        b = b * ((ta + ta) + b);
    }
}

void fiboInit() {
    // Initialize the fibo's memo.
    memo.clear();
    memo[0] = zero;
    memo[1] = one;
    memo[2] = one;
}

// This Fibonacci version derived from Paeryn's Haskell example.
const bint fibo (int n) {
    if (memo.find(n) != memo.end()) {
        return memo[n];    
    }

    int k = (n / 2);
    const bint a = fibo(k);
    const bint b = fibo(k - 1);
    if (isEven(n)) {
        return  a * (two * b + a);
    }
    bint twoa = two * a;
    if ((n % 4) == 1) {
        return  (twoa + b) * (twoa - b) + two;
    }
    return  (twoa + b) * (twoa - b) - two;
}

ejolson
Posts: 3214
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Sun Jan 06, 2019 11:52 am

Heater wrote:
Sun Jan 06, 2019 10:39 am
Do your comments still apply to these codes?
Yes, that is the version of the (a,b)-linear recurrence which is 2 to 4 times slower than it needs to be. To provide a frame of reference, the original C program has not been improved since it was first posted.

A more optimal way of writing the same doubling formula can be found in the FreeBASIC code missing the Karatsuba algorithm and, of course, the program written in line-numbered BASIC. Incidentally, it made me happy for you describe that particular program as strangely beautiful.

Heater
Posts: 12952
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sun Jan 06, 2019 8:45 pm

A more optimal way of writing the same doubling formula can be found in the FreeBASIC code...
Hmm... so I thought I'd transcribe that FreeBASIC fibo to C++ and see how it flies. So far it's not working.

But isn't there something wrong with this in fibowork:

Code: Select all

sub fibowork(n as integer,a as bignum,b as bignum)
    if n=0 then
        a.n=0:b.n=1:b.d(1)=1
        return
    end if
Aren't we to be setting a to zero there:
a.n=1:a.d(1)=0

ejolson
Posts: 3214
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Mon Jan 07, 2019 12:55 am

Heater wrote:
Sun Jan 06, 2019 8:45 pm
A more optimal way of writing the same doubling formula can be found in the FreeBASIC code...
Hmm... so I thought I'd transcribe that FreeBASIC fibo to C++ and see how it flies. So far it's not working.

But isn't there something wrong with this in fibowork:

Code: Select all

sub fibowork(n as integer,a as bignum,b as bignum)
    if n=0 then
        a.n=0:b.n=1:b.d(1)=1
        return
    end if
Aren't we to be setting a to zero there:
a.n=1:a.d(1)=0
Yes, a is zero. The code trims leading zeros from numbers so a number with zero digits is also zero.

Heater
Posts: 12952
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Mon Jan 07, 2019 2:32 am

Ah, OK. In my C++ I treat zero length bints as Not a Number (BINT_NULL).

Anyway: Yay C++ fibo(4784969) is 15% faster again!

So I transposed the fibo of the FreeBASIC version into C++. It's now about 15% faster than previously. More that twice as fast as my old version with memoization removed. Twice as fast as the previous ejolson inspired version.

I'll post the code and do some bench marking on it tomorrow but preliminary results look like this:

Code: Select all

$ ./bench
2019-01-07 04:21:42
Running ./bench
Run on (8 X 3401 MHz CPU s)
Load Average: 0.52, 0.58, 0.59
---------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
---------------------------------------------------------------------
BM_fibo/4784969                 1162 ms         1156 ms            1
BM_fiboEjOlson/4784969          1196 ms         1219 ms            1
BM_fiboEjOlsonNew/4784969        528 ms          516 ms            1
Note: The BM_fibo above is without memoization. Even with memoization this new version beats it by 100ms or so!

ejolson
Posts: 3214
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Mon Jan 07, 2019 4:00 am

Heater wrote:
Mon Jan 07, 2019 2:32 am
Even with memoization this new version beats it by 100ms or so!
That's why I had expected the line-numbered BASIC program to do better than it did. Unfortunately, it seems that language is just not expressive enough to efficiently code Karatsuba multiplication.

ejolson
Posts: 3214
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Mon Jan 07, 2019 10:40 am

I've been optimizing the original fibonacci.c code and created a new program parallel.c which uses OpenMP to obtain additional efficiency when running on multi-core hardware. Results are as follows when running the new program on a Pi 3B in 32-bit mode under clocked with arm_freq=900:

Code: Select all

$ make
/usr/local/gcc-6.4/bin/gcc -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 -Wall -o serial parallel.c
/usr/local/gcc-6.4/bin/gcc -fopenmp -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 -Wall -o parallel parallel.c
/usr/local/gcc-6.4/bin/gcc -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 -Wall -o fibonacci fibonacci.c
$ time ./fibonacci >fibonacci.txt

real    0m52.750s
user    0m52.728s
sys 0m0.010s
$ time ./serial >serial.txt

real    0m20.777s
user    0m20.756s
sys 0m0.020s
$ time ./parallel >parallel.txt

real    0m5.768s
user    0m21.004s
sys 0m0.010s
$ md5sum fibonacci.txt 
c926caa99ed1e30fe0d3c966da901738  fibonacci.txt
$ md5sum serial.txt 
c926caa99ed1e30fe0d3c966da901738  serial.txt
$ md5sum parallel.txt
c926caa99ed1e30fe0d3c966da901738  parallel.txt
I will be posting the new code shortly. For now I would note that optimizations to the original serial code led to a factor 2.5 times improvement while parallelization to the 4 cores gave another 3.6 times improvement. It is also worth noting that parallel.c took significantly less time to write and debug than the line-numbered BASIC program.

Heater
Posts: 12952
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Mon Jan 07, 2019 10:50 am

Wow. Can't wait to try it out and see the code.

A 3.6 speed up over 4 cores is a pretty good scaling factor.

Heater
Posts: 12952
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Mon Jan 07, 2019 1:49 pm

This is interesting:

ejolson's new fibo() formulation is a clear winner across the board up to fibo(134217728) by a factor of 2 and much more at the low end.

But if we look at the high end we see the original formulation I am using is rising at a lower rate. It will win again somewhere further out...
Attachments
new_2.png
new_2.png (26.04 KiB) Viewed 765 times
new_1.png
new_1.png (29.62 KiB) Viewed 765 times

ejolson
Posts: 3214
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Mon Jan 07, 2019 3:07 pm

Heater wrote:
Fri Jan 04, 2019 4:22 pm
How these things are implemented under the hood is of no consequence, as long as they perform well enough. Perhaps the language run time makes use of libs written in other languages, perhaps it does not. We are talking about the language here not any particular implementation.
A programming language which allows for the efficient expression of complicated algorithms is liberating in the sense that the person who learns such a language is no longer constrained by the functionality of built-in features and standard libraries but has the opportunity to develop and implement new algorithms that solve new problems and old problems in new ways. While it might appear that the goal is to compute million-digit Fibonacci numbers, the title of the thread is why avoid BASIC on the Raspberry Pi.

I find it profoundly important to pay attention to what goes on behind the scenes, because for me coding the doubling formulas and Karatsuba multiplication is best seen as an exercise to help determine how expressive various programming languages might be for efficiently implementing algorithms in general. For example, there is good reason to believe that built-in big-number arithmetic has little to do with the yet-to-be-invented non-bipartite womble algorithm.

When a swimmer performs a series of exercises on land, it is clear the goal is not the exercise but building the strength needed for subsequent swimming. When a calculus student is confronted with a series of exercises, if finding answers takes priority over the goal of building the skills and understanding necessary for subsequent work, then many students will notice the answers are easiest to find by consulting the key at the end of the book. When the goal is understanding why to avoid BASIC, then what I want to know is whether BASIC has the ability to efficiently express anything the hardware is capable of. If finding the 4784969th Fibonacci number takes priority, then we forget the purpose of the exercise.

What caused the end of the golden age of personal computing? Was the end caused by the failure of BASIC to liberate a person who learned that language? With the Pi a second age of personal computing has begun. Does Python also lack the ability to efficiently express new algorithms in a way that repeats the mistakes of the past?
Last edited by ejolson on Tue Jan 08, 2019 12:40 am, edited 3 times in total.

ejolson
Posts: 3214
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Mon Jan 07, 2019 3:58 pm

Heater wrote:
Mon Jan 07, 2019 1:49 pm
But if we look at the high end we see the original formulation I am using is rising at a lower rate. It will win again somewhere further out...
It does look like you turned an O(n^1.58) algorithm into an O(n^1.38) one. However, I'd be very surprised if the red and blue lines ever really crossed.

Heater
Posts: 12952
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Mon Jan 07, 2019 9:25 pm

ejolson,
It does look like you turned an O(n^1.58) algorithm into an O(n^1.38) one.
Great. Where do I pick up my Turing award?
However, I'd be very surprised if the red and blue lines ever really crossed.
Dang...There goes fame and fortune... asymptotes are a bitch.
benchmark_results.png
benchmark_results.png (28.82 KiB) Viewed 700 times

ejolson
Posts: 3214
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Tue Jan 08, 2019 4:13 am

This post includes a new code parallel.c demonstrating the expressiveness of the C programming language with OpenMP extensions in the context of computing the 4784969th Fibonacci number using the doubling formulas and Karatsuba multiplication. This program was inspired by the original fibonacci.c program with the following modifications:

1. Algebraically simplified versions of doubling formula given by

F(2k) = F(k)[2F(k+1)-F(k)]
F(2k+1) = F(k+1)[2F(k+1)-F(k)]+(-1)^(k+1)
F(2k+1) = F(k)[2F(k)+F(k+1)]+(-1)^(k)
F(2k+2) = F(k+1)[2F(k)+F(k+1)]

have been used which reduce the number of multiplications per doubling step from three to two.

2. The final return of the doubling formula occurs in a separate routine and only calculates the value of b, since that is the only value needed.

3. All memory is allocated from the execution stack using C99 variable length arrays. The routines for big-number addition and subtraction have been rewritten to take an additional argument specifying the stack-allocated buffers in which to store the big numbers. This change allows lock-less memory allocation for the multi-threaded version, since every thread has its own execution stack.

4. Signed 32-bit integers have been used to store the digits of the big numbers. This is more cache friendly and allows addition and subtraction to be performed as vector addition and vector subtraction on the arrays of digits which represent the big numbers. The resulting loops are vectorizable. The O(n^2) multiplication has been modified to use a 64-bit temporary workspace with a delayed carry that avoids any intermediate divisions until the very end.

5. A horrible bigcarry routine has been written to fix the mess made by the simplified vector addition and subtraction routines.

6. The multiplication, addition, subtraction and carry routines are composable. They were actually composable before, but the difficulty in freeing temporary memory allocated for the return values discouraged use of that feature in the original fibonacci.c code. Now the memory is stack-allocated and automatically freed at just the right time.

The changes itemized above as 1 through 6 result in a program that runs 2.5 times faster than the original fibonacci.c code. Note that most of the speed improvement results from 1 and 2. The final change is to introduce parallelism.

Most of the parallelism comes from the Karatsuba multiplication algorithm; however, parallelism was also introduced into the use of the doubling formula as well. As the Karatsuba algorithm is by nature a recursive divide and conquer algorithm, then the fork-join model of dynamic parallelism is easy to implement.

The current code relies on a header file that defines the macros spawn and sync. In order to accommodate either OpenMP or the Cilk parallel processing extensions to the C programming language, the syntax of spawn is a bit irritating. The irritation seems worthwhile, however, as it allows the program to be compiled using either OpenMP or Cilk with no further changes.

7. Make copies of the serial versions of the subroutines fibowork and bigmul3 called fiboworks and bigmul3s. Then introduce the spawn and sync keywords to perform the multiplications in parallel and a crossover point to determine when to switch to the serial code.

It should be pointed out that once steps 1 through 6 were completed adding the parallelism in step 7 took about 5 minutes and worked the first time. It should also be pointed out that addition, subtraction and bigcarry have not been parallelized. The current code scales to two cores almost perfectly and obtains a factor 3.6 speedup on four cores. Parallel efficiency when scaling to eight cores is only 70 percent. It is likely further changes would have to be made in order to scale well using more than eight cores.

The code is

Code: Select all

/*  parallel.c -- Compute the nth Fibonacci Number
    Written January 7, 2018 by Eric Olson

    This program demonstrates the expressiveness of C with OpenMP
    as measured by explicitly coding a parallel version of the 
    Karatsuba multiplication algorithm for big-number arithmetic 
    and then using the doubling formulas

        F(2k) = F(k)[2F(k+1)-F(k)]
      F(2k+1) = F(k+1)[2F(k+1)-F(k)]+(-1)^(k+1)
      F(2k+1) = F(k)[2F(k)+F(k+1)]+(-1)^(k)
      F(2k+2) = F(k+1)[2F(k)+F(k+1)]

    to compute the nth Fibonacci number.  Note that n is specified
    in the first line of the main routine.
*/

#include <stdio.h>
#include <math.h>
#include <string.h>
#include <strings.h>
#include <stdlib.h>
#include <alloca.h>
#include <sys/time.h>
#include <sys/resource.h>

#define TUNE1 44
#define TUNE2 1024
#define TUNE3 1048576

#include "parallel.h"

typedef unsigned long long llu;
typedef int limb;
typedef unsigned long long limb2;
typedef struct {
    int n; limb *d;
} bignum;

static int rexp,cmult,digits,nproc;
static limb rbase,rcarry;
static limb2 cadd,cmax;

static void bigprint(bignum x){
    char fmt[10];
    sprintf(fmt,"%%0%dllu",rexp);
    if(!x.n){
        printf("0\n");
        return;
    }
    for(int i=x.n-1;i>=0;i--){
        if(i==x.n-1) printf("%llu",(llu)x.d[i]);
        else printf(fmt,(llu)x.d[i]);
    }
    printf("\n");
}

static bignum bigtrim(bignum x){
    for(int k=x.n-1;k>=0;k--){
        if(x.d[k]!=0){
            x.n=k+1;
            return x;
        }
    }
    x.n=0;
    return x;
}

static bignum bigcarry(bignum x){
    int imax=x.n-1;
    for(int i=0;i<imax;i++){
        if(x.d[i]>=rbase){
            do {
                x.d[i]-=rbase; x.d[i+1]++;
            } while(x.d[i]>=rbase);
        } else if(x.d[i]<0){
            do {
                x.d[i]+=rbase; x.d[i+1]--;
            } while(x.d[i]<0);
        }
    }
    if(x.d[imax]<rbase) return bigtrim(x);
    x.n++;
    x.d[imax]-=rbase; x.d[imax+1]=1; 
    while(x.d[imax]>=rbase){
        x.d[imax]-=rbase; x.d[imax+1]++;
    }
    return x;
}

static bignum bigcopy(limb zd[],bignum x){
    bignum z; z.d=zd;
    memcpy(z.d,x.d,sizeof(limb)*x.n);
    z.n=x.n;
    return z;
}

static bignum bigadd3(limb zd[],bignum x,bignum y){ 
    bignum z; z.d=zd;
    if(x.n<y.n){
        for(int i=0;i<x.n;i++) z.d[i]=x.d[i]+y.d[i];
        for(int i=x.n;i<y.n;i++) z.d[i]=y.d[i];
        z.n=y.n;
    } else {
        for(int i=0;i<y.n;i++) z.d[i]=x.d[i]+y.d[i];
        for(int i=y.n;i<x.n;i++) z.d[i]=x.d[i];
        z.n=x.n;
    }
    return z;
}

#ifdef NOTUSED
static bignum bigsub3(limb zd[],bignum x,bignum y){ 
    bignum z; z.d=zd;
    for(int i=0;i<y.n;i++) z.d[i]=x.d[i]-y.d[i];
    for(int i=y.n;i<x.n;i++) z.d[i]=x.d[i];
    z.n=x.n;
    return z;
}
#endif

static bignum bigsub2(bignum x,bignum y){ 
    for(int i=0;i<y.n;i++) x.d[i]-=y.d[i];
    return x;
}

static bignum bigadd2(bignum x,bignum y){
    if(x.n<y.n){
        if(x.n>0) {
            x.d[0]--;
            for(int i=0;i<x.n;i++) x.d[i]+=y.d[i]-rcarry;
            x.d[x.n]=y.d[x.n]+1;
            for(int i=x.n+1;i<y.n;i++) x.d[i]=y.d[i];
            x.n=y.n;
        } else {
            x=bigcopy(x.d,y);
        }
    } else {
        x.d[0]--;
        for(int i=0;i<y.n;i++) x.d[i]+=y.d[i]-rcarry;
        if(y.n==x.n) {
            x.n++;
            x.d[y.n]=1;
        } else {
            x.d[y.n]++;
        }
    }
    return x;
}

static bignum bigdec1(bignum x){ 
    int imax=x.n-1;
    x.d[0]--;
    for(int i=0;i<imax;i++){
        if(x.d[i]>=0) return x;
        x.d[i]+=rbase; x.d[i+1]--;
    }
    return x;
}

static bignum biginc1(bignum x){
    int imax=x.n-1;
    if(imax>=0) {
        x.d[0]++;
        for(int i=0;i<imax;i++){
            if(x.d[i]<rbase) return x;
            x.d[i]-=rbase; x.d[i+1]++;
        }
        if(x.d[imax]<rbase) return x;
        x.d[imax]-=rbase;
    }
    x.n++;
    x.d[imax+1]=1;
    return x;
}

static bignum bigmul3w(limb zd[],bignum x,bignum y){
    bignum z; z.d=zd;
    int imax=x.n+y.n;
    limb2 s[imax];
    bzero(s,sizeof(limb2)*imax);
    imax--;
    for(int i=0;i<x.n;i++){
        for(int j=0;j<y.n;j++){
            s[i+j]+=(limb2)x.d[i]*y.d[j];
        }
        if((i+1)%cmult==0){
            for(int k=0;k<imax;k++){
                if(s[k]<cmax) continue;
                s[k]-=cmax; s[k+1]+=cadd;
            }
        }
    }
    for(int k=0;k<imax;k++){
        s[k+1]+=s[k]/rbase; z.d[k]=s[k]%rbase;
    }
    z.d[imax]=s[imax];
    z.n=imax+1;
    return bigtrim(z);
}

static bignum biglow(bignum x,int n){
    if(x.n>n) x.n=n;
    return x;
}

static bignum bighigh(bignum x,int n){
    if(x.n<n) x.n=0;
    else {
        x.n-=n;
        x.d=&x.d[n];
    }
    return x;
}

static bignum bigmul3s(limb zd[],bignum x,bignum y){
    if(x.n<TUNE1||y.n<TUNE1) return bigmul3w(zd,x,y);
    int M=x.n<y.n?y.n:x.n; int n=M/2;
    bignum z;
    bignum x0=biglow(x,n),x1=bighigh(x,n);
    bignum y0=biglow(y,n),y1=bighigh(y,n);
    limb z0d[M+1],z1d[M+3],z1yd[n+1],z2d[M+3];
    bignum z0,z2,z1x,z1y,z1;
    z0=bigmul3s(z0d,x0,y0);
    z2=bigmul3s(z2d,x1,y1);
    z1x=bigcarry(bigadd3(z1d,x0,x1));
    z1y=bigcarry(bigadd3(z1yd,y0,y1));
    z1=bigmul3s(z1x.d,z1x,z1y);
    z1=bigcarry(bigsub2(bigsub2(z1,z0),z2));
    z=bigcopy(zd,z0);
    z=bigadd3(&z.d[n],(bignum){z.n-n,&z.d[n]},z1);
    z=bigadd2((bignum){z.n-n,&z.d[n]},z2);
    z.n=z.n+2*n; z.d=zd;
    return bigcarry(z);
}

static bignum bigmul3(limb zd[],bignum x,bignum y){
    if(x.n<TUNE1||y.n<TUNE1) return bigmul3w(zd,x,y);
    int M=x.n<y.n?y.n:x.n; int n=M/2;
    if(M<TUNE2) return bigmul3s(zd,x,y);
    bignum z;
    bignum x0=biglow(x,n),x1=bighigh(x,n);
    bignum y0=biglow(y,n),y1=bighigh(y,n);
    limb z0d[M+1],z1d[M+3],z1yd[n+1],z2d[M+3];
    bignum z0,z2,z1x,z1y,z1;
    spawn(z0=) bigmul3(z0d,x0,y0);
    spawn(z2=) bigmul3(z2d,x1,y1);
    z1x=bigcarry(bigadd3(z1d,x0,x1));
    z1y=bigcarry(bigadd3(z1yd,y0,y1));
    z1=bigmul3(z1x.d,z1x,z1y);
    sync;
    z1=bigcarry(bigsub2(bigsub2(z1,z0),z2));
    z=bigcopy(zd,z0);
    z=bigadd3(&z.d[n],(bignum){z.n-n,&z.d[n]},z1);
    z=bigadd2((bignum){z.n-n,&z.d[n]},z2);
    z.n=z.n+2*n; z.d=zd;
    return bigcarry(z);
}

static bignum t1,a,b;
static void fiboworks(int n){
    if(n==0){
        a.n=0; b.n=1; b.d[0]=1;
        return;
    }
    fiboworks(n/2);
    if(n%2==0){
        // [a,b]=[a*(2*b-a),b*(2*b-a)-(-1)^k]
        t1=bigcarry(bigsub2(bigadd3(t1.d,b,b),a));
        a=bigmul3(a.d,a,t1);
        b=bigmul3(b.d,b,t1);
        if(n%4==0) b=bigdec1(b); else b=biginc1(b);
    } else {
        // [a,b]=[a*(2*a+b)+(-1)^k,b*(2*a+b)]
        t1=bigcarry(bigadd2(bigadd3(t1.d,a,a),b));
        b=bigmul3(b.d,b,t1); 
        a=bigmul3(a.d,a,t1); 
        if(n%4==1) a=biginc1(a); else a=bigdec1(a);
    }
    return;
}

static void fibowork(int n){
    if(n==0){
        a.n=0; b.n=1; b.d[0]=1;
        return;
    }
    if(n<TUNE3) fiboworks(n/2);
    else fibowork(n/2);
    if(n%2==0){
        // [a,b]=[a*(2*b-a),b*(2*b-a)-(-1)^k]
        t1=bigcarry(bigsub2(bigadd3(t1.d,b,b),a));
        spawn(a=) bigmul3(a.d,a,t1);
        b=bigmul3(b.d,b,t1);
        if(n%4==0) b=bigdec1(b); else b=biginc1(b);
        sync;
    } else {
        // [a,b]=[a*(2*a+b)+(-1)^k,b*(2*a+b)]
        t1=bigcarry(bigadd2(bigadd3(t1.d,a,a),b));
        spawn(b=) bigmul3(b.d,b,t1); 
        a=bigmul3(a.d,a,t1); 
        if(n%4==1) a=biginc1(a); else a=bigdec1(a);
        sync;
    }
    return;
}

static bignum fibo(int n, limb xd[]){
    b.d=xd;
    if(n<2){
        b.n=1; b.d[0]=n;
        return b;
    }
    limb ad[digits]; a.d=ad;
    fibowork((n-1)/2);
    if(n%2==0){
        // b=b*(2*a+b)
        t1=bigcarry(bigadd2(bigadd3(t1.d,a,a),b));
        b=bigmul3(b.d,b,t1);
    } else {
        // b=b*(2*b-a)-(-1)^k
        t1=bigcarry(bigsub2(bigadd3(t1.d,b,b),a));
        b=bigmul3(b.d,b,t1);
        if(n%4==1) b=bigdec1(b); else b=biginc1(b);
    }
    return b;
}

static int work(int n){
    digits=n*log10(0.5*(1+sqrt(5.0)))/rexp+4;
    limb t1d[digits]; t1.d=t1d;
    limb xd[digits];
    workers(nproc){
        bignum x=fibo(n,xd);
        bigprint(x);
    }
    return 0;
}

int main(){
    int n=4784969;
    setrlimit(RLIMIT_STACK,
        &(const struct rlimit)
        {RLIM_INFINITY,RLIM_INFINITY});
    if(sizeof(limb)*2!=sizeof(limb2)){
        fprintf(stderr,
        "sizeof(limb)=%llu must be half of sizeof(limb2)=%llu!\n",
        (llu)sizeof(limb),(llu)sizeof(limb2));
        return 1;
    }
    limb limbmax=((limb2)1<<(8*sizeof(limb)-1))-1;
    limb2 limb2max=((limb2)1<<(8*sizeof(limb)))-1;
    limb2max=(limb2max<<(8*sizeof(limb)))+limb2max;
    rexp=log(limbmax)/log(10);
    rbase=pow(10,rexp)+0.5; rcarry=rbase-1;
    cmult=(limb2max-2*rbase)/rbase/rbase/2;
    cadd=(limb2)rbase*cmult; cmax=cadd*rbase;
    nproc=get_ncpu();
    return work(n);
}
and the header file is

Code: Select all

/*  parallel.h -- Macros for parallel processing
    Written Jan 7, 2019 by Eric Olson */

#ifndef _PARALLEL_H
#define _PARALLEL_H

#ifdef _OPENMP
#  include <omp.h>
#  define spawn(X) _Pragma("omp task default(shared)") X
#  define sync _Pragma("omp taskwait")
#  define get_ncpu(X) omp_get_num_procs(X)
#  define workers(X) \
    if(X) omp_set_num_threads(X); \
    else omp_set_num_threads(1); \
    _Pragma("omp parallel") \
    _Pragma("omp single")
#else
#  ifdef _CILKPLUS
#    include <cilk/cilk.h>
#    include <cilk/cilk_api.h>
#    define spawn(X) X cilk_spawn
#    define sync cilk_sync
#    define get_ncpu(X) __cilkrts_get_nworkers(X)
#    define workers(X) \
    __cilkrts_end_cilk(); \
    if(X){ \
        char buf[32]; \
        sprintf(buf,"%d",X); \
        int r=__cilkrts_set_param("nworkers",buf); \
        if(r!=__CILKRTS_SET_PARAM_SUCCESS){ \
            fprintf(stderr,"Error: unable to set nworkers to %d!\n",X); \
            exit(1); \
        } \
        __cilkrts_init(); \
    }
#  else
#    define spawn(X) X
#    define sync
#    define get_ncpu(X) 1
#    define workers(X)
#  endif
#endif

#endif
Last edited by ejolson on Tue Jan 08, 2019 9:05 am, edited 1 time in total.

jahboater
Posts: 4585
Joined: Wed Feb 04, 2015 6:38 pm

Re: Why Avoid BASIC on RPi?

Tue Jan 08, 2019 8:20 am

Impressive!
I am getting better than 4 speed up with a quad core PC (hyper threading turned off).
Compared to the fibo.c version 4 ????????

Code: Select all

$ gcc -O3 fibo.c -o fibo_plain
$ gcc -O3 fibo_gmp.c -o fibo_gmp -lgmp
$ gcc -O3 -fopenmp fibo_par.c -o fibo_par      
$
$ time ./fibo_plain >/dev/null

real	0m1.562s
user	0m1.560s
sys	0m0.004s
$ time ./fibo_gmp >/dev/null

real	0m0.195s
user	0m0.192s
sys	0m0.000s
$ time ./fibo_par >/dev/null

real	0m0.293s
user	0m1.084s
sys	0m0.000s
$ 
$ pc 1.562 / 0.293
5.33105802047782

ejolson
Posts: 3214
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Tue Jan 08, 2019 9:14 am

jahboater wrote:
Tue Jan 08, 2019 8:20 am
Impressive!
I am getting better than 4 speed up with a quad core PC (hyper threading turned off).
Compared to the fibo.c version 4 ????????

Code: Select all

$ gcc -O3 fibo.c -o fibo_plain
$ gcc -O3 fibo_gmp.c -o fibo_gmp -lgmp
$ gcc -O3 -fopenmp fibo_par.c -o fibo_par      
$
$ time ./fibo_plain >/dev/null

real	0m1.562s
user	0m1.560s
sys	0m0.004s
$ time ./fibo_gmp >/dev/null

real	0m0.195s
user	0m0.192s
sys	0m0.000s
$ time ./fibo_par >/dev/null

real	0m0.293s
user	0m1.084s
sys	0m0.000s
$ 
$ pc 1.562 / 0.293
5.33105802047782
Thanks for testing the new code. I made one minor cosmetic change to the bigcopy routine and just updated the previous post. The modification doesn't change the speed but only makes calling conventions more regular. I had been hoping the new code would catch the GMP single thread performance, but that seems not to be the case. Maybe one needs a system with more and slower cores for that to happen. Have you tried changing the TUNE1, TUNE2 and TUNE3 parameters for better performance on your system?

Note to separate the algorithmic optimisations from the parallel speedup, you can compile the new code without the -fopenmp flag to obtain a serial version of the code.
Last edited by ejolson on Tue Jan 08, 2019 9:24 am, edited 1 time in total.

Heater
Posts: 12952
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Tue Jan 08, 2019 9:23 am

Awesome. No time to look into that just now but it builds and runs here:

Code: Select all

$ gcc -Wall -Wextra -O3 -o parallel parallel.c -lm
$ time ./parallel | tail   -c 32
4856539211500699706378405156269

real    0m0.919s
user    0m0.844s
sys     0m0.063s
$
$ gcc -Wall -Wextra -O3 -fopenmp -o parallel parallel.c -lm
$ time ./parallel | tail   -c 32
4856539211500699706378405156269

real    0m0.257s
user    0m1.281s
sys     0m0.094s
For a speed up of about 3.6 over 8 (hyper)threads on 4 cores.

All in all 2.4 times faster than the single core c++:

Code: Select all

$ time ../c++/fibo_karatsuba | tail -c 32
4856539211500699706378405156269

real    0m0.640s
user    0m0.609s
sys     0m0.000s
One thing that immediately jumped out at me is this declaration in work();

Code: Select all

    limb t1d[digits]; t1.d=t1d;
t1d is an unused variable. But comment it out and it segfaults. Something cunning going on here!

How do we build that with cilk?

Return to “Off topic discussion”