Although the Karatsuba algorithm was published in 1962, it is possible the size of the big numbers expected for programs written in BASIC on RISC OS in the 80s was sufficiently small that the asymptotic superiority of Karatsuba multiplication wasn't a clear win over the simplicity of the O(n^2) algorithm. At the same time, computing the 4784969th Fibonacci number on a Pi in 5 minutes using an O(n^2) algorithm--even coded in assembly--sounds almost too good to be true. It would be interesting to know how it is done.Steve Drain wrote: ↑Fri Jan 04, 2019 5:36 pmI may do, but it only applies to BBC BASIC running on a RISC OS machine with the Numbers module loaded.
To clarify for anyone unfamiliar with RISC OS, modules are operation system extensions. They can be written in assembler, C or just about in compiled BASIC. They provide software interrupts (SWIs) which can be called by any language. In this case the Numbers module was written by a third party quite while ago and provides a fairly comprehensive suite of big number operations.
You mistake my meaning. An assembler version would call the SWIs in the same way, but the overheads would be much reduced.A version with inline assembler instead of SWIs also sounds nice. Do you know whether the assembler implements an O(n^2) algorithm for the multiplication or one with greater asymptotic efficiency such as Karatsuba?
I do not know how multiplication is done. I have the source, but have not examined it. I do not think it is Karatsuba, because it is a development of algorithms originally written in the 1980s.
There have been faster methods over the past 10 years, one by De, Kurur, Saha and Saptharishi but it doesn't get faster than the Schönhage–Strassen FFT method (fastest one currently in GMP) until your numbers have 10 ** (10 ** 4797) bits!ejolson wrote: ↑Fri Jan 04, 2019 6:44 pmAlthough the Karatsuba algorithm was published in 1962, it is possible the size of the big numbers expected for programs written in BASIC on RISC OS in the 80s was sufficiently small that the asymptotic superiority of Karatsuba multiplication wasn't a clear win over the simplicity of the O(n^2) algorithm. At the same time, computing the 4784969th Fibonacci number on a Pi in 5 minutes using an O(n^2) algorithm--even coded in assembly--sounds almost too good to be true. It would be interesting to know how it is done.
Interesting. I looked up De, Kurur, Saha and Saptharishi big-number multiplication and foundPaeryn wrote: ↑Sat Jan 05, 2019 4:19 amThere have been faster methods over the past 10 years, one by De, Kurur, Saha and Saptharishi but it doesn't get faster than the Schönhage–Strassen FFT method (fastest one currently in GMP) until your numbers have 10 ** (10 ** 4797) bits!ejolson wrote: ↑Fri Jan 04, 2019 6:44 pmAlthough the Karatsuba algorithm was published in 1962, it is possible the size of the big numbers expected for programs written in BASIC on RISC OS in the 80s was sufficiently small that the asymptotic superiority of Karatsuba multiplication wasn't a clear win over the simplicity of the O(n^2) algorithm. At the same time, computing the 4784969th Fibonacci number on a Pi in 5 minutes using an O(n^2) algorithm--even coded in assembly--sounds almost too good to be true. It would be interesting to know how it is done.
I remember when 64K was the size of the universe. I guess that goes to show how important efficient expressivity is when coding asymptotically fast big-number multiplication routines. Maybe if Christoph had written the algorithm in line-numbered BASIC the crossover point would have been sooner, though it's possible he was avoiding BASIC. I hear the optimiser results in too much volatility. So how big would n have to be for the nth Fibonacci number to fill the whole universe?Christoph Lüders wrote:In this diploma thesis, results of an implementation of DKSS multiplication are presented: run-time is about 30 times larger than SSA, while memory requirements are about 3.75 times higher than SSA. A possible crossover point is estimated to be out of reach even if we utilized the whole universe for computer memory.
Code: Select all
// Define fibo benchmark
static void BM_fibo(benchmark::State& state) {
int n = state.range(0);
for (auto _ : state) {
if (!fiboInitialized) {
fiboInit();
fiboInitialized = true;
}
benchmark::DoNotOptimize(fibo(n));
}
}
BENCHMARK(BM_fibo)
->RangeMultiplier(2)
->RangeMultiplier(2)
->Range(1<<4, 1<<25)
->Complexity()
->Unit(benchmark::kMillisecond);
Code: Select all
$ ./bench
2019-01-05 20:40:56
Running ./bench
Run on (8 X 3401 MHz CPU s)
Load Average: 0.52, 0.58, 0.59
-------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------
BM_fibo/16 0.001 ms 0.001 ms 640000
BM_fibo/32 0.002 ms 0.002 ms 320000
BM_fibo/64 0.004 ms 0.004 ms 154483
BM_fibo/128 0.009 ms 0.009 ms 74667
BM_fibo/256 0.017 ms 0.017 ms 40727
BM_fibo/512 0.035 ms 0.035 ms 20364
BM_fibo/1024 0.070 ms 0.070 ms 8960
BM_fibo/2048 0.140 ms 0.138 ms 4978
BM_fibo/4096 0.284 ms 0.289 ms 2489
BM_fibo/8192 0.578 ms 0.572 ms 1120
BM_fibo/16384 1.18 ms 1.17 ms 560
BM_fibo/32768 2.49 ms 2.46 ms 280
BM_fibo/65536 5.20 ms 5.16 ms 112
BM_fibo/131072 11.4 ms 11.5 ms 64
BM_fibo/262144 25.5 ms 25.7 ms 28
BM_fibo/524288 59.6 ms 59.7 ms 11
BM_fibo/1048576 144 ms 144 ms 5
BM_fibo/2097152 364 ms 367 ms 2
BM_fibo/4194304 950 ms 938 ms 1
BM_fibo/8388608 2551 ms 2562 ms 1
BM_fibo/16777216 7095 ms 7094 ms 1
BM_fibo/33554432 20137 ms 20141 ms 1
Depending on the code, n=2^k or n=2^k+1 are possible special cases for the doubling formulas. It would be interesting to see how straight the resulting lines are when other values of n are selected, perhaps randomly.Heater wrote: ↑Sat Jan 05, 2019 7:29 pmPerhaps this plot shows what is going on more clearly at the high end:
Note 1: I have disabled the memoization of my fibo in all these benchmarks, for an apples-apples comparison.
Note 2: These are using my same bint big integer class to get the arithmetic done.
fibo_comparison_3.png
Code: Select all
..
"BM_fibo/262144",26,25.634,25.2404,ms,,,,,
"BM_fibo/524288",11,59.3675,59.6591,ms,,,,,
"BM_fibo/1048576",5,143.325,143.75,ms,,,,,
"BM_fibo/2097152",2,364.529,359.375,ms,,,,,
"BM_fibo/4194304",1,945.639,953.125,ms,,,,,
"BM_fibo/8388608",1,2540.48,2546.88,ms,,,,,
"BM_fibo/16777216",1,7060.5,7046.88,ms,,,,,
"BM_fibo/33554432",1,20041.1,20031.2,ms,,,,,
...
"BM_fiboEjOlson/262144",64,11.9017,12.207,ms,,,,,
"BM_fiboEjOlson/524288",19,36.0646,35.3618,ms,,,,,
"BM_fiboEjOlson/1048576",6,108.45,109.375,ms,,,,,
"BM_fiboEjOlson/2097152",2,326.485,328.125,ms,,,,,
"BM_fiboEjOlson/4194304",1,990.295,1000,ms,,,,,
"BM_fiboEjOlson/8388608",1,2983.46,2953.12,ms,,,,,
"BM_fiboEjOlson/16777216",1,8897.75,8906.25,ms,,,,,
"BM_fiboEjOlson/33554432",1,26804.4,26796.9,ms,,,,,
I'm not sure the exact codes you are comparing, but my guess isHeater wrote: ↑Sun Jan 06, 2019 8:43 amHmm... If I understand you correctly....
We can look at the raw data output of the benchmark, at the high end:We can calculate the slopes of the lines on a log/log graph as:Code: Select all
.. "BM_fibo/262144",26,25.634,25.2404,ms,,,,, "BM_fibo/524288",11,59.3675,59.6591,ms,,,,, "BM_fibo/1048576",5,143.325,143.75,ms,,,,, "BM_fibo/2097152",2,364.529,359.375,ms,,,,, "BM_fibo/4194304",1,945.639,953.125,ms,,,,, "BM_fibo/8388608",1,2540.48,2546.88,ms,,,,, "BM_fibo/16777216",1,7060.5,7046.88,ms,,,,, "BM_fibo/33554432",1,20041.1,20031.2,ms,,,,, ... "BM_fiboEjOlson/262144",64,11.9017,12.207,ms,,,,, "BM_fiboEjOlson/524288",19,36.0646,35.3618,ms,,,,, "BM_fiboEjOlson/1048576",6,108.45,109.375,ms,,,,, "BM_fiboEjOlson/2097152",2,326.485,328.125,ms,,,,, "BM_fiboEjOlson/4194304",1,990.295,1000,ms,,,,, "BM_fiboEjOlson/8388608",1,2983.46,2953.12,ms,,,,, "BM_fiboEjOlson/16777216",1,8897.75,8906.25,ms,,,,, "BM_fiboEjOlson/33554432",1,26804.4,26796.9,ms,,,,,
Heater curve:
> (Math.log(20031.2) - Math.log(25.2404)) / (Math.log(33554432) - Math.log(262144))
1.3760426230727143
ejolson curve:
> (Math.log(26796.9) - Math.log(12.207)) / (Math.log(33554432) - Math.log(262144))
1.58573453007763
Meanwhile log(3)/log(2) is:
> Math.log10(3) / Math.log10(2)
1.584962500721156
Hmm..bang on for your curve. A bit high for mine.
What does it all mean?
I'll try and find time to test with random numbers later.
Code: Select all
// This Fibonacci version derived from ejolson's doubling formula C example.
bint a;
bint b;
void fiboEjOlson(const int n) {
if (n == 0) {
a = zero;
b = one;
return;
}
fiboEjOlson(n / 2);
bint ta = a;
if (isEven(n)) {
a = ta * ((b + b) - ta);
b = (ta * ta) + (b * b);
} else {
a = (a * a) + (b * b);
b = b * ((ta + ta) + b);
}
}
void fiboInit() {
// Initialize the fibo's memo.
memo.clear();
memo[0] = zero;
memo[1] = one;
memo[2] = one;
}
// This Fibonacci version derived from Paeryn's Haskell example.
const bint fibo (int n) {
if (memo.find(n) != memo.end()) {
return memo[n];
}
int k = (n / 2);
const bint a = fibo(k);
const bint b = fibo(k - 1);
if (isEven(n)) {
return a * (two * b + a);
}
bint twoa = two * a;
if ((n % 4) == 1) {
return (twoa + b) * (twoa - b) + two;
}
return (twoa + b) * (twoa - b) - two;
}
Yes, that is the version of the (a,b)-linear recurrence which is 2 to 4 times slower than it needs to be. To provide a frame of reference, the original C program has not been improved since it was first posted.
Hmm... so I thought I'd transcribe that FreeBASIC fibo to C++ and see how it flies. So far it's not working.A more optimal way of writing the same doubling formula can be found in the FreeBASIC code...
Code: Select all
sub fibowork(n as integer,a as bignum,b as bignum)
if n=0 then
a.n=0:b.n=1:b.d(1)=1
return
end if
Yes, a is zero. The code trims leading zeros from numbers so a number with zero digits is also zero.Heater wrote: ↑Sun Jan 06, 2019 8:45 pmHmm... so I thought I'd transcribe that FreeBASIC fibo to C++ and see how it flies. So far it's not working.A more optimal way of writing the same doubling formula can be found in the FreeBASIC code...
But isn't there something wrong with this in fibowork:Aren't we to be setting a to zero there:Code: Select all
sub fibowork(n as integer,a as bignum,b as bignum) if n=0 then a.n=0:b.n=1:b.d(1)=1 return end if
a.n=1:a.d(1)=0
Code: Select all
$ ./bench
2019-01-07 04:21:42
Running ./bench
Run on (8 X 3401 MHz CPU s)
Load Average: 0.52, 0.58, 0.59
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_fibo/4784969 1162 ms 1156 ms 1
BM_fiboEjOlson/4784969 1196 ms 1219 ms 1
BM_fiboEjOlsonNew/4784969 528 ms 516 ms 1
That's why I had expected the line-numbered BASIC program to do better than it did. Unfortunately, it seems that language is just not expressive enough to efficiently code Karatsuba multiplication.
Code: Select all
$ make
/usr/local/gcc-6.4/bin/gcc -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 -Wall -o serial parallel.c
/usr/local/gcc-6.4/bin/gcc -fopenmp -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 -Wall -o parallel parallel.c
/usr/local/gcc-6.4/bin/gcc -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 -Wall -o fibonacci fibonacci.c
$ time ./fibonacci >fibonacci.txt
real 0m52.750s
user 0m52.728s
sys 0m0.010s
$ time ./serial >serial.txt
real 0m20.777s
user 0m20.756s
sys 0m0.020s
$ time ./parallel >parallel.txt
real 0m5.768s
user 0m21.004s
sys 0m0.010s
$ md5sum fibonacci.txt
c926caa99ed1e30fe0d3c966da901738 fibonacci.txt
$ md5sum serial.txt
c926caa99ed1e30fe0d3c966da901738 serial.txt
$ md5sum parallel.txt
c926caa99ed1e30fe0d3c966da901738 parallel.txt
A programming language which allows for the efficient expression of complicated algorithms is liberating in the sense that the person who learns such a language is no longer constrained by the functionality of built-in features and standard libraries but has the opportunity to develop and implement new algorithms that solve new problems and old problems in new ways. While it might appear that the goal is to compute million-digit Fibonacci numbers, the title of the thread is why avoid BASIC on the Raspberry Pi.Heater wrote: ↑Fri Jan 04, 2019 4:22 pmHow these things are implemented under the hood is of no consequence, as long as they perform well enough. Perhaps the language run time makes use of libs written in other languages, perhaps it does not. We are talking about the language here not any particular implementation.
It does look like you turned an O(n^1.58) algorithm into an O(n^1.38) one. However, I'd be very surprised if the red and blue lines ever really crossed.
Great. Where do I pick up my Turing award?It does look like you turned an O(n^1.58) algorithm into an O(n^1.38) one.
Dang...There goes fame and fortune... asymptotes are a bitch.However, I'd be very surprised if the red and blue lines ever really crossed.
Code: Select all
/* parallel.c -- Compute the nth Fibonacci Number
Written January 7, 2018 by Eric Olson
This program demonstrates the expressiveness of C with OpenMP
as measured by explicitly coding a parallel version of the
Karatsuba multiplication algorithm for big-number arithmetic
and then using the doubling formulas
F(2k) = F(k)[2F(k+1)-F(k)]
F(2k+1) = F(k+1)[2F(k+1)-F(k)]+(-1)^(k+1)
F(2k+1) = F(k)[2F(k)+F(k+1)]+(-1)^(k)
F(2k+2) = F(k+1)[2F(k)+F(k+1)]
to compute the nth Fibonacci number. Note that n is specified
in the first line of the main routine.
*/
#include <stdio.h>
#include <math.h>
#include <string.h>
#include <strings.h>
#include <stdlib.h>
#include <alloca.h>
#include <sys/time.h>
#include <sys/resource.h>
#define TUNE1 44
#define TUNE2 1024
#define TUNE3 1048576
#include "parallel.h"
typedef unsigned long long llu;
typedef int limb;
typedef unsigned long long limb2;
typedef struct {
int n; limb *d;
} bignum;
static int rexp,cmult,digits,nproc;
static limb rbase,rcarry;
static limb2 cadd,cmax;
static void bigprint(bignum x){
char fmt[10];
sprintf(fmt,"%%0%dllu",rexp);
if(!x.n){
printf("0\n");
return;
}
for(int i=x.n-1;i>=0;i--){
if(i==x.n-1) printf("%llu",(llu)x.d[i]);
else printf(fmt,(llu)x.d[i]);
}
printf("\n");
}
static bignum bigtrim(bignum x){
for(int k=x.n-1;k>=0;k--){
if(x.d[k]!=0){
x.n=k+1;
return x;
}
}
x.n=0;
return x;
}
static bignum bigcarry(bignum x){
int imax=x.n-1;
for(int i=0;i<imax;i++){
if(x.d[i]>=rbase){
do {
x.d[i]-=rbase; x.d[i+1]++;
} while(x.d[i]>=rbase);
} else if(x.d[i]<0){
do {
x.d[i]+=rbase; x.d[i+1]--;
} while(x.d[i]<0);
}
}
if(x.d[imax]<rbase) return bigtrim(x);
x.n++;
x.d[imax]-=rbase; x.d[imax+1]=1;
while(x.d[imax]>=rbase){
x.d[imax]-=rbase; x.d[imax+1]++;
}
return x;
}
static bignum bigcopy(limb zd[],bignum x){
bignum z; z.d=zd;
memcpy(z.d,x.d,sizeof(limb)*x.n);
z.n=x.n;
return z;
}
static bignum bigadd3(limb zd[],bignum x,bignum y){
bignum z; z.d=zd;
if(x.n<y.n){
for(int i=0;i<x.n;i++) z.d[i]=x.d[i]+y.d[i];
for(int i=x.n;i<y.n;i++) z.d[i]=y.d[i];
z.n=y.n;
} else {
for(int i=0;i<y.n;i++) z.d[i]=x.d[i]+y.d[i];
for(int i=y.n;i<x.n;i++) z.d[i]=x.d[i];
z.n=x.n;
}
return z;
}
#ifdef NOTUSED
static bignum bigsub3(limb zd[],bignum x,bignum y){
bignum z; z.d=zd;
for(int i=0;i<y.n;i++) z.d[i]=x.d[i]-y.d[i];
for(int i=y.n;i<x.n;i++) z.d[i]=x.d[i];
z.n=x.n;
return z;
}
#endif
static bignum bigsub2(bignum x,bignum y){
for(int i=0;i<y.n;i++) x.d[i]-=y.d[i];
return x;
}
static bignum bigadd2(bignum x,bignum y){
if(x.n<y.n){
if(x.n>0) {
x.d[0]--;
for(int i=0;i<x.n;i++) x.d[i]+=y.d[i]-rcarry;
x.d[x.n]=y.d[x.n]+1;
for(int i=x.n+1;i<y.n;i++) x.d[i]=y.d[i];
x.n=y.n;
} else {
x=bigcopy(x.d,y);
}
} else {
x.d[0]--;
for(int i=0;i<y.n;i++) x.d[i]+=y.d[i]-rcarry;
if(y.n==x.n) {
x.n++;
x.d[y.n]=1;
} else {
x.d[y.n]++;
}
}
return x;
}
static bignum bigdec1(bignum x){
int imax=x.n-1;
x.d[0]--;
for(int i=0;i<imax;i++){
if(x.d[i]>=0) return x;
x.d[i]+=rbase; x.d[i+1]--;
}
return x;
}
static bignum biginc1(bignum x){
int imax=x.n-1;
if(imax>=0) {
x.d[0]++;
for(int i=0;i<imax;i++){
if(x.d[i]<rbase) return x;
x.d[i]-=rbase; x.d[i+1]++;
}
if(x.d[imax]<rbase) return x;
x.d[imax]-=rbase;
}
x.n++;
x.d[imax+1]=1;
return x;
}
static bignum bigmul3w(limb zd[],bignum x,bignum y){
bignum z; z.d=zd;
int imax=x.n+y.n;
limb2 s[imax];
bzero(s,sizeof(limb2)*imax);
imax--;
for(int i=0;i<x.n;i++){
for(int j=0;j<y.n;j++){
s[i+j]+=(limb2)x.d[i]*y.d[j];
}
if((i+1)%cmult==0){
for(int k=0;k<imax;k++){
if(s[k]<cmax) continue;
s[k]-=cmax; s[k+1]+=cadd;
}
}
}
for(int k=0;k<imax;k++){
s[k+1]+=s[k]/rbase; z.d[k]=s[k]%rbase;
}
z.d[imax]=s[imax];
z.n=imax+1;
return bigtrim(z);
}
static bignum biglow(bignum x,int n){
if(x.n>n) x.n=n;
return x;
}
static bignum bighigh(bignum x,int n){
if(x.n<n) x.n=0;
else {
x.n-=n;
x.d=&x.d[n];
}
return x;
}
static bignum bigmul3s(limb zd[],bignum x,bignum y){
if(x.n<TUNE1||y.n<TUNE1) return bigmul3w(zd,x,y);
int M=x.n<y.n?y.n:x.n; int n=M/2;
bignum z;
bignum x0=biglow(x,n),x1=bighigh(x,n);
bignum y0=biglow(y,n),y1=bighigh(y,n);
limb z0d[M+1],z1d[M+3],z1yd[n+1],z2d[M+3];
bignum z0,z2,z1x,z1y,z1;
z0=bigmul3s(z0d,x0,y0);
z2=bigmul3s(z2d,x1,y1);
z1x=bigcarry(bigadd3(z1d,x0,x1));
z1y=bigcarry(bigadd3(z1yd,y0,y1));
z1=bigmul3s(z1x.d,z1x,z1y);
z1=bigcarry(bigsub2(bigsub2(z1,z0),z2));
z=bigcopy(zd,z0);
z=bigadd3(&z.d[n],(bignum){z.n-n,&z.d[n]},z1);
z=bigadd2((bignum){z.n-n,&z.d[n]},z2);
z.n=z.n+2*n; z.d=zd;
return bigcarry(z);
}
static bignum bigmul3(limb zd[],bignum x,bignum y){
if(x.n<TUNE1||y.n<TUNE1) return bigmul3w(zd,x,y);
int M=x.n<y.n?y.n:x.n; int n=M/2;
if(M<TUNE2) return bigmul3s(zd,x,y);
bignum z;
bignum x0=biglow(x,n),x1=bighigh(x,n);
bignum y0=biglow(y,n),y1=bighigh(y,n);
limb z0d[M+1],z1d[M+3],z1yd[n+1],z2d[M+3];
bignum z0,z2,z1x,z1y,z1;
spawn(z0=) bigmul3(z0d,x0,y0);
spawn(z2=) bigmul3(z2d,x1,y1);
z1x=bigcarry(bigadd3(z1d,x0,x1));
z1y=bigcarry(bigadd3(z1yd,y0,y1));
z1=bigmul3(z1x.d,z1x,z1y);
sync;
z1=bigcarry(bigsub2(bigsub2(z1,z0),z2));
z=bigcopy(zd,z0);
z=bigadd3(&z.d[n],(bignum){z.n-n,&z.d[n]},z1);
z=bigadd2((bignum){z.n-n,&z.d[n]},z2);
z.n=z.n+2*n; z.d=zd;
return bigcarry(z);
}
static bignum t1,a,b;
static void fiboworks(int n){
if(n==0){
a.n=0; b.n=1; b.d[0]=1;
return;
}
fiboworks(n/2);
if(n%2==0){
// [a,b]=[a*(2*b-a),b*(2*b-a)-(-1)^k]
t1=bigcarry(bigsub2(bigadd3(t1.d,b,b),a));
a=bigmul3(a.d,a,t1);
b=bigmul3(b.d,b,t1);
if(n%4==0) b=bigdec1(b); else b=biginc1(b);
} else {
// [a,b]=[a*(2*a+b)+(-1)^k,b*(2*a+b)]
t1=bigcarry(bigadd2(bigadd3(t1.d,a,a),b));
b=bigmul3(b.d,b,t1);
a=bigmul3(a.d,a,t1);
if(n%4==1) a=biginc1(a); else a=bigdec1(a);
}
return;
}
static void fibowork(int n){
if(n==0){
a.n=0; b.n=1; b.d[0]=1;
return;
}
if(n<TUNE3) fiboworks(n/2);
else fibowork(n/2);
if(n%2==0){
// [a,b]=[a*(2*b-a),b*(2*b-a)-(-1)^k]
t1=bigcarry(bigsub2(bigadd3(t1.d,b,b),a));
spawn(a=) bigmul3(a.d,a,t1);
b=bigmul3(b.d,b,t1);
if(n%4==0) b=bigdec1(b); else b=biginc1(b);
sync;
} else {
// [a,b]=[a*(2*a+b)+(-1)^k,b*(2*a+b)]
t1=bigcarry(bigadd2(bigadd3(t1.d,a,a),b));
spawn(b=) bigmul3(b.d,b,t1);
a=bigmul3(a.d,a,t1);
if(n%4==1) a=biginc1(a); else a=bigdec1(a);
sync;
}
return;
}
static bignum fibo(int n, limb xd[]){
b.d=xd;
if(n<2){
b.n=1; b.d[0]=n;
return b;
}
limb ad[digits]; a.d=ad;
fibowork((n-1)/2);
if(n%2==0){
// b=b*(2*a+b)
t1=bigcarry(bigadd2(bigadd3(t1.d,a,a),b));
b=bigmul3(b.d,b,t1);
} else {
// b=b*(2*b-a)-(-1)^k
t1=bigcarry(bigsub2(bigadd3(t1.d,b,b),a));
b=bigmul3(b.d,b,t1);
if(n%4==1) b=bigdec1(b); else b=biginc1(b);
}
return b;
}
static int work(int n){
digits=n*log10(0.5*(1+sqrt(5.0)))/rexp+4;
limb t1d[digits]; t1.d=t1d;
limb xd[digits];
workers(nproc){
bignum x=fibo(n,xd);
bigprint(x);
}
return 0;
}
int main(){
int n=4784969;
setrlimit(RLIMIT_STACK,
&(const struct rlimit)
{RLIM_INFINITY,RLIM_INFINITY});
if(sizeof(limb)*2!=sizeof(limb2)){
fprintf(stderr,
"sizeof(limb)=%llu must be half of sizeof(limb2)=%llu!\n",
(llu)sizeof(limb),(llu)sizeof(limb2));
return 1;
}
limb limbmax=((limb2)1<<(8*sizeof(limb)-1))-1;
limb2 limb2max=((limb2)1<<(8*sizeof(limb)))-1;
limb2max=(limb2max<<(8*sizeof(limb)))+limb2max;
rexp=log(limbmax)/log(10);
rbase=pow(10,rexp)+0.5; rcarry=rbase-1;
cmult=(limb2max-2*rbase)/rbase/rbase/2;
cadd=(limb2)rbase*cmult; cmax=cadd*rbase;
nproc=get_ncpu();
return work(n);
}
Code: Select all
/* parallel.h -- Macros for parallel processing
Written Jan 7, 2019 by Eric Olson */
#ifndef _PARALLEL_H
#define _PARALLEL_H
#ifdef _OPENMP
# include <omp.h>
# define spawn(X) _Pragma("omp task default(shared)") X
# define sync _Pragma("omp taskwait")
# define get_ncpu(X) omp_get_num_procs(X)
# define workers(X) \
if(X) omp_set_num_threads(X); \
else omp_set_num_threads(1); \
_Pragma("omp parallel") \
_Pragma("omp single")
#else
# ifdef _CILKPLUS
# include <cilk/cilk.h>
# include <cilk/cilk_api.h>
# define spawn(X) X cilk_spawn
# define sync cilk_sync
# define get_ncpu(X) __cilkrts_get_nworkers(X)
# define workers(X) \
__cilkrts_end_cilk(); \
if(X){ \
char buf[32]; \
sprintf(buf,"%d",X); \
int r=__cilkrts_set_param("nworkers",buf); \
if(r!=__CILKRTS_SET_PARAM_SUCCESS){ \
fprintf(stderr,"Error: unable to set nworkers to %d!\n",X); \
exit(1); \
} \
__cilkrts_init(); \
}
# else
# define spawn(X) X
# define sync
# define get_ncpu(X) 1
# define workers(X)
# endif
#endif
#endif
Code: Select all
$ gcc -O3 fibo.c -o fibo_plain
$ gcc -O3 fibo_gmp.c -o fibo_gmp -lgmp
$ gcc -O3 -fopenmp fibo_par.c -o fibo_par
$
$ time ./fibo_plain >/dev/null
real 0m1.562s
user 0m1.560s
sys 0m0.004s
$ time ./fibo_gmp >/dev/null
real 0m0.195s
user 0m0.192s
sys 0m0.000s
$ time ./fibo_par >/dev/null
real 0m0.293s
user 0m1.084s
sys 0m0.000s
$
$ pc 1.562 / 0.293
5.33105802047782
Thanks for testing the new code. I made one minor cosmetic change to the bigcopy routine and just updated the previous post. The modification doesn't change the speed but only makes calling conventions more regular. I had been hoping the new code would catch the GMP single thread performance, but that seems not to be the case. Maybe one needs a system with more and slower cores for that to happen. Have you tried changing the TUNE1, TUNE2 and TUNE3 parameters for better performance on your system?jahboater wrote: ↑Tue Jan 08, 2019 8:20 amImpressive!
I am getting better than 4 speed up with a quad core PC (hyper threading turned off).
Compared to the fibo.c version 4 ????????Code: Select all
$ gcc -O3 fibo.c -o fibo_plain $ gcc -O3 fibo_gmp.c -o fibo_gmp -lgmp $ gcc -O3 -fopenmp fibo_par.c -o fibo_par $ $ time ./fibo_plain >/dev/null real 0m1.562s user 0m1.560s sys 0m0.004s $ time ./fibo_gmp >/dev/null real 0m0.195s user 0m0.192s sys 0m0.000s $ time ./fibo_par >/dev/null real 0m0.293s user 0m1.084s sys 0m0.000s $ $ pc 1.562 / 0.293 5.33105802047782
Code: Select all
$ gcc -Wall -Wextra -O3 -o parallel parallel.c -lm
$ time ./parallel | tail -c 32
4856539211500699706378405156269
real 0m0.919s
user 0m0.844s
sys 0m0.063s
$
$ gcc -Wall -Wextra -O3 -fopenmp -o parallel parallel.c -lm
$ time ./parallel | tail -c 32
4856539211500699706378405156269
real 0m0.257s
user 0m1.281s
sys 0m0.094s
Code: Select all
$ time ../c++/fibo_karatsuba | tail -c 32
4856539211500699706378405156269
real 0m0.640s
user 0m0.609s
sys 0m0.000s
Code: Select all
limb t1d[digits]; t1.d=t1d;