jahboater
Posts: 3495
Joined: Wed Feb 04, 2015 6:38 pm

Re: Why Avoid BASIC on RPi?

Sun Jan 13, 2019 12:47 pm

Heater wrote:
Sun Jan 13, 2019 11:32 am
jahboater,
On my PC it takes several times as long as that just to start node up.
Something is wrong then:
Your right, sorry. I last measured it some time ago.
I tried again, and the start time for hello,world is near identical.
This script "qc" is slightly faster than node with optimization off, and slightly slower than node with -O3:-

Code: Select all

#!/bin/dash
if gcc $*
then
  ./a.out
fi
The C script has to load the shell of course.
In the human time frame, C, Javascript, and Python3 all give "instant" results.

Code: Select all

$ time qc hello.c
hello world

real	0m0.024s
user	0m0.008s
sys	0m0.004s
$ time node hello.js
Hello world

real	0m0.040s
user	0m0.032s
sys	0m0.008s
$ time qc -O3 hello.c 
hello world

real	0m0.055s
user	0m0.040s
sys	0m0.004s
$ time python3.7 hello.py
hello, world

real	0m0.022s
user	0m0.020s
sys	0m0.000s

Heater
Posts: 10611
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sun Jan 13, 2019 1:19 pm

Why do they need to use setjmp/longjmp to implement GOSUB?

jahboater
Posts: 3495
Joined: Wed Feb 04, 2015 6:38 pm

Re: Why Avoid BASIC on RPi?

Sun Jan 13, 2019 1:28 pm

No idea, unless they want to allow co-routines or mutual recursion.

Heater
Posts: 10611
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sun Jan 13, 2019 5:39 pm

This is very weird...

I was having another play with profiling fibo() with gprof on the Pi and got some very strange results when using openmp.

The profile results I get when not using OpenMP look like this:

Code: Select all

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 74.59     16.90    16.90   354316     0.00     0.00  bint::o2nMul(bint const&) const
 10.55     19.29     2.39   354297     0.00     0.00  bint::operator-(bint const&) const
  8.94     21.32     2.03   177135     0.00     0.00  bint::shiftAndAdd(bint const&, bint const&, bint const&, int, int) const
  3.93     22.21     0.89   354290     0.00     0.00  bint::operator+(bint const&) const
  0.40     22.30     0.09  2361956     0.00     0.00  bint::deallocate(unsigned long*)
  0.35     22.38     0.08  3011455     0.00     0.00  bint::~bint()
  0.35     22.46     0.08       67     0.00     0.34  bint::operator*(bint const&) const
  0.24     22.51     0.06  1240059     0.00     0.00  bint::bint(unsigned long)
...
...
Which looks reasonable. That multiply is taking most of the time.

When I enable OpenMP it changes dramatically:

Code: Select all

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 74.59     16.90    16.90       17   994.29   994.29  bint::shiftAndAdd(bint const&, bint const&, bint const&, int, int) const
 10.33     19.24     2.34                             bint::operator=(bint const&)
  5.12     20.40     1.16  2125745     0.00     0.00  bint::operator*(bint const&) const
  3.35     21.16     0.76   354295     0.00     0.01  bint::bint(char const*)
  1.54     21.51     0.35                             bint::allocate(unsigned long)
  1.54     21.86     0.35                             bint::low(int) const
  0.95     22.08     0.22                             bint::parseDigits(char const*, int)
  0.40     22.17     0.09                             bint::bint(bint const&)
...
...
That says all the time is going into the shift and add routine and the multiply is taking hardly any time!

Thinking that was suspect I asked Google about it and came up with the "fact" that gprof does not work well with multi-threaded, pthread, code on Linux. Further that there is a work around for that problem:
http://sam.zoy.org/writings/programming/gprof.html
https://coolaj86.com/articles/super-simple-gprof.html

Well, that recipe works fine. Except the resulting profile output is about the same. I can't believe it.

Now, the argument is that when using gprof with pthreads it only reports on the activity of the main thread so the results are all wrong. But it seems to me that in this application all threads get to run the same code, in which case having a measure of only what the main thread is doing would be fine.

Can that result really be true?

ejolson
Posts: 2344
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Sun Jan 13, 2019 5:48 pm

Heater wrote:
Sun Jan 13, 2019 10:57 am
Beginners don't need the tedium of C++ compilation or the syntactic complexity of C/C++ etc. They need something simple and basic.
Did you say BASIC?

There are many who know only one language, especially in the States and China. Even in these countries there are plenty of polyglots. Some develop a desire to do something new and useful and thus apply their language skills to the task of writing poetry.

Imagine how sad if, after trying and failing, the reason for failure was use of the wrong language. What if the only modern languages suitable for writing poetry were Italian, Arabic and Hindi. Then any attempt to write poetry by half the people in the world would be futile due to using the wrong language. That would be discouraging.

It is arguable, due to the universal nature of the human condition, that every human language is suitable for writing poetry. One might think by analogy that every Turing-complete programming language is suitable for writing complex algorithms. If this were true, then there would be no need to even consider the question whether to avoid BASIC.

A child learns to speak, not because it is easy, but because being able to convey their feelings, desires and needs is a great reward. In fact, learning to speak is now considered one of the tasks that take the most effort among the accomplishments of nearly every person. As with human languages, many people will only master the first programming language they are exposed to. However, unlike human languages not all programming languages are equally expressive. Not all have the same ability to convey the poetry of complicated algorithms to the computing hardware. Some are not readable enough, some are not productive enough and many are not efficient enough.

While one solution to lack of expressivity is scripting lyrics for rap music rather than writing poetry; another solution might be to avoid certain languages at the outset. Just like being able to write poetry liberates the human spirit through the expression of a person's most inner thoughts and perceptions, so does being able to efficiently convey complicated algorithms to a computer. Such digital liberation frees the human imagination from built-in features and standard subroutine libraries. It motivates the creation of new algorithms by making their practical implementation and use possible. A new age of personal computing should allow the average person to avoid the servitude of digital feudalism and might also raise the GDP.

The question remains, which languages are to be avoided at the outset and which languages lead to such great rewards that the difficult task of learning them is worthwhile?
Last edited by ejolson on Tue Jan 15, 2019 5:20 pm, edited 3 times in total.

ejolson
Posts: 2344
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Sun Jan 13, 2019 6:08 pm

Heater wrote:
Sun Jan 13, 2019 5:39 pm
This is very weird...

I was having another play with profiling fibo() with gprof on the Pi and got some very strange results when using openmp.

The profile results I get when not using OpenMP look like this:

Code: Select all

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 74.59     16.90    16.90   354316     0.00     0.00  bint::o2nMul(bint const&) const
 10.55     19.29     2.39   354297     0.00     0.00  bint::operator-(bint const&) const
  8.94     21.32     2.03   177135     0.00     0.00  bint::shiftAndAdd(bint const&, bint const&, bint const&, int, int) const
  3.93     22.21     0.89   354290     0.00     0.00  bint::operator+(bint const&) const
  0.40     22.30     0.09  2361956     0.00     0.00  bint::deallocate(unsigned long*)
  0.35     22.38     0.08  3011455     0.00     0.00  bint::~bint()
  0.35     22.46     0.08       67     0.00     0.34  bint::operator*(bint const&) const
  0.24     22.51     0.06  1240059     0.00     0.00  bint::bint(unsigned long)
...
...
Which looks reasonable. That multiply is taking most of the time.

When I enable OpenMP it changes dramatically:

Code: Select all

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 74.59     16.90    16.90       17   994.29   994.29  bint::shiftAndAdd(bint const&, bint const&, bint const&, int, int) const
 10.33     19.24     2.34                             bint::operator=(bint const&)
  5.12     20.40     1.16  2125745     0.00     0.00  bint::operator*(bint const&) const
  3.35     21.16     0.76   354295     0.00     0.01  bint::bint(char const*)
  1.54     21.51     0.35                             bint::allocate(unsigned long)
  1.54     21.86     0.35                             bint::low(int) const
  0.95     22.08     0.22                             bint::parseDigits(char const*, int)
  0.40     22.17     0.09                             bint::bint(bint const&)
...
...
That says all the time is going into the shift and add routine and the multiply is taking hardly any time!

Thinking that was suspect I asked Google about it and came up with the "fact" that gprof does not work well with multi-threaded, pthread, code on Linux. Further that there is a work around for that problem:
http://sam.zoy.org/writings/programming/gprof.html
https://coolaj86.com/articles/super-simple-gprof.html

Well, that recipe works fine. Except the resulting profile output is about the same. I can't believe it.

Now, the argument is that when using gprof with pthreads it only reports on the activity of the main thread so the results are all wrong. But it seems to me that in this application all threads get to run the same code, in which case having a measure of only what the main thread is doing would be fine.

Can that result really be true?
Since the only routine parallelized is the multiply routine one would expect it to take less time.

From a work-span point of view, the recursive multiply immediately runs on two cores and after two additions on a third. Upon sync there are a few subtractions and additions to put the results together. Thus, the O(n^1.58) part of the code distributes to available processors quickly with good parallel scaling and the only thing left are the O(n) terms. Given the relatively small size of the problem the O(n) part becomes significant somewhere between four and eight cores.

ejolson
Posts: 2344
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Sun Jan 13, 2019 6:19 pm

Heater wrote:
Sun Jan 13, 2019 11:49 am
ejolson,
I created a script to add volatility to the C intermediate files so the gcc optimizer doesn't clobber things and create a segmentation fault in the resulting executable...
Ouch, what?!!

C/C++ optimizers do not "clobber" anything unless what what they are compiling makes use of some undefined behavior of the language (Barring compiler bugs). Use of undefined behavior means that anything can happen.

What you are saying is that the FreeBASIC code generator is buggy and it's generated C code is not guaranteed to run on all platforms with all standards compliant compilers, even with optimizations off.

Ideally it would be better to have the FreeBASIC devs fix these bugs rather than hack it with workarounds.

Personally I'd rather skip the middleman and write in C directly. I don't need any more bugs than I have already introduced by a code generator, especially when it serves no useful purpose.
The gcc backend to FreeBASIC is broken. This and the workaround was posted earlier. I've only followed through with timings for the Pi Zero.

Heater
Posts: 10611
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sun Jan 13, 2019 7:05 pm

ejolson,
Did you say BASIC?
No, I did not. I said "basic". Deliberately because "simple and basic" does not imply BASIC, there are many other options.

I have no idea much about human languages other than English. Except a smattering of Finnish. Finnish is to English as Forth is to C.

I guess the peoples of the world get on very well with whatever language they have. I could not believe that they don't have the means to express their inner being as well, or as poorly, as we do.
The question remains, which languages are to be avoided at the outset and which languages lead to such great rewards that the difficult task of learning them is worthwhile?
1) Like all languages it's helpful to use the language that the community you live in uses. That includes the community of people you deal with as well as the community of computers you are using.

2) For a rank beginner a simple language is better. Without the need to comprehend all kind of complex syntax and semantics before one can get even the simplest things working. That's not to say the language cannot have complex syntax and grammar but it should be something one can grow into. When kids start to learn their human language we are happy if they can string together a few simple words we don't expect them to employ the vast complexity of the whole language. Similarly those new to programming seem to get on fine with C++ in the Arduino environment, they don't need to be aware of the huge, fractal, complexity of the whole language.

3) Related to 2), although the language should make doing simple things simple so that the beginner can get started it should not be limited. BASIC is an example of simple but limited, ultimately a dead end. C++ can be simple (see Arduino) but has vast capabilities that one may not live long enough to explore fully.

I think that 99.9% of people who learn some programming are not going to ever be dreaming up new algorithms. In the same way that we learn maths in school don't invent new maths, we learn about literature in school but never write famous novels, etc, etc. However they have problems to solve and can make use of the tools (algorithms) software engineers can provide. They should not have have to (re)create the tools. If those tools are built into a programming language they can understand all the better. Think the big integer support and all those modules available for Python or Javascript.

Then there is the other 0.01%, they will find a way.

Heater
Posts: 10611
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sun Jan 13, 2019 7:38 pm

ejolson,
Since the only routine parallelized is the multiply routine one would expect it to take less time.
I don't follow you. Thinking about what goes on here gives me headache.

Perhaps I would expect it to take less time but gprof is saying that using four cores for the multiplies has reduced their significance from 75% of the time to only 5%.

So far that does not seem credible to me.

I'm only getting a speed up of 2.8 on the Pi when going parallel.

parallel.c only speeds up by 2.9.

The scaling factor for my 8 hyper threads is even worse.

I'm beginning to think, via a hand waving argument in my mind, that the algorithms we have here are never going to scale past about 3, no matter how many processors one has. There is just too much stuff that has to be done sequentially, at all levels of recursion, in the fibo() and in the multiplies.

jahboater
Posts: 3495
Joined: Wed Feb 04, 2015 6:38 pm

Re: Why Avoid BASIC on RPi?

Sun Jan 13, 2019 8:00 pm

Heater wrote:
Sun Jan 13, 2019 7:05 pm
2) For a rank beginner a simple language is better. Without the need to comprehend all kind of complex syntax and semantics before one can get even the simplest things working. That's not to say the language cannot have complex syntax and grammar but it should be something one can grow into. When kids start to learn their human language we are happy if they can string together a few simple words we don't expect them to employ the vast complexity of the whole language. Similarly those new to programming seem to get on fine with C++ in the Arduino environment, they don't need to be aware of the huge, fractal, complexity of the whole language.
Well said.

There is always something new to learn in big languages like C++17, and D for that matter. Learning is fun.

Even simple C.
To my shame, despite their being in the language for 8 years now, I have never used _Generic or any of the atomic stuff.

ejolson
Posts: 2344
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Sun Jan 13, 2019 8:03 pm

Heater wrote:
Sun Jan 13, 2019 7:38 pm
ejolson,
Since the only routine parallelized is the multiply routine one would expect it to take less time.
I don't follow you. Thinking about what goes on here gives me headache.

Perhaps I would expect it to take less time but gprof is saying that using four cores for the multiplies has reduced their significance from 75% of the time to only 5%.

So far that does not seem credible to me.

I'm only getting a speed up of 2.8 on the Pi when going parallel.

parallel.c only speeds up by 2.9.

The scaling factor for my 8 hyper threads is even worse.

I'm beginning to think, via a hand waving argument in my mind, that the algorithms we have here are never going to scale past about 3, no matter how many processors one has. There is just too much stuff that has to be done sequentially, at all levels of recursion, in the fibo() and in the multiplies.
In this code

Code: Select all

            #pragma omp task default(shared)
            z0 = low1 * low2;

            #pragma omp task default(shared)
            z1 = (low1 + high1) * (low2 + high2);

            #pragma omp task default(shared)
            z2 = high1 * high2;

            #pragma omp taskwait
You have sent all the multiplies to worker threads and then wait in the main thread for them to return. Instead try sending only two of them and perform the third in the main thread. Does something like

Code: Select all

            #pragma omp task default(shared)
            z0 = low1 * low2;

            #pragma omp task default(shared)
            z2 = high1 * high2;

            z1 = (low1 + high1) * (low2 + high2);
            
            #pragma omp taskwait
make a difference?

The Cilk parallel programming extensions to gcc were deprecated about the same time OpenMP got a sophisticated enough scheduler to support the kind of recursive dynamic parallelism being used in the Karatsuba algorithm. Later versions of the 6.x series compiler seem to work fine as well as the 7.x and 8.x versions. What version of gcc are you using?
Last edited by ejolson on Sun Jan 13, 2019 8:30 pm, edited 1 time in total.

Heater
Posts: 10611
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Sun Jan 13, 2019 8:30 pm

Nope.

I commented out the last "#pragma omp task default(shared)".

No noticeable difference when averaged over ten runs on the Pi 3 before and after.

This is gcc version 6.3.0

I conclude OMP is smart enough to understand what I'm saying. I did not say "run this on a thread, run this other thing on another thread,..." I said so this, that, the other in parallel. OMP knows that only requires 3 threads.

ejolson
Posts: 2344
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Sun Jan 13, 2019 8:42 pm

Heater wrote:
Sun Jan 13, 2019 8:30 pm
Nope.

I commented out the last "#pragma omp task default(shared)".

No noticeable difference when average over ten runs on the Pi 3 before and after.
I think Intel Parallel Studio has a parallel performance profiling tool, but I've never used it. I've always taken the old-fashioned approach and tried to think my way through any parallel scaling problems.

Maybe the problem is what I said earlier: Too many emulation layers. POSIX threads emulated by glibc using Linux native threads which are in turn emulated by Linux Subsystem for Windows using Windows threads.

Oh, wait, the problem is on the Pi as well? Do you have power supply or throttling problems?

ejolson
Posts: 2344
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Mon Jan 14, 2019 4:26 am

Heater wrote:
Sun Jan 13, 2019 8:30 pm
Nope.

I commented out the last "#pragma omp task default(shared)".

No noticeable difference when averaged over ten runs on the Pi 3 before and after.

This is gcc version 6.3.0

I conclude OMP is smart enough to understand what I'm saying. I did not say "run this on a thread, run this other thing on another thread,..." I said so this, that, the other in parallel. OMP knows that only requires 3 threads.
I have tried parallel.c on an 8-core Cortex-A53 SBC running in 64-bit mode using taskset to limit the number of available cores to 4. With gcc version 6.4.0 I have the following results:

Code: Select all

$ time taskset -c 4-7 ./serial  >/dev/null
Using 1 cores.

real    0m3.047s
user    0m3.040s
sys 0m0.008s
$ time taskset -c 4-7 ./parallel  >/dev/null
Using 4 cores.

real    0m0.901s
user    0m3.088s
sys 0m0.024s
which indicates a speedup of 3.38 for the exact code in GitHub. Slightly better results can be achieved by over-provisioning the available cores with double the worker threads. Simply change the code in the work routine to read as

Code: Select all

    workers(2*nproc){
        bignum x=fibo(n,xd);
        bigprint(x);
    }
and recompile to obtain

Code: Select all

$ time taskset -c 4-7 ./parallel  >/dev/null
Using 4 cores.

real    0m0.858s
user    0m3.076s
sys 0m0.020s
which is now a speedup of 3.58. For reference running on all eight cores results in

Code: Select all

$ time ./parallel  >/dev/null
Using 8 cores.

real    0m0.495s
user    0m3.076s
sys 0m0.048s
which is a 6.15 factor increase in speed. At this point the O(n) part of the algorithm which was not parallelized is contributing a significant percent of the total run time. Although one could attempt to parallelize some of the O(n) routines such as, for example, bigcarry, from another point of view computing million-digit Fibonacci numbers is just too small a problem for more than eight cores.

Finally, I should mention that on this particular SBC (the NanoPC-T3) the controls in /sys/devices/system/cpu/cpufreq/policy0 function and were set so that

Code: Select all

$ cat scaling_governor 
performance
$ cat scaling_min_freq 
1400000
$ cat scaling_max_freq 
1400000
to obtain consistent timings. This may or may not work on the Raspberry Pi or even be dangerous without adequate cooling.

ejolson
Posts: 2344
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Mon Jan 14, 2019 6:50 am

I thought of one more thing aside from lock contention, throttling and emulated operating systems that might be causing the C++ code to run more slowly. With OpenMP as well as Cilk there is additional overhead related to growing cactus stacks and saving state whenever a subroutine that might spawn parallel work is called whether or not any parallel calls are subsequently made. Since you've lumped the parallel and serial recursion into the same subroutine, the serial part may experience an additional slowdown that could be avoided. Therefore, breaking the parallel and serial recursions into different subroutines as done in parallel.c may increase the resulting efficiency when running on multiple cores.

Heater
Posts: 10611
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Mon Jan 14, 2019 1:12 pm

ejolson,
...cactus stacks...
What the...?

Hmm...OK... Every new thread needs a new stack, that's clear enough, but it's worse, they get a capture of the current state of visible variables, that's a closure.

http://wiki.c2.com/?CactusStack
https://www.fs.fed.us/wildflowers/plant ... ntea.shtml
...breaking the parallel and serial recursions into different subroutines as done in parallel.c may increase the resulting efficiency...
Sounds like it might. This might take some time...

Meanwhile...

parallel.c on my Pi 3 is only getting a speed up of 2.78 when run on four cores. See results below. Where as you show a speed up of 3.38 to 3.58 on your SBC.

Also you have a speed up of 6.15 for 8 cores on that SBC which is twice what I get on my 8 hyper thread PC.

What is up with this picture? If I can't get parallel.c to scale as it should I could be wasting my time tinkering with my C++.

Code: Select all

$ gcc -Wall -Wextra -O3 -o parallel parallel.c
[email protected]:~/fibo_4784969/c$ time ./parallel > /dev/null
Using 1 cores.

real    0m3.755s
user    0m3.743s
sys     0m0.012s
[email protected]:~/fibo_4784969/c$ gcc -Wall -Wextra -O3 -fopenmp -o parallel parallel.c
[email protected]:~/fibo_4784969/c$ time ./parallel > /dev/null
Using 4 cores.

real    0m1.351s
user    0m4.668s
sys     0m0.031s
Over provisioning to 8 cores on the Pi did not change much:

Code: Select all

$ time ./parallel 8 > /dev/null
Using 8 cores.

real    0m1.219s
user    0m4.363s
sys     0m0.035s

jahboater
Posts: 3495
Joined: Wed Feb 04, 2015 6:38 pm

Re: Why Avoid BASIC on RPi?

Mon Jan 14, 2019 2:41 pm

Heater,

You often mention the cost of cache misses.

Here is a useful tool for measurement:
http://valgrind.org/docs/manual/cg-manual.html
While the output may not be all that meaningful, I suspect you could use it before and after any change you make to improve locality of reference and see the difference.

Here are the results from a fibo.c run:-

Code: Select all

$ valgrind --tool=cachegrind ./fibo >/dev/null
==27002== Cachegrind, a cache and branch-prediction profiler
==27002== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==27002== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==27002== Command: ./fibo
==27002== 
--27002-- warning: L3 cache found, using its data for the LL simulation.
==27002== 
==27002== I   refs:      10,166,522,462
==27002== I1  misses:             1,047
==27002== LLi misses:             1,038
==27002== I1  miss rate:           0.00%
==27002== LLi miss rate:           0.00%
==27002== 
==27002== D   refs:       2,168,371,867  (1,376,415,672 rd   + 791,956,195 wr)
==27002== D1  misses:        18,186,355  (   10,035,933 rd   +   8,150,422 wr)
==27002== LLd misses:           160,504  (        2,993 rd   +     157,511 wr)
==27002== D1  miss rate:            0.8% (          0.7%     +         1.0%  )
==27002== LLd miss rate:            0.0% (          0.0%     +         0.0%  )
==27002== 
==27002== LL refs:           18,187,402  (   10,036,980 rd   +   8,150,422 wr)
==27002== LL misses:            161,542  (        4,031 rd   +     157,511 wr)
==27002== LL miss rate:             0.0% (          0.0%     +         0.0%  )
This gives info for individual C functions:

Code: Select all

$ cg_annotate cachegrind.out.27002 
--------------------------------------------------------------------------------
I1 cache:         32768 B, 64 B, 8-way associative
D1 cache:         32768 B, 64 B, 8-way associative
LL cache:         8388608 B, 64 B, 16-way associative
Command:          ./fibo
Data file:        cachegrind.out.27002
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:     
User annotated:   
Auto-annotation:  off

--------------------------------------------------------------------------------
            Ir  I1mr  ILmr            Dr       D1mr  DLmr          Dw      D1mw    DLmw 
--------------------------------------------------------------------------------
10,166,522,462 1,047 1,038 1,376,415,672 10,035,933 2,993 791,956,195 8,150,422 157,511  PROGRAM TOTALS

--------------------------------------------------------------------------------
           Ir I1mr ILmr            Dr      D1mr DLmr          Dw      D1mw    DLmw  file:function
--------------------------------------------------------------------------------
9,135,362,240   28   28 1,238,630,994 7,279,021    0 566,723,582    94,528       0  fibo.c:mul_karatsuba.isra.4
  624,145,358    5    5    93,688,083 2,006,900    0  94,876,290   303,169       0  fibo.c:add_bn.isra.2
  185,950,225    4    4     1,992,934       657    0  96,007,860 6,171,101 143,022  /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/../memset.S:__GI_memset
  141,925,231    9    9    21,981,024   686,190  909  21,582,416 1,580,834  13,887  /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:__memcpy_sse2_unaligned
   42,467,194   54   54    12,933,467        15   15   8,500,118         4       0  /build/glibc-Cl5G7W/glibc-2.23/stdio-common/vfprintf.c:vfprintf
   12,962,152    3    3     1,098,698         1    1     987,586         0       0  /build/glibc-Cl5G7W/glibc-2.23/stdio-common/_itoa.c:_itoa_word
$

Heater
Posts: 10611
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Mon Jan 14, 2019 3:50 pm

jahboater,

Interesting. Thanks for the tip. Here is what I get:
$ valgrind --tool=cachegrind ./fibo_karatsuba-n-core > /dev/null
...
==26476== I refs: 3,875,873,122
==26476== I1 misses: 3,025
==26476== LLi misses: 2,559
==26476== I1 miss rate: 0.00%
==26476== LLi miss rate: 0.00%
==26476==
==26476== D refs: 1,605,673,155 (1,244,380,205 rd + 361,292,950 wr)
==26476== D1 misses: 23,426,017 ( 9,893,378 rd + 13,532,639 wr)
==26476== LLd misses: 338,402 ( 47,347 rd + 291,055 wr)
==26476== D1 miss rate: 1.5% ( 0.8% + 3.7% )
==26476== LLd miss rate: 0.0% ( 0.0% + 0.1% )
==26476==
==26476== LL refs: 23,429,042 ( 9,896,403 rd + 13,532,639 wr)
==26476== LL misses: 340,961 ( 49,906 rd + 291,055 wr)
==26476== LL miss rate: 0.0% ( 0.0% + 0.1% )
I'm not sure what all that means yet but if I cut to the bottom line it says my instruction cache miss rate is 0% and my data cache miss rate is almost 0%.

I'm going to take that as meaning I don't have a cache miss problem here. Unless you can suggest a reason otherwise.

I can't glean anything from the cg_annotate except that whatever is going on is mostly going on in my multiply routine. Which I guess we know already.

Interestingly parallel.c scores 4.6 billion I-refs compared to my 3.8 billion. I find it odd that it should be running so many more instructions.

But parallel.c only has a 0.3% D1 miss rate compared to my 1.5%. Perhaps that is significant.

ejolson
Posts: 2344
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Tue Jan 15, 2019 6:19 am

Heater wrote:
Mon Jan 14, 2019 1:12 pm
ejolson,
...cactus stacks...
What the...?

Hmm...OK... Every new thread needs a new stack, that's clear enough, but it's worse, they get a capture of the current state of visible variables, that's a closure.

http://wiki.c2.com/?CactusStack
https://www.fs.fed.us/wildflowers/plant ... ntea.shtml
...breaking the parallel and serial recursions into different subroutines as done in parallel.c may increase the resulting efficiency...
Sounds like it might. This might take some time...

Meanwhile...

parallel.c on my Pi 3 is only getting a speed up of 2.78 when run on four cores. See results below. Where as you show a speed up of 3.38 to 3.58 on your SBC.

Also you have a speed up of 6.15 for 8 cores on that SBC which is twice what I get on my 8 hyper thread PC.

What is up with this picture? If I can't get parallel.c to scale as it should I could be wasting my time tinkering with my C++.

Code: Select all

$ gcc -Wall -Wextra -O3 -o parallel parallel.c
[email protected]:~/fibo_4784969/c$ time ./parallel > /dev/null
Using 1 cores.

real    0m3.755s
user    0m3.743s
sys     0m0.012s
[email protected]:~/fibo_4784969/c$ gcc -Wall -Wextra -O3 -fopenmp -o parallel parallel.c
[email protected]:~/fibo_4784969/c$ time ./parallel > /dev/null
Using 4 cores.

real    0m1.351s
user    0m4.668s
sys     0m0.031s
Over provisioning to 8 cores on the Pi did not change much:

Code: Select all

$ time ./parallel 8 > /dev/null
Using 8 cores.

real    0m1.219s
user    0m4.363s
sys     0m0.035s
Here in the desert we're quite fond of all things to do with cactus.

Looks like over provisioning obtained another 10 percent for a 3-fold performance increase. Still, the numbers make me think your Pi is throttling when multiple cores are busy.

For the x86 computer, behind the scenes there are really only 4 cores, so scaling just under a factor four is not bad. Historically hyper-threads were a hardware work-around to increase responsiveness of the Windows scheduler. They also helped with the long Pentium 4 pipelines that tended to stall. Today hardware-threads are mostly a marketing scheme: Stalls are much less frequent due to compiler technology and processor design. They do, however, lead to some interesting side-channel security issues.

Here is a scaling study of parallel.c using a dual Xeon E5-2620 twelve-core server.

Image

Note that the scaling is not as good as with the eight-core ARM machine. Since the CPUs are faster it is possible that memory bandwidth becomes an issue sooner.
Last edited by ejolson on Tue Jan 15, 2019 7:46 pm, edited 1 time in total.

Heater
Posts: 10611
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Tue Jan 15, 2019 9:49 am

ejolson,
...the numbers make me think your Pi is throttling when multiple cores are busy...
My "vcgencmd get_throttled" shows no sign of it. However I'm not totally sure vcgencmd works properly on this pi64 operating system.
...behind the scenes there are really only 4 cores,...
I always imagined hyper-threads were not all they were cracked up to be. I guess they are even less effective than I thought, especially when also having to make a lot of memory access.
Here is a scaling study of parallel.c using a dual Xeon E5-2620 twelve-core server...
Wow!. Were do I get me one of them!

plugwash
Forum Moderator
Forum Moderator
Posts: 3286
Joined: Wed Dec 28, 2011 11:45 pm

Re: Why Avoid BASIC on RPi?

Tue Jan 15, 2019 9:57 am

Heater wrote:
Tue Jan 15, 2019 9:49 am
Wow!. Were do I get me one of them!
Ebay, you can often pick up older gen server hardware for surprisingly low prices. For example https://www.ebay.co.uk/itm/HP-Proliant- ... :rk:2:pf:1

Heater
Posts: 10611
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Tue Jan 15, 2019 10:03 am

It would cost a fortune to get that kind of thing shipped to Helsinki!

They don't turn up cheap around here very often.

Heater
Posts: 10611
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Tue Jan 15, 2019 7:15 pm

Some interesting results from the Muppet Lab:

I was wondering how my 4 core 8 hyper thread machine scales when attacking an "embarrassingly" parallel problem (no memory sharing). So I tweaked my Google benchmark code a bit. It now runs a simple big integer power function multiple times, using OpenMP, the std::async of C++ and in serial. (There is no parallelism in the actual big integer multiply here). The power function looks like this:

Code: Select all

bint power(int n)
{
	bint x = bint("1");
	bint ten = bint("10");
	for (int i = 0; i < n; i++) { 
		x = x * ten;
	}
	return x;
}
It gets run by the benchmark for values of n from 16 to 65536.

The results for four threads are interesting:

1) For small values of n, OpenMP is as fast as serial, almost no overheads there.

2) std::async is terrible slow until n = 1024. But after that it becomes faster than OpenMP.

3) At the top end std::asyc beats OpenMP!

The maximum parallel scale factor, at n = 65536, is 3.23 for OpenMP and 3.47 for std::async.

Who would ever guess std::async could beat OpenMP at it's own game?

I ran the same test with 8 parallel threads but then I find OpenMP and std::async are neck and neck with a speed up of only 3.79. Seems those hyper threads are pretty useless.

Raw results for the 4 thread case are:

Code: Select all

---------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
---------------------------------------------------------------
BM_powerOmp/16           0.007 ms        0.007 ms       112000
BM_powerOmp/32           0.008 ms        0.008 ms        89600
BM_powerOmp/64           0.011 ms        0.011 ms        56000
BM_powerOmp/128          0.021 ms        0.021 ms        37333
BM_powerOmp/256          0.048 ms        0.048 ms        14933
BM_powerOmp/512          0.153 ms        0.150 ms         4480
BM_powerOmp/1024         0.668 ms        0.670 ms         1120
BM_powerOmp/2048          3.00 ms         2.93 ms          213
BM_powerOmp/4096          18.1 ms         15.3 ms           45
BM_powerOmp/8192           119 ms          103 ms            5
BM_powerOmp/16384          808 ms          797 ms            1
BM_powerOmp/32768         7631 ms         5547 ms            1
BM_powerOmp/65536        49499 ms        48375 ms            1
BM_powerAsync/16         0.291 ms        0.220 ms         2987
BM_powerAsync/32         0.291 ms        0.231 ms         3446
BM_powerAsync/64         0.284 ms        0.223 ms         4480
BM_powerAsync/128        0.314 ms        0.213 ms         2489
BM_powerAsync/256        0.300 ms        0.179 ms         2800
BM_powerAsync/512        0.433 ms        0.234 ms         4480
BM_powerAsync/1024       0.686 ms        0.150 ms         4480
BM_powerAsync/2048        2.48 ms        0.227 ms         3446
BM_powerAsync/4096        14.1 ms        0.219 ms         1000
BM_powerAsync/8192        97.5 ms        0.156 ms          100
BM_powerAsync/16384        745 ms        0.000 ms           10
BM_powerAsync/32768       5490 ms        0.000 ms            1
BM_powerAsync/65536      46078 ms        0.000 ms            1
BM_powerSer/16           0.004 ms        0.004 ms       165926
BM_powerSer/32           0.008 ms        0.008 ms        89600
BM_powerSer/64           0.015 ms        0.015 ms        44800
BM_powerSer/128          0.035 ms        0.035 ms        20364
BM_powerSer/256          0.104 ms        0.105 ms         6400
BM_powerSer/512          0.358 ms        0.353 ms         1948
BM_powerSer/1024          1.52 ms         1.54 ms          498
BM_powerSer/2048          8.12 ms         8.12 ms           75
BM_powerSer/4096          51.1 ms         51.6 ms           10
BM_powerSer/8192           354 ms          352 ms            2
BM_powerSer/16384         2637 ms         2641 ms            1
BM_powerSer/32768        20332 ms        20328 ms            1
BM_powerSer/65536       159968 ms       159969 ms            1

ejolson
Posts: 2344
Joined: Tue Mar 18, 2014 11:47 am

Re: Why Avoid BASIC on RPi?

Tue Jan 15, 2019 7:31 pm

Heater wrote:
Tue Jan 15, 2019 7:15 pm
The maximum parallel scale factor, at n = 65536, is 3.23 for OpenMP and 3.47 for std::async.

Who would ever guess std::async could beat OpenMP at it's own game?
I wonder if OpenMP got unlucky or used setaffinity behind the scenes and scheduled multiple workers in hyper-threads on the same core.

After a simple

# apt-get install mono-vbnc

on the Pi and a few syntactic changes to the original fibo.bas FreeBASIC program, I can now create Fibonacci numbers using the mono virtual machine.

Before posting the resulting visual.bas program, I'll add a Karatsuba multiply routine, because that is still missing.

The vbnc compiler produces executables quickly, especially compared to g++; however, the speed of the Fibonacci code is yet to be seen.

It seems difficult to avoid BASIC around here, so why avoid it?

Heater
Posts: 10611
Joined: Tue Jul 17, 2012 3:02 pm

Re: Why Avoid BASIC on RPi?

Tue Jan 15, 2019 7:56 pm

Looks like my 4 core speed up is about the same as you report for the Xeon E5-2620 machine. Judging by your bar chart.

It's good to know that everything can work as expected here, at least in this simple case, and now I have a baseline to aim for.

This BASIC thing has really gone to your head!

Return to “Off topic discussion”