Maatmes
Posts: 9
Joined: Sat Jul 02, 2016 12:49 pm

Benchmarking is wierd

Tue Dec 03, 2019 5:51 pm

I dug up an old program I had written back in my Unix days. I do a lot of code doodling (think of a pad of paper and pencil but with editor and a terminal). This program is in C (my perfered language). It is based on a challenge from "Nibble" magazine (I think) back when I was learning to program on my Apple II. The challenge was to write an 6502 assembly program to count to one million as fast as possible. Back then the Apple II did not have virtual memory so the screen was mapped to an area of memory (0x400 I think but my memory is corrupted some). Any way storing the ascii value for '0' in 6 consecutive memory locations displayed 000000. The winner to this challenge counted to a million in just a few secs on a cpu that ran on a 1 Mhz clock blew my mind. The way he did it was to not loop but bump the ones place ten times then bump the tens place repeating this ten times. (basically unrolling the loop until the thousands place if I remember correctly). Anyway my code in C does the same thing but I increased the count as systems have gotten so fast.

I ran the program on my Pi 4 and it takes about 1.5 secs. I do a simple 'for loop' to do the same thing in about 15 secs.

Kind of.

This time is run on a SSH through MobaXterm on my desktop. If I run the same program through a remote xterm the time becomes 13 secs.

It's not to hard to figure out why. If I run the program redirecting the output to /dev/null the result is 0.56 secs no matter what terminal I use. The wierd part is my PI 4 beats my Windows 10 desktop a 3.6 Ghz I5 which runs 1010 secs (redirecting to NULL devive took 73 secs). Windows expects everyone to not use the command line except simple batch scripts.

The deal with benchmarks is you need to be knowledgeable of whats going on. Any way thought it might be of intrest.

Just a note: This count optimization is not a good programming implementation. The hassle to maintain the code is terrible (but it gives you some bragging rights...). I'm thinking of doing a version that counts in hex and maybe a binary version. I Implemented this in curses and the source code is over 1 megabyte long (its not fast by a long shot as I suspected). I had hoped to try counting to one billion extending the loop unrolling for the curses version, but at 10 million bytes source code both nano and gcc would break.

ejolson
Posts: 3798
Joined: Tue Mar 18, 2014 11:47 am

Re: Benchmarking is wierd

Tue Dec 03, 2019 6:40 pm

Maatmes wrote:
Tue Dec 03, 2019 5:51 pm
The deal with benchmarks is you need to be knowledgeable of whats going on. Any way thought it might be of intrest.
Those are some interesting results. I tried to unroll the loop in the square-by-doubling routine used to find integer powers p^n in my program that solves this coding challenge and it ran slower. In fact, for the original n=200 problem size the O(n) method of finding p^n by simply multiplying p by itself n times was about the same speed as the doubling method.

At any rate, given the skill you have with 6502 assembly and coding in general, it would be very nice if you could look at the results I obtained using cc65 and an emulated Commodore PET

https://www.raspberrypi.org/forums/view ... 5#p1574107

to see how much faster 6502 assembler and a clever use of page zero would be compared to the C code. Of course, code written in any language that runs on the Raspberry Pi to solve that programming challenge would also be interesting and well received.

I will be posting a new chart similar to

https://www.raspberrypi.org/forums/view ... 0#p1568127

to summarize the runtime, memory used and lines of code for each entry soon. However, that table can be updated at any time, so there is no need to hurry or deadlines to meet.

Heater
Posts: 13866
Joined: Tue Jul 17, 2012 3:02 pm

Re: Benchmarking is wierd

Tue Dec 03, 2019 8:30 pm

Maatmes,

Bench marking is indeed weird.

With modern day processors and all their cache memories, pipeling, parallel instruction execution, instruction reordering, branch prediction and so on it's almost impossible to tell what is the fastest way to do even the simplest things.

What was a good optimization back in the day, like loop unrolling, may actually slow you down today. It was about 15 years ago that I discovered this to my surprise.

Of course with a simple task like displaying all the numbers from zero to one million one is basically measuring the speed of ones operating system in writing the display, rather than the actual counting task. Which might not have been the case back in the day when displaying the numbers was just a case of writing them to RAM some place.

Welcome to the club.

As you might have noticed we have had a run of such coding challenges here this year. It all started with the idea of pitting hand crafted assembler with a C compiler or whatever high level language.

Oh good, now we have a new coding challenge. A really simple one. Display all the numbers from zero to one million as fast as possible in the programming language of ones choice.

I suspect the quickest way to do this is to attach 20 LEDs to GPIO pins and count up in binary. With a bit more effort one could drive seven segment LED displays for a decimal output. Probably much faster than writing to a terminal window in X Windows!

By way of a kick-off this naive first attempt in the Rust language is shamefully slow on my x86-64 PC:

Code: Select all

$ cat  src/main.rs

fn main() {
    for n in 0..=1_000_000 {
        print!("{}\r", n);
    }
}

$ time cargo run --release
    Finished release [optimized] target(s) in 0.03s
     Running `target/release/count`
1000000
real    0m6.808s
user    0m0.234s
sys     0m0.219s
Redirecting the output to /dev/null brings that down to 0.25 seconds. Which is still terrible.
Memory in C++ is a leaky abstraction .

Maatmes
Posts: 9
Joined: Sat Jul 02, 2016 12:49 pm

Re: Benchmarking is wierd

Tue Dec 03, 2019 9:01 pm

Back in the day I use to declare register variables in C to fine tune for performance. It didn't last long before compiler optimation out did me to a point that I did not bother any more. Readable, easily maintainable code if much more important (and relative I'm afraid).

Timing programs under linux or most modern operating systems with shared memory and multiple processes can be taken too far. My counting program can vary by about 160 ms just running a quick test of 4 runs. There are obviously other things going on. I probably need to increase the n in my big 'O'.

My implementation of the 6502 challenge after counting each instructions clock count was no where close to the winner. The unrolling of the loop over the count grants a big gain but at a cost. The Apple implementation at 1 Mhz performed close to what the Pi 4 does, but this is due to the text handling. You do not get much more efficent at updating the text than just incrementing a memory location. Then again you are much safer from a run away program with virtual memory and the isolation of user space and kernel space. You can get better performance but the cost isn't worth it.

I use the 'time' command to give me an idea of what's going on. If 'sys' time is large then I know my program is not being kind to other users.

Maatmes
Posts: 9
Joined: Sat Jul 02, 2016 12:49 pm

Re: Benchmarking is wierd

Tue Dec 03, 2019 9:29 pm

For your Rust implementation you will get a big speed up if you could SSH into the windows machine from a PI or linux box the text handling on Windoze command line is terrible.

Heater
Posts: 13866
Joined: Tue Jul 17, 2012 3:02 pm

Re: Benchmarking is wierd

Tue Dec 03, 2019 9:54 pm

I have done a lot of assembler programming back in the day. For Z80, 6809 and some others along the way. Strangely enough never the 6502.

If I understand correctly one trick for fast code on the 6502 was to keep all your data in "page 0", the first 256 bytes of RAM. Then you could use shorter instructions to access your data, all addresses being only a byte wide.

If I ssh from my Pi to my Windows, I will have to ssh from my Windows to the Pi to do so. All my Pi are headless!
Memory in C++ is a leaky abstraction .

Heater
Posts: 13866
Joined: Tue Jul 17, 2012 3:02 pm

Re: Benchmarking is wierd

Tue Dec 03, 2019 10:10 pm

I forgot to mention. Another thing that makes bench marking weird today is the magic that compilers can do of optimization.

They can unroll your loops. They can vectorize your loops. They can use shifts and adds instead of multiplies. They can use multiplies instead of divides. They can inline functions. They can delete big chunks of your code that turn out not to have an observable result. And so on.

It get's kind of hard to make a little benchmark that is representative of anything one may want to do a real application.
Memory in C++ is a leaky abstraction .

Return to “General programming discussion”