Heater
Posts: 9361
Joined: Tue Jul 17, 2012 3:02 pm

Re: .NET Core - Raspbian or Win10IoT?

Fri May 18, 2018 6:55 pm

Be careful with such benchmarks.

When I compile your code I get:

Code: Select all

$ gcc -o junk junk.c
$ time ./junk

real    0m8.967s
user    0m8.813s
sys     0m0.031s
Which is slower than the node.js version:

Code: Select all

$ time node junk.js

real    0m1.818s
user    0m1.719s
sys     0m0.109s
But wait, that is compiled without optimization. Let's optimize:

Code: Select all

$ gcc -O3 -o junk junk.c
$ time ./junk

real    0m0.016s
user    0m0.000s
sys     0m0.000s
Wow, that is fast!

Well not really. Your code has no output. Turning on optimization actually removes the whole redundant loop and the executable just exits doing nothing!

OK, let's print the final count such that there is some output and the program has work to do. Like so:

Code: Select all

 $ cat junk.c
#include <stdio.h>
#include <stdint.h>
#include <limits.h>
int main()
{
    int max = 2147483637;
    int count = 0;
    while (count < max)
    {
        count += 10;
        count -= 9;
    }
    printf("Count = %d\n", count);
    return 0;
}
$ gcc -O3 -o junk junk.c
$ time ./junk
Count = 2147483637

real    0m0.019s
user    0m0.000s
sys     0m0.000s

Still mighty fast!

No so. The optimizer is not stupid. It knows what that simple loop does, calculates the result at compile time and just prints that constant in the executable.

We can fix that my making count "volatile":

Code: Select all

$ cat junk.c
#include <stdio.h>
#include <stdint.h>
#include <limits.h>

volatile int count = 0;
int main()
{
    int max = 2147483637;
    while (count < max)
    {
        count += 10;
        count -= 9;
    }
    printf("Count = %d\n", count);
    return 0;
}
$ gcc -O3 -o junk junk.c
$ time ./junk
Count = 2147483637

real    0m8.563s
user    0m8.453s
sys     0m0.000s
Hmm...we are back to where we started!

From this simple test of yours I conclude that Javascript under node.js is 4.7 times faster than C.

How about that?!

All done on an MS Surface Pro 4, sorry no Pi to hand to try it there.

pauliunas
Posts: 41
Joined: Mon Feb 26, 2018 7:43 am

Re: .NET Core - Raspbian or Win10IoT?

Fri May 18, 2018 7:32 pm

Heater wrote:
Fri May 18, 2018 6:55 pm
Be careful with such benchmarks.

When I compile your code I get:

Code: Select all

$ gcc -o junk junk.c
$ time ./junk

real    0m8.967s
user    0m8.813s
sys     0m0.031s
Which is slower than the node.js version:

Code: Select all

$ time node junk.js

real    0m1.818s
user    0m1.719s
sys     0m0.109s
But wait, that is compiled without optimization. Let's optimize:

Code: Select all

$ gcc -O3 -o junk junk.c
$ time ./junk

real    0m0.016s
user    0m0.000s
sys     0m0.000s
Wow, that is fast!

Well not really. Your code has no output. Turning on optimization actually removes the whole redundant loop and the executable just exits doing nothing!

OK, let's print the final count such that there is some output and the program has work to do. Like so:

Code: Select all

 $ cat junk.c
#include <stdio.h>
#include <stdint.h>
#include <limits.h>
int main()
{
    int max = 2147483637;
    int count = 0;
    while (count < max)
    {
        count += 10;
        count -= 9;
    }
    printf("Count = %d\n", count);
    return 0;
}
$ gcc -O3 -o junk junk.c
$ time ./junk
Count = 2147483637

real    0m0.019s
user    0m0.000s
sys     0m0.000s

Still mighty fast!

No so. The optimizer is not stupid. It knows what that simple loop does, calculates the result at compile time and just prints that constant in the executable.

We can fix that my making count "volatile":

Code: Select all

$ cat junk.c
#include <stdio.h>
#include <stdint.h>
#include <limits.h>

volatile int count = 0;
int main()
{
    int max = 2147483637;
    while (count < max)
    {
        count += 10;
        count -= 9;
    }
    printf("Count = %d\n", count);
    return 0;
}
$ gcc -O3 -o junk junk.c
$ time ./junk
Count = 2147483637

real    0m8.563s
user    0m8.453s
sys     0m0.000s
Hmm...we are back to where we started!

From this simple test of yours I conclude that Javascript under node.js is 4.7 times faster than C.

How about that?!

All done on an MS Surface Pro 4, sorry no Pi to hand to try it there.
lol something's terribly wrong with your gcc :D

pauliunas
Posts: 41
Joined: Mon Feb 26, 2018 7:43 am

Re: .NET Core - Raspbian or Win10IoT?

Fri May 18, 2018 7:35 pm

or more likely, your node.js stuff just prints a constant like what gcc did with optimizations. so please work on your js to make it actually do the calculations and then we'll talk. because this is an apples to oranges comparison really.

Heater
Posts: 9361
Joined: Tue Jul 17, 2012 3:02 pm

Re: .NET Core - Raspbian or Win10IoT?

Fri May 18, 2018 7:42 pm

pauliunas,
...because this is an apples to oranges comparison really.
Quite correct.

That was the whole point of my post. Which started with "Be careful".

Such micro benchmarks always lie to you. Small changes in code, different optimizations, etc, can change the result a lot.

And when you are done, they say nothing about the actual performance you will get in your application.

pauliunas
Posts: 41
Joined: Mon Feb 26, 2018 7:43 am

Re: .NET Core - Raspbian or Win10IoT?

Fri May 18, 2018 8:00 pm

Heater wrote:
Fri May 18, 2018 7:42 pm
pauliunas,
...because this is an apples to oranges comparison really.
Quite correct.

That was the whole point of my post. Which started with "Be careful".

Such micro benchmarks always lie to you. Small changes in code, different optimizations, etc, can change the result a lot.

And when you are done, they say nothing about the actual performance you will get in your application.
You're quite right, I have misinterpreted your post. That's why I wanted to see something closer to the real world :)

jahboater
Posts: 2727
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Fri May 18, 2018 9:57 pm

Heater,
Well not really. Your code has no output. Turning on optimization actually removes the whole redundant loop and the executable just exits doing nothing!
This is as old as the hills. I remember someone writing a large benchmark for IBM's VS Fortran 35? years ago. Lots of complex calculations but no IO. The compiler removed everything, just like the C compiler did here.
OK, let's print the final count such that there is some output and the program has work to do. Like so:
.................
No so. The optimizer is not stupid. It knows what that simple loop does, calculates the result at compile time and just prints that constant in the executable.
Now I think thats rather clever.
Like converting Denis Ritchies algorithm for counting the set bits in an integer into a single instruction.

From what I can see this is formalized in C++ by "constexpr" where expressions and simple functions have to be computed at compile time.

These optimizations do happen all the time in C/C++, they are valid and normal, so 0.016s is the real result - for this silly benchmark. I don't think interpreted languages can ever do these things by the way, because they usually only look at a line or two at a time. For example to remove redundant code the compiler may have to examine the entire translation unit to prove that code really is redundant, or a result is not used.

Heater
Posts: 9361
Joined: Tue Jul 17, 2012 3:02 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 2:38 am

OK, what about this piece totally arbitrary benchmark code in C:

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <limits.h>

#define SIZE  10000
#define ITERS 100
/*
  This is xoshiro128** 1.0, our 32-bit all-purpose, rock-solid generator. It
  has excellent (sub-ns) speed, a state size (128 bits) that is large
  enough for mild parallelism, and it passes all tests we are aware of.

  For generating just single-precision (i.e., 32-bit) floating-point
  numbers, xoshiro128+ is even faster.

  The state must be seeded so that it is not everywhere zero.
*/

static inline uint32_t rotl(const uint32_t x, int k) {
    return (x << k) | (x >> (32 - k));
}

volatile static uint32_t s[4] = {938247, 23097423, 52309875, 297340234};

uint32_t next(void) {
    const uint32_t result_starstar = rotl(s[0] * 5, 7) * 9;
    const uint32_t t = s[1] << 9;

    s[2] ^= s[0];
    s[3] ^= s[1];
    s[1] ^= s[2];
    s[0] ^= s[3];

    s[2] ^= t;

    s[3] = rotl(s[3], 11);

    return result_starstar;
}

int main(int argc, char* argv[])
{
    int *a = (int*)calloc(SIZE * SIZE, sizeof(int));

    printf("Seed: %d, %d, %d, %d\n", s[0], s[1], s[2], s[3]);

    for (int iters = 0; iters < ITERS; iters++)
    {
        // Multiply every element by 42, add a random amount and take the modulus 100000
        for (int h = 0; h < SIZE; h++)
        {
            for (int w = 0; w < SIZE; w++)
            {
                a[(h * SIZE) + w] = ((a[(h * SIZE) + w] * 42)  + next()) % 100000;
            }
        }
    }

    // Print some element from somewhere
    printf("Result: %d\n", a[SIZE * SIZE / 3]);
}
Basically it is multiplying every element of a big two dimensional array (10000 by 10000 integers) by 42, adding a random integer and doing it all modulus 100000. And doing all that 100 times.

Let's compile and run it:

Code: Select all

$ g++ -O3 -o junk junk.cpp
$ time ./junk
Seed: 938247, 23097423, 52309875, 297340234
Result: 28247

real    0m53.820s
user    0m53.188s
sys     0m0.391s
I like this little benchmark because it has a lot of fiddly bit twiddling code with the pseudo random number generator, typical of microcontroller kind of things, but it works on a big lot of data, more like what we would see on a Pi.

Now, we could write the equivalent of that code in Javascript but I'm going to cheat and use Emscripten to translate that C source into Javascript. Then run the resulting JS under node.js:

Code: Select all

$ em++ -s ALLOW_MEMORY_GROWTH=1 -o junk.js -O3 junk.cpp
$ time node junk.js
Seed: 938247, 23097423, 52309875, 297340234
Result: 28247

real    1m4.189s
user    1m3.313s
sys     0m0.609s
Well bugger me. The C runs in 53.8 seconds. The interpreted Javascript runs in 64.1 seconds. What's that? The JS is about 16% slower than C. Amazing!

I'd love it if some C# or Java guys could show what they can do with this. Python guys, don't even bother.

JS code attached.
Attachments
junk.js.gz
(7.21 KiB) Downloaded 10 times

Heater
Posts: 9361
Joined: Tue Jul 17, 2012 3:02 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 2:56 am

jahboater,
I don't think interpreted languages can ever do these things by the way, because they usually only look at a line or two at a time.
As you see above, the V8 Javascript engine in node.js is not your father's interpreter. It compiles functions to native code as it runs, it optimizes that code using information gathered at run time. Same is true of the JS engines in Firefox and Edge.

It's amazing what those JS engine developers have done in recent years.
For example to remove redundant code the compiler may have to examine the entire translation unit to prove that code really is redundant, or a result is not used.
I always wonder about that. Typically your code is compiled one file at a time and then the objects linked together. I don't see how a C/C++ compiler can do such global optimizations when it only has one file to look at. It's impossible when your program is using precompiled static or shared libraries.

jahboater
Posts: 2727
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 6:45 am

Heater wrote:
Sat May 19, 2018 2:38 am
OK, what about this piece totally arbitrary benchmark code in C:
That is truly impressive. Clearly in a different league to Python.

In addition to executing the program itself, it has to load and start the interpreter, parse and translate the code, optimize it - all as "overhead" - i.e. stuff the compiled program does not have to do. So yes, impressive! It is time constrained, the compiler may take as long as it likes doing the optimization, but the interpreter must include the optimization time within the overall run time.

But this is a small triply nested loop with a single line of code executed 1e10 times - something interpreters love! Java would probably do well here too. Here the interpretation overheads are tiny compared to the total execution time. Harder to do I know, but I would like too see how it does with a larger program without the deep repetition.

I see the em++ has produced a single long line of output.
I'll try and lay it out in a human readable form and see if I can understand it.
It may be (the -O3) is doing some optimization at that stage.

---------------------------

For C, I like this sort of thing that modern compilers do:
In your code, the rotl() function gets translated thus:-

Code: Select all

 136                # try.c:20:     return (x << k) | (x >> (32 - k));
 137 007e C1C007        roll    $7, %eax    
it has understood the "purpose" of the bit twiddling and compiled the entire function down into a single rotate left instruction. Node is probably doing something similar, but you can imagine how many instructions a simpler interpreter like Python would take just to parse that expression.
On the Pi it becomes:

Code: Select all

 136                @ try.c:20:     return (x << k) | (x >> (32 - k));
 137 0078 E22CA0E1      ror r2, r2, #25
:) :) I have no idea what its doing here, its been inlined of course, the expression merged with the surrounding code - and then optimized. 32-7 is 25 which probably explains the 25.
Edit: Aha - I see ARM has no rotate left insn, so its converted it to a rotate right.

jahboater
Posts: 2727
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 7:31 am

Heater wrote:
Sat May 19, 2018 2:56 am
For example to remove redundant code the compiler may have to examine the entire translation unit to prove that code really is redundant, or a result is not used.
I always wonder about that. Typically your code is compiled one file at a time and then the objects linked together.
By translation unit I meant just the one file.
I don't see how a C/C++ compiler can do such global optimizations when it only has one file to look at. It's impossible when your program is using precompiled static or shared libraries.
You need -flto for that (and related options).

Code: Select all

-flto[=n]
  This option runs the standard link-time optimizer.  When invoked with source code, it
  generates GIMPLE (one of GCC's internal representations) and writes it to special ELF
  sections in the object file.  When the object files are linked together, all the
  function bodies are read from these ELF sections and instantiated as if they had been
  part of the same translation unit.
Last edited by jahboater on Sat May 19, 2018 7:39 am, edited 1 time in total.

pauliunas
Posts: 41
Joined: Mon Feb 26, 2018 7:43 am

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 7:32 am

Heater wrote:
Sat May 19, 2018 2:38 am
I'd love it if some C# or Java guys could show what they can do with this.
OK, attaching source code and self-contained binaries to mega, because the binaries were apparently too large for this forum. I could probably just delete like 90% of it because it's some DLLs I don't even use, but nah, I'm too lazy to test if it works afterwards.

https://mega.nz/#!GZ1yETra!g95wJ_PE77XE ... Id80uJuPwk

I basically just copied over the C code in C# syntax and didn't apply any additional optimizations.

jahboater, it's indeed amazing how clever compilers are nowadays. I'm even more amazed how clever an interpreter can be - it managed to do the optimizations at runtime... On the other hand, it probably wouldn't do so well with a more complex piece of code. But in the end, each language has its own strong sides so there probably won't be a clear winner at all.

jahboater
Posts: 2727
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 7:44 am

pauliunas wrote:
Sat May 19, 2018 7:32 am
But in the end, each language has its own strong sides so there probably won't be a clear winner at all.
Totally agree.
No matter how slow Python is for example, it definitely has its uses and is very popular here.

Heater
Posts: 9361
Joined: Tue Jul 17, 2012 3:02 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 9:01 am

jahboater,
But this is a small triply nested loop with a single line of code executed 1e10 times - something interpreters love!
You spotted my devious plan there!

That's right interpreters need time to "warm up". They have to do any JIT compilation, instantiate all the objects and so on. In the JS case code is further optimized using info gained by running it and that takes a few iterations to get up to speed.

That is why I have a large number of iterations.

On the other hand, I don't call that cheating, most software I write is supposed to run forever. Or a long time at least.

That is not a single line iterated mind you. Not in our source anyway. It calls next() which itself is 10 lines or so.
I see the em++ has produced a single long line of output. I'll try and lay it out in a human readable form and see if I can understand it. It may be (the -O3) is doing some optimization at that stage.
That's right. If I compile it with no optimization it produces a human readable output. Which take 3 times longer to run. Here are the timings for unoptimized JS and C:

Code: Select all

$ em++ -s ALLOW_MEMORY_GROWTH=1 -o junk.js -O0 junk.cpp
$ time node junk.js
Seed: 938247, 23097423, 52309875, 297340234
Result: 28247

real    3m6.336s
user    3m5.531s
sys     0m0.703s

$ g++ -O0 -o junk junk.cpp
$ ./junk
Seed: 938247, 23097423, 52309875, 297340234
Result: 28247
$ time ./junk
Seed: 938247, 23097423, 52309875, 297340234
Result: 28247

real    1m54.366s
user    1m53.750s
sys     0m0.438s
STOP PRESS!!!

I owe an apology.Turns out Emscripten has changed a lot since I last used it. It now, by default, compiles the Javascript to web assembly byte codes and produces a .wasm file that actually contains the code we write. That .wasm is then run from the .js file it produces. The timings above are all for web assembly execution. Damn.

In order to get good old Javascript output we need a switch, "-s s WASM=0". Here is the execution time for optimized JS:

Code: Select all

$ em++ -s WASM=0 -s ALLOW_MEMORY_GROWTH=1 -o junk.js -O3 junk.cpp
$ time node junk.js
Seed: 938247, 23097423, 52309875, 297340234
Result: 28247

real    1m9.152s
user    1m8.313s
sys     0m0.625s
And unoptimized JS:

Code: Select all

$ em++ -s WASM=0 -s ALLOW_MEMORY_GROWTH=1 -o junk.js -O0 junk.cpp
$ time node junk.js
Warning: Enlarging memory arrays, this is not fast! 16777216,536870912
Seed: 938247, 23097423, 52309875, 297340234
Result: 28247

real    2m54.513s
user    2m53.703s
sys     0m0.563s
Still only 30% slower than C. I'm not sure why we need web assembly, it seems like a lot more complication for no gain in performance.

Attached is the unoptimized, human readable Javascript. You can find the _main function in there and so on.
Attachments
junk.js.gz
(71.55 KiB) Downloaded 11 times
Last edited by Heater on Tue May 22, 2018 9:13 am, edited 1 time in total.

Heater
Posts: 9361
Joined: Tue Jul 17, 2012 3:02 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 9:24 am

pauliunas,

Hey, thanks for trying that out.

What did you do. Your C# conversion looks good. But that /benchmark/bin/Publish/linux-x64/benchmark is not a CIL image. It's a Linux native executable:

Code: Select all

$ file ./benchmark/bin/Publish/linux-x64/benchmark
./benchmark/bin/Publish/linux-x64/benchmark: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=3d89f68c79b8d30b263657973c41b7d2d5f4556c, stripped
That's OK perhaps but it's not cross-platform.

Worse still, for a native executable the performance sucks:

Code: Select all

$ time ./benchmark/bin/Publish/linux-x64/benchmark
Seed: 938247, 23097423, 52309875, 297340234

real    2m6.544s
user    2m5.344s
sys     0m0.547s
That's half the speed of the C version. Almost half the speed of Javascript! Did you have optimization turned on?

jahboater
Posts: 2727
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 10:27 am

Heater wrote:
Sat May 19, 2018 9:01 am
It now, by default, compiles the Javascript to web assembly byte codes and produces a .wasm file that actually contains the code we write. That .wasm is then run from the .js file it produces. The timings above are all for web assembly execution.
Ah that would explain it. I extracted your original file and it converted to 376 lines of gibberish unrelated to the original problem!!!!!

pauliunas
Posts: 41
Joined: Mon Feb 26, 2018 7:43 am

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 10:28 am

Heater wrote:
Sat May 19, 2018 9:24 am
pauliunas,

Hey, thanks for trying that out.

What did you do. Your C# conversion looks good. But that /benchmark/bin/Publish/linux-x64/benchmark is not a CIL image. It's a Linux native executable:

Code: Select all

$ file ./benchmark/bin/Publish/linux-x64/benchmark
./benchmark/bin/Publish/linux-x64/benchmark: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=3d89f68c79b8d30b263657973c41b7d2d5f4556c, stripped
That's OK perhaps but it's not cross-platform.

Worse still, for a native executable the performance sucks:

Code: Select all

$ time ./benchmark/bin/Publish/linux-x64/benchmark
Seed: 938247, 23097423, 52309875, 297340234

real    2m6.544s
user    2m5.344s
sys     0m0.547s
That's half the speed of the C version. Almost half the speed of Javascript! Did you have optimization turned on?
i exported it as self-contained package so that you can execute it without preinstalling .NET Core on your system. The folder includes the whole .NET Core in itself, along with benchmark.dll which contains the cross-platform code compiled from my C#.

So it's not a native executable. The executable is just a wrapper for .NET Core runtime.

I just compiled with default settings in Visual Studio, I believe there are some extra optimization options that can be enabled. I'm not familiar with them because normally I just work with debug builds.

jahboater
Posts: 2727
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 10:51 am

Heater,

Trivia! but you get slightly smaller executables with slightly faster startup times if you compile your C program with the C compiler (gcc) and not the C++ compiler (g++).
It also loads some libraries that C does not need.
Just FYI

Heater
Posts: 9361
Joined: Tue Jul 17, 2012 3:02 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 5:17 pm

In my experience compiling a C program with g++ results in exactly the same executable bytes as compiling it with gcc. Even if you rename the source to .cpp.

Yes, the resulting ELF file may be a bit bigger. Because C++ likes to do name mangling and such and the symbols get bigger.

But if you "strip" symbols from the executable they end up exactly the same size. As is the case for the example benchmark we have been discussing here.

There is no overhead to using g++ for C programs.

Even better...some years ago, as an experiment, I wrote the some object oriented code with some classes and instances of them in C++. Then the equivalent object oriented code in C using structs and passing instance pointers as the first parameter to the C "methods". I was amazed to find that the C++ and C sources compiled to exactly the same instructions. Binary identical.

Sadly I don't have the source to show you anymore.

There is no overhead to C++ unless you ask for it. But then it's not overhead, it's what you want to do.

jahboater
Posts: 2727
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 5:35 pm

Take a trivial hello.c program to show the small overhead:-
Identical program, gcc 8.1
gcc hello.c -Os -s -o c_hello
mv hello.c hello.cpp
g++ hello.cpp -Os -s -o c++_hello

Number of instructions executed (to run the program):-
C - 191329
C++ - 3723541

19.46 times as many instructions to print hello,world!

(these are stripped executables)
size c_hello c++_hello
text data bss dec hex filename
1114 544 8 1666 682 c_hello
1267 592 8 1867 74b c++_hello

ldd c_hello

linux-vdso.so.1 => (0x00007fffac1a5000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f39071a5000)
/lib64/ld-linux-x86-64.so.2 (0x000056048c68a000)

ldd c++_hello

linux-vdso.so.1 => (0x00007fff6f3d5000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f8b7a575000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f8b7a26c000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f8b7a056000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8b79c8b000)
/lib64/ld-linux-x86-64.so.2 (0x0000556cb13d8000)
Last edited by jahboater on Sat May 19, 2018 10:44 pm, edited 4 times in total.

Heater
Posts: 9361
Joined: Tue Jul 17, 2012 3:02 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 5:43 pm

pauliunas,
So it's not a native executable. The executable is just a wrapper for .NET Core runtime.
No. It is a native executable. You need different ones for Intel, ARM etc. The Intel one in your package does not work under Linux.

Yes it might contain the interpreter needed to run the .Net byte codes. That is another matter.

Anyway, I fixed up you C# version of my "benchmark" so that it produces the correct final result and compiled it for Linux. With optimization:

Code: Select all

$ csc Program.cs -optimize
Microsoft (R) Visual C# Compiler version 2.6.0.62309 (d3f6b8e7)
Copyright (C) Microsoft Corporation. All rights reserved.

$ time ./Program.exe
Seed: 938247, 23097423, 52309875, 297340234
Result: 28247

real    1m50.947s
user    0m0.000s
sys     0m0.000s
Why is it so slow? Half the speed of Javascript!

Here is the code:

Code: Select all

using System;
using System.Runtime.CompilerServices;

namespace benchmark
{
    class Program
    {
        private const short size = 10000;
        private const byte Iters = 100;

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private static UInt32 rotl(UInt32 x, int k)
        {
            return (x << k) | (x >> (32 - k));
        }

        private volatile static UInt32[] s = { 938247, 23097423, 52309875, 297340234 };

        private static UInt32 next()
        {
            UInt32 result_starstar = rotl(s[0] * 5, 7) * 9;
            UInt32 t = s[1] << 9;

            s[2] ^= s[0];
            s[3] ^= s[1];
            s[1] ^= s[2];
            s[0] ^= s[3];

            s[2] ^= t;

            s[3] = rotl(s[3], 11);

            return result_starstar;
        }

        static void Main(string[] args)
        {
            UInt32[,] a = new UInt32[size, size];

            Console.WriteLine("Seed: {0}, {1}, {2}, {3}", s[0], s[1], s[2], s[3]);

            for (int iters = 0; iters < Iters; iters++)
            {
                // Multiply every element by 42, add a random amount and take the modulus 100000
                for (int h = 0; h < size; h++)
                {
                    for (int w = 0; w < size; w++)
                    {
                        a[h, w] = ((a[h, w] * 42) + next()) % 100000;
                    }
                }
            }

            // Print some element from somewhere
            Console.WriteLine("Result: " + a[size/3, size/3]);
        }
    }
}

Heater
Posts: 9361
Joined: Tue Jul 17, 2012 3:02 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 5:57 pm

Now here is the kicker...

A teeny weeny change in the source causes it to run at less than half the speed:

Code: Select all

$ time ./Program.exe
Seed: 938247, 23097423, 52309875, 297340234
Result: 28247

real    4m40.324s
user    0m0.000s
sys     0m0.000s
So whilst we are busy fussing over the performance of this language/runtime system or other, at the end of the day if the code you write does not play nice with the machine you run it on it's all a waste of effort.

The teeny weeny change is in the code below. Anyone spot the difference and say why it has such a dramatic effect on performance?

Code: Select all

using System;
using System.Runtime.CompilerServices;

namespace benchmark
{
    class Program
    {
        private const short size = 10000;
        private const byte Iters = 100;

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private static UInt32 rotl(UInt32 x, int k)
        {
            return (x << k) | (x >> (32 - k));
        }

        private volatile static UInt32[] s = { 938247, 23097423, 52309875, 297340234 };

        private static UInt32 next()
        {
            UInt32 result_starstar = rotl(s[0] * 5, 7) * 9;
            UInt32 t = s[1] << 9;

            s[2] ^= s[0];
            s[3] ^= s[1];
            s[1] ^= s[2];
            s[0] ^= s[3];

            s[2] ^= t;

            s[3] = rotl(s[3], 11);

            return result_starstar;
        }

        static void Main(string[] args)
        {
            UInt32[,] a = new UInt32[size, size];

            Console.WriteLine("Seed: {0}, {1}, {2}, {3}", s[0], s[1], s[2], s[3]);

            for (int iters = 0; iters < Iters; iters++)
            {
                // Multiply every element by 42, add a random amount and take the modulus 100000
                for (int h = 0; h < size; h++)
                {
                    for (int w = 0; w < size; w++)
                    {
                        a[w, h] = ((a[w, h] * 42) + next()) % 100000;
                    }
                }
            }

            // Print some element from somewhere
            Console.WriteLine("Result: " + a[size/3, size/3]);
        }
    }
}

jahboater
Posts: 2727
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 6:01 pm

Row-major vs column-major array access?

Heater
Posts: 9361
Joined: Tue Jul 17, 2012 3:02 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat May 19, 2018 6:28 pm

Bingo!

Can't catch you out.

It's all about cache memory. Keeping data in cache as much as possible. Avoiding slow fetches of data from external RAM.

It's the second devious feature of my silly benchmark.

I was expecting the performance hit of changing array access order to be even worse. But I guess there is a lot of work going on with each element so that masks the effect.

ejolson
Posts: 1662
Joined: Tue Mar 18, 2014 11:47 am

Re: .NET Core - Raspbian or Win10IoT?

Mon May 21, 2018 8:49 pm

jahboater wrote:
Sat May 19, 2018 6:01 pm
Row-major vs column-major array access?
Since only the SIZE*SIZE/3 element is printed, you don't need an array at all. Just kick the random number generator SIZE*SIZE-1 times between each iteration and update only the relevant number you need for the answer. If you don't care about the exact sequence of pseudorandom numbers, don't even do that.

Presumably such code, if it ever appeared, would be part of a stochastic simulation in which the apparent randomness of the pseudorandom numbers is important but not the specific deterministic sequence. At the same time, there may be cryptographic applications where the sequence of random numbers is important. Note that since there is no mixing between adjacent elements in the array (such as via an sbox in DES) among other things, the suggested benchmark code does not likely model a cryptographic algorithm.

Heater
Posts: 9361
Joined: Tue Jul 17, 2012 3:02 pm

Re: .NET Core - Raspbian or Win10IoT?

Mon May 21, 2018 10:00 pm

ejolson,

No.

If all the inputs were constants then whatever result that algorithm produces could be reduced to just pre-computing the result at compile time and printing the result at run time. Which would take no time at all!

But the inputs are not constants. The seed of the random number generator is "volatile". That means it can change at any moment during run time and the compiler cannot optimize away all the work we have asked it to do.

Do not be lead astray by the fact that this uses a pseudo random number generator. This has nothing to do with generating random numbers, let alone cryptographically secure random numbers. Just consider it typical of the work we put in our code, whatever our code does. It's just a bunch of logical and arithmetic operations that are representative of what we expect programs to do.

And the array is an important part of all this. It's big. It bigger than the cache memory in your processor. As you probably know working on data in cache memory is 10 or more times faster than having to fetch that data from external memory.

As such, accessing that array in the right order has a huge impact on performance.

I had thought about presenting results where the JS code would be cache friendly and the C code would not. Thus "proving" that Javascript is faster than C!

I could not bring myself to be so devious.

It's not like I did not think about this "benchmark" a bit before presenting it. There is method in my madness.

Return to “Other programming languages”

Who is online

Users browsing this forum: No registered users and 4 guests