jahboater
Posts: 2928
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Thu Jun 21, 2018 10:45 am

creativeprojects wrote:
Fri May 18, 2018 5:01 pm

Here's the code in C:

Code: Select all

int main()
{
    int max = 2147483637;
    int count = 0;
    while (count < max)
    {
        count += 10;
        count -= 9;
    }
    return 0;
}
As you can see, there's really nothing.

C (compiled directly on the Pi with the default cc 6.3.0-18+rpi1+deb9u1)
Time elapsed was 21.6s
Something wrong here!
This code (as you say) does nothing.
Therefore the only sensible thing for the compiler to emit is just:

mov r0, #0
bx lr

That is, just the "return 0".
It does not take 21.6 seconds to execute two instructions.
I cant comment about the interpreted languages.
So

Code: Select all

[email protected]:~ $ gcc -O2 -s try.c -o try
[email protected]:~ $ time ./try
real	0m0.005s
user	0m0.001s
sys	0m0.004s
[email protected]:~ $ 
5 milliseconds "real" time (on a Pi3+).

This illustrates one difference between compiled and interpreted languages, the compiler will examine the entire program and may then safely say that xxx code can never be used - so remove it!

Hint, you need to do some I/O in your benchmark, otherwise the compiler will see through it.
Last edited by jahboater on Thu Jun 21, 2018 10:59 am, edited 1 time in total.

pauliunas
Posts: 41
Joined: Mon Feb 26, 2018 7:43 am

Re: .NET Core - Raspbian or Win10IoT?

Thu Jun 21, 2018 10:56 am

jahboater wrote:
Thu Jun 21, 2018 10:45 am
creativeprojects wrote:
Fri May 18, 2018 5:01 pm

Here's the code in C:

Code: Select all

int main()
{
    int max = 2147483637;
    int count = 0;
    while (count < max)
    {
        count += 10;
        count -= 9;
    }
    return 0;
}
As you can see, there's really nothing.

C (compiled directly on the Pi with the default cc 6.3.0-18+rpi1+deb9u1)
Time elapsed was 21.6s
Something wrong here!
This code (as you say) does nothing.
Therefore the only sensible thing for the compiler to emit is just:

mov r0, #0
bx lr

That is, just the "return 0".
It does not take 21.6 seconds to execute two instructions.
I cant comment about the interpreted languages.
So

Code: Select all

[email protected]:~ $ gcc -O2 -s try.c -o try
[email protected]:~ $ time ./try
real	0m0.005s
user	0m0.001s
sys	0m0.004s
[email protected]:~ $ 
5 milliseconds "real" time (on a Pi3+).

This illustrates one difference between compiled and interpreted languages, the compiler will examine the entire program and may then safely say that xxx code can never be used - so remove it!
but nowadays some interpreted languages are actually JITed, so they can also do the same...

jahboater
Posts: 2928
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Thu Jun 21, 2018 11:04 am

pauliunas wrote:
Thu Jun 21, 2018 10:56 am
but nowadays some interpreted languages are actually JITed, so they can also do the same...
Yes, I know they can do some pretty clever stuff and loops perform well.

So why is the execution time so long for these languages (74 minutes for Python :) :) )?

Show me any interpreted language that can do this in two machine instructions - it is not possible.

It would take thousands just to parse the source text, let alone optimize it.

pauliunas
Posts: 41
Joined: Mon Feb 26, 2018 7:43 am

Re: .NET Core - Raspbian or Win10IoT?

Thu Jun 21, 2018 11:22 am

jahboater wrote:
Thu Jun 21, 2018 11:04 am
pauliunas wrote:
Thu Jun 21, 2018 10:56 am
but nowadays some interpreted languages are actually JITed, so they can also do the same...
Yes, I know they can do some pretty clever stuff and loops perform well.

So why is the execution time so long for these languages (74 minutes for Python :) :) )?


Show me any interpreted language that can do this in two machine instructions - it is not possible.
Because python is python? I doubt it's even JITed at all lol... and I don't even know if there's any shittier language than that...

For other languages, once the code is JITed, I don't see why it couldn't be 2 instructions. JITing takes time, but in most applications it doesn't really matter that much... once the initial JITing is done, it can run fast enough to just JIT everything else during idle cycles. Of course, I'm not saying these languages should be used for performance-critical number crunching, because then the JIT, GC and so on are just waste of computing power.

jahboater
Posts: 2928
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Thu Jun 21, 2018 11:39 am

Creativeprojects,

In your benchmark, as count approaches max it will overflow and go negative.
At least it will in C where "int" on the Pi is a 32-bit signed integer.

This is undefined behavior - a big no no.
The compiler will assume it does not happen in a correct program, probably another reason to elide the entire thing ...............

As before, count += 10; count -=9 will not fool a compiler!!
Do something with a side effect such as print a result at the end that depends on count.
And reduce max a little to say 2147483627 which will avoid the disastrous overflow.

jahboater
Posts: 2928
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Thu Jun 21, 2018 12:08 pm

pauliunas wrote:
Thu Jun 21, 2018 11:22 am
For other languages, once the code is JITed, I don't see why it couldn't be 2 instructions. JITing takes time, but in most applications it doesn't really matter that much... once the initial JITing is done, it can run fast enough to just JIT everything else during idle cycles.
Yes, agreed. But don't forget the interpreter itself must be loaded.
You are right that Python is a bad example, but I see the python interpreter is 1090 disk blocks - plus the users program, and nine libraries (from ldd).
The C compiled program is just 2 disk blocks (which is annoyingly large for such a tiny program).

pauliunas
Posts: 41
Joined: Mon Feb 26, 2018 7:43 am

Re: .NET Core - Raspbian or Win10IoT?

Thu Jun 21, 2018 12:19 pm

jahboater wrote:
Thu Jun 21, 2018 12:08 pm
pauliunas wrote:
Thu Jun 21, 2018 11:22 am
For other languages, once the code is JITed, I don't see why it couldn't be 2 instructions. JITing takes time, but in most applications it doesn't really matter that much... once the initial JITing is done, it can run fast enough to just JIT everything else during idle cycles.
Yes, agreed. But don't forget the interpreter itself must be loaded.
You are right that Python is a bad example, but I see the python interpreter is 1090 disk blocks - plus the users program, and nine libraries (from ldd).
The C compiled program is just 2 disk blocks (which is annoyingly large for such a tiny program).
I didn't forget, and that's why I said "for most applications" :) Nowadays most machines have at least 8GB of memory, so that extra 50MB or so for the compiler (I mean, 50MB is already a lot tbh) won't be noticed by end users. Especially if the runtime is shared between several applications.

pauliunas
Posts: 41
Joined: Mon Feb 26, 2018 7:43 am

Re: .NET Core - Raspbian or Win10IoT?

Thu Jun 21, 2018 2:24 pm

JustAGeek wrote:
Thu Jun 21, 2018 9:21 am
I have absolutely no idea why I'm biting on this. I'll use it as an excuse to exercise my brain. So while I'll write as though I'm speaking to you, I'm really more interested in just writing down my thoughts. I also stopped reading all the comments when I realized that it would be a REALLY REALLY long read. So I'm about 1/4 down on the second page as it stands now.

I've been many things in life. I have been coding far more than long enough to actually have earned a living in COBOL before it was obsolete. I spent many long hours programming in languages where GOTO statements were the only real option. I've been a senior developer on one of the fastest web browsers ever made for years. I've been a programming language designer, an operating system developer, an old school demo coder and for a very large part a codec developer for one of the companies that more or less dominated the patent pool on a lot of H.26X standards.

I am also currently coding almost entirely in C# for many reasons... though not really the ones mentioned. And remember, I've spent much of my life counting cycles.

I am happy to see the discussion between C# and Javascript as these are two of my favorite playgrounds. I'm friends with some of the people in charge of C# at Microsoft as well as some of the performance oriented developers at Microsoft, Mozilla and Google. I've had lunch with at least one person from each of their Javascript teams in the past two months. My son is actually named for one of them.

I'm also glad that no one is debating firmware/kernel level languages like C and C++. I'm completely over those languages and have absolutely no possible reason to recommend either for anything since Rust came around... except maybe that C's ABI is very simplistic and can be trivial to implement.

You're both right and you're both wrong about very much of what you say. We have a Javascript fanatic and a C# fanatic. I know this isn't fair, but if you go back and read what you've both been writing from a 3rd person perspective, you might consider medication for that. It was getting a bit out of hand.

Let me start by saying, Javascript is faster. There's absolutely no doubt about this and there is no debating it. Using a simple counter program isn't testing language performance. You're comparing the AOT and caching of the engine. It's not realistic and it's not fair. It also doesn't do what real programs do. Real programs allocate and free memory... that's the key thing.

There are some cases where C# will rip the doors off of Javascript. In .NET Core 2.1, there is now support for Span<> and Map<> which are two of the best additions ever made to a modern language. I have a personal request into the .NET development team to add support for memory locking and alignment so that I would no longer have to drop to C++ to manage memory for codecs.

From a language perspective, Javascript is amazing and you should not ever use modern Javascript. Browsers don't like it and it's best to stick with something older. Javascript in version 6 and later is an incredible language. The introduction of classes has been a huge improvement. That said, Javascript as a language is pretty weak overall as a comparison.

Let's start by saying that Javascript was never really supposed to stick. Technically, it's one of the best and the worst languages ever made. The simple fact that something like 'strict' even exists is proof of this. You have basically two languages in one. One language is a shit box that says that pretty much anything you type will do something. The other one is a language where "best practices" are trying to be declared and enforced. But the language is basically a sandbox where pretty much anything goes. There are days where we were sitting there and while writing a test for Javascript we said "wouldn't it be nice to have a new construct to make this easier for this test". Of course that keyword had absolutely no value to the language as a whole, but we said "screw it, let's stick it in" and before you knew it, the standards group over at Ecma wanted to kill us because not only did it become part of the language as a defacto standard, but the bugs in our implementation did too. We spent all our days and nights trying to fight with the crappy parser rules we had to write because of Netscape's original implementation and later Microsoft's half baked alternative that we honestly almost couldn't even call it a programming language parser as opposed to a natural language parser.

C# is a special language because unlike most other languages (Java as a notable example), the developers of that language are willing to rip things out and simply release a new version. This is possible because if you're using .NET there's no particular reason you can't have 5 different versions of .NET on your machine. It means that where Javascript has the 'strict' concept, C# actually makes breaking changes that force massive refactoring of the class libraries and things improve greatly. This is why C# probably has the nicest lambda and async implementations of any language today.

There are endless reasons why each language is better than the other. I can site that C# looks like a language carefully engineered by language developers who eventually added such insane static code analysis to their compilers that it's difficult to write bad C# code anymore since the compiler will go absolutely crazy and generate pages of warnings over it. Javascript has had probably collectively close to a billion dollars invested purely in a competition of performance which has yielded a runtime environment that made at least 3 separate Javascript engines that generally produce code that is far faster and/or better than any C compiler out there today (when considering real world memory usage).

C# doesn't ever try for the nitty gritty little things like fancy auto-vectorization engines. Javascript doesn't target it, but it's present. C# calls cleanly into native code, Javascript makes this really a gruesome task. As I said, pages and pages and pages of comments can be written on this topic. But in the end, when comparing CapEx (the time it takes to write and maintain code) vs OpEx (the cost of running that code) the two languages even out.

A good point to make as well is that poorly written Javascript in just the right place can actually be seriously detrimental to the environment. For example, if you were to implement a Javascript library which was compiled on client side for a web interface. Let's think in terms of something like Angular. Then a large download and a high complexity would consume massive amounts of power worldwide if it were used on something like Google's home page. Simply minimizing the Javascript on Google's home page probably saves the world massive power costs just because it would reduce the computational complexity (think Big-O) of the lexical analysis of the received code.

The reason I use C# however has to do with class libraries. In the Javascript world, even the simple things depend on code which is provided by the community. If you were you look at the Linux kernel and the absolute trash heap of code that is, (BTW, I love Linux, the code is horrifying though) it's an example of how the community can absolutely ruin the underlying structure of something otherwise beautiful. The Linux kernel has millions of lines of wasteful duplicated code because the kernel base libraries are insufficient for the job. The Linux world constantly creates, recreates and the recreates again the simplest of functions. The worst part is that there's no real central authority building solid tools for Javascript as a base library. We depend on things like NPM which provide toolkits from just too many places at once.

I hate coding in Javascript because the NPM toolbox is an total nightmare. To be honest, I almost never use community contributed libraries from Nuget. Nearly every time I've ever done so, I've regretted it. Instead, I use packages provided by vendors or roll it myself. The one exception to this has been in SSH support where I've been too busy to sit down and start a proper modern implementation of it.

I absolutely love the Microsoft stranglehold on C#. They have a good team working together to make a good product. There are days where I want to offer pull requests to them, but then I realize, filing an issue is more effective.

Let's just say again, you're both right and both wrong. There's no value to choosing one language/platform vs. the other if you're proficient in one or the other already. Unless you're padding your CV/resume it's best to simply program in whatever you like the most. They both have incredible merits.

If you compared either language to C or C++, I'd be totally different. Consider for instance modern meltdown/spector related CPU bugs. All the C/C++ code out there had to be recompiled and systems had to have scheduled downtown and there were reboots and verification testing and then additional patches to fix performance on those patches, etc...

If the code were written in either Javascript or C#, a simple patch to the browser or to the CLR would have altered the way the code was compiled and put an end to those problems. Instead we all took performance hits when our firmwares were updated.

Then there's relocatable memory. I despise any code where memory is allocated by pointer rather than by reference. Referenced memory can be paged, compressed, relocated, etc... this means that the system MMU can have much shorter tables to traverse when performing memory reads and writes since it is possible to use unaligned memory as well as to defragment and garbage collect on idle cycles. The kernel could run a process that did nothing more than to decrease the complexity of the GDT and LDT while idle. That said, we can do that to a limited extent, but because of "C purists", defragmentation on application and kernel level is impossible.

Then there's performance. Using a tracing JIT (we don't do that anymore, but we do something similar) it is possible to compile out conditions for specific branches of execution during runtime that can adjust to provide maximum performance whether running on an Atmel processor or on a 28 core Xeon. And then there's NUMA. When working in a multi-socket server or HPC environment, code written in C has to be manually written to place code near memory. Labs spend hundreds of millions of dollars trying to build servers and storage systems to compensate for C, C++ and CUDA code which doesn't properly distribute active data sets to active computational nodes. Instead, the software just depends on massive local RAM and a ridiculously high performance fabric to distribute data as RDMA operations. If the code were written in Javascript and/or C#, code could be properly distributed to where the data is on the fly which would be much faster than moving the data. It would be little more than simply migrating the process from machine to machine... a few bytes in operation. Then there's core scheduling and cache coherency... all problems compiled languages have that JIT languages don't.

I'm heading to lunch now. Thanks for the opportunity to rant for a while.
That is a quite good rant there, I enjoyed reading it :)
But before saying that C# is "undoubtedly" slower than JS... have you seen this? https://medium.com/@chrisdaviesgeek/net ... a8fd2edff0
The "traditional" .NET Framework is indeed slow. That's OK, because it was very GUI-oriented and that part it does well. But .NET Core, on the other hand... it doesn't have the GUI part and it's fast. In those benchmarks, it's faster than Express.JS or very close behind, except for "Fortunes", whatever it is. The author of that article claims it's faster than JS and he would use it, if not for the lack of 3rd party libraries... but honestly, I think it's gotten much better since 2016.

User avatar
paddyg
Posts: 2187
Joined: Sat Jan 28, 2012 11:57 am
Location: UK

Re: .NET Core - Raspbian or Win10IoT?

Thu Jun 21, 2018 11:06 pm

Just to throw some more random info not releated to .NET Core or Win10IoT... I wondered what cython might make of the random number generator test. And then I thought it might be more interesting to see how rust coped. So copy pasting the code from p2. I got

Code: Select all

$ g++ -O3 -o test5 test5.c
$ time ./test5
Seed: 938247, 23097423, 52309875, 297340234
Result: 28247

real	1m2.872s
user	1m2.668s
sys	0m0.192s
On my laptop, so pretty close the original values from @heater. Then I transcribed the code to rust (see below) and got

Code: Select all

$ cargo build --release
   Compiling test2 v0.1.0 (file:///home/patrick/rust/test2)
    Finished release [optimized] target(s) in 0.96 secs
$ time target/release/test2
Seed: 938247, 23097423, 52309875, 297340234
Result: 28247

real	0m45.532s
user	0m45.308s
sys	0m0.216s
which is quite a significant improvement. However the rust compiles to a file 4,657,840 bytes compared with C at 8,744 which is also a significant difference.

Code: Select all

const SIZE: usize = 10000;
const ITERS: usize = 100;

fn rotl(x: u32, k: usize) -> u32 {
    (x << k) | (x >> (32 - k))
}

// in rust the access to s[] has to be given explictly for the time of this call only
fn next(s: &mut Vec<u32>) -> u32 {
    let result_starstar = rotl(s[0] * 5, 7) * 9;
    let t = s[1] << 9;

    s[2] ^= s[0];
    s[3] ^= s[1];
    s[1] ^= s[2];
    s[0] ^= s[3];

    s[2] ^= t;

    s[3] = rotl(s[3], 11);

    result_starstar
}

fn main() {
    let mut s: Vec<u32> = vec![938247, 23097423, 52309875, 297340234];
    let mut a = vec![vec![0; SIZE]; SIZE];

    println!("Seed: {}, {}, {}, {}\n", s[0], s[1], s[2], s[3]);

    for _iters in 0..ITERS {
        for h in 0..SIZE {
            for w in 0..SIZE {
                a[h][w] = (a[h][w] * 42  + next(&mut s)) % 100000;
            }
        }
    }

    let h = SIZE * SIZE / 3;
    let w = h % SIZE;
    let h = h / SIZE; // find the same location as per flattened index
    println!("Result: {}\n", a[h][w]);
}
also https://groups.google.com/forum/?hl=en-GB&fromgroups=#!forum/pi3d

jahboater
Posts: 2928
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Fri Jun 22, 2018 10:38 am

paddyg wrote:
Thu Jun 21, 2018 11:06 pm
Just to throw some more random info not releated to .NET Core or Win10IoT... I wondered what cython might make of the random number generator test. And then I thought it might be more interesting to see how rust coped. So copy pasting the code from p2. I got
Please could you try this C version on your laptop that hasn't had the state variables made "volatile"! (which makes it about 2.5x slower ...)?
Also what version of the C compiler are you using?
It is a C program, so perhaps use gcc not g++ (though it may not make much difference).
Here is the original, closer to your rust version:-

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define SIZE  10000
#define ITERS 100

static inline uint32_t rotl(const uint32_t x, const int k) {
    return (x << k) | (x >> (32 - k));
}

static uint32_t s[4] = {938247, 23097423, 52309875, 297340234};

static uint32_t next(void) {
    const uint32_t result_starstar = rotl(s[0] * 5, 7) * 9;
    const uint32_t t = s[1] << 9;

    s[2] ^= s[0];
    s[3] ^= s[1];
    s[1] ^= s[2];
    s[0] ^= s[3];

    s[2] ^= t;

    s[3] = rotl(s[3], 11);

    return result_starstar;
}

int main( void )
{
    int * const a = (int*)calloc(SIZE * SIZE, sizeof(int));

    printf("Seed: %d, %d, %d, %d\n", s[0], s[1], s[2], s[3]);

    for (int iters = 0; iters < ITERS; ++iters)
    {
        // Multiply every element by 42, add a random amount and take the modulus 100000
        for (int h = 0; h < SIZE; ++h)
        {
            for (int w = 0; w < SIZE; ++w)
            {
                a[(h * SIZE) + w] = ((a[(h * SIZE) + w] * 42)  + next()) % 100000;
            }
        }
    }

    // Print some element from somewhere
    printf("Result: %d\n", a[SIZE * SIZE / 3]);
}
Last edited by jahboater on Fri Jun 22, 2018 12:04 pm, edited 1 time in total.

Heater
Posts: 9836
Joined: Tue Jul 17, 2012 3:02 pm

Re: .NET Core - Raspbian or Win10IoT?

Fri Jun 22, 2018 11:15 am

JustAGeek,

Wonderful rant, thanks.
Javascript is amazing...
Yes it is.
...and you should not ever use modern Javascript. Browsers don't like it and it's best to stick with something older.
All the browsers I care about handle recent standards of JS just fine. Node JS handles them just fine.
Javascript in version 6 and later is an incredible language.
This in a direct contradiction of your statement above "should not ever use modern Javascript".

Which do you mean?
Javascript has had probably collectively close to a billion dollars invested purely in a competition of performance which has yielded a runtime environment that made at least 3 separate Javascript engines that generally produce code that is far faster and/or better than any C compiler out there today
I call BS on this statement. The performance of JS engines in recent times is very impressive but I challenge you to present an example of a JS program being faster than C or C++.

You could start with trying a JS version of my noddy little benchmark earlier in this thread:
viewtopic.php?f=34&t=206449&start=50#p1318549
I hate coding in Javascript because the NPM toolbox is an total nightmare.
How so? I have been using npm and all kind of node modules for years. It has worked very nicely. Of course there are a lot of junk packages out there, hardly npm's fault, don't use those.
Consider for instance modern meltdown/spector related CPU bugs. All the C/C++ code out there had to be recompiled and systems had to have scheduled downtown and there were reboots and verification testing and then additional patches to fix performance on those patches, etc...
Yes, let's consider them... your statement is nonsense. meltdown/spector style problems are independent of any programming language. Indeed they were first demonstrated using Javascript!
Using a tracing JIT (we don't do that anymore, but we do something similar) it is possible to compile out conditions for specific branches of execution during runtime that can adjust to provide maximum performance whether running on an Atmel processor or on a 28 core Xeon.
I have never had an ATMEL processor where it was even possible to run C#, Java, JS etc.

Heater
Posts: 9836
Joined: Tue Jul 17, 2012 3:02 pm

Re: .NET Core - Raspbian or Win10IoT?

Fri Jun 22, 2018 11:43 am

jahboater,
Please could you try this C version that hasn't had the state variables made "volatile"!
I could not recall if I had done that myself before or not. So I just did it (again):

With volatile:

Code: Select all

$ ./junk
Seed: 938247, 23097423, 52309875, 297340234
Result: 28247
$ time ./junk
Seed: 938247, 23097423, 52309875, 297340234
Result: 28247

real    0m48.383s
user    0m47.984s
sys     0m0.375s
Without volatile:

Code: Select all

$ time ./junk
Seed: 938247, 23097423, 52309875, 297340234
Result: 28247

real    0m23.643s
user    0m23.172s
sys     0m0.422s
So, a tad more than twice as fast without volatile. Rust is not winning yet.

jahboater
Posts: 2928
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Fri Jun 22, 2018 1:22 pm

Thanks Heater.

Rust is described as a "systems programming language" which is always attractive for me.

They need to do something about the executable size which is 533x larger than the C one (and that is -O3 with no "-s"). If they wrote all the Linux apps in Rust, a lot of hard disks would get sold :)

Now this is interesting. On the Pi I get

Code: Select all

[email protected]:~ $ time ./try
Seed: 938247, 23097423, 52309875, 297340234

thread 'main' panicked at 'attempt to multiply with overflow', try.r:10
note: Run with `RUST_BACKTRACE=1` for a backtrace.
This is at:

let result_starstar = rotl(s[0] * 5, 7) * 9

The C version correctly uses uint32_t where the overflow (wrap) is acceptable.
Maybe result_startar should be u32 or something.

I see it uses LLVM to do the compilation, like Clang.

On the Pi3+ the executable is only 388x larger than the C one! (2,135,756 bytes).
strip makes it a good bit smaller - only 57x larger.

User avatar
paddyg
Posts: 2187
Joined: Sat Jan 28, 2012 11:57 am
Location: UK

Re: .NET Core - Raspbian or Win10IoT?

Fri Jun 22, 2018 3:00 pm

Yes, that fixes it: 36s with C v. 45s Rust. gcc v g++ doesn't make any difference. Rust does have a fn rotate_left() but it it's basically identical code and the compiler automatically seems to optimize so #[inline] on rotl() has no effect. Also rotate_left() takes u32 argument so my usize was wrong for diy function. Seems to run ok on RPi either with rotate_left() or rotl(x: u32, k: u32) -> u32. It took 8m50s on my RPi2B (but gets the right answer) - thought it had died!

The unstripped size seems to be only a bit bigger for quite complicated projects, maybe executable size as well as compile time are targets for improvements. I think the main selling point for rust is that it's almost impossible to make memory leaks, race conditions or unsafe code.
also https://groups.google.com/forum/?hl=en-GB&fromgroups=#!forum/pi3d

jahboater
Posts: 2928
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Fri Jun 22, 2018 4:07 pm

paddyg wrote:
Fri Jun 22, 2018 3:00 pm
It took 8m50s on my RPi2B (but gets the right answer) - thought it had died!
You should get a Pi3B+ :) it took 5m34s (actually just the difference between 900Mhz and 1400Mhz).
The C version was 1m57s on the Pi3+ but, to be fair, that was GCC 8.1 the latest compiler release.

I used "rustc -O" to compile the Rust version.

C is helped in the inner loop by the entire rotl() function being converted to a single "rotate" instruction.

ejolson
Posts: 1896
Joined: Tue Mar 18, 2014 11:47 am

Re: .NET Core - Raspbian or Win10IoT?

Sat Jun 23, 2018 8:47 am

JustAGeek wrote:
Thu Jun 21, 2018 9:21 am
I've been a senior developer on one of the fastest web browsers ever made for years.
Welcome to the forum. Would that web browser be Microsoft Edge?

User avatar
bensimmo
Posts: 3219
Joined: Sun Dec 28, 2014 3:02 pm
Location: East Yorkshire

Re: .NET Core - Raspbian or Win10IoT?

Sat Jun 23, 2018 8:58 am

That Pi2 will run at 1GHz easily, just look in the default official overclock settings.

jahboater
Posts: 2928
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat Jun 23, 2018 10:03 am

paddyg wrote:
Fri Jun 22, 2018 3:00 pm
I think the main selling point for rust is that it's almost impossible to make memory leaks, race conditions or unsafe code.
That combined with competitive performance.
By the look of it, it is doing overflow checks too.
Impressive speed considering its doing all that.

On ARM Rust's performance is worse compared to C than it was on Intel.
I speculate that the cost of overflow checks being much higher on ARM may possibly be the reason.
(ARM insns rarely set the overflow flag; importantly multiply does not, you have to do the check by hand which is slower).

User avatar
paddyg
Posts: 2187
Joined: Sat Jan 28, 2012 11:57 am
Location: UK

Re: .NET Core - Raspbian or Win10IoT?

Sat Jun 23, 2018 1:01 pm

@jahboater, yes, looking at the LLVM object code it has rather bizarrely converted one rotl() to rol but not the other! Comparing with the gcc version is a bit difficult but it does look to have possibly reduced the number of mov instructions, by doing things in a different order. In the tightest loop LLVM has 44 instructions c.f. gcc at 51 but it has a couple of conditional jumps, presumably for safe code checking, which the C code doesn't. Obviously the actual instructions used differ, which probably has impact on speed, I notice rust uses three imul whereas C uses two imul but one mul.

Code: Select all

... rust ...
    7960:	49 8b 51 10          	mov    0x10(%r9),%rdx
    7964:	48 39 c2             	cmp    %rax,%rdx
    7967:	0f 86 be 01 00 00    	jbe    7b2b <[email protected]@Base-0x15855>
    796d:	49 8b 09             	mov    (%r9),%rcx
    7970:	44 8b 14 81          	mov    (%rcx,%rax,4),%r10d
    7974:	41 8b 1c 24          	mov    (%r12),%ebx
    7978:	41 8b 54 24 04       	mov    0x4(%r12),%edx
    797d:	89 d5                	mov    %edx,%ebp
    797f:	c1 e5 09             	shl    $0x9,%ebp
    7982:	41 8b 7c 24 08       	mov    0x8(%r12),%edi
    7987:	31 df                	xor    %ebx,%edi
    7989:	41 8b 4c 24 0c       	mov    0xc(%r12),%ecx
    798e:	31 d1                	xor    %edx,%ecx
    7990:	31 fa                	xor    %edi,%edx
    7992:	41 89 54 24 04       	mov    %edx,0x4(%r12)
    7997:	89 da                	mov    %ebx,%edx
    7999:	31 ca                	xor    %ecx,%edx
    799b:	41 89 14 24          	mov    %edx,(%r12)
    799f:	31 ef                	xor    %ebp,%edi
    79a1:	41 89 7c 24 08       	mov    %edi,0x8(%r12)
    79a6:	c1 c1 0b             	rol    $0xb,%ecx
    79a9:	41 89 4c 24 0c       	mov    %ecx,0xc(%r12)
    79ae:	49 8b 51 10          	mov    0x10(%r9),%rdx
    79b2:	48 39 c2             	cmp    %rax,%rdx
    79b5:	0f 86 5f 01 00 00    	jbe    7b1a <[email protected]@Base-0x15866>
    79bb:	8d 0c 9b             	lea    (%rbx,%rbx,4),%ecx
    79be:	c1 e9 19             	shr    $0x19,%ecx #<<<<< why not use rol again? 
    79c1:	c1 e3 07             	shl    $0x7,%ebx
    79c4:	8d 14 9b             	lea    (%rbx,%rbx,4),%edx
    79c7:	09 ca                	or     %ecx,%edx
    79c9:	8d 0c d2             	lea    (%rdx,%rdx,8),%ecx
    79cc:	49 8b 11             	mov    (%r9),%rdx
    79cf:	41 6b fa 2a          	imul   $0x2a,%r10d,%edi
    79d3:	01 cf                	add    %ecx,%edi
    79d5:	89 f9                	mov    %edi,%ecx
    79d7:	c1 e9 05             	shr    $0x5,%ecx
    79da:	48 69 c9 c5 5a 7c 0a 	imul   $0xa7c5ac5,%rcx,%rcx
    79e1:	48 c1 e9 27          	shr    $0x27,%rcx
    79e5:	69 c9 a0 86 01 00    	imul   $0x186a0,%ecx,%ecx
    79eb:	29 cf                	sub    %ecx,%edi
    79ed:	89 3c 82             	mov    %edi,(%rdx,%rax,4)
    79f0:	48 8d 48 01          	lea    0x1(%rax),%rcx
    79f4:	48 89 c8             	mov    %rcx,%rax
    79f7:	48 81 f9 10 27 00 00 	cmp    $0x2710,%rcx
    79fe:	0f 82 5c ff ff ff    	jb     7960 <[email protected]@Base-0x15a20>
    ...
    
    ... C ...
  400510:	44 8b 1d 29 0b 20 00 	mov    0x200b29(%rip),%r11d        # 601040 <s>
  400517:	8b 15 27 0b 20 00    	mov    0x200b27(%rip),%edx        # 601044 <s+0x4>
  40051d:	48 83 c6 04          	add    $0x4,%rsi
  400521:	8b 2d 19 0b 20 00    	mov    0x200b19(%rip),%ebp        # 601040 <s>
  400527:	8b 0d 1b 0b 20 00    	mov    0x200b1b(%rip),%ecx        # 601048 <s+0x8>
  40052d:	6b 46 fc 2a          	imul   $0x2a,-0x4(%rsi),%eax
  400531:	c1 e2 09             	shl    $0x9,%edx
  400534:	31 e9                	xor    %ebp,%ecx
  400536:	89 0d 0c 0b 20 00    	mov    %ecx,0x200b0c(%rip)        # 601048 <s+0x8>
  40053c:	8b 2d 02 0b 20 00    	mov    0x200b02(%rip),%ebp        # 601044 <s+0x4>
  400542:	8b 0d 04 0b 20 00    	mov    0x200b04(%rip),%ecx        # 60104c <s+0xc>
  400548:	31 e9                	xor    %ebp,%ecx
  40054a:	89 0d fc 0a 20 00    	mov    %ecx,0x200afc(%rip)        # 60104c <s+0xc>
  400550:	8b 2d f2 0a 20 00    	mov    0x200af2(%rip),%ebp        # 601048 <s+0x8>
  400556:	8b 0d e8 0a 20 00    	mov    0x200ae8(%rip),%ecx        # 601044 <s+0x4>
  40055c:	31 e9                	xor    %ebp,%ecx
  40055e:	89 0d e0 0a 20 00    	mov    %ecx,0x200ae0(%rip)        # 601044 <s+0x4>
  400564:	8b 2d e2 0a 20 00    	mov    0x200ae2(%rip),%ebp        # 60104c <s+0xc>
  40056a:	8b 0d d0 0a 20 00    	mov    0x200ad0(%rip),%ecx        # 601040 <s>
  400570:	31 e9                	xor    %ebp,%ecx
  400572:	89 0d c8 0a 20 00    	mov    %ecx,0x200ac8(%rip)        # 601040 <s>
  400578:	8b 0d ca 0a 20 00    	mov    0x200aca(%rip),%ecx        # 601048 <s+0x8>
  40057e:	31 ca                	xor    %ecx,%edx
  400580:	89 15 c2 0a 20 00    	mov    %edx,0x200ac2(%rip)        # 601048 <s+0x8>
  400586:	8b 15 c0 0a 20 00    	mov    0x200ac0(%rip),%edx        # 60104c <s+0xc>
  40058c:	c1 c2 0b             	rol    $0xb,%edx
  40058f:	89 15 b7 0a 20 00    	mov    %edx,0x200ab7(%rip)        # 60104c <s+0xc>
  400595:	43 8d 14 9b          	lea    (%r11,%r11,4),%edx
  400599:	c1 c2 07             	rol    $0x7,%edx
  40059c:	8d 0c d2             	lea    (%rdx,%rdx,8),%ecx
  40059f:	01 c1                	add    %eax,%ecx
  4005a1:	89 ca                	mov    %ecx,%edx
  4005a3:	c1 ea 05             	shr    $0x5,%edx
  4005a6:	89 d0                	mov    %edx,%eax
  4005a8:	41 f7 e0             	mul    %r8d
  4005ab:	c1 ea 07             	shr    $0x7,%edx
  4005ae:	69 d2 a0 86 01 00    	imul   $0x186a0,%edx,%edx
  4005b4:	29 d1                	sub    %edx,%ecx
  4005b6:	89 4e fc             	mov    %ecx,-0x4(%rsi)
  4005b9:	48 39 fe             	cmp    %rdi,%rsi
  4005bc:	0f 85 4e ff ff ff    	jne    400510 <main+0x70>
...
also https://groups.google.com/forum/?hl=en-GB&fromgroups=#!forum/pi3d

jahboater
Posts: 2928
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat Jun 23, 2018 1:23 pm

Interesting.

I see the ARM version does a "ror" (rotate right) instead of a rotate left!!

The Intel stuff is hard to predict for speed ...
Register "mov" instructions on Intel are usually free (they are eliminated).
Multiply is pretty fast - around 3 clocks or so. Test/cond jmp are fused into single operations, as now are
the likes of add/cond jump. Of course things like "sub r,r" or "xor r,r" have zero latency - never enter the pipeline.

This is easier to read - inner loop annotated by gcc 8.1 (-O3 -S -fverbose-asm) :-

Code: Select all

.L3:
# try.c:53:     a[(h * SIZE) + w] = ((a[(h * SIZE) + w] * 42)  + next()) % 100000;
    imull   $42, (%rsi), %ecx   #, MEM[base: _123, offset: 0B], tmp136
# try.c:25:     const uint32_t result_starstar = rotl(s[0] * 5, 7) * 9;
    leal    (%rdi,%rdi,4), %eax #, tmp131
# try.c:26:     const uint32_t t = s[1] << 9;
    movl    %r10d, %edx # s_I_lsm.1, t
# try.c:19:     return (x << k) | (x >> (32 - k));
    roll    $7, %eax    #, tmp132
# try.c:28:     s[2] ^= s[0];
    xorl    %edi, %r8d  # s_I_lsm.0, _41
# try.c:26:     const uint32_t t = s[1] << 9;
    sall    $9, %edx    #, t
# try.c:25:     const uint32_t result_starstar = rotl(s[0] * 5, 7) * 9;
    leal    (%rax,%rax,8), %eax #, tmp135
# try.c:29:     s[3] ^= s[1];
    xorl    %r10d, %r9d # s_I_lsm.1, _43
# try.c:30:     s[1] ^= s[2];
    xorl    %r8d, %r10d # _41, s_I_lsm.1
# try.c:53:             a[(h * SIZE) + w] = ((a[(h * SIZE) + w] * 42)  + next()) % 100000;
    addl    %eax, %ecx  # tmp135, tmp138
# try.c:33:     s[2] ^= t;
    xorl    %edx, %r8d  # t, s_I_lsm.2
    addq    $4, %rsi    #, ivtmp.12
# try.c:53:                 a[(h * SIZE) + w] = ((a[(h * SIZE) + w] * 42)  + next()) % 100000;
    movl    %ecx, %edx  # tmp138, tmp139
# try.c:31:     s[0] ^= s[3];
    xorl    %r9d, %edi  # _43, s_I_lsm.0
# try.c:19:     return (x << k) | (x >> (32 - k));
    roll    $11, %r9d   #, s_I_lsm.3
# try.c:53:                 a[(h * SIZE) + w] = ((a[(h * SIZE) + w] * 42)  + next()) % 100000;
    shrl    $5, %edx    #, tmp139
    movl    %edx, %eax  # tmp139, tmp139
    mull    %ebp    # tmp141
    shrl    $7, %edx    #, tmp142
    imull   $100000, %edx, %edx #, tmp142, tmp143
    subl    %edx, %ecx  # tmp143, tmp145
    movl    %ecx, -4(%rsi)  # tmp145, MEM[base: _123, offset: 0B]
# try.c:51:             for (int w = 0; w < SIZE; w++)
    cmpq    %r11, %rsi  # ivtmp.21, ivtmp.12
    jne .L3 #,
24 instructions, three multiplies.

Edit: I don't think ARM has a rol instruction.

Interesting how both languages changed the "for( int w = 0; w < SIZE; w++ )" into a "do while" which is faster.

Heater
Posts: 9836
Joined: Tue Jul 17, 2012 3:02 pm

Re: .NET Core - Raspbian or Win10IoT?

Sat Jun 23, 2018 6:47 pm

JustAGeek does not seem to be up to my little JS speed challenge, so I thought I do it myself just for giggles.

C version:

Code: Select all

$ time ./junk1
Seed: 938247, 23097423, 52309875, 297340234
Result: -73449

real    0m24.968s
user    0m24.453s
sys     0m0.453s
JS version:

Code: Select all

$ time node junk1.js
Seed:  938247 23097423 52309875 297340234
Result: -73449

real    2m22.064s
user    2m21.656s
sys     0m0.406s
So, the JS is about 6 times slower than the C version. This is a new hand made JS version. The previous JS results were JS compiled from C with Emscripten which performed much better at only about half the speed of C!

Interestingly they both use about the same memory when running:

C version - 4.7% (Of my 4 GB machine)
JS version - 4.9%

On the other hand the stripped C binary is 6 times bigger than the JS !

Here is the current code, slightly tweaked from previous versions so results may differ:

The C version:

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define SIZE  10000
#define ITERS 100

static uint32_t s[4] = {938247, 23097423, 52309875, 297340234};

static inline uint32_t rotl(const uint32_t x, int k) {
    return (x << k) | (x >> (32 - k));
}

static uint32_t next(void) {
    const uint32_t result_starstar = rotl(s[0] * 5, 7) * 9;
    const uint32_t t = s[1] << 9;

    s[2] ^= s[0];
    s[3] ^= s[1];
    s[1] ^= s[2];
    s[0] ^= s[3];

    s[2] ^= t;

    s[3] = rotl(s[3], 11);

    return result_starstar;
}

int main( void )
{
    int32_t * const a = (int32_t*)calloc(SIZE * SIZE, sizeof(int));

    printf("Seed: %d, %d, %d, %d\n", s[0], s[1], s[2], s[3]);

    for (int iters = 0; iters < ITERS; ++iters)
    {
        // Multiply every element by 42, add a random amount and take the modulus 100000
        for (int h = 0; h < SIZE; ++h)
        {
            for (int w = 0; w < SIZE; ++w)
            {
                a[(h * SIZE) + w] = (((a[(h * SIZE) + w]) * 42) + (int32_t)next()) % 100000;
            }
        }
    }

    // Print some element from somewhere
    printf("Result: %d\n", a[SIZE * SIZE / 3]);
}
The JS version:

Code: Select all

const SIZE = 10000
const ITERS = 100

const s = new Uint32Array([938247, 23097423, 52309875, 297340234]);

function rotl(x, k) {
    return (x << k) | (x >>> (32 - k));
}

function next() {
    const result_starstar = (rotl(s[0] * 5, 7) * 9)|0;
    const t = s[1] << 9;

    s[2] ^= s[0];
    s[3] ^= s[1];
    s[1] ^= s[2];
    s[0] ^= s[3];

    s[2] ^= t;

    s[3] = rotl(s[3], 11);
    return result_starstar;
}

const main = (function()
{
    let a = new Int32Array(SIZE * SIZE);
    console.log('Seed: ', s[0], s[1], s[2], s[3]);

    for (let iters = 0; iters < ITERS; ++iters)
    {
        // Multiply every element by 42, add a random amount and take the modulus 100000
        for (let h = 0; h < SIZE; ++h)
        {
            for (let w = 0; w < SIZE; ++w)
            {
                a[(h * SIZE) + w] = (((a[(h * SIZE) + w]) * 42) + next()) % 100000;
            }
        }
    }

    // Print some element from somewhere
    console.log('Result:', a[(SIZE * SIZE / 3)|0]);
}());
Last edited by Heater on Sun Jun 24, 2018 11:44 am, edited 2 times in total.

jahboater
Posts: 2928
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Sun Jun 24, 2018 7:51 am

Just curiosity, what does " )|0 " do in the JS version?

I think you should have left rotl() as a function. :)

Heater
Posts: 9836
Joined: Tue Jul 17, 2012 3:02 pm

Re: .NET Core - Raspbian or Win10IoT?

Sun Jun 24, 2018 9:00 am

jahboater
Just curiosity, what does " )|0 " do in the JS version?
A very good question. Well spotted.

It's a little bit of, perhaps not so well known, Javascript magic.

In Javascript all numbers are 64 bit floating point quantities. IEEE 754 and all that. Which is good because you have a huge real number range. Also if you are working in integers you can get precise integer values up to 53 bits (or whatever it is).

But that can cause problems. For example in my code I index and array with SIZE * SIZE / 3. Which is 10000 * 10000 / 3 in this case or 33333333.333333332. Well, you can't index an array with a non-integer value. One could use Math.floor or some such to round it to an integer.

But, it turns out that if you perform logical operations on numbers in JS the result is always truncated to a 32 bit signed integer. Which makes sense as logical ops are going to get done using integer instructions and JS comes from a time of 32 bit machines.

So, a shorthand and performant way to truncate to an int is to use a logical operation, for example "|0". Which does nothing but produce a 32 bit result.

Modern day JS engines optimize this kind of integer JS code very well. Which is why Emscripten can transpile C into JS and the result is only a factor of 2 or 3 slower.

This trick is especially important here in the random number generator which is designed to work on unsigned 32 bit integers. That "|0" ensures the result of the multiply by 9 does not exceed 32 bits and wraps around as it should.

Note also that I used ">>>" instead of ">>" to ensure a logical shift right, not arithmetic.

Not also my use of a JS typed array "new Int32Array(SIZE * SIZE)". This is a relatively new JS feature that also speeds things up and saves memory space.
I think you should have left rotl() as a function.
I was wondering about that. I had this idea that JS would not inline such a thing and the function call overhead would kill performance. So I pulled it straight in C on the way to creating the JS version.

If I have a minute I'll put rotl() back again and see what impact it has.

jahboater
Posts: 2928
Joined: Wed Feb 04, 2015 6:38 pm

Re: .NET Core - Raspbian or Win10IoT?

Sun Jun 24, 2018 9:19 am

Thats very interesting, I did wonder how all that sort of thing was done.

FYI, NEON can do shifts on 64-bit integers and a "double" can be treated as an integer in NEON of course.
Which means that Javascript's << etc should be pretty fast. Also NEON is great at floating-point rounding and conversions and can likely do the "|0" thing in a single instruction.

It seems an understanding of the bits and bytes is still needed, even in a higher level language.

Heater
Posts: 9836
Joined: Tue Jul 17, 2012 3:02 pm

Re: .NET Core - Raspbian or Win10IoT?

Sun Jun 24, 2018 12:10 pm

jahboater,

Just for you I put the rotl() back into both the C and JS versions I posted above (the post is updated).

Amazingly JS performance did not suffer very much at all from introducing that function call.

However, I also made the state variable array into a JS typed array. Which shaved 7 seconds off it's run time. So the code now is even faster!

I have no idea how well any of this works on a Pi. I don't have one to hand at the moment.

From what I understand of modern day JS engines they will try to optimize "hot" functions at run time. If a function only ever sees numbers for it's parameters it will get optimized for numbers. If they are 32 bit numbers it will get optimized to use integer arithmetic. And so on.

As such the "|0" should end up not even being compiled into any code at all. After all it does nothing for 32 bit ints. It is only a message to the compiler telling it that is the type you want.
It seems an understanding of the bits and bytes is still needed, even in a higher level language.
Yep.

At least in JS you know what your numbers are and how many bits they have and how operations will perform everywhere on any platform. As opposed to languages like C and C++ where it is a lottery where so many things are "implementation" defined.

Then there is the whole floating point fiasco. To quote the common example in JS:

Code: Select all

> 0.3 + 0.3 + 0.3 == 0.9
false
Which is not a JS problem as such, it's common to all languages using IEEE floats.

Return to “Other programming languages”

Who is online

Users browsing this forum: No registered users and 2 guests