Page 11 of 11

Re: Why moving to 64bit?

Posted: Mon Sep 02, 2019 3:44 am
by jdonald
Paeryn wrote:
Mon Sep 02, 2019 2:12 am
The representation that CPython uses is PyLong which divides the integer into an array of 30-bit integers (yes 30, not 32) and if more than one word is needed then the absolute value is stored (the sign is encoded in the length of the integer).
Indeed. Looking closer at PyLong's constituent datatypes it does use some ambiguous unsigned long / long types in there: ... r.h#L44-56 . But apparently that #elif clause is limited to PYLONG_BITS_IN_DIGIT == 15, which I've confirmed only gets set that way in 32-bit builds.

The unnecessary uses of long that raised my suspicions at first are the ones like so: ... g_fromlong
It'll even do return PyLong_FromLong(-1); because there is no PyLong_FromInt() function. I don't think it can inline these calls unless link-time code generation is now a thing on Linux.

But this would be a few extra registers in a limited number of places, and cannot explain the double-digit performance losses.

Having seen the codebase now, I'm realizing that every tiny object is allocated on the heap as a PyObject*. It makes sense that doubling the pointer width will result in more overhead in programs that on the surface appeared to be compute-bound. In fact it probably affects these programs even more so if they're doing many little char- or integer math operations.

Re: Why moving to 64bit?

Posted: Sun Sep 08, 2019 4:29 am
by ejolson
jdonald wrote:
Sun Sep 01, 2019 3:33 pm
Heater, your Rust tests were ARMv6 baselined and thus invalid. Please see my posts above and rerun your tests.
How do you specify ARMv7 and with Cortex-A53 tuning using the Rust compiler?

I have rerun and updated the Python3 timings for the anagram programs here. The loss in performance when moving from 32-bit Raspbian to 64-bit Gentoo at only 16 to 17 percent is not as significant as it used to be. I suspect Python on Raspbian is also compiled to be reverse compatible with the ARMv6 instruction set. Therefore, optimal 32-bit performance numbers may be noticeably better than what was used in that comparison.

Re: Why moving to 64bit?

Posted: Sun Sep 08, 2019 5:41 am
by jdonald
ejolson wrote:
Sun Sep 08, 2019 4:29 am
How do you specify ARMv7 and with Cortex-A53 tuning using the Rust compiler?
It's tricky with rustc. I provided guidelines for adding such args a few pages back. Let us know if that gets you anywhere.

Re: Why moving to 64bit?

Posted: Sun Sep 08, 2019 7:09 am
by Heater
Maybe put the loop in the rust program itself to minimize the system overhead?
A very good idea for bench marking, in fact I have had such timing code in the anagram and other challenge codes for a long time. There is a couple of problems with the idea:

1) It hides the load and start up times of the programs. That could be significant for the interpreted language solutions as they parse the source at start up. It could also hide the actual output time, again often very significant.

2) For things like JS with it's JIT engine it will get faster and faster as things are looped. The JIT engine learns how to optimize it dynamically at run time if it is iterated.

All of this means that what we want is not a typical benchmark with many loops over some algorithm but rather the actual, user observed run time, of a single run, from command to result.

As an example, in the case of the Rust anagram finder I put a loop around the actual anagram finder function. Thus timing the algorithm only, excluding time taken to read the dictionary file and print the output. The result is dramatic:

Code: Select all

$ cargo run  --bin insane-british-anagram --release > /dev/null
   Compiling insane-british-anagram v0.1.3 (/mnt/c/Users/heater/conveqs/insane-british-anagram-rust)
    Finished release [optimized] target(s) in 3.62s
     Running `target/release/insane-british-anagram`
Execution time: 301ms
Execution time: 50ms
Execution time: 51ms
Execution time: 50ms
The second iteration is massively faster than the first. Given that the use case is just to run the program once using a timing loop like this would badly bias the result.

Why does it get so much faster?

Not sure really. I suspect that dictionary file, read once at start up, is not actually read from disk at that time. Rather Linux lazily reads in pages as the anagram loop accesses it, thus slowing down the first iteration.

Also the memory allocator I am using is very good at not giving memory back to the OS prematurely.

I would be very interested to see if the C version of the anagram finder also speeds up like this when iterated many times in the same run.

Source here for anyone who want's to play: ... m-rust.git

Re: Why moving to 64bit?

Posted: Sun Sep 08, 2019 12:43 pm
by pica200
There is another factor. Your code easily fits in the L2 cache and partly in the L1 cache bypassing the overhead of loading the code again from the slug DRAM. It also caches a good chunk of the input data. For small programs working with small data sets this works fine but the L2 cache is way to small to compensate for the halved DRAM bus. I bet the A72 would do significantly better without these limitations.

Re: Why moving to 64bit?

Posted: Sun Sep 08, 2019 2:30 pm
by cyclic
Is zfs a reason for needing 64bit?

Re: Why moving to 64bit?

Posted: Sun Sep 08, 2019 3:02 pm
by Heater

Certainly caching can have a huge effect on performance. Mostly we worry about making data cache friendly but code layout can also be unfriendly to the instruction caching.

As you say, this is a small program and the data is not so big so the impact of cache misses is probably not so significant.

I think the impact of instruction cache misses is negligible here. The loop in that code is small and fits in cache, once it has been around the first time there will be no more instruction caches misses. It goes around many times is the cache loading time is amortized to near zero.

I still feel there is something with the file access that makes that huge difference from the first run to the second. The file is not actually read from disk to memory in that read statement. Rather the blocks are fetched later when the algorithm accesses the memory where they should be. So the first run has all the overhead of doing the actual disk reads.

Anyway, I gave up trying to optimize that program further when I saw this timing result. If those numbers are true then there is not much more I can do.


Quite possibly. ZFS sounds like a great idea.

Re: Why moving to 64bit?

Posted: Sun Sep 08, 2019 5:58 pm
by pica200
fread() does a bit of caching internally aswell if i recall correctly. read(), which is basically just a syscall wrapper only goes through the fs cache of the kernel. You could use read() and give the kernel hints that you want non-cached reads for repeated, realistic results.

Re: Why moving to 64bit?

Posted: Tue Sep 17, 2019 12:09 am
by ejolson
jdonald wrote:
Tue Aug 13, 2019 4:54 am
Once I added -mcpu=cortex-a72, sysbench gets 10x faster in the 32-bit case to become on par with aarch64.
I have tried to reproduce this result here but was unable. My understanding is that the Cortex-A72 running in 32-bit mode does not have any 64-bit registers and so can't perform any 64-bit divisions. Did you change the source code so that the 64-bit integers given by "unsigned long long" appear as "unsigned long" 32-bit integers everywhere?

Re: Why moving to 64bit?

Posted: Tue Sep 17, 2019 1:14 am
by jdonald
Thanks for investigating. Today I have been trying to reproduce my earlier result in that same Debian armhf container and have been unable to. I'm positive I didn't modify any of the C source code when examining this last time.

At this point, my best guess as to what happened last month was that I misread numbers on my screen, then drew incorrect "aha" conclusions after seeing that objdump -d sysbench initially lacked udiv instructions then contained them if compiled for the newer CPU core. Back then I had grepped the assembly from the sysbench binary as a whole, not specifically cpu_execute_event().

So for now I think it's safe to go on the record saying that sysbench --test=cpu is still an order of magnitude faster when compiled for 64-bit, no matter how the 32-bit baseline is tuned. I'll update the other related threads with more details soon.