Hi everybody,
I'm working on a very "simple" thing that didn't get mentioned into the previous posts, and on which I believe running on a 32 bit OS is strongly reducing my performances :
64 bits "double" floating point calculation (into a 3D based point scanning / reconstruction application). I got 2 distance sensors on top of mechanical precision arms, with 250 Hz sampling for the arm (not the CPU

), and 2 KHz for both of the optical distance sensors.
Recording of the values, and high precision dating of the values (which one came before or after which one) is fast.
Recording to RAM does not require any CPU power (0% from LXDE usage monitor, 13.5% (out of 400% because of 4 cores) from "top" command, and very few RAM finally.
The multi-core architecture made me possible to to multi-thread things in order to have a really fast/real time recording (no pauses/hangs on main-thread activity). Knowing the fixed sampling time of the sensors also permits to "arrange" the time-stamps in a way that perfectly recover the real-time precision (given the fact that data can only be more or less late once received, but in reality the physical time spent between each sample was the same). Time offset between robot and sensors is also taken into account. (Well just to say : it works super fine).
But : 3D reconstruction of all the recorded data is too slow. It's a big serie of different "for" loops in which "sensor 1" position and "sensor 2" positions
and orientations are calculated from the robot position (and angles) and tool coordinates (offsets and cos / sin calculations). After that, for each distance measurement, we create a "intermediate robot-sensor position" which give the ability to have several measurement point between 2 received robot positions. On those values, we apply the optical measurement distance to the correct XYZ orientation (using cos, sin, and offsets) as an offset to the robot-sensor position.
Before that we also play the "timestamps correction" loop, which again, implies double calculation about period/frequency, comparisons, most "on time" values at the begining of the scan, end of the scan, so the initial time and perfect frequency is applied to overwrite "raw" timestamps.
After 5 seconds of scan, It's really fast on a modern computer (~0.5 seconds ?) but takes more than 10 seconds into the Raspberry Pi 3 B+, which is disappointing (we would have liked reception and reconstruction being outside of the GUI Windows based PC, so RPi was perfect to run those programs).
Functions have been made really clean in order to keep simplicity into all of the steps, robot position calculations are avoided if the measured value is "nothing" in order to save CPU time, and I'm reaching the limit of what I'm able to improve... before realising that "double" were 64 bits.
My point is :
It's not impossible that the code could be "improved" by experts about assembler things and guess about register access optimisations, splitting "for" loops differently with CPU pipe lining arrangements in head... but I'll have to compare results and performance with 32 "floats" to see how much it changes. Which is too bad for a 64 bits board
I also marked every 64 bits project I've seen on this forum
https://wiki.ubuntu.com/ARM/RaspberryPi ... ISO_images
https://github.com/bamarni/pi64
https://wiki.gentoo.org/wiki/Raspberry_ ... it_Install
https://github.com/Crazyhead90/pi64/releases
https://github.com/sakaki-/gentoo-on-rpi3-64bit
https://github.com/jdonald/raspbian-multiarch
https://ubuntu-mate.org/raspberry-pi/
But I'm afraid of those things that needs to learn again, try, crash, correct, debug etc and not being available/updated with the years so I'll check them carefully.
Here is for the feedback ! I'm an average dude for which 64 bits "1 cycle" double floating point calculation by the CPU would have been useful. Thanks to everyone who participated to this discussion, I hope my feedback can be useful for anyone, someday, somewhere !