User avatar
Gavinmc42
Posts: 3885
Joined: Wed Aug 28, 2013 3:31 am

Re: Spider-OS a new operating system

Mon Feb 04, 2019 10:57 am

Redox OS is the reason I put Rustc on my Pi's and PC's.
Looking forwards to running it on Pi's and RISC-V boards.
Was worried about GPU's for RISC-V but that is moving ahead too now.
A modern? microkernel OS written in a modern language.
Cannot call Pi's modern but ARM is a bit newer than x86 ;)
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

LdB
Posts: 1280
Joined: Wed Dec 07, 2016 2:29 pm

Re: Spider-OS a new operating system

Mon Feb 04, 2019 3:03 pm

I spent my xmas playing around with rust 2018 and evaluating it for a couple of client companies and there are still a couple of serious structural problems with the language (2018 has async & await finally implemented). Don't get me wrong I like some of the language features and some things are natural to write but you end up in code running into RFC's already up but it is extremely fuzzy how these issues will be progressed. It has still got some development to do before it goes prime time and makes any serious challenge to C or C++ like it wants to do.

Mozilla itself put out an insights in rust 2018 which is worth a read
https://hacks.mozilla.org/2018/12/rust-2018-is-here/
The bit about embedded development being a wild ride is right and reading the embedded rust book is essential.

I had a play with redox but I don't get the what it's trying to be (its to simple for some things, to large for others) and then the suggestion is it's just a proof of concept, so I am interested what are you using redox for?

Aran
Posts: 36
Joined: Fri Jan 25, 2019 5:55 pm

Re: Spider-OS a new operating system

Mon Feb 04, 2019 8:42 pm

Hello,

here is my point of view about assembler.

For the creation of the window manager, I used the Neon instructions, which allow a parallel processing of the data. I have studied the documentation to see exactly what each instruction does to use the one that is most appropriate. This is a first optimization. Then after different tests, it turned out that my code was not optimal in its processing, and then I used other specialized instructions, with a different algorithm.
By checking the processing times of each code, we find the fastest solution.

Regarding the 32-bit coding, yes it's true for the moment I'm content with 32 bits. But that suits perfectly what I want to do. Memory addressing on 4GB is enough for the Raspberry Pi, and the set of registers too.

The goal is not to create a scalable and portable operating system in the event of a change of architecture. But rather to have an optimal system for a precise architecture. I think that the Raspberry Pi will not necessarily evolve enormously. The ARM processor either, the instruction set is still compatible for many years between different ARM CPUs.

Moreover, as suggested by LdB and Bzt, I wonder about the usefulness of a HAL, if the system is very oriented on a single architecture.

bzt
Posts: 393
Joined: Sat Oct 14, 2017 9:57 pm

Re: Spider-OS a new operating system

Tue Feb 05, 2019 11:45 am

Hi,
Aran wrote:
Mon Feb 04, 2019 8:42 pm
Then after different tests, it turned out that my code was not optimal in its processing, and then I used other specialized instructions, with a different algorithm.
By checking the processing times of each code, we find the fastest solution.
And that's the point. You, as a programmer can use a different algorithm easily, something a compiler optimizer will never ever able to do. You're on the right track!
Aran wrote:Regarding the 32-bit coding, yes it's true for the moment I'm content with 32 bits. But that suits perfectly what I want to do. Memory addressing on 4GB is enough for the Raspberry Pi, and the set of registers too.
Perfectly fine, that's your choice. There are more tutorials for 32 bit anyway.
Aran wrote:The goal is not to create a scalable and portable operating system in the event of a change of architecture. But rather to have an optimal system for a precise architecture. I think that the Raspberry Pi will not necessarily evolve enormously. The ARM processor either, the instruction set is still compatible for many years between different ARM CPUs.

Moreover, as suggested by LdB and Bzt, I wonder about the usefulness of a HAL, if the system is very oriented on a single architecture.
That single architecture can change too, just think about the appearance of the WiFi adapter. It's a hell lot easier to add the wireless driver to your kernel if you already have a network interface abstraction. Also RPi supports SD cards and USB storages, so a block device abstraction is very handy.

Maybe you don't see it right now, but on the long run with HAL your source would be much more maintainable. It's not only about portability, but code organization as well. It is very likely you'll work on this project for years, and it is really a pain in the ass to remove the dust from a one or two years old Assembly source which is not well-structured. But I think you should experience that at least once to fully understand what we're talking about. :-)

Good luck, and I really hope we can hear from Spider-OS soon,
bzt

LdB
Posts: 1280
Joined: Wed Dec 07, 2016 2:29 pm

Re: Spider-OS a new operating system

Tue Feb 05, 2019 1:43 pm

I agree with bzt and also the problem with worrying about optimization is you lose sight of the problem.

Try putting your neon code up against a DMA screen draw or even faster a VC4 GPU pipeline and see how you go.
You may have the fastest optimization but there are faster ways to do it :-)

This is peter Lemons assembler code to do a DMA text to screen if you want to test it
https://github.com/PeterLemon/Raspberry ... ernel8.asm

Code: Select all

adr x1,Font ; X1 = Characters
adr x2,Text ; X2 = Text Offset
adr x3,CB_STRUCT ; X3 = Control Block Data
mov w4,BUS_ADDRESSES_l2CACHE_DISABLED ; W4 = BUS_ADDRESSES_l2CACHE_DISABLED
add w3,w3,w4 ; W3 = BUS_ADDRESSES_l2CACHE_DISABLED + Control Block Data

mov x4,PERIPHERAL_BASE
orr x4,x4,DMA0_BASE ; X4 = DMA 0 Base
mov w5,DMA_ACTIVE ; W5 = DMA Active Bit
mov w6,12 ; W6 = Number Of Text Characters To Print
DrawChars:
  ldrb w7,[x2],1 ; W7 = Next Text Character
  add w7,w1,w7,lsl 6 ; Add Shift To Correct Position In Font (* 64)
  adr x8,CB_SOURCE
  str w7,[x8] ; Store DMA Source Address
  adr x8,CB_DEST
  str w0,[x8] ; Store DMA Destination Address
  str w3,[x4,DMA_CONBLK_AD] ; Store DMA Control Block Data Address

  str w5,[x4,DMA_CS] ; Print Next Text Character To Screen
  DMAWait:
    ldr w7,[x4,DMA_CS] ; Load Control Block Status
    tst w7,w5 ; Test Active Bit
    b.ne DMAWait ; Wait Until DMA Has Finished

  subs w6,w6,1 ; Subtract Number Of Text Characters To Print
  add w0,w0,CHAR_X ; Jump Forward 1 Char
  b.ne DrawChars ; IF (Number Of Text Characters != 0) Continue To Print Characters
.
Try scrolling the whole screen up one line (1 font height) using your code and try a DMA bitblt and tell me which is faster :-)

You might also look at how linux abstracts a framebuffer into a driver for ideas
https://opensourceforu.com/2015/05/writ ... er-driver/

Aran
Posts: 36
Joined: Fri Jan 25, 2019 5:55 pm

Re: Spider-OS a new operating system

Tue Feb 05, 2019 10:09 pm

Yes, of course I have already made comparisons with Neon, DMA and Videocore. Well, like you, I thought DMA was faster, and well, no ! I created two programs to show you: one for DMA, and the other with Neon (scrolling a 1920 * 1080 * 16 bits screen on a line)

Code: Select all

; DMA copy
	mov r1,screen.fbAdr
	ldr r1,[r1]	
	add r0,r1,1920*2
	imm32 r2,1920*1080*2
	mov r3,1920*2

	str r0,[dmaCb.source_ad]	; source address
	str r1,[dmaCb.dest_ad]		; target address
	str r2,[dmaCb.txfr_len]		; transfert size
	str r3,[dmaCb.stride]		; bytes to add after every line
	
	mov r0,0
	orr r0,DMA_DEST_INC
	orr r0,DMA_SRC_INC
	str r0,[dmaCb.ti]
	
	mov r3,PERIPHERAL_BASE
	orr r3,DMA0_BASE		; channel DMA 0
	
	imm32 r0,dmaCb + BUS_ADDRESSES_l2CACHE_DISABLED
	str r0,[r3,DMA_CONBLK_AD]
	
	mov r0,DMA_ACTIVE
	str r0,[r3,DMA_CS]		; register Control & Status
	
	.DO1:
		ldr r0,[r3,DMA_CS]
		tst r0,DMA_ACTIVE
	bne .DO1
	

Code: Select all

;-------- copy with NEON

; scroll the screen of 1 line (1920*1080, 16 bits/pixel)

mov r0,screen.fbAdr
ldr r0,[r0]				; r0 : framebuffer address
	
add r2,r0,1920*2			; r2 : source address (a lower line)
imm32 r1,1920*2*1080			; size of the screen
	
.DO2:
	vldm r2!,{d0-d15}		; read 64 pixels (128 bytes)
	vstm r0!,{d0-d15}		; write 64 pixels (128 bytes)
	subs r1,128
bhi .DO2

On a Raspberry Pi 3, the DMA copy takes 75ms and Neon 25ms.

The Videocore is even worse, because to display a frame, there is a lot of code to initialize. On the other hand, the Videocore is unbeatable, when it comes to displaying a fast sequence of frames, like a moving 3D object.
In my case, I use Neon mainly for processing one frame, like copying a window. So that's fine, because most processing are done in less than 40ms, that is to say 25 frames/second.
Last edited by Aran on Wed Feb 06, 2019 1:33 am, edited 1 time in total.

LdB
Posts: 1280
Joined: Wed Dec 07, 2016 2:29 pm

Re: Spider-OS a new operating system

Wed Feb 06, 2019 12:17 am

Your answer is a bit misleading in a number of respects

The GPU pipeline requires the structure to only ever be setup once ... you seem to be throwing it away each time?????

DMA result
1.) In your DMA test you are excluding you get the ARM back to actual do processing
2.) There are 16 DMA channels so use 8 say channel 0..7 to do 1/8th the screen ... that leaves 8 for other stuff disk, sound etc.
You will need a virtual screen so when you make the framebuffer make it twice the height of the screen
Based on your 75ms the time should be now closer to 9-10ms so now 4x faster than your neon code :-)
The channels are sitting there they have nothing else to do
Make the Framebuffer twice the height and adjust your code to use 8 channels like this and measure

Code: Select all

/* DMA0-7 enable */
PUT32(BCM283X_DMA_ENABLE, 0xFF);
/* DMA Control Block */
dma_cb_data[0].transfer_information = 0x332;	/* Source Address Increment 32, Destination Address Increment 32, 2D Mode */
dma_cb_data[0].length = (135 << 16) + (1920 * 4); /* 0:15 XLENGTH, 16:29 YLENGTH */
dma_cb_data[0].tdmode_stride = 0;
dma_cb_data[0].next_control_block_address = 0;
dma_cb_data[0].reserved1 = 0;
dma_cb_data[0].reserved2 = 0;
dma_cb_data[1].transfer_information = dma_cb_data[0].transfer_information;
dma_cb_data[1].length = dma_cb_data[0].length;
dma_cb_data[1].tdmode_stride = 0;
dma_cb_data[1].next_control_block_address = 0;
dma_cb_data[1].reserved1 = 0;
dma_cb_data[1].reserved2 = 0;
dma_cb_data[2].transfer_information = dma_cb_data[0].transfer_information;
dma_cb_data[2].length = dma_cb_data[0].length;
dma_cb_data[2].tdmode_stride = 0;
dma_cb_data[2].next_control_block_address = 0;
dma_cb_data[2].reserved1 = 0;
dma_cb_data[2].reserved2 = 0;
dma_cb_data[3].transfer_information = dma_cb_data[0].transfer_information;
dma_cb_data[3].length = dma_cb_data[0].length;
dma_cb_data[3].tdmode_stride = 0;
dma_cb_data[3].next_control_block_address = 0;
dma_cb_data[3].reserved1 = 0;
dma_cb_data[3].reserved2 = 0;
dma_cb_data[4].transfer_information = dma_cb_data[0].transfer_information;
dma_cb_data[4].length = dma_cb_data[0].length;
dma_cb_data[4].tdmode_stride = 0;
dma_cb_data[4].next_control_block_address = 0;
dma_cb_data[4].reserved1 = 0;
dma_cb_data[4].reserved2 = 0;
dma_cb_data[5].transfer_information = dma_cb_data[0].transfer_information;
dma_cb_data[5].length = dma_cb_data[0].length;
dma_cb_data[5].tdmode_stride = 0;
dma_cb_data[5].next_control_block_address = 0;
dma_cb_data[5].reserved1 = 0;
dma_cb_data[5].reserved2 = 0;
dma_cb_data[6].transfer_information = dma_cb_data[0].transfer_information;
dma_cb_data[6].length = dma_cb_data[0].length;
dma_cb_data[6].tdmode_stride = 0;
dma_cb_data[6].next_control_block_address = 0;
dma_cb_data[6].reserved1 = 0;
dma_cb_data[6].reserved2 = 0;
dma_cb_data[7].transfer_information = dma_cb_data[0].transfer_information;
dma_cb_data[7].length = dma_cb_data[0].length;
dma_cb_data[7].tdmode_stride = 0;
dma_cb_data[7].next_control_block_address = 0;
dma_cb_data[7].reserved1 = 0;
dma_cb_data[7].reserved2 = 0;

/* DMA src set ... virtual screen memory just after real screen memory */
dma_cb_data[0].src_address = (uint32_t)framebuffer + 1080*1920*4;   
dma_cb_data[1].src_address = dma_cb_data[0].src_address + 135*1920*4;
dma_cb_data[2].src_address = dma_cb_data[1].src_address + 135*1920*4;
dma_cb_data[3].src_address = dma_cb_data[2].src_address + 135*1920*4;
dma_cb_data[4].src_address = dma_cb_data[3].src_address + 135*1920*4;
dma_cb_data[5].src_address = dma_cb_data[4].src_address + 135*1920*4;
dma_cb_data[6].src_address = dma_cb_data[5].src_address + 135*1920*4;
dma_cb_data[7].src_address = dma_cb_data[6].src_address + 135*1920*4;

/* DMA dest set */
dma_cb_data[0].dst_address = framebuffer;
dma_cb_data[1].dst_address = dma_cb_data[0].dst_address + 135*1920*4;
dma_cb_data[2].dst_address = dma_cb_data[1].dst_address + 135*1920*4;
dma_cb_data[3].dst_address = dma_cb_data[2].dst_address + 135*1920*4;
dma_cb_data[4].dst_address = dma_cb_data[3].dst_address + 135*1920*4;
dma_cb_data[5].dst_address = dma_cb_data[4].dst_address + 135*1920*4;
dma_cb_data[6].dst_address = dma_cb_data[5].dst_address + 135*1920*4;
dma_cb_data[7].dst_address = dma_cb_data[6].dst_address + 135*1920*4;
.
The really neat part is you can set that up all as just a const data block and just copy the block to the DMA registers because it never changes.

So draw some stuff onto the virtual screen and then launch the DMA's and your virtual screen should appear on the real screen in 10ms.
In your case you end up working like windows, the individual windows draw to the virtual screen and the last window to update launches
the physical draw to screen which stops all the window redraw flashing. The code above does the whole screen but you can easily adjust it
to do rectangular sections (you align 4 on both start and end so it may draw a couple pixels more than requested but the virtual screen is a mirror of real screen so it matters not)

So I am pretty sure that the code will be faster and smaller but in addition I can get the cpu back doing real work for the 10ms. Do I win the optimization challenge?

Aran
Posts: 36
Joined: Fri Jan 25, 2019 5:55 pm

Re: Spider-OS a new operating system

Wed Feb 06, 2019 8:11 pm

LdB,

yes you win the optimization challenge :)
I did the test with the execution of 8 dma channels, and it takes less than 10ms to copy a screen of 1920 * 1080 in 16bits. Well done !

As you suggest, I use a hidden screen to manage the overlay of windows, and also to avoid blinking when refreshing.
But copying raw data with DMA is not enough, because it is also necessary to test the presence or not of a window in one place before displaying another (to make a backup copy).
For this you have to use specialized instructions, like Neon ;). A link that gives insight that what can be done with this chip : Neon

After the ideal is probably to directly manage the graphical interface with Videocore ...

LdB
Posts: 1280
Joined: Wed Dec 07, 2016 2:29 pm

Re: Spider-OS a new operating system

Thu Feb 07, 2019 1:54 am

So time to let the cat out of the bag everything you have done is actually done in linux and they never get out of C to do it

This is the linux framebuffer documentation
https://elinux.org/Flameman/framebuffer

The function you are playing around with is called
FB_SYS_COPYAREA
Include the sys_copyarea function for generic software area copying.
This is used by drivers that don't provide their own (accelerated)
version and the framebuffer is in system RAM.
In the linux driver there is a function pointer which defaults to this code when that is called
https://github.com/torvalds/linux/blob/ ... copyarea.c

This code in it translates to your neon code .. note the repeated instruction 8 times to make sure it translates as a neon array
saves a lot of assembler writing :-)

Code: Select all

// Main chunk
			n /= bits;
			while (n >= 8) {
				FB_WRITEL(FB_READL(src++), dst++);
				FB_WRITEL(FB_READL(src++), dst++);
				FB_WRITEL(FB_READL(src++), dst++);
				FB_WRITEL(FB_READL(src++), dst++);
				FB_WRITEL(FB_READL(src++), dst++);
				FB_WRITEL(FB_READL(src++), dst++);
				FB_WRITEL(FB_READL(src++), dst++);
				FB_WRITEL(FB_READL(src++), dst++);
				n -= 8;
			}
			while (n--)
FB_WRITEL(FB_READL(src++), dst++);
If you look at void cfb_copyarea(struct fb_info *p, const struct fb_copyarea *area)
it does some simple checking works out if its forward or backwards and call the appropriate sub function

Now what we did with 8 DMA's would be the accelerated version and here is the linux code for the Pi driver
It uses only one DMA btw either channel 2 or 8 they don't consider graphics important enough to get 8 channels.
http://www.macs.hw.ac.uk/~hwloidl/hacks ... m2708_fb.c
the function you want is static void bcm2708_fb_copyarea(struct fb_info *info, const struct fb_copyarea *region)

So everything you have done and discussed is already done in the linux driver in C and encapsulated in an abstraction layer :-)

Heater
Posts: 13603
Joined: Tue Jul 17, 2012 3:02 pm

Re: Spider-OS a new operating system

Thu Feb 07, 2019 7:01 am

@LdB,
I am interested what are you using redox for?
Personally I have yet to try using Redox or even Rust for anything. I just mentioned it as an interesting example of getting away from the decades old tradition that if one is going to write an OS one does it in C.

Rust for embedded systems may be a "wild ride" but in my experience developing embedded systems has always been a wild ride in whatever language.

@bzt,
You, as a programmer can use a different algorithm easily, something a compiler optimizer will never ever able to do. You're on the right track!
My experience is the opposite. With a large code base it is far easier to make change that have a global effect and huge gains in performance if one is working in a high level language. Making such changes to an assembly language code base can require touching all parts of the code which soon becomes far to much of a daunting task.

I do agree though, one cannot rely the optimizers to magically apply speed for us. As Alexander Stepanaov said: "We need to learn to do compiler optimization manually"

An example: Recently I was entertaining myself by trying to calculate Fibonacci numbers with one million digits. My initial working attempt in C++ took a minute and a half to calculate fibo(4784969). Pretty terrible. But after a processes of applying various optimizations it now runs in about 0.5 seconds. A speed up of about 200. With changing a few lines that processing can be spread over multiple cores for another huge speed up.

Thing is when you look at the current code it looks mostly like the original version. But some small changes along the way had dramatic effects in performance, for example changing a data type or changing the way memory is allocated. These small changes in the C++ source had a big effect on all of the generated code.

If I were working in assembler this thing would still take a minute and a half. The prospect of having to rewrite all the code for each of those optimization steps would have been to much to attempt it.

If anyone want's to attempt the fibo(4784969) challenge in assembler and pit themselves against the other language solutions we have that would be great. We have a thread about it here: https://www.raspberrypi.org/forums/view ... 3#p1394452

The, rather chaotic, repository of challenge entries in many languages is here: https://github.com/ZiCog/fibo_4784969

Let's see what the assembler language supporters can do.
Memory in C++ is a leaky abstraction .

jahboater
Posts: 4762
Joined: Wed Feb 04, 2015 6:38 pm

Re: Spider-OS a new operating system

Thu Feb 07, 2019 8:40 am

Heater wrote:
Thu Feb 07, 2019 7:01 am
@bzt,
You, as a programmer can use a different algorithm easily, something a compiler optimizer will never ever able to do. You're on the right track!
My experience is the opposite. With a large code base it is far easier to make change that have a global effect and huge gains in performance if one is working in a high level language. Making such changes to an assembly language code base can require touching all parts of the code which soon becomes far to much of a daunting task.
I think you are agreeing with bzt, the way I read it.
Heater wrote:
Thu Feb 07, 2019 7:01 am
Let's see what the assembler language supporters can do.
I count myself as an "assembler language supporter" because it can be fun. (never at work).
However I also agree totally with all your comments above.

I mostly now restrict it to very small pieces of inline assembler within C programs. Always within conditional compilation so that the code remains portable. Always in small support functions so the main algorithm remains readable. It has to be done with care and compared to the compiler generated version (in case the compiler does the same thing or better). For example to clear the FPU exception flags: asm( "msr fpsr,xzr" ) is much smaller and faster than: feclearexcept( FE_ALL_EXCEPT ).

Mostly I leave micro optimizations to the compiler.
The fibo thing was a case in point. The multi precision nested inner loops looked like they were a hot spot and should be tweaked.
So I changed

Code: Select all

           if( x.d[k] >= base ) {
                x.d[k] -= base;
                c = 1;
            }
            else
            	c = 0;
to

Code: Select all

            c = 0;
            if( x.d[k] >= base ) {
                x.d[k] -= base;
                c = 1;
            }
To remove a jump. Of course half the time "c" is set to zero unnecessarily, but that is cost free on Intel (which removes the instruction) and possibly aarch64 (which has a zero register).
Later I spotted that the compiler devs had the same idea and the compiler made exactly the same change. Hmm.

Heater
Posts: 13603
Joined: Tue Jul 17, 2012 3:02 pm

Re: Spider-OS a new operating system

Thu Feb 07, 2019 9:53 am

jahboater,
I think you are agreeing with bzt, the way I read it.
You might be right. bzt's statement is somewhat ambiguous. At the same time it's "...use a different algorithm easily...", which is clearly easier to do in a high level language, and it's "...something a compiler optimizer will never ever able to do..." which is statement against using a high level language.
I count myself as an "assembler language supporter" because it can be fun. (never at work).
Me too. Even at work when it's essential and I can justify it. I can also be a "goto" supporter at times but that is another story...
Mostly I leave micro optimizations to the compiler. The fibo thing was a case in point...So I changed.... To remove the jump.
And yet, if you recall, my code in that situation does this:

Code: Select all

    for (i = bShift; i < result.width; i++) {
        s = result.value[i] + carry;
        carry = s / BASE;
        s = s % BASE;
        result.value[i] = s;
    }
Which removes the test and jump to set the carry and last time I checked it was faster even though it uses what seem like expensive div and mod operators. Your mileage may vary depending on your processor.
Later I spotted that the compiler devs had the same idea and the compiler made exactly the same change. Hmm.
I rest my case.
Memory in C++ is a leaky abstraction .

jahboater
Posts: 4762
Joined: Wed Feb 04, 2015 6:38 pm

Re: Spider-OS a new operating system

Thu Feb 07, 2019 10:12 am

We all agree by the looks of it.

FYI there is a long standing bug in GCC on 32-bit ARM for this:

Code: Select all

        carry = s / BASE;
        s = s % BASE;
On Intel both of these expressions are done in one single instruction (which produces the quotient and the remainder in one go).
On ARM 32-bit it calls a function like __aeabi_uldivmod() twice!!!!!!!!!!!!!!!!!!!!!!!!!

LdB
Posts: 1280
Joined: Wed Dec 07, 2016 2:29 pm

Re: Spider-OS a new operating system

Thu Feb 07, 2019 1:05 pm

It isn't a bug you are compiling for ARM6 which is an ARM with no divide instruction .. it can't opcode what it doesn't have
https://community.arm.com/processors/b/ ... nd-conquer
I can also tell you that you also have inlining turned off or else it would inline the calls rather than make the call

ARM-7 has divid as does cortexa53 so it is only an issue using ARM6 compiling for a Pi1 or PiZero

It is well covered here
https://thinkingeek.com/2013/08/11/arm- ... hapter-15/

So if you compile for the ARM6 (arm1176jzf-s) and expect it not to do that code either by call or inline you are dead out of luck because it can't do the impossible. That is why it is long standing and never been fixed because it isn't broken :-)

jahboater
Posts: 4762
Joined: Wed Feb 04, 2015 6:38 pm

Re: Spider-OS a new operating system

Thu Feb 07, 2019 3:35 pm

OK, the situation has changed in recent GCC versions.
It still calls __aeabi_uldivmod() twice for 64-bit division/remainder. Once for the division and again for the remainder. You would think that "divmod" could return both from one call. (that's the point).
But the 32-bit stuff has been improved for GCC 8.2 anyway.
In 32-bit mode it now calls __aeabi_uidivmod() once.

I know there is no divide instruction, and I think inlining has nothing to do with it. (I always have all the inlining options turned on, but usually use -Os).

Heater
Posts: 13603
Joined: Tue Jul 17, 2012 3:02 pm

Re: Spider-OS a new operating system

Thu Feb 07, 2019 4:08 pm

jahboater,
It still calls __aeabi_uldivmod() twice for 64-bit division/remainder.
Not according to godbolt: https://godbolt.org/

Ever since ARM GCC 4.5.4 this:

Code: Select all

#include <stdint.h>
 
#define DIGITS 9
#define BASE 1000000000
 
uint64_t test(uint64_t s)
{
        long carry = s / BASE;
        s = s % BASE;
        return s;
}
compiles to this:

Code: Select all

 
test(unsigned long long):
        stmfd   sp!, {r3, lr}
        ldr     r2, .L2
        mov     r3, #0
        bl      __aeabi_uldivmod
        mov     r0, r2
        mov     r1, r3
        ldmfd   sp!, {r3, pc}
.L2:
        .word   1000000000

At anything -O1 and above. Only one call to __aeabi_uldivmod. I can only get it to make two calls with -O0.

Things get better with ARM64 GCC where we get the job done using umulh and msub.

Code: Select all

test(unsigned long):
        lsr     x1, x0, 9
        mov     x2, 23123
        movk    x2, 0xa09b, lsl 16
        movk    x2, 0xb82f, lsl 32
        movk    x2, 0x44, lsl 48
        umulh   x1, x1, x2
        lsr     x1, x1, 11
        mov     x2, 51712
        movk    x2, 0x3b9a, lsl 16
        msub    x0, x1, x2, x0
        ret
No calls or jumps at all.
Memory in C++ is a leaky abstraction .

jahboater
Posts: 4762
Joined: Wed Feb 04, 2015 6:38 pm

Re: Spider-OS a new operating system

Thu Feb 07, 2019 4:23 pm

This (crap) code segment converts an integer to a decimal string and leaves the pointer at the start.

Code: Select all

  do
    *--p = (n % 10) | 48;
  while( n /= 10 );
becomes (-Os)

Code: Select all

.L108:
    mov r2, #10   
    mov r3, #0    
    mov r0, r4  
    mov r1, r5  
    bl  __aeabi_uldivmod         
    orr r2, r2, #48 
    strb    r2, [r6, #-1]!  
    mov r3, #0  
    mov r2, #10 
    mov r0, r4  
    mov r1, r5  
    bl  __aeabi_uldivmod         
    cmp r5, #0      
    cmpeq   r4, #9      
    bhi .L109         
    ... 
    ...
.L109:
    mov r4, r0  
    mov r5, r1  
    b   .L108        
This is interesting, the "while( n /= 10 )" is comparing against 9, because the final digit doesn't need the remainder calculated (who says compilers cant change your algorithm! OK, yes, it wont change bubble sort into quick sort). Pretty clever for it to spot that.

LdB
Posts: 1280
Joined: Wed Dec 07, 2016 2:29 pm

Re: Spider-OS a new operating system

Thu Feb 07, 2019 4:45 pm

The code is contained in the library file that GCC is package with libgcc, newlib, bionic, musl or whatever.

This is exactly what GCC does ... line 2552
https://github.com/gcc-mirror/gcc/blob/ ... /arm/arm.c

Code: Select all

 set_optab_libfunc (udivmod_optab, DImode, "__aeabi_uldivmod");
It links the function to library function

So what code you get depends on what library GCC is packaged with and next to nothing to do with GCC itself.

All GCC does is optimize whatever junk the library provides and you can always provide your own code.

jahboater
Posts: 4762
Joined: Wed Feb 04, 2015 6:38 pm

Re: Spider-OS a new operating system

Thu Feb 07, 2019 4:56 pm

Interesting, thanks

Heater
Posts: 13603
Joined: Tue Jul 17, 2012 3:02 pm

Re: Spider-OS a new operating system

Thu Feb 07, 2019 5:09 pm

LdB,

I'm not sure what you are suggesting there.

Presumably one could replace __aeabi_uldivmod or whatever maths functions with ones own creation. Not that it is likely to be better than what your compiler provides.

But you don't have such control over the generated code that makes calls to those built in functions.

At the end of the day if your processor is missing mul, div, mod, floating point operations or whatever the compiler has to generate code to call functions to get it done. Hacking the compiler to generate different code is not something most people want to do or even can.

As we see in the examples above, sometimes a single call to __aeabi_uldivmod is generated to get both div and mod at the same time. Other times two calls are made even when we can see that one would do. Presumably the compiler writers could add an optimization to do that. Users generally cannot.
Memory in C++ is a leaky abstraction .

LdB
Posts: 1280
Joined: Wed Dec 07, 2016 2:29 pm

Re: Spider-OS a new operating system

Thu Feb 07, 2019 5:37 pm

You get the compiler to create the versions .. write them in C ... you just said it you won't beat the compiler :-)
https://courses.cs.vt.edu/~cs1104/Divis ... tract.html

Code: Select all

uint32_t udiv64_32(uint64_t *n, uint32_t base)
{
	uint64_t rem = *n;
	uint64_t b = base;
	uint64_t res, d = 1;
	uint32_t high = rem >> 32;

	/* Reduce the thing a bit first */
	res = 0;
	if (high >= base) {
		high = udiv32_32(high, base);   // this is a repeat of this but u32 div u32
		res = (uint64_t) high << 32;
		rem -= (uint64_t) (high*base) << 32;
	}

	while ((int64_t)b > 0 && b < rem) {
		b = b+b;
		d = d+d;
	}

	do {
		if (rem >= b) {
			rem -= b;
			res += d;
		}
		b >>= 1;
		d >>= 1;
	} while (d);

	*n = res;
	return rem;
}
All the new libraries simply made these functions in C knowing GCC will outdo anything by hand .. go and look in the libraries.

The old libraries are all old assembler in aeabi_uldivmod.S file constructed by hand an example
https://github.com/Microsoft/compiler-r ... uldivmod.S
Last edited by LdB on Thu Feb 07, 2019 6:02 pm, edited 1 time in total.

Heater
Posts: 13603
Joined: Tue Jul 17, 2012 3:02 pm

Re: Spider-OS a new operating system

Thu Feb 07, 2019 6:01 pm

Ah well, I did not say anything about whether those built in functions are written in C or assembler.

I just find it unlikely that many people would be able to improve on what the compiler does. And if they do, they can push their ideas up to the GCC devs and then GCC does it:)
Memory in C++ is a leaky abstraction .

bzt
Posts: 393
Joined: Sat Oct 14, 2017 9:57 pm

Re: Spider-OS a new operating system

Sat Feb 09, 2019 1:28 pm

Hi,

I'm sorry I wasn't able to answer.
Heater wrote:You might be right. bzt's statement is somewhat ambiguous. At the same time it's "...use a different algorithm easily...", which is clearly easier to do in a high level language, and it's "...something a compiler optimizer will never ever able to do..." which is statement against using a high level language.
That's not an ambiguous statement, and your Fibonacci number example is perfect to demonstrate what I meant.

Let's say you have implemented the Fibonacci numbers as

Code: Select all

F(n) = F(n-1) + F(n-2)
No matter how much the compiler wants to optimize that, the results will be limited and O(n) in their nature, because the compiler is only doing micro-optimisations, it cannot remove re-entrance.

An experienced programmer (and I must stress experienced) on the other hand, using algorithmic or macro-optimisations, can replace the whole thing with

Code: Select all

F(n) = ((1 + sqrt(5))^n - (1 - sqrt(5))^n) / (2^n * sqrt(5))
Which (surprisingly) gives exactly the same correct result, but it is an O(1) algorithm, uncomparably faster than any compiler optimized version of the first algorithm. (It takes the same time to compute regardless how big the n input is, that's what O(1) means).

This is totally independent of high-level-language / low-level-language perspective. I'm talking about optimizing the algorithm. A programmer can implement such asymptotic optimisations in both C and Assembly (or any other language). From this point of view it doesn't matter that some algorithms are easier to write in a high-level-language (which is usually the case anyway, that's why high-level-languages were created. But there are exceptions, like using the SHA instruction directly for example).

Cheers,
bzt

jahboater
Posts: 4762
Joined: Wed Feb 04, 2015 6:38 pm

Re: Spider-OS a new operating system

Sat Feb 09, 2019 3:10 pm

bzt wrote:
Sat Feb 09, 2019 1:28 pm
Let's say you have implemented the Fibonacci numbers as

Code: Select all

F(n) = F(n-1) + F(n-2)
No matter how much the compiler wants to optimize that, the results will be limited and O(n) in their nature, because the compiler is only doing micro-optimisations, it cannot remove re-entrance.
The compiler does a little better than that to be fair. It deduced the relationship between F(n-1) and F(n-2) and so was able to remove one of the recursive calls.

Here's one I saw in Mat Godbolts talk:

Code: Select all

int CountSetBits( int a )
{
  int count = 0;
  while( a )
  {
    ++count;
    a &= (a - 1);
  }
  return count;
}
Clang compiles this entire function to one single instruction (on x86).

Another one from his site:

Code: Select all

#include <stdio.h>
#include <string.h>

static int count_t_letters(const char *str)
{
    size_t len = strlen(str); // optimized with int also

    int count = 0;

    for (size_t i = 0; i < len; ++i) // // optimized with int also
    {
        if (str[i] == 't')
            ++count;
    }
    return count;
}

int test() {
    return count_t_letters("test");
}
This program compiles down to "mov eax,2; ret" on most compilers except MSVC (which fails).

Pretty clever I'd say!

Heater
Posts: 13603
Joined: Tue Jul 17, 2012 3:02 pm

Re: Spider-OS a new operating system

Sat Feb 09, 2019 3:31 pm

bzt,

Looks like we are in agreement then. If one wants to make serious optimizations one needs to be looking for algorithmic changes. Certainly we cannot expect our compilers to magically swap our slow fibo for a fast one or a naive Fourier Transform for a Fast Fourier Transform and so on. Simply tinkering around with micro-optimizations or working in assembler is not going to get you orders of magnitude improvements. Of course that presumes there are faster algorithms known or we are smart enough to find new ones.

My claim is simply that if one wants to hack around with ones code, trying out different algorithms and so on it is far easier in a high level language. Further, that the small gains to be had by working in assembler are far outweighed by the increased development time, increased bug rate, lack of portability and so on.

A few comments on your post though:

An experienced programmer would likely be wise enough to look around for algorithms that dramatically improve performance. However, no matter how good a software engineer he is, he may never come up with a good algorithm himself. For that we more often need a mathematician.

An experienced software engineer will evaluate his typical data sets, measure what is going on and select an algorithm appropriately. Typically a "superior" algorithm will only be faster over some size of input. For example: At what size of n is using your suggested equation for fibo faster than the naive approach? A "superior" algorithm may be excellent in general but sub-optimal for the typical data in an application.

Your suggested fibo equation is certainly not O(1). Firstly it requires an other two multiplies to get those powers for every increase in n. Secondly it requires having the square root of 5 available. Calculating that to sufficient accuracy to get a million digit result may be a challenge! Then of course all those multiplies take more and more time as the numbers get bigger. See analysis here: https://bosker.wordpress.com/2011/07/27 ... s-formula/

The SHA instruction example is interesting. There an entire algorithm has been buried in hardware. How is a high level language supposed to deal with that!
Memory in C++ is a leaky abstraction .

Return to “Bare metal, Assembly language”