okenido
Posts: 21
Joined: Thu Aug 02, 2018 11:47 am

Bare metal graphics : hardware acceleration ?

Wed Oct 10, 2018 2:18 pm

Hello

I write directly my graphics into the RPI's framebuffer using my custom functions (drawRect, drawLine...), it's working pretty well but since I have very few CPU time to spend drawing the screen, I'm trying to find better ways of accomplishing this.

- Does the GPU provides some hardware rectangle blitting ? I spend a lot of time in nested X/Y loops, filling rectangles with pixels... quite inefficient. I was looking at the mailboxes to talk to the GPU but it doesn't appear to have such functions.

- Is it possible to use bare-metal OpenGL to draw things into the framebuffer (quads + basic shader to draw 2D rectangles), while still having access for doing software rendering over it ?

- I get flickering and tearing when redrawing the screen, since I can't control when the framebuffer data is sent to the display. Is there a way to do that using something like " begindraw() ... drawing stuff... .enddraw()" ? Or some way of doing hardware double buffering ?

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 20944
Joined: Sat Jul 30, 2011 7:41 pm

Re: Bare metal graphics : hardware acceleration ?

Wed Oct 10, 2018 3:08 pm

Nothing easy to use IIRC.

Have you written the blitting functions in NEON? That will give a huge improvement in speed. There are probably quite a few examples already out there of NEON based blitting functions, so you might get away with copy and paste.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Please direct all questions to the forum, I do not do support via PM.

okenido
Posts: 21
Joined: Thu Aug 02, 2018 11:47 am

Re: Bare metal graphics : hardware acceleration ?

Wed Oct 10, 2018 3:48 pm

No i'm using the naive way of doing it. However, looking at this page shows the NEON improvements aren't significant at all : http://infocenter.arm.com/help/index.js ... 13544.html

Word by Word memory copy 100%
Load-Multiple memory copy 111%
NEON memory copy 100%
Word by Word memory copy with PLD 76%
Load-Multiple memory copy with PLD 98%
NEON memory copy with PLD 149%
Mixed ARM and NEON memory copy 112%


Except the 149% (+49%) which is quite good but i thought it would make an even bigger difference.

It's for ARM A8 so maybe it's not relevant for the RPI ?

Found a code for NEON-blitting : https://github.com/tranthamp/neon_test


Since I'm using a 16-bit framebuffer, i was thinking about casting the pointers to uint32 then do the copy, so it would copy two pixels at the same time.

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 20944
Joined: Sat Jul 30, 2011 7:41 pm

Re: Bare metal graphics : hardware acceleration ?

Wed Oct 10, 2018 7:38 pm

okenido wrote:
Wed Oct 10, 2018 3:48 pm
No i'm using the naive way of doing it. However, looking at this page shows the NEON improvements aren't significant at all : http://infocenter.arm.com/help/index.js ... 13544.html

Word by Word memory copy 100%
Load-Multiple memory copy 111%
NEON memory copy 100%
Word by Word memory copy with PLD 76%
Load-Multiple memory copy with PLD 98%
NEON memory copy with PLD 149%
Mixed ARM and NEON memory copy 112%


Except the 149% (+49%) which is quite good but i thought it would make an even bigger difference.

It's for ARM A8 so maybe it's not relevant for the RPI ?

Found a code for NEON-blitting : https://github.com/tranthamp/neon_test


Since I'm using a 16-bit framebuffer, i was thinking about casting the pointers to uint32 then do the copy, so it would copy two pixels at the same time.
NEON is 16 way SIMD, so you get 16 operations for the price of one normal operations. So 16x faster. Approximately.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Please direct all questions to the forum, I do not do support via PM.

LizardLad_1
Posts: 126
Joined: Sat Jan 13, 2018 12:29 am

Re: Bare metal graphics : hardware acceleration ?

Wed Oct 10, 2018 8:58 pm

If you are interested in OpenGL ES 2 LdB is doing some really cool stuff check out this link: https://www.raspberrypi.org/forums/view ... 2&t=192440

If you aren't ready to do this the NEON way should be faster.

okenido
Posts: 21
Joined: Thu Aug 02, 2018 11:47 am

Re: Bare metal graphics : hardware acceleration ?

Thu Oct 11, 2018 5:11 pm

Very nice, i'll take a look at it if I need even more performance.

I wrote this little code and it works very well :

Code: Select all

asm volatile
			(
				"   vdup.16 q8, %1\n\t"
				"   vst1.16  {d16-d17}, [%0]!\n\t"

				: "=r"(pDest), "=r"(color)
				: "0"(pDest), "1"(color)
			);

okenido
Posts: 21
Joined: Thu Aug 02, 2018 11:47 am

Re: Bare metal graphics : hardware acceleration ?

Fri Oct 12, 2018 6:56 pm

It doesn't work that well finally. I get random color corruptions. If I move the vdup call before the x/y loop and use only vst1 to fill the screen I got even more colour corruptions. What I'm doing wrong ?
Is neon core/thread safe ? I was thinking about interrupts in my program, that could make use of neon registers (automatically generated by GCC) writing unwanted things to the register I use for my drawing purposes.

User avatar
Paeryn
Posts: 2226
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: Bare metal graphics : hardware acceleration ?

Fri Oct 12, 2018 11:21 pm

okenido wrote:
Fri Oct 12, 2018 6:56 pm
Is neon core/thread safe ? I was thinking about interrupts in my program, that could make use of neon registers (automatically generated by GCC) writing unwanted things to the register I use for my drawing purposes.
That all depends on whether your task switching / interrupt handling is correctly saving and restoring the VFP/NEON state just like it has to for the vanilla ARM registers. If you don't then another thread using VPF/NEON will trash them when run on the same core. Each core has its own VFP/NEON unit so using NEON on one core won't affect NEON registers on another.
She who travels light — forgot something.

User avatar
DavidS
Posts: 3704
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: Bare metal graphics : hardware acceleration ?

Fri Nov 23, 2018 4:17 pm

One to consider for speed is the configuration of your caches and pagetables. If you have both caches enabled, and you have the mmu enabled with the pagetables corrisponding to the framebuffer set to non-cachable then you should have better performance (with carful coding can do 300 full screen rectangles per second at 1280x1024 running on a BCM2835 based RPi B+ at 700MHz).

Also if you are running on a multi-core system make sure that you are doing something with the other cores, so that they are not taking up a bunch of the buss time polling there mailbox.
The Raspberry Pi is an ARM computer, that runs many Operating Systems, including Linux, RISC OS, BSD, Pi64, CP/M as well as many more.
Soon to add AROS to the list of operating systems.

LdB
Posts: 912
Joined: Wed Dec 07, 2016 2:29 pm

Re: Bare metal graphics : hardware acceleration ?

Fri Nov 23, 2018 5:28 pm

It is not the rectangles or triangles that is the problem it is always the text and bitmaps that causes the problems with speed. I will have more to show and say on this in weeks ahead but I am already spread too thin on multiple things. I will leave you good people to try different approaches.

User avatar
DavidS
Posts: 3704
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: Bare metal graphics : hardware acceleration ?

Fri Nov 23, 2018 6:05 pm

LdB wrote:
Fri Nov 23, 2018 5:28 pm
It is not the rectangles or triangles that is the problem it is always the text and bitmaps that causes the problems with speed. I will have more to show and say on this in weeks ahead but I am already spread too thin on multiple things. I will leave you good people to try different approaches.
On text output:
If just outputing white text on black screen (or any single foreground color on any single background color) there are some simple tricks to implement this at fairly high speeds.

There is the old trick of loading the bitmaps of the current line of the font for 4 characters into a regester (assuming 8 pixel wide text mode font), then using that with shift tst move ops to construct the 32 pixels in registers (works very fast for 8bpp modes, and still fair for 32bpp). This takes advantage of the nature of cache as well in forming the output data.

Then there are a few different tricks for using the DMA with stride to perform bitmapped text output.

And this is all without using the GPU at all.

On bitmaps:
Not so difficult to do bitmaps with multi-word load/store about half as fast as drawing rectangles to the frame buffer. Just be sure that the in mem form is the same as the framebuffer. This is basically a mem-copy.

On Bitmapped Graphics:
This one can be a bit slow if drawing things like lines, circles, etc. Though there are still some tricks to speed it up. Though that is to long of a thing to get into on a forum post.

That said one thing is to special case horizontal lines, as these can be drawn extremely fast. Then the choices of ploting algorithms comes in next, though if using something like Bresenham's run segment line drawing algorithm you can plot multiple pixels at a time for any line with a greater horizontal length than its vertical lenght (in pixels). Then there are other mothods to speed up many things.
The Raspberry Pi is an ARM computer, that runs many Operating Systems, including Linux, RISC OS, BSD, Pi64, CP/M as well as many more.
Soon to add AROS to the list of operating systems.

User avatar
DavidS
Posts: 3704
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: Bare metal graphics : hardware acceleration ?

Fri Nov 23, 2018 7:02 pm

Also remember that if we are using the ARM or VFP/NEON core to plot to the framebuffer our biggest limit is the bus speed. So defining our algorithms to take advantage of this, by doing things between memory access, is a huge speed up in many cases.

Always optimize your algorithm first.

So if I calculate the next length of a line to plot using a run sector implementation, then fill enough registers with the color value, then plot the next pixels (at 128 bits written at once), then I am saving time. If I am having to read the existing values to preserve the existing background, then I split things having some register only operations beteween every access (read or write).

Add to that making sure that the implementation of the loop fits into a single cache line, and is alligned to a 16 word (64 byte) boundry in memory, and we get the best performance we can expect.

Add the careful use of registers in the implementation, making sure that no instruction uses the destination register of any of the 2 most recent instructions before it, and we can get some real speed, this goes for both ARM and VFP/NEON.

Do all these things and you will stil be ahead of the bus speed for just about any graphics drawing that is comon for 2D graphics.

And any time that consecutive memory locations are accessed, access them in bus sized chunks, 128 bits in the case of the Raspberry Pi (bigger will just stall the pipeline while waiting for the bus, thus slowing it down).

That said I am sure that someone else can come up with some more optimizations that I am missing out here, that may take advantage of some trick to lessen the effect of the bus on drawing speed.
The Raspberry Pi is an ARM computer, that runs many Operating Systems, including Linux, RISC OS, BSD, Pi64, CP/M as well as many more.
Soon to add AROS to the list of operating systems.

Return to “Bare metal, Assembly language”