Simon's accelerated X development thread


406 posts   Page 1 of 17   1, 2, 3, 4, 5 ... 17
by teh_orph » Mon Apr 09, 2012 10:07 am
Hello guys,

I've spent the bank hol looking into what's involved in writing an accelerated X server for the Rpi. I of course have no hardware to find out actually how hard it may be, but it certainly seems doable.

Anyway searching for similar posts on this forum leads to mixed information as to who else is working on this. There's mention of a 'bounty', the Fedora Remix people doing it, some people (dom?) have already had a look etc.

Before I start coding it would be great to find out if people have already started this, to see how far along they are and perhaps share some findings etc. Plus, no sense in writing another one if they've got hardware and are nearly ready to release!

Cheers
User avatar
Posts: 345
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by liamfraser280 » Mon Apr 09, 2012 10:49 am
Although I don't really have the time or expertise to work on this, I'd love to see this go ahead as it would certianly make the Pi more usable for a lot of scenarios :)

If we could somehow run a vnc server with the encoding done on the GPU that would be pretty cool too.

Cheers,

Liam.
Posts: 354
Joined: Tue Oct 04, 2011 6:53 pm
by spurious » Mon Apr 09, 2012 10:57 am
I think it may be an idea to have a set of SVN repositories that are for R-Pi specific extensions to the various Linux distros. Then this sort of thing doesn't have multiple people working towards the same goal on different sources.

Not sure how something like this can be organised, but hey it's just an idea.  :)
Posts: 343
Joined: Mon Nov 21, 2011 9:29 pm
by teh_orph » Mon Apr 09, 2012 11:43 am
Yeah that's a good point actually. I'd be keen also on building with exactly the same set of code used to make the (unaccelerated) release of X, eg the Fedora Remix. Quickly looking through their site it's not jumping out at me how to get the source code and their patches. Will the source be installable via yum?

On their FAQ they also seem keen on help for this X stuff, and also suggest not forking the code base too much :)

http://zenit.senecac.on.ca/wik.....ers_needed

I'm also keen to get some documentation on the Rpi CPU-GPU software link as provided in those 'vc' directories. There aren't a lot of comments in the code...

Finally in some of the other posts there's talk of programmable DMA and a blitter. There's nothing about this in the vc directory... Anyone in the know?
User avatar
Posts: 345
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by maribu » Mon Apr 09, 2012 11:51 am
Hi, everybody!

 

There's no need to write a new X-Server. The X.org X-server has a graphic driver infrastructure which means you can install, develop and build the drivers separately from the X-Server.

 

Somewhere in the forum someone suggested to build an OpenVG driver for X. This driver would provide 2d acceleration on all hardware which provides OpenVG, like the Raspberry PI does. So you don't even have to put up a Pi specific svn-server. Once someone has developed such a driver it will probably be soon be added to all ARM-Linux distributions as OpenVG seems to be quite common on ARM hardware.

Hopefully someone is already on the task.

 
User avatar
Posts: 143
Joined: Mon Feb 13, 2012 9:56 pm
by teh_orph » Mon Apr 09, 2012 12:12 pm
Hi Maribu,

You're right, a whole new X server is not necessary. Unfortunately the current X server used by the Rpi (generic frame buffer driver) does not have the required hooks in it to allow EXA acceleration. This means that the server will need to be *modified* to support this. The actual code to the do the acceleration can be stored elsewhere but the hooks must still be added.

It's likely that not all operations can be performed on the GPU, and the CPU will be still required to do some work. The X libfb (not to be confused with the X server, fbdev) does its drawing on the CPU. There is an MMX fast path, but of course we do not have MMX! Busting out the assembler and adding an ARM 32-bit SIMD path will likely bring a win here...

I'm a bit reluctant to start coding now - if I get my Pi months after everyone else bit rot will set in!
User avatar
Posts: 345
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by jamesh » Mon Apr 09, 2012 1:46 pm
maribu said:


Hi, everybody!

 

There"s no need to write a new X-Server. The X.org X-server has a graphic driver infrastructure which means you can install, develop and build the drivers separately from the X-Server.

 

Somewhere in the forum someone suggested to build an OpenVG driver for X. This driver would provide 2d acceleration on all hardware which provides OpenVG, like the Raspberry PI does. So you don"t even have to put up a Pi specific svn-server. Once someone has developed such a driver it will probably be soon be added to all ARM-Linux distributions as OpenVG seems to be quite common on ARM hardware.

Hopefully someone is already on the task.

 


As Maribu said, I think the best way of accelerating X is to use the OpenVG or OpenGL ES interfaces. These are completely standard, so would also work on other devices. Using anything else makes it Raspberry Specific. It also means dev could be done on a different device which would be faster. (I'm using Mesa OpenGLES on my Ubuntu desktop as prep for Raspi code).

I'd be inclined to go for OpenGL ES rather than VG, as VG uses GL ES internally on the GPU, giving an extra level of indirection you don't need.
Soon to be employed engineer - Hurrah! Volunteer at the Raspberry Pi Foundation, helper at PiAcademy September 2014.
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11926
Joined: Sat Jul 30, 2011 7:41 pm
by teh_orph » Mon Apr 09, 2012 2:09 pm
I'm wondering what the CPU overhead of using ES/VG are - X seems quite antiquated in that many operations are tiny, and the set-up costs would be quite high. Any idea of what the latency of a zero-work operation performed by the GPU is?

eg let's say I draw (with ES) just one pixel onto the framebuffer, what's the latency from the CPU's point of view?

Also not totally clear from the implementation provided to us is the memory layout with respect to these libraries. (James can you help out here?)
Let's imagine I'd like to blit a texture from main memory onto the framebuffer. Is this piece of memory copied into GPU-visible memory (via DMA?), the blit performed into a GPU-visible surface and a copy then performed back into the CPU-visible framebuffer?

If so, this would be pretty wasteful from a copying and cache-flush perspective! Again if so, can I map a piece of the GPU memory into the CPU's address space to avoid all this?
User avatar
Posts: 345
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by jamesh » Mon Apr 09, 2012 2:30 pm
I'm really not up on the actual mechanics of this. However, I think if you are using OGLES/VG you have a separate surface, and you probably won't be using the framebuffer at all. The GPU can composite multiple FB's in real time - so you have a number of surfaces defined - I think the FB is one of them - which are rotated etc and  composited in real time to the output. As to the copying involved - I'm not sure. I think you can to map from the address space of the Arm to the flat space of the GPU which takes some code, but I don't think whole buffers are copied. Dom will probably put me right on that one...

Note to self. Must learn this stuff.
Soon to be employed engineer - Hurrah! Volunteer at the Raspberry Pi Foundation, helper at PiAcademy September 2014.
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11926
Joined: Sat Jul 30, 2011 7:41 pm
by teh_orph » Mon Apr 09, 2012 3:03 pm
Cool James, cheers for the info.
This actually sounds much worse than I'd hoped...that'll make alpha blending stuff fun if some layers are built by the CPU and others by the GPU! It appears that only a handful of things can be (trivially) redirected to the GPU under X.

Although thinking about it, do you know if the framebuffer memory (that can be poked from the CPU) is mapped into the GPU address space?

(this is why I was really hoping for just a blitter and and DMA engine ;)
User avatar
Posts: 345
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by dom » Mon Apr 09, 2012 4:02 pm
The Arm and GPU share memory space.
The framebuffer is shared. The Arm can write a pixel and it will appear on the screen (through GPU hardware) without flushing/copying being required.
The DMA hardware can also access the whole memory space and can perform 2D fills and blits (no blending). This is documented in the peripheral spec posted.
The DMA is just an Arm accessible peripheral and can be set up with low latency (e.g. microseconds).
openGLES/openVG has high latency. Writing to framebuffer then reading it back is very inefficient (e.g. milliseconds). If you can drive it a unidirectional way, just streaming commands at then that is efficient. I don"t know if X works that way though.
(and openVG is not implemented on top of openGLES - it uses the same hardware but as a first class interface, so I wouldn"t discourage its use).
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 4042
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge
by jamesh » Mon Apr 09, 2012 4:10 pm
There, told you Dom would put me right!

Thanks Dom.
Soon to be employed engineer - Hurrah! Volunteer at the Raspberry Pi Foundation, helper at PiAcademy September 2014.
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 11926
Joined: Sat Jul 30, 2011 7:41 pm
by teh_orph » Mon Apr 09, 2012 4:55 pm
Cheers Dom.

I'm having a re-read of the datasheet (this copy, if that's important http://www.raspberrypi.org/wp-.....herals.pdf) and have a few more questions about this DMA/memory mapping. Any chance of a bit more help? :)

On page 7 it says that using the DMA engines you must use a 0xc0000000-based bus address to access SDRAM, yet non-DMA access should go via a 0x0-based bus address. Why is this? On slide 4, it says that 0xc... is uncached and 0x0... is L1/L2 cached. Surely if I'm DMAing I want a CPU-coherent copy of the data? Or does this mean that the DMA hardware is not cache coherent? Although slide 38 says that DMA can be used to fill L2 - does that mean DMA can't see L1? Can anything other than the CPU see L1?

Whereabouts is the info on DMA being used to do 2D fills? I see SRC_IGNORE on page 51, but that sounds like it's used to zero memory instead.

Finally, can I drive DMA via user mode or do I need to be privileged? (I'm looking at arch/arm/mach-bcm2708/include/mach/dma.h)

Thanks again!
User avatar
Posts: 345
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by dom » Mon Apr 09, 2012 5:27 pm
For 2D dma, set TDMODE, and the spec says:

"interpret the TXFR_LEN register as YLENGTH number of transfers each of XLENGTH, and add the strides to the address after each transfer. "

so set STRIDE to pitch of the image, the width is XLENGTH and height is YLENGTH.

You would fill by not setting the SRC_INC and point source to your fill data.

The DMA cannot see the ARM's L1 cache, so you would map the framebuffer with ioremap_nocache. Depending on where the source data comes from it may need an L1 cache flush.

The DMA can see the L2 cache.

Use 0xC0000000 bus addresses when L2 is disabled and 0x40000000 bus addresses when L2 is enabled.

(actually just call virt_to_bus and you'll get the right address out).
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 4042
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge
by dom » Mon Apr 09, 2012 5:31 pm
Like any peripheral, it should be written to from a kernel driver. User mode doesn't have access.

However there is a hack where you can (as root user) mmap /dev/mem to give peripheral addresses a virtual mapped address, alllowing user mode access. I'd class this as a hack that can make developement easier rather than a recommended way of doing things. See the example code for driving gertboard peripherals for an example of this scheme.
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 4042
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge
by louisb » Mon Apr 09, 2012 5:41 pm
Just wondering if XBMC / OpenElec do hardware acceleration for the GUI and if so how do they do it? It is cross platform and so they probably write directly using openGL. Is there anything we can learn from them?
Posts: 47
Joined: Wed Mar 07, 2012 7:08 am
by teh_orph » Mon Apr 09, 2012 7:07 pm
Cheers Dom, I was thinking that could be used for fill ;)
The mmap thing is interesting. Having a user-exposed DMA is a big fat security hole, but I wonder how much faster it would be than having to do user<->kernel transitions and bounds checking. (If the DMA could be connected to the ARM/user TLB that'd fix that I guess)

If the framebuffer is mapped CPU+GPU does that mean the memory is not L1 cached, or is the cache flushed on vsync or something?

Louis: I haven't used XBMC before, do you reckon they use a custom full-screen OpenGL UI or something? Although maybe they wrote an OpenGL ES version for the Pi... I'll have a look.
User avatar
Posts: 345
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by dom » Mon Apr 09, 2012 7:32 pm
The framebuffer (like in most video drivers) is not L1 cached. No flushing is required.
XBMC uses a full frame openGLES (so no X). They already had this from ATV2 (and other) platforms.
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 4042
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge
by ArborealSeer » Mon Apr 09, 2012 7:39 pm
thats interesting, i think from what i've seen before that people were assuming the front-end wasn't accelerated and that only the video decode (when possible via x264) was.

i think the reasoning was that the xbmc demos shown so far are using a very basic skin?
Pi Status > Farnell, Arrived 24/5- RS, Arrived 1/6
User avatar
Posts: 292
Joined: Tue Jan 24, 2012 9:48 am
by dom » Mon Apr 09, 2012 9:52 pm
XBMC really needs some profiling done on the ARM side. The GPU is almost idle running the GUI but the ARM is running flat out.
I"d guess it"s mainly fonts (freetype rendered by ARM) and general interpreting skins to decide where the textures go that takes the time.
But I"m just guessing.
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 4042
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge
by Ren » Tue Apr 10, 2012 1:00 am
I thought due to XBMCs heritage (single threaded game console) it was written as a game loop (as opposed to event driven), so it is continually refreshing the screen, and eating CPU time.
Posts: 8
Joined: Fri Sep 02, 2011 1:15 pm
by Jim Manley » Tue Apr 10, 2012 2:37 am
There are a number of us with experience in X when it was known as Project Athena at MIT and GL going back to Original Flavor (on the very first SGI IRIS workstations).

OpenGL is the correct route since it"s already widely used in many other environments, and there are benefits for some applications to having 2-D X FBs that can be further manipulated in 3-D (think about being able to rotate the face of a cube that has X and other active or saved screen views on each face as the view is changed, e.g., various desktop environments).

All text typeface (of which font family is just one element) and graphics operations must be performed within the GPU portion of RAM since video comes out of the GPU directly. The only code that should be in ARM CPU RAM is that which doesn"t perform any graphics functionality and instead makes OpenGL calls into the GPU.

Has anyone looked at GLX, the OpenGL extensions for X:

http://www.opengl.org/sdk/docs.....XIntro.xml

I believe all of the GL-side work is done, and integration of the particular X server in the Linux distros probably needs to be performed, if we"re inheriting X down trees in which GLX hasn"t been integrated.

I haven"t dealt with this for quite some time and that wasn"t for ARM, but, if there are particular problems that can"t be figured out, I and others with experience can help. If we need more head count to do the grunt work, I can see if we can pull in folks who have done this for many other platforms - they would be much faster than people who haven"t done this. Failing that in our desired timeframe, perhaps we can crowd-source it among all those college/university students around the world who seem to have unlimited quantities of time, hypercaffeinated beverages, and a lack of need for sleep, the opposite sex, etc. ;) iOS developers who have done OpenGL ES work (e.g., games) would probably be most useful.
The best things in life aren't things ... but, a Pi comes pretty darned close! :D
"Education is not the filling of a pail, but the lighting of a fire." -- W.B. Yeats
In theory, theory & practice are the same - in practice, they aren't!!!
User avatar
Posts: 1357
Joined: Thu Feb 23, 2012 8:41 pm
Location: SillyCon Valley, California, USA
by jojopi » Tue Apr 10, 2012 4:15 am
Jim Manley said:

OpenGL is the correct route since it"s already widely used in many other environments, and there are benefits for some applications to having 2-D X FBs that can be further manipulated in 3-D

Is it though?  Compositing is mostly a gimmick.  An OGLES-based implementation will require linking the X server against the proprietary library, which many will not like.  For the same reason, this will not be a permissible means of accelerating the kernel fb -- is console scrolling slow too?

Also, an implementation that allows basic 2D X acceleration with the least RAM allocated to the GPU may be more usable in practice, even if it is technically slower.
User avatar
Posts: 2122
Joined: Tue Oct 11, 2011 8:38 pm
by shirro » Tue Apr 10, 2012 7:20 am
Yes, I agree basic 2D acceleration via EXA with support for Solid, Copy, Composite, UTS and DFS is the priority.

glx is totally off target as the platform supports ES not desktop opengl. More appropriate would be an EGL that allowed you to attach surface to x windows and I don't know how any of that works (yet). Anyway there aren't a lot of programs for linux that have gles support and a hell of a lot that need to move rectangles around.

Down the track it would be brilliant to support Wayland which requires some EGL extensions to share surfaces between processes I believe. Still as cool as that is Wayland is a work in progress and supporting it has no immediate benefit unlike EXA.

I haven't delved into the Raspberry Pi kernel stuff much. Is there any framebuffer accel stuff in there that could be built upon? The support for the xorg driver will have to be in the kernel. Anyone who thinks it can be built on the userspace opengl libs is way off target I think.

It is a but cheeky but the Freescale imx xorg driver is lgpl (and probably the TI omap ones as well) so it wouldn't be hard to see how it is done.
Posts: 248
Joined: Tue Jan 24, 2012 4:54 am
by Jim Manley » Tue Apr 10, 2012 7:21 am
jojopi

Is it though?  Compositing is mostly a gimmick.  An OGLES-based implementation will require linking the X server against the proprietary library, which many will not like.  For the same reason, this will not be a permissible means of accelerating the kernel fb — is console scrolling slow too?

How is _Open_ GL proprietary?

The goal of any kernel implementation is to make available all of the hardware features to the OS, services, and applications. If you had extra registers or I/O ports to work with, would you just ignore them? A boot-time switch could be implemented to choose between accelerated and unaccelerated modes, or even at startx time, if someone had some reason for not wanting acceleration (not using the GPU means more ARM CPU resources will be freed up, which are in short supply, if no one noticed).

As for the kernel FB, if you"re talking about during kernel boot, you won"t be in X yet, so, you"ll be executing solely in ARM. If you mean a console window in X, it"s just another window and where the content is coming from is immaterial.

Also, an implementation that allows basic 2D X acceleration with the least RAM allocated to the GPU may be more usable in practice, even if it is technically slower.

Unless you"ve got so many 3-D data structures crammed into GPU RAM that there"s no room left, X memory utilization is so insignificant compared with the amount of RAM available to the GPU that, even with 64 MBs allocated to the GPU, there should be no discernible performance difference between the 64 and 128 MB allocations. All of the magic occurs in the GPU, and the GPU RAM is for storing data structures to be processed by the GPU (e.g., window element, icon, pointer, pattern, etc., primitives, for X). Output to video goes directly to the HDMI buffer in the GPU and the DAC to the composite port, AIUI.
The best things in life aren't things ... but, a Pi comes pretty darned close! :D
"Education is not the filling of a pail, but the lighting of a fire." -- W.B. Yeats
In theory, theory & practice are the same - in practice, they aren't!!!
User avatar
Posts: 1357
Joined: Thu Feb 23, 2012 8:41 pm
Location: SillyCon Valley, California, USA