Accelerated X driver testing
Eight months of hard programming, 20000+ lines of code (many thousands of them for a piece of hardware which has no public documentation) and a dozen corrupted SD cards later...I've learnt what I had guessed in the first place: the traditional X desktop environment is heavily dependent on CPU power and even the fastest and most openest GPU in the world can help so much! The performance of the existing application->X infrastructure is dominated by overhead, not the time spent drawing pixels. Incredibly, even PC X drivers lean heavily on CPU power.
Much time has been spent profiling the drawing habits of common applications. A large proportion of draw calls are for sub 15x15 pixel operations, many are for 1x1 operations and a small proportion try and manipulate tens of thousands of pixels. Synchronisation points (where the application and X server must pause until the driver is done) appear after every handful of operations. As such the system has been constructed around this behaviour.
With workloads this small being very common, operation start-up latency is of paramount importance: Dom uncovered an issue where an overhead of just 6 microseconds per draw call was causing a noticeably slowdown in a number of applications. This is the single biggest reason why an "all OpenGL/VG/ES" X driver is not appropriate and why most OSes (think Android, iOS, Windows Vista etc) have chosen to write new display systems from scratch with GPUs in mind.
I've done a write-up for the design of the driver I've made if you're interested.
To summarise, as much pixel-pushing work is removed from the CPU as possible via DMA for fills and blits, and the VPU handles ~50 different types of common blend function. The CPU handles cases where it would be more efficient to not leave the CPU plus another ~250 blend functions. The 3D hardware is not used, therefore: neither is OpenGL ES/VG/whatever.
That said, the places where the pixel-pushing work has been moved to is much faster. This enables 32-bit colour mode and high-resolution screens (I use 1920x1200x32 with my Pi). Composition operations are ~12x faster on the VPU than on the CPU.
FINALLY, you must consider that many many X windows applications now do the majority of their rendering from the client side...a place the driver cannot reach. This will be as slow as before. Other people are tackling this project.
Finally #2: Raspbian already includes some code I have written which improves the performance of copies and fills. So when you're testing this driver consider its performance versus Raspberry Pi day one!
What I need
I need help testing what I have written so far, before making more changes. I need to discover:
- display corruption
- memory card corruption (oh yeah)
- unexpected slowdowns
- common usage patterns
- applications which are poorly coded
Bug fixes will then be made, and the driver will be tuned around individual application 'issues' and operations that I had not considered (important) before.
I am not interested in performance analysing what I have written: the build supplied here is full of validation code to ensure any bugs do not take down your system or corrupt your SD card. Draw calls are also logged in order to generate internal statistics. Both these things cost valuable CPU time.
People testing this should be happy twiddling config files, reimaging their SD card, using bleeding-edge firmware and kernels and such. I wouldn't recommend this for day-to-day nor in a place where you can tolerate any failure.
I would strongly recommend running this in a debugger, via SSH from another machine, so that you can see the debugger output at all time. With the 'VerboseReporting' option this will dump lots of info to the debugger TTY allowing you to help me tune the driver based on its workload.
What to do
Read this from beginning to end, and slowly follow each of the steps.
http://elinux.org/RPi_Xorg_rpi_Driver#I ... the_driver
http://elinux.org/RPi_Xorg_rpi_Driver#C ... the_driver
When things go wrong, read this:
Yes, it's just currently for Raspbian hard float, and no I haven't yet written compilation instructions.
Whatever happens, please report back to this thread! I would like to ensure the instructions are as good as possible, so that people can do it all without my assistance.
If you're interested in directly programming the GPU (it's entertaining) then I'd recommend starting here where these guys have done a stellar job. Anyone wanting to write an OpenCL run-time should start here.
If you're interested in writing for the Linux kernel I'd recommend reading this. Some of it is out of date, but still very helpful. I read it cover-to-cover!
If reverse engineering or information derived through reverse engineering is not permitted in your country, please make sure to disable the VpuOffload option. The ARM side does not make use of RE.
I've had lots of help on this task so far, and my thanks go to them. The people who have tested stuff (Liam, Charlie, Josh, Alex, there are more...) other people for technical assistance (Siarhei, Michel, the DMA controller guy at Broadcom) VPU help (Herman, Tiernan-your disassembler!) misc (Eben). Finally a big thanks to Dom for all the endless emails.