For anyone interested, I've written a parallel SHA-256 implementation for the QPU:
https://github.com/elorimer/rpi-playground
It does about 3.1 Mh/s (single-block hash) at about 93% efficiency (IPC) which makes it 14.6x faster than the CPU reference implementation. (Which is probably saying at least as much about how slow the CPU is ...). Considering that only about half the QPU gets used (the SHA-256 operations are heavy on the add pipe operations and really can't make use of the multiply operations), that's pretty respectable.
I have no interest in BTC mining but if someone else wants to try to integrate it, feel free. (Obviously, the hash rate would be significantly lower - at least half).
Since chunks of optimized assembly are often hard to work with and understand, I've also started a series of blog articles on the QPUs and optimization. I certainly made some false starts along the way and I was planning to highlight a few of those as well as walk through the process and optimizations. The first three articles are up:
http://rpiplayground.wordpress.com/
We'll see if I can actually finish the series ...
In my opinion, the QPUs are the most interesting part of the Pi and have a lot of (advanced) educational potential to learn how a GPU architecture works at a low level and it would be great to get more people interested and, hopefully, having some more accessible documentation might help.
