I've been having some fun with the QPUs over the last few weeks, and I've just posted the results:
http://petewarden.com/2014/06/09/deep-l ... pberry-pi/
They've allowed me to boost my speed from around 20 seconds using Atlas for the numerics, to five seconds with a stock Pi, and three seconds with GPU overclocking! I've very grateful to eman's SHA-256 example code (http://www.raspberrypi.org/forums/viewt ... 1&p=550759), and Herman's hard work pulling together documentation. I ended up having to extend eman's original assembler to fix a few bugs and handle some additional instructions (eg unpacking, horizontal shifts, multi-register immediates), so I've put the updated code up here, along with some helper m4 macros:
I learned a few things from the process. My use-case was heavily dependent on the VPM memory, and unfortunately it appears that all DMA load and store operations have to be guarded with a mutex. After a lot of experimentation, I found the guard only had to be around the actual kick-off instruction, but any thoughts on why this is needed or workaround ideas would be very welcome since it's a big performance hit.