SHA-256 implementation on QPUs

18 posts
by eman » Thu May 15, 2014 2:49 am
For anyone interested, I've written a parallel SHA-256 implementation for the QPU:

It does about 3.1 Mh/s (single-block hash) at about 93% efficiency (IPC) which makes it 14.6x faster than the CPU reference implementation. (Which is probably saying at least as much about how slow the CPU is ...). Considering that only about half the QPU gets used (the SHA-256 operations are heavy on the add pipe operations and really can't make use of the multiply operations), that's pretty respectable.

I have no interest in BTC mining but if someone else wants to try to integrate it, feel free. (Obviously, the hash rate would be significantly lower - at least half).

Since chunks of optimized assembly are often hard to work with and understand, I've also started a series of blog articles on the QPUs and optimization. I certainly made some false starts along the way and I was planning to highlight a few of those as well as walk through the process and optimizations. The first three articles are up:

We'll see if I can actually finish the series ...

In my opinion, the QPUs are the most interesting part of the Pi and have a lot of (advanced) educational potential to learn how a GPU architecture works at a low level and it would be great to get more people interested and, hopefully, having some more accessible documentation might help.
Posts: 9
Joined: Wed Mar 19, 2014 10:23 pm
by teh_orph » Thu May 15, 2014 6:50 am
Impressive stuff. What technique did you use for debugging?
User avatar
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by eman » Thu May 15, 2014 6:37 pm
Good old trial and error ;-)

Like the blog posts describe, I built a reference implementation first and then I could check at every stage as I built it out. I first tried using the VPM as a queue for the data vectors and found that as soon as I unrolled the loops the performance dropped which meant I blew out the instruction cache, presumably. That led to the code table solution. It's significantly faster to do the two branches with their 3 delay slots than to take all those icache misses. Other things like prefetching the texture lookups to hide the latency are fairly standard optimization techniques. Then looking for ways to overlap the mov's with the add pipe operations.

I did try to use the built-in performance counters (especially to verify the icache misses) but I didn't get them to work. I might try that again because seeing where those stalls are can be very useful.
Posts: 9
Joined: Wed Mar 19, 2014 10:23 pm
by petewarden » Sat May 17, 2014 12:17 am
Fantastic work Eman, thanks so much for putting this together, it's just what I was looking for!

I am hitting a snag when running through the tutorial though. I've built the helloworld example following the instructions, and I only see a single value being altered:

Code: Select all
pi@raspberrypi ~/projects/rpi-playground/QPU/helloworld $ sudo ./helloworld helloworld.bin 100
Loaded 80 bytes of code from helloworld.bin ...
QPU enabled.
Uniform value = 100
QPU 0, word 0: 0x00001298
QPU 0, word 1: 0x00000000
QPU 0, word 2: 0x00000000
QPU 0, word 3: 0x00000000
QPU 0, word 4: 0x00000000
QPU 0, word 5: 0x00000000
QPU 0, word 6: 0x00000000
QPU 0, word 7: 0x00000000
QPU 0, word 8: 0x00000000
QPU 0, word 9: 0x00000000
QPU 0, word 10: 0x00000000
QPU 0, word 11: 0x00000000
QPU 0, word 12: 0x00000000
QPU 0, word 13: 0x00000000
QPU 0, word 14: 0x00000000
QPU 0, word 15: 0x00000000
Cleaning up.

Any thoughts on what might be causing this? I'm on the latest rpi-update, anything else I should check? I'll debug into this from my end too, but I'm still wrapping my head around the basics of the architecture so any ideas you have will be very welcome!
Posts: 7
Joined: Wed May 14, 2014 3:50 pm
by petewarden » Sat May 17, 2014 12:27 am
Never mind, I tried rebooting again and now I see the expected results! Thanks so much for putting this together, it's a fantastic resource.
Posts: 7
Joined: Wed May 14, 2014 3:50 pm
by eman » Sat May 17, 2014 4:04 pm
Thanks. Yeah, I have seen that a few times where the GPU will get in some state where it either gives garbage or hangs but usually only with more complicated programs (for example, when playing with the synchronization operations, it's easier to make it hang for the next program). I couldn't find any way to "reset" the GPU without rebooting the whole thing, unfortunately.

I'll make a note of it in the tutorial.
Posts: 9
Joined: Wed Mar 19, 2014 10:23 pm
by teh_orph » Sun May 18, 2014 8:08 pm
What's the caching like on the memory chosen to hold the program and working set?
Btw I found a way of apparently resetting the GPU, by power cycling it. It did the trick for me for the rendering front-end at least (and all semaphore state). Have a look at QpuEnable:
true powers up the GPU and false powers it down. Just use this to cycle it.
User avatar
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by eman » Tue May 20, 2014 6:15 am
Thanks. I will take a look. The mailbox interface is pretty opaque so I'm using very similar code to the GPU FFT sample in /opt/vc. (I actually link to that mailbox.c file in the Makefile. It's a bit of a hack but I'm pretty sure that's installed on every system). The memory is allocated cached (at least according to the comment in the gpu_fft.c file). At first glance, it looks like the QpuEnable in vc_support.c is pretty similar to the qpu_enable from mailbox.c that is used in the tutorials and FFT code, though, but I might have to look into it again.
Posts: 9
Joined: Wed Mar 19, 2014 10:23 pm
by teh_orph » Tue May 20, 2014 9:49 am
Yeah it's the same thing. What I discovered that when you disable it, the GPU completely disappears from the MMIO interface and when you turn it back on all sins are forgiven :)
User avatar
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by bmarkus » Tue May 20, 2014 10:10 am
Nice job, thanks. Just one comment, link to Broadcom pdf in Part 1 is dead.
Posts: 32
Joined: Sat Sep 15, 2012 10:32 am
by RaTTuS » Tue May 20, 2014 10:23 am
very nice ;)
How To ask Questions :-
WARNING - some parts of this post may be erroneous YMMV

User avatar
Posts: 9181
Joined: Tue Nov 29, 2011 11:12 am
Location: North West UK
by petewarden » Mon Jun 09, 2014 7:05 pm
With help from eman's examples, I was able to port my deep belief image recognition framework to the QPUs: ... pberry-pi/

It was a big performance boost, I'm very grateful for the community's assistance getting this running. I've also released a modified version of eman's assembler with some additional instructions, a few fixes, and some helper macros: ... elpers.asm

It's not well documented, but I wanted to get the changes back out to anyone who might find them useful.
Posts: 7
Joined: Wed May 14, 2014 3:50 pm
by eman » Mon Jun 09, 2014 7:43 pm

I hope you don't mind if I integrate some of your changes back into my assembler? They look like useful improvements.
Posts: 9
Joined: Wed Mar 19, 2014 10:23 pm
by petewarden » Mon Jun 09, 2014 8:19 pm
Thanks! I'd be happy to see those rolled in, hopefully the commit messages give you an idea of what's in there.
Posts: 7
Joined: Wed May 14, 2014 3:50 pm
by eupton » Sat Jun 21, 2014 4:46 pm
I mailed Pete some suggestions. Copied here in case it's of use to anyone else:

I had a quick scan through the QPU code, and a few things occurred to me:

- You seem to have an idiom that you always explicitly load from the VPM read FIFO rather than using it directly in the instruction that consumes the result. If you changed this then the block of code that starts here:

# Read 128 B values from VPM and multiply them with the corresponding A values

could be made rather tighter.

- *However*, using the VPM as your read path to memory is in general a losing proposition, as you hit contention with other QPUs. You've probably found that you're seeing way less than linear returns as you add more QPUs to the processing pool for this reason. I'd strongly encourage you to use the texture unit direct read mode instead - write a vector of addresses to t0s and signal the values into r4 using the ldtmu0 signal. Texture units are shared between pairs of QPUs, so you get less contention, and you can have up to four outstanding requests in flight, so you can achieve a lot of pipelining to hide latency.

Andrew Holme's newly-released FFT source is a great reference: we actually ran a Verilog sim of the chip running this code to squeeze the last ten percent of stalls out. ... o_fft/qasm
Forum Moderator
Forum Moderator
Posts: 25
Joined: Sun Apr 15, 2012 7:28 pm
by marked » Sat Jun 21, 2014 7:20 pm
Is anyone working on an AES128/256 implementation?

This is before I look into doing this myself next week.

Posts: 213
Joined: Fri Jul 29, 2011 4:25 pm
by tylerthetiger » Wed Sep 10, 2014 1:15 am
Is the final version complete? I am getting incorrect/random results when running the final version of the code.
Posts: 3
Joined: Wed Sep 10, 2014 1:12 am
by ab1jx » Fri May 19, 2017 4:58 pm
As far as I can tell the hash rate for mining isn't that impressive in an age of terahash/sec ASICs. But this does make use of the GPU which cpuminer doesn't. Theoretically this could be running concurrently with cpuminer on a few cores. And all on 10 watts or so. I think there's better money in other coins right now where the difficulty is lower, but have this mining Bitcoin to one pool and cpuminer mining Litecoin to a different pool at the same time. Scrypt on a QPU is worth thinking about. Might buy a loaf of bread a month. Run it on solar power and it's almost free.
User avatar
Posts: 236
Joined: Thu Sep 26, 2013 1:54 pm
Location: Heath, MA USA