Page 1 of 1

SHA-256 implementation on QPUs

Posted: Thu May 15, 2014 2:49 am
by eman
For anyone interested, I've written a parallel SHA-256 implementation for the QPU:

It does about 3.1 Mh/s (single-block hash) at about 93% efficiency (IPC) which makes it 14.6x faster than the CPU reference implementation. (Which is probably saying at least as much about how slow the CPU is ...). Considering that only about half the QPU gets used (the SHA-256 operations are heavy on the add pipe operations and really can't make use of the multiply operations), that's pretty respectable.

I have no interest in BTC mining but if someone else wants to try to integrate it, feel free. (Obviously, the hash rate would be significantly lower - at least half).

Since chunks of optimized assembly are often hard to work with and understand, I've also started a series of blog articles on the QPUs and optimization. I certainly made some false starts along the way and I was planning to highlight a few of those as well as walk through the process and optimizations. The first three articles are up:

We'll see if I can actually finish the series ...

In my opinion, the QPUs are the most interesting part of the Pi and have a lot of (advanced) educational potential to learn how a GPU architecture works at a low level and it would be great to get more people interested and, hopefully, having some more accessible documentation might help.

Re: SHA-256 implementation on QPUs

Posted: Thu May 15, 2014 6:50 am
by teh_orph
Impressive stuff. What technique did you use for debugging?

Re: SHA-256 implementation on QPUs

Posted: Thu May 15, 2014 6:37 pm
by eman
Good old trial and error ;-)

Like the blog posts describe, I built a reference implementation first and then I could check at every stage as I built it out. I first tried using the VPM as a queue for the data vectors and found that as soon as I unrolled the loops the performance dropped which meant I blew out the instruction cache, presumably. That led to the code table solution. It's significantly faster to do the two branches with their 3 delay slots than to take all those icache misses. Other things like prefetching the texture lookups to hide the latency are fairly standard optimization techniques. Then looking for ways to overlap the mov's with the add pipe operations.

I did try to use the built-in performance counters (especially to verify the icache misses) but I didn't get them to work. I might try that again because seeing where those stalls are can be very useful.

Re: SHA-256 implementation on QPUs

Posted: Sat May 17, 2014 12:17 am
by petewarden
Fantastic work Eman, thanks so much for putting this together, it's just what I was looking for!

I am hitting a snag when running through the tutorial though. I've built the helloworld example following the instructions, and I only see a single value being altered:

Code: Select all

[email protected] ~/projects/rpi-playground/QPU/helloworld $ sudo ./helloworld helloworld.bin 100
Loaded 80 bytes of code from helloworld.bin ...
QPU enabled.
Uniform value = 100
QPU 0, word 0: 0x00001298
QPU 0, word 1: 0x00000000
QPU 0, word 2: 0x00000000
QPU 0, word 3: 0x00000000
QPU 0, word 4: 0x00000000
QPU 0, word 5: 0x00000000
QPU 0, word 6: 0x00000000
QPU 0, word 7: 0x00000000
QPU 0, word 8: 0x00000000
QPU 0, word 9: 0x00000000
QPU 0, word 10: 0x00000000
QPU 0, word 11: 0x00000000
QPU 0, word 12: 0x00000000
QPU 0, word 13: 0x00000000
QPU 0, word 14: 0x00000000
QPU 0, word 15: 0x00000000
Cleaning up.
Any thoughts on what might be causing this? I'm on the latest rpi-update, anything else I should check? I'll debug into this from my end too, but I'm still wrapping my head around the basics of the architecture so any ideas you have will be very welcome!

Re: SHA-256 implementation on QPUs

Posted: Sat May 17, 2014 12:27 am
by petewarden
Never mind, I tried rebooting again and now I see the expected results! Thanks so much for putting this together, it's a fantastic resource.

Re: SHA-256 implementation on QPUs

Posted: Sat May 17, 2014 4:04 pm
by eman
Thanks. Yeah, I have seen that a few times where the GPU will get in some state where it either gives garbage or hangs but usually only with more complicated programs (for example, when playing with the synchronization operations, it's easier to make it hang for the next program). I couldn't find any way to "reset" the GPU without rebooting the whole thing, unfortunately.

I'll make a note of it in the tutorial.

Re: SHA-256 implementation on QPUs

Posted: Sun May 18, 2014 8:08 pm
by teh_orph
What's the caching like on the memory chosen to hold the program and working set?
Btw I found a way of apparently resetting the GPU, by power cycling it. It did the trick for me for the rendering front-end at least (and all semaphore state). Have a look at QpuEnable: ... _support.c
true powers up the GPU and false powers it down. Just use this to cycle it.

Re: SHA-256 implementation on QPUs

Posted: Tue May 20, 2014 6:15 am
by eman
Thanks. I will take a look. The mailbox interface is pretty opaque so I'm using very similar code to the GPU FFT sample in /opt/vc. (I actually link to that mailbox.c file in the Makefile. It's a bit of a hack but I'm pretty sure that's installed on every system). The memory is allocated cached (at least according to the comment in the gpu_fft.c file). At first glance, it looks like the QpuEnable in vc_support.c is pretty similar to the qpu_enable from mailbox.c that is used in the tutorials and FFT code, though, but I might have to look into it again.

Re: SHA-256 implementation on QPUs

Posted: Tue May 20, 2014 9:49 am
by teh_orph
Yeah it's the same thing. What I discovered that when you disable it, the GPU completely disappears from the MMIO interface and when you turn it back on all sins are forgiven :)

Re: SHA-256 implementation on QPUs

Posted: Tue May 20, 2014 10:10 am
by bmarkus
Nice job, thanks. Just one comment, link to Broadcom pdf in Part 1 is dead.

Re: SHA-256 implementation on QPUs

Posted: Tue May 20, 2014 10:23 am
by RaTTuS
very nice ;)

Re: SHA-256 implementation on QPUs

Posted: Mon Jun 09, 2014 7:05 pm
by petewarden
With help from eman's examples, I was able to port my deep belief image recognition framework to the QPUs: ... pberry-pi/

It was a big performance boost, I'm very grateful for the community's assistance getting this running. I've also released a modified version of eman's assembler with some additional instructions, a few fixes, and some helper macros: ... elpers.asm

It's not well documented, but I wanted to get the changes back out to anyone who might find them useful.

Re: SHA-256 implementation on QPUs

Posted: Mon Jun 09, 2014 7:43 pm
by eman

I hope you don't mind if I integrate some of your changes back into my assembler? They look like useful improvements.

Re: SHA-256 implementation on QPUs

Posted: Mon Jun 09, 2014 8:19 pm
by petewarden
Thanks! I'd be happy to see those rolled in, hopefully the commit messages give you an idea of what's in there.

Re: SHA-256 implementation on QPUs

Posted: Sat Jun 21, 2014 4:46 pm
by eupton
I mailed Pete some suggestions. Copied here in case it's of use to anyone else:

I had a quick scan through the QPU code, and a few things occurred to me:

- You seem to have an idiom that you always explicitly load from the VPM read FIFO rather than using it directly in the instruction that consumes the result. If you changed this then the block of code that starts here:

# Read 128 B values from VPM and multiply them with the corresponding A values

could be made rather tighter.

- *However*, using the VPM as your read path to memory is in general a losing proposition, as you hit contention with other QPUs. You've probably found that you're seeing way less than linear returns as you add more QPUs to the processing pool for this reason. I'd strongly encourage you to use the texture unit direct read mode instead - write a vector of addresses to t0s and signal the values into r4 using the ldtmu0 signal. Texture units are shared between pairs of QPUs, so you get less contention, and you can have up to four outstanding requests in flight, so you can achieve a lot of pipelining to hide latency.

Andrew Holme's newly-released FFT source is a great reference: we actually ran a Verilog sim of the chip running this code to squeeze the last ten percent of stalls out. ... o_fft/qasm

Re: SHA-256 implementation on QPUs

Posted: Sat Jun 21, 2014 7:20 pm
by marked
Is anyone working on an AES128/256 implementation?

This is before I look into doing this myself next week.


Re: SHA-256 implementation on QPUs

Posted: Wed Sep 10, 2014 1:15 am
by tylerthetiger
Is the final version complete? I am getting incorrect/random results when running the final version of the code.

Re: SHA-256 implementation on QPUs

Posted: Fri May 19, 2017 4:58 pm
by ab1jx
As far as I can tell the hash rate for mining isn't that impressive in an age of terahash/sec ASICs. But this does make use of the GPU which cpuminer doesn't. Theoretically this could be running concurrently with cpuminer on a few cores. And all on 10 watts or so. I think there's better money in other coins right now where the difficulty is lower, but have this mining Bitcoin to one pool and cpuminer mining Litecoin to a different pool at the same time. Scrypt on a QPU is worth thinking about. Might buy a loaf of bread a month. Run it on solar power and it's almost free.

One thing worth keeping an eye on here is the price of Bitcoin. I'm pretty sure I remember it around $400 a couple years ago when I started using it as a currency to send money overseas. Right now in February 2018 1 bitcoin is worth $10,000, see That changes all the equations. Mining methods that aren't cost effective at a low bitcoin price suddenly become worth using. CPU/GPU/QPU maybe, old ASIC rigs certainly. Very few people actually make a whole bitcoin, you join a mining pool where you get paid (usually in bitcoin) for the work you do. If you want to study it see You aren't going to make a lot of money in a hurry mining but if you can put something together and run it for months it should at least pay for the electricity used. That will depend on the bitcoin price which fluctuates.

But no, my Gecko Science 2Pac ASIC USB plug miner runs at around 16 Gh/s without pushing it hard, this isn't going to get anywhere.