petewarden
Posts: 9
Joined: Wed May 14, 2014 3:50 pm

GEMM example on the QPU

Fri Jun 13, 2014 12:50 am

I had some requests for the source code for the QPU side of the deep learning library, so I spent some time pulling out and tidying up the GPU-accelerated part as a standalone example. You'll find the full project here:
https://github.com/jetpacapp/pi-gemm

It implements the standard GEMM function for matrix multiplication on single-precision floats. On my overclocked Pi, I see it taking 500ms on the included example, rather than 8,000 ms with the official Atlas implementation. It's BSD licensed, so you should be able to use it in your own projects.

I'll also be curious to see what optimizations folks who know the processor better can implement. It definitely sounds like going with the TMU rather than VPM might give better performance, and I'd love to get rid of those mutexes. I welcome patches!

mimi123
Posts: 583
Joined: Thu Aug 22, 2013 3:32 pm

Re: GEMM example on the QPU

Fri Jun 13, 2014 11:41 am

to use the tmu, you should use the ldtmu0 flag then read the data from r4.

You should also set t0s to a valid adress.

User avatar
teh_orph
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
Contact: Website

Re: GEMM example on the QPU

Sat Jun 14, 2014 1:22 pm

Yes the mutexes are required because it appears you use the same VPM rows on all QPUs.

petewarden
Posts: 9
Joined: Wed May 14, 2014 3:50 pm

Re: GEMM example on the QPU

Fri Jun 20, 2014 3:24 pm

teh_orph wrote:Yes the mutexes are required because it appears you use the same VPM rows on all QPUs.
No, the VPM rows are chosen by the QPU index (index * 8). I'd need mutexes around *all* the calculations if I really were overwriting the same rows, rather than just around the fetch kickoff. It's worth taking a look at the SHA and FFT examples which use the same pattern.

eman
Posts: 9
Joined: Wed Mar 19, 2014 10:23 pm

Re: GEMM example on the QPU

Fri Jun 20, 2014 5:50 pm

I have to admit I haven't had a chance to look at this code yet (though I intend to) but to comment on the locking in the SHA-256 code, I believe the final version has all 12 QPU threads competing for the same 16 rows of VPM space (which is all it uses) so it needs the mutex. I was worried about contention and tried using 4 locks and having the 12 QPUs mapped to one of 4 slots. It turned out it didn't make much difference. There is enough computation to cover the transfer cost (which is why SHA-256 is very nice to GPUs). The lock is only held long enough to read or write the 16 rows into registers and then everything is done with the registers (which is obviously key to performance). (The lock also covers the DMA operation).

I'd expect a GEMM implementation would work the same way - lock, read a block (16x16 would probably make sense) into registers, unlock, unroll the loops, use texture fetches (with prefetching) to get the "broadcast" effect you need (or maybe use the undocumented r5(?) write that Simon discovered), and write it back (under lock, of course). If there's enough computation to dwarf the I/O, I would expect the locking cost to be minor. (If not, I'd try to use bigger blocks. GEMM scales O(n^3) - eventually the computation will dominate).

I haven't looked at the code yet so if this is all obvious, I apologize.

User avatar
teh_orph
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
Contact: Website

Re: GEMM example on the QPU

Fri Jun 20, 2014 9:44 pm

petewarden wrote:
teh_orph wrote:Yes the mutexes are required because it appears you use the same VPM rows on all QPUs.
No, the VPM rows are chosen by the QPU index (index * 8). I'd need mutexes around *all* the calculations if I really were overwriting the same rows, rather than just around the fetch kickoff. It's worth taking a look at the SHA and FFT examples which use the same pattern.
Yeah sorry, I was reading the code on my phone!

Ok looking further in - perhaps it's just a timing thing? For example, on line 181 https://github.com/jetpacapp/pi-gemm/bl ... _float.asm
you write to the read setup register to initiate the copy from VCD to the QPU. It says in the VideoCore manual (page 56) that,
After the read setup register is written, read data is available to read after a
minimum latency of three QPU instructions. Reads made after this time will stall the QPU until data arrives, but
reads made too early or extra reads made beyond the number setup will return immediately with undefined
data.
..yet it appears the data is read immediately (in the next cycle), on line 184?

If the mutex really was protecting read/write to the same lines of VPM, wouldn't the mutex acquire/release need to be around the whole fetch, not just the DMA in? So yeah perhaps it is another issue.

We need a debugger...

User avatar
rautamieli
Posts: 1
Joined: Mon Jul 14, 2014 6:28 am
Location: Finland

Re: GEMM example on the QPU

Tue Jul 15, 2014 10:03 am

Hello,
I was trying to implement multiplication of 2D array by a single constant using raspi's GPU. I read the documentation that Broadcom released and I think I "so-so" understand the concept of using mailbox.c to access GPU.

The thing is I have very little experience with writing assembler code. Thus, all my attempts to write simple code for multiplication failed. My guide was http://rpiplayground.wordpress.com/2014 ... ofit-pt-1/

Later, I discovered this(gemm) example. I have got it running on my Pi. However I would like to give it my own input : A and B square matrices
where A is a regular matrix I want to multiply by a constant c and B is diagonal matrix with the multiply constant.
since A*c == A * B.

However upon looking into the code I didn't grasp where can I change what the gpu multiplies.

Code: Select all

const int inputChannels = 363;
  const int inputHeight = 3025;
  const int outputChannels = 96;
  Buffer* input = new Buffer(Dimensions(inputHeight, inputChannels));
What do these constants represent ? Somehow I fail to imagine how is the code defining what matrices is it going to multiply...
I think it should be in the arguments of

Code: Select all

void qpu_cblas_sgemm(
  int order,
  int transposeA,
  int transposeB,
  int m,
  int n,
  int k,
  float alpha,
  uint32_t a,
  float aMin,
  float aMax,
  int aBitsPerElement,
  int lda,
  uint32_t b,
  int ldb,
  float beta,
  uint32_t c,
  int ldc)
but again I can't figure out how the matrices are represented by this arguments.
My last attempt to understand the code was compiling the code with -g flag so I could sudo gdb --tui ./gemm
Even this didn't help me much in understanding code since the line cursor just jumps as if from higher line numbers to lower sometimes.
I thought that I might just not understand the interpretation if gdb so I tried it with ddd but the result is same.

Long story short
Please could you somehow help me with implementing my own inputs for matrix multiplication using GPU? Or at least understanding the code.
Thank you in advance.

tylerthetiger
Posts: 3
Joined: Wed Sep 10, 2014 1:12 am

Re: GEMM example on the QPU

Thu Oct 09, 2014 1:42 am

I believe the inputs are random and are set on line 264 of main.cpp ( input->populateWithRandomValues(0, 1);).

I am having a problem in understanding how the values are copied back to the host and how they are displayed. I only see that the debug register is being displayed. I'm also wondering what's wrong with my set up -- on my PI it is taking 10,000 ms for the QPU version and 4,000 ms for the ATLAS version to run.

tylerthetiger
Posts: 3
Joined: Wed Sep 10, 2014 1:12 am

Re: GEMM example on the QPU

Sun Oct 12, 2014 3:39 pm

When running the gemm example, what is the expected output? For the last two lines, I'm getting

Code: Select all

Buffers contained 98.628098% different values (286416), mean delta = 0.000055 - Buffer outputCPU - (3025, 96) vs Buffer outputGPU - (3025, 96)
Buffers contained 98.628098% different values (286416), mean delta = 0.000055 - Buffer outputAtlas - (3025, 96) vs Buffer outputGPU - (3025, 96)
Is that what it should be?

Return to “Bare metal”

Who is online

Users browsing this forum: LdB and 1 guest