doe300
Posts: 45
Joined: Thu Dec 29, 2016 1:41 pm

Questions about VideoCore IV GPU

Thu Dec 29, 2016 2:34 pm

I have a few questions about the GPU of the Raspberry PI. Hope, somebody here can clarify some things.

1) In the official Broadcom VideoCore IV Architecture Reference, the illustration on page 13 implies 16 QPUs (slices 0 to 3 with 4 QPUs each), but any resource on the Raspberry Pi only tells of 12 QPUs. So does the RPI have 12 or 16 QPUs? And is this number the same for every model (A, B, ..., 3) ?

2) The documentation says: "The QPU is a 16-way SIMD processor". So by my understanding, each instruction on a single QPU is applied to 16 32-bit values (registers, immediates). This would somehow imply, that a register-address too contains 16 32-bit values, correct?
2.1) This would also mean the phrases "all 16 elements of the SIMD array" (for branch conditions, page 34) and "across the entire SIMD array" (for load immediates, page 33) refer to all 16 values within one QPU, correct?
2.2) When I load a single 32-bit value via VPM, I implicitly load 16 32-bit values from RAM, the other ones being some random memory-content, unless I set the explicitly from the host. Correct?
2.3) When I read a 32-bit uniform value, is it placed across all 16 SIMD elements? Or just the first? Or do I implicitly read 16 consecutive uniforms (as with VPM)?
2.4) Is there a way to extract/insert/replace a single value in the SIMD array? For example, I want to replace just element 0 with another value, but keep the others unchanged.

3) The QPUs are independent processors (sharing all external resources like VPM, SFU, instruction cache, ...). Does this mean, that if I somehow managed all access to external resources, I could run different code on every QPU?
3.1) Just to clarify and since I searched hard and long to find an explanation: Does the 24GFLOPs computational power claimed by Broadcom follow from a clock rate of 250MHz * 4-way data-parallelism * 2 asymmetric ALUs * 12 QPUs = 24 GFLOPs?

Thanks for the help!

Official Documentation: https://web.archive.org/web/20160803202 ... G100-R.pdf

User avatar
Gavinmc42
Posts: 3631
Joined: Wed Aug 28, 2013 3:31 am

Re: Questions about VideoCore IV GPU

Sat Dec 31, 2016 11:22 am

I noticed this too, 12 or 16, which is it?
I suspect the VC4 ThreedX RTOS runs on one of the QPU, leaving the other 3 QPU's free for video stuff.
I don't think anyone has confirmed this?

So you cannot actually use all 16 without doing baremetal stuff.
Even then, the second/third stage bootloaders have VC4 code in them.

Done a bit of hexediting on the start/.elf files to see what each one does.
Third stage start_cd.elf has the least VC4 stuff in it, enough to do HDMI, jpeg....
start.elf adds h.264/mjpeg video codecs?
start_x.elf adds camera and lens correction magic stuff.
Again this is just my speculation from looking inside the files.
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 7147
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Questions about VideoCore IV GPU

Sat Dec 31, 2016 5:34 pm

ThreadX runs on the vpu (2 cores), not the qpu.
The qpus normally only run 3d stuff, and the camera awb algorithm as the vpu can't don't do simd floating point.

I can't remember how many qpu slices were instantiated on each platform. It was designed so that it could be varied between chips depending on the performance required.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

User avatar
Gavinmc42
Posts: 3631
Joined: Wed Aug 28, 2013 3:31 am

Re: Questions about VideoCore IV GPU

Sun Jan 01, 2017 12:08 am

"Official" confirmation ThreedX runs on VPU :lol:

Broadcom link is down so I will have find the specs on old harddrive.

It showed a diagram with QPU's labeled 0, 1, 2, 3 on page?
Which is 4 QPU with a quad in each =16.
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 23382
Joined: Sat Jul 30, 2011 7:41 pm

Re: Questions about VideoCore IV GPU

Sun Jan 01, 2017 2:57 pm

I thought threadx on the VPU's was common knowledge...As for QPU's, 12 IIRC. As 6x9 says, it's configurable depending on requirements of the chip buyer.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
"My grief counseller just died, luckily, he was so good, I didn't care."

User avatar
Akane
Posts: 44
Joined: Tue May 27, 2014 1:20 pm
Location: Tsukuba, Japan

Re: Questions about VideoCore IV GPU

Sun Jan 08, 2017 8:19 am

1) All Raspberry Pis have 12 QPUs (3slices x 4qpus) according to V3D_IDENT1.

2) Yes.
2.1) Yes.
2.2) You cannot read directly from RAM through VPM. To read the content of RAM, you must issue DMA load explicitly from RAM to VPM.
From my experiences, the default contents of VPM are randomized.
2.3) One 32-bit uniform value is scattered across all 16 SIMD elements.
2.4) You can do that like this:

Code: Select all

load-imm-per-elmt-unsigned cond_add=always sf=1 waddr_add=nop immediate=0x7fff7fff
load-imm-32 cond_add=zs waddr_add=ra0 imm=1234
3) Yes. Since there are 12 QPUs, you can run 12 streams of instructions at the same time!
However, be careful that some resources, such as VPM, cannot be used fully because they are shared with all QPUs.
3.1) You are correct.

doe300
Posts: 45
Joined: Thu Dec 29, 2016 1:41 pm

Re: Questions about VideoCore IV GPU

Mon Jan 09, 2017 8:20 pm

Thanks for all the answers.
2.2) You cannot read directly from RAM through VPM. To read the content of RAM, you must issue DMA load explicitly from RAM to VPM.
By "issue DMA load explicitly", do you mean setting up DMA and writing an address to the VPM_LD_ADDR register?
From my experiences, the default contents of VPM are randomized.
Yeah, its the typical "keep in memory whatever bits where put there before". So I guess, for my branch conditions to work (when working with less than 16 elements of "real" data), I'd have to set them to some fixed value.

For 2.4, I went with something like

Code: Select all

xor.set_flags NOP, ELEMENT_NUMBER, selected_element
load.if_zero the-value-I-want-to-set
But this does the same in the same amount of instructions.
However, be careful that some resources, such as VPM, cannot be used fully because they are shared with all QPUs.
To my luck, there is a global mutex which just begs to be used to synchronize access to the VPM. I just fear the performance of code with lots of memory-accesses will suffer heavily.

User avatar
Akane
Posts: 44
Joined: Tue May 27, 2014 1:20 pm
Location: Tsukuba, Japan

Re: Questions about VideoCore IV GPU

Tue Jan 10, 2017 4:21 am

doe300 wrote:
2.2) You cannot read directly from RAM through VPM. To read the content of RAM, you must issue DMA load explicitly from RAM to VPM.
By "issue DMA load explicitly", do you mean setting up DMA and writing an address to the VPM_LD_ADDR register?
Yes.
doe300 wrote:For 2.4, I went with something like

Code: Select all

xor.set_flags NOP, ELEMENT_NUMBER, selected_element
load.if_zero the-value-I-want-to-set
But this does the same in the same amount of instructions.
LGTM.
doe300 wrote:
However, be careful that some resources, such as VPM, cannot be used fully because they are shared with all QPUs.
To my luck, there is a global mutex which just begs to be used to synchronize access to the VPM. I just fear the performance of code with lots of memory-accesses will suffer heavily.
The mutexes (muticies?) do not generate stalls (c.f. QI10), so it is very efficient.
As for nrows x 16 x 32bit VPM DMA load, stalls are generated in such a way that there will be 5*nrows + 6.5 non-DMA-load instructions between vpm_ld_addr and vpm_ld_wait (QV52).
As for units x 16 x 32bit VPM DMA store, stalls are generated in such a way that there will be 3.5*units + 10 non-DMA-store instructions between vpm_st_addr and vpm_st_wait (QV53).
As for VPM read, stalls are generated in such a way that there will be 5 instructions between vpmvcd_rd_setup and the first vpm_read (QV56).
As for VPM write, no stalls are generated (QV57).
However, doing VPM reads and writes at the same time generates extra stalls (QV59).
So, VPM and DMA operations suffer the performance rather than accessing mutex in this case.

QI10: http://imrc.noip.me/blog/vc4/QI10/
QV52: http://imrc.noip.me/blog/vc4/QV52/
QV53: http://imrc.noip.me/blog/vc4/QV53/
QV56: http://imrc.noip.me/blog/vc4/QV56/
QV57: http://imrc.noip.me/blog/vc4/QV57/
QV59: http://imrc.noip.me/blog/vc4/QV59/
These are all in Japanese, my mother tongue. Sorry...

doe300
Posts: 45
Joined: Thu Dec 29, 2016 1:41 pm

Re: Questions about VideoCore IV GPU

Tue Jan 10, 2017 4:11 pm

Akane wrote: So, VPM and DMA operations suffer the performance rather than accessing mutex in this case.
Of course the DMA costs performance, but the mutex forces all QPUs to access the VPM serially, which makes things worse :(

Thanks for your sources. Google Translator is of mediocre help, as always. But of what I understand, these are interesting benchmarks.

doe300
Posts: 45
Joined: Thu Dec 29, 2016 1:41 pm

Re: Questions about VideoCore IV GPU

Mon Jan 16, 2017 5:51 pm

Sorry for the double post...

I have another question: The official documentation states that a program running on a QPU needs to be "threadable" for the scheduler to run up to two threads at the same QPU (pages 21/22). How exactly do I set a program to "threadable"? Do I have to do more than just using only the first 16 registers (since the second 16 are used by the second thread) and send the thread-switch signal from time to time?

User avatar
Akane
Posts: 44
Joined: Tue May 27, 2014 1:20 pm
Location: Tsukuba, Japan

Re: Questions about VideoCore IV GPU

Mon Jan 16, 2017 11:55 pm

Don't hesitate to ask me ;)

I haven't tested QPU threads before, but I think "threadable" flag can only be set through GL/NV/VG shader records on control list (Table 45-47 and control list code=64-67). FYI, control list is a sequence of instructions for OpenGL ES and so on.

Accumulators and condition flags are not preserved across thread switches (page 21), I haven't tested either!

doe300
Posts: 45
Joined: Thu Dec 29, 2016 1:41 pm

Re: Questions about VideoCore IV GPU

Sun Jan 22, 2017 5:09 pm

To prevent you from getting bored, I encountered a few more questions ;)

1.) To my understanding, I can replicate a non-literal value among all elements (in a quad/the whole SIMD) by writing it into the r5 accumulator (from quad/SIMD-element 0) and reading it again. Is this correct?

2.) I can also somehow use the accumulators and/or small immediates to rotate the elements of the SIMD. The official documentation (at page 20) states, that the output of the MUL ALU can be rotated. But I cannot figure out how to specify the amount of elements to rotate. How do I e.g. rotate upwards by 4 elements? Do I have to "multiply" the value with the small immediate value for 4 elements upwards (which would be 52)? And what does
The full horizontal vector rotate is only available when both of the mul ALU input arguments are taken from accumulators r0-r3
mean? (Quoted from the official documentation, page 20)

User avatar
Akane
Posts: 44
Joined: Tue May 27, 2014 1:20 pm
Location: Tsukuba, Japan

Re: Questions about VideoCore IV GPU

Mon Jan 23, 2017 6:18 am

Wow. Thanks.
doe300 wrote:1.) To my understanding, I can replicate a non-literal value among all elements (in a quad/the whole SIMD) by writing it into the r5 accumulator (from quad/SIMD-element 0) and reading it again. Is this correct?
Yes. If you write to r5 on ra, values are distributed in quads (from [4i] to [4i+1:4i+3] for i=0,1,2,3). If you write to r5 on rb, values are distributed on the whole SIMD (from [0] to [1:15]), as far as I know.

doe300 wrote:2.) I can also somehow use the accumulators and/or small immediates to rotate the elements of the SIMD. The official documentation (at page 20) states, that the output of the MUL ALU can be rotated. But I cannot figure out how to specify the amount of elements to rotate. How do I e.g. rotate upwards by 4 elements? Do I have to "multiply" the value with the small immediate value for 4 elements upwards (which would be 52)? And what does
The full horizontal vector rotate is only available when both of the mul ALU input arguments are taken from accumulators r0-r3
mean? (Quoted from the official documentation, page 20)
Let MLU ALU's "front input"s be the mul_a and mul_b and let "real input"s be the actual input values of MLU ALU.
If the front input is an accumulator (r0-r5), it will be rotated in the full SIMD and the result is passed to MUL ALU as a real input. Let it call "full rotation".
If the front input is a register (ra0-31, rb0-31 and some physical ones), it will be rotated in each quads and the result is passed to MUL ALU as a real input. Let it call "half rotation".
MUL ALU takes the 2 real inputs, does calculation on them and writes the result to waddr_mul.

So, if the front inputs are both accumulators, "full" vector rotation will take place, as the document writes.

As for example, see http://vc4-notes.tumblr.com/post/153495 ... -registers.

doe300
Posts: 45
Joined: Thu Dec 29, 2016 1:41 pm

Re: Questions about VideoCore IV GPU

Tue Feb 07, 2017 3:34 pm

Thanks for the extensive explanation. I've got another question for you:

The VideoCore IV GPU has only 1 hardware-mutex but 16 hardware 4-bit counting semaphores, which block both on underflowing beneath 0 as well as overflowing over 15. So by initially setting a semaphore to a value of 1 and decreasing it on access, I could use them as mutex. So my question is, is there a good way to initialize a semaphore to a specific value? Or query its value?

The only thing I could come up with, is:
  1. hoping that all semaphores are initiated with 0 (which may not be true, if a previous shader/user program accessed them)
  2. increasing the semaphore from within a single program once to 1
  3. using the semaphore as a mutex by decreasing to lock it and increasing to free it

User avatar
Akane
Posts: 44
Joined: Tue May 27, 2014 1:20 pm
Location: Tsukuba, Japan

Re: Questions about VideoCore IV GPU

Wed Feb 08, 2017 12:55 pm

According to the manual, there seems be no way to set the semaphore counters.
However, there is another way to reset the counters: disabling and re-enabling QPU by using Mailbox call.

doe300
Posts: 45
Joined: Thu Dec 29, 2016 1:41 pm

Re: Questions about VideoCore IV GPU

Thu Apr 06, 2017 10:37 am

So, stumbled upon another question regarding the periphery:

What is the best way to read from memory in a non-linear (e.g. random-access) style?

My current steps (for every read from another address):
  • 1. configure VPR
    2. configure VPR DMA
    3. set address
    4. wait for DMA to finish
    5. read data
Do I have to regard anything I didn't mention in my list?

User avatar
Akane
Posts: 44
Joined: Tue May 27, 2014 1:20 pm
Location: Tsukuba, Japan

Re: Questions about VideoCore IV GPU

Thu Apr 06, 2017 11:25 am

For random accesses, I recommend to use TMU because it's more flexible than VPM DMA.
To do TMU read, you do:
  • 1. Write memory address (aligned with 4 byte) to TMU[01]_S. The address can be different across QPU threads, that is, you can read up to 16 x 4bytes of memory on a TMU read.
    2. Signal the TMU which you wrote the address.
    3. Now you have the memory content in r4.
doe300 wrote:My current steps (for every read from another address):
  • 1. configure VPR
    2. configure VPR DMA
    3. set address
    4. wait for DMA to finish
    5. read data
"VPR" = "VPM read" I suspected.
Because VPR caches some vectors (maybe 2) as soon as it is configured, you must setup VPR after waiting for VPR DMA.
So you should do like this:
  • 1. configure VPR DMA
    2. set address
    3. wait for DMA to finish
    4. configure VPR
    5. read data

doe300
Posts: 45
Joined: Thu Dec 29, 2016 1:41 pm

Re: Questions about VideoCore IV GPU

Thu Apr 06, 2017 2:19 pm

Thanks for the quick answer,
Akane wrote:The address can be different across QPU threads, that is, you can read up to 16 x 4bytes of memory on a TMU read.
Are the 16 * 4 Bytes the 16 SIMD-elements times 4 Bytes data-type? If so, what does this have to do with threads? Or has the TMU a cache of 16 requests?
Akane wrote:"VPR" = "VPM read" I suspected.
Yes.
Akane wrote:Because VPR caches some vectors (maybe 2) as soon as it is configured, you must setup VPR after waiting for VPR DMA.
So reading through the VPM via DMA has two steps, 1. read from memory via DMA into VPM, 2. read from VPM into QPU?

Which version (TMU or DMA over VPM) has the higher data throughput?

User avatar
Akane
Posts: 44
Joined: Tue May 27, 2014 1:20 pm
Location: Tsukuba, Japan

Re: Questions about VideoCore IV GPU

Thu Apr 06, 2017 3:51 pm

doe300 wrote:Thanks for the quick answer,
Akane wrote:The address can be different across QPU threads, that is, you can read up to 16 x 4bytes of memory on a TMU read.
Are the 16 * 4 Bytes the 16 SIMD-elements times 4 Bytes data-type? If so, what does this have to do with threads? Or has the TMU a cache of 16 requests
Yes - If thread x writes an address a_x to TMU[01]_S, the thread gets the contents of memory address a_x.
(I didn't figure out what are you asking... Sorry but could you tell me more?)
doe300 wrote:
Akane wrote:Because VPR caches some vectors (maybe 2) as soon as it is configured, you must setup VPR after waiting for VPR DMA.
So reading through the VPM via DMA has two steps, 1. read from memory via DMA into VPM, 2. read from VPM into QPU?
Yes. I spent many hours to understand this at first!
doe300 wrote:Which version (TMU or DMA over VPM) has the higher data throughput?
As for throughput, VPM DMA wins.
  • VPM DMA read can read at 690MB/s when 16x16x4bytes are loaded. [1]
  • VPM DMA write can write at 1120MB/s when 128x16x4bytes are stored. [2]
  • VPM DMA R/W seem to have a constant-time overhead. So you'll get better performance if you do larger-sized load/store. [1][2]
  • VPM read generates 5 clock stalls between "VPM generic block read setup" and the first VPM_READ read. [3]
  • TMU read generates:
    • 9 clock stalls when it reads from TMU cache.
    • 12 clock stalls when it reads from V3D L2 cache.
    • 20 clock stalls when it reads directly from memory.
  • So TMU can read at 28MB/s maximum, which is slower than VPM DMA.
  • However, TMU is simpler than VPM DMA. You are just to write address and do signaling.
  • In addition, VPM DMA doesn't have a smart way to access memory randomly. So I think it's better to use TMU here.
[1]: http://imrc.noip.me/blog/vc4/QV52/
[2]: http://imrc.noip.me/blog/vc4/QV53/
[3]: http://imrc.noip.me/blog/vc4/QV56/

doe300
Posts: 45
Joined: Thu Dec 29, 2016 1:41 pm

Re: Questions about VideoCore IV GPU

Sat Apr 08, 2017 9:00 am

Thank you so very much!

Finally got it working, after I reordered the steps in VPR DMA. My configuration was correct the whole time, just in the wrong order :D

I have to go with reading via VPM, since I may need the speed and I definitively need the flexibility of reading values with different byte- (char, short, int/float) and vector-sizes (1-16).
Akane wrote:Yes - If thread x writes an address a_x to TMU[01]_S, the thread gets the contents of memory address a_x.
(I didn't figure out what are you asking... Sorry but could you tell me more?)
I misunderstood you there, I thought you said the "16 x 4bytes of memory on a TMU read" are a buffer of 16 reads (from different threads), but they are the 16 elements of a single read. The fact, that TMU allows multiple threads to fetch data concurrently is sadly of no use to me, since I can't use threads on custom code.

User avatar
Akane
Posts: 44
Joined: Tue May 27, 2014 1:20 pm
Location: Tsukuba, Japan

Re: Questions about VideoCore IV GPU

Sat Apr 08, 2017 9:18 am

It's great to hear that you are doing well. Please continue rocking!

(And sorry for my mistakes in English!)

Return to “Graphics programming”