doe300
Posts: 18
Joined: Thu Dec 29, 2016 1:41 pm

OpenCL on the VideoCore IV!

Mon Oct 09, 2017 9:14 am

Not really graphics programming, more like GPGPU...

The last six months I spent on my masters thesis developing an OpenCL implementation running on the VideoCore IV GPU!

I present to you VC4CL (VideoCore IV OpenCL):

Of course it is far from complete, but it runs about 50% of the OpenCL CTS test-cases for supported features, 60% of the test-programs of a slightly modified boost compute library, 71% of the test cases for EasyCL, as well as some other test-programs.

Performance-wise it beats the results of the pocl implementation for the floating-point benchmark (reaching up to 4GFLOPS!) and has an expected inferior memory-access speed (at up to 120MB/s).

The VC4C compiler supports compilation of OpenCL C source-code, LLVM-IR intermediate code as well as SPIR-V via the corresponding front-end and can use standard LLVM as well as Khronos SPIRV-LLVM as front-end compiler. The VC4CL library can also be used with the Khronos ICD loader.

Notable not (yet) supported features:
  • 64-bit data-types (long, double)
  • linking of multiple source code files
  • images (WIP)
  • a lot of mathematical correctness (WIP)
  • performance (mostly within the compiler)
The code can be taken from here (the runtime-library), here (the compiler) and here (the standard-library).

NOTE: Due to the lack of a MMU between the VPM and the RAM as well as the required memory-mapping to access V3D registers, applications using the VC4CL implementation must be run as root!

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 17889
Joined: Sat Jul 30, 2011 7:41 pm

Re: OpenCL on the VideoCore IV!

Mon Oct 09, 2017 1:10 pm

That sounds like very good work indeed, nice one.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Please direct all questions to the forum, I do not do support via PM.

eupton
Forum Moderator
Forum Moderator
Posts: 29
Joined: Sun Apr 15, 2012 7:28 pm

Re: OpenCL on the VideoCore IV!

Mon Oct 09, 2017 9:04 pm

Is a copy of you Masters thesis available online? I'd love to know a bit more about the challenges you encountered getting this to work. Particularly interested in your approach to writes to memory via the VDW (which as you observe is not the fastest thing in the world).

User avatar
Gavinmc42
Posts: 1440
Joined: Wed Aug 28, 2013 3:31 am

Re: OpenCL on the VideoCore IV!

Tue Oct 10, 2017 3:03 am

Wow, someone give this poster a job quick.
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

doe300
Posts: 18
Joined: Thu Dec 29, 2016 1:41 pm

Re: OpenCL on the VideoCore IV!

Tue Oct 10, 2017 8:12 am

eupton wrote:
Mon Oct 09, 2017 9:04 pm
Is a copy of you Masters thesis available online?
Not yet, since I haven't turned it in yet. I will upload it once I have, it's written in German though.
eupton wrote:
Mon Oct 09, 2017 9:04 pm
I'd love to know a bit more about the challenges you encountered getting this to work. Particularly interested in your approach to writes to memory via the VDW (which as you observe is not the fastest thing in the world).
Getting memory access to work (especially freely configurable for 1, 2, 3, 4, 8 and 16 elements for 1-byte, 2-byte and 4-byte types) was not an easy thing. Akane was a great help getting it working on this thread.

The basic steps of what is done for memory writes:
  1. Lock the hardware mutex, since all QPUs share the same VPM to prevent overwriting the configuration
  2. Configure access from QPU to VPM (byte-size of type, number of elements and number of vectors to write)
  3. Write the correct number of vectors into the VPM
  4. Configure DMA (byte-size of type, number of elements and number of vectors to write)
  5. Write memory address to initiate DMA write
  6. Read the DMA wait register to wait for the DMA access to finish
  7. Unlock the hardware mutex
Gavinmc42 wrote:Wow, someone give this poster a job quick.
I'd gladly accept ;)
Last edited by doe300 on Tue Oct 10, 2017 2:43 pm, edited 1 time in total.

User avatar
Gavinmc42
Posts: 1440
Joined: Wed Aug 28, 2013 3:31 am

Re: OpenCL on the VideoCore IV!

Tue Oct 10, 2017 8:58 am

How's your Japanese :lol:

I suspect Akane might have something to do with this.
https://idein.jp/
https://github.com/nineties/py-videocore
Either that or there are more than one Japanese qpu guru's ;)

I do find it interesting that anyone who does AI/ML stuff with VC4 GPU/QPU's seem to end up at MS or Intel or Google or...
Start brushing up your resume, passport photo, etc ;)
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

doe300
Posts: 18
Joined: Thu Dec 29, 2016 1:41 pm

Re: OpenCL on the VideoCore IV!

Thu Oct 12, 2017 11:18 am

Gavinmc42 wrote:
Tue Oct 10, 2017 8:58 am
How's your Japanese :lol:
Google translate for the win! :?
Gavinmc42 wrote:
Tue Oct 10, 2017 8:58 am
I do find it interesting that anyone who does AI/ML stuff with VC4 GPU/QPU's seem to end up at MS or Intel or Google or...
Start brushing up your resume, passport photo, etc ;)
Not really my first choice of employers, Intel would be okay though ;)

blackshard83
Posts: 67
Joined: Fri Jan 10, 2014 8:31 am

Re: OpenCL on the VideoCore IV!

Tue Oct 17, 2017 10:31 am

Great job indeed!
Congratulations!

User avatar
jbeale
Posts: 3266
Joined: Tue Nov 22, 2011 11:51 pm
Contact: Website

Re: OpenCL on the VideoCore IV!

Wed Oct 18, 2017 4:26 pm

This sounds very impressive but I'm not sure I understand the implications. Are there any examples of this in action? Does this mean we might be able to get better performance on compute-intensive tasks, like image recognition? Right now there are neural-network "deep learning" based object-detection programs that run on the RPi3 and take just over 1 second to process one video frame and detect the location of objects (chair, person, etc.) in it, for example: https://www.pyimagesearch.com/2017/10/1 ... ent-437929

These neural-network programs spend most of the CPU time doing a huge number of simple multiply-and-add instructions to go from an input array of pixels to the output of predicted object locations. To compare some actual numbers, the Google MobileNets project https://research.googleblog.com/2017/06 ... s-for.html offers several versions of an object detector and classifier, requiring from 14 to 569 million MACs (multiply-accumulate operations) per frame depending on what accuracy you want.

The deep learning code I've seen on the RPi so far runs entirely on the CPU. Would this OpenCL work enable such an application to leverage the GPU to reach higher frame rates? When an object-recognition program runs at 0.9 fps it is not fast enough for some real-time applications, but if for example a 2 or 3x speedup was possible, that would start to become more useful, and of course the more the better.

doe300
Posts: 18
Joined: Thu Dec 29, 2016 1:41 pm

Re: OpenCL on the VideoCore IV!

Wed Oct 18, 2017 7:07 pm

jbeale wrote:
Wed Oct 18, 2017 4:26 pm
This sounds very impressive but I'm not sure I understand the implications. Are there any examples of this in action?
No not yet, I can run several test-cases, but haven't really tested it out with any productive application yet.
jbeale wrote:
Wed Oct 18, 2017 4:26 pm
Does this mean we might be able to get better performance on compute-intensive tasks, like image recognition? [...] These neural-network programs spend most of the CPU time doing a huge number of simple multiply-and-add instructions to go from an input array of pixels to the output of predicted object locations.
Yes, probably. Using the GPU for OpenCL calculations definitively has the advantage, that the CPU can be used to do other calculations. For small OpenCL kernels, the performance will probably be worse than a native execution on the CPU, since there is some overhead to start kernels. For larger kernels however, the performance of the GPU exceeds an execution on the CPU, especially for parallel tasks.

Performance-wise, I measured up to 4 GFLOPs for the clpeak floating-point benchmark (out of the theoretical maximum of 24 GFLOPs for the VC4 GPU), which is a lot more than the "original" Raspberry Pi A or B can achieve and about as high as the theoretical maximum computing power of the Raspberry Pi 3 without using NEON instructions.

So, in theory, OpenCL on the VideoCore IV GPU should increase performance of such applications. Currently, the greatest obstacle won't be the performance, but the fact that the implementation is not yet complete and will most likely produce some wrong results. You are definitively welcome to try it out and give feedback on the performance or the correctness of such applications!

User avatar
jbeale
Posts: 3266
Joined: Tue Nov 22, 2011 11:51 pm
Contact: Website

Re: OpenCL on the VideoCore IV!

Wed Oct 18, 2017 8:59 pm

Thanks for that informative reply. I assume the difference between the 4 GFlops in practice and 24 GFlops in theory, has to do with memory bandwidth?

doe300
Posts: 18
Joined: Thu Dec 29, 2016 1:41 pm

Re: OpenCL on the VideoCore IV!

Wed Oct 18, 2017 9:16 pm

jbeale wrote:
Wed Oct 18, 2017 8:59 pm
Thanks for that informative reply. I assume the difference between the 4 GFlops in practice and 24 GFlops in theory, has to do with memory bandwidth?
Yes, to some part due to unoptimized instructions, but the most performance loss is due to memory access speed (currently about 110MB/s for the clpeak bandwidth benchmark), which is a big bottleneck for the VC4 GPU.

User avatar
Gavinmc42
Posts: 1440
Joined: Wed Aug 28, 2013 3:31 am

Re: OpenCL on the VideoCore IV!

Thu Oct 19, 2017 1:41 am

OpenCL on the Pi QPU means things like the ARM Compute library now has a chance of being ported.
This Compute Library will run on ARM, NEON and the Mali GPU, now there is more chance of it running on VC4.

The Pi 3 has bigger caches, could the code be made small enough to mostly work in those to avoid accessing the DDR?
RPF says NEON would probably be faster than QPU, but why not use both at the same time :D

Then there is quite a bit of OpenCL code out there that could/may now run on Pi's, even Zero's.
Poor man's NEON for the BCM2835 Pi's.

It is also not just OpenCL but also the toolset, LLVM etc that is needed to get OpenCL working, this can be used for other stuff.
This is just beginning, who knows where this could lead?
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

doe300
Posts: 18
Joined: Thu Dec 29, 2016 1:41 pm

Re: OpenCL on the VideoCore IV!

Thu Oct 19, 2017 9:45 am

Gavinmc42 wrote:
Thu Oct 19, 2017 1:41 am
The Pi 3 has bigger caches, could the code be made small enough to mostly work in those to avoid accessing the DDR?
The size of the code depends on its purpose, so the compiler cannot force it to stay smaller than some limit. Also, VC4CL does not really mind the L2, instruction and uniform caches, since it cannot really influence their behaviour, except force cleaning the L2 cache.

If you are referring to the VPM cache size, then there is an optimization in development to implement a write-back cache using the VPM to limit the amount of memory access required. Currently, the VPM only caches data for successive reads or writes using consecutive memory addresses for a single QPU.

Return to “Graphics programming”

Who is online

Users browsing this forum: No registered users and 6 guests