skyglow
Posts: 1
Joined: Mon Jan 15, 2018 2:01 pm

Re: OpenCL on the VideoCore IV!

Mon Jan 15, 2018 2:13 pm

Great work doe!

My comment may sound stupid but this could help solving some compilation issues of the test suites.
Building LLVM/Clang in Relese mode is preferred since it consumes a lot less memory. Otherwise, the building process will very likely fail due to insufficient memory. It’s also a lot quicker to only build the relevant back-ends (ARM and AArch64), since it’s very unlikely that you’ll use an ARM board to cross-compile to other arches. If you’re running Compiler-RT tests, also include the x86 back-end, or some tests will fail.

Code: Select all

cmake $LLVM_SRC_DIR -DCMAKE_BUILD_TYPE=Release \
                    -DLLVM_TARGETS_TO_BUILD="ARM;X86;AArch64"
from http://releases.llvm.org/5.0.1/docs/How ... OnARM.html

I'm trying to use your sources on openSuse 64bit. I'll post later on about success/failure.

paulreimer
Posts: 3
Joined: Mon Jan 15, 2018 1:47 am

Re: OpenCL on the VideoCore IV!

Mon Jan 15, 2018 5:01 pm

While I'm not familiar with GPGPU on the Pi (i.e. using OpenGL, not OpenCL), my experience on x86's has been that GPGPU/OpenGL performance can be quite a bit better than OpenCL (for certain problems), which is perhaps because the OpenGL drivers are more optimized for problems which are easily modelled to OpenGL primitives.

What are your thoughts on this regarding the RPi? Is that something that could also be benchmarked?

paulreimer
Posts: 3
Joined: Mon Jan 15, 2018 1:47 am

Re: OpenCL on the VideoCore IV!

Mon Jan 15, 2018 5:12 pm

Also, would this work on the RPi Zero family? pocl vs OpenCL benchmarks would be quite different there, yes? (e.g. OpenCL should stay the same, while pocl would have fewer cores to work with)

mic_s
Posts: 61
Joined: Sun Oct 26, 2014 4:15 pm

Re: OpenCL on the VideoCore IV!

Mon Jan 15, 2018 10:42 pm

would this work on the RPi Zero family?
Pi0, Pi2, Pi3 all the same VideoCore IV., so any benchmark (if based on the very VideoCore IV) will give the same result.
( That said, there is a very, very small difference between the GPU in Pi0 and the GPU in Pi3 :
Pi0 : The GPU is the „owner“ of the L2 Cache. Pi3 : Arm is the „owner“ of the L2 Cache. )
.

doe300
Posts: 38
Joined: Thu Dec 29, 2016 1:41 pm

Re: OpenCL on the VideoCore IV!

Tue Jan 16, 2018 9:42 am

paulreimer wrote:
Mon Jan 15, 2018 5:01 pm
What are your thoughts on this regarding the RPi? Is that something that could also be benchmarked?
Since there exists an OpenGL implementation for the RPi running on the VC4 as well as the VC4CL implementation, their relative performance could be tested with a benchmark yielding comparable results.
paulreimer wrote:
Mon Jan 15, 2018 5:01 pm
[...] my experience on x86's has been that GPGPU/OpenGL performance can be quite a bit better than OpenCL (for certain problems), which is perhaps because the OpenGL drivers are more optimized for problems which are easily modelled to OpenGL primitives.
I think this is to be expected out of various reasons:
  1. The code is run on graphic processors, for which OpenGL is explicitly written, OpenCL is also intended to run on FPGAs and CPUs. So OpenGL is fitted more precisely to the hardware-features and -limitations of GPUs
  2. OpenCL is expected to handle a lot more use-cases than OpenGL, so OpenGL can be better optimized for its more limited uses
  3. In OpenGL, time is of the essence (you want to guarantee a certain FPS), so optimizing it for performance is critical
  4. Lastly, AFAIK OpenGL is far more used than OpenCL, so focusing on optimizing OpenGL the smart choice from the vendors point of view

merlz42
Posts: 25
Joined: Sun May 13, 2012 1:19 pm

Re: OpenCL on the VideoCore IV!

Fri Jan 26, 2018 9:48 am

I'm having trouble getting this detected. Here's my whole build process:

Code: Select all

cd /opt && rm -rf * && git clone https://github.com/KhronosGroup/SPIRV-LLVM.git && cd SPIRV-LLVM && git checkout khronos/spirv-3.6.1 && mkdir build && cd build && cmake ../ -DCMAKE_BUILD_TYPE=Release -DLLVM_BUILD_RUNTIME=Off -DLLVM_INCLUDE_TESTS=Off -DLLVM_INCLUDE_EXAMPLES=Off -DLLVM_ENABLE_BACKTRACES=Off -DLLVM_TARGETS_TO_BUILD=ARM && make -j2
cd ~/tmp && rm -rf * && git clone https://github.com/doe300/VC4CLStdLib.git && git clone https://github.com/doe300/VC4C.git && git clone https://github.com/doe300/VC4CL.git
cd VC4CLStdLib && cmake . && make -j2 && make install && cd ..
cd VC4C && cmake . && make -j2 && make install && cd ..
cd VC4CL && cmake . && make -j2 && make install && cd ..
This all executes correctly and I've got all the supporting packages installed. I installed the newer clinfo from source. Running sudo clinfo returns 0 platforms :-(

doe300
Posts: 38
Joined: Thu Dec 29, 2016 1:41 pm

Re: OpenCL on the VideoCore IV!

Fri Jan 26, 2018 10:39 am

merlz42 wrote:
Fri Jan 26, 2018 9:48 am
This all executes correctly and I've got all the supporting packages installed. I installed the newer clinfo from source. Running sudo clinfo returns 0 platforms :-(
See here, for the ICD loader to detect the implementation, there needs to exist a file in /etc/OpenCL/vendors/ with a single line containing the absolute path to the library (in this case libVC4CL.so) to load. This file should have been generated in your build-directory with the name VC4CL.icd and should have also been copied to the correct path with the make install command.

Can you check if /etc/OpenCL/vendors/VC4CL.icd exists and whether the library-path contained points to the correct library-file?

merlz42
Posts: 25
Joined: Sun May 13, 2012 1:19 pm

Re: OpenCL on the VideoCore IV!

Fri Jan 26, 2018 10:46 am

Code: Select all

$ cat /etc/OpenCL/vendors/VC4CL.icd 
/usr/local/lib/libVC4CL.so
$ ls /usr/local/lib/libVC4CL.so
/usr/local/lib/libVC4CL.so
$ ls -lah /usr/local/lib/libVC4CL.so
lrwxrwxrwx 1 pi staff 15 Jan 23 08:35 /usr/local/lib/libVC4CL.so -> libVC4CL.so.1.2
$ ls -lah /usr/local/lib/libVC4CL.so.1.2 
lrwxrwxrwx 1 pi staff 15 Jan 23 08:35 /usr/local/lib/libVC4CL.so.1.2 -> libVC4CL.so.0.4
$ ls -lah /usr/local/lib/libVC4CL.so.0.4 
-rw-r--r-- 1 pi staff 6.0M Jan 26 09:38 /usr/local/lib/libVC4CL.so.0.4

doe300
Posts: 38
Joined: Thu Dec 29, 2016 1:41 pm

Re: OpenCL on the VideoCore IV!

Fri Jan 26, 2018 11:59 am

The ICD-loader should find the libVC4CL.so, maybe it cannot load it.
Can you print the results of ldd /usr/local/lib/libVC4CL.so and see if all required libraries are found?

Can you also check if the programs v3d_info, v3d_dump_analyzer or VC4C work?

merlz42
Posts: 25
Joined: Sun May 13, 2012 1:19 pm

Re: OpenCL on the VideoCore IV!

Fri Jan 26, 2018 9:07 pm

This looks like it works. To detect libVC4CC, I needed:

Code: Select all

export LD_LIBRARY_PATH=/lib:/usr/lib:/usr/local/lib
This is weird since /usr/local/lib is set in ld.so.conf.

Similarly, I can run clinfo successfully now with:

Code: Select all

sudo LD_LIBRARY_PATH=/lib:/usr/lib:/usr/local/lib clinfo
This is with a default raspbian distro on raspberry pi 3.

naibaf7
Posts: 1
Joined: Thu Feb 01, 2018 1:23 am

Re: OpenCL on the VideoCore IV!

Thu Feb 01, 2018 1:24 am

I'm trying to add quirks/workarounds to OpenCL Caffe in order to get some networks working on RaspberryPI OpenCL:
https://github.com/naibaf7/caffe

mogu
Posts: 2
Joined: Fri Feb 16, 2018 9:09 pm

Re: OpenCL on the VideoCore IV!

Fri Feb 16, 2018 11:24 pm

Hello,

pocl developer here.. just stumbled upon this thread while randomly browsing the net :)
just a few observations...
  • @doe300 very impressive work! though i suspect you have a lot more to do if you actually plan on finishing it..
  • pocl supports other than CPU devices - in fact it supports three (NVidia cards via our CUDA backend, certain AMD hardware via HSA runtime, and a virtual device that's used to simulate hardware) so adding another shouldn't be a big issue, if anyone decides to do so..
  • I'd advise you to stay away from clPeak as a benchmark (and ignore all past results), at least until it's fixed. If you look at its OpenCL kernels here: https://github.com/krrishnarraj/clpeak/ ... rc/kernels you'll notice the author uses recursive macros to implement them, resulting in giant kernels with several thousand identical instructions. This causes several issues: 1) it's not even close to realistic benchmark, 2) certain LLVM optimization passes completely explode on this code (taking forever to compile), 3) it very easily overflows the L1 icache of the CPU/GPU, meaning you'll not be measuring FLOPS but rather how fast your CPU can execute from L2 while L1 is being trashed. If you look at the uploaded result files, you'll notice the results are all over the place and make no sense - that is a direct consequence.
Author claims he does it to fool the autovectorizer, but i found that isn't a problem, at least with LLVM+pocl. Replacing the recursive macros in clPeak with for loops (keeping the FLOPS per kernel identical), on Arch Linux + llvm 5 + pocl git master on RPi2 i got these results:

Code: Select all

[alarm@alarmpi b]$ /home/alarm/clpeak/b/clpeak --compute-sp

Platform: Portable Computing Language
  Device: pthread-cortex-a7
    Driver version  : 1.1-pre (Linux ARM)
    Compute units   : 4
    Clock frequency : 900 MHz

    Single-precision compute (GFLOPS)
      float   : 0.89
      float2  : 1.78
      float4  : 3.34
      float8  : 3.42
      float16 : 3.47
looks much more realistic than those old pocl results with 0.03 GFlops for all vector sizes, doesn't it ? ;) I haven't yet installed VC4CL, but in case anyone wants to try it, here's the clPeak patch i used (for SP benchmark only): https://pastebin.com/aHYrFage - i suspect VC4CL will also turn out to be much faster.

doe300
Posts: 38
Joined: Thu Dec 29, 2016 1:41 pm

Re: OpenCL on the VideoCore IV!

Sat Feb 17, 2018 10:20 am

mogu wrote:
Fri Feb 16, 2018 11:24 pm
@doe300 very impressive work! though i suspect you have a lot more to do if you actually plan on finishing it..
Thanks. Definitively, especially in the compiler and standard-library.
mogu wrote:
Fri Feb 16, 2018 11:24 pm
pocl supports other than CPU devices - in fact it supports three (NVidia cards via our CUDA backend, certain AMD hardware via HSA runtime, and a virtual device that's used to simulate hardware) so adding another shouldn't be a big issue, if anyone decides to do so..
I didn't know that. I suspect to port pocl to VC4, we would need to write a LLVM back-end for the VideoCore IV architecture, which I tried at first, but didn't get anywhere.
mogu wrote:
Fri Feb 16, 2018 11:24 pm
I'd advise you to stay away from clPeak as a benchmark (and ignore all past results), at least until it's fixed. If you look at its OpenCL kernels here: https://github.com/krrishnarraj/clpeak/ ... rc/kernels you'll notice the author uses recursive macros to implement them, resulting in giant kernels with several thousand identical instructions. This causes several issues: 1) it's not even close to realistic benchmark, 2) certain LLVM optimization passes completely explode on this code (taking forever to compile), 3) it very easily overflows the L1 icache of the CPU/GPU, meaning you'll not be measuring FLOPS but rather how fast your CPU can execute from L2 while L1 is being trashed. If you look at the uploaded result files, you'll notice the results are all over the place and make no sense - that is a direct consequence.
Thanks for the insight. Another thing I noticed is, that the generated code almost never executes both asymmetric ALUs in a single instruction, since every instruction depends on the previous one (e.g. for floating-point benchmark). But I don't know that much about benchmarks and - in contrast to some other benchmarking programs I've tested - the output of clpeak is actual useful.
mogu wrote:
Fri Feb 16, 2018 11:24 pm
Author claims he does it to fool the autovectorizer, but i found that isn't a problem, at least with LLVM+pocl.
Can't say much about that, since LLVM does not auto-vectorize the code for VC4CL. Which is either because I compile to LLVM IR and thus LLVM doesn't know the vectorization preferences of the destination architecture, or because the vectorization is done in the back-ends, which are not executed.
mogu wrote:
Fri Feb 16, 2018 11:24 pm
[...] but in case anyone wants to try it, here's the clPeak patch i used (for SP benchmark only): https://pastebin.com/aHYrFage - i suspect VC4CL will also turn out to be much faster.
Thanks, I will have to test that sometime.

mogu
Posts: 2
Joined: Fri Feb 16, 2018 9:09 pm

Re: OpenCL on the VideoCore IV!

Sat Feb 17, 2018 5:33 pm

doe300 wrote:
Sat Feb 17, 2018 10:20 am
I didn't know that. I suspect to port pocl to VC4, we would need to write a LLVM back-end for the VideoCore IV architecture, which I tried at first, but didn't get anywhere.
A LLVM backend would make things easier, but i think is not actually required. What pocl does is roughly 1) compiles OpenCL C to IR, 2) links the kernel library to it, 3) runs a bunch of optimizations on the IR, 4) hands over the resulting LLVM IR to the device driver. What the driver does with it is entirely up to the driver. The CPU driver compiles it to native code, the CUDA driver converts it to PTX, etc. So as long as the driver can take plain LLVM IR as input, it can be done. What is actually needed though, is a LLVM triple - i'm guessing the only usable for VC4 would be SPIR ?
or because the vectorization is done in the back-ends, which are not executed.
Well, the loop-vectorizer is an optimization pass, and so is the SLP vectorizer. Both are IR -> IR transforms, they are run before the backends are invoked.

Return to “Graphics programming”

Who is online

Users browsing this forum: No registered users and 3 guests