pocl developer here.. just stumbled upon this thread while randomly browsing the net
just a few observations...
- @doe300 very impressive work! though i suspect you have a lot more to do if you actually plan on finishing it..
- pocl supports other than CPU devices - in fact it supports three (NVidia cards via our CUDA backend, certain AMD hardware via HSA runtime, and a virtual device that's used to simulate hardware) so adding another shouldn't be a big issue, if anyone decides to do so..
- I'd advise you to stay away from clPeak as a benchmark (and ignore all past results), at least until it's fixed. If you look at its OpenCL kernels here: https://github.com/krrishnarraj/clpeak/ ... rc/kernels you'll notice the author uses recursive macros to implement them, resulting in giant kernels with several thousand identical instructions. This causes several issues: 1) it's not even close to realistic benchmark, 2) certain LLVM optimization passes completely explode on this code (taking forever to compile), 3) it very easily overflows the L1 icache of the CPU/GPU, meaning you'll not be measuring FLOPS but rather how fast your CPU can execute from L2 while L1 is being trashed. If you look at the uploaded result files, you'll notice the results are all over the place and make no sense - that is a direct consequence.
Author claims he does it to fool the autovectorizer, but i found that isn't a problem, at least with LLVM+pocl. Replacing the recursive macros in clPeak with for loops (keeping the FLOPS per kernel identical), on Arch Linux + llvm 5 + pocl git master on RPi2 i got these results:
Code: Select all
[alarm@alarmpi b]$ /home/alarm/clpeak/b/clpeak --compute-sp
Platform: Portable Computing Language
Driver version : 1.1-pre (Linux ARM)
Compute units : 4
Clock frequency : 900 MHz
Single-precision compute (GFLOPS)
float : 0.89
float2 : 1.78
float4 : 3.34
float8 : 3.42
float16 : 3.47
looks much more realistic than those old pocl results with 0.03 GFlops for all
vector sizes, doesn't it ?
I haven't yet installed VC4CL, but in case anyone wants to try it, here's the clPeak patch i used (for SP benchmark only): https://pastebin.com/aHYrFage
- i suspect VC4CL will also turn out to be much faster.