In my Master's thesis I included this comparison of the clpeak-results with PoCL (PoCL values were not actually run, but taken from here): Left is the floating-point test (in GFLOPS), in the middle the memory-bandwidth and right the transfer bandwidth test (both in MB/s). The filled bars are the median results of all tests (e.g. for floating point the median of 1-, 2-, 3-, 4-, 8-, and 16-element vector tests) while the shaded bars give the maximum value (e.g. for floating-point test the result for 16-element vector)
So as you can see, VC4CL outperforms PoCL by far on simple arithmetic operations (floating-point test), but suffers from bad memory-throughput (memory-bandwidth test).
PoCL tries to run OpenCL on every platform by running it on the CPU, VC4CL runs OpenCL on one particular platform on the GPU. So they differ greatly.