Hello,
how much performance improvement did you get?
I implemented this but in direct comparison to glReadPixels I only get a miniscule performance improvement (but that is consistently measurable so it does work).
To be fair I previously already optimized the shaders to only spit out a tiny amount of data (1 bit per pixel => 250KB for 1640x1232) so that might be it. It went from 88.18ms/f to 88.03ms/f for 1640x1232 which is a bit underwhelming... and that is the resolution with the most data to shift around.
Any similar experiences? Or is the RPi Zero somehow different?
Ok so yeah I seem to be getting weird results. Just using a VCSM framebuffer seems to have a huge overhead.
My algorithm with only the GPU part has two passes currently, with two framebuffers.
1. Mask at 1x1 full size - actual 1640x1232 (8MB) - VCSM buffer 2048x2048 (16MB)
2. Map at 4x8 binned - actual 205x308 (252KB) - VCSM buffer 256x512 (524KB)
Here are the frametimes for several combinations of normal textures vs VCSM buffer - with any kind of CPU activity disabled (glReadPixels or VCSM locking):
Mask Texture + Map Texture: 85.61 ms
Mask Texture + Map VCSM: 85.61 ms
Mask VCSM + Map VCSM: 277.77 ms
Mask VCSM + Map Texture: 277.00 ms
These numbers indicate there is a large overhead of using VCSM buffers over normal framebuffers. As if it would just copy the full 2048x2048 of memory to the CPU on every render or something, maybe as a fall back because RPi Zero is not supported?
Edit: Here's very simplified code for
program logic and
VCSM render target implementation.
For any observant soul you might find that 88.03ms for the full algorithm compared to 85.61 for just the GPU part means my algorithm is heavily GPU-bound. So if I get the VCSM working as expected I will probably merge passes and focus on GPU performance more than readback size.
Another edit with final tests, prooving that there is an overhead to VCSM that grows with the buffer size, but that grows SLOWER than normal readback.
Data for a single (quite advanced 5x5 kernel) pass at 1640x1232 (again: texture size 8MB, but VCSM buffer size 16MB):
Render into Texture Framebuffer: 72.67 ms
Render into VCSM Framebuffer: 103.20 ms
Render into Texture Framebuffer and read back: 140.45 ms
Render into VCSM Framebuffer and lock: 103.20 ms
This shows:
VCSM only really improves performance over glReadPixels at high resolutions, but it never completely eliminates the readback cost! It can cut around 4/7th of the cost compared to glReadPixel at 1640x1232.