Posts: 1
Joined: Tue Aug 27, 2019 9:31 pm

Fastest way to transfer FBO from GPU to CPU memory

Tue Aug 27, 2019 9:47 pm

I've begun playing around with image processing on my RPi 3 using OpenGL ES. At the end of the image processing, I am left with an off-screen frame buffer. The problem is, I'm not sure what the best method is to transfer the data back to the CPU so that I can access it in a reasonable amount of time. Eventually I want to play around with general-purpose computing, so getting these raw values back (instead of to the screen) would be incredibly useful.

I've done a little research and it seems people are either using glReadPixels() or the KHR_image EGL extension (I'm still a little lost on how the latter works since I haven't found much). Is glReadPixels() sufficient to transfer large amounts of data, or is it a too large of a bottleneck for real-time applications?


Posts: 29
Joined: Sat Sep 23, 2017 10:43 am

Re: Fastest way to transfer FBO from GPU to CPU memory

Wed Aug 28, 2019 9:00 pm

You would need to generate a vcsm buffer. I've done this on the pi 3 b+ with the raspbian stretch os so I'm not sure if this works on buster. I'm working with a pi 4 with the new os and this method didn't seem to work so I had to come up with an updated way to do this.

Code: Select all

#include <GLES2/gl2.h>
#include <GLES2/gl2ext.h>
#include <EGL/egl.h>
#include <EGL/eglext.h>
#include <EGL/eglext_brcm.h>
#include <interface/vcsm/user-vcsm.h>


egl_image_brcm_vcsm_info vcsm_info;
vcsmInfo.width  = 2048; // must be power of 2
vcsmInfo.height = 1024; // must be power of 2


// create eglimage. vcsm buffer generated will be a RGBA buffer. so width * height * 4(bytes) is the total memory buffered.
EGLImageKHR egl_image = eglCreateImageKHR(egl_display, EGL_NO_CONTEXT, EGL_IMAGE_BRCM_VCSM, &vcsmInfo, egl_attrib);


// create opengl texture and bind to it
glGenTextures(1, &texid);
glbindtexture(GL_TEXTURE_2D, texid);

// setup mag and min filters for the texture.

glEGLImageTargetTexture2DOES(GL_TEXTURE_2D, eglImageVcsm); // the vcsm buffer is now an opengl texture.


// this will allow me to read/write to the buffer that opengl may have just rendered to.
VCSM_CACHE_TYPE_T cache_type;
unsigned char* buffer = (unsigned char*)vcsm_lock_cache(vcsm_info.vcsm_handle, VCSM_CACHE_TYPE_HOST, &cache_type);

// change values in the buffer?


// when done with it you can call the function below.

Posts: 13
Joined: Sun Feb 12, 2017 3:32 am

Re: Fastest way to transfer FBO from GPU to CPU memory

Thu Oct 03, 2019 1:02 am

There is an example of this technique inside "raspistill". There is an operating mode that will pipe the camera output to an opengl texture, and then render a scene using the texture.

In the VCSM_Square example, it shows how to access the pixels in a scene that has been rendered to a FBO. The VCSM technique creates a shared memory buffer to render the scene into, which allows directly access the graphical buffer without additional copying. This is as opposed to the standard opengl readpixels which is quite slow on a Pi.

https://github.com/raspberrypi/userland ... m_square.c

Posts: 6
Joined: Sun Nov 11, 2018 5:52 pm

Re: Fastest way to transfer FBO from GPU to CPU memory

Wed Jan 29, 2020 1:19 pm

how much performance improvement did you get?
I implemented this but in direct comparison to glReadPixels I only get a miniscule performance improvement (but that is consistently measurable so it does work).
To be fair I previously already optimized the shaders to only spit out a tiny amount of data (1 bit per pixel => 250KB for 1640x1232) so that might be it. It went from 88.18ms/f to 88.03ms/f for 1640x1232 which is a bit underwhelming... and that is the resolution with the most data to shift around.
Any similar experiences? Or is the RPi Zero somehow different?

Ok so yeah I seem to be getting weird results. Just using a VCSM framebuffer seems to have a huge overhead.
My algorithm with only the GPU part has two passes currently, with two framebuffers.
1. Mask at 1x1 full size - actual 1640x1232 (8MB) - VCSM buffer 2048x2048 (16MB)
2. Map at 4x8 binned - actual 205x308 (252KB) - VCSM buffer 256x512 (524KB)

Here are the frametimes for several combinations of normal textures vs VCSM buffer - with any kind of CPU activity disabled (glReadPixels or VCSM locking):
Mask Texture + Map Texture: 85.61 ms
Mask Texture + Map VCSM: 85.61 ms
Mask VCSM + Map VCSM: 277.77 ms
Mask VCSM + Map Texture: 277.00 ms

These numbers indicate there is a large overhead of using VCSM buffers over normal framebuffers. As if it would just copy the full 2048x2048 of memory to the CPU on every render or something, maybe as a fall back because RPi Zero is not supported?

Edit: Here's very simplified code for program logic and VCSM render target implementation.
For any observant soul you might find that 88.03ms for the full algorithm compared to 85.61 for just the GPU part means my algorithm is heavily GPU-bound. So if I get the VCSM working as expected I will probably merge passes and focus on GPU performance more than readback size.

Another edit with final tests, prooving that there is an overhead to VCSM that grows with the buffer size, but that grows SLOWER than normal readback.
Data for a single (quite advanced 5x5 kernel) pass at 1640x1232 (again: texture size 8MB, but VCSM buffer size 16MB):
Render into Texture Framebuffer: 72.67 ms
Render into VCSM Framebuffer: 103.20 ms
Render into Texture Framebuffer and read back: 140.45 ms
Render into VCSM Framebuffer and lock: 103.20 ms

This shows:
VCSM only really improves performance over glReadPixels at high resolutions, but it never completely eliminates the readback cost! It can cut around 4/7th of the cost compared to glReadPixel at 1640x1232.

Return to “OpenGLES”