tmoe
Posts: 11
Joined: Thu Sep 20, 2018 1:44 pm

hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Thu Sep 20, 2018 2:08 pm

Hello everyone,

One project I'm working uses a pipeline like this:
receive udp stream -> decode h264 (GPU) -> render an overlay on top of the video -> encode h264 (GPU) -> send it to another udp socket

The first thing I tried, was to render the overlay using ffmpeg. Encoding/Decoding in ffmpeg with the hw encoders/decoders was fast, until i brought the overlay in. This slowed the whole thing down to an unusable framerate. The next thing I tried was to do the stuff with a custom application using MMAL, which I gave up after not being able to debug the video core error messages.

Then I tried to do the task using the OpenMAX API. As I understand it, you can't render overlays using the OpenMax API. What you can do is: Decode and Render the video to the hdmi output and draw an overlay using openGL. But this leaves me with the remaining task of encoding the current hdmi output to h264 again.

I found numerous posts in this forum of people asking if it is possible to feed the video_encoder OpenMAX Component with the framedata from the gpu (or an openGL Buffer/texture whatever). The answer was always "not right now" and the workaround involved using glReadPixels or mmap to get the framebuffer and feed that buffer into the video_encoder afterwards. Those workarounds are all very inefficient. Most recent responses I could find were from 2013.

Has there been any progress on this issue since 2013? Is it possible to feed the video encoder the current framedata from the GPU without copying the data back to the cpu first (using glReadPixels) ?

Or do you have an alternative idea on how to implement my initial pipeline in a performant way?

Thanks in advance. Regards

Timo

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5944
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Thu Sep 20, 2018 3:58 pm

Do you really want GL, or just to overlay some random image?

For encoding GL see https://www.raspberrypi.org/forums/view ... 8&t=222930. It appears that 99% of the required work is there.

If a random overlay then hold fire a couple of weeks. I'm finishing off a MMAL(*) component that wraps the HVS offscreen composition.
Feed it from video_decode and with up to 4 overlays (position, alpha, etc, all configurable). It'll render them to an output buffer that can then be consumed by video_encode. No buffers (other than the overlay images, although those could come from say the JPEG decoder) ever need to touch the ARM memory.
It's primary role is for subtitle overlays in VLC, but it should work perfectly well in your case too.

(*) in theory IL as well, but I hate IL with a passion, so it's a lower priority.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
Please don't send PMs asking for support - use the forum.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

tmoe
Posts: 11
Joined: Thu Sep 20, 2018 1:44 pm

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Fri Sep 21, 2018 9:42 am

I just want to display a random overlay. Your MMAL Component would probably be enough.
But I'm really having troubles with MMAL. I can't even get the basic examples to work.

When I try to run
https://github.com/raspberrypi/userland ... _basic_2.c (with the framerate changed to 25fps)

with a raw h264 file (downscaled version of https://github.com/raspberrypi/userland ... /test.h264 ):
Input #0, h264, from 'test.h264_2':
Duration: N/A, bitrate: N/A
Stream #0:0: Video: h264 (High), yuv420p(tv, bt709, progressive), 1280x720, 25 fps, 25 tbr, 1200k tbn, 50 tbc


I get:
mmal: mmal_port_format_commit: vc.ril.video_decode(2:0) port 0x1a7f040 format 3:H264
mmal: mmal_vc_port_info_set: set port info (2:0)
mmal: mmal_vc_port_info_set: failed to set port info (2:0): ENOSPC
mmal: mmal_vc_port_set_format: mmal_vc_port_info_set failed 0x1a7f040 (ENOSPC)
failed to commit format


and the vc log shows:

2663578.006: mmal: mmal_ril_port_parameter_get: Parameter id 0x00000007 not recognised
2663579.470: mmal: mmal_ril_port_set_format: ril.video_decode:in:0: failed to set codec config (6)
2663579.555: mmalsrv: mmal_server_do_port_info_set: ril.video_decode:in:0(H264): failed (handle 1, status 2)


I'm using the newest firmware
Sep 10 2018 17:26:38
Copyright (c) 2012 Broadcom
version 46671264d2e10a76c8071b7b084a5409b1afc2cc (clean) (release)



Any ideas?

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5944
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Fri Sep 21, 2018 10:26 am

My blunder in https://github.com/Hexxeh/rpi-firmware/ ... c109a6a7ad with passing in codec config out of band (which isn't actually needed by the codec as you can pass it via buffers quite happily too).

I'll get a fix merged as soon as poss, but in the meantime you can revert to a5b781c or earlier and shouldn't have that problem.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
Please don't send PMs asking for support - use the forum.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

tmoe
Posts: 11
Joined: Thu Sep 20, 2018 1:44 pm

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Fri Sep 21, 2018 11:57 am

Ok. Reverting to a5b781c fixes that error.

But running a modified version of the graph example
https://github.com/raspberrypi/userland ... le_graph.c

gives me another crypthic error:
2796494.900: mmalsrv: mmal_server_do_buffer_from_host: ril mem ril.video_decod-2:failed 3

Source + Makefile and full error output here:
https://gist.github.com/t-moe/7a9dc07ab ... 7d4d083604

Do you mind taking a quick look?
Thanks

tmoe
Posts: 11
Joined: Thu Sep 20, 2018 1:44 pm

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Fri Sep 21, 2018 1:33 pm

I think I found the answer to this one myself: Setting zero-copy on either the decoder input or output will cause the error.
This is somehow strange, since zero copy worked on the example2 discussed above...

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5944
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Fri Sep 21, 2018 1:56 pm

tmoe wrote:
Fri Sep 21, 2018 1:33 pm
I think I found the answer to this one myself: Setting zero-copy on either the decoder input or output will cause the error.
This is somehow strange, since zero copy worked on the example2 discussed above...
Sorry, mmal_graph is a wrapper I've never used so don't know what is likely to be going on.
I suspect it is because mmal_graph is using mmal_pool_create, whilst to get zero_copy working correctly requires the use of mmal_port_pool_create.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
Please don't send PMs asking for support - use the forum.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

tmoe
Posts: 11
Joined: Thu Sep 20, 2018 1:44 pm

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Fri Sep 21, 2018 3:38 pm

In contrast to the renderer, I can't get the encoder to work.
Format commits fails again. Any ideas?

mmal: mmal_graph_new_connection: graph: 0x1d11630, out: vc.ril.video_decode:out:0(I420)(0x1d124e0), in: vc.ril.video_encode:in:0(0x1d14a10), flags 0, connection: (nil)
mmal: mmal_connection_create: out 0x1d124e0, in 0x1d14a10, flags 0, vc.ril.video_decode:out:0/vc.ril.video_encode:in:0
mmal: mmal_port_format_commit: vc.ril.video_encode(2:0) port 0x1d14a10 format 3:I420
mmal: mmal_vc_port_info_set: set port info (2:0)
mmal: mmal_vc_port_info_set: failed to set port info (2:0): EINVAL
mmal: mmal_vc_port_set_format: mmal_vc_port_info_set failed 0x1d14a10 (EINVAL)
mmal: mmal_connection_create: format not set on input port

encoder input format: fourcc: I420, variant; width: 1280, height: 720, (0,0,1280,720)
failed to connect decoder to encoder

3269084.646: mmal: mmal_ril_set_port_settings: ril.video_encode:in:0: failed to set port definition (4) buffers 1/1/15360
3269084.730: mmal: mmal_ril_set_port_settings: ril.video_encode:in:0: fail: color format 0x14, encoding 0x0 stride 1280
3269084.820: mmalsrv: mmal_server_do_port_info_set: ril.video_encode:in:0(I420): failed (handle 1, status 3)


I simply connected the decoder output to the encoder input.
If I connect the same decoder output to the rederer, everything is fine.

Any ideas why the format commit fails?
Thanks for your support!

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5944
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Fri Sep 21, 2018 4:38 pm

Again, a simple example app that is showing what you are doing is far easier than trying to guess what you're doing.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
Please don't send PMs asking for support - use the forum.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

tmoe
Posts: 11
Joined: Thu Sep 20, 2018 1:44 pm

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Mon Sep 24, 2018 8:42 am

Sure. Here the source: https://github.com/t-moe/rpi_mmal_examp ... e_encode.c

Basically it's the same as the previous example: Connect decoder to renderer. Only instead of using a renderer I use the encoder component.

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5944
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Mon Sep 24, 2018 10:35 am

Thank you - that makes it a few minutes work to tell you what is going on.

The encoder opens the codec itself when you enable the output port. You've done that before you enabled the input port, therefore it has already opened the codec and produced the header bytes with the input port format defined at that point.

There is a problem though as video_decode updates the port format once it has decoded the stream headers, and mmal_graph tries to copy that across to video_encode and again fails as the output is already enabled. You're only connecting two ports, so you may as well use a MMAL_CONNECTION_T instead.

Followup problem is that you don't allocate an output pool, therefore you never get any data back.

All three steps are done on https://github.com/6by9/rpi_mmal_examples and we get some encoded data back.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
Please don't send PMs asking for support - use the forum.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

tmoe
Posts: 11
Joined: Thu Sep 20, 2018 1:44 pm

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Tue Sep 25, 2018 10:19 am

Thanks a lot for the modifications.
I cleaned up the example a bit, and exported the encoded frames to a file. This seems to work. https://github.com/t-moe/rpi_mmal_examp ... e_encode.c

Yet there are some errors in the log (your version of the code produced the same errors). Do I need to be worried or can I just ignore them?

2376932.317: mmal: mmal_vll_load: could not load VLL 'videnc.vll':
2377021.055: mmal: mmal_ril_set_port_settings: ril.video_encode:in:0(I420): failed to set port definition (4) buffers 1/1/1382400
2377021.144: mmal: mmal_ril_set_port_settings: ril.video_encode:in:0(I420): fail: color format 0x14, encoding 0x0 stride 1280
2377021.212: mmal: mmal_ril_port_send: format not set on port 3f4f77e0, ril.video_encode:in:0(I420)



Next thing I'm trying to do, is to draw some overlay on the decoded frames (on the CPU) before re-encoding it. Just to test the performance.
Idea: Take the previous example, manually connect the encoder and decoder and modify the buffer before passing it to the encoder.
https://github.com/t-moe/rpi_mmal_examp ... ode.c#L205

Problem: If I remove the `MMAL_CONNECTION_FLAG_TUNNELLING` flag from the connection (as a first step towards the end goal), I get:
encoded frame 0 (flags 1024, length 27)
mmal: mmal_port_event_get: vc.ril.video_decode(3:0) port 0x164c4c0, event EFCH
mmal: mmal_connection_bh_out_cb: (vc.ril.video_decode:out:0(I420))0x164c4c0,0x164cd78,0x164d500,96

and the application hangs.

The vc log says:
686371.567: mmal: mmal_vll_load: could not load VLL 'videnc.vll':
688385.024: mmalsrv: send_buffer_to_host: tx failed:size 292 st -1
688387.496: mmalsrv: send_buffer_to_host: tx failed:size 292 st -1


EFCH is Format change event right?

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5944
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Tue Sep 25, 2018 11:41 am

The mmal_ril_set_port_setting is failing as the EFCH is being processed and tries to update the format on the encoder input port, but again the output port is already active.

EFCH is format changed, and the MMAL connection is handling all that for you at present.
mmal_connection code is all in the userland repo - https://github.com/raspberrypi/userland ... nnection.c
I thought that you could just drop the MMAL_CONNECTION_FLAG_TUNNELLING flag, but from reading the source that doesn't actually forward the buffers across.
...
Ah, I'd thought MMAL_CONNECTION_FLAG_TUNNELLING was only valid on the GPU, but it should work between any two components. You learn something new every day. Without that flag the client has to do the buffer management.
The code handling the format changed is at https://github.com/raspberrypi/userland ... rt.c#L1310. Duplicating that or something very similar in your client should solve your problem, although you will want to disable the encoder output port as well to reconfigure the encoder. (You could just defer enabling the encoder at all until you got the EFCH).
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
Please don't send PMs asking for support - use the forum.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

tmoe
Posts: 11
Joined: Thu Sep 20, 2018 1:44 pm

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Mon Oct 01, 2018 12:10 pm

Thanks for your response 6by9.

I followed your advice and implemented an example were I draw an overlay on the cpu. In that example I only enable the encoder and it's ports after the FormatChange event occurred. This works perfectly. Thanks a lot.

https://github.com/t-moe/rpi_mmal_examp ... y_encode.c


The performance is however not so great and it would be nice to have an MMAL component for the overlay stuff.

Will your MMAL component allow a usecase like mine? e.g. where I draw a few pixels in a certain area of the screen. I could imagine that I create a transparent image and feed this as input to the MMAL overlay component together with the coordinates. Can you share some details about that component you're working on?

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5944
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Mon Oct 01, 2018 2:02 pm

Take it at face value as a beta - https://drive.google.com/file/d/1asP0As ... sp=sharing

New component is "vc.ril.hvs". 5 input ports, 1 output port, only supporting RGB (various formats).
input[0] is the master port. input[1] to input[4] are for the overlays. All ports support MMAL_PARAMETER_DISPLAYREGION to set layer, source rectangle, and destination rectangle (the fullscreen flag is not guaranteed to work at the moment, so please set the dest rect.)

If the overlay port buffers include timestamps, then it will wait for that timestamp to be reached on input[0] before updating to that buffer. NB If you submit a pts a long way in the future then you will stall that input port as it is NOT sorting them at present (I intend to add that).
If the overlay port buffers have timestamps of 0, then they will replace the current overlay on the next frame through input[0].
If the overlay port buffer has a length of 0, then the overlay is removed from the pipe.

Hacked version of RaspiStill in https://github.com/6by9/userland/tree/hvs_test which just inserts the component between camera and video_render and feeds in 10 buffers with PTS values incrementing by a second each time, and buffers 2 & 7 being empty. "raspistill -t 10000" will give the idea.

I have not profiled this at all. The fact it can only write out RGB is a little annoying, but that is a restriction of the hardware. video_encode will take that and again use hardware to convert appropriately, so I'd expect you to still be able to get 1080P30.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
Please don't send PMs asking for support - use the forum.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5944
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Mon Oct 01, 2018 2:43 pm

Most of the extra time is in your draw overlay function.

Nasty hacked together timing between frames:

Code: Select all

@@ -252,8 +252,9 @@ int main(int argc, char* argv[]) {
     MMAL_COMPONENT_T *decoder = NULL, *encoder=NULL;
     MMAL_POOL_T *decoder_pool_in = NULL, *encoder_pool_out = NULL;
     MMAL_ES_FORMAT_T * format_in=NULL;
-    MMAL_BOOL_T eos_sent = MMAL_FALSE, eos_received;
+    MMAL_BOOL_T eos_sent = MMAL_FALSE, eos_received = 0;
     MMAL_BUFFER_HEADER_T *buffer;
+    int64_t ts = 0;
 
 
     bcm_host_init();
@@ -361,7 +362,7 @@ int main(int argc, char* argv[]) {
         vcos_semaphore_wait(&context.semaphore);
 
         /* Send data to decode to the input port of the video decoder */
-        if ((buffer = mmal_queue_get(decoder_pool_in->queue)) != NULL) //Get empty buffers from queue
+        while ((buffer = mmal_queue_get(decoder_pool_in->queue)) != NULL) //Get empty buffers from queue
         {
             SOURCE_READ_DATA_INTO_BUFFER(buffer);
             if(!buffer->length) eos_sent = MMAL_TRUE;
@@ -389,8 +390,14 @@ int main(int argc, char* argv[]) {
             }
             else
             {
+                int64_t new_ts;
+                int64_t delta;
+                mmal_port_parameter_get_int64(encoder->output[0], MMAL_PARAMETER_SYSTEM_TIME, &new_ts);
+                delta = new_ts - ts;
+
                 DEST_WRITE_DATA_INTO_FILE(buffer->data, buffer->length);
-                fprintf(stderr, "encoded frame %u (flags %x, length %u)\n",framenr++, buffer->flags, buffer->length);
+                fprintf(stderr, "encoded frame %u (flags %x, length %u) - delta %lld %lld %lld\n",framenr++, buffer->flags, buffer->length, delta, new_ts, ts);
+                ts = new_ts;
             }
             mmal_buffer_header_release(buffer);
         }
(Yes, I have initialised eos_received for you, otherwise the loop can abort immediately).
I'm getting 18-20ms without calling draw_overlay, and 33-38ms with.

TBH your loop is hideously inefficient with loads of extraneous comparisons. You're doing 1280*720 = 921600 comparisons when you only want 100*100 = 10000 positive outcomes, so around 1%.

Code: Select all

static void draw_overlay(MMAL_BUFFER_HEADER_T *frame) {
    static const int width =1280;
    for (int y = framenr+200; y < framenr+300; y++) {
           for (int x= framenr+100; x< framenr+200; x++) {
               if((((x >> 1) << 1) % 4) == 0 && (((y >> 1) << 1) % 4) == 0) {
                   frame->data[y*width+x] = 0;
               } else {
                   frame->data[y*width+x] = 255;
               }
           }
    }
}
has the same effect, but without the extraneous conditions being checked. I'm getting a fair amount of variation there, but around 23ms.
Coding efficiency has a HUGE impact when you're looking at pixels in large images at many frames a second.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
Please don't send PMs asking for support - use the forum.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

dickon
Posts: 233
Joined: Sun Dec 09, 2012 3:54 pm
Location: Home, just outside Reading

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Mon Oct 01, 2018 3:42 pm

Code: Select all

if((((x >> 1) << 1) % 4) == 0 && (((y >> 1) << 1) % 4) == 0)
is a *really* entertainingly nasty way to write

Code: Select all

if (!((x & 2) || (y & 2)))
too... OK, a proper compiler should optimise both to

Code: Select all

tst	rx, #2
tsteq	ry, #2
bne	...
but it's the principle of the thing...

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5944
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Mon Oct 01, 2018 3:50 pm

I hadn't looked at that one. Low hanging fruit and all that.

A proper compiler would optimise it, except for the -O0 at https://github.com/t-moe/rpi_mmal_examp ... akefile#L9 :(

You are quite correct with your version, and indeed running it is taking about 18ms between frames (although a fair variation from 13ms to 26ms) so comparable to having no overlay. I guess I should really time the whole run rather than individual deltas, but this isn't my code.
It won't help having the fwrite call in the same context as all the buffer handling as that will get blocked (potentially for quite some time) at the point the OS finally decides to flush caches.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
Please don't send PMs asking for support - use the forum.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

tmoe
Posts: 11
Joined: Thu Sep 20, 2018 1:44 pm

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Tue Oct 02, 2018 11:24 am

Thanks for the remarks regarding my draw_overlay :D
I wrote the loop start/stop conditions that way on purpose because it's not clear how much of the pixels we'll be accessing in the final application. My goal was to find out if I can get 25 fps, even when we have a rather intensive overlay calculation on the CPU.

Regarding the condition:
dickon wrote:

Code: Select all

if((((x >> 1) << 1) % 4) == 0 && (((y >> 1) << 1) % 4) == 0)
Good catch. I really don't know why I wrote it so complicated.

I'm using a QT application with udp sockets for input/output for my local testing and the example on github is just a "minimal complete verifiable example" to use as basis for the discussion.

To summarize my findings:
Yes, it's possible to achieve 25+ FPS with a cpu based draw_overlay.
But, around 20 ms (or 18 if you want) are spend by the cpu per frame and you won't be able to reduce that further if you have to copy whole frames between GPU and CPU back and forth for the overlay rendering.

I'll try your new MMAL component to check if I can get the CPU time down, while maintaining a the same overlay calculation. (At least this would give me back the time it takes to copy the decoded frame to the CPU right?)

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5944
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Tue Oct 02, 2018 12:32 pm

tmoe wrote:
Tue Oct 02, 2018 11:24 am
Thanks for the remarks regarding my draw_overlay :D
I wrote the loop start/stop conditions that way on purpose because it's not clear how much of the pixels we'll be accessing in the final application. My goal was to find out if I can get 25 fps, even when we have a rather intensive overlay calculation on the CPU.
I'd query whether you need to recompute and update every frame, or prepare the overlay in another thread and just blit it in the callback.
tmoe wrote:I'm using a QT application with udp sockets for input/output for my local testing and the example on github is just a "minimal complete verifiable example" to use as basis for the discussion.

To summarize my findings:
Yes, it's possible to achieve 25+ FPS with a cpu based draw_overlay.
But, around 20 ms (or 18 if you want) are spend by the cpu per frame and you won't be able to reduce that further if you have to copy whole frames between GPU and CPU back and forth for the overlay rendering.
Copying frame buffers? QT normally renders via the frame buffer so there is no copying required there. And zero copy can be used in MMAL.
Or are you copying the resulting buffers into the QT window on the ARM? Yes, that hurts performance.
tmoe wrote:I'll try your new MMAL component to check if I can get the CPU time down, while maintaining a the same overlay calculation. (At least this would give me back the time it takes to copy the decoded frame to the CPU right?)
As above, I don't quite follow where you are copying frames around at the moment.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
Please don't send PMs asking for support - use the forum.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

tmoe
Posts: 11
Joined: Thu Sep 20, 2018 1:44 pm

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Tue Oct 02, 2018 12:50 pm

6by9 wrote: I'd query whether you need to recompute and update every frame, or prepare the overlay in another thread and just blit it in the callback.
I don't know at the moment if we need to recompute and update every frame or just have an external source for the overlay which does not depend on the current frame-data. Although I guess the latter.
6by9 wrote: Copying frame buffers? QT normally renders via the frame buffer so there is no copying required there. And zero copy can be used in MMAL.
Or are you copying the resulting buffers into the QT window on the ARM? Yes, that hurts performance.
Forget about the qt thingy for a moment. I'm referring to the manual_decode_overlay_encode example.
By copying frames around I mean: The decoded frame from the decoder is copied into CPU userspace memory, so that I can draw my overlay onto it before sending it back to the encoder. As far as I understand it, I could get rid of that step by using your component where I only would have to send my overlay to your new mmal component.

zero copy can not be used here. If I used zero copy on the decoder output port the buffer->data lies in non-accessible memory for my userspace app.

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5944
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Tue Oct 02, 2018 1:22 pm

tmoe wrote:
Tue Oct 02, 2018 12:50 pm
6by9 wrote: Copying frame buffers? QT normally renders via the frame buffer so there is no copying required there. And zero copy can be used in MMAL.
Or are you copying the resulting buffers into the QT window on the ARM? Yes, that hurts performance.
Forget about the qt thingy for a moment. I'm referring to the manual_decode_overlay_encode example.
By copying frames around I mean: The decoded frame from the decoder is copied into CPU userspace memory, so that I can draw my overlay onto it before sending it back to the encoder. As far as I understand it, I could get rid of that step by using your component where I only would have to send my overlay to your new mmal component.

zero copy can not be used here. If I used zero copy on the decoder output port the buffer->data lies in non-accessible memory for my userspace app.
Er, no. If you use MMAL_ENCODING_OPAQUE then that is the case as you only get a handle to the native image buffers, not a pointer to the buffer.
Zero copy uses vcsm to allocate the buffer and map it into the ARM MMU.

Uncomment the lines in your manual_decode_overlay_encode app where you say "If you dont want to modfiy the decoder data, you can also enable it on the decoder output and encoder input" and you'll find it still works quite happily. You end up with a cache flush or invalidate as the buffer traverses between GPU and CPU, but no copying. The delta is down to between 8 and 16ms with that (I hadn't noticed you hadn't done that before). Mesaured over the whole run I get 88fps (373 frames in 4.2secs), although I do have a modest overclock on my system.
Using the same benchmarking I get 65fps without zero copy (5.7secs for 373 frames).
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
Please don't send PMs asking for support - use the forum.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

tmoe
Posts: 11
Joined: Thu Sep 20, 2018 1:44 pm

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Tue Oct 02, 2018 1:51 pm

6by9 wrote: Er, no. If you use MMAL_ENCODING_OPAQUE then that is the case as you only get a handle to the native image buffers, not a pointer to the buffer.
Zero copy uses vcsm to allocate the buffer and map it into the ARM MMU.
You're right. Actually I never really tried setting zero copy AND writing to the buffer.
I recall using gdb to debug my initial buffer problems (format change events and so on) and there I was unable to access the buffer if zero copy was set. See this gdb output (zero copy set )

(gdb) b draw_overlay(MMAL_BUFFER_HEADER_T*)
Breakpoint 1 at 0x12cf0: file performance.cpp, line 139.

Thread 6 "vc.ril.video_de" hit Breakpoint 1, draw_overlay (frame=0x3bda0) at performance.cpp:139
139 int fn = framenr % 500;
(gdb) p frame->data
$1 = (uint8_t *) 0x70dae000 <error: Cannot access memory at address 0x70dae000>
(gdb) x /100x frame->data
0x70dae000: Cannot access memory at address 0x70dae000


Based on this I kicked out the zero-copying, because I thought it would result in a segfault anyway when I later would try to write to the location... Turns out I was wrong and writing to the location works perfectly fine. Thanks for that.

With this change the example runs with a pretty acceptable framerate. I think I'm going to try your component anyway, as soon as the real project starts here. For now we just decided that the raspberry pi is a suitable platform for what we're trying to achieve.

Thanks for your time and help. If it weren't for you, we would probably have switched to another platform by now. The poor mmal/gpu documentation and the non-existing (?) commercial support is just a killer... I hope my example project on github can help others on that matter....

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5944
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Tue Oct 02, 2018 3:24 pm

tmoe wrote:
Tue Oct 02, 2018 1:51 pm
You're right. Actually I never really tried setting zero copy AND writing to the buffer.
I recall using gdb to debug my initial buffer problems (format change events and so on) and there I was unable to access the buffer if zero copy was set. See this gdb output (zero copy set )

(gdb) b draw_overlay(MMAL_BUFFER_HEADER_T*)
Breakpoint 1 at 0x12cf0: file performance.cpp, line 139.

Thread 6 "vc.ril.video_de" hit Breakpoint 1, draw_overlay (frame=0x3bda0) at performance.cpp:139
139 int fn = framenr % 500;
(gdb) p frame->data
$1 = (uint8_t *) 0x70dae000 <error: Cannot access memory at address 0x70dae000>
(gdb) x /100x frame->data
0x70dae000: Cannot access memory at address 0x70dae000


Based on this I kicked out the zero-copying, because I thought it would result in a segfault anyway when I later would try to write to the location... Turns out I was wrong and writing to the location works perfectly fine. Thanks for that.
Because the buffer has been mapped by vcsm via a slightly odd mechanism, gdb can't see into it for some reason. It's a little annoying at times, but not sufficiently to dig into it. There is a rewrite of vc-sm in the pipeline which may overcome that as it'll be mapping kernel CMA memory through more normal routes than the current vcsm does.
tmoe wrote:With this change the example runs with a pretty acceptable framerate. I think I'm going to try your component anyway, as soon as the real project starts here. For now we just decided that the raspberry pi is a suitable platform for what we're trying to achieve.
I'm not sure how much using the HVS is going to save you, but it's worth a try.
tmoe wrote:Thanks for your time and help. If it weren't for you, we would probably have switched to another platform by now. The poor mmal/gpu documentation and the non-existing (?) commercial support is just a killer... I hope my example project on github can help others on that matter....
Working with MMAL should be a world simpler than working with IL, but you're right that we could do with writing an overview, and some more examples.
TBH I don't know whether you'll really get more support on other platforms - I've never looked. The likes of dom, PhilE, JamesH, jdb, and myself will lurk on the forums and Github and help out where time allows. It does make life significantly easier where the person at the other end understands basic Linux concepts and isn't expecting a complete solution dumped on their lap, so thank you for doing your bit.

It's projects that are pushing the hardware that have the interesting bits - just how far can this little mobile phone chip with a ~8 year old GPU actually get people?! Other than 4k and HEVC (and even that is being achieved in software at 1080p30) it's actually still holding its own pretty damn well.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
Please don't send PMs asking for support - use the forum.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

dickon
Posts: 233
Joined: Sun Dec 09, 2012 3:54 pm
Location: Home, just outside Reading

Re: hw-encode opengl output without glReadPixels in 2018 ("fastpath")?

Tue Oct 02, 2018 3:44 pm

6by9 wrote:
Tue Oct 02, 2018 3:24 pm
It's projects that are pushing the hardware that have the interesting bits - just how far can this little mobile phone chip with a ~8 year old GPU actually get people?! Other than 4k and HEVC (and even that is being achieved in software at 1080p30) it's actually still holding its own pretty damn well.
It's quite astonishing. The work you lot at Pi Towers have done is incredible.

12b 8k p60 H.265 (encode and decode) is all I want from a new Pi. Please :-)

Return to “OpenMAX”