sandboxvt
Posts: 3
Joined: Fri Oct 07, 2016 9:55 pm

Out of spec response time for EmptyThisBuffer?

Fri Oct 07, 2016 11:47 pm

The short summary is that EmptyThisBuffer call takes more than 10 msec and often >15 msec. This spec indicates that it should return within 5 msec.

This out of spec delay prevents live 1080@30fps or 720@60fps transcoding on RPi 3.

The detail:

The goal is transcoding ATSC 1080i and 720p MPEG2 TS to H264 1080p@30fps and 720p@60fps live! Therefore it is important to keep up with the frame rate.

Using avconv or ffmpeg with h264_omx, the output encoding frame rate is about 20% short. (About 24 fps for 1080)

Tracing the code, I noticed the EmptyThisBuffer is very slow in response often close to 20 msec. Give that at 30 fps each frame transcoding has total 33.3 msec to complete, this delay pushes the per frame transcoding coding time to > 40 msec. No chance doing 30 fps with this kind of delay.

I then looked at the hello_pi/encode.c and it has similar problem when video width is increased to 1920. The delay is ~10 msec and still way out of the spec.

Is there a work around to this problem? Perhaps some OpenMax component configuration parameters? Why is EmptyThisBuffer call response time >> 5 msec? Above observations are with the original source. With encode.c, only the video width has been altered.

Or RPi 3 is not capable to doing this type of video transcoding live?

Thanks

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 4440
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Out of spec response time for EmptyThisBuffer?

Sat Oct 08, 2016 9:38 pm

Because of the original market for the VideoCore GPU (a multimedia coprocessor, not apps processor), EmptyThisBuffer and FillBufferDone include transferring (copying) the buffer contents from host processor (ARM) to GPU memory or back again. On the co-pro this meant sending the data off chip, and is all handled by ILCS (IL Component Service). All video decode and display, or camera to encode type use cases always kept the image data on the GPU by setting up IL tunnels.
I wasn't expecting that to take as much time as you are observing, but it will be a modest chunk for the ~3MB of a 1080P YUV4:2:0 frame.

The main aim with IL is to form a pipeline. Using Gstreamer you're taking one IL component in isolation and wrapping GST over the top. Create a pure IL pipeline of video_decode (MPEG2 to YUV) -> video_encode (YUV to H264) with tunnels and you've a chance of it working.
My main hesitation would be that the codec block is only specified to achieve 1080P30 encode with about 10% headroom, and it is the same block shared with decode. Overclocking may get you enough bandwidth to handle both the encode and decode simultaneously, but I just don't know.
You don't say how you were doing the decode in GStreamer, nor where you are dealing with deinterlacing your 1080i images, so I'm having to make guesses here.. How much are you stressing the ARM cores?


You also seem to be missing that IL is all about being a pipeline. It supports multiple buffers on each port. As long as the EmptyThisBuffer call takes less than a frame time, then you may be increasing latency, but you can still achieve full frame rate given appropriate pipelining.
To elaborate, if you're encoding a YUV420PackedPlanar frame to H264 on Pi, then the frame has to get to the GPU (ARM memcpy), be converted to an internal format (VPU), motion estimation (CME and FME), encode (ENC), entropy code, and transfer the resulting data back to the ARM (ARM memcpy). WIth the exception of the two copies, all of those are on different bits of silicon. So each could take 33ms, and whilst you'd end up with an encoding latency of 6*33 = 200ms, it would still achieve 30fps.

If you want to get this working on Pi with software decode of the video, then I'd recommend you look at MMAL instead of IL. MMAL was written because IL was such a pain to work with, and because things were shifting to the apps processor architecture and shared memory. There you can allocate a zero copy buffer (ie GPU memory) and fill it with your image data to avoid copying full frame buffers around.
Alternatively you can again set up mmal_connections to have a complete pipeline on the GPU. There may even be a couple more tricks that can be pulled to further minimise image format conversions and memory bandwidth if really needed to squeeze every last drop of performance from the system.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
Please don't send PMs asking for support - use the forum.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

sandboxvt
Posts: 3
Joined: Fri Oct 07, 2016 9:55 pm

Re: Out of spec response time for EmptyThisBuffer?

Sun Oct 09, 2016 11:27 am

Thanks for the detail reply.

Just some clarifications and follow-up questions.

I am using ffmpeg and avconv, not gstreamer, for compatibility and convenience reasons. ATSC source is MPEG2 TS. Demuxing and decoding video is not very stressful to RPi 3. The total load for the complete transcoding only uses 30-40% of the 4 CPUs.

A lot waiting time for EmptyThisBuffer and FillThisBuffer. This is just waiting for the calls to return, not completing this tasks. Stuck waiting for completion is more understandable. Not observing the 5 msec response requirements is very troublesome. I have not seen other implementations behaving this way.

I also tested CPU memory transfer speed for decoded frames and speed is not an issue. If transfer between CPU and GPU is causing issues here, it would also be unusual. Do you have any benchmark on such transfer?

I can see tunneling can help if decoding is also done by GPU. But I cannot always assume that. It might be useful for frame resizing

I have not got into audio transcoding yet. Currently audio stream is 'copied' - no decoding is required. Only muxing. Audio transcoding will take additional resources.

Would multiple buffers for encode help?

I am trying to overlap decoding and encoding. But not sure if that will increase the throughput sufficiently.

I only got about 10% overclocking bump with RPi 3 and since the transcoding is not very CPU bound, I did not pursue this further.

Do you think this EmptyThisBuffer call implementation will ever be compliant to the 5 msec requirement?

I will take a look at MMAL and see if there is any possibility. But I do prefer leveraging tools like ffmpeg.

Thanks

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 4440
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Out of spec response time for EmptyThisBuffer?

Sun Oct 09, 2016 9:05 pm

I also tested CPU memory transfer speed for decoded frames and speed is not an issue. If transfer between CPU and GPU is causing issues here, it would also be unusual. Do you have any benchmark on such transfer?
No, I don't have any benchmarks. I'm quite surprised that it is as slow as you are reporting, but haven't had time or energy to investigate.
Would multiple buffers for encode help?
Yes, use multiple buffers to prevent buffer transfer stalling your pipeline.
Do you think this EmptyThisBuffer call implementation will ever be compliant to the 5 msec requirement?
The simple issue is a lack of resource. As noted in my sig, I do not work for Pi Towers, so my playtime is limited. Pi Towers have limited resource too.
OpenMax IL and ILCS are a pain to work with. There was no development on using IL for probably 2 years before Broadcom pulled out of the mobile space as MMAL had been written to remove most of the awkwardness. IL compatibility was almost relegated to a marketing tickbox that the chip could do it.
The IL Working Group also became significantly less relevant - some members just wanted to work on a 1.2.0 as bug fixes, whilst others wanted to investigate a 2.0 which actually addressed some of the limitations. Seeing as 1.2.0 seemingly never actually happened (still on 1.1.2) you can make your own judgement on the investment any of the companies involved were prepared to make.

All the ARM side code for ILCS is in the userland repo. If you or others can narrow down where the time appears to be largely spent, then you've a better chance to get one of us to investigate that smaller area. Admittedly it may disappear fairly rapidly into VCHIQ (VideoCore Host Interface Queued - the actual transfer broker) which can get a little hairy, but at least following it that far would be useful.
The little bit of investigation I may do would be to check how long the transfers are taking with MMAL. Again that uses VCHIQ for the transfer (except in the zero copy optimised mode), so a fix might be forthcoming if VCHIQ is identified as the hold up due to it potentially affecting other services.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
Please don't send PMs asking for support - use the forum.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

sandboxvt
Posts: 3
Joined: Fri Oct 07, 2016 9:55 pm

Re: Out of spec response time for EmptyThisBuffer?

Mon Oct 10, 2016 8:47 am

Thanks again for the quick and thoughtful reply.

I think I will take your suggestions on multiple buffers, userland code and MMAL.

Just a few brief questions:

1. Setting multiple buffers - Is this as simple as setting the number of buffers for the port? Both input and output ports?

2. userland project status - I considered working with the code. But I thought it might be actively maintained and therefore maybe I should be patient for further improvements or engaging its developer(s). Is current userland code relatively stable and up to date? Are developers still active on improving it?

3. userland code - I browsed the EmptyThisBuffer code path briefly. It looks like the delay might be with mutex or semaphore block or transit to GPU or GPU response. Any insight to this is greatly appreciated. I just want to make sure (or as much as possible) that GPU hardware is not the limiting factor before I dive deep into this.

4. Updates from Broadcom - Is it fair to infer from your reply that one should not expect updates in this area from Broadcom any time soon?

Thanks

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 4440
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Out of spec response time for EmptyThisBuffer?

Mon Oct 10, 2016 9:12 am

sandboxvt wrote:1. Setting multiple buffers - Is this as simple as setting the number of buffers for the port? Both input and output ports?
Set nBufferCountActual to something bigger than nBufferCountMin in the PortDefinition. AllocateBuffer/UseBuffer then wants that number of buffers before the port will enable fully. You now have more buffers to fill and submit.
Input and output ports can have different counts. Use a buffer count that you think is reasonable when considering memory vs any gain in performance.
You should also be able to vary nBufferSizeActual if there is any gain. Typically this would be for something like JPEG encoding where you could either have one huge buffer for the whole image, or multiple smaller ones. Multiple smaller ones mean you can start piping the data to the destination sooner. There's no point in changing the buffer size on non-encoded images - you'll just waste memory.
sandboxvt wrote:2. userland project status - I considered working with the code. But I thought it might be actively maintained and therefore maybe I should be patient for further improvements or engaging its developer(s). Is current userland code relatively stable and up to date? Are developers still active on improving it?
It is maintained in the sense that there are some ongoing developments, but ILCS is likely to only get minor bug fixes where critical issues are identified. MMAL and the raspicam apps (host_applications/linux/apps/raspicam) are still being developed as we get the chance to expose new or otherwise improve functionality, as are the EGL stack and some of the other services. PRs always welcome :D
sandboxvt wrote:3. userland code - I browsed the EmptyThisBuffer code path briefly. It looks like the delay might be with mutex or semaphore block or transit to GPU or GPU response. Any insight to this is greatly appreciated. I just want to make sure (or as much as possible) that GPU hardware is not the limiting factor before I dive deep into this.
There shouldn't be a huge issue with the GPU transfer. The memory bandwidth should be plenty high enough, even though it will be converting from an SG list to contiguous memory and therefore have to do multiple smaller copies.
I'm just wondering if it is doing a DMA copy rather than using the CPU. That would have a larger setup overhead and therefore increase the time taken.
sandboxvt wrote:4. Updates from Broadcom - Is it fair to infer from your reply that one should not expect updates in this area from Broadcom any time soon?
Check the news from back in July 2014. http://uk.reuters.com/article/us-broadc ... BS20140722 and the like. Broadcom pulled out of the baseband processor market and laid off almost the entire Cambridge team behind VideoCore (including myself). They pretty much have no further interest in developing software for the chip. They are turning the handle to produce the chips, and are very likely involved in any ongoing developments.
Nearly all software support is now down to Pi Towers (who employed several of the people Broadcom laid off) and the community, that's why there is limited resource. (The exception being Eric Anholt's upstream kernel drivers for VideoCore - Broadcom are still employing him).
The relationship may have changed again with Broadcom being sold to Avago (http://www.bloomberg.com/news/articles/ ... r-broadcom), but I don't know as I'm no longer involved in either company.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
Please don't send PMs asking for support - use the forum.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

Return to “OpenMAX”

Who is online

Users browsing this forum: No registered users and 2 guests