piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

gles & omx.resize deadlock

Thu Jan 31, 2019 2:18 pm

I've got an app running a render loop in one thread, and loading textures in another. Part of that is using omx.resize to scale my images. I've been using the omxresize code for a few months now and the gl es is relatively new to the code, so I'm reasonably sure that the omx portion of the code works well. Loading other types of textures is working - I'm rendering text as well in that thread.The problem I'm having is that the resize component seems to be hanging up the main render thread. The render thread is getting stuck on eglSwapBuffers and the resize code is in a loop waiting for port settings to change which never comes. I'm wondering if anyone has thoughts on debugging that? i'm also using the image decoding omx component which doesn't seem to have the same problem - though I need to verify that the way I'm using the components is comparable. My resize code is based on the decoder which I wrote first so they're very similar. Anyways, thought I'd ping the masses who have played with these areas of code before I dive deeper into debugging. I'll probably end up having to strip out my omx classes and render loop into a minimal program to demonstrate and isolate the problem for further debugging.

piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

Re: gles & omx.resize deadlock

Fri Feb 01, 2019 1:50 am

Really feels like something related to this:
https://github.com/raspberrypi/firmware/issues/449

It's usually hanging when one thread is waiting for a component buffer and the main thread is either stuck at vc_dispmanx_update_submit_sync or eglSwapBuffers

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 6422
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: gles & omx.resize deadlock

Fri Feb 01, 2019 11:35 am

Did you enable the ILCS_WARN=1 debug logging and confirm if you were running out of message slots?
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

Re: gles & omx.resize deadlock

Fri Feb 01, 2019 3:44 pm

I did try that and do not see those warnings so there's a good chance it's a different root cause. I've stripped my code down to a more minimal example and am packaging it nicely for building and testing in raspbian - my app currently runs on a buildroot image. Once I throw it up on GitHub It'd be awesome if you guys have some time in the near future to take a look and point me in the right direction. The concept seems simple enough so I'm really curious if it's something I'm doing wrong or something in the GPU, though my OMX code has been working really well for a few months now before I mixed it with GLES and threads.

piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

Re: gles & omx.resize deadlock

Sat Feb 02, 2019 2:11 am

I tested my simplified code on raspbian and am seeing the error. Here's my code if any of the Pi guys have a chance to take a look at the backtrace when it's hung up. Should be as simple as cloning the repo and typing make to compile. You will have to increase the GPU ram though, I used something around 200 MB.

(edit: Helps to include a link to the repo)
https://github.com/piense/pi-omx-gles-bug

Thanks!

tvjon
Posts: 680
Joined: Mon Jan 07, 2013 9:11 am

Re: gles & omx.resize deadlock

Sat Feb 02, 2019 10:44 am

It's interesting to see how you go about doing this.
I'm afraid I don't as yet have sufficient experience to fix the problem, but you may be interested in some feedback?

Using your test.jpg I have similar lockups, so reboots necessary.

It appears the size & resolution of a test.jpg are significant.

I finally made a successful run using a smaller file, ~ 990k. Smaller than that & loadimage always fails to load.
Attached is the long command line output for the successful run..

Note the:

JPEG Decoder: State never changed to loaded -1
Resizer: Component did not enter Idle state: -1

Can you elaborate on:

//Resizer does odd things with YUV so
//Convert to RGB888 first

in PiMediaImageGFX.cc please?
Attachments
cmd-line-output.txt.zip
(1.41 KiB) Downloaded 28 times

piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

Re: gles & omx.resize deadlock

Sat Feb 02, 2019 3:07 pm

I'm not surprised the OMX state changes aren't quite perfect. I wrote most of that 6 - 8 months ago so it takes some effort to figure out why I did a few things. For the OMX component classes I basically took the hello_jpeg example and caught every return code I could. Then I took some of the library calls and pulled them back into the class to avoid assertions in the library code. To some extent I don't care if it errors out and fails, the app can log it and it just won't show an image there. For the odd things with YUV I want to say it had to do with the dimensions it would accept. Once I had it working on my sample set of images I decided to leave well enough alone, it was meeting my performance requirements and OMX stuff is a bit of a hassle to work with. Really though that code has worked quite well up until now. In my first single-threaded version of the app I left it running for a couple days cycling through images and it's been working well as I develop other parts of the app. It's just recently I decided to use GLES for compositing instead of just dispmanx that it's been causing issues. It might be worth spending some time in the next week seeing if something in the component isn't quite right and getting things in an odd state. I did notice that smaller images would load, think it's just a timing thing between calls in the threads but I can't see why that would be the case.

piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

Re: gles & omx.resize deadlock

Sat Feb 02, 2019 10:59 pm

Just cleaned up the resizer code a bit. Gets a bit farther before getting stuck, not sure if that's entirely related to my clean-up, but I gained some safety with timeouts in one or two spots that I didn't have before.

piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

Re: gles & omx.resize deadlock

Thu Feb 07, 2019 2:05 am

So I've been poking at this thing for the past week and finally made some progress. I isolated my OMX code into a unit test and it seems to be causing the crash. Though when I run it on an image based off of the firmware & kernel from around November of 2017, it runs fine. Whether it's technically a regression in the Pi firmware, or a bug in my code that's getting caught on the newer firmware. I'll post my omx unit test here in a bit, but my main question is how to regression test with raspbian? Is there a good way to revert to certain points in time or do I just need to find the archive of the base images and make sure it doesn't update?

Edit: Found the archives and bisecting the bug against the raspbian releases.

piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

Re: gles & omx.resize deadlock

Fri Feb 08, 2019 4:21 am

I've managed to narrow the bug down quite a bit. My demo app will run for hours on the raspbian-2018-03-14 release and earlier but crashes within a minute or two on raspbian-2018-04-19 and later. My repo has the debug output after running with start_debug=1 in config.txt and getting the assert logs and a backtrace. The code is in the following repo, just type make to compile it. It also needs quite a bit of videoram, I've been using 300MB. Didn't realize quite how big the test images were but they do the trick and it's the same bug I've been seeing.

https://github.com/piense/pi-omx-gles-bug

tvjon
Posts: 680
Joined: Mon Jan 07, 2013 9:11 am

Re: gles & omx.resize deadlock

Sun Feb 10, 2019 1:17 pm

What f/w & kernel are the failing OS's, ie.

$ sudo vcgencmd version

$ uname -a

I've run the new version for ~ 3 & a half hours with no failures on the

4.19.19-v7+ kernel (sudo BRANCH=next rpi-update)

However using some of the jpg's here, eg. the attached, note the

Allocating buffer

size.

$ ./omxtest
Adding delrod.jpg
Open display[0]...
Display is 1920 x 1080
Loading: testImages/delrod.jpg
1920 1080
pis_MediaImageGFX::loadImage: testImages/delrod.jpg
PiOMXUser: ilclient loaded.
New Resizer
JPEG Decoder: decodeJpgImage()
JPEG Decoder: Allocated input image buffer.
JPEG Decoder: Image file read into memory.
JPEG Decoder: setupOpenMaxJpegDecoder()
JPEG Decoder: prepareImageDecoder()
JPEG Decoder: Created the component.
JPEG Decoder: prepareImageDecoder succeeded
JPEG Decoder: startupImageDecoder()
JPEG Decoder: Setting port image format
JPEG Decoder: Setting port parameters
JPEG Decoder: Enabling input port
JPEG Decoder: Initializing input buffers
JPEG Decoder: Allocating buffer of 81920
JPEG Decoder: Allocating buffer of 81920

JPEG Decoder: Allocating buffer of -75247

Allocate resize input buffer, err: 80001000

JPEG Decoder: Failed at startupImageDecoder -1
JPEG Decoder: Failed to setup OpenMaxJpegDecoder.
JPEG Decoder: Exiting with error.
JPEG Decoder: Cleaning up pDecoder
JPEG Decoder: cleanup()
JPEG Decoder: Continuing cleanup
JPEG Decoder: Decoder commands flushed
JPEG Decoder: Error transitioning to idle: OMX_ErrorInsufficientResources
JPEG Decoder: Component did not enter Idle state: -1
JPEG Decoder: Input port disabling
JPEG Decoder: Freeing buffer header 0
JPEG Decoder: Freed input buffer 0
JPEG Decoder: Freeing buffer header 1
JPEG Decoder: Freed input buffer 1
JPEG Decoder: Freeing buffer header 2
JPEG Decoder: Error freeing input buffer 2: OMX_ErrorBadParameter
JPEG Decoder: Freed input buffers and headers
JPEG Decoder: Input port confirmed disabled
JPEG Decoder: OMX error OMX_ErrorSameState
JPEG Decoder: Disabling output port
JPEG Decoder: Error freeing output port buffer: OMX_ErrorBadParameter
JPEG Decoder: Output port never disabled -2
JPEG Decoder: Freeing pDecoder
JPEG Decoder: Freeing sourceImage
loadImage: Error loading JPEG testImages/delrod.jpg
Resizer closed
ERROR loading image.PiOMXUser: DESTROYED
Attachments
delrod.jpg
delrod.jpg (86.52 KiB) Viewed 1271 times

piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

Re: gles & omx.resize deadlock

Sun Feb 10, 2019 6:22 pm

oh, that ones a bug in setting up some buffers with smaller images. Fixed in the repo now. I'm still narrowing it down, all I know at the moment is rasbian releases after a certain date are crashing. I'm really not sure what component it causing the issue. I updated just rpi-firmware in my buildroot image and my full app runs ok. I'll play with userland and the kernel next. If I can narrow it down to a specific commit I might be able to figure out what's going on. I'm just really unfamiliar with the inner-workings of all that.

tvjon
Posts: 680
Joined: Mon Jan 07, 2013 9:11 am

Re: gles & omx.resize deadlock

Sun Feb 10, 2019 7:44 pm

Well, it's definitely improved, but a quick test suggests any jpg < 81920 bytes will fail, eg., the attached jpg.

I'll look at your changes later tonight. hopefully.
Attachments
sw.jpg
sw.jpg (77.99 KiB) Viewed 1245 times

piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

Re: gles & omx.resize deadlock

Mon Feb 11, 2019 1:41 pm

I'll figure out the small image problem in a bit. Good things to add to my test image set though.

The crashing problem I've been having seems to be related to the kernel version. I'm running 4.9.52 with the current rpi-firmware & rpi-userland code and it seems to be running great. Might be able to bisect the commit log to figure out what version caused issues but I'll probably be lost after that point - and compiling kernels takes awhile. At least it does make it easier for development knowing which package I need to roll back.

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 6422
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: gles & omx.resize deadlock

Mon Feb 11, 2019 2:34 pm

A quick run and I've seen "sudo vcdbg log msg" reporting

Code: Select all

1267106.140: vcos_abort: Halting
That means a thread on the VPU has hit an unrecoverable error and will then suspend that thread.
It'd be useful to confirm if you are seeing the same error message.

Looking at further debug, it's an issue in trying to dma copy a buffer. Of course now that I'm trying to reproduce it, it's refusing to go wrong :(
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

Re: gles & omx.resize deadlock

Mon Feb 11, 2019 2:48 pm

How are you trying to reproduce it? I was using a Pi2 for most of this if that helps. I flashed a card with raspbian, cloned the repo, bumped video RAM up to 300 MB, ran 'make', then ran the code. My monitor was 1920x1080 if that has any affect. I have a few of the newer models and crashed those too at some point, just so happens that I started with the Pi2 and have been using it consistently debugging this. My full app, and the original test code I posted, can crash it pretty darn quick. I just distilled it down to that simple OMX example as the simplest case that could crash it while I figured out what was going on. The big image did seem to crash it quick enough in the same manner I was seeing elsewhere.

This version is a bit more complicated but crashed it quicker in more cases when using GLES in another thread. Thought that was my issue originally until I started stripping it down and realizing I had updated my OS image.
https://github.com/piense/pi-omx-gles-b ... 2baf59b63f

I did notice that using start_debug=1 in the config file makes it a tad harder to reproduce, but usually it just took a few extra minutes to crash.

As far as DMA transfers, I've programmed my share of those on MCUs but no clue how they're are used in the Pi.

piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

Re: gles & omx.resize deadlock

Mon Feb 11, 2019 2:52 pm

Totally forgot to answer your question. I've seen that message before, thought it was in dmesg. 'sudo vcdbg log msg' hung when I was gathering the debug info though I know it's worked for me before with this issue. I had start_debug=1 when gathering all the debug info if that affected things.

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 6422
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: gles & omx.resize deadlock

Mon Feb 11, 2019 3:59 pm

"vcos_abort: Halting" will always be in vcdbg log msg, never in dmesg.

I'm on a 4.19.19-v7+ kernel on a 3B+. Debug firmware so I can see more information. I've got gpu_mem set at 256M.

I am suspecting a race condition.

The assert/abort I'd logged (and now lost the info on) was from a dma transfer being done of a buffer from/to GPU to/from ARM memory. It looked like one of the addresses being passed was invalid, so the dma copy was stopping before it did something silly.
From what I have logged I can't say whether it was IL or DispmanX that triggered the issue. I've now had it running for probably 10 minutes without issue, so can't easily take it further until it goes wrong again.

If this won't go wrong again, then I'm afraid I'm going to take the approach of the man (from one of many places) asked how to get somewhere - I wouldn't start from here
- You're decoding the raw JPEG into GPU memory, and then hauling it back to the ARM as YUV420. For otherlanguage.jpg, that's 51MB
- Converting that to RGBA8888 you're sending that 51MB back to the GPU, and then pulling 137MB back for the RGB image.
- That 137MB then gets sent back again to be resized down to your ~1080P image ("only" 8MB for RGBA8888)
- That 8MB buffer then gets copied back again to be displayed by DispmanX.
IL (and MMAL) support tunnelling, which keep the buffers on the GPU without all these huge copies. They also support sending images direct to the screen using the video_render component, so saving involving DispmanX directly.
I'd also request that you specify what the problem is that means you convert to RGB at full resolution first, and then resize.

MMAL allows you to do zero copy of buffers (mapping GPU memory into the ARM), so saves all this copying.

I've never used it with IL (which I hate with a passion), but a better bet would be to use the isp component instead of resize. It uses the ISP hardware block to do resizing and format conversion. The one potential difference vs resize is that it does NOT support specifying the resize properties, you set the absolute input and output formats explicitly, and shouldn't wait on a port settings changed event on the output port.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

Re: gles & omx.resize deadlock

Mon Feb 11, 2019 4:37 pm

You make good points on the amount of memory bandwidth being used and I've tried to minimize that. That color space conversion step is a bit obnoxious but there were too many stupid cases trying to resize the YUV images - my post history has some of that battle with odd/even and mysterious green lines on certain dimensions. I could explore tunneling, mostly I got OMX working and didn't feel like poking it any more. What is the cost of moving between the GPU and CPU? I tried to reuse the buffers by just moving the header along the existing buffer so I'm copying as little as possible. Piecemealing the data with tunnels doesn't seem like a benefit in and of itself but maybe that context cost is significant. Then again, it was doing it pretty darn fast and worked really well. I intentionally used big images to put worse case bounds on execution times for performance expectations and I never felt the need to go shave off a few extra seconds with tunneling. And up until I upgraded the os image it was darn stable. I took a stack of personal photos and Wikimedia Commons photos and tried to break it. Other than some ridiculously huge photos, it did really well. I'm already running into problems with putting too much load on the HVS and blacking the screen out so using the video_render component might not work out.

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 6422
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: gles & omx.resize deadlock

Mon Feb 11, 2019 4:40 pm

And now it's triggered again.

Your app appears to be stuck during the cleanup on the "resize" testImages/otherlanguage.jpg from YUV to RGBA.
It claims

Code: Select all

Resizer: Allocated output port buffer of: 144609280 with stride 29440.
Resizer: Output port buffers assigned
Resizer: Output port enabled.
Resizer: Filling the output buffer
Resizer: EOS on output port detected
Resizer: Image resize succeeded, cleaning up.
Resizer: cleanup()
Resizer: Continuing cleanup
You have wedged the main RPC thread on the GPU, therefore almost everything on the GPU will be dead.

I think what may be happening is you are getting the EOS signal before the buffer is back. Adding back in the FillBufferDone and EmptyBufferDone callbacks, on a good pass I get

Code: Select all

Resizer: Allocated output port buffer of: 7102464 with stride 6528.
Resizer: Output port buffers assigned
Resizer: Output port enabled.
Resizer: Filling the output buffer
Resizer: Done with input buffer
Resizer: EOS on output port detected
Resizer: Image resize succeeded, cleaning up.
Resizer: cleanup()
Resizer: Continuing cleanup
Resizer: FillBufferDoneCB()
Resizer: Decoder commands flushed
Resizer: Component transitioning to idle
Resizer: Component transitioned to idle
Resizer: Input port disabling
Resizer: Freed input buffer.
Resizer: Input port confirmed disabled
Resizer: Disabling output port
Resizer: Returning successfully.
pis_MediaImageGFX: Resized image in 5.950000 seconds.
Note that "EOS on output port detected" is logged before "FillBufferDoneCB()".
When it fails I see

Code: Select all

Resizer: portSettingsChanged()
Resizer: nBufferCountActual 1, size 48848896
Resizer: Resizing to: 4288x2848
Resizer: Enabling out port
Resizer: Output port enabled
Resizer: Allocated output port buffer of: 48848896 with stride 17152.
Resizer: Output port buffers assigned
Resizer: Output port enabled.
Resizer: Filling the output buffer
Resizer: Done with input buffer
Resizer: EOS on output port detected
Resizer: Image resize succeeded, cleaning up.
Resizer: cleanup()
Resizer: Continuing cleanup
No "FillBufferDoneCB()".

You really should be handling the buffers as they are returned rather than banking on them being what you think.

I think what has occurred is that the event path has overtaken the data path as the transfer will take a measurable period of time. You've then destroyed the buffer that IL was transferring into during the transfer. I will agree that that shouldn't be allowed by the framework, but there is no way I am going to try fixing up IL - I wish it would die.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

Re: gles & omx.resize deadlock

Mon Feb 11, 2019 4:58 pm

oh. Interesting theory! So the il client is returning before a DMA transfer is complete and then bad things happen?

"You really should be handling the buffers as they are returned rather than banking on them being what you think."

I sort of get what you mean, but can you put that in implementation terms? I should rely on the buffer done callbacks instead of the il client calls? I'm all for a workaround in my code instead of the root cause in the IL code. Also I'm at work and can't get too distracted figuring it out at the moment but I definitely value having your time poking at it.

What would've caused this behavior to change between versions? Just happenstance of how timing worked out? I'm still trying to narrow down the culprit just with swapping around versions of things but it does really seem like it's tied to the kernel version - which of course is all sorts of drivers screwing with hardware and registers and timing.

I know I'm pushing the Pi hardware a bit and I'd consider a better suited SoC for the project but you guys did a darn good job of getting Pis out there and that fits well with who I'm trying to target.

piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

Re: gles & omx.resize deadlock

Tue Feb 12, 2019 5:42 am

I see what you mean now about trying to catch it crashing. I'd swear the other day I was testing the various raspbian versions it crashed in under a minute each time and now it's taking quite awhile. Not sure what I could've possibly been doing differently.

piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

Re: gles & omx.resize deadlock

Tue Feb 12, 2019 2:13 pm

Thought about it and figured out one thing I did in my earlier testing that seems to have an effect for whatever reason. I ran the test immediately after it booted up, tried it this morning that did the trick twice in a row. Turned on the Pi, opened a command prompt, and ran the code and it crashed in under a minute. When I had been testing earlier I kept forgetting to change the RAM setting so I had it compiled and ready to run from a fresh boot each time. No clue why that would affect things and I'm curious to experiment more this evening to see how reproducible that trick really is. Things like poking around for a few minutes online first vs executing it first thing and try that a few times in a row.

piense
Posts: 45
Joined: Thu Jun 09, 2016 4:11 am

Re: gles & omx.resize deadlock

Thu Feb 14, 2019 5:50 am

I tried waiting for the FillBufferDone callback and now it just hangs waiting for the callback. Same errors as before. Here's the code in another branch:

https://github.com/piense/pi-omx-gles-bug/tree/atomic

Think this is what I was seeing before with vcdbg log asser:
# vcdbg log assert
120198.289: assert( IS_ALIAS_PERIPHERAL(c->dst_addr) || (c->xfer_info & DMA0_TI_DEST_IGNORE_SET) || IS_ALIAS_NOT_L1_ALLOCATING(c->dst_addr) ) failed; ../../../../../helpers/dmalib/dmalib.c::dma_chain_start line 410 rev 656741e
vcdbg_ctx_get_dump_stack: dump_stack failed

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 6422
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: gles & omx.resize deadlock

Thu Feb 14, 2019 9:57 am

Sorry, busy on other stuff at the moment.
That's the same assert as I was seeing. DMA is being given NULL as a destination pointer somehow, and then aborts (a bit severe, but also possibly uncoverable anyway).
I didn't get to the bottom of why it gets a NULL destination. I wonder if a malloc has failed somewhere and the return value not been checked - your code appears OK in that regard, but I haven't checked everything.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

Return to “Graphics programming”