shusheer
Posts: 18
Joined: Wed Oct 28, 2015 3:36 pm

Raspistill causes system lock-up when used with two cameras

Wed Oct 28, 2015 4:13 pm

Hi All,

I seem to have found a bug in raspistill (or possibly deeper in the system). It seems only to happen on a compute module with 2 cameras, and results in raspistill locking up either it's own process, or occasionally the entire system (e.g. system no longer responds to ping when USB ethernet adapter is used). Obviously that's a bit of a problem for autonomous systems.

Please could someone else with a compute module with two camera boards copy and execute the scripts below to see if this is a problem just with my hardware (both my boards were purchased at the same time, so presumably same batch). Assuming someone else can replicate this, I'll file a bug report.

The problem seems to occur when you fire off two sets of raspistill in parallel, one for each camera, and each of those sets independently takes several images. For some reason you need a long delay between each call to raspistill (like >>0.1 sec), otherwise you get the observed lock-up.

Below are two short bash scripts that demonstrate the problem. Just pop them on a standard compute module with a pair of cameras configured, and type "master.sh 1" - it will tick along merrily with a second between raspistill captures and an additional 3 seconds before repeating the requests. It (should) finish 60 such captures and exit normally after about 5 minutes.

Now repeat this but type "master.sh 0.1" - and watch it fall over after somewhere between 5 and 20 iterations (tried on two different compute modules with the same result).

Increasing the number of raspistill captures in slave.sh, and decreasing or removing the sleep in slave.sh, increases the frequency of the problem.

Both boards have the latest firmware/software version - however it's not specific to this version, I saw the issue, updated my machines from whatever older versions they were on, and the issue remained. I then spent some time making the minimum possible scripts to replicate the issue (below) in the hopes that someone else with a compute module and two cameras could test and reassure me it's not just my boards.

$uname -a
Linux Pi-computeA 4.1.11+ #822 PREEMPT Fri Oct 23 16:14:56 BST 2015 armv6l GNU/Linux

$vcgencmd version
Oct 23 2015 16:28:38
Copyright (c) 2012 Broadcom
version c1a8b643677011791abeea1f5dc3a4738349ea55 (clean) (release)

I have 256MB allocated to the GPU - the problem appears to be more frequent if I allocate less GPU memory, but I haven't tested that exhaustively.

The scripts:

First, the master script that starts the slave scripts independently, and has a nice long wait before iterating:

master.sh:
#!/bin/bash
# This script takes 1 parameter:
# $1 is the sleep time after each raspistill, passed to the slave script
# This script pretents to be being triggered to capture camera images in bursts with burstdelay seconds between bursts, for numbursts sets in total
numbursts=60
burstdelay=3
burstnum=0
while [ "$burstnum" -lt $numbursts ]; do
sleep $burstdelay
burstnum=$((burstnum+1))
echo Issuing burst $burstnum with sleep of $1
/home/pi/slave.sh 0 $1 &
/home/pi/slave.sh 1 $1 &
wait
done

Second, the salve script that runs for each camera independently:

slave.sh:
#!/bin/bash
# This script takes 2 parameters:
# $1 = camera={0,1} sets which camera to use
# $2 = delay time after each raspistill for the sleep command
# The script then grabs two stills from the camera to the ram disk, and deletes them
raspistill -ss 100000 -vf -hf -ev -24 -ISO 100 -awb off -awbg 1.0,1.0 -drc high -n -q 99 -t 1000 -cs $1 -o /dev/shm/cam-$1-A.jpg
sleep $2
raspistill -ss 100000 -vf -hf -ev -24 -ISO 100 -awb off -awbg 1.0,1.0 -drc high -n -q 99 -t 1000 -cs $1 -o /dev/shm/cam-$1-B.jpg
sleep $2
rm /dev/shm/cam-$1-*.jpg
echo Done camera $1 with sleep of $2

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 9357
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Raspistill causes system lock-up when used with two came

Thu Oct 29, 2015 11:31 pm

Just trying this on my CM. I've only got a USB ethernet adapter connected and SSHing into the box. Just run a "sudo rpi-update", so absolute latest firmware (Oct 28 2015 17:09:22) and kernel (4.1.12+ #824 PREEMPT Wed Oct 28 16:39:49).
gpu_mem is set at the default 128M.

It's run through your scripts with a 0.1s delay without issue.

Are you doing anything else on the system, particularly 3D or video decode related?
The only case that I can think of is if something nasty happened during a memory compaction, and that would be potentially more likely with 3D or other multimedia stuff going on.

Code: Select all

vcgencmd cache_flush
sudo vcdbg reloc
would give you some indication of what else is allocated in the heap (if nothing else is going on, it should drop back to about 5 items once raspistill finishes - "ARM FB", "audioplus_temp_buf", "camera fast alloc arena", "dt_blob", and "ILCS VC buffer pool")
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

shusheer
Posts: 18
Joined: Wed Oct 28, 2015 3:36 pm

Re: Raspistill causes system lock-up when used with two came

Fri Oct 30, 2015 12:00 am

No, no 3D processing, not even using gfx (no HDMI plugged in, just SSH over ethernet via USB).

I get exactly those 5 items as a result of
vcgencmd cache_flush
sudo vcdbg reloc

Try removing the sleep $2 lines in slave.sh altogether, and perhaps running e.g. 10 raspistill commands in a row on each camera to really tax it, as it may be something borderline.

If your system survives that, I'll need to start looking for weird hardware-related stuff - the power lines look fine, each CM running off a separate Pi branded power supply, so not sure what could be going on. Any suggestions for what to examine next gratefully received!

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 9357
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Raspistill causes system lock-up when used with two came

Fri Oct 30, 2015 11:37 am

A quick test with no delay did appear to do something strange after 20 iterations, but no time to investigate right now.

One thing has just niggled in the back of my mind. The IPC from ARM to GPU (called VCHIQ) only has a certain number of service slots - memory says 4 is the critical number. Services get cleaned up when clients disconnect. I'm just wondering if there is a chance that we're hitting that limit if we catch a point where the old client hasn't been cleaned up before the new one connects. I'm not convinced on the theory, but more noting it for my own benefit.

I'd suggest adding "start_debug=1" to boot/config.txt which means GPU asserts are logged and viewable via "sudo vcdbg log assert". If you can connect a serial console too, then that may give you a way of breaking into the kernel when things go wrong. Otherwise I'll try running it with the GPU debugger attached.

BTW There's little point raising an issue on Github - I'm the person most likely to investigate this in the firmware, so it'd only be a (virtual) paper exercise. I'll flag things to people at Pi Towers if necessary.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

shusheer
Posts: 18
Joined: Wed Oct 28, 2015 3:36 pm

Re: Raspistill causes system lock-up when used with two came

Fri Oct 30, 2015 7:58 pm

I think you might be onto something.

One strange error code I saw whilst trying to get to the minimum script to induce failure was the following:

tvservice-client: Failed to connect to TV service: -1

It only happened once, not consistently, so I thought it wasn't that bit of a hint, but a google on the error found this post, saying something about a limit of 4 connections:
https://github.com/raspberrypi/firmware/issues/441
This didn't seem to be relevant to raspistill to my mind, but your discussion of a limitation of 4 connections made me scratch my head and look back in my browser history.

Also, on searching again, someone else reporting a similar sounding issue to mine, with that error:
https://www.reddit.com/r/raspberry_pi/comments/24nrk2/

So possibly this might help you track down the issue.

shusheer
Posts: 18
Joined: Wed Oct 28, 2015 3:36 pm

Re: Raspistill causes system lock-up when used with two came

Sat Oct 31, 2015 9:41 pm

FYI, I have added "start_debug=1" to boot/config.txt as suggested, and then triggered my master.sh script with delay 0.1 and observed failure, but "sudo vcdbg log assert" shows nothing new.
It does however show one entry, which seems odd (this is always there after boot):
002831.474: assert( card not initialised ) failed; ../../../../../filesystem/media/sdcard/sdcard.c::sdhost_close line 1045
Given that the compute module has soldered-on "sd card" that seemed odd to me.

Unfortunately I don't have an easy way to get a serial connection to the compute modules, as they're actually at a remote site (hence my panic with a problem that can cause them to lock up and not even respond to shutdown - fortunately a phone-call can result in a power-cycle after a few hours, but it makes development slower... should have added a remote-controlled power switch)

6by9
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 9357
Joined: Wed Dec 04, 2013 11:27 am
Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.

Re: Raspistill causes system lock-up when used with two came

Sat Oct 31, 2015 11:26 pm

The sdhost_close assert is there on all Pi platforms - I've never investigated it as there are others who worry about the platform level. It'll be something linked to the GPU controlling the SD card before the ARM boots, and then shutting down as access from 2 processors is bad.
I'm currently rebuilding my dev machine, but will hook the GPU debugger on when I get a chance.
Software Engineer at Raspberry Pi Trading. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.

shusheer
Posts: 18
Joined: Wed Oct 28, 2015 3:36 pm

Re: Raspistill causes system lock-up when used with two came

Sun Nov 01, 2015 1:14 pm

Ok, spotted something in the logfile after another camera crash on each machine, but not sure how useful it will be.

On the first compute module (I have a pair of them):

$ sudo vcdbg log assert
002831.474: assert( card not initialised ) failed; ../../../../../filesystem/media/sdcard/sdcard.c::sdhost_close line 1045
3772250.767: assert( (interrupt_save & (1<<30)) || ((int)_tx_thread_system_state < 0) ) failed; ../../../../../vcfw/rtos/threadx/rtos_threadx_latch.c::rtos_latch_get_real line 99

Unfortunately I can't seem to work out what the time conversion is for that second event timestamp - the first event looks to be in milliseconds (?) since system restart, which would make the second event about an hour after system restart. If that's the case it is probably unrelated to the raspistill issue as that only failed about 12 hours after restart as far as I can tell from file dates.

On the second compute module, also in a crashed state, probably crashed around a similar time of day:

$ sudo vcdbg log assert
4060777.446: assert( 0 ) failed; ../../../../../middleware/ISP/tuner/isp_tuner_brcm_agc.c::handle_legacy_metering_tunings line 2041
4060810.744: assert( 0 ) failed; ../../../../../middleware/ISP/tuner/isp_tuner_brcm_agc.c::handle_legacy_metering_tunings line 2041
4060844.087: assert( 0 ) failed; ../../../../../middleware/ISP/tuner/isp_tuner_brcm_agc.c::handle_legacy_metering_tunings line 2041
4060877.395: assert( 0 ) failed; ../../../../../middleware/ISP/tuner/isp_tuner_brcm_agc.c::handle_legacy_metering_tunings line 2041
4060910.705: assert( 0 ) failed; ../../../../../middleware/ISP/tuner/isp_tuner_brcm_agc.c::handle_legacy_metering_tunings line 2041
4060944.039: assert( 0 ) failed; ../../../../../middleware/ISP/tuner/isp_tuner_brcm_agc.c::handle_legacy_metering_tunings line 2041
4060977.338: assert( 0 ) failed; ../../../../../middleware/ISP/tuner/isp_tuner_brcm_agc.c::handle_legacy_metering_tunings line 2041
4060986.564: assert( sizeof(WRITE_RAW_MD_STATE_T) < PERSISTENT_MEM_REQU ) failed; ../../../../../middleware/camplus/sw/write_raw_md.c::write_raw_md_open line 161
4061129.741: assert( 0 ) failed; ../../../../../middleware/ISP/tuner/isp_tuner_brcm_agc.c::handle_legacy_metering_tunings line 2041
4061218.325: assert( 0 ) failed; ../../../../../middleware/ISP/tuner/isp_tuner_brcm_agc.c::handle_legacy_metering_tunings line 2041
4061271.490: assert( 0 ) failed; ../../../../../middleware/ISP/tuner/isp_tuner_brcm_agc.c::handle_legacy_metering_tunings line 2041
4063747.302: assert( (interrupt_save & (1<<30)) || ((int)_tx_thread_system_state < 0) ) failed; ../../../../../vcfw/rtos/threadx/rtos_threadx_latch.c::rtos_latch_get_real line 99

Rather a lot more going on here, but again not sure if this is relevant given the timing.

I should also say that these crashes aren't from using a script as simple as the master/slave scripts I gave earlier, but scripts that do similar things (multiple camera pics taken in quick succession).

shusheer
Posts: 18
Joined: Wed Oct 28, 2015 3:36 pm

Re: Raspistill causes system lock-up when used with two came

Sun Nov 01, 2015 1:20 pm

Not sure if it's any use at all, but here's the logfile from a Pi-V2 running the same code (but obviously only for a single camera) - it seems to have some asserts as well at about the same time (all the machines were restarted within about a minute of each other, so I think all these logs represent more or less the same activity being performed). Notably, the Pi-v2 has not failed in any observable way, even though the log has entries.

sudo vcdbg log assert
001027.848: assert( source ) failed; ../../../../../middleware/confzilla/cp_front_fdt.c::cp_front_fdt_load_builtin line 115
001655.983: assert( card not initialised ) failed; ../../../../../filesystem/media/sdcard/sdcard.c::sdhost_close line 1045
4054529.827: assert( camera_subsystem_state.power_enable_ref_count[port] > 0 ) failed; ../../../../../vcfw/drivers/device/camera_subsystem/camera_subsystem.c::camera_disable_ctrl line 1116
4055725.475: assert( sizeof(WRITE_RAW_MD_STATE_T) < PERSISTENT_MEM_REQU ) failed; ../../../../../middleware/camplus/sw/write_raw_md.c::write_raw_md_open line 161
4057718.030: assert( sizeof(WRITE_RAW_MD_STATE_T) < PERSISTENT_MEM_REQU ) failed; ../../../../../middleware/camplus/sw/write_raw_md.c::write_raw_md_open line 161
4059956.404: assert( sizeof(WRITE_RAW_MD_STATE_T) < PERSISTENT_MEM_REQU ) failed; ../../../../../middleware/camplus/sw/write_raw_md.c::write_raw_md_open line 161

shusheer
Posts: 18
Joined: Wed Oct 28, 2015 3:36 pm

Re: Raspistill causes system lock-up when used with two came

Sun Nov 01, 2015 8:57 pm

Another possible hint - this output came when using the -set option of raspistill during a crash, but it only happened once so not sure that it is "real":

mmal: mmal_vc_component_create: failed to create component 'vc.ril.image_encode' (1:ENOMEM)
mmal: mmal_component_create_core: could not create component 'vc.ril.image_encode' (1)
mmal: Unable to create JPEG encoder component
mmal: main: Failed to create encode component
mmal: Failed to run camera app. Please check for firmware updates

shusheer
Posts: 18
Joined: Wed Oct 28, 2015 3:36 pm

Re: Raspistill causes system lock-up when used with two came

Sun Nov 01, 2015 9:46 pm

More progress, if you could call it that.

If you just copy-and-paste both the lines below into your ssh session, as a single paste job (i.e. so the line-break causes the two instructions to run immediately one after the other), then you also get the failure:

raspistill -set -n -cs 0 -o /dev/shm/testing-c0a.jpg &
raspistill -set -n -cs 1 -o /dev/shm/testing-c1a.jpg &

I'm pretty sure that's supposed to work, but doesn't on my hardware - hopefully you will find the same. Makes it a lot easier to see what is going on even than the two bash scripts...

shusheer
Posts: 18
Joined: Wed Oct 28, 2015 3:36 pm

Re: Raspistill causes system lock-up when used with two came

Mon Nov 02, 2015 10:22 pm

Another interesting finding - by looking through some old code from around the start of September, I found some that did a rather similar set of captures, and worked then, but fails now. That suggests some kind of update has happened to cause this issue.

I therefore did an sudo rpi-update ea072e1d615bbfeae5ad04f1c9367295dca84f06 to get back to that era for firmware, but the code still fails, suggesting that it might have been something outside firmware.

Unfortunately I don't have physical access to the compute modules to reflash a complete new raspian install from that time to test, but perhaps someone else with a compute module might be able to test this.

Have any other users noticed this problem at all?

shusheer
Posts: 18
Joined: Wed Oct 28, 2015 3:36 pm

Re: Raspistill causes system lock-up when used with two came

Sat Nov 07, 2015 12:00 am

Ok, finally got my hands back on the compute module hardware, so I could try installing clean images.

This problem has existed all the time I've had the compute modules, but seems to have become dramatically more sensitive due to whatever upgrades and s/w I have installed recently.

The test code previously posted, particularly the simple two-line raspistill, works fine on a virgin install of 2015-05-05-raspbian-wheezy.img with the only raspi-config change being the enabling of the cameras and a dt-blob.bin for my particular camera pin connections (which haven't changed).

However, a more aggressive master.sh / slave.sh pair of scripts below still fails after a few iterations, showing that the problem is definitely there, just not as obvious as it had been on my system. The more aggressive scripts are as follows (just run master.sh with no parameters):

master.sh:

Code: Select all

#!/bin/bash
# This script captures camera images in bursts with burstdelay seconds between bursts, for numbursts sets in total
numbursts=60
burstdelay=1
burstnum=0
while [ "$burstnum" -lt $numbursts ]; do
sleep $burstdelay
burstnum=$((burstnum+1))
echo Issuing burst $burstnum
/home/pi/slave.sh 0 &
/home/pi/slave.sh 1 &
wait
done
slave.sh:

Code: Select all

#!/bin/bash
# This script takes 1 parameter:
# $1 = camera={0,1} sets which camera to use
# The script then grabs five stills from the camera to the ram disk in quick succession, and deletes them
raspistill -ss 10000 -ISO 100 -awb off -awbg 1.0,1.0 -n -t 100 -cs $1 -o /dev/shm/cam-$1-A.jpg
raspistill -ss 10000 -ISO 100 -awb off -awbg 1.0,1.0 -n -t 100 -cs $1 -o /dev/shm/cam-$1-B.jpg
raspistill -ss 10000 -ISO 100 -awb off -awbg 1.0,1.0 -n -t 100 -cs $1 -o /dev/shm/cam-$1-C.jpg
raspistill -ss 10000 -ISO 100 -awb off -awbg 1.0,1.0 -n -t 100 -cs $1 -o /dev/shm/cam-$1-D.jpg
raspistill -ss 10000 -ISO 100 -awb off -awbg 1.0,1.0 -n -t 100 -cs $1 -o /dev/shm/cam-$1-E.jpg
rm /dev/shm/cam-$1-*.jpg
echo Done 5x camera $1
On running sudo apt-get update / sudo apt-get upgrade to get to the latest production firmware, the problem still persists, and does not appear to be markedly worse.

On updating to the recent test firmware that I had been reporting problems against, via sudo rpi-update 65aad25162cc39e61d104d87af8ff46cb81308da the problem still persists and again does not seem to be markedly more susceptible to the issue.

I'm not quite sure what changes I had made to make this issue more sensitive, but in any case it does appear to occur in the vanilla production software, if only under more aggressive testing than I had done previously.

I'm now all out of ideas on how to track this down further - anyone else able to replicate the issue? 6by9 said he had spotted something odd on his hardware, but nobody else seems to have tried (or if they have, nothing to report).

Anyone else with compute module and 2 cameras able to test the behaviour with the above scripts?

Return to “Compute Module”