Problem with /sys/class/thermal/thermal_zone0/temp


20 posts
by milhouse » Sat Jan 05, 2013 3:29 am
I've had this problem a few times now, over the last 2-3 months, on Raspbian.

I've found that if a script is left running that repeatedly calls "cat /sys/class/thermal/thermal_zone0/temp", eventually the script will hang while trying to read the temperature data. On a number of occasions I've had this happen after 2-3 days (reading the temp value every 2 seconds). The only solution is a reboot. Each time, the "hang" has been caused by the read of temp from thermal_zone0 blocking, and never completing.

Is there some sort of memory leak or other resource leak in this driver that causes a failure after many repeated reads?

A quick look at the blocking processes currently running on my system, in case the process states are of any help:
Code: Select all
pi@raspberrypi ~ $ ps aux|grep cat
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
pi       30071  0.0  0.0   3204   488 pts/1    DN   Jan04   0:00 cat /sys/class/thermal/thermal_zone0/temp
pi       30740  0.0  0.0   3204   488 pts/1    DN   03:16   0:00 cat /sys/class/thermal/thermal_zone0/temp
pi       30755  0.0  0.0   3204   488 pts/1    D+   03:17   0:00 cat /sys/class/thermal/thermal_zone0/temp


Running "cat /sys/class/thermal/thermal_zone0/temp" at the command line will block in such a way that even ctrl-c won't get back to the prompt. Not even kill -9 will kill these "cat" processes...

This is my current Raspbian kernel:
Code: Select all
Linux raspberrypi 3.2.27+ #250 PREEMPT Thu Oct 18 19:03:02 BST 2012 armv6l GNU/Linux


I'll put off rebooting this system for another day or so in case anyone has any ideas.
Posts: 555
Joined: Mon Jan 16, 2012 12:59 pm
by M33P » Sun Jan 06, 2013 4:09 pm
What version of kernel are you running (uname -a)?

What is in the kernel message log (dmesg)?
Posts: 199
Joined: Sun Sep 02, 2012 1:14 pm
by milhouse » Sun Jan 06, 2013 4:16 pm
M33P wrote:What version of kernel are you running (uname -a)?

It was in the first post. :)

Code: Select all
Linux raspberrypi 3.2.27+ #250 PREEMPT Thu Oct 18 19:03:02 BST 2012 armv6l GNU/Linux


M33P wrote:What is in the kernel message log (dmesg)?


There is nothing relevant in any of the logs, it just fails silently. I've rebooted now though - have to wait for it to happen again in a few days, though it's not always guaranteed.
Posts: 555
Joined: Mon Jan 16, 2012 12:59 pm
by M33P » Sun Jan 06, 2013 4:41 pm
milhouse wrote:It was in the first post. :)

Code: Select all
Linux raspberrypi 3.2.27+ #250 PREEMPT Thu Oct 18 19:03:02 BST 2012 armv6l GNU/Linux


There is nothing relevant in any of the logs, it just fails silently. I've rebooted now though - have to wait for it to happen again in a few days, though it's not always guaranteed.


Ha ha failure to read on my part.

Can you remove the delay between temp reads and see if you can hang the command quicker?

e.g. do something like
Code: Select all
while true; do cat /sys/class/thermal/thermal_zone0/temp; done

within the shell?

Several Videocore firmware and kernel updates have occurred since that kernel version was released. Maybe updating will help?
Posts: 199
Joined: Sun Sep 02, 2012 1:14 pm
by milhouse » Sun Jan 06, 2013 4:57 pm
M33P wrote:Several Videocore firmware and kernel updates have occurred since that kernel version was released. Maybe updating will help?


I tried an rpi-update earlier in the week, which left me with a non-booting system... bit wary of trying again...

Am running the temp loop now, will let you know how it gets on! I modified it as follows so if it hangs I'll know when (and a rough indication of how many times it took)...
Code: Select all
X=0; while true; do echo "$(date): $(cat /sys/class/thermal/thermal_zone0/temp) ($(echo $X))"; X=$(($X+1)); done
Posts: 555
Joined: Mon Jan 16, 2012 12:59 pm
by milhouse » Mon Jan 07, 2013 1:08 am
The loop process has been running almost 8 hours (1.25m iterations) and no hang so far, however as an experiment I kicked off a second loop in another window and noticed that the temperature returned is very often invalid (a large negative number). This negative value would appear occasionally when running only a single loop, but it is now quite common with two loops (eg. easily every 50-100 reads), would this suggest the thermal driver isn't comfortable with concurrent access?

With two loops, it's now common to see this:
Code: Select all
Mon Jan  7 00:57:42 GMT 2013: 57300 (465)
Mon Jan  7 00:57:42 GMT 2013: -854745208 (466)
Mon Jan  7 00:57:42 GMT 2013: 57300 (467)


The negative value can appear in either window, at different times. With only a single loop running, these negative values are extremely rare, though the fact they do appear from time to time, could this be due to a system process also checking the temp?

I've stopped the second loop now, and have allowed the first loop to continue in isolation.
Posts: 555
Joined: Mon Jan 16, 2012 12:59 pm
by milhouse » Mon Jan 07, 2013 6:20 pm
OK, success :)

The infinite loop hung overnight, and it failed as follows:

Code: Select all
Mon Jan  7 01:48:46 GMT 2013: 57300 (1386574)
Mon Jan  7 01:48:46 GMT 2013: 57838 (1386575)
Mon Jan  7 01:48:46 GMT 2013: 57838 (1386576)


The bottom line is the the last call that succeeded. The hang occurred after 1,386,576 iterations (give or take a couple of thousand when I concurrently ran a second loop which revealed a possibly (un)related concurrency issue). This main loop was started at about 17:06:17 on Jan 6 (just after I booted the Pi), so it took about 8 hours 42 minutes to hang.

Previously I said there was nothing interesting in the logs, but I now take that back. At the time of the hang I have the following in syslog (and also kern.log) - I hope its useful:

Code: Select all
Jan  7 01:48:46 raspberrypi kernel: [31346.634917] bcm2708_vcio: mbox chan 8 overflow - drop 5d892008
Jan  7 01:52:18 raspberrypi kernel: [31558.618283] INFO: task kworker/0:2:1656 blocked for more than 120 seconds.
Jan  7 01:52:18 raspberrypi kernel: [31558.618299] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  7 01:52:18 raspberrypi kernel: [31558.618310] kworker/0:2     D c037a204     0  1656      2 0x00000000
Jan  7 01:52:18 raspberrypi kernel: [31558.618354] [<c037a204>] (__schedule+0x2bc/0x568) from [<c037b720>] (__mutex_lock_slowpath+0xb8/0x154)
Jan  7 01:52:18 raspberrypi kernel: [31558.618389] [<c037b720>] (__mutex_lock_slowpath+0xb8/0x154) from [<c001a218>] (bcm_mailbox_write+0x34/0xb8)
Jan  7 01:52:18 raspberrypi kernel: [31558.618414] [<c001a218>] (bcm_mailbox_write+0x34/0xb8) from [<c001a390>] (bcm_mailbox_property+0x5c/0xe4)
Jan  7 01:52:18 raspberrypi kernel: [31558.618443] [<c001a390>] (bcm_mailbox_property+0x5c/0xe4) from [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64)
Jan  7 01:52:18 raspberrypi kernel: [31558.618466] [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64) from [<c029704c>] (thermal_zone_device_update+0x2c/0x240)
Jan  7 01:52:18 raspberrypi kernel: [31558.618497] [<c029704c>] (thermal_zone_device_update+0x2c/0x240) from [<c003da30>] (process_one_work+0x134/0x38c)
Jan  7 01:52:18 raspberrypi kernel: [31558.618521] [<c003da30>] (process_one_work+0x134/0x38c) from [<c003e4a4>] (worker_thread+0x1a0/0x354)
Jan  7 01:52:18 raspberrypi kernel: [31558.618547] [<c003e4a4>] (worker_thread+0x1a0/0x354) from [<c0042e3c>] (kthread+0x84/0x8c)
Jan  7 01:52:18 raspberrypi kernel: [31558.618573] [<c0042e3c>] (kthread+0x84/0x8c) from [<c000e930>] (kernel_thread_exit+0x0/0x8)
Jan  7 01:52:18 raspberrypi kernel: [31558.618588] INFO: task cat:18732 blocked for more than 120 seconds.
Jan  7 01:52:18 raspberrypi kernel: [31558.618596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  7 01:52:18 raspberrypi kernel: [31558.618606] cat             D c037a204     0 18732   2185 0x00000000
Jan  7 01:52:18 raspberrypi kernel: [31558.618631] [<c037a204>] (__schedule+0x2bc/0x568) from [<c037aea8>] (schedule_timeout+0x198/0x230)
Jan  7 01:52:18 raspberrypi kernel: [31558.618651] [<c037aea8>] (schedule_timeout+0x198/0x230) from [<c037bb9c>] (__down+0x88/0xc0)
Jan  7 01:52:18 raspberrypi kernel: [31558.618672] [<c037bb9c>] (__down+0x88/0xc0) from [<c004809c>] (down+0x60/0x64)
Jan  7 01:52:18 raspberrypi kernel: [31558.618693] [<c004809c>] (down+0x60/0x64) from [<c001a2fc>] (bcm_mailbox_read+0x60/0x98)
Jan  7 01:52:18 raspberrypi kernel: [31558.618716] [<c001a2fc>] (bcm_mailbox_read+0x60/0x98) from [<c001a3d4>] (bcm_mailbox_property+0xa0/0xe4)
Jan  7 01:52:18 raspberrypi kernel: [31558.618739] [<c001a3d4>] (bcm_mailbox_property+0xa0/0xe4) from [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64)
Jan  7 01:52:18 raspberrypi kernel: [31558.618760] [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64) from [<c0296934>] (temp_show+0x28/0x54)
Jan  7 01:52:18 raspberrypi kernel: [31558.618789] [<c0296934>] (temp_show+0x28/0x54) from [<c02232f8>] (dev_attr_show+0x1c/0x48)
Jan  7 01:52:18 raspberrypi kernel: [31558.618818] [<c02232f8>] (dev_attr_show+0x1c/0x48) from [<c0114ff8>] (sysfs_read_file+0x90/0x134)
Jan  7 01:52:18 raspberrypi kernel: [31558.618843] [<c0114ff8>] (sysfs_read_file+0x90/0x134) from [<c00b88c4>] (vfs_read+0x98/0x16c)
Jan  7 01:52:18 raspberrypi kernel: [31558.618862] [<c00b88c4>] (vfs_read+0x98/0x16c) from [<c00b89d0>] (sys_read+0x38/0x70)
Jan  7 01:52:18 raspberrypi kernel: [31558.618883] [<c00b89d0>] (sys_read+0x38/0x70) from [<c000d980>] (ret_fast_syscall+0x0/0x30)
Jan  7 01:54:18 raspberrypi kernel: [31678.608927] INFO: task kworker/0:2:1656 blocked for more than 120 seconds.
Jan  7 01:54:18 raspberrypi kernel: [31678.608942] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  7 01:54:18 raspberrypi kernel: [31678.608954] kworker/0:2     D c037a204     0  1656      2 0x00000000
Jan  7 01:54:18 raspberrypi kernel: [31678.608997] [<c037a204>] (__schedule+0x2bc/0x568) from [<c037b720>] (__mutex_lock_slowpath+0xb8/0x154)
Jan  7 01:54:18 raspberrypi kernel: [31678.609033] [<c037b720>] (__mutex_lock_slowpath+0xb8/0x154) from [<c001a218>] (bcm_mailbox_write+0x34/0xb8)
Jan  7 01:54:18 raspberrypi kernel: [31678.609058] [<c001a218>] (bcm_mailbox_write+0x34/0xb8) from [<c001a390>] (bcm_mailbox_property+0x5c/0xe4)
Jan  7 01:54:18 raspberrypi kernel: [31678.609087] [<c001a390>] (bcm_mailbox_property+0x5c/0xe4) from [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64)
Jan  7 01:54:18 raspberrypi kernel: [31678.609109] [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64) from [<c029704c>] (thermal_zone_device_update+0x2c/0x240)
Jan  7 01:54:18 raspberrypi kernel: [31678.609141] [<c029704c>] (thermal_zone_device_update+0x2c/0x240) from [<c003da30>] (process_one_work+0x134/0x38c)
Jan  7 01:54:18 raspberrypi kernel: [31678.609165] [<c003da30>] (process_one_work+0x134/0x38c) from [<c003e4a4>] (worker_thread+0x1a0/0x354)
Jan  7 01:54:18 raspberrypi kernel: [31678.609190] [<c003e4a4>] (worker_thread+0x1a0/0x354) from [<c0042e3c>] (kthread+0x84/0x8c)
Jan  7 01:54:18 raspberrypi kernel: [31678.609216] [<c0042e3c>] (kthread+0x84/0x8c) from [<c000e930>] (kernel_thread_exit+0x0/0x8)
Jan  7 01:54:18 raspberrypi kernel: [31678.609231] INFO: task cat:18732 blocked for more than 120 seconds.
Jan  7 01:54:18 raspberrypi kernel: [31678.609239] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  7 01:54:18 raspberrypi kernel: [31678.609249] cat             D c037a204     0 18732   2185 0x00000000
Jan  7 01:54:18 raspberrypi kernel: [31678.609274] [<c037a204>] (__schedule+0x2bc/0x568) from [<c037aea8>] (schedule_timeout+0x198/0x230)
Jan  7 01:54:18 raspberrypi kernel: [31678.609294] [<c037aea8>] (schedule_timeout+0x198/0x230) from [<c037bb9c>] (__down+0x88/0xc0)
Jan  7 01:54:18 raspberrypi kernel: [31678.609314] [<c037bb9c>] (__down+0x88/0xc0) from [<c004809c>] (down+0x60/0x64)
Jan  7 01:54:18 raspberrypi kernel: [31678.609337] [<c004809c>] (down+0x60/0x64) from [<c001a2fc>] (bcm_mailbox_read+0x60/0x98)
Jan  7 01:54:18 raspberrypi kernel: [31678.609360] [<c001a2fc>] (bcm_mailbox_read+0x60/0x98) from [<c001a3d4>] (bcm_mailbox_property+0xa0/0xe4)
Jan  7 01:54:18 raspberrypi kernel: [31678.609383] [<c001a3d4>] (bcm_mailbox_property+0xa0/0xe4) from [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64)
Jan  7 01:54:18 raspberrypi kernel: [31678.609404] [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64) from [<c0296934>] (temp_show+0x28/0x54)
Jan  7 01:54:18 raspberrypi kernel: [31678.609433] [<c0296934>] (temp_show+0x28/0x54) from [<c02232f8>] (dev_attr_show+0x1c/0x48)
Jan  7 01:54:18 raspberrypi kernel: [31678.609463] [<c02232f8>] (dev_attr_show+0x1c/0x48) from [<c0114ff8>] (sysfs_read_file+0x90/0x134)
Jan  7 01:54:18 raspberrypi kernel: [31678.609487] [<c0114ff8>] (sysfs_read_file+0x90/0x134) from [<c00b88c4>] (vfs_read+0x98/0x16c)
Jan  7 01:54:18 raspberrypi kernel: [31678.609508] [<c00b88c4>] (vfs_read+0x98/0x16c) from [<c00b89d0>] (sys_read+0x38/0x70)
Jan  7 01:54:18 raspberrypi kernel: [31678.609528] [<c00b89d0>] (sys_read+0x38/0x70) from [<c000d980>] (ret_fast_syscall+0x0/0x30)
Jan  7 01:56:18 raspberrypi kernel: [31798.599576] INFO: task kworker/0:2:1656 blocked for more than 120 seconds.
Jan  7 01:56:18 raspberrypi kernel: [31798.599591] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  7 01:56:18 raspberrypi kernel: [31798.599602] kworker/0:2     D c037a204     0  1656      2 0x00000000
Jan  7 01:56:18 raspberrypi kernel: [31798.599646] [<c037a204>] (__schedule+0x2bc/0x568) from [<c037b720>] (__mutex_lock_slowpath+0xb8/0x154)
Jan  7 01:56:18 raspberrypi kernel: [31798.599681] [<c037b720>] (__mutex_lock_slowpath+0xb8/0x154) from [<c001a218>] (bcm_mailbox_write+0x34/0xb8)
Jan  7 01:56:18 raspberrypi kernel: [31798.599707] [<c001a218>] (bcm_mailbox_write+0x34/0xb8) from [<c001a390>] (bcm_mailbox_property+0x5c/0xe4)
Jan  7 01:56:18 raspberrypi kernel: [31798.599735] [<c001a390>] (bcm_mailbox_property+0x5c/0xe4) from [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64)
Jan  7 01:56:18 raspberrypi kernel: [31798.599758] [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64) from [<c029704c>] (thermal_zone_device_update+0x2c/0x240)
Jan  7 01:56:18 raspberrypi kernel: [31798.599790] [<c029704c>] (thermal_zone_device_update+0x2c/0x240) from [<c003da30>] (process_one_work+0x134/0x38c)
Jan  7 01:56:18 raspberrypi kernel: [31798.599815] [<c003da30>] (process_one_work+0x134/0x38c) from [<c003e4a4>] (worker_thread+0x1a0/0x354)
Jan  7 01:56:18 raspberrypi kernel: [31798.599839] [<c003e4a4>] (worker_thread+0x1a0/0x354) from [<c0042e3c>] (kthread+0x84/0x8c)
Jan  7 01:56:18 raspberrypi kernel: [31798.599867] [<c0042e3c>] (kthread+0x84/0x8c) from [<c000e930>] (kernel_thread_exit+0x0/0x8)
Jan  7 01:56:18 raspberrypi kernel: [31798.599881] INFO: task cat:18732 blocked for more than 120 seconds.
Jan  7 01:56:18 raspberrypi kernel: [31798.599889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  7 01:56:18 raspberrypi kernel: [31798.599899] cat             D c037a204     0 18732   2185 0x00000000
Jan  7 01:56:18 raspberrypi kernel: [31798.599924] [<c037a204>] (__schedule+0x2bc/0x568) from [<c037aea8>] (schedule_timeout+0x198/0x230)
Jan  7 01:56:18 raspberrypi kernel: [31798.599944] [<c037aea8>] (schedule_timeout+0x198/0x230) from [<c037bb9c>] (__down+0x88/0xc0)
Jan  7 01:56:18 raspberrypi kernel: [31798.599964] [<c037bb9c>] (__down+0x88/0xc0) from [<c004809c>] (down+0x60/0x64)
Jan  7 01:56:18 raspberrypi kernel: [31798.599987] [<c004809c>] (down+0x60/0x64) from [<c001a2fc>] (bcm_mailbox_read+0x60/0x98)
Jan  7 01:56:18 raspberrypi kernel: [31798.600010] [<c001a2fc>] (bcm_mailbox_read+0x60/0x98) from [<c001a3d4>] (bcm_mailbox_property+0xa0/0xe4)
Jan  7 01:56:18 raspberrypi kernel: [31798.600034] [<c001a3d4>] (bcm_mailbox_property+0xa0/0xe4) from [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64)
Jan  7 01:56:18 raspberrypi kernel: [31798.600055] [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64) from [<c0296934>] (temp_show+0x28/0x54)
Jan  7 01:56:18 raspberrypi kernel: [31798.600083] [<c0296934>] (temp_show+0x28/0x54) from [<c02232f8>] (dev_attr_show+0x1c/0x48)
Jan  7 01:56:18 raspberrypi kernel: [31798.600112] [<c02232f8>] (dev_attr_show+0x1c/0x48) from [<c0114ff8>] (sysfs_read_file+0x90/0x134)
Jan  7 01:56:18 raspberrypi kernel: [31798.600137] [<c0114ff8>] (sysfs_read_file+0x90/0x134) from [<c00b88c4>] (vfs_read+0x98/0x16c)
Jan  7 01:56:18 raspberrypi kernel: [31798.600157] [<c00b88c4>] (vfs_read+0x98/0x16c) from [<c00b89d0>] (sys_read+0x38/0x70)
Jan  7 01:56:18 raspberrypi kernel: [31798.600178] [<c00b89d0>] (sys_read+0x38/0x70) from [<c000d980>] (ret_fast_syscall+0x0/0x30)
Jan  7 01:58:18 raspberrypi kernel: [31918.590229] INFO: task kworker/0:2:1656 blocked for more than 120 seconds.
Jan  7 01:58:18 raspberrypi kernel: [31918.590244] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  7 01:58:18 raspberrypi kernel: [31918.590255] kworker/0:2     D c037a204     0  1656      2 0x00000000
Jan  7 01:58:18 raspberrypi kernel: [31918.590299] [<c037a204>] (__schedule+0x2bc/0x568) from [<c037b720>] (__mutex_lock_slowpath+0xb8/0x154)
Jan  7 01:58:18 raspberrypi kernel: [31918.590334] [<c037b720>] (__mutex_lock_slowpath+0xb8/0x154) from [<c001a218>] (bcm_mailbox_write+0x34/0xb8)
Jan  7 01:58:18 raspberrypi kernel: [31918.590360] [<c001a218>] (bcm_mailbox_write+0x34/0xb8) from [<c001a390>] (bcm_mailbox_property+0x5c/0xe4)
Jan  7 01:58:18 raspberrypi kernel: [31918.590388] [<c001a390>] (bcm_mailbox_property+0x5c/0xe4) from [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64)
Jan  7 01:58:18 raspberrypi kernel: [31918.590411] [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64) from [<c029704c>] (thermal_zone_device_update+0x2c/0x240)
Jan  7 01:58:18 raspberrypi kernel: [31918.590443] [<c029704c>] (thermal_zone_device_update+0x2c/0x240) from [<c003da30>] (process_one_work+0x134/0x38c)
Jan  7 01:58:18 raspberrypi kernel: [31918.590468] [<c003da30>] (process_one_work+0x134/0x38c) from [<c003e4a4>] (worker_thread+0x1a0/0x354)
Jan  7 01:58:18 raspberrypi kernel: [31918.590493] [<c003e4a4>] (worker_thread+0x1a0/0x354) from [<c0042e3c>] (kthread+0x84/0x8c)
Jan  7 01:58:18 raspberrypi kernel: [31918.590519] [<c0042e3c>] (kthread+0x84/0x8c) from [<c000e930>] (kernel_thread_exit+0x0/0x8)
Jan  7 01:58:18 raspberrypi kernel: [31918.590533] INFO: task cat:18732 blocked for more than 120 seconds.
Jan  7 01:58:18 raspberrypi kernel: [31918.590542] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  7 01:58:18 raspberrypi kernel: [31918.590551] cat             D c037a204     0 18732   2185 0x00000000
Jan  7 01:58:18 raspberrypi kernel: [31918.590576] [<c037a204>] (__schedule+0x2bc/0x568) from [<c037aea8>] (schedule_timeout+0x198/0x230)
Jan  7 01:58:18 raspberrypi kernel: [31918.590596] [<c037aea8>] (schedule_timeout+0x198/0x230) from [<c037bb9c>] (__down+0x88/0xc0)
Jan  7 01:58:18 raspberrypi kernel: [31918.590617] [<c037bb9c>] (__down+0x88/0xc0) from [<c004809c>] (down+0x60/0x64)
Jan  7 01:58:18 raspberrypi kernel: [31918.590639] [<c004809c>] (down+0x60/0x64) from [<c001a2fc>] (bcm_mailbox_read+0x60/0x98)
Jan  7 01:58:18 raspberrypi kernel: [31918.590662] [<c001a2fc>] (bcm_mailbox_read+0x60/0x98) from [<c001a3d4>] (bcm_mailbox_property+0xa0/0xe4)
Jan  7 01:58:18 raspberrypi kernel: [31918.590686] [<c001a3d4>] (bcm_mailbox_property+0xa0/0xe4) from [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64)
Jan  7 01:58:18 raspberrypi kernel: [31918.590706] [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64) from [<c0296934>] (temp_show+0x28/0x54)
Jan  7 01:58:18 raspberrypi kernel: [31918.590735] [<c0296934>] (temp_show+0x28/0x54) from [<c02232f8>] (dev_attr_show+0x1c/0x48)
Jan  7 01:58:18 raspberrypi kernel: [31918.590764] [<c02232f8>] (dev_attr_show+0x1c/0x48) from [<c0114ff8>] (sysfs_read_file+0x90/0x134)
Jan  7 01:58:18 raspberrypi kernel: [31918.590789] [<c0114ff8>] (sysfs_read_file+0x90/0x134) from [<c00b88c4>] (vfs_read+0x98/0x16c)
Jan  7 01:58:18 raspberrypi kernel: [31918.590809] [<c00b88c4>] (vfs_read+0x98/0x16c) from [<c00b89d0>] (sys_read+0x38/0x70)
Jan  7 01:58:18 raspberrypi kernel: [31918.590829] [<c00b89d0>] (sys_read+0x38/0x70) from [<c000d980>] (ret_fast_syscall+0x0/0x30)
Jan  7 02:00:18 raspberrypi kernel: [32038.580902] INFO: task kworker/0:2:1656 blocked for more than 120 seconds.
Jan  7 02:00:18 raspberrypi kernel: [32038.580919] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  7 02:00:18 raspberrypi kernel: [32038.580930] kworker/0:2     D c037a204     0  1656      2 0x00000000
Jan  7 02:00:18 raspberrypi kernel: [32038.580974] [<c037a204>] (__schedule+0x2bc/0x568) from [<c037b720>] (__mutex_lock_slowpath+0xb8/0x154)
Jan  7 02:00:18 raspberrypi kernel: [32038.581010] [<c037b720>] (__mutex_lock_slowpath+0xb8/0x154) from [<c001a218>] (bcm_mailbox_write+0x34/0xb8)
Jan  7 02:00:18 raspberrypi kernel: [32038.581036] [<c001a218>] (bcm_mailbox_write+0x34/0xb8) from [<c001a390>] (bcm_mailbox_property+0x5c/0xe4)
Jan  7 02:00:18 raspberrypi kernel: [32038.581065] [<c001a390>] (bcm_mailbox_property+0x5c/0xe4) from [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64)
Jan  7 02:00:18 raspberrypi kernel: [32038.581088] [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64) from [<c029704c>] (thermal_zone_device_update+0x2c/0x240)
Jan  7 02:00:18 raspberrypi kernel: [32038.581121] [<c029704c>] (thermal_zone_device_update+0x2c/0x240) from [<c003da30>] (process_one_work+0x134/0x38c)
Jan  7 02:00:18 raspberrypi kernel: [32038.581145] [<c003da30>] (process_one_work+0x134/0x38c) from [<c003e4a4>] (worker_thread+0x1a0/0x354)
Jan  7 02:00:18 raspberrypi kernel: [32038.581170] [<c003e4a4>] (worker_thread+0x1a0/0x354) from [<c0042e3c>] (kthread+0x84/0x8c)
Jan  7 02:00:18 raspberrypi kernel: [32038.581197] [<c0042e3c>] (kthread+0x84/0x8c) from [<c000e930>] (kernel_thread_exit+0x0/0x8)
Jan  7 02:00:18 raspberrypi kernel: [32038.581212] INFO: task cat:18732 blocked for more than 120 seconds.
Jan  7 02:00:18 raspberrypi kernel: [32038.581220] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  7 02:00:18 raspberrypi kernel: [32038.581229] cat             D c037a204     0 18732   2185 0x00000000
Jan  7 02:00:18 raspberrypi kernel: [32038.581254] [<c037a204>] (__schedule+0x2bc/0x568) from [<c037aea8>] (schedule_timeout+0x198/0x230)
Jan  7 02:00:18 raspberrypi kernel: [32038.581274] [<c037aea8>] (schedule_timeout+0x198/0x230) from [<c037bb9c>] (__down+0x88/0xc0)
Jan  7 02:00:18 raspberrypi kernel: [32038.581296] [<c037bb9c>] (__down+0x88/0xc0) from [<c004809c>] (down+0x60/0x64)
Jan  7 02:00:18 raspberrypi kernel: [32038.581319] [<c004809c>] (down+0x60/0x64) from [<c001a2fc>] (bcm_mailbox_read+0x60/0x98)
Jan  7 02:00:18 raspberrypi kernel: [32038.581341] [<c001a2fc>] (bcm_mailbox_read+0x60/0x98) from [<c001a3d4>] (bcm_mailbox_property+0xa0/0xe4)
Jan  7 02:00:18 raspberrypi kernel: [32038.581364] [<c001a3d4>] (bcm_mailbox_property+0xa0/0xe4) from [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64)
Jan  7 02:00:18 raspberrypi kernel: [32038.581385] [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64) from [<c0296934>] (temp_show+0x28/0x54)
Jan  7 02:00:18 raspberrypi kernel: [32038.581414] [<c0296934>] (temp_show+0x28/0x54) from [<c02232f8>] (dev_attr_show+0x1c/0x48)
Jan  7 02:00:18 raspberrypi kernel: [32038.581444] [<c02232f8>] (dev_attr_show+0x1c/0x48) from [<c0114ff8>] (sysfs_read_file+0x90/0x134)
Jan  7 02:00:18 raspberrypi kernel: [32038.581468] [<c0114ff8>] (sysfs_read_file+0x90/0x134) from [<c00b88c4>] (vfs_read+0x98/0x16c)
Jan  7 02:00:18 raspberrypi kernel: [32038.581488] [<c00b88c4>] (vfs_read+0x98/0x16c) from [<c00b89d0>] (sys_read+0x38/0x70)
Jan  7 02:00:18 raspberrypi kernel: [32038.581508] [<c00b89d0>] (sys_read+0x38/0x70) from [<c000d980>] (ret_fast_syscall+0x0/0x30)


This is my version of the firmware:

Code: Select all
pi@raspberrypi /var/log $ vcgencmd version
Oct 29 2012 23:13:42
Copyright (c) 2012 Broadcom
version 346980 (release)
pi@raspberrypi /var/log $ uname -a
Linux raspberrypi 3.2.27+ #250 PREEMPT Thu Oct 18 19:03:02 BST 2012 armv6l GNU/Linux


Any attempt to access /sys/class/thermal/thermal_zone0/temp now results in another hung process (but nothing extra written to syslog/kern.log).
Posts: 555
Joined: Mon Jan 16, 2012 12:59 pm
by M33P » Mon Jan 07, 2013 7:20 pm
Aha!

That kernel log is a clue.

2 things are accessing the device (i.e. pushing mail into boxes) at the same time. We have

Jan 7 01:52:18 raspberrypi kernel: [31558.618497] [<c029704c>] (thermal_zone_device_update+0x2c/0x240) from [<c003da30>]

called by a sleeping kernel thread

at nearly the same time as

Jan 7 01:52:18 raspberrypi kernel: [31558.618760] [<c0297e2c>] (bcm2835_get_temp+0x3c/0x64) from [<c0296934>] (temp_show+0x28/0x54)

called by cat.

Note the time difference between the respective accesses - very low! 300us between the two. This is however after a 120 second timeout, but this smells of concurrency issues all over it.

The problem now is that we are in GPU land. The videocore interrupted the ARM to tell it there was a message waiting, but there was no message (==0). Cryptically in vcio.c the associated comment to go with the line in the kernel log is "overflow" - of what?

The associated thermal_zone_device_update is merely checking the temperature at periodic intervals to see if it exceeds a trip threshold - and then halts the processor if it thinks it's on fire.
Posts: 199
Joined: Sun Sep 02, 2012 1:14 pm
by milhouse » Mon Jan 07, 2013 7:29 pm
Yes, that and the invalid values when running concurrent tests would suggest a concurrency issue. I've opened issue #132.

Many thanks for your suggestion to run the little script. I'll also update my firmware to the latest (fingers crossed this time), and assuming I can boot will test again.
Posts: 555
Joined: Mon Jan 16, 2012 12:59 pm
by milhouse » Mon Jan 07, 2013 8:04 pm
I've successfully updated to:

Code: Select all
pi@raspberrypi ~ $ vcgencmd version
Jan  6 2013 20:27:42
Copyright (c) 2012 Broadcom
version 360762 (release)
pi@raspberrypi ~ $ uname -a
Linux raspberrypi 3.6.11+ #348 PREEMPT Tue Jan 1 16:33:22 GMT 2013 armv6l GNU/Linux


And have kicked off the test script...

/waits patiently. :)
Posts: 555
Joined: Mon Jan 16, 2012 12:59 pm
by M33P » Mon Jan 07, 2013 8:17 pm
milhouse wrote:I've successfully updated to:

Code: Select all
pi@raspberrypi ~ $ vcgencmd version
Jan  6 2013 20:27:42
Copyright (c) 2012 Broadcom
version 360762 (release)
pi@raspberrypi ~ $ uname -a
Linux raspberrypi 3.6.11+ #348 PREEMPT Tue Jan 1 16:33:22 GMT 2013 armv6l GNU/Linux


And have kicked off the test script...

/waits patiently. :)



Here's how to speed things up

Code: Select all
X=0; while true; do echo "$(date): $(cat /sys/class/thermal/thermal_zone0/temp) ($(echo $X))"; X=$(($X+1)); done &
X=0; while true; do echo "$(date): $(cat /sys/class/thermal/thermal_zone0/temp) ($(echo $X))"; X=$(($X+1)); done &
X=0; while true; do echo "$(date): $(cat /sys/class/thermal/thermal_zone0/temp) ($(echo $X))"; X=$(($X+1)); done &
X=0; while true; do echo "$(date): $(cat /sys/class/thermal/thermal_zone0/temp) ($(echo $X))"; X=$(($X+1)); done &


Mine died in 10 seconds flat.

Code: Select all
 ~ $ vcgencmd version
Dec 28 2012 11:22:54
Copyright (c) 2012 Broadcom
version 359904 (release)


Although my firmware version is a little out of date...
Posts: 199
Joined: Sun Sep 02, 2012 1:14 pm
by milhouse » Mon Jan 07, 2013 8:35 pm
M33P wrote:Mine died in 10 seconds flat.


Heh, yeah - mine too. I've updated the issue with this fast way to reproduce and the latest stacktrace.
Posts: 555
Joined: Mon Jan 16, 2012 12:59 pm
by stupid_boy » Mon Jan 07, 2013 10:04 pm
this seems very similar to my issue in another thread
viewtopic.php?f=31&t=27908

.... may be location of my thread is wrong, who knows
Posts: 64
Joined: Tue Jan 01, 2013 7:53 pm
by milhouse » Mon Jan 07, 2013 10:10 pm
stupid_boy wrote:this seems very similar to my issue in another thread
viewtopic.php?f=31&t=27908

.... may be location of my thread is wrong, who knows


Looks a little similar. Dom has committed this fix so you might be able to test it yourself soon and see if it helps your hang.
Posts: 555
Joined: Mon Jan 16, 2012 12:59 pm
by M33P » Tue Jan 08, 2013 2:48 am
Well the fix has been posted here and has worked its way through to a built image

https://github.com/raspberrypi/linux/co ... f825c53d30

When I run the torture test now, it -
1) No longer locks up (at least not for 5 minutes)
2) Proceeds a bit slower than previously (higher CPU usage)
3) still occasionally returns either a massive negative number or 85000 (85c)

Are the bogus numbers still due to concurrency? Or some bug in the GPU code that's tickled by rapid access
Posts: 199
Joined: Sun Sep 02, 2012 1:14 pm
by milhouse » Tue Jan 08, 2013 3:19 am
I've also been running the torture test on the latest kernel with mutex fix for the last 4.5 hours and so far so good - no hangs and no kernel stacktraces. I'll leave it going overnight just to be sure.

M33P wrote:Are the bogus numbers still due to concurrency? Or some bug in the GPU code that's tickled by rapid access


I brought the bogus values to the attention of Dom, and in issue #132 he commented:

The bogus values are a different issue to the hang.

The currently disabled message "Failed to get temperature" does get output when we get a spurious value,
so the temperature driver spots the problem.

I could add a retry, which would avoid the problem, but I'd prefer to understand it...


so looks like it's a separate issue (now open as issue #133), the cause of which isn't entirely clear.

At least the main and more serious culprit has been fixed - thanks for your input on this thread!
Posts: 555
Joined: Mon Jan 16, 2012 12:59 pm
by stupid_boy » Tue Jan 08, 2013 1:31 pm
at my side things are even worse now:
viewtopic.php?f=31&t=27908
Posts: 64
Joined: Tue Jan 01, 2013 7:53 pm
by JensKrueger » Mon Apr 14, 2014 6:09 pm
not sure if anyone is still looking at this issue but even about a year after this discussion a fully updated RPi still seems to have the same issue. As pointed out before it seems to be related to some race-condition between the user code and the graphics driver. I found out that if nothing is connected to the GPU (no video / HDMI connection) at boot time the issue disappears so the issue is triggered by something in the GPU driver (as detailed above) even thought the GPU driver may not be problem but only the trigger.

Cheers
Jens
Posts: 14
Joined: Sat Feb 09, 2013 11:53 pm
by dom » Tue Apr 15, 2014 12:41 pm
Are you sure it's still a problem? I had a look after your post and couldn't make it fail.
The original issue has just been closed:
https://github.com/raspberrypi/firmware/issues/133
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 4043
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge
by JensKrueger » Mon Jul 14, 2014 12:20 pm
sorry for responding so late, my SPAM filter must have caught the reply notification.

Now for your question, YES I do still see the issue that the RPi freezes but only if I have a monitor connected. When I run the RPi headless it works perfectly. I should also notice that I've never seen any bogus values. All the values that I get make sense (around 45 °C), it's just that after a while the RPi stops and needs a restart.

Cheers
Jens
Posts: 14
Joined: Sat Feb 09, 2013 11:53 pm