bhjel
Posts: 21
Joined: Tue Jan 01, 2019 2:00 am

Rare comms failure with enc28j60 and compute module "netdev watchdog: eth0 ("enc28j60"): transmit queue 0 timed out"

Tue Feb 05, 2019 7:26 pm

Hello,

We are experiencing an issue that is difficult to reproduce.
"netdev watchdog: eth0 ("enc28j60"): transmit queue 0 timed out"

It appears to be slightly more likely to happen when we are sending a significant amount of data from our software, but we have seen the issue crop up with a very small amount of outgoing data as well. I am posting this to understand whether this is a known issue, and whether a fix has been introduced in 4.19. We believe this issue did not exist with the last stable Jessie, and has surfaced only after our move to 4.14

4.14.86-v7+ #1175

bhjel
Posts: 21
Joined: Tue Jan 01, 2019 2:00 am

Re: Rare comms failure with enc28j60 and compute module "netdev watchdog: eth0 ("enc28j60"): transmit queue 0 timed out"

Tue Feb 05, 2019 7:53 pm

I can see multiple fixes that were actually merged in on jan 9, so it is possible that this is fixed. At first glance I saw that the commits were made in November which was before our last update around December 6th.

https://github.com/raspberrypi/linux/co ... -bcm2835.c

Any input is still appreciated.

bhjel
Posts: 21
Joined: Tue Jan 01, 2019 2:00 am

Re: Rare comms failure with enc28j60 and compute module "netdev watchdog: eth0 ("enc28j60"): transmit queue 0 timed out"

Wed Mar 20, 2019 8:43 pm

This was resolved with the listed merge. Upgrading to 14.98 resolved the issue.

pretzel11
Posts: 7
Joined: Thu Aug 16, 2018 9:17 pm

Re: Rare comms failure with enc28j60 and compute module "netdev watchdog: eth0 ("enc28j60"): transmit queue 0 timed out"

Tue Oct 08, 2019 7:12 pm

I am seeing the same issue with our CM3 product and the ENC28J60. Upgrading to 4.14.98 did not resolve the issue for me. Upgrading further to 4.19.75 still shows the same issue.

I am actually seeing two issues that I suspect are related. The first is the issue you described with "transmit queue 0 timed out", the other is that a Tx Error occurs (as in ifconfig Tx Errors) then the interface restarts.

I've tried the ENC28J60 on both the default SPI0 pins as well as the alternate 35-39 pins with no change in behavior. The one thing that seems to have the biggest impact is reducing the speed to ~6MHz (instead of the default 12MHz) in config.txt. At that speed, the "transmit queue 0 timed out" error never appears, but the interface will still occasionally restart when a Tx Error occurs.

I can readily reproduce the "transmit queue 0 timed out" error by uploading a sufficiently large file to my webserver.

Have you continued to avoid this error after updating to 14.98 or is it still occurring?

Does anyone else have any suggestions?

Thanks!

bhjel
Posts: 21
Joined: Tue Jan 01, 2019 2:00 am

Re: Rare comms failure with enc28j60 and compute module "netdev watchdog: eth0 ("enc28j60"): transmit queue 0 timed out"

Mon Jan 13, 2020 12:19 am

Seeing it again on 4.19.66 unfortunately. Our INT pin is physically connected, and it appears that downgrading to 4.14.98 resolves the issue. Here is the dmesg log from the crash on 4.19.66::

Code: Select all

Uname -a
Linux limelight 4.19.66-v7+ #1253 SMP Thu Aug 15 11:49:46 BST 2019 armv7l GNU/Linux


Network Freeze/Crash


[  +0.742006] net eth0: link up - Half duplex
[  +0.000139] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  +0.009115] net eth0: multicast mode
[  +0.000159] net eth0: multicast mode
[  +0.247258] net eth0: link down
[  +0.023042] net eth0: multicast mode
[  +0.000230] net eth0: multicast mode
[  +2.072262] net eth0: link up - Half duplex
[  +0.008249] net eth0: multicast mode
[  +0.000099] net eth0: multicast mode
[ +11.399197] net eth0: multicast mode
[  +1.757080] net eth0: multicast mode
[  +0.000059] net eth0: multicast mode
[  +1.823179] Indeed it is in host mode hprt0 = 00001101
[  +0.209979] usb 1-1: reset high-speed USB device number 2 using dwc_otg
[  +0.000081] Indeed it is in host mode hprt0 = 00001101
[  +0.821238] uart-pl011 3f201000.serial: no DMA platform data
[Jan12 12:15] ------------[ cut here ]------------
[  +0.000019] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x294/0x298
[  +0.000005] NETDEV WATCHDOG: eth0 (enc28j60): transmit queue 0 timed out
[  +0.000003] Modules linked in: sha256_generic cfg80211 rfkill 8021q garp stp llc evdev snd_usb_audio enc28j60 snd_hwdep snd_usbmidi_lib snd_rawmidi spidev uvcvideo snd_seq_device raspberrypi_hwmon hwmon snd_bcm2835(C) snd_pcm bcm2835_v4l2(C) bcm2835_codec(C) snd_timer v4l2_mem2mem snd bcm2835_mmal_vchiq(C) v4l2_common videobuf2_vmalloc videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 videobuf2_common videodev spi_bcm2835 vc_sm_cma(C) media uio_pdrv_genirq uio fixed ip_tables x_tables ipv6
[  +0.000122] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G         C        4.19.66-v7+ #1253
[  +0.000003] Hardware name: BCM2835
[  +0.000017] [<80111f38>] (unwind_backtrace) from [<8010d4b0>] (show_stack+0x20/0x24)
[  +0.000010] [<8010d4b0>] (show_stack) from [<808191e0>] (dump_stack+0xd4/0x118)
[  +0.000011] [<808191e0>] (dump_stack) from [<801209c8>] (__warn+0x104/0x11c)
[  +0.000008] [<801209c8>] (__warn) from [<80120a38>] (warn_slowpath_fmt+0x58/0x74)
[  +0.000008] [<80120a38>] (warn_slowpath_fmt) from [<8073f880>] (dev_watchdog+0x294/0x298)
[  +0.000011] [<8073f880>] (dev_watchdog) from [<80197cfc>] (call_timer_fn+0x3c/0x198)
[  +0.000008] [<80197cfc>] (call_timer_fn) from [<80197f44>] (expire_timers+0xec/0x14c)
[  +0.000008] [<80197f44>] (expire_timers) from [<8019805c>] (run_timer_softirq+0xb8/0x1ec)
[  +0.000008] [<8019805c>] (run_timer_softirq) from [<80102410>] (__do_softirq+0x190/0x3f0)
[  +0.000008] [<80102410>] (__do_softirq) from [<801269a0>] (irq_exit+0xfc/0x120)
[  +0.000011] [<801269a0>] (irq_exit) from [<8017f19c>] (__handle_domain_irq+0x70/0xc4)
[  +0.000009] [<8017f19c>] (__handle_domain_irq) from [<801021b4>] (bcm2836_arm_irqchip_handle_irq+0x60/0xa4)
[  +0.000008] [<801021b4>] (bcm2836_arm_irqchip_handle_irq) from [<801019bc>] (__irq_svc+0x5c/0x7c)
[  +0.000004] Exception stack(0x80d01ee8 to 0x80d01f30)
[  +0.000006] 1ee0:                   80109a84 00000000 40000093 40000093 80d04d70 80d00000
[  +0.000006] 1f00: 80d04db8 00000001 80d8ed3e b77ffa00 80c64a38 80d01f44 80d0517c 80d01f38
[  +0.000005] 1f20: 00000000 80109a88 40000013 ffffffff
[  +0.000009] [<801019bc>] (__irq_svc) from [<80109a88>] (arch_cpu_idle+0x34/0x4c)
[  +0.000009] [<80109a88>] (arch_cpu_idle) from [<80836234>] (default_idle_call+0x34/0x48)
[  +0.000010] [<80836234>] (default_idle_call) from [<80152680>] (do_idle+0xec/0x17c)
[  +0.000010] [<80152680>] (do_idle) from [<801529d0>] (cpu_startup_entry+0x28/0x2c)
[  +0.000009] [<801529d0>] (cpu_startup_entry) from [<8082f8e8>] (rest_init+0xbc/0xc0)
[  +0.000012] [<8082f8e8>] (rest_init) from [<80c00fb0>] (start_kernel+0x484/0x4b4)
[  +0.000005] ---[ end trace 3b7b355c19f9091b ]---
[  +0.003163] net eth0: link down
[  +0.023577] net eth0: multicast mode
[  +0.000369] net eth0: multicast mode
[  +0.000839] net eth0: multicast mode
[  +0.034879] net eth0: link up - Half duplex
[  +0.108218] net eth0: multicast mode
[  +0.000564] net eth0: multicast mode
[  +5.780478] net eth0: multicast mode


bhjel
Posts: 21
Joined: Tue Jan 01, 2019 2:00 am

Re: Rare comms failure with enc28j60 and compute module "netdev watchdog: eth0 ("enc28j60"): transmit queue 0 timed out"

Tue Jan 14, 2020 7:23 pm

This is matched with an error 11 and then error 9 out of our select() call. It seems like something in the spi driver commits after Jan 9 has introduced a new bug. Has anyone else experienced this?

I've multiplied our network transmit calls to try to reproduce the issue, but it almost seems like it happens more often with 2mbit/sec transmit bandwidth rather than 4mbit/s. Could this be because we are beyond the spi DMA range when we increase our bandwidth and the race occurs with DMA?

Thanks

bhjel
Posts: 21
Joined: Tue Jan 01, 2019 2:00 am

Re: Rare comms failure with enc28j60 and compute module "netdev watchdog: eth0 ("enc28j60"): transmit queue 0 timed out"

Tue Jan 14, 2020 11:12 pm

Downgrading to 4.19.50 - same result
Downgrading to 4.19.40 - same result

PhilE
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 2523
Joined: Mon Sep 29, 2014 1:07 pm
Location: Cambridge

Re: Rare comms failure with enc28j60 and compute module "netdev watchdog: eth0 ("enc28j60"): transmit queue 0 timed out"

Wed Jan 15, 2020 10:29 am

4.19 and 4.14 are virtually identical with regards to the spi-bcm2835 driver. 4.19 has a bug fix for 3-wire mode and supports shared interrupts, but that's it. It's more likely that a change elsewhere in the kernel (interrupt handling, for example) could have caused a regression.

One option for you is to build your own kernel - it's not difficult if you follow our guide. This will allow you to use the 4.14 version of the driver and confirm that it is still broken (or not). It would then be a matter of looking for other differences from 4.14, but that could be a lengthy process. Alternatively you could send us some hardware and detailed instructions on how to reproduce the fault, and we could focus on the failure mechanism itself.

bhjel
Posts: 21
Joined: Tue Jan 01, 2019 2:00 am

Re: Rare comms failure with enc28j60 and compute module "netdev watchdog: eth0 ("enc28j60"): transmit queue 0 timed out"

Wed Jan 15, 2020 4:35 pm

We will send a unit over and try to work this out in parallel.

I performed a diff of the driver in 4.14.96 and 4.19.50 and confirmed that they are identical. I also confirmed that the issue does exist in 4.14.96, although it does seem to happen less frequently (sometimes 30 minutes between failures). I'm going to try the 4.19.50 kernel again and check its failure frequency. We checked our 3.3V line just to be sure, and we confirmed that it is perfectly smooth across hardware models regardless of the input voltage.

In the meantime, I'll go through the changes in headers in the SPI driver and see if I can find anything that looks suspect.

Return to “Compute Module”