USB/serial converters (not very) broken on Pi


162 posts   Page 2 of 7   1, 2, 3, 4, 5 ... 7
by M33P » Wed Oct 31, 2012 10:07 pm
Welp. Cannot reproduce the bug with the latest (28-10-2012) raspbian image.

Code: Select all
Linux raspberrypi 3.2.27+ #250 PREEMPT Thu Oct 18 19:03:02 BST 2012 armv6l GNU/Linux
firmware:
Oct 25 2012 16:37:21
Copyright (c) 2012 Broadcom
version 346337 (release)


Note that the critical parameter that "fixed" it for me was the NAK holdoff fix implemented in this commit.

To check if it is enabled during boot, do a dmesg | grep dwc.

I get the feeling that the bug isn't squashed, it's merely hiding because of various other factors reducing the stress on the 30,000-line USB driver.
Posts: 199
Joined: Sun Sep 02, 2012 1:14 pm
by tedh » Thu Nov 01, 2012 3:05 am
M33P, Thanks for digging into the issue. (very impressive)
I updated to the latest image and firmware
"Linux raspberrypi 3.2.27+ #250 PREEMPT Thu Oct 18 19:03:02 BST 2012 armv6l GNU/Linux"

My system still hangs after about 4 hours
Did you modify the cmdline.txt or any other files?
Just wondering if I needed to change any other files.

Thanks again
Posts: 21
Joined: Thu Nov 01, 2012 2:56 am
by winstonma » Thu Nov 01, 2012 6:51 am
M33P, I had the same issue as well.

After updating the image to 10-28. The same freeze occurs. It seems that the latest build didn't help the stability.

Also, I tried D2XX driver provided by FTDI official web site. I replaced the ftdi_sio driver with the library. However the system keeps freezing with the DMA issue.
Posts: 11
Joined: Tue Jun 05, 2012 10:01 am
by M33P » Thu Nov 01, 2012 10:07 am
tedh wrote:M33P, Thanks for digging into the issue. (very impressive)
I updated to the latest image and firmware
"Linux raspberrypi 3.2.27+ #250 PREEMPT Thu Oct 18 19:03:02 BST 2012 armv6l GNU/Linux"

My system still hangs after about 4 hours
Did you modify the cmdline.txt or any other files?
Just wondering if I needed to change any other files.

Thanks again


Can you please post
- Manufacturer/type of serial converter chip
- Connection method - via hub or direct
- Usage case (what app is using the serial converter / what is the serial output being monitored)?
- Other concurrent things happening on the pi (processing, network activity etc)
Posts: 199
Joined: Sun Sep 02, 2012 1:14 pm
by tedh » Thu Nov 01, 2012 11:42 pm
Can you please post
- Manufacturer/type of serial converter chip
** I have a UartSBee (usb xbee adapter) connected to the USB port. It uses the FT232RL
usb-serial chip.

- Connection method - via hub or direct
** It is directly connected

- Usage case (what app is using the serial converter / what is the serial output being monitored)?

I wrote a c-program that sends request across the XBEE to the Remote device
and the Remote Device sends a reply of about 64bytes (for a total of aprox. 45Kbytes)
back to the host(raspberry pi). The cycle continues every minute.
[The program runs fine on my PC running Ubuntu.]
The USB/serial rate is set to 57600

- Other concurrent things happening on the pi (processing, network activity etc)
The pi has X started (doesn't need to be but the times I tried it X was running)
I have the pi connected to the network (via a cable) but outside of the OS there
is no other programs using the networks.
Posts: 21
Joined: Thu Nov 01, 2012 2:56 am
by winstonma » Fri Nov 02, 2012 3:06 am
Hi M33P

- Manufacturer/type of serial converter chip
FT232RQ

- Connection method - via hub or direct
Both hub and direct connection cause failure

- Usage case (what app is using the serial converter / what is the serial output being monitored)?
Hardware-wise RPi connects to Microchip PIC, via FT232RQ.
Software-wise There are two way of access
1. Using ftdi_sio driver, included in Raspbian image. Then use pyserial to do I/O
2. Using D2xx library for RPi, provided by FTDI. Then run a C program, provided by FTDI as well to read the FTDI register.

Both method would lead to DMA error. I believe it is kernel problem instead of other thing.

- Other concurrent things happening on the pi (processing, network activity etc)
No unstable behavior is being observed
Posts: 11
Joined: Tue Jun 05, 2012 10:01 am
by tedh » Fri Nov 02, 2012 3:21 am
Ref post #11 (by M33P » Sun Sep 09, 2012 9:17 am )
ChHltd set, but reason for halting is unknown, hcint 0x00000402,
Looking at the hcint_data_t 0x00000402 shows that the channel was halted due to
data toggle error.

I'm looking to see how a "data toggle error' is created/defined
Posts: 21
Joined: Thu Nov 01, 2012 2:56 am
by arsi » Fri Nov 02, 2012 5:45 pm
Hallo!

I managed to partially solve the problem with a high number of interrupts for driver PL2303.

I canceled the initialization of endpoint for interrupt and it works! ;)
Interrupt handler in this driver restores only the LineStatus of Com port. So Flow Control stops working. But If you are not using Flow Control It's a good deal..

Arsi

pl2303.c
Code: Select all
 static int pl2303_open(struct tty_struct *tty, struct usb_serial_port *port)

   dbg("%s - submitting interrupt urb", __func__);
//   result = usb_submit_urb(port->interrupt_in_urb, GFP_KERNEL);
//   if (result) {
//      dev_err(&port->dev, "%s - failed submitting interrupt urb,"
//         " error %d\n", __func__, result);
//      pl2303_close(port);
//      return -EPROTO;
//   }
   port->port.drain_delay = 256;
   return 0;
Posts: 5
Joined: Fri Nov 02, 2012 5:09 pm
by tedh » Sat Nov 03, 2012 2:10 am
Arsi, Thanks for the info. I'll give it a try
Posts: 21
Joined: Thu Nov 01, 2012 2:56 am
by arsi » Sat Nov 03, 2012 10:27 am
Btw.
The data from the pl2303 data sheet:

Endpoint 1 Descriptor: Interrupt Input Endpoint
bInterval Byte 01h Polling on every 1 ms interval :( 1000 int/s + 1000 by reply from driver

Chip maker does not know about the event driven interrupts ;)
Posts: 5
Joined: Fri Nov 02, 2012 5:09 pm
by gsh » Sat Nov 03, 2012 9:35 pm
Chip maker does know quite a bit about interrupts, unfortunately the module maker decided against implementing it that way for relatively logical reasons and we're now paying for it.

I've found some interesting stuff, continue discussion over on the bug tracker please

https://github.com/raspberrypi/linux/issues/151
Moderator
Moderator
Posts: 707
Joined: Sat Sep 10, 2011 11:43 am
by M33P » Sun Nov 04, 2012 11:25 pm
I have managed to reproduce the breakage to a DMA handler infinite loop just once -
- Moxa Uport 1110 connected via null-modem to desktop PC with MOSCHIP based PCIe-serial adapter
- Commence a zmodem file transfer at 115200bps between the two. This almost immediately broke with the DMA handler infinite loop.

I've not been able to replicate it again - need to find out what changes.

It seems that the UART speed is a factor with this breakage - i have not managed to get it to happen even after 10 hours with the port open and transferring data at 4800baud.

FYI I get an interrupt rate of >11,000 per second with the serial port adapter "open" (Moxa, PL2303, FT232 untested yet). This is because that interrupt URB submitted when the port is opened results in the host controller sending out a packet to "poll" the interrupt on the device every microframe - 125us, which results in the device sending packets back each time. The total rate is slightly higher than this for unknown reasons...
Posts: 199
Joined: Sun Sep 02, 2012 1:14 pm
by tedh » Mon Nov 05, 2012 3:21 am
I've been trying to get the linux source so I can compile the kernel and do some debug
I'm having problem with "git clone https://github.com/raspberrypi/linux.git"

I get the following message
Cloning into 'linux'...

It just sits there for hours.
Any one know it github is having problems?

Or is there another way to get the source and do a compile?
Posts: 21
Joined: Thu Nov 01, 2012 2:56 am
by jamesh » Mon Nov 05, 2012 8:36 am
It's a lot of data- are you sure it's not doing anything - can you try the verbose clone command, can't remember the switc, or monitor the network activity to see if its downloading?
Moderator
Moderator
Posts: 10528
Joined: Sat Jul 30, 2011 7:41 pm
by M33P » Mon Nov 05, 2012 1:17 pm
tedh: substitute https:// with git:// in your URL.

I have found out why this DMA handler hangs forever - it never bothers to check for data toggle errors before falling through to the "catch-all" message!

Data toggle errors are quite benign as far as USB goes - the host or device will simply "catch up" if more data or handshake packets are sent. They would also be the most likely to occurr in a situation where the host is being spammed with interrupts while the device is trying to do a transaction.

dwc_otg_hcd_intr.c:2140 -
In here is the handling for a channel halted via if/elseif (whatever happened to switch/case?) - datatglerror is notably absent, therefore never gets dealt with. The infinite loop occurs if we get a data toggle error on a BULK or CONTROL endpoint.

dwc_otg_hcd_intr.c:1918 -
The error handling for datatgl is rather amusing - it just disables the interrupt and continue.

I will post a patch later adding the necessary handling for data toggle errors and (possibly) simply clearing the data toggle bit in the register rather than disabling it.
Posts: 199
Joined: Sun Sep 02, 2012 1:14 pm
by jamesh » Mon Nov 05, 2012 1:22 pm
M33P wrote:tedh: substitute https:// with git:// in your URL.

I have found out why this DMA handler hangs forever - it never bothers to check for data toggle errors before falling through to the "catch-all" message!

Data toggle errors are quite benign as far as USB goes - the host or device will simply "catch up" if more data or handshake packets are sent. They would also be the most likely to occurr in a situation where the host is being spammed with interrupts while the device is trying to do a transaction.

dwc_otg_hcd_intr.c:2140 -
In here is the handling for a channel halted via if/elseif (whatever happened to switch/case?) - datatglerror is notably absent, therefore never gets dealt with. The infinite loop occurs if we get a data toggle error on a BULK or CONTROL endpoint.

dwc_otg_hcd_intr.c:1918 -
The error handling for datatgl is rather amusing - it just disables the interrupt and continue.

I will post a patch later adding the necessary handling for data toggle errors and (possibly) simply clearing the data toggle bit in the register rather than disabling it.


Can you pm gsh with your results, just to make sure he get's them. Thanks for the efforts here, much appreciated.
Moderator
Moderator
Posts: 10528
Joined: Sat Jul 30, 2011 7:41 pm
by gsh » Mon Nov 05, 2012 5:32 pm
Yeah,

I'd be interested in looking at your results, I also came to the same conclusion previously but haven't made any changes yet...

From the documentation for the module it never suggests that the channel halted interrupt can be triggered by a data toggle error though (but this wouldn't be the first time the docs were wrong!)

Gordon
Moderator
Moderator
Posts: 707
Joined: Sat Sep 10, 2011 11:43 am
by winstonma » Tue Nov 06, 2012 9:36 am
Hi all,

I posted a kernel issue on github.
https://github.com/raspberrypi/linux/issues/40

And the fix is being submitted and released. I updated the kernel (to version #257) and the dma issue is gone.

Please use rpi-update to update your kernel. And see if it works for you guys. Good luck.
Posts: 11
Joined: Tue Jun 05, 2012 10:01 am
by gsh » Tue Nov 06, 2012 8:22 pm
If you're happy with the fix can you please close the issue?

Thanks

Gordon
Moderator
Moderator
Posts: 707
Joined: Sat Sep 10, 2011 11:43 am
by M33P » Tue Nov 06, 2012 8:39 pm
gsh wrote:Yeah,

I'd be interested in looking at your results, I also came to the same conclusion previously but haven't made any changes yet...

From the documentation for the module it never suggests that the channel halted interrupt can be triggered by a data toggle error though (but this wouldn't be the first time the docs were wrong!)

Gordon


Data toggle errors can occur at any stage of a USB transaction past the initial packet, IN or OUT where multiple data packets are exchanged. The interrupt handler for this even spits out a debug message to that effect.
In USB terms an error during a BULK or INTERRUPT transfer matters little - resending the current packet in the case of OUT or re-requesting a packet via NAK for an IN will result in re-synchronization.

ISOC transfers don't use data toggles except for high-speed multi-packet ISOC transfers which use an analogue of it to detect but not correct errors.

CONTROL data toggle errors are more serious - if they occur during query/setup phase of a device, you essentially have to redo the transaction from start. In this case you would want to either retry the transaction and the device should restart automatically (as per spec) or the handler should signal to the USB device driver that their control URB failed - which it may decide to retry.

The fact is that the DWC_OTG driver caters for precisely 0 of these conditions - does the underlying silicon handle this? If so, how much given that software intervention is implied to be required in certain cases.

Yay for patch - but I might be a while before considering the bug squashed in my case as I have only seen it once since the NAK holdoff fix...

How does this patch fix the issue? I see only a mod to the top-level IRQ handler to do with a somewhat cryptic reset of the MPHI interrupt every 60 FIQs...
Posts: 199
Joined: Sun Sep 02, 2012 1:14 pm
by tedh » Thu Nov 08, 2012 2:05 am
I updated my raspberry pi to version #257. My program ran for about 9 hours before it hung.
My thought at the time was that it was the same problem (it still may be). But I just realized that
I didn't expand the memory on the SD-Card and the hang may have been due to the system
running out of memory.

I'm in the process of expanding the memory and I'm going to restart my program again.
If it hangs again, I'll get my system sat up to capture the kernel messages and go from there.

Thanks to everyone who has been looking into this issue.

NOTE: git https://github.com/...... & git git://github.com/.... didn't work for me.
git http://github.com/..... Did work.
I'll post my results when I have them.
Posts: 21
Joined: Thu Nov 01, 2012 2:56 am
by ausserirdischegesund » Fri Nov 09, 2012 2:10 pm
I am currently stress testing my Raspi with two serial converters feeding data at different rates simulaniously: One of the internal USB ports is connected to a FTDI cable connected to a GPS giving NMEA sentences at 4800 Baud,
while the other is connected to an Arduino looping the "ASCII-Table" example sketch at 1200 Baud in an infinite loop.

This "torture test" has been running for a few hours now without any problem. Data are read from "cu" utility over SSH, so Ethernet is active as well.

Code: Select all
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 002: ID 0424:9512 Standard Microsystems Corp.
Bus 001 Device 003: ID 0424:ec00 Standard Microsystems Corp.
Bus 001 Device 006: ID 2341:0043 Arduino SA Uno R3 (CDC ACM)
Bus 001 Device 007: ID 0403:6001 Future Technology Devices International, Ltd FT232 USB-Serial (UART) IC

ralph@pi:~$ uname -a
Linux pi 3.2.27+ #257 PREEMPT Mon Nov 5 00:01:55 GMT 2012 armv6l GNU/Linux
root@pi:/home/ralph# vcgencmd version
Oct 31 2012 17:56:35
Copyright (c) 2012 Broadcom
version 347413 (release)


So for me at least serial seems stable.
Posts: 17
Joined: Wed Mar 07, 2012 7:52 am
by M33P » Fri Nov 09, 2012 8:29 pm
For lack of time, I am just going to throw this out there:
I have patched the interrupt handling code with modifications posted below.

Code: Select all
diff --git a/drivers/usb/host/dwc_otg/dwc_otg_hcd_intr.c b/drivers/usb/host/dwc_otg/dwc_otg_hcd_intr.c
index 3e762e2..df03c3f 100644
--- a/drivers/usb/host/dwc_otg/dwc_otg_hcd_intr.c
+++ b/drivers/usb/host/dwc_otg/dwc_otg_hcd_intr.c
@@ -1918,8 +1918,32 @@ static int32_t handle_hc_datatglerr_intr(dwc_otg_hcd_t * hcd,
                                         dwc_otg_hc_regs_t * hc_regs,
                                         dwc_otg_qtd_t * qtd)
 {
+       /* A data toggle error in a BULK or INTR transaction is benign and continuing
+        * the transaction will (as per USB spec) result in resynchronisation.
+        * In DMA mode the channel may also be halted automatically by the host -
+         * Therefore there is nothing to do here but cleanup host-side and try again
+        */
+       // FIXME: This code is for TEST PURPOSES to solve infinite looping in an interrupt
+       char * eptype;
        DWC_DEBUGPL(DBG_HCDI, "--Host Channel %d Interrupt: "
                    "Data Toggle Error--\n", hc->hc_num);
+       switch (hc->ep_type) {
+               case DWC_OTG_EP_TYPE_BULK:
+                       eptype = "BULK";
+                       break;
+               case DWC_OTG_EP_TYPE_INTR:
+                       eptype = "INTERRUPT";
+                       break;
+               case DWC_OTG_EP_TYPE_ISOC:
+                       eptype = "ISOCHRONOUS";
+                       break;
+               case DWC_OTG_EP_TYPE_CONTROL:
+                       eptype = "CONTROL";
+                       break;
+               default:
+                       eptype = "NO IDEA";
+       }
+       DWC_ERROR("Data Toggle Error - on endpoint type %s\n", eptype);

        if (hc->ep_is_in) {
                qtd->error_count = 0;
@@ -1927,9 +1951,10 @@ static int32_t handle_hc_datatglerr_intr(dwc_otg_hcd_t * hcd,
                DWC_ERROR("Data Toggle Error on OUT transfer,"
                          "channel %d\n", hc->hc_num);
        }
-
+
+       /* No choice but to disable and restart DMA channel as core has halted */
        disable_hc_int(hc_regs, datatglerr);
-
+       halt_channel(hcd, hc, qtd, DWC_OTG_HC_XFER_NO_HALT_STATUS);
        return 1;
 }

@@ -2078,6 +2103,8 @@ static void handle_hc_chhltd_intr_dma(dwc_otg_hcd_t * hcd,
                handle_hc_babble_intr(hcd, hc, hc_regs, qtd);
        } else if (hcint.b.frmovrun) {
                handle_hc_frmovrun_intr(hcd, hc, hc_regs, qtd);
+       } else if (hcint.b.datatglerr) {
+               handle_hc_datatglerr_intr(hcd, hc, hc_regs, qtd);
        } else if (!out_nak_enh) {
                if (hcint.b.nyet) {
                        /*
@@ -2120,6 +2147,7 @@ static void handle_hc_chhltd_intr_dma(dwc_otg_hcd_t * hcd,
                                halt_channel(hcd, hc, qtd,
                                             DWC_OTG_HC_XFER_PERIODIC_INCOMPLETE);
                        } else {
+                               /* BULK or CONTROL */
                                DWC_ERROR
                                    ("%s: Channel %d, DMA Mode -- ChHltd set, but reason "
                                     "for halting is unknown, hcint 0x%08x, intsts 0x%08x\n",
@@ -2127,6 +2155,8 @@ static void handle_hc_chhltd_intr_dma(dwc_otg_hcd_t * hcd,
                                     DWC_READ_REG32(&hcd->
                                                    core_if->core_global_regs->
                                                    gintsts));
+                               dump_stack();
+                               halt_channel(hcd, hc, qtd, DWC_OTG_HC_XFER_NO_HALT_STATUS);
                        }

                }


For interested parties, could you please apply this via git add --patch to a recent pull of 3.2.27+ and then build your kernel. If it fails, it should be fairly obvious from the posted code where bits of it go.

This simply adds some sanity to the checks and recovery from the DMA channel halted condition noted previously.
Posts: 199
Joined: Sun Sep 02, 2012 1:14 pm
by kumme74 » Mon Nov 19, 2012 9:12 am
Did someone have experiences with this patch?

I will build my kernel this week and hope that I will get rid of this behaviour.

Klaus
Posts: 6
Joined: Thu Sep 27, 2012 6:40 pm
by markushx » Thu Nov 22, 2012 7:53 am
That patch (adapted and compiled for 3.6.1+) works for my device with an FT232BM. Thanks!
Posts: 3
Joined: Thu Nov 22, 2012 7:51 am