linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* serious 2.6 bug in USB subsystem?
@ 2003-10-27 22:35 David Mosberger
  2003-10-28  1:30 ` Greg KH
  2004-03-03 12:33 ` Wouter Lueks
  0 siblings, 2 replies; 45+ messages in thread
From: David Mosberger @ 2003-10-27 22:35 UTC (permalink / raw)
  To: vojtech; +Cc: linux-usb-devel, linux-kernel

One-line summary: plug-in your USB keyboard, see your machine die.

So, I have this non-name USB keyboard (with built-in 2-port USB hub)
which reliably crashes 2.6.0-test{8,9} on both x86 and ia64.  In
retrospect, it's clear to me that the same keyboard also occasionally
crashes 2.4 kernels, but there the problem appears more seldom.
Perhaps once in 10 reboots and once the machine is booted and the
keyboard is running, it keeps on working.  The keyboard in question is
a BTC 5141H.

I'm wondering if anybody out there has the same keyboard and could see
if the same problem exists with all 5141H keyboards.

In any case, even if the keyboard happened to be completely broken, I
don't think it should be possible to crash the kernel merely by
connecting a (potentially) bad USB device, so, regardless, there seems
to be something off in the 2.6 USB subsystem.

Here is a more detailed description of how to reproduce the problem
and the behavior that I'm seeing with 2.6 (I don't think the exact
kernel version matters much; it definitely happens both with
2.6.0-test8 and 2.6.0-test9):

 1) disconnect the 5141H keyboard
 2) power on the machine
 3) boot into Linux 2.6
 4) connect the keyboard
 5) see the normal "USB HID" keyboard connection message
 6) wait a few seconds
 7) machine is dead

On x86, I see this message prior to the death of the machine:

 drivers/usb/input/hid-core.c: ctrl urb status -104 received
 drivers/usb/input/hid-core.c: timeout initializing reports

On ia64, the keyboards triggers an MCA apparently because the USB
controller ends up trying to read from physical address 0xf0000000,
which is a write-only area for the box in question.

I don't have a lot of experience with debugging the USB stack, but if
there is something in particular you want me to try, let me know.

Thanks,

	--david

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: serious 2.6 bug in USB subsystem?
  2003-10-27 22:35 serious 2.6 bug in USB subsystem? David Mosberger
@ 2003-10-28  1:30 ` Greg KH
  2003-10-28  3:00   ` David Mosberger
  2004-03-03 12:33 ` Wouter Lueks
  1 sibling, 1 reply; 45+ messages in thread
From: Greg KH @ 2003-10-28  1:30 UTC (permalink / raw)
  To: davidm; +Cc: vojtech, linux-usb-devel, linux-kernel

On Mon, Oct 27, 2003 at 02:35:09PM -0800, David Mosberger wrote:
> One-line summary: plug-in your USB keyboard, see your machine die.

Any chance to know where the machine dies?  Any oops you can help us out
with?

> So, I have this non-name USB keyboard (with built-in 2-port USB hub)
> which reliably crashes 2.6.0-test{8,9} on both x86 and ia64.  In
> retrospect, it's clear to me that the same keyboard also occasionally
> crashes 2.4 kernels, but there the problem appears more seldom.
> Perhaps once in 10 reboots and once the machine is booted and the
> keyboard is running, it keeps on working.  The keyboard in question is
> a BTC 5141H.

If you do not load the HID driver, and disable automatic loading of the
hid driver (echo '/sbin/true' > /proc/sys/kernel/hotplug) and plug in
the device, does it still crash?

If not, can you get us the output of /proc/bus/usb/devices and lsusb
with the device plugged in?

If not, does then loading the hid driver cause the problem?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: serious 2.6 bug in USB subsystem?
  2003-10-28  1:30 ` Greg KH
@ 2003-10-28  3:00   ` David Mosberger
  2003-10-30 15:11     ` [linux-usb-devel] " David Brownell
  0 siblings, 1 reply; 45+ messages in thread
From: David Mosberger @ 2003-10-28  3:00 UTC (permalink / raw)
  To: Greg KH; +Cc: davidm, vojtech, linux-usb-devel, linux-kernel

>>>>> On Mon, 27 Oct 2003 17:30:13 -0800, Greg KH <greg@kroah.com> said:

  Greg> On Mon, Oct 27, 2003 at 02:35:09PM -0800, David Mosberger
  Greg> wrote:

  >> One-line summary: plug-in your USB keyboard, see your machine
  >> die.

  Greg> Any chance to know where the machine dies?  Any oops you can
  Greg> help us out with?

On x86, there is no OOps, it just freezes.  On ia64, I get a nice MCA
and from that we can infer that a USB host controller read from
address 0xf0000000 caused the problem but since this is asynchronous
to the kernel's code path, the instruction pointer etc. in the MCA
state dump isn't terribly helpful.  Actually, just now I got a case
where the ia64 machine hung (it does this occasionally, instead of
MCAing).  In this case, I got similar "timeout" messages as on the x86
box and the backtrace looked like this:

Call Trace:
 [<a00000020004ea70>] hidinput_connect+0x150/0x2ce0 [hid]
 [<a00000020004ad30>] hid_probe+0x1290/0x1ac0 [hid]
 [<a000000100557f40>] usb_probe_interface+0x100/0x240
 [<a000000100388e40>] bus_match+0xc0/0x140
 [<a000000100389490>] driver_attach+0x110/0x1a0
 [<a000000100389600>] bus_add_driver+0xe0/0x200
 [<a00000010038a600>] driver_register+0xc0/0xe0
 [<a000000100556310>] usb_register+0x110/0x1c0
 [<a00000020005c0e0>] hid_init+0x60/0x100 [hid]
 [<a0000001000ca2c0>] sys_init_module+0x3c0/0x2d00
 [<a000000100014de0>] ia64_ret_from_syscall+0x0/0x20

The machine still responded to ping's but was otherwise unreachable.

  >> So, I have this non-name USB keyboard (with built-in 2-port USB
  >> hub) which reliably crashes 2.6.0-test{8,9} on both x86 and ia64.
  >> In retrospect, it's clear to me that the same keyboard also
  >> occasionally crashes 2.4 kernels, but there the problem appears
  >> more seldom.  Perhaps once in 10 reboots and once the machine is
  >> booted and the keyboard is running, it keeps on working.  The
  >> keyboard in question is a BTC 5141H.

  Greg> If you do not load the HID driver, and disable automatic
  Greg> loading of the hid driver (echo '/sbin/true' >
  Greg> /proc/sys/kernel/hotplug) and plug in the device, does it
  Greg> still crash?

No, if I only load the OHCI HCD, the kernel doesn't crash.

  Greg> If not, can you get us the output of /proc/bus/usb/devices and
  Greg> lsusb with the device plugged in?

Sure, they're attached.  Note: there are devices other than the BTC
keyboard connected, but I verified that the same problem occurs even
when the BTC keyboard is the only USB device plugged into the machine.

  Greg> If not, does then loading the hid driver cause the problem?

Yes, it's the loading of the hid driver that's triggering the crash.

	--david

<--- lsusb output --><--- lsusb output --><--- lsusb output -->

Bus 002 Device 004: ID 046e:5303 Behavior Tech. Computer 
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               1.10
  bDeviceClass            0 Interface
  bDeviceSubClass         0 
  bDeviceProtocol         0 
  bMaxPacketSize0         8
  idVendor           0x046e Behavior Tech. Computer
  idProduct          0x5303 
  bcdDevice            0.01
  iManufacturer           1 BTC
  iProduct                2 USB Hub Keyboard   
  iSerial                 0 
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           59
    bNumInterfaces          2
    bConfigurationValue     1
    iConfiguration          2
    bmAttributes         0xe0
      Self Powered
      Remote Wakeup
    MaxPower                0mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         3 Human Interface Devices
      bInterfaceSubClass      1 Boot Interface Subclass
      bInterfaceProtocol      1 Keyboard
      iInterface              2 USB Hub Keyboard   
        HID Device Descriptor:
          bLength                 9
          bDescriptorType        33
          bcdHID               1.00
          bCountryCode            0
          bNumDescriptors         1
          bDescriptorType        34 Report
          wDescriptorLength      65
	  Report Descriptor: (length is 65)
            Item(Global): Usage Page, data= [ 0x01 ] 1
                            Generic Desktop Controls
            Item(Local ): Usage, data= [ 0x06 ] 6
                            Keyboard
            Item(Main  ): Collection, data= [ 0x01 ] 1
                            Application
            Item(Global): Usage Page, data= [ 0x07 ] 7
                            Keyboard
            Item(Local ): Usage Minimum, data= [ 0xe0 ] 224
                            Control Left
            Item(Local ): Usage Maximum, data= [ 0xe7 ] 231
                            GUI Right
            Item(Global): Logical Minimum, data= [ 0x00 ] 0
            Item(Global): Logical Maximum, data= [ 0x01 ] 1
            Item(Global): Report Size, data= [ 0x01 ] 1
            Item(Global): Report Count, data= [ 0x08 ] 8
            Item(Main  ): Input, data= [ 0x02 ] 2
                            Data Variable Absolute No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Global): Report Count, data= [ 0x01 ] 1
            Item(Global): Report Size, data= [ 0x08 ] 8
            Item(Main  ): Input, data= [ 0x01 ] 1
                            Constant Array Absolute No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Global): Report Count, data= [ 0x05 ] 5
            Item(Global): Report Size, data= [ 0x01 ] 1
            Item(Global): Usage Page, data= [ 0x08 ] 8
                            LEDs
            Item(Local ): Usage Minimum, data= [ 0x01 ] 1
                            NumLock
            Item(Local ): Usage Maximum, data= [ 0x05 ] 5
                            Kana
            Item(Main  ): Output, data= [ 0x02 ] 2
                            Data Variable Absolute No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Global): Report Count, data= [ 0x01 ] 1
            Item(Global): Report Size, data= [ 0x03 ] 3
            Item(Main  ): Output, data= [ 0x01 ] 1
                            Constant Array Absolute No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Global): Report Count, data= [ 0x06 ] 6
            Item(Global): Report Size, data= [ 0x08 ] 8
            Item(Global): Logical Minimum, data= [ 0x00 ] 0
            Item(Global): Logical Maximum, data= [ 0xff 0x00 ] 255
            Item(Global): Usage Page, data= [ 0x07 ] 7
                            Keyboard
            Item(Local ): Usage Minimum, data= [ 0x00 ] 0
                            No Event
            Item(Local ): Usage Maximum, data= [ 0xff 0x00 ] 255
                            (null)
            Item(Main  ): Input, data= [ 0x00 ] 0
                            Data Array Absolute No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Main  ): End Collection, data=none
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               none
        wMaxPacketSize          8
        bInterval              10
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        1
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         3 Human Interface Devices
      bInterfaceSubClass      0 No Subclass
      bInterfaceProtocol      0 None
      iInterface              2 USB Hub Keyboard   
        HID Device Descriptor:
          bLength                 9
          bDescriptorType        33
          bcdHID               1.00
          bCountryCode            0
          bNumDescriptors         1
          bDescriptorType        34 Report
          wDescriptorLength      67
	  Report Descriptor: (length is 67)
            Item(Global): Usage Page, data= [ 0x01 ] 1
                            Generic Desktop Controls
            Item(Local ): Usage, data= [ 0x00 ] 0
                            Undefined
            Item(Main  ): Collection, data= [ 0x01 ] 1
                            Application
            Item(Global): Report ID, data= [ 0x01 ] 1
            Item(Global): Logical Minimum, data= [ 0x00 ] 0
            Item(Global): Logical Maximum, data= [ 0x01 ] 1
            Item(Global): Report Size, data= [ 0x01 ] 1
            Item(Global): Report Count, data= [ 0x08 ] 8
            Item(Main  ): Input, data= [ 0x01 ] 1
                            Constant Array Absolute No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Main  ): End Collection, data=none
            Item(Global): Usage Page, data= [ 0x01 ] 1
                            Generic Desktop Controls
            Item(Local ): Usage, data= [ 0x80 ] 128
                            System Control
            Item(Main  ): Collection, data= [ 0x01 ] 1
                            Application
            Item(Global): Report ID, data= [ 0x02 ] 2
            Item(Global): Logical Minimum, data= [ 0x00 ] 0
            Item(Global): Logical Maximum, data= [ 0x01 ] 1
            Item(Local ): Usage Minimum, data= [ 0x81 ] 129
                            System Power Down
            Item(Local ): Usage Maximum, data= [ 0x83 ] 131
                            System Wake Up
            Item(Global): Report Size, data= [ 0x01 ] 1
            Item(Global): Report Count, data= [ 0x03 ] 3
            Item(Main  ): Input, data= [ 0x06 ] 6
                            Data Variable Relative No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Global): Report Size, data= [ 0x05 ] 5
            Item(Global): Report Count, data= [ 0x01 ] 1
            Item(Main  ): Input, data= [ 0x01 ] 1
                            Constant Array Absolute No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Main  ): End Collection, data=none
            Item(Global): Usage Page, data= [ 0x01 ] 1
                            Generic Desktop Controls
            Item(Local ): Usage, data= [ 0x00 ] 0
                            Undefined
            Item(Main  ): Collection, data= [ 0x01 ] 1
                            Application
            Item(Global): Report ID, data= [ 0x03 ] 3
            Item(Global): Logical Minimum, data= [ 0x00 ] 0
            Item(Global): Logical Maximum, data= [ 0x01 ] 1
            Item(Global): Report Size, data= [ 0x01 ] 1
            Item(Global): Report Count, data= [ 0x18 ] 24
            Item(Main  ): Input, data= [ 0x01 ] 1
                            Constant Array Absolute No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Main  ): End Collection, data=none
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x82  EP 2 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               none
        wMaxPacketSize          4
        bInterval             255
  Language IDs: (length=4)
     0409 English(US)

Bus 002 Device 003: ID 050d:0119 Belkin Components 
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               1.10
  bDeviceClass            0 Interface
  bDeviceSubClass         0 
  bDeviceProtocol         0 
  bMaxPacketSize0         8
  idVendor           0x050d Belkin Components
  idProduct          0x0119 
  bcdDevice            1.20
  iManufacturer           1 Belkin Components
  iProduct                2 USB-PS2 Adapter 
  iSerial                 0 
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           59
    bNumInterfaces          2
    bConfigurationValue     1
    iConfiguration          0
    bmAttributes         0xa0
      Remote Wakeup
    MaxPower              100mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         3 Human Interface Devices
      bInterfaceSubClass      1 Boot Interface Subclass
      bInterfaceProtocol      1 Keyboard
      iInterface              0 
        HID Device Descriptor:
          bLength                 9
          bDescriptorType        33
          bcdHID               1.00
          bCountryCode            0
          bNumDescriptors         1
          bDescriptorType        34 Report
          wDescriptorLength      63
	  Report Descriptor: (length is 63)
            Item(Global): Usage Page, data= [ 0x01 ] 1
                            Generic Desktop Controls
            Item(Local ): Usage, data= [ 0x06 ] 6
                            Keyboard
            Item(Main  ): Collection, data= [ 0x01 ] 1
                            Application
            Item(Global): Usage Page, data= [ 0x07 ] 7
                            Keyboard
            Item(Local ): Usage Minimum, data= [ 0xe0 ] 224
                            Control Left
            Item(Local ): Usage Maximum, data= [ 0xe7 ] 231
                            GUI Right
            Item(Global): Logical Minimum, data= [ 0x00 ] 0
            Item(Global): Logical Maximum, data= [ 0x01 ] 1
            Item(Global): Report Count, data= [ 0x08 ] 8
            Item(Global): Report Size, data= [ 0x01 ] 1
            Item(Main  ): Input, data= [ 0x02 ] 2
                            Data Variable Absolute No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Global): Report Count, data= [ 0x01 ] 1
            Item(Global): Report Size, data= [ 0x08 ] 8
            Item(Main  ): Input, data= [ 0x01 ] 1
                            Constant Array Absolute No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Global): Report Count, data= [ 0x05 ] 5
            Item(Global): Report Size, data= [ 0x01 ] 1
            Item(Global): Usage Page, data= [ 0x08 ] 8
                            LEDs
            Item(Local ): Usage Minimum, data= [ 0x01 ] 1
                            NumLock
            Item(Local ): Usage Maximum, data= [ 0x05 ] 5
                            Kana
            Item(Main  ): Output, data= [ 0x02 ] 2
                            Data Variable Absolute No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Global): Report Count, data= [ 0x01 ] 1
            Item(Global): Report Size, data= [ 0x03 ] 3
            Item(Main  ): Output, data= [ 0x01 ] 1
                            Constant Array Absolute No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Global): Report Count, data= [ 0x06 ] 6
            Item(Global): Report Size, data= [ 0x08 ] 8
            Item(Global): Logical Minimum, data= [ 0x00 ] 0
            Item(Global): Logical Maximum, data= [ 0x65 ] 101
            Item(Global): Usage Page, data= [ 0x07 ] 7
                            Keyboard
            Item(Local ): Usage Minimum, data= [ 0x00 ] 0
                            No Event
            Item(Local ): Usage Maximum, data= [ 0x65 ] 101
                            Keyboard Application (Windows Key for Win95 or Compose)
            Item(Main  ): Input, data= [ 0x00 ] 0
                            Data Array Absolute No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Main  ): End Collection, data=none
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               none
        wMaxPacketSize          8
        bInterval              10
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        1
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         3 Human Interface Devices
      bInterfaceSubClass      1 Boot Interface Subclass
      bInterfaceProtocol      2 Mouse
      iInterface              0 
        HID Device Descriptor:
          bLength                 9
          bDescriptorType        33
          bcdHID               1.00
          bCountryCode            0
          bNumDescriptors         1
          bDescriptorType        34 Report
          wDescriptorLength      52
	  Report Descriptor: (length is 52)
            Item(Global): Usage Page, data= [ 0x01 ] 1
                            Generic Desktop Controls
            Item(Local ): Usage, data= [ 0x02 ] 2
                            Mouse
            Item(Main  ): Collection, data= [ 0x01 ] 1
                            Application
            Item(Local ): Usage, data= [ 0x01 ] 1
                            Pointer
            Item(Main  ): Collection, data= [ 0x00 ] 0
                            Physical
            Item(Global): Usage Page, data= [ 0x09 ] 9
                            Buttons
            Item(Local ): Usage Minimum, data= [ 0x01 ] 1
                            Button 1 (Primary)
            Item(Local ): Usage Maximum, data= [ 0x05 ] 5
                            (null)
            Item(Global): Logical Minimum, data= [ 0x00 ] 0
            Item(Global): Logical Maximum, data= [ 0x01 ] 1
            Item(Global): Report Count, data= [ 0x05 ] 5
            Item(Global): Report Size, data= [ 0x01 ] 1
            Item(Main  ): Input, data= [ 0x02 ] 2
                            Data Variable Absolute No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Global): Report Count, data= [ 0x01 ] 1
            Item(Global): Report Size, data= [ 0x03 ] 3
            Item(Main  ): Input, data= [ 0x01 ] 1
                            Constant Array Absolute No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Global): Usage Page, data= [ 0x01 ] 1
                            Generic Desktop Controls
            Item(Local ): Usage, data= [ 0x30 ] 48
                            Direction-X
            Item(Local ): Usage, data= [ 0x31 ] 49
                            Direction-Y
            Item(Local ): Usage, data= [ 0x38 ] 56
                            Wheel
            Item(Global): Logical Minimum, data= [ 0x81 ] 129
            Item(Global): Logical Maximum, data= [ 0x7f ] 127
            Item(Global): Report Size, data= [ 0x08 ] 8
            Item(Global): Report Count, data= [ 0x03 ] 3
            Item(Main  ): Input, data= [ 0x06 ] 6
                            Data Variable Relative No_Wrap Linear
                            Preferred_State No_Null_Position Non_Volatile Bitfield
            Item(Main  ): End Collection, data=none
            Item(Main  ): End Collection, data=none
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x82  EP 2 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               none
        wMaxPacketSize          4
        bInterval              10
  Language IDs: (length=4)
     0409 English(US)

Bus 002 Device 002: ID 046e:5403 Behavior Tech. Computer 
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               1.10
  bDeviceClass            9 Hub
  bDeviceSubClass         0 
  bDeviceProtocol         0 
  bMaxPacketSize0         8
  idVendor           0x046e Behavior Tech. Computer
  idProduct          0x5403 
  bcdDevice            0.01
  iManufacturer           1 BTC
  iProduct                2 USB Hub Keyboard   
  iSerial                 0 
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           25
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          2
    bmAttributes         0xe0
      Self Powered
      Remote Wakeup
    MaxPower              100mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         9 Hub
      bInterfaceSubClass      0 
      bInterfaceProtocol      0 
      iInterface              2 USB Hub Keyboard   
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               none
        wMaxPacketSize          1
        bInterval             255
  Language IDs: (length=4)
     0409 English(US)

Bus 002 Device 001: ID 0000:0000 Virtual Hub
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               1.10
  bDeviceClass            9 Hub
  bDeviceSubClass         0 
  bDeviceProtocol         0 
  bMaxPacketSize0         8
  idVendor           0x0000 Virtual
  idProduct          0x0000 Hub
  bcdDevice            2.06
  iManufacturer           3 Linux 2.6.0-test9 ohci_hcd
  iProduct                2 OHCI Host Controller
  iSerial                 1 0000:a0:01.1
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           25
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          0
    bmAttributes         0x40
      Self Powered
    MaxPower                0mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         9 Hub
      bInterfaceSubClass      0 
      bInterfaceProtocol      0 
      iInterface              0 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               none
        wMaxPacketSize          2
        bInterval             255
  Language IDs: (length=4)
     0409 English(US)

Bus 001 Device 001: ID 0000:0000 Virtual Hub
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               1.10
  bDeviceClass            9 Hub
  bDeviceSubClass         0 
  bDeviceProtocol         0 
  bMaxPacketSize0         8
  idVendor           0x0000 Virtual
  idProduct          0x0000 Hub
  bcdDevice            2.06
  iManufacturer           3 Linux 2.6.0-test9 ohci_hcd
  iProduct                2 OHCI Host Controller
  iSerial                 1 0000:a0:01.0
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           25
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          0
    bmAttributes         0x40
      Self Powered
    MaxPower                0mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         9 Hub
      bInterfaceSubClass      0 
      bInterfaceProtocol      0 
      iInterface              0 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               none
        wMaxPacketSize          2
        bInterval             255
  Language IDs: (length=4)
     0409 English(US)

<-- contents of /proc/bus/usb/devices -->

T:  Bus=02 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#=  1 Spd=12  MxCh= 2
B:  Alloc=  0/900 us ( 0%), #Int=  1, #Iso=  0
D:  Ver= 1.10 Cls=09(hub  ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=0000 ProdID=0000 Rev= 2.06
S:  Manufacturer=Linux 2.6.0-test9 ohci_hcd
S:  Product=OHCI Host Controller
S:  SerialNumber=0000:a0:01.1
C:* #Ifs= 1 Cfg#= 1 Atr=40 MxPwr=  0mA
I:  If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   2 Ivl=255ms

T:  Bus=02 Lev=01 Prnt=01 Port=00 Cnt=01 Dev#=  2 Spd=12  MxCh= 3
D:  Ver= 1.10 Cls=09(hub  ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=046e ProdID=5403 Rev= 0.01
S:  Manufacturer=BTC
S:  Product=USB Hub Keyboard   
C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=100mA
I:  If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   1 Ivl=255ms

T:  Bus=02 Lev=02 Prnt=02 Port=00 Cnt=01 Dev#=  4 Spd=12  MxCh= 0
D:  Ver= 1.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=046e ProdID=5303 Rev= 0.01
S:  Manufacturer=BTC
S:  Product=USB Hub Keyboard   
C:* #Ifs= 2 Cfg#= 1 Atr=e0 MxPwr=  0mA
I:  If#= 0 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=01 Prot=01 Driver=(none)
E:  Ad=81(I) Atr=03(Int.) MxPS=   8 Ivl=10ms
I:  If#= 1 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=00 Prot=00 Driver=(none)
E:  Ad=82(I) Atr=03(Int.) MxPS=   4 Ivl=255ms

T:  Bus=02 Lev=01 Prnt=01 Port=01 Cnt=02 Dev#=  3 Spd=1.5 MxCh= 0
D:  Ver= 1.10 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=050d ProdID=0119 Rev= 1.20
S:  Manufacturer=Belkin Components
S:  Product=USB-PS2 Adapter 
C:* #Ifs= 2 Cfg#= 1 Atr=a0 MxPwr=100mA
I:  If#= 0 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=01 Prot=01 Driver=(none)
E:  Ad=81(I) Atr=03(Int.) MxPS=   8 Ivl=10ms
I:  If#= 1 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=01 Prot=02 Driver=(none)
E:  Ad=82(I) Atr=03(Int.) MxPS=   4 Ivl=10ms

T:  Bus=01 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#=  1 Spd=12  MxCh= 3
B:  Alloc=  0/900 us ( 0%), #Int=  0, #Iso=  0
D:  Ver= 1.10 Cls=09(hub  ) Sub=00 Prot=00 MxPS= 8 #Cfgs=  1
P:  Vendor=0000 ProdID=0000 Rev= 2.06
S:  Manufacturer=Linux 2.6.0-test9 ohci_hcd
S:  Product=OHCI Host Controller
S:  SerialNumber=0000:a0:01.0
C:* #Ifs= 1 Cfg#= 1 Atr=40 MxPwr=  0mA
I:  If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   2 Ivl=255ms

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2003-10-28  3:00   ` David Mosberger
@ 2003-10-30 15:11     ` David Brownell
  2003-10-30 20:15       ` David Mosberger
  0 siblings, 1 reply; 45+ messages in thread
From: David Brownell @ 2003-10-30 15:11 UTC (permalink / raw)
  To: davidm; +Cc: Greg KH, vojtech, linux-usb-devel, linux-kernel

David Mosberger wrote:
> 
> On x86, there is no OOps, it just freezes.  On ia64, I get a nice MCA
> and from that we can infer that a USB host controller read from
> address 0xf0000000 caused the problem but since this is asynchronous
> to the kernel's code path, the instruction pointer etc. in the MCA
> state dump isn't terribly helpful. 

Does that 0xf0000000 (on ia64) match any obvious address mapping
of the null pointer -- like a dma mapping?  I'm not sure that if
the HID driver were to pass a null buffer pointer, it would be
caught anywhere.

- Dave




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2003-10-30 15:11     ` [linux-usb-devel] " David Brownell
@ 2003-10-30 20:15       ` David Mosberger
       [not found]         ` <16289.55171.278494.17172@napali.hpl.hp.com>
  2003-11-03  3:46         ` David Brownell
  0 siblings, 2 replies; 45+ messages in thread
From: David Mosberger @ 2003-10-30 20:15 UTC (permalink / raw)
  To: David Brownell; +Cc: davidm, Greg KH, vojtech, linux-usb-devel, linux-kernel

>>>>> On Thu, 30 Oct 2003 07:11:42 -0800, David Brownell <david-b@pacbell.net> said:

  David> David Mosberger wrote:

  >> On x86, there is no OOps, it just freezes.  On ia64, I get a nice MCA
  >> and from that we can infer that a USB host controller read from
  >> address 0xf0000000 caused the problem but since this is asynchronous
  >> to the kernel's code path, the instruction pointer etc. in the MCA
  >> state dump isn't terribly helpful.

  David> Does that 0xf0000000 (on ia64) match any obvious address mapping
  David> of the null pointer -- like a dma mapping?

Not really.  AFAIK, 0xf0000000 is part of the PCI MMIO address space,
but on the machines that I have access to, this particular address
isn't assigned to any device:

	$ lspci -v|fgrep 'Memory at'
        Memory at 0000000080000000 (32-bit, prefetchable) [size=128M]
        Memory at 0000000088000000 (32-bit, non-prefetchable) [size=512K]
        Memory at 00000000d0023000 (32-bit, non-prefetchable) [size=4K]
        Memory at 00000000d0022000 (32-bit, non-prefetchable) [size=4K]
        Memory at 00000000d0021000 (32-bit, non-prefetchable) [size=256]
        Memory at 00000000d0020000 (32-bit, non-prefetchable) [size=4K]
        Memory at 00000000d0000000 (32-bit, non-prefetchable) [size=128K]
        Memory at 00000000e0200000 (32-bit, non-prefetchable) [size=4K]
        Memory at 00000000e0100000 (32-bit, non-prefetchable) [size=1M]

  David> I'm not sure that if the HID driver were to pass a null
  David> buffer pointer, it would be caught anywhere.

OK, I'll try to find some time to trace the I/O MMU calls to see if
something isn't kosher at that level.  Is there a good way of getting
a relatively high-level of tracing in the USB subsystem that would
some me what's going on between the HID and the core USB level?

	--david

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
       [not found]         ` <16289.55171.278494.17172@napali.hpl.hp.com>
@ 2003-10-31 16:23           ` David Brownell
  2003-10-31 18:34             ` David Mosberger
  2004-03-06  2:08             ` David Mosberger
  0 siblings, 2 replies; 45+ messages in thread
From: David Brownell @ 2003-10-31 16:23 UTC (permalink / raw)
  To: davidm; +Cc: Greg KH, vojtech, linux-usb-devel, linux-kernel

David Mosberger wrote:
> After spending a bit more time on this, it looks to me like the
> keyboard is crashing the system very early on.  

I think there are some devices that choke the HID
code; I recall someone reporting a mouse that did the
same kind of thing.  Do other kinds of keyboards do
the same thing, or is it just that one?

Vojtech may have other suggestions.

- Dave


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2003-10-31 16:23           ` David Brownell
@ 2003-10-31 18:34             ` David Mosberger
  2003-10-31 18:50               ` Valdis.Kletnieks
  2003-10-31 19:28               ` David Brownell
  2004-03-06  2:08             ` David Mosberger
  1 sibling, 2 replies; 45+ messages in thread
From: David Mosberger @ 2003-10-31 18:34 UTC (permalink / raw)
  To: David Brownell; +Cc: davidm, Greg KH, vojtech, linux-usb-devel, linux-kernel

>>>>> On Fri, 31 Oct 2003 08:23:54 -0800, David Brownell <david-b@pacbell.net> said:

  David.B> David Mosberger wrote:
  >> After spending a bit more time on this, it looks to me like the
  >> keyboard is crashing the system very early on.

  David.B> I think there are some devices that choke the HID
  David.B> code;

And nobody is alarmed by this?  Surely crashing the kernel by plugging
in a USB device must be considered a MUST-FIX item.  Perhaps I missed
something, but I never saw this mentioned before.

  David.B> I recall someone reporting a mouse that did the same kind of
  David.B> thing.  Do other kinds of keyboards do the same thing, or is
  David.B> it just that one?

Ugh, I only have about half a dozen or so different types of USB
devices (and even fewer of them are HID devices), so my experience
isn't exactly a statistically valid sample.  Having said that, out of
that 6 or so devices, that particular keyboard is the only one causing
crashes.  However, note that it works (mostly) fine under 2.4 and even
if they keyboard were total crap, it certainly shouldn't crash the
kernel.

	--david

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2003-10-31 18:34             ` David Mosberger
@ 2003-10-31 18:50               ` Valdis.Kletnieks
  2003-10-31 19:28               ` David Brownell
  1 sibling, 0 replies; 45+ messages in thread
From: Valdis.Kletnieks @ 2003-10-31 18:50 UTC (permalink / raw)
  To: davidm; +Cc: linux-usb-devel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 287 bytes --]

On Fri, 31 Oct 2003 10:34:22 PST, David Mosberger said:

> And nobody is alarmed by this?  Surely crashing the kernel by plugging
> in a USB device must be considered a MUST-FIX item.  Perhaps I missed
> something, but I never saw this mentioned before.

Bill Gates. Comdex. Printer. :)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2003-10-31 18:34             ` David Mosberger
  2003-10-31 18:50               ` Valdis.Kletnieks
@ 2003-10-31 19:28               ` David Brownell
  2003-10-31 19:50                 ` David Mosberger
  1 sibling, 1 reply; 45+ messages in thread
From: David Brownell @ 2003-10-31 19:28 UTC (permalink / raw)
  To: davidm; +Cc: Greg KH, vojtech, linux-usb-devel, linux-kernel

David Mosberger wrote:
>>>>>>On Fri, 31 Oct 2003 08:23:54 -0800, David Brownell <david-b@pacbell.net> said:
> 
> 
>   David.B> David Mosberger wrote:
>   >> After spending a bit more time on this, it looks to me like the
>   >> keyboard is crashing the system very early on.
> 
>   David.B> I think there are some devices that choke the HID
>   David.B> code;
> 
> And nobody is alarmed by this?  Surely crashing the kernel by plugging
> in a USB device must be considered a MUST-FIX item.  Perhaps I missed
> something, but I never saw this mentioned before.

You sound alarmed!  If that's alarmed enough to find out what
the real problem is, maybe you'll end up fixing it ... :)

I could be wrong about the problem being in the HID code, but
that does look like a likely home for the bug.  We know there
are other issues with HID/input/hiddev/... that need attention.


>  Having said that, out of
> that 6 or so devices, that particular keyboard is the only one causing
> crashes.  However, note that it works (mostly) fine under 2.4 and even
> if they keyboard were total crap, it certainly shouldn't crash the
> kernel.

Agreed, oopsing == bad.  HID needs more attention.  I suspect whoever
dives into that will want to know what you mean by "(mostly) fine";
that might give a clue about what 2.6 changes worsened the failures.

- Dave


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2003-10-31 19:28               ` David Brownell
@ 2003-10-31 19:50                 ` David Mosberger
  2003-10-31 20:06                   ` David S. Miller
  0 siblings, 1 reply; 45+ messages in thread
From: David Mosberger @ 2003-10-31 19:50 UTC (permalink / raw)
  To: David Brownell; +Cc: davidm, Greg KH, vojtech, linux-usb-devel, linux-kernel

>>>>> On Fri, 31 Oct 2003 11:28:20 -0800, David Brownell <david-b@pacbell.net> said:

  David.B> You sound alarmed!  If that's alarmed enough to find out
  David.B> what the real problem is, maybe you'll end up fixing it
  David.B> ... :)

Except I know almost nothing about the USB stack.

  David.B> Agreed, oopsing == bad.  HID needs more attention.  I
  David.B> suspect whoever dives into that will want to know what you
  David.B> mean by "(mostly) fine"; that might give a clue about what
  David.B> 2.6 changes worsened the failures.

Are you saying nobody is maintaining HID?

	--david

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2003-10-31 19:50                 ` David Mosberger
@ 2003-10-31 20:06                   ` David S. Miller
  0 siblings, 0 replies; 45+ messages in thread
From: David S. Miller @ 2003-10-31 20:06 UTC (permalink / raw)
  To: davidm; +Cc: davidm, david-b, greg, vojtech, linux-usb-devel, linux-kernel

On Fri, 31 Oct 2003 11:50:01 -0800
David Mosberger <davidm@napali.hpl.hp.com> wrote:

> >>>>> On Fri, 31 Oct 2003 11:28:20 -0800, David Brownell <david-b@pacbell.net> said:
> 
>   David.B> You sound alarmed!  If that's alarmed enough to find out
>   David.B> what the real problem is, maybe you'll end up fixing it
>   David.B> ... :)
> 
> Except I know almost nothing about the USB stack.

David, get real, this is never an excuse for people of our
caliber.  :-)

You, myself, and many others are more than intelligent enough and more
than capable enough to debug subsystems we are not familiar with or
even have never looked at before.

As platforms maintainers, such a skill is nearly a necessity.

When I hit a problem in some subsystem and I can't provide enough
information to the subsystem maintainer for them to fix the bug, I
have to do the debugging work if I want the bug fixed.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2003-10-30 20:15       ` David Mosberger
       [not found]         ` <16289.55171.278494.17172@napali.hpl.hp.com>
@ 2003-11-03  3:46         ` David Brownell
  2003-11-03 21:25           ` David Mosberger
  1 sibling, 1 reply; 45+ messages in thread
From: David Brownell @ 2003-11-03  3:46 UTC (permalink / raw)
  To: davidm; +Cc: Greg KH, vojtech, linux-usb-devel, linux-kernel


>   David> I'm not sure that if the HID driver were to pass a null
>   David> buffer pointer, it would be caught anywhere.
> 
> OK, I'll try to find some time to trace the I/O MMU calls to see if
> something isn't kosher at that level.  Is there a good way of getting
> a relatively high-level of tracing in the USB subsystem that would
> some me what's going on between the HID and the core USB level?

Most of that story is just submitting and completing URBs.

I'd either try changing the spots in drivers/usb/core/hcd.c
marked as appropriate for generic MONITOR_URB hooks (printk
if it's your HID device, maybe), or manually turn on whatever
HCD-specific hooks exist (maybe use a VERBOSE message level).

Such a thing wasn't possible in 2.4 since there were too
many different bizarre (and sometimes buggy) ways for URBs
to return to the usb device drivers and get implicitly
resubmitted.

- Dave





^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2003-11-03  3:46         ` David Brownell
@ 2003-11-03 21:25           ` David Mosberger
  0 siblings, 0 replies; 45+ messages in thread
From: David Mosberger @ 2003-11-03 21:25 UTC (permalink / raw)
  To: David Brownell; +Cc: davidm, Greg KH, vojtech, linux-usb-devel, linux-kernel

>>>>> On Sun, 02 Nov 2003 19:46:38 -0800, David Brownell <david-b@pacbell.net> said:

  David> I'm not sure that if the HID driver were to pass a null
  David> buffer pointer, it would be caught anywhere.
  >>  OK, I'll try to find some time to trace the I/O MMU calls to see
  >> if something isn't kosher at that level.  Is there a good way of
  >> getting a relatively high-level of tracing in the USB subsystem
  >> that would some me what's going on between the HID and the core
  >> USB level?

  Dave.B> Most of that story is just submitting and completing URBs.

Yeah.  And it appears that it's the very first call to
hid_submit_ctrl() that's triggering the problem (not always, but about
9 out of 10 times).  I dumped some of the key fields for the URB being
submitted and they all looked saned to me.

  Dave.B> I'd either try changing the spots in drivers/usb/core/hcd.c
  Dave.B> marked as appropriate for generic MONITOR_URB hooks (printk
  Dave.B> if it's your HID device, maybe), or manually turn on
  Dave.B> whatever HCD-specific hooks exist (maybe use a VERBOSE
  Dave.B> message level).

OK, thanks for the suggestion.  I'll keep looking, but will be on
travel this week, so I may not be able to spend much time on this
problem.

	--david

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: serious 2.6 bug in USB subsystem?
  2003-10-27 22:35 serious 2.6 bug in USB subsystem? David Mosberger
  2003-10-28  1:30 ` Greg KH
@ 2004-03-03 12:33 ` Wouter Lueks
  2004-03-03 15:30   ` Wouter Lueks
  1 sibling, 1 reply; 45+ messages in thread
From: Wouter Lueks @ 2004-03-03 12:33 UTC (permalink / raw)
  To: davidm; +Cc: linux-kernel

On Mon, Oct 27, 2003 at 02:35:09PM -0800, David Mosberger wrote:
> One-line summary: plug-in your USB keyboard, see your machine die.
>
> So, I have this non-name USB keyboard (with built-in 2-port USB hub)
> which reliably crashes 2.6.0-test{8,9} on both x86 and ia64.  In
> retrospect, it's clear to me that the same keyboard also occasionally
> crashes 2.4 kernels, but there the problem appears more seldom.
> Perhaps once in 10 reboots and once the machine is booted and the
> keyboard is running, it keeps on working.  The keyboard in question is
> a BTC 5141H.

I am having the same problem as David but with a major difference, the
keyboard doesn't have any problems on either 2.4.21 or windows XP, it
works flawlessly there. I have tested up to 2.6.2 and haven't seen
anything in the changelogs that suggests its fixed in 2.6.3

> In any case, even if the keyboard happened to be completely broken, I
> don't think it should be possible to crash the kernel merely by
> connecting a (potentially) bad USB device, so, regardless, there seems
> to be something off in the 2.6 USB subsystem.
> 
> Here is a more detailed description of how to reproduce the problem
> and the behavior that I'm seeing with 2.6 (I don't think the exact
> kernel version matters much; it definitely happens both with
> 2.6.0-test8 and 2.6.0-test9):
> 
>  1) disconnect the 5141H keyboard
>  2) power on the machine
>  3) boot into Linux 2.6
>  4) connect the keyboard
>  5) see the normal "USB HID" keyboard connection message
>  6) wait a few seconds
>  7) machine is dead

I used the exact same steps and the result is the same, the machine is
dead, and I can do a hard reset.
>
> On x86, I see this message prior to the death of the machine:
>
>  drivers/usb/input/hid-core.c: ctrl urb status -104 received
>  drivers/usb/input/hid-core.c: timeout initializing reports

See the attached dmesg file I get one error more. On top of that I have
enabled usb debugging so perhaps the dmesg file yields more result when
someone who knows this subsystem can see what is happening here.
Personally I lack the experience to fix this problem. When you need any
more information I'll be happy to provide it.

One thing to note though is that I do have usb legacy mode turned on,
because I otherwise cannot boot into linux.

Wouter

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: serious 2.6 bug in USB subsystem?
  2004-03-03 12:33 ` Wouter Lueks
@ 2004-03-03 15:30   ` Wouter Lueks
  0 siblings, 0 replies; 45+ messages in thread
From: Wouter Lueks @ 2004-03-03 15:30 UTC (permalink / raw)
  To: davidm; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 436 bytes --]

On Wed, Mar 03, 2004 at 01:33:40PM +0100, Wouter Lueks wrote:
> See the attached dmesg file I get one error more. On top of that I have
> enabled usb debugging so perhaps the dmesg file yields more result when
> someone who knows this subsystem can see what is happening here.
> Personally I lack the experience to fix this problem. When you need any
> more information I'll be happy to provide it.

Which attached file ;). Here it is.

[-- Attachment #2: dmesg --]
[-- Type: text/plain, Size: 11395 bytes --]

Linux version 2.6.2 (root@newton) (gcc version 3.3.3 20040125 (prerelease) (Debian)) #1 Sun Feb 8 12:12:08 CET 2004
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009e800 (usable)
 BIOS-e820: 000000000009e800 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000ce000 - 00000000000d0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000000fef0000 (usable)
 BIOS-e820: 000000000fef0000 - 000000000feff000 (ACPI data)
 BIOS-e820: 000000000feff000 - 000000000ff00000 (ACPI NVS)
 BIOS-e820: 000000000ff00000 - 000000000ff80000 (usable)
 BIOS-e820: 000000000ff80000 - 0000000010000000 (reserved)
 BIOS-e820: 00000000ff800000 - 0000000100000000 (reserved)
255MB LOWMEM available.
found SMP MP-table at 000f6330
hm, page 000f6000 reserved twice.
hm, page 000f7000 reserved twice.
hm, page 0009e000 reserved twice.
hm, page 0009f000 reserved twice.
On node 0 totalpages: 65408
  DMA zone: 4096 pages, LIFO batch:1
  Normal zone: 61312 pages, LIFO batch:14
  HighMem zone: 0 pages, LIFO batch:1
DMI 2.3 present.
ACPI: RSDP (v000 PTLTD                                     ) @ 0x000f6300
ACPI: RSDT (v001 PTLTD    RSDT   0x06040000  LTP 0x00000000) @ 0x0fefbddc
ACPI: FADT (v001 INTEL  WTVV     0x06040000 PTL  0x00000008) @ 0x0fefef32
ACPI: MADT (v001 PTLTD  	 APIC   0x06040000  LTP 0x00000000) @ 0x0fefefa6
ACPI: DSDT (v001    NEC          0x06040000 MSFT 0x0100000d) @ 0x00000000
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 15:1 APIC version 20
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
Using ACPI for processor (LAPIC) configuration information
Intel MultiProcessor Specification v1.4
    Virtual Wire compatibility mode.
OEM ID: INTEL    Product ID: WTVV+ICH2    APIC at: 0xFEE00000
I/O APIC #1 Version 32 at 0xFEC00000.
Enabling APIC mode:  Flat.  Using 1 I/O APICs
Processors: 1
Building zonelist for node : 0
Kernel command line: 
Initializing CPU#0
PID hash table entries: 1024 (order 10: 8192 bytes)
Detected 1879.179 MHz processor.
Using tsc for high-res timesource
Console: colour VGA+ 80x25
Memory: 254680k/261632k available (2286k kernel code, 6160k reserved, 906k data, 156k init, 0k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay loop... 3702.78 BogoMIPS
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
CPU:     After generic identify, caps: 3febfbff 00000000 00000000 00000000
CPU:     After vendor identify, caps: 3febfbff 00000000 00000000 00000000
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 256K
CPU:     After all inits, caps: 3febfbff 00000000 00000000 00000080
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU#0: Intel P4/Xeon Extended MCE MSRs (12) available
CPU: Intel(R) Pentium(R) 4 CPU 1.90GHz stepping 02
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
enabled ExtINT on CPU#0
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 1878.0242 MHz.
..... host bus clock speed is 98.0854 MHz.
NET: Registered protocol family 16
PCI: PCI BIOS revision 2.10 entry at 0xfd9ab, last bus=2
PCI: Using configuration type 1
mtrr: v2.0 (20020519)
ACPI: Subsystem revision 20040116
ACPI: IRQ9 SCI: Level Trigger.
ACPI: Interpreter enabled
ACPI: Using PIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (00:00)
PCI: Probing PCI hardware (bus 00)
Transparent bridge - 0000:00:1e.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.AGP_._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 *10 11 14 15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 *5 6 7 9 10 11 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 *5 6 7 9 10 11 14 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 *9 10 11 14 15)
ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 9 10 11 14 15)
ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 9 10 11 14 15)
ACPI: PCI Interrupt Link [LNKG] (IRQs *3 4 5 6 7 9 10 11 14 15)
ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 7 9 10 *11 14 15)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.HUB0._PRT]
SCSI subsystem initialized
drivers/usb/core/usb.c: registered new driver usbfs
drivers/usb/core/usb.c: registered new driver hub
ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 9
ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 5
ACPI: PCI Interrupt Link [LNKH] enabled at IRQ 11
ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10
ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 5
ACPI: PCI Interrupt Link [LNKG] enabled at IRQ 3
PCI: Using ACPI for IRQ routing
PCI: if you experience problems, try using option 'pci=noacpi' or even 'acpi=off'
apm: BIOS version 1.2 Flags 0x03 (Driver version 1.16ac)
apm: overridden by ACPI.
VFS: Disk quotas dquot_6.5.1
Initializing Cryptographic API
pty: 256 Unix98 ptys configured
Real Time Clock Driver v1.12
parport0: PC-style at 0x378 (0x778) [PCSPP,TRISTATE,EPP]
parport0: irq 7 detected
parport0: cpp_daisy: aa5500ff(98)
parport0: assign_addrs: aa5500ff(98)
parport0: Printer, HEWLETT-PACKARD OFFICEJET PRO 1150C
Using anticipatory io scheduler
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
loop: loaded (max 8 devices)
nbd: registered device at major 43
8139too Fast Ethernet driver 0.9.26
eth0: RealTek RTL8139 at 0xd0816000, 00:30:f1:1e:29:7c, IRQ 10
eth0:  Identified 8139 chip type 'RTL-8139C'
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
ICH2: IDE controller at PCI slot 0000:00:1f.1
ICH2: chipset revision 4
ICH2: not 100% native mode: will probe irqs later
    ide0: BM-DMA at 0x2000-0x2007, BIOS settings: hda:DMA, hdb:pio
    ide1: BM-DMA at 0x2008-0x200f, BIOS settings: hdc:DMA, hdd:DMA
hda: ST360020A, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hdc: _NEC DVD-ROM DV-5800A, ATAPI CD/DVD-ROM drive
hdd: R/RW 12x8x32, ATAPI CD/DVD-ROM drive
ide1 at 0x170-0x177,0x376 on irq 15
hda: max request size: 128KiB
hda: 117231408 sectors (60022 MB) w/2048KiB Cache, CHS=65535/16/63
 hda: hda1 < hda5 hda6 > hda2
hdc: ATAPI 48X DVD-ROM drive, 512kB Cache
Uniform CD-ROM driver Revision: 3.20
hdd: ATAPI 32X CD-ROM CD-R/RW drive, 4096kB Cache
drivers/usb/core/usb.c: registered new driver hiddev
drivers/usb/core/usb.c: registered new driver hid
drivers/usb/input/hid-core.c: v2.0:USB HID core driver
mice: PS/2 mouse device common for all mice
input: PC Speaker
serio: i8042 AUX port at 0x60,0x64 irq 12
input: PS/2 Generic Mouse on isa0060/serio1
serio: i8042 KBD port at 0x60,0x64 irq 1
input: AT Translated Set 2 keyboard on isa0060/serio0
Advanced Linux Sound Architecture Driver Version 0.9.7 (Thu Sep 25 19:16:36 2003 UTC).
ALSA device list:
  #0: Sound Blaster Live! (rev.8) at 0x3400, irq 5
NET: Registered protocol family 2
IP: routing cache hash table of 2048 buckets, 16Kbytes
TCP: Hash tables configured (established 16384 bind 32768)
NET: Registered protocol family 1
NET: Registered protocol family 10
IPv6 over IPv4 tunneling driver
ACPI: (supports S0 S1 S3 S4 S5)
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 156k freed
Adding 706820k swap on /dev/hda6.  Priority:-1 extents:1
Disabled Privacy Extensions on device c03deda0(lo)
eth0: link up, 100Mbps, full-duplex, lpa 0x45E1
nfs warning: mount version older than kernel
nfs warning: mount version older than kernel
lp0: using parport0 (polling).
lp0: console ready
spurious 8259A interrupt: IRQ7.
atkbd.c: Unknown key released (translated set 2, code 0x7a on isa0060/serio0).
atkbd.c: This is an XFree86 bug. It shouldn't access hardware directly.
atkbd.c: Unknown key released (translated set 2, code 0x7a on isa0060/serio0).
atkbd.c: This is an XFree86 bug. It shouldn't access hardware directly.
drivers/usb/host/uhci-hcd.c: USB Universal Host Controller Interface driver v2.1
uhci_hcd 0000:00:1f.2: UHCI Host Controller
PCI: Setting latency timer of device 0000:00:1f.2 to 64
uhci_hcd 0000:00:1f.2: irq 9, io base 00001cc0
uhci_hcd 0000:00:1f.2: new USB bus registered, assigned bus number 1
drivers/usb/host/uhci-hcd.c: detected 2 ports
uhci_hcd 0000:00:1f.2: root hub device address 1
usb usb1: new device strings: Mfr=3, Product=2, SerialNumber=1
drivers/usb/core/message.c: USB device number 1 default language ID 0x409
usb usb1: Product: UHCI Host Controller
usb usb1: Manufacturer: Linux 2.6.2 uhci_hcd
usb usb1: SerialNumber: 0000:00:1f.2
drivers/usb/core/usb.c: usb_hotplug
usb usb1: registering 1-0:1.0 (config #1, interface 0)
drivers/usb/core/usb.c: usb_hotplug
hub 1-0:1.0: usb_probe_interface
hub 1-0:1.0: usb_probe_interface - got id
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 2 ports detected
hub 1-0:1.0: standalone hub
hub 1-0:1.0: ganged power switching
hub 1-0:1.0: global over-current protection
hub 1-0:1.0: Port indicators are not supported
hub 1-0:1.0: power on to power good time: 2ms
hub 1-0:1.0: hub controller current requirement: 0mA
hub 1-0:1.0: local power source is good
hub 1-0:1.0: no over-current condition exists
hub 1-0:1.0: enabling power on all ports
uhci_hcd 0000:00:1f.4: UHCI Host Controller
PCI: Setting latency timer of device 0000:00:1f.4 to 64
uhci_hcd 0000:00:1f.4: irq 11, io base 00001ce0
uhci_hcd 0000:00:1f.4: new USB bus registered, assigned bus number 2
drivers/usb/host/uhci-hcd.c: detected 2 ports
uhci_hcd 0000:00:1f.4: root hub device address 1
usb usb2: new device strings: Mfr=3, Product=2, SerialNumber=1
drivers/usb/core/message.c: USB device number 1 default language ID 0x409
usb usb2: Product: UHCI Host Controller
usb usb2: Manufacturer: Linux 2.6.2 uhci_hcd
usb usb2: SerialNumber: 0000:00:1f.4
drivers/usb/core/usb.c: usb_hotplug
usb usb2: registering 2-0:1.0 (config #1, interface 0)
drivers/usb/core/usb.c: usb_hotplug
hub 2-0:1.0: usb_probe_interface
hub 2-0:1.0: usb_probe_interface - got id
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
hub 2-0:1.0: standalone hub
hub 2-0:1.0: ganged power switching
hub 2-0:1.0: global over-current protection
hub 2-0:1.0: Port indicators are not supported
hub 2-0:1.0: power on to power good time: 2ms
hub 2-0:1.0: hub controller current requirement: 0mA
hub 2-0:1.0: local power source is good
hub 2-0:1.0: no over-current condition exists
hub 2-0:1.0: enabling power on all ports
drivers/usb/host/uhci-hcd.c: 1cc0: suspend_hc
drivers/usb/host/uhci-hcd.c: 1ce0: suspend_hc

------------- COPIED FROM SCREEN -------------
hub 1-0:1.0: new USB device on port 2, assigned address 2
usb 1-2: Pruduct: USB Keyboard
usb 1-2: Manufacturer: BTC
hun 1-2:1.0: USB hub found
hub 1-2:1.0: 2 ports detected
hub 1-2:1.0: new USB device on port 1, assigned address 3
usb 1-2.1: Product USB Keyboard
usb 1-2.1: Manufacturer: BTC
drivers/usb/input/hid-core.c: ctrl urb status -104 received
drvers/usb/input/hid-core.c: timeout initializing reports

input: USB HID v1.00 Keyboard [BTC USB Keyboard] on usb-0000:00:1f.2-2.1
------------END COPIED FROM SCREEN ------------


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2003-10-31 16:23           ` David Brownell
  2003-10-31 18:34             ` David Mosberger
@ 2004-03-06  2:08             ` David Mosberger
  2004-03-06  2:13               ` David Mosberger
  2004-03-06  4:55               ` David Brownell
  1 sibling, 2 replies; 45+ messages in thread
From: David Mosberger @ 2004-03-06  2:08 UTC (permalink / raw)
  To: David Brownell
  Cc: davidm, Greg KH, vojtech, linux-usb-devel, linux-kernel,
	linux-ia64, pochini

OK, finally a bit of progress.  If you remember back in October 2003 I
reported:

 > One-line summary: plug-in your USB keyboard, see your machine die.

 > So, I have this non-name USB keyboard (with built-in 2-port USB
 > hub) which reliably crashes 2.6.0-test{8,9} on both x86 and ia64.
 > In retrospect, it's clear to me that the same keyboard also
 > occasionally crashes 2.4 kernels, but there the problem appears
 > more seldom.  Perhaps once in 10 reboots and once the machine is
 > booted and the keyboard is running, it keeps on working.  The
 > keyboard in question is a BTC 5141H.

After this, I spent a (small) amount of time looking over the HID code
etc to see what could be causing it.  I could find nothing wrong so I
gave up, connected another USB keyboard, and basically ignored the
problem.  In retrospect, that was Good Thinking, because I was
apparently looking at the wrong code: the problem _does_ appear to be
coming from the USB HCD, not from the HIDeous code.

Specifically, after upgrading to 2.6.4-rc2, _all_ of the ia64 machines
I tested would crash as soon as they had _any_ USB keyboard plugged
in.  That is, the problem no longer was limited to the BTC keyboard,
which is special because it has a built-in hub.  This was encouraging.

Turns out it's this patch that was causing the crashes:

 http://linux.bkbits.net:8080/linux-2.5/cset@1.1619.1.17

That was strange, because even to my USB-untrained eye the patch
looked obviously correct.  However, I think the root cause of the
problem really has to do with a race-condition between the controller
and the driver.  In particular, if I apply the patch below, my USB
keyboards (including the BTC keyboard) work just fine!

===== drivers/usb/host/ohci-q.c 1.48 vs edited =====
--- 1.48/drivers/usb/host/ohci-q.c	Tue Mar  2 05:52:46 2004
+++ edited/drivers/usb/host/ohci-q.c	Fri Mar  5 17:25:55 2004
@@ -438,7 +451,7 @@
 	 * behave.  frame_no wraps every 2^16 msec, and changes right before
 	 * SF is triggered.
 	 */
-	ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1;
+	ed->tick = OHCI_FRAME_NO(ohci->hcca) + 2;
 
 	/* rm_list is just singly linked, for simplicity */
 	ed->ed_next = ohci->ed_rm_list;

However, I think the root-cause of the problem may be this optimization
in ohci_irq():

	/* we can eliminate a (slow) readl() if _only_ WDH caused this irq */

Indeed, if I apply this patch instead:

===== drivers/usb/host/ohci-hcd.c 1.56 vs edited =====
--- 1.56/drivers/usb/host/ohci-hcd.c	Tue Mar  2 05:52:40 2004
+++ edited/drivers/usb/host/ohci-hcd.c	Fri Mar  5 17:45:09 2004
@@ -584,7 +584,7 @@
  	int			ints; 
 
 	/* we can eliminate a (slow) readl() if _only_ WDH caused this irq */
-	if ((ohci->hcca->done_head != 0)
+	if (0 && (ohci->hcca->done_head != 0)
 			&& ! (le32_to_cpup (&ohci->hcca->done_head) & 0x01)) {
 		ints =  OHCI_INTR_WDH;
 

there are no crashes either.

So my theory is that I was seeing this sequence of events:

 - HCD signals WDH interrupt & sends DMA to update the frame number in
   the host-controller communication area (HCCA)

 - host gets interrupt, but skips readl() and hence reads a stale
   frame number N instead of the up-to-date value (N+1)

 - HCD cancels a transfer descriptor (TD), moves it to the "remove list"
   and calculates the frame number at which it can be remove from
   the host-controller's list as N+1

 - SOF interrupt arrives (probably was pending already?)

 - interrupt handler does a readl() and now sees the updated
   frame-number N+1

 - HCD sees that the cancelled TD's time stamp N+1 is <= the current
   current time stamp (N+1) and goes ahead and removes it from the
   host-list, while the controller is still looking at the entry being
   removed

 - HCD ends up dereferencing a bad pointer and ends up reading from
   address 0xf0000000, which on our ia64 machines is a read-only area,
   which then results in a machine-check abort

Does this sound plausible?

What beats me is why UHCI would have the same issue.  I know even less
about UHCI than I do about OHCI but perhaps there is a similar
problem.

	--david

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-06  2:08             ` David Mosberger
@ 2004-03-06  2:13               ` David Mosberger
  2004-03-06  4:55               ` David Brownell
  1 sibling, 0 replies; 45+ messages in thread
From: David Mosberger @ 2004-03-06  2:13 UTC (permalink / raw)
  To: davidm
  Cc: David Brownell, Greg KH, vojtech, linux-usb-devel, linux-kernel,
	linux-ia64, pochini

Typo-alert:

>>>>> On Fri, 5 Mar 2004 18:08:40 -0800, David Mosberger <davidm@linux.hpl.hp.com> said:

  David>  - HCD ends up dereferencing a bad pointer and ends up
  David> reading from address 0xf0000000, which on our ia64 machines
  David> is a read-only area, which then results in a machine-check
  David> abort
              ^^^^^^^^^
              make that "write-only"

  --david

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-06  2:08             ` David Mosberger
  2004-03-06  2:13               ` David Mosberger
@ 2004-03-06  4:55               ` David Brownell
  2004-03-06  5:49                 ` David Mosberger
  2004-03-06  9:17                 ` [linux-usb-devel] " David Mosberger
  1 sibling, 2 replies; 45+ messages in thread
From: David Brownell @ 2004-03-06  4:55 UTC (permalink / raw)
  To: davidm
  Cc: Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini

David Mosberger wrote:
> OK, finally a bit of progress.  If you remember back in October 2003 I
> reported:
> 
>  > One-line summary: plug-in your USB keyboard, see your machine die.
> 
>  > So, I have this non-name USB keyboard (with built-in 2-port USB
>  > hub) which reliably crashes 2.6.0-test{8,9} on both x86 and ia64.
>  > In retrospect, it's clear to me that the same keyboard also
>  > occasionally crashes 2.4 kernels, but there the problem appears
>  > more seldom.  
> 
> Specifically, after upgrading to 2.6.4-rc2, _all_ of the ia64 machines
> I tested would crash as soon as they had _any_ USB keyboard plugged
> in.  That is, the problem no longer was limited to the BTC keyboard,
> which is special because it has a built-in hub.  This was encouraging.
> 
> Turns out it's this patch that was causing the crashes:
> 
>  http://linux.bkbits.net:8080/linux-2.5/cset@1.1619.1.17

Maybe in 2.6.4-rc2... but not in 2.6.0-test{8,9}!!

There's something wierd going on recently for sure, and it's
not caused just by that patch.  I'm not sure reverting that
would make things better overall right now; hard to say.

I'm thinking the "disable periodic schedule" patch is also
worth looking at.  I can imagine some silicon having bugs
if that's turned off (just like some _systems_ have issues
when it's left on).


> That was strange, because even to my USB-untrained eye the patch
> looked obviously correct.  However, I think the root cause of the
> problem really has to do with a race-condition between the controller
> and the driver.  In particular, if I apply the patch below, my USB
> keyboards (including the BTC keyboard) work just fine!
> 
> ...
> -	ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1;
> +	ed->tick = OHCI_FRAME_NO(ohci->hcca) + 2;

I've seen that _change_ behavior in some regression tests
(usbtest test11/test12), as if the extra msec let things
quiesce (so only one of two broken states showed, and
not the oopsable one) but not _fix_ it.


> However, I think the root-cause of the problem may be this optimization
> in ohci_irq():
> 
> 	/* we can eliminate a (slow) readl() if _only_ WDH caused this irq */
> 
> Indeed, if I apply this patch instead:
> 
> ...
>  	/* we can eliminate a (slow) readl() if _only_ WDH caused this irq */
> -	if ((ohci->hcca->done_head != 0)
> +	if (0 && (ohci->hcca->done_head != 0)
> ...
> 
> there are no crashes either.

See this post from Martin Diehl ... my response isn't out yet:

   http://marc.theaimsgroup.com/?l=linux-usb-devel&m=107850825815775&w=2

The reason I keep ending up thinking that readl-elimination
must be OK (me agreeing with Martin) is that the memory there
came from dma_alloc_coherent() ... so if anything's wrong,
it'd be at most lack of rmb(), not a stale-cache kind of thing.


> So my theory is that I was seeing this sequence of events:
> 
>  - HCD signals WDH interrupt & sends DMA to update the frame number in
>    the host-controller communication area (HCCA)
> 
>  - host gets interrupt, but skips readl() and hence reads a stale
>    frame number N instead of the up-to-date value (N+1)

It reads the frame number directly from the controller, so it's
not possible that it can be so stale that an rmb() wouldn't fix
sequencing issues.

What might be possible though is that the donelist gets modified
by the time the unlinks get processed, with some extra TDs changing
state (from HC perspective) ... haven't explored that possibility.


>  - HCD cancels a transfer descriptor (TD), moves it to the "remove list"
>    and calculates the frame number at which it can be remove from
>    the host-controller's list as N+1

That'd be an ED on the remove list, not a TD.  Also in dma-coherent
memory.  The cancelation would apply to one or more of the TDs
queued to that ED.


>  - SOF interrupt arrives (probably was pending already?)
> 
>  - interrupt handler does a readl() and now sees the updated
>    frame-number N+1
> 
>  - HCD sees that the cancelled TD's time stamp N+1 is <= the current
>    current time stamp (N+1) and goes ahead and removes it from the
>    host-list, while the controller is still looking at the entry being
>    removed

This ED-level disagreement between the HC and HCD might explain
some issues.  I think the current trouble cases are usually where
there's only one TD queued to the ED, so that TD and the dummy
keep ping-ponging back and forth.  The HC certainly seems to
overwrite the "dummy TD" --

>  - HCD ends up dereferencing a bad pointer and ends up reading from
>    address 0xf0000000, which on our ia64 machines is a read-only area,
>    which then results in a machine-check abort

I'm surprised that DMA from a read-only area would be a problem!  :)
If OHCI is getting a PCI error, I'd expect a "UE" IRQ.


> Does this sound plausible?

Parts of it.  There's definite recent nastiness.  Of the type
that other eyes sometimes see better.


> What beats me is why UHCI would have the same issue.  I know even less
> about UHCI than I do about OHCI but perhaps there is a similar
> problem.

I still suspect some problem in the HID code.  But right now
I'm quite certain of a recent-ish OHCI issue.

- Dave


> 
> 	--david
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-06  4:55               ` David Brownell
@ 2004-03-06  5:49                 ` David Mosberger
  2004-03-06  7:21                   ` David Mosberger
  2004-03-06 16:37                   ` David Brownell
  2004-03-06  9:17                 ` [linux-usb-devel] " David Mosberger
  1 sibling, 2 replies; 45+ messages in thread
From: David Mosberger @ 2004-03-06  5:49 UTC (permalink / raw)
  To: David Brownell
  Cc: davidm, Greg KH, vojtech, linux-usb-devel, linux-kernel,
	linux-ia64, pochini

>>>>> On Fri, 05 Mar 2004 20:55:01 -0800, David Brownell <david-b@pacbell.net> said:
  >> Turns out it's this patch that was causing the crashes:

  >> http://linux.bkbits.net:8080/linux-2.5/cset@1.1619.1.17

  David.B> Maybe in 2.6.4-rc2... but not in 2.6.0-test{8,9}!!

Of course.  What I'm saying is that in 2.6.0-test{8,9} it was rare to
trigger the problem (only with BTC keyboard) and the change above made
it trivial to trigger the keyboard.  Basically, your fix in
cset 1.1619.1.17 made it more common for stuff to be unlinked in
the "deferred" (proper) manner and that made it much more likely to
trigger the bug.

  David.B> The reason I keep ending up thinking that readl-elimination
  David.B> must be OK (me agreeing with Martin) is that the memory
  David.B> there came from dma_alloc_coherent() ... so if anything's
  David.B> wrong, it'd be at most lack of rmb(), not a stale-cache
  David.B> kind of thing.

It's not an issue of DMA coherency, it's an issue of DMA vs. interrupt
ordering.  I believe the WHD interrupt is arriving at the CPU before
the DMA update to the HCCA is done.  In my second patch, the readl()
at the beginning of the interrupt ensures that the DMA update to
the HCCA is completed before the readl() returns data.

  David.B> It reads the frame number directly from the controller, so
  David.B> it's not possible that it can be so stale that an rmb()
  David.B> wouldn't fix sequencing issues.

Eh, it's read like this:

   #define OHCI_FRAME_NO(hccap) ((u16)le32_to_cpup(&(hccap)->frame_no))

   finish_unlinks (ohci, OHCI_FRAME_NO(ohci->hcca), ptregs);

The HCCA is in host memory.

  >> - HCD ends up dereferencing a bad pointer and ends up reading
  >> from address 0xf0000000, which on our ia64 machines is a
  >> read-only area, which then results in a machine-check abort

  David.B> I'm surprised that DMA from a read-only area would be a
  David.B> problem!  :) If OHCI is getting a PCI error, I'd expect a
  David.B> "UE" IRQ.

You must have not received my follow-up message.  There was a typo in
my message: it was supposed to say "write-only" area.

  David.B> I still suspect some problem in the HID code.  But right
  David.B> now I'm quite certain of a recent-ish OHCI issue.

I'm 99% certain that the problem I saw back in October (BTC keyboard)
was identical to the one triggered by cset 1.1619.1.17.

	--david

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-06  5:49                 ` David Mosberger
@ 2004-03-06  7:21                   ` David Mosberger
  2004-03-06  8:39                     ` David Mosberger
  2004-03-06 16:37                   ` David Brownell
  1 sibling, 1 reply; 45+ messages in thread
From: David Mosberger @ 2004-03-06  7:21 UTC (permalink / raw)
  To: David Brownell, Greg KH, vojtech, linux-usb-devel, linux-kernel,
	linux-ia64, pochini, davidm

>>>>> On Fri, 5 Mar 2004 21:49:20 -0800, David Mosberger <davidm@linux.hpl.hp.com> said:

  David> It's not an issue of DMA coherency, it's an issue of DMA
  David> vs. interrupt ordering.  I believe the WHD interrupt is
  David> arriving at the CPU before the DMA update to the HCCA is
  David> done.

Actually, it looks like I misunderstood the OHCI spec on first reading.
It seems like the causal relationship goes like this:

 (1) Start of Frame -> (2) update HccaFrameNumber -> (3) trigger SF interrupt

Now, suppose you get a WDH interrupt between (1) and (2).  You'd read
the old frame-number yet by the time the interrupt from (3) arrives
the HC might already be accessing the ED that you're about to remove.

If this is correct, then the first patch is probably a better
approach:

===== drivers/usb/host/ohci-q.c 1.48 vs edited =====
--- 1.48/drivers/usb/host/ohci-q.c	Tue Mar  2 05:52:46 2004
+++ edited/drivers/usb/host/ohci-q.c	Fri Mar  5 17:25:55 2004
@@ -438,7 +451,7 @@
 	 * behave.  frame_no wraps every 2^16 msec, and changes right before
 	 * SF is triggered.
 	 */
-	ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1;
+	ed->tick = OHCI_FRAME_NO(ohci->hcca) + 2;
 
 	/* rm_list is just singly linked, for simplicity */
 	ed->ed_next = ohci->ed_rm_list;

This actually makes tons of sense if you think of it like jiffies: you
need to make sure you delay at least one full frame-interval.  If you
set the tick to "+ 1" and the current tick is almost over, that
requirement is violated.  Setting it to "+ 2" should be safe.  The
only problem I can think of is if the delay between point (1) and (2)
were to exceed one frame-interval (1 msec).  While unlikely, the right
PCI topology and heavy bus traffic perhaps could cause such delays.
However, even then it's probably OK because the HC would presumably
stall when trying to update the HccaFrameNumber the second time and
the previous update hasn't completed yet.

Here is one little piece of evidence that's consistent with this
explanation: last week I tried to rip some audio tracks off a CD.
With PIO, this caused interrupts to get delayed 2-3msec and that
caused all kinds of weird effects on the USB bus.  Mostly, I'd
suddenly lose the keyboard or the mouse, though reconnecting them
would "fix" the problem for a short time.

	--david

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-06  7:21                   ` David Mosberger
@ 2004-03-06  8:39                     ` David Mosberger
  0 siblings, 0 replies; 45+ messages in thread
From: David Mosberger @ 2004-03-06  8:39 UTC (permalink / raw)
  To: davidm
  Cc: David Brownell, Greg KH, vojtech, linux-usb-devel, linux-kernel,
	linux-ia64, pochini

>>>>> On Fri, 5 Mar 2004 23:21:32 -0800, David Mosberger <davidm@linux.hpl.hp.com> said:

  David>  (1) Start of Frame -> (2) update HccaFrameNumber -> (3)
  David> trigger SF interrupt

  David> Now, suppose you get a WDH interrupt between (1) and (2).
  David> You'd read the old frame-number yet by the time the interrupt
  David> from (3) arrives the HC might already be accessing the ED
  David> that you're about to remove.

Sorry for the monologue---trying to learn how this is all supposed to
work...

The OHCI spec says that HccaFrameNumber is updated in this fashion:

 (a) send Start-of-Frame
 (b) HccaFrameNumber <- HcFmNumber.StartingFrame
 (c) start processing ED (& post SF intr if requested)

Since start_ed_unlink() uses the following sequence:

 (1) ed->hwINFO |= ED_DEQUEUE
 (2) ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1

Then as long as (1) is observed by the HC before (2) (which it should
be), the race I described isn't possible: if (2) read the "old"
frame-number, then the HC wouldn't have started step (c) yet and hence
the HC would observe step (1) and notice that the ED is being
dequeued.  Converseley, if the HC started to process the ED before (1)
completed (i.e., it missed the ED_DEQUEUE flag), then step (2) must
have been reading the the new frame-number.

OK, I see now the conundrum...

BTW: does the value 0xf0000000 bear any special meaning in USB?  We
already considered whether this would be a NULL-pointer after I/O MMU
translation but it is not: the I/O MMU window is at
0x40000000-0x80000000 on the machines in question.

	--david

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-06  4:55               ` David Brownell
  2004-03-06  5:49                 ` David Mosberger
@ 2004-03-06  9:17                 ` David Mosberger
  2004-03-06 17:30                   ` David Brownell
  1 sibling, 1 reply; 45+ messages in thread
From: David Mosberger @ 2004-03-06  9:17 UTC (permalink / raw)
  To: David Brownell
  Cc: davidm, Greg KH, vojtech, linux-usb-devel, linux-kernel,
	linux-ia64, pochini

>>>>> On Fri, 05 Mar 2004 20:55:01 -0800, David Brownell <david-b@pacbell.net> said:

  >> Does this sound plausible?

  David.B> Parts of it.  There's definite recent nastiness.  Of the
  David.B> type that other eyes sometimes see better.

Here is patch #3.  It also Works For Me.  I was wondering whether it
it is really safe to mess with the OHCI control registers the way
ed_deschedule() does at a time the OHCI is running.  To test this
theory, I delayed the ed_deschedule() handling to finish_unlinks(), as
shown in the patch below.  I don't know whether this is really safe as
far as the host's lists are concerned, but it does avoid the crashes.

What's the argument as to why it's safe to update the OHCI control
registers in ed_deschedule() at the time start_ed_unlink() is running?

	--david

===== drivers/usb/host/ohci-q.c 1.48 vs edited =====
--- 1.48/drivers/usb/host/ohci-q.c	Tue Mar  2 05:52:46 2004
+++ edited/drivers/usb/host/ohci-q.c	Sat Mar  6 01:09:16 2004
@@ -274,7 +274,10 @@
  */
 static void ed_deschedule (struct ohci_hcd *ohci, struct ed *ed) 
 {
+#if 0
 	ed->hwINFO |= ED_SKIP;
+	wmb();
+#endif
 
 	switch (ed->type) {
 	case PIPE_CONTROL:
@@ -431,7 +434,12 @@
 {    
 	ed->hwINFO |= ED_DEQUEUE;
 	ed->state = ED_UNLINK;
+#if 0
 	ed_deschedule (ohci, ed);
+#else
+	ed->hwINFO |= ED_SKIP;
+	wmb();
+#endif
 
 	/* SF interrupt might get delayed; record the frame counter value that
 	 * indicates when the HC isn't looking at it, so concurrent unlinks
@@ -896,6 +904,11 @@
 				last = &ed->ed_next;
 				continue;
 			}
+
+#if 0
+#else
+			ed_deschedule (ohci, ed);
+#endif
 
 			if (!list_empty (&ed->td_list)) {
 				struct td	*td;


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-06  5:49                 ` David Mosberger
  2004-03-06  7:21                   ` David Mosberger
@ 2004-03-06 16:37                   ` David Brownell
  2004-03-08  6:18                     ` Grant Grundler
  1 sibling, 1 reply; 45+ messages in thread
From: David Brownell @ 2004-03-06 16:37 UTC (permalink / raw)
  To: davidm
  Cc: Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini

David Mosberger wrote:

>   David.B> The reason I keep ending up thinking that readl-elimination
>   David.B> must be OK (me agreeing with Martin) is that the memory
>   David.B> there came from dma_alloc_coherent() ... so if anything's
>   David.B> wrong, it'd be at most lack of rmb(), not a stale-cache
>   David.B> kind of thing.
> 
> It's not an issue of DMA coherency, it's an issue of DMA vs. interrupt
> ordering.  I believe the WHD interrupt is arriving at the CPU before

Which is what I sketched to Martin, as the reason to be interested
in a patch that was equivalent to your second patch.

Unfortunately that doesn't check out as being "the" fix here.
Martin still saw the problem.  (And I don't see how it'd be
what gave me several hard lockups ... but it didn't work either,
and the day before, without that change, no lockup.)


> the DMA update to the HCCA is done.  In my second patch, the readl()
> at the beginning of the interrupt ensures that the DMA update to
> the HCCA is completed before the readl() returns data.

DMA-coherent memory is defined as "memory for which a write by either
the device or the processor can immediately be read by the processor
or device without having to worry about caching effects."  Such a
write-buffering mechanism is clearly a type of (write-)caching effect,
and readl() would be a kind of dma_rmb(), if you will.  I suspect the
docs are wrong about what dma-coherent means.


> If this is correct, then the first patch is probably a better
> approach:
> 
> ...
> -	ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1;
> +	ed->tick = OHCI_FRAME_NO(ohci->hcca) + 2;
> ..
> 
> This actually makes tons of sense if you think of it like jiffies: you
> need to make sure you delay at least one full frame-interval.  

Right, I had that theory too.  As I reported, it doesn't check out
as the only issue.  Plus, even assuming it's correct, adding another
millisecond slows down some common operations even more.  I'd rather
just slow down just the "right" paths, rather than all of them ... :)

- Dave






^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-06  9:17                 ` [linux-usb-devel] " David Mosberger
@ 2004-03-06 17:30                   ` David Brownell
  2004-03-07 13:48                     ` Matthias Andree
  2004-03-08 18:49                     ` David Mosberger
  0 siblings, 2 replies; 45+ messages in thread
From: David Brownell @ 2004-03-06 17:30 UTC (permalink / raw)
  To: davidm
  Cc: Greg KH, vojtech, linux-usb-devel, linux-kernel, linux-ia64, pochini

David Mosberger wrote:

> Here is patch #3.  It also Works For Me.  I was wondering whether it

I've had several "Works For Me" patches too, but then if the
silicion got kicked a bit differently it'd not behave... :(


> it is really safe to mess with the OHCI control registers the way
> ed_deschedule() does at a time the OHCI is running.  To test this

It must be, or we'd not have had a driver working for several
years now!

The quick stop/restart cycles haven't seemed to be a problem
with any OHCI silicion in the way they are with, for example,
VIA EHCI.


> theory, I delayed the ed_deschedule() handling to finish_unlinks(), as
> shown in the patch below.  I don't know whether this is really safe as
> far as the host's lists are concerned, but it does avoid the crashes.

My suspicions have been focussing on finish_unlinks().

That's really the only place the HCD does anything
that could corrupt the ED queues, which is what looks
to be happening.

Your change doesn't actually _unlink_ in the same way;
interesting change, I'll have to think about it.  It
certainly changes timings.


> What's the argument as to why it's safe to update the OHCI control
> registers in ed_deschedule() at the time start_ed_unlink() is running?

It's always safe to update those registers, except
that some silicon doesn't support that while the
controller is suspended.

- Dave



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-06 17:30                   ` David Brownell
@ 2004-03-07 13:48                     ` Matthias Andree
  2004-03-08 18:49                     ` David Mosberger
  1 sibling, 0 replies; 45+ messages in thread
From: Matthias Andree @ 2004-03-07 13:48 UTC (permalink / raw)
  To: linux-kernel

On Sat, 06 Mar 2004, David Brownell wrote:

> That's really the only place the HCD does anything
> that could corrupt the ED queues, which is what looks
> to be happening.

I have a situation that may or may not be related, but it makes 2.6 USB
unusable, so I'll just report it.

I have USB lockups with various 2.6.3 variants, including 2.6.4-rc2, on
a VIA KT600 (Via 8237 south bridge). Linux 2.4.25 and newer (BitKeeper version)
behave well, so does FreeBSD 5-CURRENT on the same computer (dual boot), and
apparently the 2.6.3 kernels are fine on a VIA KT133 (Via 82C686 south bridge -
older variant with UDMA66 only).

During a scan attempt with an Epson 1650 (USB 1.1, hence UHCI) I caught these
logs in dmesg, the USB subsystem is dead now (2.6.4-rc2). I don't have
USB 2.0 hardware so I cannot check if EHCI would be fine.

usb 3-2: usb_disable_device nuking non-ep0 URBs
usb 3-2: unregistering interface 3-2:1.0
drivers/usb/core/usb.c: usb_hotplug
usb 3-2: registering 3-2:1.0 (config #1, interface 0)
drivers/usb/core/usb.c: usb_hotplug
usb 3-2: usb_disable_device nuking non-ep0 URBs
usb 3-2: unregistering interface 3-2:1.0
drivers/usb/core/usb.c: usb_hotplug
usb 3-2: registering 3-2:1.0 (config #1, interface 0)
drivers/usb/core/usb.c: usb_hotplug
uhci_hcd 0000:00:10.1: uhci_result_control: failed with status 440000
[ccdc8270] link (0cdc81e2) element (0ced4140)
  0: [cced4140] link (0ced4180) e0 Stalled CRC/Timeo Length=7 MaxLen=7 DT0 EndPt=0 Dev=2, PID=2d(SETUP) (buf=0c446380)
  1: [cced4180] link (0ced41c0) e3 SPD Active Length=0 MaxLen=7 DT1 EndPt=0 Dev=2, PID=69(IN) (buf=06997000)
  2: [cced41c0] link (0ced4200) e3 SPD Active Length=0 MaxLen=7 DT0 EndPt=0 Dev=2, PID=69(IN) (buf=06997008)
  3: [cced4200] link (0ced4240) e3 SPD Active Length=0 MaxLen=1 DT1 EndPt=0 Dev=2, PID=69(IN) (buf=06997010)
  4: [cced4240] link (00000001) e3 IOC Active Length=0 MaxLen=7ff DT1 EndPt=0 Dev=2, PID=e1(OUT) (buf=00000000)

usbfs: USBDEVFS_CONTROL failed cmd usbmodules dev 2 rqt 128 rq 6 len 18 ret -110
uhci_hcd 0000:00:10.1: uhci_result_control: failed with status 500000
[ccdc82a0] link (0cdc81e2) element (0ced4300)
 Element != First TD
  0: [cced4280] link (0ced42c0) e3 Length=7 MaxLen=7 DT0 EndPt=0 Dev=2, PID=2d(SETUP) (buf=0c446b00)
  1: [cced42c0] link (0ced4300) e3 Length=7 MaxLen=7 DT1 EndPt=0 Dev=2, PID=69(IN) (buf=03235000)
  2: [cced4300] link (0ced4340) e3 Stalled Babble Length=0 MaxLen=0 DT0 EndPt=0 Dev=2, PID=69(IN) (buf=03235008)
  3: [cced4340] link (0cdc82d2) e3 IOC Active Length=0 MaxLen=7ff DT1 EndPt=0 Dev=2, PID=e1(OUT) (buf=00000000)
--Queued QH's:
[ccdc82d0] link (0cdc81e2) element (0ced4380)
  0: [cced4380] link (0ced43c0) e3 Active Length=0 MaxLen=7 DT0 EndPt=0 Dev=2, PID=2d(SETUP) (buf=0c446380)
  1: [cced43c0] link (0ced4400) e3 SPD Active Length=0 MaxLen=7 DT1 EndPt=0 Dev=2, PID=69(IN) (buf=06997000)
  2: [cced4400] link (0ced4440) e3 SPD Active Length=0 MaxLen=7 DT0 EndPt=0 Dev=2, PID=69(IN) (buf=06997008)
  3: [cced4440] link (0ced4480) e3 SPD Active Length=0 MaxLen=1 DT1 EndPt=0 Dev=2, PID=69(IN) (buf=06997010)
  4: [cced4480] link (00000001) e3 IOC Active Length=0 MaxLen=7ff DT1 EndPt=0 Dev=2, PID=e1(OUT) (buf=00000000)

usbfs: USBDEVFS_CONTROL failed cmd usbmodules dev 2 rqt 128 rq 6 len 9 ret -75
usb 3-2: control timeout on ep0in
usb 3-2: control timeout on ep0in
usb 3-2: control timeout on ep0in
usb 3-2: bulk timeout on ep1out


-- 
Matthias Andree

Encrypt your mail: my GnuPG key ID is 0x052E7D95

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-06 16:37                   ` David Brownell
@ 2004-03-08  6:18                     ` Grant Grundler
  2004-03-08 18:58                       ` David Mosberger
  0 siblings, 1 reply; 45+ messages in thread
From: Grant Grundler @ 2004-03-08  6:18 UTC (permalink / raw)
  To: David Brownell
  Cc: davidm, Greg KH, vojtech, linux-usb-devel, linux-kernel,
	linux-ia64, pochini

On Sat, Mar 06, 2004 at 08:37:43AM -0800, David Brownell wrote:
> DMA-coherent memory is defined as "memory for which a write by either
> the device or the processor can immediately be read by the processor
> or device without having to worry about caching effects."

The use of "immediate" here means no other sync function needs to
be called to access the data - ie don't need to call pci_sync_single().

In general, the accesses are ordered following PCI ordering rules.
But every architecture (including x86) has issues with "inflight" DMA.
Line based Interrupts are delivered on a different path than DMA 
and thus ordering can't be enforced.
For example, the code around the following comment in drivers/net/tg3.c:
	/*
	 * Flush PCI write.  This also guarantees that our
	 * status block has been flushed to host memory.
	 */


> `Such a
> write-buffering mechanism is clearly a type of (write-)caching effect,

No - the data is still in flight and in some deterministic time frame
will become visible to the CPU.
Calling it a "caching effect" confuses the issues even worse.

> and readl() would be a kind of dma_rmb(), if you will.

Yes, that's correct - but it's orthogonal to "cache coherent".

> I suspect the docs are wrong about what dma-coherent means.

Not "wrong", just misunderstood. ;^)

hth,
grant

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-06 17:30                   ` David Brownell
  2004-03-07 13:48                     ` Matthias Andree
@ 2004-03-08 18:49                     ` David Mosberger
  1 sibling, 0 replies; 45+ messages in thread
From: David Mosberger @ 2004-03-08 18:49 UTC (permalink / raw)
  To: David Brownell
  Cc: davidm, Greg KH, vojtech, linux-usb-devel, linux-kernel,
	linux-ia64, pochini

>>>>> On Sat, 06 Mar 2004 09:30:31 -0800, David Brownell <david-b@pacbell.net> said:

  David.B> David Mosberger wrote:

  >> Here is patch #3.  It also Works For Me.  I was wondering whether
  >> it

  David.B> I've had several "Works For Me" patches too, but then if
  David.B> the silicion got kicked a bit differently it'd not
  David.B> behave... :(

Just for completeness, I can confirm this.  Depending on the exact USB
configuration (varying number and types of devices connected via a
varying number of hubs), there are still problems.  So clearly my
patches didn't address the root cause (yet).

	--david

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-08  6:18                     ` Grant Grundler
@ 2004-03-08 18:58                       ` David Mosberger
  2004-03-08 21:48                         ` David Brownell
  0 siblings, 1 reply; 45+ messages in thread
From: David Mosberger @ 2004-03-08 18:58 UTC (permalink / raw)
  To: Grant Grundler
  Cc: David Brownell, davidm, Greg KH, vojtech, linux-usb-devel,
	linux-kernel, linux-ia64, pochini

>>>>> On Sun, 7 Mar 2004 22:18:02 -0800, Grant Grundler <iod00d@hp.com> said:

  >> `Such a write-buffering mechanism is clearly a type of
  >> (write-)caching effect,

  Grant> No - the data is still in flight and in some deterministic
  Grant> time frame will become visible to the CPU.  Calling it a
  Grant> "caching effect" confuses the issues even worse.

That's why I'm so unhappy that the PCI interface used the term
"consistent" memory, when it should have said "coherent".  We
should nail a plate on everbody's forehead saying:

 consistency = coherency + ordering

Perhaps then people would start to have a clear distincition between
the meaning of the two terms (or at least it would force them to think
about it! ;-).

But in any case, as later experimentation confirmed, the USB bug isn't
(just) an ordering issue.  The order of operation described in the
OHCI spec does not rely on any specific order of interrupt delivery at
all, so I was wrong about that.

	--david

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-08 18:58                       ` David Mosberger
@ 2004-03-08 21:48                         ` David Brownell
  2004-03-09  9:15                           ` David Mosberger
  0 siblings, 1 reply; 45+ messages in thread
From: David Brownell @ 2004-03-08 21:48 UTC (permalink / raw)
  To: davidm
  Cc: Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel,
	linux-ia64, pochini

[-- Attachment #1: Type: text/plain, Size: 647 bytes --]

David Mosberger wrote:

>  consistency = coherency + ordering

That's one of the things I meant when I said the docs weren't
quite there yet ... the dma_*() API docs don't address how
to achieve the ordering, or even mention that it's an issue.

Plus, there are definitions of coherency that include event
ordering; it's not clear which definition is intended.


Anyway, see the attached patch.  It goes on top of 2.6.4-bk2
and passes my unlink tests (the ones that haven't been run
much on recent kernels!) on several different OHCI hosts.

Other issues may be lurking, but this seems to fix a handful
of nasties that I could reproduce.

- Dave


[-- Attachment #2: ohci-0308.patch --]
[-- Type: text/plain, Size: 3017 bytes --]

Fixes for some recent OHCI instability.

    - Never munge ed->hwTailP once it's set up, except when appending
      to the queue.

    - Don't clear control/bulk "current ED" registers when taking
      entries off the queue.  The chip may still be using the value.

    - Revert part of previous bk1 patch, which removed a fast path
      that makes a visible difference in some common operations.
      Removing it just slowed timings, and didn't fix anything.

    - Make the "unlink during submit" path behave better (SMP).


--- 1.50/drivers/usb/host/ohci-q.c	Tue Mar  2 22:48:07 2004
+++ edited/drivers/usb/host/ohci-q.c	Mon Mar  8 13:15:08 2004
@@ -282,7 +282,6 @@
 			if (!ed->hwNextED) {
 				ohci->hc_control &= ~OHCI_CTRL_CLE;
 				writel (ohci->hc_control, &ohci->regs->control);
-				writel (0, &ohci->regs->ed_controlcurrent);
 				// post those pci writes
 				(void) readl (&ohci->regs->control);
 			}
@@ -306,7 +305,6 @@
 			if (!ed->hwNextED) {
 				ohci->hc_control &= ~OHCI_CTRL_BLE;
 				writel (ohci->hc_control, &ohci->regs->control);
-				writel (0, &ohci->regs->ed_bulkcurrent);
 				// post those pci writes
 				(void) readl (&ohci->regs->control);
 			}
@@ -331,6 +329,19 @@
 		periodic_unlink (ohci, ed);
 		break;
 	}
+
+	/* NOTE: Except for a couple of exceptionally clean unlink cases
+	 * (like unlinking the only c/b ED, with no TDs) HCs may still be
+	 * caching this operational ED (or its address).  Safe unlinking
+	 * involves not marking it ED_IDLE till INTR_SF; we always do that
+	 * if td_list isn't empty.  Otherwise the race is small; but ...
+	 */
+	if (ed->state == ED_OPER) {
+		ed->state = ED_IDLE;
+		ed->hwINFO &= ~(ED_SKIP | ED_DEQUEUE);
+		ed->hwHeadP &= ~ED_H;
+		wmb ();
+	}
 }
 
 
@@ -794,8 +805,6 @@
 		next->next_dl_td = rev;	
 		rev = next;
 
-		if (ed->hwTailP == cpu_to_le32 (next->td_dma))
-			ed->hwTailP = next->hwNextTD;
 		ed->hwHeadP = next->hwNextTD | toggle;
 	}
 
@@ -941,12 +950,7 @@
 				continue;
 			}
 
-			/* patch pointers hc uses ... tail, if we're removing
-			 * an otherwise active td, and whatever td pointer
-			 * points to this td
-			 */
-			if (ed->hwTailP == cpu_to_le32 (td->td_dma))
-				ed->hwTailP = td->hwNextTD;
+			/* patch pointer hc uses */
 			savebits = *prev & ~cpu_to_le32 (TD_MASK);
 			*prev = td->hwNextTD | savebits;
 
@@ -1041,7 +1045,7 @@
 
 		/* clean schedule:  unlink EDs that are no longer busy */
 		if (list_empty (&ed->td_list))
-			start_ed_unlink (ohci, ed);
+			ed_deschedule (ohci, ed);
 		/* ... reenabling halted EDs only after fault cleanup */
 		else if ((ed->hwINFO & (ED_SKIP | ED_DEQUEUE)) == ED_SKIP) {
 			td = list_entry (ed->td_list.next, struct td, td_list);
--- 1.57/drivers/usb/host/ohci-hcd.c	Tue Mar  2 22:48:07 2004
+++ edited/drivers/usb/host/ohci-hcd.c	Mon Mar  8 12:39:24 2004
@@ -233,7 +233,7 @@
 	spin_lock (&urb->lock);
 	if (urb->status != -EINPROGRESS) {
 		spin_unlock (&urb->lock);
-
+		urb->hcpriv = urb_priv;
 		finish_urb (ohci, urb, 0);
 		retval = 0;
 		goto fail;

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-08 21:48                         ` David Brownell
@ 2004-03-09  9:15                           ` David Mosberger
  2004-03-09 17:36                             ` David Brownell
  0 siblings, 1 reply; 45+ messages in thread
From: David Mosberger @ 2004-03-09  9:15 UTC (permalink / raw)
  To: David Brownell
  Cc: davidm, Grant Grundler, Greg KH, vojtech, linux-usb-devel,
	linux-kernel, linux-ia64, pochini

How about something along the following lines?  The patch is relative
to 2.6.4-rc1.  What it does is add a new state ED_DESCHEDULED, which
is treated exactly like ED_IDLE, except that in this state, the HC may
still be referring to the ED in question.  Thus, if
ohci_endpoint_disable() finds an ED in this state, it will take it
down via start_ed_unlink() and once that is done, the ED will be in
ED_IDLE state and hence safe for removal (HC won't be referencing it
anymore).

I left some printk's for low-frequency events enabled, to give you a
better idea of how this is working.  Note that I removed the
posting of the PCI-writes in ed_deschedule().  I don't think those
are needed.  As far as I can see, a write-memory barrier at the
end of this routine should suffice.

With this patch applied:

 - My dual-CPU Itanium machine boots fine (doesn't hang anymore
   as with 2.6.4-rc1)

 - My home workstation (single CPU Itanium with 3 USB devices)
   now boots and works fine again.  Moreover, I don't see any
   USB keyboard/mouse disconnects anymore when doing heavy
   IDE traffic.

The BTC keyboard, however, still does NOT work.  I'm fairly certain
now that this is indeed a separate problem in the HID.  The reason
being that I'm now fairly comfortable with the OHCI HCD and if this
bug truly were in the OHCI HCD, then why does it trigger with UHCI as
well (I haven't tried EHCI yet; maybe I should try connecting the BTC
keyboard via a USB 2.0 hub).

Comments?

	--david

===== drivers/usb/host/ohci-hcd.c 1.56 vs edited =====
--- 1.56/drivers/usb/host/ohci-hcd.c	Tue Mar  2 05:52:40 2004
+++ edited/drivers/usb/host/ohci-hcd.c	Tue Mar  9 00:24:54 2004
@@ -239,8 +239,11 @@
 		goto fail;
 	}
 
-	/* schedule the ed if needed */
-	if (ed->state == ED_IDLE) {
+	if (ed->state == ED_UNLINK) {
+		printk("%s: ed %p being unlinked\n", __FUNCTION__, ed);
+		goto fail0;
+	} else if (ed->state == ED_IDLE || ed->state == ED_DESCHEDULED) {
+		/* schedule the ed if needed */
 		retval = ed_schedule (ohci, ed);
 		if (retval < 0)
 			goto fail0;
@@ -344,9 +347,15 @@
 	if (!ed)
 		goto done;
 
+	printk("%s(ed=%p, type=%d, state=%u)\n", __FUNCTION__, ed, ed->type, ed->state);
+
 	if (!HCD_IS_RUNNING (ohci->hcd.state))
 		ed->state = ED_IDLE;
 	switch (ed->state) {
+	case ED_DESCHEDULED:
+		start_ed_unlink (ohci, ed);
+		BUG_ON (ed->state != ED_UNLINK);
+		/* fall through */
 	case ED_UNLINK:		/* wait for hw to finish? */
 		/* major IRQ delivery trouble loses INTR_SF too... */
 		WARN_ON (limit-- == 0);
===== drivers/usb/host/ohci-q.c 1.48 vs edited =====
--- 1.48/drivers/usb/host/ohci-q.c	Tue Mar  2 05:52:46 2004
+++ edited/drivers/usb/host/ohci-q.c	Tue Mar  9 00:39:39 2004
@@ -168,10 +168,16 @@
 {	 
 	int	branch;
 
+#if 0
+printk("%s: ed=%p type=%d\n", __FUNCTION__, ed, ed->type);
+#endif
 	ed->state = ED_OPER;
 	ed->ed_prev = 0;
 	ed->ed_next = 0;
 	ed->hwNextED = 0;
+	ed->hwINFO &= ~ED_SKIP;
+	WARN_ON (ed->hwHeadP & ED_H);
+	ed->hwHeadP &= ~ED_H;	/* XXX is this really needed?? --davidm */
 	wmb ();
 
 	/* we care about rm_list when setting CLE/BLE in case the HC was at
@@ -267,24 +273,30 @@
 		ed, ed->branch, ed->load, ed->interval);
 }
 
-/* unlink an ed from one of the HC chains. 
+/* Deschedule an ed from one of the HC chains.
  * just the link to the ed is unlinked.
  * the link from the ed still points to another operational ed or 0
- * so the HC can eventually finish the processing of the unlinked ed
+ * so the HC can eventually finish the processing of the unlinked ed.
+ * The ED isn't idle after returning from this routine, because the HC
+ * may continue to refer to it, but as far as the HCD is concerned,
+ * the ED is reusable.  If it is necessary to get the ED into the
+ * ED_IDLE state, the full unlink-sequence needs to be performed
+ * (via start_ed_unlink()).
  */
-static void ed_deschedule (struct ohci_hcd *ohci, struct ed *ed) 
+static void ed_deschedule (struct ohci_hcd *ohci, struct ed *ed)
 {
+#if 0
+printk("%s: ed=%p type=%d\n", __FUNCTION__, ed, ed->type);
+#endif
 	ed->hwINFO |= ED_SKIP;
 
 	switch (ed->type) {
 	case PIPE_CONTROL:
+		/* remove ED from the HC's list: */
 		if (ed->ed_prev == NULL) {
 			if (!ed->hwNextED) {
 				ohci->hc_control &= ~OHCI_CTRL_CLE;
 				writel (ohci->hc_control, &ohci->regs->control);
-				writel (0, &ohci->regs->ed_controlcurrent);
-				// post those pci writes
-				(void) readl (&ohci->regs->control);
 			}
 			writel (le32_to_cpup (&ed->hwNextED),
 				&ohci->regs->ed_controlhead);
@@ -292,6 +304,7 @@
 			ed->ed_prev->ed_next = ed->ed_next;
 			ed->ed_prev->hwNextED = ed->hwNextED;
 		}
+		/* remove ED from the HCD's list: */
 		if (ohci->ed_controltail == ed) {
 			ohci->ed_controltail = ed->ed_prev;
 			if (ohci->ed_controltail)
@@ -302,13 +315,11 @@
 		break;
 
 	case PIPE_BULK:
+		/* remove ED from the singly-linked HC's list: */
 		if (ed->ed_prev == NULL) {
 			if (!ed->hwNextED) {
 				ohci->hc_control &= ~OHCI_CTRL_BLE;
 				writel (ohci->hc_control, &ohci->regs->control);
-				writel (0, &ohci->regs->ed_bulkcurrent);
-				// post those pci writes
-				(void) readl (&ohci->regs->control);
 			}
 			writel (le32_to_cpup (&ed->hwNextED),
 				&ohci->regs->ed_bulkhead);
@@ -316,6 +327,7 @@
 			ed->ed_prev->ed_next = ed->ed_next;
 			ed->ed_prev->hwNextED = ed->hwNextED;
 		}
+		/* remove ED from the HCD's list: */
 		if (ohci->ed_bulktail == ed) {
 			ohci->ed_bulktail = ed->ed_prev;
 			if (ohci->ed_bulktail)
@@ -331,6 +343,12 @@
 		periodic_unlink (ohci, ed);
 		break;
 	}
+	ed->state = ED_DESCHEDULED;
+	/*
+	 * Ensure HCD sees these updates (in particular update of
+	 * hwINFO) before any following updates.
+	 */
+	wmb ();
 }
 
 
@@ -387,7 +405,7 @@
 	/* NOTE: only ep0 currently needs this "re"init logic, during
 	 * enumeration (after set_address, or if ep0 maxpacket >8).
 	 */
-  	if (ed->state == ED_IDLE) {
+  	if (ed->state == ED_IDLE || ed->state == ED_DESCHEDULED) {
 		u32	info;
 
 		info = usb_pipedevice (pipe);
@@ -418,6 +436,9 @@
 
 done:
 	spin_unlock_irqrestore (&ohci->lock, flags);
+#if 0
+printk("new ed %p: type=%u state=%u hwTailP=%x hwHeadP=%x\n", ed, ed->type, ed->state, ed->hwTailP, ed->hwHeadP);
+#endif
 	return ed; 
 }
 
@@ -428,17 +449,38 @@
  * real work is done at the next start frame (SF) hardware interrupt
  */
 static void start_ed_unlink (struct ohci_hcd *ohci, struct ed *ed)
-{    
+{
+	u32 mask = 0;
+
+	/* deschedule ED as far as the HCD is concerned: */
+	ed_deschedule (ohci, ed);
+
+	/* now initiate the unlink sequence to ensure the HC isn't using the ED anymore: */
+	if (ed->type == PIPE_CONTROL)
+		mask = OHCI_CTRL_CLE;
+	else if (ed->type == PIPE_BULK)
+		mask = OHCI_CTRL_BLE;
+	if (mask) {
+		/*
+		 * Disable scanning of the control or bulk list; this
+		 * ensures that when we get an interrupt with a frame
+		 * # greater than the current frame #, the HC isn't
+		 * using the list anymore.
+		 */
+		ohci->hc_control &= ~mask;
+		writel (ohci->hc_control, &ohci->regs->control);
+	}
 	ed->hwINFO |= ED_DEQUEUE;
 	ed->state = ED_UNLINK;
-	ed_deschedule (ohci, ed);
 
 	/* SF interrupt might get delayed; record the frame counter value that
 	 * indicates when the HC isn't looking at it, so concurrent unlinks
 	 * behave.  frame_no wraps every 2^16 msec, and changes right before
 	 * SF is triggered.
 	 */
+	rmb();	/* ensure ed_deschedule() work is done before we read the frame # */
 	ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1;
+printk("%s: ed=%p tick=%u\n", __FUNCTION__, ed, ed->tick);
 
 	/* rm_list is just singly linked, for simplicity */
 	ed->ed_next = ohci->ed_rm_list;
@@ -452,6 +494,7 @@
 		// flush those pci writes
 		(void) readl (&ohci->regs->control);
 	}
+	/* ??? What if HCD isn't running?  Shouldn't that be handled as well?  */
 }
 
 /*-------------------------------------------------------------------------*
@@ -614,6 +657,7 @@
 		info = TD_CC | TD_DP_SETUP | TD_T_DATA0;
 		td_fill (ohci, info, urb->setup_dma, 8, urb, cnt++);
 		if (data_len > 0) {
+			BUG_ON(data_len >= 4096);
 			info = TD_CC | TD_R | TD_T_DATA1;
 			info |= is_out ? TD_DP_OUT : TD_DP_IN;
 			/* NOTE:  mishandles transfers >8K, some >4K */
@@ -758,6 +802,7 @@
 	struct list_head	*tmp = td->td_list.next;
 	u32			toggle = ed->hwHeadP & ED_C;
 
+printk("%s: ed=%p\n", __FUNCTION__, ed);
 	/* clear ed halt; this is the td that caused it, but keep it inactive
 	 * until its urb->complete() has a chance to clean up.
 	 */
@@ -794,8 +839,6 @@
 		next->next_dl_td = rev;	
 		rev = next;
 
-		if (ed->hwTailP == cpu_to_le32 (next->td_dma))
-			ed->hwTailP = next->hwNextTD;
 		ed->hwHeadP = next->hwNextTD | toggle;
 	}
 
@@ -881,6 +924,7 @@
 {
 	struct ed	*ed, **last;
 
+printk("%s(tick=%u)\n", __FUNCTION__, tick);
 rescan_all:
 	for (last = &ohci->ed_rm_list, ed = *last; ed != NULL; ed = *last) {
 		struct list_head	*entry, *tmp;
@@ -941,12 +985,7 @@
 				continue;
 			}
 
-			/* patch pointers hc uses ... tail, if we're removing
-			 * an otherwise active td, and whatever td pointer
-			 * points to this td
-			 */
-			if (ed->hwTailP == cpu_to_le32 (td->td_dma))
-				ed->hwTailP = td->hwNextTD;
+			/* patch pointers hc uses  */
 			savebits = *prev & ~cpu_to_le32 (TD_MASK);
 			*prev = td->hwNextTD | savebits;
 
@@ -957,23 +996,22 @@
 			/* if URB is done, clean up */
 			if (urb_priv->td_cnt == urb_priv->length) {
 				modified = completed = 1;
-				finish_urb (ohci, urb, regs);
+				finish_urb (ohci, urb, regs);	/* this drops ohci->lock! */
 			}
 		}
 		if (completed && !list_empty (&ed->td_list))
 			goto rescan_this;
 
 		/* ED's now officially unlinked, hc doesn't see */
+#if 1
+printk("%s: done with ed %p\n", __FUNCTION__, ed);
+#endif
 		ed->state = ED_IDLE;
 		ed->hwINFO &= ~(ED_SKIP | ED_DEQUEUE);
 		ed->hwHeadP &= ~ED_H;
 		ed->hwNextED = 0;
 
-		/* but if there's work queued, reschedule */
-		if (!list_empty (&ed->td_list)) {
-			if (HCD_IS_RUNNING(ohci->hcd.state))
-				ed_schedule (ohci, ed);
-		}
+		BUG_ON (!list_empty (&ed->td_list));
 
 		if (modified)
 			goto rescan_all;
@@ -1039,9 +1077,9 @@
   		if (urb_priv->td_cnt == urb_priv->length)
   			finish_urb (ohci, urb, regs);
 
-		/* clean schedule:  unlink EDs that are no longer busy */
+		/* clean schedule:  deschedule EDs that are no longer busy */
 		if (list_empty (&ed->td_list))
-			start_ed_unlink (ohci, ed);
+			ed_deschedule (ohci, ed);
 		/* ... reenabling halted EDs only after fault cleanup */
 		else if ((ed->hwINFO & (ED_SKIP | ED_DEQUEUE)) == ED_SKIP) {
 			td = list_entry (ed->td_list.next, struct td, td_list);
===== drivers/usb/host/ohci.h 1.19 vs edited =====
--- 1.19/drivers/usb/host/ohci.h	Wed Feb 11 03:42:39 2004
+++ edited/drivers/usb/host/ohci.h	Mon Mar  8 23:03:31 2004
@@ -48,6 +48,7 @@
 #define ED_IDLE 	0x00		/* NOT linked to HC */
 #define ED_UNLINK 	0x01		/* being unlinked from hc */
 #define ED_OPER		0x02		/* IS linked to hc */
+#define ED_DESCHEDULED	0x03		/* idle for HCD, but HC may still hold reference to it */
 
 	u8			type; 		/* PIPE_{BULK,...} */
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-09  9:15                           ` David Mosberger
@ 2004-03-09 17:36                             ` David Brownell
  2004-03-09 17:58                               ` David Mosberger
  0 siblings, 1 reply; 45+ messages in thread
From: David Brownell @ 2004-03-09 17:36 UTC (permalink / raw)
  To: davidm
  Cc: Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel,
	linux-ia64, pochini

David Mosberger wrote:
> How about something along the following lines?  The patch is relative
> to 2.6.4-rc1.  What it does is add a new state ED_DESCHEDULED, which
> is treated exactly like ED_IDLE, except that in this state, the HC may
> still be referring to the ED in question.  Thus, if

Sounds exactly like ED_UNLINK -- except maybe that it's not
been put onto ed_rm_list (with ED_DEQUEUE set).

Why add another state?


The parts of this patch that came from the one I sent earlier
are obviously correct (what were your test results for that?),
and there's non-worrisome noise (printks etc).

But some parts worry me.  Like changing that code to BUG()
on a driver behavior that's perfectly reasonable; and removing
some of the PCI posting, which makes it easier for the HC
and its driver to disagree about schedule status.


> The BTC keyboard, however, still does NOT work.  I'm fairly certain
> now that this is indeed a separate problem in the HID.  The reason

That was my original suspicion, you may recall ... :)

- Dave


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-09 17:36                             ` David Brownell
@ 2004-03-09 17:58                               ` David Mosberger
  2004-03-09 20:39                                 ` David Brownell
  0 siblings, 1 reply; 45+ messages in thread
From: David Mosberger @ 2004-03-09 17:58 UTC (permalink / raw)
  To: David Brownell
  Cc: davidm, Grant Grundler, Greg KH, vojtech, linux-usb-devel,
	linux-kernel, linux-ia64, pochini

>>>>> On Tue, 09 Mar 2004 09:36:53 -0800, David Brownell <david-b@pacbell.net> said:

  David.B> David Mosberger wrote:
  >> How about something along the following lines?  The patch is
  >> relative to 2.6.4-rc1.  What it does is add a new state
  >> ED_DESCHEDULED, which is treated exactly like ED_IDLE, except
  >> that in this state, the HC may still be referring to the ED in
  >> question.  Thus, if

  David.B> Sounds exactly like ED_UNLINK -- except maybe that it's not
  David.B> been put onto ed_rm_list (with ED_DEQUEUE set).

  David.B> Why add another state?

So you can tell them apart.  See ohci_endpoint_disable().

You seem to think small races are OK.  I disagree.  The smaller the
race, the harder it is to debug (and yes, it _will_ bite you
eventually).  With my patch, you're _guaranteed_ that ED_IDLE means
the HC doesn't refer to the ED anymore so you can free the memory and
do whatever you want without worry about potentially causing the HC to
do something bad.

  David.B> But some parts worry me.  Like changing that code to BUG()
  David.B> on a driver behavior that's perfectly reasonable;

With my patch, the only reason you enter ED_UNLINK state is
ohci_endpoint_disable().  If you try to ohci_urb_enqueue() on an ED in
this state, it will fail, so I don't there is a problem (I see now
that I forgot to set the error-code for this case, that's obviously
something that needs to be fixed).  But perhaps you had different
semantics in mind for ED_UNLINK?

  David.B> removing some of the PCI posting, which makes it easier for
  David.B> the HC and its driver to disagree about schedule status.

Meaning what?  Please explain what sequence of events would cause
problems that could be solved by flushing the posted writes.  AFAICT,
the flushes are just expensive NO-OPs in this particular case.

	--david

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-09 17:58                               ` David Mosberger
@ 2004-03-09 20:39                                 ` David Brownell
  2004-03-09 23:32                                   ` David Mosberger
  2004-03-10  6:59                                   ` David Mosberger
  0 siblings, 2 replies; 45+ messages in thread
From: David Brownell @ 2004-03-09 20:39 UTC (permalink / raw)
  To: davidm
  Cc: Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel,
	linux-ia64, pochini

David Mosberger wrote:
> 
>   > David Mosberger wrote:
>   >> ...  add a new state
>   >> ED_DESCHEDULED, which is treated exactly like ED_IDLE, except
>   >> that in this state, the HC may still be referring to the ED in
>   >> question.  Thus, if
> 
>   David.B> Sounds exactly like ED_UNLINK -- except maybe that it's not
>   David.B> been put onto ed_rm_list (with ED_DEQUEUE set).
>   David.B> Why add another state?
> 
> So you can tell them apart.  See ohci_endpoint_disable().

Not informative.  Why doesn't UNLINK work as-is?
If there's a bug in how that's handled, better to fix it.
I'd be willing to believe there is one, given proof.


> You seem to think small races are OK.  I disagree.  The smaller the

I certainly didn't say that, any more than you said
that there's no such thing as an engineering tradeoff!

(With which statement I'd likewise disagree.)

But I certainly did mean to say that I was skeptical
you were actually running into one particular race,
which I've had an eye on for some time.  (And which
shouldn't have the failure mode you reported, though
some other things might...)


>   David.B> But some parts worry me.  Like changing that code to BUG()
>   David.B> on a driver behavior that's perfectly reasonable;
> 
> With my patch, the only reason you enter ED_UNLINK state is
> ohci_endpoint_disable().  If you try to ohci_urb_enqueue() on an ED in
> this state, it will fail, so I don't there is a problem (I see now

It's entered in usb_unlink_urb() as always.  So as I
noted, your patch would make drivers BUG() in cases
that are perfectly reasonable according to the API.


> that I forgot to set the error-code for this case, that's obviously
> something that needs to be fixed).  But perhaps you had different
> semantics in mind for ED_UNLINK?

As I've said more than once on this thread.

- Dave


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-09 20:39                                 ` David Brownell
@ 2004-03-09 23:32                                   ` David Mosberger
  2004-03-10  2:53                                     ` David Brownell
  2004-03-10  6:59                                   ` David Mosberger
  1 sibling, 1 reply; 45+ messages in thread
From: David Mosberger @ 2004-03-09 23:32 UTC (permalink / raw)
  To: David Brownell
  Cc: davidm, Grant Grundler, Greg KH, vojtech, linux-usb-devel,
	linux-kernel, linux-ia64, pochini

David B.,

I'll respond to your concerns about my OHCI changes later on.  I
wanted to make progress on the BTC HID issue first so we have the full
picture.

Remember how I was asking what's so special about 0xf00000 because
that's the address the OHCI HC was reading from that caused the crash?
Well, looky here what you get when you interpret the memory at address
0 as an ED:

 hw=(info=f0000000 tailp=f0000000 headp=f0000000 nextED=f0000000)

D'oh!

(I suspect this strange memory pattern is a left-over from the
memory-testing done by the firmware; I should say thanks to whoever
invented this pattern; the problem would have been much harder to
debug if the address hadn't convenently fallen on a write-only region
of the address-space.)

Anyhow, I have been wondering since about Sunday whether it's really
safe to write HcControlHeadED (and HcBulkHeadED) with 0.  The register
description itself is ambiguous.  Now I'm finding that Figure 6-5
"List Service Flow" and Section 6.4.2.2 "Locating Endpoint
Descriptors" are outright contradictory!

The flow-chart suggests that after loading CurrentED with the contents
of HeadED, the HC checks whether CurrentED is 0 and, if so, does nothing.
However, the text in Section 6.4.2.2 says:

 ... At this point, the Host Controller checks the BulkListFilled bit
 or ControlListFilled bit of the HcCommandStatus register.  If the
 respective "Filled" bit is set to 1, there is at least one Endpoint
 Descriptor on the list which needs service.  In this case, the
 HostController will copy the value of HcControlHeadED or HcBulkHeadED
 into HcControlCurrentED or HcBulkCurrentED respectively, clear the
 "Filled" bit to 0, and attempt to process the Endpoint Descriptor now
 present in the CurrentED register. ...

So, if the HC behaves as described in the text, then there is an
obvious race:

  HC						HCD

  - start new frame
  - find list enabled (CLE/BLE set)
  - find "Filled" bit enabled (CLF/BLF set)
						- ed_deschedule() gets called
						- turn off "list-enabled" bit
						- store 0 into *HeadED
  - read *HeadED and store in *CurrentEd
  - clear "Filled" bit to 0
  - process ED at "Current"

In other words, whenever de-scheduling the first ED on the control or
bulk list, there is a risk that with the right timing, you'll end up
processing the "ED" at address zero!

So I changed ohci-q.c from donig this:

	if (ed->ed_prev == NULL) {
		if (!ed->hwNextED) {
			ohci->hc_control &= ~OHCI_CTRL_CLE;
			writel (ohci->hc_control, &ohci->regs->control);
		}
		writel (le32_to_cpup (&ed->hwNextED),
			&ohci->regs->ed_controlhead);
	} else ...

to this:

	if (ed->ed_prev == NULL) {
		if (!ed->hwNextED) {
			ohci->hc_control &= ~OHCI_CTRL_CLE;
			writel (ohci->hc_control, &ohci->regs->control);
		} else
			writel (le32_to_cpup (&ed->hwNextED),
				&ohci->regs->ed_controlhead);
	} else ...

and now there are no more crashes when plugging in the BTC keyboard.
(Note: the change is safe only if the ED being removed remains valid
until the clearing of the CLE is observed, i.e, until the next start
of frame, which is the case with my patch from yesterday.)

Now the next problem is to figure out why one of the URBs submitted by
the HID consistently times out the first time round.  (Oh, and there
is an infinite loop in hidinput_connect() which is trivial to fix.)

	--david

PS: It would have been nice if I had been smart enough to check the
    memory at address 0 upfront (as described above), but in truth, I
    found the problem in the reverse order by noticing that the
    machine died right after/during the ed_deschedule() and the
    clearing of the *HeadED register was just about the only thing
    left that could cause the crash.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-09 23:32                                   ` David Mosberger
@ 2004-03-10  2:53                                     ` David Brownell
  2004-03-10  6:11                                       ` David Mosberger
  0 siblings, 1 reply; 45+ messages in thread
From: David Brownell @ 2004-03-10  2:53 UTC (permalink / raw)
  To: davidm
  Cc: Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel,
	linux-ia64, pochini

David Mosberger wrote:

> Anyhow, I have been wondering since about Sunday whether it's really
> safe to write HcControlHeadED (and HcBulkHeadED) with 0.  The register
> description itself is ambiguous.  Now I'm finding that Figure 6-5
> "List Service Flow" and Section 6.4.2.2 "Locating Endpoint
> Descriptors" are outright contradictory!

OHCI isn't as tightly specified as one might like.  There are also
differences in how vendors implemented it (notably initialization);
but at least there don't seem to be major incompatibilities that
crept in between implementations.


> So, if the HC behaves as described in the text, then there is an
> obvious race:
> ....
> So I changed ohci-q.c ... to this:
> 
> 	if (ed->ed_prev == NULL) {
> 		if (!ed->hwNextED) {
> 			ohci->hc_control &= ~OHCI_CTRL_CLE;
> 			writel (ohci->hc_control, &ohci->regs->control);
> 		} else

(added the ELSE)

> 			writel (le32_to_cpup (&ed->hwNextED),
> 				&ohci->regs->ed_controlhead);
> 	} else ...
> 
> and now there are no more crashes when plugging in the BTC keyboard.

Interesting.  I had those "else"s in the patch I was testing too,
and they were literally the last "is this really needed?" change I
removed before posting that patch yesterday.  I had tested it on
four different implementations of OHCI (two on SMP) and the "else"
didn't make a difference ... as it might not, given the race
you identified.   This is a good example of something better
found with a fresh set of eyes looking at spec and code!

Looks like that's really needed then!  Whose OHCI silicon are
you testing with, by the way?  Good catch, regardless.

- Dave




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-10  2:53                                     ` David Brownell
@ 2004-03-10  6:11                                       ` David Mosberger
  0 siblings, 0 replies; 45+ messages in thread
From: David Mosberger @ 2004-03-10  6:11 UTC (permalink / raw)
  To: David Brownell
  Cc: davidm, Grant Grundler, Greg KH, vojtech, linux-usb-devel,
	linux-kernel, linux-ia64, pochini

>>>>> On Tue, 09 Mar 2004 18:53:58 -0800, David Brownell <david-b@pacbell.net> said:

  David.B> Whose OHCI silicon David.B> are you testing with, by the way?

lspci claims it's a NEC chip:

a0:01.0 USB Controller: NEC Corporation USB (rev 41) (prog-if 10 [OHCI])
        Subsystem: NEC Corporation USB
        Flags: bus master, medium devsel, latency 128, IRQ 62
        Memory at 00000000d0062000 (32-bit, non-prefetchable) [size=4K]
        Capabilities: <available only to root>

  --david

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-09 20:39                                 ` David Brownell
  2004-03-09 23:32                                   ` David Mosberger
@ 2004-03-10  6:59                                   ` David Mosberger
  2004-03-10  7:52                                     ` Wouter Lueks
                                                       ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: David Mosberger @ 2004-03-10  6:59 UTC (permalink / raw)
  To: David Brownell
  Cc: davidm, Grant Grundler, Greg KH, vojtech, linux-usb-devel,
	linux-kernel, linux-ia64, pochini

>>>>> On Tue, 09 Mar 2004 12:39:52 -0800, David Brownell <david-b@pacbell.net> said:

  David.B> David Mosberger wrote:
  >>  > David Mosberger wrote: >> ...  add a new state >>

  >> ED_DESCHEDULED, which is treated exactly like ED_IDLE, except >>
  >> that in this state, the HC may still be referring to the ED in >>
  >> question.  Thus, if

  David.B> Sounds exactly like ED_UNLINK -- except maybe that it's not
  David.B> been put onto ed_rm_list (with ED_DEQUEUE set).  Why add
  David.B> another state?

  >>  So you can tell them apart.  See ohci_endpoint_disable().

  David.B> Not informative.  Why doesn't UNLINK work as-is?  If
  David.B> there's a bug in how that's handled, better to fix it.  I'd
  David.B> be willing to believe there is one, given proof.

The current OHCI relies on the internals of the dma_pool()
implementation.  If the implementation changed to, say, modify the
memory that is free or, heaven forbid, return the memory to the
kernel, you'd get _extremely_ difficult to track down race conditions.
Even so, the code isn't race-free, like I explained yesterday:

 - ed_alloc() clears the ED to 0 via memset()
 - the order in which memset() clears memory is undefined (various
   from platform to platform etc)
 - thus you might get a case where hwTailP is 0 but hwHeadP
   is non-zero, which would cause the HC to happily start
   dereferencing the descriptor

There are probably other potential issues. Frankly, I _don't_ want to
have to deal with this.  Especially considering that the alternative
costs virtually nothing, requires just a few source code changes, and
contains the EDs that are potentially still referenced by the HC to
the HC code itself.

  David.B> But some parts worry me.  Like changing that code to BUG()
  David.B> on a driver behavior that's perfectly reasonable;

I think the patch below incorporates your suggestions and fixes the
problems you pointed out.  In particular:

 - dropped debug printing
 - allow URB submission during ED_UNLINK again
 - add a big fat comment in ed_deschedule() why we must not update
   controlhead when the list goes empty (it's quite counter-intuitive
   to _not_ do it and I'm worried someone in the feature may want
   to "fix" this again...)

The rest should be the samae as yesterday.

The patch is still against 2.6.4-rc2 (and still contains your hwTailP
update fixes).

Please consider for inclusion.

	--david

===== drivers/usb/host/ohci-hcd.c 1.56 vs edited =====
--- 1.56/drivers/usb/host/ohci-hcd.c	Tue Mar  2 05:52:40 2004
+++ edited/drivers/usb/host/ohci-hcd.c	Tue Mar  9 22:29:08 2004
@@ -239,8 +239,8 @@
 		goto fail;
 	}
 
-	/* schedule the ed if needed */
-	if (ed->state == ED_IDLE) {
+	if (ed->state == ED_IDLE || ed->state == ED_DESCHEDULED) {
+		/* schedule the ed if needed */
 		retval = ed_schedule (ohci, ed);
 		if (retval < 0)
 			goto fail0;
@@ -347,6 +347,10 @@
 	if (!HCD_IS_RUNNING (ohci->hcd.state))
 		ed->state = ED_IDLE;
 	switch (ed->state) {
+	case ED_DESCHEDULED:
+		start_ed_unlink (ohci, ed);
+		BUG_ON (ed->state != ED_UNLINK);
+		/* fall through */
 	case ED_UNLINK:		/* wait for hw to finish? */
 		/* major IRQ delivery trouble loses INTR_SF too... */
 		WARN_ON (limit-- == 0);
===== drivers/usb/host/ohci-q.c 1.48 vs edited =====
--- 1.48/drivers/usb/host/ohci-q.c	Tue Mar  2 05:52:46 2004
+++ edited/drivers/usb/host/ohci-q.c	Tue Mar  9 22:27:04 2004
@@ -172,6 +172,7 @@
 	ed->ed_prev = 0;
 	ed->ed_next = 0;
 	ed->hwNextED = 0;
+	ed->hwINFO &= ~ED_SKIP;
 	wmb ();
 
 	/* we care about rm_list when setting CLE/BLE in case the HC was at
@@ -187,6 +188,8 @@
 	switch (ed->type) {
 	case PIPE_CONTROL:
 		if (ohci->ed_controltail == NULL) {
+			/* Writing of control-head is safe only if CLE is off. */
+			BUG_ON (ohci->hc_control & OHCI_CTRL_CLE);
 			writel (ed->dma, &ohci->regs->ed_controlhead);
 		} else {
 			ohci->ed_controltail->ed_next = ed;
@@ -203,6 +206,8 @@
 
 	case PIPE_BULK:
 		if (ohci->ed_bulktail == NULL) {
+			/* Writing of bulk-head is safe only if BLE is off. */
+			BUG_ON (ohci->hc_control & OHCI_CTRL_BLE);
 			writel (ed->dma, &ohci->regs->ed_bulkhead);
 		} else {
 			ohci->ed_bulktail->ed_next = ed;
@@ -267,10 +272,15 @@
 		ed, ed->branch, ed->load, ed->interval);
 }
 
-/* unlink an ed from one of the HC chains. 
+/* Deschedule an ed from one of the HC chains.
  * just the link to the ed is unlinked.
  * the link from the ed still points to another operational ed or 0
- * so the HC can eventually finish the processing of the unlinked ed
+ * so the HC can eventually finish the processing of the unlinked ed.
+ * The ED isn't idle after returning from this routine, because the HC
+ * may continue to refer to it, but as far as the HCD is concerned,
+ * the ED is reusable.  If it is necessary to get the ED into the
+ * ED_IDLE state, the full unlink-sequence needs to be performed
+ * (via start_ed_unlink()).
  */
 static void ed_deschedule (struct ohci_hcd *ohci, struct ed *ed) 
 {
@@ -278,20 +288,33 @@
 
 	switch (ed->type) {
 	case PIPE_CONTROL:
+		/* remove ED from the HC's list: */
 		if (ed->ed_prev == NULL) {
 			if (!ed->hwNextED) {
 				ohci->hc_control &= ~OHCI_CTRL_CLE;
 				writel (ohci->hc_control, &ohci->regs->control);
-				writel (0, &ohci->regs->ed_controlcurrent);
-				// post those pci writes
-				(void) readl (&ohci->regs->control);
-			}
-			writel (le32_to_cpup (&ed->hwNextED),
-				&ohci->regs->ed_controlhead);
+				/*
+				 * Caveat: even though the list is
+				 * empty now, we MUST NOT write 0 to
+				 * controlhead.  The OHCI spec is
+				 * ambiguous on this point (Figure 6-5
+				 * and Section 6.4.2.2 specify
+				 * conflicting behaviors), but there
+				 * are definitely HCs out there which
+				 * will happily try to process an ED
+				 * from address 0.  It's OK not to
+				 * update controlhead because we leave
+				 * the ED alive for long enough that
+				 * this is safe.
+				 */
+			} else
+				writel (le32_to_cpup (&ed->hwNextED),
+					&ohci->regs->ed_controlhead);
 		} else {
 			ed->ed_prev->ed_next = ed->ed_next;
 			ed->ed_prev->hwNextED = ed->hwNextED;
 		}
+		/* remove ED from the HCD's list: */
 		if (ohci->ed_controltail == ed) {
 			ohci->ed_controltail = ed->ed_prev;
 			if (ohci->ed_controltail)
@@ -302,20 +325,33 @@
 		break;
 
 	case PIPE_BULK:
+		/* remove ED from the singly-linked HC's list: */
 		if (ed->ed_prev == NULL) {
 			if (!ed->hwNextED) {
 				ohci->hc_control &= ~OHCI_CTRL_BLE;
 				writel (ohci->hc_control, &ohci->regs->control);
-				writel (0, &ohci->regs->ed_bulkcurrent);
-				// post those pci writes
-				(void) readl (&ohci->regs->control);
-			}
-			writel (le32_to_cpup (&ed->hwNextED),
-				&ohci->regs->ed_bulkhead);
+				/*
+				 * Caveat: even though the list is
+				 * empty now, we MUST NOT write 0 to
+				 * bulkhead.  The OHCI spec is
+				 * ambiguous on this point (Figure 6-5
+				 * and Section 6.4.2.2 specify
+				 * conflicting behaviors), but there
+				 * are definitely HCs out there which
+				 * will happily try to process an ED
+				 * from address 0.  It's OK not to
+				 * update controlhead because we leave
+				 * the ED alive for long enough that
+				 * this is safe.
+				 */
+			} else
+				writel (le32_to_cpup (&ed->hwNextED),
+					&ohci->regs->ed_bulkhead);
 		} else {
 			ed->ed_prev->ed_next = ed->ed_next;
 			ed->ed_prev->hwNextED = ed->hwNextED;
 		}
+		/* remove ED from the HCD's list: */
 		if (ohci->ed_bulktail == ed) {
 			ohci->ed_bulktail = ed->ed_prev;
 			if (ohci->ed_bulktail)
@@ -331,6 +367,12 @@
 		periodic_unlink (ohci, ed);
 		break;
 	}
+	ed->state = ED_DESCHEDULED;
+	/*
+	 * Ensure HCD sees these updates (in particular update of
+	 * hwINFO) before any following updates.
+	 */
+	wmb ();
 }
 
 
@@ -387,7 +429,7 @@
 	/* NOTE: only ep0 currently needs this "re"init logic, during
 	 * enumeration (after set_address, or if ep0 maxpacket >8).
 	 */
-  	if (ed->state == ED_IDLE) {
+  	if (ed->state == ED_IDLE || ed->state == ED_DESCHEDULED) {
 		u32	info;
 
 		info = usb_pipedevice (pipe);
@@ -429,15 +471,35 @@
  */
 static void start_ed_unlink (struct ohci_hcd *ohci, struct ed *ed)
 {    
+	u32 mask = 0;
+
+	/* deschedule ED as far as the HCD is concerned: */
+	ed_deschedule (ohci, ed);
+
+	/* now initiate the unlink sequence to ensure the HC isn't using the ED anymore: */
+	if (ed->type == PIPE_CONTROL)
+		mask = OHCI_CTRL_CLE;
+	else if (ed->type == PIPE_BULK)
+		mask = OHCI_CTRL_BLE;
+	if (mask) {
+		/*
+		 * Disable scanning of the control or bulk list; this
+		 * ensures that when we get an interrupt with a frame
+		 * # greater than the current frame #, the HC isn't
+		 * using the list anymore.
+		 */
+		ohci->hc_control &= ~mask;
+		writel (ohci->hc_control, &ohci->regs->control);
+	}
 	ed->hwINFO |= ED_DEQUEUE;
 	ed->state = ED_UNLINK;
-	ed_deschedule (ohci, ed);
 
 	/* SF interrupt might get delayed; record the frame counter value that
 	 * indicates when the HC isn't looking at it, so concurrent unlinks
 	 * behave.  frame_no wraps every 2^16 msec, and changes right before
 	 * SF is triggered.
 	 */
+	rmb();	/* ensure ed_deschedule() work is done before we read the frame # */
 	ed->tick = OHCI_FRAME_NO(ohci->hcca) + 1;
 
 	/* rm_list is just singly linked, for simplicity */
@@ -452,6 +514,7 @@
 		// flush those pci writes
 		(void) readl (&ohci->regs->control);
 	}
+	/* ??? What if HCD isn't running?  Shouldn't that be handled as well?  */
 }
 
 /*-------------------------------------------------------------------------*
@@ -614,6 +677,7 @@
 		info = TD_CC | TD_DP_SETUP | TD_T_DATA0;
 		td_fill (ohci, info, urb->setup_dma, 8, urb, cnt++);
 		if (data_len > 0) {
+			BUG_ON(data_len >= 4096);
 			info = TD_CC | TD_R | TD_T_DATA1;
 			info |= is_out ? TD_DP_OUT : TD_DP_IN;
 			/* NOTE:  mishandles transfers >8K, some >4K */
@@ -794,8 +858,6 @@
 		next->next_dl_td = rev;	
 		rev = next;
 
-		if (ed->hwTailP == cpu_to_le32 (next->td_dma))
-			ed->hwTailP = next->hwNextTD;
 		ed->hwHeadP = next->hwNextTD | toggle;
 	}
 
@@ -941,12 +1003,7 @@
 				continue;
 			}
 
-			/* patch pointers hc uses ... tail, if we're removing
-			 * an otherwise active td, and whatever td pointer
-			 * points to this td
-			 */
-			if (ed->hwTailP == cpu_to_le32 (td->td_dma))
-				ed->hwTailP = td->hwNextTD;
+			/* patch pointers hc uses  */
 			savebits = *prev & ~cpu_to_le32 (TD_MASK);
 			*prev = td->hwNextTD | savebits;
 
@@ -957,7 +1014,7 @@
 			/* if URB is done, clean up */
 			if (urb_priv->td_cnt == urb_priv->length) {
 				modified = completed = 1;
-				finish_urb (ohci, urb, regs);
+				finish_urb (ohci, urb, regs);	/* this drops ohci->lock! */
 			}
 		}
 		if (completed && !list_empty (&ed->td_list))
@@ -1039,9 +1096,9 @@
   		if (urb_priv->td_cnt == urb_priv->length)
   			finish_urb (ohci, urb, regs);
 
-		/* clean schedule:  unlink EDs that are no longer busy */
+		/* clean schedule:  deschedule EDs that are no longer busy */
 		if (list_empty (&ed->td_list))
-			start_ed_unlink (ohci, ed);
+			ed_deschedule (ohci, ed);
 		/* ... reenabling halted EDs only after fault cleanup */
 		else if ((ed->hwINFO & (ED_SKIP | ED_DEQUEUE)) == ED_SKIP) {
 			td = list_entry (ed->td_list.next, struct td, td_list);
===== drivers/usb/host/ohci.h 1.19 vs edited =====
--- 1.19/drivers/usb/host/ohci.h	Wed Feb 11 03:42:39 2004
+++ edited/drivers/usb/host/ohci.h	Mon Mar  8 23:03:31 2004
@@ -48,6 +48,7 @@
 #define ED_IDLE 	0x00		/* NOT linked to HC */
 #define ED_UNLINK 	0x01		/* being unlinked from hc */
 #define ED_OPER		0x02		/* IS linked to hc */
+#define ED_DESCHEDULED	0x03		/* idle for HCD, but HC may still hold reference to it */
 
 	u8			type; 		/* PIPE_{BULK,...} */
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-10  6:59                                   ` David Mosberger
@ 2004-03-10  7:52                                     ` Wouter Lueks
  2004-03-10 16:49                                       ` David Mosberger
  2004-03-10 16:22                                     ` David Brownell
  2004-03-10 19:29                                     ` Colin Leroy
  2 siblings, 1 reply; 45+ messages in thread
From: Wouter Lueks @ 2004-03-10  7:52 UTC (permalink / raw)
  To: davidm; +Cc: linux-kernel

On Tue, Mar 09, 2004 at 10:59:33PM -0800, David Mosberger wrote:
> The current OHCI relies on the internals of the dma_pool()
> implementation.  If the implementation changed to, say, modify the
> memory that is free or, heaven forbid, return the memory to the
> kernel, you'd get _extremely_ difficult to track down race conditions.
> Even so, the code isn't race-free, like I explained yesterday:

I'm doing my best to track this thread altough I cannot understand the
technical stuff I belive you are havily tracking down a bug in the
ohci-hcd, but, according to the problems I expierienced, there is a 
similar bug in uhci-hcd.

00:1f.2 USB Controller: Intel Corp. 82801BA/BAM USB (Hub #1) (rev 04) (prog-if 00 [UHCI])
    Subsystem: NEC Corporation: Unknown device 815e
    Flags: bus master, medium devsel, latency 0, IRQ 9
    I/O ports at 1cc0 [size=32]

When there is any further information you need, or any extra testing
that needs to be done, please tell me and I'll be happy to help out. 

Keep up the good work!

Wouter Lueks

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-10  6:59                                   ` David Mosberger
  2004-03-10  7:52                                     ` Wouter Lueks
@ 2004-03-10 16:22                                     ` David Brownell
  2004-03-10 18:04                                       ` David Mosberger
  2004-03-10 19:29                                     ` Colin Leroy
  2 siblings, 1 reply; 45+ messages in thread
From: David Brownell @ 2004-03-10 16:22 UTC (permalink / raw)
  To: davidm
  Cc: Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel,
	linux-ia64, pochini

I'll send out a revised patch later, thanks!  It's also good
this code got a more careful read.  It seems like some things
are not as obvious as I might like...

That patch will merge those list corruption fixes I sent, the
"else" you verified was needed (ugh!!!), and some of what you
include here.   It won't add new BUG_ON calls (WARN at worst)
or a duplicate ED state (see next).


>   >>   ...  add a new state
>   >> ED_DESCHEDULED, which is treated exactly like ED_IDLE, except
>   >> that in this state, the HC may still be referring to the ED in
>   >> question.  Thus, if
> 
>   David.B> Sounds exactly like ED_UNLINK -- except maybe that it's not
>   David.B> been put onto ed_rm_list (with ED_DEQUEUE set).  Why add
>   David.B> another state?
>   ...
> 
> The current OHCI relies on the internals of the dma_pool()
> implementation.  If the implementation changed to, say, modify the
> memory that is free or, heaven forbid, return the memory to the
> kernel, you'd get _extremely_ difficult to track down race conditions.

The current implementation _does_ poison memory on free, if
slab poisoning is enabled.  (That's why I asked if you were
using it.)

And that's been quite handy for reporting list corruption bugs,
from races or otherwise. But those list corruption bugs hit a
blind spot in that code:  it doesn't check for modify-after-free.
Which is why those bugs were able to hide for so long!


It'd be good if you said _how_ you think it relies on such
internals.  Some of your debug diagnostics wrongly claimed
allocation of "new" EDs when they were just being re-used.
That'd make intentional/safe re-use look like a bug or race.


> Even so, the code isn't race-free, like I explained yesterday:
> 
>  - ed_alloc() clears the ED to 0 via memset()
>  - the order in which memset() clears memory is undefined (various
>    from platform to platform etc)

There's a wmb() before any ED is handed off to the OHCI silicon;
that forces a defined order.  Top of ed_schedule().  First use,
or Nth re-use; no matter.

>  - thus you might get a case where hwTailP is 0 but hwHeadP
>    is non-zero, which would cause the HC to happily start
>    dereferencing the descriptor

If you assume a bug where the ED is freed but still in use,
that's hardly the only thing that'd go wrong!!  You can't
use such a potential bug to prove something else is broken.

- Dave




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-10  7:52                                     ` Wouter Lueks
@ 2004-03-10 16:49                                       ` David Mosberger
  2004-03-10 19:49                                         ` Wouter Lueks
  0 siblings, 1 reply; 45+ messages in thread
From: David Mosberger @ 2004-03-10 16:49 UTC (permalink / raw)
  To: Wouter Lueks; +Cc: davidm, linux-kernel

>>>>> On Wed, 10 Mar 2004 08:52:12 +0100, Wouter Lueks <wouter@telox.net> said:

  Wouter> I'm doing my best to track this thread altough I cannot
  Wouter> understand the technical stuff I belive you are havily
  Wouter> tracking down a bug in the ohci-hcd, but, according to the
  Wouter> problems I expierienced, there is a similar bug in uhci-hcd.

That's possible but it's not necessarily the case.  There is one real
bug in the hid-input.c which would cause a hang on single-CPU machines
when connecting the BTC keyboard.  See this mail (patch included):

  http://marc.theaimsgroup.com/?l=linux-usb-devel&m=107889747301413

There also seems to be a bug in the BTC keyboard which causes the
hid-core to time out when initializing the keyboard.  That problem
isn't fatal, but it's annoying to have to wait for 10+ seconds to wait
for the keyboard to become online.  A workaround for this bug can be
found in this mail:

  http://marc.theaimsgroup.com/?l=linux-usb-devel&m=107889855108618

I'd be interested in hearing if that fixes the problems for your machine.

	--david


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-10 16:22                                     ` David Brownell
@ 2004-03-10 18:04                                       ` David Mosberger
  2004-03-11  2:43                                         ` David Brownell
  0 siblings, 1 reply; 45+ messages in thread
From: David Mosberger @ 2004-03-10 18:04 UTC (permalink / raw)
  To: David Brownell
  Cc: davidm, Grant Grundler, Greg KH, vojtech, linux-usb-devel,
	linux-kernel, linux-ia64, pochini

>>>>> On Wed, 10 Mar 2004 08:22:26 -0800, David Brownell <david-b@pacbell.net> said:

  David.B> It won't add new BUG_ON calls (WARN at worst)

I put them there mostly as assertions.  What I'd really want there is
a DEBUG_BUG_ON, which is more like assert() in user-land (i.e.,
production code would drop the checks).  WARN_ON() would be fine, too.

  >> The current OHCI relies on the internals of the dma_pool()
  >> implementation.  If the implementation changed to, say, modify
  >> the memory that is free or, heaven forbid, return the memory to
  >> the kernel, you'd get _extremely_ difficult to track down race
  >> conditions.

  David.B> It'd be good if you said _how_ you think it relies on such
  David.B> internals.

I thought I did.  Suppose somebody changed the dma_pool code such that
it would overwrite freed memory with an 0xf00000000000000 pattern.  If
the HC can still hold a reference to a freed ED (it can without my
patch), the HC could see this kind of ED:

 hw=(info=00000000 tailp=f0000000 headp=00000000 nextED=f0000000)

If so, the HC would go ahead and try to interpret the memory at
address 0 as a transfer descriptor.  Depending on the memory contents,
this could cause silent data corruption at an arbitrary address.

  >> - thus you might get a case where hwTailP is 0 but hwHeadP is
  >> non-zero, which would cause the HC to happily start dereferencing
  >> the descriptor

  David.B> If you assume a bug where the ED is freed but still in use,
  David.B> that's hardly the only thing that'd go wrong!!  You can't
  David.B> use such a potential bug to prove something else is broken.

You lost me here.  All I'm saying is that the current code has a
dangerous race that can corrupt memory, crash machines, or have all
sorts of other wild side-effects.  I never claimed this bug had
anything todo with the BTC keyboard problem.

	--david

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: serious 2.6 bug in USB subsystem?
  2004-03-10  6:59                                   ` David Mosberger
  2004-03-10  7:52                                     ` Wouter Lueks
  2004-03-10 16:22                                     ` David Brownell
@ 2004-03-10 19:29                                     ` Colin Leroy
  2 siblings, 0 replies; 45+ messages in thread
From: Colin Leroy @ 2004-03-10 19:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: David Mosberger, linux-usb-devel

Hi,

Your patch at 
http://marc.theaimsgroup.com/?l=linux-usb-devel&m=107890218325615&w=2

does fix the "OHCI unrecoverable error" and "ep0in control timeout" and the 
other errors I kept getting with my ACM modem (motorola GPRS phone).

Thanks !

(PS I also applied this one:
http://marc.theaimsgroup.com/?l=linux-usb-devel&m=107890407025557&w=2 )

-- 
Colin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-10 16:49                                       ` David Mosberger
@ 2004-03-10 19:49                                         ` Wouter Lueks
  0 siblings, 0 replies; 45+ messages in thread
From: Wouter Lueks @ 2004-03-10 19:49 UTC (permalink / raw)
  To: davidm; +Cc: linux-kernel

On Wed, Mar 10, 2004 at 08:49:11AM -0800, David Mosberger wrote:
> >>>>> On Wed, 10 Mar 2004 08:52:12 +0100, Wouter Lueks <wouter@telox.net> said:
>   http://marc.theaimsgroup.com/?l=linux-usb-devel&m=107889747301413
> 
>   http://marc.theaimsgroup.com/?l=linux-usb-devel&m=107889855108618
> 
> I'd be interested in hearing if that fixes the problems for your machine.

I applied both patches and it workes fine now, the keyboard gets
recognised withouth any further problems.

Wouter Lueks

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-10 18:04                                       ` David Mosberger
@ 2004-03-11  2:43                                         ` David Brownell
  2004-03-11  5:35                                           ` David Mosberger
  0 siblings, 1 reply; 45+ messages in thread
From: David Brownell @ 2004-03-11  2:43 UTC (permalink / raw)
  To: davidm
  Cc: Grant Grundler, Greg KH, vojtech, linux-usb-devel, linux-kernel,
	linux-ia64, pochini

David Mosberger wrote:

>   >> The current OHCI relies on the internals of the dma_pool()
>   >> implementation.  ...
>   David.B> It'd be good if you said _how_ you think it relies on such
>   David.B> internals.
> 
> I thought I did.  Suppose somebody changed the dma_pool code such that
> it would overwrite freed memory with an 0xf00000000000000 pattern. 

Erm, _anything_ the dma_pool code does with freed memory is legal.
Even the old "monkeys flying out of the back of the server" trick!  :)


Anyway, please (a) see if 2.6.4 works for you, and (b) direct any
future followups on this thread _only_ to linux-usb-devel.

- Dave



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [linux-usb-devel] Re: serious 2.6 bug in USB subsystem?
  2004-03-11  2:43                                         ` David Brownell
@ 2004-03-11  5:35                                           ` David Mosberger
  0 siblings, 0 replies; 45+ messages in thread
From: David Mosberger @ 2004-03-11  5:35 UTC (permalink / raw)
  To: David Brownell
  Cc: davidm, Grant Grundler, Greg KH, vojtech, linux-usb-devel,
	linux-kernel, linux-ia64, pochini

>>>>> On Wed, 10 Mar 2004 18:43:25 -0800, David Brownell <david-b@pacbell.net> said:

  David.B> David Mosberger wrote:

  >> The current OHCI relies on the internals of the dma_pool()
  >> implementation.  ...

  David.B> It'd be good if you said _how_ you think it relies on such
  David.B> internals.

  >>  I thought I did.  Suppose somebody changed the dma_pool code
  >> such that it would overwrite freed memory with an
  >> 0xf00000000000000 pattern.

  David.B> Erm, _anything_ the dma_pool code does with freed memory is
  David.B> legal.

Glad to see we agree on that!

  David.B> Anyway, please (a) see if 2.6.4 works for you, and (b)
  David.B> direct any future followups on this thread _only_ to
  David.B> linux-usb-devel.

The patch that's in 2.6.4 definitely looks like a step in the right
direction.  I still have some concerns.  I'll follow up on
linux-usb-devel with more details.

Thanks,

	--david

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2004-03-11  5:35 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-10-27 22:35 serious 2.6 bug in USB subsystem? David Mosberger
2003-10-28  1:30 ` Greg KH
2003-10-28  3:00   ` David Mosberger
2003-10-30 15:11     ` [linux-usb-devel] " David Brownell
2003-10-30 20:15       ` David Mosberger
     [not found]         ` <16289.55171.278494.17172@napali.hpl.hp.com>
2003-10-31 16:23           ` David Brownell
2003-10-31 18:34             ` David Mosberger
2003-10-31 18:50               ` Valdis.Kletnieks
2003-10-31 19:28               ` David Brownell
2003-10-31 19:50                 ` David Mosberger
2003-10-31 20:06                   ` David S. Miller
2004-03-06  2:08             ` David Mosberger
2004-03-06  2:13               ` David Mosberger
2004-03-06  4:55               ` David Brownell
2004-03-06  5:49                 ` David Mosberger
2004-03-06  7:21                   ` David Mosberger
2004-03-06  8:39                     ` David Mosberger
2004-03-06 16:37                   ` David Brownell
2004-03-08  6:18                     ` Grant Grundler
2004-03-08 18:58                       ` David Mosberger
2004-03-08 21:48                         ` David Brownell
2004-03-09  9:15                           ` David Mosberger
2004-03-09 17:36                             ` David Brownell
2004-03-09 17:58                               ` David Mosberger
2004-03-09 20:39                                 ` David Brownell
2004-03-09 23:32                                   ` David Mosberger
2004-03-10  2:53                                     ` David Brownell
2004-03-10  6:11                                       ` David Mosberger
2004-03-10  6:59                                   ` David Mosberger
2004-03-10  7:52                                     ` Wouter Lueks
2004-03-10 16:49                                       ` David Mosberger
2004-03-10 19:49                                         ` Wouter Lueks
2004-03-10 16:22                                     ` David Brownell
2004-03-10 18:04                                       ` David Mosberger
2004-03-11  2:43                                         ` David Brownell
2004-03-11  5:35                                           ` David Mosberger
2004-03-10 19:29                                     ` Colin Leroy
2004-03-06  9:17                 ` [linux-usb-devel] " David Mosberger
2004-03-06 17:30                   ` David Brownell
2004-03-07 13:48                     ` Matthias Andree
2004-03-08 18:49                     ` David Mosberger
2003-11-03  3:46         ` David Brownell
2003-11-03 21:25           ` David Mosberger
2004-03-03 12:33 ` Wouter Lueks
2004-03-03 15:30   ` Wouter Lueks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).