All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible regression between 4.9 and 4.13
@ 2017-08-22 17:34 ` Mason
  0 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-22 17:34 UTC (permalink / raw)
  To: linux-pci, linux-usb, Linux ARM
  Cc: Bjorn Helgaas, Alan Stern, Greg Kroah-Hartman, Mathias Nyman

Hello,

The driver for my system's PCIe host bridge landed recently
(in 4.13) but it was developed on 4.9

I tested the PCIe host bridge by plugging a 4-port USB3 adapter
into the PCIe slot (system at rest) and plugging an USB3 Flash
drive into the USB3 adapter (at run-time).

On 4.9, the setup works (almost perfectly, see below).
On 4.13, once I unplug the Flash drive, the controller port
remains unresponsive.


On 4.9, I said *almost* perfectly, because the pcieport driver
does report a few non-fatal errors when I unplug:

[  193.838504] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[  193.878081] usb-storage 2-2:1.0: USB Mass Storage device detected
[  193.884547] scsi host0: usb-storage 2-2:1.0
[  194.907936] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[  194.920296] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[  194.928666] sd 0:0:0:0: [sda] Write Protect is off
[  194.933755] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[  194.946074]  sda: sda1
[  194.953608] sd 0:0:0:0: [sda] Attached SCSI removable disk

[  208.930260] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[  208.938342] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[  208.950163] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[  208.958577] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[  208.965432] pcieport 0000:00:00.0: AER: Device recovery failed
[  209.663733] xhci_hcd 0000:01:00.0: Cannot set link state.
[  209.669194] usb usb2-port2: cannot disable (err = -32)
[  209.674376] usb 2-2: USB disconnect, device number 2
[  209.680481] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[  209.688689] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[  209.700555] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[  209.708978] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[  209.715845] pcieport 0000:00:00.0: AER: Device recovery failed
[  209.721722] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[  209.729785] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[  209.741602] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[  209.750027] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[  209.756866] pcieport 0000:00:00.0: AER: Device recovery failed

After that, I can still plug the drive into the same port.

But on 4.13, I get

[   27.330378] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   27.369383] usb-storage 2-2:1.0: USB Mass Storage device detected
[   27.375840] scsi host0: usb-storage 2-2:1.0
[   28.403035] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   28.413326] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   28.423653] sd 0:0:0:0: [sda] Write Protect is off
[   28.429139] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   28.441529]  sda: sda1
[   28.449431] sd 0:0:0:0: [sda] Attached SCSI removable disk

[   90.592134] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
[   90.599857] xhci_hcd 0000:01:00.0: HC died; cleaning up
[   90.605336] usb 2-2: USB disconnect, device number 2
[   90.630414] udevd[955]: inotify_add_watch(6, /dev/sda, 10) failed: No such file or directory

Trying to replug into the same port = nothing happens
(Linux did say "assume dead")

Any idea what could have changed between 4.9 and 4.13 ?

Regards.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-22 17:34 ` Mason
  0 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-22 17:34 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

The driver for my system's PCIe host bridge landed recently
(in 4.13) but it was developed on 4.9

I tested the PCIe host bridge by plugging a 4-port USB3 adapter
into the PCIe slot (system at rest) and plugging an USB3 Flash
drive into the USB3 adapter (at run-time).

On 4.9, the setup works (almost perfectly, see below).
On 4.13, once I unplug the Flash drive, the controller port
remains unresponsive.


On 4.9, I said *almost* perfectly, because the pcieport driver
does report a few non-fatal errors when I unplug:

[  193.838504] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[  193.878081] usb-storage 2-2:1.0: USB Mass Storage device detected
[  193.884547] scsi host0: usb-storage 2-2:1.0
[  194.907936] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[  194.920296] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[  194.928666] sd 0:0:0:0: [sda] Write Protect is off
[  194.933755] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[  194.946074]  sda: sda1
[  194.953608] sd 0:0:0:0: [sda] Attached SCSI removable disk

[  208.930260] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[  208.938342] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[  208.950163] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[  208.958577] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[  208.965432] pcieport 0000:00:00.0: AER: Device recovery failed
[  209.663733] xhci_hcd 0000:01:00.0: Cannot set link state.
[  209.669194] usb usb2-port2: cannot disable (err = -32)
[  209.674376] usb 2-2: USB disconnect, device number 2
[  209.680481] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[  209.688689] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[  209.700555] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[  209.708978] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[  209.715845] pcieport 0000:00:00.0: AER: Device recovery failed
[  209.721722] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[  209.729785] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[  209.741602] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[  209.750027] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[  209.756866] pcieport 0000:00:00.0: AER: Device recovery failed

After that, I can still plug the drive into the same port.

But on 4.13, I get

[   27.330378] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   27.369383] usb-storage 2-2:1.0: USB Mass Storage device detected
[   27.375840] scsi host0: usb-storage 2-2:1.0
[   28.403035] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   28.413326] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   28.423653] sd 0:0:0:0: [sda] Write Protect is off
[   28.429139] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   28.441529]  sda: sda1
[   28.449431] sd 0:0:0:0: [sda] Attached SCSI removable disk

[   90.592134] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
[   90.599857] xhci_hcd 0000:01:00.0: HC died; cleaning up
[   90.605336] usb 2-2: USB disconnect, device number 2
[   90.630414] udevd[955]: inotify_add_watch(6, /dev/sda, 10) failed: No such file or directory

Trying to replug into the same port = nothing happens
(Linux did say "assume dead")

Any idea what could have changed between 4.9 and 4.13 ?

Regards.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-22 17:34 ` Mason
@ 2017-08-23  6:07   ` Felipe Balbi
  -1 siblings, 0 replies; 60+ messages in thread
From: Felipe Balbi @ 2017-08-23  6:07 UTC (permalink / raw)
  To: Mason, linux-pci, linux-usb, Linux ARM
  Cc: Bjorn Helgaas, Alan Stern, Greg Kroah-Hartman, Mathias Nyman


Hi,

Mason <slash.tmp@free.fr> writes:
> Hello,
>
> The driver for my system's PCIe host bridge landed recently
> (in 4.13) but it was developed on 4.9
>
> I tested the PCIe host bridge by plugging a 4-port USB3 adapter
> into the PCIe slot (system at rest) and plugging an USB3 Flash
> drive into the USB3 adapter (at run-time).
>
> On 4.9, the setup works (almost perfectly, see below).
> On 4.13, once I unplug the Flash drive, the controller port
> remains unresponsive.
>
>
> On 4.9, I said *almost* perfectly, because the pcieport driver
> does report a few non-fatal errors when I unplug:
>
> [  193.838504] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
> [  193.878081] usb-storage 2-2:1.0: USB Mass Storage device detected
> [  193.884547] scsi host0: usb-storage 2-2:1.0
> [  194.907936] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
> [  194.920296] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
> [  194.928666] sd 0:0:0:0: [sda] Write Protect is off
> [  194.933755] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
> [  194.946074]  sda: sda1
> [  194.953608] sd 0:0:0:0: [sda] Attached SCSI removable disk
>
> [  208.930260] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
> [  208.938342] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
> [  208.950163] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
> [  208.958577] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
> [  208.965432] pcieport 0000:00:00.0: AER: Device recovery failed
> [  209.663733] xhci_hcd 0000:01:00.0: Cannot set link state.
> [  209.669194] usb usb2-port2: cannot disable (err = -32)
> [  209.674376] usb 2-2: USB disconnect, device number 2
> [  209.680481] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
> [  209.688689] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
> [  209.700555] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
> [  209.708978] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
> [  209.715845] pcieport 0000:00:00.0: AER: Device recovery failed
> [  209.721722] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
> [  209.729785] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
> [  209.741602] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
> [  209.750027] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
> [  209.756866] pcieport 0000:00:00.0: AER: Device recovery failed
>
> After that, I can still plug the drive into the same port.
>
> But on 4.13, I get
>
> [   27.330378] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
> [   27.369383] usb-storage 2-2:1.0: USB Mass Storage device detected
> [   27.375840] scsi host0: usb-storage 2-2:1.0
> [   28.403035] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
> [   28.413326] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
> [   28.423653] sd 0:0:0:0: [sda] Write Protect is off
> [   28.429139] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
> [   28.441529]  sda: sda1
> [   28.449431] sd 0:0:0:0: [sda] Attached SCSI removable disk
>
> [   90.592134] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
> [   90.599857] xhci_hcd 0000:01:00.0: HC died; cleaning up
> [   90.605336] usb 2-2: USB disconnect, device number 2
> [   90.630414] udevd[955]: inotify_add_watch(6, /dev/sda, 10) failed: No such file or directory
>
> Trying to replug into the same port = nothing happens
> (Linux did say "assume dead")
>
> Any idea what could have changed between 4.9 and 4.13 ?
>

Quite a bit:

$ git rev-list --no-merges  --count v4.13-rc6 ^v4.9 -- drivers/usb/host/xhci drivers/usb/core/
58

Any chance you can bisect to figure out the offending commit?

-- 
balbi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-23  6:07   ` Felipe Balbi
  0 siblings, 0 replies; 60+ messages in thread
From: Felipe Balbi @ 2017-08-23  6:07 UTC (permalink / raw)
  To: linux-arm-kernel


Hi,

Mason <slash.tmp@free.fr> writes:
> Hello,
>
> The driver for my system's PCIe host bridge landed recently
> (in 4.13) but it was developed on 4.9
>
> I tested the PCIe host bridge by plugging a 4-port USB3 adapter
> into the PCIe slot (system at rest) and plugging an USB3 Flash
> drive into the USB3 adapter (at run-time).
>
> On 4.9, the setup works (almost perfectly, see below).
> On 4.13, once I unplug the Flash drive, the controller port
> remains unresponsive.
>
>
> On 4.9, I said *almost* perfectly, because the pcieport driver
> does report a few non-fatal errors when I unplug:
>
> [  193.838504] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
> [  193.878081] usb-storage 2-2:1.0: USB Mass Storage device detected
> [  193.884547] scsi host0: usb-storage 2-2:1.0
> [  194.907936] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
> [  194.920296] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
> [  194.928666] sd 0:0:0:0: [sda] Write Protect is off
> [  194.933755] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
> [  194.946074]  sda: sda1
> [  194.953608] sd 0:0:0:0: [sda] Attached SCSI removable disk
>
> [  208.930260] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
> [  208.938342] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
> [  208.950163] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
> [  208.958577] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
> [  208.965432] pcieport 0000:00:00.0: AER: Device recovery failed
> [  209.663733] xhci_hcd 0000:01:00.0: Cannot set link state.
> [  209.669194] usb usb2-port2: cannot disable (err = -32)
> [  209.674376] usb 2-2: USB disconnect, device number 2
> [  209.680481] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
> [  209.688689] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
> [  209.700555] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
> [  209.708978] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
> [  209.715845] pcieport 0000:00:00.0: AER: Device recovery failed
> [  209.721722] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
> [  209.729785] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
> [  209.741602] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
> [  209.750027] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
> [  209.756866] pcieport 0000:00:00.0: AER: Device recovery failed
>
> After that, I can still plug the drive into the same port.
>
> But on 4.13, I get
>
> [   27.330378] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
> [   27.369383] usb-storage 2-2:1.0: USB Mass Storage device detected
> [   27.375840] scsi host0: usb-storage 2-2:1.0
> [   28.403035] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
> [   28.413326] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
> [   28.423653] sd 0:0:0:0: [sda] Write Protect is off
> [   28.429139] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
> [   28.441529]  sda: sda1
> [   28.449431] sd 0:0:0:0: [sda] Attached SCSI removable disk
>
> [   90.592134] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
> [   90.599857] xhci_hcd 0000:01:00.0: HC died; cleaning up
> [   90.605336] usb 2-2: USB disconnect, device number 2
> [   90.630414] udevd[955]: inotify_add_watch(6, /dev/sda, 10) failed: No such file or directory
>
> Trying to replug into the same port = nothing happens
> (Linux did say "assume dead")
>
> Any idea what could have changed between 4.9 and 4.13 ?
>

Quite a bit:

$ git rev-list --no-merges  --count v4.13-rc6 ^v4.9 -- drivers/usb/host/xhci drivers/usb/core/
58

Any chance you can bisect to figure out the offending commit?

-- 
balbi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-23  6:07   ` Felipe Balbi
@ 2017-08-23  7:51     ` Mathias Nyman
  -1 siblings, 0 replies; 60+ messages in thread
From: Mathias Nyman @ 2017-08-23  7:51 UTC (permalink / raw)
  To: Felipe Balbi, Mason, linux-pci, linux-usb, Linux ARM
  Cc: Bjorn Helgaas, Alan Stern, Greg Kroah-Hartman

On 23.08.2017 09:07, Felipe Balbi wrote:
>
> Hi,
>
> Mason <slash.tmp@free.fr> writes:
>> Hello,
>>
>> The driver for my system's PCIe host bridge landed recently
>> (in 4.13) but it was developed on 4.9
>>
>> I tested the PCIe host bridge by plugging a 4-port USB3 adapter
>> into the PCIe slot (system at rest) and plugging an USB3 Flash
>> drive into the USB3 adapter (at run-time).
>>
>> On 4.9, the setup works (almost perfectly, see below).
>> On 4.13, once I unplug the Flash drive, the controller port
>> remains unresponsive.
>>
>>
>> On 4.9, I said *almost* perfectly, because the pcieport driver
>> does report a few non-fatal errors when I unplug:
>>
>> [  193.838504] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
>> [  193.878081] usb-storage 2-2:1.0: USB Mass Storage device detected
>> [  193.884547] scsi host0: usb-storage 2-2:1.0
>> [  194.907936] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
>> [  194.920296] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
>> [  194.928666] sd 0:0:0:0: [sda] Write Protect is off
>> [  194.933755] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
>> [  194.946074]  sda: sda1
>> [  194.953608] sd 0:0:0:0: [sda] Attached SCSI removable disk
>>
>> [  208.930260] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
>> [  208.938342] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
>> [  208.950163] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
>> [  208.958577] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
>> [  208.965432] pcieport 0000:00:00.0: AER: Device recovery failed
>> [  209.663733] xhci_hcd 0000:01:00.0: Cannot set link state.
>> [  209.669194] usb usb2-port2: cannot disable (err = -32)
>> [  209.674376] usb 2-2: USB disconnect, device number 2
>> [  209.680481] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
>> [  209.688689] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
>> [  209.700555] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
>> [  209.708978] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
>> [  209.715845] pcieport 0000:00:00.0: AER: Device recovery failed
>> [  209.721722] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
>> [  209.729785] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
>> [  209.741602] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
>> [  209.750027] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
>> [  209.756866] pcieport 0000:00:00.0: AER: Device recovery failed
>>
>> After that, I can still plug the drive into the same port.
>>
>> But on 4.13, I get
>>
>> [   27.330378] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
>> [   27.369383] usb-storage 2-2:1.0: USB Mass Storage device detected
>> [   27.375840] scsi host0: usb-storage 2-2:1.0
>> [   28.403035] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
>> [   28.413326] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
>> [   28.423653] sd 0:0:0:0: [sda] Write Protect is off
>> [   28.429139] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
>> [   28.441529]  sda: sda1
>> [   28.449431] sd 0:0:0:0: [sda] Attached SCSI removable disk
>>
>> [   90.592134] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
>> [   90.599857] xhci_hcd 0000:01:00.0: HC died; cleaning up
>> [   90.605336] usb 2-2: USB disconnect, device number 2
>> [   90.630414] udevd[955]: inotify_add_watch(6, /dev/sda, 10) failed: No such file or directory
>>
>> Trying to replug into the same port = nothing happens
>> (Linux did say "assume dead")
>>
>> Any idea what could have changed between 4.9 and 4.13 ?
>>
>
> Quite a bit:
>
> $ git rev-list --no-merges  --count v4.13-rc6 ^v4.9 -- drivers/usb/host/xhci drivers/usb/core/
> 58
>

very likely cause is the more aggressive detection of pci removed xhci hosts

See commit d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
     xhci: Rework how we handle unresponsive or hoptlug removed hosts

It checks if a xhci register reads returns 0xffffffff and assumes xhci
died in that case.

Could you add something like the below to check which what is killing the host?
Or a BUG()/WARN() in xhci_hc_died() to get a backtrace of who called it.

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 51cd4b8..ade2ad6 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -922,7 +922,8 @@ void xhci_hc_died(struct xhci_hcd *xhci)
         if (xhci->xhc_state & XHCI_STATE_DYING)
                 return;
  
-       xhci_err(xhci, "xHCI host controller not responding, assume dead\n");
+       xhci_err(xhci, "xHC not responding in %pf, assume controller is dead\n",
+                __builtin_return_address(0));
         xhci->xhc_state |= XHCI_STATE_DYING;
  
         xhci_cleanup_command_queue(xhci);


Thanks
Mathias

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-23  7:51     ` Mathias Nyman
  0 siblings, 0 replies; 60+ messages in thread
From: Mathias Nyman @ 2017-08-23  7:51 UTC (permalink / raw)
  To: linux-arm-kernel

On 23.08.2017 09:07, Felipe Balbi wrote:
>
> Hi,
>
> Mason <slash.tmp@free.fr> writes:
>> Hello,
>>
>> The driver for my system's PCIe host bridge landed recently
>> (in 4.13) but it was developed on 4.9
>>
>> I tested the PCIe host bridge by plugging a 4-port USB3 adapter
>> into the PCIe slot (system at rest) and plugging an USB3 Flash
>> drive into the USB3 adapter (at run-time).
>>
>> On 4.9, the setup works (almost perfectly, see below).
>> On 4.13, once I unplug the Flash drive, the controller port
>> remains unresponsive.
>>
>>
>> On 4.9, I said *almost* perfectly, because the pcieport driver
>> does report a few non-fatal errors when I unplug:
>>
>> [  193.838504] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
>> [  193.878081] usb-storage 2-2:1.0: USB Mass Storage device detected
>> [  193.884547] scsi host0: usb-storage 2-2:1.0
>> [  194.907936] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
>> [  194.920296] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
>> [  194.928666] sd 0:0:0:0: [sda] Write Protect is off
>> [  194.933755] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
>> [  194.946074]  sda: sda1
>> [  194.953608] sd 0:0:0:0: [sda] Attached SCSI removable disk
>>
>> [  208.930260] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
>> [  208.938342] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
>> [  208.950163] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
>> [  208.958577] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
>> [  208.965432] pcieport 0000:00:00.0: AER: Device recovery failed
>> [  209.663733] xhci_hcd 0000:01:00.0: Cannot set link state.
>> [  209.669194] usb usb2-port2: cannot disable (err = -32)
>> [  209.674376] usb 2-2: USB disconnect, device number 2
>> [  209.680481] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
>> [  209.688689] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
>> [  209.700555] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
>> [  209.708978] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
>> [  209.715845] pcieport 0000:00:00.0: AER: Device recovery failed
>> [  209.721722] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
>> [  209.729785] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
>> [  209.741602] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
>> [  209.750027] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
>> [  209.756866] pcieport 0000:00:00.0: AER: Device recovery failed
>>
>> After that, I can still plug the drive into the same port.
>>
>> But on 4.13, I get
>>
>> [   27.330378] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
>> [   27.369383] usb-storage 2-2:1.0: USB Mass Storage device detected
>> [   27.375840] scsi host0: usb-storage 2-2:1.0
>> [   28.403035] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
>> [   28.413326] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
>> [   28.423653] sd 0:0:0:0: [sda] Write Protect is off
>> [   28.429139] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
>> [   28.441529]  sda: sda1
>> [   28.449431] sd 0:0:0:0: [sda] Attached SCSI removable disk
>>
>> [   90.592134] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
>> [   90.599857] xhci_hcd 0000:01:00.0: HC died; cleaning up
>> [   90.605336] usb 2-2: USB disconnect, device number 2
>> [   90.630414] udevd[955]: inotify_add_watch(6, /dev/sda, 10) failed: No such file or directory
>>
>> Trying to replug into the same port = nothing happens
>> (Linux did say "assume dead")
>>
>> Any idea what could have changed between 4.9 and 4.13 ?
>>
>
> Quite a bit:
>
> $ git rev-list --no-merges  --count v4.13-rc6 ^v4.9 -- drivers/usb/host/xhci drivers/usb/core/
> 58
>

very likely cause is the more aggressive detection of pci removed xhci hosts

See commit d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
     xhci: Rework how we handle unresponsive or hoptlug removed hosts

It checks if a xhci register reads returns 0xffffffff and assumes xhci
died in that case.

Could you add something like the below to check which what is killing the host?
Or a BUG()/WARN() in xhci_hc_died() to get a backtrace of who called it.

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 51cd4b8..ade2ad6 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -922,7 +922,8 @@ void xhci_hc_died(struct xhci_hcd *xhci)
         if (xhci->xhc_state & XHCI_STATE_DYING)
                 return;
  
-       xhci_err(xhci, "xHCI host controller not responding, assume dead\n");
+       xhci_err(xhci, "xHC not responding in %pf, assume controller is dead\n",
+                __builtin_return_address(0));
         xhci->xhc_state |= XHCI_STATE_DYING;
  
         xhci_cleanup_command_queue(xhci);


Thanks
Mathias

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-23  7:51     ` Mathias Nyman
@ 2017-08-23  9:18       ` Mason
  -1 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-23  9:18 UTC (permalink / raw)
  To: Mathias Nyman, Felipe Balbi, linux-pci, linux-usb, Linux ARM
  Cc: Bjorn Helgaas, Alan Stern, Greg Kroah-Hartman

On 23/08/2017 09:51, Mathias Nyman wrote:

> On 23.08.2017 09:07, Felipe Balbi wrote:
>
>> Mason writes:
>>
>>> Any idea what could have changed between 4.9 and 4.13 ?
>>
>> Quite a bit:
>>
>> $ git rev-list --no-merges  --count v4.13-rc6 ^v4.9 -- drivers/usb/host/xhci drivers/usb/core/
>> 58
> 
> very likely cause is the more aggressive detection of pci removed xhci hosts
> 
> See commit d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
>      xhci: Rework how we handle unresponsive or hoptlug removed hosts
> 
> It checks if a xhci register reads returns 0xffffffff and assumes xhci
> died in that case.
> 
> Could you add something like the below to check which what is killing the host?
> Or a BUG()/WARN() in xhci_hc_died() to get a backtrace of who called it.
> 
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> index 51cd4b8..ade2ad6 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -922,7 +922,8 @@ void xhci_hc_died(struct xhci_hcd *xhci)
>          if (xhci->xhc_state & XHCI_STATE_DYING)
>                  return;
>   
> -       xhci_err(xhci, "xHCI host controller not responding, assume dead\n");
> +       xhci_err(xhci, "xHC not responding in %pf, assume controller is dead\n",
> +                __builtin_return_address(0));
>          xhci->xhc_state |= XHCI_STATE_DYING;
>   
>          xhci_cleanup_command_queue(xhci);

I'll try some coarse bisection to narrow it down.

$ git describe --contains d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
v4.12-rc1~97^2~39

I'll check 4.11 first.

I wanted to mention that the XHCI setup on 4.9 and 4.13 print
slightly different things (at the beginning).

On 4.9
[    1.240322] xhci_hcd 0000:01:00.0: xHCI Host Controller
[    1.245617] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 1
[    1.258691] xhci_hcd 0000:01:00.0: hcc params 0x014051cf hci version 0x100 quirks 0x00000010
[    1.268090] hub 1-0:1.0: USB hub found
[    1.271905] hub 1-0:1.0: 4 ports detected
[    1.276372] xhci_hcd 0000:01:00.0: xHCI Host Controller
[    1.281645] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 2
[    1.289173] usb usb2: We don't know the algorithms for LPM for this host, disabling LPM.
[    1.297775] hub 2-0:1.0: USB hub found
[    1.301577] hub 2-0:1.0: 4 ports detected
[    1.306194] usbcore: registered new interface driver usb-storage

On 4.13
[    1.222471] pcieport 0000:00:00.0: of_irq_parse_pci: failed with rc=-22
[    1.229156] xhci_hcd 0000:01:00.0: Resetting
[    2.268836] xhci_hcd 0000:01:00.0: xHCI Host Controller
[    2.274126] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 1
[    2.287222] xhci_hcd 0000:01:00.0: hcc params 0x014051cf hci version 0x100 quirks 0x00000010
[    2.296653] hub 1-0:1.0: USB hub found
[    2.300478] hub 1-0:1.0: 4 ports detected
[    2.304962] xhci_hcd 0000:01:00.0: xHCI Host Controller
[    2.310246] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 2
[    2.317776] usb usb2: We don't know the algorithms for LPM for this host, disabling LPM.
[    2.326419] hub 2-0:1.0: USB hub found
[    2.330229] hub 2-0:1.0: 4 ports detected
[    2.334869] usbcore: registered new interface driver usb-storage

FWIW, "of_irq_parse_pci: failed with rc=-22"
seems to come from:

[    1.257411] [<c03d80c8>] (of_irq_parse_pci) from [<c03d8270>] (of_irq_parse_and_map_pci+0x10/0x2c)
[    1.266420] [<c03d8270>] (of_irq_parse_and_map_pci) from [<c03100a8>] (pci_assign_irq+0x78/0xb0)
[    1.275254] [<c03100a8>] (pci_assign_irq) from [<c030a1c8>] (pci_device_probe+0x18/0x128)
[    1.283476] [<c030a1c8>] (pci_device_probe) from [<c0357864>] (driver_probe_device+0x244/0x2c8)

The error logging was added by f1aa54840657f
No, that just turned one specific error into a warning.
Need to dig a bit more.

Regards.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-23  9:18       ` Mason
  0 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-23  9:18 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/08/2017 09:51, Mathias Nyman wrote:

> On 23.08.2017 09:07, Felipe Balbi wrote:
>
>> Mason writes:
>>
>>> Any idea what could have changed between 4.9 and 4.13 ?
>>
>> Quite a bit:
>>
>> $ git rev-list --no-merges  --count v4.13-rc6 ^v4.9 -- drivers/usb/host/xhci drivers/usb/core/
>> 58
> 
> very likely cause is the more aggressive detection of pci removed xhci hosts
> 
> See commit d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
>      xhci: Rework how we handle unresponsive or hoptlug removed hosts
> 
> It checks if a xhci register reads returns 0xffffffff and assumes xhci
> died in that case.
> 
> Could you add something like the below to check which what is killing the host?
> Or a BUG()/WARN() in xhci_hc_died() to get a backtrace of who called it.
> 
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> index 51cd4b8..ade2ad6 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -922,7 +922,8 @@ void xhci_hc_died(struct xhci_hcd *xhci)
>          if (xhci->xhc_state & XHCI_STATE_DYING)
>                  return;
>   
> -       xhci_err(xhci, "xHCI host controller not responding, assume dead\n");
> +       xhci_err(xhci, "xHC not responding in %pf, assume controller is dead\n",
> +                __builtin_return_address(0));
>          xhci->xhc_state |= XHCI_STATE_DYING;
>   
>          xhci_cleanup_command_queue(xhci);

I'll try some coarse bisection to narrow it down.

$ git describe --contains d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
v4.12-rc1~97^2~39

I'll check 4.11 first.

I wanted to mention that the XHCI setup on 4.9 and 4.13 print
slightly different things (at the beginning).

On 4.9
[    1.240322] xhci_hcd 0000:01:00.0: xHCI Host Controller
[    1.245617] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 1
[    1.258691] xhci_hcd 0000:01:00.0: hcc params 0x014051cf hci version 0x100 quirks 0x00000010
[    1.268090] hub 1-0:1.0: USB hub found
[    1.271905] hub 1-0:1.0: 4 ports detected
[    1.276372] xhci_hcd 0000:01:00.0: xHCI Host Controller
[    1.281645] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 2
[    1.289173] usb usb2: We don't know the algorithms for LPM for this host, disabling LPM.
[    1.297775] hub 2-0:1.0: USB hub found
[    1.301577] hub 2-0:1.0: 4 ports detected
[    1.306194] usbcore: registered new interface driver usb-storage

On 4.13
[    1.222471] pcieport 0000:00:00.0: of_irq_parse_pci: failed with rc=-22
[    1.229156] xhci_hcd 0000:01:00.0: Resetting
[    2.268836] xhci_hcd 0000:01:00.0: xHCI Host Controller
[    2.274126] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 1
[    2.287222] xhci_hcd 0000:01:00.0: hcc params 0x014051cf hci version 0x100 quirks 0x00000010
[    2.296653] hub 1-0:1.0: USB hub found
[    2.300478] hub 1-0:1.0: 4 ports detected
[    2.304962] xhci_hcd 0000:01:00.0: xHCI Host Controller
[    2.310246] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 2
[    2.317776] usb usb2: We don't know the algorithms for LPM for this host, disabling LPM.
[    2.326419] hub 2-0:1.0: USB hub found
[    2.330229] hub 2-0:1.0: 4 ports detected
[    2.334869] usbcore: registered new interface driver usb-storage

FWIW, "of_irq_parse_pci: failed with rc=-22"
seems to come from:

[    1.257411] [<c03d80c8>] (of_irq_parse_pci) from [<c03d8270>] (of_irq_parse_and_map_pci+0x10/0x2c)
[    1.266420] [<c03d8270>] (of_irq_parse_and_map_pci) from [<c03100a8>] (pci_assign_irq+0x78/0xb0)
[    1.275254] [<c03100a8>] (pci_assign_irq) from [<c030a1c8>] (pci_device_probe+0x18/0x128)
[    1.283476] [<c030a1c8>] (pci_device_probe) from [<c0357864>] (driver_probe_device+0x244/0x2c8)

The error logging was added by f1aa54840657f
No, that just turned one specific error into a warning.
Need to dig a bit more.

Regards.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-23  7:51     ` Mathias Nyman
@ 2017-08-23  9:31       ` Mason
  -1 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-23  9:31 UTC (permalink / raw)
  To: Mathias Nyman, Felipe Balbi, linux-pci, linux-usb, Linux ARM
  Cc: Bjorn Helgaas, Alan Stern, Greg Kroah-Hartman

On 23/08/2017 09:51, Mathias Nyman wrote:

> very likely cause is the more aggressive detection of pci removed xhci hosts
> 
> See commit d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
>      xhci: Rework how we handle unresponsive or hoptlug removed hosts
> 
> It checks if a xhci register reads returns 0xffffffff and assumes xhci
> died in that case.
> 
> Could you add something like the below to check which what is killing the host?
> Or a BUG()/WARN() in xhci_hc_died() to get a backtrace of who called it.

[   46.525247] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   46.565496] usb-storage 2-2:1.0: USB Mass Storage device detected
[   46.571934] scsi host0: usb-storage 2-2:1.0
[   47.601227] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   47.611340] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   47.621624] sd 0:0:0:0: [sda] Write Protect is off
[   47.627131] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   47.639637]  sda: sda1
[   47.648091] sd 0:0:0:0: [sda] Attached SCSI removable disk
[   58.100306] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
[   58.108021] CPU: 0 PID: 939 Comm: kworker/0:2 Tainted: G         C      4.13.0-rc6 #11
[   58.115976] Hardware name: Sigma Tango DT
[   58.120016] Workqueue: usb_hub_wq hub_event
[   58.124241] [<c010f288>] (unwind_backtrace) from [<c010af58>] (show_stack+0x10/0x14)
[   58.132033] [<c010af58>] (show_stack) from [<c049d714>] (dump_stack+0x84/0x98)
[   58.139302] [<c049d714>] (dump_stack) from [<c03b090c>] (xhci_hc_died.part.9+0x50/0x23c)
[   58.147438] [<c03b090c>] (xhci_hc_died.part.9) from [<c03b5d80>] (xhci_hub_control+0xf3c/0x175c)
[   58.156273] [<c03b5d80>] (xhci_hub_control) from [<c03934a4>] (usb_hcd_submit_urb+0x264/0x814)
[   58.164932] [<c03934a4>] (usb_hcd_submit_urb) from [<c0394fa4>] (usb_start_wait_urb+0x4c/0xbc)
[   58.173591] [<c0394fa4>] (usb_start_wait_urb) from [<c03950b4>] (usb_control_msg+0xa0/0xcc)
[   58.181985] [<c03950b4>] (usb_control_msg) from [<c038bf54>] (usb_clear_port_feature+0x44/0x4c)
[   58.190730] [<c038bf54>] (usb_clear_port_feature) from [<c038c320>] (hub_port_reset+0x228/0x51c)
[   58.199561] [<c038c320>] (hub_port_reset) from [<c038fd68>] (hub_event+0x87c/0x108c)
[   58.207349] [<c038fd68>] (hub_event) from [<c012ecc4>] (process_one_work+0x1d8/0x3f0)
[   58.215220] [<c012ecc4>] (process_one_work) from [<c012f8d8>] (worker_thread+0x38/0x554)
[   58.223354] [<c012f8d8>] (worker_thread) from [<c01347d0>] (kthread+0x108/0x138)
[   58.230789] [<c01347d0>] (kthread) from [<c01076d8>] (ret_from_fork+0x14/0x3c)
[   58.238056] xhci_hcd 0000:01:00.0: HC died; cleaning up
[   58.243391] usb 2-2: USB disconnect, device number 2

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-23  9:31       ` Mason
  0 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-23  9:31 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/08/2017 09:51, Mathias Nyman wrote:

> very likely cause is the more aggressive detection of pci removed xhci hosts
> 
> See commit d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
>      xhci: Rework how we handle unresponsive or hoptlug removed hosts
> 
> It checks if a xhci register reads returns 0xffffffff and assumes xhci
> died in that case.
> 
> Could you add something like the below to check which what is killing the host?
> Or a BUG()/WARN() in xhci_hc_died() to get a backtrace of who called it.

[   46.525247] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   46.565496] usb-storage 2-2:1.0: USB Mass Storage device detected
[   46.571934] scsi host0: usb-storage 2-2:1.0
[   47.601227] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   47.611340] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   47.621624] sd 0:0:0:0: [sda] Write Protect is off
[   47.627131] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   47.639637]  sda: sda1
[   47.648091] sd 0:0:0:0: [sda] Attached SCSI removable disk
[   58.100306] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
[   58.108021] CPU: 0 PID: 939 Comm: kworker/0:2 Tainted: G         C      4.13.0-rc6 #11
[   58.115976] Hardware name: Sigma Tango DT
[   58.120016] Workqueue: usb_hub_wq hub_event
[   58.124241] [<c010f288>] (unwind_backtrace) from [<c010af58>] (show_stack+0x10/0x14)
[   58.132033] [<c010af58>] (show_stack) from [<c049d714>] (dump_stack+0x84/0x98)
[   58.139302] [<c049d714>] (dump_stack) from [<c03b090c>] (xhci_hc_died.part.9+0x50/0x23c)
[   58.147438] [<c03b090c>] (xhci_hc_died.part.9) from [<c03b5d80>] (xhci_hub_control+0xf3c/0x175c)
[   58.156273] [<c03b5d80>] (xhci_hub_control) from [<c03934a4>] (usb_hcd_submit_urb+0x264/0x814)
[   58.164932] [<c03934a4>] (usb_hcd_submit_urb) from [<c0394fa4>] (usb_start_wait_urb+0x4c/0xbc)
[   58.173591] [<c0394fa4>] (usb_start_wait_urb) from [<c03950b4>] (usb_control_msg+0xa0/0xcc)
[   58.181985] [<c03950b4>] (usb_control_msg) from [<c038bf54>] (usb_clear_port_feature+0x44/0x4c)
[   58.190730] [<c038bf54>] (usb_clear_port_feature) from [<c038c320>] (hub_port_reset+0x228/0x51c)
[   58.199561] [<c038c320>] (hub_port_reset) from [<c038fd68>] (hub_event+0x87c/0x108c)
[   58.207349] [<c038fd68>] (hub_event) from [<c012ecc4>] (process_one_work+0x1d8/0x3f0)
[   58.215220] [<c012ecc4>] (process_one_work) from [<c012f8d8>] (worker_thread+0x38/0x554)
[   58.223354] [<c012f8d8>] (worker_thread) from [<c01347d0>] (kthread+0x108/0x138)
[   58.230789] [<c01347d0>] (kthread) from [<c01076d8>] (ret_from_fork+0x14/0x3c)
[   58.238056] xhci_hcd 0000:01:00.0: HC died; cleaning up
[   58.243391] usb 2-2: USB disconnect, device number 2

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-23  7:51     ` Mathias Nyman
@ 2017-08-23 10:19       ` Mason
  -1 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-23 10:19 UTC (permalink / raw)
  To: Mathias Nyman, Felipe Balbi, linux-pci, linux-usb, Linux ARM
  Cc: Bjorn Helgaas, Alan Stern, Greg Kroah-Hartman, Jon Derrick, Keith Busch

On 23/08/2017 09:51, Mathias Nyman wrote:

> very likely cause is the more aggressive detection of pci removed xhci hosts
> 
> See commit d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
>      xhci: Rework how we handle unresponsive or hoptlug removed hosts
> 
> It checks if a xhci register reads returns 0xffffffff and assumes xhci
> died in that case.

I've just tested 4.11.12 + a few local patches to back-port
PCIe host bridge support.

It "works" as well as 4.9
(i.e. modulo the "AER: Uncorrected (Non-Fatal) error received")

[    0.508533] pcie_tango 50000000.pcie: simultaneous PCI config and MMIO accesses may cause data corruption
[    0.519622] OF: PCI: host bridge /soc/pcie@2e000 ranges:
[    0.519645] OF: PCI:   MEM 0x50400000..0x53ffffff -> 0x00400000
[    0.519725] pcie_tango 50000000.pcie: ECAM at [mem 0x50000000-0x503fffff] for [bus 00-03]
[    0.519872] pcie_tango 50000000.pcie: PCI host bridge to bus 0000:00
[    0.519886] pci_bus 0000:00: root bus resource [bus 00-03]
[    0.519898] pci_bus 0000:00: root bus resource [mem 0x50400000-0x53ffffff] (bus address [0x00400000-0x03ffffff])
[    0.520201] PCI: bus0: Fast back to back transfers disabled
[    0.520213] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    0.520922] PCI: bus1: Fast back to back transfers disabled
[    0.520964] pci 0000:00:00.0: of_irq_parse_pci: failed with rc=-22
[    0.520993] pci 0000:00:00.0: BAR 8: assigned [mem 0x50400000-0x504fffff]
[    0.521004] pci 0000:01:00.0: BAR 0: assigned [mem 0x50400000-0x50401fff 64bit]
[    0.521025] pci 0000:00:00.0: PCI bridge to [bus 01]
[    0.521033] pci 0000:00:00.0:   bridge window [mem 0x50400000-0x504fffff]
[    0.521085] pcieport 0000:00:00.0: enabling device (0140 -> 0142)
[    0.521282] pcieport 0000:00:00.0: Signaling PME with IRQ 30
[    0.521402] pcieport 0000:00:00.0: AER enabled with IRQ 30
[    0.521526] pci 0000:01:00.0: enabling device (0140 -> 0142)
...
[    1.239706] xhci_hcd 0000:01:00.0: xHCI Host Controller
[    1.244998] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 1
[    1.258048] xhci_hcd 0000:01:00.0: hcc params 0x014051cf hci version 0x100 quirks 0x00000010
[    1.267467] hub 1-0:1.0: USB hub found
[    1.271287] hub 1-0:1.0: 4 ports detected
[    1.275761] xhci_hcd 0000:01:00.0: xHCI Host Controller
[    1.281048] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 2
[    1.288578] usb usb2: We don't know the algorithms for LPM for this host, disabling LPM.
[    1.297234] hub 2-0:1.0: USB hub found
[    1.301042] hub 2-0:1.0: 4 ports detected
[    1.305681] usbcore: registered new interface driver usb-storage


PLUG #1
[   26.104607] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   26.143799] usb-storage 2-2:1.0: USB Mass Storage device detected
[   26.150253] scsi host0: usb-storage 2-2:1.0
[   27.177298] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   27.187586] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   27.199000] sd 0:0:0:0: [sda] Write Protect is off
[   27.204186] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   27.216322]  sda: sda1
[   27.220584] sd 0:0:0:0: [sda] Attached SCSI removable disk
[   27.252046] random: fast init done

UNPLUG #1
[   37.334040] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   37.342135] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   37.353970] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   37.362589] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   37.369485] pcieport 0000:00:00.0: AER: Device recovery failed
[   38.066538] xhci_hcd 0000:01:00.0: Cannot set link state.
[   38.072039] usb usb2-port2: cannot disable (err = -32)
[   38.077348] usb 2-2: USB disconnect, device number 2
[   38.082711] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   38.094279] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   38.108006] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   38.116878] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   38.123954] pcieport 0000:00:00.0: AER: Device recovery failed

PLUG #2
[   55.097922] usb 2-2: new SuperSpeed USB device number 3 using xhci_hcd
[   55.137590] usb-storage 2-2:1.0: USB Mass Storage device detected
[   55.144016] scsi host0: usb-storage 2-2:1.0
[   56.163907] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   56.174851] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   56.184218] sd 0:0:0:0: [sda] Write Protect is off
[   56.190162] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   56.202117]  sda: sda1
[   56.207112] sd 0:0:0:0: [sda] Attached SCSI removable disk

UNPLUG #2
[   63.228310] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   63.236403] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   63.248220] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   63.256653] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   63.263523] pcieport 0000:00:00.0: AER: Device recovery failed
[   63.959768] xhci_hcd 0000:01:00.0: Cannot set link state.
[   63.965227] usb usb2-port2: cannot disable (err = -32)
[   63.970409] usb 2-2: USB disconnect, device number 3
[   63.975664] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   63.987356] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   64.000021] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   64.008655] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   64.015553] pcieport 0000:00:00.0: AER: Device recovery failed
[   64.021449] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   64.029580] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   64.041410] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   64.049818] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   64.056658] pcieport 0000:00:00.0: AER: Device recovery failed


Bjorn,

What do you make of the AER logs?
What can I do to debug this issue?

Regards.



FWIW, verbose lspci output below.

# lspci -vv
00:00.0 PCI bridge: Sigma Designs, Inc. Device 0024 (rev 01) (prog-if 00 [Normal decode])
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin ? routed to IRQ 30
        Region 0: Memory at <ignored> (64-bit, non-prefetchable)
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 00000000-00000fff
        Memory behind bridge: 00400000-004fffff
        Prefetchable memory behind bridge: 00000000-000fffff
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity+ SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [50] MSI: Enable+ Count=1/4 Maskable- 64bit+
                Address: 00000000a002e07c  Data: 0000
        Capabilities: [78] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=3 PME-
        Capabilities: [80] Express (v2) Root Port (Slot-), MSI 03
                DevCap: MaxPayload 256 bytes, PhantFunc 0
                        ExtTag- RBE+
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq- AuxPwr- TransPend+
                LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <2us, L1 <4us
                        ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
                RootCap: CRSVisible-
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Range B, TimeoutDis-, LTR-, OBFF Not Supported ARIFwd-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [800 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 0e, GenCap- CGenEn- ChkCap- ChkEn-
        Kernel driver in use: pcieport

01:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03) (prog-if 30 [XHCI])
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 0
        Region 0: Memory at 50400000 (64-bit, non-prefetchable) [size=8K]
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [90] MSI-X: Enable+ Count=8 Masked-
                Vector table: BAR=0 offset=00001000
                PBA: BAR=0 offset=00001080
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <4us, L1 unlimited
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR+, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Capabilities: [150 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Kernel driver in use: xhci_hcd

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-23 10:19       ` Mason
  0 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-23 10:19 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/08/2017 09:51, Mathias Nyman wrote:

> very likely cause is the more aggressive detection of pci removed xhci hosts
> 
> See commit d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
>      xhci: Rework how we handle unresponsive or hoptlug removed hosts
> 
> It checks if a xhci register reads returns 0xffffffff and assumes xhci
> died in that case.

I've just tested 4.11.12 + a few local patches to back-port
PCIe host bridge support.

It "works" as well as 4.9
(i.e. modulo the "AER: Uncorrected (Non-Fatal) error received")

[    0.508533] pcie_tango 50000000.pcie: simultaneous PCI config and MMIO accesses may cause data corruption
[    0.519622] OF: PCI: host bridge /soc/pcie at 2e000 ranges:
[    0.519645] OF: PCI:   MEM 0x50400000..0x53ffffff -> 0x00400000
[    0.519725] pcie_tango 50000000.pcie: ECAM at [mem 0x50000000-0x503fffff] for [bus 00-03]
[    0.519872] pcie_tango 50000000.pcie: PCI host bridge to bus 0000:00
[    0.519886] pci_bus 0000:00: root bus resource [bus 00-03]
[    0.519898] pci_bus 0000:00: root bus resource [mem 0x50400000-0x53ffffff] (bus address [0x00400000-0x03ffffff])
[    0.520201] PCI: bus0: Fast back to back transfers disabled
[    0.520213] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    0.520922] PCI: bus1: Fast back to back transfers disabled
[    0.520964] pci 0000:00:00.0: of_irq_parse_pci: failed with rc=-22
[    0.520993] pci 0000:00:00.0: BAR 8: assigned [mem 0x50400000-0x504fffff]
[    0.521004] pci 0000:01:00.0: BAR 0: assigned [mem 0x50400000-0x50401fff 64bit]
[    0.521025] pci 0000:00:00.0: PCI bridge to [bus 01]
[    0.521033] pci 0000:00:00.0:   bridge window [mem 0x50400000-0x504fffff]
[    0.521085] pcieport 0000:00:00.0: enabling device (0140 -> 0142)
[    0.521282] pcieport 0000:00:00.0: Signaling PME with IRQ 30
[    0.521402] pcieport 0000:00:00.0: AER enabled with IRQ 30
[    0.521526] pci 0000:01:00.0: enabling device (0140 -> 0142)
...
[    1.239706] xhci_hcd 0000:01:00.0: xHCI Host Controller
[    1.244998] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 1
[    1.258048] xhci_hcd 0000:01:00.0: hcc params 0x014051cf hci version 0x100 quirks 0x00000010
[    1.267467] hub 1-0:1.0: USB hub found
[    1.271287] hub 1-0:1.0: 4 ports detected
[    1.275761] xhci_hcd 0000:01:00.0: xHCI Host Controller
[    1.281048] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 2
[    1.288578] usb usb2: We don't know the algorithms for LPM for this host, disabling LPM.
[    1.297234] hub 2-0:1.0: USB hub found
[    1.301042] hub 2-0:1.0: 4 ports detected
[    1.305681] usbcore: registered new interface driver usb-storage


PLUG #1
[   26.104607] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   26.143799] usb-storage 2-2:1.0: USB Mass Storage device detected
[   26.150253] scsi host0: usb-storage 2-2:1.0
[   27.177298] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   27.187586] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   27.199000] sd 0:0:0:0: [sda] Write Protect is off
[   27.204186] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   27.216322]  sda: sda1
[   27.220584] sd 0:0:0:0: [sda] Attached SCSI removable disk
[   27.252046] random: fast init done

UNPLUG #1
[   37.334040] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   37.342135] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   37.353970] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   37.362589] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   37.369485] pcieport 0000:00:00.0: AER: Device recovery failed
[   38.066538] xhci_hcd 0000:01:00.0: Cannot set link state.
[   38.072039] usb usb2-port2: cannot disable (err = -32)
[   38.077348] usb 2-2: USB disconnect, device number 2
[   38.082711] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   38.094279] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   38.108006] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   38.116878] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   38.123954] pcieport 0000:00:00.0: AER: Device recovery failed

PLUG #2
[   55.097922] usb 2-2: new SuperSpeed USB device number 3 using xhci_hcd
[   55.137590] usb-storage 2-2:1.0: USB Mass Storage device detected
[   55.144016] scsi host0: usb-storage 2-2:1.0
[   56.163907] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   56.174851] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   56.184218] sd 0:0:0:0: [sda] Write Protect is off
[   56.190162] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   56.202117]  sda: sda1
[   56.207112] sd 0:0:0:0: [sda] Attached SCSI removable disk

UNPLUG #2
[   63.228310] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   63.236403] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   63.248220] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   63.256653] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   63.263523] pcieport 0000:00:00.0: AER: Device recovery failed
[   63.959768] xhci_hcd 0000:01:00.0: Cannot set link state.
[   63.965227] usb usb2-port2: cannot disable (err = -32)
[   63.970409] usb 2-2: USB disconnect, device number 3
[   63.975664] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   63.987356] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   64.000021] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   64.008655] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   64.015553] pcieport 0000:00:00.0: AER: Device recovery failed
[   64.021449] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   64.029580] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   64.041410] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   64.049818] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   64.056658] pcieport 0000:00:00.0: AER: Device recovery failed


Bjorn,

What do you make of the AER logs?
What can I do to debug this issue?

Regards.



FWIW, verbose lspci output below.

# lspci -vv
00:00.0 PCI bridge: Sigma Designs, Inc. Device 0024 (rev 01) (prog-if 00 [Normal decode])
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin ? routed to IRQ 30
        Region 0: Memory@<ignored> (64-bit, non-prefetchable)
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 00000000-00000fff
        Memory behind bridge: 00400000-004fffff
        Prefetchable memory behind bridge: 00000000-000fffff
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity+ SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [50] MSI: Enable+ Count=1/4 Maskable- 64bit+
                Address: 00000000a002e07c  Data: 0000
        Capabilities: [78] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=3 PME-
        Capabilities: [80] Express (v2) Root Port (Slot-), MSI 03
                DevCap: MaxPayload 256 bytes, PhantFunc 0
                        ExtTag- RBE+
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq- AuxPwr- TransPend+
                LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <2us, L1 <4us
                        ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
                RootCap: CRSVisible-
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Range B, TimeoutDis-, LTR-, OBFF Not Supported ARIFwd-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [800 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 0e, GenCap- CGenEn- ChkCap- ChkEn-
        Kernel driver in use: pcieport

01:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03) (prog-if 30 [XHCI])
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 0
        Region 0: Memory@50400000 (64-bit, non-prefetchable) [size=8K]
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [90] MSI-X: Enable+ Count=8 Masked-
                Vector table: BAR=0 offset=00001000
                PBA: BAR=0 offset=00001080
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <4us, L1 unlimited
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR+, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Capabilities: [150 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Kernel driver in use: xhci_hcd

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-23  9:31       ` Mason
@ 2017-08-23 11:11         ` Mathias Nyman
  -1 siblings, 0 replies; 60+ messages in thread
From: Mathias Nyman @ 2017-08-23 11:11 UTC (permalink / raw)
  To: Mason, Mathias Nyman, Felipe Balbi, linux-pci, linux-usb, Linux ARM
  Cc: Bjorn Helgaas, Alan Stern, Greg Kroah-Hartman

On 23.08.2017 12:31, Mason wrote:
> On 23/08/2017 09:51, Mathias Nyman wrote:
>
>> very likely cause is the more aggressive detection of pci removed xhci hosts
>>
>> See commit d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
>>       xhci: Rework how we handle unresponsive or hoptlug removed hosts
>>
>> It checks if a xhci register reads returns 0xffffffff and assumes xhci
>> died in that case.
>>
>> Could you add something like the below to check which what is killing the host?
>> Or a BUG()/WARN() in xhci_hc_died() to get a backtrace of who called it.
>
> [   46.525247] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
> [   46.565496] usb-storage 2-2:1.0: USB Mass Storage device detected
> [   46.571934] scsi host0: usb-storage 2-2:1.0
> [   47.601227] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
> [   47.611340] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
> [   47.621624] sd 0:0:0:0: [sda] Write Protect is off
> [   47.627131] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
> [   47.639637]  sda: sda1
> [   47.648091] sd 0:0:0:0: [sda] Attached SCSI removable disk
> [   58.100306] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
> [   58.108021] CPU: 0 PID: 939 Comm: kworker/0:2 Tainted: G         C      4.13.0-rc6 #11
> [   58.115976] Hardware name: Sigma Tango DT
> [   58.120016] Workqueue: usb_hub_wq hub_event
> [   58.124241] [<c010f288>] (unwind_backtrace) from [<c010af58>] (show_stack+0x10/0x14)
> [   58.132033] [<c010af58>] (show_stack) from [<c049d714>] (dump_stack+0x84/0x98)
> [   58.139302] [<c049d714>] (dump_stack) from [<c03b090c>] (xhci_hc_died.part.9+0x50/0x23c)
> [   58.147438] [<c03b090c>] (xhci_hc_died.part.9) from [<c03b5d80>] (xhci_hub_control+0xf3c/0x175c)
> [   58.156273] [<c03b5d80>] (xhci_hub_control) from [<c03934a4>] (usb_hcd_submit_urb+0x264/0x814)
> [   58.164932] [<c03934a4>] (usb_hcd_submit_urb) from [<c0394fa4>] (usb_start_wait_urb+0x4c/0xbc)
> [   58.173591] [<c0394fa4>] (usb_start_wait_urb) from [<c03950b4>] (usb_control_msg+0xa0/0xcc)
> [   58.181985] [<c03950b4>] (usb_control_msg) from [<c038bf54>] (usb_clear_port_feature+0x44/0x4c)
> [   58.190730] [<c038bf54>] (usb_clear_port_feature) from [<c038c320>] (hub_port_reset+0x228/0x51c)
> [   58.199561] [<c038c320>] (hub_port_reset) from [<c038fd68>] (hub_event+0x87c/0x108c)
> [   58.207349] [<c038fd68>] (hub_event) from [<c012ecc4>] (process_one_work+0x1d8/0x3f0)
> [   58.215220] [<c012ecc4>] (process_one_work) from [<c012f8d8>] (worker_thread+0x38/0x554)
> [   58.223354] [<c012f8d8>] (worker_thread) from [<c01347d0>] (kthread+0x108/0x138)
> [   58.230789] [<c01347d0>] (kthread) from [<c01076d8>] (ret_from_fork+0x14/0x3c)
> [   58.238056] xhci_hcd 0000:01:00.0: HC died; cleaning up
> [   58.243391] usb 2-2: USB disconnect, device number 2
> --

xhci driver reads 0xffffffff from a mmio mapped xhci portsc register and bails out in:
xhci-hub.c:
         temp = readl(port_array[wIndex]);
                 if (temp == ~(u32)0) {
                         xhci_hc_died(xhci);
			retval = -ENODEV;
	                break;
		}

In this case we read the register when hub thread asks to clear port feature.

why portsc returns 0xffffffff is a nother quiestion, could the hub thread be running while xhci controller is (in D3)?
Was xhci runtime suspended?
There were some pcieport errors in another log you showed, maybe PCI devices are not properly recovered
and the registers return 0xffffffff?

-Mathias

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-23 11:11         ` Mathias Nyman
  0 siblings, 0 replies; 60+ messages in thread
From: Mathias Nyman @ 2017-08-23 11:11 UTC (permalink / raw)
  To: linux-arm-kernel

On 23.08.2017 12:31, Mason wrote:
> On 23/08/2017 09:51, Mathias Nyman wrote:
>
>> very likely cause is the more aggressive detection of pci removed xhci hosts
>>
>> See commit d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
>>       xhci: Rework how we handle unresponsive or hoptlug removed hosts
>>
>> It checks if a xhci register reads returns 0xffffffff and assumes xhci
>> died in that case.
>>
>> Could you add something like the below to check which what is killing the host?
>> Or a BUG()/WARN() in xhci_hc_died() to get a backtrace of who called it.
>
> [   46.525247] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
> [   46.565496] usb-storage 2-2:1.0: USB Mass Storage device detected
> [   46.571934] scsi host0: usb-storage 2-2:1.0
> [   47.601227] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
> [   47.611340] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
> [   47.621624] sd 0:0:0:0: [sda] Write Protect is off
> [   47.627131] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
> [   47.639637]  sda: sda1
> [   47.648091] sd 0:0:0:0: [sda] Attached SCSI removable disk
> [   58.100306] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
> [   58.108021] CPU: 0 PID: 939 Comm: kworker/0:2 Tainted: G         C      4.13.0-rc6 #11
> [   58.115976] Hardware name: Sigma Tango DT
> [   58.120016] Workqueue: usb_hub_wq hub_event
> [   58.124241] [<c010f288>] (unwind_backtrace) from [<c010af58>] (show_stack+0x10/0x14)
> [   58.132033] [<c010af58>] (show_stack) from [<c049d714>] (dump_stack+0x84/0x98)
> [   58.139302] [<c049d714>] (dump_stack) from [<c03b090c>] (xhci_hc_died.part.9+0x50/0x23c)
> [   58.147438] [<c03b090c>] (xhci_hc_died.part.9) from [<c03b5d80>] (xhci_hub_control+0xf3c/0x175c)
> [   58.156273] [<c03b5d80>] (xhci_hub_control) from [<c03934a4>] (usb_hcd_submit_urb+0x264/0x814)
> [   58.164932] [<c03934a4>] (usb_hcd_submit_urb) from [<c0394fa4>] (usb_start_wait_urb+0x4c/0xbc)
> [   58.173591] [<c0394fa4>] (usb_start_wait_urb) from [<c03950b4>] (usb_control_msg+0xa0/0xcc)
> [   58.181985] [<c03950b4>] (usb_control_msg) from [<c038bf54>] (usb_clear_port_feature+0x44/0x4c)
> [   58.190730] [<c038bf54>] (usb_clear_port_feature) from [<c038c320>] (hub_port_reset+0x228/0x51c)
> [   58.199561] [<c038c320>] (hub_port_reset) from [<c038fd68>] (hub_event+0x87c/0x108c)
> [   58.207349] [<c038fd68>] (hub_event) from [<c012ecc4>] (process_one_work+0x1d8/0x3f0)
> [   58.215220] [<c012ecc4>] (process_one_work) from [<c012f8d8>] (worker_thread+0x38/0x554)
> [   58.223354] [<c012f8d8>] (worker_thread) from [<c01347d0>] (kthread+0x108/0x138)
> [   58.230789] [<c01347d0>] (kthread) from [<c01076d8>] (ret_from_fork+0x14/0x3c)
> [   58.238056] xhci_hcd 0000:01:00.0: HC died; cleaning up
> [   58.243391] usb 2-2: USB disconnect, device number 2
> --

xhci driver reads 0xffffffff from a mmio mapped xhci portsc register and bails out in:
xhci-hub.c:
         temp = readl(port_array[wIndex]);
                 if (temp == ~(u32)0) {
                         xhci_hc_died(xhci);
			retval = -ENODEV;
	                break;
		}

In this case we read the register when hub thread asks to clear port feature.

why portsc returns 0xffffffff is a nother quiestion, could the hub thread be running while xhci controller is (in D3)?
Was xhci runtime suspended?
There were some pcieport errors in another log you showed, maybe PCI devices are not properly recovered
and the registers return 0xffffffff?

-Mathias

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-23 11:11         ` Mathias Nyman
@ 2017-08-23 11:54           ` Mason
  -1 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-23 11:54 UTC (permalink / raw)
  To: Mathias Nyman, Felipe Balbi, linux-pci, linux-usb, Linux ARM
  Cc: Bjorn Helgaas, Alan Stern, Greg Kroah-Hartman

On 23/08/2017 13:11, Mathias Nyman wrote:

> On 23.08.2017 12:31, Mason wrote:
> 
>> [   46.525247] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
>> [   46.565496] usb-storage 2-2:1.0: USB Mass Storage device detected
>> [   46.571934] scsi host0: usb-storage 2-2:1.0
>> [   47.601227] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
>> [   47.611340] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
>> [   47.621624] sd 0:0:0:0: [sda] Write Protect is off
>> [   47.627131] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
>> [   47.639637]  sda: sda1
>> [   47.648091] sd 0:0:0:0: [sda] Attached SCSI removable disk
>> [   58.100306] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
>> [   58.108021] CPU: 0 PID: 939 Comm: kworker/0:2 Tainted: G         C      4.13.0-rc6 #11
>> [   58.115976] Hardware name: Sigma Tango DT
>> [   58.120016] Workqueue: usb_hub_wq hub_event
>> [   58.124241] [<c010f288>] (unwind_backtrace) from [<c010af58>] (show_stack+0x10/0x14)
>> [   58.132033] [<c010af58>] (show_stack) from [<c049d714>] (dump_stack+0x84/0x98)
>> [   58.139302] [<c049d714>] (dump_stack) from [<c03b090c>] (xhci_hc_died.part.9+0x50/0x23c)
>> [   58.147438] [<c03b090c>] (xhci_hc_died.part.9) from [<c03b5d80>] (xhci_hub_control+0xf3c/0x175c)
>> [   58.156273] [<c03b5d80>] (xhci_hub_control) from [<c03934a4>] (usb_hcd_submit_urb+0x264/0x814)
>> [   58.164932] [<c03934a4>] (usb_hcd_submit_urb) from [<c0394fa4>] (usb_start_wait_urb+0x4c/0xbc)
>> [   58.173591] [<c0394fa4>] (usb_start_wait_urb) from [<c03950b4>] (usb_control_msg+0xa0/0xcc)
>> [   58.181985] [<c03950b4>] (usb_control_msg) from [<c038bf54>] (usb_clear_port_feature+0x44/0x4c)
>> [   58.190730] [<c038bf54>] (usb_clear_port_feature) from [<c038c320>] (hub_port_reset+0x228/0x51c)
>> [   58.199561] [<c038c320>] (hub_port_reset) from [<c038fd68>] (hub_event+0x87c/0x108c)
>> [   58.207349] [<c038fd68>] (hub_event) from [<c012ecc4>] (process_one_work+0x1d8/0x3f0)
>> [   58.215220] [<c012ecc4>] (process_one_work) from [<c012f8d8>] (worker_thread+0x38/0x554)
>> [   58.223354] [<c012f8d8>] (worker_thread) from [<c01347d0>] (kthread+0x108/0x138)
>> [   58.230789] [<c01347d0>] (kthread) from [<c01076d8>] (ret_from_fork+0x14/0x3c)
>> [   58.238056] xhci_hcd 0000:01:00.0: HC died; cleaning up
>> [   58.243391] usb 2-2: USB disconnect, device number 2
> 
> xhci driver reads 0xffffffff from a mmio mapped xhci portsc register and bails out in:
> xhci-hub.c:
>          temp = readl(port_array[wIndex]);
>                  if (temp == ~(u32)0) {
>                          xhci_hc_died(xhci);
> 			retval = -ENODEV;
> 	                break;
> 		}
> 
> In this case we read the register when hub thread asks to clear port feature.
> 
> why portsc returns 0xffffffff is a another question, could the hub thread be running while xhci controller is (in D3)?
> Was xhci runtime suspended?

How do I tell?
Should I disable SUSPEND support and all kinds of power management?

> There were some pcieport errors in another log you showed, maybe PCI devices are not properly recovered
> and the registers return 0xffffffff?

FWIW, I just compiled v4.12-rc1 and I do get the broken behavior.

v4.11.12 = OK
v4.12-rc1 = KO

PLUG
[   17.226953] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   17.267195] usb-storage 2-2:1.0: USB Mass Storage device detected
[   17.273612] scsi host0: usb-storage 2-2:1.0
[   18.296369] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   18.307772] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   18.316991] sd 0:0:0:0: [sda] Write Protect is off
[   18.322588] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   18.334828]  sda: sda1
[   18.339507] sd 0:0:0:0: [sda] Attached SCSI removable disk
[   18.366202] random: fast init done

UNPLUG
[   21.314111] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   21.322219] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   21.334039] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   21.342453] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   21.349306] pcieport 0000:00:00.0: AER: Device recovery failed
[   22.055471] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
[   22.063187] xhci_hcd 0000:01:00.0: HC died; cleaning up
[   22.068523] usb 2-2: USB disconnect, device number 2
[   22.073774] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   22.085369] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   22.098823] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   22.107245] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   22.114130] pcieport 0000:00:00.0: AER: Device recovery failed
[   22.120026] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   22.128096] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   22.139916] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   22.148320] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   22.155162] pcieport 0000:00:00.0: AER: Device recovery failed


The defconfig I used for testing:

# CONFIG_SWAP is not set
CONFIG_SYSVIPC=y
CONFIG_NO_HZ_IDLE=y
CONFIG_HIGH_RES_TIMERS=y
# CONFIG_COMPAT_BRK is not set
CONFIG_SLAB=y
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODVERSIONS=y
CONFIG_ARCH_TANGO=y
# CONFIG_ARM_ERRATA_643719 is not set
CONFIG_PCI=y
CONFIG_PCIEPORTBUS=y
CONFIG_PCI_MSI=y
CONFIG_PCIE_TANGO_SMP8759=y
CONFIG_SMP=y
CONFIG_PREEMPT=y
CONFIG_HZ_300=y
CONFIG_AEABI=y
CONFIG_HIGHMEM=y
# CONFIG_ATAGS is not set
CONFIG_ARM_APPENDED_DTB=y
CONFIG_ARM_ATAG_DTB_COMPAT=y
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPUFREQ_DT=y
CONFIG_VFP=y
CONFIG_NEON=y
CONFIG_NET=y
CONFIG_PACKET=y
CONFIG_UNIX=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
# CONFIG_IPV6 is not set
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_BLK_DEV_LOOP=y
CONFIG_SCSI=y
CONFIG_BLK_DEV_SD=y
CONFIG_NETDEVICES=y
CONFIG_NET_VENDOR_AURORA=y
CONFIG_AURORA_NB8800=y
CONFIG_AT803X_PHY=y
# CONFIG_WLAN is not set
# CONFIG_INPUT_KEYBOARD is not set
# CONFIG_INPUT_MOUSE is not set
# CONFIG_SERIO is not set
CONFIG_SERIAL_8250=y
# CONFIG_SERIAL_8250_DEPRECATED_OPTIONS is not set
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_RT288X=y
CONFIG_SERIAL_OF_PLATFORM=y
# CONFIG_HW_RANDOM is not set
CONFIG_I2C=y
CONFIG_I2C_XLR=y
CONFIG_GPIOLIB=y
CONFIG_THERMAL=y
CONFIG_CPU_THERMAL=y
CONFIG_TANGO_THERMAL=y
CONFIG_WATCHDOG=y
CONFIG_TANGOX_WATCHDOG=y
CONFIG_FB=y
# CONFIG_HID is not set
# CONFIG_USB_HID is not set
CONFIG_USB=y
CONFIG_USB_XHCI_HCD=y
CONFIG_USB_STORAGE=y
CONFIG_EXT4_FS=y
CONFIG_FUSE_FS=m
CONFIG_VFAT_FS=m
CONFIG_TMPFS=y
CONFIG_NFS_FS=y
# CONFIG_NFS_V2 is not set
CONFIG_ROOT_NFS=y
CONFIG_NLS_CODEPAGE_437=m
CONFIG_NLS_ISO8859_1=m
CONFIG_NLS_UTF8=m
CONFIG_PRINTK_TIME=y
# CONFIG_CRYPTO_ECHAINIV is not set

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-23 11:54           ` Mason
  0 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-23 11:54 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/08/2017 13:11, Mathias Nyman wrote:

> On 23.08.2017 12:31, Mason wrote:
> 
>> [   46.525247] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
>> [   46.565496] usb-storage 2-2:1.0: USB Mass Storage device detected
>> [   46.571934] scsi host0: usb-storage 2-2:1.0
>> [   47.601227] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
>> [   47.611340] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
>> [   47.621624] sd 0:0:0:0: [sda] Write Protect is off
>> [   47.627131] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
>> [   47.639637]  sda: sda1
>> [   47.648091] sd 0:0:0:0: [sda] Attached SCSI removable disk
>> [   58.100306] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
>> [   58.108021] CPU: 0 PID: 939 Comm: kworker/0:2 Tainted: G         C      4.13.0-rc6 #11
>> [   58.115976] Hardware name: Sigma Tango DT
>> [   58.120016] Workqueue: usb_hub_wq hub_event
>> [   58.124241] [<c010f288>] (unwind_backtrace) from [<c010af58>] (show_stack+0x10/0x14)
>> [   58.132033] [<c010af58>] (show_stack) from [<c049d714>] (dump_stack+0x84/0x98)
>> [   58.139302] [<c049d714>] (dump_stack) from [<c03b090c>] (xhci_hc_died.part.9+0x50/0x23c)
>> [   58.147438] [<c03b090c>] (xhci_hc_died.part.9) from [<c03b5d80>] (xhci_hub_control+0xf3c/0x175c)
>> [   58.156273] [<c03b5d80>] (xhci_hub_control) from [<c03934a4>] (usb_hcd_submit_urb+0x264/0x814)
>> [   58.164932] [<c03934a4>] (usb_hcd_submit_urb) from [<c0394fa4>] (usb_start_wait_urb+0x4c/0xbc)
>> [   58.173591] [<c0394fa4>] (usb_start_wait_urb) from [<c03950b4>] (usb_control_msg+0xa0/0xcc)
>> [   58.181985] [<c03950b4>] (usb_control_msg) from [<c038bf54>] (usb_clear_port_feature+0x44/0x4c)
>> [   58.190730] [<c038bf54>] (usb_clear_port_feature) from [<c038c320>] (hub_port_reset+0x228/0x51c)
>> [   58.199561] [<c038c320>] (hub_port_reset) from [<c038fd68>] (hub_event+0x87c/0x108c)
>> [   58.207349] [<c038fd68>] (hub_event) from [<c012ecc4>] (process_one_work+0x1d8/0x3f0)
>> [   58.215220] [<c012ecc4>] (process_one_work) from [<c012f8d8>] (worker_thread+0x38/0x554)
>> [   58.223354] [<c012f8d8>] (worker_thread) from [<c01347d0>] (kthread+0x108/0x138)
>> [   58.230789] [<c01347d0>] (kthread) from [<c01076d8>] (ret_from_fork+0x14/0x3c)
>> [   58.238056] xhci_hcd 0000:01:00.0: HC died; cleaning up
>> [   58.243391] usb 2-2: USB disconnect, device number 2
> 
> xhci driver reads 0xffffffff from a mmio mapped xhci portsc register and bails out in:
> xhci-hub.c:
>          temp = readl(port_array[wIndex]);
>                  if (temp == ~(u32)0) {
>                          xhci_hc_died(xhci);
> 			retval = -ENODEV;
> 	                break;
> 		}
> 
> In this case we read the register when hub thread asks to clear port feature.
> 
> why portsc returns 0xffffffff is a another question, could the hub thread be running while xhci controller is (in D3)?
> Was xhci runtime suspended?

How do I tell?
Should I disable SUSPEND support and all kinds of power management?

> There were some pcieport errors in another log you showed, maybe PCI devices are not properly recovered
> and the registers return 0xffffffff?

FWIW, I just compiled v4.12-rc1 and I do get the broken behavior.

v4.11.12 = OK
v4.12-rc1 = KO

PLUG
[   17.226953] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   17.267195] usb-storage 2-2:1.0: USB Mass Storage device detected
[   17.273612] scsi host0: usb-storage 2-2:1.0
[   18.296369] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   18.307772] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   18.316991] sd 0:0:0:0: [sda] Write Protect is off
[   18.322588] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   18.334828]  sda: sda1
[   18.339507] sd 0:0:0:0: [sda] Attached SCSI removable disk
[   18.366202] random: fast init done

UNPLUG
[   21.314111] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   21.322219] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   21.334039] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   21.342453] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   21.349306] pcieport 0000:00:00.0: AER: Device recovery failed
[   22.055471] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
[   22.063187] xhci_hcd 0000:01:00.0: HC died; cleaning up
[   22.068523] usb 2-2: USB disconnect, device number 2
[   22.073774] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   22.085369] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   22.098823] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   22.107245] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   22.114130] pcieport 0000:00:00.0: AER: Device recovery failed
[   22.120026] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   22.128096] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   22.139916] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   22.148320] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   22.155162] pcieport 0000:00:00.0: AER: Device recovery failed


The defconfig I used for testing:

# CONFIG_SWAP is not set
CONFIG_SYSVIPC=y
CONFIG_NO_HZ_IDLE=y
CONFIG_HIGH_RES_TIMERS=y
# CONFIG_COMPAT_BRK is not set
CONFIG_SLAB=y
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODVERSIONS=y
CONFIG_ARCH_TANGO=y
# CONFIG_ARM_ERRATA_643719 is not set
CONFIG_PCI=y
CONFIG_PCIEPORTBUS=y
CONFIG_PCI_MSI=y
CONFIG_PCIE_TANGO_SMP8759=y
CONFIG_SMP=y
CONFIG_PREEMPT=y
CONFIG_HZ_300=y
CONFIG_AEABI=y
CONFIG_HIGHMEM=y
# CONFIG_ATAGS is not set
CONFIG_ARM_APPENDED_DTB=y
CONFIG_ARM_ATAG_DTB_COMPAT=y
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPUFREQ_DT=y
CONFIG_VFP=y
CONFIG_NEON=y
CONFIG_NET=y
CONFIG_PACKET=y
CONFIG_UNIX=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
# CONFIG_IPV6 is not set
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_BLK_DEV_LOOP=y
CONFIG_SCSI=y
CONFIG_BLK_DEV_SD=y
CONFIG_NETDEVICES=y
CONFIG_NET_VENDOR_AURORA=y
CONFIG_AURORA_NB8800=y
CONFIG_AT803X_PHY=y
# CONFIG_WLAN is not set
# CONFIG_INPUT_KEYBOARD is not set
# CONFIG_INPUT_MOUSE is not set
# CONFIG_SERIO is not set
CONFIG_SERIAL_8250=y
# CONFIG_SERIAL_8250_DEPRECATED_OPTIONS is not set
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_RT288X=y
CONFIG_SERIAL_OF_PLATFORM=y
# CONFIG_HW_RANDOM is not set
CONFIG_I2C=y
CONFIG_I2C_XLR=y
CONFIG_GPIOLIB=y
CONFIG_THERMAL=y
CONFIG_CPU_THERMAL=y
CONFIG_TANGO_THERMAL=y
CONFIG_WATCHDOG=y
CONFIG_TANGOX_WATCHDOG=y
CONFIG_FB=y
# CONFIG_HID is not set
# CONFIG_USB_HID is not set
CONFIG_USB=y
CONFIG_USB_XHCI_HCD=y
CONFIG_USB_STORAGE=y
CONFIG_EXT4_FS=y
CONFIG_FUSE_FS=m
CONFIG_VFAT_FS=m
CONFIG_TMPFS=y
CONFIG_NFS_FS=y
# CONFIG_NFS_V2 is not set
CONFIG_ROOT_NFS=y
CONFIG_NLS_CODEPAGE_437=m
CONFIG_NLS_ISO8859_1=m
CONFIG_NLS_UTF8=m
CONFIG_PRINTK_TIME=y
# CONFIG_CRYPTO_ECHAINIV is not set

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-23 11:54           ` Mason
@ 2017-08-23 12:41             ` Mason
  -1 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-23 12:41 UTC (permalink / raw)
  To: Mathias Nyman, Felipe Balbi, linux-pci, linux-usb, Linux ARM
  Cc: Bjorn Helgaas, Alan Stern, Greg Kroah-Hartman

On 23/08/2017 13:54, Mason wrote:

> On 23/08/2017 13:11, Mathias Nyman wrote:
> 
>> In this case we read the register when hub thread asks to clear port feature.
>>
>> why portsc returns 0xffffffff is a another question, could the hub thread be running while xhci controller is (in D3)?
>> Was xhci runtime suspended?
> 
> How do I tell?
> Should I disable SUSPEND support and all kinds of power management?

I compiled a minimal kernel, with lots of irrelevant drivers and
frameworks left out, including power management. I still get the
"xHCI host controller not responding, assume dead" issue.

PLUG
[   59.803499] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   59.836902] usb 2-2: New USB device found, idVendor=0951, idProduct=1666
[   59.843653] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[   59.850900] usb 2-2: Product: DataTraveler 3.0
[   59.855417] usb 2-2: Manufacturer: Kingston
[   59.859661] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA
[   59.868249] usb-storage 2-2:1.0: USB Mass Storage device detected
[   59.874691] scsi host0: usb-storage 2-2:1.0
[   60.882801] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   60.891640] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   60.899662] sd 0:0:0:0: [sda] Write Protect is off
[   60.904763] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   60.916154]  sda: sda1
[   60.919798] sd 0:0:0:0: [sda] Attached SCSI removable disk

UNPLUG
[   70.545087] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   70.553169] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   70.565084] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   70.573528] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   70.580402] pcieport 0000:00:00.0: AER: Device recovery failed

[   71.275253] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
[   71.282956] xhci_hcd 0000:01:00.0: HC died; cleaning up
[   71.288304] usb 2-2: USB disconnect, device number 2

[   71.293445] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   71.301851] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   71.313785] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   71.322240] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   71.329115] pcieport 0000:00:00.0: AER: Device recovery failed

[   71.335042] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   71.343137] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   71.354984] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   71.363443] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   71.370289] pcieport 0000:00:00.0: AER: Device recovery failed


defconfig for reference

# CONFIG_SWAP is not set
CONFIG_SYSVIPC=y
CONFIG_NO_HZ_IDLE=y
CONFIG_HIGH_RES_TIMERS=y
# CONFIG_COMPAT_BRK is not set
CONFIG_SLAB=y
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODVERSIONS=y
CONFIG_ARCH_TANGO=y
# CONFIG_ARM_ERRATA_643719 is not set
CONFIG_PCI=y
CONFIG_PCIEPORTBUS=y
CONFIG_PCI_MSI=y
CONFIG_PCIE_TANGO_SMP8759=y
CONFIG_SMP=y
CONFIG_PREEMPT=y
CONFIG_HZ_300=y
CONFIG_AEABI=y
CONFIG_HIGHMEM=y
# CONFIG_ATAGS is not set
CONFIG_ARM_APPENDED_DTB=y
CONFIG_ARM_ATAG_DTB_COMPAT=y
CONFIG_VFP=y
CONFIG_NEON=y
# CONFIG_SUSPEND is not set
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_BLK_DEV_LOOP=y
CONFIG_SCSI=y
CONFIG_BLK_DEV_SD=y
# CONFIG_INPUT_KEYBOARD is not set
# CONFIG_INPUT_MOUSE is not set
# CONFIG_SERIO is not set
CONFIG_SERIAL_8250=y
# CONFIG_SERIAL_8250_DEPRECATED_OPTIONS is not set
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_RT288X=y
CONFIG_SERIAL_OF_PLATFORM=y
# CONFIG_HW_RANDOM is not set
# CONFIG_HWMON is not set
# CONFIG_HID is not set
# CONFIG_USB_HID is not set
CONFIG_USB=y
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y
CONFIG_USB_XHCI_HCD=y
CONFIG_USB_STORAGE=y
CONFIG_VFAT_FS=m
CONFIG_TMPFS=y
CONFIG_NLS_CODEPAGE_437=m
CONFIG_NLS_ISO8859_1=m
CONFIG_NLS_UTF8=m
CONFIG_PRINTK_TIME=y

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-23 12:41             ` Mason
  0 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-23 12:41 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/08/2017 13:54, Mason wrote:

> On 23/08/2017 13:11, Mathias Nyman wrote:
> 
>> In this case we read the register when hub thread asks to clear port feature.
>>
>> why portsc returns 0xffffffff is a another question, could the hub thread be running while xhci controller is (in D3)?
>> Was xhci runtime suspended?
> 
> How do I tell?
> Should I disable SUSPEND support and all kinds of power management?

I compiled a minimal kernel, with lots of irrelevant drivers and
frameworks left out, including power management. I still get the
"xHCI host controller not responding, assume dead" issue.

PLUG
[   59.803499] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   59.836902] usb 2-2: New USB device found, idVendor=0951, idProduct=1666
[   59.843653] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[   59.850900] usb 2-2: Product: DataTraveler 3.0
[   59.855417] usb 2-2: Manufacturer: Kingston
[   59.859661] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA
[   59.868249] usb-storage 2-2:1.0: USB Mass Storage device detected
[   59.874691] scsi host0: usb-storage 2-2:1.0
[   60.882801] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   60.891640] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   60.899662] sd 0:0:0:0: [sda] Write Protect is off
[   60.904763] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   60.916154]  sda: sda1
[   60.919798] sd 0:0:0:0: [sda] Attached SCSI removable disk

UNPLUG
[   70.545087] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   70.553169] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   70.565084] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   70.573528] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   70.580402] pcieport 0000:00:00.0: AER: Device recovery failed

[   71.275253] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
[   71.282956] xhci_hcd 0000:01:00.0: HC died; cleaning up
[   71.288304] usb 2-2: USB disconnect, device number 2

[   71.293445] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   71.301851] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   71.313785] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   71.322240] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   71.329115] pcieport 0000:00:00.0: AER: Device recovery failed

[   71.335042] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   71.343137] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   71.354984] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   71.363443] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   71.370289] pcieport 0000:00:00.0: AER: Device recovery failed


defconfig for reference

# CONFIG_SWAP is not set
CONFIG_SYSVIPC=y
CONFIG_NO_HZ_IDLE=y
CONFIG_HIGH_RES_TIMERS=y
# CONFIG_COMPAT_BRK is not set
CONFIG_SLAB=y
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODVERSIONS=y
CONFIG_ARCH_TANGO=y
# CONFIG_ARM_ERRATA_643719 is not set
CONFIG_PCI=y
CONFIG_PCIEPORTBUS=y
CONFIG_PCI_MSI=y
CONFIG_PCIE_TANGO_SMP8759=y
CONFIG_SMP=y
CONFIG_PREEMPT=y
CONFIG_HZ_300=y
CONFIG_AEABI=y
CONFIG_HIGHMEM=y
# CONFIG_ATAGS is not set
CONFIG_ARM_APPENDED_DTB=y
CONFIG_ARM_ATAG_DTB_COMPAT=y
CONFIG_VFP=y
CONFIG_NEON=y
# CONFIG_SUSPEND is not set
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_BLK_DEV_LOOP=y
CONFIG_SCSI=y
CONFIG_BLK_DEV_SD=y
# CONFIG_INPUT_KEYBOARD is not set
# CONFIG_INPUT_MOUSE is not set
# CONFIG_SERIO is not set
CONFIG_SERIAL_8250=y
# CONFIG_SERIAL_8250_DEPRECATED_OPTIONS is not set
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_RT288X=y
CONFIG_SERIAL_OF_PLATFORM=y
# CONFIG_HW_RANDOM is not set
# CONFIG_HWMON is not set
# CONFIG_HID is not set
# CONFIG_USB_HID is not set
CONFIG_USB=y
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y
CONFIG_USB_XHCI_HCD=y
CONFIG_USB_STORAGE=y
CONFIG_VFAT_FS=m
CONFIG_TMPFS=y
CONFIG_NLS_CODEPAGE_437=m
CONFIG_NLS_ISO8859_1=m
CONFIG_NLS_UTF8=m
CONFIG_PRINTK_TIME=y

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-23 12:41             ` Mason
@ 2017-08-23 14:30               ` Mason
  -1 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-23 14:30 UTC (permalink / raw)
  To: Mathias Nyman, Felipe Balbi, linux-pci, linux-usb, Linux ARM
  Cc: Bjorn Helgaas, Alan Stern, Greg Kroah-Hartman

On 23/08/2017 14:41, Mason wrote:

> I compiled a minimal kernel, with lots of irrelevant drivers and
> frameworks left out, including power management. I still get the
> "xHCI host controller not responding, assume dead" issue.

The problem seems to have a timing-related aspect.

I added a bunch of logs (to a slow serial console) and the HC was
not killed. I was able to plug the Flash drive a second time.
(I am logging config space reads and writes.)

[    1.098314]   READ: bus=1 devfn=0 where=84 size=2 val=0x8
[    1.103779]   READ: bus=1 devfn=0 where=4 size=2 val=0x142
[    1.109315]   READ: bus=1 devfn=0 where=61 size=1 val=0x1
[    1.114746]   READ: bus=1 devfn=0 where=4 size=2 val=0x142
[    1.120311]   READ: bus=1 devfn=0 where=4 size=2 val=0x142
[    1.125841]  WRITE: bus=1 devfn=0 where=4 size=2 val=0x146

NB: I added msleep(2500) in usb_add_hcd()

[    3.681867] xhci_hcd 0000:01:00.0: xHCI Host Controller
[    3.687154] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 1
[    3.694656]   READ: bus=1 devfn=0 where=96 size=1 val=0x30
[    3.705736] xhci_hcd 0000:01:00.0: hcc params 0x014051cf hci version 0x100 quirks 0x00000010
[    3.714233]   READ: bus=1 devfn=0 where=12 size=1 val=0x10
[    3.719752]   READ: bus=1 devfn=0 where=4 size=2 val=0x146
[    3.725269]  WRITE: bus=1 devfn=0 where=4 size=2 val=0x156
[    3.730794]   READ: bus=1 devfn=0 where=146 size=2 val=0x7
[    3.736314]   READ: bus=1 devfn=0 where=146 size=2 val=0x7
[    3.741835]  WRITE: bus=1 devfn=0 where=146 size=2 val=0x7
[    3.747354]   READ: bus=1 devfn=0 where=146 size=2 val=0x7
[    3.752871]   READ: bus=1 devfn=0 where=148 size=4 val=0x1000
[    3.758775]   READ: bus=1 devfn=0 where=146 size=2 val=0x7
[    3.764297]  WRITE: bus=1 devfn=0 where=146 size=2 val=0xc007
[    3.770108]   READ: bus=1 devfn=0 where=4 size=2 val=0x146
[    3.775626]  WRITE: bus=1 devfn=0 where=4 size=2 val=0x546
[    3.781146]   READ: bus=1 devfn=0 where=146 size=2 val=0xc007
[    3.786925]  WRITE: bus=1 devfn=0 where=146 size=2 val=0x8007
[    3.792919] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
[    3.799756] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    3.807021] usb usb1: Product: xHCI Host Controller
[    3.811933] usb usb1: Manufacturer: Linux 4.12.0-rc1 xhci-hcd
[    3.817713] usb usb1: SerialNumber: 0000:01:00.0
[    3.822773] hub 1-0:1.0: USB hub found
[    3.826598] hub 1-0:1.0: 4 ports detected

NB: I added msleep(2500) in usb_add_hcd()

[    6.455246] xhci_hcd 0000:01:00.0: xHCI Host Controller
[    6.460520] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 2
[    6.468028] usb usb2: We don't know the algorithms for LPM for this host, disabling LPM.
[    6.476236] usb usb2: New USB device found, idVendor=1d6b, idProduct=0003
[    6.483068] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    6.490334] usb usb2: Product: xHCI Host Controller
[    6.495240] usb usb2: Manufacturer: Linux 4.12.0-rc1 xhci-hcd
[    6.501020] usb usb2: SerialNumber: 0000:01:00.0
[    6.505994] hub 2-0:1.0: USB hub found
[    6.509806] hub 2-0:1.0: 4 ports detected
[    6.514215] usbcore: registered new interface driver usb-storage
[    6.520313] Registering SWP/SWPB emulation handler
[    6.525541]   READ: bus=0 devfn=0 where=132 size=4 val=0x8001
[    6.531334]   READ: bus=0 devfn=0 where=6 size=2 val=0x4010
[    6.536955]   READ: bus=0 devfn=0 where=52 size=1 val=0x50
[    6.542484]   READ: bus=0 devfn=0 where=80 size=2 val=0x7805
[    6.548180]   READ: bus=0 devfn=0 where=120 size=2 val=0x8001
[    6.553969]   READ: bus=0 devfn=0 where=128 size=2 val=0x10
[    6.559584]   READ: bus=0 devfn=0 where=124 size=2 val=0x6008
[    6.565387]   READ: bus=1 devfn=0 where=164 size=4 val=0x8fc0
[    6.571167]   READ: bus=1 devfn=0 where=6 size=2 val=0x10
[    6.576609]   READ: bus=1 devfn=0 where=52 size=1 val=0x50
[    6.582129]   READ: bus=1 devfn=0 where=80 size=2 val=0x7001
[    6.587821]   READ: bus=1 devfn=0 where=112 size=2 val=0x9005
[    6.593601]   READ: bus=1 devfn=0 where=144 size=2 val=0xa011
[    6.599381]   READ: bus=1 devfn=0 where=160 size=2 val=0x10
[    6.604985]   READ: bus=1 devfn=0 where=84 size=2 val=0x8
[    6.623665] Freeing unused kernel memory: 9216K


PLUG #1
[   66.783559] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   66.816910] usb 2-2: New USB device found, idVendor=0951, idProduct=1666
[   66.823661] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[   66.830909] usb 2-2: Product: DataTraveler 3.0
[   66.835417] usb 2-2: Manufacturer: Kingston
[   66.839660] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA
[   66.848131] usb-storage 2-2:1.0: USB Mass Storage device detected
[   66.854584] scsi host0: usb-storage 2-2:1.0
[   67.869446] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   67.878270] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   67.886248] sd 0:0:0:0: [sda] Write Protect is off
[   67.891347] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   67.902708]  sda: sda1
[   67.906372] sd 0:0:0:0: [sda] Attached SCSI removable disk


UNPLUG #1
[   71.697358]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   71.703572]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[   71.709170]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   71.715569] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   71.723632]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[   71.729470]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   71.735373]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   71.741013]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   71.746914]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   71.752552]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   71.758194] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   71.770008] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   71.778494] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   71.785358]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   71.791259]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   71.796897]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   71.802524] pcieport 0000:00:00.0: AER: Device recovery failed
[   72.451908]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   72.458120]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[   72.463717]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   72.470012]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   72.476221]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[   72.481819]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   72.488109]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   72.494319]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[   72.499916]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   72.506205]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   72.512415]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[   72.518011]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   72.524263] xhci_hcd 0000:01:00.0: Cannot set link state.
[   72.529711] usb usb2-port2: cannot disable (err = -32)
[   72.534883] usb 2-2: USB disconnect, device number 2
[   72.540042] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   72.548365]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[   72.554264]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.560157]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.565778]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.571654]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.577273]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   72.582891] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   72.594705] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   72.603122] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   72.609955]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.615833]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.621441]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   72.627061] pcieport 0000:00:00.0: AER: Device recovery failed
[   72.632931] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   72.640984]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[   72.646769]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.652636]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.658245]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.664114]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.669722]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   72.675330] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   72.687142] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   72.695545] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   72.702376]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.708244]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.713856]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   72.719473] pcieport 0000:00:00.0: AER: Device recovery failed
[   72.725342] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   72.733394]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[   72.739178]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.745044]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.750653]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.756520]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.762128]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   72.767734] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   72.779548] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   72.787950] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   72.794781]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.800649]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.806258]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   72.811873] pcieport 0000:00:00.0: AER: Device recovery failed
[   72.817741] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   72.825793]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[   72.831574]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.837442]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.843054]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.848922]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.854529]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   72.860137] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   72.871951] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   72.880353] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   72.887184]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.893051]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.898660]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   72.904273] pcieport 0000:00:00.0: AER: Device recovery failed


PLUG #2
[  165.860193] usb 2-2: new SuperSpeed USB device number 3 using xhci_hcd
[  165.893583] usb 2-2: New USB device found, idVendor=0951, idProduct=1666
[  165.900333] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[  165.907515] usb 2-2: Product: DataTraveler 3.0
[  165.911989] usb 2-2: Manufacturer: Kingston
[  165.916198] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA
[  165.924547] usb-storage 2-2:1.0: USB Mass Storage device detected
[  165.930970] scsi host0: usb-storage 2-2:1.0
[  166.962705] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[  166.971494] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[  166.979556] sd 0:0:0:0: [sda] Write Protect is off
[  166.984591] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[  166.995847] random: fast init done
[  166.999430]  sda: sda1
[  167.003039] sd 0:0:0:0: [sda] Attached SCSI removable disk


UNPLUG #2
[  171.918834]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  171.925046]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[  171.930645]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  171.936941] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[  171.945000]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[  171.950784]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  171.956656]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  171.962263]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  171.968134]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  171.973741]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  171.979354] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[  171.991164] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[  171.999597] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[  172.006429]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.012300]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.017908]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  172.023529] pcieport 0000:00:00.0: AER: Device recovery failed
[  172.675221]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  172.681432]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[  172.687030]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  172.693325]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  172.699536]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[  172.705133]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  172.711424]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  172.717633]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[  172.723230]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  172.729517]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  172.735726]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[  172.741322]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  172.747574] xhci_hcd 0000:01:00.0: Cannot set link state.
[  172.753021] usb usb2-port2: cannot disable (err = -32)
[  172.758193] usb 2-2: USB disconnect, device number 3
[  172.763340] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[  172.771627]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[  172.777515]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.783408]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.789030]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.794907]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.800526]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  172.806146] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[  172.817960] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[  172.826375] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[  172.833208]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.839078]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.844685]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  172.850305] pcieport 0000:00:00.0: AER: Device recovery failed
[  172.856183] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[  172.864236]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[  172.870020]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.875889]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.881497]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.887365]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.892974]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  172.898582] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[  172.910393] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[  172.918796] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[  172.925627]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.931494]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.937107]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  172.942724] pcieport 0000:00:00.0: AER: Device recovery failed
[  172.948593] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[  172.956644]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[  172.962428]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.968295]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.973903]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.979771]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.985379]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  172.990985] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[  173.002799] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[  173.011202] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[  173.018033]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  173.023901]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  173.029510]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  173.035123] pcieport 0000:00:00.0: AER: Device recovery failed
[  173.040990] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[  173.049042]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[  173.054825]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  173.060693]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  173.066305]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  173.072173]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  173.077780]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  173.083388] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[  173.095202] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[  173.103605] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[  173.110435]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  173.116303]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  173.121911]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  173.127524] pcieport 0000:00:00.0: AER: Device recovery failed




NOTE BENE: these issues do not occur at all with a USB2 Flash drive.

[ 2093.564771] usb 1-2: new high-speed USB device number 2 using xhci_hcd
[ 2093.790646] usb 1-2: New USB device found, idVendor=058f, idProduct=6387
[ 2093.797397] usb 1-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ 2093.804583] usb 1-2: Product: Mass Storage
[ 2093.808707] usb 1-2: Manufacturer: Generic
[ 2093.812829] usb 1-2: SerialNumber: 31A69E70
[ 2093.819244] usb-storage 1-2:1.0: USB Mass Storage device detected
[ 2093.825624] scsi host0: usb-storage 1-2:1.0
[ 2094.856918] scsi 0:0:0:0: Direct-Access     Generic  Flash Disk       8.07 PQ: 0 ANSI: 2
[ 2094.866196] sd 0:0:0:0: [sda] 4106240 512-byte logical blocks: (2.10 GB/1.96 GiB)
[ 2094.874232] sd 0:0:0:0: [sda] Write Protect is off
[ 2094.879350] sd 0:0:0:0: [sda] No Caching mode page found
[ 2094.884816] sd 0:0:0:0: [sda] Assuming drive cache: write through
[ 2094.909111]  sda: sda1
[ 2094.912935] sd 0:0:0:0: [sda] Attached SCSI removable disk

[ 2100.516396] usb 1-2: USB disconnect, device number 2


Regards.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-23 14:30               ` Mason
  0 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-23 14:30 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/08/2017 14:41, Mason wrote:

> I compiled a minimal kernel, with lots of irrelevant drivers and
> frameworks left out, including power management. I still get the
> "xHCI host controller not responding, assume dead" issue.

The problem seems to have a timing-related aspect.

I added a bunch of logs (to a slow serial console) and the HC was
not killed. I was able to plug the Flash drive a second time.
(I am logging config space reads and writes.)

[    1.098314]   READ: bus=1 devfn=0 where=84 size=2 val=0x8
[    1.103779]   READ: bus=1 devfn=0 where=4 size=2 val=0x142
[    1.109315]   READ: bus=1 devfn=0 where=61 size=1 val=0x1
[    1.114746]   READ: bus=1 devfn=0 where=4 size=2 val=0x142
[    1.120311]   READ: bus=1 devfn=0 where=4 size=2 val=0x142
[    1.125841]  WRITE: bus=1 devfn=0 where=4 size=2 val=0x146

NB: I added msleep(2500) in usb_add_hcd()

[    3.681867] xhci_hcd 0000:01:00.0: xHCI Host Controller
[    3.687154] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 1
[    3.694656]   READ: bus=1 devfn=0 where=96 size=1 val=0x30
[    3.705736] xhci_hcd 0000:01:00.0: hcc params 0x014051cf hci version 0x100 quirks 0x00000010
[    3.714233]   READ: bus=1 devfn=0 where=12 size=1 val=0x10
[    3.719752]   READ: bus=1 devfn=0 where=4 size=2 val=0x146
[    3.725269]  WRITE: bus=1 devfn=0 where=4 size=2 val=0x156
[    3.730794]   READ: bus=1 devfn=0 where=146 size=2 val=0x7
[    3.736314]   READ: bus=1 devfn=0 where=146 size=2 val=0x7
[    3.741835]  WRITE: bus=1 devfn=0 where=146 size=2 val=0x7
[    3.747354]   READ: bus=1 devfn=0 where=146 size=2 val=0x7
[    3.752871]   READ: bus=1 devfn=0 where=148 size=4 val=0x1000
[    3.758775]   READ: bus=1 devfn=0 where=146 size=2 val=0x7
[    3.764297]  WRITE: bus=1 devfn=0 where=146 size=2 val=0xc007
[    3.770108]   READ: bus=1 devfn=0 where=4 size=2 val=0x146
[    3.775626]  WRITE: bus=1 devfn=0 where=4 size=2 val=0x546
[    3.781146]   READ: bus=1 devfn=0 where=146 size=2 val=0xc007
[    3.786925]  WRITE: bus=1 devfn=0 where=146 size=2 val=0x8007
[    3.792919] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
[    3.799756] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    3.807021] usb usb1: Product: xHCI Host Controller
[    3.811933] usb usb1: Manufacturer: Linux 4.12.0-rc1 xhci-hcd
[    3.817713] usb usb1: SerialNumber: 0000:01:00.0
[    3.822773] hub 1-0:1.0: USB hub found
[    3.826598] hub 1-0:1.0: 4 ports detected

NB: I added msleep(2500) in usb_add_hcd()

[    6.455246] xhci_hcd 0000:01:00.0: xHCI Host Controller
[    6.460520] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 2
[    6.468028] usb usb2: We don't know the algorithms for LPM for this host, disabling LPM.
[    6.476236] usb usb2: New USB device found, idVendor=1d6b, idProduct=0003
[    6.483068] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    6.490334] usb usb2: Product: xHCI Host Controller
[    6.495240] usb usb2: Manufacturer: Linux 4.12.0-rc1 xhci-hcd
[    6.501020] usb usb2: SerialNumber: 0000:01:00.0
[    6.505994] hub 2-0:1.0: USB hub found
[    6.509806] hub 2-0:1.0: 4 ports detected
[    6.514215] usbcore: registered new interface driver usb-storage
[    6.520313] Registering SWP/SWPB emulation handler
[    6.525541]   READ: bus=0 devfn=0 where=132 size=4 val=0x8001
[    6.531334]   READ: bus=0 devfn=0 where=6 size=2 val=0x4010
[    6.536955]   READ: bus=0 devfn=0 where=52 size=1 val=0x50
[    6.542484]   READ: bus=0 devfn=0 where=80 size=2 val=0x7805
[    6.548180]   READ: bus=0 devfn=0 where=120 size=2 val=0x8001
[    6.553969]   READ: bus=0 devfn=0 where=128 size=2 val=0x10
[    6.559584]   READ: bus=0 devfn=0 where=124 size=2 val=0x6008
[    6.565387]   READ: bus=1 devfn=0 where=164 size=4 val=0x8fc0
[    6.571167]   READ: bus=1 devfn=0 where=6 size=2 val=0x10
[    6.576609]   READ: bus=1 devfn=0 where=52 size=1 val=0x50
[    6.582129]   READ: bus=1 devfn=0 where=80 size=2 val=0x7001
[    6.587821]   READ: bus=1 devfn=0 where=112 size=2 val=0x9005
[    6.593601]   READ: bus=1 devfn=0 where=144 size=2 val=0xa011
[    6.599381]   READ: bus=1 devfn=0 where=160 size=2 val=0x10
[    6.604985]   READ: bus=1 devfn=0 where=84 size=2 val=0x8
[    6.623665] Freeing unused kernel memory: 9216K


PLUG #1
[   66.783559] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   66.816910] usb 2-2: New USB device found, idVendor=0951, idProduct=1666
[   66.823661] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[   66.830909] usb 2-2: Product: DataTraveler 3.0
[   66.835417] usb 2-2: Manufacturer: Kingston
[   66.839660] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA
[   66.848131] usb-storage 2-2:1.0: USB Mass Storage device detected
[   66.854584] scsi host0: usb-storage 2-2:1.0
[   67.869446] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   67.878270] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   67.886248] sd 0:0:0:0: [sda] Write Protect is off
[   67.891347] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   67.902708]  sda: sda1
[   67.906372] sd 0:0:0:0: [sda] Attached SCSI removable disk


UNPLUG #1
[   71.697358]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   71.703572]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[   71.709170]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   71.715569] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   71.723632]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[   71.729470]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   71.735373]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   71.741013]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   71.746914]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   71.752552]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   71.758194] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   71.770008] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   71.778494] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   71.785358]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   71.791259]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   71.796897]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   71.802524] pcieport 0000:00:00.0: AER: Device recovery failed
[   72.451908]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   72.458120]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[   72.463717]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   72.470012]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   72.476221]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[   72.481819]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   72.488109]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   72.494319]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[   72.499916]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   72.506205]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   72.512415]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[   72.518011]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[   72.524263] xhci_hcd 0000:01:00.0: Cannot set link state.
[   72.529711] usb usb2-port2: cannot disable (err = -32)
[   72.534883] usb 2-2: USB disconnect, device number 2
[   72.540042] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   72.548365]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[   72.554264]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.560157]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.565778]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.571654]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.577273]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   72.582891] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   72.594705] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   72.603122] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   72.609955]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.615833]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.621441]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   72.627061] pcieport 0000:00:00.0: AER: Device recovery failed
[   72.632931] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   72.640984]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[   72.646769]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.652636]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.658245]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.664114]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.669722]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   72.675330] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   72.687142] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   72.695545] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   72.702376]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.708244]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.713856]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   72.719473] pcieport 0000:00:00.0: AER: Device recovery failed
[   72.725342] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   72.733394]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[   72.739178]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.745044]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.750653]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.756520]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.762128]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   72.767734] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   72.779548] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   72.787950] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   72.794781]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.800649]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.806258]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   72.811873] pcieport 0000:00:00.0: AER: Device recovery failed
[   72.817741] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   72.825793]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[   72.831574]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.837442]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.843054]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.848922]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.854529]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   72.860137] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   72.871951] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   72.880353] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   72.887184]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[   72.893051]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[   72.898660]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[   72.904273] pcieport 0000:00:00.0: AER: Device recovery failed


PLUG #2
[  165.860193] usb 2-2: new SuperSpeed USB device number 3 using xhci_hcd
[  165.893583] usb 2-2: New USB device found, idVendor=0951, idProduct=1666
[  165.900333] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[  165.907515] usb 2-2: Product: DataTraveler 3.0
[  165.911989] usb 2-2: Manufacturer: Kingston
[  165.916198] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA
[  165.924547] usb-storage 2-2:1.0: USB Mass Storage device detected
[  165.930970] scsi host0: usb-storage 2-2:1.0
[  166.962705] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[  166.971494] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[  166.979556] sd 0:0:0:0: [sda] Write Protect is off
[  166.984591] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[  166.995847] random: fast init done
[  166.999430]  sda: sda1
[  167.003039] sd 0:0:0:0: [sda] Attached SCSI removable disk


UNPLUG #2
[  171.918834]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  171.925046]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[  171.930645]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  171.936941] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[  171.945000]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[  171.950784]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  171.956656]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  171.962263]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  171.968134]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  171.973741]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  171.979354] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[  171.991164] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[  171.999597] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[  172.006429]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.012300]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.017908]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  172.023529] pcieport 0000:00:00.0: AER: Device recovery failed
[  172.675221]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  172.681432]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[  172.687030]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  172.693325]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  172.699536]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[  172.705133]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  172.711424]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  172.717633]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[  172.723230]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  172.729517]   READ: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  172.735726]   READ: bus=0 devfn=0 where=2100 size=4 val=0x0
[  172.741322]  WRITE: bus=0 devfn=0 where=2096 size=4 val=0x10000024
[  172.747574] xhci_hcd 0000:01:00.0: Cannot set link state.
[  172.753021] usb usb2-port2: cannot disable (err = -32)
[  172.758193] usb 2-2: USB disconnect, device number 3
[  172.763340] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[  172.771627]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[  172.777515]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.783408]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.789030]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.794907]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.800526]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  172.806146] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[  172.817960] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[  172.826375] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[  172.833208]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.839078]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.844685]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  172.850305] pcieport 0000:00:00.0: AER: Device recovery failed
[  172.856183] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[  172.864236]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[  172.870020]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.875889]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.881497]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.887365]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.892974]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  172.898582] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[  172.910393] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[  172.918796] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[  172.925627]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.931494]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.937107]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  172.942724] pcieport 0000:00:00.0: AER: Device recovery failed
[  172.948593] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[  172.956644]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[  172.962428]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.968295]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.973903]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  172.979771]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  172.985379]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  172.990985] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[  173.002799] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[  173.011202] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[  173.018033]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  173.023901]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  173.029510]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  173.035123] pcieport 0000:00:00.0: AER: Device recovery failed
[  173.040990] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[  173.049042]   READ: bus=0 devfn=0 where=136 size=2 val=0x281f
[  173.054825]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  173.060693]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  173.066305]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  173.072173]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  173.077780]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  173.083388] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[  173.095202] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[  173.103605] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[  173.110435]   READ: bus=0 devfn=0 where=2052 size=4 val=0x4000
[  173.116303]   READ: bus=0 devfn=0 where=2056 size=4 val=0x0
[  173.121911]   READ: bus=0 devfn=0 where=2072 size=4 val=0xe
[  173.127524] pcieport 0000:00:00.0: AER: Device recovery failed




NOTE BENE: these issues do not occur at all with a USB2 Flash drive.

[ 2093.564771] usb 1-2: new high-speed USB device number 2 using xhci_hcd
[ 2093.790646] usb 1-2: New USB device found, idVendor=058f, idProduct=6387
[ 2093.797397] usb 1-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ 2093.804583] usb 1-2: Product: Mass Storage
[ 2093.808707] usb 1-2: Manufacturer: Generic
[ 2093.812829] usb 1-2: SerialNumber: 31A69E70
[ 2093.819244] usb-storage 1-2:1.0: USB Mass Storage device detected
[ 2093.825624] scsi host0: usb-storage 1-2:1.0
[ 2094.856918] scsi 0:0:0:0: Direct-Access     Generic  Flash Disk       8.07 PQ: 0 ANSI: 2
[ 2094.866196] sd 0:0:0:0: [sda] 4106240 512-byte logical blocks: (2.10 GB/1.96 GiB)
[ 2094.874232] sd 0:0:0:0: [sda] Write Protect is off
[ 2094.879350] sd 0:0:0:0: [sda] No Caching mode page found
[ 2094.884816] sd 0:0:0:0: [sda] Assuming drive cache: write through
[ 2094.909111]  sda: sda1
[ 2094.912935] sd 0:0:0:0: [sda] Attached SCSI removable disk

[ 2100.516396] usb 1-2: USB disconnect, device number 2


Regards.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-23 14:30               ` Mason
@ 2017-08-28  8:39                 ` Mathias Nyman
  -1 siblings, 0 replies; 60+ messages in thread
From: Mathias Nyman @ 2017-08-28  8:39 UTC (permalink / raw)
  To: Mason, Mathias Nyman, Felipe Balbi, linux-pci, linux-usb, Linux ARM
  Cc: Bjorn Helgaas, Alan Stern, Greg Kroah-Hartman

On 23.08.2017 17:30, Mason wrote:
> On 23/08/2017 14:41, Mason wrote:
>
>> I compiled a minimal kernel, with lots of irrelevant drivers and
>> frameworks left out, including power management. I still get the
>> "xHCI host controller not responding, assume dead" issue.
>
> The problem seems to have a timing-related aspect.
>
> I added a bunch of logs (to a slow serial console) and the HC was
> not killed. I was able to plug the Flash drive a second time.
> (I am logging config space reads and writes.)


Could you take a log with the following added debug, without
your extra delays, It should show a bit more about the state
of the controller when we read 0xffffffff


diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
index 4bc6f42..a124c3d 100644
--- a/drivers/usb/host/xhci-hub.c
+++ b/drivers/usb/host/xhci-hub.c
@@ -23,6 +23,7 @@
  
  #include <linux/slab.h>
  #include <asm/unaligned.h>
+#include <linux/pci.h>
  
  #include "xhci.h"
  #include "xhci-trace.h"
@@ -1280,7 +1281,11 @@ int xhci_hub_control(struct usb_hcd *hcd, u16 typeReq, u16 wValue,
                 wIndex--;
                 temp = readl(port_array[wIndex]);
                 if (temp == ~(u32)0) {
-                       xhci_hc_died(xhci);
+                       struct pci_dev *pdev = to_pci_dev(hcd->self.controller);
+                       xhci_err(xhci, "ClearPortFeat port%d @%p=%x, hcd->state:0x%x hcd->flags:0x%x, pci_state 0x%x\n",
+                                wIndex, port_array[wIndex], temp, hcd->state, hcd->flags, pdev->current_state);
+
+                       WARN_ON(1);
                         retval = -ENODEV;
                         break;
                 }

Thanks
-Mathias

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-28  8:39                 ` Mathias Nyman
  0 siblings, 0 replies; 60+ messages in thread
From: Mathias Nyman @ 2017-08-28  8:39 UTC (permalink / raw)
  To: linux-arm-kernel

On 23.08.2017 17:30, Mason wrote:
> On 23/08/2017 14:41, Mason wrote:
>
>> I compiled a minimal kernel, with lots of irrelevant drivers and
>> frameworks left out, including power management. I still get the
>> "xHCI host controller not responding, assume dead" issue.
>
> The problem seems to have a timing-related aspect.
>
> I added a bunch of logs (to a slow serial console) and the HC was
> not killed. I was able to plug the Flash drive a second time.
> (I am logging config space reads and writes.)


Could you take a log with the following added debug, without
your extra delays, It should show a bit more about the state
of the controller when we read 0xffffffff


diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
index 4bc6f42..a124c3d 100644
--- a/drivers/usb/host/xhci-hub.c
+++ b/drivers/usb/host/xhci-hub.c
@@ -23,6 +23,7 @@
  
  #include <linux/slab.h>
  #include <asm/unaligned.h>
+#include <linux/pci.h>
  
  #include "xhci.h"
  #include "xhci-trace.h"
@@ -1280,7 +1281,11 @@ int xhci_hub_control(struct usb_hcd *hcd, u16 typeReq, u16 wValue,
                 wIndex--;
                 temp = readl(port_array[wIndex]);
                 if (temp == ~(u32)0) {
-                       xhci_hc_died(xhci);
+                       struct pci_dev *pdev = to_pci_dev(hcd->self.controller);
+                       xhci_err(xhci, "ClearPortFeat port%d @%p=%x, hcd->state:0x%x hcd->flags:0x%x, pci_state 0x%x\n",
+                                wIndex, port_array[wIndex], temp, hcd->state, hcd->flags, pdev->current_state);
+
+                       WARN_ON(1);
                         retval = -ENODEV;
                         break;
                 }

Thanks
-Mathias

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-28  8:39                 ` Mathias Nyman
@ 2017-08-28 14:40                   ` Mason
  -1 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-28 14:40 UTC (permalink / raw)
  To: Mathias Nyman, Felipe Balbi, linux-pci, linux-usb, Linux ARM
  Cc: Bjorn Helgaas, Alan Stern, Greg Kroah-Hartman

On 28/08/2017 10:39, Mathias Nyman wrote:

> Could you take a log with the following added debug, without
> your extra delays, It should show a bit more about the state
> of the controller when we read 0xffffffff

I applied the following patch on top of v4.12-rc1

diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
index 5e3e9d4c6956..c7ea7d4c801f 100644
--- a/drivers/usb/host/xhci-hub.c
+++ b/drivers/usb/host/xhci-hub.c
@@ -23,6 +23,7 @@
 
 #include <linux/slab.h>
 #include <asm/unaligned.h>
+#include <linux/pci.h>
 
 #include "xhci.h"
 #include "xhci-trace.h"
@@ -1268,7 +1269,10 @@ int xhci_hub_control(struct usb_hcd *hcd, u16 typeReq, u16 wValue,
 		wIndex--;
 		temp = readl(port_array[wIndex]);
 		if (temp == ~(u32)0) {
-			xhci_hc_died(xhci);
+			struct pci_dev *pdev = to_pci_dev(hcd->self.controller);
+			xhci_err(xhci, "ClearPortFeat port%d @%p=%x, hcd->state:0x%x hcd->flags:0x%x, pci_state 0x%x\n",
+					wIndex, port_array[wIndex], temp, hcd->state, hcd->flags, pdev->current_state);
+			WARN_ON(1);
 			retval = -ENODEV;
 			break;
 		}


And here are logs I get when I plug/unplug my USB3 device.

[   14.970148] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   15.003487] usb 2-2: New USB device found, idVendor=0951, idProduct=1666
[   15.010237] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[   15.017483] usb 2-2: Product: DataTraveler 3.0
[   15.021990] usb 2-2: Manufacturer: Kingston
[   15.026234] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA
[   15.034830] usb-storage 2-2:1.0: USB Mass Storage device detected
[   15.041269] scsi host0: usb-storage 2-2:1.0
[   16.056140] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   16.064979] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   16.072978] sd 0:0:0:0: [sda] Write Protect is off
[   16.078076] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   16.089417]  sda: sda1
[   16.093050] sd 0:0:0:0: [sda] Attached SCSI removable disk


[   22.152078] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   22.160157] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   22.172051] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   22.180493] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   22.187368] pcieport 0000:00:00.0: AER: Device recovery failed
[   22.885269] xhci_hcd 0000:01:00.0: ClearPortFeat port1 @e0852430=ffffffff, hcd->state:0x1 hcd->flags:0x1a5, pci_state 0x0
[   22.896284] ------------[ cut here ]------------
[   22.900938] WARNING: CPU: 0 PID: 127 at drivers/usb/host/xhci-hub.c:1275 xhci_hub_control+0x10f4/0x1778
[   22.910377] Modules linked in:
[   22.913447] CPU: 0 PID: 127 Comm: kworker/0:1 Tainted: G         C      4.12.0-rc1 #4
[   22.921314] Hardware name: Sigma Tango DT
[   22.925342] Workqueue: usb_hub_wq hub_event
[   22.929564] [<c010e8b4>] (unwind_backtrace) from [<c010ac00>] (show_stack+0x10/0x14)
[   22.937353] [<c010ac00>] (show_stack) from [<c0257a30>] (dump_stack+0x84/0x98)
[   22.944617] [<c0257a30>] (dump_stack) from [<c01183d0>] (__warn+0xe8/0x100)
[   22.951616] [<c01183d0>] (__warn) from [<c0118498>] (warn_slowpath_null+0x20/0x28)
[   22.959227] [<c0118498>] (warn_slowpath_null) from [<c031ad90>] (xhci_hub_control+0x10f4/0x1778)
[   22.968062] [<c031ad90>] (xhci_hub_control) from [<c02fbb4c>] (usb_hcd_submit_urb+0x264/0x810)
[   22.976719] [<c02fbb4c>] (usb_hcd_submit_urb) from [<c02fccec>] (usb_submit_urb+0x2b0/0x4b4)
[   22.985201] [<c02fccec>] (usb_submit_urb) from [<c02fd3c4>] (usb_start_wait_urb+0x4c/0xbc)
[   22.993509] [<c02fd3c4>] (usb_start_wait_urb) from [<c02fd4d4>] (usb_control_msg+0xa0/0xcc)
[   23.001904] [<c02fd4d4>] (usb_control_msg) from [<c02f5718>] (usb_clear_port_feature+0x44/0x4c)
[   23.010648] [<c02f5718>] (usb_clear_port_feature) from [<c02f60fc>] (hub_port_reset+0x228/0x51c)
[   23.019479] [<c02f60fc>] (hub_port_reset) from [<c02f82f0>] (hub_event+0x1f4/0xe64)
[   23.027177] [<c02f82f0>] (hub_event) from [<c012d398>] (process_one_work+0x1d4/0x3ec)
[   23.035049] [<c012d398>] (process_one_work) from [<c012dfb4>] (worker_thread+0x38/0x554)
[   23.043185] [<c012dfb4>] (worker_thread) from [<c0132c84>] (kthread+0x108/0x138)
[   23.050620] [<c0132c84>] (kthread) from [<c01076f8>] (ret_from_fork+0x14/0x3c)
[   23.057877] ---[ end trace 5e4494cf1f6e3761 ]---
[   23.062691] xhci_hcd 0000:01:00.0: ClearPortFeat port1 @e0852430=ffffffff, hcd->state:0x1 hcd->flags:0x1a5, pci_state 0x0
[   23.073707] ------------[ cut here ]------------
[   23.078349] WARNING: CPU: 0 PID: 127 at drivers/usb/host/xhci-hub.c:1275 xhci_hub_control+0x10f4/0x1778
[   23.087787] Modules linked in:
[   23.090854] CPU: 0 PID: 127 Comm: kworker/0:1 Tainted: G        WC      4.12.0-rc1 #4
[   23.098720] Hardware name: Sigma Tango DT
[   23.102745] Workqueue: usb_hub_wq hub_event
[   23.106953] [<c010e8b4>] (unwind_backtrace) from [<c010ac00>] (show_stack+0x10/0x14)
[   23.114737] [<c010ac00>] (show_stack) from [<c0257a30>] (dump_stack+0x84/0x98)
[   23.121998] [<c0257a30>] (dump_stack) from [<c01183d0>] (__warn+0xe8/0x100)
[   23.128996] [<c01183d0>] (__warn) from [<c0118498>] (warn_slowpath_null+0x20/0x28)
[   23.136606] [<c0118498>] (warn_slowpath_null) from [<c031ad90>] (xhci_hub_control+0x10f4/0x1778)
[   23.145439] [<c031ad90>] (xhci_hub_control) from [<c02fbb4c>] (usb_hcd_submit_urb+0x264/0x810)
[   23.154095] [<c02fbb4c>] (usb_hcd_submit_urb) from [<c02fccec>] (usb_submit_urb+0x2b0/0x4b4)
[   23.162577] [<c02fccec>] (usb_submit_urb) from [<c02fd3c4>] (usb_start_wait_urb+0x4c/0xbc)
[   23.170884] [<c02fd3c4>] (usb_start_wait_urb) from [<c02fd4d4>] (usb_control_msg+0xa0/0xcc)
[   23.179278] [<c02fd4d4>] (usb_control_msg) from [<c02f5718>] (usb_clear_port_feature+0x44/0x4c)
[   23.188021] [<c02f5718>] (usb_clear_port_feature) from [<c02f611c>] (hub_port_reset+0x248/0x51c)
[   23.196851] [<c02f611c>] (hub_port_reset) from [<c02f82f0>] (hub_event+0x1f4/0xe64)
[   23.204547] [<c02f82f0>] (hub_event) from [<c012d398>] (process_one_work+0x1d4/0x3ec)
[   23.212418] [<c012d398>] (process_one_work) from [<c012dfb4>] (worker_thread+0x38/0x554)
[   23.220551] [<c012dfb4>] (worker_thread) from [<c0132c84>] (kthread+0x108/0x138)
[   23.227986] [<c0132c84>] (kthread) from [<c01076f8>] (ret_from_fork+0x14/0x3c)
[   23.235242] ---[ end trace 5e4494cf1f6e3762 ]---
[   23.239953] xhci_hcd 0000:01:00.0: Cannot set link state.
[   23.245403] usb usb2-port2: cannot disable (err = -32)
[   23.250575] usb 2-2: USB disconnect, device number 2
[   23.255724] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   23.264048] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   23.275985] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   23.284417] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   23.291256] pcieport 0000:00:00.0: AER: Device recovery failed
[   23.297144] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   23.305218] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   23.317047] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   23.325467] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   23.332309] pcieport 0000:00:00.0: AER: Device recovery failed
[   23.338188] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   23.346273] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   23.358093] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   23.366518] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   23.373357] pcieport 0000:00:00.0: AER: Device recovery failed
[   23.379229] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   23.387287] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   23.399101] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   23.407504] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   23.414344] pcieport 0000:00:00.0: AER: Device recovery failed
[   23.434143] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   23.442263] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   23.454100] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   23.462542] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   23.469427] pcieport 0000:00:00.0: AER: Device recovery failed


Regards.

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-28 14:40                   ` Mason
  0 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-28 14:40 UTC (permalink / raw)
  To: linux-arm-kernel

On 28/08/2017 10:39, Mathias Nyman wrote:

> Could you take a log with the following added debug, without
> your extra delays, It should show a bit more about the state
> of the controller when we read 0xffffffff

I applied the following patch on top of v4.12-rc1

diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
index 5e3e9d4c6956..c7ea7d4c801f 100644
--- a/drivers/usb/host/xhci-hub.c
+++ b/drivers/usb/host/xhci-hub.c
@@ -23,6 +23,7 @@
 
 #include <linux/slab.h>
 #include <asm/unaligned.h>
+#include <linux/pci.h>
 
 #include "xhci.h"
 #include "xhci-trace.h"
@@ -1268,7 +1269,10 @@ int xhci_hub_control(struct usb_hcd *hcd, u16 typeReq, u16 wValue,
 		wIndex--;
 		temp = readl(port_array[wIndex]);
 		if (temp == ~(u32)0) {
-			xhci_hc_died(xhci);
+			struct pci_dev *pdev = to_pci_dev(hcd->self.controller);
+			xhci_err(xhci, "ClearPortFeat port%d @%p=%x, hcd->state:0x%x hcd->flags:0x%x, pci_state 0x%x\n",
+					wIndex, port_array[wIndex], temp, hcd->state, hcd->flags, pdev->current_state);
+			WARN_ON(1);
 			retval = -ENODEV;
 			break;
 		}


And here are logs I get when I plug/unplug my USB3 device.

[   14.970148] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   15.003487] usb 2-2: New USB device found, idVendor=0951, idProduct=1666
[   15.010237] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[   15.017483] usb 2-2: Product: DataTraveler 3.0
[   15.021990] usb 2-2: Manufacturer: Kingston
[   15.026234] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA
[   15.034830] usb-storage 2-2:1.0: USB Mass Storage device detected
[   15.041269] scsi host0: usb-storage 2-2:1.0
[   16.056140] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   16.064979] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   16.072978] sd 0:0:0:0: [sda] Write Protect is off
[   16.078076] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   16.089417]  sda: sda1
[   16.093050] sd 0:0:0:0: [sda] Attached SCSI removable disk


[   22.152078] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   22.160157] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   22.172051] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   22.180493] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   22.187368] pcieport 0000:00:00.0: AER: Device recovery failed
[   22.885269] xhci_hcd 0000:01:00.0: ClearPortFeat port1 @e0852430=ffffffff, hcd->state:0x1 hcd->flags:0x1a5, pci_state 0x0
[   22.896284] ------------[ cut here ]------------
[   22.900938] WARNING: CPU: 0 PID: 127 at drivers/usb/host/xhci-hub.c:1275 xhci_hub_control+0x10f4/0x1778
[   22.910377] Modules linked in:
[   22.913447] CPU: 0 PID: 127 Comm: kworker/0:1 Tainted: G         C      4.12.0-rc1 #4
[   22.921314] Hardware name: Sigma Tango DT
[   22.925342] Workqueue: usb_hub_wq hub_event
[   22.929564] [<c010e8b4>] (unwind_backtrace) from [<c010ac00>] (show_stack+0x10/0x14)
[   22.937353] [<c010ac00>] (show_stack) from [<c0257a30>] (dump_stack+0x84/0x98)
[   22.944617] [<c0257a30>] (dump_stack) from [<c01183d0>] (__warn+0xe8/0x100)
[   22.951616] [<c01183d0>] (__warn) from [<c0118498>] (warn_slowpath_null+0x20/0x28)
[   22.959227] [<c0118498>] (warn_slowpath_null) from [<c031ad90>] (xhci_hub_control+0x10f4/0x1778)
[   22.968062] [<c031ad90>] (xhci_hub_control) from [<c02fbb4c>] (usb_hcd_submit_urb+0x264/0x810)
[   22.976719] [<c02fbb4c>] (usb_hcd_submit_urb) from [<c02fccec>] (usb_submit_urb+0x2b0/0x4b4)
[   22.985201] [<c02fccec>] (usb_submit_urb) from [<c02fd3c4>] (usb_start_wait_urb+0x4c/0xbc)
[   22.993509] [<c02fd3c4>] (usb_start_wait_urb) from [<c02fd4d4>] (usb_control_msg+0xa0/0xcc)
[   23.001904] [<c02fd4d4>] (usb_control_msg) from [<c02f5718>] (usb_clear_port_feature+0x44/0x4c)
[   23.010648] [<c02f5718>] (usb_clear_port_feature) from [<c02f60fc>] (hub_port_reset+0x228/0x51c)
[   23.019479] [<c02f60fc>] (hub_port_reset) from [<c02f82f0>] (hub_event+0x1f4/0xe64)
[   23.027177] [<c02f82f0>] (hub_event) from [<c012d398>] (process_one_work+0x1d4/0x3ec)
[   23.035049] [<c012d398>] (process_one_work) from [<c012dfb4>] (worker_thread+0x38/0x554)
[   23.043185] [<c012dfb4>] (worker_thread) from [<c0132c84>] (kthread+0x108/0x138)
[   23.050620] [<c0132c84>] (kthread) from [<c01076f8>] (ret_from_fork+0x14/0x3c)
[   23.057877] ---[ end trace 5e4494cf1f6e3761 ]---
[   23.062691] xhci_hcd 0000:01:00.0: ClearPortFeat port1 @e0852430=ffffffff, hcd->state:0x1 hcd->flags:0x1a5, pci_state 0x0
[   23.073707] ------------[ cut here ]------------
[   23.078349] WARNING: CPU: 0 PID: 127 at drivers/usb/host/xhci-hub.c:1275 xhci_hub_control+0x10f4/0x1778
[   23.087787] Modules linked in:
[   23.090854] CPU: 0 PID: 127 Comm: kworker/0:1 Tainted: G        WC      4.12.0-rc1 #4
[   23.098720] Hardware name: Sigma Tango DT
[   23.102745] Workqueue: usb_hub_wq hub_event
[   23.106953] [<c010e8b4>] (unwind_backtrace) from [<c010ac00>] (show_stack+0x10/0x14)
[   23.114737] [<c010ac00>] (show_stack) from [<c0257a30>] (dump_stack+0x84/0x98)
[   23.121998] [<c0257a30>] (dump_stack) from [<c01183d0>] (__warn+0xe8/0x100)
[   23.128996] [<c01183d0>] (__warn) from [<c0118498>] (warn_slowpath_null+0x20/0x28)
[   23.136606] [<c0118498>] (warn_slowpath_null) from [<c031ad90>] (xhci_hub_control+0x10f4/0x1778)
[   23.145439] [<c031ad90>] (xhci_hub_control) from [<c02fbb4c>] (usb_hcd_submit_urb+0x264/0x810)
[   23.154095] [<c02fbb4c>] (usb_hcd_submit_urb) from [<c02fccec>] (usb_submit_urb+0x2b0/0x4b4)
[   23.162577] [<c02fccec>] (usb_submit_urb) from [<c02fd3c4>] (usb_start_wait_urb+0x4c/0xbc)
[   23.170884] [<c02fd3c4>] (usb_start_wait_urb) from [<c02fd4d4>] (usb_control_msg+0xa0/0xcc)
[   23.179278] [<c02fd4d4>] (usb_control_msg) from [<c02f5718>] (usb_clear_port_feature+0x44/0x4c)
[   23.188021] [<c02f5718>] (usb_clear_port_feature) from [<c02f611c>] (hub_port_reset+0x248/0x51c)
[   23.196851] [<c02f611c>] (hub_port_reset) from [<c02f82f0>] (hub_event+0x1f4/0xe64)
[   23.204547] [<c02f82f0>] (hub_event) from [<c012d398>] (process_one_work+0x1d4/0x3ec)
[   23.212418] [<c012d398>] (process_one_work) from [<c012dfb4>] (worker_thread+0x38/0x554)
[   23.220551] [<c012dfb4>] (worker_thread) from [<c0132c84>] (kthread+0x108/0x138)
[   23.227986] [<c0132c84>] (kthread) from [<c01076f8>] (ret_from_fork+0x14/0x3c)
[   23.235242] ---[ end trace 5e4494cf1f6e3762 ]---
[   23.239953] xhci_hcd 0000:01:00.0: Cannot set link state.
[   23.245403] usb usb2-port2: cannot disable (err = -32)
[   23.250575] usb 2-2: USB disconnect, device number 2
[   23.255724] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   23.264048] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   23.275985] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   23.284417] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   23.291256] pcieport 0000:00:00.0: AER: Device recovery failed
[   23.297144] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   23.305218] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   23.317047] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   23.325467] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   23.332309] pcieport 0000:00:00.0: AER: Device recovery failed
[   23.338188] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   23.346273] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   23.358093] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   23.366518] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   23.373357] pcieport 0000:00:00.0: AER: Device recovery failed
[   23.379229] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   23.387287] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   23.399101] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   23.407504] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   23.414344] pcieport 0000:00:00.0: AER: Device recovery failed
[   23.434143] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
[   23.442263] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
[   23.454100] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
[   23.462542] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
[   23.469427] pcieport 0000:00:00.0: AER: Device recovery failed


Regards.

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-28 14:40                   ` Mason
@ 2017-08-29 13:28                     ` Mathias Nyman
  -1 siblings, 0 replies; 60+ messages in thread
From: Mathias Nyman @ 2017-08-29 13:28 UTC (permalink / raw)
  To: Mason, Felipe Balbi, linux-pci, linux-usb, Linux ARM
  Cc: Bjorn Helgaas, Alan Stern, Greg Kroah-Hartman

On 28.08.2017 17:40, Mason wrote:
> On 28/08/2017 10:39, Mathias Nyman wrote:
>
>> Could you take a log with the following added debug, without
>> your extra delays, It should show a bit more about the state
>> of the controller when we read 0xffffffff
>
> I applied the following patch on top of v4.12-rc1
>
> diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
> index 5e3e9d4c6956..c7ea7d4c801f 100644
> --- a/drivers/usb/host/xhci-hub.c
> +++ b/drivers/usb/host/xhci-hub.c
> @@ -23,6 +23,7 @@
>
>   #include <linux/slab.h>
>   #include <asm/unaligned.h>
> +#include <linux/pci.h>
>
>   #include "xhci.h"
>   #include "xhci-trace.h"
> @@ -1268,7 +1269,10 @@ int xhci_hub_control(struct usb_hcd *hcd, u16 typeReq, u16 wValue,
>   		wIndex--;
>   		temp = readl(port_array[wIndex]);
>   		if (temp == ~(u32)0) {
> -			xhci_hc_died(xhci);
> +			struct pci_dev *pdev = to_pci_dev(hcd->self.controller);
> +			xhci_err(xhci, "ClearPortFeat port%d @%p=%x, hcd->state:0x%x hcd->flags:0x%x, pci_state 0x%x\n",
> +					wIndex, port_array[wIndex], temp, hcd->state, hcd->flags, pdev->current_state);
> +			WARN_ON(1);
>   			retval = -ENODEV;
>   			break;
>   		}
>
>
> And here are logs I get when I plug/unplug my USB3 device.
>
> [   14.970148] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
> [   15.003487] usb 2-2: New USB device found, idVendor=0951, idProduct=1666
> [   15.010237] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
> [   15.017483] usb 2-2: Product: DataTraveler 3.0
> [   15.021990] usb 2-2: Manufacturer: Kingston
> [   15.026234] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA
> [   15.034830] usb-storage 2-2:1.0: USB Mass Storage device detected
> [   15.041269] scsi host0: usb-storage 2-2:1.0
> [   16.056140] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
> [   16.064979] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
> [   16.072978] sd 0:0:0:0: [sda] Write Protect is off
> [   16.078076] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
> [   16.089417]  sda: sda1
> [   16.093050] sd 0:0:0:0: [sda] Attached SCSI removable disk
>
>
> [   22.152078] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
> [   22.160157] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
> [   22.172051] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
> [   22.180493] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
> [   22.187368] pcieport 0000:00:00.0: AER: Device recovery failed
> [   22.885269] xhci_hcd 0000:01:00.0: ClearPortFeat port1 @e0852430=ffffffff, hcd->state:0x1 hcd->flags:0x1a5, pci_state 0x0

State is HC_STATE_RUNNING,

Flags bits 0,2,5,7,8 set:
#define HCD_FLAG_HW_ACCESSIBLE          0       /* at full power */
#define HCD_FLAG_POLL_RH                2       /* poll for rh status? */
#define HCD_FLAG_RH_RUNNING             5       /* root hub is running? */
#define HCD_FLAG_INTF_AUTHORIZED        7       /* authorize interfaces? */
#define HCD_FLAG_DEV_AUTHORIZED         8       /* authorize devices? */

And pci state seems to be D0 (according to driver, pdev->current_state)

I can't see anything wrong from xhci/usb point of view.
I'd focus more on the PCI errors in the logs as the cause for reading 0xffffffff from xhci mmio.

Then again it might be a bit too drastic to kill xhci just because we read 0xffffffff once
from a mmio xhci register. Maybe we should return an error a couple times before actually
tearing down xhci.

This tight check was originally done to detect pci hotplug removed hosts as soon as possible.

-Mathias  

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-29 13:28                     ` Mathias Nyman
  0 siblings, 0 replies; 60+ messages in thread
From: Mathias Nyman @ 2017-08-29 13:28 UTC (permalink / raw)
  To: linux-arm-kernel

On 28.08.2017 17:40, Mason wrote:
> On 28/08/2017 10:39, Mathias Nyman wrote:
>
>> Could you take a log with the following added debug, without
>> your extra delays, It should show a bit more about the state
>> of the controller when we read 0xffffffff
>
> I applied the following patch on top of v4.12-rc1
>
> diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
> index 5e3e9d4c6956..c7ea7d4c801f 100644
> --- a/drivers/usb/host/xhci-hub.c
> +++ b/drivers/usb/host/xhci-hub.c
> @@ -23,6 +23,7 @@
>
>   #include <linux/slab.h>
>   #include <asm/unaligned.h>
> +#include <linux/pci.h>
>
>   #include "xhci.h"
>   #include "xhci-trace.h"
> @@ -1268,7 +1269,10 @@ int xhci_hub_control(struct usb_hcd *hcd, u16 typeReq, u16 wValue,
>   		wIndex--;
>   		temp = readl(port_array[wIndex]);
>   		if (temp == ~(u32)0) {
> -			xhci_hc_died(xhci);
> +			struct pci_dev *pdev = to_pci_dev(hcd->self.controller);
> +			xhci_err(xhci, "ClearPortFeat port%d @%p=%x, hcd->state:0x%x hcd->flags:0x%x, pci_state 0x%x\n",
> +					wIndex, port_array[wIndex], temp, hcd->state, hcd->flags, pdev->current_state);
> +			WARN_ON(1);
>   			retval = -ENODEV;
>   			break;
>   		}
>
>
> And here are logs I get when I plug/unplug my USB3 device.
>
> [   14.970148] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
> [   15.003487] usb 2-2: New USB device found, idVendor=0951, idProduct=1666
> [   15.010237] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
> [   15.017483] usb 2-2: Product: DataTraveler 3.0
> [   15.021990] usb 2-2: Manufacturer: Kingston
> [   15.026234] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA
> [   15.034830] usb-storage 2-2:1.0: USB Mass Storage device detected
> [   15.041269] scsi host0: usb-storage 2-2:1.0
> [   16.056140] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
> [   16.064979] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
> [   16.072978] sd 0:0:0:0: [sda] Write Protect is off
> [   16.078076] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
> [   16.089417]  sda: sda1
> [   16.093050] sd 0:0:0:0: [sda] Attached SCSI removable disk
>
>
> [   22.152078] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000
> [   22.160157] pcieport 0000:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)
> [   22.172051] pcieport 0000:00:00.0:   device [1105:0024] error status/mask=00004000/00000000
> [   22.180493] pcieport 0000:00:00.0:    [14] Completion Timeout     (First)
> [   22.187368] pcieport 0000:00:00.0: AER: Device recovery failed
> [   22.885269] xhci_hcd 0000:01:00.0: ClearPortFeat port1 @e0852430=ffffffff, hcd->state:0x1 hcd->flags:0x1a5, pci_state 0x0

State is HC_STATE_RUNNING,

Flags bits 0,2,5,7,8 set:
#define HCD_FLAG_HW_ACCESSIBLE          0       /* at full power */
#define HCD_FLAG_POLL_RH                2       /* poll for rh status? */
#define HCD_FLAG_RH_RUNNING             5       /* root hub is running? */
#define HCD_FLAG_INTF_AUTHORIZED        7       /* authorize interfaces? */
#define HCD_FLAG_DEV_AUTHORIZED         8       /* authorize devices? */

And pci state seems to be D0 (according to driver, pdev->current_state)

I can't see anything wrong from xhci/usb point of view.
I'd focus more on the PCI errors in the logs as the cause for reading 0xffffffff from xhci mmio.

Then again it might be a bit too drastic to kill xhci just because we read 0xffffffff once
from a mmio xhci register. Maybe we should return an error a couple times before actually
tearing down xhci.

This tight check was originally done to detect pci hotplug removed hosts as soon as possible.

-Mathias  

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-29 13:28                     ` Mathias Nyman
@ 2017-08-29 13:38                       ` Lukas Wunner
  -1 siblings, 0 replies; 60+ messages in thread
From: Lukas Wunner @ 2017-08-29 13:38 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Mason, Felipe Balbi, linux-pci, linux-usb, Linux ARM,
	Bjorn Helgaas, Alan Stern, Greg Kroah-Hartman

On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> Then again it might be a bit too drastic to kill xhci just because
> we read 0xffffffff once from a mmio xhci register. Maybe we should
> return an error a couple times before actually tearing down xhci.
> 
> This tight check was originally done to detect pci hotplug removed
> hosts as soon as possible.

Just make pci_dev_is_disconnected() public to detect PCI hot removal.
We *know* when the device was hot removed, so I think there's no need
to guess that based on reading "all ones" from mmio (which may happen
for entirely legitimate reasons unrelated to hot removal).

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-29 13:38                       ` Lukas Wunner
  0 siblings, 0 replies; 60+ messages in thread
From: Lukas Wunner @ 2017-08-29 13:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> Then again it might be a bit too drastic to kill xhci just because
> we read 0xffffffff once from a mmio xhci register. Maybe we should
> return an error a couple times before actually tearing down xhci.
> 
> This tight check was originally done to detect pci hotplug removed
> hosts as soon as possible.

Just make pci_dev_is_disconnected() public to detect PCI hot removal.
We *know* when the device was hot removed, so I think there's no need
to guess that based on reading "all ones" from mmio (which may happen
for entirely legitimate reasons unrelated to hot removal).

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-29 13:38                       ` Lukas Wunner
@ 2017-08-29 14:47                         ` Greg Kroah-Hartman
  -1 siblings, 0 replies; 60+ messages in thread
From: Greg Kroah-Hartman @ 2017-08-29 14:47 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Mathias Nyman, Mason, Felipe Balbi, linux-pci, linux-usb,
	Linux ARM, Bjorn Helgaas, Alan Stern

On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote:
> On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> > Then again it might be a bit too drastic to kill xhci just because
> > we read 0xffffffff once from a mmio xhci register. Maybe we should
> > return an error a couple times before actually tearing down xhci.
> > 
> > This tight check was originally done to detect pci hotplug removed
> > hosts as soon as possible.
> 
> Just make pci_dev_is_disconnected() public to detect PCI hot removal.
> We *know* when the device was hot removed, so I think there's no need
> to guess that based on reading "all ones" from mmio (which may happen
> for entirely legitimate reasons unrelated to hot removal).

No, you don't always "know" when a device is removed, don't rely on it,
not all platforms support that.

One more reason why I hate that function and I'm glad it's not exported
for others to think it somehow actually works for their system...

Reading all ff shows the device is removed, that's all the PCI spec
guarantees.  What other legitimate reason could that happen for?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-29 14:47                         ` Greg Kroah-Hartman
  0 siblings, 0 replies; 60+ messages in thread
From: Greg Kroah-Hartman @ 2017-08-29 14:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote:
> On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> > Then again it might be a bit too drastic to kill xhci just because
> > we read 0xffffffff once from a mmio xhci register. Maybe we should
> > return an error a couple times before actually tearing down xhci.
> > 
> > This tight check was originally done to detect pci hotplug removed
> > hosts as soon as possible.
> 
> Just make pci_dev_is_disconnected() public to detect PCI hot removal.
> We *know* when the device was hot removed, so I think there's no need
> to guess that based on reading "all ones" from mmio (which may happen
> for entirely legitimate reasons unrelated to hot removal).

No, you don't always "know" when a device is removed, don't rely on it,
not all platforms support that.

One more reason why I hate that function and I'm glad it's not exported
for others to think it somehow actually works for their system...

Reading all ff shows the device is removed, that's all the PCI spec
guarantees.  What other legitimate reason could that happen for?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-29 14:47                         ` Greg Kroah-Hartman
@ 2017-08-29 15:34                           ` Lukas Wunner
  -1 siblings, 0 replies; 60+ messages in thread
From: Lukas Wunner @ 2017-08-29 15:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Mathias Nyman, Mason, Felipe Balbi, linux-pci, linux-usb,
	Linux ARM, Bjorn Helgaas, Alan Stern

On Tue, Aug 29, 2017 at 04:47:25PM +0200, Greg Kroah-Hartman wrote:
> On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote:
> > On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> > > Then again it might be a bit too drastic to kill xhci just because
> > > we read 0xffffffff once from a mmio xhci register. Maybe we should
> > > return an error a couple times before actually tearing down xhci.
> > > 
> > > This tight check was originally done to detect pci hotplug removed
> > > hosts as soon as possible.
> > 
> > Just make pci_dev_is_disconnected() public to detect PCI hot removal.
> > We *know* when the device was hot removed, so I think there's no need
> > to guess that based on reading "all ones" from mmio (which may happen
> > for entirely legitimate reasons unrelated to hot removal).
> 
> No, you don't always "know" when a device is removed, don't rely on it,
> not all platforms support that.

Please explain, which platforms don't support that?  They wouldn't be
compliant with the spec it seems.

PCIe r3.1, section 6.7.3:

	"A Downstream Port with hot-plug capabilities supports the
	 following hot-plug events:

	 Presence Detect Changed

	 A Downstream Port with hot-plug capabilities monitors the slot
	 it controls for the slot events listed above. [...]
	 If enabled through the associated enable field, slot events
	 must generate a software notification."

And pciehp sets the flag on all downstream devices that they're removed
once the software notification has been received and processed.


> Reading all ff shows the device is removed, that's all the PCI spec
> guarantees.  What other legitimate reason could that happen for?

Is 0xffffffff not a valid value to be stored in and read from mmio space?

Best regards,

Lukas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-29 15:34                           ` Lukas Wunner
  0 siblings, 0 replies; 60+ messages in thread
From: Lukas Wunner @ 2017-08-29 15:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 29, 2017 at 04:47:25PM +0200, Greg Kroah-Hartman wrote:
> On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote:
> > On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> > > Then again it might be a bit too drastic to kill xhci just because
> > > we read 0xffffffff once from a mmio xhci register. Maybe we should
> > > return an error a couple times before actually tearing down xhci.
> > > 
> > > This tight check was originally done to detect pci hotplug removed
> > > hosts as soon as possible.
> > 
> > Just make pci_dev_is_disconnected() public to detect PCI hot removal.
> > We *know* when the device was hot removed, so I think there's no need
> > to guess that based on reading "all ones" from mmio (which may happen
> > for entirely legitimate reasons unrelated to hot removal).
> 
> No, you don't always "know" when a device is removed, don't rely on it,
> not all platforms support that.

Please explain, which platforms don't support that?  They wouldn't be
compliant with the spec it seems.

PCIe r3.1, section 6.7.3:

	"A Downstream Port with hot-plug capabilities supports the
	 following hot-plug events:

	 Presence Detect Changed

	 A Downstream Port with hot-plug capabilities monitors the slot
	 it controls for the slot events listed above. [...]
	 If enabled through the associated enable field, slot events
	 must generate a software notification."

And pciehp sets the flag on all downstream devices that they're removed
once the software notification has been received and processed.


> Reading all ff shows the device is removed, that's all the PCI spec
> guarantees.  What other legitimate reason could that happen for?

Is 0xffffffff not a valid value to be stored in and read from mmio space?

Best regards,

Lukas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-29 15:34                           ` Lukas Wunner
@ 2017-08-29 15:51                             ` Greg Kroah-Hartman
  -1 siblings, 0 replies; 60+ messages in thread
From: Greg Kroah-Hartman @ 2017-08-29 15:51 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Mathias Nyman, Mason, Felipe Balbi, linux-pci, linux-usb,
	Linux ARM, Bjorn Helgaas, Alan Stern

On Tue, Aug 29, 2017 at 05:34:56PM +0200, Lukas Wunner wrote:
> On Tue, Aug 29, 2017 at 04:47:25PM +0200, Greg Kroah-Hartman wrote:
> > On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote:
> > > On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> > > > Then again it might be a bit too drastic to kill xhci just because
> > > > we read 0xffffffff once from a mmio xhci register. Maybe we should
> > > > return an error a couple times before actually tearing down xhci.
> > > > 
> > > > This tight check was originally done to detect pci hotplug removed
> > > > hosts as soon as possible.
> > > 
> > > Just make pci_dev_is_disconnected() public to detect PCI hot removal.
> > > We *know* when the device was hot removed, so I think there's no need
> > > to guess that based on reading "all ones" from mmio (which may happen
> > > for entirely legitimate reasons unrelated to hot removal).
> > 
> > No, you don't always "know" when a device is removed, don't rely on it,
> > not all platforms support that.
> 
> Please explain, which platforms don't support that?  They wouldn't be
> compliant with the spec it seems.
> 
> PCIe r3.1, section 6.7.3:
> 
> 	"A Downstream Port with hot-plug capabilities supports the
> 	 following hot-plug events:
> 
> 	 Presence Detect Changed
> 
> 	 A Downstream Port with hot-plug capabilities monitors the slot
> 	 it controls for the slot events listed above. [...]
> 	 If enabled through the associated enable field, slot events
> 	 must generate a software notification."
> 
> And pciehp sets the flag on all downstream devices that they're removed
> once the software notification has been received and processed.

What about all of the non-pciehp platforms?  :)

Also, there is always a race between when that notification has been
sent and processed on the PCIe channel and the reading of all 1s on the
PCI bus by the driver.

For fun, try disconnecting a USB device from one of the more modern
laptops with a USB 3.1 connection on it.  The bios treats those as a pci
hotpluggable xhci controller on the PCI bus.  When the device is
disconnected, the BIOS rips out the PCI device as well, but all that
time, the xhci driver is thinking the device is still present as it
takes a while for the BIOS to do all of the needed housekeeping.  It's a
really long time for everything to shut down and to help prevent the
driver from going crazy, it has to detect ffff reads as "disconnection
happened".

All PCI drivers have had to do this for decades now, it's nothing new
here, PCIe just gave us a chance to be notified that the device really
is gone now, PCI hotplug has always been out-of-band like this.

> > Reading all ff shows the device is removed, that's all the PCI spec
> > guarantees.  What other legitimate reason could that happen for?
> 
> Is 0xffffffff not a valid value to be stored in and read from mmio space?

For a specific register, doubtful, which is why the code errors out,
right?  If it is a valid value, then it shouldn't be exiting, and move
on to the next read.

I don't understand what we are arguing about here anymore...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-29 15:51                             ` Greg Kroah-Hartman
  0 siblings, 0 replies; 60+ messages in thread
From: Greg Kroah-Hartman @ 2017-08-29 15:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 29, 2017 at 05:34:56PM +0200, Lukas Wunner wrote:
> On Tue, Aug 29, 2017 at 04:47:25PM +0200, Greg Kroah-Hartman wrote:
> > On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote:
> > > On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> > > > Then again it might be a bit too drastic to kill xhci just because
> > > > we read 0xffffffff once from a mmio xhci register. Maybe we should
> > > > return an error a couple times before actually tearing down xhci.
> > > > 
> > > > This tight check was originally done to detect pci hotplug removed
> > > > hosts as soon as possible.
> > > 
> > > Just make pci_dev_is_disconnected() public to detect PCI hot removal.
> > > We *know* when the device was hot removed, so I think there's no need
> > > to guess that based on reading "all ones" from mmio (which may happen
> > > for entirely legitimate reasons unrelated to hot removal).
> > 
> > No, you don't always "know" when a device is removed, don't rely on it,
> > not all platforms support that.
> 
> Please explain, which platforms don't support that?  They wouldn't be
> compliant with the spec it seems.
> 
> PCIe r3.1, section 6.7.3:
> 
> 	"A Downstream Port with hot-plug capabilities supports the
> 	 following hot-plug events:
> 
> 	 Presence Detect Changed
> 
> 	 A Downstream Port with hot-plug capabilities monitors the slot
> 	 it controls for the slot events listed above. [...]
> 	 If enabled through the associated enable field, slot events
> 	 must generate a software notification."
> 
> And pciehp sets the flag on all downstream devices that they're removed
> once the software notification has been received and processed.

What about all of the non-pciehp platforms?  :)

Also, there is always a race between when that notification has been
sent and processed on the PCIe channel and the reading of all 1s on the
PCI bus by the driver.

For fun, try disconnecting a USB device from one of the more modern
laptops with a USB 3.1 connection on it.  The bios treats those as a pci
hotpluggable xhci controller on the PCI bus.  When the device is
disconnected, the BIOS rips out the PCI device as well, but all that
time, the xhci driver is thinking the device is still present as it
takes a while for the BIOS to do all of the needed housekeeping.  It's a
really long time for everything to shut down and to help prevent the
driver from going crazy, it has to detect ffff reads as "disconnection
happened".

All PCI drivers have had to do this for decades now, it's nothing new
here, PCIe just gave us a chance to be notified that the device really
is gone now, PCI hotplug has always been out-of-band like this.

> > Reading all ff shows the device is removed, that's all the PCI spec
> > guarantees.  What other legitimate reason could that happen for?
> 
> Is 0xffffffff not a valid value to be stored in and read from mmio space?

For a specific register, doubtful, which is why the code errors out,
right?  If it is a valid value, then it shouldn't be exiting, and move
on to the next read.

I don't understand what we are arguing about here anymore...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-29 13:28                     ` Mathias Nyman
@ 2017-08-29 23:53                       ` Lukas Wunner
  -1 siblings, 0 replies; 60+ messages in thread
From: Lukas Wunner @ 2017-08-29 23:53 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Mason, Felipe Balbi, linux-pci, linux-usb, Linux ARM,
	Bjorn Helgaas, Alan Stern, Greg Kroah-Hartman

On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> This tight check was originally done to detect pci hotplug removed
> hosts as soon as possible.

In Mason's case, the parent of the XHCI controller isn't a hotplug port,
see this lspci output:

https://www.spinics.net/lists/linux-usb/msg160010.html

Please check is_hotplug_bridge in the parent's struct pci_dev before
assuming that the XHCI controller was unplugged.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-29 23:53                       ` Lukas Wunner
  0 siblings, 0 replies; 60+ messages in thread
From: Lukas Wunner @ 2017-08-29 23:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> This tight check was originally done to detect pci hotplug removed
> hosts as soon as possible.

In Mason's case, the parent of the XHCI controller isn't a hotplug port,
see this lspci output:

https://www.spinics.net/lists/linux-usb/msg160010.html

Please check is_hotplug_bridge in the parent's struct pci_dev before
assuming that the XHCI controller was unplugged.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-29 23:53                       ` Lukas Wunner
@ 2017-08-30  6:02                         ` Greg Kroah-Hartman
  -1 siblings, 0 replies; 60+ messages in thread
From: Greg Kroah-Hartman @ 2017-08-30  6:02 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Mathias Nyman, Mason, Felipe Balbi, linux-pci, linux-usb,
	Linux ARM, Bjorn Helgaas, Alan Stern

On Wed, Aug 30, 2017 at 01:53:10AM +0200, Lukas Wunner wrote:
> On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> > This tight check was originally done to detect pci hotplug removed
> > hosts as soon as possible.
> 
> In Mason's case, the parent of the XHCI controller isn't a hotplug port,
> see this lspci output:
> 
> https://www.spinics.net/lists/linux-usb/msg160010.html
> 
> Please check is_hotplug_bridge in the parent's struct pci_dev before
> assuming that the XHCI controller was unplugged.

How can you guarantee that this is set on some systems?  Will it be set
on cardbus devices?  What about on a "normal" system where I can just go
and yank out a PCI card at will?

I don't think this is a valid thing to check, and again, why are we
arguing this point?  It's been this way since the 1990's, this isn't a
new thing...

To get back to the original issue here, the hardware seems to have died,
the driver stops talking to it, and all is good.  The "regression" here
is that we now properly can determine that the hardware is crap.

So, how do you think we should proceed, delay a bit longer before saying
the device is gone?  How long is "long enough"?  How many bus errors are
we allowed to tolerate (hint, the PCI spec says none...)

Maybe someone wants to get to the root problem here, why is the hardware
suddenly reporting all 1s?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-30  6:02                         ` Greg Kroah-Hartman
  0 siblings, 0 replies; 60+ messages in thread
From: Greg Kroah-Hartman @ 2017-08-30  6:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Aug 30, 2017 at 01:53:10AM +0200, Lukas Wunner wrote:
> On Tue, Aug 29, 2017 at 04:28:53PM +0300, Mathias Nyman wrote:
> > This tight check was originally done to detect pci hotplug removed
> > hosts as soon as possible.
> 
> In Mason's case, the parent of the XHCI controller isn't a hotplug port,
> see this lspci output:
> 
> https://www.spinics.net/lists/linux-usb/msg160010.html
> 
> Please check is_hotplug_bridge in the parent's struct pci_dev before
> assuming that the XHCI controller was unplugged.

How can you guarantee that this is set on some systems?  Will it be set
on cardbus devices?  What about on a "normal" system where I can just go
and yank out a PCI card at will?

I don't think this is a valid thing to check, and again, why are we
arguing this point?  It's been this way since the 1990's, this isn't a
new thing...

To get back to the original issue here, the hardware seems to have died,
the driver stops talking to it, and all is good.  The "regression" here
is that we now properly can determine that the hardware is crap.

So, how do you think we should proceed, delay a bit longer before saying
the device is gone?  How long is "long enough"?  How many bus errors are
we allowed to tolerate (hint, the PCI spec says none...)

Maybe someone wants to get to the root problem here, why is the hardware
suddenly reporting all 1s?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-29 15:51                             ` Greg Kroah-Hartman
@ 2017-08-30  6:36                               ` Lukas Wunner
  -1 siblings, 0 replies; 60+ messages in thread
From: Lukas Wunner @ 2017-08-30  6:36 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Mathias Nyman, Mason, Felipe Balbi, linux-pci, linux-usb,
	Linux ARM, Bjorn Helgaas, Alan Stern

On Tue, Aug 29, 2017 at 05:51:38PM +0200, Greg Kroah-Hartman wrote:
> On Tue, Aug 29, 2017 at 05:34:56PM +0200, Lukas Wunner wrote:
> > On Tue, Aug 29, 2017 at 04:47:25PM +0200, Greg Kroah-Hartman wrote:
> > > On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote:
> > Is 0xffffffff not a valid value to be stored in and read from mmio space?
> 
> For a specific register, doubtful

Well, "doubtful" means you don't know for sure.

It's fine to check for "all ones" as a heuristic if that's not a valid
value for the register read, however a hotplug notification is a
*definitive* indication the hardware is gone.

I you seem to prefer forgoing a *definitive* indication for a mere
heuristic, that doesn't make sense from my point of view.


> It's a really long time for everything to shut down and to help
> prevent the driver from going crazy, [...]
> Also, there is always a race between when that notification has been
> sent and processed on the PCIe channel and the reading of all 1s on the
> PCI bus by the driver.

Yes I know that.  In practice the interrupt signaling hot removal
comes fast enough to prevent drivers from "going crazy", as I've
mentioned in this patch:

https://patchwork.kernel.org/patch/9405255/


> For fun, try disconnecting a USB device from one of the more modern
> laptops with a USB 3.1 connection on it.  The bios treats those as a pci
> hotpluggable xhci controller on the PCI bus.  When the device is
> disconnected, the BIOS rips out the PCI device as well, but all that
> time, the xhci driver is thinking the device is still present as it
> takes a while for the BIOS to do all of the needed housekeeping.

Yes, that is the case with Thunderbolt 3 controllers on non-Macs:
The XHCI controller appears below downstream bridge 2 of the Thunderbolt
controller's PCIe switch.  Once the last device is removed, the PCIe
switch and all devices below it disappear and the controller is powered
down.  The controller is thus only visible if powered up.  On Mac this
is all native instead:  Native pciehp, native tunnel setup, native PM.


> > > > Just make pci_dev_is_disconnected() public to detect PCI hot removal.
> > > > We *know* when the device was hot removed, so I think there's no need
> > > > to guess that based on reading "all ones" from mmio (which may happen
> > > > for entirely legitimate reasons unrelated to hot removal).
> > > 
> > > No, you don't always "know" when a device is removed, don't rely on it,
> > > not all platforms support that.
> > 
> > Please explain, which platforms don't support that?  They wouldn't be
> > compliant with the spec it seems.
> 
> What about all of the non-pciehp platforms?  :)

Fair enough, those should be extended to set PCI_DEV_DISCONNECTED as well.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-30  6:36                               ` Lukas Wunner
  0 siblings, 0 replies; 60+ messages in thread
From: Lukas Wunner @ 2017-08-30  6:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 29, 2017 at 05:51:38PM +0200, Greg Kroah-Hartman wrote:
> On Tue, Aug 29, 2017 at 05:34:56PM +0200, Lukas Wunner wrote:
> > On Tue, Aug 29, 2017 at 04:47:25PM +0200, Greg Kroah-Hartman wrote:
> > > On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote:
> > Is 0xffffffff not a valid value to be stored in and read from mmio space?
> 
> For a specific register, doubtful

Well, "doubtful" means you don't know for sure.

It's fine to check for "all ones" as a heuristic if that's not a valid
value for the register read, however a hotplug notification is a
*definitive* indication the hardware is gone.

I you seem to prefer forgoing a *definitive* indication for a mere
heuristic, that doesn't make sense from my point of view.


> It's a really long time for everything to shut down and to help
> prevent the driver from going crazy, [...]
> Also, there is always a race between when that notification has been
> sent and processed on the PCIe channel and the reading of all 1s on the
> PCI bus by the driver.

Yes I know that.  In practice the interrupt signaling hot removal
comes fast enough to prevent drivers from "going crazy", as I've
mentioned in this patch:

https://patchwork.kernel.org/patch/9405255/


> For fun, try disconnecting a USB device from one of the more modern
> laptops with a USB 3.1 connection on it.  The bios treats those as a pci
> hotpluggable xhci controller on the PCI bus.  When the device is
> disconnected, the BIOS rips out the PCI device as well, but all that
> time, the xhci driver is thinking the device is still present as it
> takes a while for the BIOS to do all of the needed housekeeping.

Yes, that is the case with Thunderbolt 3 controllers on non-Macs:
The XHCI controller appears below downstream bridge 2 of the Thunderbolt
controller's PCIe switch.  Once the last device is removed, the PCIe
switch and all devices below it disappear and the controller is powered
down.  The controller is thus only visible if powered up.  On Mac this
is all native instead:  Native pciehp, native tunnel setup, native PM.


> > > > Just make pci_dev_is_disconnected() public to detect PCI hot removal.
> > > > We *know* when the device was hot removed, so I think there's no need
> > > > to guess that based on reading "all ones" from mmio (which may happen
> > > > for entirely legitimate reasons unrelated to hot removal).
> > > 
> > > No, you don't always "know" when a device is removed, don't rely on it,
> > > not all platforms support that.
> > 
> > Please explain, which platforms don't support that?  They wouldn't be
> > compliant with the spec it seems.
> 
> What about all of the non-pciehp platforms?  :)

Fair enough, those should be extended to set PCI_DEV_DISCONNECTED as well.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-30  6:36                               ` Lukas Wunner
@ 2017-08-30  6:45                                 ` Greg Kroah-Hartman
  -1 siblings, 0 replies; 60+ messages in thread
From: Greg Kroah-Hartman @ 2017-08-30  6:45 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Mathias Nyman, Mason, Felipe Balbi, linux-pci, linux-usb,
	Linux ARM, Bjorn Helgaas, Alan Stern

On Wed, Aug 30, 2017 at 08:36:23AM +0200, Lukas Wunner wrote:
> On Tue, Aug 29, 2017 at 05:51:38PM +0200, Greg Kroah-Hartman wrote:
> > On Tue, Aug 29, 2017 at 05:34:56PM +0200, Lukas Wunner wrote:
> > > On Tue, Aug 29, 2017 at 04:47:25PM +0200, Greg Kroah-Hartman wrote:
> > > > On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote:
> > > Is 0xffffffff not a valid value to be stored in and read from mmio space?
> > 
> > For a specific register, doubtful
> 
> Well, "doubtful" means you don't know for sure.
> 
> It's fine to check for "all ones" as a heuristic if that's not a valid
> value for the register read, however a hotplug notification is a
> *definitive* indication the hardware is gone.
> 
> I you seem to prefer forgoing a *definitive* indication for a mere
> heuristic, that doesn't make sense from my point of view.

I still don't know what you are arguing about here.  The _driver_ knows
if a specific read allows all ones as a valid return value.  If it
isn't, then the driver knows the device is now gone.  It's that simple,
don't do that type of check if all ones is a valid read.

And that's not what is happening here anyway, so again, what is this
discussion about?

Unless there's something specific we can do here for the xhci driver, I
think this thread is dead until someone determines what is going wrong
with the hardware the original reporter posted about.

greg k-h

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-30  6:45                                 ` Greg Kroah-Hartman
  0 siblings, 0 replies; 60+ messages in thread
From: Greg Kroah-Hartman @ 2017-08-30  6:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Aug 30, 2017 at 08:36:23AM +0200, Lukas Wunner wrote:
> On Tue, Aug 29, 2017 at 05:51:38PM +0200, Greg Kroah-Hartman wrote:
> > On Tue, Aug 29, 2017 at 05:34:56PM +0200, Lukas Wunner wrote:
> > > On Tue, Aug 29, 2017 at 04:47:25PM +0200, Greg Kroah-Hartman wrote:
> > > > On Tue, Aug 29, 2017 at 03:38:52PM +0200, Lukas Wunner wrote:
> > > Is 0xffffffff not a valid value to be stored in and read from mmio space?
> > 
> > For a specific register, doubtful
> 
> Well, "doubtful" means you don't know for sure.
> 
> It's fine to check for "all ones" as a heuristic if that's not a valid
> value for the register read, however a hotplug notification is a
> *definitive* indication the hardware is gone.
> 
> I you seem to prefer forgoing a *definitive* indication for a mere
> heuristic, that doesn't make sense from my point of view.

I still don't know what you are arguing about here.  The _driver_ knows
if a specific read allows all ones as a valid return value.  If it
isn't, then the driver knows the device is now gone.  It's that simple,
don't do that type of check if all ones is a valid read.

And that's not what is happening here anyway, so again, what is this
discussion about?

Unless there's something specific we can do here for the xhci driver, I
think this thread is dead until someone determines what is going wrong
with the hardware the original reporter posted about.

greg k-h

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-30  6:02                         ` Greg Kroah-Hartman
@ 2017-08-30  8:55                           ` Mason
  -1 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-30  8:55 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Lukas Wunner
  Cc: Mathias Nyman, Felipe Balbi, linux-pci, linux-usb, Linux ARM,
	Bjorn Helgaas, Alan Stern

On 30/08/2017 08:02, Greg Kroah-Hartman wrote:

> To get back to the original issue here, the hardware seems to have died,
> the driver stops talking to it, and all is good.  The "regression" here
> is that we now properly can determine that the hardware is crap.

Before 4.12, when I unplugged my USB3 Flash drive, Linux would
detect a few "Uncorrected Non-Fatal errors" via AER, but it was
still possible to plug the drive back in.

Since 4.12, once I unplug the drive, the whole USB3 card is marked
as dead (all 4 ports), and I can no longer plug anything in (not even
the USB2 drive that didn't have any issues, IIRC).

It seems a bit premature to "mark as dead" something that remains
functional, doesn't it?

Disclaimer, there are many variables in this setup, and I've only
tested a small fraction of the problem space: only one system,
only one USB3 board, only one USB3 Flash drive.

> So, how do you think we should proceed, delay a bit longer before saying
> the device is gone?  How long is "long enough"?  How many bus errors are
> we allowed to tolerate (hint, the PCI spec says none...)
> 
> Maybe someone wants to get to the root problem here, why is the hardware
> suddenly reporting all 1s?

I'm afraid I won't be able to make any progress on this front,
unless I can get my hands on a PCIe packet analyzer.

Regards.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-30  8:55                           ` Mason
  0 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-30  8:55 UTC (permalink / raw)
  To: linux-arm-kernel

On 30/08/2017 08:02, Greg Kroah-Hartman wrote:

> To get back to the original issue here, the hardware seems to have died,
> the driver stops talking to it, and all is good.  The "regression" here
> is that we now properly can determine that the hardware is crap.

Before 4.12, when I unplugged my USB3 Flash drive, Linux would
detect a few "Uncorrected Non-Fatal errors" via AER, but it was
still possible to plug the drive back in.

Since 4.12, once I unplug the drive, the whole USB3 card is marked
as dead (all 4 ports), and I can no longer plug anything in (not even
the USB2 drive that didn't have any issues, IIRC).

It seems a bit premature to "mark as dead" something that remains
functional, doesn't it?

Disclaimer, there are many variables in this setup, and I've only
tested a small fraction of the problem space: only one system,
only one USB3 board, only one USB3 Flash drive.

> So, how do you think we should proceed, delay a bit longer before saying
> the device is gone?  How long is "long enough"?  How many bus errors are
> we allowed to tolerate (hint, the PCI spec says none...)
> 
> Maybe someone wants to get to the root problem here, why is the hardware
> suddenly reporting all 1s?

I'm afraid I won't be able to make any progress on this front,
unless I can get my hands on a PCIe packet analyzer.

Regards.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-30  8:55                           ` Mason
@ 2017-08-30  9:06                             ` Greg Kroah-Hartman
  -1 siblings, 0 replies; 60+ messages in thread
From: Greg Kroah-Hartman @ 2017-08-30  9:06 UTC (permalink / raw)
  To: Mason
  Cc: Lukas Wunner, Mathias Nyman, Felipe Balbi, linux-pci, linux-usb,
	Linux ARM, Bjorn Helgaas, Alan Stern

On Wed, Aug 30, 2017 at 10:55:37AM +0200, Mason wrote:
> On 30/08/2017 08:02, Greg Kroah-Hartman wrote:
> 
> > To get back to the original issue here, the hardware seems to have died,
> > the driver stops talking to it, and all is good.  The "regression" here
> > is that we now properly can determine that the hardware is crap.
> 
> Before 4.12, when I unplugged my USB3 Flash drive, Linux would
> detect a few "Uncorrected Non-Fatal errors" via AER, but it was
> still possible to plug the drive back in.
> 
> Since 4.12, once I unplug the drive, the whole USB3 card is marked
> as dead (all 4 ports), and I can no longer plug anything in (not even
> the USB2 drive that didn't have any issues, IIRC).
> 
> It seems a bit premature to "mark as dead" something that remains
> functional, doesn't it?

I agree, but if the device sends all ones, it's a good indication it is
really dead, right?  Or something is wrong with it.

> Disclaimer, there are many variables in this setup, and I've only
> tested a small fraction of the problem space: only one system,
> only one USB3 board, only one USB3 Flash drive.

Did you ever happen to narrow this down to a single git commit using
'git bisect'?  I can't remember what happened in the beginning of this
thread...

> > So, how do you think we should proceed, delay a bit longer before saying
> > the device is gone?  How long is "long enough"?  How many bus errors are
> > we allowed to tolerate (hint, the PCI spec says none...)
> > 
> > Maybe someone wants to get to the root problem here, why is the hardware
> > suddenly reporting all 1s?
> 
> I'm afraid I won't be able to make any progress on this front,
> unless I can get my hands on a PCIe packet analyzer.

Odds of that happening are pretty rare, right?  I've never even seen one
of those...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-30  9:06                             ` Greg Kroah-Hartman
  0 siblings, 0 replies; 60+ messages in thread
From: Greg Kroah-Hartman @ 2017-08-30  9:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Aug 30, 2017 at 10:55:37AM +0200, Mason wrote:
> On 30/08/2017 08:02, Greg Kroah-Hartman wrote:
> 
> > To get back to the original issue here, the hardware seems to have died,
> > the driver stops talking to it, and all is good.  The "regression" here
> > is that we now properly can determine that the hardware is crap.
> 
> Before 4.12, when I unplugged my USB3 Flash drive, Linux would
> detect a few "Uncorrected Non-Fatal errors" via AER, but it was
> still possible to plug the drive back in.
> 
> Since 4.12, once I unplug the drive, the whole USB3 card is marked
> as dead (all 4 ports), and I can no longer plug anything in (not even
> the USB2 drive that didn't have any issues, IIRC).
> 
> It seems a bit premature to "mark as dead" something that remains
> functional, doesn't it?

I agree, but if the device sends all ones, it's a good indication it is
really dead, right?  Or something is wrong with it.

> Disclaimer, there are many variables in this setup, and I've only
> tested a small fraction of the problem space: only one system,
> only one USB3 board, only one USB3 Flash drive.

Did you ever happen to narrow this down to a single git commit using
'git bisect'?  I can't remember what happened in the beginning of this
thread...

> > So, how do you think we should proceed, delay a bit longer before saying
> > the device is gone?  How long is "long enough"?  How many bus errors are
> > we allowed to tolerate (hint, the PCI spec says none...)
> > 
> > Maybe someone wants to get to the root problem here, why is the hardware
> > suddenly reporting all 1s?
> 
> I'm afraid I won't be able to make any progress on this front,
> unless I can get my hands on a PCIe packet analyzer.

Odds of that happening are pretty rare, right?  I've never even seen one
of those...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-30  8:55                           ` Mason
@ 2017-08-30  9:07                             ` Ard Biesheuvel
  -1 siblings, 0 replies; 60+ messages in thread
From: Ard Biesheuvel @ 2017-08-30  9:07 UTC (permalink / raw)
  To: Mason
  Cc: Greg Kroah-Hartman, Lukas Wunner, Mathias Nyman, Felipe Balbi,
	linux-pci, linux-usb, Bjorn Helgaas, Alan Stern, Linux ARM

On 30 August 2017 at 09:55, Mason <slash.tmp@free.fr> wrote:
> On 30/08/2017 08:02, Greg Kroah-Hartman wrote:
>
>> To get back to the original issue here, the hardware seems to have died,
>> the driver stops talking to it, and all is good.  The "regression" here
>> is that we now properly can determine that the hardware is crap.
>
> Before 4.12, when I unplugged my USB3 Flash drive, Linux would
> detect a few "Uncorrected Non-Fatal errors" via AER, but it was
> still possible to plug the drive back in.
>
> Since 4.12, once I unplug the drive, the whole USB3 card is marked
> as dead (all 4 ports), and I can no longer plug anything in (not even
> the USB2 drive that didn't have any issues, IIRC).
>
> It seems a bit premature to "mark as dead" something that remains
> functional, doesn't it?
>
> Disclaimer, there are many variables in this setup, and I've only
> tested a small fraction of the problem space: only one system,
> only one USB3 board, only one USB3 Flash drive.
>

Please don't forget to mention that this is quirky hardware that
depends on BROKEN because it multiplexes MMIO and config space
accesses in the same memory window without any locking whatsoever
(which would be difficult to do in the first place because we don't
use accessors for MMIO in the kernel).

So how likely is it that you are attempting to read from the xhci BAR
window while a config space access is in progress? Any way to
instrument this in your driver?

>> So, how do you think we should proceed, delay a bit longer before saying
>> the device is gone?  How long is "long enough"?  How many bus errors are
>> we allowed to tolerate (hint, the PCI spec says none...)
>>
>> Maybe someone wants to get to the root problem here, why is the hardware
>> suddenly reporting all 1s?
>
> I'm afraid I won't be able to make any progress on this front,
> unless I can get my hands on a PCIe packet analyzer.
>
> Regards.
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-30  9:07                             ` Ard Biesheuvel
  0 siblings, 0 replies; 60+ messages in thread
From: Ard Biesheuvel @ 2017-08-30  9:07 UTC (permalink / raw)
  To: linux-arm-kernel

On 30 August 2017 at 09:55, Mason <slash.tmp@free.fr> wrote:
> On 30/08/2017 08:02, Greg Kroah-Hartman wrote:
>
>> To get back to the original issue here, the hardware seems to have died,
>> the driver stops talking to it, and all is good.  The "regression" here
>> is that we now properly can determine that the hardware is crap.
>
> Before 4.12, when I unplugged my USB3 Flash drive, Linux would
> detect a few "Uncorrected Non-Fatal errors" via AER, but it was
> still possible to plug the drive back in.
>
> Since 4.12, once I unplug the drive, the whole USB3 card is marked
> as dead (all 4 ports), and I can no longer plug anything in (not even
> the USB2 drive that didn't have any issues, IIRC).
>
> It seems a bit premature to "mark as dead" something that remains
> functional, doesn't it?
>
> Disclaimer, there are many variables in this setup, and I've only
> tested a small fraction of the problem space: only one system,
> only one USB3 board, only one USB3 Flash drive.
>

Please don't forget to mention that this is quirky hardware that
depends on BROKEN because it multiplexes MMIO and config space
accesses in the same memory window without any locking whatsoever
(which would be difficult to do in the first place because we don't
use accessors for MMIO in the kernel).

So how likely is it that you are attempting to read from the xhci BAR
window while a config space access is in progress? Any way to
instrument this in your driver?

>> So, how do you think we should proceed, delay a bit longer before saying
>> the device is gone?  How long is "long enough"?  How many bus errors are
>> we allowed to tolerate (hint, the PCI spec says none...)
>>
>> Maybe someone wants to get to the root problem here, why is the hardware
>> suddenly reporting all 1s?
>
> I'm afraid I won't be able to make any progress on this front,
> unless I can get my hands on a PCIe packet analyzer.
>
> Regards.
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-30  9:07                             ` Ard Biesheuvel
@ 2017-08-30  9:22                               ` Greg Kroah-Hartman
  -1 siblings, 0 replies; 60+ messages in thread
From: Greg Kroah-Hartman @ 2017-08-30  9:22 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Mason, Lukas Wunner, Mathias Nyman, Felipe Balbi, linux-pci,
	linux-usb, Bjorn Helgaas, Alan Stern, Linux ARM

On Wed, Aug 30, 2017 at 10:07:59AM +0100, Ard Biesheuvel wrote:
> On 30 August 2017 at 09:55, Mason <slash.tmp@free.fr> wrote:
> > On 30/08/2017 08:02, Greg Kroah-Hartman wrote:
> >
> >> To get back to the original issue here, the hardware seems to have died,
> >> the driver stops talking to it, and all is good.  The "regression" here
> >> is that we now properly can determine that the hardware is crap.
> >
> > Before 4.12, when I unplugged my USB3 Flash drive, Linux would
> > detect a few "Uncorrected Non-Fatal errors" via AER, but it was
> > still possible to plug the drive back in.
> >
> > Since 4.12, once I unplug the drive, the whole USB3 card is marked
> > as dead (all 4 ports), and I can no longer plug anything in (not even
> > the USB2 drive that didn't have any issues, IIRC).
> >
> > It seems a bit premature to "mark as dead" something that remains
> > functional, doesn't it?
> >
> > Disclaimer, there are many variables in this setup, and I've only
> > tested a small fraction of the problem space: only one system,
> > only one USB3 board, only one USB3 Flash drive.
> >
> 
> Please don't forget to mention that this is quirky hardware that
> depends on BROKEN because it multiplexes MMIO and config space
> accesses in the same memory window without any locking whatsoever
> (which would be difficult to do in the first place because we don't
> use accessors for MMIO in the kernel).
> 
> So how likely is it that you are attempting to read from the xhci BAR
> window while a config space access is in progress? Any way to
> instrument this in your driver?

Seriously?  Ok, that's crap hardware, sorry, I don't feel bad at all
here.  You are going to have worse problems than just a single USB
controller issue if that's your hardware design, go kick some hardware
engineers for me please.

good luck, you are on your own :(

greg k-h

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-30  9:22                               ` Greg Kroah-Hartman
  0 siblings, 0 replies; 60+ messages in thread
From: Greg Kroah-Hartman @ 2017-08-30  9:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Aug 30, 2017 at 10:07:59AM +0100, Ard Biesheuvel wrote:
> On 30 August 2017 at 09:55, Mason <slash.tmp@free.fr> wrote:
> > On 30/08/2017 08:02, Greg Kroah-Hartman wrote:
> >
> >> To get back to the original issue here, the hardware seems to have died,
> >> the driver stops talking to it, and all is good.  The "regression" here
> >> is that we now properly can determine that the hardware is crap.
> >
> > Before 4.12, when I unplugged my USB3 Flash drive, Linux would
> > detect a few "Uncorrected Non-Fatal errors" via AER, but it was
> > still possible to plug the drive back in.
> >
> > Since 4.12, once I unplug the drive, the whole USB3 card is marked
> > as dead (all 4 ports), and I can no longer plug anything in (not even
> > the USB2 drive that didn't have any issues, IIRC).
> >
> > It seems a bit premature to "mark as dead" something that remains
> > functional, doesn't it?
> >
> > Disclaimer, there are many variables in this setup, and I've only
> > tested a small fraction of the problem space: only one system,
> > only one USB3 board, only one USB3 Flash drive.
> >
> 
> Please don't forget to mention that this is quirky hardware that
> depends on BROKEN because it multiplexes MMIO and config space
> accesses in the same memory window without any locking whatsoever
> (which would be difficult to do in the first place because we don't
> use accessors for MMIO in the kernel).
> 
> So how likely is it that you are attempting to read from the xhci BAR
> window while a config space access is in progress? Any way to
> instrument this in your driver?

Seriously?  Ok, that's crap hardware, sorry, I don't feel bad at all
here.  You are going to have worse problems than just a single USB
controller issue if that's your hardware design, go kick some hardware
engineers for me please.

good luck, you are on your own :(

greg k-h

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-30  9:07                             ` Ard Biesheuvel
@ 2017-08-30  9:37                               ` Mason
  -1 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-30  9:37 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Greg Kroah-Hartman, Lukas Wunner, Mathias Nyman, Felipe Balbi,
	linux-pci, linux-usb, Bjorn Helgaas, Alan Stern, Linux ARM

On 30/08/2017 11:07, Ard Biesheuvel wrote:

> Please don't forget to mention that this is quirky hardware that
> depends on BROKEN because it multiplexes MMIO and config space
> accesses in the same memory window without any locking whatsoever
> (which would be difficult to do in the first place because we don't
> use accessors for MMIO in the kernel).

You're right, it was in the back of my mind, but I didn't state
it explicitly for the benefit of linux-usb readers.

> So how likely is it that you are attempting to read from the xhci BAR
> window while a config space access is in progress? Any way to
> instrument this in your driver?

I logged config space accesses here:

https://www.spinics.net/lists/arm-kernel/msg602832.html

IIRC, the config space accesses are generated by the AER ISR.
So disabling the AER driver should guarantee that no config space
accesses are occurring when the drive is unplugged.

Regards.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-30  9:37                               ` Mason
  0 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-30  9:37 UTC (permalink / raw)
  To: linux-arm-kernel

On 30/08/2017 11:07, Ard Biesheuvel wrote:

> Please don't forget to mention that this is quirky hardware that
> depends on BROKEN because it multiplexes MMIO and config space
> accesses in the same memory window without any locking whatsoever
> (which would be difficult to do in the first place because we don't
> use accessors for MMIO in the kernel).

You're right, it was in the back of my mind, but I didn't state
it explicitly for the benefit of linux-usb readers.

> So how likely is it that you are attempting to read from the xhci BAR
> window while a config space access is in progress? Any way to
> instrument this in your driver?

I logged config space accesses here:

https://www.spinics.net/lists/arm-kernel/msg602832.html

IIRC, the config space accesses are generated by the AER ISR.
So disabling the AER driver should guarantee that no config space
accesses are occurring when the drive is unplugged.

Regards.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-30  9:37                               ` Mason
@ 2017-08-31  9:17                                 ` Mason
  -1 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-31  9:17 UTC (permalink / raw)
  To: Ard Biesheuvel, Greg Kroah-Hartman
  Cc: Lukas Wunner, Mathias Nyman, Felipe Balbi, linux-pci, linux-usb,
	Bjorn Helgaas, Alan Stern, Linux ARM

On 30/08/2017 11:37, Mason wrote:

> On 30/08/2017 11:07, Ard Biesheuvel wrote:
> 
>> Please don't forget to mention that this is quirky hardware that
>> depends on BROKEN because it multiplexes MMIO and config space
>> accesses in the same memory window without any locking whatsoever
>> (which would be difficult to do in the first place because we don't
>> use accessors for MMIO in the kernel).
> 
> You're right, it was in the back of my mind, but I didn't state
> it explicitly for the benefit of linux-usb readers.
> 
>> So how likely is it that you are attempting to read from the xhci
>> BAR window while a config space access is in progress? Any way to
>> instrument this in your driver?
> 
> I logged config space accesses here:
> 
> https://www.spinics.net/lists/arm-kernel/msg602832.html
> 
> IIRC, the config space accesses are generated by the AER ISR.
> So disabling the AER driver should guarantee that no config space
> accesses are occurring when the drive is unplugged.

I checked, and I *did* remember correctly.

Disabling the AER driver results in 0 config space access occurring
when the USB3 drive is unplugged. This confirms that the controller's
broken design (muxing config and mem space) is not responsible for
the glitches occurring on unplug events.

Furthermore, I confirm that once the controller has been deemed "dead",
even USB2 drives are no longer detected, and all USB port on the PCIe
board are disabled.

Regards.


For reads/writes in config space, I have:

	if (do_debug) {
		printk("\t READ: bus=%d devfn=%u where=%d size=%d val=0x%x\n",
			bus->number, devfn, where, size, *val);
		dump_stack();
	}

	if (do_debug) {
		printk("\tWRITE: bus=%d devfn=%u where=%d size=%d val=0x%x\n",
			bus->number, devfn, where, size, val);
		dump_stack();
	}

During setup I do get, e.g.

[    7.621417]   READ: bus=1 devfn=0 where=84 size=2 val=0x8
[    7.626840] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G         C      4.12.0-rc1 #2
[    7.634358] Hardware name: Sigma Tango DT
[    7.638387] [<c010e8b4>] (unwind_backtrace) from [<c010ac00>] (show_stack+0x10/0x14)
[    7.646171] [<c010ac00>] (show_stack) from [<c0257a30>] (dump_stack+0x84/0x98)
[    7.653429] [<c0257a30>] (dump_stack) from [<c029cb34>] (smp8759_config_read+0xa0/0xa4)
[    7.661474] [<c029cb34>] (smp8759_config_read) from [<c0282908>] (pci_bus_read_config_word+0x6c/0x94)
[    7.670742] [<c0282908>] (pci_bus_read_config_word) from [<c0282cfc>] (pci_read_config_word+0x24/0x38)
[    7.680097] [<c0282cfc>] (pci_read_config_word) from [<c028c5c0>] (__pci_dev_reset+0x11c/0x2fc)
[    7.688841] [<c028c5c0>] (__pci_dev_reset) from [<c028c9c4>] (pci_probe_reset_function+0xc/0x10)
[    7.697673] [<c028c9c4>] (pci_probe_reset_function) from [<c028f720>] (pci_create_sysfs_dev_files+0x2a8/0x374)
[    7.707728] [<c028f720>] (pci_create_sysfs_dev_files) from [<c0515cf8>] (pci_sysfs_init+0x34/0x54)
[    7.716734] [<c0515cf8>] (pci_sysfs_init) from [<c010175c>] (do_one_initcall+0x44/0x168)
[    7.724867] [<c010175c>] (do_one_initcall) from [<c0500dd8>] (kernel_init_freeable+0x15c/0x1e8)
[    7.733611] [<c0500dd8>] (kernel_init_freeable) from [<c0332348>] (kernel_init+0x8/0x108)
[    7.741831] [<c0332348>] (kernel_init) from [<c01076f8>] (ret_from_fork+0x14/0x3c)


On plug/unplug events, there are no config space accesses:

[   88.006750] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   88.040179] usb 2-2: New USB device found, idVendor=0951, idProduct=1666
[   88.046930] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[   88.054177] usb 2-2: Product: DataTraveler 3.0
[   88.058684] usb 2-2: Manufacturer: Kingston
[   88.062927] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA
[   88.071523] usb-storage 2-2:1.0: USB Mass Storage device detected
[   88.081334] scsi host0: usb-storage 2-2:1.0
[   89.096074] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   89.104828] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   89.112996] sd 0:0:0:0: [sda] Write Protect is off
[   89.118060] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   89.129463]  sda: sda1
[   89.133104] sd 0:0:0:0: [sda] Attached SCSI removable disk

[  103.375210] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
[  103.382917] xhci_hcd 0000:01:00.0: HC died; cleaning up
[  103.388281] usb 2-2: USB disconnect, device number 2

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-31  9:17                                 ` Mason
  0 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-31  9:17 UTC (permalink / raw)
  To: linux-arm-kernel

On 30/08/2017 11:37, Mason wrote:

> On 30/08/2017 11:07, Ard Biesheuvel wrote:
> 
>> Please don't forget to mention that this is quirky hardware that
>> depends on BROKEN because it multiplexes MMIO and config space
>> accesses in the same memory window without any locking whatsoever
>> (which would be difficult to do in the first place because we don't
>> use accessors for MMIO in the kernel).
> 
> You're right, it was in the back of my mind, but I didn't state
> it explicitly for the benefit of linux-usb readers.
> 
>> So how likely is it that you are attempting to read from the xhci
>> BAR window while a config space access is in progress? Any way to
>> instrument this in your driver?
> 
> I logged config space accesses here:
> 
> https://www.spinics.net/lists/arm-kernel/msg602832.html
> 
> IIRC, the config space accesses are generated by the AER ISR.
> So disabling the AER driver should guarantee that no config space
> accesses are occurring when the drive is unplugged.

I checked, and I *did* remember correctly.

Disabling the AER driver results in 0 config space access occurring
when the USB3 drive is unplugged. This confirms that the controller's
broken design (muxing config and mem space) is not responsible for
the glitches occurring on unplug events.

Furthermore, I confirm that once the controller has been deemed "dead",
even USB2 drives are no longer detected, and all USB port on the PCIe
board are disabled.

Regards.


For reads/writes in config space, I have:

	if (do_debug) {
		printk("\t READ: bus=%d devfn=%u where=%d size=%d val=0x%x\n",
			bus->number, devfn, where, size, *val);
		dump_stack();
	}

	if (do_debug) {
		printk("\tWRITE: bus=%d devfn=%u where=%d size=%d val=0x%x\n",
			bus->number, devfn, where, size, val);
		dump_stack();
	}

During setup I do get, e.g.

[    7.621417]   READ: bus=1 devfn=0 where=84 size=2 val=0x8
[    7.626840] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G         C      4.12.0-rc1 #2
[    7.634358] Hardware name: Sigma Tango DT
[    7.638387] [<c010e8b4>] (unwind_backtrace) from [<c010ac00>] (show_stack+0x10/0x14)
[    7.646171] [<c010ac00>] (show_stack) from [<c0257a30>] (dump_stack+0x84/0x98)
[    7.653429] [<c0257a30>] (dump_stack) from [<c029cb34>] (smp8759_config_read+0xa0/0xa4)
[    7.661474] [<c029cb34>] (smp8759_config_read) from [<c0282908>] (pci_bus_read_config_word+0x6c/0x94)
[    7.670742] [<c0282908>] (pci_bus_read_config_word) from [<c0282cfc>] (pci_read_config_word+0x24/0x38)
[    7.680097] [<c0282cfc>] (pci_read_config_word) from [<c028c5c0>] (__pci_dev_reset+0x11c/0x2fc)
[    7.688841] [<c028c5c0>] (__pci_dev_reset) from [<c028c9c4>] (pci_probe_reset_function+0xc/0x10)
[    7.697673] [<c028c9c4>] (pci_probe_reset_function) from [<c028f720>] (pci_create_sysfs_dev_files+0x2a8/0x374)
[    7.707728] [<c028f720>] (pci_create_sysfs_dev_files) from [<c0515cf8>] (pci_sysfs_init+0x34/0x54)
[    7.716734] [<c0515cf8>] (pci_sysfs_init) from [<c010175c>] (do_one_initcall+0x44/0x168)
[    7.724867] [<c010175c>] (do_one_initcall) from [<c0500dd8>] (kernel_init_freeable+0x15c/0x1e8)
[    7.733611] [<c0500dd8>] (kernel_init_freeable) from [<c0332348>] (kernel_init+0x8/0x108)
[    7.741831] [<c0332348>] (kernel_init) from [<c01076f8>] (ret_from_fork+0x14/0x3c)


On plug/unplug events, there are no config space accesses:

[   88.006750] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   88.040179] usb 2-2: New USB device found, idVendor=0951, idProduct=1666
[   88.046930] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[   88.054177] usb 2-2: Product: DataTraveler 3.0
[   88.058684] usb 2-2: Manufacturer: Kingston
[   88.062927] usb 2-2: SerialNumber: 002618887865F0C0F8646BFA
[   88.071523] usb-storage 2-2:1.0: USB Mass Storage device detected
[   88.081334] scsi host0: usb-storage 2-2:1.0
[   89.096074] scsi 0:0:0:0: Direct-Access     Kingston DataTraveler 3.0      PQ: 0 ANSI: 6
[   89.104828] sd 0:0:0:0: [sda] 15109516 512-byte logical blocks: (7.74 GB/7.20 GiB)
[   89.112996] sd 0:0:0:0: [sda] Write Protect is off
[   89.118060] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   89.129463]  sda: sda1
[   89.133104] sd 0:0:0:0: [sda] Attached SCSI removable disk

[  103.375210] xhci_hcd 0000:01:00.0: xHCI host controller not responding, assume dead
[  103.382917] xhci_hcd 0000:01:00.0: HC died; cleaning up
[  103.388281] usb 2-2: USB disconnect, device number 2

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-30  9:06                             ` Greg Kroah-Hartman
@ 2017-08-31  9:39                               ` Mason
  -1 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-31  9:39 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Lukas Wunner, Mathias Nyman, Felipe Balbi, linux-pci, linux-usb,
	Linux ARM, Bjorn Helgaas, Alan Stern

On 30/08/2017 11:06, Greg Kroah-Hartman wrote:

> On Wed, Aug 30, 2017 at 10:55:37AM +0200, Mason wrote:
>
>> On 30/08/2017 08:02, Greg Kroah-Hartman wrote:
>>
>>> To get back to the original issue here, the hardware seems to have died,
>>> the driver stops talking to it, and all is good.  The "regression" here
>>> is that we now properly can determine that the hardware is crap.
>>
>> Before 4.12, when I unplugged my USB3 Flash drive, Linux would
>> detect a few "Uncorrected Non-Fatal errors" via AER, but it was
>> still possible to plug the drive back in.
>>
>> Since 4.12, once I unplug the drive, the whole USB3 card is marked
>> as dead (all 4 ports), and I can no longer plug anything in (not even
>> the USB2 drive that didn't have any issues, IIRC).
>>
>> It seems a bit premature to "mark as dead" something that remains
>> functional, doesn't it?
> 
> I agree, but if the device sends all ones, it's a good indication it is
> really dead, right?  Or something is wrong with it.

I wouldn't call it dead if I can plug the drive back in, and have
it working... But I agree that something fishy is happening...

>> Disclaimer, there are many variables in this setup, and I've only
>> tested a small fraction of the problem space: only one system,
>> only one USB3 board, only one USB3 Flash drive.
> 
> Did you ever happen to narrow this down to a single git commit using
> 'git bisect'?  I can't remember what happened in the beginning of this
> thread...

Mathias pointed out d9f11ba9f107aa335091ab8d7ba5eea714e46e8b

>>> So, how do you think we should proceed, delay a bit longer before saying
>>> the device is gone?  How long is "long enough"?  How many bus errors are
>>> we allowed to tolerate (hint, the PCI spec says none...)
>>>
>>> Maybe someone wants to get to the root problem here, why is the hardware
>>> suddenly reporting all 1s?
>>
>> I'm afraid I won't be able to make any progress on this front,
>> unless I can get my hands on a PCIe packet analyzer.
> 
> Odds of that happening are pretty rare, right?  I've never even seen one
> of those...

I had a "Summit T24 Analyzer" on my desk a few months ago, but I was getting
strange results, and the knowledgeable people in my company were not available
at the time.

http://teledynelecroy.com/protocolanalyzer/protocoloverview.aspx?seriesid=445

Regards.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-31  9:39                               ` Mason
  0 siblings, 0 replies; 60+ messages in thread
From: Mason @ 2017-08-31  9:39 UTC (permalink / raw)
  To: linux-arm-kernel

On 30/08/2017 11:06, Greg Kroah-Hartman wrote:

> On Wed, Aug 30, 2017 at 10:55:37AM +0200, Mason wrote:
>
>> On 30/08/2017 08:02, Greg Kroah-Hartman wrote:
>>
>>> To get back to the original issue here, the hardware seems to have died,
>>> the driver stops talking to it, and all is good.  The "regression" here
>>> is that we now properly can determine that the hardware is crap.
>>
>> Before 4.12, when I unplugged my USB3 Flash drive, Linux would
>> detect a few "Uncorrected Non-Fatal errors" via AER, but it was
>> still possible to plug the drive back in.
>>
>> Since 4.12, once I unplug the drive, the whole USB3 card is marked
>> as dead (all 4 ports), and I can no longer plug anything in (not even
>> the USB2 drive that didn't have any issues, IIRC).
>>
>> It seems a bit premature to "mark as dead" something that remains
>> functional, doesn't it?
> 
> I agree, but if the device sends all ones, it's a good indication it is
> really dead, right?  Or something is wrong with it.

I wouldn't call it dead if I can plug the drive back in, and have
it working... But I agree that something fishy is happening...

>> Disclaimer, there are many variables in this setup, and I've only
>> tested a small fraction of the problem space: only one system,
>> only one USB3 board, only one USB3 Flash drive.
> 
> Did you ever happen to narrow this down to a single git commit using
> 'git bisect'?  I can't remember what happened in the beginning of this
> thread...

Mathias pointed out d9f11ba9f107aa335091ab8d7ba5eea714e46e8b

>>> So, how do you think we should proceed, delay a bit longer before saying
>>> the device is gone?  How long is "long enough"?  How many bus errors are
>>> we allowed to tolerate (hint, the PCI spec says none...)
>>>
>>> Maybe someone wants to get to the root problem here, why is the hardware
>>> suddenly reporting all 1s?
>>
>> I'm afraid I won't be able to make any progress on this front,
>> unless I can get my hands on a PCIe packet analyzer.
> 
> Odds of that happening are pretty rare, right?  I've never even seen one
> of those...

I had a "Summit T24 Analyzer" on my desk a few months ago, but I was getting
strange results, and the knowledgeable people in my company were not available
at the time.

http://teledynelecroy.com/protocolanalyzer/protocoloverview.aspx?seriesid=445

Regards.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-31  9:17                                 ` Mason
@ 2017-08-31 11:38                                   ` Mathias Nyman
  -1 siblings, 0 replies; 60+ messages in thread
From: Mathias Nyman @ 2017-08-31 11:38 UTC (permalink / raw)
  To: Mason, Ard Biesheuvel, Greg Kroah-Hartman
  Cc: Lukas Wunner, Felipe Balbi, linux-pci, linux-usb, Bjorn Helgaas,
	Alan Stern, Linux ARM

On 31.08.2017 12:17, Mason wrote:
> On 30/08/2017 11:37, Mason wrote:
>
>> On 30/08/2017 11:07, Ard Biesheuvel wrote:
>>
>>> Please don't forget to mention that this is quirky hardware that
>>> depends on BROKEN because it multiplexes MMIO and config space
>>> accesses in the same memory window without any locking whatsoever
>>> (which would be difficult to do in the first place because we don't
>>> use accessors for MMIO in the kernel).
>>
>> You're right, it was in the back of my mind, but I didn't state
>> it explicitly for the benefit of linux-usb readers.
>>
>>> So how likely is it that you are attempting to read from the xhci
>>> BAR window while a config space access is in progress? Any way to
>>> instrument this in your driver?
>>
>> I logged config space accesses here:
>>
>> https://www.spinics.net/lists/arm-kernel/msg602832.html
>>
>> IIRC, the config space accesses are generated by the AER ISR.
>> So disabling the AER driver should guarantee that no config space
>> accesses are occurring when the drive is unplugged.
>
> I checked, and I *did* remember correctly.
>
> Disabling the AER driver results in 0 config space access occurring
> when the USB3 drive is unplugged. This confirms that the controller's
> broken design (muxing config and mem space) is not responsible for
> the glitches occurring on unplug events.
>
> Furthermore, I confirm that once the controller has been deemed "dead",
> even USB2 drives are no longer detected, and all USB port on the PCIe
> board are disabled.

xhci handles both USB3 and USB2, If there is only a xhci in use then all
usb ports will be disabled.
Many systems have both ehci and xhci, where ehci handles USB2 side.
I'm guessing yours only have the xhci.

-Mathias

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-31 11:38                                   ` Mathias Nyman
  0 siblings, 0 replies; 60+ messages in thread
From: Mathias Nyman @ 2017-08-31 11:38 UTC (permalink / raw)
  To: linux-arm-kernel

On 31.08.2017 12:17, Mason wrote:
> On 30/08/2017 11:37, Mason wrote:
>
>> On 30/08/2017 11:07, Ard Biesheuvel wrote:
>>
>>> Please don't forget to mention that this is quirky hardware that
>>> depends on BROKEN because it multiplexes MMIO and config space
>>> accesses in the same memory window without any locking whatsoever
>>> (which would be difficult to do in the first place because we don't
>>> use accessors for MMIO in the kernel).
>>
>> You're right, it was in the back of my mind, but I didn't state
>> it explicitly for the benefit of linux-usb readers.
>>
>>> So how likely is it that you are attempting to read from the xhci
>>> BAR window while a config space access is in progress? Any way to
>>> instrument this in your driver?
>>
>> I logged config space accesses here:
>>
>> https://www.spinics.net/lists/arm-kernel/msg602832.html
>>
>> IIRC, the config space accesses are generated by the AER ISR.
>> So disabling the AER driver should guarantee that no config space
>> accesses are occurring when the drive is unplugged.
>
> I checked, and I *did* remember correctly.
>
> Disabling the AER driver results in 0 config space access occurring
> when the USB3 drive is unplugged. This confirms that the controller's
> broken design (muxing config and mem space) is not responsible for
> the glitches occurring on unplug events.
>
> Furthermore, I confirm that once the controller has been deemed "dead",
> even USB2 drives are no longer detected, and all USB port on the PCIe
> board are disabled.

xhci handles both USB3 and USB2, If there is only a xhci in use then all
usb ports will be disabled.
Many systems have both ehci and xhci, where ehci handles USB2 side.
I'm guessing yours only have the xhci.

-Mathias

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Possible regression between 4.9 and 4.13
  2017-08-31  9:39                               ` Mason
@ 2017-08-31 11:40                                 ` Mathias Nyman
  -1 siblings, 0 replies; 60+ messages in thread
From: Mathias Nyman @ 2017-08-31 11:40 UTC (permalink / raw)
  To: Mason, Greg Kroah-Hartman
  Cc: Lukas Wunner, Felipe Balbi, linux-pci, linux-usb, Linux ARM,
	Bjorn Helgaas, Alan Stern

On 31.08.2017 12:39, Mason wrote:
> On 30/08/2017 11:06, Greg Kroah-Hartman wrote:
>
>> On Wed, Aug 30, 2017 at 10:55:37AM +0200, Mason wrote:
>>
>>> On 30/08/2017 08:02, Greg Kroah-Hartman wrote:
>>>
>>>> To get back to the original issue here, the hardware seems to have died,
>>>> the driver stops talking to it, and all is good.  The "regression" here
>>>> is that we now properly can determine that the hardware is crap.
>>>
>>> Before 4.12, when I unplugged my USB3 Flash drive, Linux would
>>> detect a few "Uncorrected Non-Fatal errors" via AER, but it was
>>> still possible to plug the drive back in.
>>>
>>> Since 4.12, once I unplug the drive, the whole USB3 card is marked
>>> as dead (all 4 ports), and I can no longer plug anything in (not even
>>> the USB2 drive that didn't have any issues, IIRC).
>>>
>>> It seems a bit premature to "mark as dead" something that remains
>>> functional, doesn't it?
>>
>> I agree, but if the device sends all ones, it's a good indication it is
>> really dead, right?  Or something is wrong with it.
>
> I wouldn't call it dead if I can plug the drive back in, and have
> it working... But I agree that something fishy is happening...
>
>>> Disclaimer, there are many variables in this setup, and I've only
>>> tested a small fraction of the problem space: only one system,
>>> only one USB3 board, only one USB3 Flash drive.
>>
>> Did you ever happen to narrow this down to a single git commit using
>> 'git bisect'?  I can't remember what happened in the beginning of this
>> thread...
>
> Mathias pointed out d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
>

That patch only changes how xhci reacts to reading 0xffffffff.
we used to just returned -ENODEV, but after patch we assume
hardware is broken or removed.

-Mathias

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Possible regression between 4.9 and 4.13
@ 2017-08-31 11:40                                 ` Mathias Nyman
  0 siblings, 0 replies; 60+ messages in thread
From: Mathias Nyman @ 2017-08-31 11:40 UTC (permalink / raw)
  To: linux-arm-kernel

On 31.08.2017 12:39, Mason wrote:
> On 30/08/2017 11:06, Greg Kroah-Hartman wrote:
>
>> On Wed, Aug 30, 2017 at 10:55:37AM +0200, Mason wrote:
>>
>>> On 30/08/2017 08:02, Greg Kroah-Hartman wrote:
>>>
>>>> To get back to the original issue here, the hardware seems to have died,
>>>> the driver stops talking to it, and all is good.  The "regression" here
>>>> is that we now properly can determine that the hardware is crap.
>>>
>>> Before 4.12, when I unplugged my USB3 Flash drive, Linux would
>>> detect a few "Uncorrected Non-Fatal errors" via AER, but it was
>>> still possible to plug the drive back in.
>>>
>>> Since 4.12, once I unplug the drive, the whole USB3 card is marked
>>> as dead (all 4 ports), and I can no longer plug anything in (not even
>>> the USB2 drive that didn't have any issues, IIRC).
>>>
>>> It seems a bit premature to "mark as dead" something that remains
>>> functional, doesn't it?
>>
>> I agree, but if the device sends all ones, it's a good indication it is
>> really dead, right?  Or something is wrong with it.
>
> I wouldn't call it dead if I can plug the drive back in, and have
> it working... But I agree that something fishy is happening...
>
>>> Disclaimer, there are many variables in this setup, and I've only
>>> tested a small fraction of the problem space: only one system,
>>> only one USB3 board, only one USB3 Flash drive.
>>
>> Did you ever happen to narrow this down to a single git commit using
>> 'git bisect'?  I can't remember what happened in the beginning of this
>> thread...
>
> Mathias pointed out d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
>

That patch only changes how xhci reacts to reading 0xffffffff.
we used to just returned -ENODEV, but after patch we assume
hardware is broken or removed.

-Mathias

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2017-08-31 11:40 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-22 17:34 Possible regression between 4.9 and 4.13 Mason
2017-08-22 17:34 ` Mason
2017-08-23  6:07 ` Felipe Balbi
2017-08-23  6:07   ` Felipe Balbi
2017-08-23  7:51   ` Mathias Nyman
2017-08-23  7:51     ` Mathias Nyman
2017-08-23  9:18     ` Mason
2017-08-23  9:18       ` Mason
2017-08-23  9:31     ` Mason
2017-08-23  9:31       ` Mason
2017-08-23 11:11       ` Mathias Nyman
2017-08-23 11:11         ` Mathias Nyman
2017-08-23 11:54         ` Mason
2017-08-23 11:54           ` Mason
2017-08-23 12:41           ` Mason
2017-08-23 12:41             ` Mason
2017-08-23 14:30             ` Mason
2017-08-23 14:30               ` Mason
2017-08-28  8:39               ` Mathias Nyman
2017-08-28  8:39                 ` Mathias Nyman
2017-08-28 14:40                 ` Mason
2017-08-28 14:40                   ` Mason
2017-08-29 13:28                   ` Mathias Nyman
2017-08-29 13:28                     ` Mathias Nyman
2017-08-29 13:38                     ` Lukas Wunner
2017-08-29 13:38                       ` Lukas Wunner
2017-08-29 14:47                       ` Greg Kroah-Hartman
2017-08-29 14:47                         ` Greg Kroah-Hartman
2017-08-29 15:34                         ` Lukas Wunner
2017-08-29 15:34                           ` Lukas Wunner
2017-08-29 15:51                           ` Greg Kroah-Hartman
2017-08-29 15:51                             ` Greg Kroah-Hartman
2017-08-30  6:36                             ` Lukas Wunner
2017-08-30  6:36                               ` Lukas Wunner
2017-08-30  6:45                               ` Greg Kroah-Hartman
2017-08-30  6:45                                 ` Greg Kroah-Hartman
2017-08-29 23:53                     ` Lukas Wunner
2017-08-29 23:53                       ` Lukas Wunner
2017-08-30  6:02                       ` Greg Kroah-Hartman
2017-08-30  6:02                         ` Greg Kroah-Hartman
2017-08-30  8:55                         ` Mason
2017-08-30  8:55                           ` Mason
2017-08-30  9:06                           ` Greg Kroah-Hartman
2017-08-30  9:06                             ` Greg Kroah-Hartman
2017-08-31  9:39                             ` Mason
2017-08-31  9:39                               ` Mason
2017-08-31 11:40                               ` Mathias Nyman
2017-08-31 11:40                                 ` Mathias Nyman
2017-08-30  9:07                           ` Ard Biesheuvel
2017-08-30  9:07                             ` Ard Biesheuvel
2017-08-30  9:22                             ` Greg Kroah-Hartman
2017-08-30  9:22                               ` Greg Kroah-Hartman
2017-08-30  9:37                             ` Mason
2017-08-30  9:37                               ` Mason
2017-08-31  9:17                               ` Mason
2017-08-31  9:17                                 ` Mason
2017-08-31 11:38                                 ` Mathias Nyman
2017-08-31 11:38                                   ` Mathias Nyman
2017-08-23 10:19     ` Mason
2017-08-23 10:19       ` Mason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.