All of lore.kernel.org
 help / color / mirror / Atom feed
* AER: Malformed TLP recovery deadlock with NVMe drives
@ 2018-05-07 18:36 ` Alex G.
  0 siblings, 0 replies; 24+ messages in thread
From: Alex G. @ 2018-05-07 18:36 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, austin_bolen, shyam_iyer, linux-nvme, Alex_Gagniuc

Hi! Me again!

I'm seeing what appears to be a deadlock in the AER recovery path. It
seems that the device_lock() call in report_slot_reset() never returns.
How we get there is interesting:

Injection of the error happens by changing the maximum payload size to
128 bytes from 256. This is on the switch upstream port.
When there's IO to the drive, switch sees a malformed TLP. Switch
reports error, AER handles it.
More IO goes, another error is triggered, and this time the root port
reports it. AER recovery hits the NVMe drive behind the affetced
upstream port and deadlocks.

I've added some printks in the AER handler to see which lock dies, and I
have a fairly succinct log[1], also pasted below. It appears somebody is
already holding the lock to the PCI device behind the sick upstream
port, and never releases it.


I'm not sure how to track down other users of the lock, and I"m hoping
somebody may have a brighter idea.

Alex


[1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log


Appendix A: log

:: Mon May  7 13:09:02 2018 :: <function malform_tlp at 0x7f27fec04400>
[  127.857606] megaraid_sas 0000:86:00.0: invalid short VPD tag 00 at
offset 1
[  138.385529] pcieport 0000:ae:00.0: AER: Multiple Uncorrected (Fatal)
error received: id=b020
[  138.571742] pcieport 0000:b0:04.0: PCIe Bus Error:
severity=Uncorrected (Fatal), type=Transaction Layer, id=b020(Receiver ID)
[  138.583050] pcieport 0000:b0:04.0:   device [10b5:9733] error
status/mask=00440000/01a10000
[  138.591398] pcieport 0000:b0:04.0:    [18] Malformed TLP          (First)
[  138.598200] pcieport 0000:b0:04.0:    [22] Uncorrectable Internal Error
[  138.604812] pcieport 0000:b0:04.0:   TLP Header: 60000040 b10000ff
00000003 df400000
[  138.612554] pcieport 0000:b0:04.0: broadcast error_detected message
[  138.612557] nvme 0000:b1:00.0: HACK: report_error_detected: Preparing
to lock
[  138.612557] nvme 0000:b1:00.0: HACK: report_error_detected: locked
and ready
[  138.612560] nvme nvme2: frozen state error detected, reset controller
[  151.133387] nvme 0000:b1:00.0: HACK: report_error_detected: Unlocked
and DONE
[  152.164786] pcieport 0000:b0:04.0: downstream link has been reset
[  152.164789] pcieport 0000:b0:04.0: broadcast slot_reset message
[  152.164793] nvme 0000:b1:00.0: HACK: report_slot_reset: Preparing to lock
[  152.164795] nvme 0000:b1:00.0: HACK: report_slot_reset: locked and ready
[  152.164799] nvme nvme2: restart after slot reset
[  152.164863] nvme 0000:b1:00.0: HACK: report_slot_reset: Unlocked and DONE
[  152.164865] pcieport 0000:b0:04.0: broadcast resume message
[  152.164867] nvme 0000:b1:00.0: HACK: report_resume: Preparing to lock
[  152.164869] nvme 0000:b1:00.0: HACK: report_resume: locked and ready
[  152.164901] nvme 0000:b1:00.0: HACK: report_resume: Unlocked and DONE
[  152.164904] pcieport 0000:b0:04.0: AER: Device recovery successful
[  152.164907] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164913] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164917] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164921] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164925] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164930] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164934] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164938] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164942] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164946] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164951] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164955] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164959] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164963] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164967] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164972] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164976] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164980] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164984] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164988] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164992] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164997] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165001] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165005] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165009] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165013] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165017] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165022] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165026] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165030] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165034] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165038] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165043] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=ae00
[  152.165046] pcieport 0000:ae:00.0: PCIe Bus Error:
severity=Uncorrected (Fatal), type=Transaction Layer, id=ae00(Requester ID)
[  152.176442] pcieport 0000:ae:00.0:   device [8086:2030] error
status/mask=00004000/00310000
[  152.184795] pcieport 0000:ae:00.0:    [14] Completion Timeout     (First)
[  152.191600] pcieport 0000:ae:00.0: broadcast error_detected message
[  152.191601] pcieport 0000:af:00.0: HACK: report_error_detected:
Preparing to lock
[  152.191602] pcieport 0000:af:00.0: HACK: report_error_detected:
locked and ready
[  152.191613] pcieport 0000:af:00.0: HACK: report_error_detected:
Unlocked and DONE
[  152.191614] pcieport 0000:b0:04.0: HACK: report_error_detected:
Preparing to lock
[  152.191615] pcieport 0000:b0:04.0: HACK: report_error_detected:
locked and ready
[  152.191621] pcieport 0000:b0:04.0: HACK: report_error_detected:
Unlocked and DONE
[  152.191622] nvme 0000:b1:00.0: HACK: report_error_detected: Preparing
to lock
[  152.191623] nvme 0000:b1:00.0: HACK: report_error_detected: locked
and ready
[  152.191624] nvme nvme2: frozen state error detected, reset controller
[  152.204899] nvme 0000:b1:00.0: HACK: report_error_detected: Unlocked
and DONE
[  152.204903] pcieport 0000:b0:05.0: HACK: report_error_detected:
Preparing to lock
[  152.204906] pcieport 0000:b0:05.0: HACK: report_error_detected:
locked and ready
[  152.204921] pcieport 0000:b0:05.0: HACK: report_error_detected:
Unlocked and DONE
[  152.204924] nvme 0000:b2:00.0: HACK: report_error_detected: Preparing
to lock
[  152.204927] nvme 0000:b2:00.0: HACK: report_error_detected: locked
and ready
[  152.204931] nvme nvme3: frozen state error detected, reset controller
[  152.230068] nvme 0000:b2:00.0: HACK: report_error_detected: Unlocked
and DONE
[  152.230073] pcieport 0000:b0:06.0: HACK: report_error_detected:
Preparing to lock
[  152.230075] pcieport 0000:b0:06.0: HACK: report_error_detected:
locked and ready
[  152.230091] pcieport 0000:b0:06.0: HACK: report_error_detected:
Unlocked and DONE
[  152.230094] pcieport 0000:b0:07.0: HACK: report_error_detected:
Preparing to lock
[  152.230097] pcieport 0000:b0:07.0: HACK: report_error_detected:
locked and ready
[  152.230111] pcieport 0000:b0:07.0: HACK: report_error_detected:
Unlocked and DONE
[  152.300716] nvme nvme2: Removing after probe failure status: -19
[  152.306687] nvme1n1: detected capacity change from 3200631791616 to 0
[  152.306789] md: super_written gets error=10
[  152.311002] md/raid1:mdX: Disk failure on dm-4, disabling device.
               md/raid1:mdX: Operation continuing on 2 devices.
[  153.252694] pcieport 0000:ae:00.0: Root Port link has been reset
[  153.252700] pcieport 0000:ae:00.0: broadcast slot_reset message
[  153.252705] pcieport 0000:af:00.0: HACK: report_slot_reset: Preparing
to lock
[  153.252707] pcieport 0000:af:00.0: HACK: report_slot_reset: locked
and ready
[  153.252782] pcieport 0000:af:00.0: HACK: report_slot_reset: Unlocked
and DONE
[  153.252785] pcieport 0000:b0:04.0: HACK: report_slot_reset: Preparing
to lock
[  153.252787] pcieport 0000:b0:04.0: HACK: report_slot_reset: locked
and ready
[  153.252859] pcieport 0000:b0:04.0: HACK: report_slot_reset: Unlocked
and DONE
[  153.252862] nvme 0000:b1:00.0: HACK: report_slot_reset: Preparing to lock
[  153.252911] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up
:: Mon May  7 13:09:33 2018 :: <function malform_tlp at 0x7f27fec04400>
:: Mon May  7 13:09:33 2018 :: <function malform_tlp at 0x7f27fec04400>
:: Mon May  7 13:10:03 2018 :: <function malform_tlp at 0x7f27fec04400>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 24+ messages in thread

* AER: Malformed TLP recovery deadlock with NVMe drives
@ 2018-05-07 18:36 ` Alex G.
  0 siblings, 0 replies; 24+ messages in thread
From: Alex G. @ 2018-05-07 18:36 UTC (permalink / raw)


Hi! Me again!

I'm seeing what appears to be a deadlock in the AER recovery path. It
seems that the device_lock() call in report_slot_reset() never returns.
How we get there is interesting:

Injection of the error happens by changing the maximum payload size to
128 bytes from 256. This is on the switch upstream port.
When there's IO to the drive, switch sees a malformed TLP. Switch
reports error, AER handles it.
More IO goes, another error is triggered, and this time the root port
reports it. AER recovery hits the NVMe drive behind the affetced
upstream port and deadlocks.

I've added some printks in the AER handler to see which lock dies, and I
have a fairly succinct log[1], also pasted below. It appears somebody is
already holding the lock to the PCI device behind the sick upstream
port, and never releases it.


I'm not sure how to track down other users of the lock, and I"m hoping
somebody may have a brighter idea.

Alex


[1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log


Appendix A: log

:: Mon May  7 13:09:02 2018 :: <function malform_tlp at 0x7f27fec04400>
[  127.857606] megaraid_sas 0000:86:00.0: invalid short VPD tag 00 at
offset 1
[  138.385529] pcieport 0000:ae:00.0: AER: Multiple Uncorrected (Fatal)
error received: id=b020
[  138.571742] pcieport 0000:b0:04.0: PCIe Bus Error:
severity=Uncorrected (Fatal), type=Transaction Layer, id=b020(Receiver ID)
[  138.583050] pcieport 0000:b0:04.0:   device [10b5:9733] error
status/mask=00440000/01a10000
[  138.591398] pcieport 0000:b0:04.0:    [18] Malformed TLP          (First)
[  138.598200] pcieport 0000:b0:04.0:    [22] Uncorrectable Internal Error
[  138.604812] pcieport 0000:b0:04.0:   TLP Header: 60000040 b10000ff
00000003 df400000
[  138.612554] pcieport 0000:b0:04.0: broadcast error_detected message
[  138.612557] nvme 0000:b1:00.0: HACK: report_error_detected: Preparing
to lock
[  138.612557] nvme 0000:b1:00.0: HACK: report_error_detected: locked
and ready
[  138.612560] nvme nvme2: frozen state error detected, reset controller
[  151.133387] nvme 0000:b1:00.0: HACK: report_error_detected: Unlocked
and DONE
[  152.164786] pcieport 0000:b0:04.0: downstream link has been reset
[  152.164789] pcieport 0000:b0:04.0: broadcast slot_reset message
[  152.164793] nvme 0000:b1:00.0: HACK: report_slot_reset: Preparing to lock
[  152.164795] nvme 0000:b1:00.0: HACK: report_slot_reset: locked and ready
[  152.164799] nvme nvme2: restart after slot reset
[  152.164863] nvme 0000:b1:00.0: HACK: report_slot_reset: Unlocked and DONE
[  152.164865] pcieport 0000:b0:04.0: broadcast resume message
[  152.164867] nvme 0000:b1:00.0: HACK: report_resume: Preparing to lock
[  152.164869] nvme 0000:b1:00.0: HACK: report_resume: locked and ready
[  152.164901] nvme 0000:b1:00.0: HACK: report_resume: Unlocked and DONE
[  152.164904] pcieport 0000:b0:04.0: AER: Device recovery successful
[  152.164907] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164913] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164917] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164921] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164925] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164930] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164934] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164938] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164942] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164946] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164951] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164955] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164959] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164963] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164967] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164972] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164976] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164980] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164984] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164988] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164992] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.164997] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165001] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165005] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165009] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165013] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165017] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165022] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165026] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165030] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165034] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165038] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=b020
[  152.165043] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
received: id=ae00
[  152.165046] pcieport 0000:ae:00.0: PCIe Bus Error:
severity=Uncorrected (Fatal), type=Transaction Layer, id=ae00(Requester ID)
[  152.176442] pcieport 0000:ae:00.0:   device [8086:2030] error
status/mask=00004000/00310000
[  152.184795] pcieport 0000:ae:00.0:    [14] Completion Timeout     (First)
[  152.191600] pcieport 0000:ae:00.0: broadcast error_detected message
[  152.191601] pcieport 0000:af:00.0: HACK: report_error_detected:
Preparing to lock
[  152.191602] pcieport 0000:af:00.0: HACK: report_error_detected:
locked and ready
[  152.191613] pcieport 0000:af:00.0: HACK: report_error_detected:
Unlocked and DONE
[  152.191614] pcieport 0000:b0:04.0: HACK: report_error_detected:
Preparing to lock
[  152.191615] pcieport 0000:b0:04.0: HACK: report_error_detected:
locked and ready
[  152.191621] pcieport 0000:b0:04.0: HACK: report_error_detected:
Unlocked and DONE
[  152.191622] nvme 0000:b1:00.0: HACK: report_error_detected: Preparing
to lock
[  152.191623] nvme 0000:b1:00.0: HACK: report_error_detected: locked
and ready
[  152.191624] nvme nvme2: frozen state error detected, reset controller
[  152.204899] nvme 0000:b1:00.0: HACK: report_error_detected: Unlocked
and DONE
[  152.204903] pcieport 0000:b0:05.0: HACK: report_error_detected:
Preparing to lock
[  152.204906] pcieport 0000:b0:05.0: HACK: report_error_detected:
locked and ready
[  152.204921] pcieport 0000:b0:05.0: HACK: report_error_detected:
Unlocked and DONE
[  152.204924] nvme 0000:b2:00.0: HACK: report_error_detected: Preparing
to lock
[  152.204927] nvme 0000:b2:00.0: HACK: report_error_detected: locked
and ready
[  152.204931] nvme nvme3: frozen state error detected, reset controller
[  152.230068] nvme 0000:b2:00.0: HACK: report_error_detected: Unlocked
and DONE
[  152.230073] pcieport 0000:b0:06.0: HACK: report_error_detected:
Preparing to lock
[  152.230075] pcieport 0000:b0:06.0: HACK: report_error_detected:
locked and ready
[  152.230091] pcieport 0000:b0:06.0: HACK: report_error_detected:
Unlocked and DONE
[  152.230094] pcieport 0000:b0:07.0: HACK: report_error_detected:
Preparing to lock
[  152.230097] pcieport 0000:b0:07.0: HACK: report_error_detected:
locked and ready
[  152.230111] pcieport 0000:b0:07.0: HACK: report_error_detected:
Unlocked and DONE
[  152.300716] nvme nvme2: Removing after probe failure status: -19
[  152.306687] nvme1n1: detected capacity change from 3200631791616 to 0
[  152.306789] md: super_written gets error=10
[  152.311002] md/raid1:mdX: Disk failure on dm-4, disabling device.
               md/raid1:mdX: Operation continuing on 2 devices.
[  153.252694] pcieport 0000:ae:00.0: Root Port link has been reset
[  153.252700] pcieport 0000:ae:00.0: broadcast slot_reset message
[  153.252705] pcieport 0000:af:00.0: HACK: report_slot_reset: Preparing
to lock
[  153.252707] pcieport 0000:af:00.0: HACK: report_slot_reset: locked
and ready
[  153.252782] pcieport 0000:af:00.0: HACK: report_slot_reset: Unlocked
and DONE
[  153.252785] pcieport 0000:b0:04.0: HACK: report_slot_reset: Preparing
to lock
[  153.252787] pcieport 0000:b0:04.0: HACK: report_slot_reset: locked
and ready
[  153.252859] pcieport 0000:b0:04.0: HACK: report_slot_reset: Unlocked
and DONE
[  153.252862] nvme 0000:b1:00.0: HACK: report_slot_reset: Preparing to lock
[  153.252911] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up
:: Mon May  7 13:09:33 2018 :: <function malform_tlp at 0x7f27fec04400>
:: Mon May  7 13:09:33 2018 :: <function malform_tlp at 0x7f27fec04400>
:: Mon May  7 13:10:03 2018 :: <function malform_tlp at 0x7f27fec04400>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: AER: Malformed TLP recovery deadlock with NVMe drives
  2018-05-07 18:36 ` Alex G.
@ 2018-05-07 18:46   ` okaya
  -1 siblings, 0 replies; 24+ messages in thread
From: okaya @ 2018-05-07 18:46 UTC (permalink / raw)
  To: Alex G.
  Cc: Alex_Gagniuc, linux-pci, shyam_iyer, linux-nvme, Keith Busch,
	austin_bolen, linux-pci-owner

On 2018-05-07 19:36, Alex G. wrote:
> Hi! Me again!
> 
> I'm seeing what appears to be a deadlock in the AER recovery path. It
> seems that the device_lock() call in report_slot_reset() never returns.
> How we get there is interesting:
> 


Can you give this patch a try?


https://patchwork.kernel.org/patch/10351515/




> Injection of the error happens by changing the maximum payload size to
> 128 bytes from 256. This is on the switch upstream port.
> When there's IO to the drive, switch sees a malformed TLP. Switch
> reports error, AER handles it.
> More IO goes, another error is triggered, and this time the root port
> reports it. AER recovery hits the NVMe drive behind the affetced
> upstream port and deadlocks.
> 
> I've added some printks in the AER handler to see which lock dies, and 
> I
> have a fairly succinct log[1], also pasted below. It appears somebody 
> is
> already holding the lock to the PCI device behind the sick upstream
> port, and never releases it.
> 
> 
> I'm not sure how to track down other users of the lock, and I"m hoping
> somebody may have a brighter idea.
> 
> Alex
> 
> 
> [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log
> 
> 
> Appendix A: log
> 
> :: Mon May  7 13:09:02 2018 :: <function malform_tlp at 0x7f27fec04400>
> [  127.857606] megaraid_sas 0000:86:00.0: invalid short VPD tag 00 at
> offset 1
> [  138.385529] pcieport 0000:ae:00.0: AER: Multiple Uncorrected (Fatal)
> error received: id=b020
> [  138.571742] pcieport 0000:b0:04.0: PCIe Bus Error:
> severity=Uncorrected (Fatal), type=Transaction Layer, id=b020(Receiver 
> ID)
> [  138.583050] pcieport 0000:b0:04.0:   device [10b5:9733] error
> status/mask=00440000/01a10000
> [  138.591398] pcieport 0000:b0:04.0:    [18] Malformed TLP          
> (First)
> [  138.598200] pcieport 0000:b0:04.0:    [22] Uncorrectable Internal 
> Error
> [  138.604812] pcieport 0000:b0:04.0:   TLP Header: 60000040 b10000ff
> 00000003 df400000
> [  138.612554] pcieport 0000:b0:04.0: broadcast error_detected message
> [  138.612557] nvme 0000:b1:00.0: HACK: report_error_detected: 
> Preparing
> to lock
> [  138.612557] nvme 0000:b1:00.0: HACK: report_error_detected: locked
> and ready
> [  138.612560] nvme nvme2: frozen state error detected, reset 
> controller
> [  151.133387] nvme 0000:b1:00.0: HACK: report_error_detected: Unlocked
> and DONE
> [  152.164786] pcieport 0000:b0:04.0: downstream link has been reset
> [  152.164789] pcieport 0000:b0:04.0: broadcast slot_reset message
> [  152.164793] nvme 0000:b1:00.0: HACK: report_slot_reset: Preparing to 
> lock
> [  152.164795] nvme 0000:b1:00.0: HACK: report_slot_reset: locked and 
> ready
> [  152.164799] nvme nvme2: restart after slot reset
> [  152.164863] nvme 0000:b1:00.0: HACK: report_slot_reset: Unlocked and 
> DONE
> [  152.164865] pcieport 0000:b0:04.0: broadcast resume message
> [  152.164867] nvme 0000:b1:00.0: HACK: report_resume: Preparing to 
> lock
> [  152.164869] nvme 0000:b1:00.0: HACK: report_resume: locked and ready
> [  152.164901] nvme 0000:b1:00.0: HACK: report_resume: Unlocked and 
> DONE
> [  152.164904] pcieport 0000:b0:04.0: AER: Device recovery successful
> [  152.164907] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164913] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164917] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164921] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164925] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164930] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164934] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164938] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164942] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164946] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164951] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164955] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164959] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164963] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164967] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164972] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164976] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164980] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164984] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164988] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164992] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164997] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165001] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165005] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165009] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165013] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165017] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165022] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165026] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165030] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165034] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165038] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165043] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=ae00
> [  152.165046] pcieport 0000:ae:00.0: PCIe Bus Error:
> severity=Uncorrected (Fatal), type=Transaction Layer, id=ae00(Requester 
> ID)
> [  152.176442] pcieport 0000:ae:00.0:   device [8086:2030] error
> status/mask=00004000/00310000
> [  152.184795] pcieport 0000:ae:00.0:    [14] Completion Timeout     
> (First)
> [  152.191600] pcieport 0000:ae:00.0: broadcast error_detected message
> [  152.191601] pcieport 0000:af:00.0: HACK: report_error_detected:
> Preparing to lock
> [  152.191602] pcieport 0000:af:00.0: HACK: report_error_detected:
> locked and ready
> [  152.191613] pcieport 0000:af:00.0: HACK: report_error_detected:
> Unlocked and DONE
> [  152.191614] pcieport 0000:b0:04.0: HACK: report_error_detected:
> Preparing to lock
> [  152.191615] pcieport 0000:b0:04.0: HACK: report_error_detected:
> locked and ready
> [  152.191621] pcieport 0000:b0:04.0: HACK: report_error_detected:
> Unlocked and DONE
> [  152.191622] nvme 0000:b1:00.0: HACK: report_error_detected: 
> Preparing
> to lock
> [  152.191623] nvme 0000:b1:00.0: HACK: report_error_detected: locked
> and ready
> [  152.191624] nvme nvme2: frozen state error detected, reset 
> controller
> [  152.204899] nvme 0000:b1:00.0: HACK: report_error_detected: Unlocked
> and DONE
> [  152.204903] pcieport 0000:b0:05.0: HACK: report_error_detected:
> Preparing to lock
> [  152.204906] pcieport 0000:b0:05.0: HACK: report_error_detected:
> locked and ready
> [  152.204921] pcieport 0000:b0:05.0: HACK: report_error_detected:
> Unlocked and DONE
> [  152.204924] nvme 0000:b2:00.0: HACK: report_error_detected: 
> Preparing
> to lock
> [  152.204927] nvme 0000:b2:00.0: HACK: report_error_detected: locked
> and ready
> [  152.204931] nvme nvme3: frozen state error detected, reset 
> controller
> [  152.230068] nvme 0000:b2:00.0: HACK: report_error_detected: Unlocked
> and DONE
> [  152.230073] pcieport 0000:b0:06.0: HACK: report_error_detected:
> Preparing to lock
> [  152.230075] pcieport 0000:b0:06.0: HACK: report_error_detected:
> locked and ready
> [  152.230091] pcieport 0000:b0:06.0: HACK: report_error_detected:
> Unlocked and DONE
> [  152.230094] pcieport 0000:b0:07.0: HACK: report_error_detected:
> Preparing to lock
> [  152.230097] pcieport 0000:b0:07.0: HACK: report_error_detected:
> locked and ready
> [  152.230111] pcieport 0000:b0:07.0: HACK: report_error_detected:
> Unlocked and DONE
> [  152.300716] nvme nvme2: Removing after probe failure status: -19
> [  152.306687] nvme1n1: detected capacity change from 3200631791616 to 
> 0
> [  152.306789] md: super_written gets error=10
> [  152.311002] md/raid1:mdX: Disk failure on dm-4, disabling device.
>                md/raid1:mdX: Operation continuing on 2 devices.
> [  153.252694] pcieport 0000:ae:00.0: Root Port link has been reset
> [  153.252700] pcieport 0000:ae:00.0: broadcast slot_reset message
> [  153.252705] pcieport 0000:af:00.0: HACK: report_slot_reset: 
> Preparing
> to lock
> [  153.252707] pcieport 0000:af:00.0: HACK: report_slot_reset: locked
> and ready
> [  153.252782] pcieport 0000:af:00.0: HACK: report_slot_reset: Unlocked
> and DONE
> [  153.252785] pcieport 0000:b0:04.0: HACK: report_slot_reset: 
> Preparing
> to lock
> [  153.252787] pcieport 0000:b0:04.0: HACK: report_slot_reset: locked
> and ready
> [  153.252859] pcieport 0000:b0:04.0: HACK: report_slot_reset: Unlocked
> and DONE
> [  153.252862] nvme 0000:b1:00.0: HACK: report_slot_reset: Preparing to 
> lock
> [  153.252911] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up
> :: Mon May  7 13:09:33 2018 :: <function malform_tlp at 0x7f27fec04400>
> :: Mon May  7 13:09:33 2018 :: <function malform_tlp at 0x7f27fec04400>
> :: Mon May  7 13:10:03 2018 :: <function malform_tlp at 0x7f27fec04400>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 24+ messages in thread

* AER: Malformed TLP recovery deadlock with NVMe drives
@ 2018-05-07 18:46   ` okaya
  0 siblings, 0 replies; 24+ messages in thread
From: okaya @ 2018-05-07 18:46 UTC (permalink / raw)


On 2018-05-07 19:36, Alex G. wrote:
> Hi! Me again!
> 
> I'm seeing what appears to be a deadlock in the AER recovery path. It
> seems that the device_lock() call in report_slot_reset() never returns.
> How we get there is interesting:
> 


Can you give this patch a try?


https://patchwork.kernel.org/patch/10351515/




> Injection of the error happens by changing the maximum payload size to
> 128 bytes from 256. This is on the switch upstream port.
> When there's IO to the drive, switch sees a malformed TLP. Switch
> reports error, AER handles it.
> More IO goes, another error is triggered, and this time the root port
> reports it. AER recovery hits the NVMe drive behind the affetced
> upstream port and deadlocks.
> 
> I've added some printks in the AER handler to see which lock dies, and 
> I
> have a fairly succinct log[1], also pasted below. It appears somebody 
> is
> already holding the lock to the PCI device behind the sick upstream
> port, and never releases it.
> 
> 
> I'm not sure how to track down other users of the lock, and I"m hoping
> somebody may have a brighter idea.
> 
> Alex
> 
> 
> [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log
> 
> 
> Appendix A: log
> 
> :: Mon May  7 13:09:02 2018 :: <function malform_tlp at 0x7f27fec04400>
> [  127.857606] megaraid_sas 0000:86:00.0: invalid short VPD tag 00 at
> offset 1
> [  138.385529] pcieport 0000:ae:00.0: AER: Multiple Uncorrected (Fatal)
> error received: id=b020
> [  138.571742] pcieport 0000:b0:04.0: PCIe Bus Error:
> severity=Uncorrected (Fatal), type=Transaction Layer, id=b020(Receiver 
> ID)
> [  138.583050] pcieport 0000:b0:04.0:   device [10b5:9733] error
> status/mask=00440000/01a10000
> [  138.591398] pcieport 0000:b0:04.0:    [18] Malformed TLP          
> (First)
> [  138.598200] pcieport 0000:b0:04.0:    [22] Uncorrectable Internal 
> Error
> [  138.604812] pcieport 0000:b0:04.0:   TLP Header: 60000040 b10000ff
> 00000003 df400000
> [  138.612554] pcieport 0000:b0:04.0: broadcast error_detected message
> [  138.612557] nvme 0000:b1:00.0: HACK: report_error_detected: 
> Preparing
> to lock
> [  138.612557] nvme 0000:b1:00.0: HACK: report_error_detected: locked
> and ready
> [  138.612560] nvme nvme2: frozen state error detected, reset 
> controller
> [  151.133387] nvme 0000:b1:00.0: HACK: report_error_detected: Unlocked
> and DONE
> [  152.164786] pcieport 0000:b0:04.0: downstream link has been reset
> [  152.164789] pcieport 0000:b0:04.0: broadcast slot_reset message
> [  152.164793] nvme 0000:b1:00.0: HACK: report_slot_reset: Preparing to 
> lock
> [  152.164795] nvme 0000:b1:00.0: HACK: report_slot_reset: locked and 
> ready
> [  152.164799] nvme nvme2: restart after slot reset
> [  152.164863] nvme 0000:b1:00.0: HACK: report_slot_reset: Unlocked and 
> DONE
> [  152.164865] pcieport 0000:b0:04.0: broadcast resume message
> [  152.164867] nvme 0000:b1:00.0: HACK: report_resume: Preparing to 
> lock
> [  152.164869] nvme 0000:b1:00.0: HACK: report_resume: locked and ready
> [  152.164901] nvme 0000:b1:00.0: HACK: report_resume: Unlocked and 
> DONE
> [  152.164904] pcieport 0000:b0:04.0: AER: Device recovery successful
> [  152.164907] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164913] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164917] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164921] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164925] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164930] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164934] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164938] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164942] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164946] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164951] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164955] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164959] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164963] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164967] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164972] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164976] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164980] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164984] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164988] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164992] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.164997] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165001] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165005] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165009] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165013] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165017] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165022] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165026] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165030] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165034] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165038] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=b020
> [  152.165043] pcieport 0000:ae:00.0: AER: Uncorrected (Fatal) error
> received: id=ae00
> [  152.165046] pcieport 0000:ae:00.0: PCIe Bus Error:
> severity=Uncorrected (Fatal), type=Transaction Layer, id=ae00(Requester 
> ID)
> [  152.176442] pcieport 0000:ae:00.0:   device [8086:2030] error
> status/mask=00004000/00310000
> [  152.184795] pcieport 0000:ae:00.0:    [14] Completion Timeout     
> (First)
> [  152.191600] pcieport 0000:ae:00.0: broadcast error_detected message
> [  152.191601] pcieport 0000:af:00.0: HACK: report_error_detected:
> Preparing to lock
> [  152.191602] pcieport 0000:af:00.0: HACK: report_error_detected:
> locked and ready
> [  152.191613] pcieport 0000:af:00.0: HACK: report_error_detected:
> Unlocked and DONE
> [  152.191614] pcieport 0000:b0:04.0: HACK: report_error_detected:
> Preparing to lock
> [  152.191615] pcieport 0000:b0:04.0: HACK: report_error_detected:
> locked and ready
> [  152.191621] pcieport 0000:b0:04.0: HACK: report_error_detected:
> Unlocked and DONE
> [  152.191622] nvme 0000:b1:00.0: HACK: report_error_detected: 
> Preparing
> to lock
> [  152.191623] nvme 0000:b1:00.0: HACK: report_error_detected: locked
> and ready
> [  152.191624] nvme nvme2: frozen state error detected, reset 
> controller
> [  152.204899] nvme 0000:b1:00.0: HACK: report_error_detected: Unlocked
> and DONE
> [  152.204903] pcieport 0000:b0:05.0: HACK: report_error_detected:
> Preparing to lock
> [  152.204906] pcieport 0000:b0:05.0: HACK: report_error_detected:
> locked and ready
> [  152.204921] pcieport 0000:b0:05.0: HACK: report_error_detected:
> Unlocked and DONE
> [  152.204924] nvme 0000:b2:00.0: HACK: report_error_detected: 
> Preparing
> to lock
> [  152.204927] nvme 0000:b2:00.0: HACK: report_error_detected: locked
> and ready
> [  152.204931] nvme nvme3: frozen state error detected, reset 
> controller
> [  152.230068] nvme 0000:b2:00.0: HACK: report_error_detected: Unlocked
> and DONE
> [  152.230073] pcieport 0000:b0:06.0: HACK: report_error_detected:
> Preparing to lock
> [  152.230075] pcieport 0000:b0:06.0: HACK: report_error_detected:
> locked and ready
> [  152.230091] pcieport 0000:b0:06.0: HACK: report_error_detected:
> Unlocked and DONE
> [  152.230094] pcieport 0000:b0:07.0: HACK: report_error_detected:
> Preparing to lock
> [  152.230097] pcieport 0000:b0:07.0: HACK: report_error_detected:
> locked and ready
> [  152.230111] pcieport 0000:b0:07.0: HACK: report_error_detected:
> Unlocked and DONE
> [  152.300716] nvme nvme2: Removing after probe failure status: -19
> [  152.306687] nvme1n1: detected capacity change from 3200631791616 to 
> 0
> [  152.306789] md: super_written gets error=10
> [  152.311002] md/raid1:mdX: Disk failure on dm-4, disabling device.
>                md/raid1:mdX: Operation continuing on 2 devices.
> [  153.252694] pcieport 0000:ae:00.0: Root Port link has been reset
> [  153.252700] pcieport 0000:ae:00.0: broadcast slot_reset message
> [  153.252705] pcieport 0000:af:00.0: HACK: report_slot_reset: 
> Preparing
> to lock
> [  153.252707] pcieport 0000:af:00.0: HACK: report_slot_reset: locked
> and ready
> [  153.252782] pcieport 0000:af:00.0: HACK: report_slot_reset: Unlocked
> and DONE
> [  153.252785] pcieport 0000:b0:04.0: HACK: report_slot_reset: 
> Preparing
> to lock
> [  153.252787] pcieport 0000:b0:04.0: HACK: report_slot_reset: locked
> and ready
> [  153.252859] pcieport 0000:b0:04.0: HACK: report_slot_reset: Unlocked
> and DONE
> [  153.252862] nvme 0000:b1:00.0: HACK: report_slot_reset: Preparing to 
> lock
> [  153.252911] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up
> :: Mon May  7 13:09:33 2018 :: <function malform_tlp at 0x7f27fec04400>
> :: Mon May  7 13:09:33 2018 :: <function malform_tlp at 0x7f27fec04400>
> :: Mon May  7 13:10:03 2018 :: <function malform_tlp at 0x7f27fec04400>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: AER: Malformed TLP recovery deadlock with NVMe drives
  2018-05-07 18:46   ` okaya
@ 2018-05-07 20:16     ` Alex G.
  -1 siblings, 0 replies; 24+ messages in thread
From: Alex G. @ 2018-05-07 20:16 UTC (permalink / raw)
  To: okaya
  Cc: Alex_Gagniuc, linux-pci, shyam_iyer, linux-nvme, Keith Busch,
	austin_bolen, linux-pci-owner

On 05/07/2018 01:46 PM, okaya@codeaurora.org wrote:
> On 2018-05-07 19:36, Alex G. wrote:
>> Hi! Me again!
>>
>> I'm seeing what appears to be a deadlock in the AER recovery path. It
>> seems that the device_lock() call in report_slot_reset() never returns.
>> How we get there is interesting:
> 
> Can you give this patch a try?
> 
Oh! Patches so soon? Yay!

> https://patchwork.kernel.org/patch/10351515/

Thank you! I tried a few runs. there was one run where we didn't lock
up, but the other runs all went like before.

For comparison, the run that didn't deadlock looked like [2].

Alex

[2] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1429.log

>> Injection of the error happens by changing the maximum payload size to
>> 128 bytes from 256. This is on the switch upstream port.
>> When there's IO to the drive, switch sees a malformed TLP. Switch
>> reports error, AER handles it.
>> More IO goes, another error is triggered, and this time the root port
>> reports it. AER recovery hits the NVMe drive behind the affetced
>> upstream port and deadlocks.
>>
>> I've added some printks in the AER handler to see which lock dies, and I
>> have a fairly succinct log[1], also pasted below. It appears somebody is
>> already holding the lock to the PCI device behind the sick upstream
>> port, and never releases it.
>>
>>
>> I'm not sure how to track down other users of the lock, and I"m hoping
>> somebody may have a brighter idea.
>>
>> Alex
>>
>>
>> [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log
>>
``

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 24+ messages in thread

* AER: Malformed TLP recovery deadlock with NVMe drives
@ 2018-05-07 20:16     ` Alex G.
  0 siblings, 0 replies; 24+ messages in thread
From: Alex G. @ 2018-05-07 20:16 UTC (permalink / raw)


On 05/07/2018 01:46 PM, okaya@codeaurora.org wrote:
> On 2018-05-07 19:36, Alex G. wrote:
>> Hi! Me again!
>>
>> I'm seeing what appears to be a deadlock in the AER recovery path. It
>> seems that the device_lock() call in report_slot_reset() never returns.
>> How we get there is interesting:
> 
> Can you give this patch a try?
> 
Oh! Patches so soon? Yay!

> https://patchwork.kernel.org/patch/10351515/

Thank you! I tried a few runs. there was one run where we didn't lock
up, but the other runs all went like before.

For comparison, the run that didn't deadlock looked like [2].

Alex

[2] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1429.log

>> Injection of the error happens by changing the maximum payload size to
>> 128 bytes from 256. This is on the switch upstream port.
>> When there's IO to the drive, switch sees a malformed TLP. Switch
>> reports error, AER handles it.
>> More IO goes, another error is triggered, and this time the root port
>> reports it. AER recovery hits the NVMe drive behind the affetced
>> upstream port and deadlocks.
>>
>> I've added some printks in the AER handler to see which lock dies, and I
>> have a fairly succinct log[1], also pasted below. It appears somebody is
>> already holding the lock to the PCI device behind the sick upstream
>> port, and never releases it.
>>
>>
>> I'm not sure how to track down other users of the lock, and I"m hoping
>> somebody may have a brighter idea.
>>
>> Alex
>>
>>
>> [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log
>>
``

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: AER: Malformed TLP recovery deadlock with NVMe drives
  2018-05-07 20:16     ` Alex G.
@ 2018-05-07 20:30       ` okaya
  -1 siblings, 0 replies; 24+ messages in thread
From: okaya @ 2018-05-07 20:30 UTC (permalink / raw)
  To: Alex G.
  Cc: Alex_Gagniuc, linux-pci, shyam_iyer, linux-nvme, Keith Busch,
	austin_bolen, linux-pci-owner

On 2018-05-07 21:16, Alex G. wrote:
> On 05/07/2018 01:46 PM, okaya@codeaurora.org wrote:
>> On 2018-05-07 19:36, Alex G. wrote:
>>> Hi! Me again!
>>> 
>>> I'm seeing what appears to be a deadlock in the AER recovery path. It
>>> seems that the device_lock() call in report_slot_reset() never 
>>> returns.
>>> How we get there is interesting:
>> 
>> Can you give this patch a try?
>> 
> Oh! Patches so soon? Yay!
> 
>> https://patchwork.kernel.org/patch/10351515/
> 
> Thank you! I tried a few runs. there was one run where we didn't lock
> up, but the other runs all went like before.
> 
> For comparison, the run that didn't deadlock looked like [2].
> 


Sounds like there are multiple problems. With this patch, you shouldn't 
see link down and up interrupts during reset but i do see them in the 
log.

Can you also share a fail case log with this patch and a diff of your 
hacks so that we know where prints are coming from.


> Alex
> 
> [2] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1429.log
> 
>>> Injection of the error happens by changing the maximum payload size 
>>> to
>>> 128 bytes from 256. This is on the switch upstream port.
>>> When there's IO to the drive, switch sees a malformed TLP. Switch
>>> reports error, AER handles it.
>>> More IO goes, another error is triggered, and this time the root port
>>> reports it. AER recovery hits the NVMe drive behind the affetced
>>> upstream port and deadlocks.
>>> 
>>> I've added some printks in the AER handler to see which lock dies, 
>>> and I
>>> have a fairly succinct log[1], also pasted below. It appears somebody 
>>> is
>>> already holding the lock to the PCI device behind the sick upstream
>>> port, and never releases it.
>>> 
>>> 
>>> I'm not sure how to track down other users of the lock, and I"m 
>>> hoping
>>> somebody may have a brighter idea.
>>> 
>>> Alex
>>> 
>>> 
>>> [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log
>>> 
> ``

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 24+ messages in thread

* AER: Malformed TLP recovery deadlock with NVMe drives
@ 2018-05-07 20:30       ` okaya
  0 siblings, 0 replies; 24+ messages in thread
From: okaya @ 2018-05-07 20:30 UTC (permalink / raw)


On 2018-05-07 21:16, Alex G. wrote:
> On 05/07/2018 01:46 PM, okaya@codeaurora.org wrote:
>> On 2018-05-07 19:36, Alex G. wrote:
>>> Hi! Me again!
>>> 
>>> I'm seeing what appears to be a deadlock in the AER recovery path. It
>>> seems that the device_lock() call in report_slot_reset() never 
>>> returns.
>>> How we get there is interesting:
>> 
>> Can you give this patch a try?
>> 
> Oh! Patches so soon? Yay!
> 
>> https://patchwork.kernel.org/patch/10351515/
> 
> Thank you! I tried a few runs. there was one run where we didn't lock
> up, but the other runs all went like before.
> 
> For comparison, the run that didn't deadlock looked like [2].
> 


Sounds like there are multiple problems. With this patch, you shouldn't 
see link down and up interrupts during reset but i do see them in the 
log.

Can you also share a fail case log with this patch and a diff of your 
hacks so that we know where prints are coming from.


> Alex
> 
> [2] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1429.log
> 
>>> Injection of the error happens by changing the maximum payload size 
>>> to
>>> 128 bytes from 256. This is on the switch upstream port.
>>> When there's IO to the drive, switch sees a malformed TLP. Switch
>>> reports error, AER handles it.
>>> More IO goes, another error is triggered, and this time the root port
>>> reports it. AER recovery hits the NVMe drive behind the affetced
>>> upstream port and deadlocks.
>>> 
>>> I've added some printks in the AER handler to see which lock dies, 
>>> and I
>>> have a fairly succinct log[1], also pasted below. It appears somebody 
>>> is
>>> already holding the lock to the PCI device behind the sick upstream
>>> port, and never releases it.
>>> 
>>> 
>>> I'm not sure how to track down other users of the lock, and I"m 
>>> hoping
>>> somebody may have a brighter idea.
>>> 
>>> Alex
>>> 
>>> 
>>> [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log
>>> 
> ``

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: AER: Malformed TLP recovery deadlock with NVMe drives
  2018-05-07 20:30       ` okaya
@ 2018-05-07 20:58         ` Alex G.
  -1 siblings, 0 replies; 24+ messages in thread
From: Alex G. @ 2018-05-07 20:58 UTC (permalink / raw)
  To: okaya
  Cc: Alex_Gagniuc, linux-pci, shyam_iyer, linux-nvme, Keith Busch,
	austin_bolen, linux-pci-owner



On 05/07/2018 03:30 PM, okaya@codeaurora.org wrote:
> On 2018-05-07 21:16, Alex G. wrote:
>> On 05/07/2018 01:46 PM, okaya@codeaurora.org wrote:
>>> On 2018-05-07 19:36, Alex G. wrote:
>>>> Hi! Me again!
>>>>
>>>> I'm seeing what appears to be a deadlock in the AER recovery path. It
>>>> seems that the device_lock() call in report_slot_reset() never returns.
>>>> How we get there is interesting:
>>>
>>> Can you give this patch a try?
>>>
>> Oh! Patches so soon? Yay!
>>
>>> https://patchwork.kernel.org/patch/10351515/
>>
>> Thank you! I tried a few runs. there was one run where we didn't lock
>> up, but the other runs all went like before.
>>
>> For comparison, the run that didn't deadlock looked like [2].
>>
> 
> 
> Sounds like there are multiple problems.

If it were easy, somebody would have patched it by now ;)

> With this patch, you shouldn't
> see link down and up interrupts during reset but i do see them in the log.

You will see the messages from the link up/down events regardless if any
action is actually taken.

> Can you also share a fail case log with this patch and a diff of your
> hacks so that we know where prints are coming from.

Of course. Example of failing case [3], and is identical to the fail log
without any patches. Although prints have the function name, the diff is
in [4].

Alex

[3] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1509.log
[4] http://gtech.myftp.org/~mrnuke/nvme_logs/print_hacks.patch


>> [2] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1429.log
>>>> [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 24+ messages in thread

* AER: Malformed TLP recovery deadlock with NVMe drives
@ 2018-05-07 20:58         ` Alex G.
  0 siblings, 0 replies; 24+ messages in thread
From: Alex G. @ 2018-05-07 20:58 UTC (permalink / raw)




On 05/07/2018 03:30 PM, okaya@codeaurora.org wrote:
> On 2018-05-07 21:16, Alex G. wrote:
>> On 05/07/2018 01:46 PM, okaya@codeaurora.org wrote:
>>> On 2018-05-07 19:36, Alex G. wrote:
>>>> Hi! Me again!
>>>>
>>>> I'm seeing what appears to be a deadlock in the AER recovery path. It
>>>> seems that the device_lock() call in report_slot_reset() never returns.
>>>> How we get there is interesting:
>>>
>>> Can you give this patch a try?
>>>
>> Oh! Patches so soon? Yay!
>>
>>> https://patchwork.kernel.org/patch/10351515/
>>
>> Thank you! I tried a few runs. there was one run where we didn't lock
>> up, but the other runs all went like before.
>>
>> For comparison, the run that didn't deadlock looked like [2].
>>
> 
> 
> Sounds like there are multiple problems.

If it were easy, somebody would have patched it by now ;)

> With this patch, you shouldn't
> see link down and up interrupts during reset but i do see them in the log.

You will see the messages from the link up/down events regardless if any
action is actually taken.

> Can you also share a fail case log with this patch and a diff of your
> hacks so that we know where prints are coming from.

Of course. Example of failing case [3], and is identical to the fail log
without any patches. Although prints have the function name, the diff is
in [4].

Alex

[3] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1509.log
[4] http://gtech.myftp.org/~mrnuke/nvme_logs/print_hacks.patch


>> [2] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1429.log
>>>> [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: AER: Malformed TLP recovery deadlock with NVMe drives
  2018-05-07 20:58         ` Alex G.
@ 2018-05-07 21:48           ` Sinan Kaya
  -1 siblings, 0 replies; 24+ messages in thread
From: Sinan Kaya @ 2018-05-07 21:48 UTC (permalink / raw)
  To: Alex G.
  Cc: Alex_Gagniuc, linux-pci, shyam_iyer, linux-nvme, Keith Busch,
	austin_bolen, linux-pci-owner

On 5/7/2018 9:58 PM, Alex G. wrote:
>> Sounds like there are multiple problems.
> If it were easy, somebody would have patched it by now ;)
> 

I know...

Keith has been debugging similar problems recently. I also have interest
in error cases for hotplug. I have been keeping an eye on this for a while
to see what might be happening.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 24+ messages in thread

* AER: Malformed TLP recovery deadlock with NVMe drives
@ 2018-05-07 21:48           ` Sinan Kaya
  0 siblings, 0 replies; 24+ messages in thread
From: Sinan Kaya @ 2018-05-07 21:48 UTC (permalink / raw)


On 5/7/2018 9:58 PM, Alex G. wrote:
>> Sounds like there are multiple problems.
> If it were easy, somebody would have patched it by now ;)
> 

I know...

Keith has been debugging similar problems recently. I also have interest
in error cases for hotplug. I have been keeping an eye on this for a while
to see what might be happening.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: AER: Malformed TLP recovery deadlock with NVMe drives
  2018-05-07 20:58         ` Alex G.
@ 2018-05-07 22:45           ` okaya
  -1 siblings, 0 replies; 24+ messages in thread
From: okaya @ 2018-05-07 22:45 UTC (permalink / raw)
  To: Alex G.
  Cc: Alex_Gagniuc, linux-pci, shyam_iyer, linux-nvme, Keith Busch,
	austin_bolen, linux-pci-owner

On 2018-05-07 21:58, Alex G. wrote:
> On 05/07/2018 03:30 PM, okaya@codeaurora.org wrote:
>> On 2018-05-07 21:16, Alex G. wrote:
>>> On 05/07/2018 01:46 PM, okaya@codeaurora.org wrote:
>>>> On 2018-05-07 19:36, Alex G. wrote:
>>>>> Hi! Me again!
>>>>> 
>>>>> I'm seeing what appears to be a deadlock in the AER recovery path. 
>>>>> It
>>>>> seems that the device_lock() call in report_slot_reset() never 
>>>>> returns.
>>>>> How we get there is interesting:
>>>> 
>>>> Can you give this patch a try?
>>>> 
>>> Oh! Patches so soon? Yay!
>>> 
>>>> https://patchwork.kernel.org/patch/10351515/
>>> 
>>> Thank you! I tried a few runs. there was one run where we didn't lock
>>> up, but the other runs all went like before.
>>> 
>>> For comparison, the run that didn't deadlock looked like [2].
>>> 
>> 
>> 
>> Sounds like there are multiple problems.
> 
> If it were easy, somebody would have patched it by now ;)

Can you file a bugzilla CC me, keith and bjorn and attach all of your 
logs?

Let's debug this there.


> 
>> With this patch, you shouldn't
>> see link down and up interrupts during reset but i do see them in the 
>> log.
> 
> You will see the messages from the link up/down events regardless if 
> any
> action is actually taken.
> 
>> Can you also share a fail case log with this patch and a diff of your
>> hacks so that we know where prints are coming from.
> 
> Of course. Example of failing case [3], and is identical to the fail 
> log
> without any patches. Although prints have the function name, the diff 
> is
> in [4].
> 
> Alex
> 
> [3] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1509.log
> [4] http://gtech.myftp.org/~mrnuke/nvme_logs/print_hacks.patch
> 
> 
>>> [2] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1429.log
>>>>> [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 24+ messages in thread

* AER: Malformed TLP recovery deadlock with NVMe drives
@ 2018-05-07 22:45           ` okaya
  0 siblings, 0 replies; 24+ messages in thread
From: okaya @ 2018-05-07 22:45 UTC (permalink / raw)


On 2018-05-07 21:58, Alex G. wrote:
> On 05/07/2018 03:30 PM, okaya@codeaurora.org wrote:
>> On 2018-05-07 21:16, Alex G. wrote:
>>> On 05/07/2018 01:46 PM, okaya@codeaurora.org wrote:
>>>> On 2018-05-07 19:36, Alex G. wrote:
>>>>> Hi! Me again!
>>>>> 
>>>>> I'm seeing what appears to be a deadlock in the AER recovery path. 
>>>>> It
>>>>> seems that the device_lock() call in report_slot_reset() never 
>>>>> returns.
>>>>> How we get there is interesting:
>>>> 
>>>> Can you give this patch a try?
>>>> 
>>> Oh! Patches so soon? Yay!
>>> 
>>>> https://patchwork.kernel.org/patch/10351515/
>>> 
>>> Thank you! I tried a few runs. there was one run where we didn't lock
>>> up, but the other runs all went like before.
>>> 
>>> For comparison, the run that didn't deadlock looked like [2].
>>> 
>> 
>> 
>> Sounds like there are multiple problems.
> 
> If it were easy, somebody would have patched it by now ;)

Can you file a bugzilla CC me, keith and bjorn and attach all of your 
logs?

Let's debug this there.


> 
>> With this patch, you shouldn't
>> see link down and up interrupts during reset but i do see them in the 
>> log.
> 
> You will see the messages from the link up/down events regardless if 
> any
> action is actually taken.
> 
>> Can you also share a fail case log with this patch and a diff of your
>> hacks so that we know where prints are coming from.
> 
> Of course. Example of failing case [3], and is identical to the fail 
> log
> without any patches. Although prints have the function name, the diff 
> is
> in [4].
> 
> Alex
> 
> [3] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1509.log
> [4] http://gtech.myftp.org/~mrnuke/nvme_logs/print_hacks.patch
> 
> 
>>> [2] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1429.log
>>>>> [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: AER: Malformed TLP recovery deadlock with NVMe drives
  2018-05-07 22:45           ` okaya
@ 2018-05-07 23:57             ` Alex_Gagniuc
  -1 siblings, 0 replies; 24+ messages in thread
From: Alex_Gagniuc @ 2018-05-07 23:57 UTC (permalink / raw)
  To: okaya, mr.nuke.me
  Cc: Shyam.Iyer, linux-pci, linux-nvme, keith.busch, Austin.Bolen,
	linux-pci-owner

On 5/7/2018 5:46 PM, okaya@codeaurora.org wrote:
[snip]
>> If it were easy, somebody would have patched it by now ;)
> 
> Can you file a bugzilla CC me, keith and bjorn and attach all of your
> logs?

Sure. Which bugzilla?


> Let's debug this there.

Debugging over email not fun enough?

Alex


>>> With this patch, you shouldn't
>>> see link down and up interrupts during reset but i do see them in the
>>> log.
>>
>> You will see the messages from the link up/down events regardless if
>> any
>> action is actually taken.
>>
>>> Can you also share a fail case log with this patch and a diff of your
>>> hacks so that we know where prints are coming from.
>>
>> Of course. Example of failing case [3], and is identical to the fail
>> log
>> without any patches. Although prints have the function name, the diff
>> is
>> in [4].
>>
>> Alex
>>
>> [3] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1509.log
>> [4] http://gtech.myftp.org/~mrnuke/nvme_logs/print_hacks.patch
>>
>>
>>>> [2] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1429.log
>>>>>> [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log
> 


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 24+ messages in thread

* AER: Malformed TLP recovery deadlock with NVMe drives
@ 2018-05-07 23:57             ` Alex_Gagniuc
  0 siblings, 0 replies; 24+ messages in thread
From: Alex_Gagniuc @ 2018-05-07 23:57 UTC (permalink / raw)


On 5/7/2018 5:46 PM, okaya@codeaurora.org wrote:
[snip]
>> If it were easy, somebody would have patched it by now ;)
> 
> Can you file a bugzilla CC me, keith and bjorn and attach all of your
> logs?

Sure. Which bugzilla?


> Let's debug this there.

Debugging over email not fun enough?

Alex


>>> With this patch, you shouldn't
>>> see link down and up interrupts during reset but i do see them in the
>>> log.
>>
>> You will see the messages from the link up/down events regardless if
>> any
>> action is actually taken.
>>
>>> Can you also share a fail case log with this patch and a diff of your
>>> hacks so that we know where prints are coming from.
>>
>> Of course. Example of failing case [3], and is identical to the fail
>> log
>> without any patches. Although prints have the function name, the diff
>> is
>> in [4].
>>
>> Alex
>>
>> [3] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1509.log
>> [4] http://gtech.myftp.org/~mrnuke/nvme_logs/print_hacks.patch
>>
>>
>>>> [2] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1429.log
>>>>>> [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: AER: Malformed TLP recovery deadlock with NVMe drives
  2018-05-07 23:57             ` Alex_Gagniuc
@ 2018-05-08  0:21               ` okaya
  -1 siblings, 0 replies; 24+ messages in thread
From: okaya @ 2018-05-08  0:21 UTC (permalink / raw)
  To: Alex_Gagniuc
  Cc: Shyam.Iyer, linux-pci, linux-nvme, keith.busch, mr.nuke.me,
	Austin.Bolen, linux-pci-owner

On 2018-05-08 00:57, Alex_Gagniuc@Dellteam.com wrote:
> On 5/7/2018 5:46 PM, okaya@codeaurora.org wrote:
> [snip]
>>> If it were easy, somebody would have patched it by now ;)
>> 
>> Can you file a bugzilla CC me, keith and bjorn and attach all of your
>> logs?
> 
> Sure. Which bugzilla?
> 

https://bugzilla.kernel.org

Drivers -> pci


> 
>> Let's debug this there.
> 

Bugzilla is more organized for keeping track of which log is for what.

My experience is that bugzilla is preferred unless Keith or Bjorn has a 
different opinion.

> Debugging over email not fun enough?
> 
> Alex
> 
> 
>>>> With this patch, you shouldn't
>>>> see link down and up interrupts during reset but i do see them in 
>>>> the
>>>> log.
>>> 
>>> You will see the messages from the link up/down events regardless if
>>> any
>>> action is actually taken.
>>> 
>>>> Can you also share a fail case log with this patch and a diff of 
>>>> your
>>>> hacks so that we know where prints are coming from.
>>> 
>>> Of course. Example of failing case [3], and is identical to the fail
>>> log
>>> without any patches. Although prints have the function name, the diff
>>> is
>>> in [4].
>>> 
>>> Alex
>>> 
>>> [3] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1509.log
>>> [4] http://gtech.myftp.org/~mrnuke/nvme_logs/print_hacks.patch
>>> 
>>> 
>>>>> [2] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1429.log
>>>>>>> [1] 
>>>>>>> http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log
>> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 24+ messages in thread

* AER: Malformed TLP recovery deadlock with NVMe drives
@ 2018-05-08  0:21               ` okaya
  0 siblings, 0 replies; 24+ messages in thread
From: okaya @ 2018-05-08  0:21 UTC (permalink / raw)


On 2018-05-08 00:57, Alex_Gagniuc@Dellteam.com wrote:
> On 5/7/2018 5:46 PM, okaya@codeaurora.org wrote:
> [snip]
>>> If it were easy, somebody would have patched it by now ;)
>> 
>> Can you file a bugzilla CC me, keith and bjorn and attach all of your
>> logs?
> 
> Sure. Which bugzilla?
> 

https://bugzilla.kernel.org

Drivers -> pci


> 
>> Let's debug this there.
> 

Bugzilla is more organized for keeping track of which log is for what.

My experience is that bugzilla is preferred unless Keith or Bjorn has a 
different opinion.

> Debugging over email not fun enough?
> 
> Alex
> 
> 
>>>> With this patch, you shouldn't
>>>> see link down and up interrupts during reset but i do see them in 
>>>> the
>>>> log.
>>> 
>>> You will see the messages from the link up/down events regardless if
>>> any
>>> action is actually taken.
>>> 
>>>> Can you also share a fail case log with this patch and a diff of 
>>>> your
>>>> hacks so that we know where prints are coming from.
>>> 
>>> Of course. Example of failing case [3], and is identical to the fail
>>> log
>>> without any patches. Although prints have the function name, the diff
>>> is
>>> in [4].
>>> 
>>> Alex
>>> 
>>> [3] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1509.log
>>> [4] http://gtech.myftp.org/~mrnuke/nvme_logs/print_hacks.patch
>>> 
>>> 
>>>>> [2] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1429.log
>>>>>>> [1] 
>>>>>>> http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180507-1308.log
>> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: AER: Malformed TLP recovery deadlock with NVMe drives
  2018-05-08  0:21               ` okaya
@ 2018-05-08 16:58                 ` Bjorn Helgaas
  -1 siblings, 0 replies; 24+ messages in thread
From: Bjorn Helgaas @ 2018-05-08 16:58 UTC (permalink / raw)
  To: okaya
  Cc: Alex_Gagniuc, linux-pci, linux-nvme, Shyam.Iyer, mr.nuke.me,
	Austin.Bolen, keith.busch, linux-pci-owner

On Tue, May 08, 2018 at 01:21:38AM +0100, okaya@codeaurora.org wrote:
> On 2018-05-08 00:57, Alex_Gagniuc@Dellteam.com wrote:
> > On 5/7/2018 5:46 PM, okaya@codeaurora.org wrote:
> > [snip]
> > > > If it were easy, somebody would have patched it by now ;)
> > > 
> > > Can you file a bugzilla CC me, keith and bjorn and attach all of your
> > > logs?
> > 
> > Sure. Which bugzilla?
> > 
> 
> https://bugzilla.kernel.org
> 
> Drivers -> pci

https://bugzilla.kernel.org/show_bug.cgi?id=199657

Thanks for the bugzilla.  Can you add the "lspci -vv" output (the
"lspci -tvnn" output doesn't have the details about individual
devices) and also the specifics about how you inject the error (I
assume you're using setpci to update MPS, but it would help others
reproduce it to see the exact command and the lspci output so they can
adjust to their local topology).

I don't know how you collected the dmesg logs, but they have a bunch
of gunk in them (looks like maybe colorizing escape sequences?).  I
always think it's nice to have the plain ASCII complete log from
beginning of boot, e.g., from "dmesg" command or similar.  But I
probably won't have time to look at them myself, and maybe what's
there is enough.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 24+ messages in thread

* AER: Malformed TLP recovery deadlock with NVMe drives
@ 2018-05-08 16:58                 ` Bjorn Helgaas
  0 siblings, 0 replies; 24+ messages in thread
From: Bjorn Helgaas @ 2018-05-08 16:58 UTC (permalink / raw)


On Tue, May 08, 2018@01:21:38AM +0100, okaya@codeaurora.org wrote:
> On 2018-05-08 00:57, Alex_Gagniuc@Dellteam.com wrote:
> > On 5/7/2018 5:46 PM, okaya@codeaurora.org wrote:
> > [snip]
> > > > If it were easy, somebody would have patched it by now ;)
> > > 
> > > Can you file a bugzilla CC me, keith and bjorn and attach all of your
> > > logs?
> > 
> > Sure. Which bugzilla?
> > 
> 
> https://bugzilla.kernel.org
> 
> Drivers -> pci

https://bugzilla.kernel.org/show_bug.cgi?id=199657

Thanks for the bugzilla.  Can you add the "lspci -vv" output (the
"lspci -tvnn" output doesn't have the details about individual
devices) and also the specifics about how you inject the error (I
assume you're using setpci to update MPS, but it would help others
reproduce it to see the exact command and the lspci output so they can
adjust to their local topology).

I don't know how you collected the dmesg logs, but they have a bunch
of gunk in them (looks like maybe colorizing escape sequences?).  I
always think it's nice to have the plain ASCII complete log from
beginning of boot, e.g., from "dmesg" command or similar.  But I
probably won't have time to look at them myself, and maybe what's
there is enough.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: AER: Malformed TLP recovery deadlock with NVMe drives
  2018-05-08 16:58                 ` Bjorn Helgaas
@ 2018-05-08 17:32                   ` Alex G.
  -1 siblings, 0 replies; 24+ messages in thread
From: Alex G. @ 2018-05-08 17:32 UTC (permalink / raw)
  To: Bjorn Helgaas, okaya
  Cc: Alex_Gagniuc, Shyam.Iyer, linux-pci, linux-nvme, keith.busch,
	Austin.Bolen, linux-pci-owner

On 05/08/2018 11:58 AM, Bjorn Helgaas wrote:
(snip)
> https://bugzilla.kernel.org/show_bug.cgi?id=199657
> 
> Thanks for the bugzilla.  Can you add the "lspci -vv"
> and also the specifics about how you inject the error

Done


> I don't know how you collected the dmesg logs

I hired scribes.


> (looks like maybe colorizing escape sequences?).

You mean the ANSI escape sequences. wget and cat in a color terminal.
You will get so used to this, you will love it so much, you won't ever
again want to open a a web page.


> always think it's nice to have the plain ASCII

dmesg without color information is like driving a car without a steering
wheel. You better have a very straight road and go very slow, or you'll
crash into something.

Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* AER: Malformed TLP recovery deadlock with NVMe drives
@ 2018-05-08 17:32                   ` Alex G.
  0 siblings, 0 replies; 24+ messages in thread
From: Alex G. @ 2018-05-08 17:32 UTC (permalink / raw)


On 05/08/2018 11:58 AM, Bjorn Helgaas wrote:
(snip)
> https://bugzilla.kernel.org/show_bug.cgi?id=199657
> 
> Thanks for the bugzilla.  Can you add the "lspci -vv"
> and also the specifics about how you inject the error

Done


> I don't know how you collected the dmesg logs

I hired scribes.


> (looks like maybe colorizing escape sequences?).

You mean the ANSI escape sequences. wget and cat in a color terminal.
You will get so used to this, you will love it so much, you won't ever
again want to open a a web page.


> always think it's nice to have the plain ASCII

dmesg without color information is like driving a car without a steering
wheel. You better have a very straight road and go very slow, or you'll
crash into something.

Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: AER: Malformed TLP recovery deadlock with NVMe drives
  2018-05-08 17:32                   ` Alex G.
@ 2018-05-08 18:01                     ` Bjorn Helgaas
  -1 siblings, 0 replies; 24+ messages in thread
From: Bjorn Helgaas @ 2018-05-08 18:01 UTC (permalink / raw)
  To: Alex G.
  Cc: Alex_Gagniuc, linux-pci, linux-nvme, okaya, Austin.Bolen,
	Shyam.Iyer, keith.busch, linux-pci-owner

On Tue, May 08, 2018 at 12:32:28PM -0500, Alex G. wrote:
> On 05/08/2018 11:58 AM, Bjorn Helgaas wrote:
> > (looks like maybe colorizing escape sequences?).
> 
> You mean the ANSI escape sequences. wget and cat in a color terminal.
> You will get so used to this, you will love it so much, you won't ever
> again want to open a a web page.

Hey, you're right!  If I download it and look at it with "less", it *is*
pretty!  It'd be cool if Chrome would do that.  Maybe.

I'm really glad you're looking at this stuff and raising issues.  I'm not
at all confident in the correctness or robustness of this code.

Bjorn

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 24+ messages in thread

* AER: Malformed TLP recovery deadlock with NVMe drives
@ 2018-05-08 18:01                     ` Bjorn Helgaas
  0 siblings, 0 replies; 24+ messages in thread
From: Bjorn Helgaas @ 2018-05-08 18:01 UTC (permalink / raw)


On Tue, May 08, 2018@12:32:28PM -0500, Alex G. wrote:
> On 05/08/2018 11:58 AM, Bjorn Helgaas wrote:
> > (looks like maybe colorizing escape sequences?).
> 
> You mean the ANSI escape sequences. wget and cat in a color terminal.
> You will get so used to this, you will love it so much, you won't ever
> again want to open a a web page.

Hey, you're right!  If I download it and look at it with "less", it *is*
pretty!  It'd be cool if Chrome would do that.  Maybe.

I'm really glad you're looking at this stuff and raising issues.  I'm not
at all confident in the correctness or robustness of this code.

Bjorn

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-05-08 18:01 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-07 18:36 AER: Malformed TLP recovery deadlock with NVMe drives Alex G.
2018-05-07 18:36 ` Alex G.
2018-05-07 18:46 ` okaya
2018-05-07 18:46   ` okaya
2018-05-07 20:16   ` Alex G.
2018-05-07 20:16     ` Alex G.
2018-05-07 20:30     ` okaya
2018-05-07 20:30       ` okaya
2018-05-07 20:58       ` Alex G.
2018-05-07 20:58         ` Alex G.
2018-05-07 21:48         ` Sinan Kaya
2018-05-07 21:48           ` Sinan Kaya
2018-05-07 22:45         ` okaya
2018-05-07 22:45           ` okaya
2018-05-07 23:57           ` Alex_Gagniuc
2018-05-07 23:57             ` Alex_Gagniuc
2018-05-08  0:21             ` okaya
2018-05-08  0:21               ` okaya
2018-05-08 16:58               ` Bjorn Helgaas
2018-05-08 16:58                 ` Bjorn Helgaas
2018-05-08 17:32                 ` Alex G.
2018-05-08 17:32                   ` Alex G.
2018-05-08 18:01                   ` Bjorn Helgaas
2018-05-08 18:01                     ` Bjorn Helgaas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.