* Re: Another NVMe failure, this time with AER info
[not found] <CAObL_7G5k7XNqXqDZASChiQd1mt+AJPAdNKz-DE+xQawhUy6ZA@mail.gmail.com>
@ 2018-05-11 16:57 ` Bjorn Helgaas
2018-05-11 17:26 ` Keith Busch
2018-05-12 2:38 ` Ming Lei
0 siblings, 2 replies; 5+ messages in thread
From: Bjorn Helgaas @ 2018-05-11 16:57 UTC (permalink / raw)
To: Andrew Lutomirski
Cc: Jesse Vincent, Bjorn Helgaas, Christoph Hellwig, Sagi Grimberg,
Jens Axboe, linux-pci, linux-nvme, linux-kernel
Andrew wrote:
> A friend of mine has a brand new LG laptop that has intermittent NVMe
> failures. They mostly happen during a suspend/resume cycle
> (apparently during suspend, not resume). Unlike the earlier
> Dell/Samsung issue, the NVMe device isn't completely gone -- MMIO
> reads fail, but PCI configuration space is apparently still there:
> nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
> and it comes with a nice AER dump:
> [12720.894411] pcieport 0000:00:1c.0: AER: Multiple Corrected error received: id=00e0
> [12720.909747] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e0(Transmitter ID)
> [12720.909751] pcieport 0000:00:1c.0: device [8086:9d14] error status/mask=00001001/00002000
> [12720.909754] pcieport 0000:00:1c.0: [ 0] Receiver Error (First)
> [12720.909756] pcieport 0000:00:1c.0: [12] Replay Timer Timeout
I opened this bugzilla and attached the dmesg and lspci -vv output to
it: https://bugzilla.kernel.org/show_bug.cgi?id=199695
The root port at 00:1c.0 leads to the NVMe device at 01:00.0 (this is
nvme0):
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1) (prog-if 00 [Normal decode])
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd Device a801
We reported several corrected errors before the nvme timeout:
[12750.281158] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[12750.297594] nvme nvme0: I/O 455 QID 2 timeout, disable controller
[12750.305196] nvme 0000:01:00.0: enabling device (0000 -> 0002)
[12750.305465] nvme nvme0: Removing after probe failure status: -19
[12750.313188] nvme nvme0: I/O 456 QID 2 timeout, disable controller
[12750.329152] nvme nvme0: I/O 457 QID 2 timeout, disable controller
The corrected errors are supposedly recovered in hardware without
software intervention, and AER logs them for informational purposes.
But it seems very likely that these corrected errors are related to
the nvme timeout: the first corrected errors were logged at
12720.894411, nvme_io_timeout defaults to 30 seconds, and the nvme
timeout was at 12750.281158.
I don't have any good ideas. As a shot in the dark, you could try
running these commands before doing a suspend:
# setpci -s01:00.0 0x98.W
# setpci -s00:1c.0 0x68.W
# setpci -s01:00.0 0x198.L
# setpci -s00:1c.0 0x208.L
# setpci -s01:00.0 0x198.L=0x00000000
# setpci -s01:00.0 0x98.W=0x0000
# setpci -s00:1c.0 0x208.L=0x00000000
# setpci -s00:1c.0 0x68.W=0x0000
# lspci -vv -s00:1c.0
# lspci -vv -s01:00.0
The idea is to turn off ASPM L1.2 and LTR, just because that's new and
we've had issues with it before. If you try this, please collect the
output of the commands above in addition to the dmesg log, in case my
math is bad.
Bjorn
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Another NVMe failure, this time with AER info
2018-05-11 16:57 ` Another NVMe failure, this time with AER info Bjorn Helgaas
@ 2018-05-11 17:26 ` Keith Busch
2018-05-11 17:42 ` Keith Busch
2018-05-12 2:38 ` Ming Lei
1 sibling, 1 reply; 5+ messages in thread
From: Keith Busch @ 2018-05-11 17:26 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Andrew Lutomirski, Jesse Vincent, Sagi Grimberg, linux-pci,
linux-kernel, linux-nvme, Jens Axboe, Bjorn Helgaas,
Christoph Hellwig
On Fri, May 11, 2018 at 11:57:52AM -0500, Bjorn Helgaas wrote:
> We reported several corrected errors before the nvme timeout:
>
> [12750.281158] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
> [12750.297594] nvme nvme0: I/O 455 QID 2 timeout, disable controller
> [12750.305196] nvme 0000:01:00.0: enabling device (0000 -> 0002)
> [12750.305465] nvme nvme0: Removing after probe failure status: -19
> [12750.313188] nvme nvme0: I/O 456 QID 2 timeout, disable controller
> [12750.329152] nvme nvme0: I/O 457 QID 2 timeout, disable controller
>
> The corrected errors are supposedly recovered in hardware without
> software intervention, and AER logs them for informational purposes.
>
> But it seems very likely that these corrected errors are related to
> the nvme timeout: the first corrected errors were logged at
> 12720.894411, nvme_io_timeout defaults to 30 seconds, and the nvme
> timeout was at 12750.281158.
The nvme_timeout handling is broken at the moment, but I'm not sure any
of the fixes being considered will help here if we're really getting
MMIO errors (that's what it looks like).
> I don't have any good ideas. As a shot in the dark, you could try
> running these commands before doing a suspend:
>
> # setpci -s01:00.0 0x98.W
> # setpci -s00:1c.0 0x68.W
> # setpci -s01:00.0 0x198.L
> # setpci -s00:1c.0 0x208.L
>
> # setpci -s01:00.0 0x198.L=0x00000000
> # setpci -s01:00.0 0x98.W=0x0000
> # setpci -s00:1c.0 0x208.L=0x00000000
> # setpci -s00:1c.0 0x68.W=0x0000
>
> # lspci -vv -s00:1c.0
> # lspci -vv -s01:00.0
>
> The idea is to turn off ASPM L1.2 and LTR, just because that's new and
> we've had issues with it before. If you try this, please collect the
> output of the commands above in addition to the dmesg log, in case my
> math is bad.
I trust you know the offsets here, but it's hard to tell what this
is doing with hard-coded addresses. Just to be safe and for clarity,
I recommend the 'CAP_*+<offset>' with a mask.
For example, disabling ASPM L1.2 can look like:
# setpci -s <B:D.f> CAP_PM+8.l=0:4
And disabling LTR:
# setpci -s <B:D.f> CAP_EXP+28.w=0:400
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Another NVMe failure, this time with AER info
2018-05-11 17:26 ` Keith Busch
@ 2018-05-11 17:42 ` Keith Busch
2018-05-11 17:55 ` Bjorn Helgaas
0 siblings, 1 reply; 5+ messages in thread
From: Keith Busch @ 2018-05-11 17:42 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Jesse Vincent, Sagi Grimberg, linux-pci, Andrew Lutomirski,
linux-kernel, linux-nvme, Jens Axboe, Bjorn Helgaas,
Christoph Hellwig
On Fri, May 11, 2018 at 11:26:11AM -0600, Keith Busch wrote:
> I trust you know the offsets here, but it's hard to tell what this
> is doing with hard-coded addresses. Just to be safe and for clarity,
> I recommend the 'CAP_*+<offset>' with a mask.
>
> For example, disabling ASPM L1.2 can look like:
>
> # setpci -s <B:D.f> CAP_PM+8.l=0:4
My mistake above: CAP_PM is a different capability, not the intended
one. It looks like setpci doesn't even have a convenient symbol for the
L1 PM extended capability, so the hard-coded offsets are the only way
for this setting. Sorry about the mistake.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Another NVMe failure, this time with AER info
2018-05-11 17:42 ` Keith Busch
@ 2018-05-11 17:55 ` Bjorn Helgaas
0 siblings, 0 replies; 5+ messages in thread
From: Bjorn Helgaas @ 2018-05-11 17:55 UTC (permalink / raw)
To: Keith Busch
Cc: Jesse Vincent, Sagi Grimberg, linux-pci, Andrew Lutomirski,
linux-kernel, linux-nvme, Jens Axboe, Bjorn Helgaas,
Christoph Hellwig
On Fri, May 11, 2018 at 11:42:42AM -0600, Keith Busch wrote:
> On Fri, May 11, 2018 at 11:26:11AM -0600, Keith Busch wrote:
> > I trust you know the offsets here, but it's hard to tell what this
> > is doing with hard-coded addresses. Just to be safe and for clarity,
> > I recommend the 'CAP_*+<offset>' with a mask.
> >
> > For example, disabling ASPM L1.2 can look like:
> >
> > # setpci -s <B:D.f> CAP_PM+8.l=0:4
>
> My mistake above: CAP_PM is a different capability, not the intended
> one. It looks like setpci doesn't even have a convenient symbol for the
> L1 PM extended capability, so the hard-coded offsets are the only way
> for this setting. Sorry about the mistake.
No problem, would be a nice janitor job to add more of those symbolic
offsets to setpci. I recently added more cap ID decoding to lspci, but
didn't think about updating these things for setpci.
Using the symbols would definitely make this easier and better!
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Another NVMe failure, this time with AER info
2018-05-11 16:57 ` Another NVMe failure, this time with AER info Bjorn Helgaas
2018-05-11 17:26 ` Keith Busch
@ 2018-05-12 2:38 ` Ming Lei
1 sibling, 0 replies; 5+ messages in thread
From: Ming Lei @ 2018-05-12 2:38 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Andrew Lutomirski, Jesse Vincent, Sagi Grimberg, linux-pci,
Linux Kernel Mailing List, linux-nvme, Jens Axboe, Bjorn Helgaas,
Christoph Hellwig
On Sat, May 12, 2018 at 12:57 AM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> Andrew wrote:
>> A friend of mine has a brand new LG laptop that has intermittent NVMe
>> failures. They mostly happen during a suspend/resume cycle
>> (apparently during suspend, not resume). Unlike the earlier
>> Dell/Samsung issue, the NVMe device isn't completely gone -- MMIO
>> reads fail, but PCI configuration space is apparently still there:
>
>> nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
>
>> and it comes with a nice AER dump:
>
>> [12720.894411] pcieport 0000:00:1c.0: AER: Multiple Corrected error received: id=00e0
>> [12720.909747] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e0(Transmitter ID)
>> [12720.909751] pcieport 0000:00:1c.0: device [8086:9d14] error status/mask=00001001/00002000
>> [12720.909754] pcieport 0000:00:1c.0: [ 0] Receiver Error (First)
>> [12720.909756] pcieport 0000:00:1c.0: [12] Replay Timer Timeout
>
> I opened this bugzilla and attached the dmesg and lspci -vv output to
> it: https://bugzilla.kernel.org/show_bug.cgi?id=199695
>
> The root port at 00:1c.0 leads to the NVMe device at 01:00.0 (this is
> nvme0):
>
> 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1) (prog-if 00 [Normal decode])
> Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
> 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 (prog-if 02 [NVM Express])
> Subsystem: Samsung Electronics Co Ltd Device a801
>
> We reported several corrected errors before the nvme timeout:
>
> [12750.281158] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
> [12750.297594] nvme nvme0: I/O 455 QID 2 timeout, disable controller
> [12750.305196] nvme 0000:01:00.0: enabling device (0000 -> 0002)
> [12750.305465] nvme nvme0: Removing after probe failure status: -19
> [12750.313188] nvme nvme0: I/O 456 QID 2 timeout, disable controller
> [12750.329152] nvme nvme0: I/O 457 QID 2 timeout, disable controller
>
> The corrected errors are supposedly recovered in hardware without
> software intervention, and AER logs them for informational purposes.
>
> But it seems very likely that these corrected errors are related to
> the nvme timeout: the first corrected errors were logged at
> 12720.894411, nvme_io_timeout defaults to 30 seconds, and the nvme
> timeout was at 12750.281158.
The following patchset might help this issue:
https://marc.info/?l=linux-block&m=152604179903505&w=2
--
Ming Lei
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-05-12 2:38 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <CAObL_7G5k7XNqXqDZASChiQd1mt+AJPAdNKz-DE+xQawhUy6ZA@mail.gmail.com>
2018-05-11 16:57 ` Another NVMe failure, this time with AER info Bjorn Helgaas
2018-05-11 17:26 ` Keith Busch
2018-05-11 17:42 ` Keith Busch
2018-05-11 17:55 ` Bjorn Helgaas
2018-05-12 2:38 ` Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).