linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: ASPM powersupersave change NVMe SSD Samsung 960 PRO capacity to 0 and read-only
       [not found] <20171214184701.GA6322@libmpq.org>
@ 2017-12-15  0:21 ` Bjorn Helgaas
  2017-12-15 15:08   ` Keith Busch
  2017-12-15 17:32   ` Rajat Jain
  0 siblings, 2 replies; 7+ messages in thread
From: Bjorn Helgaas @ 2017-12-15  0:21 UTC (permalink / raw)
  To: Maik Broemme; +Cc: linux-pci, Rajat Jain, Keith Busch, linux-kernel

[+cc Rajat, Keith, linux-kernel]

On Thu, Dec 14, 2017 at 07:47:01PM +0100, Maik Broemme wrote:
> I have a Samsung 960 PRO NVMe SSD (Non-Volatile memory controller:
> Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961). It
> works fine until I enable powersupersave via
> /sys/module/pcie_aspm/parameters/policy
> 
> ASPM is enabled in BIOS and works fine for all devices and in
> powersave mode. I'm able to reproduce this always at any time while
> the system is up and running via:
> 
> $> echo powersupersave > /sys/module/pcie_aspm/parameters/policy
> 
> The Linux kernel is 4.14.4 and APST for my device is working with
> powersave. As soon as I enable powersupersave I get:
> 
> [11535.142755] dpc 0000:00:10.0:pcie010: DPC containment event, status:0x1f09 source:0x0000
> [11535.142760] dpc 0000:00:10.0:pcie010: DPC unmasked uncorrectable error detected, remove downstream devices
> [11535.159999] nvme0n1: detected capacity change from 1024209543168 to 0
> ...

Can you start by opening a bug report at https://bugzilla.kernel.org,
category Drivers/PCI, and attaching the complete "lspci -vv" output
(as root) and the complete dmesg log?  Make sure you have a new enough
lspci to decode the ASPM L1 Substates capability and the LTR bits.
Source is at git://git.kernel.org/pub/scm/utils/pciutils/pciutils.git

powersupersave enables ASPM L1 Substates.  Rajat, do you have any
ideas about this or how we might debug it?

Keith, is this really all the information about the event that we can
get out of DPC?  Is there some AER logging we might be able to get via
"lspci -vv"?  Sounds like this is the boot disk, so Maik may not be
able to run lspci after the DPC event.  If there *is* any AER info,
can we connect up the DPC event so we can print the AER info from the
kernel?

I wonder if there's some way improper L1 Substate configuration could
cause a DPC event.  There are lots of knobs there that seem to depend
on devices, and I'm not sure we have them all correct yet.

There are some recent changes in that area that are in linux-next:

  PCI/ASPM: Enable Latency Tolerance Reporting when supported
  PCI/ASPM: Calculate LTR_L1.2_THRESHOLD from device characteristics
  PCI/ASPM: Use correct capability pointer to program LTR_L1.2_THRESHOLD
  PCI/ASPM: Account for downstream device's Port Common_Mode_Restore_Time

It's conceivable that they could have some bearing on this problem.
If you could give this a whirl on linux-next, that would be
interesting.  If you do this, please also collect the "lspci -vv"
output there so we can compare it with the v4.14 configuration.

> It looks like APST feature cannot be set anymore after enabling
> powersupersave. Also the PCIe device disappears completely
> from lspci output.

My guess is this is to be expected after the DPC event.  That
basically disconnects the PCIe device from the system.

> Any idea why the device is failing with powersupersave and how to avoid
> it? Especially how to enable it but skip certain broken devices as this
> is my boot device.

We could conceivably add a quirk if we find that L1SS is broken on
this particular device.  But L1SS is so new that I'd be more
suspicious of the Linux code than the device.

Bjorn

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ASPM powersupersave change NVMe SSD Samsung 960 PRO capacity to 0 and read-only
  2017-12-15  0:21 ` ASPM powersupersave change NVMe SSD Samsung 960 PRO capacity to 0 and read-only Bjorn Helgaas
@ 2017-12-15 15:08   ` Keith Busch
  2017-12-15 17:32   ` Rajat Jain
  1 sibling, 0 replies; 7+ messages in thread
From: Keith Busch @ 2017-12-15 15:08 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Maik Broemme, linux-pci, Rajat Jain, linux-kernel

On Thu, Dec 14, 2017 at 06:21:55PM -0600, Bjorn Helgaas wrote:
> [+cc Rajat, Keith, linux-kernel]
> 
> On Thu, Dec 14, 2017 at 07:47:01PM +0100, Maik Broemme wrote:
> > I have a Samsung 960 PRO NVMe SSD (Non-Volatile memory controller:
> > Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961). It
> > works fine until I enable powersupersave via
> > /sys/module/pcie_aspm/parameters/policy
> > 
> > ASPM is enabled in BIOS and works fine for all devices and in
> > powersave mode. I'm able to reproduce this always at any time while
> > the system is up and running via:
> > 
> > $> echo powersupersave > /sys/module/pcie_aspm/parameters/policy
> > 
> > The Linux kernel is 4.14.4 and APST for my device is working with
> > powersave. As soon as I enable powersupersave I get:
> > 
> > [11535.142755] dpc 0000:00:10.0:pcie010: DPC containment event, status:0x1f09 source:0x0000
> > [11535.142760] dpc 0000:00:10.0:pcie010: DPC unmasked uncorrectable error detected, remove downstream devices
> > [11535.159999] nvme0n1: detected capacity change from 1024209543168 to 0
> > ...
> 
> Can you start by opening a bug report at https://bugzilla.kernel.org,
> category Drivers/PCI, and attaching the complete "lspci -vv" output
> (as root) and the complete dmesg log?  Make sure you have a new enough
> lspci to decode the ASPM L1 Substates capability and the LTR bits.
> Source is at git://git.kernel.org/pub/scm/utils/pciutils/pciutils.git
> 
> powersupersave enables ASPM L1 Substates.  Rajat, do you have any
> ideas about this or how we might debug it?
>
> Keith, is this really all the information about the event that we can
> get out of DPC?  Is there some AER logging we might be able to get via
> "lspci -vv"?  Sounds like this is the boot disk, so Maik may not be
> able to run lspci after the DPC event.  If there *is* any AER info,
> can we connect up the DPC event so we can print the AER info from the
> kernel?

There should be information in the AER register. The base spec section
6.2.5 ("Sequence of Device Eror Signaling and Logging") says the
corresponding bit in the AER Uncorrectable Error Status register should
be set before triggerring DPC. The sequence ends with the DPC trigger,
so the Linux AER service was never notified to handle the event.

As an enhancement to the DPC driver, we may be able to enqueue an AER
event to see if that may provide additional details about the error. I can
implement that enhanmcement, and should have something for consideration
sometime in the next week.

On a side note, now that root ports are implementing DPC, we should
probably consult the platform for AER firmware first. The PCIe
specification strongly recommends linking DPC control to that of AER, so
I'll try to add that check as well.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ASPM powersupersave change NVMe SSD Samsung 960 PRO capacity to 0 and read-only
  2017-12-15  0:21 ` ASPM powersupersave change NVMe SSD Samsung 960 PRO capacity to 0 and read-only Bjorn Helgaas
  2017-12-15 15:08   ` Keith Busch
@ 2017-12-15 17:32   ` Rajat Jain
  2017-12-15 19:01     ` Maik Broemme
  1 sibling, 1 reply; 7+ messages in thread
From: Rajat Jain @ 2017-12-15 17:32 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Maik Broemme, linux-pci, Keith Busch, Linux Kernel Mailing List

On Thu, Dec 14, 2017 at 4:21 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> [+cc Rajat, Keith, linux-kernel]
>
> On Thu, Dec 14, 2017 at 07:47:01PM +0100, Maik Broemme wrote:
>> I have a Samsung 960 PRO NVMe SSD (Non-Volatile memory controller:
>> Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961). It
>> works fine until I enable powersupersave via
>> /sys/module/pcie_aspm/parameters/policy
>>
>> ASPM is enabled in BIOS and works fine for all devices and in
>> powersave mode. I'm able to reproduce this always at any time while
>> the system is up and running via:
>>
>> $> echo powersupersave > /sys/module/pcie_aspm/parameters/policy
>>
>> The Linux kernel is 4.14.4 and APST for my device is working with
>> powersave. As soon as I enable powersupersave I get:
>>
>> [11535.142755] dpc 0000:00:10.0:pcie010: DPC containment event, status:0x1f09 source:0x0000
>> [11535.142760] dpc 0000:00:10.0:pcie010: DPC unmasked uncorrectable error detected, remove downstream devices
>> [11535.159999] nvme0n1: detected capacity change from 1024209543168 to 0
>> ...
>
> Can you start by opening a bug report at https://bugzilla.kernel.org,
> category Drivers/PCI, and attaching the complete "lspci -vv" output
> (as root) and the complete dmesg log?  Make sure you have a new enough
> lspci to decode the ASPM L1 Substates capability and the LTR bits.
> Source is at git://git.kernel.org/pub/scm/utils/pciutils/pciutils.git
>
> powersupersave enables ASPM L1 Substates.  Rajat, do you have any
> ideas about this or how we might debug it?


I know Maik mentioned that this is the boot device. Maik, is it
possible to boot off something else so that we can do some more
experiments on this port? If so,
- can you try to see if the device comes back if you switch the ASPM
policy back from "powersupersave" -> powersave, and potentially do a
rescan (echo 1 > /sys/bus/pci/rescan)?
- It would be good to get the complete lspci -vv for the root port
(assuming device is connected to root port i.e. no switch).
Specifically what does the Link status show?
- Also, do you know if your root port provides any debug registers
that could tell the current L1 substate of the link (My system's root
port had such register).
- I had usually resorted to a PCIe analyzer to peak at the packets
when I was debugging it. Not sure if that is an option here.

I don't see any debug prints in aspm.c that we could enable. Even if I
provide a patch, I suspect that the problem will start at the last
step of the pcie_config_aspm_l1ss() i.e. as soon as we really enable
it in HW. Maik, would you be open to take a debug patch that adds some
debug prints and try it out (compile your kernel with that patch)?

>
> Keith, is this really all the information about the event that we can
> get out of DPC?  Is there some AER logging we might be able to get via
> "lspci -vv"?  Sounds like this is the boot disk, so Maik may not be
> able to run lspci after the DPC event.  If there *is* any AER info,
> can we connect up the DPC event so we can print the AER info from the
> kernel?
>
> I wonder if there's some way improper L1 Substate configuration could
> cause a DPC event.  There are lots of knobs there that seem to depend
> on devices, and I'm not sure we have them all correct yet.
>
> There are some recent changes in that area that are in linux-next:
>
>   PCI/ASPM: Enable Latency Tolerance Reporting when supported
>   PCI/ASPM: Calculate LTR_L1.2_THRESHOLD from device characteristics
>   PCI/ASPM: Use correct capability pointer to program LTR_L1.2_THRESHOLD
>   PCI/ASPM: Account for downstream device's Port Common_Mode_Restore_Time
>
> It's conceivable that they could have some bearing on this problem.
> If you could give this a whirl on linux-next, that would be
> interesting.  If you do this, please also collect the "lspci -vv"
> output there so we can compare it with the v4.14 configuration.
>
>> It looks like APST feature cannot be set anymore after enabling
>> powersupersave. Also the PCIe device disappears completely
>> from lspci output.
>
> My guess is this is to be expected after the DPC event.  That
> basically disconnects the PCIe device from the system.
>
>> Any idea why the device is failing with powersupersave and how to avoid
>> it? Especially how to enable it but skip certain broken devices as this
>> is my boot device.
>
> We could conceivably add a quirk if we find that L1SS is broken on
> this particular device.  But L1SS is so new that I'd be more
> suspicious of the Linux code than the device.
>
> Bjorn

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ASPM powersupersave change NVMe SSD Samsung 960 PRO capacity to 0 and read-only
  2017-12-15 17:32   ` Rajat Jain
@ 2017-12-15 19:01     ` Maik Broemme
  2018-01-11 17:50       ` Maik Broemme
  0 siblings, 1 reply; 7+ messages in thread
From: Maik Broemme @ 2017-12-15 19:01 UTC (permalink / raw)
  To: Rajat Jain
  Cc: Bjorn Helgaas, linux-pci, Keith Busch, Linux Kernel Mailing List

Hi Rajat,

On Dec 15, 2017, at 18:33, Rajat Jain <rajatja@google.com> wrote:
> On Thu, Dec 14, 2017 at 4:21 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> > [+cc Rajat, Keith, linux-kernel]
> >
> > On Thu, Dec 14, 2017 at 07:47:01PM +0100, Maik Broemme wrote:
> >> I have a Samsung 960 PRO NVMe SSD (Non-Volatile memory controller:
> >> Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961). It
> >> works fine until I enable powersupersave via
> >> /sys/module/pcie_aspm/parameters/policy
> >>
> >> ASPM is enabled in BIOS and works fine for all devices and in
> >> powersave mode. I'm able to reproduce this always at any time while
> >> the system is up and running via:
> >>
> >> $> echo powersupersave > /sys/module/pcie_aspm/parameters/policy
> >>
> >> The Linux kernel is 4.14.4 and APST for my device is working with
> >> powersave. As soon as I enable powersupersave I get:
> >>
> >> [11535.142755] dpc 0000:00:10.0:pcie010: DPC containment event, status:0x1f09 source:0x0000
> >> [11535.142760] dpc 0000:00:10.0:pcie010: DPC unmasked uncorrectable error detected, remove downstream devices
> >> [11535.159999] nvme0n1: detected capacity change from 1024209543168 to 0
> >> ...
> >
> > Can you start by opening a bug report at https://bugzilla.kernel.org,
> > category Drivers/PCI, and attaching the complete "lspci -vv" output
> > (as root) and the complete dmesg log?  Make sure you have a new enough
> > lspci to decode the ASPM L1 Substates capability and the LTR bits.
> > Source is at git://git.kernel.org/pub/scm/utils/pciutils/pciutils.git
> >
> > powersupersave enables ASPM L1 Substates.  Rajat, do you have any
> > ideas about this or how we might debug it?
> 
> 
> I know Maik mentioned that this is the boot device. Maik, is it
> possible to boot off something else so that we can do some more
> experiments on this port? If so,
> - can you try to see if the device comes back if you switch the ASPM
> policy back from "powersupersave" -> powersave, and potentially do a
> rescan (echo 1 > /sys/bus/pci/rescan)?

Yes it is possible, will do later today.

> - It would be good to get the complete lspci -vv for the root port
> (assuming device is connected to root port i.e. no switch).
> Specifically what does the Link status show?
> - Also, do you know if your root port provides any debug registers
> that could tell the current L1 substate of the link (My system's root
> port had such register).
> - I had usually resorted to a PCIe analyzer to peak at the packets
> when I was debugging it. Not sure if that is an option here.
> 
> I don't see any debug prints in aspm.c that we could enable. Even if I
> provide a patch, I suspect that the problem will start at the last
> step of the pcie_config_aspm_l1ss() i.e. as soon as we really enable
> it in HW. Maik, would you be open to take a debug patch that adds some
> debug prints and try it out (compile your kernel with that patch)?
> 

Sure that is fine. I will also re-run later today with 4.15rc3.

> >
> > Keith, is this really all the information about the event that we can
> > get out of DPC?  Is there some AER logging we might be able to get via
> > "lspci -vv"?  Sounds like this is the boot disk, so Maik may not be
> > able to run lspci after the DPC event.  If there *is* any AER info,
> > can we connect up the DPC event so we can print the AER info from the
> > kernel?
> >
> > I wonder if there's some way improper L1 Substate configuration could
> > cause a DPC event.  There are lots of knobs there that seem to depend
> > on devices, and I'm not sure we have them all correct yet.
> >
> > There are some recent changes in that area that are in linux-next:
> >
> >   PCI/ASPM: Enable Latency Tolerance Reporting when supported
> >   PCI/ASPM: Calculate LTR_L1.2_THRESHOLD from device characteristics
> >   PCI/ASPM: Use correct capability pointer to program LTR_L1.2_THRESHOLD
> >   PCI/ASPM: Account for downstream device's Port Common_Mode_Restore_Time
> >
> > It's conceivable that they could have some bearing on this problem.
> > If you could give this a whirl on linux-next, that would be
> > interesting.  If you do this, please also collect the "lspci -vv"
> > output there so we can compare it with the v4.14 configuration.
> >
> >> It looks like APST feature cannot be set anymore after enabling
> >> powersupersave. Also the PCIe device disappears completely
> >> from lspci output.
> >
> > My guess is this is to be expected after the DPC event.  That
> > basically disconnects the PCIe device from the system.
> >
> >> Any idea why the device is failing with powersupersave and how to avoid
> >> it? Especially how to enable it but skip certain broken devices as this
> >> is my boot device.
> >
> > We could conceivably add a quirk if we find that L1SS is broken on
> > this particular device.  But L1SS is so new that I'd be more
> > suspicious of the Linux code than the device.
> >
> > Bjorn
> 

--Maik

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ASPM powersupersave change NVMe SSD Samsung 960 PRO capacity to 0 and read-only
  2017-12-15 19:01     ` Maik Broemme
@ 2018-01-11 17:50       ` Maik Broemme
  2018-01-11 17:59         ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: Maik Broemme @ 2018-01-11 17:50 UTC (permalink / raw)
  To: Rajat Jain
  Cc: Bjorn Helgaas, linux-pci, Keith Busch, Linux Kernel Mailing List

Hi Rajat,

On Dec 15, 2017, at 20:01, Maik Broemme <mbroemme@libmpq.org> wrote:
> Hi Rajat,
> 
> On Dec 15, 2017, at 18:33, Rajat Jain <rajatja@google.com> wrote:
> > On Thu, Dec 14, 2017 at 4:21 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > [+cc Rajat, Keith, linux-kernel]
> > >
> > > On Thu, Dec 14, 2017 at 07:47:01PM +0100, Maik Broemme wrote:
> > >> I have a Samsung 960 PRO NVMe SSD (Non-Volatile memory controller:
> > >> Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961). It
> > >> works fine until I enable powersupersave via
> > >> /sys/module/pcie_aspm/parameters/policy
> > >>
> > >> ASPM is enabled in BIOS and works fine for all devices and in
> > >> powersave mode. I'm able to reproduce this always at any time while
> > >> the system is up and running via:
> > >>
> > >> $> echo powersupersave > /sys/module/pcie_aspm/parameters/policy
> > >>
> > >> The Linux kernel is 4.14.4 and APST for my device is working with
> > >> powersave. As soon as I enable powersupersave I get:
> > >>
> > >> [11535.142755] dpc 0000:00:10.0:pcie010: DPC containment event, status:0x1f09 source:0x0000
> > >> [11535.142760] dpc 0000:00:10.0:pcie010: DPC unmasked uncorrectable error detected, remove downstream devices
> > >> [11535.159999] nvme0n1: detected capacity change from 1024209543168 to 0
> > >> ...
> > >
> > > Can you start by opening a bug report at https://bugzilla.kernel.org,
> > > category Drivers/PCI, and attaching the complete "lspci -vv" output
> > > (as root) and the complete dmesg log?  Make sure you have a new enough
> > > lspci to decode the ASPM L1 Substates capability and the LTR bits.
> > > Source is at git://git.kernel.org/pub/scm/utils/pciutils/pciutils.git
> > >
> > > powersupersave enables ASPM L1 Substates.  Rajat, do you have any
> > > ideas about this or how we might debug it?
> > 
> > 
> > I know Maik mentioned that this is the boot device. Maik, is it
> > possible to boot off something else so that we can do some more
> > experiments on this port? If so,
> > - can you try to see if the device comes back if you switch the ASPM
> > policy back from "powersupersave" -> powersave, and potentially do a
> > rescan (echo 1 > /sys/bus/pci/rescan)?
> 
> Yes it is possible, will do later today.
> 

I've re-run the test with 4.15rc7.r111.g5f615b97cdea and the following
patches from Keith:

[PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported
[PATCH 2/4] PCI/AER: Provide API for getting AER information
[PATCH 3/4] PCI/DPC: Enable DPC in conjuction with AER
[PATCH 4/4] PCI/DPC: Print AER status in DPC event handling

The issue is still the same. Additionally to the output before I see now:

Jan 11 18:34:45 server.theraso.int kernel: dpc 0000:00:10.0:pcie010: DPC containment event, status:0x1f09 source:0x0000
Jan 11 18:34:45 server.theraso.int kernel: dpc 0000:00:10.0:pcie010: DPC unmasked uncorrectable error detected, remove downstream devices
Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0080(Receiver ID)
Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0:   device [8086:19aa] error status/mask=00000020/00000000
Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0:    [ 5] Surprise Down Error    (First)
Jan 11 18:34:46 server.theraso.int kernel: nvme0n1: detected capacity change from 1024209543168 to 0

> > - It would be good to get the complete lspci -vv for the root port
> > (assuming device is connected to root port i.e. no switch).
> > Specifically what does the Link status show?
> > - Also, do you know if your root port provides any debug registers
> > that could tell the current L1 substate of the link (My system's root
> > port had such register).
> > - I had usually resorted to a PCIe analyzer to peak at the packets
> > when I was debugging it. Not sure if that is an option here.
> > 
> > I don't see any debug prints in aspm.c that we could enable. Even if I
> > provide a patch, I suspect that the problem will start at the last
> > step of the pcie_config_aspm_l1ss() i.e. as soon as we really enable
> > it in HW. Maik, would you be open to take a debug patch that adds some
> > debug prints and try it out (compile your kernel with that patch)?
> > 
> 
> Sure that is fine. I will also re-run later today with 4.15rc3.
> 
> > >
> > > Keith, is this really all the information about the event that we can
> > > get out of DPC?  Is there some AER logging we might be able to get via
> > > "lspci -vv"?  Sounds like this is the boot disk, so Maik may not be
> > > able to run lspci after the DPC event.  If there *is* any AER info,
> > > can we connect up the DPC event so we can print the AER info from the
> > > kernel?
> > >
> > > I wonder if there's some way improper L1 Substate configuration could
> > > cause a DPC event.  There are lots of knobs there that seem to depend
> > > on devices, and I'm not sure we have them all correct yet.
> > >
> > > There are some recent changes in that area that are in linux-next:
> > >
> > >   PCI/ASPM: Enable Latency Tolerance Reporting when supported
> > >   PCI/ASPM: Calculate LTR_L1.2_THRESHOLD from device characteristics
> > >   PCI/ASPM: Use correct capability pointer to program LTR_L1.2_THRESHOLD
> > >   PCI/ASPM: Account for downstream device's Port Common_Mode_Restore_Time
> > >
> > > It's conceivable that they could have some bearing on this problem.
> > > If you could give this a whirl on linux-next, that would be
> > > interesting.  If you do this, please also collect the "lspci -vv"
> > > output there so we can compare it with the v4.14 configuration.
> > >
> > >> It looks like APST feature cannot be set anymore after enabling
> > >> powersupersave. Also the PCIe device disappears completely
> > >> from lspci output.
> > >
> > > My guess is this is to be expected after the DPC event.  That
> > > basically disconnects the PCIe device from the system.
> > >
> > >> Any idea why the device is failing with powersupersave and how to avoid
> > >> it? Especially how to enable it but skip certain broken devices as this
> > >> is my boot device.
> > >
> > > We could conceivably add a quirk if we find that L1SS is broken on
> > > this particular device.  But L1SS is so new that I'd be more
> > > suspicious of the Linux code than the device.
> > >
> > > Bjorn
> > 
> 
> --Maik

--Maik

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ASPM powersupersave change NVMe SSD Samsung 960 PRO capacity to 0 and read-only
  2018-01-11 17:50       ` Maik Broemme
@ 2018-01-11 17:59         ` Keith Busch
  2018-01-11 20:22           ` Rajat Jain
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2018-01-11 17:59 UTC (permalink / raw)
  To: Maik Broemme
  Cc: Rajat Jain, Bjorn Helgaas, linux-pci, Linux Kernel Mailing List

On Thu, Jan 11, 2018 at 06:50:40PM +0100, Maik Broemme wrote:
> I've re-run the test with 4.15rc7.r111.g5f615b97cdea and the following
> patches from Keith:
> 
> [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported
> [PATCH 2/4] PCI/AER: Provide API for getting AER information
> [PATCH 3/4] PCI/DPC: Enable DPC in conjuction with AER
> [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
> 
> The issue is still the same. Additionally to the output before I see now:
> 
> Jan 11 18:34:45 server.theraso.int kernel: dpc 0000:00:10.0:pcie010: DPC containment event, status:0x1f09 source:0x0000
> Jan 11 18:34:45 server.theraso.int kernel: dpc 0000:00:10.0:pcie010: DPC unmasked uncorrectable error detected, remove downstream devices
> Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0080(Receiver ID)
> Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0:   device [8086:19aa] error status/mask=00000020/00000000
> Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0:    [ 5] Surprise Down Error    (First)
> Jan 11 18:34:46 server.theraso.int kernel: nvme0n1: detected capacity change from 1024209543168 to 0

Okay, so that series wasn't going to fix anything, but at least it gets
some visibility into what's happened. The DPC was triggered due to a
Surprise Down uncorrectable error, so the power settting is causing the
link to fail.

The NVMe driver has quirks specifically for this vendor's devices to
fence off NVMe specific automated power settings. Your observations
appear to align with the same issues.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ASPM powersupersave change NVMe SSD Samsung 960 PRO capacity to 0 and read-only
  2018-01-11 17:59         ` Keith Busch
@ 2018-01-11 20:22           ` Rajat Jain
  0 siblings, 0 replies; 7+ messages in thread
From: Rajat Jain @ 2018-01-11 20:22 UTC (permalink / raw)
  To: Keith Busch
  Cc: Maik Broemme, Bjorn Helgaas, linux-pci, Linux Kernel Mailing List

On Thu, Jan 11, 2018 at 9:59 AM, Keith Busch <keith.busch@intel.com> wrote:
> On Thu, Jan 11, 2018 at 06:50:40PM +0100, Maik Broemme wrote:
>> I've re-run the test with 4.15rc7.r111.g5f615b97cdea and the following
>> patches from Keith:
>>
>> [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported
>> [PATCH 2/4] PCI/AER: Provide API for getting AER information
>> [PATCH 3/4] PCI/DPC: Enable DPC in conjuction with AER
>> [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
>>
>> The issue is still the same. Additionally to the output before I see now:
>>
>> Jan 11 18:34:45 server.theraso.int kernel: dpc 0000:00:10.0:pcie010: DPC containment event, status:0x1f09 source:0x0000
>> Jan 11 18:34:45 server.theraso.int kernel: dpc 0000:00:10.0:pcie010: DPC unmasked uncorrectable error detected, remove downstream devices
>> Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0080(Receiver ID)
>> Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0:   device [8086:19aa] error status/mask=00000020/00000000
>> Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0:    [ 5] Surprise Down Error    (First)
>> Jan 11 18:34:46 server.theraso.int kernel: nvme0n1: detected capacity change from 1024209543168 to 0
>
> Okay, so that series wasn't going to fix anything, but at least it gets
> some visibility into what's happened. The DPC was triggered due to a
> Surprise Down uncorrectable error, so the power settting is causing the
> link to fail.
>
> The NVMe driver has quirks specifically for this vendor's devices to
> fence off NVMe specific automated power settings. Your observations
> appear to align with the same issues.

Agree.

                /*
                 * Samsung SSD 960 EVO drops off the PCIe bus after system
                 * suspend on a Ryzen board, ASUS PRIME B350M-A.
                 */
                if (dmi_match(DMI_BOARD_VENDOR, "ASUSTeK COMPUTER INC.") &&
                    dmi_match(DMI_BOARD_NAME, "PRIME B350M-A"))
                        return NVME_QUIRK_NO_APST;

It seems that the attempt to save extrapower using  ASPM L1 substates
is causing it to fall off. Sorry but I suspect that it may be
difficult to debug without a pcie analyzer, some debugging directions
can be:

- Assuming this is a hotpluggable device, try with another NVMe to
verify if the issue is specific to this device.
- Can you please try switch the ASPM policy back from "powersupersave"
-> powersave, and potentially do a rescan (echo 1 >
/sys/bus/pci/rescan), and see if the device comes back (and goes away
again when you switch back to supersave)?
- May be put some debug prints in pcie_config_aspm_l1ss() to see
writing to which register causes the device to fall off (most likely
this would be the last statement, but just throwing ideas).
- May be dump the timing parameters link->l1ss.ctl1 and
link->l1ss.ctl2 from aspm_calc_l1ss_info(), and try to play with them
a little.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-01-11 20:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20171214184701.GA6322@libmpq.org>
2017-12-15  0:21 ` ASPM powersupersave change NVMe SSD Samsung 960 PRO capacity to 0 and read-only Bjorn Helgaas
2017-12-15 15:08   ` Keith Busch
2017-12-15 17:32   ` Rajat Jain
2017-12-15 19:01     ` Maik Broemme
2018-01-11 17:50       ` Maik Broemme
2018-01-11 17:59         ` Keith Busch
2018-01-11 20:22           ` Rajat Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).