Linux-NVME Archive on lore.kernel.org
 help / color / Atom feed
* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
@ 2019-07-25  9:51 rjw
  2019-07-25 14:02 ` kai.heng.feng
  2019-07-25 14:52 ` kbusch
  0 siblings, 2 replies; 75+ messages in thread
From: rjw @ 2019-07-25  9:51 UTC (permalink / raw)


Hi Keith,

Unfortunately,

commit d916b1be94b6dc8d293abed2451f3062f6af7551
Author: Keith Busch <keith.busch at intel.com>
Date:   Thu May 23 09:27:35 2019 -0600

    nvme-pci: use host managed power state for suspend

doesn't universally improve things.  In fact, in some cases it makes things worse.

For example, on the Dell XPS13 9380 I have here it prevents the processor package
from reaching idle states deeper than PC2 in suspend-to-idle (which, of course, also
prevents the SoC from reaching any kind of S0ix).

That can be readily explained too.  Namely, with the commit above the NVMe device
stays in D0 over suspend/resume, so the root port it is connected to also has to stay in
D0 and that "blocks" package C-states deeper than PC2.

In order for the root port to be able to go to D3, the device connected to it also needs
to go into D3, so it looks like (at least on this particular machine, but maybe in
general), both D3 and the NVMe-specific PM are needed.

I'm not sure what to do here, because evidently there are systems where that commit
helps.  I was thinking about adding a module option allowing the user to override the
default behavior which in turn should be compatible with 5.2 and earlier kernels.

Cheers,
Rafael

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-25  9:51 [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems rjw
@ 2019-07-25 14:02 ` kai.heng.feng
  2019-07-25 16:23   ` Mario.Limonciello
  2019-07-25 16:59   ` [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems rafael
  2019-07-25 14:52 ` kbusch
  1 sibling, 2 replies; 75+ messages in thread
From: kai.heng.feng @ 2019-07-25 14:02 UTC (permalink / raw)


Hi Rafael,

at 17:51, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:

> Hi Keith,
>
> Unfortunately,
>
> commit d916b1be94b6dc8d293abed2451f3062f6af7551
> Author: Keith Busch <keith.busch at intel.com>
> Date:   Thu May 23 09:27:35 2019 -0600
>
>     nvme-pci: use host managed power state for suspend
>
> doesn't universally improve things.  In fact, in some cases it makes  
> things worse.
>
> For example, on the Dell XPS13 9380 I have here it prevents the processor  
> package
> from reaching idle states deeper than PC2 in suspend-to-idle (which, of  
> course, also
> prevents the SoC from reaching any kind of S0ix).
>
> That can be readily explained too.  Namely, with the commit above the  
> NVMe device
> stays in D0 over suspend/resume, so the root port it is connected to also  
> has to stay in
> D0 and that "blocks" package C-states deeper than PC2.
>
> In order for the root port to be able to go to D3, the device connected  
> to it also needs
> to go into D3, so it looks like (at least on this particular machine, but  
> maybe in
> general), both D3 and the NVMe-specific PM are needed.
>
> I'm not sure what to do here, because evidently there are systems where  
> that commit
> helps.  I was thinking about adding a module option allowing the user to  
> override the
> default behavior which in turn should be compatible with 5.2 and earlier  
> kernels.

I just briefly tested s2i on XPS 9370, and the power meter shows a 0.8~0.9W  
power consumption so at least I don?t see the issue on XPS 9370.

Can you please provide the output of `nvme id-ctrl /dev/nvme*` and I?ll  
test the NVMe controller on XPS 9380.

Kai-Heng

>
> Cheers,
> Rafael

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-25  9:51 [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems rjw
  2019-07-25 14:02 ` kai.heng.feng
@ 2019-07-25 14:52 ` kbusch
  2019-07-25 19:48   ` rjw
  1 sibling, 1 reply; 75+ messages in thread
From: kbusch @ 2019-07-25 14:52 UTC (permalink / raw)


On Thu, Jul 25, 2019@02:51:41AM -0700, Rafael J. Wysocki wrote:
> Hi Keith,
> 
> Unfortunately,
> 
> commit d916b1be94b6dc8d293abed2451f3062f6af7551
> Author: Keith Busch <keith.busch at intel.com>
> Date:   Thu May 23 09:27:35 2019 -0600
> 
>     nvme-pci: use host managed power state for suspend
> 
> doesn't universally improve things.  In fact, in some cases it makes things worse.
> 
> For example, on the Dell XPS13 9380 I have here it prevents the processor package
> from reaching idle states deeper than PC2 in suspend-to-idle (which, of course, also
> prevents the SoC from reaching any kind of S0ix).
> 
> That can be readily explained too.  Namely, with the commit above the NVMe device
> stays in D0 over suspend/resume, so the root port it is connected to also has to stay in
> D0 and that "blocks" package C-states deeper than PC2.
> 
> In order for the root port to be able to go to D3, the device connected to it also needs
> to go into D3, so it looks like (at least on this particular machine, but maybe in
> general), both D3 and the NVMe-specific PM are needed.
> 
> I'm not sure what to do here, because evidently there are systems where that commit
> helps.  I was thinking about adding a module option allowing the user to override the
> default behavior which in turn should be compatible with 5.2 and earlier kernels.

Darn, that's too bad. I don't think we can improve one thing at the
expense of another, so unless we find an acceptable criteria to select
what low power mode to use, I would be inclined to support a revert or
a kernel option to default to the previous behavior.

One thing we might check before using NVMe power states is if the lowest
PS is non-operational with MP below some threshold. What does your device
report for:

  nvme id-ctrl /dev/nvme0

?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-25 14:02 ` kai.heng.feng
@ 2019-07-25 16:23   ` Mario.Limonciello
  2019-07-25 17:03     ` rafael
  2019-07-25 16:59   ` [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems rafael
  1 sibling, 1 reply; 75+ messages in thread
From: Mario.Limonciello @ 2019-07-25 16:23 UTC (permalink / raw)


+Rajat

> -----Original Message-----
> From: Kai-Heng Feng <kai.heng.feng at canonical.com>
> Sent: Thursday, July 25, 2019 9:03 AM
> To: Rafael J. Wysocki
> Cc: Keith Busch; Christoph Hellwig; Sagi Grimberg; linux-
> nvme at lists.infradead.org; Limonciello, Mario; Linux PM; LKML
> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state for
> suspend" has problems
> 
> 
> [EXTERNAL EMAIL]
> 
> Hi Rafael,
> 
>@17:51, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> 
> > Hi Keith,
> >
> > Unfortunately,
> >
> > commit d916b1be94b6dc8d293abed2451f3062f6af7551
> > Author: Keith Busch <keith.busch at intel.com>
> > Date:   Thu May 23 09:27:35 2019 -0600
> >
> >     nvme-pci: use host managed power state for suspend
> >
> > doesn't universally improve things.  In fact, in some cases it makes
> > things worse.
> >
> > For example, on the Dell XPS13 9380 I have here it prevents the processor
> > package
> > from reaching idle states deeper than PC2 in suspend-to-idle (which, of
> > course, also
> > prevents the SoC from reaching any kind of S0ix).
> >
> > That can be readily explained too.  Namely, with the commit above the
> > NVMe device
> > stays in D0 over suspend/resume, so the root port it is connected to also
> > has to stay in
> > D0 and that "blocks" package C-states deeper than PC2.
> >
> > In order for the root port to be able to go to D3, the device connected
> > to it also needs
> > to go into D3, so it looks like (at least on this particular machine, but
> > maybe in
> > general), both D3 and the NVMe-specific PM are needed.

Well this is really unfortunate to hear.  I recall that with some disks we were
seeing problems where NVME specific PM wasn't working when the disk was in D3.

On your specific disk, it would be good to know if just removing the pci_save_state(pdev)
call helps.

If so, :
* that might be a better option to add as a parameter.
* maybe we should double check all the disks one more time with that tweak.

> >
> > I'm not sure what to do here, because evidently there are systems where
> > that commit
> > helps.  I was thinking about adding a module option allowing the user to
> > override the
> > default behavior which in turn should be compatible with 5.2 and earlier
> > kernels.
> 
> I just briefly tested s2i on XPS 9370, and the power meter shows a 0.8~0.9W
> power consumption so at least I don?t see the issue on XPS 9370.
> 

To me that confirms NVME is down, but it still seems higher than I would have
expected.  We should be more typically in the order of ~0.3W I think.

> Can you please provide the output of `nvme id-ctrl /dev/nvme*` and I?ll
> test the NVMe controller on XPS 9380.
> 
> Kai-Heng
> 
> >

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-25 14:02 ` kai.heng.feng
  2019-07-25 16:23   ` Mario.Limonciello
@ 2019-07-25 16:59   ` rafael
  1 sibling, 0 replies; 75+ messages in thread
From: rafael @ 2019-07-25 16:59 UTC (permalink / raw)


On Thu, Jul 25, 2019 at 4:02 PM Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:
>
> Hi Rafael,
>
>@17:51, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> > Hi Keith,
> >
> > Unfortunately,
> >
> > commit d916b1be94b6dc8d293abed2451f3062f6af7551
> > Author: Keith Busch <keith.busch at intel.com>
> > Date:   Thu May 23 09:27:35 2019 -0600
> >
> >     nvme-pci: use host managed power state for suspend
> >
> > doesn't universally improve things.  In fact, in some cases it makes
> > things worse.
> >
> > For example, on the Dell XPS13 9380 I have here it prevents the processor
> > package
> > from reaching idle states deeper than PC2 in suspend-to-idle (which, of
> > course, also
> > prevents the SoC from reaching any kind of S0ix).
> >
> > That can be readily explained too.  Namely, with the commit above the
> > NVMe device
> > stays in D0 over suspend/resume, so the root port it is connected to also
> > has to stay in
> > D0 and that "blocks" package C-states deeper than PC2.
> >
> > In order for the root port to be able to go to D3, the device connected
> > to it also needs
> > to go into D3, so it looks like (at least on this particular machine, but
> > maybe in
> > general), both D3 and the NVMe-specific PM are needed.
> >
> > I'm not sure what to do here, because evidently there are systems where
> > that commit
> > helps.  I was thinking about adding a module option allowing the user to
> > override the
> > default behavior which in turn should be compatible with 5.2 and earlier
> > kernels.
>
> I just briefly tested s2i on XPS 9370, and the power meter shows a 0.8~0.9W
> power consumption so at least I don?t see the issue on XPS 9370.

It works for me on a XPS13 9360 too, only the 9380 is problematic.

> Can you please provide the output of `nvme id-ctrl /dev/nvme*` and I?ll
> test the NVMe controller on XPS 9380.

I'll reply to Keith with that later.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-25 16:23   ` Mario.Limonciello
@ 2019-07-25 17:03     ` rafael
  2019-07-25 17:23       ` Mario.Limonciello
                         ` (2 more replies)
  0 siblings, 3 replies; 75+ messages in thread
From: rafael @ 2019-07-25 17:03 UTC (permalink / raw)


On Thu, Jul 25, 2019@6:24 PM <Mario.Limonciello@dell.com> wrote:
>
> +Rajat
>
> > -----Original Message-----
> > From: Kai-Heng Feng <kai.heng.feng at canonical.com>
> > Sent: Thursday, July 25, 2019 9:03 AM
> > To: Rafael J. Wysocki
> > Cc: Keith Busch; Christoph Hellwig; Sagi Grimberg; linux-
> > nvme at lists.infradead.org; Limonciello, Mario; Linux PM; LKML
> > Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state for
> > suspend" has problems
> >
> >
> > [EXTERNAL EMAIL]
> >
> > Hi Rafael,
> >
> >@17:51, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> > > Hi Keith,
> > >
> > > Unfortunately,
> > >
> > > commit d916b1be94b6dc8d293abed2451f3062f6af7551
> > > Author: Keith Busch <keith.busch at intel.com>
> > > Date:   Thu May 23 09:27:35 2019 -0600
> > >
> > >     nvme-pci: use host managed power state for suspend
> > >
> > > doesn't universally improve things.  In fact, in some cases it makes
> > > things worse.
> > >
> > > For example, on the Dell XPS13 9380 I have here it prevents the processor
> > > package
> > > from reaching idle states deeper than PC2 in suspend-to-idle (which, of
> > > course, also
> > > prevents the SoC from reaching any kind of S0ix).
> > >
> > > That can be readily explained too.  Namely, with the commit above the
> > > NVMe device
> > > stays in D0 over suspend/resume, so the root port it is connected to also
> > > has to stay in
> > > D0 and that "blocks" package C-states deeper than PC2.
> > >
> > > In order for the root port to be able to go to D3, the device connected
> > > to it also needs
> > > to go into D3, so it looks like (at least on this particular machine, but
> > > maybe in
> > > general), both D3 and the NVMe-specific PM are needed.
>
> Well this is really unfortunate to hear.  I recall that with some disks we were
> seeing problems where NVME specific PM wasn't working when the disk was in D3.
>
> On your specific disk, it would be good to know if just removing the pci_save_state(pdev)
> call helps.

Yes, it does help.

> If so, :
> * that might be a better option to add as a parameter.
> * maybe we should double check all the disks one more time with that tweak.

At this point it seems so.

> > > I'm not sure what to do here, because evidently there are systems where
> > > that commit
> > > helps.  I was thinking about adding a module option allowing the user to
> > > override the
> > > default behavior which in turn should be compatible with 5.2 and earlier
> > > kernels.
> >
> > I just briefly tested s2i on XPS 9370, and the power meter shows a 0.8~0.9W
> > power consumption so at least I don?t see the issue on XPS 9370.
> >
>
> To me that confirms NVME is down, but it still seems higher than I would have
> expected.  We should be more typically in the order of ~0.3W I think.

It may go to PC10, but not reach S0ix.

Anyway, I run the s2idle tests under turbostat which then tells me
what has happened more precisely.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-25 17:03     ` rafael
@ 2019-07-25 17:23       ` Mario.Limonciello
  2019-07-25 18:20       ` kai.heng.feng
  2019-07-30 10:45       ` rjw
  2 siblings, 0 replies; 75+ messages in thread
From: Mario.Limonciello @ 2019-07-25 17:23 UTC (permalink / raw)


> -----Original Message-----
> From: Rafael J. Wysocki <rafael at kernel.org>
> Sent: Thursday, July 25, 2019 12:04 PM
> To: Limonciello, Mario
> Cc: Kai-Heng Feng; Rafael J. Wysocki; Keith Busch; Christoph Hellwig; Sagi
> Grimberg; linux-nvme; Linux PM; Linux Kernel Mailing List; Rajat Jain
> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state for
> suspend" has problems
> 
> 
> [EXTERNAL EMAIL]
> 
> On Thu, Jul 25, 2019@6:24 PM <Mario.Limonciello@dell.com> wrote:
> >
> > +Rajat
> >
> > > -----Original Message-----
> > > From: Kai-Heng Feng <kai.heng.feng at canonical.com>
> > > Sent: Thursday, July 25, 2019 9:03 AM
> > > To: Rafael J. Wysocki
> > > Cc: Keith Busch; Christoph Hellwig; Sagi Grimberg; linux-
> > > nvme at lists.infradead.org; Limonciello, Mario; Linux PM; LKML
> > > Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state
> for
> > > suspend" has problems
> > >
> > >
> > > [EXTERNAL EMAIL]
> > >
> > > Hi Rafael,
> > >
> > >@17:51, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >
> > > > Hi Keith,
> > > >
> > > > Unfortunately,
> > > >
> > > > commit d916b1be94b6dc8d293abed2451f3062f6af7551
> > > > Author: Keith Busch <keith.busch at intel.com>
> > > > Date:   Thu May 23 09:27:35 2019 -0600
> > > >
> > > >     nvme-pci: use host managed power state for suspend
> > > >
> > > > doesn't universally improve things.  In fact, in some cases it makes
> > > > things worse.
> > > >
> > > > For example, on the Dell XPS13 9380 I have here it prevents the processor
> > > > package
> > > > from reaching idle states deeper than PC2 in suspend-to-idle (which, of
> > > > course, also
> > > > prevents the SoC from reaching any kind of S0ix).
> > > >
> > > > That can be readily explained too.  Namely, with the commit above the
> > > > NVMe device
> > > > stays in D0 over suspend/resume, so the root port it is connected to also
> > > > has to stay in
> > > > D0 and that "blocks" package C-states deeper than PC2.
> > > >
> > > > In order for the root port to be able to go to D3, the device connected
> > > > to it also needs
> > > > to go into D3, so it looks like (at least on this particular machine, but
> > > > maybe in
> > > > general), both D3 and the NVMe-specific PM are needed.
> >
> > Well this is really unfortunate to hear.  I recall that with some disks we were
> > seeing problems where NVME specific PM wasn't working when the disk was in
> D3.
> >
> > On your specific disk, it would be good to know if just removing the
> pci_save_state(pdev)
> > call helps.
> 
> Yes, it does help.
> 
> > If so, :
> > * that might be a better option to add as a parameter.
> > * maybe we should double check all the disks one more time with that tweak.
> 
> At this point it seems so.

OK, I've asked someone in my lab to check across a variety of otherwise working SSDs
with that modification.

Hopefully KH can also accomplish in his lab that as he has more SSDs readily available.

> 
> > > > I'm not sure what to do here, because evidently there are systems where
> > > > that commit
> > > > helps.  I was thinking about adding a module option allowing the user to
> > > > override the
> > > > default behavior which in turn should be compatible with 5.2 and earlier
> > > > kernels.
> > >
> > > I just briefly tested s2i on XPS 9370, and the power meter shows a 0.8~0.9W
> > > power consumption so at least I don?t see the issue on XPS 9370.
> > >
> >
> > To me that confirms NVME is down, but it still seems higher than I would have
> > expected.  We should be more typically in the order of ~0.3W I think.
> 
> It may go to PC10, but not reach S0ix.
> 
> Anyway, I run the s2idle tests under turbostat which then tells me
> what has happened more precisely.

To echo the request earlier, it would be good to know exactly which SSD you have
here in your 9380.  Specifically I'd like to know the vendor/model and if the SSD
you're using requests HMB.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-25 17:03     ` rafael
  2019-07-25 17:23       ` Mario.Limonciello
@ 2019-07-25 18:20       ` kai.heng.feng
  2019-07-25 19:09         ` Mario.Limonciello
  2019-07-30 10:45       ` rjw
  2 siblings, 1 reply; 75+ messages in thread
From: kai.heng.feng @ 2019-07-25 18:20 UTC (permalink / raw)


at 01:03, Rafael J. Wysocki <rafael@kernel.org> wrote:

> On Thu, Jul 25, 2019@6:24 PM <Mario.Limonciello@dell.com> wrote:
>> +Rajat
>>
>>> -----Original Message-----
>>> From: Kai-Heng Feng <kai.heng.feng at canonical.com>
>>> Sent: Thursday, July 25, 2019 9:03 AM
>>> To: Rafael J. Wysocki
>>> Cc: Keith Busch; Christoph Hellwig; Sagi Grimberg; linux-
>>> nvme at lists.infradead.org; Limonciello, Mario; Linux PM; LKML
>>> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power  
>>> state for
>>> suspend" has problems
>>>
>>>
>>> [EXTERNAL EMAIL]
>>>
>>> Hi Rafael,
>>>
>>>@17:51, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>>>
>>>> Hi Keith,
>>>>
>>>> Unfortunately,
>>>>
>>>> commit d916b1be94b6dc8d293abed2451f3062f6af7551
>>>> Author: Keith Busch <keith.busch at intel.com>
>>>> Date:   Thu May 23 09:27:35 2019 -0600
>>>>
>>>>     nvme-pci: use host managed power state for suspend
>>>>
>>>> doesn't universally improve things.  In fact, in some cases it makes
>>>> things worse.
>>>>
>>>> For example, on the Dell XPS13 9380 I have here it prevents the  
>>>> processor
>>>> package
>>>> from reaching idle states deeper than PC2 in suspend-to-idle (which, of
>>>> course, also
>>>> prevents the SoC from reaching any kind of S0ix).
>>>>
>>>> That can be readily explained too.  Namely, with the commit above the
>>>> NVMe device
>>>> stays in D0 over suspend/resume, so the root port it is connected to  
>>>> also
>>>> has to stay in
>>>> D0 and that "blocks" package C-states deeper than PC2.
>>>>
>>>> In order for the root port to be able to go to D3, the device connected
>>>> to it also needs
>>>> to go into D3, so it looks like (at least on this particular machine,  
>>>> but
>>>> maybe in
>>>> general), both D3 and the NVMe-specific PM are needed.
>>
>> Well this is really unfortunate to hear.  I recall that with some disks  
>> we were
>> seeing problems where NVME specific PM wasn't working when the disk was  
>> in D3.
>>
>> On your specific disk, it would be good to know if just removing the  
>> pci_save_state(pdev)
>> call helps.
>
> Yes, it does help.
>
>> If so, :
>> * that might be a better option to add as a parameter.
>> * maybe we should double check all the disks one more time with that  
>> tweak.
>
> At this point it seems so.
>
>>>> I'm not sure what to do here, because evidently there are systems where
>>>> that commit
>>>> helps.  I was thinking about adding a module option allowing the user to
>>>> override the
>>>> default behavior which in turn should be compatible with 5.2 and earlier
>>>> kernels.
>>>
>>> I just briefly tested s2i on XPS 9370, and the power meter shows a  
>>> 0.8~0.9W
>>> power consumption so at least I don?t see the issue on XPS 9370.
>>
>> To me that confirms NVME is down, but it still seems higher than I would  
>> have
>> expected.  We should be more typically in the order of ~0.3W I think.

 From what I?ve observed, ~0.8W s2idle is already better than Windows (~1W).
0.3W is what I see during S5.

>
> It may go to PC10, but not reach S0ix.
>
> Anyway, I run the s2idle tests under turbostat which then tells me
> what has happened more precisely.

The XPS 9370 at my hand does reach to s0ix during s2idle:
# cat /sys/kernel/debug/pmc_core/slp_s0_residency_usec
15998400

So I think keep the root port in D0 is not the culprit here.
Maybe something is wrong on the ASPM settings?

Kai-Heng

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-25 18:20       ` kai.heng.feng
@ 2019-07-25 19:09         ` Mario.Limonciello
  0 siblings, 0 replies; 75+ messages in thread
From: Mario.Limonciello @ 2019-07-25 19:09 UTC (permalink / raw)


> -----Original Message-----
> From: Kai-Heng Feng <kai.heng.feng at canonical.com>
> Sent: Thursday, July 25, 2019 1:20 PM
> To: Rafael J. Wysocki
> Cc: Limonciello, Mario; Rafael J. Wysocki; Keith Busch; Christoph Hellwig; Sagi
> Grimberg; linux-nvme; Linux PM; Linux Kernel Mailing List; Rajat Jain
> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state for
> suspend" has problems
> 
> 
> [EXTERNAL EMAIL]
> 
>@01:03, Rafael J. Wysocki <rafael@kernel.org> wrote:
> 
> > On Thu, Jul 25, 2019@6:24 PM <Mario.Limonciello@dell.com> wrote:
> >> +Rajat
> >>
> >>> -----Original Message-----
> >>> From: Kai-Heng Feng <kai.heng.feng at canonical.com>
> >>> Sent: Thursday, July 25, 2019 9:03 AM
> >>> To: Rafael J. Wysocki
> >>> Cc: Keith Busch; Christoph Hellwig; Sagi Grimberg; linux-
> >>> nvme at lists.infradead.org; Limonciello, Mario; Linux PM; LKML
> >>> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power
> >>> state for
> >>> suspend" has problems
> >>>
> >>>
> >>> [EXTERNAL EMAIL]
> >>>
> >>> Hi Rafael,
> >>>
> >>>@17:51, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >>>
> >>>> Hi Keith,
> >>>>
> >>>> Unfortunately,
> >>>>
> >>>> commit d916b1be94b6dc8d293abed2451f3062f6af7551
> >>>> Author: Keith Busch <keith.busch at intel.com>
> >>>> Date:   Thu May 23 09:27:35 2019 -0600
> >>>>
> >>>>     nvme-pci: use host managed power state for suspend
> >>>>
> >>>> doesn't universally improve things.  In fact, in some cases it makes
> >>>> things worse.
> >>>>
> >>>> For example, on the Dell XPS13 9380 I have here it prevents the
> >>>> processor
> >>>> package
> >>>> from reaching idle states deeper than PC2 in suspend-to-idle (which, of
> >>>> course, also
> >>>> prevents the SoC from reaching any kind of S0ix).
> >>>>
> >>>> That can be readily explained too.  Namely, with the commit above the
> >>>> NVMe device
> >>>> stays in D0 over suspend/resume, so the root port it is connected to
> >>>> also
> >>>> has to stay in
> >>>> D0 and that "blocks" package C-states deeper than PC2.
> >>>>
> >>>> In order for the root port to be able to go to D3, the device connected
> >>>> to it also needs
> >>>> to go into D3, so it looks like (at least on this particular machine,
> >>>> but
> >>>> maybe in
> >>>> general), both D3 and the NVMe-specific PM are needed.
> >>
> >> Well this is really unfortunate to hear.  I recall that with some disks
> >> we were
> >> seeing problems where NVME specific PM wasn't working when the disk was
> >> in D3.
> >>
> >> On your specific disk, it would be good to know if just removing the
> >> pci_save_state(pdev)
> >> call helps.
> >
> > Yes, it does help.
> >
> >> If so, :
> >> * that might be a better option to add as a parameter.
> >> * maybe we should double check all the disks one more time with that
> >> tweak.
> >
> > At this point it seems so.
> >
> >>>> I'm not sure what to do here, because evidently there are systems where
> >>>> that commit
> >>>> helps.  I was thinking about adding a module option allowing the user to
> >>>> override the
> >>>> default behavior which in turn should be compatible with 5.2 and earlier
> >>>> kernels.
> >>>
> >>> I just briefly tested s2i on XPS 9370, and the power meter shows a
> >>> 0.8~0.9W
> >>> power consumption so at least I don?t see the issue on XPS 9370.
> >>
> >> To me that confirms NVME is down, but it still seems higher than I would
> >> have
> >> expected.  We should be more typically in the order of ~0.3W I think.
> 
>  From what I?ve observed, ~0.8W s2idle is already better than Windows (~1W).
> 0.3W is what I see during S5.

Oh thanks for confirming, I'm probably mixing up with another system.

> 
> >
> > It may go to PC10, but not reach S0ix.
> >
> > Anyway, I run the s2idle tests under turbostat which then tells me
> > what has happened more precisely.
> 
> The XPS 9370 at my hand does reach to s0ix during s2idle:
> # cat /sys/kernel/debug/pmc_core/slp_s0_residency_usec
> 15998400
> 
> So I think keep the root port in D0 is not the culprit here.
> Maybe something is wrong on the ASPM settings?
> 
> Kai-Heng

I have a 9380 on hand and I'm also showing slps0 residency with the SSD in it
and this series (Hynix BC501).

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-25 14:52 ` kbusch
@ 2019-07-25 19:48   ` rjw
  2019-07-25 19:52     ` kbusch
  0 siblings, 1 reply; 75+ messages in thread
From: rjw @ 2019-07-25 19:48 UTC (permalink / raw)


On Thursday, July 25, 2019 4:52:10 PM CEST Keith Busch wrote:
> On Thu, Jul 25, 2019@02:51:41AM -0700, Rafael J. Wysocki wrote:
> > Hi Keith,
> > 
> > Unfortunately,
> > 
> > commit d916b1be94b6dc8d293abed2451f3062f6af7551
> > Author: Keith Busch <keith.busch at intel.com>
> > Date:   Thu May 23 09:27:35 2019 -0600
> > 
> >     nvme-pci: use host managed power state for suspend
> > 
> > doesn't universally improve things.  In fact, in some cases it makes things worse.
> > 
> > For example, on the Dell XPS13 9380 I have here it prevents the processor package
> > from reaching idle states deeper than PC2 in suspend-to-idle (which, of course, also
> > prevents the SoC from reaching any kind of S0ix).
> > 
> > That can be readily explained too.  Namely, with the commit above the NVMe device
> > stays in D0 over suspend/resume, so the root port it is connected to also has to stay in
> > D0 and that "blocks" package C-states deeper than PC2.
> > 
> > In order for the root port to be able to go to D3, the device connected to it also needs
> > to go into D3, so it looks like (at least on this particular machine, but maybe in
> > general), both D3 and the NVMe-specific PM are needed.
> > 
> > I'm not sure what to do here, because evidently there are systems where that commit
> > helps.  I was thinking about adding a module option allowing the user to override the
> > default behavior which in turn should be compatible with 5.2 and earlier kernels.
> 
> Darn, that's too bad. I don't think we can improve one thing at the
> expense of another, so unless we find an acceptable criteria to select
> what low power mode to use, I would be inclined to support a revert or
> a kernel option to default to the previous behavior.
> 
> One thing we might check before using NVMe power states is if the lowest
> PS is non-operational with MP below some threshold. What does your device
> report for:
> 
>   nvme id-ctrl /dev/nvme0

NVME Identify Controller:
vid     : 0x1c5c
ssvid   : 0x1c5c
sn      : MS92N171312902J0N   
mn      : PC401 NVMe SK hynix 256GB               
fr      : 80007E00
rab     : 2
ieee    : ace42e
cmic    : 0
mdts    : 5
cntlid  : 1
ver     : 10200
rtd3r   : 7a120
rtd3e   : 1e8480
oaes    : 0x200
ctratt  : 0
oacs    : 0x17
acl     : 7
aerl    : 3
frmw    : 0x14
lpa     : 0x2
elpe    : 255
npss    : 4
avscc   : 0x1
apsta   : 0x1
wctemp  : 352
cctemp  : 354
mtfa    : 0
hmpre   : 0
hmmin   : 0
tnvmcap : 0
unvmcap : 0
rpmbs   : 0
edstt   : 10
dsto    : 0
fwug    : 0
kas     : 0
hctma   : 0
mntmt   : 0
mxtmt   : 0
sanicap : 0
hmminds : 0
hmmaxd  : 0
nsetidmax : 0
anatt   : 0
anacap  : 0
anagrpmax : 0
nanagrpid : 0
sqes    : 0x66
cqes    : 0x44
maxcmd  : 0
nn      : 1
oncs    : 0x1f
fuses   : 0x1
fna     : 0
vwc     : 0x1
awun    : 7
awupf   : 7
nvscc   : 1
acwu    : 7
sgls    : 0
mnan    : 0
subnqn  : 
ioccsz  : 0
iorcsz  : 0
icdoff  : 0
ctrattr : 0
msdbd   : 0
ps    0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:3.80W operational enlat:30 exlat:30 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:2.40W operational enlat:100 exlat:100 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0700W non-operational enlat:1000 exlat:1000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0070W non-operational enlat:1000 exlat:5000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-25 19:48   ` rjw
@ 2019-07-25 19:52     ` kbusch
  2019-07-25 20:02       ` rjw
  0 siblings, 1 reply; 75+ messages in thread
From: kbusch @ 2019-07-25 19:52 UTC (permalink / raw)


On Thu, Jul 25, 2019@09:48:57PM +0200, Rafael J. Wysocki wrote:
> NVME Identify Controller:
> vid     : 0x1c5c
> ssvid   : 0x1c5c
> sn      : MS92N171312902J0N   
> mn      : PC401 NVMe SK hynix 256GB               
> fr      : 80007E00
> rab     : 2
> ieee    : ace42e
> cmic    : 0
> mdts    : 5
> cntlid  : 1
> ver     : 10200
> rtd3r   : 7a120
> rtd3e   : 1e8480
> oaes    : 0x200
> ctratt  : 0
> oacs    : 0x17
> acl     : 7
> aerl    : 3
> frmw    : 0x14
> lpa     : 0x2
> elpe    : 255
> npss    : 4
> avscc   : 0x1
> apsta   : 0x1
> wctemp  : 352
> cctemp  : 354
> mtfa    : 0
> hmpre   : 0
> hmmin   : 0
> tnvmcap : 0
> unvmcap : 0
> rpmbs   : 0
> edstt   : 10
> dsto    : 0
> fwug    : 0
> kas     : 0
> hctma   : 0
> mntmt   : 0
> mxtmt   : 0
> sanicap : 0
> hmminds : 0
> hmmaxd  : 0
> nsetidmax : 0
> anatt   : 0
> anacap  : 0
> anagrpmax : 0
> nanagrpid : 0
> sqes    : 0x66
> cqes    : 0x44
> maxcmd  : 0
> nn      : 1
> oncs    : 0x1f
> fuses   : 0x1
> fna     : 0
> vwc     : 0x1
> awun    : 7
> awupf   : 7
> nvscc   : 1
> acwu    : 7
> sgls    : 0
> mnan    : 0
> subnqn  : 
> ioccsz  : 0
> iorcsz  : 0
> icdoff  : 0
> ctrattr : 0
> msdbd   : 0
> ps    0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0
>           rwt:0 rwl:0 idle_power:- active_power:-
> ps    1 : mp:3.80W operational enlat:30 exlat:30 rrt:1 rrl:1
>           rwt:1 rwl:1 idle_power:- active_power:-
> ps    2 : mp:2.40W operational enlat:100 exlat:100 rrt:2 rrl:2
>           rwt:2 rwl:2 idle_power:- active_power:-
> ps    3 : mp:0.0700W non-operational enlat:1000 exlat:1000 rrt:3 rrl:3
>           rwt:3 rwl:3 idle_power:- active_power:-
> ps    4 : mp:0.0070W non-operational enlat:1000 exlat:5000 rrt:3 rrl:3
>           rwt:3 rwl:3 idle_power:- active_power:-

Hm, nothing stands out as something we can use to determine if we should
skip the nvme specific settings or allow D3. I've no other ideas at the
moment for what we may check.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-25 19:52     ` kbusch
@ 2019-07-25 20:02       ` rjw
  2019-07-26 14:02         ` kai.heng.feng
  0 siblings, 1 reply; 75+ messages in thread
From: rjw @ 2019-07-25 20:02 UTC (permalink / raw)


On Thursday, July 25, 2019 9:52:59 PM CEST Keith Busch wrote:
> On Thu, Jul 25, 2019@09:48:57PM +0200, Rafael J. Wysocki wrote:
> > NVME Identify Controller:
> > vid     : 0x1c5c
> > ssvid   : 0x1c5c
> > sn      : MS92N171312902J0N   
> > mn      : PC401 NVMe SK hynix 256GB               
> > fr      : 80007E00
> > rab     : 2
> > ieee    : ace42e
> > cmic    : 0
> > mdts    : 5
> > cntlid  : 1
> > ver     : 10200
> > rtd3r   : 7a120
> > rtd3e   : 1e8480
> > oaes    : 0x200
> > ctratt  : 0
> > oacs    : 0x17
> > acl     : 7
> > aerl    : 3
> > frmw    : 0x14
> > lpa     : 0x2
> > elpe    : 255
> > npss    : 4
> > avscc   : 0x1
> > apsta   : 0x1
> > wctemp  : 352
> > cctemp  : 354
> > mtfa    : 0
> > hmpre   : 0
> > hmmin   : 0
> > tnvmcap : 0
> > unvmcap : 0
> > rpmbs   : 0
> > edstt   : 10
> > dsto    : 0
> > fwug    : 0
> > kas     : 0
> > hctma   : 0
> > mntmt   : 0
> > mxtmt   : 0
> > sanicap : 0
> > hmminds : 0
> > hmmaxd  : 0
> > nsetidmax : 0
> > anatt   : 0
> > anacap  : 0
> > anagrpmax : 0
> > nanagrpid : 0
> > sqes    : 0x66
> > cqes    : 0x44
> > maxcmd  : 0
> > nn      : 1
> > oncs    : 0x1f
> > fuses   : 0x1
> > fna     : 0
> > vwc     : 0x1
> > awun    : 7
> > awupf   : 7
> > nvscc   : 1
> > acwu    : 7
> > sgls    : 0
> > mnan    : 0
> > subnqn  : 
> > ioccsz  : 0
> > iorcsz  : 0
> > icdoff  : 0
> > ctrattr : 0
> > msdbd   : 0
> > ps    0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0
> >           rwt:0 rwl:0 idle_power:- active_power:-
> > ps    1 : mp:3.80W operational enlat:30 exlat:30 rrt:1 rrl:1
> >           rwt:1 rwl:1 idle_power:- active_power:-
> > ps    2 : mp:2.40W operational enlat:100 exlat:100 rrt:2 rrl:2
> >           rwt:2 rwl:2 idle_power:- active_power:-
> > ps    3 : mp:0.0700W non-operational enlat:1000 exlat:1000 rrt:3 rrl:3
> >           rwt:3 rwl:3 idle_power:- active_power:-
> > ps    4 : mp:0.0070W non-operational enlat:1000 exlat:5000 rrt:3 rrl:3
> >           rwt:3 rwl:3 idle_power:- active_power:-
> 
> Hm, nothing stands out as something we can use to determine if we should
> skip the nvme specific settings or allow D3. I've no other ideas at the
> moment for what we may check.

Well, do ASPM settings matter here?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-25 20:02       ` rjw
@ 2019-07-26 14:02         ` kai.heng.feng
  2019-07-27 12:55           ` rafael
  0 siblings, 1 reply; 75+ messages in thread
From: kai.heng.feng @ 2019-07-26 14:02 UTC (permalink / raw)


at 04:02, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:

> On Thursday, July 25, 2019 9:52:59 PM CEST Keith Busch wrote:
>> On Thu, Jul 25, 2019@09:48:57PM +0200, Rafael J. Wysocki wrote:
>>> NVME Identify Controller:
>>> vid     : 0x1c5c
>>> ssvid   : 0x1c5c
>>> sn      : MS92N171312902J0N
>>> mn      : PC401 NVMe SK hynix 256GB
>>> fr      : 80007E00
>>> rab     : 2
>>> ieee    : ace42e
>>> cmic    : 0
>>> mdts    : 5
>>> cntlid  : 1
>>> ver     : 10200
>>> rtd3r   : 7a120
>>> rtd3e   : 1e8480
>>> oaes    : 0x200
>>> ctratt  : 0
>>> oacs    : 0x17
>>> acl     : 7
>>> aerl    : 3
>>> frmw    : 0x14
>>> lpa     : 0x2
>>> elpe    : 255
>>> npss    : 4
>>> avscc   : 0x1
>>> apsta   : 0x1
>>> wctemp  : 352
>>> cctemp  : 354
>>> mtfa    : 0
>>> hmpre   : 0
>>> hmmin   : 0
>>> tnvmcap : 0
>>> unvmcap : 0
>>> rpmbs   : 0
>>> edstt   : 10
>>> dsto    : 0
>>> fwug    : 0
>>> kas     : 0
>>> hctma   : 0
>>> mntmt   : 0
>>> mxtmt   : 0
>>> sanicap : 0
>>> hmminds : 0
>>> hmmaxd  : 0
>>> nsetidmax : 0
>>> anatt   : 0
>>> anacap  : 0
>>> anagrpmax : 0
>>> nanagrpid : 0
>>> sqes    : 0x66
>>> cqes    : 0x44
>>> maxcmd  : 0
>>> nn      : 1
>>> oncs    : 0x1f
>>> fuses   : 0x1
>>> fna     : 0
>>> vwc     : 0x1
>>> awun    : 7
>>> awupf   : 7
>>> nvscc   : 1
>>> acwu    : 7
>>> sgls    : 0
>>> mnan    : 0
>>> subnqn  :
>>> ioccsz  : 0
>>> iorcsz  : 0
>>> icdoff  : 0
>>> ctrattr : 0
>>> msdbd   : 0
>>> ps    0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0
>>>           rwt:0 rwl:0 idle_power:- active_power:-
>>> ps    1 : mp:3.80W operational enlat:30 exlat:30 rrt:1 rrl:1
>>>           rwt:1 rwl:1 idle_power:- active_power:-
>>> ps    2 : mp:2.40W operational enlat:100 exlat:100 rrt:2 rrl:2
>>>           rwt:2 rwl:2 idle_power:- active_power:-
>>> ps    3 : mp:0.0700W non-operational enlat:1000 exlat:1000 rrt:3 rrl:3
>>>           rwt:3 rwl:3 idle_power:- active_power:-
>>> ps    4 : mp:0.0070W non-operational enlat:1000 exlat:5000 rrt:3 rrl:3
>>>           rwt:3 rwl:3 idle_power:- active_power:-
>>
>> Hm, nothing stands out as something we can use to determine if we should
>> skip the nvme specific settings or allow D3. I've no other ideas at the
>> moment for what we may check.
>
> Well, do ASPM settings matter here?

Seems like it's a regression in the firmware.

The issue happens in version 80007E00 but not version 80006E00.
I am not sure how to downgrade it under Linux though.
The firmware changelog [1] is very interesting:
- Improves the performance of the solid-state drive (SSD) by distributing  
power into the SSD efficiently according to the power state of the system.

[1]  
https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverid=mcxm8

Kai-Heng

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-26 14:02         ` kai.heng.feng
@ 2019-07-27 12:55           ` rafael
  2019-07-29 15:51             ` Mario.Limonciello
  0 siblings, 1 reply; 75+ messages in thread
From: rafael @ 2019-07-27 12:55 UTC (permalink / raw)


On Fri, Jul 26, 2019 at 4:03 PM Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:
>
>@04:02, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> > On Thursday, July 25, 2019 9:52:59 PM CEST Keith Busch wrote:
> >> On Thu, Jul 25, 2019@09:48:57PM +0200, Rafael J. Wysocki wrote:
> >>> NVME Identify Controller:
> >>> vid     : 0x1c5c
> >>> ssvid   : 0x1c5c
> >>> sn      : MS92N171312902J0N
> >>> mn      : PC401 NVMe SK hynix 256GB
> >>> fr      : 80007E00
> >>> rab     : 2
> >>> ieee    : ace42e
> >>> cmic    : 0
> >>> mdts    : 5
> >>> cntlid  : 1
> >>> ver     : 10200
> >>> rtd3r   : 7a120
> >>> rtd3e   : 1e8480
> >>> oaes    : 0x200
> >>> ctratt  : 0
> >>> oacs    : 0x17
> >>> acl     : 7
> >>> aerl    : 3
> >>> frmw    : 0x14
> >>> lpa     : 0x2
> >>> elpe    : 255
> >>> npss    : 4
> >>> avscc   : 0x1
> >>> apsta   : 0x1
> >>> wctemp  : 352
> >>> cctemp  : 354
> >>> mtfa    : 0
> >>> hmpre   : 0
> >>> hmmin   : 0
> >>> tnvmcap : 0
> >>> unvmcap : 0
> >>> rpmbs   : 0
> >>> edstt   : 10
> >>> dsto    : 0
> >>> fwug    : 0
> >>> kas     : 0
> >>> hctma   : 0
> >>> mntmt   : 0
> >>> mxtmt   : 0
> >>> sanicap : 0
> >>> hmminds : 0
> >>> hmmaxd  : 0
> >>> nsetidmax : 0
> >>> anatt   : 0
> >>> anacap  : 0
> >>> anagrpmax : 0
> >>> nanagrpid : 0
> >>> sqes    : 0x66
> >>> cqes    : 0x44
> >>> maxcmd  : 0
> >>> nn      : 1
> >>> oncs    : 0x1f
> >>> fuses   : 0x1
> >>> fna     : 0
> >>> vwc     : 0x1
> >>> awun    : 7
> >>> awupf   : 7
> >>> nvscc   : 1
> >>> acwu    : 7
> >>> sgls    : 0
> >>> mnan    : 0
> >>> subnqn  :
> >>> ioccsz  : 0
> >>> iorcsz  : 0
> >>> icdoff  : 0
> >>> ctrattr : 0
> >>> msdbd   : 0
> >>> ps    0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0
> >>>           rwt:0 rwl:0 idle_power:- active_power:-
> >>> ps    1 : mp:3.80W operational enlat:30 exlat:30 rrt:1 rrl:1
> >>>           rwt:1 rwl:1 idle_power:- active_power:-
> >>> ps    2 : mp:2.40W operational enlat:100 exlat:100 rrt:2 rrl:2
> >>>           rwt:2 rwl:2 idle_power:- active_power:-
> >>> ps    3 : mp:0.0700W non-operational enlat:1000 exlat:1000 rrt:3 rrl:3
> >>>           rwt:3 rwl:3 idle_power:- active_power:-
> >>> ps    4 : mp:0.0070W non-operational enlat:1000 exlat:5000 rrt:3 rrl:3
> >>>           rwt:3 rwl:3 idle_power:- active_power:-
> >>
> >> Hm, nothing stands out as something we can use to determine if we should
> >> skip the nvme specific settings or allow D3. I've no other ideas at the
> >> moment for what we may check.
> >
> > Well, do ASPM settings matter here?
>
> Seems like it's a regression in the firmware.
>
> The issue happens in version 80007E00 but not version 80006E00.

So you mean the NVMe firmware, to be entirely precise.

> I am not sure how to downgrade it under Linux though.

Me neither.

> The firmware changelog [1] is very interesting:
> - Improves the performance of the solid-state drive (SSD) by distributing
> power into the SSD efficiently according to the power state of the system.
>
> [1]
> https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverid=mcxm8

Huh.

It looks like something else prevents the PCH on my 9380 from reaching
the right state for S0ix, though.  I still need to find out what it
is.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-27 12:55           ` rafael
@ 2019-07-29 15:51             ` Mario.Limonciello
  2019-07-29 22:05               ` rafael
  0 siblings, 1 reply; 75+ messages in thread
From: Mario.Limonciello @ 2019-07-29 15:51 UTC (permalink / raw)


> -----Original Message-----
> From: Rafael J. Wysocki <rafael at kernel.org>
> Sent: Saturday, July 27, 2019 7:55 AM
> To: Kai-Heng Feng
> Cc: Rafael J. Wysocki; Keith Busch; Busch, Keith; Christoph Hellwig; Sagi Grimberg;
> linux-nvme at lists.infradead.org; Limonciello, Mario; Linux PM; LKML
> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state for
> suspend" has problems
> 
> 
> [EXTERNAL EMAIL]
> 
> On Fri, Jul 26, 2019 at 4:03 PM Kai-Heng Feng
> <kai.heng.feng@canonical.com> wrote:
> >
> >@04:02, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> > > On Thursday, July 25, 2019 9:52:59 PM CEST Keith Busch wrote:
> > >> On Thu, Jul 25, 2019@09:48:57PM +0200, Rafael J. Wysocki wrote:
> > >>> NVME Identify Controller:
> > >>> vid     : 0x1c5c
> > >>> ssvid   : 0x1c5c
> > >>> sn      : MS92N171312902J0N
> > >>> mn      : PC401 NVMe SK hynix 256GB
> > >>> fr      : 80007E00
> > >>> rab     : 2
> > >>> ieee    : ace42e
> > >>> cmic    : 0
> > >>> mdts    : 5
> > >>> cntlid  : 1
> > >>> ver     : 10200
> > >>> rtd3r   : 7a120
> > >>> rtd3e   : 1e8480
> > >>> oaes    : 0x200
> > >>> ctratt  : 0
> > >>> oacs    : 0x17
> > >>> acl     : 7
> > >>> aerl    : 3
> > >>> frmw    : 0x14
> > >>> lpa     : 0x2
> > >>> elpe    : 255
> > >>> npss    : 4
> > >>> avscc   : 0x1
> > >>> apsta   : 0x1
> > >>> wctemp  : 352
> > >>> cctemp  : 354
> > >>> mtfa    : 0
> > >>> hmpre   : 0
> > >>> hmmin   : 0
> > >>> tnvmcap : 0
> > >>> unvmcap : 0
> > >>> rpmbs   : 0
> > >>> edstt   : 10
> > >>> dsto    : 0
> > >>> fwug    : 0
> > >>> kas     : 0
> > >>> hctma   : 0
> > >>> mntmt   : 0
> > >>> mxtmt   : 0
> > >>> sanicap : 0
> > >>> hmminds : 0
> > >>> hmmaxd  : 0
> > >>> nsetidmax : 0
> > >>> anatt   : 0
> > >>> anacap  : 0
> > >>> anagrpmax : 0
> > >>> nanagrpid : 0
> > >>> sqes    : 0x66
> > >>> cqes    : 0x44
> > >>> maxcmd  : 0
> > >>> nn      : 1
> > >>> oncs    : 0x1f
> > >>> fuses   : 0x1
> > >>> fna     : 0
> > >>> vwc     : 0x1
> > >>> awun    : 7
> > >>> awupf   : 7
> > >>> nvscc   : 1
> > >>> acwu    : 7
> > >>> sgls    : 0
> > >>> mnan    : 0
> > >>> subnqn  :
> > >>> ioccsz  : 0
> > >>> iorcsz  : 0
> > >>> icdoff  : 0
> > >>> ctrattr : 0
> > >>> msdbd   : 0
> > >>> ps    0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0
> > >>>           rwt:0 rwl:0 idle_power:- active_power:-
> > >>> ps    1 : mp:3.80W operational enlat:30 exlat:30 rrt:1 rrl:1
> > >>>           rwt:1 rwl:1 idle_power:- active_power:-
> > >>> ps    2 : mp:2.40W operational enlat:100 exlat:100 rrt:2 rrl:2
> > >>>           rwt:2 rwl:2 idle_power:- active_power:-
> > >>> ps    3 : mp:0.0700W non-operational enlat:1000 exlat:1000 rrt:3 rrl:3
> > >>>           rwt:3 rwl:3 idle_power:- active_power:-
> > >>> ps    4 : mp:0.0070W non-operational enlat:1000 exlat:5000 rrt:3 rrl:3
> > >>>           rwt:3 rwl:3 idle_power:- active_power:-
> > >>
> > >> Hm, nothing stands out as something we can use to determine if we should
> > >> skip the nvme specific settings or allow D3. I've no other ideas at the
> > >> moment for what we may check.
> > >
> > > Well, do ASPM settings matter here?
> >
> > Seems like it's a regression in the firmware.
> >
> > The issue happens in version 80007E00 but not version 80006E00.
> 
> So you mean the NVMe firmware, to be entirely precise.

Yes.

> 
> > I am not sure how to downgrade it under Linux though.
> 
> Me neither.

I'll ask the storage team to ask Hynix to make both these FW available on LVFS.
Fwupd can upgrade and downgrade firmware when the binaries are made available.

They could potentially be pulled directly out of the Windows executables too, but I don't
know how to identify them myself.

> 
> > The firmware changelog [1] is very interesting:
> > - Improves the performance of the solid-state drive (SSD) by distributing
> > power into the SSD efficiently according to the power state of the system.
> >
> > [1]
> >
> https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverid=m
> cxm8
> 
> Huh.
> 
> It looks like something else prevents the PCH on my 9380 from reaching
> the right state for S0ix, though.  I still need to find out what it
> is.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-29 15:51             ` Mario.Limonciello
@ 2019-07-29 22:05               ` rafael
  0 siblings, 0 replies; 75+ messages in thread
From: rafael @ 2019-07-29 22:05 UTC (permalink / raw)


On Mon, Jul 29, 2019@5:53 PM <Mario.Limonciello@dell.com> wrote:
>
> > -----Original Message-----
> > From: Rafael J. Wysocki <rafael at kernel.org>
> > Sent: Saturday, July 27, 2019 7:55 AM
> > To: Kai-Heng Feng
> > Cc: Rafael J. Wysocki; Keith Busch; Busch, Keith; Christoph Hellwig; Sagi Grimberg;
> > linux-nvme at lists.infradead.org; Limonciello, Mario; Linux PM; LKML
> > Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state for
> > suspend" has problems
> >
> >
> > [EXTERNAL EMAIL]
> >
> > On Fri, Jul 26, 2019 at 4:03 PM Kai-Heng Feng
> > <kai.heng.feng@canonical.com> wrote:
> > >
> > >@04:02, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >
> > > > On Thursday, July 25, 2019 9:52:59 PM CEST Keith Busch wrote:
> > > >> On Thu, Jul 25, 2019@09:48:57PM +0200, Rafael J. Wysocki wrote:
> > > >>> NVME Identify Controller:
> > > >>> vid     : 0x1c5c
> > > >>> ssvid   : 0x1c5c
> > > >>> sn      : MS92N171312902J0N
> > > >>> mn      : PC401 NVMe SK hynix 256GB
> > > >>> fr      : 80007E00
> > > >>> rab     : 2
> > > >>> ieee    : ace42e
> > > >>> cmic    : 0
> > > >>> mdts    : 5
> > > >>> cntlid  : 1
> > > >>> ver     : 10200
> > > >>> rtd3r   : 7a120
> > > >>> rtd3e   : 1e8480
> > > >>> oaes    : 0x200
> > > >>> ctratt  : 0
> > > >>> oacs    : 0x17
> > > >>> acl     : 7
> > > >>> aerl    : 3
> > > >>> frmw    : 0x14
> > > >>> lpa     : 0x2
> > > >>> elpe    : 255
> > > >>> npss    : 4
> > > >>> avscc   : 0x1
> > > >>> apsta   : 0x1
> > > >>> wctemp  : 352
> > > >>> cctemp  : 354
> > > >>> mtfa    : 0
> > > >>> hmpre   : 0
> > > >>> hmmin   : 0
> > > >>> tnvmcap : 0
> > > >>> unvmcap : 0
> > > >>> rpmbs   : 0
> > > >>> edstt   : 10
> > > >>> dsto    : 0
> > > >>> fwug    : 0
> > > >>> kas     : 0
> > > >>> hctma   : 0
> > > >>> mntmt   : 0
> > > >>> mxtmt   : 0
> > > >>> sanicap : 0
> > > >>> hmminds : 0
> > > >>> hmmaxd  : 0
> > > >>> nsetidmax : 0
> > > >>> anatt   : 0
> > > >>> anacap  : 0
> > > >>> anagrpmax : 0
> > > >>> nanagrpid : 0
> > > >>> sqes    : 0x66
> > > >>> cqes    : 0x44
> > > >>> maxcmd  : 0
> > > >>> nn      : 1
> > > >>> oncs    : 0x1f
> > > >>> fuses   : 0x1
> > > >>> fna     : 0
> > > >>> vwc     : 0x1
> > > >>> awun    : 7
> > > >>> awupf   : 7
> > > >>> nvscc   : 1
> > > >>> acwu    : 7
> > > >>> sgls    : 0
> > > >>> mnan    : 0
> > > >>> subnqn  :
> > > >>> ioccsz  : 0
> > > >>> iorcsz  : 0
> > > >>> icdoff  : 0
> > > >>> ctrattr : 0
> > > >>> msdbd   : 0
> > > >>> ps    0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0
> > > >>>           rwt:0 rwl:0 idle_power:- active_power:-
> > > >>> ps    1 : mp:3.80W operational enlat:30 exlat:30 rrt:1 rrl:1
> > > >>>           rwt:1 rwl:1 idle_power:- active_power:-
> > > >>> ps    2 : mp:2.40W operational enlat:100 exlat:100 rrt:2 rrl:2
> > > >>>           rwt:2 rwl:2 idle_power:- active_power:-
> > > >>> ps    3 : mp:0.0700W non-operational enlat:1000 exlat:1000 rrt:3 rrl:3
> > > >>>           rwt:3 rwl:3 idle_power:- active_power:-
> > > >>> ps    4 : mp:0.0070W non-operational enlat:1000 exlat:5000 rrt:3 rrl:3
> > > >>>           rwt:3 rwl:3 idle_power:- active_power:-
> > > >>
> > > >> Hm, nothing stands out as something we can use to determine if we should
> > > >> skip the nvme specific settings or allow D3. I've no other ideas at the
> > > >> moment for what we may check.
> > > >
> > > > Well, do ASPM settings matter here?
> > >
> > > Seems like it's a regression in the firmware.
> > >
> > > The issue happens in version 80007E00 but not version 80006E00.
> >
> > So you mean the NVMe firmware, to be entirely precise.
>
> Yes.
>
> >
> > > I am not sure how to downgrade it under Linux though.
> >
> > Me neither.
>
> I'll ask the storage team to ask Hynix to make both these FW available on LVFS.
> Fwupd can upgrade and downgrade firmware when the binaries are made available.
>
> They could potentially be pulled directly out of the Windows executables too, but I don't
> know how to identify them myself.

Well, thanks, but I'm not quite convinced that the NVMe is the reason
why my 9380 cannot reach SLP_S0 and this is my production system, so
I'd rather not do NVMe firmware downgrade experiments on it. :-)



>
> >
> > > The firmware changelog [1] is very interesting:
> > > - Improves the performance of the solid-state drive (SSD) by distributing
> > > power into the SSD efficiently according to the power state of the system.
> > >
> > > [1]
> > >
> > https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverid=m
> > cxm8
> >
> > Huh.
> >
> > It looks like something else prevents the PCH on my 9380 from reaching
> > the right state for S0ix, though.  I still need to find out what it
> > is.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-25 17:03     ` rafael
  2019-07-25 17:23       ` Mario.Limonciello
  2019-07-25 18:20       ` kai.heng.feng
@ 2019-07-30 10:45       ` rjw
  2019-07-30 14:41         ` kbusch
  2 siblings, 1 reply; 75+ messages in thread
From: rjw @ 2019-07-30 10:45 UTC (permalink / raw)


On Thursday, July 25, 2019 7:03:49 PM CEST Rafael J. Wysocki wrote:
> On Thu, Jul 25, 2019@6:24 PM <Mario.Limonciello@dell.com> wrote:
> >
> > +Rajat
> >
> > > -----Original Message-----
> > > From: Kai-Heng Feng <kai.heng.feng at canonical.com>
> > > Sent: Thursday, July 25, 2019 9:03 AM
> > > To: Rafael J. Wysocki
> > > Cc: Keith Busch; Christoph Hellwig; Sagi Grimberg; linux-
> > > nvme at lists.infradead.org; Limonciello, Mario; Linux PM; LKML
> > > Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state for
> > > suspend" has problems
> > >
> > >
> > > [EXTERNAL EMAIL]
> > >
> > > Hi Rafael,
> > >
> > >@17:51, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >
> > > > Hi Keith,
> > > >
> > > > Unfortunately,
> > > >
> > > > commit d916b1be94b6dc8d293abed2451f3062f6af7551
> > > > Author: Keith Busch <keith.busch at intel.com>
> > > > Date:   Thu May 23 09:27:35 2019 -0600
> > > >
> > > >     nvme-pci: use host managed power state for suspend
> > > >
> > > > doesn't universally improve things.  In fact, in some cases it makes
> > > > things worse.
> > > >
> > > > For example, on the Dell XPS13 9380 I have here it prevents the processor
> > > > package
> > > > from reaching idle states deeper than PC2 in suspend-to-idle (which, of
> > > > course, also
> > > > prevents the SoC from reaching any kind of S0ix).
> > > >
> > > > That can be readily explained too.  Namely, with the commit above the
> > > > NVMe device
> > > > stays in D0 over suspend/resume, so the root port it is connected to also
> > > > has to stay in
> > > > D0 and that "blocks" package C-states deeper than PC2.
> > > >
> > > > In order for the root port to be able to go to D3, the device connected
> > > > to it also needs
> > > > to go into D3, so it looks like (at least on this particular machine, but
> > > > maybe in
> > > > general), both D3 and the NVMe-specific PM are needed.
> >
> > Well this is really unfortunate to hear.  I recall that with some disks we were
> > seeing problems where NVME specific PM wasn't working when the disk was in D3.
> >
> > On your specific disk, it would be good to know if just removing the pci_save_state(pdev)
> > call helps.
> 
> Yes, it does help.
> 
> > If so, :
> > * that might be a better option to add as a parameter.
> > * maybe we should double check all the disks one more time with that tweak.
> 
> At this point it seems so.

So I can reproduce this problem with plain 5.3-rc1 and the patch below fixes it.

Also Mario reports that the same patch needs to be applied for his 9380 to reach
SLP_S0 after some additional changes under testing/review now, so here it goes.

[The changes mentioned above are in the pm-s2idle-testing branch in the
 linux-pm.git tree at kernel.org.]

---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: [PATCH] nvme-pci: Do not prevent PCI bus-level PM from being used

One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
host managed power state for suspend") was adding a pci_save_state()
call to nvme_suspend() in order to prevent the PCI bus-level PM from
being applied to the suspended NVMe devices, but that causes the NVMe
drive (PC401 NVMe SK hynix 256GB) in my Dell XPS13 9380 to prevent
the SoC from reaching package idle states deeper than PC3, which is
way insufficient for system suspend.

Fix this issue by removing the pci_save_state() call in question.

Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
---
 drivers/nvme/host/pci.c |    8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

Index: linux-pm/drivers/nvme/host/pci.c
===================================================================
--- linux-pm.orig/drivers/nvme/host/pci.c
+++ linux-pm/drivers/nvme/host/pci.c
@@ -2897,14 +2897,8 @@ static int nvme_suspend(struct device *d
 		nvme_dev_disable(ndev, true);
 		ctrl->npss = 0;
 		ret = 0;
-		goto unfreeze;
 	}
-	/*
-	 * A saved state prevents pci pm from generically controlling the
-	 * device's power. If we're using protocol specific settings, we don't
-	 * want pci interfering.
-	 */
-	pci_save_state(pdev);
+
 unfreeze:
 	nvme_unfreeze(ctrl);
 	return ret;

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-30 10:45       ` rjw
@ 2019-07-30 14:41         ` kbusch
  2019-07-30 17:14           ` Mario.Limonciello
  0 siblings, 1 reply; 75+ messages in thread
From: kbusch @ 2019-07-30 14:41 UTC (permalink / raw)


On Tue, Jul 30, 2019@03:45:31AM -0700, Rafael J. Wysocki wrote:
> So I can reproduce this problem with plain 5.3-rc1 and the patch below fixes it.
> 
> Also Mario reports that the same patch needs to be applied for his 9380 to reach
> SLP_S0 after some additional changes under testing/review now, so here it goes.
> 
> [The changes mentioned above are in the pm-s2idle-testing branch in the
>  linux-pm.git tree at kernel.org.]
> 
> ---
> From: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> Subject: [PATCH] nvme-pci: Do not prevent PCI bus-level PM from being used
> 
> One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
> host managed power state for suspend") was adding a pci_save_state()
> call to nvme_suspend() in order to prevent the PCI bus-level PM from
> being applied to the suspended NVMe devices, but that causes the NVMe
> drive (PC401 NVMe SK hynix 256GB) in my Dell XPS13 9380 to prevent
> the SoC from reaching package idle states deeper than PC3, which is
> way insufficient for system suspend.
> 
> Fix this issue by removing the pci_save_state() call in question.

I'm okay with the patch if we can get confirmation this doesn't break
any previously tested devices. I recall we add the pci_save_state() in
the first place specifically to prevent PCI D3 since that was reported
to break some devices' low power settings. Kai-Heng or Mario, any input
here?


 
> Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> ---
>  drivers/nvme/host/pci.c |    8 +-------
>  1 file changed, 1 insertion(+), 7 deletions(-)
> 
> Index: linux-pm/drivers/nvme/host/pci.c
> ===================================================================
> --- linux-pm.orig/drivers/nvme/host/pci.c
> +++ linux-pm/drivers/nvme/host/pci.c
> @@ -2897,14 +2897,8 @@ static int nvme_suspend(struct device *d
>  		nvme_dev_disable(ndev, true);
>  		ctrl->npss = 0;
>  		ret = 0;
> -		goto unfreeze;
>  	}
> -	/*
> -	 * A saved state prevents pci pm from generically controlling the
> -	 * device's power. If we're using protocol specific settings, we don't
> -	 * want pci interfering.
> -	 */
> -	pci_save_state(pdev);
> +
>  unfreeze:
>  	nvme_unfreeze(ctrl);
>  	return ret;

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-30 14:41         ` kbusch
@ 2019-07-30 17:14           ` Mario.Limonciello
  2019-07-30 18:50             ` kai.heng.feng
  0 siblings, 1 reply; 75+ messages in thread
From: Mario.Limonciello @ 2019-07-30 17:14 UTC (permalink / raw)


> -----Original Message-----
> From: Keith Busch <kbusch at kernel.org>
> Sent: Tuesday, July 30, 2019 9:42 AM
> To: Rafael J. Wysocki
> Cc: Busch, Keith; Limonciello, Mario; Kai-Heng Feng; Christoph Hellwig; Sagi
> Grimberg; linux-nvme; Linux PM; Linux Kernel Mailing List; Rajat Jain
> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state for
> suspend" has problems
> 
> 
> [EXTERNAL EMAIL]
> 
> On Tue, Jul 30, 2019@03:45:31AM -0700, Rafael J. Wysocki wrote:
> > So I can reproduce this problem with plain 5.3-rc1 and the patch below fixes it.
> >
> > Also Mario reports that the same patch needs to be applied for his 9380 to
> reach
> > SLP_S0 after some additional changes under testing/review now, so here it
> goes.
> >
> > [The changes mentioned above are in the pm-s2idle-testing branch in the
> >  linux-pm.git tree at kernel.org.]
> >
> > ---
> > From: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> > Subject: [PATCH] nvme-pci: Do not prevent PCI bus-level PM from being used
> >
> > One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
> > host managed power state for suspend") was adding a pci_save_state()
> > call to nvme_suspend() in order to prevent the PCI bus-level PM from
> > being applied to the suspended NVMe devices, but that causes the NVMe
> > drive (PC401 NVMe SK hynix 256GB) in my Dell XPS13 9380 to prevent
> > the SoC from reaching package idle states deeper than PC3, which is
> > way insufficient for system suspend.
> >
> > Fix this issue by removing the pci_save_state() call in question.
> 
> I'm okay with the patch if we can get confirmation this doesn't break
> any previously tested devices. I recall we add the pci_save_state() in
> the first place specifically to prevent PCI D3 since that was reported
> to break some devices' low power settings. Kai-Heng or Mario, any input
> here?
> 

It's entirely possible that in fixing the shutdown/flush/send NVME power state command
that D3 will be OK now but it will take some time to double check across the variety of disks that
we tested before.

What's kernel policy in terms of adding a module parameter and removing it later?  My gut
reaction is I'd like to see that behind a module parameter and if we see that all the disks
are actually OK we can potentially rip it out in a future release.  Also gives us a knob for easier
wider testing outside of the 4 of us.

> 
> 
> > Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> > ---
> >  drivers/nvme/host/pci.c |    8 +-------
> >  1 file changed, 1 insertion(+), 7 deletions(-)
> >
> > Index: linux-pm/drivers/nvme/host/pci.c
> >
> ==============================================================
> =====
> > --- linux-pm.orig/drivers/nvme/host/pci.c
> > +++ linux-pm/drivers/nvme/host/pci.c
> > @@ -2897,14 +2897,8 @@ static int nvme_suspend(struct device *d
> >  		nvme_dev_disable(ndev, true);
> >  		ctrl->npss = 0;
> >  		ret = 0;
> > -		goto unfreeze;
> >  	}
> > -	/*
> > -	 * A saved state prevents pci pm from generically controlling the
> > -	 * device's power. If we're using protocol specific settings, we don't
> > -	 * want pci interfering.
> > -	 */
> > -	pci_save_state(pdev);
> > +
> >  unfreeze:
> >  	nvme_unfreeze(ctrl);
> >  	return ret;

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-30 17:14           ` Mario.Limonciello
@ 2019-07-30 18:50             ` kai.heng.feng
  2019-07-30 19:19               ` kbusch
  0 siblings, 1 reply; 75+ messages in thread
From: kai.heng.feng @ 2019-07-30 18:50 UTC (permalink / raw)


at 01:14, <Mario.Limonciello@dell.com> <Mario.Limonciello@dell.com> wrote:

>> -----Original Message-----
>> From: Keith Busch <kbusch at kernel.org>
>> Sent: Tuesday, July 30, 2019 9:42 AM
>> To: Rafael J. Wysocki
>> Cc: Busch, Keith; Limonciello, Mario; Kai-Heng Feng; Christoph Hellwig;  
>> Sagi
>> Grimberg; linux-nvme; Linux PM; Linux Kernel Mailing List; Rajat Jain
>> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state  
>> for
>> suspend" has problems
>>
>>
>> [EXTERNAL EMAIL]
>>
>> On Tue, Jul 30, 2019@03:45:31AM -0700, Rafael J. Wysocki wrote:
>>> So I can reproduce this problem with plain 5.3-rc1 and the patch below  
>>> fixes it.
>>>
>>> Also Mario reports that the same patch needs to be applied for his 9380  
>>> to
>> reach
>>> SLP_S0 after some additional changes under testing/review now, so here it
>> goes.
>>> [The changes mentioned above are in the pm-s2idle-testing branch in the
>>>  linux-pm.git tree at kernel.org.]
>>>
>>> ---
>>> From: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
>>> Subject: [PATCH] nvme-pci: Do not prevent PCI bus-level PM from being  
>>> used
>>>
>>> One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
>>> host managed power state for suspend") was adding a pci_save_state()
>>> call to nvme_suspend() in order to prevent the PCI bus-level PM from
>>> being applied to the suspended NVMe devices, but that causes the NVMe
>>> drive (PC401 NVMe SK hynix 256GB) in my Dell XPS13 9380 to prevent
>>> the SoC from reaching package idle states deeper than PC3, which is
>>> way insufficient for system suspend.
>>>
>>> Fix this issue by removing the pci_save_state() call in question.
>>
>> I'm okay with the patch if we can get confirmation this doesn't break
>> any previously tested devices. I recall we add the pci_save_state() in
>> the first place specifically to prevent PCI D3 since that was reported
>> to break some devices' low power settings. Kai-Heng or Mario, any input
>> here?
>
> It's entirely possible that in fixing the shutdown/flush/send NVME power  
> state command
> that D3 will be OK now but it will take some time to double check across  
> the variety of disks that
> we tested before.

Just did a quick test, this patch regress SK Hynix BC501, the SoC stays at  
PC3 once the patch is applied.

Kai-Heng

>
> What's kernel policy in terms of adding a module parameter and removing  
> it later?  My gut
> reaction is I'd like to see that behind a module parameter and if we see  
> that all the disks
> are actually OK we can potentially rip it out in a future release.  Also  
> gives us a knob for easier
> wider testing outside of the 4 of us.
>
>>> Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for  
>>> suspend")
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
>>> ---
>>>  drivers/nvme/host/pci.c |    8 +-------
>>>  1 file changed, 1 insertion(+), 7 deletions(-)
>>>
>>> Index: linux-pm/drivers/nvme/host/pci.c
>> ==============================================================
>> =====
>>> --- linux-pm.orig/drivers/nvme/host/pci.c
>>> +++ linux-pm/drivers/nvme/host/pci.c
>>> @@ -2897,14 +2897,8 @@ static int nvme_suspend(struct device *d
>>>  		nvme_dev_disable(ndev, true);
>>>  		ctrl->npss = 0;
>>>  		ret = 0;
>>> -		goto unfreeze;
>>>  	}
>>> -	/*
>>> -	 * A saved state prevents pci pm from generically controlling the
>>> -	 * device's power. If we're using protocol specific settings, we don't
>>> -	 * want pci interfering.
>>> -	 */
>>> -	pci_save_state(pdev);
>>> +
>>>  unfreeze:
>>>  	nvme_unfreeze(ctrl);
>>>  	return ret;

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-30 18:50             ` kai.heng.feng
@ 2019-07-30 19:19               ` kbusch
  2019-07-30 21:05                 ` Mario.Limonciello
  0 siblings, 1 reply; 75+ messages in thread
From: kbusch @ 2019-07-30 19:19 UTC (permalink / raw)


On Wed, Jul 31, 2019@02:50:01AM +0800, Kai-Heng Feng wrote:
> 
> Just did a quick test, this patch regress SK Hynix BC501, the SoC stays at
> PC3 once the patch is applied.

Okay, I'm afraid device/platform quirks may be required unless there are
any other ideas out there.

I'm not a big fan of adding more params to this driver. Those are
global to the module, so that couldn't really handle a platform with
two different devices that want different behavior.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-30 19:19               ` kbusch
@ 2019-07-30 21:05                 ` Mario.Limonciello
  2019-07-30 21:31                   ` kbusch
  0 siblings, 1 reply; 75+ messages in thread
From: Mario.Limonciello @ 2019-07-30 21:05 UTC (permalink / raw)


> -----Original Message-----
> From: Keith Busch <kbusch at kernel.org>
> Sent: Tuesday, July 30, 2019 2:20 PM
> To: Kai-Heng Feng
> Cc: Limonciello, Mario; rjw at rjwysocki.net; keith.busch at intel.com; hch at lst.de;
> sagi at grimberg.me; linux-nvme at lists.infradead.org; linux-pm at vger.kernel.org;
> linux-kernel at vger.kernel.org; rajatja at google.com
> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state for
> suspend" has problems
> 
> 
> [EXTERNAL EMAIL]
> 
> On Wed, Jul 31, 2019@02:50:01AM +0800, Kai-Heng Feng wrote:
> >
> > Just did a quick test, this patch regress SK Hynix BC501, the SoC stays at
> > PC3 once the patch is applied.
> 
> Okay, I'm afraid device/platform quirks may be required unless there are
> any other ideas out there.

I think if a quirk goes in for Rafael's SSD it would have to be a quirk specific to this
device and FW version per the findings on KH checking the same device with the
older FW version.

> 
> I'm not a big fan of adding more params to this driver. Those are
> global to the module, so that couldn't really handle a platform with
> two different devices that want different behavior.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-30 21:05                 ` Mario.Limonciello
@ 2019-07-30 21:31                   ` kbusch
  2019-07-31 21:25                     ` rafael
  0 siblings, 1 reply; 75+ messages in thread
From: kbusch @ 2019-07-30 21:31 UTC (permalink / raw)


On Tue, Jul 30, 2019@09:05:22PM +0000, Mario.Limonciello@dell.com wrote:
> > -----Original Message-----
> > From: Keith Busch <kbusch at kernel.org>
> > Sent: Tuesday, July 30, 2019 2:20 PM
> > To: Kai-Heng Feng
> > Cc: Limonciello, Mario; rjw at rjwysocki.net; keith.busch at intel.com; hch at lst.de;
> > sagi at grimberg.me; linux-nvme at lists.infradead.org; linux-pm at vger.kernel.org;
> > linux-kernel at vger.kernel.org; rajatja at google.com
> > Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state for
> > suspend" has problems
> > 
> > 
> > [EXTERNAL EMAIL]
> > 
> > On Wed, Jul 31, 2019@02:50:01AM +0800, Kai-Heng Feng wrote:
> > >
> > > Just did a quick test, this patch regress SK Hynix BC501, the SoC stays at
> > > PC3 once the patch is applied.
> > 
> > Okay, I'm afraid device/platform quirks may be required unless there are
> > any other ideas out there.
> 
> I think if a quirk goes in for Rafael's SSD it would have to be a quirk specific to this
> device and FW version per the findings on KH checking the same device with the
> older FW version.

That's fine, we have the infrastructure in place for fw specific quirks.
See drivers/nvme/host/core.c:nvme_core_quirk_entry

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-30 21:31                   ` kbusch
@ 2019-07-31 21:25                     ` rafael
  2019-07-31 22:19                       ` kbusch
  0 siblings, 1 reply; 75+ messages in thread
From: rafael @ 2019-07-31 21:25 UTC (permalink / raw)


On Tue, Jul 30, 2019@11:33 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Tue, Jul 30, 2019@09:05:22PM +0000, Mario.Limonciello@dell.com wrote:
> > > -----Original Message-----
> > > From: Keith Busch <kbusch at kernel.org>
> > > Sent: Tuesday, July 30, 2019 2:20 PM
> > > To: Kai-Heng Feng
> > > Cc: Limonciello, Mario; rjw at rjwysocki.net; keith.busch at intel.com; hch at lst.de;
> > > sagi at grimberg.me; linux-nvme at lists.infradead.org; linux-pm at vger.kernel.org;
> > > linux-kernel at vger.kernel.org; rajatja at google.com
> > > Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state for
> > > suspend" has problems
> > >
> > >
> > > [EXTERNAL EMAIL]
> > >
> > > On Wed, Jul 31, 2019@02:50:01AM +0800, Kai-Heng Feng wrote:
> > > >
> > > > Just did a quick test, this patch regress SK Hynix BC501, the SoC stays at
> > > > PC3 once the patch is applied.
> > >
> > > Okay, I'm afraid device/platform quirks may be required unless there are
> > > any other ideas out there.
> >
> > I think if a quirk goes in for Rafael's SSD it would have to be a quirk specific to this
> > device and FW version per the findings on KH checking the same device with the
> > older FW version.
>
> That's fine, we have the infrastructure in place for fw specific quirks.
> See drivers/nvme/host/core.c:nvme_core_quirk_entry

A couple of remarks if you will.

First, we don't know which case is the majority at this point.  For
now, there is one example of each, but it may very well turn out that
the SK Hynix BC501 above needs to be quirked.

Second, the reference here really is 5.2, so if there are any systems
that are not better off with 5.3-rc than they were with 5.2, well, we
have not made progress.  However, if there are systems that are worse
off with 5.3, that's bad.  In the face of the latest findings the only
way to avoid that is to be backwards compatible with 5.2 and that's
where my patch is going.  That cannot be achieved by quirking all
cases that are reported as "bad", because there still may be
unreported ones.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-31 21:25                     ` rafael
@ 2019-07-31 22:19                       ` kbusch
  2019-07-31 22:33                         ` rafael
                                           ` (5 more replies)
  0 siblings, 6 replies; 75+ messages in thread
From: kbusch @ 2019-07-31 22:19 UTC (permalink / raw)


On Wed, Jul 31, 2019@11:25:51PM +0200, Rafael J. Wysocki wrote:
> 
> A couple of remarks if you will.
> 
> First, we don't know which case is the majority at this point.  For
> now, there is one example of each, but it may very well turn out that
> the SK Hynix BC501 above needs to be quirked.
> 
> Second, the reference here really is 5.2, so if there are any systems
> that are not better off with 5.3-rc than they were with 5.2, well, we
> have not made progress.  However, if there are systems that are worse
> off with 5.3, that's bad.  In the face of the latest findings the only
> way to avoid that is to be backwards compatible with 5.2 and that's
> where my patch is going.  That cannot be achieved by quirking all
> cases that are reported as "bad", because there still may be
> unreported ones.

I have to agree. I think your proposal may allow PCI D3cold, in which
case we do need to reintroduce the HMB handling.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-31 22:19                       ` kbusch
@ 2019-07-31 22:33                         ` rafael
  2019-08-01  9:05                           ` kai.heng.feng
  2019-08-07  9:48                         ` rjw
                                           ` (4 subsequent siblings)
  5 siblings, 1 reply; 75+ messages in thread
From: rafael @ 2019-07-31 22:33 UTC (permalink / raw)


On Thu, Aug 1, 2019@12:22 AM Keith Busch <kbusch@kernel.org> wrote:
>
> On Wed, Jul 31, 2019@11:25:51PM +0200, Rafael J. Wysocki wrote:
> >
> > A couple of remarks if you will.
> >
> > First, we don't know which case is the majority at this point.  For
> > now, there is one example of each, but it may very well turn out that
> > the SK Hynix BC501 above needs to be quirked.
> >
> > Second, the reference here really is 5.2, so if there are any systems
> > that are not better off with 5.3-rc than they were with 5.2, well, we
> > have not made progress.  However, if there are systems that are worse
> > off with 5.3, that's bad.  In the face of the latest findings the only
> > way to avoid that is to be backwards compatible with 5.2 and that's
> > where my patch is going.  That cannot be achieved by quirking all
> > cases that are reported as "bad", because there still may be
> > unreported ones.
>
> I have to agree. I think your proposal may allow PCI D3cold,

Yes, it may.

> In which case we do need to reintroduce the HMB handling.

Right.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-31 22:33                         ` rafael
@ 2019-08-01  9:05                           ` kai.heng.feng
  2019-08-01 17:29                             ` rafael
  2019-08-01 20:22                             ` kbusch
  0 siblings, 2 replies; 75+ messages in thread
From: kai.heng.feng @ 2019-08-01  9:05 UTC (permalink / raw)


at 06:33, Rafael J. Wysocki <rafael@kernel.org> wrote:

> On Thu, Aug 1, 2019@12:22 AM Keith Busch <kbusch@kernel.org> wrote:
>> On Wed, Jul 31, 2019@11:25:51PM +0200, Rafael J. Wysocki wrote:
>>> A couple of remarks if you will.
>>>
>>> First, we don't know which case is the majority at this point.  For
>>> now, there is one example of each, but it may very well turn out that
>>> the SK Hynix BC501 above needs to be quirked.
>>>
>>> Second, the reference here really is 5.2, so if there are any systems
>>> that are not better off with 5.3-rc than they were with 5.2, well, we
>>> have not made progress.  However, if there are systems that are worse
>>> off with 5.3, that's bad.  In the face of the latest findings the only
>>> way to avoid that is to be backwards compatible with 5.2 and that's
>>> where my patch is going.  That cannot be achieved by quirking all
>>> cases that are reported as "bad", because there still may be
>>> unreported ones.
>>
>> I have to agree. I think your proposal may allow PCI D3cold,
>
> Yes, it may.

Somehow the 9380 with Toshiba NVMe never hits SLP_S0 with or without  
Rafael?s patch.
But the ?real? s2idle power consumption does improve with the patch.

Can we use a DMI based quirk for this platform? It seems like a platform  
specific issue.

>
>> In which case we do need to reintroduce the HMB handling.
>
> Right.

The patch alone doesn?t break HMB Toshiba NVMe I tested. But I think it?s  
still safer to do proper HMB handling.

Kai-Heng

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-08-01  9:05                           ` kai.heng.feng
@ 2019-08-01 17:29                             ` rafael
  2019-08-01 19:05                               ` Mario.Limonciello
  2019-08-01 20:22                             ` kbusch
  1 sibling, 1 reply; 75+ messages in thread
From: rafael @ 2019-08-01 17:29 UTC (permalink / raw)


On Thu, Aug 1, 2019 at 11:06 AM Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:
>
>@06:33, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> > On Thu, Aug 1, 2019@12:22 AM Keith Busch <kbusch@kernel.org> wrote:
> >> On Wed, Jul 31, 2019@11:25:51PM +0200, Rafael J. Wysocki wrote:
> >>> A couple of remarks if you will.
> >>>
> >>> First, we don't know which case is the majority at this point.  For
> >>> now, there is one example of each, but it may very well turn out that
> >>> the SK Hynix BC501 above needs to be quirked.
> >>>
> >>> Second, the reference here really is 5.2, so if there are any systems
> >>> that are not better off with 5.3-rc than they were with 5.2, well, we
> >>> have not made progress.  However, if there are systems that are worse
> >>> off with 5.3, that's bad.  In the face of the latest findings the only
> >>> way to avoid that is to be backwards compatible with 5.2 and that's
> >>> where my patch is going.  That cannot be achieved by quirking all
> >>> cases that are reported as "bad", because there still may be
> >>> unreported ones.
> >>
> >> I have to agree. I think your proposal may allow PCI D3cold,
> >
> > Yes, it may.
>
> Somehow the 9380 with Toshiba NVMe never hits SLP_S0 with or without
> Rafael?s patch.
> But the ?real? s2idle power consumption does improve with the patch.

Do you mean this patch:

https://lore.kernel.org/linux-pm/70D536BE-8DC7-4CA2-84A9-AFB067BA520E at canonical.com/T/#m456aa5c69973a3b68f2cdd4713a1ce83be51458f

or the $subject one without the above?

> Can we use a DMI based quirk for this platform? It seems like a platform
> specific issue.

We seem to see too many "platform-specific issues" here. :-)

To me, the status quo (ie. what we have in 5.3-rc2) is not defensible.
Something needs to be done to improve the situation.

> >
> >> In which case we do need to reintroduce the HMB handling.
> >
> > Right.
>
> The patch alone doesn?t break HMB Toshiba NVMe I tested. But I think it?s
> still safer to do proper HMB handling.

Well, so can anyone please propose something specific?  Like an
alternative patch?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-08-01 17:29                             ` rafael
@ 2019-08-01 19:05                               ` Mario.Limonciello
  2019-08-01 22:26                                 ` rafael
  0 siblings, 1 reply; 75+ messages in thread
From: Mario.Limonciello @ 2019-08-01 19:05 UTC (permalink / raw)


> -----Original Message-----
> From: Rafael J. Wysocki <rafael at kernel.org>
> Sent: Thursday, August 1, 2019 12:30 PM
> To: Kai-Heng Feng; Keith Busch; Limonciello, Mario
> Cc: Keith Busch; Christoph Hellwig; Sagi Grimberg; linux-nvme; Linux PM; Linux
> Kernel Mailing List; Rajat Jain
> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state for
> suspend" has problems
> 
> 
> [EXTERNAL EMAIL]
> 
> On Thu, Aug 1, 2019 at 11:06 AM Kai-Heng Feng
> <kai.heng.feng@canonical.com> wrote:
> >
> >@06:33, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > > On Thu, Aug 1, 2019@12:22 AM Keith Busch <kbusch@kernel.org> wrote:
> > >> On Wed, Jul 31, 2019@11:25:51PM +0200, Rafael J. Wysocki wrote:
> > >>> A couple of remarks if you will.
> > >>>
> > >>> First, we don't know which case is the majority at this point.  For
> > >>> now, there is one example of each, but it may very well turn out that
> > >>> the SK Hynix BC501 above needs to be quirked.
> > >>>
> > >>> Second, the reference here really is 5.2, so if there are any systems
> > >>> that are not better off with 5.3-rc than they were with 5.2, well, we
> > >>> have not made progress.  However, if there are systems that are worse
> > >>> off with 5.3, that's bad.  In the face of the latest findings the only
> > >>> way to avoid that is to be backwards compatible with 5.2 and that's
> > >>> where my patch is going.  That cannot be achieved by quirking all
> > >>> cases that are reported as "bad", because there still may be
> > >>> unreported ones.
> > >>
> > >> I have to agree. I think your proposal may allow PCI D3cold,
> > >
> > > Yes, it may.
> >
> > Somehow the 9380 with Toshiba NVMe never hits SLP_S0 with or without
> > Rafael?s patch.
> > But the ?real? s2idle power consumption does improve with the patch.
> 
> Do you mean this patch:
> 
> https://lore.kernel.org/linux-pm/70D536BE-8DC7-4CA2-84A9-
> AFB067BA520E at canonical.com/T/#m456aa5c69973a3b68f2cdd4713a1ce83be5145
> 8f
> 
> or the $subject one without the above?
> 
> > Can we use a DMI based quirk for this platform? It seems like a platform
> > specific issue.
> 
> We seem to see too many "platform-specific issues" here. :-)
> 
> To me, the status quo (ie. what we have in 5.3-rc2) is not defensible.
> Something needs to be done to improve the situation.

Rafael, would it be possible to try popping out PC401 from the 9380 and into a 9360 to
confirm there actually being a platform impact or not?

I was hoping to have something useful from Hynix by now before responding, but oh well.

In terms of what is the majority, I do know that between folks at Dell, Google, Compal,
Wistron, Canonical, Micron, Hynix, Toshiba, LiteOn, and Western Digital we tested a wide
variety of SSDs with this patch series.  I would like to think that they are representative of
what's being manufactured into machines now.

Notably the LiteOn CL1 was tested with the HMB flushing support and 
and Hynix PC401 was tested with older firmware though.

> 
> > >
> > >> In which case we do need to reintroduce the HMB handling.
> > >
> > > Right.
> >
> > The patch alone doesn?t break HMB Toshiba NVMe I tested. But I think it?s
> > still safer to do proper HMB handling.
> 
> Well, so can anyone please propose something specific?  Like an
> alternative patch?

This was proposed a few days ago:
http://lists.infradead.org/pipermail/linux-nvme/2019-July/026056.html

However we're still not sure why it is needed, and it will take some time to get
a proper failure analysis from LiteOn  regarding the CL1. 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-08-01  9:05                           ` kai.heng.feng
  2019-08-01 17:29                             ` rafael
@ 2019-08-01 20:22                             ` kbusch
  1 sibling, 0 replies; 75+ messages in thread
From: kbusch @ 2019-08-01 20:22 UTC (permalink / raw)


On Thu, Aug 01, 2019@02:05:54AM -0700, Kai-Heng Feng wrote:
>@06:33, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > On Thu, Aug 1, 2019@12:22 AM Keith Busch <kbusch@kernel.org> wrote:
> >
> >> In which case we do need to reintroduce the HMB handling.
> >
> > Right.
> 
> The patch alone doesn?t break HMB Toshiba NVMe I tested. But I think it?s  
> still safer to do proper HMB handling.

Spec requires host request controller release HMB for D3cold. I suspect
you're only getting to D3hot.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-08-01 19:05                               ` Mario.Limonciello
@ 2019-08-01 22:26                                 ` rafael
  2019-08-02 10:55                                   ` kai.heng.feng
  0 siblings, 1 reply; 75+ messages in thread
From: rafael @ 2019-08-01 22:26 UTC (permalink / raw)


On Thu, Aug 1, 2019@9:05 PM <Mario.Limonciello@dell.com> wrote:
>
> > -----Original Message-----
> > From: Rafael J. Wysocki <rafael at kernel.org>
> > Sent: Thursday, August 1, 2019 12:30 PM
> > To: Kai-Heng Feng; Keith Busch; Limonciello, Mario
> > Cc: Keith Busch; Christoph Hellwig; Sagi Grimberg; linux-nvme; Linux PM; Linux
> > Kernel Mailing List; Rajat Jain
> > Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state for
> > suspend" has problems
> >
> >
> > [EXTERNAL EMAIL]
> >
> > On Thu, Aug 1, 2019 at 11:06 AM Kai-Heng Feng
> > <kai.heng.feng@canonical.com> wrote:
> > >
> > >@06:33, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > > On Thu, Aug 1, 2019@12:22 AM Keith Busch <kbusch@kernel.org> wrote:
> > > >> On Wed, Jul 31, 2019@11:25:51PM +0200, Rafael J. Wysocki wrote:
> > > >>> A couple of remarks if you will.
> > > >>>
> > > >>> First, we don't know which case is the majority at this point.  For
> > > >>> now, there is one example of each, but it may very well turn out that
> > > >>> the SK Hynix BC501 above needs to be quirked.
> > > >>>
> > > >>> Second, the reference here really is 5.2, so if there are any systems
> > > >>> that are not better off with 5.3-rc than they were with 5.2, well, we
> > > >>> have not made progress.  However, if there are systems that are worse
> > > >>> off with 5.3, that's bad.  In the face of the latest findings the only
> > > >>> way to avoid that is to be backwards compatible with 5.2 and that's
> > > >>> where my patch is going.  That cannot be achieved by quirking all
> > > >>> cases that are reported as "bad", because there still may be
> > > >>> unreported ones.
> > > >>
> > > >> I have to agree. I think your proposal may allow PCI D3cold,
> > > >
> > > > Yes, it may.
> > >
> > > Somehow the 9380 with Toshiba NVMe never hits SLP_S0 with or without
> > > Rafael?s patch.
> > > But the ?real? s2idle power consumption does improve with the patch.
> >
> > Do you mean this patch:
> >
> > https://lore.kernel.org/linux-pm/70D536BE-8DC7-4CA2-84A9-
> > AFB067BA520E at canonical.com/T/#m456aa5c69973a3b68f2cdd4713a1ce83be5145
> > 8f
> >
> > or the $subject one without the above?
> >
> > > Can we use a DMI based quirk for this platform? It seems like a platform
> > > specific issue.
> >
> > We seem to see too many "platform-specific issues" here. :-)
> >
> > To me, the status quo (ie. what we have in 5.3-rc2) is not defensible.
> > Something needs to be done to improve the situation.
>
> Rafael, would it be possible to try popping out PC401 from the 9380 and into a 9360 to
> confirm there actually being a platform impact or not?

Not really, sorry.

> I was hoping to have something useful from Hynix by now before responding, but oh well.
>
> In terms of what is the majority, I do know that between folks at Dell, Google, Compal,
> Wistron, Canonical, Micron, Hynix, Toshiba, LiteOn, and Western Digital we tested a wide
> variety of SSDs with this patch series.  I would like to think that they are representative of
> what's being manufactured into machines now.

Well, what about drives already in the field?  My concern is mostly
about those ones.

> Notably the LiteOn CL1 was tested with the HMB flushing support and
> and Hynix PC401 was tested with older firmware though.
>
> >
> > > >
> > > >> In which case we do need to reintroduce the HMB handling.
> > > >
> > > > Right.
> > >
> > > The patch alone doesn?t break HMB Toshiba NVMe I tested. But I think it?s
> > > still safer to do proper HMB handling.
> >
> > Well, so can anyone please propose something specific?  Like an
> > alternative patch?
>
> This was proposed a few days ago:
> http://lists.infradead.org/pipermail/linux-nvme/2019-July/026056.html
>
> However we're still not sure why it is needed, and it will take some time to get
> a proper failure analysis from LiteOn  regarding the CL1.

Thanks for the update, but IMO we still need to do something before
final 5.3 while the investigation continues.

Honestly, at this point I would vote for going back to the 5.2
behavior at least by default and only running the new code on the
drives known to require it (because they will block PC10 otherwise).

Possibly (ideally) with an option for users who can't get beyond PC3
to test whether or not the new code helps them.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-08-01 22:26                                 ` rafael
@ 2019-08-02 10:55                                   ` kai.heng.feng
  2019-08-02 11:04                                     ` rafael
  0 siblings, 1 reply; 75+ messages in thread
From: kai.heng.feng @ 2019-08-02 10:55 UTC (permalink / raw)


at 06:26, Rafael J. Wysocki <rafael@kernel.org> wrote:

> On Thu, Aug 1, 2019@9:05 PM <Mario.Limonciello@dell.com> wrote:
>>> -----Original Message-----
>>> From: Rafael J. Wysocki <rafael at kernel.org>
>>> Sent: Thursday, August 1, 2019 12:30 PM
>>> To: Kai-Heng Feng; Keith Busch; Limonciello, Mario
>>> Cc: Keith Busch; Christoph Hellwig; Sagi Grimberg; linux-nvme; Linux  
>>> PM; Linux
>>> Kernel Mailing List; Rajat Jain
>>> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power  
>>> state for
>>> suspend" has problems
>>>
>>>
>>> [EXTERNAL EMAIL]
>>>
>>> On Thu, Aug 1, 2019 at 11:06 AM Kai-Heng Feng
>>> <kai.heng.feng@canonical.com> wrote:
>>>>@06:33, Rafael J. Wysocki <rafael@kernel.org> wrote:
>>>>
>>>>> On Thu, Aug 1, 2019@12:22 AM Keith Busch <kbusch@kernel.org> wrote:
>>>>>> On Wed, Jul 31, 2019@11:25:51PM +0200, Rafael J. Wysocki wrote:
>>>>>>> A couple of remarks if you will.
>>>>>>>
>>>>>>> First, we don't know which case is the majority at this point.  For
>>>>>>> now, there is one example of each, but it may very well turn out that
>>>>>>> the SK Hynix BC501 above needs to be quirked.
>>>>>>>
>>>>>>> Second, the reference here really is 5.2, so if there are any systems
>>>>>>> that are not better off with 5.3-rc than they were with 5.2, well, we
>>>>>>> have not made progress.  However, if there are systems that are worse
>>>>>>> off with 5.3, that's bad.  In the face of the latest findings the  
>>>>>>> only
>>>>>>> way to avoid that is to be backwards compatible with 5.2 and that's
>>>>>>> where my patch is going.  That cannot be achieved by quirking all
>>>>>>> cases that are reported as "bad", because there still may be
>>>>>>> unreported ones.
>>>>>>
>>>>>> I have to agree. I think your proposal may allow PCI D3cold,
>>>>>
>>>>> Yes, it may.
>>>>
>>>> Somehow the 9380 with Toshiba NVMe never hits SLP_S0 with or without
>>>> Rafael?s patch.
>>>> But the ?real? s2idle power consumption does improve with the patch.
>>>
>>> Do you mean this patch:
>>>
>>> https://lore.kernel.org/linux-pm/70D536BE-8DC7-4CA2-84A9-
>>> AFB067BA520E at canonical.com/T/#m456aa5c69973a3b68f2cdd4713a1ce83be5145
>>> 8f
>>>
>>> or the $subject one without the above?
>>>
>>>> Can we use a DMI based quirk for this platform? It seems like a platform
>>>> specific issue.
>>>
>>> We seem to see too many "platform-specific issues" here. :-)
>>>
>>> To me, the status quo (ie. what we have in 5.3-rc2) is not defensible.
>>> Something needs to be done to improve the situation.
>>
>> Rafael, would it be possible to try popping out PC401 from the 9380 and  
>> into a 9360 to
>> confirm there actually being a platform impact or not?
>
> Not really, sorry.
>
>> I was hoping to have something useful from Hynix by now before  
>> responding, but oh well.
>>
>> In terms of what is the majority, I do know that between folks at Dell,  
>> Google, Compal,
>> Wistron, Canonical, Micron, Hynix, Toshiba, LiteOn, and Western Digital  
>> we tested a wide
>> variety of SSDs with this patch series.  I would like to think that they  
>> are representative of
>> what's being manufactured into machines now.
>
> Well, what about drives already in the field?  My concern is mostly
> about those ones.
>
>> Notably the LiteOn CL1 was tested with the HMB flushing support and
>> and Hynix PC401 was tested with older firmware though.
>>
>>>>>> In which case we do need to reintroduce the HMB handling.
>>>>>
>>>>> Right.
>>>>
>>>> The patch alone doesn?t break HMB Toshiba NVMe I tested. But I think  
>>>> it?s
>>>> still safer to do proper HMB handling.
>>>
>>> Well, so can anyone please propose something specific?  Like an
>>> alternative patch?
>>
>> This was proposed a few days ago:
>> http://lists.infradead.org/pipermail/linux-nvme/2019-July/026056.html
>>
>> However we're still not sure why it is needed, and it will take some  
>> time to get
>> a proper failure analysis from LiteOn  regarding the CL1.
>
> Thanks for the update, but IMO we still need to do something before
> final 5.3 while the investigation continues.
>
> Honestly, at this point I would vote for going back to the 5.2
> behavior at least by default and only running the new code on the
> drives known to require it (because they will block PC10 otherwise).
>
> Possibly (ideally) with an option for users who can't get beyond PC3
> to test whether or not the new code helps them.

I just found out that the XPS 9380 at my hand never reaches SLP_S0 but only  
PC10.
This happens with or without putting the device to D3.

Kai-Heng

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-08-02 10:55                                   ` kai.heng.feng
@ 2019-08-02 11:04                                     ` rafael
  2019-08-05 19:13                                       ` kai.heng.feng
  0 siblings, 1 reply; 75+ messages in thread
From: rafael @ 2019-08-02 11:04 UTC (permalink / raw)


On Fri, Aug 2, 2019 at 12:55 PM Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:
>
>@06:26, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> > On Thu, Aug 1, 2019@9:05 PM <Mario.Limonciello@dell.com> wrote:
> >>> -----Original Message-----
> >>> From: Rafael J. Wysocki <rafael at kernel.org>
> >>> Sent: Thursday, August 1, 2019 12:30 PM
> >>> To: Kai-Heng Feng; Keith Busch; Limonciello, Mario
> >>> Cc: Keith Busch; Christoph Hellwig; Sagi Grimberg; linux-nvme; Linux
> >>> PM; Linux
> >>> Kernel Mailing List; Rajat Jain
> >>> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power
> >>> state for
> >>> suspend" has problems
> >>>
> >>>
> >>> [EXTERNAL EMAIL]
> >>>
> >>> On Thu, Aug 1, 2019 at 11:06 AM Kai-Heng Feng
> >>> <kai.heng.feng@canonical.com> wrote:
> >>>>@06:33, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >>>>
> >>>>> On Thu, Aug 1, 2019@12:22 AM Keith Busch <kbusch@kernel.org> wrote:
> >>>>>> On Wed, Jul 31, 2019@11:25:51PM +0200, Rafael J. Wysocki wrote:
> >>>>>>> A couple of remarks if you will.
> >>>>>>>
> >>>>>>> First, we don't know which case is the majority at this point.  For
> >>>>>>> now, there is one example of each, but it may very well turn out that
> >>>>>>> the SK Hynix BC501 above needs to be quirked.
> >>>>>>>
> >>>>>>> Second, the reference here really is 5.2, so if there are any systems
> >>>>>>> that are not better off with 5.3-rc than they were with 5.2, well, we
> >>>>>>> have not made progress.  However, if there are systems that are worse
> >>>>>>> off with 5.3, that's bad.  In the face of the latest findings the
> >>>>>>> only
> >>>>>>> way to avoid that is to be backwards compatible with 5.2 and that's
> >>>>>>> where my patch is going.  That cannot be achieved by quirking all
> >>>>>>> cases that are reported as "bad", because there still may be
> >>>>>>> unreported ones.
> >>>>>>
> >>>>>> I have to agree. I think your proposal may allow PCI D3cold,
> >>>>>
> >>>>> Yes, it may.
> >>>>
> >>>> Somehow the 9380 with Toshiba NVMe never hits SLP_S0 with or without
> >>>> Rafael?s patch.
> >>>> But the ?real? s2idle power consumption does improve with the patch.
> >>>
> >>> Do you mean this patch:
> >>>
> >>> https://lore.kernel.org/linux-pm/70D536BE-8DC7-4CA2-84A9-
> >>> AFB067BA520E at canonical.com/T/#m456aa5c69973a3b68f2cdd4713a1ce83be5145
> >>> 8f
> >>>
> >>> or the $subject one without the above?
> >>>
> >>>> Can we use a DMI based quirk for this platform? It seems like a platform
> >>>> specific issue.
> >>>
> >>> We seem to see too many "platform-specific issues" here. :-)
> >>>
> >>> To me, the status quo (ie. what we have in 5.3-rc2) is not defensible.
> >>> Something needs to be done to improve the situation.
> >>
> >> Rafael, would it be possible to try popping out PC401 from the 9380 and
> >> into a 9360 to
> >> confirm there actually being a platform impact or not?
> >
> > Not really, sorry.
> >
> >> I was hoping to have something useful from Hynix by now before
> >> responding, but oh well.
> >>
> >> In terms of what is the majority, I do know that between folks at Dell,
> >> Google, Compal,
> >> Wistron, Canonical, Micron, Hynix, Toshiba, LiteOn, and Western Digital
> >> we tested a wide
> >> variety of SSDs with this patch series.  I would like to think that they
> >> are representative of
> >> what's being manufactured into machines now.
> >
> > Well, what about drives already in the field?  My concern is mostly
> > about those ones.
> >
> >> Notably the LiteOn CL1 was tested with the HMB flushing support and
> >> and Hynix PC401 was tested with older firmware though.
> >>
> >>>>>> In which case we do need to reintroduce the HMB handling.
> >>>>>
> >>>>> Right.
> >>>>
> >>>> The patch alone doesn?t break HMB Toshiba NVMe I tested. But I think
> >>>> it?s
> >>>> still safer to do proper HMB handling.
> >>>
> >>> Well, so can anyone please propose something specific?  Like an
> >>> alternative patch?
> >>
> >> This was proposed a few days ago:
> >> http://lists.infradead.org/pipermail/linux-nvme/2019-July/026056.html
> >>
> >> However we're still not sure why it is needed, and it will take some
> >> time to get
> >> a proper failure analysis from LiteOn  regarding the CL1.
> >
> > Thanks for the update, but IMO we still need to do something before
> > final 5.3 while the investigation continues.
> >
> > Honestly, at this point I would vote for going back to the 5.2
> > behavior at least by default and only running the new code on the
> > drives known to require it (because they will block PC10 otherwise).
> >
> > Possibly (ideally) with an option for users who can't get beyond PC3
> > to test whether or not the new code helps them.
>
> I just found out that the XPS 9380 at my hand never reaches SLP_S0 but only
> PC10.

That's the case for me too.

> This happens with or without putting the device to D3.

On my system, though, it only can get to PC3 without putting the NVMe
into D3 (as reported previously).

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-08-02 11:04                                     ` rafael
@ 2019-08-05 19:13                                       ` kai.heng.feng
  2019-08-05 21:28                                         ` rafael
  0 siblings, 1 reply; 75+ messages in thread
From: kai.heng.feng @ 2019-08-05 19:13 UTC (permalink / raw)


at 19:04, Rafael J. Wysocki <rafael@kernel.org> wrote:

> On Fri, Aug 2, 2019 at 12:55 PM Kai-Heng Feng
> <kai.heng.feng@canonical.com> wrote:
>>@06:26, Rafael J. Wysocki <rafael@kernel.org> wrote:
>>
>>> On Thu, Aug 1, 2019@9:05 PM <Mario.Limonciello@dell.com> wrote:
>>>>> -----Original Message-----
>>>>> From: Rafael J. Wysocki <rafael at kernel.org>
>>>>> Sent: Thursday, August 1, 2019 12:30 PM
>>>>> To: Kai-Heng Feng; Keith Busch; Limonciello, Mario
>>>>> Cc: Keith Busch; Christoph Hellwig; Sagi Grimberg; linux-nvme; Linux
>>>>> PM; Linux
>>>>> Kernel Mailing List; Rajat Jain
>>>>> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power
>>>>> state for
>>>>> suspend" has problems
>>>>>
>>>>>
>>>>> [EXTERNAL EMAIL]
>>>>>
>>>>> On Thu, Aug 1, 2019 at 11:06 AM Kai-Heng Feng
>>>>> <kai.heng.feng@canonical.com> wrote:
>>>>>>@06:33, Rafael J. Wysocki <rafael@kernel.org> wrote:
>>>>>>
>>>>>>> On Thu, Aug 1, 2019 at 12:22 AM Keith Busch <kbusch at kernel.org>  
>>>>>>> wrote:
>>>>>>>> On Wed, Jul 31, 2019@11:25:51PM +0200, Rafael J. Wysocki wrote:
>>>>>>>>> A couple of remarks if you will.
>>>>>>>>>
>>>>>>>>> First, we don't know which case is the majority at this point.  For
>>>>>>>>> now, there is one example of each, but it may very well turn out  
>>>>>>>>> that
>>>>>>>>> the SK Hynix BC501 above needs to be quirked.
>>>>>>>>>
>>>>>>>>> Second, the reference here really is 5.2, so if there are any  
>>>>>>>>> systems
>>>>>>>>> that are not better off with 5.3-rc than they were with 5.2,  
>>>>>>>>> well, we
>>>>>>>>> have not made progress.  However, if there are systems that are  
>>>>>>>>> worse
>>>>>>>>> off with 5.3, that's bad.  In the face of the latest findings the
>>>>>>>>> only
>>>>>>>>> way to avoid that is to be backwards compatible with 5.2 and that's
>>>>>>>>> where my patch is going.  That cannot be achieved by quirking all
>>>>>>>>> cases that are reported as "bad", because there still may be
>>>>>>>>> unreported ones.
>>>>>>>>
>>>>>>>> I have to agree. I think your proposal may allow PCI D3cold,
>>>>>>>
>>>>>>> Yes, it may.
>>>>>>
>>>>>> Somehow the 9380 with Toshiba NVMe never hits SLP_S0 with or without
>>>>>> Rafael?s patch.
>>>>>> But the ?real? s2idle power consumption does improve with the patch.
>>>>>
>>>>> Do you mean this patch:
>>>>>
>>>>> https://lore.kernel.org/linux-pm/70D536BE-8DC7-4CA2-84A9-
>>>>> AFB067BA520E at canonical.com/T/#m456aa5c69973a3b68f2cdd4713a1ce83be5145
>>>>> 8f
>>>>>
>>>>> or the $subject one without the above?
>>>>>
>>>>>> Can we use a DMI based quirk for this platform? It seems like a  
>>>>>> platform
>>>>>> specific issue.
>>>>>
>>>>> We seem to see too many "platform-specific issues" here. :-)
>>>>>
>>>>> To me, the status quo (ie. what we have in 5.3-rc2) is not defensible.
>>>>> Something needs to be done to improve the situation.
>>>>
>>>> Rafael, would it be possible to try popping out PC401 from the 9380 and
>>>> into a 9360 to
>>>> confirm there actually being a platform impact or not?
>>>
>>> Not really, sorry.
>>>
>>>> I was hoping to have something useful from Hynix by now before
>>>> responding, but oh well.
>>>>
>>>> In terms of what is the majority, I do know that between folks at Dell,
>>>> Google, Compal,
>>>> Wistron, Canonical, Micron, Hynix, Toshiba, LiteOn, and Western Digital
>>>> we tested a wide
>>>> variety of SSDs with this patch series.  I would like to think that they
>>>> are representative of
>>>> what's being manufactured into machines now.
>>>
>>> Well, what about drives already in the field?  My concern is mostly
>>> about those ones.
>>>
>>>> Notably the LiteOn CL1 was tested with the HMB flushing support and
>>>> and Hynix PC401 was tested with older firmware though.
>>>>
>>>>>>>> In which case we do need to reintroduce the HMB handling.
>>>>>>>
>>>>>>> Right.
>>>>>>
>>>>>> The patch alone doesn?t break HMB Toshiba NVMe I tested. But I think
>>>>>> it?s
>>>>>> still safer to do proper HMB handling.
>>>>>
>>>>> Well, so can anyone please propose something specific?  Like an
>>>>> alternative patch?
>>>>
>>>> This was proposed a few days ago:
>>>> http://lists.infradead.org/pipermail/linux-nvme/2019-July/026056.html
>>>>
>>>> However we're still not sure why it is needed, and it will take some
>>>> time to get
>>>> a proper failure analysis from LiteOn  regarding the CL1.
>>>
>>> Thanks for the update, but IMO we still need to do something before
>>> final 5.3 while the investigation continues.
>>>
>>> Honestly, at this point I would vote for going back to the 5.2
>>> behavior at least by default and only running the new code on the
>>> drives known to require it (because they will block PC10 otherwise).
>>>
>>> Possibly (ideally) with an option for users who can't get beyond PC3
>>> to test whether or not the new code helps them.
>>
>> I just found out that the XPS 9380 at my hand never reaches SLP_S0 but  
>> only
>> PC10.
>
> That's the case for me too.
>
>> This happens with or without putting the device to D3.
>
> On my system, though, it only can get to PC3 without putting the NVMe
> into D3 (as reported previously).

I forgot to ask, what BIOS version does the system have?
I don?t see this issue on BIOS v1.5.0.

Kai-Heng

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-08-05 19:13                                       ` kai.heng.feng
@ 2019-08-05 21:28                                         ` rafael
  2019-08-06 14:02                                           ` Mario.Limonciello
  0 siblings, 1 reply; 75+ messages in thread
From: rafael @ 2019-08-05 21:28 UTC (permalink / raw)


On Mon, Aug 5, 2019 at 9:14 PM Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:
>
>@19:04, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> > On Fri, Aug 2, 2019 at 12:55 PM Kai-Heng Feng
> > <kai.heng.feng@canonical.com> wrote:
> >>@06:26, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >>
> >>> On Thu, Aug 1, 2019@9:05 PM <Mario.Limonciello@dell.com> wrote:
> >>>>> -----Original Message-----
> >>>>> From: Rafael J. Wysocki <rafael at kernel.org>
> >>>>> Sent: Thursday, August 1, 2019 12:30 PM
> >>>>> To: Kai-Heng Feng; Keith Busch; Limonciello, Mario
> >>>>> Cc: Keith Busch; Christoph Hellwig; Sagi Grimberg; linux-nvme; Linux
> >>>>> PM; Linux
> >>>>> Kernel Mailing List; Rajat Jain
> >>>>> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power
> >>>>> state for
> >>>>> suspend" has problems
> >>>>>
> >>>>>
> >>>>> [EXTERNAL EMAIL]
> >>>>>
> >>>>> On Thu, Aug 1, 2019 at 11:06 AM Kai-Heng Feng
> >>>>> <kai.heng.feng@canonical.com> wrote:
> >>>>>>@06:33, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >>>>>>
> >>>>>>> On Thu, Aug 1, 2019 at 12:22 AM Keith Busch <kbusch at kernel.org>
> >>>>>>> wrote:
> >>>>>>>> On Wed, Jul 31, 2019@11:25:51PM +0200, Rafael J. Wysocki wrote:
> >>>>>>>>> A couple of remarks if you will.
> >>>>>>>>>
> >>>>>>>>> First, we don't know which case is the majority at this point.  For
> >>>>>>>>> now, there is one example of each, but it may very well turn out
> >>>>>>>>> that
> >>>>>>>>> the SK Hynix BC501 above needs to be quirked.
> >>>>>>>>>
> >>>>>>>>> Second, the reference here really is 5.2, so if there are any
> >>>>>>>>> systems
> >>>>>>>>> that are not better off with 5.3-rc than they were with 5.2,
> >>>>>>>>> well, we
> >>>>>>>>> have not made progress.  However, if there are systems that are
> >>>>>>>>> worse
> >>>>>>>>> off with 5.3, that's bad.  In the face of the latest findings the
> >>>>>>>>> only
> >>>>>>>>> way to avoid that is to be backwards compatible with 5.2 and that's
> >>>>>>>>> where my patch is going.  That cannot be achieved by quirking all
> >>>>>>>>> cases that are reported as "bad", because there still may be
> >>>>>>>>> unreported ones.
> >>>>>>>>
> >>>>>>>> I have to agree. I think your proposal may allow PCI D3cold,
> >>>>>>>
> >>>>>>> Yes, it may.
> >>>>>>
> >>>>>> Somehow the 9380 with Toshiba NVMe never hits SLP_S0 with or without
> >>>>>> Rafael?s patch.
> >>>>>> But the ?real? s2idle power consumption does improve with the patch.
> >>>>>
> >>>>> Do you mean this patch:
> >>>>>
> >>>>> https://lore.kernel.org/linux-pm/70D536BE-8DC7-4CA2-84A9-
> >>>>> AFB067BA520E at canonical.com/T/#m456aa5c69973a3b68f2cdd4713a1ce83be5145
> >>>>> 8f
> >>>>>
> >>>>> or the $subject one without the above?
> >>>>>
> >>>>>> Can we use a DMI based quirk for this platform? It seems like a
> >>>>>> platform
> >>>>>> specific issue.
> >>>>>
> >>>>> We seem to see too many "platform-specific issues" here. :-)
> >>>>>
> >>>>> To me, the status quo (ie. what we have in 5.3-rc2) is not defensible.
> >>>>> Something needs to be done to improve the situation.
> >>>>
> >>>> Rafael, would it be possible to try popping out PC401 from the 9380 and
> >>>> into a 9360 to
> >>>> confirm there actually being a platform impact or not?
> >>>
> >>> Not really, sorry.
> >>>
> >>>> I was hoping to have something useful from Hynix by now before
> >>>> responding, but oh well.
> >>>>
> >>>> In terms of what is the majority, I do know that between folks at Dell,
> >>>> Google, Compal,
> >>>> Wistron, Canonical, Micron, Hynix, Toshiba, LiteOn, and Western Digital
> >>>> we tested a wide
> >>>> variety of SSDs with this patch series.  I would like to think that they
> >>>> are representative of
> >>>> what's being manufactured into machines now.
> >>>
> >>> Well, what about drives already in the field?  My concern is mostly
> >>> about those ones.
> >>>
> >>>> Notably the LiteOn CL1 was tested with the HMB flushing support and
> >>>> and Hynix PC401 was tested with older firmware though.
> >>>>
> >>>>>>>> In which case we do need to reintroduce the HMB handling.
> >>>>>>>
> >>>>>>> Right.
> >>>>>>
> >>>>>> The patch alone doesn?t break HMB Toshiba NVMe I tested. But I think
> >>>>>> it?s
> >>>>>> still safer to do proper HMB handling.
> >>>>>
> >>>>> Well, so can anyone please propose something specific?  Like an
> >>>>> alternative patch?
> >>>>
> >>>> This was proposed a few days ago:
> >>>> http://lists.infradead.org/pipermail/linux-nvme/2019-July/026056.html
> >>>>
> >>>> However we're still not sure why it is needed, and it will take some
> >>>> time to get
> >>>> a proper failure analysis from LiteOn  regarding the CL1.
> >>>
> >>> Thanks for the update, but IMO we still need to do something before
> >>> final 5.3 while the investigation continues.
> >>>
> >>> Honestly, at this point I would vote for going back to the 5.2
> >>> behavior at least by default and only running the new code on the
> >>> drives known to require it (because they will block PC10 otherwise).
> >>>
> >>> Possibly (ideally) with an option for users who can't get beyond PC3
> >>> to test whether or not the new code helps them.
> >>
> >> I just found out that the XPS 9380 at my hand never reaches SLP_S0 but
> >> only
> >> PC10.
> >
> > That's the case for me too.
> >
> >> This happens with or without putting the device to D3.
> >
> > On my system, though, it only can get to PC3 without putting the NVMe
> > into D3 (as reported previously).
>
> I forgot to ask, what BIOS version does the system have?
> I don?t see this issue on BIOS v1.5.0.

It is 1.5.0 here too.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-08-05 21:28                                         ` rafael
@ 2019-08-06 14:02                                           ` Mario.Limonciello
  2019-08-06 15:00                                             ` rafael
  0 siblings, 1 reply; 75+ messages in thread
From: Mario.Limonciello @ 2019-08-06 14:02 UTC (permalink / raw)




> -----Original Message-----
> From: Rafael J. Wysocki <rafael at kernel.org>
> Sent: Monday, August 5, 2019 4:29 PM
> To: Kai-Heng Feng
> Cc: Rafael J. Wysocki; Limonciello, Mario; Keith Busch; Keith Busch; Christoph
> Hellwig; Sagi Grimberg; linux-nvme; Linux PM; Linux Kernel Mailing List; Rajat Jain
> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state for
> suspend" has problems
> 
> 
> [EXTERNAL EMAIL]
> 
> On Mon, Aug 5, 2019 at 9:14 PM Kai-Heng Feng
> <kai.heng.feng@canonical.com> wrote:
> >
> >@19:04, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > > On Fri, Aug 2, 2019 at 12:55 PM Kai-Heng Feng
> > > <kai.heng.feng@canonical.com> wrote:
> > >>@06:26, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >>
> > >>> On Thu, Aug 1, 2019@9:05 PM <Mario.Limonciello@dell.com> wrote:
> > >>>>> -----Original Message-----
> > >>>>> From: Rafael J. Wysocki <rafael at kernel.org>
> > >>>>> Sent: Thursday, August 1, 2019 12:30 PM
> > >>>>> To: Kai-Heng Feng; Keith Busch; Limonciello, Mario
> > >>>>> Cc: Keith Busch; Christoph Hellwig; Sagi Grimberg; linux-nvme; Linux
> > >>>>> PM; Linux
> > >>>>> Kernel Mailing List; Rajat Jain
> > >>>>> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power
> > >>>>> state for
> > >>>>> suspend" has problems
> > >>>>>
> > >>>>>
> > >>>>> [EXTERNAL EMAIL]
> > >>>>>
> > >>>>> On Thu, Aug 1, 2019 at 11:06 AM Kai-Heng Feng
> > >>>>> <kai.heng.feng@canonical.com> wrote:
> > >>>>>>@06:33, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >>>>>>
> > >>>>>>> On Thu, Aug 1, 2019 at 12:22 AM Keith Busch <kbusch at kernel.org>
> > >>>>>>> wrote:
> > >>>>>>>> On Wed, Jul 31, 2019@11:25:51PM +0200, Rafael J. Wysocki wrote:
> > >>>>>>>>> A couple of remarks if you will.
> > >>>>>>>>>
> > >>>>>>>>> First, we don't know which case is the majority at this point.  For
> > >>>>>>>>> now, there is one example of each, but it may very well turn out
> > >>>>>>>>> that
> > >>>>>>>>> the SK Hynix BC501 above needs to be quirked.
> > >>>>>>>>>
> > >>>>>>>>> Second, the reference here really is 5.2, so if there are any
> > >>>>>>>>> systems
> > >>>>>>>>> that are not better off with 5.3-rc than they were with 5.2,
> > >>>>>>>>> well, we
> > >>>>>>>>> have not made progress.  However, if there are systems that are
> > >>>>>>>>> worse
> > >>>>>>>>> off with 5.3, that's bad.  In the face of the latest findings the
> > >>>>>>>>> only
> > >>>>>>>>> way to avoid that is to be backwards compatible with 5.2 and that's
> > >>>>>>>>> where my patch is going.  That cannot be achieved by quirking all
> > >>>>>>>>> cases that are reported as "bad", because there still may be
> > >>>>>>>>> unreported ones.
> > >>>>>>>>
> > >>>>>>>> I have to agree. I think your proposal may allow PCI D3cold,
> > >>>>>>>
> > >>>>>>> Yes, it may.
> > >>>>>>
> > >>>>>> Somehow the 9380 with Toshiba NVMe never hits SLP_S0 with or
> without
> > >>>>>> Rafael?s patch.
> > >>>>>> But the ?real? s2idle power consumption does improve with the patch.
> > >>>>>
> > >>>>> Do you mean this patch:
> > >>>>>
> > >>>>> https://lore.kernel.org/linux-pm/70D536BE-8DC7-4CA2-84A9-
> > >>>>>
> AFB067BA520E at canonical.com/T/#m456aa5c69973a3b68f2cdd4713a1ce83be5145
> > >>>>> 8f
> > >>>>>
> > >>>>> or the $subject one without the above?
> > >>>>>
> > >>>>>> Can we use a DMI based quirk for this platform? It seems like a
> > >>>>>> platform
> > >>>>>> specific issue.
> > >>>>>
> > >>>>> We seem to see too many "platform-specific issues" here. :-)
> > >>>>>
> > >>>>> To me, the status quo (ie. what we have in 5.3-rc2) is not defensible.
> > >>>>> Something needs to be done to improve the situation.
> > >>>>
> > >>>> Rafael, would it be possible to try popping out PC401 from the 9380 and
> > >>>> into a 9360 to
> > >>>> confirm there actually being a platform impact or not?
> > >>>
> > >>> Not really, sorry.
> > >>>
> > >>>> I was hoping to have something useful from Hynix by now before
> > >>>> responding, but oh well.
> > >>>>
> > >>>> In terms of what is the majority, I do know that between folks at Dell,
> > >>>> Google, Compal,
> > >>>> Wistron, Canonical, Micron, Hynix, Toshiba, LiteOn, and Western Digital
> > >>>> we tested a wide
> > >>>> variety of SSDs with this patch series.  I would like to think that they
> > >>>> are representative of
> > >>>> what's being manufactured into machines now.
> > >>>
> > >>> Well, what about drives already in the field?  My concern is mostly
> > >>> about those ones.
> > >>>
> > >>>> Notably the LiteOn CL1 was tested with the HMB flushing support and
> > >>>> and Hynix PC401 was tested with older firmware though.
> > >>>>
> > >>>>>>>> In which case we do need to reintroduce the HMB handling.
> > >>>>>>>
> > >>>>>>> Right.
> > >>>>>>
> > >>>>>> The patch alone doesn?t break HMB Toshiba NVMe I tested. But I think
> > >>>>>> it?s
> > >>>>>> still safer to do proper HMB handling.
> > >>>>>
> > >>>>> Well, so can anyone please propose something specific?  Like an
> > >>>>> alternative patch?
> > >>>>
> > >>>> This was proposed a few days ago:
> > >>>> http://lists.infradead.org/pipermail/linux-nvme/2019-July/026056.html
> > >>>>
> > >>>> However we're still not sure why it is needed, and it will take some
> > >>>> time to get
> > >>>> a proper failure analysis from LiteOn  regarding the CL1.
> > >>>
> > >>> Thanks for the update, but IMO we still need to do something before
> > >>> final 5.3 while the investigation continues.
> > >>>
> > >>> Honestly, at this point I would vote for going back to the 5.2
> > >>> behavior at least by default and only running the new code on the
> > >>> drives known to require it (because they will block PC10 otherwise).
> > >>>
> > >>> Possibly (ideally) with an option for users who can't get beyond PC3
> > >>> to test whether or not the new code helps them.
> > >>
> > >> I just found out that the XPS 9380 at my hand never reaches SLP_S0 but
> > >> only
> > >> PC10.
> > >
> > > That's the case for me too.
> > >
> > >> This happens with or without putting the device to D3.
> > >
> > > On my system, though, it only can get to PC3 without putting the NVMe
> > > into D3 (as reported previously).
> >
> > I forgot to ask, what BIOS version does the system have?
> > I don?t see this issue on BIOS v1.5.0.
> 
> It is 1.5.0 here too.

All, regarding the need for the patch proposed by Rafael on PC401, I have some updates
to share from Hynix.
First off - the firmware changelog is misleading from 80006E00 to 80007E00.

The change was made in the firmware specifically because of a change in behavior from
Intel KBL to CFL and WHL.  On CFL/WHL the period of time that RefClk was turned on after L1.2
was larger than KBL platforms.  So this meant that Hynix couldn't lock up from CLKREQ#
to RefClk as quickly on CFL/WHL.  So there is a "larger" fixed delay introduced in their FW.

To those that don't know - XPS 9380 is a WHL platform.

Second - a hypothesis of what is happening with the patch proposed by Rafael is that the link
is only transitioning to L1.0 rather than L1.2.  This may satisfy the PMC but it shouldn't lead to
the lowest actual device power state.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-08-06 14:02                                           ` Mario.Limonciello
@ 2019-08-06 15:00                                             ` rafael
  2019-08-07 10:29                                               ` rjw
  0 siblings, 1 reply; 75+ messages in thread
From: rafael @ 2019-08-06 15:00 UTC (permalink / raw)


On Tue, Aug 6, 2019@4:02 PM <Mario.Limonciello@dell.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Rafael J. Wysocki <rafael at kernel.org>
> > Sent: Monday, August 5, 2019 4:29 PM
> > To: Kai-Heng Feng
> > Cc: Rafael J. Wysocki; Limonciello, Mario; Keith Busch; Keith Busch; Christoph
> > Hellwig; Sagi Grimberg; linux-nvme; Linux PM; Linux Kernel Mailing List; Rajat Jain
> > Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state for
> > suspend" has problems
> >
> >
> > [EXTERNAL EMAIL]
> >
> > On Mon, Aug 5, 2019 at 9:14 PM Kai-Heng Feng
> > <kai.heng.feng@canonical.com> wrote:
> > >
> > >@19:04, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > > On Fri, Aug 2, 2019 at 12:55 PM Kai-Heng Feng
> > > > <kai.heng.feng@canonical.com> wrote:
> > > >>@06:26, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > >>
> > > >>> On Thu, Aug 1, 2019@9:05 PM <Mario.Limonciello@dell.com> wrote:
> > > >>>>> -----Original Message-----
> > > >>>>> From: Rafael J. Wysocki <rafael at kernel.org>
> > > >>>>> Sent: Thursday, August 1, 2019 12:30 PM
> > > >>>>> To: Kai-Heng Feng; Keith Busch; Limonciello, Mario
> > > >>>>> Cc: Keith Busch; Christoph Hellwig; Sagi Grimberg; linux-nvme; Linux
> > > >>>>> PM; Linux
> > > >>>>> Kernel Mailing List; Rajat Jain
> > > >>>>> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power
> > > >>>>> state for
> > > >>>>> suspend" has problems
> > > >>>>>
> > > >>>>>
> > > >>>>> [EXTERNAL EMAIL]
> > > >>>>>
> > > >>>>> On Thu, Aug 1, 2019 at 11:06 AM Kai-Heng Feng
> > > >>>>> <kai.heng.feng@canonical.com> wrote:
> > > >>>>>>@06:33, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > >>>>>>
> > > >>>>>>> On Thu, Aug 1, 2019 at 12:22 AM Keith Busch <kbusch at kernel.org>
> > > >>>>>>> wrote:
> > > >>>>>>>> On Wed, Jul 31, 2019@11:25:51PM +0200, Rafael J. Wysocki wrote:
> > > >>>>>>>>> A couple of remarks if you will.
> > > >>>>>>>>>
> > > >>>>>>>>> First, we don't know which case is the majority at this point.  For
> > > >>>>>>>>> now, there is one example of each, but it may very well turn out
> > > >>>>>>>>> that
> > > >>>>>>>>> the SK Hynix BC501 above needs to be quirked.
> > > >>>>>>>>>
> > > >>>>>>>>> Second, the reference here really is 5.2, so if there are any
> > > >>>>>>>>> systems
> > > >>>>>>>>> that are not better off with 5.3-rc than they were with 5.2,
> > > >>>>>>>>> well, we
> > > >>>>>>>>> have not made progress.  However, if there are systems that are
> > > >>>>>>>>> worse
> > > >>>>>>>>> off with 5.3, that's bad.  In the face of the latest findings the
> > > >>>>>>>>> only
> > > >>>>>>>>> way to avoid that is to be backwards compatible with 5.2 and that's
> > > >>>>>>>>> where my patch is going.  That cannot be achieved by quirking all
> > > >>>>>>>>> cases that are reported as "bad", because there still may be
> > > >>>>>>>>> unreported ones.
> > > >>>>>>>>
> > > >>>>>>>> I have to agree. I think your proposal may allow PCI D3cold,
> > > >>>>>>>
> > > >>>>>>> Yes, it may.
> > > >>>>>>
> > > >>>>>> Somehow the 9380 with Toshiba NVMe never hits SLP_S0 with or
> > without
> > > >>>>>> Rafael?s patch.
> > > >>>>>> But the ?real? s2idle power consumption does improve with the patch.
> > > >>>>>
> > > >>>>> Do you mean this patch:
> > > >>>>>
> > > >>>>> https://lore.kernel.org/linux-pm/70D536BE-8DC7-4CA2-84A9-
> > > >>>>>
> > AFB067BA520E at canonical.com/T/#m456aa5c69973a3b68f2cdd4713a1ce83be5145
> > > >>>>> 8f
> > > >>>>>
> > > >>>>> or the $subject one without the above?
> > > >>>>>
> > > >>>>>> Can we use a DMI based quirk for this platform? It seems like a
> > > >>>>>> platform
> > > >>>>>> specific issue.
> > > >>>>>
> > > >>>>> We seem to see too many "platform-specific issues" here. :-)
> > > >>>>>
> > > >>>>> To me, the status quo (ie. what we have in 5.3-rc2) is not defensible.
> > > >>>>> Something needs to be done to improve the situation.
> > > >>>>
> > > >>>> Rafael, would it be possible to try popping out PC401 from the 9380 and
> > > >>>> into a 9360 to
> > > >>>> confirm there actually being a platform impact or not?
> > > >>>
> > > >>> Not really, sorry.
> > > >>>
> > > >>>> I was hoping to have something useful from Hynix by now before
> > > >>>> responding, but oh well.
> > > >>>>
> > > >>>> In terms of what is the majority, I do know that between folks at Dell,
> > > >>>> Google, Compal,
> > > >>>> Wistron, Canonical, Micron, Hynix, Toshiba, LiteOn, and Western Digital
> > > >>>> we tested a wide
> > > >>>> variety of SSDs with this patch series.  I would like to think that they
> > > >>>> are representative of
> > > >>>> what's being manufactured into machines now.
> > > >>>
> > > >>> Well, what about drives already in the field?  My concern is mostly
> > > >>> about those ones.
> > > >>>
> > > >>>> Notably the LiteOn CL1 was tested with the HMB flushing support and
> > > >>>> and Hynix PC401 was tested with older firmware though.
> > > >>>>
> > > >>>>>>>> In which case we do need to reintroduce the HMB handling.
> > > >>>>>>>
> > > >>>>>>> Right.
> > > >>>>>>
> > > >>>>>> The patch alone doesn?t break HMB Toshiba NVMe I tested. But I think
> > > >>>>>> it?s
> > > >>>>>> still safer to do proper HMB handling.
> > > >>>>>
> > > >>>>> Well, so can anyone please propose something specific?  Like an
> > > >>>>> alternative patch?
> > > >>>>
> > > >>>> This was proposed a few days ago:
> > > >>>> http://lists.infradead.org/pipermail/linux-nvme/2019-July/026056.html
> > > >>>>
> > > >>>> However we're still not sure why it is needed, and it will take some
> > > >>>> time to get
> > > >>>> a proper failure analysis from LiteOn  regarding the CL1.
> > > >>>
> > > >>> Thanks for the update, but IMO we still need to do something before
> > > >>> final 5.3 while the investigation continues.
> > > >>>
> > > >>> Honestly, at this point I would vote for going back to the 5.2
> > > >>> behavior at least by default and only running the new code on the
> > > >>> drives known to require it (because they will block PC10 otherwise).
> > > >>>
> > > >>> Possibly (ideally) with an option for users who can't get beyond PC3
> > > >>> to test whether or not the new code helps them.
> > > >>
> > > >> I just found out that the XPS 9380 at my hand never reaches SLP_S0 but
> > > >> only
> > > >> PC10.
> > > >
> > > > That's the case for me too.
> > > >
> > > >> This happens with or without putting the device to D3.
> > > >
> > > > On my system, though, it only can get to PC3 without putting the NVMe
> > > > into D3 (as reported previously).
> > >
> > > I forgot to ask, what BIOS version does the system have?
> > > I don?t see this issue on BIOS v1.5.0.
> >
> > It is 1.5.0 here too.
>
> All, regarding the need for the patch proposed by Rafael on PC401, I have some updates
> to share from Hynix.
> First off - the firmware changelog is misleading from 80006E00 to 80007E00.
>
> The change was made in the firmware specifically because of a change in behavior from
> Intel KBL to CFL and WHL.  On CFL/WHL the period of time that RefClk was turned on after L1.2
> was larger than KBL platforms.  So this meant that Hynix couldn't lock up from CLKREQ#
> to RefClk as quickly on CFL/WHL.  So there is a "larger" fixed delay introduced in their FW.
>
> To those that don't know - XPS 9380 is a WHL platform.
>
> Second - a hypothesis of what is happening with the patch proposed by Rafael is that the link
> is only transitioning to L1.0 rather than L1.2.  This may satisfy the PMC but it shouldn't lead to
> the lowest actual device power state.

The north complex doesn't get to PC10 without this patch, so this is
more about the PCIe root complex than the PMC.

PC3 vs PC10 is a big deal regardless of what the NVMe can achieve.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-07-31 22:19                       ` kbusch
  2019-07-31 22:33                         ` rafael
@ 2019-08-07  9:48                         ` rjw
  2019-08-07 10:45                           ` hch
  2019-08-07  9:53                         ` [PATCH] nvme-pci: Do not prevent PCI bus-level PM from being used rjw
                                           ` (3 subsequent siblings)
  5 siblings, 1 reply; 75+ messages in thread
From: rjw @ 2019-08-07  9:48 UTC (permalink / raw)


On Thursday, August 1, 2019 12:19:56 AM CEST Keith Busch wrote:
> On Wed, Jul 31, 2019@11:25:51PM +0200, Rafael J. Wysocki wrote:
> > 
> > A couple of remarks if you will.
> > 
> > First, we don't know which case is the majority at this point.  For
> > now, there is one example of each, but it may very well turn out that
> > the SK Hynix BC501 above needs to be quirked.
> > 
> > Second, the reference here really is 5.2, so if there are any systems
> > that are not better off with 5.3-rc than they were with 5.2, well, we
> > have not made progress.  However, if there are systems that are worse
> > off with 5.3, that's bad.  In the face of the latest findings the only
> > way to avoid that is to be backwards compatible with 5.2 and that's
> > where my patch is going.  That cannot be achieved by quirking all
> > cases that are reported as "bad", because there still may be
> > unreported ones.
> 
> I have to agree. I think your proposal may allow PCI D3cold, in which
> case we do need to reintroduce the HMB handling.

So I think I know what the problem is here.

If ASPM is disabled for the NVMe device (which is the case on my machine by default),
skipping the bus-level PM in nvme_suspend() causes the PCIe link of it to stay up and
that prevents the SoC from getting into deeper package C-states.

If I change the ASPM policy to "powersave" (through the module parameter in there),
ASPM gets enabled for the NVMe drive and I can get into PC10 via S2Idle with plain 5.3-rc3.

However, that's a bit less than straightforward, so I'm going to post a patch to make
nvme_suspend() fall back to the "old ways" if ASPM is not enabled for the target device.

Cheers!

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH] nvme-pci: Do not prevent PCI bus-level PM from being used
  2019-07-31 22:19                       ` kbusch
  2019-07-31 22:33                         ` rafael
  2019-08-07  9:48                         ` rjw
@ 2019-08-07  9:53                         ` rjw
  2019-08-07 10:14                           ` rjw
                                             ` (2 more replies)
  2019-08-08  8:36                         ` [PATCH] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled rjw
                                           ` (2 subsequent siblings)
  5 siblings, 3 replies; 75+ messages in thread
From: rjw @ 2019-08-07  9:53 UTC (permalink / raw)


From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
host managed power state for suspend") was adding a pci_save_state()
call to nvme_suspend() in order to prevent the PCI bus-level PM from
being applied to the suspended NVMe devices, but if ASPM is not
enabled for the target NVMe device, that causes its PCIe link to stay
up and the platform may not be able to get into its optimum low-power
state because of that.

For example, if ASPM is disabled for the NVMe drive (PC401 NVMe SK
hynix 256GB) in my Dell XPS13 9380, leaving it in D0 during
suspend-to-idle prevents the SoC from reaching package idle states
deeper than PC3, which is way insufficient for system suspend.

To address this shortcoming, make nvme_suspend() check if ASPM is
enabled for the target device and fall back to full device shutdown
and PCI bus-level PM if that is not the case.

Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
Link: https://lore.kernel.org/linux-pm/2763495.NmdaWeg79L at kreacher/T/#t
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
---
 drivers/nvme/host/pci.c |   14 ++++++++++----
 drivers/pci/pcie/aspm.c |   17 +++++++++++++++++
 include/linux/pci.h     |    2 ++
 3 files changed, 29 insertions(+), 4 deletions(-)

Index: linux-pm/drivers/nvme/host/pci.c
===================================================================
--- linux-pm.orig/drivers/nvme/host/pci.c
+++ linux-pm/drivers/nvme/host/pci.c
@@ -2846,7 +2846,7 @@ static int nvme_resume(struct device *de
 	struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev));
 	struct nvme_ctrl *ctrl = &ndev->ctrl;
 
-	if (pm_resume_via_firmware() || !ctrl->npss ||
+	if (ndev->last_ps == U32_MAX ||
 	    nvme_set_power_state(ctrl, ndev->last_ps) != 0)
 		nvme_reset_ctrl(ctrl);
 	return 0;
@@ -2859,6 +2859,8 @@ static int nvme_suspend(struct device *d
 	struct nvme_ctrl *ctrl = &ndev->ctrl;
 	int ret = -EBUSY;
 
+	ndev->last_ps = U32_MAX;
+
 	/*
 	 * The platform does not remove power for a kernel managed suspend so
 	 * use host managed nvme power settings for lowest idle power if
@@ -2866,8 +2868,13 @@ static int nvme_suspend(struct device *d
 	 * shutdown.  But if the firmware is involved after the suspend or the
 	 * device does not support any non-default power states, shut down the
 	 * device fully.
+	 *
+	 * If ASPM is not enabled for the device, shut down the device and allow
+	 * the PCI bus layer to put it into D3 in order to take the PCIe link
+	 * down, so as to allow the platform to achieve its minimum low-power
+	 * state (which may not be possible if the link is up).
 	 */
-	if (pm_suspend_via_firmware() || !ctrl->npss) {
+	if (pm_suspend_via_firmware() || !ctrl->npss || !pcie_aspm_enabled(pdev)) {
 		nvme_dev_disable(ndev, true);
 		return 0;
 	}
@@ -2880,9 +2887,8 @@ static int nvme_suspend(struct device *d
 	    ctrl->state != NVME_CTRL_ADMIN_ONLY)
 		goto unfreeze;
 
-	ndev->last_ps = 0;
 	ret = nvme_get_power_state(ctrl, &ndev->last_ps);
-	if (ret < 0)
+	if (ret < 0 || ndev->last_ps == U32_MAX)
 		goto unfreeze;
 
 	ret = nvme_set_power_state(ctrl, ctrl->npss);
Index: linux-pm/drivers/pci/pcie/aspm.c
===================================================================
--- linux-pm.orig/drivers/pci/pcie/aspm.c
+++ linux-pm/drivers/pci/pcie/aspm.c
@@ -1170,6 +1170,23 @@ static int pcie_aspm_get_policy(char *bu
 module_param_call(policy, pcie_aspm_set_policy, pcie_aspm_get_policy,
 	NULL, 0644);
 
+/*
+ * pcie_aspm_enabled - Return the mask of enabled ASPM link states.
+ * @pci_device: Target device.
+ */
+u32 pcie_aspm_enabled(struct pci_dev *pci_device)
+{
+	struct pci_dev *bridge = pci_device->bus->self;
+	u32 aspm_enabled;
+
+	mutex_lock(&aspm_lock);
+	aspm_enabled = bridge->link_state ? bridge->link_state->aspm_enabled : 0;
+	mutex_unlock(&aspm_lock);
+
+	return aspm_enabled;
+}
+
+
 #ifdef CONFIG_PCIEASPM_DEBUG
 static ssize_t link_state_show(struct device *dev,
 		struct device_attribute *attr,
Index: linux-pm/include/linux/pci.h
===================================================================
--- linux-pm.orig/include/linux/pci.h
+++ linux-pm/include/linux/pci.h
@@ -1567,8 +1567,10 @@ extern bool pcie_ports_native;
 
 #ifdef CONFIG_PCIEASPM
 bool pcie_aspm_support_enabled(void);
+u32 pcie_aspm_enabled(struct pci_dev *pci_device);
 #else
 static inline bool pcie_aspm_support_enabled(void) { return false; }
+static inline u32 pcie_aspm_enabled(struct pci_dev *pci_device) { return 0; }
 #endif
 
 #ifdef CONFIG_PCIEAER

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH] nvme-pci: Do not prevent PCI bus-level PM from being used
  2019-08-07  9:53                         ` [PATCH] nvme-pci: Do not prevent PCI bus-level PM from being used rjw
@ 2019-08-07 10:14                           ` rjw
  2019-08-07 10:43                           ` hch
  2019-08-07 14:37                           ` kbusch
  2 siblings, 0 replies; 75+ messages in thread
From: rjw @ 2019-08-07 10:14 UTC (permalink / raw)


On Wednesday, August 7, 2019 11:53:44 AM CEST Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> 
> One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
> host managed power state for suspend") was adding a pci_save_state()
> call to nvme_suspend() in order to prevent the PCI bus-level PM from
> being applied to the suspended NVMe devices, but if ASPM is not
> enabled for the target NVMe device, that causes its PCIe link to stay
> up and the platform may not be able to get into its optimum low-power
> state because of that.
> 
> For example, if ASPM is disabled for the NVMe drive (PC401 NVMe SK
> hynix 256GB) in my Dell XPS13 9380, leaving it in D0 during
> suspend-to-idle prevents the SoC from reaching package idle states
> deeper than PC3, which is way insufficient for system suspend.
> 
> To address this shortcoming, make nvme_suspend() check if ASPM is
> enabled for the target device and fall back to full device shutdown
> and PCI bus-level PM if that is not the case.
> 
> Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
> Link: https://lore.kernel.org/linux-pm/2763495.NmdaWeg79L at kreacher/T/#t
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> ---

I should have used a better subject for this patch.

I'll resend it with a changed subject later, but for now I would like to collect
opinions about it (if any).

Cheers!

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-08-06 15:00                                             ` rafael
@ 2019-08-07 10:29                                               ` rjw
  0 siblings, 0 replies; 75+ messages in thread
From: rjw @ 2019-08-07 10:29 UTC (permalink / raw)


On Tuesday, August 6, 2019 5:00:06 PM CEST Rafael J. Wysocki wrote:
> On Tue, Aug 6, 2019@4:02 PM <Mario.Limonciello@dell.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Rafael J. Wysocki <rafael at kernel.org>
> > > Sent: Monday, August 5, 2019 4:29 PM
> > > To: Kai-Heng Feng
> > > Cc: Rafael J. Wysocki; Limonciello, Mario; Keith Busch; Keith Busch; Christoph
> > > Hellwig; Sagi Grimberg; linux-nvme; Linux PM; Linux Kernel Mailing List; Rajat Jain
> > > Subject: Re: [Regression] Commit "nvme/pci: Use host managed power state for
> > > suspend" has problems
> > >
> > >
> > > [EXTERNAL EMAIL]
> > >
> > > On Mon, Aug 5, 2019 at 9:14 PM Kai-Heng Feng
> > > <kai.heng.feng@canonical.com> wrote:
> > > >
> > > >@19:04, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > >
> > > > > On Fri, Aug 2, 2019 at 12:55 PM Kai-Heng Feng
> > > > > <kai.heng.feng@canonical.com> wrote:
> > > > >>@06:26, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > >>
> > > > >>> On Thu, Aug 1, 2019@9:05 PM <Mario.Limonciello@dell.com> wrote:
> > > > >>>>> -----Original Message-----
> > > > >>>>> From: Rafael J. Wysocki <rafael at kernel.org>
> > > > >>>>> Sent: Thursday, August 1, 2019 12:30 PM
> > > > >>>>> To: Kai-Heng Feng; Keith Busch; Limonciello, Mario
> > > > >>>>> Cc: Keith Busch; Christoph Hellwig; Sagi Grimberg; linux-nvme; Linux
> > > > >>>>> PM; Linux
> > > > >>>>> Kernel Mailing List; Rajat Jain
> > > > >>>>> Subject: Re: [Regression] Commit "nvme/pci: Use host managed power
> > > > >>>>> state for
> > > > >>>>> suspend" has problems
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> [EXTERNAL EMAIL]
> > > > >>>>>
> > > > >>>>> On Thu, Aug 1, 2019 at 11:06 AM Kai-Heng Feng
> > > > >>>>> <kai.heng.feng@canonical.com> wrote:
> > > > >>>>>>@06:33, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > >>>>>>
> > > > >>>>>>> On Thu, Aug 1, 2019 at 12:22 AM Keith Busch <kbusch at kernel.org>
> > > > >>>>>>> wrote:
> > > > >>>>>>>> On Wed, Jul 31, 2019@11:25:51PM +0200, Rafael J. Wysocki wrote:
> > > > >>>>>>>>> A couple of remarks if you will.
> > > > >>>>>>>>>
> > > > >>>>>>>>> First, we don't know which case is the majority at this point.  For
> > > > >>>>>>>>> now, there is one example of each, but it may very well turn out
> > > > >>>>>>>>> that
> > > > >>>>>>>>> the SK Hynix BC501 above needs to be quirked.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Second, the reference here really is 5.2, so if there are any
> > > > >>>>>>>>> systems
> > > > >>>>>>>>> that are not better off with 5.3-rc than they were with 5.2,
> > > > >>>>>>>>> well, we
> > > > >>>>>>>>> have not made progress.  However, if there are systems that are
> > > > >>>>>>>>> worse
> > > > >>>>>>>>> off with 5.3, that's bad.  In the face of the latest findings the
> > > > >>>>>>>>> only
> > > > >>>>>>>>> way to avoid that is to be backwards compatible with 5.2 and that's
> > > > >>>>>>>>> where my patch is going.  That cannot be achieved by quirking all
> > > > >>>>>>>>> cases that are reported as "bad", because there still may be
> > > > >>>>>>>>> unreported ones.
> > > > >>>>>>>>
> > > > >>>>>>>> I have to agree. I think your proposal may allow PCI D3cold,
> > > > >>>>>>>
> > > > >>>>>>> Yes, it may.
> > > > >>>>>>
> > > > >>>>>> Somehow the 9380 with Toshiba NVMe never hits SLP_S0 with or
> > > without
> > > > >>>>>> Rafael?s patch.
> > > > >>>>>> But the ?real? s2idle power consumption does improve with the patch.
> > > > >>>>>
> > > > >>>>> Do you mean this patch:
> > > > >>>>>
> > > > >>>>> https://lore.kernel.org/linux-pm/70D536BE-8DC7-4CA2-84A9-
> > > > >>>>>
> > > AFB067BA520E at canonical.com/T/#m456aa5c69973a3b68f2cdd4713a1ce83be5145
> > > > >>>>> 8f
> > > > >>>>>
> > > > >>>>> or the $subject one without the above?
> > > > >>>>>
> > > > >>>>>> Can we use a DMI based quirk for this platform? It seems like a
> > > > >>>>>> platform
> > > > >>>>>> specific issue.
> > > > >>>>>
> > > > >>>>> We seem to see too many "platform-specific issues" here. :-)
> > > > >>>>>
> > > > >>>>> To me, the status quo (ie. what we have in 5.3-rc2) is not defensible.
> > > > >>>>> Something needs to be done to improve the situation.
> > > > >>>>
> > > > >>>> Rafael, would it be possible to try popping out PC401 from the 9380 and
> > > > >>>> into a 9360 to
> > > > >>>> confirm there actually being a platform impact or not?
> > > > >>>
> > > > >>> Not really, sorry.
> > > > >>>
> > > > >>>> I was hoping to have something useful from Hynix by now before
> > > > >>>> responding, but oh well.
> > > > >>>>
> > > > >>>> In terms of what is the majority, I do know that between folks at Dell,
> > > > >>>> Google, Compal,
> > > > >>>> Wistron, Canonical, Micron, Hynix, Toshiba, LiteOn, and Western Digital
> > > > >>>> we tested a wide
> > > > >>>> variety of SSDs with this patch series.  I would like to think that they
> > > > >>>> are representative of
> > > > >>>> what's being manufactured into machines now.
> > > > >>>
> > > > >>> Well, what about drives already in the field?  My concern is mostly
> > > > >>> about those ones.
> > > > >>>
> > > > >>>> Notably the LiteOn CL1 was tested with the HMB flushing support and
> > > > >>>> and Hynix PC401 was tested with older firmware though.
> > > > >>>>
> > > > >>>>>>>> In which case we do need to reintroduce the HMB handling.
> > > > >>>>>>>
> > > > >>>>>>> Right.
> > > > >>>>>>
> > > > >>>>>> The patch alone doesn?t break HMB Toshiba NVMe I tested. But I think
> > > > >>>>>> it?s
> > > > >>>>>> still safer to do proper HMB handling.
> > > > >>>>>
> > > > >>>>> Well, so can anyone please propose something specific?  Like an
> > > > >>>>> alternative patch?
> > > > >>>>
> > > > >>>> This was proposed a few days ago:
> > > > >>>> http://lists.infradead.org/pipermail/linux-nvme/2019-July/026056.html
> > > > >>>>
> > > > >>>> However we're still not sure why it is needed, and it will take some
> > > > >>>> time to get
> > > > >>>> a proper failure analysis from LiteOn  regarding the CL1.
> > > > >>>
> > > > >>> Thanks for the update, but IMO we still need to do something before
> > > > >>> final 5.3 while the investigation continues.
> > > > >>>
> > > > >>> Honestly, at this point I would vote for going back to the 5.2
> > > > >>> behavior at least by default and only running the new code on the
> > > > >>> drives known to require it (because they will block PC10 otherwise).
> > > > >>>
> > > > >>> Possibly (ideally) with an option for users who can't get beyond PC3
> > > > >>> to test whether or not the new code helps them.
> > > > >>
> > > > >> I just found out that the XPS 9380 at my hand never reaches SLP_S0 but
> > > > >> only
> > > > >> PC10.
> > > > >
> > > > > That's the case for me too.
> > > > >
> > > > >> This happens with or without putting the device to D3.
> > > > >
> > > > > On my system, though, it only can get to PC3 without putting the NVMe
> > > > > into D3 (as reported previously).
> > > >
> > > > I forgot to ask, what BIOS version does the system have?
> > > > I don?t see this issue on BIOS v1.5.0.
> > >
> > > It is 1.5.0 here too.
> >
> > All, regarding the need for the patch proposed by Rafael on PC401, I have some updates
> > to share from Hynix.
> > First off - the firmware changelog is misleading from 80006E00 to 80007E00.
> >
> > The change was made in the firmware specifically because of a change in behavior from
> > Intel KBL to CFL and WHL.  On CFL/WHL the period of time that RefClk was turned on after L1.2
> > was larger than KBL platforms.  So this meant that Hynix couldn't lock up from CLKREQ#
> > to RefClk as quickly on CFL/WHL.  So there is a "larger" fixed delay introduced in their FW.
> >
> > To those that don't know - XPS 9380 is a WHL platform.
> >
> > Second - a hypothesis of what is happening with the patch proposed by Rafael is that the link
> > is only transitioning to L1.0 rather than L1.2.  This may satisfy the PMC but it shouldn't lead to
> > the lowest actual device power state.
> 
> The north complex doesn't get to PC10 without this patch, so this is
> more about the PCIe root complex than the PMC.
> 
> PC3 vs PC10 is a big deal regardless of what the NVMe can achieve.

This has been resolved as I said here:

https://lore.kernel.org/linux-pm/2763495.NmdaWeg79L at kreacher/T/#m2a341a34faeab84ab92d106129d2b946d193e60b

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH] nvme-pci: Do not prevent PCI bus-level PM from being used
  2019-08-07  9:53                         ` [PATCH] nvme-pci: Do not prevent PCI bus-level PM from being used rjw
  2019-08-07 10:14                           ` rjw
@ 2019-08-07 10:43                           ` hch
  2019-08-07 14:37                           ` kbusch
  2 siblings, 0 replies; 75+ messages in thread
From: hch @ 2019-08-07 10:43 UTC (permalink / raw)


> +	if (pm_suspend_via_firmware() || !ctrl->npss || !pcie_aspm_enabled(pdev)) {



> +	mutex_lock(&aspm_lock);
> +	aspm_enabled = bridge->link_state ? bridge->link_state->aspm_enabled : 0;

Please fix the overly long lines..

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-08-07  9:48                         ` rjw
@ 2019-08-07 10:45                           ` hch
  2019-08-07 10:54                             ` rafael
  0 siblings, 1 reply; 75+ messages in thread
From: hch @ 2019-08-07 10:45 UTC (permalink / raw)


On Wed, Aug 07, 2019@11:48:33AM +0200, Rafael J. Wysocki wrote:
> So I think I know what the problem is here.
> 
> If ASPM is disabled for the NVMe device (which is the case on my machine by default),
> skipping the bus-level PM in nvme_suspend() causes the PCIe link of it to stay up and
> that prevents the SoC from getting into deeper package C-states.
> 
> If I change the ASPM policy to "powersave" (through the module parameter in there),
> ASPM gets enabled for the NVMe drive and I can get into PC10 via S2Idle with plain 5.3-rc3.
> 
> However, that's a bit less than straightforward, so I'm going to post a patch to make
> nvme_suspend() fall back to the "old ways" if ASPM is not enabled for the target device.

Sounds sensibel.

FYI your mail is not properly formatted and has way too long lines.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems
  2019-08-07 10:45                           ` hch
@ 2019-08-07 10:54                             ` rafael
  0 siblings, 0 replies; 75+ messages in thread
From: rafael @ 2019-08-07 10:54 UTC (permalink / raw)


On Wed, Aug 7, 2019@12:45 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, Aug 07, 2019@11:48:33AM +0200, Rafael J. Wysocki wrote:
> > So I think I know what the problem is here.
> >
> > If ASPM is disabled for the NVMe device (which is the case on my machine by default),
> > skipping the bus-level PM in nvme_suspend() causes the PCIe link of it to stay up and
> > that prevents the SoC from getting into deeper package C-states.
> >
> > If I change the ASPM policy to "powersave" (through the module parameter in there),
> > ASPM gets enabled for the NVMe drive and I can get into PC10 via S2Idle with plain 5.3-rc3.
> >
> > However, that's a bit less than straightforward, so I'm going to post a patch to make
> > nvme_suspend() fall back to the "old ways" if ASPM is not enabled for the target device.
>
> Sounds sensibel.
>
> FYI your mail is not properly formatted and has way too long lines.

Sorry about that.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH] nvme-pci: Do not prevent PCI bus-level PM from being used
  2019-08-07  9:53                         ` [PATCH] nvme-pci: Do not prevent PCI bus-level PM from being used rjw
  2019-08-07 10:14                           ` rjw
  2019-08-07 10:43                           ` hch
@ 2019-08-07 14:37                           ` kbusch
  2 siblings, 0 replies; 75+ messages in thread
From: kbusch @ 2019-08-07 14:37 UTC (permalink / raw)


On Wed, Aug 07, 2019@02:53:44AM -0700, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> 
> One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
> host managed power state for suspend") was adding a pci_save_state()
> call to nvme_suspend() in order to prevent the PCI bus-level PM from
> being applied to the suspended NVMe devices, but if ASPM is not
> enabled for the target NVMe device, that causes its PCIe link to stay
> up and the platform may not be able to get into its optimum low-power
> state because of that.
> 
> For example, if ASPM is disabled for the NVMe drive (PC401 NVMe SK
> hynix 256GB) in my Dell XPS13 9380, leaving it in D0 during
> suspend-to-idle prevents the SoC from reaching package idle states
> deeper than PC3, which is way insufficient for system suspend.
> 
> To address this shortcoming, make nvme_suspend() check if ASPM is
> enabled for the target device and fall back to full device shutdown
> and PCI bus-level PM if that is not the case.
> 
> Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
> Link: https://lore.kernel.org/linux-pm/2763495.NmdaWeg79L at kreacher/T/#t
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>

Thanks for tracking down the cause. Sounds like your earlier assumption
on ASPM's involvement was spot on.

> +/*
> + * pcie_aspm_enabled - Return the mask of enabled ASPM link states.
> + * @pci_device: Target device.
> + */
> +u32 pcie_aspm_enabled(struct pci_dev *pci_device)
> +{
> +	struct pci_dev *bridge = pci_device->bus->self;

You may want use pci_upstream_bridge() instead, just in case someone
calls this on a virtual function's pci_dev.

> +	u32 aspm_enabled;
> +
> +	mutex_lock(&aspm_lock);
> +	aspm_enabled = bridge->link_state ? bridge->link_state->aspm_enabled : 0;
> +	mutex_unlock(&aspm_lock);
> +
> +	return aspm_enabled;
> +}

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-07-31 22:19                       ` kbusch
                                           ` (2 preceding siblings ...)
  2019-08-07  9:53                         ` [PATCH] nvme-pci: Do not prevent PCI bus-level PM from being used rjw
@ 2019-08-08  8:36                         ` rjw
  2019-08-08  8:48                           ` hch
  2019-08-08 10:03                         ` [PATCH v2 0/2] " rjw
  2019-08-08 21:51                         ` [PATCH v3 0/2] " rjw
  5 siblings, 1 reply; 75+ messages in thread
From: rjw @ 2019-08-08  8:36 UTC (permalink / raw)


From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
host managed power state for suspend") was adding a pci_save_state()
call to nvme_suspend() in order to prevent the PCI bus-level PM from
being applied to the suspended NVMe devices, but if ASPM is not
enabled for the target NVMe device, that causes its PCIe link to stay
up and the platform may not be able to get into its optimum low-power
state because of that.

For example, if ASPM is disabled for the NVMe drive (PC401 NVMe SK
hynix 256GB) in my Dell XPS13 9380, leaving it in D0 during
suspend-to-idle prevents the SoC from reaching package idle states
deeper than PC3, which is way insufficient for system suspend.

To address this shortcoming, make nvme_suspend() check if ASPM is
enabled for the target device and fall back to full device shutdown
and PCI bus-level PM if that is not the case.

Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
Link: https://lore.kernel.org/linux-pm/2763495.NmdaWeg79L at kreacher/T/#t
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
---

This is an update of the following patch:

https://patchwork.kernel.org/patch/11081791/

going with the subject matching the changes in the patch.

This also addresses style-related comments from Christoph and follows the
Keith's advice to use pci_upstream_bridge() to get to the upstream bridge
of the device.

Thanks!

---
 drivers/nvme/host/pci.c |   15 +++++++++++----
 drivers/pci/pcie/aspm.c |   20 ++++++++++++++++++++
 include/linux/pci.h     |    2 ++
 3 files changed, 33 insertions(+), 4 deletions(-)

Index: linux-pm/drivers/nvme/host/pci.c
===================================================================
--- linux-pm.orig/drivers/nvme/host/pci.c
+++ linux-pm/drivers/nvme/host/pci.c
@@ -2846,7 +2846,7 @@ static int nvme_resume(struct device *de
 	struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev));
 	struct nvme_ctrl *ctrl = &ndev->ctrl;
 
-	if (pm_resume_via_firmware() || !ctrl->npss ||
+	if (ndev->last_ps == U32_MAX ||
 	    nvme_set_power_state(ctrl, ndev->last_ps) != 0)
 		nvme_reset_ctrl(ctrl);
 	return 0;
@@ -2859,6 +2859,8 @@ static int nvme_suspend(struct device *d
 	struct nvme_ctrl *ctrl = &ndev->ctrl;
 	int ret = -EBUSY;
 
+	ndev->last_ps = U32_MAX;
+
 	/*
 	 * The platform does not remove power for a kernel managed suspend so
 	 * use host managed nvme power settings for lowest idle power if
@@ -2866,8 +2868,14 @@ static int nvme_suspend(struct device *d
 	 * shutdown.  But if the firmware is involved after the suspend or the
 	 * device does not support any non-default power states, shut down the
 	 * device fully.
+	 *
+	 * If ASPM is not enabled for the device, shut down the device and allow
+	 * the PCI bus layer to put it into D3 in order to take the PCIe link
+	 * down, so as to allow the platform to achieve its minimum low-power
+	 * state (which may not be possible if the link is up).
 	 */
-	if (pm_suspend_via_firmware() || !ctrl->npss) {
+	if (pm_suspend_via_firmware() || !ctrl->npss ||
+	    !pcie_aspm_enabled(pdev)) {
 		nvme_dev_disable(ndev, true);
 		return 0;
 	}
@@ -2880,9 +2888,8 @@ static int nvme_suspend(struct device *d
 	    ctrl->state != NVME_CTRL_ADMIN_ONLY)
 		goto unfreeze;
 
-	ndev->last_ps = 0;
 	ret = nvme_get_power_state(ctrl, &ndev->last_ps);
-	if (ret < 0)
+	if (ret < 0 || ndev->last_ps == U32_MAX)
 		goto unfreeze;
 
 	ret = nvme_set_power_state(ctrl, ctrl->npss);
Index: linux-pm/drivers/pci/pcie/aspm.c
===================================================================
--- linux-pm.orig/drivers/pci/pcie/aspm.c
+++ linux-pm/drivers/pci/pcie/aspm.c
@@ -1170,6 +1170,26 @@ static int pcie_aspm_get_policy(char *bu
 module_param_call(policy, pcie_aspm_set_policy, pcie_aspm_get_policy,
 	NULL, 0644);
 
+/*
+ * pcie_aspm_enabled - Return the mask of enabled ASPM link states.
+ * @pci_device: Target device.
+ */
+u32 pcie_aspm_enabled(struct pci_dev *pci_device)
+{
+	struct pci_dev *bridge = pci_upstream_bridge(pci_device);
+	u32 ret;
+
+	if (!bridge)
+		return 0;
+
+	mutex_lock(&aspm_lock);
+	ret = bridge->link_state ? bridge->link_state->aspm_enabled : 0;
+	mutex_unlock(&aspm_lock);
+
+	return ret;
+}
+
+
 #ifdef CONFIG_PCIEASPM_DEBUG
 static ssize_t link_state_show(struct device *dev,
 		struct device_attribute *attr,
Index: linux-pm/include/linux/pci.h
===================================================================
--- linux-pm.orig/include/linux/pci.h
+++ linux-pm/include/linux/pci.h
@@ -1567,8 +1567,10 @@ extern bool pcie_ports_native;
 
 #ifdef CONFIG_PCIEASPM
 bool pcie_aspm_support_enabled(void);
+u32 pcie_aspm_enabled(struct pci_dev *pci_device);
 #else
 static inline bool pcie_aspm_support_enabled(void) { return false; }
+static inline u32 pcie_aspm_enabled(struct pci_dev *pci_device) { return 0; }
 #endif
 
 #ifdef CONFIG_PCIEAER

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-08-08  8:36                         ` [PATCH] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled rjw
@ 2019-08-08  8:48                           ` hch
  2019-08-08  9:06                             ` rafael
  0 siblings, 1 reply; 75+ messages in thread
From: hch @ 2019-08-08  8:48 UTC (permalink / raw)


> -     ndev->last_ps = 0;
>       ret = nvme_get_power_state(ctrl, &ndev->last_ps);
> -     if (ret < 0)
> +     if (ret < 0 || ndev->last_ps == U32_MAX)

Is the intent of the magic U32_MAX check to see if the
nvme_get_power_state failed at the nvme level?  In that case just
checking for any non-zero return value from nvme_get_power_state might
be the easier and more clear way to do it.

> Index: linux-pm/drivers/pci/pcie/aspm.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/pcie/aspm.c
> +++ linux-pm/drivers/pci/pcie/aspm.c

Shouldn't we split PCI vs nvme in two patches?

> @@ -1170,6 +1170,26 @@ static int pcie_aspm_get_policy(char *bu
>  module_param_call(policy, pcie_aspm_set_policy, pcie_aspm_get_policy,
>  	NULL, 0644);
>  
> +/*
> + * pcie_aspm_enabled - Return the mask of enabled ASPM link states.
> + * @pci_device: Target device.
> + */
> +u32 pcie_aspm_enabled(struct pci_dev *pci_device)

pcie_aspm_enabled sounds like it returns a boolean.  Shouldn't there be
a mask or so in the name better documenting what it returns?

> +{
> +	struct pci_dev *bridge = pci_upstream_bridge(pci_device);
> +	u32 ret;
> +
> +	if (!bridge)
> +		return 0;
> +
> +	mutex_lock(&aspm_lock);
> +	ret = bridge->link_state ? bridge->link_state->aspm_enabled : 0;
> +	mutex_unlock(&aspm_lock);
> +
> +	return ret;
> +}

I think this will need a EXPORT_SYMBOL_GPL thrown in so that modular
nvme continues working.

> +
> +
>  #ifdef CONFIG_PCIEASPM_DEBUG

Nit: double blank line here.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-08-08  8:48                           ` hch
@ 2019-08-08  9:06                             ` rafael
  0 siblings, 0 replies; 75+ messages in thread
From: rafael @ 2019-08-08  9:06 UTC (permalink / raw)


On Thu, Aug 8, 2019@10:48 AM Christoph Hellwig <hch@lst.de> wrote:
>
> > -     ndev->last_ps = 0;
> >       ret = nvme_get_power_state(ctrl, &ndev->last_ps);
> > -     if (ret < 0)
> > +     if (ret < 0 || ndev->last_ps == U32_MAX)
>
> Is the intent of the magic U32_MAX check to see if the
> nvme_get_power_state failed at the nvme level?  In that case just
> checking for any non-zero return value from nvme_get_power_state might
> be the easier and more clear way to do it.

Now that I think of that, it appears redundant.  I'll drop it.

>
> > Index: linux-pm/drivers/pci/pcie/aspm.c
> > ===================================================================
> > --- linux-pm.orig/drivers/pci/pcie/aspm.c
> > +++ linux-pm/drivers/pci/pcie/aspm.c
>
> Shouldn't we split PCI vs nvme in two patches?

That can be done.

> > @@ -1170,6 +1170,26 @@ static int pcie_aspm_get_policy(char *bu
> >  module_param_call(policy, pcie_aspm_set_policy, pcie_aspm_get_policy,
> >       NULL, 0644);
> >
> > +/*
> > + * pcie_aspm_enabled - Return the mask of enabled ASPM link states.
> > + * @pci_device: Target device.
> > + */
> > +u32 pcie_aspm_enabled(struct pci_dev *pci_device)
>
> pcie_aspm_enabled sounds like it returns a boolean.  Shouldn't there be
> a mask or so in the name better documenting what it returns?

OK

> > +{
> > +     struct pci_dev *bridge = pci_upstream_bridge(pci_device);
> > +     u32 ret;
> > +
> > +     if (!bridge)
> > +             return 0;
> > +
> > +     mutex_lock(&aspm_lock);
> > +     ret = bridge->link_state ? bridge->link_state->aspm_enabled : 0;
> > +     mutex_unlock(&aspm_lock);
> > +
> > +     return ret;
> > +}
>
> I think this will need a EXPORT_SYMBOL_GPL thrown in so that modular
> nvme continues working.

Right, sorry.

> > +
> > +
> >  #ifdef CONFIG_PCIEASPM_DEBUG
>
> Nit: double blank line here.

Overlooked, will fix.

Thanks!

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 0/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-07-31 22:19                       ` kbusch
                                           ` (3 preceding siblings ...)
  2019-08-08  8:36                         ` [PATCH] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled rjw
@ 2019-08-08 10:03                         ` " rjw
  2019-08-08 10:06                           ` [PATCH v2 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled_mask() rjw
  2019-08-08 10:10                           ` [PATCH v2 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled rjw
  2019-08-08 21:51                         ` [PATCH v3 0/2] " rjw
  5 siblings, 2 replies; 75+ messages in thread
From: rjw @ 2019-08-08 10:03 UTC (permalink / raw)


Hi All,

This series is equivalent to the following patch:

https://patchwork.kernel.org/patch/11083551/

posted earlier today.

It addresses review comments from Christoph by splitting the PCI/PCIe ASPM part
off to a separate patch (patch [1/2]) and fixing a few defects.

Thanks!

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled_mask()
  2019-08-08 10:03                         ` [PATCH v2 0/2] " rjw
@ 2019-08-08 10:06                           ` rjw
  2019-08-08 13:15                             ` helgaas
  2019-08-08 10:10                           ` [PATCH v2 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled rjw
  1 sibling, 1 reply; 75+ messages in thread
From: rjw @ 2019-08-08 10:06 UTC (permalink / raw)


From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Add a function returning the mask of currently enabled ASPM link
states for a given device.

It will be used by the NVMe driver to decide how to handle the
device during system suspend.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
---

-> v2:
  * Move the PCI/PCIe ASPM changes to a separate patch.
  * Add the _mask suffix to the new function name.
  * Add EXPORT_SYMBOL_GPL() to the new function.
  * Avoid adding an unnecessary blank line.

---
 drivers/pci/pcie/aspm.c |   20 ++++++++++++++++++++
 include/linux/pci.h     |    3 +++
 2 files changed, 23 insertions(+)

Index: linux-pm/drivers/pci/pcie/aspm.c
===================================================================
--- linux-pm.orig/drivers/pci/pcie/aspm.c
+++ linux-pm/drivers/pci/pcie/aspm.c
@@ -1170,6 +1170,26 @@ static int pcie_aspm_get_policy(char *bu
 module_param_call(policy, pcie_aspm_set_policy, pcie_aspm_get_policy,
 	NULL, 0644);
 
+/*
+ * pcie_aspm_enabled_mask - Return the mask of enabled ASPM link states.
+ * @pci_device: Target device.
+ */
+u32 pcie_aspm_enabled_mask(struct pci_dev *pci_device)
+{
+	struct pci_dev *bridge = pci_upstream_bridge(pci_device);
+	u32 ret;
+
+	if (!bridge)
+		return 0;
+
+	mutex_lock(&aspm_lock);
+	ret = bridge->link_state ? bridge->link_state->aspm_enabled : 0;
+	mutex_unlock(&aspm_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pcie_aspm_enabled_mask);
+
 #ifdef CONFIG_PCIEASPM_DEBUG
 static ssize_t link_state_show(struct device *dev,
 		struct device_attribute *attr,
Index: linux-pm/include/linux/pci.h
===================================================================
--- linux-pm.orig/include/linux/pci.h
+++ linux-pm/include/linux/pci.h
@@ -1567,8 +1567,11 @@ extern bool pcie_ports_native;
 
 #ifdef CONFIG_PCIEASPM
 bool pcie_aspm_support_enabled(void);
+u32 pcie_aspm_enabled_mask(struct pci_dev *pci_device);
 #else
 static inline bool pcie_aspm_support_enabled(void) { return false; }
+static inline u32 pcie_aspm_enabled_mask(struct pci_dev *pci_device)
+{ return 0; }
 #endif
 
 #ifdef CONFIG_PCIEAER

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-08-08 10:03                         ` [PATCH v2 0/2] " rjw
  2019-08-08 10:06                           ` [PATCH v2 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled_mask() rjw
@ 2019-08-08 10:10                           ` rjw
  2019-08-08 13:43                             ` helgaas
  1 sibling, 1 reply; 75+ messages in thread
From: rjw @ 2019-08-08 10:10 UTC (permalink / raw)


From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
host managed power state for suspend") was adding a pci_save_state()
call to nvme_suspend() in order to prevent the PCI bus-level PM from
being applied to the suspended NVMe devices, but if ASPM is not
enabled for the target NVMe device, that causes its PCIe link to stay
up and the platform may not be able to get into its optimum low-power
state because of that.

For example, if ASPM is disabled for the NVMe drive (PC401 NVMe SK
hynix 256GB) in my Dell XPS13 9380, leaving it in D0 during
suspend-to-idle prevents the SoC from reaching package idle states
deeper than PC3, which is way insufficient for system suspend.

To address this shortcoming, make nvme_suspend() check if ASPM is
enabled for the target device and fall back to full device shutdown
and PCI bus-level PM if that is not the case.

Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
Link: https://lore.kernel.org/linux-pm/2763495.NmdaWeg79L at kreacher/T/#t
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
---

-> v2:
  * Move the PCI/PCIe ASPM changes to a separate patch.
  * Do not add a redundant ndev->last_ps == U32_MAX check in nvme_suspend().

---
 drivers/nvme/host/pci.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

Index: linux-pm/drivers/nvme/host/pci.c
===================================================================
--- linux-pm.orig/drivers/nvme/host/pci.c
+++ linux-pm/drivers/nvme/host/pci.c
@@ -2846,7 +2846,7 @@ static int nvme_resume(struct device *de
 	struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev));
 	struct nvme_ctrl *ctrl = &ndev->ctrl;
 
-	if (pm_resume_via_firmware() || !ctrl->npss ||
+	if (ndev->last_ps == U32_MAX ||
 	    nvme_set_power_state(ctrl, ndev->last_ps) != 0)
 		nvme_reset_ctrl(ctrl);
 	return 0;
@@ -2859,6 +2859,8 @@ static int nvme_suspend(struct device *d
 	struct nvme_ctrl *ctrl = &ndev->ctrl;
 	int ret = -EBUSY;
 
+	ndev->last_ps = U32_MAX;
+
 	/*
 	 * The platform does not remove power for a kernel managed suspend so
 	 * use host managed nvme power settings for lowest idle power if
@@ -2866,8 +2868,14 @@ static int nvme_suspend(struct device *d
 	 * shutdown.  But if the firmware is involved after the suspend or the
 	 * device does not support any non-default power states, shut down the
 	 * device fully.
+	 *
+	 * If ASPM is not enabled for the device, shut down the device and allow
+	 * the PCI bus layer to put it into D3 in order to take the PCIe link
+	 * down, so as to allow the platform to achieve its minimum low-power
+	 * state (which may not be possible if the link is up).
 	 */
-	if (pm_suspend_via_firmware() || !ctrl->npss) {
+	if (pm_suspend_via_firmware() || !ctrl->npss ||
+	    !pcie_aspm_enabled_mask(pdev)) {
 		nvme_dev_disable(ndev, true);
 		return 0;
 	}
@@ -2880,7 +2888,6 @@ static int nvme_suspend(struct device *d
 	    ctrl->state != NVME_CTRL_ADMIN_ONLY)
 		goto unfreeze;
 
-	ndev->last_ps = 0;
 	ret = nvme_get_power_state(ctrl, &ndev->last_ps);
 	if (ret < 0)
 		goto unfreeze;

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled_mask()
  2019-08-08 10:06                           ` [PATCH v2 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled_mask() rjw
@ 2019-08-08 13:15                             ` helgaas
  2019-08-08 14:48                               ` rafael
  0 siblings, 1 reply; 75+ messages in thread
From: helgaas @ 2019-08-08 13:15 UTC (permalink / raw)


On Thu, Aug 08, 2019@12:06:52PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> 
> Add a function returning the mask of currently enabled ASPM link
> states for a given device.
> 
> It will be used by the NVMe driver to decide how to handle the
> device during system suspend.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> ---
> 
> -> v2:
>   * Move the PCI/PCIe ASPM changes to a separate patch.
>   * Add the _mask suffix to the new function name.
>   * Add EXPORT_SYMBOL_GPL() to the new function.
>   * Avoid adding an unnecessary blank line.
> 
> ---
>  drivers/pci/pcie/aspm.c |   20 ++++++++++++++++++++
>  include/linux/pci.h     |    3 +++
>  2 files changed, 23 insertions(+)
> 
> Index: linux-pm/drivers/pci/pcie/aspm.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/pcie/aspm.c
> +++ linux-pm/drivers/pci/pcie/aspm.c
> @@ -1170,6 +1170,26 @@ static int pcie_aspm_get_policy(char *bu
>  module_param_call(policy, pcie_aspm_set_policy, pcie_aspm_get_policy,
>  	NULL, 0644);
>  
> +/*
> + * pcie_aspm_enabled_mask - Return the mask of enabled ASPM link states.
> + * @pci_device: Target device.
> + */
> +u32 pcie_aspm_enabled_mask(struct pci_dev *pci_device)
> +{
> +	struct pci_dev *bridge = pci_upstream_bridge(pci_device);
> +	u32 ret;
> +
> +	if (!bridge)
> +		return 0;
> +
> +	mutex_lock(&aspm_lock);
> +	ret = bridge->link_state ? bridge->link_state->aspm_enabled : 0;

This returns the "aspm_enabled" mask, but the values of that mask are
combinations of:

  ASPM_STATE_L0S_UP
  ASPM_STATE_L0S_DW
  ASPM_STATE_L1
  ...

which are defined internally in drivers/pci/pcie/aspm.c and not
visible to the caller of pcie_aspm_enabled_mask().  If there's no need
for the actual mask (the current caller doesn't seem to use it), maybe
this could be a boolean?

> +	mutex_unlock(&aspm_lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pcie_aspm_enabled_mask);
> +
>  #ifdef CONFIG_PCIEASPM_DEBUG
>  static ssize_t link_state_show(struct device *dev,
>  		struct device_attribute *attr,
> Index: linux-pm/include/linux/pci.h
> ===================================================================
> --- linux-pm.orig/include/linux/pci.h
> +++ linux-pm/include/linux/pci.h
> @@ -1567,8 +1567,11 @@ extern bool pcie_ports_native;
>  
>  #ifdef CONFIG_PCIEASPM
>  bool pcie_aspm_support_enabled(void);
> +u32 pcie_aspm_enabled_mask(struct pci_dev *pci_device);
>  #else
>  static inline bool pcie_aspm_support_enabled(void) { return false; }
> +static inline u32 pcie_aspm_enabled_mask(struct pci_dev *pci_device)
> +{ return 0; }
>  #endif
>  
>  #ifdef CONFIG_PCIEAER
> 
> 
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-08-08 10:10                           ` [PATCH v2 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled rjw
@ 2019-08-08 13:43                             ` helgaas
  2019-08-08 14:47                               ` rafael
  0 siblings, 1 reply; 75+ messages in thread
From: helgaas @ 2019-08-08 13:43 UTC (permalink / raw)


On Thu, Aug 08, 2019@12:10:06PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> 
> One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
> host managed power state for suspend") was adding a pci_save_state()
> call to nvme_suspend() in order to prevent the PCI bus-level PM from
> being applied to the suspended NVMe devices, but if ASPM is not
> enabled for the target NVMe device, that causes its PCIe link to stay
> up and the platform may not be able to get into its optimum low-power
> state because of that.
> 
> For example, if ASPM is disabled for the NVMe drive (PC401 NVMe SK
> hynix 256GB) in my Dell XPS13 9380, leaving it in D0 during
> suspend-to-idle prevents the SoC from reaching package idle states
> deeper than PC3, which is way insufficient for system suspend.

Just curious: I assume the SoC you reference is some part of the NVMe
drive?

> To address this shortcoming, make nvme_suspend() check if ASPM is
> enabled for the target device and fall back to full device shutdown
> and PCI bus-level PM if that is not the case.
> 
> Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
> Link: https://lore.kernel.org/linux-pm/2763495.NmdaWeg79L at kreacher/T/#t
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> ---
> 
> -> v2:
>   * Move the PCI/PCIe ASPM changes to a separate patch.
>   * Do not add a redundant ndev->last_ps == U32_MAX check in nvme_suspend().
> 
> ---
>  drivers/nvme/host/pci.c |   13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> Index: linux-pm/drivers/nvme/host/pci.c
> ===================================================================
> --- linux-pm.orig/drivers/nvme/host/pci.c
> +++ linux-pm/drivers/nvme/host/pci.c
> @@ -2846,7 +2846,7 @@ static int nvme_resume(struct device *de
>  	struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev));
>  	struct nvme_ctrl *ctrl = &ndev->ctrl;
>  
> -	if (pm_resume_via_firmware() || !ctrl->npss ||
> +	if (ndev->last_ps == U32_MAX ||
>  	    nvme_set_power_state(ctrl, ndev->last_ps) != 0)
>  		nvme_reset_ctrl(ctrl);
>  	return 0;
> @@ -2859,6 +2859,8 @@ static int nvme_suspend(struct device *d
>  	struct nvme_ctrl *ctrl = &ndev->ctrl;
>  	int ret = -EBUSY;
>  
> +	ndev->last_ps = U32_MAX;
> +
>  	/*
>  	 * The platform does not remove power for a kernel managed suspend so
>  	 * use host managed nvme power settings for lowest idle power if
> @@ -2866,8 +2868,14 @@ static int nvme_suspend(struct device *d
>  	 * shutdown.  But if the firmware is involved after the suspend or the
>  	 * device does not support any non-default power states, shut down the
>  	 * device fully.
> +	 *
> +	 * If ASPM is not enabled for the device, shut down the device and allow
> +	 * the PCI bus layer to put it into D3 in order to take the PCIe link
> +	 * down, so as to allow the platform to achieve its minimum low-power
> +	 * state (which may not be possible if the link is up).
>  	 */
> -	if (pm_suspend_via_firmware() || !ctrl->npss) {
> +	if (pm_suspend_via_firmware() || !ctrl->npss ||
> +	    !pcie_aspm_enabled_mask(pdev)) {

This seems like a layering violation, in the sense that ASPM is
supposed to be hardware-autonomous and invisible to software.

IIUC the NVMe device will go to the desired package idle state if the
link is in L0s or L1, but not if the link is in L0.  I don't
understand that connection; AFAIK that would be something outside the
scope of the PCIe spec.

The spec (PCIe r5.0, sec 5.4.1.1.1 for L0s, 5.4.1.2.1 for L1) is
careful to say that when the conditions are right, devices "should"
enter L0s but it is never mandatory, or "may" enter L1.

And this patch assumes that if ASPM is enabled, the link will
eventually go to L0s or L1.  Because the PCIe spec doesn't mandate
that transition, I think this patch makes the driver dependent on
device-specific behavior.

>  		nvme_dev_disable(ndev, true);
>  		return 0;
>  	}
> @@ -2880,7 +2888,6 @@ static int nvme_suspend(struct device *d
>  	    ctrl->state != NVME_CTRL_ADMIN_ONLY)
>  		goto unfreeze;
>  
> -	ndev->last_ps = 0;
>  	ret = nvme_get_power_state(ctrl, &ndev->last_ps);
>  	if (ret < 0)
>  		goto unfreeze;
> 
> 
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-08-08 13:43                             ` helgaas
@ 2019-08-08 14:47                               ` rafael
  2019-08-08 17:06                                 ` rafael
  2019-08-08 18:39                                 ` helgaas
  0 siblings, 2 replies; 75+ messages in thread
From: rafael @ 2019-08-08 14:47 UTC (permalink / raw)


On Thu, Aug 8, 2019@3:43 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Aug 08, 2019@12:10:06PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> >
> > One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
> > host managed power state for suspend") was adding a pci_save_state()
> > call to nvme_suspend() in order to prevent the PCI bus-level PM from
> > being applied to the suspended NVMe devices, but if ASPM is not
> > enabled for the target NVMe device, that causes its PCIe link to stay
> > up and the platform may not be able to get into its optimum low-power
> > state because of that.
> >
> > For example, if ASPM is disabled for the NVMe drive (PC401 NVMe SK
> > hynix 256GB) in my Dell XPS13 9380, leaving it in D0 during
> > suspend-to-idle prevents the SoC from reaching package idle states
> > deeper than PC3, which is way insufficient for system suspend.
>
> Just curious: I assume the SoC you reference is some part of the NVMe
> drive?

No, the SoC is what contains the Intel processor and PCH (formerly "chipset").

> > To address this shortcoming, make nvme_suspend() check if ASPM is
> > enabled for the target device and fall back to full device shutdown
> > and PCI bus-level PM if that is not the case.
> >
> > Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
> > Link: https://lore.kernel.org/linux-pm/2763495.NmdaWeg79L at kreacher/T/#t
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> > ---
> >
> > -> v2:
> >   * Move the PCI/PCIe ASPM changes to a separate patch.
> >   * Do not add a redundant ndev->last_ps == U32_MAX check in nvme_suspend().
> >
> > ---
> >  drivers/nvme/host/pci.c |   13 ++++++++++---
> >  1 file changed, 10 insertions(+), 3 deletions(-)
> >
> > Index: linux-pm/drivers/nvme/host/pci.c
> > ===================================================================
> > --- linux-pm.orig/drivers/nvme/host/pci.c
> > +++ linux-pm/drivers/nvme/host/pci.c
> > @@ -2846,7 +2846,7 @@ static int nvme_resume(struct device *de
> >       struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev));
> >       struct nvme_ctrl *ctrl = &ndev->ctrl;
> >
> > -     if (pm_resume_via_firmware() || !ctrl->npss ||
> > +     if (ndev->last_ps == U32_MAX ||
> >           nvme_set_power_state(ctrl, ndev->last_ps) != 0)
> >               nvme_reset_ctrl(ctrl);
> >       return 0;
> > @@ -2859,6 +2859,8 @@ static int nvme_suspend(struct device *d
> >       struct nvme_ctrl *ctrl = &ndev->ctrl;
> >       int ret = -EBUSY;
> >
> > +     ndev->last_ps = U32_MAX;
> > +
> >       /*
> >        * The platform does not remove power for a kernel managed suspend so
> >        * use host managed nvme power settings for lowest idle power if
> > @@ -2866,8 +2868,14 @@ static int nvme_suspend(struct device *d
> >        * shutdown.  But if the firmware is involved after the suspend or the
> >        * device does not support any non-default power states, shut down the
> >        * device fully.
> > +      *
> > +      * If ASPM is not enabled for the device, shut down the device and allow
> > +      * the PCI bus layer to put it into D3 in order to take the PCIe link
> > +      * down, so as to allow the platform to achieve its minimum low-power
> > +      * state (which may not be possible if the link is up).
> >        */
> > -     if (pm_suspend_via_firmware() || !ctrl->npss) {
> > +     if (pm_suspend_via_firmware() || !ctrl->npss ||
> > +         !pcie_aspm_enabled_mask(pdev)) {
>
> This seems like a layering violation, in the sense that ASPM is
> supposed to be hardware-autonomous and invisible to software.

But software has to enable it.

If it is not enabled, it will not be used, and that's what the check is about.

> IIUC the NVMe device will go to the desired package idle state if the
> link is in L0s or L1, but not if the link is in L0.  I don't
> understand that connection; AFAIK that would be something outside the
> scope of the PCIe spec.

Yes, it is outside of the PCIe spec.

No, this is not about the NVMe device, it is about the Intel SoC
(System-on-a-Chip) the platform is based on.

The background really is commit d916b1be94b6 and its changelog is kind
of misleading, unfortunately.  What it did, among other things, was to
cause the NVMe driver to prevent the PCI bus type from applying the
standard PCI PM to the devices handled by it in the suspend-to-idle
flow.  The reason for doing that was a (reportedly) widespread failure
to take the PCIe link down during D0 -> D3hot transitions of NVMe
devices, which then prevented the platform from going into a deep
enough low-power state while suspended (because it was not sure
whether or not the NVMe device was really "sufficiently" inactive).
[I guess I should mention that in the changelog of the $subject
patch.]  So the idea was to put the (NVMe) device into a low-power
state internally and then let ASPM take care of the PCIe link.

Of course, that can only work if ASPM is enabled at all for the device
in question, even though it may not be sufficient as you say below.

> The spec (PCIe r5.0, sec 5.4.1.1.1 for L0s, 5.4.1.2.1 for L1) is
> careful to say that when the conditions are right, devices "should"
> enter L0s but it is never mandatory, or "may" enter L1.
>
> And this patch assumes that if ASPM is enabled, the link will
> eventually go to L0s or L1.

No, it doesn't.

It avoids failure in the case in which it is guaranteed to happen
(disabled ASPM) and that's it.

> Because the PCIe spec doesn't mandate that transition, I think this patch makes the
> driver dependent on device-specific behavior.

IMO not really.  It just adds a "don't do it if you are going to fail"
kind of check.

>
> >               nvme_dev_disable(ndev, true);
> >               return 0;
> >       }
> > @@ -2880,7 +2888,6 @@ static int nvme_suspend(struct device *d
> >           ctrl->state != NVME_CTRL_ADMIN_ONLY)
> >               goto unfreeze;
> >
> > -     ndev->last_ps = 0;
> >       ret = nvme_get_power_state(ctrl, &ndev->last_ps);
> >       if (ret < 0)
> >               goto unfreeze;
> >
> >
> >

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled_mask()
  2019-08-08 13:15                             ` helgaas
@ 2019-08-08 14:48                               ` rafael
  0 siblings, 0 replies; 75+ messages in thread
From: rafael @ 2019-08-08 14:48 UTC (permalink / raw)


On Thu, Aug 8, 2019@3:15 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Aug 08, 2019@12:06:52PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> >
> > Add a function returning the mask of currently enabled ASPM link
> > states for a given device.
> >
> > It will be used by the NVMe driver to decide how to handle the
> > device during system suspend.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> > ---
> >
> > -> v2:
> >   * Move the PCI/PCIe ASPM changes to a separate patch.
> >   * Add the _mask suffix to the new function name.
> >   * Add EXPORT_SYMBOL_GPL() to the new function.
> >   * Avoid adding an unnecessary blank line.
> >
> > ---
> >  drivers/pci/pcie/aspm.c |   20 ++++++++++++++++++++
> >  include/linux/pci.h     |    3 +++
> >  2 files changed, 23 insertions(+)
> >
> > Index: linux-pm/drivers/pci/pcie/aspm.c
> > ===================================================================
> > --- linux-pm.orig/drivers/pci/pcie/aspm.c
> > +++ linux-pm/drivers/pci/pcie/aspm.c
> > @@ -1170,6 +1170,26 @@ static int pcie_aspm_get_policy(char *bu
> >  module_param_call(policy, pcie_aspm_set_policy, pcie_aspm_get_policy,
> >       NULL, 0644);
> >
> > +/*
> > + * pcie_aspm_enabled_mask - Return the mask of enabled ASPM link states.
> > + * @pci_device: Target device.
> > + */
> > +u32 pcie_aspm_enabled_mask(struct pci_dev *pci_device)
> > +{
> > +     struct pci_dev *bridge = pci_upstream_bridge(pci_device);
> > +     u32 ret;
> > +
> > +     if (!bridge)
> > +             return 0;
> > +
> > +     mutex_lock(&aspm_lock);
> > +     ret = bridge->link_state ? bridge->link_state->aspm_enabled : 0;
>
> This returns the "aspm_enabled" mask, but the values of that mask are
> combinations of:
>
>   ASPM_STATE_L0S_UP
>   ASPM_STATE_L0S_DW
>   ASPM_STATE_L1
>   ...
>
> which are defined internally in drivers/pci/pcie/aspm.c and not
> visible to the caller of pcie_aspm_enabled_mask().  If there's no need
> for the actual mask (the current caller doesn't seem to use it), maybe
> this could be a boolean?

Yes, it can be a boolean.

>
> > +     mutex_unlock(&aspm_lock);
> > +
> > +     return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(pcie_aspm_enabled_mask);
> > +
> >  #ifdef CONFIG_PCIEASPM_DEBUG
> >  static ssize_t link_state_show(struct device *dev,
> >               struct device_attribute *attr,
> > Index: linux-pm/include/linux/pci.h
> > ===================================================================
> > --- linux-pm.orig/include/linux/pci.h
> > +++ linux-pm/include/linux/pci.h
> > @@ -1567,8 +1567,11 @@ extern bool pcie_ports_native;
> >
> >  #ifdef CONFIG_PCIEASPM
> >  bool pcie_aspm_support_enabled(void);
> > +u32 pcie_aspm_enabled_mask(struct pci_dev *pci_device);
> >  #else
> >  static inline bool pcie_aspm_support_enabled(void) { return false; }
> > +static inline u32 pcie_aspm_enabled_mask(struct pci_dev *pci_device)
> > +{ return 0; }
> >  #endif
> >
> >  #ifdef CONFIG_PCIEAER
> >
> >
> >

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-08-08 14:47                               ` rafael
@ 2019-08-08 17:06                                 ` rafael
  2019-08-08 18:39                                 ` helgaas
  1 sibling, 0 replies; 75+ messages in thread
From: rafael @ 2019-08-08 17:06 UTC (permalink / raw)


On Thu, Aug 8, 2019@4:47 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Thu, Aug 8, 2019@3:43 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Thu, Aug 08, 2019@12:10:06PM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> > >
> > > One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
> > > host managed power state for suspend") was adding a pci_save_state()
> > > call to nvme_suspend() in order to prevent the PCI bus-level PM from
> > > being applied to the suspended NVMe devices, but if ASPM is not
> > > enabled for the target NVMe device, that causes its PCIe link to stay
> > > up and the platform may not be able to get into its optimum low-power
> > > state because of that.
> > >
> > > For example, if ASPM is disabled for the NVMe drive (PC401 NVMe SK
> > > hynix 256GB) in my Dell XPS13 9380, leaving it in D0 during
> > > suspend-to-idle prevents the SoC from reaching package idle states
> > > deeper than PC3, which is way insufficient for system suspend.
> >
> > Just curious: I assume the SoC you reference is some part of the NVMe
> > drive?
>
> No, the SoC is what contains the Intel processor and PCH (formerly "chipset").
>
> > > To address this shortcoming, make nvme_suspend() check if ASPM is
> > > enabled for the target device and fall back to full device shutdown
> > > and PCI bus-level PM if that is not the case.
> > >
> > > Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
> > > Link: https://lore.kernel.org/linux-pm/2763495.NmdaWeg79L at kreacher/T/#t
> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> > > ---
> > >
> > > -> v2:
> > >   * Move the PCI/PCIe ASPM changes to a separate patch.
> > >   * Do not add a redundant ndev->last_ps == U32_MAX check in nvme_suspend().
> > >
> > > ---
> > >  drivers/nvme/host/pci.c |   13 ++++++++++---
> > >  1 file changed, 10 insertions(+), 3 deletions(-)
> > >
> > > Index: linux-pm/drivers/nvme/host/pci.c
> > > ===================================================================
> > > --- linux-pm.orig/drivers/nvme/host/pci.c
> > > +++ linux-pm/drivers/nvme/host/pci.c
> > > @@ -2846,7 +2846,7 @@ static int nvme_resume(struct device *de
> > >       struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev));
> > >       struct nvme_ctrl *ctrl = &ndev->ctrl;
> > >
> > > -     if (pm_resume_via_firmware() || !ctrl->npss ||
> > > +     if (ndev->last_ps == U32_MAX ||
> > >           nvme_set_power_state(ctrl, ndev->last_ps) != 0)
> > >               nvme_reset_ctrl(ctrl);
> > >       return 0;
> > > @@ -2859,6 +2859,8 @@ static int nvme_suspend(struct device *d
> > >       struct nvme_ctrl *ctrl = &ndev->ctrl;
> > >       int ret = -EBUSY;
> > >
> > > +     ndev->last_ps = U32_MAX;
> > > +
> > >       /*
> > >        * The platform does not remove power for a kernel managed suspend so
> > >        * use host managed nvme power settings for lowest idle power if
> > > @@ -2866,8 +2868,14 @@ static int nvme_suspend(struct device *d
> > >        * shutdown.  But if the firmware is involved after the suspend or the
> > >        * device does not support any non-default power states, shut down the
> > >        * device fully.
> > > +      *
> > > +      * If ASPM is not enabled for the device, shut down the device and allow
> > > +      * the PCI bus layer to put it into D3 in order to take the PCIe link
> > > +      * down, so as to allow the platform to achieve its minimum low-power
> > > +      * state (which may not be possible if the link is up).
> > >        */
> > > -     if (pm_suspend_via_firmware() || !ctrl->npss) {
> > > +     if (pm_suspend_via_firmware() || !ctrl->npss ||
> > > +         !pcie_aspm_enabled_mask(pdev)) {
> >
> > This seems like a layering violation, in the sense that ASPM is
> > supposed to be hardware-autonomous and invisible to software.
>
> But software has to enable it.
>
> If it is not enabled, it will not be used, and that's what the check is about.
>
> > IIUC the NVMe device will go to the desired package idle state if the
> > link is in L0s or L1, but not if the link is in L0.  I don't
> > understand that connection; AFAIK that would be something outside the
> > scope of the PCIe spec.
>
> Yes, it is outside of the PCIe spec.
>
> No, this is not about the NVMe device, it is about the Intel SoC
> (System-on-a-Chip) the platform is based on.
>
> The background really is commit d916b1be94b6 and its changelog is kind
> of misleading, unfortunately.  What it did, among other things, was to
> cause the NVMe driver to prevent the PCI bus type from applying the
> standard PCI PM to the devices handled by it in the suspend-to-idle
> flow.  The reason for doing that was a (reportedly) widespread failure
> to take the PCIe link down during D0 -> D3hot transitions of NVMe
> devices, which then prevented the platform from going into a deep
> enough low-power state while suspended (because it was not sure
> whether or not the NVMe device was really "sufficiently" inactive).
> [I guess I should mention that in the changelog of the $subject
> patch.]  So the idea was to put the (NVMe) device into a low-power
> state internally and then let ASPM take care of the PCIe link.
>
> Of course, that can only work if ASPM is enabled at all for the device
> in question, even though it may not be sufficient as you say below.
>
> > The spec (PCIe r5.0, sec 5.4.1.1.1 for L0s, 5.4.1.2.1 for L1) is
> > careful to say that when the conditions are right, devices "should"
> > enter L0s but it is never mandatory, or "may" enter L1.
> >
> > And this patch assumes that if ASPM is enabled, the link will
> > eventually go to L0s or L1.
>
> No, it doesn't.
>
> It avoids failure in the case in which it is guaranteed to happen
> (disabled ASPM) and that's it.

IOW, after commit d916b1be94b6 and without this patch, nvme_suspend()
*always* assumes ASPM to take the device's PCIe link down, which
obviously is not going to happen if ASPM is disabled for that device.

The rationale for this patch is to avoid the obvious failure.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-08-08 14:47                               ` rafael
  2019-08-08 17:06                                 ` rafael
@ 2019-08-08 18:39                                 ` helgaas
  2019-08-08 20:01                                   ` kbusch
                                                     ` (2 more replies)
  1 sibling, 3 replies; 75+ messages in thread
From: helgaas @ 2019-08-08 18:39 UTC (permalink / raw)


On Thu, Aug 08, 2019@04:47:45PM +0200, Rafael J. Wysocki wrote:
> On Thu, Aug 8, 2019@3:43 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Aug 08, 2019@12:10:06PM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> > >
> > > One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
> > > host managed power state for suspend") was adding a pci_save_state()
> > > call to nvme_suspend() in order to prevent the PCI bus-level PM from
> > > being applied to the suspended NVMe devices, but if ASPM is not
> > > enabled for the target NVMe device, that causes its PCIe link to stay
> > > up and the platform may not be able to get into its optimum low-power
> > > state because of that.
> > >
> > > For example, if ASPM is disabled for the NVMe drive (PC401 NVMe SK
> > > hynix 256GB) in my Dell XPS13 9380, leaving it in D0 during
> > > suspend-to-idle prevents the SoC from reaching package idle states
> > > deeper than PC3, which is way insufficient for system suspend.
> >
> > Just curious: I assume the SoC you reference is some part of the NVMe
> > drive?
> 
> No, the SoC is what contains the Intel processor and PCH (formerly "chipset").
> 
> > > To address this shortcoming, make nvme_suspend() check if ASPM is
> > > enabled for the target device and fall back to full device shutdown
> > > and PCI bus-level PM if that is not the case.
> > >
> > > Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
> > > Link: https://lore.kernel.org/linux-pm/2763495.NmdaWeg79L at kreacher/T/#t
> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> > > ---
> > >
> > > -> v2:
> > >   * Move the PCI/PCIe ASPM changes to a separate patch.
> > >   * Do not add a redundant ndev->last_ps == U32_MAX check in nvme_suspend().
> > >
> > > ---
> > >  drivers/nvme/host/pci.c |   13 ++++++++++---
> > >  1 file changed, 10 insertions(+), 3 deletions(-)
> > >
> > > Index: linux-pm/drivers/nvme/host/pci.c
> > > ===================================================================
> > > --- linux-pm.orig/drivers/nvme/host/pci.c
> > > +++ linux-pm/drivers/nvme/host/pci.c
> > > @@ -2846,7 +2846,7 @@ static int nvme_resume(struct device *de
> > >       struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev));
> > >       struct nvme_ctrl *ctrl = &ndev->ctrl;
> > >
> > > -     if (pm_resume_via_firmware() || !ctrl->npss ||
> > > +     if (ndev->last_ps == U32_MAX ||
> > >           nvme_set_power_state(ctrl, ndev->last_ps) != 0)
> > >               nvme_reset_ctrl(ctrl);
> > >       return 0;
> > > @@ -2859,6 +2859,8 @@ static int nvme_suspend(struct device *d
> > >       struct nvme_ctrl *ctrl = &ndev->ctrl;
> > >       int ret = -EBUSY;
> > >
> > > +     ndev->last_ps = U32_MAX;
> > > +
> > >       /*
> > >        * The platform does not remove power for a kernel managed suspend so
> > >        * use host managed nvme power settings for lowest idle power if
> > > @@ -2866,8 +2868,14 @@ static int nvme_suspend(struct device *d
> > >        * shutdown.  But if the firmware is involved after the suspend or the
> > >        * device does not support any non-default power states, shut down the
> > >        * device fully.
> > > +      *
> > > +      * If ASPM is not enabled for the device, shut down the device and allow
> > > +      * the PCI bus layer to put it into D3 in order to take the PCIe link
> > > +      * down, so as to allow the platform to achieve its minimum low-power
> > > +      * state (which may not be possible if the link is up).
> > >        */
> > > -     if (pm_suspend_via_firmware() || !ctrl->npss) {
> > > +     if (pm_suspend_via_firmware() || !ctrl->npss ||
> > > +         !pcie_aspm_enabled_mask(pdev)) {
> >
> > This seems like a layering violation, in the sense that ASPM is
> > supposed to be hardware-autonomous and invisible to software.
> 
> But software has to enable it.
> 
> If it is not enabled, it will not be used, and that's what the check
> is about.
> 
> > IIUC the NVMe device will go to the desired package idle state if
> > the link is in L0s or L1, but not if the link is in L0.  I don't
> > understand that connection; AFAIK that would be something outside
> > the scope of the PCIe spec.
> 
> Yes, it is outside of the PCIe spec.
> 
> No, this is not about the NVMe device, it is about the Intel SoC
> (System-on-a-Chip) the platform is based on.

Ah.  So this problem could occur with any device, not just NVMe?  If
so, how do you address that?  Obviously you don't want to patch all
drivers this way.

> The background really is commit d916b1be94b6 and its changelog is
> kind of misleading, unfortunately.  What it did, among other things,
> was to cause the NVMe driver to prevent the PCI bus type from
> applying the standard PCI PM to the devices handled by it in the
> suspend-to-idle flow.  

This is more meaningful to you than to most people because "applying
the standard PCI PM" doesn't tell us what that means in terms of the
device.  Presumably it has something to do with a D-state transition?
I *assume* a suspend might involve the D0 -> D3hot transition you
mention below?

> The reason for doing that was a (reportedly) widespread failure to
> take the PCIe link down during D0 -> D3hot transitions of NVMe
> devices,

I don't know any of the details, but "failure to take the link down
during D0 -> D3hot transitions" is phrased as though it might be a
hardware erratum.  If this *is* related to an NVMe erratum, that would
explain why you only need to patch the nvme driver, and it would be
useful to mention that in the commit log, since otherwise it sounds
like something that might be needed in other drivers, too.

According to PCIe r5.0 sec 5.3.2, the only legal link states for D3hot
are L1, L2/L3 Ready.  So if you put a device in D3hot and its link
stays in L0, that sounds like a defect.  Is that what happens?

Obviously I'm still confused.  I think it would help if you could
describe the problem in terms of the specific PCIe states involved
(D0, D3hot, L0, L1, L2, L3, etc) because then the spec would help
explain what's happening.

> which then prevented the platform from going into a deep enough
> low-power state while suspended (because it was not sure whether or
> not the NVMe device was really "sufficiently" inactive).  [I guess I
> should mention that in the changelog of the $subject patch.]  So the
> idea was to put the (NVMe) device into a low-power state internally
> and then let ASPM take care of the PCIe link.
> 
> Of course, that can only work if ASPM is enabled at all for the
> device in question, even though it may not be sufficient as you say
> below.
> 
> > The spec (PCIe r5.0, sec 5.4.1.1.1 for L0s, 5.4.1.2.1 for L1) is
> > careful to say that when the conditions are right, devices
> > "should" enter L0s but it is never mandatory, or "may" enter L1.
> >
> > And this patch assumes that if ASPM is enabled, the link will
> > eventually go to L0s or L1.
> 
> No, it doesn't.
> 
> It avoids failure in the case in which it is guaranteed to happen
> (disabled ASPM) and that's it.
> 
> > Because the PCIe spec doesn't mandate that transition, I think
> > this patch makes the driver dependent on device-specific behavior.
> 
> IMO not really.  It just adds a "don't do it if you are going to
> fail" kind of check.
> 
> >
> > >               nvme_dev_disable(ndev, true);
> > >               return 0;
> > >       }
> > > @@ -2880,7 +2888,6 @@ static int nvme_suspend(struct device *d
> > >           ctrl->state != NVME_CTRL_ADMIN_ONLY)
> > >               goto unfreeze;
> > >
> > > -     ndev->last_ps = 0;
> > >       ret = nvme_get_power_state(ctrl, &ndev->last_ps);
> > >       if (ret < 0)
> > >               goto unfreeze;
> > >
> > >
> > >

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-08-08 18:39                                 ` helgaas
@ 2019-08-08 20:01                                   ` kbusch
  2019-08-08 20:05                                   ` Mario.Limonciello
  2019-08-08 20:41                                   ` rafael
  2 siblings, 0 replies; 75+ messages in thread
From: kbusch @ 2019-08-08 20:01 UTC (permalink / raw)


On Thu, Aug 08, 2019@01:39:54PM -0500, Bjorn Helgaas wrote:
> On Thu, Aug 08, 2019@04:47:45PM +0200, Rafael J. Wysocki wrote:
> > On Thu, Aug 8, 2019@3:43 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > 
> > > IIUC the NVMe device will go to the desired package idle state if
> > > the link is in L0s or L1, but not if the link is in L0.  I don't
> > > understand that connection; AFAIK that would be something outside
> > > the scope of the PCIe spec.
> > 
> > Yes, it is outside of the PCIe spec.
> > 
> > No, this is not about the NVMe device, it is about the Intel SoC
> > (System-on-a-Chip) the platform is based on.
> 
> Ah.  So this problem could occur with any device, not just NVMe?  If
> so, how do you address that?  Obviously you don't want to patch all
> drivers this way.

We discovered this when using an NVMe protocol specific power setting, so
that part is driver specific. We just have to ensure device generic
dependencies are met in order to achieve the our power target. So in
that sense, I think you would need to patch all drivers if they're also
using protocol specific settings incorrectly.

Granted, the NVMe specification doesn't detail what PCIe settings may
prevent NVMe power management from hitting the objective, but I think
ASPM enabled makes sense.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-08-08 18:39                                 ` helgaas
  2019-08-08 20:01                                   ` kbusch
@ 2019-08-08 20:05                                   ` Mario.Limonciello
  2019-08-08 20:41                                   ` rafael
  2 siblings, 0 replies; 75+ messages in thread
From: Mario.Limonciello @ 2019-08-08 20:05 UTC (permalink / raw)


> This is more meaningful to you than to most people because "applying
> the standard PCI PM" doesn't tell us what that means in terms of the
> device.  Presumably it has something to do with a D-state transition?
> I *assume* a suspend might involve the D0 -> D3hot transition you
> mention below?
> 
> > The reason for doing that was a (reportedly) widespread failure to
> > take the PCIe link down during D0 -> D3hot transitions of NVMe
> > devices,
> 
> I don't know any of the details, but "failure to take the link down
> during D0 -> D3hot transitions" is phrased as though it might be a
> hardware erratum.  If this *is* related to an NVMe erratum, that would
> explain why you only need to patch the nvme driver, and it would be
> useful to mention that in the commit log, since otherwise it sounds
> like something that might be needed in other drivers, too.

NVME is special in this case that there is other logic being put in place
to set the drive's power state explicitly.

I would mention that also this alternate flow is quicker for s0ix
resume since NVME doesn't go through shutdown routine.

Unanimously the feedback from vendors was to avoid NVME shutdown
and to instead use SetFeatures to go into deepest power state instead
over S0ix.

> 
> According to PCIe r5.0 sec 5.3.2, the only legal link states for D3hot
> are L1, L2/L3 Ready.  So if you put a device in D3hot and its link
> stays in L0, that sounds like a defect.  Is that what happens?
> 
> Obviously I'm still confused.  I think it would help if you could
> describe the problem in terms of the specific PCIe states involved
> (D0, D3hot, L0, L1, L2, L3, etc) because then the spec would help
> explain what's happening.

Before that commit, the flow for NVME s0ix was:

* Delete IO SQ/CQ
* Shutdown NVME controller
* Save PCI registers
* Go into D3hot
* Read PMCSR

A functioning drive had the link at L1.2 and NVME power state at PS4
at this point.
Resuming looked like this:

* Restore PCI registers
* Enable NVME controller
* Configure NVME controller (IO queues, features, etc).

After that commit the flow for NVME s0ix is:

* Use NVME SetFeatures to put drive into low power mode (PS3 or PS4)
* Save PCI config register
* ASPM is used to bring link into L1.2

The resume flow is:

* Restore PCI registers

"Non-functioning" drives consumed too much power from the old flow.

The root cause varied from manufacturer to manufacturer.
The two I know off hand:

One instance is that when PM status register is read after the device in L1.2
from D3 it causes link to go to L0 and then stay there.

Another instance I heard drive isn't able to service D3hot request when NVME
was already shut down.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-08-08 18:39                                 ` helgaas
  2019-08-08 20:01                                   ` kbusch
  2019-08-08 20:05                                   ` Mario.Limonciello
@ 2019-08-08 20:41                                   ` rafael
  2019-08-09  4:47                                     ` helgaas
  2 siblings, 1 reply; 75+ messages in thread
From: rafael @ 2019-08-08 20:41 UTC (permalink / raw)


On Thu, Aug 8, 2019, 20:39 Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Aug 08, 2019@04:47:45PM +0200, Rafael J. Wysocki wrote:
> > On Thu, Aug 8, 2019@3:43 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Thu, Aug 08, 2019@12:10:06PM +0200, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> > > >
> > > > One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
> > > > host managed power state for suspend") was adding a pci_save_state()
> > > > call to nvme_suspend() in order to prevent the PCI bus-level PM from
> > > > being applied to the suspended NVMe devices, but if ASPM is not
> > > > enabled for the target NVMe device, that causes its PCIe link to stay
> > > > up and the platform may not be able to get into its optimum low-power
> > > > state because of that.
> > > >
> > > > For example, if ASPM is disabled for the NVMe drive (PC401 NVMe SK
> > > > hynix 256GB) in my Dell XPS13 9380, leaving it in D0 during
> > > > suspend-to-idle prevents the SoC from reaching package idle states
> > > > deeper than PC3, which is way insufficient for system suspend.
> > >
> > > Just curious: I assume the SoC you reference is some part of the NVMe
> > > drive?
> >
> > No, the SoC is what contains the Intel processor and PCH (formerly "chipset").
> >
> > > > To address this shortcoming, make nvme_suspend() check if ASPM is
> > > > enabled for the target device and fall back to full device shutdown
> > > > and PCI bus-level PM if that is not the case.
> > > >
> > > > Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
> > > > Link: https://lore.kernel.org/linux-pm/2763495.NmdaWeg79L at kreacher/T/#t
> > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> > > > ---
> > > >
> > > > -> v2:
> > > >   * Move the PCI/PCIe ASPM changes to a separate patch.
> > > >   * Do not add a redundant ndev->last_ps == U32_MAX check in nvme_suspend().
> > > >
> > > > ---
> > > >  drivers/nvme/host/pci.c |   13 ++++++++++---
> > > >  1 file changed, 10 insertions(+), 3 deletions(-)
> > > >
> > > > Index: linux-pm/drivers/nvme/host/pci.c
> > > > ===================================================================
> > > > --- linux-pm.orig/drivers/nvme/host/pci.c
> > > > +++ linux-pm/drivers/nvme/host/pci.c
> > > > @@ -2846,7 +2846,7 @@ static int nvme_resume(struct device *de
> > > >       struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev));
> > > >       struct nvme_ctrl *ctrl = &ndev->ctrl;
> > > >
> > > > -     if (pm_resume_via_firmware() || !ctrl->npss ||
> > > > +     if (ndev->last_ps == U32_MAX ||
> > > >           nvme_set_power_state(ctrl, ndev->last_ps) != 0)
> > > >               nvme_reset_ctrl(ctrl);
> > > >       return 0;
> > > > @@ -2859,6 +2859,8 @@ static int nvme_suspend(struct device *d
> > > >       struct nvme_ctrl *ctrl = &ndev->ctrl;
> > > >       int ret = -EBUSY;
> > > >
> > > > +     ndev->last_ps = U32_MAX;
> > > > +
> > > >       /*
> > > >        * The platform does not remove power for a kernel managed suspend so
> > > >        * use host managed nvme power settings for lowest idle power if
> > > > @@ -2866,8 +2868,14 @@ static int nvme_suspend(struct device *d
> > > >        * shutdown.  But if the firmware is involved after the suspend or the
> > > >        * device does not support any non-default power states, shut down the
> > > >        * device fully.
> > > > +      *
> > > > +      * If ASPM is not enabled for the device, shut down the device and allow
> > > > +      * the PCI bus layer to put it into D3 in order to take the PCIe link
> > > > +      * down, so as to allow the platform to achieve its minimum low-power
> > > > +      * state (which may not be possible if the link is up).
> > > >        */
> > > > -     if (pm_suspend_via_firmware() || !ctrl->npss) {
> > > > +     if (pm_suspend_via_firmware() || !ctrl->npss ||
> > > > +         !pcie_aspm_enabled_mask(pdev)) {
> > >
> > > This seems like a layering violation, in the sense that ASPM is
> > > supposed to be hardware-autonomous and invisible to software.
> >
> > But software has to enable it.
> >
> > If it is not enabled, it will not be used, and that's what the check
> > is about.
> >
> > > IIUC the NVMe device will go to the desired package idle state if
> > > the link is in L0s or L1, but not if the link is in L0.  I don't
> > > understand that connection; AFAIK that would be something outside
> > > the scope of the PCIe spec.
> >
> > Yes, it is outside of the PCIe spec.
> >
> > No, this is not about the NVMe device, it is about the Intel SoC
> > (System-on-a-Chip) the platform is based on.
>
> Ah.  So this problem could occur with any device, not just NVMe?  If
> so, how do you address that?  Obviously you don't want to patch all
> drivers this way.

It could, if the device was left in D0 during suspend, but drivers
don't let devices stay in D0 during suspend as a rule, so this is all
academic, except for the NVMe driver that has just started to do it in
5.3-rc1.

It has started to do that becasuse of what can be regarded as a
hardware issue, but this does not even matter here.

>
> > The background really is commit d916b1be94b6 and its changelog is
> > kind of misleading, unfortunately.  What it did, among other things,
> > was to cause the NVMe driver to prevent the PCI bus type from
> > applying the standard PCI PM to the devices handled by it in the
> > suspend-to-idle flow.
>
> This is more meaningful to you than to most people because "applying
> the standard PCI PM" doesn't tell us what that means in terms of the
> device.  Presumably it has something to do with a D-state transition?
> I *assume* a suspend might involve the D0 -> D3hot transition you
> mention below?

By "standard PCI PM" I mean what pci_prepare_to_sleep() does. And yes,
in the vast majority of cases the device goes from D0 to D3hot then.

>
> > The reason for doing that was a (reportedly) widespread failure to
> > take the PCIe link down during D0 -> D3hot transitions of NVMe
> > devices,
>
> I don't know any of the details, but "failure to take the link down
> during D0 -> D3hot transitions" is phrased as though it might be a
> hardware erratum.  If this *is* related to an NVMe erratum, that would
> explain why you only need to patch the nvme driver, and it would be
> useful to mention that in the commit log, since otherwise it sounds
> like something that might be needed in other drivers, too.

Yes, that can be considered as an NVMe erratum and the NVMe driver has
been *already* patched because of that in 5.3-rc1. [That's the commit
mentioned in the changelog of the $subject patch.]

It effectively asks the PCI bus type to leave *all* devices handled by
it in D0 during suspend-to-idle.  Already today.

I hope that this clarifies the current situation. :-)

>
> According to PCIe r5.0 sec 5.3.2, the only legal link states for D3hot
> are L1, L2/L3 Ready.  So if you put a device in D3hot and its link
> stays in L0, that sounds like a defect.  Is that what happens?

For some devices that's what happens. For some other devices the state
of the link in D3hot appears to be L1 or L2/L3 Ready (as per the spec)
and that's when the $subject patch makes a difference.

The underlying principle is that the energy used by the system while
suspended depends on the states of all of the PCIe links and the
deeper the link state, the less energy the system will use.

Now, say an NVMe device works in accordance with the spec, so when it
goes from D0 to D3hot, its PCIe link goes into L1 or L2/L3 Ready.  As
of 5.3-rc1 or later it will be left in D0 during suspend-to-idle
(because that's how the NVMe driver works), so its link state will
depend on whether or not ASPM is enabled for it.  If ASPM is enabled
for it, the final state of its link will depend on how deep ASPM is
allowed to go, but if ASPM is not enabled for it, its link will remain
in L0.

This means, however, that by allowing that device to go into D3hot
when ASPM is not enabled for it, the energy used by the system while
suspended can be reduced, because the PCIe link of the device will
then go to L1 or L2/L3 Ready.  That's exactly what the $subject patch
does.

Is this still not convincing enough?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v3 0/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-07-31 22:19                       ` kbusch
                                           ` (4 preceding siblings ...)
  2019-08-08 10:03                         ` [PATCH v2 0/2] " rjw
@ 2019-08-08 21:51                         ` " rjw
  2019-08-08 21:55                           ` [PATCH v3 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled() rjw
                                             ` (2 more replies)
  5 siblings, 3 replies; 75+ messages in thread
From: rjw @ 2019-08-08 21:51 UTC (permalink / raw)


Hi All,

> This series is equivalent to the following patch:
> 
> https://patchwork.kernel.org/patch/11083551/
> 
> posted earlier today.
> 
> It addresses review comments from Christoph by splitting the PCI/PCIe ASPM
> part off to a separate patch (patch [1/2]) and fixing a few defects.\0

Sending v3 to address review comments from Bjorn.

Thanks!

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v3 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled()
  2019-08-08 21:51                         ` [PATCH v3 0/2] " rjw
@ 2019-08-08 21:55                           ` rjw
  2019-08-09  4:50                             ` helgaas
  2019-10-07 22:34                             ` Bjorn Helgaas
  2019-08-08 21:58                           ` [PATCH v3 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled rjw
  2019-08-08 22:13                           ` [PATCH v3 0/2] " kbusch
  2 siblings, 2 replies; 75+ messages in thread
From: rjw @ 2019-08-08 21:55 UTC (permalink / raw)


From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Add a function checking whether or not PCIe ASPM has been enabled for
a given device.

It will be used by the NVMe driver to decide how to handle the
device during system suspend.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
---

v2 -> v3:
  * Make the new function return bool.
  * Change its name back to pcie_aspm_enabled().
  * Fix kerneldoc comment formatting.

-> v2:
  * Move the PCI/PCIe ASPM changes to a separate patch.
  * Add the _mask suffix to the new function name.
  * Add EXPORT_SYMBOL_GPL() to the new function.
  * Avoid adding an unnecessary blank line.

---
 drivers/pci/pcie/aspm.c |   20 ++++++++++++++++++++
 include/linux/pci.h     |    3 +++
 2 files changed, 23 insertions(+)

Index: linux-pm/drivers/pci/pcie/aspm.c
===================================================================
--- linux-pm.orig/drivers/pci/pcie/aspm.c
+++ linux-pm/drivers/pci/pcie/aspm.c
@@ -1170,6 +1170,26 @@ static int pcie_aspm_get_policy(char *bu
 module_param_call(policy, pcie_aspm_set_policy, pcie_aspm_get_policy,
 	NULL, 0644);
 
+/**
+ * pcie_aspm_enabled - Check if PCIe ASPM has been enabled for a device.
+ * @pci_device: Target device.
+ */
+bool pcie_aspm_enabled(struct pci_dev *pci_device)
+{
+	struct pci_dev *bridge = pci_upstream_bridge(pci_device);
+	bool ret;
+
+	if (!bridge)
+		return false;
+
+	mutex_lock(&aspm_lock);
+	ret = bridge->link_state ? !!bridge->link_state->aspm_enabled : false;
+	mutex_unlock(&aspm_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pcie_aspm_enabled);
+
 #ifdef CONFIG_PCIEASPM_DEBUG
 static ssize_t link_state_show(struct device *dev,
 		struct device_attribute *attr,
Index: linux-pm/include/linux/pci.h
===================================================================
--- linux-pm.orig/include/linux/pci.h
+++ linux-pm/include/linux/pci.h
@@ -1567,8 +1567,11 @@ extern bool pcie_ports_native;
 
 #ifdef CONFIG_PCIEASPM
 bool pcie_aspm_support_enabled(void);
+bool pcie_aspm_enabled(struct pci_dev *pci_device);
 #else
 static inline bool pcie_aspm_support_enabled(void) { return false; }
+static inline bool pcie_aspm_enabled(struct pci_dev *pci_device)
+{ return false; }
 #endif
 
 #ifdef CONFIG_PCIEAER

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v3 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-08-08 21:51                         ` [PATCH v3 0/2] " rjw
  2019-08-08 21:55                           ` [PATCH v3 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled() rjw
@ 2019-08-08 21:58                           ` rjw
  2019-08-08 22:13                           ` [PATCH v3 0/2] " kbusch
  2 siblings, 0 replies; 75+ messages in thread
From: rjw @ 2019-08-08 21:58 UTC (permalink / raw)


From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
host managed power state for suspend") was adding a pci_save_state()
call to nvme_suspend() so as to instruct the PCI bus type to leave
devices handled by the nvme driver in D0 during suspend-to-idle.
That was done with the assumption that ASPM would transition the
device's PCIe link into a low-power state when the device became
inactive.  However, if ASPM is disabled for the device, its PCIe
link will stay in L0 and in that case commit d916b1be94b6 is likely
to cause the energy used by the system while suspended to increase.

Namely, if the device in question works in accordance with the PCIe
specification, putting it into D3hot causes its PCIe link to go to
L1 or L2/L3 Ready, which is lower-power than L0.  Since the energy
used by the system while suspended depends on the state of its PCIe
link (as a general rule, the lower-power the state of the link, the
less energy the system will use), putting the device into D3hot
during suspend-to-idle should be more energy-efficient that leaving
it in D0 with disabled ASPM.

For this reason, avoid leaving NVMe devices with disabled ASPM in D0
during suspend-to-idle.  Instead, shut them down entirely and let
the PCI bus type put them into D3.

Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
Link: https://lore.kernel.org/linux-pm/2763495.NmdaWeg79L at kreacher/T/#t
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
---

v2 -> v3:
  * Modify the changelog to describe the rationale for this patch in
    a less confusing and more convincing way.

-> v2:
  * Move the PCI/PCIe ASPM changes to a separate patch.
  * Do not add a redundant ndev->last_ps == U32_MAX check in nvme_suspend().

---
 drivers/nvme/host/pci.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

Index: linux-pm/drivers/nvme/host/pci.c
===================================================================
--- linux-pm.orig/drivers/nvme/host/pci.c
+++ linux-pm/drivers/nvme/host/pci.c
@@ -2846,7 +2846,7 @@ static int nvme_resume(struct device *de
 	struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev));
 	struct nvme_ctrl *ctrl = &ndev->ctrl;
 
-	if (pm_resume_via_firmware() || !ctrl->npss ||
+	if (ndev->last_ps == U32_MAX ||
 	    nvme_set_power_state(ctrl, ndev->last_ps) != 0)
 		nvme_reset_ctrl(ctrl);
 	return 0;
@@ -2859,6 +2859,8 @@ static int nvme_suspend(struct device *d
 	struct nvme_ctrl *ctrl = &ndev->ctrl;
 	int ret = -EBUSY;
 
+	ndev->last_ps = U32_MAX;
+
 	/*
 	 * The platform does not remove power for a kernel managed suspend so
 	 * use host managed nvme power settings for lowest idle power if
@@ -2866,8 +2868,14 @@ static int nvme_suspend(struct device *d
 	 * shutdown.  But if the firmware is involved after the suspend or the
 	 * device does not support any non-default power states, shut down the
 	 * device fully.
+	 *
+	 * If ASPM is not enabled for the device, shut down the device and allow
+	 * the PCI bus layer to put it into D3 in order to take the PCIe link
+	 * down, so as to allow the platform to achieve its minimum low-power
+	 * state (which may not be possible if the link is up).
 	 */
-	if (pm_suspend_via_firmware() || !ctrl->npss) {
+	if (pm_suspend_via_firmware() || !ctrl->npss ||
+	    !pcie_aspm_enabled(pdev)) {
 		nvme_dev_disable(ndev, true);
 		return 0;
 	}
@@ -2880,7 +2888,6 @@ static int nvme_suspend(struct device *d
 	    ctrl->state != NVME_CTRL_ADMIN_ONLY)
 		goto unfreeze;
 
-	ndev->last_ps = 0;
 	ret = nvme_get_power_state(ctrl, &ndev->last_ps);
 	if (ret < 0)
 		goto unfreeze;

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v3 0/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-08-08 21:51                         ` [PATCH v3 0/2] " rjw
  2019-08-08 21:55                           ` [PATCH v3 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled() rjw
  2019-08-08 21:58                           ` [PATCH v3 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled rjw
@ 2019-08-08 22:13                           ` " kbusch
  2019-08-09  8:05                             ` rafael
  2 siblings, 1 reply; 75+ messages in thread
From: kbusch @ 2019-08-08 22:13 UTC (permalink / raw)


The v3 series looks good to me.

Reviewed-by: Keith Busch <keith.busch at intel.com>

Bjorn,

If you're okay with the series, we can either take it through nvme,
or you can feel free to apply through pci, whichever you prefer.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-08-08 20:41                                   ` rafael
@ 2019-08-09  4:47                                     ` helgaas
  2019-08-09  8:04                                       ` rafael
  0 siblings, 1 reply; 75+ messages in thread
From: helgaas @ 2019-08-09  4:47 UTC (permalink / raw)


On Thu, Aug 08, 2019@10:41:56PM +0200, Rafael J. Wysocki wrote:
> On Thu, Aug 8, 2019, 20:39 Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Aug 08, 2019@04:47:45PM +0200, Rafael J. Wysocki wrote:
> > > On Thu, Aug 8, 2019@3:43 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Thu, Aug 08, 2019@12:10:06PM +0200, Rafael J. Wysocki wrote:
> > > > > From: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> > > > >
> > > > > One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
> > > > > host managed power state for suspend") was adding a pci_save_state()
> > > > > call to nvme_suspend() in order to prevent the PCI bus-level PM from
> > > > > being applied to the suspended NVMe devices, but if ASPM is not
> > > > > enabled for the target NVMe device, that causes its PCIe link to stay
> > > > > up and the platform may not be able to get into its optimum low-power
> > > > > state because of that.
> > > > >
> > > > > For example, if ASPM is disabled for the NVMe drive (PC401 NVMe SK
> > > > > hynix 256GB) in my Dell XPS13 9380, leaving it in D0 during
> > > > > suspend-to-idle prevents the SoC from reaching package idle states
> > > > > deeper than PC3, which is way insufficient for system suspend.
> > > >
> > > > Just curious: I assume the SoC you reference is some part of the NVMe
> > > > drive?
> > >
> > > No, the SoC is what contains the Intel processor and PCH (formerly "chipset").
> > >
> > > > > To address this shortcoming, make nvme_suspend() check if ASPM is
> > > > > enabled for the target device and fall back to full device shutdown
> > > > > and PCI bus-level PM if that is not the case.
> > > > >
> > > > > Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
> > > > > Link: https://lore.kernel.org/linux-pm/2763495.NmdaWeg79L at kreacher/T/#t
> > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> > > > > ---
> > > > >
> > > > > -> v2:
> > > > >   * Move the PCI/PCIe ASPM changes to a separate patch.
> > > > >   * Do not add a redundant ndev->last_ps == U32_MAX check in nvme_suspend().
> > > > >
> > > > > ---
> > > > >  drivers/nvme/host/pci.c |   13 ++++++++++---
> > > > >  1 file changed, 10 insertions(+), 3 deletions(-)
> > > > >
> > > > > Index: linux-pm/drivers/nvme/host/pci.c
> > > > > ===================================================================
> > > > > --- linux-pm.orig/drivers/nvme/host/pci.c
> > > > > +++ linux-pm/drivers/nvme/host/pci.c
> > > > > @@ -2846,7 +2846,7 @@ static int nvme_resume(struct device *de
> > > > >       struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev));
> > > > >       struct nvme_ctrl *ctrl = &ndev->ctrl;
> > > > >
> > > > > -     if (pm_resume_via_firmware() || !ctrl->npss ||
> > > > > +     if (ndev->last_ps == U32_MAX ||
> > > > >           nvme_set_power_state(ctrl, ndev->last_ps) != 0)
> > > > >               nvme_reset_ctrl(ctrl);
> > > > >       return 0;
> > > > > @@ -2859,6 +2859,8 @@ static int nvme_suspend(struct device *d
> > > > >       struct nvme_ctrl *ctrl = &ndev->ctrl;
> > > > >       int ret = -EBUSY;
> > > > >
> > > > > +     ndev->last_ps = U32_MAX;
> > > > > +
> > > > >       /*
> > > > >        * The platform does not remove power for a kernel managed suspend so
> > > > >        * use host managed nvme power settings for lowest idle power if
> > > > > @@ -2866,8 +2868,14 @@ static int nvme_suspend(struct device *d
> > > > >        * shutdown.  But if the firmware is involved after the suspend or the
> > > > >        * device does not support any non-default power states, shut down the
> > > > >        * device fully.
> > > > > +      *
> > > > > +      * If ASPM is not enabled for the device, shut down the device and allow
> > > > > +      * the PCI bus layer to put it into D3 in order to take the PCIe link
> > > > > +      * down, so as to allow the platform to achieve its minimum low-power
> > > > > +      * state (which may not be possible if the link is up).
> > > > >        */
> > > > > -     if (pm_suspend_via_firmware() || !ctrl->npss) {
> > > > > +     if (pm_suspend_via_firmware() || !ctrl->npss ||
> > > > > +         !pcie_aspm_enabled_mask(pdev)) {
> > > >
> > > > This seems like a layering violation, in the sense that ASPM is
> > > > supposed to be hardware-autonomous and invisible to software.
> > >
> > > But software has to enable it.
> > >
> > > If it is not enabled, it will not be used, and that's what the check
> > > is about.
> > >
> > > > IIUC the NVMe device will go to the desired package idle state if
> > > > the link is in L0s or L1, but not if the link is in L0.  I don't
> > > > understand that connection; AFAIK that would be something outside
> > > > the scope of the PCIe spec.
> > >
> > > Yes, it is outside of the PCIe spec.
> > >
> > > No, this is not about the NVMe device, it is about the Intel SoC
> > > (System-on-a-Chip) the platform is based on.
> >
> > Ah.  So this problem could occur with any device, not just NVMe?  If
> > so, how do you address that?  Obviously you don't want to patch all
> > drivers this way.
> 
> It could, if the device was left in D0 during suspend, but drivers
> don't let devices stay in D0 during suspend as a rule, so this is all
> academic, except for the NVMe driver that has just started to do it in
> 5.3-rc1.
> 
> It has started to do that becasuse of what can be regarded as a
> hardware issue, but this does not even matter here.
> 
> > > The background really is commit d916b1be94b6 and its changelog is
> > > kind of misleading, unfortunately.  What it did, among other things,
> > > was to cause the NVMe driver to prevent the PCI bus type from
> > > applying the standard PCI PM to the devices handled by it in the
> > > suspend-to-idle flow.
> >
> > This is more meaningful to you than to most people because "applying
> > the standard PCI PM" doesn't tell us what that means in terms of the
> > device.  Presumably it has something to do with a D-state transition?
> > I *assume* a suspend might involve the D0 -> D3hot transition you
> > mention below?
> 
> By "standard PCI PM" I mean what pci_prepare_to_sleep() does. And yes,
> in the vast majority of cases the device goes from D0 to D3hot then.
> 
> > > The reason for doing that was a (reportedly) widespread failure to
> > > take the PCIe link down during D0 -> D3hot transitions of NVMe
> > > devices,
> >
> > I don't know any of the details, but "failure to take the link down
> > during D0 -> D3hot transitions" is phrased as though it might be a
> > hardware erratum.  If this *is* related to an NVMe erratum, that would
> > explain why you only need to patch the nvme driver, and it would be
> > useful to mention that in the commit log, since otherwise it sounds
> > like something that might be needed in other drivers, too.
> 
> Yes, that can be considered as an NVMe erratum and the NVMe driver has
> been *already* patched because of that in 5.3-rc1. [That's the commit
> mentioned in the changelog of the $subject patch.]
> 
> It effectively asks the PCI bus type to leave *all* devices handled by
> it in D0 during suspend-to-idle.  Already today.
> 
> I hope that this clarifies the current situation. :-)
> 
> > According to PCIe r5.0 sec 5.3.2, the only legal link states for D3hot
> > are L1, L2/L3 Ready.  So if you put a device in D3hot and its link
> > stays in L0, that sounds like a defect.  Is that what happens?
> 
> For some devices that's what happens. For some other devices the state
> of the link in D3hot appears to be L1 or L2/L3 Ready (as per the spec)
> and that's when the $subject patch makes a difference.
> ...

> Now, say an NVMe device works in accordance with the spec, so when it
> goes from D0 to D3hot, its PCIe link goes into L1 or L2/L3 Ready.  As
> of 5.3-rc1 or later it will be left in D0 during suspend-to-idle
> (because that's how the NVMe driver works), so its link state will
> depend on whether or not ASPM is enabled for it.  If ASPM is enabled
> for it, the final state of its link will depend on how deep ASPM is
> allowed to go, but if ASPM is not enabled for it, its link will remain
> in L0.
> 
> This means, however, that by allowing that device to go into D3hot
> when ASPM is not enabled for it, the energy used by the system while
> suspended can be reduced, because the PCIe link of the device will
> then go to L1 or L2/L3 Ready.  That's exactly what the $subject patch
> does.
> 
> Is this still not convincing enough?

It's not a matter of being convincing, it's a matter of dissecting and
analyzing this far enough so it makes sense to someone who hasn't
debugged the problem.  Since we're talking about ASPM being enabled,
that really means making the connections to specific PCIe situations.

I'm not the nvme maintainer, so my only interest in this is that it
was really hard for me to figure out how pcie_aspm_enabled() is
related to pm_suspend_via_firmware() and ctrl->npss.

But I think it has finally percolated through.  Here's my
understanding; see it has any connection with reality:

  Prior to d916b1be94b6 ("nvme-pci: use host managed power state for
  suspend"), suspend always put the NVMe device in D3hot.

  After d916b1be94b6, when it's possible, suspend keeps the NVMe
  device in D0 and uses NVMe-specific power settings because it's
  faster to change those than to do D0 -> D3hot -> D0 transitions.

  When it's not possible (either the device doesn't support
  NVMe-specific power settings or platform firmware has to be
  involved), we use D3hot as before.

  So now we have these three cases for suspending an NVMe device:

    1  D0 + no ASPM + NVMe power setting
    2  D0 +    ASPM + NVMe power setting
    3  D3hot

  Prior to d916b1be94b6, we always used case 3.  After d916b1be94b6,
  we used case 1 or 2 whenever possible (we didn't know which).  Case
  2 seemed acceptable, but the power consumption in case 1 was too
  high.

  This patch ("nvme-pci: Allow PCI bus-level PM to be used if ASPM is
  disabled") would replace case 1 with case 3 to reduce power
  consumption.

AFAICT we don't have a way to compute the relative power consumption
of these cases.  It's possible that even case 2 would use more power
than case 3.  You can empirically determine that this patch makes the
right trade-offs for the controllers you care about, but I don't think
it's clear that this will *always* be the case, so in that sense I
think pcie_aspm_enabled() is being used as part of a heuristic.

Bjorn

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v3 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled()
  2019-08-08 21:55                           ` [PATCH v3 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled() rjw
@ 2019-08-09  4:50                             ` helgaas
  2019-08-09  8:00                               ` rafael
  2019-10-07 22:34                             ` Bjorn Helgaas
  1 sibling, 1 reply; 75+ messages in thread
From: helgaas @ 2019-08-09  4:50 UTC (permalink / raw)


s|PCI: PCIe: ASPM: Introduce pcie_aspm_enabled()|PCI/ASPM: Add pcie_aspm_enabled()|

to match previous history.

On Thu, Aug 08, 2019@11:55:07PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> 
> Add a function checking whether or not PCIe ASPM has been enabled for
> a given device.
> 
> It will be used by the NVMe driver to decide how to handle the
> device during system suspend.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>

Acked-by: Bjorn Helgaas <bhelgaas at google.com>

> ---
> 
> v2 -> v3:
>   * Make the new function return bool.
>   * Change its name back to pcie_aspm_enabled().
>   * Fix kerneldoc comment formatting.
> 
> -> v2:
>   * Move the PCI/PCIe ASPM changes to a separate patch.
>   * Add the _mask suffix to the new function name.
>   * Add EXPORT_SYMBOL_GPL() to the new function.
>   * Avoid adding an unnecessary blank line.
> 
> ---
>  drivers/pci/pcie/aspm.c |   20 ++++++++++++++++++++
>  include/linux/pci.h     |    3 +++
>  2 files changed, 23 insertions(+)
> 
> Index: linux-pm/drivers/pci/pcie/aspm.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/pcie/aspm.c
> +++ linux-pm/drivers/pci/pcie/aspm.c
> @@ -1170,6 +1170,26 @@ static int pcie_aspm_get_policy(char *bu
>  module_param_call(policy, pcie_aspm_set_policy, pcie_aspm_get_policy,
>  	NULL, 0644);
>  
> +/**
> + * pcie_aspm_enabled - Check if PCIe ASPM has been enabled for a device.
> + * @pci_device: Target device.
> + */
> +bool pcie_aspm_enabled(struct pci_dev *pci_device)

The typical name in this file is "pdev".

> +{
> +	struct pci_dev *bridge = pci_upstream_bridge(pci_device);
> +	bool ret;
> +
> +	if (!bridge)
> +		return false;
> +
> +	mutex_lock(&aspm_lock);
> +	ret = bridge->link_state ? !!bridge->link_state->aspm_enabled : false;
> +	mutex_unlock(&aspm_lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pcie_aspm_enabled);
> +
>  #ifdef CONFIG_PCIEASPM_DEBUG
>  static ssize_t link_state_show(struct device *dev,
>  		struct device_attribute *attr,
> Index: linux-pm/include/linux/pci.h
> ===================================================================
> --- linux-pm.orig/include/linux/pci.h
> +++ linux-pm/include/linux/pci.h
> @@ -1567,8 +1567,11 @@ extern bool pcie_ports_native;
>  
>  #ifdef CONFIG_PCIEASPM
>  bool pcie_aspm_support_enabled(void);
> +bool pcie_aspm_enabled(struct pci_dev *pci_device);
>  #else
>  static inline bool pcie_aspm_support_enabled(void) { return false; }
> +static inline bool pcie_aspm_enabled(struct pci_dev *pci_device)
> +{ return false; }
>  #endif
>  
>  #ifdef CONFIG_PCIEAER
> 
> 
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v3 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled()
  2019-08-09  4:50                             ` helgaas
@ 2019-08-09  8:00                               ` rafael
  0 siblings, 0 replies; 75+ messages in thread
From: rafael @ 2019-08-09  8:00 UTC (permalink / raw)


On Fri, Aug 9, 2019@6:51 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> s|PCI: PCIe: ASPM: Introduce pcie_aspm_enabled()|PCI/ASPM: Add pcie_aspm_enabled()|

Will change.

>
> to match previous history.
>
> On Thu, Aug 08, 2019@11:55:07PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> >
> > Add a function checking whether or not PCIe ASPM has been enabled for
> > a given device.
> >
> > It will be used by the NVMe driver to decide how to handle the
> > device during system suspend.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
>
> Acked-by: Bjorn Helgaas <bhelgaas at google.com>

Thanks!

> > ---
> >
> > v2 -> v3:
> >   * Make the new function return bool.
> >   * Change its name back to pcie_aspm_enabled().
> >   * Fix kerneldoc comment formatting.
> >
> > -> v2:
> >   * Move the PCI/PCIe ASPM changes to a separate patch.
> >   * Add the _mask suffix to the new function name.
> >   * Add EXPORT_SYMBOL_GPL() to the new function.
> >   * Avoid adding an unnecessary blank line.
> >
> > ---
> >  drivers/pci/pcie/aspm.c |   20 ++++++++++++++++++++
> >  include/linux/pci.h     |    3 +++
> >  2 files changed, 23 insertions(+)
> >
> > Index: linux-pm/drivers/pci/pcie/aspm.c
> > ===================================================================
> > --- linux-pm.orig/drivers/pci/pcie/aspm.c
> > +++ linux-pm/drivers/pci/pcie/aspm.c
> > @@ -1170,6 +1170,26 @@ static int pcie_aspm_get_policy(char *bu
> >  module_param_call(policy, pcie_aspm_set_policy, pcie_aspm_get_policy,
> >       NULL, 0644);
> >
> > +/**
> > + * pcie_aspm_enabled - Check if PCIe ASPM has been enabled for a device.
> > + * @pci_device: Target device.
> > + */
> > +bool pcie_aspm_enabled(struct pci_dev *pci_device)
>
> The typical name in this file is "pdev".

OK, will change.

> > +{
> > +     struct pci_dev *bridge = pci_upstream_bridge(pci_device);
> > +     bool ret;
> > +
> > +     if (!bridge)
> > +             return false;
> > +
> > +     mutex_lock(&aspm_lock);
> > +     ret = bridge->link_state ? !!bridge->link_state->aspm_enabled : false;
> > +     mutex_unlock(&aspm_lock);
> > +
> > +     return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(pcie_aspm_enabled);
> > +
> >  #ifdef CONFIG_PCIEASPM_DEBUG
> >  static ssize_t link_state_show(struct device *dev,
> >               struct device_attribute *attr,
> > Index: linux-pm/include/linux/pci.h
> > ===================================================================
> > --- linux-pm.orig/include/linux/pci.h
> > +++ linux-pm/include/linux/pci.h
> > @@ -1567,8 +1567,11 @@ extern bool pcie_ports_native;
> >
> >  #ifdef CONFIG_PCIEASPM
> >  bool pcie_aspm_support_enabled(void);
> > +bool pcie_aspm_enabled(struct pci_dev *pci_device);
> >  #else
> >  static inline bool pcie_aspm_support_enabled(void) { return false; }
> > +static inline bool pcie_aspm_enabled(struct pci_dev *pci_device)
> > +{ return false; }
> >  #endif
> >
> >  #ifdef CONFIG_PCIEAER
> >
> >
> >

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-08-09  4:47                                     ` helgaas
@ 2019-08-09  8:04                                       ` rafael
  0 siblings, 0 replies; 75+ messages in thread
From: rafael @ 2019-08-09  8:04 UTC (permalink / raw)


On Fri, Aug 9, 2019@6:47 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Aug 08, 2019@10:41:56PM +0200, Rafael J. Wysocki wrote:
> > On Thu, Aug 8, 2019, 20:39 Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Thu, Aug 08, 2019@04:47:45PM +0200, Rafael J. Wysocki wrote:
> > > > On Thu, Aug 8, 2019@3:43 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Thu, Aug 08, 2019@12:10:06PM +0200, Rafael J. Wysocki wrote:
> > > > > > From: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> > > > > >
> > > > > > One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
> > > > > > host managed power state for suspend") was adding a pci_save_state()
> > > > > > call to nvme_suspend() in order to prevent the PCI bus-level PM from
> > > > > > being applied to the suspended NVMe devices, but if ASPM is not
> > > > > > enabled for the target NVMe device, that causes its PCIe link to stay
> > > > > > up and the platform may not be able to get into its optimum low-power
> > > > > > state because of that.
> > > > > >
> > > > > > For example, if ASPM is disabled for the NVMe drive (PC401 NVMe SK
> > > > > > hynix 256GB) in my Dell XPS13 9380, leaving it in D0 during
> > > > > > suspend-to-idle prevents the SoC from reaching package idle states
> > > > > > deeper than PC3, which is way insufficient for system suspend.
> > > > >
> > > > > Just curious: I assume the SoC you reference is some part of the NVMe
> > > > > drive?
> > > >
> > > > No, the SoC is what contains the Intel processor and PCH (formerly "chipset").
> > > >
> > > > > > To address this shortcoming, make nvme_suspend() check if ASPM is
> > > > > > enabled for the target device and fall back to full device shutdown
> > > > > > and PCI bus-level PM if that is not the case.
> > > > > >
> > > > > > Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
> > > > > > Link: https://lore.kernel.org/linux-pm/2763495.NmdaWeg79L at kreacher/T/#t
> > > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> > > > > > ---
> > > > > >
> > > > > > -> v2:
> > > > > >   * Move the PCI/PCIe ASPM changes to a separate patch.
> > > > > >   * Do not add a redundant ndev->last_ps == U32_MAX check in nvme_suspend().
> > > > > >
> > > > > > ---
> > > > > >  drivers/nvme/host/pci.c |   13 ++++++++++---
> > > > > >  1 file changed, 10 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > Index: linux-pm/drivers/nvme/host/pci.c
> > > > > > ===================================================================
> > > > > > --- linux-pm.orig/drivers/nvme/host/pci.c
> > > > > > +++ linux-pm/drivers/nvme/host/pci.c
> > > > > > @@ -2846,7 +2846,7 @@ static int nvme_resume(struct device *de
> > > > > >       struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev));
> > > > > >       struct nvme_ctrl *ctrl = &ndev->ctrl;
> > > > > >
> > > > > > -     if (pm_resume_via_firmware() || !ctrl->npss ||
> > > > > > +     if (ndev->last_ps == U32_MAX ||
> > > > > >           nvme_set_power_state(ctrl, ndev->last_ps) != 0)
> > > > > >               nvme_reset_ctrl(ctrl);
> > > > > >       return 0;
> > > > > > @@ -2859,6 +2859,8 @@ static int nvme_suspend(struct device *d
> > > > > >       struct nvme_ctrl *ctrl = &ndev->ctrl;
> > > > > >       int ret = -EBUSY;
> > > > > >
> > > > > > +     ndev->last_ps = U32_MAX;
> > > > > > +
> > > > > >       /*
> > > > > >        * The platform does not remove power for a kernel managed suspend so
> > > > > >        * use host managed nvme power settings for lowest idle power if
> > > > > > @@ -2866,8 +2868,14 @@ static int nvme_suspend(struct device *d
> > > > > >        * shutdown.  But if the firmware is involved after the suspend or the
> > > > > >        * device does not support any non-default power states, shut down the
> > > > > >        * device fully.
> > > > > > +      *
> > > > > > +      * If ASPM is not enabled for the device, shut down the device and allow
> > > > > > +      * the PCI bus layer to put it into D3 in order to take the PCIe link
> > > > > > +      * down, so as to allow the platform to achieve its minimum low-power
> > > > > > +      * state (which may not be possible if the link is up).
> > > > > >        */
> > > > > > -     if (pm_suspend_via_firmware() || !ctrl->npss) {
> > > > > > +     if (pm_suspend_via_firmware() || !ctrl->npss ||
> > > > > > +         !pcie_aspm_enabled_mask(pdev)) {
> > > > >
> > > > > This seems like a layering violation, in the sense that ASPM is
> > > > > supposed to be hardware-autonomous and invisible to software.
> > > >
> > > > But software has to enable it.
> > > >
> > > > If it is not enabled, it will not be used, and that's what the check
> > > > is about.
> > > >
> > > > > IIUC the NVMe device will go to the desired package idle state if
> > > > > the link is in L0s or L1, but not if the link is in L0.  I don't
> > > > > understand that connection; AFAIK that would be something outside
> > > > > the scope of the PCIe spec.
> > > >
> > > > Yes, it is outside of the PCIe spec.
> > > >
> > > > No, this is not about the NVMe device, it is about the Intel SoC
> > > > (System-on-a-Chip) the platform is based on.
> > >
> > > Ah.  So this problem could occur with any device, not just NVMe?  If
> > > so, how do you address that?  Obviously you don't want to patch all
> > > drivers this way.
> >
> > It could, if the device was left in D0 during suspend, but drivers
> > don't let devices stay in D0 during suspend as a rule, so this is all
> > academic, except for the NVMe driver that has just started to do it in
> > 5.3-rc1.
> >
> > It has started to do that becasuse of what can be regarded as a
> > hardware issue, but this does not even matter here.
> >
> > > > The background really is commit d916b1be94b6 and its changelog is
> > > > kind of misleading, unfortunately.  What it did, among other things,
> > > > was to cause the NVMe driver to prevent the PCI bus type from
> > > > applying the standard PCI PM to the devices handled by it in the
> > > > suspend-to-idle flow.
> > >
> > > This is more meaningful to you than to most people because "applying
> > > the standard PCI PM" doesn't tell us what that means in terms of the
> > > device.  Presumably it has something to do with a D-state transition?
> > > I *assume* a suspend might involve the D0 -> D3hot transition you
> > > mention below?
> >
> > By "standard PCI PM" I mean what pci_prepare_to_sleep() does. And yes,
> > in the vast majority of cases the device goes from D0 to D3hot then.
> >
> > > > The reason for doing that was a (reportedly) widespread failure to
> > > > take the PCIe link down during D0 -> D3hot transitions of NVMe
> > > > devices,
> > >
> > > I don't know any of the details, but "failure to take the link down
> > > during D0 -> D3hot transitions" is phrased as though it might be a
> > > hardware erratum.  If this *is* related to an NVMe erratum, that would
> > > explain why you only need to patch the nvme driver, and it would be
> > > useful to mention that in the commit log, since otherwise it sounds
> > > like something that might be needed in other drivers, too.
> >
> > Yes, that can be considered as an NVMe erratum and the NVMe driver has
> > been *already* patched because of that in 5.3-rc1. [That's the commit
> > mentioned in the changelog of the $subject patch.]
> >
> > It effectively asks the PCI bus type to leave *all* devices handled by
> > it in D0 during suspend-to-idle.  Already today.
> >
> > I hope that this clarifies the current situation. :-)
> >
> > > According to PCIe r5.0 sec 5.3.2, the only legal link states for D3hot
> > > are L1, L2/L3 Ready.  So if you put a device in D3hot and its link
> > > stays in L0, that sounds like a defect.  Is that what happens?
> >
> > For some devices that's what happens. For some other devices the state
> > of the link in D3hot appears to be L1 or L2/L3 Ready (as per the spec)
> > and that's when the $subject patch makes a difference.
> > ...
>
> > Now, say an NVMe device works in accordance with the spec, so when it
> > goes from D0 to D3hot, its PCIe link goes into L1 or L2/L3 Ready.  As
> > of 5.3-rc1 or later it will be left in D0 during suspend-to-idle
> > (because that's how the NVMe driver works), so its link state will
> > depend on whether or not ASPM is enabled for it.  If ASPM is enabled
> > for it, the final state of its link will depend on how deep ASPM is
> > allowed to go, but if ASPM is not enabled for it, its link will remain
> > in L0.
> >
> > This means, however, that by allowing that device to go into D3hot
> > when ASPM is not enabled for it, the energy used by the system while
> > suspended can be reduced, because the PCIe link of the device will
> > then go to L1 or L2/L3 Ready.  That's exactly what the $subject patch
> > does.
> >
> > Is this still not convincing enough?
>
> It's not a matter of being convincing, it's a matter of dissecting and
> analyzing this far enough so it makes sense to someone who hasn't
> debugged the problem.  Since we're talking about ASPM being enabled,
> that really means making the connections to specific PCIe situations.
>
> I'm not the nvme maintainer, so my only interest in this is that it
> was really hard for me to figure out how pcie_aspm_enabled() is
> related to pm_suspend_via_firmware() and ctrl->npss.

Fair enough.

> But I think it has finally percolated through.  Here's my
> understanding; see it has any connection with reality:
>
>   Prior to d916b1be94b6 ("nvme-pci: use host managed power state for
>   suspend"), suspend always put the NVMe device in D3hot.

Right.

>   After d916b1be94b6, when it's possible, suspend keeps the NVMe
>   device in D0 and uses NVMe-specific power settings because it's
>   faster to change those than to do D0 -> D3hot -> D0 transitions.
>
>   When it's not possible (either the device doesn't support
>   NVMe-specific power settings or platform firmware has to be
>   involved), we use D3hot as before.

Right.

>   So now we have these three cases for suspending an NVMe device:
>
>     1  D0 + no ASPM + NVMe power setting
>     2  D0 +    ASPM + NVMe power setting
>     3  D3hot
>
>   Prior to d916b1be94b6, we always used case 3.  After d916b1be94b6,
>   we used case 1 or 2 whenever possible (we didn't know which).  Case
>   2 seemed acceptable, but the power consumption in case 1 was too
>   high.

That's correct.

>   This patch ("nvme-pci: Allow PCI bus-level PM to be used if ASPM is
>   disabled") would replace case 1 with case 3 to reduce power
>   consumption.

Right.

> AFAICT we don't have a way to compute the relative power consumption
> of these cases.  It's possible that even case 2 would use more power
> than case 3.  You can empirically determine that this patch makes the
> right trade-offs for the controllers you care about, but I don't think
> it's clear that this will *always* be the case, so in that sense I
> think pcie_aspm_enabled() is being used as part of a heuristic.

Fair enough.

Cheers,
Rafael

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v3 0/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-08-08 22:13                           ` [PATCH v3 0/2] " kbusch
@ 2019-08-09  8:05                             ` rafael
  2019-08-09 14:52                               ` kbusch
  0 siblings, 1 reply; 75+ messages in thread
From: rafael @ 2019-08-09  8:05 UTC (permalink / raw)


On Fri, Aug 9, 2019@12:16 AM Keith Busch <kbusch@kernel.org> wrote:
>
> The v3 series looks good to me.
>
> Reviewed-by: Keith Busch <keith.busch at intel.com>
>
> Bjorn,
>
> If you're okay with the series, we can either take it through nvme,
> or you can feel free to apply through pci, whichever you prefer.

Actually, I can apply it too with your R-by along with the PCIe patch
ACKed by Bjorn.  Please let me know if that works for you.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v3 0/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  2019-08-09  8:05                             ` rafael
@ 2019-08-09 14:52                               ` kbusch
  0 siblings, 0 replies; 75+ messages in thread
From: kbusch @ 2019-08-09 14:52 UTC (permalink / raw)


On Fri, Aug 09, 2019@01:05:42AM -0700, Rafael J. Wysocki wrote:
> On Fri, Aug 9, 2019@12:16 AM Keith Busch <kbusch@kernel.org> wrote:
> >
> > The v3 series looks good to me.
> >
> > Reviewed-by: Keith Busch <keith.busch at intel.com>
> >
> > Bjorn,
> >
> > If you're okay with the series, we can either take it through nvme,
> > or you can feel free to apply through pci, whichever you prefer.
> 
> Actually, I can apply it too with your R-by along with the PCIe patch
> ACKed by Bjorn.  Please let me know if that works for you.

Thanks, that sounds good to me.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v3 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled()
  2019-08-08 21:55                           ` [PATCH v3 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled() rjw
  2019-08-09  4:50                             ` helgaas
@ 2019-10-07 22:34                             ` Bjorn Helgaas
  2019-10-08  9:27                               ` Rafael J. Wysocki
  1 sibling, 1 reply; 75+ messages in thread
From: Bjorn Helgaas @ 2019-10-07 22:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Sagi Grimberg, Mario Limonciello, Linux PCI, Linux PM,
	Linux Kernel Mailing List, linux-nvme, Keith Busch,
	Kai-Heng Feng, Keith Busch, Rajat Jain, Christoph Hellwig,
	Heiner Kallweit

[+cc Heiner]

On Thu, Aug 08, 2019 at 11:55:07PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Add a function checking whether or not PCIe ASPM has been enabled for
> a given device.
> 
> It will be used by the NVMe driver to decide how to handle the
> device during system suspend.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> v2 -> v3:
>   * Make the new function return bool.
>   * Change its name back to pcie_aspm_enabled().
>   * Fix kerneldoc comment formatting.
> 
> -> v2:
>   * Move the PCI/PCIe ASPM changes to a separate patch.
>   * Add the _mask suffix to the new function name.
>   * Add EXPORT_SYMBOL_GPL() to the new function.
>   * Avoid adding an unnecessary blank line.
> 
> ---
>  drivers/pci/pcie/aspm.c |   20 ++++++++++++++++++++
>  include/linux/pci.h     |    3 +++
>  2 files changed, 23 insertions(+)
> 
> Index: linux-pm/drivers/pci/pcie/aspm.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/pcie/aspm.c
> +++ linux-pm/drivers/pci/pcie/aspm.c
> @@ -1170,6 +1170,26 @@ static int pcie_aspm_get_policy(char *bu
>  module_param_call(policy, pcie_aspm_set_policy, pcie_aspm_get_policy,
>  	NULL, 0644);
>  
> +/**
> + * pcie_aspm_enabled - Check if PCIe ASPM has been enabled for a device.
> + * @pci_device: Target device.
> + */
> +bool pcie_aspm_enabled(struct pci_dev *pci_device)
> +{
> +	struct pci_dev *bridge = pci_upstream_bridge(pci_device);
> +	bool ret;
> +
> +	if (!bridge)
> +		return false;
> +
> +	mutex_lock(&aspm_lock);
> +	ret = bridge->link_state ? !!bridge->link_state->aspm_enabled : false;
> +	mutex_unlock(&aspm_lock);

Why do we need to acquire aspm_lock here?  We aren't modifying
anything, and I don't think we're preventing a race.  If this races
with another thread that changes aspm_enabled, we'll return either the
old state or the new one, and I think that's still the case even if we
don't acquire aspm_lock.

> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pcie_aspm_enabled);
> +
>  #ifdef CONFIG_PCIEASPM_DEBUG
>  static ssize_t link_state_show(struct device *dev,
>  		struct device_attribute *attr,
> Index: linux-pm/include/linux/pci.h
> ===================================================================
> --- linux-pm.orig/include/linux/pci.h
> +++ linux-pm/include/linux/pci.h
> @@ -1567,8 +1567,11 @@ extern bool pcie_ports_native;
>  
>  #ifdef CONFIG_PCIEASPM
>  bool pcie_aspm_support_enabled(void);
> +bool pcie_aspm_enabled(struct pci_dev *pci_device);
>  #else
>  static inline bool pcie_aspm_support_enabled(void) { return false; }
> +static inline bool pcie_aspm_enabled(struct pci_dev *pci_device)
> +{ return false; }
>  #endif
>  
>  #ifdef CONFIG_PCIEAER
> 
> 
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v3 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled()
  2019-10-07 22:34                             ` Bjorn Helgaas
@ 2019-10-08  9:27                               ` Rafael J. Wysocki
  2019-10-08 21:16                                 ` Bjorn Helgaas
  0 siblings, 1 reply; 75+ messages in thread
From: Rafael J. Wysocki @ 2019-10-08  9:27 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Sagi Grimberg, Mario Limonciello, Linux PCI, Linux PM,
	Rafael J. Wysocki, Linux Kernel Mailing List, linux-nvme,
	Keith Busch, Kai-Heng Feng, Keith Busch, Rajat Jain,
	Christoph Hellwig, Heiner Kallweit

On Tue, Oct 8, 2019 at 12:34 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Heiner]
>
> On Thu, Aug 08, 2019 at 11:55:07PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Add a function checking whether or not PCIe ASPM has been enabled for
> > a given device.
> >
> > It will be used by the NVMe driver to decide how to handle the
> > device during system suspend.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >
> > v2 -> v3:
> >   * Make the new function return bool.
> >   * Change its name back to pcie_aspm_enabled().
> >   * Fix kerneldoc comment formatting.
> >
> > -> v2:
> >   * Move the PCI/PCIe ASPM changes to a separate patch.
> >   * Add the _mask suffix to the new function name.
> >   * Add EXPORT_SYMBOL_GPL() to the new function.
> >   * Avoid adding an unnecessary blank line.
> >
> > ---
> >  drivers/pci/pcie/aspm.c |   20 ++++++++++++++++++++
> >  include/linux/pci.h     |    3 +++
> >  2 files changed, 23 insertions(+)
> >
> > Index: linux-pm/drivers/pci/pcie/aspm.c
> > ===================================================================
> > --- linux-pm.orig/drivers/pci/pcie/aspm.c
> > +++ linux-pm/drivers/pci/pcie/aspm.c
> > @@ -1170,6 +1170,26 @@ static int pcie_aspm_get_policy(char *bu
> >  module_param_call(policy, pcie_aspm_set_policy, pcie_aspm_get_policy,
> >       NULL, 0644);
> >
> > +/**
> > + * pcie_aspm_enabled - Check if PCIe ASPM has been enabled for a device.
> > + * @pci_device: Target device.
> > + */
> > +bool pcie_aspm_enabled(struct pci_dev *pci_device)
> > +{
> > +     struct pci_dev *bridge = pci_upstream_bridge(pci_device);
> > +     bool ret;
> > +
> > +     if (!bridge)
> > +             return false;
> > +
> > +     mutex_lock(&aspm_lock);
> > +     ret = bridge->link_state ? !!bridge->link_state->aspm_enabled : false;
> > +     mutex_unlock(&aspm_lock);
>
> Why do we need to acquire aspm_lock here?  We aren't modifying
> anything, and I don't think we're preventing a race.  If this races
> with another thread that changes aspm_enabled, we'll return either the
> old state or the new one, and I think that's still the case even if we
> don't acquire aspm_lock.

Well, if we can guarantee that pci_remove_bus_device() will never be
called in parallel with this helper, then I agree, but can we
guarantee that?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v3 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled()
  2019-10-08  9:27                               ` Rafael J. Wysocki
@ 2019-10-08 21:16                                 ` Bjorn Helgaas
  2019-10-08 22:54                                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 75+ messages in thread
From: Bjorn Helgaas @ 2019-10-08 21:16 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Sagi Grimberg, Mario Limonciello, Linux PCI, Linux PM,
	Rafael J. Wysocki, Linux Kernel Mailing List, linux-nvme,
	Keith Busch, Kai-Heng Feng, Keith Busch, Rajat Jain,
	Christoph Hellwig, Heiner Kallweit

On Tue, Oct 08, 2019 at 11:27:51AM +0200, Rafael J. Wysocki wrote:
> On Tue, Oct 8, 2019 at 12:34 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Aug 08, 2019 at 11:55:07PM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >
> > > Add a function checking whether or not PCIe ASPM has been enabled for
> > > a given device.
> > >
> > > It will be used by the NVMe driver to decide how to handle the
> > > device during system suspend.
> > >
> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > ---
> > >
> > > v2 -> v3:
> > >   * Make the new function return bool.
> > >   * Change its name back to pcie_aspm_enabled().
> > >   * Fix kerneldoc comment formatting.
> > >
> > > -> v2:
> > >   * Move the PCI/PCIe ASPM changes to a separate patch.
> > >   * Add the _mask suffix to the new function name.
> > >   * Add EXPORT_SYMBOL_GPL() to the new function.
> > >   * Avoid adding an unnecessary blank line.
> > >
> > > ---
> > >  drivers/pci/pcie/aspm.c |   20 ++++++++++++++++++++
> > >  include/linux/pci.h     |    3 +++
> > >  2 files changed, 23 insertions(+)
> > >
> > > Index: linux-pm/drivers/pci/pcie/aspm.c
> > > ===================================================================
> > > --- linux-pm.orig/drivers/pci/pcie/aspm.c
> > > +++ linux-pm/drivers/pci/pcie/aspm.c
> > > @@ -1170,6 +1170,26 @@ static int pcie_aspm_get_policy(char *bu
> > >  module_param_call(policy, pcie_aspm_set_policy, pcie_aspm_get_policy,
> > >       NULL, 0644);
> > >
> > > +/**
> > > + * pcie_aspm_enabled - Check if PCIe ASPM has been enabled for a device.
> > > + * @pci_device: Target device.
> > > + */
> > > +bool pcie_aspm_enabled(struct pci_dev *pci_device)
> > > +{
> > > +     struct pci_dev *bridge = pci_upstream_bridge(pci_device);
> > > +     bool ret;
> > > +
> > > +     if (!bridge)
> > > +             return false;
> > > +
> > > +     mutex_lock(&aspm_lock);
> > > +     ret = bridge->link_state ? !!bridge->link_state->aspm_enabled : false;
> > > +     mutex_unlock(&aspm_lock);
> >
> > Why do we need to acquire aspm_lock here?  We aren't modifying
> > anything, and I don't think we're preventing a race.  If this races
> > with another thread that changes aspm_enabled, we'll return either the
> > old state or the new one, and I think that's still the case even if we
> > don't acquire aspm_lock.
> 
> Well, if we can guarantee that pci_remove_bus_device() will never be
> called in parallel with this helper, then I agree, but can we
> guarantee that?

Hmm, yeah, I guess that's the question.  It's not a race with another
thread changing aspm_enabled; the potential race is with another
thread removing the last child of "bridge", which will free the
link_state and set bridge->link_state = NULL.

I think it should be safe to call device-related PCI interfaces if
you're holding a reference to the device, e.g., from a driver bound to
the device or a sysfs accessor.  Since we call pcie_aspm_enabled(dev)
from a driver bound to "dev", another thread should not be able to
remove "dev" while we're using it.

I know that's a little hand-wavey, but if it weren't true, I think
we'd have a lot more locking sprinkled everywhere in the PCI core than
we do.

This has implications for Heiner's ASPM sysfs patches because we're
currently doing this in sysfs accessors:

  static ssize_t aspm_attr_show_common(struct device *dev, ...)
  {
    ...
    link = pcie_aspm_get_link(pdev);

    mutex_lock(&aspm_lock);
    enabled = link->aspm_enabled & state;
    mutex_unlock(&aspm_lock);
    ...
  }

I assume sysfs must be holding a reference that guarantees "dev" is
valid througout this code, and therefore we should not need to hold
aspm_lock.

Bjorn

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v3 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled()
  2019-10-08 21:16                                 ` Bjorn Helgaas
@ 2019-10-08 22:54                                   ` Rafael J. Wysocki
  2019-10-09 12:49                                     ` Bjorn Helgaas
  0 siblings, 1 reply; 75+ messages in thread
From: Rafael J. Wysocki @ 2019-10-08 22:54 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Sagi Grimberg, Mario Limonciello, Linux PCI, Linux PM,
	Rafael J. Wysocki, Rafael J. Wysocki, Linux Kernel Mailing List,
	linux-nvme, Keith Busch, Kai-Heng Feng, Keith Busch, Rajat Jain,
	Christoph Hellwig, Heiner Kallweit

On Tue, Oct 8, 2019 at 11:16 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Tue, Oct 08, 2019 at 11:27:51AM +0200, Rafael J. Wysocki wrote:
> > On Tue, Oct 8, 2019 at 12:34 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Thu, Aug 08, 2019 at 11:55:07PM +0200, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > >
> > > > Add a function checking whether or not PCIe ASPM has been enabled for
> > > > a given device.
> > > >
> > > > It will be used by the NVMe driver to decide how to handle the
> > > > device during system suspend.
> > > >
> > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > ---
> > > >
> > > > v2 -> v3:
> > > >   * Make the new function return bool.
> > > >   * Change its name back to pcie_aspm_enabled().
> > > >   * Fix kerneldoc comment formatting.
> > > >
> > > > -> v2:
> > > >   * Move the PCI/PCIe ASPM changes to a separate patch.
> > > >   * Add the _mask suffix to the new function name.
> > > >   * Add EXPORT_SYMBOL_GPL() to the new function.
> > > >   * Avoid adding an unnecessary blank line.
> > > >
> > > > ---
> > > >  drivers/pci/pcie/aspm.c |   20 ++++++++++++++++++++
> > > >  include/linux/pci.h     |    3 +++
> > > >  2 files changed, 23 insertions(+)
> > > >
> > > > Index: linux-pm/drivers/pci/pcie/aspm.c
> > > > ===================================================================
> > > > --- linux-pm.orig/drivers/pci/pcie/aspm.c
> > > > +++ linux-pm/drivers/pci/pcie/aspm.c
> > > > @@ -1170,6 +1170,26 @@ static int pcie_aspm_get_policy(char *bu
> > > >  module_param_call(policy, pcie_aspm_set_policy, pcie_aspm_get_policy,
> > > >       NULL, 0644);
> > > >
> > > > +/**
> > > > + * pcie_aspm_enabled - Check if PCIe ASPM has been enabled for a device.
> > > > + * @pci_device: Target device.
> > > > + */
> > > > +bool pcie_aspm_enabled(struct pci_dev *pci_device)
> > > > +{
> > > > +     struct pci_dev *bridge = pci_upstream_bridge(pci_device);
> > > > +     bool ret;
> > > > +
> > > > +     if (!bridge)
> > > > +             return false;
> > > > +
> > > > +     mutex_lock(&aspm_lock);
> > > > +     ret = bridge->link_state ? !!bridge->link_state->aspm_enabled : false;
> > > > +     mutex_unlock(&aspm_lock);
> > >
> > > Why do we need to acquire aspm_lock here?  We aren't modifying
> > > anything, and I don't think we're preventing a race.  If this races
> > > with another thread that changes aspm_enabled, we'll return either the
> > > old state or the new one, and I think that's still the case even if we
> > > don't acquire aspm_lock.
> >
> > Well, if we can guarantee that pci_remove_bus_device() will never be
> > called in parallel with this helper, then I agree, but can we
> > guarantee that?
>
> Hmm, yeah, I guess that's the question.  It's not a race with another
> thread changing aspm_enabled; the potential race is with another
> thread removing the last child of "bridge", which will free the
> link_state and set bridge->link_state = NULL.
>
> I think it should be safe to call device-related PCI interfaces if
> you're holding a reference to the device, e.g., from a driver bound to
> the device or a sysfs accessor.  Since we call pcie_aspm_enabled(dev)
> from a driver bound to "dev", another thread should not be able to
> remove "dev" while we're using it.
>
> I know that's a little hand-wavey, but if it weren't true, I think
> we'd have a lot more locking sprinkled everywhere in the PCI core than
> we do.
>
> This has implications for Heiner's ASPM sysfs patches because we're
> currently doing this in sysfs accessors:
>
>   static ssize_t aspm_attr_show_common(struct device *dev, ...)
>   {
>     ...
>     link = pcie_aspm_get_link(pdev);
>
>     mutex_lock(&aspm_lock);
>     enabled = link->aspm_enabled & state;
>     mutex_unlock(&aspm_lock);
>     ...
>   }
>
> I assume sysfs must be holding a reference that guarantees "dev" is
> valid througout this code, and therefore we should not need to hold
> aspm_lock.

In principle, pcie_aspm_enabled() need not be called via sysfs.

In the particular NVMe use case, it is called from the driver's own PM
callback, so it would be safe without the locking AFAICS.

I guess it is safe to drop the locking from there, but then it would
be good to mention in the kerneldoc that calling it is only safe under
the assumption that the link_state object cannot go away while it is
running.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v3 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled()
  2019-10-08 22:54                                   ` Rafael J. Wysocki
@ 2019-10-09 12:49                                     ` Bjorn Helgaas
  0 siblings, 0 replies; 75+ messages in thread
From: Bjorn Helgaas @ 2019-10-09 12:49 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Sagi Grimberg, Mario Limonciello, Linux PCI, Linux PM,
	Rafael J. Wysocki, Linux Kernel Mailing List, linux-nvme,
	Keith Busch, Kai-Heng Feng, Keith Busch, Rajat Jain,
	Christoph Hellwig, Heiner Kallweit

On Wed, Oct 09, 2019 at 12:54:37AM +0200, Rafael J. Wysocki wrote:
> On Tue, Oct 8, 2019 at 11:16 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Tue, Oct 08, 2019 at 11:27:51AM +0200, Rafael J. Wysocki wrote:
> > > On Tue, Oct 8, 2019 at 12:34 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Thu, Aug 08, 2019 at 11:55:07PM +0200, Rafael J. Wysocki wrote:
> > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > >
> > > > > Add a function checking whether or not PCIe ASPM has been enabled for
> > > > > a given device.
> > > > >
> > > > > It will be used by the NVMe driver to decide how to handle the
> > > > > device during system suspend.
> > > > >
> > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > ---
> > > > >
> > > > > v2 -> v3:
> > > > >   * Make the new function return bool.
> > > > >   * Change its name back to pcie_aspm_enabled().
> > > > >   * Fix kerneldoc comment formatting.
> > > > >
> > > > > -> v2:
> > > > >   * Move the PCI/PCIe ASPM changes to a separate patch.
> > > > >   * Add the _mask suffix to the new function name.
> > > > >   * Add EXPORT_SYMBOL_GPL() to the new function.
> > > > >   * Avoid adding an unnecessary blank line.
> > > > >
> > > > > ---
> > > > >  drivers/pci/pcie/aspm.c |   20 ++++++++++++++++++++
> > > > >  include/linux/pci.h     |    3 +++
> > > > >  2 files changed, 23 insertions(+)
> > > > >
> > > > > Index: linux-pm/drivers/pci/pcie/aspm.c
> > > > > ===================================================================
> > > > > --- linux-pm.orig/drivers/pci/pcie/aspm.c
> > > > > +++ linux-pm/drivers/pci/pcie/aspm.c
> > > > > @@ -1170,6 +1170,26 @@ static int pcie_aspm_get_policy(char *bu
> > > > >  module_param_call(policy, pcie_aspm_set_policy, pcie_aspm_get_policy,
> > > > >       NULL, 0644);
> > > > >
> > > > > +/**
> > > > > + * pcie_aspm_enabled - Check if PCIe ASPM has been enabled for a device.
> > > > > + * @pci_device: Target device.
> > > > > + */
> > > > > +bool pcie_aspm_enabled(struct pci_dev *pci_device)
> > > > > +{
> > > > > +     struct pci_dev *bridge = pci_upstream_bridge(pci_device);
> > > > > +     bool ret;
> > > > > +
> > > > > +     if (!bridge)
> > > > > +             return false;
> > > > > +
> > > > > +     mutex_lock(&aspm_lock);
> > > > > +     ret = bridge->link_state ? !!bridge->link_state->aspm_enabled : false;
> > > > > +     mutex_unlock(&aspm_lock);
> > > >
> > > > Why do we need to acquire aspm_lock here?  We aren't modifying
> > > > anything, and I don't think we're preventing a race.  If this races
> > > > with another thread that changes aspm_enabled, we'll return either the
> > > > old state or the new one, and I think that's still the case even if we
> > > > don't acquire aspm_lock.
> > >
> > > Well, if we can guarantee that pci_remove_bus_device() will never be
> > > called in parallel with this helper, then I agree, but can we
> > > guarantee that?
> >
> > Hmm, yeah, I guess that's the question.  It's not a race with another
> > thread changing aspm_enabled; the potential race is with another
> > thread removing the last child of "bridge", which will free the
> > link_state and set bridge->link_state = NULL.
> >
> > I think it should be safe to call device-related PCI interfaces if
> > you're holding a reference to the device, e.g., from a driver bound to
> > the device or a sysfs accessor.  Since we call pcie_aspm_enabled(dev)
> > from a driver bound to "dev", another thread should not be able to
> > remove "dev" while we're using it.
> >
> > I know that's a little hand-wavey, but if it weren't true, I think
> > we'd have a lot more locking sprinkled everywhere in the PCI core than
> > we do.
> >
> > This has implications for Heiner's ASPM sysfs patches because we're
> > currently doing this in sysfs accessors:
> >
> >   static ssize_t aspm_attr_show_common(struct device *dev, ...)
> >   {
> >     ...
> >     link = pcie_aspm_get_link(pdev);
> >
> >     mutex_lock(&aspm_lock);
> >     enabled = link->aspm_enabled & state;
> >     mutex_unlock(&aspm_lock);
> >     ...
> >   }
> >
> > I assume sysfs must be holding a reference that guarantees "dev" is
> > valid througout this code, and therefore we should not need to hold
> > aspm_lock.
> 
> In principle, pcie_aspm_enabled() need not be called via sysfs.
> 
> In the particular NVMe use case, it is called from the driver's own PM
> callback, so it would be safe without the locking AFAICS.

Right, pcie_aspm_enabled() is only used by drivers (actually only by
the nvme driver so far).  And aspm_attr_show_common() is only used via
new sysfs code being added by Heiner.

> I guess it is safe to drop the locking from there, but then it would
> be good to mention in the kerneldoc that calling it is only safe under
> the assumption that the link_state object cannot go away while it is
> running.

I'll post a patch to that effect.  Thanks!

Bjorn

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, back to index

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-25  9:51 [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems rjw
2019-07-25 14:02 ` kai.heng.feng
2019-07-25 16:23   ` Mario.Limonciello
2019-07-25 17:03     ` rafael
2019-07-25 17:23       ` Mario.Limonciello
2019-07-25 18:20       ` kai.heng.feng
2019-07-25 19:09         ` Mario.Limonciello
2019-07-30 10:45       ` rjw
2019-07-30 14:41         ` kbusch
2019-07-30 17:14           ` Mario.Limonciello
2019-07-30 18:50             ` kai.heng.feng
2019-07-30 19:19               ` kbusch
2019-07-30 21:05                 ` Mario.Limonciello
2019-07-30 21:31                   ` kbusch
2019-07-31 21:25                     ` rafael
2019-07-31 22:19                       ` kbusch
2019-07-31 22:33                         ` rafael
2019-08-01  9:05                           ` kai.heng.feng
2019-08-01 17:29                             ` rafael
2019-08-01 19:05                               ` Mario.Limonciello
2019-08-01 22:26                                 ` rafael
2019-08-02 10:55                                   ` kai.heng.feng
2019-08-02 11:04                                     ` rafael
2019-08-05 19:13                                       ` kai.heng.feng
2019-08-05 21:28                                         ` rafael
2019-08-06 14:02                                           ` Mario.Limonciello
2019-08-06 15:00                                             ` rafael
2019-08-07 10:29                                               ` rjw
2019-08-01 20:22                             ` kbusch
2019-08-07  9:48                         ` rjw
2019-08-07 10:45                           ` hch
2019-08-07 10:54                             ` rafael
2019-08-07  9:53                         ` [PATCH] nvme-pci: Do not prevent PCI bus-level PM from being used rjw
2019-08-07 10:14                           ` rjw
2019-08-07 10:43                           ` hch
2019-08-07 14:37                           ` kbusch
2019-08-08  8:36                         ` [PATCH] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled rjw
2019-08-08  8:48                           ` hch
2019-08-08  9:06                             ` rafael
2019-08-08 10:03                         ` [PATCH v2 0/2] " rjw
2019-08-08 10:06                           ` [PATCH v2 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled_mask() rjw
2019-08-08 13:15                             ` helgaas
2019-08-08 14:48                               ` rafael
2019-08-08 10:10                           ` [PATCH v2 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled rjw
2019-08-08 13:43                             ` helgaas
2019-08-08 14:47                               ` rafael
2019-08-08 17:06                                 ` rafael
2019-08-08 18:39                                 ` helgaas
2019-08-08 20:01                                   ` kbusch
2019-08-08 20:05                                   ` Mario.Limonciello
2019-08-08 20:41                                   ` rafael
2019-08-09  4:47                                     ` helgaas
2019-08-09  8:04                                       ` rafael
2019-08-08 21:51                         ` [PATCH v3 0/2] " rjw
2019-08-08 21:55                           ` [PATCH v3 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled() rjw
2019-08-09  4:50                             ` helgaas
2019-08-09  8:00                               ` rafael
2019-10-07 22:34                             ` Bjorn Helgaas
2019-10-08  9:27                               ` Rafael J. Wysocki
2019-10-08 21:16                                 ` Bjorn Helgaas
2019-10-08 22:54                                   ` Rafael J. Wysocki
2019-10-09 12:49                                     ` Bjorn Helgaas
2019-08-08 21:58                           ` [PATCH v3 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled rjw
2019-08-08 22:13                           ` [PATCH v3 0/2] " kbusch
2019-08-09  8:05                             ` rafael
2019-08-09 14:52                               ` kbusch
2019-07-25 16:59   ` [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems rafael
2019-07-25 14:52 ` kbusch
2019-07-25 19:48   ` rjw
2019-07-25 19:52     ` kbusch
2019-07-25 20:02       ` rjw
2019-07-26 14:02         ` kai.heng.feng
2019-07-27 12:55           ` rafael
2019-07-29 15:51             ` Mario.Limonciello
2019-07-29 22:05               ` rafael

Linux-NVME Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nvme/0 linux-nvme/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nvme linux-nvme/ https://lore.kernel.org/linux-nvme \
		linux-nvme@lists.infradead.org linux-nvme@archiver.kernel.org
	public-inbox-index linux-nvme

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-nvme


AGPL code for this site: git clone https://public-inbox.org/ public-inbox