linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Rafael J. Wysocki" <rafael@kernel.org>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>,
	linux-nvme <linux-nvme@lists.infradead.org>,
	Keith Busch <kbusch@kernel.org>,
	Mario Limonciello <Mario.Limonciello@dell.com>,
	Kai-Heng Feng <kai.heng.feng@canonical.com>,
	Keith Busch <keith.busch@intel.com>,
	Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>,
	Linux PM <linux-pm@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Rajat Jain <rajatja@google.com>,
	Linux PCI <linux-pci@vger.kernel.org>
Subject: Re: [PATCH v2 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
Date: Fri, 9 Aug 2019 10:04:02 +0200	[thread overview]
Message-ID: <CAJZ5v0iWOQtp_m-KuToMuTFGtmmHrSptSfMpjuMpWYxd60DfNg@mail.gmail.com> (raw)
In-Reply-To: <20190809044713.GD7302@google.com>

On Fri, Aug 9, 2019 at 6:47 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Aug 08, 2019 at 10:41:56PM +0200, Rafael J. Wysocki wrote:
> > On Thu, Aug 8, 2019, 20:39 Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Thu, Aug 08, 2019 at 04:47:45PM +0200, Rafael J. Wysocki wrote:
> > > > On Thu, Aug 8, 2019 at 3:43 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Thu, Aug 08, 2019 at 12:10:06PM +0200, Rafael J. Wysocki wrote:
> > > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > >
> > > > > > One of the modifications made by commit d916b1be94b6 ("nvme-pci: use
> > > > > > host managed power state for suspend") was adding a pci_save_state()
> > > > > > call to nvme_suspend() in order to prevent the PCI bus-level PM from
> > > > > > being applied to the suspended NVMe devices, but if ASPM is not
> > > > > > enabled for the target NVMe device, that causes its PCIe link to stay
> > > > > > up and the platform may not be able to get into its optimum low-power
> > > > > > state because of that.
> > > > > >
> > > > > > For example, if ASPM is disabled for the NVMe drive (PC401 NVMe SK
> > > > > > hynix 256GB) in my Dell XPS13 9380, leaving it in D0 during
> > > > > > suspend-to-idle prevents the SoC from reaching package idle states
> > > > > > deeper than PC3, which is way insufficient for system suspend.
> > > > >
> > > > > Just curious: I assume the SoC you reference is some part of the NVMe
> > > > > drive?
> > > >
> > > > No, the SoC is what contains the Intel processor and PCH (formerly "chipset").
> > > >
> > > > > > To address this shortcoming, make nvme_suspend() check if ASPM is
> > > > > > enabled for the target device and fall back to full device shutdown
> > > > > > and PCI bus-level PM if that is not the case.
> > > > > >
> > > > > > Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend")
> > > > > > Link: https://lore.kernel.org/linux-pm/2763495.NmdaWeg79L@kreacher/T/#t
> > > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > > ---
> > > > > >
> > > > > > -> v2:
> > > > > >   * Move the PCI/PCIe ASPM changes to a separate patch.
> > > > > >   * Do not add a redundant ndev->last_ps == U32_MAX check in nvme_suspend().
> > > > > >
> > > > > > ---
> > > > > >  drivers/nvme/host/pci.c |   13 ++++++++++---
> > > > > >  1 file changed, 10 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > Index: linux-pm/drivers/nvme/host/pci.c
> > > > > > ===================================================================
> > > > > > --- linux-pm.orig/drivers/nvme/host/pci.c
> > > > > > +++ linux-pm/drivers/nvme/host/pci.c
> > > > > > @@ -2846,7 +2846,7 @@ static int nvme_resume(struct device *de
> > > > > >       struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev));
> > > > > >       struct nvme_ctrl *ctrl = &ndev->ctrl;
> > > > > >
> > > > > > -     if (pm_resume_via_firmware() || !ctrl->npss ||
> > > > > > +     if (ndev->last_ps == U32_MAX ||
> > > > > >           nvme_set_power_state(ctrl, ndev->last_ps) != 0)
> > > > > >               nvme_reset_ctrl(ctrl);
> > > > > >       return 0;
> > > > > > @@ -2859,6 +2859,8 @@ static int nvme_suspend(struct device *d
> > > > > >       struct nvme_ctrl *ctrl = &ndev->ctrl;
> > > > > >       int ret = -EBUSY;
> > > > > >
> > > > > > +     ndev->last_ps = U32_MAX;
> > > > > > +
> > > > > >       /*
> > > > > >        * The platform does not remove power for a kernel managed suspend so
> > > > > >        * use host managed nvme power settings for lowest idle power if
> > > > > > @@ -2866,8 +2868,14 @@ static int nvme_suspend(struct device *d
> > > > > >        * shutdown.  But if the firmware is involved after the suspend or the
> > > > > >        * device does not support any non-default power states, shut down the
> > > > > >        * device fully.
> > > > > > +      *
> > > > > > +      * If ASPM is not enabled for the device, shut down the device and allow
> > > > > > +      * the PCI bus layer to put it into D3 in order to take the PCIe link
> > > > > > +      * down, so as to allow the platform to achieve its minimum low-power
> > > > > > +      * state (which may not be possible if the link is up).
> > > > > >        */
> > > > > > -     if (pm_suspend_via_firmware() || !ctrl->npss) {
> > > > > > +     if (pm_suspend_via_firmware() || !ctrl->npss ||
> > > > > > +         !pcie_aspm_enabled_mask(pdev)) {
> > > > >
> > > > > This seems like a layering violation, in the sense that ASPM is
> > > > > supposed to be hardware-autonomous and invisible to software.
> > > >
> > > > But software has to enable it.
> > > >
> > > > If it is not enabled, it will not be used, and that's what the check
> > > > is about.
> > > >
> > > > > IIUC the NVMe device will go to the desired package idle state if
> > > > > the link is in L0s or L1, but not if the link is in L0.  I don't
> > > > > understand that connection; AFAIK that would be something outside
> > > > > the scope of the PCIe spec.
> > > >
> > > > Yes, it is outside of the PCIe spec.
> > > >
> > > > No, this is not about the NVMe device, it is about the Intel SoC
> > > > (System-on-a-Chip) the platform is based on.
> > >
> > > Ah.  So this problem could occur with any device, not just NVMe?  If
> > > so, how do you address that?  Obviously you don't want to patch all
> > > drivers this way.
> >
> > It could, if the device was left in D0 during suspend, but drivers
> > don't let devices stay in D0 during suspend as a rule, so this is all
> > academic, except for the NVMe driver that has just started to do it in
> > 5.3-rc1.
> >
> > It has started to do that becasuse of what can be regarded as a
> > hardware issue, but this does not even matter here.
> >
> > > > The background really is commit d916b1be94b6 and its changelog is
> > > > kind of misleading, unfortunately.  What it did, among other things,
> > > > was to cause the NVMe driver to prevent the PCI bus type from
> > > > applying the standard PCI PM to the devices handled by it in the
> > > > suspend-to-idle flow.
> > >
> > > This is more meaningful to you than to most people because "applying
> > > the standard PCI PM" doesn't tell us what that means in terms of the
> > > device.  Presumably it has something to do with a D-state transition?
> > > I *assume* a suspend might involve the D0 -> D3hot transition you
> > > mention below?
> >
> > By "standard PCI PM" I mean what pci_prepare_to_sleep() does. And yes,
> > in the vast majority of cases the device goes from D0 to D3hot then.
> >
> > > > The reason for doing that was a (reportedly) widespread failure to
> > > > take the PCIe link down during D0 -> D3hot transitions of NVMe
> > > > devices,
> > >
> > > I don't know any of the details, but "failure to take the link down
> > > during D0 -> D3hot transitions" is phrased as though it might be a
> > > hardware erratum.  If this *is* related to an NVMe erratum, that would
> > > explain why you only need to patch the nvme driver, and it would be
> > > useful to mention that in the commit log, since otherwise it sounds
> > > like something that might be needed in other drivers, too.
> >
> > Yes, that can be considered as an NVMe erratum and the NVMe driver has
> > been *already* patched because of that in 5.3-rc1. [That's the commit
> > mentioned in the changelog of the $subject patch.]
> >
> > It effectively asks the PCI bus type to leave *all* devices handled by
> > it in D0 during suspend-to-idle.  Already today.
> >
> > I hope that this clarifies the current situation. :-)
> >
> > > According to PCIe r5.0 sec 5.3.2, the only legal link states for D3hot
> > > are L1, L2/L3 Ready.  So if you put a device in D3hot and its link
> > > stays in L0, that sounds like a defect.  Is that what happens?
> >
> > For some devices that's what happens. For some other devices the state
> > of the link in D3hot appears to be L1 or L2/L3 Ready (as per the spec)
> > and that's when the $subject patch makes a difference.
> > ...
>
> > Now, say an NVMe device works in accordance with the spec, so when it
> > goes from D0 to D3hot, its PCIe link goes into L1 or L2/L3 Ready.  As
> > of 5.3-rc1 or later it will be left in D0 during suspend-to-idle
> > (because that's how the NVMe driver works), so its link state will
> > depend on whether or not ASPM is enabled for it.  If ASPM is enabled
> > for it, the final state of its link will depend on how deep ASPM is
> > allowed to go, but if ASPM is not enabled for it, its link will remain
> > in L0.
> >
> > This means, however, that by allowing that device to go into D3hot
> > when ASPM is not enabled for it, the energy used by the system while
> > suspended can be reduced, because the PCIe link of the device will
> > then go to L1 or L2/L3 Ready.  That's exactly what the $subject patch
> > does.
> >
> > Is this still not convincing enough?
>
> It's not a matter of being convincing, it's a matter of dissecting and
> analyzing this far enough so it makes sense to someone who hasn't
> debugged the problem.  Since we're talking about ASPM being enabled,
> that really means making the connections to specific PCIe situations.
>
> I'm not the nvme maintainer, so my only interest in this is that it
> was really hard for me to figure out how pcie_aspm_enabled() is
> related to pm_suspend_via_firmware() and ctrl->npss.

Fair enough.

> But I think it has finally percolated through.  Here's my
> understanding; see it has any connection with reality:
>
>   Prior to d916b1be94b6 ("nvme-pci: use host managed power state for
>   suspend"), suspend always put the NVMe device in D3hot.

Right.

>   After d916b1be94b6, when it's possible, suspend keeps the NVMe
>   device in D0 and uses NVMe-specific power settings because it's
>   faster to change those than to do D0 -> D3hot -> D0 transitions.
>
>   When it's not possible (either the device doesn't support
>   NVMe-specific power settings or platform firmware has to be
>   involved), we use D3hot as before.

Right.

>   So now we have these three cases for suspending an NVMe device:
>
>     1  D0 + no ASPM + NVMe power setting
>     2  D0 +    ASPM + NVMe power setting
>     3  D3hot
>
>   Prior to d916b1be94b6, we always used case 3.  After d916b1be94b6,
>   we used case 1 or 2 whenever possible (we didn't know which).  Case
>   2 seemed acceptable, but the power consumption in case 1 was too
>   high.

That's correct.

>   This patch ("nvme-pci: Allow PCI bus-level PM to be used if ASPM is
>   disabled") would replace case 1 with case 3 to reduce power
>   consumption.

Right.

> AFAICT we don't have a way to compute the relative power consumption
> of these cases.  It's possible that even case 2 would use more power
> than case 3.  You can empirically determine that this patch makes the
> right trade-offs for the controllers you care about, but I don't think
> it's clear that this will *always* be the case, so in that sense I
> think pcie_aspm_enabled() is being used as part of a heuristic.

Fair enough.

Cheers,
Rafael

  reply	other threads:[~2019-08-09  8:04 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-25  9:51 [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems Rafael J. Wysocki
2019-07-25 14:02 ` Kai-Heng Feng
2019-07-25 16:23   ` Mario.Limonciello
2019-07-25 17:03     ` Rafael J. Wysocki
2019-07-25 17:23       ` Mario.Limonciello
2019-07-25 18:20       ` Kai-Heng Feng
2019-07-25 19:09         ` Mario.Limonciello
2019-07-30 10:45       ` Rafael J. Wysocki
2019-07-30 14:41         ` Keith Busch
2019-07-30 17:14           ` Mario.Limonciello
2019-07-30 18:50             ` Kai-Heng Feng
2019-07-30 19:19               ` Keith Busch
2019-07-30 21:05                 ` Mario.Limonciello
2019-07-30 21:31                   ` Keith Busch
2019-07-31 21:25                     ` Rafael J. Wysocki
2019-07-31 22:19                       ` Keith Busch
2019-07-31 22:33                         ` Rafael J. Wysocki
2019-08-01  9:05                           ` Kai-Heng Feng
2019-08-01 17:29                             ` Rafael J. Wysocki
2019-08-01 19:05                               ` Mario.Limonciello
2019-08-01 22:26                                 ` Rafael J. Wysocki
2019-08-02 10:55                                   ` Kai-Heng Feng
2019-08-02 11:04                                     ` Rafael J. Wysocki
2019-08-05 19:13                                       ` Kai-Heng Feng
2019-08-05 21:28                                         ` Rafael J. Wysocki
2019-08-06 14:02                                           ` Mario.Limonciello
2019-08-06 15:00                                             ` Rafael J. Wysocki
2019-08-07 10:29                                               ` Rafael J. Wysocki
2019-08-01 20:22                             ` Keith Busch
2019-08-07  9:48                         ` Rafael J. Wysocki
2019-08-07 10:45                           ` Christoph Hellwig
2019-08-07 10:54                             ` Rafael J. Wysocki
2019-08-07  9:53                         ` [PATCH] nvme-pci: Do not prevent PCI bus-level PM from being used Rafael J. Wysocki
2019-08-07 10:14                           ` Rafael J. Wysocki
2019-08-07 10:43                           ` Christoph Hellwig
2019-08-07 14:37                           ` Keith Busch
2019-08-08  8:36                         ` [PATCH] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled Rafael J. Wysocki
2019-08-08  8:48                           ` Christoph Hellwig
2019-08-08  9:06                             ` Rafael J. Wysocki
2019-08-08 10:03                         ` [PATCH v2 0/2] " Rafael J. Wysocki
2019-08-08 10:06                           ` [PATCH v2 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled_mask() Rafael J. Wysocki
2019-08-08 13:15                             ` Bjorn Helgaas
2019-08-08 14:48                               ` Rafael J. Wysocki
2019-08-08 10:10                           ` [PATCH v2 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled Rafael J. Wysocki
2019-08-08 13:43                             ` Bjorn Helgaas
2019-08-08 14:47                               ` Rafael J. Wysocki
2019-08-08 17:06                                 ` Rafael J. Wysocki
2019-08-08 18:39                                 ` Bjorn Helgaas
2019-08-08 20:01                                   ` Keith Busch
2019-08-08 20:05                                   ` Mario.Limonciello
2019-08-08 20:41                                   ` Rafael J. Wysocki
2019-08-09  4:47                                     ` Bjorn Helgaas
2019-08-09  8:04                                       ` Rafael J. Wysocki [this message]
2019-08-08 21:51                         ` [PATCH v3 0/2] " Rafael J. Wysocki
2019-08-08 21:55                           ` [PATCH v3 1/2] PCI: PCIe: ASPM: Introduce pcie_aspm_enabled() Rafael J. Wysocki
2019-08-09  4:50                             ` Bjorn Helgaas
2019-08-09  8:00                               ` Rafael J. Wysocki
2019-10-07 22:34                             ` Bjorn Helgaas
2019-10-08  9:27                               ` Rafael J. Wysocki
2019-10-08 21:16                                 ` Bjorn Helgaas
2019-10-08 22:54                                   ` Rafael J. Wysocki
2019-10-09 12:49                                     ` Bjorn Helgaas
2019-08-08 21:58                           ` [PATCH v3 2/2] nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled Rafael J. Wysocki
2019-08-08 22:13                           ` [PATCH v3 0/2] " Keith Busch
2019-08-09  8:05                             ` Rafael J. Wysocki
2019-08-09 14:52                               ` Keith Busch
2019-07-25 16:59   ` [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems Rafael J. Wysocki
2019-07-25 14:52 ` Keith Busch
2019-07-25 19:48   ` Rafael J. Wysocki
2019-07-25 19:52     ` Keith Busch
2019-07-25 20:02       ` Rafael J. Wysocki
2019-07-26 14:02         ` Kai-Heng Feng
2019-07-27 12:55           ` Rafael J. Wysocki
2019-07-29 15:51             ` Mario.Limonciello
2019-07-29 22:05               ` Rafael J. Wysocki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJZ5v0iWOQtp_m-KuToMuTFGtmmHrSptSfMpjuMpWYxd60DfNg@mail.gmail.com \
    --to=rafael@kernel.org \
    --cc=Mario.Limonciello@dell.com \
    --cc=hch@lst.de \
    --cc=helgaas@kernel.org \
    --cc=kai.heng.feng@canonical.com \
    --cc=kbusch@kernel.org \
    --cc=keith.busch@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=rajatja@google.com \
    --cc=rjw@rjwysocki.net \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).