From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.9 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 44A1DC04AB2 for ; Thu, 9 May 2019 09:42:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 140FF2084E for ; Thu, 9 May 2019 09:42:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726711AbfEIJmk (ORCPT ); Thu, 9 May 2019 05:42:40 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:38413 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725826AbfEIJmj (ORCPT ); Thu, 9 May 2019 05:42:39 -0400 Received: from mail-pf1-f199.google.com ([209.85.210.199]) by youngberry.canonical.com with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.76) (envelope-from ) id 1hOfZU-0005ub-0q for linux-kernel@vger.kernel.org; Thu, 09 May 2019 09:42:36 +0000 Received: by mail-pf1-f199.google.com with SMTP id n3so1243120pff.4 for ; Thu, 09 May 2019 02:42:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=vCyQ60BE6nSCa7adNn8wGWtGspW5wwW93oaMgXkndqY=; b=CaIIUrcFU6dj8gCqHgiPBtB55grJNqxZ3hT6goAHINwA42Hz0xCNK8wvZhyxg51RmO bDOsunc4MhijtKcRTaqYWFHZjR22aH2DkaHQEIewmFqhQ07YeRh0QgSbZ3ZorDx5FYlg 4+7zR8iZn+4e8w7vNSZ6ehPXYI2xvtYsMMHYaXdUqcI0FiRVsoi2iL3AdwRi037VOC2u IFku3SAEg2obRv4NbidXmY3vOvFhQi9c4kKe9AfgWDKdSRC2owVNlZkG+18Vr581SJ1B /ERmUaiciUkNtAtzNhthLzqN2p/kUPaLvg+espDiOaA+bl7cCilX6dOZWyIJ+OmKSQIp hy9w== X-Gm-Message-State: APjAAAWoCNqAHb1NM7LBZGK/CoH4yLfmvlDSU8YXDtw39o8JoSNVnDTA V2yvd1p4CvFTIbRbVWe86HPWjoMVE+bKlqFecOKO1LgkUszcFfNsdLfBGi1JvCDfd3Md/FPTS7v cqF9OG7ZaGFy7MAc5wkuGsNdyv5LyP6j8shbytMZ9iA== X-Received: by 2002:a63:10c:: with SMTP id 12mr4207300pgb.276.1557394954727; Thu, 09 May 2019 02:42:34 -0700 (PDT) X-Google-Smtp-Source: APXvYqx4tmcD33XpLhEuyhpoUaGX5iEQ6p4zYrtvOSZ/IdC6tLif2k7cQ2EI0mQEiPkOT1lAI1J9Ew== X-Received: by 2002:a63:10c:: with SMTP id 12mr4207277pgb.276.1557394954398; Thu, 09 May 2019 02:42:34 -0700 (PDT) Received: from 2001-b011-380f-14b9-f0ba-4a15-3e79-97f9.dynamic-ip6.hinet.net (2001-b011-380f-14b9-f0ba-4a15-3e79-97f9.dynamic-ip6.hinet.net. [2001:b011:380f:14b9:f0ba:4a15:3e79:97f9]) by smtp.gmail.com with ESMTPSA id f87sm3062680pff.56.2019.05.09.02.42.31 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 09 May 2019 02:42:33 -0700 (PDT) Content-Type: text/plain; charset=utf-8; delsp=yes; format=flowed Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.8\)) Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle From: Kai-Heng Feng In-Reply-To: Date: Thu, 9 May 2019 17:42:30 +0800 Cc: Christoph Hellwig , Rafael Wysocki , Mario Limonciello , Keith Busch , Keith Busch , Jens Axboe , Sagi Grimberg , linux-nvme , Linux PM , LKML Content-Transfer-Encoding: 8bit Message-Id: References: <20190508185955.11406-1-kai.heng.feng@canonical.com> <20190508191624.GA8365@localhost.localdomain> <3CDA9F13-B17C-456F-8CE1-3A63C6E0DC8F@canonical.com> <20190508195159.GA1530@lst.de> <20190509061237.GA15229@lst.de> <064701C3-2BD4-4D93-891D-B7FBB5040FC4@canonical.com> To: "Rafael J. Wysocki" X-Mailer: Apple Mail (2.3445.104.8) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org at 17:07, Rafael J. Wysocki wrote: > On Thu, May 9, 2019 at 8:49 AM Kai-Heng Feng > wrote: >> Cc Rafael and linux-pm > > I would have been much more useful to CC the patch to linux-pm at > least from the outset. > >> at 14:12, Christoph Hellwig wrote: >> >>> On Wed, May 08, 2019 at 08:28:30PM +0000, Mario.Limonciello@dell.com >>> wrote: >>>> You might think this would be adding runtime_suspend/runtime_resume >>>> callbacks, but those also get called actually at runtime which is not >>>> the goal here. At runtime, these types of disks should rely on APST >>>> which >>>> should calculate the appropriate latencies around the different power >>>> states. >>>> >>>> This code path is only applicable in the suspend to idle state, which >>>> /does/ >>>> call suspend/resume functions associated with dev_pm_ops. There isn't >>>> a dedicated function in there for use only in suspend to idle, which is >>>> why pm_suspend_via_s2idle() needs to get called. >>> >>> The problem is that it also gets called for others paths: >>> >>> #ifdef CONFIG_PM_SLEEP >>> #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \ >>> .suspend = suspend_fn, \ >>> .resume = resume_fn, \ >>> .freeze = suspend_fn, \ >>> .thaw = resume_fn, \ >>> .poweroff = suspend_fn, \ >>> .restore = resume_fn, >>> #else >>> else >>> #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) >>> #endif >>> >>> #define SIMPLE_DEV_PM_OPS(name, suspend_fn, resume_fn) \ >>> const struct dev_pm_ops name = { \ >>> SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \ >>> } >>> >>> And at least for poweroff this new code seems completely wrong, even >>> for freeze it looks rather borderline. >> >> Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so >> the old code path will be taken. >> >>> And more to the points - if these "modern MS standby" systems are >>> becoming common, which it looks they are, we need support in the PM core >>> for those instead of working around the decisions in low-level drivers. >> >> Rafael, what do you think about this? > > The difference between suspend-to-idle and suspend-to-RAM (S3) boils > down to the fact that at the end of S3 suspend all control of the > system is passed to the platform firmware. Then, the firmware can > take care of some things that may not be taken care of by drivers (it > sometimes assumes that drivers will not take care of those things even > which is generally problematic). > > For suspend-to-idle the final physical state of the system should (in > theory) be the same as the deepest possible physical idle state of it > achievable through working-state PM (combination of PM-runtime and > cpuidle, roughly speaking). However, in practice the working-state PM > may not even be able to get there, as it generally requires many > things to happen exactly at the right time in a specific order and the > probability of that in the working-state PM situation is practically > 0. Suspend-to-idle helps here by quiescing everything in an ordered > fashion which makes all of the requisite conditions more likely to be > met together. > > So yes, from an individual driver perspective, the device handling for > s2idle should be more like for PM-runtime than for S3 (s2R), but this > really shouldn't matter (and it doesn't matter for the vast majority > of drivers). > > Unfortunately, the "modern MS standby" concept makes it matter, > because "modern MS standby" causes system-wide transitions to be > "special" and it appears to expect drivers to take care of the "extra > bit" that would have been taken care of by the platform firmware in > the S3 case. [Note that in the Windows world the "modern MS standby" > systems don't support S3 ("modern MS standby" and S3 support are > mutually exclusive in Windows AFAICS) while Linux needs to support S3 > and is expected to achieve the minimum power state through s2idle > (generally, even on the same platform) at the same time.] > >> Including this patch, there are five drivers that use >> pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3. > > Well, that is not a large number relative to the total number of > drivers in Linux. That’s right, but I think we are going to see more of similar cases. > >> So I think maybe it’s time to introduce a new suspend callback for S2I? > > That would be a set of 6 new suspend and resume callbacks, mind you, > and there's quite a few of them already. And the majority of drivers > would not need to use them anyway. I think suspend_to_idle() and resume_from_idle() should be enough? What are other 4 callbacks? > > Also, please note that, possibly apart from the device power state > setting, the S2I and S2R handling really aren't that different at all. > You basically need to carry out the same preparations during suspend > and reverse them during resume in both cases. But for this case, it’s quite different to the original suspend and resume callbacks. > > That said I admit that there are cases in which device drivers need to > know that the system-wide transition under way is into s2idle and so > they should do extra stuff. If pm_suspend_via_firmware() is not > sufficient for that, then I'm open to other suggestions, but > introducing a new set of callbacks for that alone would be rather > excessive IMO. From drivers’ perspective nothing changed, as PM core can prioritize suspend_to_idle() over suspend() when it’s actually S2I. > >>>> SIMPLE_DEV_PM_OPS normally sets the same function for suspend and >>>> freeze (hibernate), so to avoid any changes to the hibernate case it >>>> seems >>>> to me that there needs to be a new nvme_freeze() that calls into the >>>> existing >>>> nvme_dev_disable for the freeze pm op and nvme_thaw() that calls into >>>> the >>>> existing nvme_reset_ctrl for the thaw pm op. >>> >>> At least, yes. >> >> Hibernation should remain the same as stated above. > > Depending on what check is used in that code path. > pm_suspend_via_s2idle() will return "true" in the hibernation path > too, for one. You are right, I should use !pm_suspend_via_firmware() instead. > >>>>> enterprise class NVMe devices >>>>> that don't do APST and don't really do different power states at >>>>> all in many cases. >>>> >>>> Enterprise class NVMe devices that don't do APST - do they typically >>>> have a non-zero value for ndev->ctrl.npss? >>>> >>>> If not, they wouldn't enter this new codepath even if the server entered >>>> into S2I. >>> >>> No, devices that do set NPSS will have at least some power states >>> per definition, although they might not be too useful. I suspect >>> checking >>> APSTA might be safer, but if we don't want to rely on APST we should >>> check for a power state supporting the condition that the MS document >>> quoted in the original document supports. >> >> If Modern Standby or Connected Standby is not supported by servers, I >> don’t >> think the design documents mean much here. >> We probably should check if the platform firmware really supports S2I >> instead. > > S2I is expected to work regardless of the platform firmware and there > is nothing like "platform firmware support for S2I". IOW, that check > would always return "false". > > What you really need to know is if the given particular transition is S2I. Maybe a helper based on FADT flag and _DSM can do this thing? Kai-Heng