All of lore.kernel.org
 help / color / mirror / Atom feed
From: luto@kernel.org (Andy Lutomirski)
Subject: NVMe APST high latency power states being skipped
Date: Tue, 23 May 2017 13:01:46 -0700	[thread overview]
Message-ID: <CALCETrWNU8uquoDiNDeYaoOUyMOAVyWQ73GfrEeWntTcm4ioAw@mail.gmail.com> (raw)
In-Reply-To: <84ea5d8fbf0446c58c7b6c85d3868d94@ausx13mpc120.AMER.DELL.COM>

On Tue, May 23, 2017@12:56 PM,  <Mario.Limonciello@dell.com> wrote:
>> -----Original Message-----
>> From: Andy Lutomirski [mailto:luto at kernel.org]
>> Sent: Tuesday, May 23, 2017 2:35 PM
>> To: Kai-Heng Feng <kai.heng.feng at canonical.com>
>> Cc: Christoph Hellwig <hch at infradead.org>; Andrew Lutomirski
>> <luto at kernel.org>; linux-nvme <linux-nvme at lists.infradead.org>; Limonciello,
>> Mario <Mario_Limonciello at Dell.com>
>> Subject: Re: NVMe APST high latency power states being skipped
>>
>> On Tue, May 23, 2017 at 1:06 AM, Kai-Heng Feng
>> <kai.heng.feng@canonical.com> wrote:
>> > On Tue, May 23, 2017 at 3:17 PM, Christoph Hellwig <hch at infradead.org>
>> wrote:
>> >> On Mon, May 22, 2017@05:04:15PM +0800, Kai-Heng Feng wrote:
>> >>> Hi Andy,
>> >>>
>> >>> Currently, if a power state tradition requires high latency, it may be
>> >>> skipped [1] based on the value of ps_max_latency_us in
>> >>> nvme_configure_apst():
>> >>>
>> >>> if (total_latency_us > ctrl->ps_max_latency_us)
>> >>>     continue;
>> >>>
>> >>> Right now ps_max_latency_us defaults to 25000, but some consumer level
>> >>> NVMe have much higher latency.
>> >>> I understand this value is configurable, but I am wondering if it's
>> >>> possible to ignore the latency on consumer devices, probably based on
>> >>> chassis type, so consumer devices can get most NVMe power saving out
>> >>> of the box?
>> >>
>> >> What is your proposed change?
>> >
>> > Ignore the latency limit if it's a mobile device, based on DMI chassis type.
>> > I can write a patch for that.
>> >
>> >> Do you have any numbers on how this
>> >> improves power consumption for given workloads and what the performance
>> >> impact is on common benchmarks?
>> >
>> > A SanDisk NVMe has entry latency 1,000,000 and exit latency 100,000.
>> > The default latency (25000) does not allow this device enters to
>> > non-operational state. The system power consumption is around 13W.
>> > Make this SanDisk device able to enter PS4 can get a system with
>> > roughly 8W power consumption.
>> > The 5W difference is quite good.
>>
>> Can you send the actual 'nvme id-ctrl' output?
>>
>
> I happen to have the output of this disk from another email I'm on so
> I'll share it while it's Kai Heng's night.  There are several disks mentioned
> that have this same concern, here's three of them at the end of this email.
>
>> I suspect that something is screwy here.  This is an entry latency of
>> 1 second and an exit latency of 100ms.  This is *atrocious*.  I don't
>> care what kind of mobile device this is -- making it unresponsive for
>> 1.1 seconds for the round trip will be quite noticeable.  And, with an
>> RSTe-like policy, that's 100 *seconds* of delay before going fully to
>> sleep.  Also, 5W power difference between deep sleep and less deep
>> sleep is also bizarrely large.  The NVMe device shouldn't take 5W of
>> power when idle even in the max-power operational state.
>>
>
> There are some configurations that have multiple NVMe disks.
> For example the Precision 7520 can have up to 3.
>
> NVME Identify Controller:
...
> mn      : A400 NVMe SanDisk 512GB
...
> ps    0 : mp:8.25W operational enlat:0 exlat:0 rrt:0 rrl:0
>           rwt:0 rwl:0 idle_power:- active_power:5.30W
> ps    1 : mp:8.25W operational enlat:0 exlat:0 rrt:1 rrl:1
>           rwt:1 rwl:1 idle_power:- active_power:3.30W
> ps    2 : mp:8.25W operational enlat:0 exlat:0 rrt:2 rrl:2
>           rwt:2 rwl:2 idle_power:- active_power:3.30W
> ps    3 : mp:0.0500W non-operational enlat:51000 exlat:10000 rrt:0 rrl:0
>           rwt:0 rwl:0 idle_power:- active_power:-
> ps    4 : mp:0.0055W non-operational enlat:1000000 exlat:100000 rrt:0 rrl:0
>           rwt:0 rwl:0 idle_power:- active_power:-
>

44.5mW saved and totally crazy latency.

>
> NVME Identify Controller:
...
> mn      : THNSF5512GPUK NVMe SED TOSHIBA 512GB
...
> ps    0 : mp:6.00W operational enlat:0 exlat:0 rrt:0 rrl:0
>           rwt:0 rwl:0 idle_power:- active_power:-
> ps    1 : mp:2.40W operational enlat:0 exlat:0 rrt:1 rrl:1
>           rwt:1 rwl:1 idle_power:- active_power:-
> ps    2 : mp:1.90W operational enlat:0 exlat:0 rrt:2 rrl:2
>           rwt:2 rwl:2 idle_power:- active_power:-
> ps    3 : mp:0.0120W non-operational enlat:5000 exlat:25000 rrt:3 rrl:3
>           rwt:3 rwl:3 idle_power:- active_power:-
> ps    4 : mp:0.0060W non-operational enlat:100000 exlat:70000 rrt:4 rrl:4
>           rwt:4 rwl:4 idle_power:- active_power:-

6 mW saved and still fairly crazy latency.  70ms means you drop a couple frames.

>
>
> NVME Identify Controller:
...
> mn      : CX2-GB1024-Q11 NVMe LITEON 1024GB
...> ps    0 : mp:8.00W operational enlat:0 exlat:0 rrt:0 rrl:0
>           rwt:0 rwl:0 idle_power:- active_power:-
> ps    1 : mp:4.00W operational enlat:5 exlat:5 rrt:1 rrl:1
>           rwt:1 rwl:1 idle_power:- active_power:-
> ps    2 : mp:2.10W operational enlat:5 exlat:5 rrt:2 rrl:2
>           rwt:2 rwl:2 idle_power:- active_power:-
> ps    3 : mp:0.1000W non-operational enlat:5000 exlat:5000 rrt:3 rrl:3
>           rwt:3 rwl:3 idle_power:- active_power:-
> ps    4 : mp:0.0100W non-operational enlat:50000 exlat:100000 rrt:4 rrl:4
>           rwt:4 rwl:4 idle_power:- active_power:-

90mW saved and still 100ms latency.  Also, I didn't know that Lite-on
made disks.

I'm not convinced that there's any chassis type for which this type of
default makes sense.

What would perhaps make sense is to have system-wide
performance-vs-power controls and to integrate NVMe power saving into
it, presumably through the pm_qos framework.  Or to export more
information to userspace and have a user tool that sets all this up
generically.

  reply	other threads:[~2017-05-23 20:01 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-22  9:04 NVMe APST high latency power states being skipped Kai-Heng Feng
2017-05-23  7:17 ` Christoph Hellwig
2017-05-23  8:06   ` Kai-Heng Feng
2017-05-23  9:42     ` Christoph Hellwig
2017-05-23 19:35     ` Andy Lutomirski
2017-05-23 19:56       ` Mario.Limonciello
2017-05-23 20:01         ` Andy Lutomirski [this message]
2017-05-23 20:19           ` Mario.Limonciello
2017-05-23 21:11             ` Andy Lutomirski
2017-05-23 22:09               ` Mario.Limonciello
2017-05-24  4:53                 ` Kai-Heng Feng
2017-05-24  5:31                   ` Andy Lutomirski
2017-05-25  8:21                     ` Kai-Heng Feng
2017-05-26  9:25                       ` Christoph Hellwig
2017-06-01  8:19                         ` Kai-Heng Feng
2017-06-01 11:32                           ` Christoph Hellwig
2017-06-02  7:08                             ` Kai-Heng Feng
2017-06-02  7:13                               ` Christoph Hellwig
2017-06-06  9:54                                 ` Christoph Hellwig
2017-06-06 15:57                                   ` Andy Lutomirski
2017-06-07  6:19                                     ` Kai-Heng Feng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALCETrWNU8uquoDiNDeYaoOUyMOAVyWQ73GfrEeWntTcm4ioAw@mail.gmail.com \
    --to=luto@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.