All of lore.kernel.org
 help / color / mirror / Atom feed
* NVMe APST high latency power states being skipped
@ 2017-05-22  9:04 Kai-Heng Feng
  2017-05-23  7:17 ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Kai-Heng Feng @ 2017-05-22  9:04 UTC (permalink / raw)


Hi Andy,

Currently, if a power state tradition requires high latency, it may be
skipped [1] based on the value of ps_max_latency_us in
nvme_configure_apst():

if (total_latency_us > ctrl->ps_max_latency_us)
    continue;

Right now ps_max_latency_us defaults to 25000, but some consumer level
NVMe have much higher latency.
I understand this value is configurable, but I am wondering if it's
possible to ignore the latency on consumer devices, probably based on
chassis type, so consumer devices can get most NVMe power saving out
of the box?

Thanks.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/tree/drivers/nvme/host/core.c?h=nvme/power#n1396

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-05-22  9:04 NVMe APST high latency power states being skipped Kai-Heng Feng
@ 2017-05-23  7:17 ` Christoph Hellwig
  2017-05-23  8:06   ` Kai-Heng Feng
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2017-05-23  7:17 UTC (permalink / raw)


On Mon, May 22, 2017@05:04:15PM +0800, Kai-Heng Feng wrote:
> Hi Andy,
> 
> Currently, if a power state tradition requires high latency, it may be
> skipped [1] based on the value of ps_max_latency_us in
> nvme_configure_apst():
> 
> if (total_latency_us > ctrl->ps_max_latency_us)
>     continue;
> 
> Right now ps_max_latency_us defaults to 25000, but some consumer level
> NVMe have much higher latency.
> I understand this value is configurable, but I am wondering if it's
> possible to ignore the latency on consumer devices, probably based on
> chassis type, so consumer devices can get most NVMe power saving out
> of the box?

What is your proposed change?  Do you have any numbers on how this
improves power consumption for given workloads and what the performance
impact is on common benchmarks?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-05-23  7:17 ` Christoph Hellwig
@ 2017-05-23  8:06   ` Kai-Heng Feng
  2017-05-23  9:42     ` Christoph Hellwig
  2017-05-23 19:35     ` Andy Lutomirski
  0 siblings, 2 replies; 21+ messages in thread
From: Kai-Heng Feng @ 2017-05-23  8:06 UTC (permalink / raw)


On Tue, May 23, 2017@3:17 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Mon, May 22, 2017@05:04:15PM +0800, Kai-Heng Feng wrote:
>> Hi Andy,
>>
>> Currently, if a power state tradition requires high latency, it may be
>> skipped [1] based on the value of ps_max_latency_us in
>> nvme_configure_apst():
>>
>> if (total_latency_us > ctrl->ps_max_latency_us)
>>     continue;
>>
>> Right now ps_max_latency_us defaults to 25000, but some consumer level
>> NVMe have much higher latency.
>> I understand this value is configurable, but I am wondering if it's
>> possible to ignore the latency on consumer devices, probably based on
>> chassis type, so consumer devices can get most NVMe power saving out
>> of the box?
>
> What is your proposed change?

Ignore the latency limit if it's a mobile device, based on DMI chassis type.
I can write a patch for that.

> Do you have any numbers on how this
> improves power consumption for given workloads and what the performance
> impact is on common benchmarks?

A SanDisk NVMe has entry latency 1,000,000 and exit latency 100,000.
The default latency (25000) does not allow this device enters to
non-operational state. The system power consumption is around 13W.
Make this SanDisk device able to enter PS4 can get a system with
roughly 8W power consumption.
The 5W difference is quite good.

I have no idea about the performance impact though. Is there any
benchmark test storage power management latency?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-05-23  8:06   ` Kai-Heng Feng
@ 2017-05-23  9:42     ` Christoph Hellwig
  2017-05-23 19:35     ` Andy Lutomirski
  1 sibling, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2017-05-23  9:42 UTC (permalink / raw)


On Tue, May 23, 2017@04:06:27PM +0800, Kai-Heng Feng wrote:
> I have no idea about the performance impact though. Is there any
> benchmark test storage power management latency?

Mostly I want to verify this using some normal workloads - e.g. kernel
compiles are always the kernel hackers favourite, and some application
runs that are easily verifyable, e.g. video playback might be an
interesting one.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-05-23  8:06   ` Kai-Heng Feng
  2017-05-23  9:42     ` Christoph Hellwig
@ 2017-05-23 19:35     ` Andy Lutomirski
  2017-05-23 19:56       ` Mario.Limonciello
  1 sibling, 1 reply; 21+ messages in thread
From: Andy Lutomirski @ 2017-05-23 19:35 UTC (permalink / raw)


On Tue, May 23, 2017 at 1:06 AM, Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:
> On Tue, May 23, 2017@3:17 PM, Christoph Hellwig <hch@infradead.org> wrote:
>> On Mon, May 22, 2017@05:04:15PM +0800, Kai-Heng Feng wrote:
>>> Hi Andy,
>>>
>>> Currently, if a power state tradition requires high latency, it may be
>>> skipped [1] based on the value of ps_max_latency_us in
>>> nvme_configure_apst():
>>>
>>> if (total_latency_us > ctrl->ps_max_latency_us)
>>>     continue;
>>>
>>> Right now ps_max_latency_us defaults to 25000, but some consumer level
>>> NVMe have much higher latency.
>>> I understand this value is configurable, but I am wondering if it's
>>> possible to ignore the latency on consumer devices, probably based on
>>> chassis type, so consumer devices can get most NVMe power saving out
>>> of the box?
>>
>> What is your proposed change?
>
> Ignore the latency limit if it's a mobile device, based on DMI chassis type.
> I can write a patch for that.
>
>> Do you have any numbers on how this
>> improves power consumption for given workloads and what the performance
>> impact is on common benchmarks?
>
> A SanDisk NVMe has entry latency 1,000,000 and exit latency 100,000.
> The default latency (25000) does not allow this device enters to
> non-operational state. The system power consumption is around 13W.
> Make this SanDisk device able to enter PS4 can get a system with
> roughly 8W power consumption.
> The 5W difference is quite good.

Can you send the actual 'nvme id-ctrl' output?

I suspect that something is screwy here.  This is an entry latency of
1 second and an exit latency of 100ms.  This is *atrocious*.  I don't
care what kind of mobile device this is -- making it unresponsive for
1.1 seconds for the round trip will be quite noticeable.  And, with an
RSTe-like policy, that's 100 *seconds* of delay before going fully to
sleep.  Also, 5W power difference between deep sleep and less deep
sleep is also bizarrely large.  The NVMe device shouldn't take 5W of
power when idle even in the max-power operational state.

--Andy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-05-23 19:35     ` Andy Lutomirski
@ 2017-05-23 19:56       ` Mario.Limonciello
  2017-05-23 20:01         ` Andy Lutomirski
  0 siblings, 1 reply; 21+ messages in thread
From: Mario.Limonciello @ 2017-05-23 19:56 UTC (permalink / raw)


> -----Original Message-----
> From: Andy Lutomirski [mailto:luto at kernel.org]
> Sent: Tuesday, May 23, 2017 2:35 PM
> To: Kai-Heng Feng <kai.heng.feng at canonical.com>
> Cc: Christoph Hellwig <hch at infradead.org>; Andrew Lutomirski
> <luto at kernel.org>; linux-nvme <linux-nvme at lists.infradead.org>; Limonciello,
> Mario <Mario_Limonciello at Dell.com>
> Subject: Re: NVMe APST high latency power states being skipped
> 
> On Tue, May 23, 2017 at 1:06 AM, Kai-Heng Feng
> <kai.heng.feng@canonical.com> wrote:
> > On Tue, May 23, 2017 at 3:17 PM, Christoph Hellwig <hch at infradead.org>
> wrote:
> >> On Mon, May 22, 2017@05:04:15PM +0800, Kai-Heng Feng wrote:
> >>> Hi Andy,
> >>>
> >>> Currently, if a power state tradition requires high latency, it may be
> >>> skipped [1] based on the value of ps_max_latency_us in
> >>> nvme_configure_apst():
> >>>
> >>> if (total_latency_us > ctrl->ps_max_latency_us)
> >>>     continue;
> >>>
> >>> Right now ps_max_latency_us defaults to 25000, but some consumer level
> >>> NVMe have much higher latency.
> >>> I understand this value is configurable, but I am wondering if it's
> >>> possible to ignore the latency on consumer devices, probably based on
> >>> chassis type, so consumer devices can get most NVMe power saving out
> >>> of the box?
> >>
> >> What is your proposed change?
> >
> > Ignore the latency limit if it's a mobile device, based on DMI chassis type.
> > I can write a patch for that.
> >
> >> Do you have any numbers on how this
> >> improves power consumption for given workloads and what the performance
> >> impact is on common benchmarks?
> >
> > A SanDisk NVMe has entry latency 1,000,000 and exit latency 100,000.
> > The default latency (25000) does not allow this device enters to
> > non-operational state. The system power consumption is around 13W.
> > Make this SanDisk device able to enter PS4 can get a system with
> > roughly 8W power consumption.
> > The 5W difference is quite good.
> 
> Can you send the actual 'nvme id-ctrl' output?
> 

I happen to have the output of this disk from another email I'm on so
I'll share it while it's Kai Heng's night.  There are several disks mentioned
that have this same concern, here's three of them at the end of this email.

> I suspect that something is screwy here.  This is an entry latency of
> 1 second and an exit latency of 100ms.  This is *atrocious*.  I don't
> care what kind of mobile device this is -- making it unresponsive for
> 1.1 seconds for the round trip will be quite noticeable.  And, with an
> RSTe-like policy, that's 100 *seconds* of delay before going fully to
> sleep.  Also, 5W power difference between deep sleep and less deep
> sleep is also bizarrely large.  The NVMe device shouldn't take 5W of
> power when idle even in the max-power operational state.
> 

There are some configurations that have multiple NVMe disks.
For example the Precision 7520 can have up to 3.

NVME Identify Controller:
vid     : 0x15b7
ssvid   : 0x1b4b
sn      : 163503900124        
mn      : A400 NVMe SanDisk 512GB                 
fr      : A3550012
rab     : 2
ieee    : 001b44
cmic    : 0
mdts    : 5
cntlid  : 0
ver     : 10200
rtd3r   : 182b8
rtd3e   : f4240
oaes    : 0
oacs    : 0x17
acl     : 4
aerl    : 7
frmw    : 0x14
lpa     : 0x2
elpe    : 63
npss    : 4
avscc   : 0x1
apsta   : 0x1
wctemp  : 358
cctemp  : 361
mtfa    : 50
hmpre   : 0
hmmin   : 0
tnvmcap : 0
unvmcap : 0
rpmbs   : 0
sqes    : 0x66
cqes    : 0x44
nn      : 1
oncs    : 0x17
fuses   : 0
fna     : 0
vwc     : 0x1
awun    : 7
awupf   : 7
nvscc   : 1
acwu    : 0
sgls    : 0
ps    0 : mp:8.25W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:5.30W
ps    1 : mp:8.25W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:3.30W
ps    2 : mp:8.25W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:3.30W
ps    3 : mp:0.0500W non-operational enlat:51000 exlat:10000 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    4 : mp:0.0055W non-operational enlat:1000000 exlat:100000 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-


NVME Identify Controller:
vid     : 0x1179
ssvid   : 0x1179
sn      : 667S100ETXYV        
mn      : THNSF5512GPUK NVMe SED TOSHIBA 512GB    
fr      : 5KDA5103
rab     : 1
ieee    : 00080d
cmic    : 0
mdts    : 0
cntlid  : 0
ver     : 0
rtd3r   : 0
rtd3e   : 0
oaes    : 0
oacs    : 0x17
acl     : 3
aerl    : 3
frmw    : 0x2
lpa     : 0x2
elpe    : 127
npss    : 4
avscc   : 0
apsta   : 0x1
wctemp  : 351
cctemp  : 355
mtfa    : 0
hmpre   : 0
hmmin   : 0
tnvmcap : 0
unvmcap : 0
rpmbs   : 0
sqes    : 0x66
cqes    : 0x44
nn      : 1
oncs    : 0x1e
fuses   : 0
fna     : 0x4
vwc     : 0x1
awun    : 255
awupf   : 0
nvscc   : 0
acwu    : 0
sgls    : 0
ps    0 : mp:6.00W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:2.40W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:1.90W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0120W non-operational enlat:5000 exlat:25000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0060W non-operational enlat:100000 exlat:70000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-


NVME Identify Controller:
vid     : 0x14a4
ssvid   : 0x1b4b
sn      : TW0YR3K3LOH006A600CN
mn      : CX2-GB1024-Q11 NVMe LITEON 1024GB       
fr      : 4GA11QD 
rab     : 0
ieee    : 002303
cmic    : 0
mdts    : 5
cntlid  : 1
ver     : 10200
rtd3r   : f4240
rtd3e   : f4240
oaes    : 0
oacs    : 0x1f
acl     : 3
aerl    : 3
frmw    : 0x14
lpa     : 0x2
elpe    : 63
npss    : 4
avscc   : 0x1
apsta   : 0x1
wctemp  : 358
cctemp  : 368
mtfa    : 50
hmpre   : 0
hmmin   : 0
tnvmcap : 1024209543168
unvmcap : 0
rpmbs   : 0
sqes    : 0x66
cqes    : 0x44
nn      : 1
oncs    : 0x1f
fuses   : 0
fna     : 0
vwc     : 0x1
awun    : 255
awupf   : 7
nvscc   : 1
acwu    : 0
sgls    : 0
ps    0 : mp:8.00W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:4.00W operational enlat:5 exlat:5 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:2.10W operational enlat:5 exlat:5 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.1000W non-operational enlat:5000 exlat:5000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0100W non-operational enlat:50000 exlat:100000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-05-23 19:56       ` Mario.Limonciello
@ 2017-05-23 20:01         ` Andy Lutomirski
  2017-05-23 20:19           ` Mario.Limonciello
  0 siblings, 1 reply; 21+ messages in thread
From: Andy Lutomirski @ 2017-05-23 20:01 UTC (permalink / raw)


On Tue, May 23, 2017@12:56 PM,  <Mario.Limonciello@dell.com> wrote:
>> -----Original Message-----
>> From: Andy Lutomirski [mailto:luto at kernel.org]
>> Sent: Tuesday, May 23, 2017 2:35 PM
>> To: Kai-Heng Feng <kai.heng.feng at canonical.com>
>> Cc: Christoph Hellwig <hch at infradead.org>; Andrew Lutomirski
>> <luto at kernel.org>; linux-nvme <linux-nvme at lists.infradead.org>; Limonciello,
>> Mario <Mario_Limonciello at Dell.com>
>> Subject: Re: NVMe APST high latency power states being skipped
>>
>> On Tue, May 23, 2017 at 1:06 AM, Kai-Heng Feng
>> <kai.heng.feng@canonical.com> wrote:
>> > On Tue, May 23, 2017 at 3:17 PM, Christoph Hellwig <hch at infradead.org>
>> wrote:
>> >> On Mon, May 22, 2017@05:04:15PM +0800, Kai-Heng Feng wrote:
>> >>> Hi Andy,
>> >>>
>> >>> Currently, if a power state tradition requires high latency, it may be
>> >>> skipped [1] based on the value of ps_max_latency_us in
>> >>> nvme_configure_apst():
>> >>>
>> >>> if (total_latency_us > ctrl->ps_max_latency_us)
>> >>>     continue;
>> >>>
>> >>> Right now ps_max_latency_us defaults to 25000, but some consumer level
>> >>> NVMe have much higher latency.
>> >>> I understand this value is configurable, but I am wondering if it's
>> >>> possible to ignore the latency on consumer devices, probably based on
>> >>> chassis type, so consumer devices can get most NVMe power saving out
>> >>> of the box?
>> >>
>> >> What is your proposed change?
>> >
>> > Ignore the latency limit if it's a mobile device, based on DMI chassis type.
>> > I can write a patch for that.
>> >
>> >> Do you have any numbers on how this
>> >> improves power consumption for given workloads and what the performance
>> >> impact is on common benchmarks?
>> >
>> > A SanDisk NVMe has entry latency 1,000,000 and exit latency 100,000.
>> > The default latency (25000) does not allow this device enters to
>> > non-operational state. The system power consumption is around 13W.
>> > Make this SanDisk device able to enter PS4 can get a system with
>> > roughly 8W power consumption.
>> > The 5W difference is quite good.
>>
>> Can you send the actual 'nvme id-ctrl' output?
>>
>
> I happen to have the output of this disk from another email I'm on so
> I'll share it while it's Kai Heng's night.  There are several disks mentioned
> that have this same concern, here's three of them at the end of this email.
>
>> I suspect that something is screwy here.  This is an entry latency of
>> 1 second and an exit latency of 100ms.  This is *atrocious*.  I don't
>> care what kind of mobile device this is -- making it unresponsive for
>> 1.1 seconds for the round trip will be quite noticeable.  And, with an
>> RSTe-like policy, that's 100 *seconds* of delay before going fully to
>> sleep.  Also, 5W power difference between deep sleep and less deep
>> sleep is also bizarrely large.  The NVMe device shouldn't take 5W of
>> power when idle even in the max-power operational state.
>>
>
> There are some configurations that have multiple NVMe disks.
> For example the Precision 7520 can have up to 3.
>
> NVME Identify Controller:
...
> mn      : A400 NVMe SanDisk 512GB
...
> ps    0 : mp:8.25W operational enlat:0 exlat:0 rrt:0 rrl:0
>           rwt:0 rwl:0 idle_power:- active_power:5.30W
> ps    1 : mp:8.25W operational enlat:0 exlat:0 rrt:1 rrl:1
>           rwt:1 rwl:1 idle_power:- active_power:3.30W
> ps    2 : mp:8.25W operational enlat:0 exlat:0 rrt:2 rrl:2
>           rwt:2 rwl:2 idle_power:- active_power:3.30W
> ps    3 : mp:0.0500W non-operational enlat:51000 exlat:10000 rrt:0 rrl:0
>           rwt:0 rwl:0 idle_power:- active_power:-
> ps    4 : mp:0.0055W non-operational enlat:1000000 exlat:100000 rrt:0 rrl:0
>           rwt:0 rwl:0 idle_power:- active_power:-
>

44.5mW saved and totally crazy latency.

>
> NVME Identify Controller:
...
> mn      : THNSF5512GPUK NVMe SED TOSHIBA 512GB
...
> ps    0 : mp:6.00W operational enlat:0 exlat:0 rrt:0 rrl:0
>           rwt:0 rwl:0 idle_power:- active_power:-
> ps    1 : mp:2.40W operational enlat:0 exlat:0 rrt:1 rrl:1
>           rwt:1 rwl:1 idle_power:- active_power:-
> ps    2 : mp:1.90W operational enlat:0 exlat:0 rrt:2 rrl:2
>           rwt:2 rwl:2 idle_power:- active_power:-
> ps    3 : mp:0.0120W non-operational enlat:5000 exlat:25000 rrt:3 rrl:3
>           rwt:3 rwl:3 idle_power:- active_power:-
> ps    4 : mp:0.0060W non-operational enlat:100000 exlat:70000 rrt:4 rrl:4
>           rwt:4 rwl:4 idle_power:- active_power:-

6 mW saved and still fairly crazy latency.  70ms means you drop a couple frames.

>
>
> NVME Identify Controller:
...
> mn      : CX2-GB1024-Q11 NVMe LITEON 1024GB
...> ps    0 : mp:8.00W operational enlat:0 exlat:0 rrt:0 rrl:0
>           rwt:0 rwl:0 idle_power:- active_power:-
> ps    1 : mp:4.00W operational enlat:5 exlat:5 rrt:1 rrl:1
>           rwt:1 rwl:1 idle_power:- active_power:-
> ps    2 : mp:2.10W operational enlat:5 exlat:5 rrt:2 rrl:2
>           rwt:2 rwl:2 idle_power:- active_power:-
> ps    3 : mp:0.1000W non-operational enlat:5000 exlat:5000 rrt:3 rrl:3
>           rwt:3 rwl:3 idle_power:- active_power:-
> ps    4 : mp:0.0100W non-operational enlat:50000 exlat:100000 rrt:4 rrl:4
>           rwt:4 rwl:4 idle_power:- active_power:-

90mW saved and still 100ms latency.  Also, I didn't know that Lite-on
made disks.

I'm not convinced that there's any chassis type for which this type of
default makes sense.

What would perhaps make sense is to have system-wide
performance-vs-power controls and to integrate NVMe power saving into
it, presumably through the pm_qos framework.  Or to export more
information to userspace and have a user tool that sets all this up
generically.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-05-23 20:01         ` Andy Lutomirski
@ 2017-05-23 20:19           ` Mario.Limonciello
  2017-05-23 21:11             ` Andy Lutomirski
  0 siblings, 1 reply; 21+ messages in thread
From: Mario.Limonciello @ 2017-05-23 20:19 UTC (permalink / raw)


> > There are some configurations that have multiple NVMe disks.
> > For example the Precision 7520 can have up to 3.
> >
> > NVME Identify Controller:
> ...
> > mn      : A400 NVMe SanDisk 512GB
> ...
> > ps    0 : mp:8.25W operational enlat:0 exlat:0 rrt:0 rrl:0
> >           rwt:0 rwl:0 idle_power:- active_power:5.30W
> > ps    1 : mp:8.25W operational enlat:0 exlat:0 rrt:1 rrl:1
> >           rwt:1 rwl:1 idle_power:- active_power:3.30W
> > ps    2 : mp:8.25W operational enlat:0 exlat:0 rrt:2 rrl:2
> >           rwt:2 rwl:2 idle_power:- active_power:3.30W
> > ps    3 : mp:0.0500W non-operational enlat:51000 exlat:10000 rrt:0 rrl:0
> >           rwt:0 rwl:0 idle_power:- active_power:-
> > ps    4 : mp:0.0055W non-operational enlat:1000000 exlat:100000 rrt:0 rrl:0
> >           rwt:0 rwl:0 idle_power:- active_power:-
> >
> 
> 44.5mW saved and totally crazy latency.
> 
> >
> > NVME Identify Controller:
> ...
> > mn      : THNSF5512GPUK NVMe SED TOSHIBA 512GB
> ...
> > ps    0 : mp:6.00W operational enlat:0 exlat:0 rrt:0 rrl:0
> >           rwt:0 rwl:0 idle_power:- active_power:-
> > ps    1 : mp:2.40W operational enlat:0 exlat:0 rrt:1 rrl:1
> >           rwt:1 rwl:1 idle_power:- active_power:-
> > ps    2 : mp:1.90W operational enlat:0 exlat:0 rrt:2 rrl:2
> >           rwt:2 rwl:2 idle_power:- active_power:-
> > ps    3 : mp:0.0120W non-operational enlat:5000 exlat:25000 rrt:3 rrl:3
> >           rwt:3 rwl:3 idle_power:- active_power:-
> > ps    4 : mp:0.0060W non-operational enlat:100000 exlat:70000 rrt:4 rrl:4
> >           rwt:4 rwl:4 idle_power:- active_power:-
> 
> 6 mW saved and still fairly crazy latency.  70ms means you drop a couple frames.
> 
> >
> >
> > NVME Identify Controller:
> ...
> > mn      : CX2-GB1024-Q11 NVMe LITEON 1024GB
> ...> ps    0 : mp:8.00W operational enlat:0 exlat:0 rrt:0 rrl:0
> >           rwt:0 rwl:0 idle_power:- active_power:-
> > ps    1 : mp:4.00W operational enlat:5 exlat:5 rrt:1 rrl:1
> >           rwt:1 rwl:1 idle_power:- active_power:-
> > ps    2 : mp:2.10W operational enlat:5 exlat:5 rrt:2 rrl:2
> >           rwt:2 rwl:2 idle_power:- active_power:-
> > ps    3 : mp:0.1000W non-operational enlat:5000 exlat:5000 rrt:3 rrl:3
> >           rwt:3 rwl:3 idle_power:- active_power:-
> > ps    4 : mp:0.0100W non-operational enlat:50000 exlat:100000 rrt:4 rrl:4
> >           rwt:4 rwl:4 idle_power:- active_power:-
> 
> 90mW saved and still 100ms latency.  Also, I didn't know that Lite-on
> made disks.

Well so the important one here I think is jumping down to PS3.  That's a much bigger
drop in power across all of these disks.  The Liteon one will obviously go into PS3
in the current patch, but the other two are just going to be vampires.

> 
> I'm not convinced that there's any chassis type for which this type of
> default makes sense.
> 
I guess I'm wondering where you came up with 25000 as the default:
+static unsigned long default_ps_max_latency_us = 25000;

Was it based across results of testing a bunch of disks, or from 
experimentation with a few higher end SSDs?

> What would perhaps make sense is to have system-wide
> performance-vs-power controls and to integrate NVMe power saving into
> it, presumably through the pm_qos framework.  Or to export more
> information to userspace and have a user tool that sets all this up
> generically.

So I think you're already doing this.  power/pm_qos_latency_tolerance_us
and the module parameter default_ps_max_latency_us can effectively
change it.

Kai Heng can comment more on the testing they've done and the performance
impact, but I understand that by tweaking those knobs they've been able to
get all these disks into at least PS3 and saved a lot of power.

We could go work with the TLP project  or power top guys and have them 
go and tweak the various sysfs knobs to make more of these disks work, 
but I would rather the kernel had good defaults across this collection of disks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-05-23 20:19           ` Mario.Limonciello
@ 2017-05-23 21:11             ` Andy Lutomirski
  2017-05-23 22:09               ` Mario.Limonciello
  0 siblings, 1 reply; 21+ messages in thread
From: Andy Lutomirski @ 2017-05-23 21:11 UTC (permalink / raw)


On Tue, May 23, 2017@1:19 PM,  <Mario.Limonciello@dell.com> wrote:
>> > There are some configurations that have multiple NVMe disks.
>> > For example the Precision 7520 can have up to 3.
>> >
>> > NVME Identify Controller:
>> ...
>> > mn      : A400 NVMe SanDisk 512GB
>> ...
>> > ps    0 : mp:8.25W operational enlat:0 exlat:0 rrt:0 rrl:0
>> >           rwt:0 rwl:0 idle_power:- active_power:5.30W
>> > ps    1 : mp:8.25W operational enlat:0 exlat:0 rrt:1 rrl:1
>> >           rwt:1 rwl:1 idle_power:- active_power:3.30W
>> > ps    2 : mp:8.25W operational enlat:0 exlat:0 rrt:2 rrl:2
>> >           rwt:2 rwl:2 idle_power:- active_power:3.30W
>> > ps    3 : mp:0.0500W non-operational enlat:51000 exlat:10000 rrt:0 rrl:0
>> >           rwt:0 rwl:0 idle_power:- active_power:-
>> > ps    4 : mp:0.0055W non-operational enlat:1000000 exlat:100000 rrt:0 rrl:0
>> >           rwt:0 rwl:0 idle_power:- active_power:-
>> >
>>
>> 44.5mW saved and totally crazy latency.
>>
>> >
>> > NVME Identify Controller:
>> ...
>> > mn      : THNSF5512GPUK NVMe SED TOSHIBA 512GB
>> ...
>> > ps    0 : mp:6.00W operational enlat:0 exlat:0 rrt:0 rrl:0
>> >           rwt:0 rwl:0 idle_power:- active_power:-
>> > ps    1 : mp:2.40W operational enlat:0 exlat:0 rrt:1 rrl:1
>> >           rwt:1 rwl:1 idle_power:- active_power:-
>> > ps    2 : mp:1.90W operational enlat:0 exlat:0 rrt:2 rrl:2
>> >           rwt:2 rwl:2 idle_power:- active_power:-
>> > ps    3 : mp:0.0120W non-operational enlat:5000 exlat:25000 rrt:3 rrl:3
>> >           rwt:3 rwl:3 idle_power:- active_power:-
>> > ps    4 : mp:0.0060W non-operational enlat:100000 exlat:70000 rrt:4 rrl:4
>> >           rwt:4 rwl:4 idle_power:- active_power:-
>>
>> 6 mW saved and still fairly crazy latency.  70ms means you drop a couple frames.
>>
>> >
>> >
>> > NVME Identify Controller:
>> ...
>> > mn      : CX2-GB1024-Q11 NVMe LITEON 1024GB
>> ...> ps    0 : mp:8.00W operational enlat:0 exlat:0 rrt:0 rrl:0
>> >           rwt:0 rwl:0 idle_power:- active_power:-
>> > ps    1 : mp:4.00W operational enlat:5 exlat:5 rrt:1 rrl:1
>> >           rwt:1 rwl:1 idle_power:- active_power:-
>> > ps    2 : mp:2.10W operational enlat:5 exlat:5 rrt:2 rrl:2
>> >           rwt:2 rwl:2 idle_power:- active_power:-
>> > ps    3 : mp:0.1000W non-operational enlat:5000 exlat:5000 rrt:3 rrl:3
>> >           rwt:3 rwl:3 idle_power:- active_power:-
>> > ps    4 : mp:0.0100W non-operational enlat:50000 exlat:100000 rrt:4 rrl:4
>> >           rwt:4 rwl:4 idle_power:- active_power:-
>>
>> 90mW saved and still 100ms latency.  Also, I didn't know that Lite-on
>> made disks.
>
> Well so the important one here I think is jumping down to PS3.  That's a much bigger
> drop in power across all of these disks.  The Liteon one will obviously go into PS3
> in the current patch, but the other two are just going to be vampires.

Ah, I missed that when reading the numbers.

>
>>
>> I'm not convinced that there's any chassis type for which this type of
>> default makes sense.
>>
> I guess I'm wondering where you came up with 25000 as the default:
> +static unsigned long default_ps_max_latency_us = 25000;
>
> Was it based across results of testing a bunch of disks, or from
> experimentation with a few higher end SSDs?

It was based on results across a bunch of disks, where "a bunch" == 2,
one that I own and one that Niranjan has. :)  Also, 25ms is a nice
round number.  I could be persuaded to increase it.  (Although the
SanDisk one should hit PS3 as well, no?)

I could also be persuaded to change the relevant parameter from (enlat
+ exlat) to something else.  The spec says, in language that's about
as clear as mud, that starting to go non-operational and then doing
any actual work can take (enlat + exlat) time.  But maybe real disks
aren't quite that bad.  In any event, the common case should be just
exlat.

Also, jeez, that Toshiba disk must *suck* under the RSTe policy.  25ms
exit latency incurred after 60ms of idle time?  No thanks!

>
>> What would perhaps make sense is to have system-wide
>> performance-vs-power controls and to integrate NVMe power saving into
>> it, presumably through the pm_qos framework.  Or to export more
>> information to userspace and have a user tool that sets all this up
>> generically.
>
> So I think you're already doing this.  power/pm_qos_latency_tolerance_us
> and the module parameter default_ps_max_latency_us can effectively
> change it.

What I mean is: <device>/power could also expose some hints about
exactly what the tradeoffs are (to the best of the kernel's knowledge)
so that user code could make a more informed and more automatic
decision.

>
> Kai Heng can comment more on the testing they've done and the performance
> impact, but I understand that by tweaking those knobs they've been able to
> get all these disks into at least PS3 and saved a lot of power.
>
> We could go work with the TLP project  or power top guys and have them
> go and tweak the various sysfs knobs to make more of these disks work,
> but I would rather the kernel had good defaults across this collection of disks.

Agreed.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-05-23 21:11             ` Andy Lutomirski
@ 2017-05-23 22:09               ` Mario.Limonciello
  2017-05-24  4:53                 ` Kai-Heng Feng
  0 siblings, 1 reply; 21+ messages in thread
From: Mario.Limonciello @ 2017-05-23 22:09 UTC (permalink / raw)




> -----Original Message-----
> From: Andy Lutomirski [mailto:luto at kernel.org]
> Sent: Tuesday, May 23, 2017 4:12 PM
> To: Limonciello, Mario <Mario_Limonciello at Dell.com>
> Cc: Andrew Lutomirski <luto at kernel.org>; Kai-Heng Feng
> <kai.heng.feng at canonical.com>; Christoph Hellwig <hch at infradead.org>; linux-
> nvme <linux-nvme at lists.infradead.org>
> Subject: Re: NVMe APST high latency power states being skipped
> 
> On Tue, May 23, 2017@1:19 PM,  <Mario.Limonciello@dell.com> wrote:
> >> > There are some configurations that have multiple NVMe disks.
> >> > For example the Precision 7520 can have up to 3.
> >> >
> >> > NVME Identify Controller:
> >> ...
> >> > mn      : A400 NVMe SanDisk 512GB
> >> ...
> >> > ps    0 : mp:8.25W operational enlat:0 exlat:0 rrt:0 rrl:0
> >> >           rwt:0 rwl:0 idle_power:- active_power:5.30W
> >> > ps    1 : mp:8.25W operational enlat:0 exlat:0 rrt:1 rrl:1
> >> >           rwt:1 rwl:1 idle_power:- active_power:3.30W
> >> > ps    2 : mp:8.25W operational enlat:0 exlat:0 rrt:2 rrl:2
> >> >           rwt:2 rwl:2 idle_power:- active_power:3.30W
> >> > ps    3 : mp:0.0500W non-operational enlat:51000 exlat:10000 rrt:0 rrl:0
> >> >           rwt:0 rwl:0 idle_power:- active_power:-
> >> > ps    4 : mp:0.0055W non-operational enlat:1000000 exlat:100000 rrt:0 rrl:0
> >> >           rwt:0 rwl:0 idle_power:- active_power:-
> >> >
> >>
> >> 44.5mW saved and totally crazy latency.
> >>
> >> >
> >> > NVME Identify Controller:
> >> ...
> >> > mn      : THNSF5512GPUK NVMe SED TOSHIBA 512GB
> >> ...
> >> > ps    0 : mp:6.00W operational enlat:0 exlat:0 rrt:0 rrl:0
> >> >           rwt:0 rwl:0 idle_power:- active_power:-
> >> > ps    1 : mp:2.40W operational enlat:0 exlat:0 rrt:1 rrl:1
> >> >           rwt:1 rwl:1 idle_power:- active_power:-
> >> > ps    2 : mp:1.90W operational enlat:0 exlat:0 rrt:2 rrl:2
> >> >           rwt:2 rwl:2 idle_power:- active_power:-
> >> > ps    3 : mp:0.0120W non-operational enlat:5000 exlat:25000 rrt:3 rrl:3
> >> >           rwt:3 rwl:3 idle_power:- active_power:-
> >> > ps    4 : mp:0.0060W non-operational enlat:100000 exlat:70000 rrt:4 rrl:4
> >> >           rwt:4 rwl:4 idle_power:- active_power:-
> >>
> >> 6 mW saved and still fairly crazy latency.  70ms means you drop a couple frames.
> >>
> >> >
> >> >
> >> > NVME Identify Controller:
> >> ...
> >> > mn      : CX2-GB1024-Q11 NVMe LITEON 1024GB
> >> ...> ps    0 : mp:8.00W operational enlat:0 exlat:0 rrt:0 rrl:0
> >> >           rwt:0 rwl:0 idle_power:- active_power:-
> >> > ps    1 : mp:4.00W operational enlat:5 exlat:5 rrt:1 rrl:1
> >> >           rwt:1 rwl:1 idle_power:- active_power:-
> >> > ps    2 : mp:2.10W operational enlat:5 exlat:5 rrt:2 rrl:2
> >> >           rwt:2 rwl:2 idle_power:- active_power:-
> >> > ps    3 : mp:0.1000W non-operational enlat:5000 exlat:5000 rrt:3 rrl:3
> >> >           rwt:3 rwl:3 idle_power:- active_power:-
> >> > ps    4 : mp:0.0100W non-operational enlat:50000 exlat:100000 rrt:4 rrl:4
> >> >           rwt:4 rwl:4 idle_power:- active_power:-
> >>
> >> 90mW saved and still 100ms latency.  Also, I didn't know that Lite-on
> >> made disks.
> >
> > Well so the important one here I think is jumping down to PS3.  That's a much
> bigger
> > drop in power across all of these disks.  The Liteon one will obviously go into PS3
> > in the current patch, but the other two are just going to be vampires.
> 
> Ah, I missed that when reading the numbers.
> 
> >
> >>
> >> I'm not convinced that there's any chassis type for which this type of
> >> default makes sense.
> >>
> > I guess I'm wondering where you came up with 25000 as the default:
> > +static unsigned long default_ps_max_latency_us = 25000;
> >
> > Was it based across results of testing a bunch of disks, or from
> > experimentation with a few higher end SSDs?
> 
> It was based on results across a bunch of disks, where "a bunch" == 2,
> one that I own and one that Niranjan has. :)  Also, 25ms is a nice
> round number.  I could be persuaded to increase it.  (Although the
> SanDisk one should hit PS3 as well, no?)
>
I think you missed a 0 when looking at the numbers.

51000 + 10000 > 25000
 
> I could also be persuaded to change the relevant parameter from (enlat
> + exlat) to something else.  The spec says, in language that's about
> as clear as mud, that starting to go non-operational and then doing
> any actual work can take (enlat + exlat) time.  But maybe real disks
> aren't quite that bad.  In any event, the common case should be just
> exlat.
> 

I know Kai Heng has looked at a /lot/ of disks. I've got stats from a few
of them, but there are many more that I haven't seen.

Perhaps Chris or Kai Heng might be able to provide a better parameter 
to base off from other experience.

> Also, jeez, that Toshiba disk must *suck* under the RSTe policy.  25ms
> exit latency incurred after 60ms of idle time?  No thanks!
> 
> >
> >> What would perhaps make sense is to have system-wide
> >> performance-vs-power controls and to integrate NVMe power saving into
> >> it, presumably through the pm_qos framework.  Or to export more
> >> information to userspace and have a user tool that sets all this up
> >> generically.
> >
> > So I think you're already doing this.  power/pm_qos_latency_tolerance_us
> > and the module parameter default_ps_max_latency_us can effectively
> > change it.
> 
> What I mean is: <device>/power could also expose some hints about
> exactly what the tradeoffs are (to the best of the kernel's knowledge)
> so that user code could make a more informed and more automatic
> decision.

I think separate from the effort of getting the default right this makes sense.
To me the most important default should be getting the disk into at least
the first non-operational state even if latency is bad.

Then provide the ability to block that non-operational state or go into
other non-operational states that would be otherwise blocked due to latency
by user code.

> 
> >
> > Kai Heng can comment more on the testing they've done and the performance
> > impact, but I understand that by tweaking those knobs they've been able to
> > get all these disks into at least PS3 and saved a lot of power.
> >
> > We could go work with the TLP project  or power top guys and have them
> > go and tweak the various sysfs knobs to make more of these disks work,
> > but I would rather the kernel had good defaults across this collection of disks.
> 
> Agreed.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-05-23 22:09               ` Mario.Limonciello
@ 2017-05-24  4:53                 ` Kai-Heng Feng
  2017-05-24  5:31                   ` Andy Lutomirski
  0 siblings, 1 reply; 21+ messages in thread
From: Kai-Heng Feng @ 2017-05-24  4:53 UTC (permalink / raw)


On Wed, May 24, 2017@6:09 AM,  <Mario.Limonciello@dell.com> wrote:
[snipped]
>
> I know Kai Heng has looked at a /lot/ of disks. I've got stats from a few
> of them, but there are many more that I haven't seen.

Not really, others I've seen have rather low latency. We have the same
high latency ones.

> Perhaps Chris or Kai Heng might be able to provide a better parameter
> to base off from other experience.

A quick summary: we need at least 61000 to make all of them be able to
enters PS3,
1100000 for PS4.

I'll do some performance testing on the 1100000 latency one.

Is there anyway to observe the power state transition in NVMe?

[snipped]

> I think separate from the effort of getting the default right this makes sense.
> To me the most important default should be getting the disk into at least
> the first non-operational state even if latency is bad.
>
> Then provide the ability to block that non-operational state or go into
> other non-operational states that would be otherwise blocked due to latency
> by user code.

We can add this to TLP by greping the PS3 latencies out of `nvme
id-ctrl` and do some math, but it will be ugly.

>
>>
>> >
>> > Kai Heng can comment more on the testing they've done and the performance
>> > impact, but I understand that by tweaking those knobs they've been able to
>> > get all these disks into at least PS3 and saved a lot of power.
>> >
>> > We could go work with the TLP project  or power top guys and have them
>> > go and tweak the various sysfs knobs to make more of these disks work,
>> > but I would rather the kernel had good defaults across this collection of disks.
>>
>> Agreed.

Other than TLP/powertop, we should make this easy to work with
something like thermald.
NVMe is quite hot. It can be quite useful to let thermald controls the
max available power state directly via sysfs knob. Fanless devices
will benefit a lot from this.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-05-24  4:53                 ` Kai-Heng Feng
@ 2017-05-24  5:31                   ` Andy Lutomirski
  2017-05-25  8:21                     ` Kai-Heng Feng
  0 siblings, 1 reply; 21+ messages in thread
From: Andy Lutomirski @ 2017-05-24  5:31 UTC (permalink / raw)


On Tue, May 23, 2017 at 9:53 PM, Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:
> On Wed, May 24, 2017@6:09 AM,  <Mario.Limonciello@dell.com> wrote:
> [snipped]
>>
>> I know Kai Heng has looked at a /lot/ of disks. I've got stats from a few
>> of them, but there are many more that I haven't seen.
>
> Not really, others I've seen have rather low latency. We have the same
> high latency ones.
>
>> Perhaps Chris or Kai Heng might be able to provide a better parameter
>> to base off from other experience.
>
> A quick summary: we need at least 61000 to make all of them be able to
> enters PS3,
> 1100000 for PS4.
>
> I'll do some performance testing on the 1100000 latency one.
>
> Is there anyway to observe the power state transition in NVMe?

I don't think so, sadly.  It's probably possible to use non-autonomous
transitions to force low power and then do some IO.  I can try to
fiddle with this and see how hard it would be to whip up a simple
benchmark.

>
> [snipped]
>
>> I think separate from the effort of getting the default right this makes sense.
>> To me the most important default should be getting the disk into at least
>> the first non-operational state even if latency is bad.
>>
>> Then provide the ability to block that non-operational state or go into
>> other non-operational states that would be otherwise blocked due to latency
>> by user code.
>
> We can add this to TLP by greping the PS3 latencies out of `nvme
> id-ctrl` and do some math, but it will be ugly.
>
>>
>>>
>>> >
>>> > Kai Heng can comment more on the testing they've done and the performance
>>> > impact, but I understand that by tweaking those knobs they've been able to
>>> > get all these disks into at least PS3 and saved a lot of power.
>>> >
>>> > We could go work with the TLP project  or power top guys and have them
>>> > go and tweak the various sysfs knobs to make more of these disks work,
>>> > but I would rather the kernel had good defaults across this collection of disks.
>>>
>>> Agreed.
>
> Other than TLP/powertop, we should make this easy to work with
> something like thermald.
> NVMe is quite hot. It can be quite useful to let thermald controls the
> max available power state directly via sysfs knob. Fanless devices
> will benefit a lot from this.

Hmm.  That's doable but isn't strictly part of APST.  We could add a
sysfs knob "operating_power_state" and a sysfs file that lists the
available operating states.  APST is about transitions to
*non-operating* states.

Unfortunately, the info in the provided tables are almost entirely
worthless when it comes to describing the performance impact of using
reduced-power operating states.  Also, I wouldn't personally be
shocked to see some interesting hardware bugs.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-05-24  5:31                   ` Andy Lutomirski
@ 2017-05-25  8:21                     ` Kai-Heng Feng
  2017-05-26  9:25                       ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Kai-Heng Feng @ 2017-05-25  8:21 UTC (permalink / raw)


On Wed, May 24, 2017@1:31 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Tue, May 23, 2017 at 9:53 PM, Kai-Heng Feng
> <kai.heng.feng@canonical.com> wrote:
>> On Wed, May 24, 2017@6:09 AM,  <Mario.Limonciello@dell.com> wrote:
>> [snipped]
>>>
>>> I know Kai Heng has looked at a /lot/ of disks. I've got stats from a few
>>> of them, but there are many more that I haven't seen.
>>
>> Not really, others I've seen have rather low latency. We have the same
>> high latency ones.
>>
>>> Perhaps Chris or Kai Heng might be able to provide a better parameter
>>> to base off from other experience.
>>
>> A quick summary: we need at least 61000 to make all of them be able to
>> enters PS3,
>> 1100000 for PS4.
>>
>> I'll do some performance testing on the 1100000 latency one.
>>
>> Is there anyway to observe the power state transition in NVMe?
>
> I don't think so, sadly.  It's probably possible to use non-autonomous
> transitions to force low power and then do some IO.  I can try to
> fiddle with this and see how hard it would be to whip up a simple
> benchmark.

I did some benchmark on the high latency SanDisk A400:
Kernel compilation, no PS3/PS4:
real    23m36.466s
user    115m49.944s
sys     10m58.352s

Kernel compilation, allow PS3/PS4:
real    24m40.308s
user    116m12.600s
sys     11m47.484s

Also, played a 4K video downloaded from youtube, no visual stutters.

>
>>
>> [snipped]
>>
>>> I think separate from the effort of getting the default right this makes sense.
>>> To me the most important default should be getting the disk into at least
>>> the first non-operational state even if latency is bad.
>>>
>>> Then provide the ability to block that non-operational state or go into
>>> other non-operational states that would be otherwise blocked due to latency
>>> by user code.
>>
>> We can add this to TLP by greping the PS3 latencies out of `nvme
>> id-ctrl` and do some math, but it will be ugly.
>>
>>>
>>>>
>>>> >
>>>> > Kai Heng can comment more on the testing they've done and the performance
>>>> > impact, but I understand that by tweaking those knobs they've been able to
>>>> > get all these disks into at least PS3 and saved a lot of power.
>>>> >
>>>> > We could go work with the TLP project  or power top guys and have them
>>>> > go and tweak the various sysfs knobs to make more of these disks work,
>>>> > but I would rather the kernel had good defaults across this collection of disks.
>>>>
>>>> Agreed.
>>
>> Other than TLP/powertop, we should make this easy to work with
>> something like thermald.
>> NVMe is quite hot. It can be quite useful to let thermald controls the
>> max available power state directly via sysfs knob. Fanless devices
>> will benefit a lot from this.
>
> Hmm.  That's doable but isn't strictly part of APST.  We could add a
> sysfs knob "operating_power_state" and a sysfs file that lists the
> available operating states.  APST is about transitions to
> *non-operating* states.
>
> Unfortunately, the info in the provided tables are almost entirely
> worthless when it comes to describing the performance impact of using
> reduced-power operating states.  Also, I wouldn't personally be
> shocked to see some interesting hardware bugs.

You are right, but we should allow the NVMe transit to non-opiating
states when the total system is too hot, even if it has pretty bad
latency.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-05-25  8:21                     ` Kai-Heng Feng
@ 2017-05-26  9:25                       ` Christoph Hellwig
  2017-06-01  8:19                         ` Kai-Heng Feng
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2017-05-26  9:25 UTC (permalink / raw)


On Thu, May 25, 2017@04:21:07PM +0800, Kai-Heng Feng wrote:
> I did some benchmark on the high latency SanDisk A400:
> Kernel compilation, no PS3/PS4:
> real    23m36.466s
> user    115m49.944s
> sys     10m58.352s
> 
> Kernel compilation, allow PS3/PS4:
> real    24m40.308s
> user    116m12.600s
> sys     11m47.484s

That's quite a bit of a slow down.  Can we play a bit with the
entry latency (maybe just for PS4) so that we still get into the
modes, but not as quickly?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-05-26  9:25                       ` Christoph Hellwig
@ 2017-06-01  8:19                         ` Kai-Heng Feng
  2017-06-01 11:32                           ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Kai-Heng Feng @ 2017-06-01  8:19 UTC (permalink / raw)


On Fri, May 26, 2017@5:25 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, May 25, 2017@04:21:07PM +0800, Kai-Heng Feng wrote:
>> I did some benchmark on the high latency SanDisk A400:
>> Kernel compilation, no PS3/PS4:
>> real    23m36.466s
>> user    115m49.944s
>> sys     10m58.352s
>>
>> Kernel compilation, allow PS3/PS4:
>> real    24m40.308s
>> user    116m12.600s
>> sys     11m47.484s
>
> That's quite a bit of a slow down.  Can we play a bit with the
> entry latency (maybe just for PS4) so that we still get into the
> modes, but not as quickly?

I changed the "Idle Time Prior to Transition" from 55000ms to 110000ms,
The time to compile kernel source is roughly the same.

My guess is that since the filesystem is constantly doing I/O while
compiling kernel, nvme never (or rarely) hits PS4.

Probably need to think a better scenario for power saving latency...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-06-01  8:19                         ` Kai-Heng Feng
@ 2017-06-01 11:32                           ` Christoph Hellwig
  2017-06-02  7:08                             ` Kai-Heng Feng
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2017-06-01 11:32 UTC (permalink / raw)


On Thu, Jun 01, 2017@04:19:26PM +0800, Kai-Heng Feng wrote:
> >> I did some benchmark on the high latency SanDisk A400:
> >> Kernel compilation, no PS3/PS4:
> >> real    23m36.466s
> >> user    115m49.944s
> >> sys     10m58.352s
> >>
> >> Kernel compilation, allow PS3/PS4:
> >> real    24m40.308s
> >> user    116m12.600s
> >> sys     11m47.484s
> >
> > That's quite a bit of a slow down.  Can we play a bit with the
> > entry latency (maybe just for PS4) so that we still get into the
> > modes, but not as quickly?
> 
> I changed the "Idle Time Prior to Transition" from 55000ms to 110000ms,
> The time to compile kernel source is roughly the same.
> 
> My guess is that since the filesystem is constantly doing I/O while
> compiling kernel, nvme never (or rarely) hits PS4.

That doesn't explain why we need more than a minute more for the
compile.  How reproducible are these numbers, btw?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-06-01 11:32                           ` Christoph Hellwig
@ 2017-06-02  7:08                             ` Kai-Heng Feng
  2017-06-02  7:13                               ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Kai-Heng Feng @ 2017-06-02  7:08 UTC (permalink / raw)


On Thu, Jun 1, 2017@7:32 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, Jun 01, 2017@04:19:26PM +0800, Kai-Heng Feng wrote:
>> >> I did some benchmark on the high latency SanDisk A400:
>> >> Kernel compilation, no PS3/PS4:
>> >> real    23m36.466s
>> >> user    115m49.944s
>> >> sys     10m58.352s
>> >>
>> >> Kernel compilation, allow PS3/PS4:
>> >> real    24m40.308s
>> >> user    116m12.600s
>> >> sys     11m47.484s
>> >
>> > That's quite a bit of a slow down.  Can we play a bit with the
>> > entry latency (maybe just for PS4) so that we still get into the
>> > modes, but not as quickly?
>>
>> I changed the "Idle Time Prior to Transition" from 55000ms to 110000ms,
>> The time to compile kernel source is roughly the same.
>>
>> My guess is that since the filesystem is constantly doing I/O while
>> compiling kernel, nvme never (or rarely) hits PS4.
>
> That doesn't explain why we need more than a minute more for the
> compile.  How reproducible are these numbers, btw?

After some more test, actually I am getting similar values despite
APST on or off. This is a super hot machine so throttled CPU might be
a more impactful factor here.

Can't directly get power state from NVMe make the latency between
power state transition hard to measure.

At least after a long idle, I typed `ls` and everything showed
instantly, I can't feel the latency at all. Probably the "real"
latency is not as bad as NVMe claimed.

Maybe add an extra knob which can directly control deepest allow power
state? Userspace tools can control deepest power state through this
knob.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-06-02  7:08                             ` Kai-Heng Feng
@ 2017-06-02  7:13                               ` Christoph Hellwig
  2017-06-06  9:54                                 ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2017-06-02  7:13 UTC (permalink / raw)


On Fri, Jun 02, 2017@03:08:49PM +0800, Kai-Heng Feng wrote:
> After some more test, actually I am getting similar values despite
> APST on or off. This is a super hot machine so throttled CPU might be
> a more impactful factor here.
> 
> Can't directly get power state from NVMe make the latency between
> power state transition hard to measure.
> 
> At least after a long idle, I typed `ls` and everything showed
> instantly, I can't feel the latency at all. Probably the "real"
> latency is not as bad as NVMe claimed.
> 
> Maybe add an extra knob which can directly control deepest allow power
> state? Userspace tools can control deepest power state through this
> knob.

I'd prefer that things work out of the box.  I'd be tempted to just
bump up the latency requirement to cover the device, but if Andy
doesn't like that in general we could add a quirk for this device
to at least allow it to use deep power states.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-06-02  7:13                               ` Christoph Hellwig
@ 2017-06-06  9:54                                 ` Christoph Hellwig
  2017-06-06 15:57                                   ` Andy Lutomirski
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2017-06-06  9:54 UTC (permalink / raw)


On Fri, Jun 02, 2017@12:13:37AM -0700, Christoph Hellwig wrote:
> > Maybe add an extra knob which can directly control deepest allow power
> > state? Userspace tools can control deepest power state through this
> > knob.
> 
> I'd prefer that things work out of the box.  I'd be tempted to just
> bump up the latency requirement to cover the device, but if Andy
> doesn't like that in general we could add a quirk for this device
> to at least allow it to use deep power states.

Kai, can you send the patch to bump default_ps_max_latency_us to
reasonably high value that all the devices you are testing are
able to use PS3/4?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-06-06  9:54                                 ` Christoph Hellwig
@ 2017-06-06 15:57                                   ` Andy Lutomirski
  2017-06-07  6:19                                     ` Kai-Heng Feng
  0 siblings, 1 reply; 21+ messages in thread
From: Andy Lutomirski @ 2017-06-06 15:57 UTC (permalink / raw)


On Tue, Jun 6, 2017@2:54 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Jun 02, 2017@12:13:37AM -0700, Christoph Hellwig wrote:
>> > Maybe add an extra knob which can directly control deepest allow power
>> > state? Userspace tools can control deepest power state through this
>> > knob.
>>
>> I'd prefer that things work out of the box.  I'd be tempted to just
>> bump up the latency requirement to cover the device, but if Andy
>> doesn't like that in general we could add a quirk for this device
>> to at least allow it to use deep power states.
>
> Kai, can you send the patch to bump default_ps_max_latency_us to
> reasonably high value that all the devices you are testing are
> able to use PS3/4?

I'm starting to think we should ignore enlat and only consider exlat
when we interpret the requested max latency.  If I find some time,
I'll try to write a little benchmark to see how drives actually
behave.

Given that we've seen enlat as high as 1s, I don't think we want to
start setting the default latency over 1s.

--Andy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* NVMe APST high latency power states being skipped
  2017-06-06 15:57                                   ` Andy Lutomirski
@ 2017-06-07  6:19                                     ` Kai-Heng Feng
  0 siblings, 0 replies; 21+ messages in thread
From: Kai-Heng Feng @ 2017-06-07  6:19 UTC (permalink / raw)


On Tue, Jun 6, 2017@11:57 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Tue, Jun 6, 2017@2:54 AM, Christoph Hellwig <hch@infradead.org> wrote:
>> On Fri, Jun 02, 2017@12:13:37AM -0700, Christoph Hellwig wrote:
>>> > Maybe add an extra knob which can directly control deepest allow power
>>> > state? Userspace tools can control deepest power state through this
>>> > knob.
>>>
>>> I'd prefer that things work out of the box.  I'd be tempted to just
>>> bump up the latency requirement to cover the device, but if Andy
>>> doesn't like that in general we could add a quirk for this device
>>> to at least allow it to use deep power states.
>>
>> Kai, can you send the patch to bump default_ps_max_latency_us to
>> reasonably high value that all the devices you are testing are
>> able to use PS3/4?
>
> I'm starting to think we should ignore enlat and only consider exlat
> when we interpret the requested max latency.  If I find some time,
> I'll try to write a little benchmark to see how drives actually
> behave.

Suppose a NVMe with 5ms enlat and 5ms exlat, the Idle Time Prior to
Transition is 500ms.
I'd say that most of the time, the no I/O activities period will also
spans over the entlat's 5ms time window, hence the only impactful
latency here is limited to exlat.
The chance to occur 10ms latency (device transits back to op state
right after it enters non-op state) should be low.

So I think you are right, only considering exlat will be more close to
real world scenario.

>
> Given that we've seen enlat as high as 1s, I don't think we want to
> start setting the default latency over 1s.

The highest PS4 exlat on NVMes at my hand is 100ms. Newer NVMe should
have better latency.

I'll send a patch for this, thanks for the info.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-06-07  6:19 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-22  9:04 NVMe APST high latency power states being skipped Kai-Heng Feng
2017-05-23  7:17 ` Christoph Hellwig
2017-05-23  8:06   ` Kai-Heng Feng
2017-05-23  9:42     ` Christoph Hellwig
2017-05-23 19:35     ` Andy Lutomirski
2017-05-23 19:56       ` Mario.Limonciello
2017-05-23 20:01         ` Andy Lutomirski
2017-05-23 20:19           ` Mario.Limonciello
2017-05-23 21:11             ` Andy Lutomirski
2017-05-23 22:09               ` Mario.Limonciello
2017-05-24  4:53                 ` Kai-Heng Feng
2017-05-24  5:31                   ` Andy Lutomirski
2017-05-25  8:21                     ` Kai-Heng Feng
2017-05-26  9:25                       ` Christoph Hellwig
2017-06-01  8:19                         ` Kai-Heng Feng
2017-06-01 11:32                           ` Christoph Hellwig
2017-06-02  7:08                             ` Kai-Heng Feng
2017-06-02  7:13                               ` Christoph Hellwig
2017-06-06  9:54                                 ` Christoph Hellwig
2017-06-06 15:57                                   ` Andy Lutomirski
2017-06-07  6:19                                     ` Kai-Heng Feng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.