All of lore.kernel.org
 help / color / mirror / Atom feed
* Perf regression after enabling nvme autonomous power state transitions
@ 2017-03-17 10:58 Tvrtko Ursulin
  2017-03-17 15:26 ` Andy Lutomirski
  0 siblings, 1 reply; 4+ messages in thread
From: Tvrtko Ursulin @ 2017-03-17 10:58 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Jens Axboe, Christoph Hellwig, LKML, Chris Wilson


Hi Andy, all,

I have bisected and verified an interesting performance regression 
caused by commit c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf "nvme: Enable 
autonomous power state transitions".

Having that patch or not accounts for approx. 3% perf difference in a 
test which is, and this is the best part, not even i/o bound by any 
stretch of the imagination.

The test is multi-process with overall medium CPU usage and high GPU 
(Intel) usage. Average runtime is around 13 seconds during which it 
writes out around 14MiB of data.

It does so in chunks during the whole runtime but doesn't do anything 
special, just normal O_RDWR | O_CREAT | O_TRUNC so in practice this is 
all written to the device only the end of the test run in one chunk. Via 
the background write out I suspect.

The 3% mentioned earlier translates to approx. 430ms longer average 
runtime with the above patch.

NVMe storage in question:

NVME Identify Controller:
vid     : 0x8086
ssvid   : 0x8086
sn      : BTPY70130HEB256D
mn      : INTEL SSDPEKKW256G7
fr      :  PSF109C
rab     : 6
ieee    : 5cd2e4
cmic    : 0
mdts    : 5
cntlid  : 1
ver     : 10200
rtd3r   : 249f0
rtd3e   : 13880
oaes    : 0
oacs    : 0x6
acl     : 4
aerl    : 7
frmw    : 0x12
lpa     : 0x3
elpe    : 63
npss    : 4
avscc   : 0
apsta   : 0x1
wctemp  : 343
cctemp  : 353
mtfa    : 20
hmpre   : 0
hmmin   : 0
tnvmcap : 0
unvmcap : 0
rpmbs   : 0
sqes    : 0x66
cqes    : 0x44
nn      : 1
oncs    : 0x1e
fuses   : 0
fna     : 0x4
vwc     : 0x1
awun    : 0
awupf   : 0
nvscc   : 0
acwu    : 0
sgls    : 0
ps    0 : mp:9.00W operational enlat:5 exlat:5 rrt:0 rrl:0
           rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:4.60W operational enlat:30 exlat:30 rrt:1 rrl:1
           rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.80W operational enlat:30 exlat:30 rrt:2 rrl:2
           rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0700W non-operational enlat:10000 exlat:300 rrt:3 rrl:3
           rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2000 exlat:10000 rrt:4 rrl:4
           rwt:4 rwl:4 idle_power:- active_power:-

I see there are some latencies (unit?) mentioned here, but as the test 
does not appear to be blocking on IO I am confused as to why would this 
patch be causing this. Nevertheless the regression is 100% repeatable.

Any ideas on what could be causing this and if there is something else 
to check or look at?

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Perf regression after enabling nvme autonomous power state transitions
  2017-03-17 10:58 Perf regression after enabling nvme autonomous power state transitions Tvrtko Ursulin
@ 2017-03-17 15:26 ` Andy Lutomirski
  2017-03-20 11:14   ` Tvrtko Ursulin
  0 siblings, 1 reply; 4+ messages in thread
From: Andy Lutomirski @ 2017-03-17 15:26 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Andy Lutomirski, Jens Axboe, Christoph Hellwig, LKML, Chris Wilson

On Fri, Mar 17, 2017 at 3:58 AM, Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
> Hi Andy, all,
>
> I have bisected and verified an interesting performance regression caused by
> commit c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf "nvme: Enable autonomous
> power state transitions".
>
> Having that patch or not accounts for approx. 3% perf difference in a test
> which is, and this is the best part, not even i/o bound by any stretch of
> the imagination.

Weird.  Is there something in the workload that is periodically
reading or otherwise doing a synchronous IO?  APST shouldn't affect
throughput in any meaningful way, but it certainly affects latency of
sporadic IO.

>
> The test is multi-process with overall medium CPU usage and high GPU (Intel)
> usage. Average runtime is around 13 seconds during which it writes out
> around 14MiB of data.
>
> It does so in chunks during the whole runtime but doesn't do anything
> special, just normal O_RDWR | O_CREAT | O_TRUNC so in practice this is all
> written to the device only the end of the test run in one chunk. Via the
> background write out I suspect.
>
> The 3% mentioned earlier translates to approx. 430ms longer average runtime
> with the above patch.

That is surprisingly long.  If your device is telling the truth about
its latencies, this should be literally impossible.

> ps    0 : mp:9.00W operational enlat:5 exlat:5 rrt:0 rrl:0
>           rwt:0 rwl:0 idle_power:- active_power:-

This is the normal on state.

> ps    1 : mp:4.60W operational enlat:30 exlat:30 rrt:1 rrl:1
>           rwt:1 rwl:1 idle_power:- active_power:-
> ps    2 : mp:3.80W operational enlat:30 exlat:30 rrt:2 rrl:2
>           rwt:2 rwl:2 idle_power:- active_power:-

These won't be used at all.

> ps    3 : mp:0.0700W non-operational enlat:10000 exlat:300 rrt:3 rrl:3
>           rwt:3 rwl:3 idle_power:- active_power:-
> ps    4 : mp:0.0050W non-operational enlat:2000 exlat:10000 rrt:4 rrl:4
>           rwt:4 rwl:4 idle_power:- active_power:-

These two have their latencies specified in microseconds and will be
used autonomously.

Can you re-test with
/sys/class/nvme/nvme0/power/pm_qos_latency_tolerance_us set to the
following values and report back?

0 - will turn APST off entirely
1 - will leave APST on but will not program any transitions
11000 - should allow state 3 but not state 4

Thanks
Andy

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Perf regression after enabling nvme autonomous power state transitions
  2017-03-17 15:26 ` Andy Lutomirski
@ 2017-03-20 11:14   ` Tvrtko Ursulin
  2017-03-20 21:23     ` Andy Lutomirski
  0 siblings, 1 reply; 4+ messages in thread
From: Tvrtko Ursulin @ 2017-03-20 11:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Jens Axboe, Christoph Hellwig, LKML, Chris Wilson


On 17/03/2017 15:26, Andy Lutomirski wrote:
> On Fri, Mar 17, 2017 at 3:58 AM, Tvrtko Ursulin
> <tvrtko.ursulin@linux.intel.com> wrote:
>>
>> Hi Andy, all,
>>
>> I have bisected and verified an interesting performance regression caused by
>> commit c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf "nvme: Enable autonomous
>> power state transitions".
>>
>> Having that patch or not accounts for approx. 3% perf difference in a test
>> which is, and this is the best part, not even i/o bound by any stretch of
>> the imagination.
>
> Weird.  Is there something in the workload that is periodically
> reading or otherwise doing a synchronous IO?  APST shouldn't affect
> throughput in any meaningful way, but it certainly affects latency of
> sporadic IO.

Not as far as I can see. All output files are opened in a most basic way 
and there are no sync sycalls of any sort in the trace. Reads are all 
coming from page cache.

I did look more closely at the I/O stats (I use "vmstat 1" for this), 
and it looks that there is some I/O happening during the runtime. I was 
surprised by that since the total data written out is only ~14MiB as I 
said. (And those are the only writes the test does, apart from 
stdout/err at beginning and end.)

I tried increasing dirty_*_bytes values to some huge ones to try to 
eliminate any potential background or direct writeback during the test 
run but that did not seem to help. I could still see a tiny bit of I/O, 
few KiB here and there occasionally, and then a large 14MiB dump at the 
end of a test run.

Any ideas on how to figure out who is exactly doing these writes and why?

>> The test is multi-process with overall medium CPU usage and high GPU (Intel)
>> usage. Average runtime is around 13 seconds during which it writes out
>> around 14MiB of data.
>>
>> It does so in chunks during the whole runtime but doesn't do anything
>> special, just normal O_RDWR | O_CREAT | O_TRUNC so in practice this is all
>> written to the device only the end of the test run in one chunk. Via the
>> background write out I suspect.
>>
>> The 3% mentioned earlier translates to approx. 430ms longer average runtime
>> with the above patch.
>
> That is surprisingly long.  If your device is telling the truth about
> its latencies, this should be literally impossible.
>
>> ps    0 : mp:9.00W operational enlat:5 exlat:5 rrt:0 rrl:0
>>           rwt:0 rwl:0 idle_power:- active_power:-
>
> This is the normal on state.
>
>> ps    1 : mp:4.60W operational enlat:30 exlat:30 rrt:1 rrl:1
>>           rwt:1 rwl:1 idle_power:- active_power:-
>> ps    2 : mp:3.80W operational enlat:30 exlat:30 rrt:2 rrl:2
>>           rwt:2 rwl:2 idle_power:- active_power:-
>
> These won't be used at all.
>
>> ps    3 : mp:0.0700W non-operational enlat:10000 exlat:300 rrt:3 rrl:3
>>           rwt:3 rwl:3 idle_power:- active_power:-
>> ps    4 : mp:0.0050W non-operational enlat:2000 exlat:10000 rrt:4 rrl:4
>>           rwt:4 rwl:4 idle_power:- active_power:-
>
> These two have their latencies specified in microseconds and will be
> used autonomously.
>
> Can you re-test with
> /sys/class/nvme/nvme0/power/pm_qos_latency_tolerance_us set to the
> following values and report back?
>
> 0 - will turn APST off entirely
> 1 - will leave APST on but will not program any transitions
> 11000 - should allow state 3 but not state 4

11000 seems to be enough to bring the results back to expected. 1 and 0 
seem to not have any further effect.

I've also tried making the test output to /dev/null and that had the 
same beneficial result, even with the default "25000" setting for 
pm_qos_latency_tolerance.

So I suspected unexpected direct write back was happening. But as I 
wrote above increasing the dirty_bytes (or dirty_background_bytes, or 
the expire controls) did not help. I used 200Mb and 20s for all the 
fields, which I assumed would have caused everything to go quiet during 
the test run, but it hadn't.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Perf regression after enabling nvme autonomous power state transitions
  2017-03-20 11:14   ` Tvrtko Ursulin
@ 2017-03-20 21:23     ` Andy Lutomirski
  0 siblings, 0 replies; 4+ messages in thread
From: Andy Lutomirski @ 2017-03-20 21:23 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: Andy Lutomirski, Jens Axboe, Christoph Hellwig, LKML, Chris Wilson

On Mon, Mar 20, 2017 at 4:14 AM, Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
> On 17/03/2017 15:26, Andy Lutomirski wrote:
>>
>> On Fri, Mar 17, 2017 at 3:58 AM, Tvrtko Ursulin
>> <tvrtko.ursulin@linux.intel.com> wrote:
>>>
>>>
>>> Hi Andy, all,
>>>
>>> I have bisected and verified an interesting performance regression caused
>>> by
>>> commit c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf "nvme: Enable autonomous
>>> power state transitions".
>>>
>>> Having that patch or not accounts for approx. 3% perf difference in a
>>> test
>>> which is, and this is the best part, not even i/o bound by any stretch of
>>> the imagination.
>>
>>
>> Weird.  Is there something in the workload that is periodically
>> reading or otherwise doing a synchronous IO?  APST shouldn't affect
>> throughput in any meaningful way, but it certainly affects latency of
>> sporadic IO.
>
>
> Not as far as I can see. All output files are opened in a most basic way and
> there are no sync sycalls of any sort in the trace. Reads are all coming
> from page cache.
>
> I did look more closely at the I/O stats (I use "vmstat 1" for this), and it
> looks that there is some I/O happening during the runtime. I was surprised
> by that since the total data written out is only ~14MiB as I said. (And
> those are the only writes the test does, apart from stdout/err at beginning
> and end.)
>
> I tried increasing dirty_*_bytes values to some huge ones to try to
> eliminate any potential background or direct writeback during the test run
> but that did not seem to help. I could still see a tiny bit of I/O, few KiB
> here and there occasionally, and then a large 14MiB dump at the end of a
> test run.
>
> Any ideas on how to figure out who is exactly doing these writes and why?
>

You could try playing with iosnoop or blktrace.

>
>>> The test is multi-process with overall medium CPU usage and high GPU
>>> (Intel)
>>> usage. Average runtime is around 13 seconds during which it writes out
>>> around 14MiB of data.
>>>
>>> It does so in chunks during the whole runtime but doesn't do anything
>>> special, just normal O_RDWR | O_CREAT | O_TRUNC so in practice this is
>>> all
>>> written to the device only the end of the test run in one chunk. Via the
>>> background write out I suspect.
>>>
>>> The 3% mentioned earlier translates to approx. 430ms longer average
>>> runtime
>>> with the above patch.
>>
>>
>> That is surprisingly long.  If your device is telling the truth about
>> its latencies, this should be literally impossible.
>>
>>> ps    0 : mp:9.00W operational enlat:5 exlat:5 rrt:0 rrl:0
>>>           rwt:0 rwl:0 idle_power:- active_power:-
>>
>>
>> This is the normal on state.
>>
>>> ps    1 : mp:4.60W operational enlat:30 exlat:30 rrt:1 rrl:1
>>>           rwt:1 rwl:1 idle_power:- active_power:-
>>> ps    2 : mp:3.80W operational enlat:30 exlat:30 rrt:2 rrl:2
>>>           rwt:2 rwl:2 idle_power:- active_power:-
>>
>>
>> These won't be used at all.
>>
>>> ps    3 : mp:0.0700W non-operational enlat:10000 exlat:300 rrt:3 rrl:3
>>>           rwt:3 rwl:3 idle_power:- active_power:-
>>> ps    4 : mp:0.0050W non-operational enlat:2000 exlat:10000 rrt:4 rrl:4
>>>           rwt:4 rwl:4 idle_power:- active_power:-
>>
>>
>> These two have their latencies specified in microseconds and will be
>> used autonomously.
>>
>> Can you re-test with
>> /sys/class/nvme/nvme0/power/pm_qos_latency_tolerance_us set to the
>> following values and report back?
>>
>> 0 - will turn APST off entirely
>> 1 - will leave APST on but will not program any transitions
>> 11000 - should allow state 3 but not state 4
>
>
> 11000 seems to be enough to bring the results back to expected. 1 and 0 seem
> to not have any further effect.
>
> I've also tried making the test output to /dev/null and that had the same
> beneficial result, even with the default "25000" setting for
> pm_qos_latency_tolerance.

My best guess is that your disk lies and the latency in ps 4 is just
extraordinarily high.

--Andy

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-03-20 21:30 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-17 10:58 Perf regression after enabling nvme autonomous power state transitions Tvrtko Ursulin
2017-03-17 15:26 ` Andy Lutomirski
2017-03-20 11:14   ` Tvrtko Ursulin
2017-03-20 21:23     ` Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.