NO_HZ_IDLE causes consistently low cpu "iowait" time (and higher cpu "idle" time)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* NO_HZ_IDLE causes consistently low cpu "iowait" time (and higher cpu "idle" time)
@ 2019-07-01 15:33 Alan Jenkins
  2019-07-03 14:06 ` Doug Smythies
  0 siblings, 1 reply; 6+ messages in thread
From: Alan Jenkins @ 2019-07-01 15:33 UTC (permalink / raw)
  To: linux-pm, linux-kernel

Hi

I tried running a simple test:

     dd if=testfile iflag=direct bs=1M of=/dev/null

With my default settings, `vmstat 10` shows something like 85% idle time 
to 15% iowait time. I have 4 CPUs, so this is much less than one CPU 
worth of iowait time.

If I boot with "nohz=off", I see idle time fall to 75% or below, and 
iowait rise to about 25%, equivalent to one CPU.  That is what I had 
originally expected.

(I can also see my expected numbers, if I disable *all* C-states and 
force polling using `pm_qos_resume_latency_us` in sysfs).

The numbers above are from a kernel somewhere around v5.2-rc5.  I saw 
the "wrong" results on some previous kernels as well.  I just now 
realized the link to NO_HZ_IDLE.[1]

[1] 
https://unix.stackexchange.com/questions/517757/my-basic-assumption-about-system-iowait-does-not-hold/527836#527836

I did not find any information about this high level of inaccuracy. Can 
anyone explain, is this behaviour expected?

I found several patches that mentioned "iowait" and NO_HZ_IDLE. But if 
they described this problem, it was not clear to me.

I thought this might also be affecting the "IO pressure" values from the 
new "pressure stall information"... but I am too confused already, so I 
am only asking about iowait at the moment :-).[2]

[2] 
https://unix.stackexchange.com/questions/527342/why-does-the-new-linux-pressure-stall-information-for-io-not-show-as-100/527347#527347

I have seen the disclaimers for iowait in 
Documentation/filesystems/proc.txt, and the derived man page. 
Technically, the third disclaimer might cover anything.  But I was 
optimistic; I hoped it was talking about relatively small glitches :-).  
I didn't think it would mean a large systematic undercounting, which 
applied to the vast majority of current systems (which are not tuned for 
realtime use).

|

> - iowait: In a word, iowait stands for waiting for I/O to complete. But there
>  are several problems:
>  1. Cpu will not wait for I/O to complete, iowait is the time that a task is
>     waiting for I/O to complete. When cpu goes into idle state for
>     outstanding task io, another task will be scheduled on this CPU.
>  2. In a multi-core CPU, the task waiting for I/O to complete is not running
>     on any CPU, so the iowait of each CPU is difficult to calculate.
>  3. The value of iowait field in /proc/stat will decrease in certain
>     conditions|

Thanks for all the power-saving code
Alan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: NO_HZ_IDLE causes consistently low cpu "iowait" time (and higher cpu "idle" time)
  2019-07-01 15:33 NO_HZ_IDLE causes consistently low cpu "iowait" time (and higher cpu "idle" time) Alan Jenkins
@ 2019-07-03 14:06 ` Doug Smythies
  2019-07-03 16:09   ` Alan Jenkins
  2019-07-05 11:25   ` iowait v.s. idle accounting is "inconsistent" - iowait is too low Alan Jenkins
  0 siblings, 2 replies; 6+ messages in thread
From: Doug Smythies @ 2019-07-03 14:06 UTC (permalink / raw)
  To: 'Alan Jenkins'; +Cc: linux-pm, linux-kernel

On 2019.07.01 08:34 Alan Jenkins wrote:

> Hi

Hi,

> I tried running a simple test:
>
>     dd if=testfile iflag=direct bs=1M of=/dev/null
>
> With my default settings, `vmstat 10` shows something like 85% idle time 
> to 15% iowait time. I have 4 CPUs, so this is much less than one CPU 
> worth of iowait time.
>
> If I boot with "nohz=off", I see idle time fall to 75% or below, and 
> iowait rise to about 25%, equivalent to one CPU.  That is what I had 
> originally expected.
>
> (I can also see my expected numbers, if I disable *all* C-states and 
> force polling using `pm_qos_resume_latency_us` in sysfs).
>
> The numbers above are from a kernel somewhere around v5.2-rc5.  I saw 
> the "wrong" results on some previous kernels as well.  I just now 
> realized the link to NO_HZ_IDLE.[1]
>
> [1] 
> https://unix.stackexchange.com/questions/517757/my-basic-assumption-about-system-iowait-does-not-hold/527836#527836
>
> I did not find any information about this high level of inaccuracy. Can 
> anyone explain, is this behaviour expected?

I'm not commenting on expected behaviour or not, just that it is
inconsistent.

>
> I found several patches that mentioned "iowait" and NO_HZ_IDLE. But if 
> they described this problem, it was not clear to me.
>
> I thought this might also be affecting the "IO pressure" values from the 
> new "pressure stall information"... but I am too confused already, so I 
> am only asking about iowait at the moment :-).

Using your workload, I confirm inconsistent behaviour for /proc/stat
(which vmstat uses) between kernels 4.15, 4.16, and 4.17:
4.15 does what you expect, no matter idle states enabled or disabled.
4.16 doesn't do what you expect regardless. (although a little erratic.)
>= 4.17 does what you expect with only idle state 0 enabled, and doesn't otherwise.

Actual test data vmstat (/proc/stat) (8 CPUs, 12.5% = 1 CPU)):
Kernel	idle/iowait %	Idle states >= 1
4.15		88/12			enabled
4.15		88/12			disabled
4.16		99/1			enabled
4.16		99/1			disabled
4.17		98/2			enabled
4.17		88/12			disabled

Note 1: I never booted with "nohz=off" because the tick never turns off for
idle state 0, which is good enough for testing.

Note 2: Myself, I don't use /proc/stat for idle time statistics. I use:
/sys/devices/system/cpu/cpu*/cpuidle/state*/time
And they seem to always be consistent at the higher idle percentage number.

Unless someone has some insight, the next step is kernel bisection,
once for between kernel 4.15 and 4.16, then again between 4.16 and 4.17.
The second bisection might go faster with knowledge gained from the first.
Alan: Can you do kernel bisection? I can only do it starting maybe Friday.

... Doug

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: NO_HZ_IDLE causes consistently low cpu "iowait" time (and higher cpu "idle" time)
  2019-07-03 14:06 ` Doug Smythies
@ 2019-07-03 16:09   ` Alan Jenkins
  2019-07-05 11:25   ` iowait v.s. idle accounting is "inconsistent" - iowait is too low Alan Jenkins
  1 sibling, 0 replies; 6+ messages in thread
From: Alan Jenkins @ 2019-07-03 16:09 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-pm, linux-kernel

On 03/07/2019 15:06, Doug Smythies wrote:
> On 2019.07.01 08:34 Alan Jenkins wrote:
>
>> Hi
> Hi,
>
>> I tried running a simple test:
>>
>>      dd if=testfile iflag=direct bs=1M of=/dev/null
>>
>> With my default settings, `vmstat 10` shows something like 85% idle time
>> to 15% iowait time. I have 4 CPUs, so this is much less than one CPU
>> worth of iowait time.
>>
>> If I boot with "nohz=off", I see idle time fall to 75% or below, and
>> iowait rise to about 25%, equivalent to one CPU.  That is what I had
>> originally expected.
>>
>> (I can also see my expected numbers, if I disable *all* C-states and
>> force polling using `pm_qos_resume_latency_us` in sysfs).
>>
>> The numbers above are from a kernel somewhere around v5.2-rc5.  I saw
>> the "wrong" results on some previous kernels as well.  I just now
>> realized the link to NO_HZ_IDLE.[1]
>>
>> [1]
>> https://unix.stackexchange.com/questions/517757/my-basic-assumption-about-system-iowait-does-not-hold/527836#527836
>>
>> I did not find any information about this high level of inaccuracy. Can
>> anyone explain, is this behaviour expected?
> I'm not commenting on expected behaviour or not, just that it is
> inconsistent.
>
>> I found several patches that mentioned "iowait" and NO_HZ_IDLE. But if
>> they described this problem, it was not clear to me.
>>
>> I thought this might also be affecting the "IO pressure" values from the
>> new "pressure stall information"... but I am too confused already, so I
>> am only asking about iowait at the moment :-).
> Using your workload, I confirm inconsistent behaviour for /proc/stat
> (which vmstat uses) between kernels 4.15, 4.16, and 4.17:
> 4.15 does what you expect, no matter idle states enabled or disabled.
> 4.16 doesn't do what you expect regardless. (although a little erratic.)
>> = 4.17 does what you expect with only idle state 0 enabled, and doesn't otherwise.
> Actual test data vmstat (/proc/stat) (8 CPUs, 12.5% = 1 CPU)):
> Kernel	idle/iowait %	Idle states >= 1
> 4.15		88/12			enabled
> 4.15		88/12			disabled
> 4.16		99/1			enabled
> 4.16		99/1			disabled
> 4.17		98/2			enabled
> 4.17		88/12			disabled
>
> Note 1: I never booted with "nohz=off" because the tick never turns off for
> idle state 0, which is good enough for testing.
>
> Note 2: Myself, I don't use /proc/stat for idle time statistics. I use:
> /sys/devices/system/cpu/cpu*/cpuidle/state*/time
> And they seem to always be consistent at the higher idle percentage number.
>
> Unless someone has some insight, the next step is kernel bisection,
> once for between kernel 4.15 and 4.16, then again between 4.16 and 4.17.
> The second bisection might go faster with knowledge gained from the first.
> Alan: Can you do kernel bisection? I can only do it starting maybe Friday.
>
> ... Doug

Thanks for your help Doug!

I wish I had a faster CPU :-), but I'm familiar with bisection. I have 
started and I'm down to about 8 minute builds, so I can probably be done 
before Friday.

Alan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: iowait v.s. idle accounting is "inconsistent" - iowait is too low
  2019-07-03 14:06 ` Doug Smythies
  2019-07-03 16:09   ` Alan Jenkins
@ 2019-07-05 11:25   ` Alan Jenkins
  2019-07-05 11:38     ` Peter Zijlstra
  1 sibling, 1 reply; 6+ messages in thread
From: Alan Jenkins @ 2019-07-05 11:25 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel; +Cc: Doug Smythies, linux-pm

Hi, scheduler experts!

My cpu "iowait" time appears to be reported incorrectly.  Do you know 
why this could happen?

Doug helped me - it was he who noticed different behaviour on v4.15 vs 
v4.16 vs v4.17+.  So I have some confirmation of this.  I don't think 
Doug mentioned what hardware he ran the kernels on.  lscpu says my 
hardware is "Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz".

I tested using "dd" :-

dd if=bigfile bs=1M iflag=direct of=/dev/null

  (1) E.g. I get the expected result if I pin "dd" to the "right" cpu.  
Which cpu varies; it has often been cpu2.  At the moment I have booted 
5.2-rc5-ish.  Here I run "dd" with "taskset -c 0", and "iowait" worked 
as expected:

top - 11:01:47 up 15:10,  2 users,  load average: 1.07, 0.86, 0.86
Tasks: 288 total,   1 running, 287 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.3 us,  3.3 sy,  0.0 ni,  0.0 id, 94.3 wa,  0.7 hi,  1.3 si,  0.0 st
%Cpu1  :  0.3 us,  0.7 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.3 hi,  0.0 si,  0.0 st
%Cpu2  :  1.0 us,  0.7 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.3 hi,  0.0 si,  0.0 st
%Cpu3  :  0.7 us,  0.3 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   7854.0 total,    432.2 free,   4616.4 used,   2805.4 buff/cache
MiB Swap:   2048.0 total,   1978.2 free,     69.8 used.   2498.0 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
31849 alan-sy+  20   0  216052   2836   1800 D   3.0   0.0   0:00.58 dd
24220 alan-sy+  20   0 3339828 232160 126720 S   0.7   2.9   1:53.14 gnome-shell
...

(I have also used "atop" and "vmstat 3". "atop" shows both total and 
per-cpu iowait, idle, etc. "vmstat 3" just shows a total, but all the 
old values stay on-screen).

  (2) But compare running "dd" with "taskset -c 1":

%Cpu0  :  0.3 us,  0.3 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.7 hi,  1.3 si,  0.0 st
%Cpu1  :  0.3 us,  3.0 sy,  0.0 ni, 83.7 id, 12.6 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu2  :  1.0 us,  0.3 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  1.3 us,  0.7 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

  (3) If I don't use "taskset", "dd" generally doesn't sit on the 
"right" cpu, and so I get don't see the right "iowait".  Here's "top -d 
30".  Over this longer interval, "dd" appears to spend a quarter of its 
time on the "right" cpu:

%Cpu0  :  0.5 us,  1.2 sy,  0.0 ni, 74.4 id, 22.2 wa,  0.5 hi,  1.3 si,  0.0 st
%Cpu1  :  0.7 us,  1.3 sy,  0.0 ni, 92.3 id,  5.7 wa,  0.1 hi,  0.0 si,  0.0 st
%Cpu2  :  0.5 us,  0.9 sy,  0.0 ni, 95.1 id,  3.3 wa,  0.1 hi,  0.0 si,  0.0 st
%Cpu3  :  0.7 us,  0.9 sy,  0.0 ni, 94.6 id,  3.6 wa,  0.1 hi,  0.0 si,  0.0 st

This point (3) does not apply to 4.15.  On 4.15, it seems "dd" naturally 
sits on the "right" cpu, so I get the "right" iowait". But if I pin "dd" 
to a different cpu, I get the "wrong" iowait again.

I bisected 4.15-4.16.  The first "bad" commit was 806486c377e3 
"sched/fair: Do not migrate if the prev_cpu is idle"

  (4) I can get the "right" iowait regardless of which cpu, if I boot 
with "nohz=off", or if I suppress nohz by dynamically disabling all 
cpuidle states except for state0 (POLL).

This point (4) does not apply to 4.16. On 4.15 and 4.16, suppressing 
nohz does not help.  (So far, I did not test 4.15).

I bisected 4.16-4.17.    The first "new" commit was 554c8aa8ecad "sched: 
idle: Select idle state before stopping the tick"

  (5) I seem to get the "right" iowait regardless of which cpu, if I run 
inside a virtual machine.  I tested stock Fedora v5.1.? inside a KVM 
(virt-manager) virtual machine, which also had 4 cpus.  Whereas 
un-virtualized Fedora v5.1.? on my laptop, behaves as per points 1-4.

I read the documented limitations for "iowait" time. As far as I 
understand them, they don't explain such inconsistent values.

> - iowait: In a word, iowait stands for waiting for I/O to complete. But there
> are several problems:
>
> 1. Cpu will not wait for I/O to complete, iowait is the time that a task is
>    waiting for I/O to complete. When cpu goes into idle state for
>    outstanding task io, another task will be scheduled on this CPU.
>
> 2. In a multi-core CPU, the task waiting for I/O to complete is not running
>    on any CPU, so the iowait of each CPU is difficult to calculate.
>
> 3. The value of iowait field in /proc/stat will decrease in certain
>    conditions

Thanks for all the kernels
Alan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: iowait v.s. idle accounting is "inconsistent" - iowait is too low
  2019-07-05 11:25   ` iowait v.s. idle accounting is "inconsistent" - iowait is too low Alan Jenkins
@ 2019-07-05 11:38     ` Peter Zijlstra
  2019-07-05 13:37       ` Alan Jenkins
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2019-07-05 11:38 UTC (permalink / raw)
  To: Alan Jenkins; +Cc: Ingo Molnar, linux-kernel, Doug Smythies, linux-pm

On Fri, Jul 05, 2019 at 12:25:46PM +0100, Alan Jenkins wrote:
> Hi, scheduler experts!
> 
> My cpu "iowait" time appears to be reported incorrectly.  Do you know why
> this could happen?

Because iowait is a magic random number that has no sane meaning.
Personally I'd prefer to just delete the whole thing, except ABI :/

Also see the comment near nr_iowait():

/*
 * IO-wait accounting, and how its mostly bollocks (on SMP).
 *
 * The idea behind IO-wait account is to account the idle time that we could
 * have spend running if it were not for IO. That is, if we were to improve the
 * storage performance, we'd have a proportional reduction in IO-wait time.
 *
 * This all works nicely on UP, where, when a task blocks on IO, we account
 * idle time as IO-wait, because if the storage were faster, it could've been
 * running and we'd not be idle.
 *
 * This has been extended to SMP, by doing the same for each CPU. This however
 * is broken.
 *
 * Imagine for instance the case where two tasks block on one CPU, only the one
 * CPU will have IO-wait accounted, while the other has regular idle. Even
 * though, if the storage were faster, both could've ran at the same time,
 * utilising both CPUs.
 *
 * This means, that when looking globally, the current IO-wait accounting on
 * SMP is a lower bound, by reason of under accounting.
 *
 * Worse, since the numbers are provided per CPU, they are sometimes
 * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly
 * associated with any one particular CPU, it can wake to another CPU than it
 * blocked on. This means the per CPU IO-wait number is meaningless.
 *
 * Task CPU affinities can make all that even more 'interesting'.
 */

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: iowait v.s. idle accounting is "inconsistent" - iowait is too low
  2019-07-05 11:38     ` Peter Zijlstra
@ 2019-07-05 13:37       ` Alan Jenkins
  0 siblings, 0 replies; 6+ messages in thread
From: Alan Jenkins @ 2019-07-05 13:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Doug Smythies, linux-pm

On 05/07/2019 12:38, Peter Zijlstra wrote:
> On Fri, Jul 05, 2019 at 12:25:46PM +0100, Alan Jenkins wrote:
>> Hi, scheduler experts!
>>
>> My cpu "iowait" time appears to be reported incorrectly.  Do you know why
>> this could happen?
> Because iowait is a magic random number that has no sane meaning.
> Personally I'd prefer to just delete the whole thing, except ABI :/
>
> Also see the comment near nr_iowait():
>
> /*
>   * IO-wait accounting, and how its mostly bollocks (on SMP).
>   *
>   * The idea behind IO-wait account is to account the idle time that we could
>   * have spend running if it were not for IO. That is, if we were to improve the
>   * storage performance, we'd have a proportional reduction in IO-wait time.
>   *
>   * This all works nicely on UP, where, when a task blocks on IO, we account
>   * idle time as IO-wait, because if the storage were faster, it could've been
>   * running and we'd not be idle.
>   *
>   * This has been extended to SMP, by doing the same for each CPU. This however
>   * is broken.
>   *
>   * Imagine for instance the case where two tasks block on one CPU, only the one
>   * CPU will have IO-wait accounted, while the other has regular idle. Even
>   * though, if the storage were faster, both could've ran at the same time,
>   * utilising both CPUs.
>   *
>   * This means, that when looking globally, the current IO-wait accounting on
>   * SMP is a lower bound, by reason of under accounting.
>   *
>   * Worse, since the numbers are provided per CPU, they are sometimes
>   * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly
>   * associated with any one particular CPU, it can wake to another CPU than it
>   * blocked on. This means the per CPU IO-wait number is meaningless.
>   *
>   * Task CPU affinities can make all that even more 'interesting'.
>   */

Thanks. I take those as being different problems, but you mean there is 
not much demand (or point) to "fix" my issue.

>  (2) Compare running "dd" with "taskset -c 1":
>
> %Cpu1  :  0.3 us,  3.0 sy,  0.0 ni, 83.7 id, 12.6 wa,  0.0 hi,  0.3 si,  0.0 st 

                                       ^ non-zero idle time for Cpu1, despite the pinned IO hog.


The block layer recently decided they could break "disk busy%" reporting 
for slow devices (mechanical HDD), in order to reduce overheads for fast 
devices.  This means the summary view in "atop" now lacks any reliable 
indicator.

I suppose I need to look in "iotop".

The new /proc/pressure/io seems to have caveats related to the iowait 
issues... it seems even more complex to interpret for this case, and it 
does not seem to work how I think it does.[1]

Regards
Alan

[1] 
https://unix.stackexchange.com/questions/527342/why-does-the-new-linux-pressure-stall-information-for-io-not-show-as-100/

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-07-05 13:37 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-01 15:33 NO_HZ_IDLE causes consistently low cpu "iowait" time (and higher cpu "idle" time) Alan Jenkins
2019-07-03 14:06 ` Doug Smythies
2019-07-03 16:09   ` Alan Jenkins
2019-07-05 11:25   ` iowait v.s. idle accounting is "inconsistent" - iowait is too low Alan Jenkins
2019-07-05 11:38     ` Peter Zijlstra
2019-07-05 13:37       ` Alan Jenkins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).