LKML Archive on
 help / color / Atom feed
From: Hidetoshi Seto <>
Cc: Fernando Luis Vazquez Cao <>,
	Tetsuo Handa <>,
	Frederic Weisbecker <>,
	Thomas Gleixner <>,
	Ingo Molnar <>,
	Peter Zijlstra <>,
	Andrew Morton <>,
	Arjan van de Ven <>,
	Oleg Nesterov <>,
	Preeti U Murthy <>,
	Denys Vlasenko <>
Subject: [RFC PATCH 0/8] rework iowait accounting
Date: Thu, 26 Jun 2014 18:06:23 +0900
Message-ID: <> (raw)

This mail is 5th try to fix an issue that iowait of /proc/stat can
go backward. Originally reported by Tetsuo and Fernando at last year,
Mar 2013.

Previous v1-v4 were proposal to apply my patch set, but this v5 is
request for comment, with untested patches to draw up a blueprint.


  Tetsuo and Fernando reported that the idle/iowait time obtained
  through /proc/stat is not monotonic.

  There were 2 causes:

  [1]: sleep stats has no exclusive control
    (This is one of what my previous patch aimed to solve)

    Since NO_HZ kernel can stop tick interrupts while the cpu
    is in idle, sleep stats (=cpu's idle time and iowait time)
    are not updated constantly. It means if we refer stats of
    idle cpu then it will contain stale values updated before
    the cpu start idle. To avoid this, kernel records timestamp
    at idle entry and provide function to return calculated
    (rather) proper stats by adding duration of idle.

    However there was no locking between readers/writer of
    the sleep stats and timestamp. So in cases if sleep stats
    are updated while calculation for reader is in progress,
    calculated stats will go wrong and lose monotonicity with
    previous/next values.

    This cause backwarding trouble in both of idle and iowait.

  [2]: zeroing nr_iowait is not tracked in long idle

    Assume that a cpu was in idle for 30 ticks, and that nr_iowait
    was dropped to 0 between 17th tick and 18th tick. In constant
    updating manner by tick interrupts, iowait get 17 ticks to be
    accounted and after that idle get 13 ticks to be accounted.

    Then, how about in NO_HZ kernel?
    Let see the following short simulation:

    * given:
    *   idle time stats: idle=1000, iowait=100
    *   stamp at idle entry: entry=50
    *   nr tasks waiting io: nr_iowait=1
    * 1st reader temporary assigns delta as iowait
    * (but does not account delta to time stats)):
    *   1st reader's query @ now = 60:
    *     idle=1000
    *     iowait=110 (=100+(60-50))
    * then blocked task is queued to runqueue of other active cpu,
    * woken up from io_schedule{,_timeout}() and do atomic_dec()
    * from the remote:
    *   nr_iowait=0
    * and in 2nd turn reader assign delta as idle:
    *   2nd reader's query @ now = 70:
    *     idle=1020 (=1000+(70-50))
    *     iowait=100
    * at last cpu wakes up @ now = 80:
    *   idle time stats: idle=1030, iowait=100

    You can see that iowait is decreased from 110 to 100.
    And there were more than 10 ticks should have been accounted
    as iowait but not accounted at all at last.

    This cause backwarding trouble only in iowait. Because nr_iowait
    is never incremented from remote cpus.

  Solution for [1] is easy - just add proper exclusive control.
  But I stumbled over the [2].


  I'd like to point following facts:

  - s390 kernel is supposed to have problem [2]

    Not only NO_HZ kernel but also s390 kernel can stop tick
    interrupts while the cpu is in idle. Therefore s390 provides
    function s390_get_idle_time() which return idle duration of
    target sleeping cpu. /proc/stat use this and "nr_iowait at
    read is performed" to get stats of idle cpu. So consecutive
    readers may do different accounting for the idle duration
    if nr_iowait have changed around them.
    (Refer 4/8 of following pseudo code set)

  - common account_idle_time() takes cputime, not tick

    account_idle_time() uses nr_iowait in this function to
    determine whether cputime given as argument should be
    accounted as idle time or iowait time. If this function was
    only called from tick interrupts and cputime was always equal
    to one tick, then it will make sense on some level.
    However it is called from non-tick accounting (for example,
    state-transition based accounting, a.k.a. VIRT_CPU_ACCOUNTING)
    codes and take long and short cputime.

    While it is OK that accounting idle duration as iowait if
    nr_iowait is greater than 0 at the end of idle state, it is
    not always OK that accounting idle duration as idle if
    nr_iowait is 0 at the end of idle state.
    (Refer 3/8 of following pseudo code set)

  - cputime's resolution is better than tick in some architecture

    Some arch using VIRT_CPU_ACCOUNTING have cputime in better
    resolution like nano seconds. Even though iowait accounting
    is performed based on tick interrupts and accounted value is
    not precise compared to other values like sys/user.

[Request for Comments]

  So, as the first step to solve this bunch of problems, I wrote
  some pseudo codes for review to check my direction. These codes
  look like usual patch set but !! _NOT_TESTED_COMPLETELY_ !!
  Now I don't have s390/ia64/ppc test box so I didn't even check
  these pass the compiler :-p

  What I want to do here is:

  - record timestamp at the end of iowait (decrement nr_iowait
    from 1 to 0) and use it for iowait time accounting
    (This idea is what PeterZ suggested)

  - use same strategy for basic tick-accounting, NO_HZ kernel,
    s390 and other VIRT_CPU_ACCOUNTING architectures.

  Still I'm concerning performance impact by changing scheduler
  core codes and also possible troubles by comparing timestamps
  from different clocks.

  These codes are based on following patch on top of v3.16-rc2:
    [PATCH] nohz: make updating sleep stats local, add seqcount

  I'd like to ask you:

    - Do you think if I continue this direction these blueprints
      would be acceptable change?

    - Or shall we kill iowait completely?

    - Are there any good workaround or band-aid for stable kernels?

  Please give me your comment and let me know if you have better
  idea to solve this problem.

  Any comments are welcome.


  First report from Fernando:
    [RFC] iowait/idle time accounting hiccups in NOHZ kernels
  Steps to reproduce guided by Tetsuo:

  1st patchset from Frederic:
    [PATCH 0/4] nohz: Fix racy sleeptime stats
    [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats

  2nd patchset from Frederic:
    [RFC PATCH 0/5] nohz: Fix racy sleeptime stats v2

  My previous patch set:
    [PATCH 0/2] nohz: fix idle accounting in NO_HZ kernels
    [PATCH 0/2 v2] nohz: fix idle accounting in NO_HZ kernels
    [PATCH v3 0/2] nohz: fix idle accounting in NO_HZ kernels
    [PATCH v4 0/2] nohz: fix idle accounting in NO_HZ kernels

  Further reading:
    SMP iowait stats



Hidetoshi Seto (8):
      cputime, sched: record last_iowait
      cputime, nohz: handle last_iowait for nohz
      cputime: introduce account_idle_and_iowait
      cputime, s390: introduce s390_get_idle_and_iowait
      cputime, ia64: update iowait accounting
      cputime, ppc: update iowait accounting
      cputime: generic iowait accounting for VIRT_CPU_ACCOUNTING
      cputime: iowait aware idle tick accounting

 arch/ia64/kernel/time.c         |   43 +++++++++++++++++-
 arch/powerpc/kernel/time.c      |   21 +++++++--
 arch/s390/include/asm/cputime.h |    5 +-
 arch/s390/kernel/vtime.c        |   40 ++++++++++++++---
 fs/proc/stat.c                  |   20 +++++---
 include/linux/kernel_stat.h     |    1 +
 include/linux/sched.h           |    1 +
 include/linux/vtime.h           |    3 +
 kernel/sched/core.c             |   70 ++++++++++++++++++++++++-----
 kernel/sched/cputime.c          |   93 ++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h            |    5 ++-
 kernel/time/tick-sched.c        |   48 +++++++++++++++-----
 12 files changed, 290 insertions(+), 60 deletions(-)

             reply index

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-26  9:06 Hidetoshi Seto [this message]
2014-06-26  9:08 ` [PATCH 1/8] cputime, sched: record last_iowait Hidetoshi Seto
2014-06-26  9:09 ` [PATCH 2/8] cputime, nohz: handle last_iowait for nohz Hidetoshi Seto
2014-06-26  9:10 ` [PATCH 3/8] cputime: introduce account_idle_and_iowait Hidetoshi Seto
2014-06-26  9:12 ` [PATCH 4/8] cputime, s390: introduce s390_get_idle_and_iowait Hidetoshi Seto
2014-06-26  9:13 ` [PATCH 5/8] cputime, ia64: update iowait accounting Hidetoshi Seto
2014-06-26  9:14 ` [PATCH 6/8] cputime, ppc: " Hidetoshi Seto
2014-06-26  9:16 ` [PATCH 7/8] cputime: generic iowait accounting for VIRT_CPU_ACCOUNTING Hidetoshi Seto
2014-06-26  9:17 ` [PATCH 8/8] cputime: iowait aware idle tick accounting Hidetoshi Seto
2014-07-07  9:30 ` [RFC PATCH 0/8] rework iowait accounting Peter Zijlstra

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on

Archives are clonable:
	git clone --mirror lkml/git/0.git
	git clone --mirror lkml/git/1.git
	git clone --mirror lkml/git/2.git
	git clone --mirror lkml/git/3.git
	git clone --mirror lkml/git/4.git
	git clone --mirror lkml/git/5.git
	git clone --mirror lkml/git/6.git
	git clone --mirror lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ \
	public-inbox-index lkml

Newsgroup available over NNTP:

AGPL code for this site: git clone public-inbox