linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/8] rework iowait accounting
@ 2014-06-26  9:06 Hidetoshi Seto
  2014-06-26  9:08 ` [PATCH 1/8] cputime, sched: record last_iowait Hidetoshi Seto
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Hidetoshi Seto @ 2014-06-26  9:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko

This mail is 5th try to fix an issue that iowait of /proc/stat can
go backward. Originally reported by Tetsuo and Fernando at last year,
Mar 2013.

Previous v1-v4 were proposal to apply my patch set, but this v5 is
request for comment, with untested patches to draw up a blueprint.


[OBSERVED PROBLEM]

  Tetsuo and Fernando reported that the idle/iowait time obtained
  through /proc/stat is not monotonic.

  There were 2 causes:


  [1]: sleep stats has no exclusive control
    (This is one of what my previous patch aimed to solve)

    Since NO_HZ kernel can stop tick interrupts while the cpu
    is in idle, sleep stats (=cpu's idle time and iowait time)
    are not updated constantly. It means if we refer stats of
    idle cpu then it will contain stale values updated before
    the cpu start idle. To avoid this, kernel records timestamp
    at idle entry and provide function to return calculated
    (rather) proper stats by adding duration of idle.

    However there was no locking between readers/writer of
    the sleep stats and timestamp. So in cases if sleep stats
    are updated while calculation for reader is in progress,
    calculated stats will go wrong and lose monotonicity with
    previous/next values.

    This cause backwarding trouble in both of idle and iowait.


  [2]: zeroing nr_iowait is not tracked in long idle

    Assume that a cpu was in idle for 30 ticks, and that nr_iowait
    was dropped to 0 between 17th tick and 18th tick. In constant
    updating manner by tick interrupts, iowait get 17 ticks to be
    accounted and after that idle get 13 ticks to be accounted.

    Then, how about in NO_HZ kernel?
    Let see the following short simulation:

    * given:
    *   idle time stats: idle=1000, iowait=100
    *   stamp at idle entry: entry=50
    *   nr tasks waiting io: nr_iowait=1
    *
    * 1st reader temporary assigns delta as iowait
    * (but does not account delta to time stats)):
    *   1st reader's query @ now = 60:
    *     idle=1000
    *     iowait=110 (=100+(60-50))
    *
    * then blocked task is queued to runqueue of other active cpu,
    * woken up from io_schedule{,_timeout}() and do atomic_dec()
    * from the remote:
    *   nr_iowait=0
    *
    * and in 2nd turn reader assign delta as idle:
    *   2nd reader's query @ now = 70:
    *     idle=1020 (=1000+(70-50))
    *     iowait=100
    *
    * at last cpu wakes up @ now = 80:
    *   idle time stats: idle=1030, iowait=100

    You can see that iowait is decreased from 110 to 100.
    And there were more than 10 ticks should have been accounted
    as iowait but not accounted at all at last.

    This cause backwarding trouble only in iowait. Because nr_iowait
    is never incremented from remote cpus.


  Solution for [1] is easy - just add proper exclusive control.
  But I stumbled over the [2].



[FURTHER INVESTIGATION]

  I'd like to point following facts:

  - s390 kernel is supposed to have problem [2]

    Not only NO_HZ kernel but also s390 kernel can stop tick
    interrupts while the cpu is in idle. Therefore s390 provides
    function s390_get_idle_time() which return idle duration of
    target sleeping cpu. /proc/stat use this and "nr_iowait at
    read is performed" to get stats of idle cpu. So consecutive
    readers may do different accounting for the idle duration
    if nr_iowait have changed around them.
    (Refer 4/8 of following pseudo code set)

  - common account_idle_time() takes cputime, not tick

    account_idle_time() uses nr_iowait in this function to
    determine whether cputime given as argument should be
    accounted as idle time or iowait time. If this function was
    only called from tick interrupts and cputime was always equal
    to one tick, then it will make sense on some level.
    However it is called from non-tick accounting (for example,
    state-transition based accounting, a.k.a. VIRT_CPU_ACCOUNTING)
    codes and take long and short cputime.

    While it is OK that accounting idle duration as iowait if
    nr_iowait is greater than 0 at the end of idle state, it is
    not always OK that accounting idle duration as idle if
    nr_iowait is 0 at the end of idle state.
    (Refer 3/8 of following pseudo code set)

  - cputime's resolution is better than tick in some architecture

    Some arch using VIRT_CPU_ACCOUNTING have cputime in better
    resolution like nano seconds. Even though iowait accounting
    is performed based on tick interrupts and accounted value is
    not precise compared to other values like sys/user.


[Request for Comments]

  So, as the first step to solve this bunch of problems, I wrote
  some pseudo codes for review to check my direction. These codes
  look like usual patch set but !! _NOT_TESTED_COMPLETELY_ !!
  Now I don't have s390/ia64/ppc test box so I didn't even check
  these pass the compiler :-p

  What I want to do here is:

  - record timestamp at the end of iowait (decrement nr_iowait
    from 1 to 0) and use it for iowait time accounting
    (This idea is what PeterZ suggested)

  - use same strategy for basic tick-accounting, NO_HZ kernel,
    s390 and other VIRT_CPU_ACCOUNTING architectures.

  Still I'm concerning performance impact by changing scheduler
  core codes and also possible troubles by comparing timestamps
  from different clocks.

  These codes are based on following patch on top of v3.16-rc2:
    [PATCH] nohz: make updating sleep stats local, add seqcount

  I'd like to ask you:

    - Do you think if I continue this direction these blueprints
      would be acceptable change?

    - Or shall we kill iowait completely?

    - Are there any good workaround or band-aid for stable kernels?

  Please give me your comment and let me know if you have better
  idea to solve this problem.

  Any comments are welcome.

[References]:

  First report from Fernando:
    [RFC] iowait/idle time accounting hiccups in NOHZ kernels
    https://lkml.org/lkml/2013/3/18/962
  Steps to reproduce guided by Tetsuo:
    https://lkml.org/lkml/2013/4/1/201

  1st patchset from Frederic:
    [PATCH 0/4] nohz: Fix racy sleeptime stats
    https://lkml.org/lkml/2013/8/8/638
    [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats
    https://lkml.org/lkml/2013/8/16/274

  2nd patchset from Frederic:
    [RFC PATCH 0/5] nohz: Fix racy sleeptime stats v2
    https://lkml.org/lkml/2013/10/19/86

  My previous patch set:
    [PATCH 0/2] nohz: fix idle accounting in NO_HZ kernels
    https://lkml.org/lkml/2014/3/23/256
    [PATCH 0/2 v2] nohz: fix idle accounting in NO_HZ kernels
    https://lkml.org/lkml/2014/3/30/315
    [PATCH v3 0/2] nohz: fix idle accounting in NO_HZ kernels
    https://lkml.org/lkml/2014/4/10/133
    [PATCH v4 0/2] nohz: fix idle accounting in NO_HZ kernels
    https://lkml.org/lkml/2014/4/17/120

  Further reading:
    SMP iowait stats
    https://www.kernel.org/pub/linux/kernel/people/wli/vm/iowait/iowait-2.5.45-6

Thanks,
H.Seto

---

Hidetoshi Seto (8):
      cputime, sched: record last_iowait
      cputime, nohz: handle last_iowait for nohz
      cputime: introduce account_idle_and_iowait
      cputime, s390: introduce s390_get_idle_and_iowait
      cputime, ia64: update iowait accounting
      cputime, ppc: update iowait accounting
      cputime: generic iowait accounting for VIRT_CPU_ACCOUNTING
      cputime: iowait aware idle tick accounting

 arch/ia64/kernel/time.c         |   43 +++++++++++++++++-
 arch/powerpc/kernel/time.c      |   21 +++++++--
 arch/s390/include/asm/cputime.h |    5 +-
 arch/s390/kernel/vtime.c        |   40 ++++++++++++++---
 fs/proc/stat.c                  |   20 +++++---
 include/linux/kernel_stat.h     |    1 +
 include/linux/sched.h           |    1 +
 include/linux/vtime.h           |    3 +
 kernel/sched/core.c             |   70 ++++++++++++++++++++++++-----
 kernel/sched/cputime.c          |   93 ++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h            |    5 ++-
 kernel/time/tick-sched.c        |   48 +++++++++++++++-----
 12 files changed, 290 insertions(+), 60 deletions(-)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/8] cputime, sched: record last_iowait
  2014-06-26  9:06 [RFC PATCH 0/8] rework iowait accounting Hidetoshi Seto
@ 2014-06-26  9:08 ` Hidetoshi Seto
  2014-06-26  9:09 ` [PATCH 2/8] cputime, nohz: handle last_iowait for nohz Hidetoshi Seto
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Hidetoshi Seto @ 2014-06-26  9:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko

Record the timestamp when nr_iowait of idle cpu is dropped to 0 by
running cpu who pick a task which have call io_schedule() before
entering idle.

It is the time point that cpu's state have changed from "iowait"
to "idle". Following patch use it for updated idle accounting.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Not-Tested-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
---
 kernel/sched/core.c    |   40 ++++++++++++++++++++++++++++------------
 kernel/sched/cputime.c |    2 +-
 kernel/sched/sched.h   |    4 +++-
 3 files changed, 32 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3bdf01b..e759238 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2374,15 +2374,14 @@ unsigned long nr_iowait(void)
 	unsigned long i, sum = 0;
 
 	for_each_possible_cpu(i)
-		sum += atomic_read(&cpu_rq(i)->nr_iowait);
+		sum += cpu_rq(i)->nr_iowait;
 
 	return sum;
 }
 
 unsigned long nr_iowait_cpu(int cpu)
 {
-	struct rq *this = cpu_rq(cpu);
-	return atomic_read(&this->nr_iowait);
+	return cpu_rq(cpu)->nr_iowait;
 }
 
 #ifdef CONFIG_SMP
@@ -4305,6 +4304,24 @@ out_irq:
 }
 EXPORT_SYMBOL_GPL(yield_to);
 
+static inline void iowait_start(struct rq *rq)
+{
+	raw_spin_lock(&rq->iowait_lock);
+	rq->nr_iowait++;
+	raw_spin_unlock(&rq->iowait_lock);
+	current->in_iowait = 1;
+}
+
+static inline void iowait_stop(struct rq *rq)
+{
+	current->in_iowait = 0;
+	raw_spin_lock(&rq->iowait_lock);
+	rq->nr_iowait--;
+	if (!rq->nr_iowait && rq != this_rq())
+		rq->last_iowait = ktime_get();
+	raw_spin_unlock(&rq->iowait_lock);
+}
+
 /*
  * This task is about to go to sleep on IO. Increment rq->nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
@@ -4314,12 +4331,10 @@ void __sched io_schedule(void)
 	struct rq *rq = raw_rq();
 
 	delayacct_blkio_start();
-	atomic_inc(&rq->nr_iowait);
+	iowait_start(rq);
 	blk_flush_plug(current);
-	current->in_iowait = 1;
 	schedule();
-	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
+	iowait_stop(rq);
 	delayacct_blkio_end();
 }
 EXPORT_SYMBOL(io_schedule);
@@ -4330,12 +4345,10 @@ long __sched io_schedule_timeout(long timeout)
 	long ret;
 
 	delayacct_blkio_start();
-	atomic_inc(&rq->nr_iowait);
+	iowait_start(rq);
 	blk_flush_plug(current);
-	current->in_iowait = 1;
 	ret = schedule_timeout(timeout);
-	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
+	iowait_stop(rq);
 	delayacct_blkio_end();
 	return ret;
 }
@@ -6994,7 +7007,10 @@ void __init sched_init(void)
 #endif
 #endif
 		init_rq_hrtick(rq);
-		atomic_set(&rq->nr_iowait, 0);
+
+		raw_spin_lock_init(&rq->iowait_lock);
+		rq->nr_iowait = 0;
+		rq->last_iowait = ktime_get();
 	}
 
 	set_load_weight(&init_task);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 72fdf06..a028604 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -248,7 +248,7 @@ void account_idle_time(cputime_t cputime)
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	struct rq *rq = this_rq();
 
-	if (atomic_read(&rq->nr_iowait) > 0)
+	if (rq->nr_iowait > 0)
 		cpustat[CPUTIME_IOWAIT] += (__force u64) cputime;
 	else
 		cpustat[CPUTIME_IDLE] += (__force u64) cputime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 31cc02e..4ddfddc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -561,7 +561,9 @@ struct rq {
 	u64 clock;
 	u64 clock_task;
 
-	atomic_t nr_iowait;
+	raw_spinlock_t	iowait_lock ____cacheline_aligned;
+	unsigned int	nr_iowait;
+	ktime_t		last_iowait;
 
 #ifdef CONFIG_SMP
 	struct root_domain *rd;
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/8] cputime, nohz: handle last_iowait for nohz
  2014-06-26  9:06 [RFC PATCH 0/8] rework iowait accounting Hidetoshi Seto
  2014-06-26  9:08 ` [PATCH 1/8] cputime, sched: record last_iowait Hidetoshi Seto
@ 2014-06-26  9:09 ` Hidetoshi Seto
  2014-06-26  9:10 ` [PATCH 3/8] cputime: introduce account_idle_and_iowait Hidetoshi Seto
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Hidetoshi Seto @ 2014-06-26  9:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko

Now observer cpu can refer both of idle entry time and iowait exit
time of observed sleeping cpu, so observer can get idle/iowait time
of sleeping cpu by calculating cputimes not accounted yet.

Not-Tested-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
---
 include/linux/sched.h    |    1 +
 kernel/sched/core.c      |   27 +++++++++++++++++++++++++
 kernel/time/tick-sched.c |   48 +++++++++++++++++++++++++++++++++------------
 3 files changed, 63 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 306f4f0..29e1af0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -168,6 +168,7 @@ extern int nr_processes(void);
 extern unsigned long nr_running(void);
 extern unsigned long nr_iowait(void);
 extern unsigned long nr_iowait_cpu(int cpu);
+extern void nr_iowait_deltas(int cpu, ktime_t start, ktime_t now, ktime_t *iowait, ktime_t *idle);
 extern unsigned long this_cpu_load(void);
 
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e759238..814ee2e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2384,6 +2384,33 @@ unsigned long nr_iowait_cpu(int cpu)
 	return cpu_rq(cpu)->nr_iowait;
 }
 
+/*
+ * nr_iowait_deltas - divide idle time into idle delta and iowait delta
+ *
+ * @start: time stamp at start of idle span
+ * @now: time stamp at end of idle span
+ * @iowait_delta: address to store calculated iowait
+ * @idle_delta: address to store calculated idle
+ */
+void nr_iowait_deltas(int cpu, ktime_t start, ktime_t now,
+		      ktime_t *iowait_delta, ktime_t *idle_delta)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	raw_spin_lock(&rq->iowait_lock);
+	if (rq->nr_iowait || ktime_compare(rq->last_iowait, now) > 0) {
+		*iowait_delta = ktime_sub(now, start);
+		*idle_delta = ktime_set(0, 0);
+	} else if (ktime_compare(rq->last_iowait, start) > 0) {
+		*iowait_delta = ktime_sub(rq->last_iowait, start);
+		*idle_delta = ktime_sub(now, rq->last_iowait);
+	} else {
+		*iowait_delta = ktime_set(0, 0);
+		*idle_delta = ktime_sub(now, start);
+	}
+	raw_spin_unlock(&rq->iowait_lock);
+}
+
 #ifdef CONFIG_SMP
 
 /*
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 44eb187..8d23af5 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -408,16 +408,22 @@ static void tick_nohz_update_jiffies(ktime_t now)
 
 static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 {
-	ktime_t delta;
+	static const ktime_t ktime_zero = { .tv64 = 0 };
+	ktime_t iowait_delta = ktime_zero, idle_delta = ktime_zero;
 
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
 
 	/* Updates the per cpu time idle statistics counters */
-	delta = ktime_sub(now, ts->idle_entrytime);
-	if (nr_iowait_cpu(smp_processor_id()) > 0)
-		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
-	else
-		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+	if (ts->idle_active == 2) {
+		nr_iowait_deltas(smp_processor_id(), ts->idle_entrytime, now,
+				 &iowait_delta, &idle_delta);
+		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime,
+						 iowait_delta);
+	} else {
+		idle_delta = ktime_sub(now, ts->idle_entrytime);
+	}
+	ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, idle_delta);
+
 	ts->idle_entrytime = now;
 	ts->idle_active = 0;
 
@@ -432,7 +438,13 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	ts->idle_entrytime = now;
-	ts->idle_active = 1;
+	/*
+	 * idle_active:
+	 *  0: cpu is not idle
+	 *  1: cpu is performing idle
+	 *  2: cpu is performing iowait and idle
+	 */
+	ts->idle_active = 1 + !!nr_iowait_cpu(smp_processor_id());
 	write_seqcount_end(&ts->idle_sleeptime_seq);
 
 	sched_clock_idle_sleep_event();
@@ -467,10 +479,18 @@ u64 get_cpu_idle_time_us(int cpu, u64 *wall)
 
 	do {
 		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
- 
-		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
 
+		if (ts->idle_active) {
+			ktime_t delta;
+
+			if (ts->idle_active == 2) {
+				ktime_t unused;
+
+				nr_iowait_deltas(cpu, ts->idle_entrytime, now,
+						 &unused, &delta);
+			} else {
+				delta = ktime_sub(now, ts->idle_entrytime);
+			}
 			idle = ktime_add(ts->idle_sleeptime, delta);
 		} else {
 			idle = ts->idle_sleeptime;
@@ -510,10 +530,12 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *wall)
 
 	do {
 		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
- 
-		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
 
+		if (ts->idle_active == 2) {
+			ktime_t delta, unused;
+
+			nr_iowait_deltas(cpu, ts->idle_entrytime, now,
+					 &delta, &unused);
 			iowait = ktime_add(ts->iowait_sleeptime, delta);
 		} else {
 			iowait = ts->iowait_sleeptime;
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/8] cputime: introduce account_idle_and_iowait
  2014-06-26  9:06 [RFC PATCH 0/8] rework iowait accounting Hidetoshi Seto
  2014-06-26  9:08 ` [PATCH 1/8] cputime, sched: record last_iowait Hidetoshi Seto
  2014-06-26  9:09 ` [PATCH 2/8] cputime, nohz: handle last_iowait for nohz Hidetoshi Seto
@ 2014-06-26  9:10 ` Hidetoshi Seto
  2014-06-26  9:12 ` [PATCH 4/8] cputime, s390: introduce s390_get_idle_and_iowait Hidetoshi Seto
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Hidetoshi Seto @ 2014-06-26  9:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko

The current account_idle_time() cannot process mixed cputime which
contain both of idle cputime and iowait cputime.

So introduce new account_idle_and_iowait() to do paranoid work.
Following patches will add users of this new function.

Not-Tested-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
---
 kernel/sched/cputime.c |   24 +++++++++++++++++++++++-
 1 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a028604..283e011 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -240,14 +240,36 @@ void account_steal_time(cputime_t cputime)
 }
 
 /*
+ * Account for idle and iowait time in a dulation.
+ * @idle_enter: time stamp at idle entry
+ * @iowait_exit: time stamp when nr_iowait dropped to 0
+ * @idle_exit: time stamp at idle exit
+ */
+void account_idle_and_iowait(cputime_t idle_enter, cputime_t iowait_exit, cputime_t idle_exit)
+{
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	struct rq *rq = this_rq();
+
+	if (rq->nr_iowait > 0 || iowait_exit > idle_exit) {
+		cpustat[CPUTIME_IOWAIT] += (__force u64) idle_exit - idle_enter;
+	} else if (iowait_exit > idle_enter) {
+		cpustat[CPUTIME_IOWAIT] += (__force u64) iowait_exit - idle_enter;
+		cpustat[CPUTIME_IDLE] += (__force u64) idle_exit - iowait_exit;
+	} else {
+		cpustat[CPUTIME_IDLE] += (__force u64) idle_exit - idle_enter;
+	}
+}
+
+/*
  * Account for idle time.
- * @cputime: the cpu time spent in idle wait
+ * @cputime: the cpu time spent in idle wait (sometimes include iowait time)
  */
 void account_idle_time(cputime_t cputime)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	struct rq *rq = this_rq();
 
+	/* FIXME */
 	if (rq->nr_iowait > 0)
 		cpustat[CPUTIME_IOWAIT] += (__force u64) cputime;
 	else
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/8] cputime, s390: introduce s390_get_idle_and_iowait
  2014-06-26  9:06 [RFC PATCH 0/8] rework iowait accounting Hidetoshi Seto
                   ` (2 preceding siblings ...)
  2014-06-26  9:10 ` [PATCH 3/8] cputime: introduce account_idle_and_iowait Hidetoshi Seto
@ 2014-06-26  9:12 ` Hidetoshi Seto
  2014-06-26  9:13 ` [PATCH 5/8] cputime, ia64: update iowait accounting Hidetoshi Seto
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Hidetoshi Seto @ 2014-06-26  9:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko

s390_get_idle_time give us the duration from idle entry to now.
But it does not tell us how to divide it to idle and iowait.

Modify this function to return 2 values. To realize this, s390's
cputime accounting also requires timestamp at end of iowait.

Not-Tested-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
---
 arch/s390/include/asm/cputime.h |    7 ++++-
 arch/s390/kernel/vtime.c        |   40 ++++++++++++++++++++++++++++++++------
 fs/proc/stat.c                  |   20 +++++++++++-------
 kernel/sched/core.c             |    6 ++++-
 4 files changed, 55 insertions(+), 18 deletions(-)

diff --git a/arch/s390/include/asm/cputime.h b/arch/s390/include/asm/cputime.h
index f65bd36..f4f882d 100644
--- a/arch/s390/include/asm/cputime.h
+++ b/arch/s390/include/asm/cputime.h
@@ -174,13 +174,16 @@ struct s390_idle_data {
 	unsigned long long clock_idle_exit;
 	unsigned long long timer_idle_enter;
 	unsigned long long timer_idle_exit;
+	unsigned long long clock_iowait_exit;
 };
 
 DECLARE_PER_CPU(struct s390_idle_data, s390_idle);
 
-cputime64_t s390_get_idle_time(int cpu);
+void s390_get_idle_and_iowait(int cpu, cputime64_t reti, cputime64_t retw);
+void s390_record_iowait_exit(int cpu);
 
-#define arch_idle_time(cpu) s390_get_idle_time(cpu)
+#define arch_idle_and_iowait(cpu, i, w) s390_get_idle_and_iowait(cpu, i, w)
+#define arch_record_iowait_exit(cpu) s390_save_iowait_exit(cpu)
 
 static inline int s390_nohz_delay(int cpu)
 {
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index 8c34363..f945cbd 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -155,7 +155,7 @@ EXPORT_SYMBOL_GPL(vtime_account_system);
 void __kprobes vtime_stop_cpu(void)
 {
 	struct s390_idle_data *idle = &__get_cpu_var(s390_idle);
-	unsigned long long idle_time;
+	unsigned long long idle_enter, idle_exit, iowait_exit;
 	unsigned long psw_mask;
 
 	trace_hardirqs_on();
@@ -171,19 +171,21 @@ void __kprobes vtime_stop_cpu(void)
 	/* Account time spent with enabled wait psw loaded as idle time. */
 	idle->sequence++;
 	smp_wmb();
-	idle_time = idle->clock_idle_exit - idle->clock_idle_enter;
+	idle_enter = idle->clock_idle_enter;
+	idle_exit = idle->clock_idle_exit;
+	iowait_exit = idle->clock_iowait_exit;
 	idle->clock_idle_enter = idle->clock_idle_exit = 0ULL;
-	idle->idle_time += idle_time;
+	idle->idle_time += idle_exit - idle_enter;
 	idle->idle_count++;
-	account_idle_time(idle_time);
+	account_idle_and_iowait(idle_enter, iowait_exit, idle_exit);
 	smp_wmb();
 	idle->sequence++;
 }
 
-cputime64_t s390_get_idle_time(int cpu)
+void s390_get_idle_and_iowait(int cpu, cputime64_t *reti, cputime64_t *retw)
 {
 	struct s390_idle_data *idle = &per_cpu(s390_idle, cpu);
-	unsigned long long now, idle_enter, idle_exit;
+	unsigned long long now, idle_enter, idle_exit, iowait_exit;
 	unsigned int sequence;
 
 	do {
@@ -191,8 +193,32 @@ cputime64_t s390_get_idle_time(int cpu)
 		sequence = ACCESS_ONCE(idle->sequence);
 		idle_enter = ACCESS_ONCE(idle->clock_idle_enter);
 		idle_exit = ACCESS_ONCE(idle->clock_idle_exit);
+		iowait_exit = ACCESS_ONCE(idle->clock_iowait_exit);
 	} while ((sequence & 1) || (ACCESS_ONCE(idle->sequence) != sequence));
-	return idle_enter ? ((idle_exit ?: now) - idle_enter) : 0;
+
+	if (!idle_enter) {
+		*reti = *retw = 0ULL;
+	} else {
+		if (!idle_exit)
+			idle_exit = now;
+		if (nr_iowait_cpu(cpu)) {
+			*retw = idle_exit - idle_enter;
+			*reti = 0ULL;
+		} else if (iowait_exit > idle_enter) {
+			*retw = iowait_exit - idle_enter;
+			*reti = idle_exit - iowait_exit;
+		} else {
+			*retw = 0ULL;
+			*reti = idle_exit - idle_enter;
+		}
+	}
+}
+
+void s390_record_iowait_exit(int cpu)
+{
+	struct s390_idle_data *idle = &per_cpu(s390_idle, cpu);
+
+	idle->clock_iowait_exit = get_tod_clock();
 }
 
 /*
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 9d231e9..6b23a89 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -19,25 +19,29 @@
 #define arch_irq_stat() 0
 #endif
 
-#ifdef arch_idle_time
+#ifdef arch_idle_and_iowait
 
 static cputime64_t get_idle_time(int cpu)
 {
-	cputime64_t idle;
+	cputime64_t idle, idle_not_accounted_yet, unused;
 
 	idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE];
-	if (cpu_online(cpu) && !nr_iowait_cpu(cpu))
-		idle += arch_idle_time(cpu);
+	if (cpu_online(cpu)) {
+		arch_idle_and_iowait(cpu, &idle_not_accounted_yet, &unused);
+		idle += idle_not_accounted_yet;
+	}
 	return idle;
 }
 
 static cputime64_t get_iowait_time(int cpu)
 {
-	cputime64_t iowait;
+	cputime64_t iowait, iowait_not_accounted_yet, unused;
 
 	iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
-	if (cpu_online(cpu) && nr_iowait_cpu(cpu))
-		iowait += arch_idle_time(cpu);
+	if (cpu_online(cpu)) {
+		arch_idle_and_iowait(cpu, &unused, &iowait_not_accounted_yet);
+		iowait += iowait_not_accounted_yet;
+	}
 	return iowait;
 }
 
@@ -75,7 +79,7 @@ static u64 get_iowait_time(int cpu)
 	return iowait;
 }
 
-#endif
+#endif /* arch_idle_and_iowait */
 
 static int show_stat(struct seq_file *p, void *v)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 814ee2e..52abf79 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4344,8 +4344,12 @@ static inline void iowait_stop(struct rq *rq)
 	current->in_iowait = 0;
 	raw_spin_lock(&rq->iowait_lock);
 	rq->nr_iowait--;
-	if (!rq->nr_iowait && rq != this_rq())
+	if (!rq->nr_iowait && rq != this_rq()) {
+#ifdef arch_record_iowait_exit
+		arch_record_iowait_exit(rq->cpu);
+#endif
 		rq->last_iowait = ktime_get();
+	}
 	raw_spin_unlock(&rq->iowait_lock);
 }
 
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/8] cputime, ia64: update iowait accounting
  2014-06-26  9:06 [RFC PATCH 0/8] rework iowait accounting Hidetoshi Seto
                   ` (3 preceding siblings ...)
  2014-06-26  9:12 ` [PATCH 4/8] cputime, s390: introduce s390_get_idle_and_iowait Hidetoshi Seto
@ 2014-06-26  9:13 ` Hidetoshi Seto
  2014-06-26  9:14 ` [PATCH 6/8] cputime, ppc: " Hidetoshi Seto
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Hidetoshi Seto @ 2014-06-26  9:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko

Using VIRT_CPU_ACCOUNTING, ia64 utilize "timestamp at end of iowait"
like s390.

Not-Tested-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
---
 arch/ia64/include/asm/cputime.h |    2 +
 arch/ia64/kernel/time.c         |   43 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 44 insertions(+), 1 deletions(-)

diff --git a/arch/ia64/include/asm/cputime.h b/arch/ia64/include/asm/cputime.h
index e2d3f5b..6089cbd 100644
--- a/arch/ia64/include/asm/cputime.h
+++ b/arch/ia64/include/asm/cputime.h
@@ -24,6 +24,8 @@
 # include <asm/processor.h>
 # include <asm-generic/cputime_nsecs.h>
 extern void arch_vtime_task_switch(struct task_struct *tsk);
+extern void ia64_record_iowait_exit(int cpu);
+# define arch_record_iowait_exit(cpu) ia64_record_iowait_exit(cpu)
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #endif /* __IA64_CPUTIME_H */
diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 71c52bc..37c907d 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -138,9 +138,50 @@ void vtime_account_system(struct task_struct *tsk)
 }
 EXPORT_SYMBOL_GPL(vtime_account_system);
 
+/*
+ * Nasty trick to account iowait time as defined
+ */
+DEFINE_PER_CPU(u64, ia64_iowait_exit);
+
+void ia64_record_iowait_exit(int cpu)
+{
+	/* FIXME: ITC might not synchronized between cpus/sockets */
+        per_cpu(ia64_iowait_exit, cpu) = ia64_get_itc();
+}
+
 void vtime_account_idle(struct task_struct *tsk)
 {
-	account_idle_time(vtime_delta(tsk));
+	struct thread_info *ti = task_thread_info(tsk);
+	cputime_t delta_stime, iowait_time;
+	__u64 now;
+
+	WARN_ON_ONCE(!irqs_disabled());
+
+	now = ia64_get_itc();
+
+	/*
+	 * tsk must be idle process: its runtime is accumulated as system time
+	 * (ac_utime must be 0) so convert it to idle time for accounting.
+	 *
+	 * in cycles:
+	 *		              iowait_exit
+	 *  (idle_enter)              |   ac_stamp   now(=idle_exit)
+	 * -------+-------------------*------+------->
+	 *         <------- ac_stime ------->
+	 *
+	 * in cputime:
+	 *       (0)
+	 * -------+-------------------*------+------->
+	 *         <----------- delta_stime --------->
+	 *         <-- iowait_time -->
+	 */
+	delta_stime = cycle_to_cputime(ti->ac_stime + (now - ti->ac_stamp));
+	iowait_time = cycle_to_cputime(ti->ac_stime +
+			__ia64_per_cpu_var(ia64_iowait_exit) - ti->ac_stamp);
+	ti->ac_stime = 0;
+	ti->ac_stamp = now;
+
+	account_idle_and_iowait(0, iowait_time, delta_stime);
 }
 
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 6/8] cputime, ppc: update iowait accounting
  2014-06-26  9:06 [RFC PATCH 0/8] rework iowait accounting Hidetoshi Seto
                   ` (4 preceding siblings ...)
  2014-06-26  9:13 ` [PATCH 5/8] cputime, ia64: update iowait accounting Hidetoshi Seto
@ 2014-06-26  9:14 ` Hidetoshi Seto
  2014-06-26  9:16 ` [PATCH 7/8] cputime: generic iowait accounting for VIRT_CPU_ACCOUNTING Hidetoshi Seto
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Hidetoshi Seto @ 2014-06-26  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko

Like s390 and ia64, ppc also has VIRT_CPU_ACCOUNTING.
Check "timestamp at end of iowait" for idle/iowait accounting.

Not-Tested-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
---
 arch/powerpc/include/asm/cputime.h |    3 +++
 arch/powerpc/kernel/time.c         |   21 +++++++++++++++++----
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h
index 607559a..f33a801 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -230,6 +230,9 @@ static inline cputime_t clock_t_to_cputime(const unsigned long clk)
 
 static inline void arch_vtime_task_switch(struct task_struct *tsk) { }
 
+void ppc_record_iowait_exit(int cpu);
+#define arch_record_iowait_exit(cpu) ppc_record_iowait_exit(cpu)
+
 #endif /* __KERNEL__ */
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 #endif /* __POWERPC_CPUTIME_H */
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 9fff9cd..879d97a 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -293,7 +293,7 @@ static inline u64 calculate_stolen_time(u64 stop_tb)
  * or soft irq state.
  */
 static u64 vtime_delta(struct task_struct *tsk,
-			u64 *sys_scaled, u64 *stolen)
+			u64 *sys_scaled, u64 *stolen, u64 *stamp)
 {
 	u64 now, nowscaled, deltascaled;
 	u64 udelta, delta, user_scaled;
@@ -336,6 +336,9 @@ static u64 vtime_delta(struct task_struct *tsk,
 	}
 	get_paca()->user_time_scaled += user_scaled;
 
+	if (stamp)
+		*stamp = now;
+
 	return delta;
 }
 
@@ -343,19 +346,29 @@ void vtime_account_system(struct task_struct *tsk)
 {
 	u64 delta, sys_scaled, stolen;
 
-	delta = vtime_delta(tsk, &sys_scaled, &stolen);
+	delta = vtime_delta(tsk, &sys_scaled, &stolen, NULL);
 	account_system_time(tsk, 0, delta, sys_scaled);
 	if (stolen)
 		account_steal_time(stolen);
 }
 EXPORT_SYMBOL_GPL(vtime_account_system);
 
+DEFINE_PER_CPU(u64, vtime_iowait_exit);
+
+void ppc_record_iowait_exit(int cpu)
+{
+	per_cpu(vtime_iowait_exit, cpu) = mftb();
+}
+
 void vtime_account_idle(struct task_struct *tsk)
 {
 	u64 delta, sys_scaled, stolen;
+	u64 now;
 
-	delta = vtime_delta(tsk, &sys_scaled, &stolen);
-	account_idle_time(delta + stolen);
+	delta = vtime_delta(tsk, &sys_scaled, &stolen, &now);
+	delta += stolen;
+	account_idle_and_iowait(0, delta + per_cpu_var(vtime_iowait_exit) - now,
+				delta);
 }
 
 /*
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 7/8] cputime: generic iowait accounting for VIRT_CPU_ACCOUNTING
  2014-06-26  9:06 [RFC PATCH 0/8] rework iowait accounting Hidetoshi Seto
                   ` (5 preceding siblings ...)
  2014-06-26  9:14 ` [PATCH 6/8] cputime, ppc: " Hidetoshi Seto
@ 2014-06-26  9:16 ` Hidetoshi Seto
  2014-06-26  9:17 ` [PATCH 8/8] cputime: iowait aware idle tick accounting Hidetoshi Seto
  2014-07-07  9:30 ` [RFC PATCH 0/8] rework iowait accounting Peter Zijlstra
  8 siblings, 0 replies; 10+ messages in thread
From: Hidetoshi Seto @ 2014-06-26  9:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko

Get iowait's timestamp for accounting w/ VIRT_CPU_ACCOUNTING_GEN.
(currently arm is only user of this?)

At last of this series of changes, introduce common function
vtime_iowait_exit to replace all arch_record_iowait_exit.

Not-tested-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
---
 arch/ia64/include/asm/cputime.h    |    2 --
 arch/ia64/kernel/time.c            |    2 +-
 arch/powerpc/include/asm/cputime.h |    3 ---
 arch/powerpc/kernel/time.c         |    2 +-
 arch/s390/include/asm/cputime.h    |    2 --
 arch/s390/kernel/vtime.c           |    2 +-
 include/linux/vtime.h              |    3 +++
 kernel/sched/core.c                |    4 +---
 kernel/sched/cputime.c             |   25 +++++++++++++++++++++++--
 9 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/arch/ia64/include/asm/cputime.h b/arch/ia64/include/asm/cputime.h
index 6089cbd..e2d3f5b 100644
--- a/arch/ia64/include/asm/cputime.h
+++ b/arch/ia64/include/asm/cputime.h
@@ -24,8 +24,6 @@
 # include <asm/processor.h>
 # include <asm-generic/cputime_nsecs.h>
 extern void arch_vtime_task_switch(struct task_struct *tsk);
-extern void ia64_record_iowait_exit(int cpu);
-# define arch_record_iowait_exit(cpu) ia64_record_iowait_exit(cpu)
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #endif /* __IA64_CPUTIME_H */
diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 37c907d..4b34346 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -143,7 +143,7 @@ EXPORT_SYMBOL_GPL(vtime_account_system);
  */
 DEFINE_PER_CPU(u64, ia64_iowait_exit);
 
-void ia64_record_iowait_exit(int cpu)
+void vtime_iowait_exit(int cpu)
 {
 	/* FIXME: ITC might not synchronized between cpus/sockets */
         per_cpu(ia64_iowait_exit, cpu) = ia64_get_itc();
diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h
index f33a801..607559a 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -230,9 +230,6 @@ static inline cputime_t clock_t_to_cputime(const unsigned long clk)
 
 static inline void arch_vtime_task_switch(struct task_struct *tsk) { }
 
-void ppc_record_iowait_exit(int cpu);
-#define arch_record_iowait_exit(cpu) ppc_record_iowait_exit(cpu)
-
 #endif /* __KERNEL__ */
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 #endif /* __POWERPC_CPUTIME_H */
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 879d97a..7afddc1 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -355,7 +355,7 @@ EXPORT_SYMBOL_GPL(vtime_account_system);
 
 DEFINE_PER_CPU(u64, vtime_iowait_exit);
 
-void ppc_record_iowait_exit(int cpu)
+void vtime_iowait_exit(int cpu)
 {
 	per_cpu(vtime_iowait_exit, cpu) = mftb();
 }
diff --git a/arch/s390/include/asm/cputime.h b/arch/s390/include/asm/cputime.h
index f4f882d..e136328 100644
--- a/arch/s390/include/asm/cputime.h
+++ b/arch/s390/include/asm/cputime.h
@@ -180,10 +180,8 @@ struct s390_idle_data {
 DECLARE_PER_CPU(struct s390_idle_data, s390_idle);
 
 void s390_get_idle_and_iowait(int cpu, cputime64_t reti, cputime64_t retw);
-void s390_record_iowait_exit(int cpu);
 
 #define arch_idle_and_iowait(cpu, i, w) s390_get_idle_and_iowait(cpu, i, w)
-#define arch_record_iowait_exit(cpu) s390_save_iowait_exit(cpu)
 
 static inline int s390_nohz_delay(int cpu)
 {
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index f945cbd..123fa74 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -214,7 +214,7 @@ void s390_get_idle_and_iowait(int cpu, cputime64_t *reti, cputime64_t *retw)
 	}
 }
 
-void s390_record_iowait_exit(int cpu)
+void vtime_iowait_exit(int cpu)
 {
 	struct s390_idle_data *idle = &per_cpu(s390_idle, cpu);
 
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index c5165fd..7f4632a 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -49,6 +49,7 @@ static inline void vtime_task_switch(struct task_struct *prev)
 }
 #endif /* __ARCH_HAS_VTIME_TASK_SWITCH */
 
+extern void vtime_iowait_exit(int cpu);
 extern void vtime_account_system(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
 extern void vtime_account_user(struct task_struct *tsk);
@@ -66,8 +67,10 @@ static inline void vtime_account_irq_enter(struct task_struct *tsk)
 
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
+static inline void vtime_iowait_exit(int cpu) { }
 static inline void vtime_task_switch(struct task_struct *prev) { }
 static inline void vtime_account_system(struct task_struct *tsk) { }
+static inline void vtime_account_idle(struct task_struct *tsk) { }
 static inline void vtime_account_user(struct task_struct *tsk) { }
 static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 52abf79..f8eec61 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4345,9 +4345,7 @@ static inline void iowait_stop(struct rq *rq)
 	raw_spin_lock(&rq->iowait_lock);
 	rq->nr_iowait--;
 	if (!rq->nr_iowait && rq != this_rq()) {
-#ifdef arch_record_iowait_exit
-		arch_record_iowait_exit(rq->cpu);
-#endif
+		vtime_iowait_exit(rq->cpu);
 		rq->last_iowait = ktime_get();
 	}
 	raw_spin_unlock(&rq->iowait_lock);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 283e011..866a3ff 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -741,11 +741,32 @@ void vtime_guest_exit(struct task_struct *tsk)
 }
 EXPORT_SYMBOL_GPL(vtime_guest_exit);
 
+DEFINE_PER_CPU(unsigned long long, vtime_exit_iowait);
+
+void vtime_iowait_exit(int cpu)
+{
+	/* FIXME: don't compare local clock on different cpus */
+	per_cpu(vtime_exit_iowait, cpu) = local_clock();
+}
+
 void vtime_account_idle(struct task_struct *tsk)
 {
-	cputime_t delta_cpu = get_vtime_delta(tsk);
+	unsigned long long now, delta;
+	cputime_t delta_cpu, idle_cpu;
+
+	now = local_clock();
+	if (now < tsk->vtime_snap)
+		return;
+
+	delta = now - tsk->vtime_snap;
+
+	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_SLEEPING);
+	tsk->vtime_snap += delta;
+
+	delta_cpu = nsecs_to_cputime(delta);
+	idle_cpu = nsecs_to_cputime(now - per_cpu_var(vtime_exit_iowait));
 
-	account_idle_time(delta_cpu);
+	account_idle_and_iowait(0, delta_cpu - idle_cpu, delta_cpu);
 }
 
 void arch_vtime_task_switch(struct task_struct *prev)
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 8/8] cputime: iowait aware idle tick accounting
  2014-06-26  9:06 [RFC PATCH 0/8] rework iowait accounting Hidetoshi Seto
                   ` (6 preceding siblings ...)
  2014-06-26  9:16 ` [PATCH 7/8] cputime: generic iowait accounting for VIRT_CPU_ACCOUNTING Hidetoshi Seto
@ 2014-06-26  9:17 ` Hidetoshi Seto
  2014-07-07  9:30 ` [RFC PATCH 0/8] rework iowait accounting Peter Zijlstra
  8 siblings, 0 replies; 10+ messages in thread
From: Hidetoshi Seto @ 2014-06-26  9:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko

By changes in vtime* codes by previous patches, now account_idle_time()
become a function to be called only from tick-accounting codes.

Introduce __account_idle_ticks() to do iowait accounting in ticks
properly. For this purpose record jiffies at end of iowait.

Not-Tested-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
---
 include/linux/kernel_stat.h |    1 +
 kernel/sched/core.c         |    1 +
 kernel/sched/cputime.c      |   68 +++++++++++++++++++++++++++++-------------
 kernel/sched/sched.h        |    1 +
 4 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index ecbc52f..bdea2f7 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -85,6 +85,7 @@ extern unsigned long long task_delta_exec(struct task_struct *);
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
 extern void account_steal_time(cputime_t);
+extern void account_iowait_time(cputime_t);
 extern void account_idle_time(cputime_t);
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f8eec61..5d3ebc3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4347,6 +4347,7 @@ static inline void iowait_stop(struct rq *rq)
 	if (!rq->nr_iowait && rq != this_rq()) {
 		vtime_iowait_exit(rq->cpu);
 		rq->last_iowait = ktime_get();
+		rq->last_iowait_jiffies = jiffies;
 	}
 	raw_spin_unlock(&rq->iowait_lock);
 }
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 866a3ff..42a0e99 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -240,40 +240,65 @@ void account_steal_time(cputime_t cputime)
 }
 
 /*
+ * Account for iowait time.
+ * @cputime: the cpu time spent in io wait
+ */
+void account_iowait_time(cputime_t cputime)
+{
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
+
+	cpustat[CPUTIME_IOWAIT] += (__force u64) cputime;
+}
+
+/*
+ * Account for idle time.
+ * @cputime: the cpu time spent in idle wait
+ */
+void account_idle_time(cputime_t cputime)
+{
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
+
+	cpustat[CPUTIME_IDLE] += (__force u64) cputime;
+}
+
+/*
  * Account for idle and iowait time in a dulation.
  * @idle_enter: time stamp at idle entry
  * @iowait_exit: time stamp when nr_iowait dropped to 0
  * @idle_exit: time stamp at idle exit
  */
-void account_idle_and_iowait(cputime_t idle_enter, cputime_t iowait_exit, cputime_t idle_exit)
+void account_idle_and_iowait(cputime_t idle_enter, cputime_t iowait_exit,
+			     cputime_t idle_exit)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	struct rq *rq = this_rq();
 
 	if (rq->nr_iowait > 0 || iowait_exit > idle_exit) {
-		cpustat[CPUTIME_IOWAIT] += (__force u64) idle_exit - idle_enter;
+		account_iowait_time(idle_exit - idle_enter);
 	} else if (iowait_exit > idle_enter) {
-		cpustat[CPUTIME_IOWAIT] += (__force u64) iowait_exit - idle_enter;
-		cpustat[CPUTIME_IDLE] += (__force u64) idle_exit - iowait_exit;
+		account_iowait_time(iowait_exit - idle_enter);
+		account_idle_time(idle_exit - iowait_exit);
 	} else {
-		cpustat[CPUTIME_IDLE] += (__force u64) idle_exit - idle_enter;
+		account_idle_time(idle_exit - idle_enter);
 	}
 }
 
 /*
- * Account for idle time.
- * @cputime: the cpu time spent in idle wait (sometimes include iowait time)
+ * Account for idle and iowait time.
+ * @ticks: ticks spent in idle/io wait
  */
-void account_idle_time(cputime_t cputime)
+static void __account_idle_ticks(int ticks)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	struct rq *rq = this_rq();
+	unsigned long no_io_ticks = jiffies - rq->last_iowait_jiffies;
 
-	/* FIXME */
-	if (rq->nr_iowait > 0)
-		cpustat[CPUTIME_IOWAIT] += (__force u64) cputime;
-	else
-		cpustat[CPUTIME_IDLE] += (__force u64) cputime;
+	if (rq->nr_iowait > 0) {
+		account_iowait_time(jiffies_to_cputime(ticks));
+	} else if (no_io_ticks < ticks) {
+		account_iowait_time(jiffies_to_cputime(ticks - no_io_ticks));
+		account_idle_time(jiffies_to_cputime(no_io_ticks));
+	} else {
+		account_idle_time(jiffies_to_cputime(ticks));
+	}
 }
 
 static __always_inline bool steal_account_process_tick(void)
@@ -380,11 +405,11 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 	} else if (user_tick) {
 		account_user_time(p, cputime, scaled);
 	} else if (p == rq->idle) {
-		account_idle_time(cputime);
+		__account_idle_ticks(ticks);
 	} else if (p->flags & PF_VCPU) { /* System time or guest time */
 		account_guest_time(p, cputime, scaled);
 	} else {
-		__account_system_time(p, cputime, scaled,	CPUTIME_SYSTEM);
+		__account_system_time(p, cputime, scaled, CPUTIME_SYSTEM);
 	}
 }
 
@@ -396,7 +421,8 @@ static void irqtime_account_idle_ticks(int ticks)
 }
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
 static inline void irqtime_account_idle_ticks(int ticks) {}
-static inline void irqtime_account_process_tick(struct task_struct *p, int user_tick,
+static inline void irqtime_account_process_tick(struct task_struct *p,
+						int user_tick,
 						struct rq *rq, int nr_ticks) {}
 #endif /* CONFIG_IRQ_TIME_ACCOUNTING */
 
@@ -499,7 +525,7 @@ void account_process_tick(struct task_struct *p, int user_tick)
 		account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,
 				    one_jiffy_scaled);
 	else
-		account_idle_time(cputime_one_jiffy);
+		__account_idle_ticks(1);
 }
 
 /*
@@ -514,7 +540,7 @@ void account_steal_ticks(unsigned long ticks)
 
 /*
  * Account multiple ticks of idle time.
- * @ticks: number of stolen ticks
+ * @ticks: number of idle ticks
  */
 void account_idle_ticks(unsigned long ticks)
 {
@@ -524,7 +550,7 @@ void account_idle_ticks(unsigned long ticks)
 		return;
 	}
 
-	account_idle_time(jiffies_to_cputime(ticks));
+	__account_idle_ticks(ticks);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4ddfddc..e5fb7b5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -564,6 +564,7 @@ struct rq {
 	raw_spinlock_t	iowait_lock ____cacheline_aligned;
 	unsigned int	nr_iowait;
 	ktime_t		last_iowait;
+	unsigned long	last_iowait_jiffies;
 
 #ifdef CONFIG_SMP
 	struct root_domain *rd;
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/8] rework iowait accounting
  2014-06-26  9:06 [RFC PATCH 0/8] rework iowait accounting Hidetoshi Seto
                   ` (7 preceding siblings ...)
  2014-06-26  9:17 ` [PATCH 8/8] cputime: iowait aware idle tick accounting Hidetoshi Seto
@ 2014-07-07  9:30 ` Peter Zijlstra
  8 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2014-07-07  9:30 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Fernando Luis Vazquez Cao, Tetsuo Handa,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko

[-- Attachment #1: Type: text/plain, Size: 2546 bytes --]

On Thu, Jun 26, 2014 at 06:06:23PM +0900, Hidetoshi Seto wrote:
> [Request for Comments]
> 
>   So, as the first step to solve this bunch of problems, I wrote
>   some pseudo codes for review to check my direction. These codes
>   look like usual patch set but !! _NOT_TESTED_COMPLETELY_ !!
>   Now I don't have s390/ia64/ppc test box so I didn't even check
>   these pass the compiler :-p

qemu? It'd be horridly slow I suppose. But you should be able to get a
cross compiler at the least.

>   What I want to do here is:
> 
>   - record timestamp at the end of iowait (decrement nr_iowait
>     from 1 to 0) and use it for iowait time accounting
>     (This idea is what PeterZ suggested)
> 
>   - use same strategy for basic tick-accounting, NO_HZ kernel,
>     s390 and other VIRT_CPU_ACCOUNTING architectures.
> 
>   Still I'm concerning performance impact by changing scheduler
>   core codes and also possible troubles by comparing timestamps
>   from different clocks.

So I don't _think_ the performance impact is _that_ high for x86, can be
worse for others. On x86 we:

 - iowait_start(): was one atomic_inc(), is now still one atomic op --
   raw_spin_lock().

 - iowait_stop(): was one atomic_dec(), is now still one atomic op --
   raw_spin_lock().

The only case we add lots of pain is the remote iowait to 0 case, where
we call ktime_get(). Which gets us to different clocks, what different
clocks?

Also; there's an argument to be made about correctness over performance;
but I feel its a somewhat weak argument since the entire per-cpu iowait
number is still complete bullshit :-)

>   These codes are based on following patch on top of v3.16-rc2:
>     [PATCH] nohz: make updating sleep stats local, add seqcount
> 
>   I'd like to ask you:
> 
>     - Do you think if I continue this direction these blueprints
>       would be acceptable change?

Yeah, so this solves the 'simple' problem of making the iowait
accounting nohz invariant, which is 'good'. It doesn't address the
bigger issue of what it all means. But I suppose we can start here.

>     - Or shall we kill iowait completely?

I'm all in favour of less accounting :-) It might upset some people
though, but you have my blessing if you can get it done.

>     - Are there any good workaround or band-aid for stable kernels?

I suppose this is still backportable to any -stable that is still
maintained ;-) Then again, this has been broken for bloody ever, so I
don't suppose its really that urgent.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-07-07  9:30 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-26  9:06 [RFC PATCH 0/8] rework iowait accounting Hidetoshi Seto
2014-06-26  9:08 ` [PATCH 1/8] cputime, sched: record last_iowait Hidetoshi Seto
2014-06-26  9:09 ` [PATCH 2/8] cputime, nohz: handle last_iowait for nohz Hidetoshi Seto
2014-06-26  9:10 ` [PATCH 3/8] cputime: introduce account_idle_and_iowait Hidetoshi Seto
2014-06-26  9:12 ` [PATCH 4/8] cputime, s390: introduce s390_get_idle_and_iowait Hidetoshi Seto
2014-06-26  9:13 ` [PATCH 5/8] cputime, ia64: update iowait accounting Hidetoshi Seto
2014-06-26  9:14 ` [PATCH 6/8] cputime, ppc: " Hidetoshi Seto
2014-06-26  9:16 ` [PATCH 7/8] cputime: generic iowait accounting for VIRT_CPU_ACCOUNTING Hidetoshi Seto
2014-06-26  9:17 ` [PATCH 8/8] cputime: iowait aware idle tick accounting Hidetoshi Seto
2014-07-07  9:30 ` [RFC PATCH 0/8] rework iowait accounting Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).