All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2 v2] nohz: fix idle accounting in NO_HZ kernels
@ 2014-03-31  2:01 Hidetoshi Seto
  2014-03-31  2:08 ` [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2 Hidetoshi Seto
  2014-03-31  2:09 ` [PATCH 2/2] nohz, procfs: introduce get_cpu_idle/iowait_time_coarse Hidetoshi Seto
  0 siblings, 2 replies; 15+ messages in thread
From: Hidetoshi Seto @ 2014-03-31  2:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy

Hi all,

This patch set (rebased on v3.14-rc8) will fix an issue that
idle/iowait of /proc/stat can go backward. Originally reported
by Tetsuo and Fernando at last year, Mar 2013.

v2 have Preeti's Reviewed-by (Thanks!).
Of course still reviews are welcome.

Thanks,
H.Seto


Hidetoshi Seto (2):
      nohz: use seqlock to avoid race on idle time stats v2
      nohz, procfs: introduce get_cpu_idle/iowait_time_coarse

 fs/proc/stat.c           |   16 ++---
 include/linux/tick.h     |    5 +
 kernel/time/tick-sched.c |  197 ++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 183 insertions(+), 35 deletions(-)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2
  2014-03-31  2:01 [PATCH 0/2 v2] nohz: fix idle accounting in NO_HZ kernels Hidetoshi Seto
@ 2014-03-31  2:08 ` Hidetoshi Seto
  2014-04-02 19:35   ` Denys Vlasenko
  2014-03-31  2:09 ` [PATCH 2/2] nohz, procfs: introduce get_cpu_idle/iowait_time_coarse Hidetoshi Seto
  1 sibling, 1 reply; 15+ messages in thread
From: Hidetoshi Seto @ 2014-03-31  2:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy

This patch is 1/2 of patch set to fix an issue that idle/iowait
of /proc/stat can go backward. Originally reported by Tetsuo and
Fernando at last year, Mar 2013.

[BACKGROUNDS]: idle accounting on NO_HZ

If NO_HZ is enabled, cpu stops tick interrupt for itself before
go sleep to be idle. It means that time stats of the sleeping cpu
will not be updated in tick interrupt. Instead when cpu wakes up,
it updates time stats by calculating idle duration time from
timestamp at entering idle and current time as exiting idle.

OTOH, it can happen that there are some kind of observer who want
to know how long the sleeping cpu have been idle. Imagine that
userland runs top command or application who read /proc/stats.
Therefore kernel provides get_cpu_{idle,iowait}_time_us() function
for user to obtain current idle time stats of such sleeping cpu.
This function reads time stats and timestamp at entering idle,
and then return current idle time by adding duration calculated
from timestamp and current time.

There are 2 problems:

[PROBLEM 1]: there is no exclusive control.

It is easy to understand that there are 2 different cpu - an
observing cpu where running a program observing idle cpu's stat
and an observed cpu where performing idle. It means race can
happen if observed cpu wakes up while it is observed. Observer
can accidentally add calculated duration time (say delta) to
time stats which is just updated by woken cpu. Soon correct
idle time is returned in next turn, so it will result in
backward time. Therefore readers must be excluded by writer.

Then time stats are updated by woken cpu so there are only one
writer, right? No, unfortunately no. I'm not sure why are they,
but in some reason the function get_cpu_{idle,iowait}_time_us()
has ability to update sleeping cpu's time stats from observing
cpu. From grep of today's kernel tree, this feature is used by
cpufreq module. Calling this function with this feature in
periodically manner works like emulating tick for sleeping cpu.

In summary there are races between multiple writer and multiple
reader but no exclusive control on this idle time stats dataset.

[PROBLEM 2]: broken iowait accounting.

As historical nature, cpu's idle time was accounted as either
idle or iowait depending on the presence of tasks blocked by
I/O. No one complain about it for a long time. However:

  > Still trying to wrap my head around it, but conceptually
  > get_cpu_iowait_time_us() doesn't make any kind of sense.
  > iowait isn't per cpu since effectively tasks that aren't
  > running aren't assigned a cpu (as Oleg already pointed out).
  -- Peter Zijlstra

Now some kernel folks realized that accounting iowait as per-cpu
does not make sense in SMP world. When we were in traditional
UP era, cpu is considered to be waiting I/O if it is idle while
nr_iowait > 0. But in these days with SMP systems, tasks easily
migrate from a cpu where they issued an I/O to another cpu where
they are queued after I/O completion.

Back to NO_HZ mechanism. Totally terrible thing here is that
observer need to handle duration "delta" without knowing that
nr_iowait of sleeping cpu can be changed easily by migration
even if cpu is sleeping. So it can happen that:

  given:
    idle time stats: idle=1000, iowait=100
    stamp at idle entry: entry=50
    nr tasks waiting io: nr_iowait=1

  observer temporary assigns delta as iowait at 1st place,
  (but does not do update (=account delta to time stats)):
    1st reader's query @ now = 60:
      idle=1000
      iowait=110 (=100+(60-50))

  then blocked tasks are migrated all:
    nr_iowait=0

  and at last in 2nd turn observer assign delta as idle:
    2nd reader's query @ now = 70:
      idle=1020 (=1000+(70-50))
      iowait=100

You will see that iowait is decreased from 110 to 100.

In summary iowait accounting has fundamental problem and needs
to be precisely reworked. It implies that some user interfaces
might be replaced completely.

[WHAT THIS PATCH PROPOSED]: fix problem 1 first.

To fix problem 1, this patch adds seqlock for NO_HZ idle
accounting to avoid possible races between multiple reader/writer.

And to cope with problem 2, I add comments to note that there
are known issues.

References:
  First report from Fernando:
    [RFC] iowait/idle time accounting hiccups in NOHZ kernels
    https://lkml.org/lkml/2013/3/18/962
  Steps to reproduce guided by Tetsuo:
    https://lkml.org/lkml/2013/4/1/201

  1st patchset from Frederic:
    [PATCH 0/4] nohz: Fix racy sleeptime stats
    https://lkml.org/lkml/2013/8/8/638
    [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats
    https://lkml.org/lkml/2013/8/16/274

  2nd patchset from Frederic:
    [RFC PATCH 0/5] nohz: Fix racy sleeptime stats v2
    https://lkml.org/lkml/2013/10/19/86

v2: update comments and description about problem 2.
    include fix for minor typo

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Reported-by: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/tick.h     |    1 +
 kernel/time/tick-sched.c |  118 ++++++++++++++++++++++++++++++++++++----------
 2 files changed, 94 insertions(+), 25 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index b84773c..f6f4ac1 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -62,6 +62,7 @@ struct tick_sched {
 	unsigned long			idle_calls;
 	unsigned long			idle_sleeps;
 	int				idle_active;
+	seqlock_t			idle_sleeptime_seq;
 	ktime_t				idle_entrytime;
 	ktime_t				idle_waketime;
 	ktime_t				idle_exittime;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9f8af69..6641c56 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -407,11 +407,23 @@ static void tick_nohz_update_jiffies(ktime_t now)
  * Updates the per cpu time idle statistics counters
  */
 static void
-update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
+update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now,
+		     ktime_t *idle, ktime_t *iowait, int last)
 {
 	ktime_t delta;
 
-	if (ts->idle_active) {
+	write_seqlock(&ts->idle_sleeptime_seq);
+
+	/* now must be newer/greater than entrytime */
+	if (ts->idle_active && (ktime_compare(now, ts->idle_entrytime) > 0)) {
+		/*
+		 * FIXME: Since nr_iowait of sleeping cpu can change from/to
+		 * 0 while in this span from idle_entrytime to now, we have no
+		 * idea how to divide the delta to idle_sleeptime and/or
+		 * iowait_sleeptime.
+		 * So as traditional non-proper estimation for workaround,
+		 * take the delta depending on the current value of nr_iowait.
+		 */
 		delta = ktime_sub(now, ts->idle_entrytime);
 		if (nr_iowait_cpu(cpu) > 0)
 			ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
@@ -420,16 +432,19 @@ update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_upda
 		ts->idle_entrytime = now;
 	}
 
-	if (last_update_time)
-		*last_update_time = ktime_to_us(now);
+	if (idle)
+		*idle = ts->idle_sleeptime;
+	if (iowait)
+		*iowait = ts->iowait_sleeptime;
+	if (last)
+		ts->idle_active = 0;
 
+	write_sequnlock(&ts->idle_sleeptime_seq);
 }
 
 static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 {
-	update_ts_time_stats(smp_processor_id(), ts, now, NULL);
-	ts->idle_active = 0;
-
+	update_ts_time_stats(smp_processor_id(), ts, now, NULL, NULL, 1);
 	sched_clock_idle_wakeup_event(0);
 }
 
@@ -437,8 +452,11 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 {
 	ktime_t now = ktime_get();
 
+	write_seqlock(&ts->idle_sleeptime_seq);
 	ts->idle_entrytime = now;
 	ts->idle_active = 1;
+	write_sequnlock(&ts->idle_sleeptime_seq);
+
 	sched_clock_idle_sleep_event();
 	return now;
 }
@@ -449,34 +467,59 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
  * @last_update_time: variable to store update time in. Do not update
  * counters if NULL.
  *
- * Return the cummulative idle time (since boot) for a given
+ * Return the cumulative idle time (since boot) for a given
  * CPU, in microseconds.
  *
  * This time is measured via accounting rather than sampling,
  * and is as accurate as ktime_get() is.
  *
+ * Known bug: Return value is not monotonic in case if @last_update_time
+ * is NULL and therefore update is not performed. Because it includes
+ * cputime which is not determined idle or iowait so not accounted yet.
+ *
  * This function returns -1 if NOHZ is not enabled.
  */
 u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
-	ktime_t now, idle;
+	ktime_t now, idle, delta;
 
 	if (!tick_nohz_active)
 		return -1;
 
 	now = ktime_get();
 	if (last_update_time) {
-		update_ts_time_stats(cpu, ts, now, last_update_time);
-		idle = ts->idle_sleeptime;
+		update_ts_time_stats(cpu, ts, now, &idle, NULL, 0);
+		*last_update_time = ktime_to_us(now);
 	} else {
-		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+		unsigned int seq;
 
-			idle = ktime_add(ts->idle_sleeptime, delta);
-		} else {
+		do {
+			seq = read_seqbegin(&ts->idle_sleeptime_seq);
 			idle = ts->idle_sleeptime;
-		}
+			/*
+			 * FIXME: At this point, delta is determined neither
+			 * idle nor iowait. For observer who want to know time
+			 * stats of idle cpu, this function temporarily treat
+			 * this delta depending on the value of nr_iowait_cpu
+			 * when this function reaches here, while it does not
+			 * update time stats to account delta used here.
+			 * It means that every observer need to assign delta at
+			 * individual discretion. And because nr_iowait can be
+			 * changed while idle cpu is sleeping, it can happen
+			 * that one of concurrent observers assign delta to
+			 * idle while another assign delta to iowait. It looks
+			 * like that delta have moved from one side to another,
+			 * and that one side lose delta returns non-monotonic
+			 * return value. That is why user of this function
+			 * must not expect monotonicity here.
+			 */
+			if (ts->idle_active && !nr_iowait_cpu(cpu)
+			    && (ktime_compare(now, ts->idle_entrytime) > 0)) {
+				delta = ktime_sub(now, ts->idle_entrytime);
+				idle = ktime_add(ts->idle_sleeptime, delta);
+			}
+		} while (read_seqretry(&ts->idle_sleeptime_seq, seq));
 	}
 
 	return ktime_to_us(idle);
@@ -490,34 +533,59 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
  * @last_update_time: variable to store update time in. Do not update
  * counters if NULL.
  *
- * Return the cummulative iowait time (since boot) for a given
+ * Return the cumulative iowait time (since boot) for a given
  * CPU, in microseconds.
  *
  * This time is measured via accounting rather than sampling,
  * and is as accurate as ktime_get() is.
  *
+ * Known bug: Return value is not monotonic in case if @last_update_time
+ * is NULL and therefore update is not performed. Because it includes
+ * cputime which is not determined idle or iowait so not accounted yet.
+ *
  * This function returns -1 if NOHZ is not enabled.
  */
 u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
-	ktime_t now, iowait;
+	ktime_t now, iowait, delta;
 
 	if (!tick_nohz_active)
 		return -1;
 
 	now = ktime_get();
 	if (last_update_time) {
-		update_ts_time_stats(cpu, ts, now, last_update_time);
-		iowait = ts->iowait_sleeptime;
+		update_ts_time_stats(cpu, ts, now, NULL, &iowait, 0);
+		*last_update_time = ktime_to_us(now);
 	} else {
-		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+		unsigned int seq;
 
-			iowait = ktime_add(ts->iowait_sleeptime, delta);
-		} else {
+		do {
+			seq = read_seqbegin(&ts->idle_sleeptime_seq);
 			iowait = ts->iowait_sleeptime;
-		}
+			/*
+			 * FIXME: At this point, delta is determined neither
+			 * idle nor iowait. For observer who want to know time
+			 * stats of idle cpu, this function temporarily treat
+			 * this delta depending on the value of nr_iowait_cpu
+			 * when this function reaches here, while it does not
+			 * update time stats to account delta used here.
+			 * It means that every observer need to assign delta at
+			 * individual discretion. And because nr_iowait can be
+			 * changed while idle cpu is sleeping, it can happen
+			 * that one of concurrent observers assign delta to
+			 * idle while another assign delta to iowait. It looks
+			 * like that delta have moved from one side to another,
+			 * and that one side lose delta returns non-monotonic
+			 * return value. That is why user of this function
+			 * must not expect monotonicity here.
+			 */
+			if (ts->idle_active && nr_iowait_cpu(cpu) > 0
+			    && (ktime_compare(now, ts->idle_entrytime) > 0)) {
+				delta = ktime_sub(now, ts->idle_entrytime);
+				iowait = ktime_add(ts->iowait_sleeptime, delta);
+			}
+		} while (read_seqretry(&ts->idle_sleeptime_seq, seq));
 	}
 
 	return ktime_to_us(iowait);
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] nohz, procfs: introduce get_cpu_idle/iowait_time_coarse
  2014-03-31  2:01 [PATCH 0/2 v2] nohz: fix idle accounting in NO_HZ kernels Hidetoshi Seto
  2014-03-31  2:08 ` [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2 Hidetoshi Seto
@ 2014-03-31  2:09 ` Hidetoshi Seto
  1 sibling, 0 replies; 15+ messages in thread
From: Hidetoshi Seto @ 2014-03-31  2:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy

This patch is 2/2 of patch set to fix an issue that idle/iowait
of /proc/stat can go backward. Originally reported by Tetsuo and
Fernando at last year, Mar 2013.

Now it is clear that get_cpu_{idle,iowait}_time_us() is not monotonic.
Using this for /proc/stats will cause troubles in innocent userland
who believe these counters are definitely monotonic.

Given that:

  - If observer (=reader of /proc/stats) want to avoid backwarding,
    it must update time stats for next observer. It means observer
    determine delta to idle/iowait and account it for sleeping cpu.

  - Usually the number of observers will not so many (i.e. running
    top on few console or hiring few monitor program will be enough
    for average system admin), but we must predict the worst case.
    In short, make effort to reduce lock contention.

  - The resolution required in /proc/stats is tick-level, not us.

This patch introduces new function get_cpu_idle/iowait_time_coarse()
that guarantees monotonic return value but resolution is low.

Tricks are here:

  - At first this function obtain latest time stats and calculate
    "delta" which indicates duration from time when the time stats
    is updated last time to current time.

  - If delta is less than tick length, use obtained time stats
    as if it was sampled in tick interrupt recently happened.

  - If delta is greater than tick, perform update of time stats
    as if it emulates tick for sleeping observed cpu.

  - As the result the rate of updating time stats by observer is
    limited to once per tick. In other words, in case if there is
    observer who are monitoring sleeping cpu, we leave tick
    emulation job during idle to the observer.

I confirmed this patch fix the monotonicity of /proc/stats, by
running reproducer and stressor for a day. The rate of reproduce
is different for different system, but in my case, running
"git gc" on kernel source repository aside of checker works fine.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Reported-by: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/stat.c           |   16 +++------
 include/linux/tick.h     |    4 ++
 kernel/time/tick-sched.c |   79 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 89 insertions(+), 10 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 6f599c6..3dbe282 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -45,32 +45,28 @@ static cputime64_t get_iowait_time(int cpu)
 
 static u64 get_idle_time(int cpu)
 {
-	u64 idle, idle_time = -1ULL;
+	u64 idle = -1ULL;
 
 	if (cpu_online(cpu))
-		idle_time = get_cpu_idle_time_us(cpu, NULL);
+		idle = get_cpu_idle_time_coarse(cpu);
 
-	if (idle_time == -1ULL)
+	if (idle == -1ULL)
 		/* !NO_HZ or cpu offline so we can rely on cpustat.idle */
 		idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE];
-	else
-		idle = usecs_to_cputime64(idle_time);
 
 	return idle;
 }
 
 static u64 get_iowait_time(int cpu)
 {
-	u64 iowait, iowait_time = -1ULL;
+	u64 iowait = -1ULL;
 
 	if (cpu_online(cpu))
-		iowait_time = get_cpu_iowait_time_us(cpu, NULL);
+		iowait = get_cpu_iowait_time_coarse(cpu);
 
-	if (iowait_time == -1ULL)
+	if (iowait == -1ULL)
 		/* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
 		iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
-	else
-		iowait = usecs_to_cputime64(iowait_time);
 
 	return iowait;
 }
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f6f4ac1..3b4674d 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -140,6 +140,8 @@ extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
+extern u64 get_cpu_idle_time_coarse(int cpu);
+extern u64 get_cpu_iowait_time_coarse(int cpu);
 
 # else /* !CONFIG_NO_HZ_COMMON */
 static inline int tick_nohz_tick_stopped(void)
@@ -158,6 +160,8 @@ static inline ktime_t tick_nohz_get_sleep_length(void)
 }
 static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
 static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
+static inline u64 get_cpu_idle_time_coarse(int cpu) { return -1; }
+static inline u64 get_cpu_iowait_time_coarse(int cpu) { return -1; }
 # endif /* !CONFIG_NO_HZ_COMMON */
 
 #ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6641c56..4bf22d2 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -592,6 +592,85 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
+/**
+ * get_cpu_idle_time_coarse - get coarse idle time of a cpu
+ * @cpu: CPU number to query
+ *
+ * Return the cumulative idle time (since boot) for a given
+ * CPU, in tick resolution (for traditional UI like /proc/stat).
+ *
+ * This function returns -1 if NOHZ is not enabled.
+ */
+u64 get_cpu_idle_time_coarse(int cpu)
+{
+	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	ktime_t now, idle, delta;
+	unsigned int seq;
+
+	if (!tick_nohz_active)
+		return -1;
+
+	if (!ts->idle_active)
+		return usecs_to_cputime64(ktime_to_us(ts->idle_sleeptime));
+
+	now = ktime_get();
+
+	do {
+		seq = read_seqbegin(&ts->idle_sleeptime_seq);
+		idle = ts->idle_sleeptime;
+		delta = ktime_sub(now, ts->idle_entrytime);
+	} while (read_seqretry(&ts->idle_sleeptime_seq, seq));
+
+	/*
+	 * If delta is less than tick, use current value and just
+	 * ignore the delta. Otherwise perform update.
+	 */
+	if (ktime_compare(delta, ns_to_ktime(TICK_NSEC)) > 0)
+		update_ts_time_stats(cpu, ts, now, &idle, NULL, 0);
+
+	return usecs_to_cputime64(ktime_to_us(idle));
+
+}
+
+/**
+ * get_cpu_iowait_time_coarse - get coarse iowait time of a cpu
+ * @cpu: CPU number to query
+ *
+ * Return the cumulative iowait time (since boot) for a given
+ * CPU, in tick resolution (for traditional UI like /proc/stat).
+ *
+ * This function returns -1 if NOHZ is not enabled.
+ */
+u64 get_cpu_iowait_time_coarse(int cpu)
+{
+	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	ktime_t now, iowait, delta;
+	unsigned int seq;
+
+	if (!tick_nohz_active)
+		return -1;
+
+	if (!ts->idle_active)
+		return usecs_to_cputime64(ktime_to_us(ts->iowait_sleeptime));
+
+	now = ktime_get();
+
+	do {
+		seq = read_seqbegin(&ts->idle_sleeptime_seq);
+		iowait = ts->iowait_sleeptime;
+		delta = ktime_sub(now, ts->idle_entrytime);
+	} while (read_seqretry(&ts->idle_sleeptime_seq, seq));
+
+	/*
+	 * If delta is less than tick, use current value and just
+	 * ignore the delta. Otherwise perform update.
+	 */
+	if (ktime_compare(delta, ns_to_ktime(TICK_NSEC)) > 0)
+		update_ts_time_stats(cpu, ts, now, NULL, &iowait, 0);
+
+	return usecs_to_cputime64(ktime_to_us(iowait));
+}
+
 static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
 					 ktime_t now, int cpu)
 {
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2
  2014-03-31  2:08 ` [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2 Hidetoshi Seto
@ 2014-04-02 19:35   ` Denys Vlasenko
  2014-04-03  7:02     ` Hidetoshi Seto
  2014-04-04 16:03     ` Frederic Weisbecker
  0 siblings, 2 replies; 15+ messages in thread
From: Denys Vlasenko @ 2014-04-02 19:35 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Linux Kernel Mailing List, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Frederic Weisbecker, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Arjan van de Ven, Oleg Nesterov,
	Preeti U Murthy

On Mon, Mar 31, 2014 at 4:08 AM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> There are 2 problems:
>
> [PROBLEM 1]: there is no exclusive control.
>
> It is easy to understand that there are 2 different cpu - an
> observing cpu where running a program observing idle cpu's stat
> and an observed cpu where performing idle. It means race can
> happen if observed cpu wakes up while it is observed. Observer
> can accidentally add calculated duration time (say delta) to
> time stats which is just updated by woken cpu. Soon correct
> idle time is returned in next turn, so it will result in
> backward time. Therefore readers must be excluded by writer.
>
> Then time stats are updated by woken cpu so there are only one
> writer, right? No, unfortunately no. I'm not sure why are they,
> but in some reason the function get_cpu_{idle,iowait}_time_us()
> has ability to update sleeping cpu's time stats from observing
> cpu. From grep of today's kernel tree, this feature is used by
> cpufreq module. Calling this function with this feature in
> periodically manner works like emulating tick for sleeping cpu.

Frederic's patches started by moving all updates
to tick_nohz_stop_idle(), makign the above problem easier -
get_cpu_{idle,iowait}_time_us() are pure readers.

The patches are here:

https://lkml.org/lkml/2013/10/19/86

> [PROBLEM 2]: broken iowait accounting.
>
> As historical nature, cpu's idle time was accounted as either
> idle or iowait depending on the presence of tasks blocked by
> I/O. No one complain about it for a long time. However:
>
>   > Still trying to wrap my head around it, but conceptually
>   > get_cpu_iowait_time_us() doesn't make any kind of sense.
>   > iowait isn't per cpu since effectively tasks that aren't
>   > running aren't assigned a cpu (as Oleg already pointed out).
>   -- Peter Zijlstra
>
> Now some kernel folks realized that accounting iowait as per-cpu
> does not make sense in SMP world. When we were in traditional
> UP era, cpu is considered to be waiting I/O if it is idle while
> nr_iowait > 0. But in these days with SMP systems, tasks easily
> migrate from a cpu where they issued an I/O to another cpu where
> they are queued after I/O completion.

However, if we would put ourselves into admin's seat, iowait
immediately starts to make sense: for admin, the system state
where a lot of CPU time is genuinely idle is qualitatively different
form the state where a lot of CPU time is "idle" because
we are I/O bound.

Admins probably wouldn't insist that iowait accounting must be
very accurate. I would hazard to guess that admins would settle
for the following rules:

* (idle + iowait) should accurately represent amount of time
CPUs were idle.
* both idle and iowait should never go backwards
* when system is truly idle, only idle should increase
* when system is truly I/O bound on all CPUs, only iowait should increase
* when the situation is in between of the above two cases,
both iowait and idle counters should grow. It's ok if they
represent idle/IO-bound ratio only approximately

> Back to NO_HZ mechanism. Totally terrible thing here is that
> observer need to handle duration "delta" without knowing that
> nr_iowait of sleeping cpu can be changed easily by migration
> even if cpu is sleeping.

How about the following: when CPU enters idle, it remembers
in struct tick_sched->idle_active whether it was "truly" idle
or I/O bound: something like

ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;

Then, when we exit idle, we account entire idle period as
"true" idle or as iowait depending on ts->idle_active value,
regardless of what happened to I/O bound task (whether
it migrated or not).


> [WHAT THIS PATCH PROPOSED]: fix problem 1 first.
>
> To fix problem 1, this patch adds seqlock for NO_HZ idle
> accounting to avoid possible races between multiple reader/writer.

That's what Frederic proposed too.
However, he thought it adds too much overhead. It adds
a new field, two memory barriers and a RMW cycle
in the updater (tick_nohz_stop_idle),
and similarly two memory barriers in readers too.

How about a slightly different approach.
We take his first two patches:

"nohz: Convert a few places to use local per cpu accesses"
"nohz: Only update sleeptime stats locally"

then on top of that we fix racy access by readers as follows:

updater does not need ts->idle_entrytime field
after it is done. We can reuse it as "updater in progress" flag.
We set it to a sentinel value (say, zero),
then increase idle or iowait, then clear ts->idle_active.
With two memory barriers to ensure other CPUs see
updates exactly in that order:

tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now):
...
        /* Updates the per cpu time idle statistics counters */
        delta = ktime_sub(now, ts->idle_entrytime);
-       if (nr_iowait_cpu(smp_processor_id()) > 0)
+       ts->idle_entrytime.tv64 = 0;
+       smp_wmb();
+       if (ts->idle_active == 2)
                ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
        else
                ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+       smp_wmb();
        ts->idle_active = 0;

Compared to seqlock approach, we don't need a new field (seqlock counter)
and don't need to increment it twice (two RMW cycles). We have the same
number of wmb's as seqlock approach has - two.

The iowait reader reads these three fields in reverse order,
with correct read barriers:

get_cpu_iowait_time_us(int cpu, u64 *last_update_time):
...
+ again:
+       active = ACCESS_ONCE(ts->idle_active);
+       smp_rmb();
+       count = ACCESS_ONCE(ts->iowait_sleeptime);
+       if (active == 2) {   // 2 means "idle was entered with pending I/O"
+               smp_rmb();
+               start = ACCESS_ONCE(ts->idle_entrytime);
....
+       }
+       return count;

Compared to seqlock approach, we can avoid second rmb (as shown above)
if we see that idle_active was not set: we know that in this case,
we already have the result in "count", no need to correct it.

Now, if idle_active was set (for iowait case, to 2). We can check the
sentinel in idle_entrytime. If it is there, we are racing with updater.
in this case, we loop back:

+       if (active == 2) {
+               smp_rmb();
+               start = ACCESS_ONCE(ts->idle_entrytime);
+               if (start.tv64 == 0)
+                       /* Other CPU is updating the count.
+                        * We don't know whether fetched count is valid.
+                        */
+                       goto again;
+               delta = ktime_sub(now, start);
+               count = ktime_add(count, delta);
        }

If we don't see sentinel, we adjust "count" and we are done.

I will send the full patches shortly.

Thoughts?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2
  2014-04-02 19:35   ` Denys Vlasenko
@ 2014-04-03  7:02     ` Hidetoshi Seto
  2014-04-03  9:51       ` Denys Vlasenko
  2014-04-04 16:03     ` Frederic Weisbecker
  1 sibling, 1 reply; 15+ messages in thread
From: Hidetoshi Seto @ 2014-04-03  7:02 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linux Kernel Mailing List, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Frederic Weisbecker, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Arjan van de Ven, Oleg Nesterov,
	Preeti U Murthy

(2014/04/03 4:35), Denys Vlasenko wrote:
> On Mon, Mar 31, 2014 at 4:08 AM, Hidetoshi Seto
> <seto.hidetoshi@jp.fujitsu.com> wrote:
>> There are 2 problems:
>>
>> [PROBLEM 1]: there is no exclusive control.
>>
>> It is easy to understand that there are 2 different cpu - an
>> observing cpu where running a program observing idle cpu's stat
>> and an observed cpu where performing idle. It means race can
>> happen if observed cpu wakes up while it is observed. Observer
>> can accidentally add calculated duration time (say delta) to
>> time stats which is just updated by woken cpu. Soon correct
>> idle time is returned in next turn, so it will result in
>> backward time. Therefore readers must be excluded by writer.
>>
>> Then time stats are updated by woken cpu so there are only one
>> writer, right? No, unfortunately no. I'm not sure why are they,
>> but in some reason the function get_cpu_{idle,iowait}_time_us()
>> has ability to update sleeping cpu's time stats from observing
>> cpu. From grep of today's kernel tree, this feature is used by
>> cpufreq module. Calling this function with this feature in
>> periodically manner works like emulating tick for sleeping cpu.
> 
> Frederic's patches started by moving all updates
> to tick_nohz_stop_idle(), makign the above problem easier -
> get_cpu_{idle,iowait}_time_us() are pure readers.
> 
> The patches are here:
> 
> https://lkml.org/lkml/2013/10/19/86

My concern here was that get_cpu_{idle,iowait}_time_us() are
exported function so that removing update functionality from
these affects kernel ABI compatibility. Even though I guess
there would be no other user than known cpufreq and kins, I also
thought that it looks like a shotgun approach and rough stuff. 

However now I noticed that it is old mindset from when kernel
have Documentation/feature-removal.txt ... so I'm OK with
removing updates from these functions.

Still I'd prefer to see this change in separate patch that
modifies get_cpu_{idle,iowait}_time_us() only. It should
have CC and acked-by with cpufreq people.

>> [PROBLEM 2]: broken iowait accounting.
>>
>> As historical nature, cpu's idle time was accounted as either
>> idle or iowait depending on the presence of tasks blocked by
>> I/O. No one complain about it for a long time. However:
>>
>>   > Still trying to wrap my head around it, but conceptually
>>   > get_cpu_iowait_time_us() doesn't make any kind of sense.
>>   > iowait isn't per cpu since effectively tasks that aren't
>>   > running aren't assigned a cpu (as Oleg already pointed out).
>>   -- Peter Zijlstra
>>
>> Now some kernel folks realized that accounting iowait as per-cpu
>> does not make sense in SMP world. When we were in traditional
>> UP era, cpu is considered to be waiting I/O if it is idle while
>> nr_iowait > 0. But in these days with SMP systems, tasks easily
>> migrate from a cpu where they issued an I/O to another cpu where
>> they are queued after I/O completion.
> 
> However, if we would put ourselves into admin's seat, iowait
> immediately starts to make sense: for admin, the system state
> where a lot of CPU time is genuinely idle is qualitatively different
> form the state where a lot of CPU time is "idle" because
> we are I/O bound.
> 
> Admins probably wouldn't insist that iowait accounting must be
> very accurate. I would hazard to guess that admins would settle
> for the following rules:
> 
> * (idle + iowait) should accurately represent amount of time
> CPUs were idle.
> * both idle and iowait should never go backwards
> * when system is truly idle, only idle should increase
> * when system is truly I/O bound on all CPUs, only iowait should increase
> * when the situation is in between of the above two cases,
> both iowait and idle counters should grow. It's ok if they
> represent idle/IO-bound ratio only approximately

Yep. Admins are at the mercy of iowait value, though they know it
is not accurate.

Assume there are task X,Y,Z (X issues io, Y sleeps moderately,
and Z has low priority):

Case 1:
  cpu A: <--run X--><--iowait--><--run X--><--iowait--><--run X ...
  cpu B: <---run Y--><--run Z--><--run Y--><--run Z--><--run Y ...
  io:               <-- io X -->           <-- io X -->

Case 2:
  cpu A: <--run X--><--run Z---><--run X--><--run Z---><--run X ...
  cpu B: <---run Y---><--idle--><---run Y---><--idle--><--run Y ...
  io:               <-- io X -->           <-- io X -->

So case 1 tend to be iowait while case 2 is idle, despite
almost same workloads. Then what should admins do...?
(How silly! :-p)

>> Back to NO_HZ mechanism. Totally terrible thing here is that
>> observer need to handle duration "delta" without knowing that
>> nr_iowait of sleeping cpu can be changed easily by migration
>> even if cpu is sleeping.
> 
> How about the following: when CPU enters idle, it remembers
> in struct tick_sched->idle_active whether it was "truly" idle
> or I/O bound: something like
> 
> ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
> 
> Then, when we exit idle, we account entire idle period as
> "true" idle or as iowait depending on ts->idle_active value,
> regardless of what happened to I/O bound task (whether
> it migrated or not).

It will not be acceptable. CPU can sleep significantly long
time after all I/O bound tasks are migrated. e.g.:

cpu A: <-run X-><-- iowait ---... (few days) ...--><-run Z ..
cpu B: <----run Y------><-run X->..
io:             <-io X->

Since NO_HZ fits well to systems where want to keep CPUs in
sleeping to save battery/power, situation like above is
likely common. You'll see amazing iowait on system in idle,
and also see increasing iowait despite no tasks waiting io...

>> [WHAT THIS PATCH PROPOSED]: fix problem 1 first.
>>
>> To fix problem 1, this patch adds seqlock for NO_HZ idle
>> accounting to avoid possible races between multiple reader/writer.
> 
> That's what Frederic proposed too.
> However, he thought it adds too much overhead. It adds
> a new field, two memory barriers and a RMW cycle
> in the updater (tick_nohz_stop_idle),
> and similarly two memory barriers in readers too.
> 
> How about a slightly different approach.
> We take his first two patches:
> 
> "nohz: Convert a few places to use local per cpu accesses"
> "nohz: Only update sleeptime stats locally"
> 
> then on top of that we fix racy access by readers as follows:
> 
> updater does not need ts->idle_entrytime field
> after it is done. We can reuse it as "updater in progress" flag.
> We set it to a sentinel value (say, zero),
> then increase idle or iowait, then clear ts->idle_active.
> With two memory barriers to ensure other CPUs see
> updates exactly in that order:
> 
> tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now):
> ...
>         /* Updates the per cpu time idle statistics counters */
>         delta = ktime_sub(now, ts->idle_entrytime);
> -       if (nr_iowait_cpu(smp_processor_id()) > 0)
> +       ts->idle_entrytime.tv64 = 0;
> +       smp_wmb();
> +       if (ts->idle_active == 2)
>                 ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
>         else
>                 ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
> +       smp_wmb();
>         ts->idle_active = 0;
> 

I just noted that "delta might be big, such as days."

> Compared to seqlock approach, we don't need a new field (seqlock counter)
> and don't need to increment it twice (two RMW cycles). We have the same
> number of wmb's as seqlock approach has - two.
> 
> The iowait reader reads these three fields in reverse order,
> with correct read barriers:
> 
> get_cpu_iowait_time_us(int cpu, u64 *last_update_time):
> ...
> + again:
> +       active = ACCESS_ONCE(ts->idle_active);
> +       smp_rmb();
> +       count = ACCESS_ONCE(ts->iowait_sleeptime);
> +       if (active == 2) {   // 2 means "idle was entered with pending I/O"
> +               smp_rmb();
> +               start = ACCESS_ONCE(ts->idle_entrytime);
> ....
> +       }
> +       return count;
> 
> Compared to seqlock approach, we can avoid second rmb (as shown above)
> if we see that idle_active was not set: we know that in this case,
> we already have the result in "count", no need to correct it.
> 
> Now, if idle_active was set (for iowait case, to 2). We can check the
> sentinel in idle_entrytime. If it is there, we are racing with updater.
> in this case, we loop back:
> 
> +       if (active == 2) {
> +               smp_rmb();
> +               start = ACCESS_ONCE(ts->idle_entrytime);
> +               if (start.tv64 == 0)
> +                       /* Other CPU is updating the count.
> +                        * We don't know whether fetched count is valid.
> +                        */
> +                       goto again;
> +               delta = ktime_sub(now, start);
> +               count = ktime_add(count, delta);
>         }
> 
> If we don't see sentinel, we adjust "count" and we are done.
> 
> I will send the full patches shortly.
> 
> Thoughts?

I agree on removing get_cpu_{idle,iowait}_time_us() (or marking
it as obsolete) with some conditions.

However your approach to make "pure reader" is considered to be
a failure. Thank you for providing good counter design!

Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2
  2014-04-03  7:02     ` Hidetoshi Seto
@ 2014-04-03  9:51       ` Denys Vlasenko
  2014-04-04  2:47         ` Hidetoshi Seto
  0 siblings, 1 reply; 15+ messages in thread
From: Denys Vlasenko @ 2014-04-03  9:51 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Linux Kernel Mailing List, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Frederic Weisbecker, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Arjan van de Ven, Oleg Nesterov,
	Preeti U Murthy

On Thu, Apr 3, 2014 at 9:02 AM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
>>> [PROBLEM 2]: broken iowait accounting.
>>>
>>> As historical nature, cpu's idle time was accounted as either
>>> idle or iowait depending on the presence of tasks blocked by
>>> I/O. No one complain about it for a long time. However:
>>>
>>>   > Still trying to wrap my head around it, but conceptually
>>>   > get_cpu_iowait_time_us() doesn't make any kind of sense.
>>>   > iowait isn't per cpu since effectively tasks that aren't
>>>   > running aren't assigned a cpu (as Oleg already pointed out).
>>>   -- Peter Zijlstra
>>>
>>> Now some kernel folks realized that accounting iowait as per-cpu
>>> does not make sense in SMP world. When we were in traditional
>>> UP era, cpu is considered to be waiting I/O if it is idle while
>>> nr_iowait > 0. But in these days with SMP systems, tasks easily
>>> migrate from a cpu where they issued an I/O to another cpu where
>>> they are queued after I/O completion.
>>
>> However, if we would put ourselves into admin's seat, iowait
>> immediately starts to make sense: for admin, the system state
>> where a lot of CPU time is genuinely idle is qualitatively different
>> form the state where a lot of CPU time is "idle" because
>> we are I/O bound.
>>
>> Admins probably wouldn't insist that iowait accounting must be
>> very accurate. I would hazard to guess that admins would settle
>> for the following rules:
>>
>> * (idle + iowait) should accurately represent amount of time
>> CPUs were idle.
>> * both idle and iowait should never go backwards
>> * when system is truly idle, only idle should increase
>> * when system is truly I/O bound on all CPUs, only iowait should increase
>> * when the situation is in between of the above two cases,
>> both iowait and idle counters should grow. It's ok if they
>> represent idle/IO-bound ratio only approximately
>
> Yep. Admins are at the mercy of iowait value, though they know it
> is not accurate.
>
> Assume there are task X,Y,Z (X issues io, Y sleeps moderately,
> and Z has low priority):
>
> Case 1:
>   cpu A: <--run X--><--iowait--><--run X--><--iowait--><--run X ...
>   cpu B: <---run Y--><--run Z--><--run Y--><--run Z--><--run Y ...
>   io:               <-- io X -->           <-- io X -->
>
> Case 2:
>   cpu A: <--run X--><--run Z---><--run X--><--run Z---><--run X ...
>   cpu B: <---run Y---><--idle--><---run Y---><--idle--><--run Y ...
>   io:               <-- io X -->           <-- io X -->
>
> So case 1 tend to be iowait while case 2 is idle, despite
> almost same workloads. Then what should admins do...?

This happens with current code too, right?
No regression then.

>>> Back to NO_HZ mechanism. Totally terrible thing here is that
>>> observer need to handle duration "delta" without knowing that
>>> nr_iowait of sleeping cpu can be changed easily by migration
>>> even if cpu is sleeping.
>>
>> How about the following: when CPU enters idle, it remembers
>> in struct tick_sched->idle_active whether it was "truly" idle
>> or I/O bound: something like
>>
>> ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
>>
>> Then, when we exit idle, we account entire idle period as
>> "true" idle or as iowait depending on ts->idle_active value,
>> regardless of what happened to I/O bound task (whether
>> it migrated or not).
>
> It will not be acceptable. CPU can sleep significantly long
> time after all I/O bound tasks are migrated. e.g.:
>
> cpu A: <-run X-><-- iowait ---... (few days) ...--><-run Z ..
> cpu B: <----run Y------><-run X->..
> io:             <-io X->

Does task migrate from an *idle* CPU? If yes, why?
Since its CPU is idle (i.e. immediately available
for it to be scheduled on),
I would imagine normally IO-blocked task stays
on its CPU's rq if it is idle.

> I agree on removing get_cpu_{idle,iowait}_time_us() (or marking
> it as obsolete) with some conditions.

Er?
My proposal does not eliminate or change
get_cpu_{idle,iowait}_time_us() API.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2
  2014-04-03  9:51       ` Denys Vlasenko
@ 2014-04-04  2:47         ` Hidetoshi Seto
  0 siblings, 0 replies; 15+ messages in thread
From: Hidetoshi Seto @ 2014-04-04  2:47 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linux Kernel Mailing List, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Frederic Weisbecker, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Arjan van de Ven, Oleg Nesterov,
	Preeti U Murthy

(2014/04/03 18:51), Denys Vlasenko wrote:
> On Thu, Apr 3, 2014 at 9:02 AM, Hidetoshi Seto
> <seto.hidetoshi@jp.fujitsu.com> wrote:
>>>> [PROBLEM 2]: broken iowait accounting.
>>>>
>>>> As historical nature, cpu's idle time was accounted as either
>>>> idle or iowait depending on the presence of tasks blocked by
>>>> I/O. No one complain about it for a long time. However:
>>>>
>>>>   > Still trying to wrap my head around it, but conceptually
>>>>   > get_cpu_iowait_time_us() doesn't make any kind of sense.
>>>>   > iowait isn't per cpu since effectively tasks that aren't
>>>>   > running aren't assigned a cpu (as Oleg already pointed out).
>>>>   -- Peter Zijlstra
>>>>
>>>> Now some kernel folks realized that accounting iowait as per-cpu
>>>> does not make sense in SMP world. When we were in traditional
>>>> UP era, cpu is considered to be waiting I/O if it is idle while
>>>> nr_iowait > 0. But in these days with SMP systems, tasks easily
>>>> migrate from a cpu where they issued an I/O to another cpu where
>>>> they are queued after I/O completion.
>>>
>>> However, if we would put ourselves into admin's seat, iowait
>>> immediately starts to make sense: for admin, the system state
>>> where a lot of CPU time is genuinely idle is qualitatively different
>>> form the state where a lot of CPU time is "idle" because
>>> we are I/O bound.
>>>
>>> Admins probably wouldn't insist that iowait accounting must be
>>> very accurate. I would hazard to guess that admins would settle
>>> for the following rules:
>>>
>>> * (idle + iowait) should accurately represent amount of time
>>> CPUs were idle.
>>> * both idle and iowait should never go backwards
>>> * when system is truly idle, only idle should increase
>>> * when system is truly I/O bound on all CPUs, only iowait should increase
>>> * when the situation is in between of the above two cases,
>>> both iowait and idle counters should grow. It's ok if they
>>> represent idle/IO-bound ratio only approximately
>>
>> Yep. Admins are at the mercy of iowait value, though they know it
>> is not accurate.
>>
>> Assume there are task X,Y,Z (X issues io, Y sleeps moderately,
>> and Z has low priority):
>>
>> Case 1:
>>   cpu A: <--run X--><--iowait--><--run X--><--iowait--><--run X ...
>>   cpu B: <---run Y--><--run Z--><--run Y--><--run Z--><--run Y ...
>>   io:               <-- io X -->           <-- io X -->
>>
>> Case 2:
>>   cpu A: <--run X--><--run Z---><--run X--><--run Z---><--run X ...
>>   cpu B: <---run Y---><--idle--><---run Y---><--idle--><--run Y ...
>>   io:               <-- io X -->           <-- io X -->
>>
>> So case 1 tend to be iowait while case 2 is idle, despite
>> almost same workloads. Then what should admins do...?
> 
> This happens with current code too, right?
> No regression then.

Yes, problem 2 is not regression. As I state it at first place,
it is fundamental problem of current iowait stuff. And my patch
set does not aim at this problem 2.

>>>> Back to NO_HZ mechanism. Totally terrible thing here is that
>>>> observer need to handle duration "delta" without knowing that
>>>> nr_iowait of sleeping cpu can be changed easily by migration
>>>> even if cpu is sleeping.
>>>
>>> How about the following: when CPU enters idle, it remembers
>>> in struct tick_sched->idle_active whether it was "truly" idle
>>> or I/O bound: something like
>>>
>>> ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
>>>
>>> Then, when we exit idle, we account entire idle period as
>>> "true" idle or as iowait depending on ts->idle_active value,
>>> regardless of what happened to I/O bound task (whether
>>> it migrated or not).
>>
>> It will not be acceptable. CPU can sleep significantly long
>> time after all I/O bound tasks are migrated. e.g.:
>>
>> cpu A: <-run X-><-- iowait ---... (few days) ...--><-run Z ..
>> cpu B: <----run Y------><-run X->..
>> io:             <-io X->
> 
> Does task migrate from an *idle* CPU? If yes, why?
> Since its CPU is idle (i.e. immediately available
> for it to be scheduled on),
> I would imagine normally IO-blocked task stays
> on its CPU's rq if it is idle.

I found an answer from Peter Zijlstra in following threads:
    [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats
    https://lkml.org/lkml/2013/8/16/274

(Sorry, I could not reach lkml.org today due to some network
 error, so I could not get direct link to following reply.
 I hope you can find it from parent post started from link
 above. I quote the important part instead.)

<quote> 
> Option B:
> 
>> Or we can live with that and still account the whole idle time slept until
>> tick_nohz_stop_idle() to iowait if we called tick_nohz_start_idle() with nr_iowait > 0.
>> All we need is just a new field in ts-> that records on which state we entered
>> idle.
>>
>> What do you think?
> 
> I think option B is unworkable. Afaict it could basically caused
> unlimited iowait time. Suppose we have a load-balancer that tries it
> bestestest to sort-left (ie. run a task on the lowest 'free' cpu
> possible) -- the power aware folks are pondering such schemes.
</quote>

Another answer: we cannot stop user to do cpuset (=force migration
by hand) to task which is waiting io.

>> I agree on removing get_cpu_{idle,iowait}_time_us() (or marking
>> it as obsolete) with some conditions.
> 
> Er?
> My proposal does not eliminate or change
> get_cpu_{idle,iowait}_time_us() API.

Sorry to making confuse.
Well, I should revise my previous comment in different proper words.

At first, it was my fault to use "API change" for your proposal.
It does not change number/type of function's argument etc.
I guess I should use "semantics change" for "removing update
functionality".

<source kernel/time/tick-sched.c>
> /**
>  * get_cpu_idle_time_us - get the total idle time of a cpu
>  * @cpu: CPU number to query
>  * @last_update_time: variable to store update time in. Do not update
>  * counters if NULL.
</source>

Second, it was only my opinion to remove these functions.
You did not mention about it.

So revised comment would be:
  - I agree on removing update functionality from
    get_cpu_{idle,iowait}_time_us() if it is acceptable
    semantics change for cpufreq people.
  - By the way, IMHO we can remove these functions
    completely. (Or if required mark it as obsolete for
    a certain period.)
  - Anyway such change could be a single patch separated
    from current patch set.

Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2
  2014-04-02 19:35   ` Denys Vlasenko
  2014-04-03  7:02     ` Hidetoshi Seto
@ 2014-04-04 16:03     ` Frederic Weisbecker
  2014-04-04 17:02       ` Denys Vlasenko
  1 sibling, 1 reply; 15+ messages in thread
From: Frederic Weisbecker @ 2014-04-04 16:03 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Hidetoshi Seto, Linux Kernel Mailing List,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov, Preeti U Murthy

Hi Guys,

You and Hidetoshi have sent a few patches with very detailed changelogs
and it's going to be hard to synthesize. So my reviews are going to be a
bit chaotic, sorry for that in advance.

On Wed, Apr 02, 2014 at 09:35:47PM +0200, Denys Vlasenko wrote:
> On Mon, Mar 31, 2014 at 4:08 AM, Hidetoshi Seto
> <seto.hidetoshi@jp.fujitsu.com> wrote:
> > There are 2 problems:
> >
> > [PROBLEM 1]: there is no exclusive control.
> >
> > It is easy to understand that there are 2 different cpu - an
> > observing cpu where running a program observing idle cpu's stat
> > and an observed cpu where performing idle. It means race can
> > happen if observed cpu wakes up while it is observed. Observer
> > can accidentally add calculated duration time (say delta) to
> > time stats which is just updated by woken cpu. Soon correct
> > idle time is returned in next turn, so it will result in
> > backward time. Therefore readers must be excluded by writer.
> >
> > Then time stats are updated by woken cpu so there are only one
> > writer, right? No, unfortunately no. I'm not sure why are they,
> > but in some reason the function get_cpu_{idle,iowait}_time_us()
> > has ability to update sleeping cpu's time stats from observing
> > cpu. From grep of today's kernel tree, this feature is used by
> > cpufreq module. Calling this function with this feature in
> > periodically manner works like emulating tick for sleeping cpu.
> 
> Frederic's patches started by moving all updates
> to tick_nohz_stop_idle(), makign the above problem easier -
> get_cpu_{idle,iowait}_time_us() are pure readers.
> 
> The patches are here:
> 
> https://lkml.org/lkml/2013/10/19/86

Yeah definetly we should limit the update areas to local updates
from idle when possible.

The get_cpu_..time_us() should try to read with seqcount to deduce
the right delta since last update.

> 
> > [PROBLEM 2]: broken iowait accounting.
> >
> > As historical nature, cpu's idle time was accounted as either
> > idle or iowait depending on the presence of tasks blocked by
> > I/O. No one complain about it for a long time. However:
> >
> >   > Still trying to wrap my head around it, but conceptually
> >   > get_cpu_iowait_time_us() doesn't make any kind of sense.
> >   > iowait isn't per cpu since effectively tasks that aren't
> >   > running aren't assigned a cpu (as Oleg already pointed out).
> >   -- Peter Zijlstra
> >
> > Now some kernel folks realized that accounting iowait as per-cpu
> > does not make sense in SMP world. When we were in traditional
> > UP era, cpu is considered to be waiting I/O if it is idle while
> > nr_iowait > 0. But in these days with SMP systems, tasks easily
> > migrate from a cpu where they issued an I/O to another cpu where
> > they are queued after I/O completion.
> 
> However, if we would put ourselves into admin's seat, iowait
> immediately starts to make sense: for admin, the system state
> where a lot of CPU time is genuinely idle is qualitatively different
> form the state where a lot of CPU time is "idle" because
> we are I/O bound.
> 
> Admins probably wouldn't insist that iowait accounting must be
> very accurate. I would hazard to guess that admins would settle
> for the following rules:

Iowait makes sense but not per cpu. Eventually it's a global
stat. Or per task.

I've sratched my head a lot on this. And I think we can't continue
with the current semantics. If we had to keep the current semantics
and enforce correctness at the same time, we are going to run into
big scalability and performance issues. This can't be done without
locking updates to nr_iowait() with seqlock:

* when a task blocks on IO and goes idle, lock some per cpu iowait_seq,
increase nr_iowait, save curr CPU number, save time.

* when a task io completes and it gets enqueued on another CPU: retrieve
old CPU, lock its iowait_seq, decrease nr_iowait, flush delta iowait .

And all that just to maintain stats which semantics are wrong, this
would be pure madness.

OTOH we must stay compatible with user ABI in /proc/stat (the one in /proc/timers_list
matters less).  But if we make it a per task stat, we are free to account it
on the CPU we want.

So what we can do for example is to account it per task and update stats
on the CPU where the blocking task wakes up. This way we guarantee
that we only account locally, which is good for scalability.

This is going to be an ABI change on a /proc/stat field semantic.
We usually can not do that as it can break userspace. But I think
we have a reasonable exception here:

1) On a performance POV we don't have the choice.

2) It has always been a buggy stat on SMP. Given the possible fast iowait update
rate, I doubt it has ever dumped correct stats. So I guess that few user apps
have ever correctly relied on it. 

Also it decouples iowait from idle time. Running time is also accounted
as iowait. But then again, since an iowait task is not attached to any CPU,
accounting iowait time only when a specific CPU is idle doesn't make sense.

Oh and the kernel API is not a problem either. Only cpufreq use it IIRC. But
I've been told it's only for a few old intel CPUs. Now given how buggy the stat
is, I doubt it worked correctly.

Anyway, the first step will be to remove the cpufreq API use.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2
  2014-04-04 16:03     ` Frederic Weisbecker
@ 2014-04-04 17:02       ` Denys Vlasenko
  2014-04-05 10:08         ` Frederic Weisbecker
  0 siblings, 1 reply; 15+ messages in thread
From: Denys Vlasenko @ 2014-04-04 17:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Hidetoshi Seto, Linux Kernel Mailing List,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov, Preeti U Murthy

On Fri, Apr 4, 2014 at 6:03 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>> However, if we would put ourselves into admin's seat, iowait
>> immediately starts to make sense: for admin, the system state
>> where a lot of CPU time is genuinely idle is qualitatively different
>> form the state where a lot of CPU time is "idle" because
>> we are I/O bound.
>>
>> Admins probably wouldn't insist that iowait accounting must be
>> very accurate. I would hazard to guess that admins would settle
>> for the following rules:
>
> Iowait makes sense but not per cpu. Eventually it's a global
> stat. Or per task.

There a lot of situations where admins want to know
how much, on average, their CPUs are idle because
they wait for IO.

If you are running, say, a Web cache,
you need to know that stat in order to be able to
conjecture "looks like I'm IO bound, perhaps caching
some data in RAM will speed it up".

Global stat will give such data to admin. Per-task won't -
there can be an efficient Web cache design which uses
many parallel tasks to hide IO latency. Thus, such
Web cache can be nearly optimal despite its tasks,
individually, having significant iowait counts each.

> I've sratched my head a lot on this. And I think we can't continue
> with the current semantics. If we had to keep the current semantics
> and enforce correctness at the same time, we are going to run into
> big scalability and performance issues. This can't be done without
> locking updates to nr_iowait() with seqlock:
>
> * when a task blocks on IO and goes idle, lock some per cpu iowait_seq,
> increase nr_iowait, save curr CPU number, save time.
>
> * when a task io completes and it gets enqueued on another CPU: retrieve
> old CPU, lock its iowait_seq, decrease nr_iowait, flush delta iowait .
>
> And all that just to maintain stats which semantics are wrong, this
> would be pure madness.
>
> OTOH we must stay compatible with user ABI in /proc/stat (the one in /proc/timers_list
> matters less).  But if we make it a per task stat, we are free to account it
> on the CPU we want.
>
> So what we can do for example is to account it per task and update stats
> on the CPU where the blocking task wakes up. This way we guarantee
> that we only account locally, which is good for scalability.

When IO-bound task wakes on some CPU,
how exactly do you propose to update counters -
add total waited time of this task to this CPU's counter?

Is such counter meaningful for the admin?

> This is going to be an ABI change on a /proc/stat field semantic.
> We usually can not do that as it can break userspace. But I think
> we have a reasonable exception here:
>
> 1) On a performance POV we don't have the choice.
>
> 2) It has always been a buggy stat on SMP. Given the possible fast iowait update
> rate, I doubt it has ever dumped correct stats. So I guess that few user apps
> have ever correctly relied on it.

In busybox project, the following tools use iowait counter:

top,mpstat: in order to show "%iowait"

nmeter: to show "waiting for disk" part of CPU bar.
Example:
$ nmeter '%t %70c'
18:57:33 SUU...................................................................
18:57:34 SUUUUUUUUUUI..........................................................
18:57:35 SUUUII................................................................
18:57:36 SUUU..................................................................
18:57:37 SSUDDD................................................................
  (^^^^^^ IO-intensive task starts)
18:57:38 SSSSSSUDDDDDDDDDDDDIi.................................................
18:57:39 SSSSSSSSUDDDDDDDDDi...................................................
18:57:40 SSSSSUUDDDDDDDDDDDDi..................................................
18:57:41 SSSSSUUUUUDDDDDDDDDDDDi...............................................
18:57:42 SSSSSUDDDDDDDDDDDDDIi.................................................
18:57:43 SSSUUDDDDDDDi.........................................................
  (^^^^^^ IO-intensive task ends)
18:57:44 SUUUI.................................................................
18:57:45 SUUU..................................................................
18:57:46 UU....................................................................
18:57:47 U.....................................................................

This doesn't look bogus to me.
It does give me information I need to know.

> Also it decouples iowait from idle time. Running time is also accounted
> as iowait.

The time when CPUs are busy while there is IO-wait
are usually not a sign of badly tuned software/system.

Only when CPUs are idle and there is IO-wait is.

That's how it looks from userspace.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2
  2014-04-04 17:02       ` Denys Vlasenko
@ 2014-04-05 10:08         ` Frederic Weisbecker
  2014-04-05 14:56           ` Denys Vlasenko
  0 siblings, 1 reply; 15+ messages in thread
From: Frederic Weisbecker @ 2014-04-05 10:08 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Hidetoshi Seto, Linux Kernel Mailing List,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov, Preeti U Murthy

On Fri, Apr 04, 2014 at 07:02:43PM +0200, Denys Vlasenko wrote:
> On Fri, Apr 4, 2014 at 6:03 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> >> However, if we would put ourselves into admin's seat, iowait
> >> immediately starts to make sense: for admin, the system state
> >> where a lot of CPU time is genuinely idle is qualitatively different
> >> form the state where a lot of CPU time is "idle" because
> >> we are I/O bound.
> >>
> >> Admins probably wouldn't insist that iowait accounting must be
> >> very accurate. I would hazard to guess that admins would settle
> >> for the following rules:
> >
> > Iowait makes sense but not per cpu. Eventually it's a global
> > stat. Or per task.
> 
> There a lot of situations where admins want to know
> how much, on average, their CPUs are idle because
> they wait for IO.
> 
> If you are running, say, a Web cache,
> you need to know that stat in order to be able to
> conjecture "looks like I'm IO bound, perhaps caching
> some data in RAM will speed it up".

But then accounting iowait time waited until completion on the CPU
that the task wakes up should work for that.

Doesn't it?

> 
> Global stat will give such data to admin. Per-task won't -
> there can be an efficient Web cache design which uses
> many parallel tasks to hide IO latency. Thus, such
> Web cache can be nearly optimal despite its tasks,
> individually, having significant iowait counts each.

I don't see how it's an issue. Just add up the cumulated iowait among
all these tasks and you get the global iowait time.

Anyway that's not a problem, the goal is still to display that per CPU.

> 
> > I've sratched my head a lot on this. And I think we can't continue
> > with the current semantics. If we had to keep the current semantics
> > and enforce correctness at the same time, we are going to run into
> > big scalability and performance issues. This can't be done without
> > locking updates to nr_iowait() with seqlock:
> >
> > * when a task blocks on IO and goes idle, lock some per cpu iowait_seq,
> > increase nr_iowait, save curr CPU number, save time.
> >
> > * when a task io completes and it gets enqueued on another CPU: retrieve
> > old CPU, lock its iowait_seq, decrease nr_iowait, flush delta iowait .
> >
> > And all that just to maintain stats which semantics are wrong, this
> > would be pure madness.
> >
> > OTOH we must stay compatible with user ABI in /proc/stat (the one in /proc/timers_list
> > matters less).  But if we make it a per task stat, we are free to account it
> > on the CPU we want.
> >
> > So what we can do for example is to account it per task and update stats
> > on the CPU where the blocking task wakes up. This way we guarantee
> > that we only account locally, which is good for scalability.
> 
> When IO-bound task wakes on some CPU,
> how exactly do you propose to update counters -
> add total waited time of this task to this CPU's counter?

So we save, per task, the time when the task went to sleep. And when it wakes up
we just flush the pending time since that sleep time to the waking CPU:
iowait[$CPU] += NOW() - tsk->sleeptime;

> Is such counter meaningful for the admin?

Well, you get the iowait time accounting.

> 
> > This is going to be an ABI change on a /proc/stat field semantic.
> > We usually can not do that as it can break userspace. But I think
> > we have a reasonable exception here:
> >
> > 1) On a performance POV we don't have the choice.
> >
> > 2) It has always been a buggy stat on SMP. Given the possible fast iowait update
> > rate, I doubt it has ever dumped correct stats. So I guess that few user apps
> > have ever correctly relied on it.
> 
> In busybox project, the following tools use iowait counter:
> 
> top,mpstat: in order to show "%iowait"
> 
> nmeter: to show "waiting for disk" part of CPU bar.
> Example:
> $ nmeter '%t %70c'
> 18:57:33 SUU...................................................................
> 18:57:34 SUUUUUUUUUUI..........................................................
> 18:57:35 SUUUII................................................................
> 18:57:36 SUUU..................................................................
> 18:57:37 SSUDDD................................................................
>   (^^^^^^ IO-intensive task starts)
> 18:57:38 SSSSSSUDDDDDDDDDDDDIi.................................................
> 18:57:39 SSSSSSSSUDDDDDDDDDi...................................................
> 18:57:40 SSSSSUUDDDDDDDDDDDDi..................................................
> 18:57:41 SSSSSUUUUUDDDDDDDDDDDDi...............................................
> 18:57:42 SSSSSUDDDDDDDDDDDDDIi.................................................
> 18:57:43 SSSUUDDDDDDDi.........................................................
>   (^^^^^^ IO-intensive task ends)
> 18:57:44 SUUUI.................................................................
> 18:57:45 SUUU..................................................................
> 18:57:46 UU....................................................................
> 18:57:47 U.....................................................................
> 
> This doesn't look bogus to me.
> It does give me information I need to know.

Hmm, but the iowait and idle stats are so racy that I would personally not trust such a
report.

Races are just too likely and potentially high frequency cumulated bugs. Imagine this simple one:

Imagine io task A sleeps on CPU 0. Then it wakes up elsewhere, runs for a while,
sleeps but not on IO for, say 1 minute.
During this time CPU 0 still sleeps.
Task A does short IO again and wakes up on CPU 0.
Cpu 0 just accounted more than 1 minute of iowait spuriously.

But I also realize that userspace code like this must rely on the fact that idle
time and iowait time are mutually exclusive. So perhaps we can't break the ABI
as easily as I thought.

Also the !NO_HZ behaviour still has the old semantics.

Haha, this interface is such well designed nightmare :-(

> 
> > Also it decouples iowait from idle time. Running time is also accounted
> > as iowait.
> 
> The time when CPUs are busy while there is IO-wait
> are usually not a sign of badly tuned software/system.
> 
> Only when CPUs are idle and there is IO-wait is.
>
> That's how it looks from userspace.

That's actually an artificial view that we made for userspace
because its implementation was convenient on !CONFIG_NO_HZ time.

In fact when I look at the history:

* Introduction of the buggy accounting 0224cf4c5ee0d7faec83956b8e21f7d89e3df3bd
* Use it for /proc/stat 7386cdbf2f57ea8cff3c9fde93f206e58b9fe13f

We tried to mimic the way we account iowait time in !CONFIG_NO_HZ
configurations where it's easy to distinguish idle/iowait time and to account
the iowait time per blocking CPU source using the periodic tick. Because it's
always there anyway. So we can tick on the blocking CPU source and poll on the
number of tasks that blocked there for the last time until they are
woken up somewhere else. Updates on iowait stats are then always local.

So we did it that way because it was handy to do so. Somehow the implementation
ease influenced the view.

Now with CONFIG_NO_HZ this artificial view doesn't scale anymore.
If we want to mimic the !CONFIG_NO_HZ  behaviour and do it correctly, we must
use remote updates instead of local updates helped by polling tick.

Thought?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2
  2014-04-05 10:08         ` Frederic Weisbecker
@ 2014-04-05 14:56           ` Denys Vlasenko
  2014-04-07 18:17             ` Frederic Weisbecker
  0 siblings, 1 reply; 15+ messages in thread
From: Denys Vlasenko @ 2014-04-05 14:56 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Hidetoshi Seto, Linux Kernel Mailing List,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov, Preeti U Murthy

On Sat, Apr 5, 2014 at 12:08 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>> > Iowait makes sense but not per cpu. Eventually it's a global
>> > stat. Or per task.
>>
>> There a lot of situations where admins want to know
>> how much, on average, their CPUs are idle because
>> they wait for IO.
>>
>> If you are running, say, a Web cache,
>> you need to know that stat in order to be able to
>> conjecture "looks like I'm IO bound, perhaps caching
>> some data in RAM will speed it up".
>
> But then accounting iowait time waited until completion on the CPU
> that the task wakes up should work for that.
>
> Doesn't it?

It can easily make iowait count higher than idle count,
or even higher than idle+sys+user+nice count.

IOW, it can show that the system is way more
than 100% IO bound, which doesn't make sense.


> So we save, per task, the time when the task went to sleep. And when it wakes up
> we just flush the pending time since that sleep time to the waking CPU:
> iowait[$CPU] += NOW() - tsk->sleeptime;
>
>> Is such counter meaningful for the admin?
>
> Well, you get the iowait time accounting.

Admin wants to know "how often do I have CPU idled
because they have nothing to do until IO is complete?"

Merely knowing how much tasks waited for IO
doesn't answer that question.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2
  2014-04-05 14:56           ` Denys Vlasenko
@ 2014-04-07 18:17             ` Frederic Weisbecker
  2014-04-09 12:49               ` Denys Vlasenko
  0 siblings, 1 reply; 15+ messages in thread
From: Frederic Weisbecker @ 2014-04-07 18:17 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Hidetoshi Seto, Linux Kernel Mailing List,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov, Preeti U Murthy

On Sat, Apr 05, 2014 at 04:56:54PM +0200, Denys Vlasenko wrote:
> On Sat, Apr 5, 2014 at 12:08 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> >> > Iowait makes sense but not per cpu. Eventually it's a global
> >> > stat. Or per task.
> >>
> >> There a lot of situations where admins want to know
> >> how much, on average, their CPUs are idle because
> >> they wait for IO.
> >>
> >> If you are running, say, a Web cache,
> >> you need to know that stat in order to be able to
> >> conjecture "looks like I'm IO bound, perhaps caching
> >> some data in RAM will speed it up".
> >
> > But then accounting iowait time waited until completion on the CPU
> > that the task wakes up should work for that.
> >
> > Doesn't it?
> 
> It can easily make iowait count higher than idle count,
> or even higher than idle+sys+user+nice count.
> 
> IOW, it can show that the system is way more
> than 100% IO bound, which doesn't make sense.

I'm talking about a semantic change, not just an implementation change.
Of course if we change iowait to "account _all_ iowait time" while it's still
interpreted with the old semantics that only "account iowait & idle time from the
CPU source", we have a problem.

Well this breakage is inevitable actually. It's just going to be leveraged
by the fact that CONFIG_NO_HZ actually never accounted correctly this iowait.

So, since it's already broken, I think we meet the conditions for an ABI change.


> 
> 
> > So we save, per task, the time when the task went to sleep. And when it wakes up
> > we just flush the pending time since that sleep time to the waking CPU:
> > iowait[$CPU] += NOW() - tsk->sleeptime;
> >
> >> Is such counter meaningful for the admin?
> >
> > Well, you get the iowait time accounting.
> 
> Admin wants to know "how often do I have CPU idled
> because they have nothing to do until IO is complete?"
> 
> Merely knowing how much tasks waited for IO
> doesn't answer that question.

If the admin asks this question on a per CPU basis, it's simply a
wrong question. Because a task waiting on an IO to complete is a
sleeping task. And a sleeping task doesn't belong to any CPU.

It's a pure per task property that can't be mapped on the lower
CPU level.

OTOH it can make sense to ask how much time did we wait on IO
and account this on the CPU which either initiated the IO or
ran the task on IO completion. Why not, it can give you an overview
of where these IO things happen most.

But the accounting we are doing today does not that.
Instead, it accounts the time spent on the CPU which initiated
the IO and only when it is idle. It is a terrible misconception that
is completly subject to scheduler randomness:

The following example displays all the nonsense of that stat:

    CPU 0                     CPU 1

    task A block on IO        ...
    task B runs for 1 min     ...
    task A completes IO

So in the above we've been waiting on IO for 1 minute. But none of that
have been accounted. OTOH if task B were to run on CPU 1 (it could have,
really here this is about scheduler load balancing internals, hence pure
randomness for the user), the iowait time would have been accounted.

I doubt that users are interested in such random accounting. They want
to know either:

1) how much time was spent waiting on IO by the whole system
2) how much time was spent waiting on IO per task
3) how much time was spent waiting on IO per CPU that initiated
   IOs, or per CPU which ran task completing IOs. In order to have
   an overview on where these mostly happened.

Given my above practical example, the current accounting is unable
to reliably report any of these informations. Because it is misdesigned
and doesn't account the right thing.

Now I'm repeating, why are we accounting iowait that way then?
Because it was very convenient from an implementation POV on periodic
tick systems. But it's certainly not convenient for the users given
this accounting randomness.

Now we run dynticks kernels, mostly. And this accounting is just not
possible anymore. Not if we want to keep our kernel scalable.

Hence I think we must change the semantics of our iowait accounting.
This stat is broken since the very beginning and is a strong candidate
for a semantical ABI change.

You guys can argue as much as you want. I'll maintain my position.

Unless you find a way to maintain the current accounting semantics while
keeping it scalable and correct, good luck!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2
  2014-04-07 18:17             ` Frederic Weisbecker
@ 2014-04-09 12:49               ` Denys Vlasenko
  2014-04-09 13:42                 ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Denys Vlasenko @ 2014-04-09 12:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Hidetoshi Seto, Linux Kernel Mailing List,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov, Preeti U Murthy

On Mon, Apr 7, 2014 at 8:17 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> The following example displays all the nonsense of that stat:
>
>     CPU 0                     CPU 1
>
>     task A block on IO        ...
>     task B runs for 1 min     ...
>     task A completes IO
>
> So in the above we've been waiting on IO for 1 minute. But none of that
> have been accounted.

If there is task B which can put CPU to use while task A
waits for IO, then *system performance is not IO bound*.


> OTOH if task B were to run on CPU 1 (it could have,
> really here this is about scheduler load balancing internals, hence pure
> randomness for the user), the iowait time would have been accounted.

Case A: overall system stats are: 50% busy, 50% idle
Case B: overall system stats are: 50% busy, 50% iowait

You are right, this does not look correct.

Lets step back and look at the situation from a high-level POV.
I believe I have a solution for this problem.

Let's say we have a heavily loaded file server machine where CPUs
are busy only 5% of the time. It makes sense to say that machine
as a whole is "95% waiting for IO".

Our existing accounting did exactly that for single-CPU machines.

But for, say, 2-CPU machine it can show 5% busy, 45% iowait, 50% idle
if there is only one task reading files, or 5% busy, 95% iowait
if there are more than one task.

But it's wrong! NONE of the CPUs are "idle" as long as there even
one task blocked on IO. The machine is still IO-bound, not idling.
In my example, it should not matter whether the machine has one
or 64 CPUs, it should show 5% busy, 95% iowait overall state
in both cases.

Does the above make sense to you?

My proposal is to count each CPU's time towards iowait
if there are task(s) blocked on IO, *regardless on which
runqueue they are*. Only if there are none, then time
is counted towards idle.


> I doubt that users are interested in such random accounting. They want
> to know either:
>
> 1) how much time was spent waiting on IO by the whole system

Hmm, I think I just said the same thing :)

> 2) how much time was spent waiting on IO per task
> 3) how much time was spent waiting on IO per CPU that initiated
>    IOs, or per CPU which ran task completing IOs. In order to have
>    an overview on where these mostly happened.

Some people may want to know these things, and I am not objecting
to adding whatever counters to help with that.

-- 
vda

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2
  2014-04-09 12:49               ` Denys Vlasenko
@ 2014-04-09 13:42                 ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2014-04-09 13:42 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Frederic Weisbecker, Hidetoshi Seto, Linux Kernel Mailing List,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Oleg Nesterov,
	Preeti U Murthy

On Wed, Apr 09, 2014 at 02:49:55PM +0200, Denys Vlasenko wrote:
> > I doubt that users are interested in such random accounting. They want
> > to know either:
> >
> > 1) how much time was spent waiting on IO by the whole system
> 
> Hmm, I think I just said the same thing :)
> 
> > 2) how much time was spent waiting on IO per task

I think we already have enough cruft to measure that. We have delayacct
measuring this and schedstat also seems to measure this.

> > 3) how much time was spent waiting on IO per CPU that initiated
> >    IOs, or per CPU which ran task completing IOs. In order to have
> >    an overview on where these mostly happened.

I'm not entirely sure this makes sense in general. It only makes sense
in the specific case where tasks are affine to a single CPU and that cpu
is not part of any balance domain.

I think this is a specific enough case (exclusive cpusets in general)
that we can ignore it. After all, we do not have per cpuset load numbers
at all, so why bother with this one.

At which point I think Denys makes a fair argument.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/2] nohz, procfs: introduce get_cpu_idle/iowait_time_coarse
  2014-03-24  3:05 [PATCH 0/2] nohz: fix idle accounting in NO_HZ kernels Hidetoshi Seto
@ 2014-03-24  3:10 ` Hidetoshi Seto
  0 siblings, 0 replies; 15+ messages in thread
From: Hidetoshi Seto @ 2014-03-24  3:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov

This patch is 2/2 of patch set to fix an issue that idle/iowait
of /proc/stat can go backward. Originally reported by Tetsuo and
Fernando at last year, Mar 2013.

Now it is clear that get_cpu_{idle,iowait}_time_us() is not monotonic.
Using this for /proc/stats will cause troubles in innocent userland
who believe these counters are definitely monotonic.

Given that:

  - If observer (= reader of /proc/stats) want to avoid backwarding,
    it must update time stats for next observer. It means observer
    determine delta to idle/iowait and account it for sleeping cpu.

  - Usually the number of observers will not so many (i.e. running
    top on few console or hiring few monitor program will be enough
    for average system admin), but we must predict the worst case.
    In short, make effort to reduce lock contention.

  - The resolution required in /proc/stats is tick-level, not us.

This patch introduces new function get_cpu_idle/iowait_time_coarse()
that guarantees monotonic return value but resolution is low.

Tricks are here:

  - At first this function obtain latest time stats and calculate
    "delta" which indicates duration from time when the time stats
    is updated last time to current time.

  - If delta is less than tick length, use obtained time stats
    as if it was sampled in tick interrupt recently happened.

  - If delta is greater than tick, perform update of time stats
    as if it emulates tick for sleeping observed cpu.

  - As the result the rate of updating time stats by observer is
    limited to once per tick. In other words, in case if there is
    observer who are monitoring sleeping cpu, we leave tick
    emulation job during idle to the observer.

I confirmed this patch fix the monotonicity of /proc/stats, by
running reproducer and stressor for a day. The rate of reproduce
is different for different system, but in my case, running
"git gc" on kernel source repository aside of checker works fine.

Thanks,
H.Seto

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Reported-by: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/stat.c           |   16 +++------
 include/linux/tick.h     |    4 ++
 kernel/time/tick-sched.c |   79 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 89 insertions(+), 10 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 6f599c6..3dbe282 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -45,32 +45,28 @@ static cputime64_t get_iowait_time(int cpu)
 
 static u64 get_idle_time(int cpu)
 {
-	u64 idle, idle_time = -1ULL;
+	u64 idle = -1ULL;
 
 	if (cpu_online(cpu))
-		idle_time = get_cpu_idle_time_us(cpu, NULL);
+		idle = get_cpu_idle_time_coarse(cpu);
 
-	if (idle_time == -1ULL)
+	if (idle == -1ULL)
 		/* !NO_HZ or cpu offline so we can rely on cpustat.idle */
 		idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE];
-	else
-		idle = usecs_to_cputime64(idle_time);
 
 	return idle;
 }
 
 static u64 get_iowait_time(int cpu)
 {
-	u64 iowait, iowait_time = -1ULL;
+	u64 iowait = -1ULL;
 
 	if (cpu_online(cpu))
-		iowait_time = get_cpu_iowait_time_us(cpu, NULL);
+		iowait = get_cpu_iowait_time_coarse(cpu);
 
-	if (iowait_time == -1ULL)
+	if (iowait == -1ULL)
 		/* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
 		iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
-	else
-		iowait = usecs_to_cputime64(iowait_time);
 
 	return iowait;
 }
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f6f4ac1..3b4674d 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -140,6 +140,8 @@ extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
+extern u64 get_cpu_idle_time_coarse(int cpu);
+extern u64 get_cpu_iowait_time_coarse(int cpu);
 
 # else /* !CONFIG_NO_HZ_COMMON */
 static inline int tick_nohz_tick_stopped(void)
@@ -158,6 +160,8 @@ static inline ktime_t tick_nohz_get_sleep_length(void)
 }
 static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
 static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
+static inline u64 get_cpu_idle_time_coarse(int cpu) { return -1; }
+static inline u64 get_cpu_iowait_time_coarse(int cpu) { return -1; }
 # endif /* !CONFIG_NO_HZ_COMMON */
 
 #ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index da37125..178ffdc 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -574,6 +574,85 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
+/**
+ * get_cpu_idle_time_coarse - get coarse idle time of a cpu
+ * @cpu: CPU number to query
+ *
+ * Return the cummulative idle time (since boot) for a given
+ * CPU, in tick resolution (for traditional UI like /proc/stat).
+ *
+ * This function returns -1 if NOHZ is not enabled.
+ */
+u64 get_cpu_idle_time_coarse(int cpu)
+{
+	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	ktime_t now, idle, delta;
+	unsigned int seq;
+
+	if (!tick_nohz_active)
+		return -1;
+
+	if (!ts->idle_active)
+		return usecs_to_cputime64(ktime_to_us(ts->idle_sleeptime));
+
+	now = ktime_get();
+
+	do {
+		seq = read_seqbegin(&ts->idle_sleeptime_seq);
+		idle = ts->idle_sleeptime;
+		delta = ktime_sub(now, ts->idle_entrytime);
+	} while (read_seqretry(&ts->idle_sleeptime_seq, seq));
+
+	/*
+	 * If delta is less than tick, use current value and just
+	 * ignore the delta. Otherwise perform update.
+	 */
+	if (ktime_compare(delta, ns_to_ktime(TICK_NSEC)) > 0)
+		update_ts_time_stats(cpu, ts, now, &idle, NULL, 0);
+
+	return usecs_to_cputime64(ktime_to_us(idle));
+
+}
+
+/**
+ * get_cpu_iowait_time_coarse - get coarse iowait time of a cpu
+ * @cpu: CPU number to query
+ *
+ * Return the cummulative iowait time (since boot) for a given
+ * CPU, in tick resolution (for traditional UI like /proc/stat).
+ *
+ * This function returns -1 if NOHZ is not enabled.
+ */
+u64 get_cpu_iowait_time_coarse(int cpu)
+{
+	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	ktime_t now, iowait, delta;
+	unsigned int seq;
+
+	if (!tick_nohz_active)
+		return -1;
+
+	if (!ts->idle_active)
+		return usecs_to_cputime64(ktime_to_us(ts->iowait_sleeptime));
+
+	now = ktime_get();
+
+	do {
+		seq = read_seqbegin(&ts->idle_sleeptime_seq);
+		iowait = ts->iowait_sleeptime;
+		delta = ktime_sub(now, ts->idle_entrytime);
+	} while (read_seqretry(&ts->idle_sleeptime_seq, seq));
+
+	/*
+	 * If delta is less than tick, use current value and just
+	 * ignore the delta. Otherwise perform update.
+	 */
+	if (ktime_compare(delta, ns_to_ktime(TICK_NSEC)) > 0)
+		update_ts_time_stats(cpu, ts, now, NULL, &iowait, 0);
+
+	return usecs_to_cputime64(ktime_to_us(iowait));
+}
+
 static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
 					 ktime_t now, int cpu)
 {
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-04-09 13:42 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-31  2:01 [PATCH 0/2 v2] nohz: fix idle accounting in NO_HZ kernels Hidetoshi Seto
2014-03-31  2:08 ` [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2 Hidetoshi Seto
2014-04-02 19:35   ` Denys Vlasenko
2014-04-03  7:02     ` Hidetoshi Seto
2014-04-03  9:51       ` Denys Vlasenko
2014-04-04  2:47         ` Hidetoshi Seto
2014-04-04 16:03     ` Frederic Weisbecker
2014-04-04 17:02       ` Denys Vlasenko
2014-04-05 10:08         ` Frederic Weisbecker
2014-04-05 14:56           ` Denys Vlasenko
2014-04-07 18:17             ` Frederic Weisbecker
2014-04-09 12:49               ` Denys Vlasenko
2014-04-09 13:42                 ` Peter Zijlstra
2014-03-31  2:09 ` [PATCH 2/2] nohz, procfs: introduce get_cpu_idle/iowait_time_coarse Hidetoshi Seto
  -- strict thread matches above, loose matches on Subject: below --
2014-03-24  3:05 [PATCH 0/2] nohz: fix idle accounting in NO_HZ kernels Hidetoshi Seto
2014-03-24  3:10 ` [PATCH 2/2] nohz, procfs: introduce get_cpu_idle/iowait_time_coarse Hidetoshi Seto

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.