All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] nohz: fix idle accounting in NO_HZ kernels
@ 2014-03-24  3:05 Hidetoshi Seto
  2014-03-24  3:10 ` [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats Hidetoshi Seto
  2014-03-24  3:10 ` [PATCH 2/2] nohz, procfs: introduce get_cpu_idle/iowait_time_coarse Hidetoshi Seto
  0 siblings, 2 replies; 6+ messages in thread
From: Hidetoshi Seto @ 2014-03-24  3:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov

Hi all,

This patch set (based on v3.14-rc7) will fix an issue that
idle/iowait of /proc/stat can go backward. Originally reported
by Tetsuo and Fernando at last year, Mar 2013.

Reviews are welcome.

Thanks,
H.Seto


Hidetoshi Seto (2):
      nohz: use seqlock to avoid race on idle time stats
      nohz, procfs: introduce get_cpu_idle/iowait_time_coarse

 fs/proc/stat.c           |   16 ++---
 include/linux/tick.h     |    5 ++
 kernel/time/tick-sched.c |  171 ++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 161 insertions(+), 31 deletions(-)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats
  2014-03-24  3:05 [PATCH 0/2] nohz: fix idle accounting in NO_HZ kernels Hidetoshi Seto
@ 2014-03-24  3:10 ` Hidetoshi Seto
  2014-03-24  7:45   ` Preeti Murthy
  2014-03-24  3:10 ` [PATCH 2/2] nohz, procfs: introduce get_cpu_idle/iowait_time_coarse Hidetoshi Seto
  1 sibling, 1 reply; 6+ messages in thread
From: Hidetoshi Seto @ 2014-03-24  3:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov

This patch is 1/2 of patch set to fix an issue that idle/iowait
of /proc/stat can go backward. Originally reported by Tetsuo and
Fernando at last year, Mar 2013.

[BACKGROUNDS]: idle accounting on NO_HZ

If NO_HZ is enabled, cpu stops tick interrupt for itself before
go sleep to be idle. It means that time stats of the sleeping cpu
will not be updated in tick interrupt. Instead when cpu wakes up,
it updates time stats by calculating idle duration time from
timestamp at entering idle and current time as exiting idle.

OTOH, it can happen that there are some kind of observer who want
to know how long the sleeping cpu have been idle. Imagine that
userland runs top command or application who read /proc/stats.
Therefore kernel provides get_cpu_{idle,iowait}_time_us() function
for user to obtain current idle time stats of such sleeping cpu.
This function reads time stats and timestamp at entering idle,
and then return current idle time by adding duration calculated
from timestamp and current time.

There are 2 problems:

[PROBLEM 1]: there is no exclusive control.

It is easy to understand that there are 2 different cpu - an
observing cpu where running a program observing idle cpu's stat
and an observed cpu where performing idle. It means race can
happen if observed cpu wakes up while it is observed. Observer
can accidentally add calculated duration time (say delta) to
time stats which is just updated by woken cpu. Soon correct
idle time is returned in next turn, so it will result in
backward time. Therefore readers must be excluded by writer.

Then time stats are updated by woken cpu so there are only one
writer, right? No, unfortunately no. I'm not sure why are they,
but in some reason the function get_cpu_{idle,iowait}_time_us()
has ability to update sleeping cpu's time stats from observing
cpu. From grep of today's kernel tree, this feature is used by
cpufreq module. Calling this function with this feature in
periodically manner works like emulating tick for sleeping cpu.

In summary there are races between multiple writer and multiple
reader but no exclusive control on this idle time stats dataset.

[PROBLEM 2]: broken iowait accounting.

As historical nature, cpu's idle time was accounted as either
idle or iowait depending on the presence of tasks blocked by
I/O. No one complain about it for a long time. However:

  > Still trying to wrap my head around it, but conceptually
  > get_cpu_iowait_time_us() doesn't make any kind of sense.
  > iowait isn't per cpu since effectively tasks that aren't
  > running aren't assigned a cpu (as Oleg already pointed out).
  -- Peter Zijlstra

Now some kernel folks realized that accounting iowait as per-cpu
does not make sense in SMP world. When we were in traditional
UP era, cpu is considered to be waiting I/O if it is idle while
nr_iowait > 0. But in these days with SMP systems, tasks easily
migrate from a cpu where they issued an I/O to another cpu where
they are queued after I/O completion.

Back to NO_HZ mechanism. Totally terrible thing here is that
observer need to handle duration "delta" without knowing that
nr_iowait of sleeping cpu can be changed easily by migration
even if cpu is sleeping. So it can happen that observer temporary
assign delta as iowait at 1st place, then blocked tasks are
migrated all, and at last in 2nd turn observer assign delta as
idle. You will see that delta have moved from iowait to idle.

In summary iowait accounting has fundamental problem and needs
to be precisely reworked. It implies that some user interfaces
might be replaced completely.

[WHAT THIS PATCH PROPOSED]: fix problem 1 first.

To fix problem 1, this patch adds seqlock for NO_HZ idle
accounting to avoid possible races between multiple reader/writer.

And to cope with problem 2, I add comments to note that there
is known issues.

Thanks,
H.Seto

References:
  First report from Fernando:
    [RFC] iowait/idle time accounting hiccups in NOHZ kernels
    https://lkml.org/lkml/2013/3/18/962
  Steps to reproduce guided by Tetsuo:
    https://lkml.org/lkml/2013/4/1/201

  1st patchset from Frederic:
    [PATCH 0/4] nohz: Fix racy sleeptime stats
    https://lkml.org/lkml/2013/8/8/638
    [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats
    https://lkml.org/lkml/2013/8/16/274

  2nd patchset from Frederic:
    [RFC PATCH 0/5] nohz: Fix racy sleeptime stats v2
    https://lkml.org/lkml/2013/10/19/86

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Reported-by: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/tick.h     |    1 +
 kernel/time/tick-sched.c |   92 +++++++++++++++++++++++++++++++++++----------
 2 files changed, 72 insertions(+), 21 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index b84773c..f6f4ac1 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -62,6 +62,7 @@ struct tick_sched {
 	unsigned long			idle_calls;
 	unsigned long			idle_sleeps;
 	int				idle_active;
+	seqlock_t			idle_sleeptime_seq;
 	ktime_t				idle_entrytime;
 	ktime_t				idle_waketime;
 	ktime_t				idle_exittime;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9f8af69..da37125 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -407,11 +407,23 @@ static void tick_nohz_update_jiffies(ktime_t now)
  * Updates the per cpu time idle statistics counters
  */
 static void
-update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
+update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now,
+		     ktime_t *idle, ktime_t *iowait, int last)
 {
 	ktime_t delta;
 
-	if (ts->idle_active) {
+	write_seqlock(&ts->idle_sleeptime_seq);
+
+	/* now must be newer/greater than entrytime */
+	if (ts->idle_active && (ktime_compare(now, ts->idle_entrytime) > 0)) {
+		/*
+		 * FIXME: Since nr_iowait of sleeping cpu can change from/to
+		 * 0 while in this span from idle_entrytime to now, we have no
+		 * idea how to divide the delta to idle_sleeptime and/or
+		 * iowait_sleeptime.
+		 * So as traditional non-proper estimation for workaround,
+		 * take the delta depending on the current value of nr_iowait.
+		 */
 		delta = ktime_sub(now, ts->idle_entrytime);
 		if (nr_iowait_cpu(cpu) > 0)
 			ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
@@ -420,16 +432,19 @@ update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_upda
 		ts->idle_entrytime = now;
 	}
 
-	if (last_update_time)
-		*last_update_time = ktime_to_us(now);
+	if (idle)
+		*idle = ts->idle_sleeptime;
+	if (iowait)
+		*iowait = ts->iowait_sleeptime;
+	if (last)
+		ts->idle_active = 0;
 
+	write_sequnlock(&ts->idle_sleeptime_seq);
 }
 
 static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 {
-	update_ts_time_stats(smp_processor_id(), ts, now, NULL);
-	ts->idle_active = 0;
-
+	update_ts_time_stats(smp_processor_id(), ts, now, NULL, NULL, 1);
 	sched_clock_idle_wakeup_event(0);
 }
 
@@ -437,8 +452,11 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 {
 	ktime_t now = ktime_get();
 
+	write_seqlock(&ts->idle_sleeptime_seq);
 	ts->idle_entrytime = now;
 	ts->idle_active = 1;
+	write_sequnlock(&ts->idle_sleeptime_seq);
+
 	sched_clock_idle_sleep_event();
 	return now;
 }
@@ -455,6 +473,10 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
  * This time is measured via accounting rather than sampling,
  * and is as accurate as ktime_get() is.
  *
+ * Known bug: Return value is not monotonic in case if @last_update_time
+ * is NULL and therefore update is not performed. Because it includes
+ * cputime which is not determined idle or iowait so not accounted yet.
+ *
  * This function returns -1 if NOHZ is not enabled.
  */
 u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
@@ -467,16 +489,28 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 
 	now = ktime_get();
 	if (last_update_time) {
-		update_ts_time_stats(cpu, ts, now, last_update_time);
-		idle = ts->idle_sleeptime;
+		update_ts_time_stats(cpu, ts, now, &idle, NULL, 0);
+		*last_update_time = ktime_to_us(now);
 	} else {
-		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+		unsigned int seq;
 
-			idle = ktime_add(ts->idle_sleeptime, delta);
-		} else {
+		do {
+			seq = read_seqbegin(&ts->idle_sleeptime_seq);
 			idle = ts->idle_sleeptime;
-		}
+			/*
+			 * FIXME: At this point, delta is determined neither
+			 * idle nor iowait. This function temporarily treat
+			 * this delta depending on the value of nr_iowait
+			 * when this function reaches here. It will result
+			 * in non-monotonic return value. So user of this
+			 * function must not expect monotonicity here.
+			 */
+			if (ts->idle_active && !nr_iowait_cpu(cpu)
+			    && (ktime_compare(now, ts->idle_entrytime) > 0)) {
+				ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+				idle = ktime_add(ts->idle_sleeptime, delta);
+			}
+		} while (read_seqretry(&ts->idle_sleeptime_seq, seq));
 	}
 
 	return ktime_to_us(idle);
@@ -496,6 +530,10 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
  * This time is measured via accounting rather than sampling,
  * and is as accurate as ktime_get() is.
  *
+ * Known bug: Return value is not monotonic in case if @last_update_time
+ * is NULL and therefore update is not performed. Because it includes
+ * cputime which is not determined idle or iowait so not accounted yet.
+ *
  * This function returns -1 if NOHZ is not enabled.
  */
 u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
@@ -508,16 +546,28 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 
 	now = ktime_get();
 	if (last_update_time) {
-		update_ts_time_stats(cpu, ts, now, last_update_time);
-		iowait = ts->iowait_sleeptime;
+		update_ts_time_stats(cpu, ts, now, NULL, &iowait, 0);
+		*last_update_time = ktime_to_us(now);
 	} else {
-		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+		unsigned int seq;
 
-			iowait = ktime_add(ts->iowait_sleeptime, delta);
-		} else {
+		do {
+			seq = read_seqbegin(&ts->idle_sleeptime_seq);
 			iowait = ts->iowait_sleeptime;
-		}
+			/*
+			 * FIXME: At this point, delta is determined neither
+			 * idle nor iowait. This function temporarily treat
+			 * this delta depending on the value of nr_iowait
+			 * when this function reaches here. It will result
+			 * in non-monotonic return value. So user of this
+			 * function must not expect monotonicity here.
+			 */
+			if (ts->idle_active && nr_iowait_cpu(cpu) > 0
+			    && (ktime_compare(now, ts->idle_entrytime) > 0)) {
+				ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+				iowait = ktime_add(ts->iowait_sleeptime, delta);
+			}
+		} while (read_seqretry(&ts->idle_sleeptime_seq, seq));
 	}
 
 	return ktime_to_us(iowait);
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] nohz, procfs: introduce get_cpu_idle/iowait_time_coarse
  2014-03-24  3:05 [PATCH 0/2] nohz: fix idle accounting in NO_HZ kernels Hidetoshi Seto
  2014-03-24  3:10 ` [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats Hidetoshi Seto
@ 2014-03-24  3:10 ` Hidetoshi Seto
  1 sibling, 0 replies; 6+ messages in thread
From: Hidetoshi Seto @ 2014-03-24  3:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov

This patch is 2/2 of patch set to fix an issue that idle/iowait
of /proc/stat can go backward. Originally reported by Tetsuo and
Fernando at last year, Mar 2013.

Now it is clear that get_cpu_{idle,iowait}_time_us() is not monotonic.
Using this for /proc/stats will cause troubles in innocent userland
who believe these counters are definitely monotonic.

Given that:

  - If observer (= reader of /proc/stats) want to avoid backwarding,
    it must update time stats for next observer. It means observer
    determine delta to idle/iowait and account it for sleeping cpu.

  - Usually the number of observers will not so many (i.e. running
    top on few console or hiring few monitor program will be enough
    for average system admin), but we must predict the worst case.
    In short, make effort to reduce lock contention.

  - The resolution required in /proc/stats is tick-level, not us.

This patch introduces new function get_cpu_idle/iowait_time_coarse()
that guarantees monotonic return value but resolution is low.

Tricks are here:

  - At first this function obtain latest time stats and calculate
    "delta" which indicates duration from time when the time stats
    is updated last time to current time.

  - If delta is less than tick length, use obtained time stats
    as if it was sampled in tick interrupt recently happened.

  - If delta is greater than tick, perform update of time stats
    as if it emulates tick for sleeping observed cpu.

  - As the result the rate of updating time stats by observer is
    limited to once per tick. In other words, in case if there is
    observer who are monitoring sleeping cpu, we leave tick
    emulation job during idle to the observer.

I confirmed this patch fix the monotonicity of /proc/stats, by
running reproducer and stressor for a day. The rate of reproduce
is different for different system, but in my case, running
"git gc" on kernel source repository aside of checker works fine.

Thanks,
H.Seto

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Reported-by: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/stat.c           |   16 +++------
 include/linux/tick.h     |    4 ++
 kernel/time/tick-sched.c |   79 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 89 insertions(+), 10 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 6f599c6..3dbe282 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -45,32 +45,28 @@ static cputime64_t get_iowait_time(int cpu)
 
 static u64 get_idle_time(int cpu)
 {
-	u64 idle, idle_time = -1ULL;
+	u64 idle = -1ULL;
 
 	if (cpu_online(cpu))
-		idle_time = get_cpu_idle_time_us(cpu, NULL);
+		idle = get_cpu_idle_time_coarse(cpu);
 
-	if (idle_time == -1ULL)
+	if (idle == -1ULL)
 		/* !NO_HZ or cpu offline so we can rely on cpustat.idle */
 		idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE];
-	else
-		idle = usecs_to_cputime64(idle_time);
 
 	return idle;
 }
 
 static u64 get_iowait_time(int cpu)
 {
-	u64 iowait, iowait_time = -1ULL;
+	u64 iowait = -1ULL;
 
 	if (cpu_online(cpu))
-		iowait_time = get_cpu_iowait_time_us(cpu, NULL);
+		iowait = get_cpu_iowait_time_coarse(cpu);
 
-	if (iowait_time == -1ULL)
+	if (iowait == -1ULL)
 		/* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
 		iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT];
-	else
-		iowait = usecs_to_cputime64(iowait_time);
 
 	return iowait;
 }
diff --git a/include/linux/tick.h b/include/linux/tick.h
index f6f4ac1..3b4674d 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -140,6 +140,8 @@ extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
+extern u64 get_cpu_idle_time_coarse(int cpu);
+extern u64 get_cpu_iowait_time_coarse(int cpu);
 
 # else /* !CONFIG_NO_HZ_COMMON */
 static inline int tick_nohz_tick_stopped(void)
@@ -158,6 +160,8 @@ static inline ktime_t tick_nohz_get_sleep_length(void)
 }
 static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
 static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
+static inline u64 get_cpu_idle_time_coarse(int cpu) { return -1; }
+static inline u64 get_cpu_iowait_time_coarse(int cpu) { return -1; }
 # endif /* !CONFIG_NO_HZ_COMMON */
 
 #ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index da37125..178ffdc 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -574,6 +574,85 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
+/**
+ * get_cpu_idle_time_coarse - get coarse idle time of a cpu
+ * @cpu: CPU number to query
+ *
+ * Return the cummulative idle time (since boot) for a given
+ * CPU, in tick resolution (for traditional UI like /proc/stat).
+ *
+ * This function returns -1 if NOHZ is not enabled.
+ */
+u64 get_cpu_idle_time_coarse(int cpu)
+{
+	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	ktime_t now, idle, delta;
+	unsigned int seq;
+
+	if (!tick_nohz_active)
+		return -1;
+
+	if (!ts->idle_active)
+		return usecs_to_cputime64(ktime_to_us(ts->idle_sleeptime));
+
+	now = ktime_get();
+
+	do {
+		seq = read_seqbegin(&ts->idle_sleeptime_seq);
+		idle = ts->idle_sleeptime;
+		delta = ktime_sub(now, ts->idle_entrytime);
+	} while (read_seqretry(&ts->idle_sleeptime_seq, seq));
+
+	/*
+	 * If delta is less than tick, use current value and just
+	 * ignore the delta. Otherwise perform update.
+	 */
+	if (ktime_compare(delta, ns_to_ktime(TICK_NSEC)) > 0)
+		update_ts_time_stats(cpu, ts, now, &idle, NULL, 0);
+
+	return usecs_to_cputime64(ktime_to_us(idle));
+
+}
+
+/**
+ * get_cpu_iowait_time_coarse - get coarse iowait time of a cpu
+ * @cpu: CPU number to query
+ *
+ * Return the cummulative iowait time (since boot) for a given
+ * CPU, in tick resolution (for traditional UI like /proc/stat).
+ *
+ * This function returns -1 if NOHZ is not enabled.
+ */
+u64 get_cpu_iowait_time_coarse(int cpu)
+{
+	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	ktime_t now, iowait, delta;
+	unsigned int seq;
+
+	if (!tick_nohz_active)
+		return -1;
+
+	if (!ts->idle_active)
+		return usecs_to_cputime64(ktime_to_us(ts->iowait_sleeptime));
+
+	now = ktime_get();
+
+	do {
+		seq = read_seqbegin(&ts->idle_sleeptime_seq);
+		iowait = ts->iowait_sleeptime;
+		delta = ktime_sub(now, ts->idle_entrytime);
+	} while (read_seqretry(&ts->idle_sleeptime_seq, seq));
+
+	/*
+	 * If delta is less than tick, use current value and just
+	 * ignore the delta. Otherwise perform update.
+	 */
+	if (ktime_compare(delta, ns_to_ktime(TICK_NSEC)) > 0)
+		update_ts_time_stats(cpu, ts, now, NULL, &iowait, 0);
+
+	return usecs_to_cputime64(ktime_to_us(iowait));
+}
+
 static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
 					 ktime_t now, int cpu)
 {
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats
  2014-03-24  3:10 ` [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats Hidetoshi Seto
@ 2014-03-24  7:45   ` Preeti Murthy
  2014-03-24  8:21     ` Hidetoshi Seto
  0 siblings, 1 reply; 6+ messages in thread
From: Preeti Murthy @ 2014-03-24  7:45 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Fernando Luis Vazquez Cao, Tetsuo Handa,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Arjan van de Ven, Oleg Nesterov,
	preeti

Hi Hidetoshi,

The patch looks good to me except the comments around the monotonicity
of the return value of the idle stats observer. I am unable to relate them
to the dependency on nr_iowait_cpu.

I see that when the reader queries for the idle stats and calls
get_cpu_idle_time_us(), the nr_iowait_cpu might be 0. When he later
queries get_cpu_iowait_time_us(), it may be >0 . Hence we will be
accounting for the idle time in both idle time and iowait time. This
is definitely a problem. But I do not understand what this has got to
do with the monotonicity of the time
returned. This is just for my understanding.

Thanks!

Regards
Preeti U Murthy
On 3/24/14, Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> wrote:

<snip>
> + * Known bug: Return value is not monotonic in case if @last_update_time
> + * is NULL and therefore update is not performed. Because it includes
> + * cputime which is not determined idle or iowait so not accounted yet.
> + *
>   * This function returns -1 if NOHZ is not enabled.
>   */
>  u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
> @@ -467,16 +489,28 @@ u64 get_cpu_idle_time_us(int cpu, u64
> *last_update_time)
>
>  	now = ktime_get();
>  	if (last_update_time) {
> -		update_ts_time_stats(cpu, ts, now, last_update_time);
> -		idle = ts->idle_sleeptime;
> +		update_ts_time_stats(cpu, ts, now, &idle, NULL, 0);
> +		*last_update_time = ktime_to_us(now);
>  	} else {
> -		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
> -			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> +		unsigned int seq;
>
> -			idle = ktime_add(ts->idle_sleeptime, delta);
> -		} else {
> +		do {
> +			seq = read_seqbegin(&ts->idle_sleeptime_seq);
>  			idle = ts->idle_sleeptime;
> -		}
> +			/*
> +			 * FIXME: At this point, delta is determined neither
> +			 * idle nor iowait. This function temporarily treat
> +			 * this delta depending on the value of nr_iowait
> +			 * when this function reaches here. It will result
> +			 * in non-monotonic return value. So user of this
> +			 * function must not expect monotonicity here.
> +			 */
> +			if (ts->idle_active && !nr_iowait_cpu(cpu)
> +			    && (ktime_compare(now, ts->idle_entrytime) > 0)) {
> +				ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> +				idle = ktime_add(ts->idle_sleeptime, delta);
> +			}
> +		} while (read_seqretry(&ts->idle_sleeptime_seq, seq));
>  	}
>
>  	return ktime_to_us(idle);
> @@ -496,6 +530,10 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
>   * This time is measured via accounting rather than sampling,
>   * and is as accurate as ktime_get() is.
>   *
> + * Known bug: Return value is not monotonic in case if @last_update_time
> + * is NULL and therefore update is not performed. Because it includes
> + * cputime which is not determined idle or iowait so not accounted yet.
> + *
>   * This function returns -1 if NOHZ is not enabled.
>   */
>  u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
> @@ -508,16 +546,28 @@ u64 get_cpu_iowait_time_us(int cpu, u64
> *last_update_time)
>
>  	now = ktime_get();
>  	if (last_update_time) {
> -		update_ts_time_stats(cpu, ts, now, last_update_time);
> -		iowait = ts->iowait_sleeptime;
> +		update_ts_time_stats(cpu, ts, now, NULL, &iowait, 0);
> +		*last_update_time = ktime_to_us(now);
>  	} else {
> -		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> -			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> +		unsigned int seq;
>
> -			iowait = ktime_add(ts->iowait_sleeptime, delta);
> -		} else {
> +		do {
> +			seq = read_seqbegin(&ts->idle_sleeptime_seq);
>  			iowait = ts->iowait_sleeptime;
> -		}
> +			/*
> +			 * FIXME: At this point, delta is determined neither
> +			 * idle nor iowait. This function temporarily treat
> +			 * this delta depending on the value of nr_iowait
> +			 * when this function reaches here. It will result
> +			 * in non-monotonic return value. So user of this
> +			 * function must not expect monotonicity here.
> +			 */
> +			if (ts->idle_active && nr_iowait_cpu(cpu) > 0
> +			    && (ktime_compare(now, ts->idle_entrytime) > 0)) {
> +				ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> +				iowait = ktime_add(ts->iowait_sleeptime, delta);
> +			}
> +		} while (read_seqretry(&ts->idle_sleeptime_seq, seq));
>  	}
>
>  	return ktime_to_us(iowait);
> --
> 1.7.1
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats
  2014-03-24  7:45   ` Preeti Murthy
@ 2014-03-24  8:21     ` Hidetoshi Seto
  2014-03-25  5:27       ` Preeti U Murthy
  0 siblings, 1 reply; 6+ messages in thread
From: Hidetoshi Seto @ 2014-03-24  8:21 UTC (permalink / raw)
  To: Preeti Murthy
  Cc: linux-kernel, Fernando Luis Vazquez Cao, Tetsuo Handa,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Arjan van de Ven, Oleg Nesterov,
	preeti

(2014/03/24 16:45), Preeti Murthy wrote:
> Hi Hidetoshi,
> 
> The patch looks good to me except the comments around the monotonicity
> of the return value of the idle stats observer. I am unable to relate them
> to the dependency on nr_iowait_cpu.
> 
> I see that when the reader queries for the idle stats and calls
> get_cpu_idle_time_us(), the nr_iowait_cpu might be 0. When he later
> queries get_cpu_iowait_time_us(), it may be >0 . Hence we will be
> accounting for the idle time in both idle time and iowait time. This
> is definitely a problem. But I do not understand what this has got to
> do with the monotonicity of the time
> returned. This is just for my understanding.

Thank you for your comment!

Ah yes, I think I could write better comments around here to clarify
the monotonicity problem. (It will be happy if someone can give me
such better sentence for here :-D)

One important point is that readers do not update idle stats when they
use these function. i.e.

given:
   idle stats:  idle=1000, iowait=100
   stamp at idle entry: entry=50
   nr tasks waiting io: nr_iowait=1

1st reader:
   query @ now=60
     idle=1000
     iowait=110 (=100+(60-50))

(here nr_iowait changed to 0)

2nd reader:
   query @ now=70
     idle=1020 (=1000+(70-50))
     iowait=100

So you will see iowait is decreased from 110 to 100.

I hope this short story helps you.

Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats
  2014-03-24  8:21     ` Hidetoshi Seto
@ 2014-03-25  5:27       ` Preeti U Murthy
  0 siblings, 0 replies; 6+ messages in thread
From: Preeti U Murthy @ 2014-03-25  5:27 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Preeti Murthy, linux-kernel, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Frederic Weisbecker, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Arjan van de Ven, Oleg Nesterov

On 03/24/2014 01:51 PM, Hidetoshi Seto wrote:
> (2014/03/24 16:45), Preeti Murthy wrote:
>> Hi Hidetoshi,
>>
>> The patch looks good to me except the comments around the monotonicity
>> of the return value of the idle stats observer. I am unable to relate them
>> to the dependency on nr_iowait_cpu.
>>
>> I see that when the reader queries for the idle stats and calls
>> get_cpu_idle_time_us(), the nr_iowait_cpu might be 0. When he later
>> queries get_cpu_iowait_time_us(), it may be >0 . Hence we will be
>> accounting for the idle time in both idle time and iowait time. This
>> is definitely a problem. But I do not understand what this has got to
>> do with the monotonicity of the time
>> returned. This is just for my understanding.
> 
> Thank you for your comment!
> 
> Ah yes, I think I could write better comments around here to clarify
> the monotonicity problem. (It will be happy if someone can give me
> such better sentence for here :-D)
> 
> One important point is that readers do not update idle stats when they
> use these function. i.e.
> 
> given:
>    idle stats:  idle=1000, iowait=100
>    stamp at idle entry: entry=50
>    nr tasks waiting io: nr_iowait=1
> 
> 1st reader:
>    query @ now=60
>      idle=1000
>      iowait=110 (=100+(60-50))
> 
> (here nr_iowait changed to 0)
> 
> 2nd reader:
>    query @ now=70
>      idle=1020 (=1000+(70-50))
>      iowait=100
> 
> So you will see iowait is decreased from 110 to 100.

Ah ok! I get it. Please do add my Reviewed-by to it.

> 
> I hope this short story helps you.

Thanks!

Regards
Preeti U Murthy
> 
> Thanks,
> H.Seto
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-03-25  5:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-24  3:05 [PATCH 0/2] nohz: fix idle accounting in NO_HZ kernels Hidetoshi Seto
2014-03-24  3:10 ` [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats Hidetoshi Seto
2014-03-24  7:45   ` Preeti Murthy
2014-03-24  8:21     ` Hidetoshi Seto
2014-03-25  5:27       ` Preeti U Murthy
2014-03-24  3:10 ` [PATCH 2/2] nohz, procfs: introduce get_cpu_idle/iowait_time_coarse Hidetoshi Seto

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.