All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] nohz: Only update sleeptime stats locally
@ 2014-04-23 19:00 Denys Vlasenko
  2014-04-23 19:00 ` [PATCH 2/3] nohz: Synchronize sleep time stats with memory barriers Denys Vlasenko
  2014-04-23 19:00 ` [PATCH 3/3] nohz: Fix iowait overcounting if iowait task migrates Denys Vlasenko
  0 siblings, 2 replies; 10+ messages in thread
From: Denys Vlasenko @ 2014-04-23 19:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Frederic Weisbecker, Denys Vlasenko, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov

From: Frederic Weisbecker <fweisbec@gmail.com>

The idle and io sleeptime stats can be updated concurrently from callers
of get_cpu_idle_time_us(), get_cpu_iowait_time_us() and
tick_nohz_stop_idle().

Updaters can easily race and mess up with internal datas coherency,
for example when a governor calls a get_cpu_*_time_us() API and the
target CPU exits idle at the same time, because no locking or whatsoever
is there to protect against this concurrency.

To fix this, lets only update the sleeptime stats locally when the CPU
exits from idle. This is the only place where a real update is truly
needed. The callers of get_cpu_*_time_us() can simply add up the pending
sleep time delta to the last sleeptime snapshot in order to get a coherent
result. There is no need for them to also update the stats.

Rebased to Linus' tree by Denys Vlasenko.

Reported-by: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 kernel/time/tick-sched.c | 63 ++++++++++++++++--------------------------------
 1 file changed, 21 insertions(+), 42 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9f8af69..73ced0c4 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -403,31 +403,16 @@ static void tick_nohz_update_jiffies(ktime_t now)
 	touch_softlockup_watchdog();
 }
 
-/*
- * Updates the per cpu time idle statistics counters
- */
-static void
-update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
+static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 {
 	ktime_t delta;
 
-	if (ts->idle_active) {
-		delta = ktime_sub(now, ts->idle_entrytime);
-		if (nr_iowait_cpu(cpu) > 0)
-			ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
-		else
-			ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
-		ts->idle_entrytime = now;
-	}
-
-	if (last_update_time)
-		*last_update_time = ktime_to_us(now);
-
-}
-
-static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
-{
-	update_ts_time_stats(smp_processor_id(), ts, now, NULL);
+	/* Updates the per cpu time idle statistics counters */
+	delta = ktime_sub(now, ts->idle_entrytime);
+	if (nr_iowait_cpu(smp_processor_id()) > 0)
+		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
+	else
+		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
 	ts->idle_active = 0;
 
 	sched_clock_idle_wakeup_event(0);
@@ -466,17 +451,14 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 		return -1;
 
 	now = ktime_get();
-	if (last_update_time) {
-		update_ts_time_stats(cpu, ts, now, last_update_time);
-		idle = ts->idle_sleeptime;
-	} else {
-		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+	if (last_update_time)
+		*last_update_time = ktime_to_us(now);
 
-			idle = ktime_add(ts->idle_sleeptime, delta);
-		} else {
-			idle = ts->idle_sleeptime;
-		}
+	if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+		idle = ktime_add(ts->idle_sleeptime, delta);
+	} else {
+		idle = ts->idle_sleeptime;
 	}
 
 	return ktime_to_us(idle);
@@ -507,17 +489,14 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 		return -1;
 
 	now = ktime_get();
-	if (last_update_time) {
-		update_ts_time_stats(cpu, ts, now, last_update_time);
-		iowait = ts->iowait_sleeptime;
-	} else {
-		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+	if (last_update_time)
+		*last_update_time = ktime_to_us(now);
 
-			iowait = ktime_add(ts->iowait_sleeptime, delta);
-		} else {
-			iowait = ts->iowait_sleeptime;
-		}
+	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
+		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+		iowait = ktime_add(ts->iowait_sleeptime, delta);
+	} else {
+		iowait = ts->iowait_sleeptime;
 	}
 
 	return ktime_to_us(iowait);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] nohz: Synchronize sleep time stats with memory barriers
  2014-04-23 19:00 [PATCH 1/3] nohz: Only update sleeptime stats locally Denys Vlasenko
@ 2014-04-23 19:00 ` Denys Vlasenko
  2014-04-24  0:16   ` Frederic Weisbecker
  2014-04-23 19:00 ` [PATCH 3/3] nohz: Fix iowait overcounting if iowait task migrates Denys Vlasenko
  1 sibling, 1 reply; 10+ messages in thread
From: Denys Vlasenko @ 2014-04-23 19:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov

When some call site uses get_cpu_*_time_us() to read a sleeptime
stat, it deduces the total sleeptime by adding the pending time
to the last sleeptime snapshot if the CPU target is idle.

But this only works if idle_sleeptime, idle_entrytime and idle_active are
read and updated under some disciplined order.

This patch changes updaters to modify idle_entrytime,
{idle,iowait}_sleeptime, and idle_active exactly in this order,
with write barriers on SMP to ensure other CPUs see then in this order too.
Readers are changed read them in the opposite order, with read barriers.
When readers detect a race by seeing cleared idle_entrytime,
they retry the reads.

The "iowait-ness" of every idle period is decided at the moment it starts:
if this CPU's run-queue had tasks waiting on I/O, then this idle
period's duration will be added to iowait_sleeptime.
This, along with proper SMP syncronization, fixes the bug where iowait
counts could go backwards.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 kernel/time/tick-sched.c | 83 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 66 insertions(+), 17 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 73ced0c4..a99770e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -409,10 +409,15 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 
 	/* Updates the per cpu time idle statistics counters */
 	delta = ktime_sub(now, ts->idle_entrytime);
-	if (nr_iowait_cpu(smp_processor_id()) > 0)
+	ts->idle_entrytime.tv64 = 0;
+	smp_wmb();
+
+	if (ts->idle_active == 2)
 		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
 	else
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+
+	smp_wmb();
 	ts->idle_active = 0;
 
 	sched_clock_idle_wakeup_event(0);
@@ -423,7 +428,8 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 	ktime_t now = ktime_get();
 
 	ts->idle_entrytime = now;
-	ts->idle_active = 1;
+	smp_wmb();
+	ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
 	sched_clock_idle_sleep_event();
 	return now;
 }
@@ -444,25 +450,46 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
  */
 u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 {
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
-	ktime_t now, idle;
+	struct tick_sched *ts;
+	ktime_t now, count;
+	int active;
 
 	if (!tick_nohz_active)
 		return -1;
 
+	ts = &per_cpu(tick_cpu_sched, cpu);
+
 	now = ktime_get();
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
-	if (ts->idle_active && !nr_iowait_cpu(cpu)) {
-		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-		idle = ktime_add(ts->idle_sleeptime, delta);
+ again:
+	active = ACCESS_ONCE(ts->idle_active);
+	smp_rmb();
+	count = ACCESS_ONCE(ts->idle_sleeptime);
+	if (active == 1) {
+		ktime_t delta, start;
+
+		smp_rmb();
+		start = ACCESS_ONCE(ts->idle_entrytime);
+		if (start.tv64 == 0) {
+			/*
+			 * Other CPU is updating the count.
+			 * We don't know whether fetched count is valid.
+			 */
+			goto again;
+		}
+		delta = ktime_sub(now, start);
+		count = ktime_add(count, delta);
 	} else {
-		idle = ts->idle_sleeptime;
+		/*
+		 * Possible concurrent tick_nohz_stop_idle() already
+		 * cleared idle_active. We fetched count *after*
+		 * we fetched idle_active, so count must be valid.
+		 */
 	}
 
-	return ktime_to_us(idle);
-
+	return ktime_to_us(count);
 }
 EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
 
@@ -482,24 +509,46 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
  */
 u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 {
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
-	ktime_t now, iowait;
+	struct tick_sched *ts;
+	ktime_t now, count;
+	int active;
 
 	if (!tick_nohz_active)
 		return -1;
 
+	ts = &per_cpu(tick_cpu_sched, cpu);
+
 	now = ktime_get();
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
-	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
-		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-		iowait = ktime_add(ts->iowait_sleeptime, delta);
+ again:
+	active = ACCESS_ONCE(ts->idle_active);
+	smp_rmb();
+	count = ACCESS_ONCE(ts->iowait_sleeptime);
+	if (active == 2) {
+		ktime_t delta, start;
+
+		smp_rmb();
+		start = ACCESS_ONCE(ts->idle_entrytime);
+		if (start.tv64 == 0) {
+			/*
+			 * Other CPU is updating the count.
+			 * We don't know whether fetched count is valid.
+			 */
+			goto again;
+		}
+		delta = ktime_sub(now, start);
+		count = ktime_add(count, delta);
 	} else {
-		iowait = ts->iowait_sleeptime;
+		/*
+		 * Possible concurrent tick_nohz_stop_idle() already
+		 * cleared idle_active. We fetched count *after*
+		 * we fetched idle_active, so count must be valid.
+		 */
 	}
 
-	return ktime_to_us(iowait);
+	return ktime_to_us(count);
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] nohz: Fix iowait overcounting if iowait task migrates
  2014-04-23 19:00 [PATCH 1/3] nohz: Only update sleeptime stats locally Denys Vlasenko
  2014-04-23 19:00 ` [PATCH 2/3] nohz: Synchronize sleep time stats with memory barriers Denys Vlasenko
@ 2014-04-23 19:00 ` Denys Vlasenko
  2014-04-24  0:37   ` Frederic Weisbecker
                     ` (2 more replies)
  1 sibling, 3 replies; 10+ messages in thread
From: Denys Vlasenko @ 2014-04-23 19:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov

Before this change, if last IO-blocked task wakes up
on a different CPU, the original CPU may stay idle for much longer,
and the entire time it stays idle is accounted as iowait time.

This change adds struct tick_sched::iowait_exittime member.
On entry to idle, it is set to KTIME_MAX.
Last IO-blocked task, if migrated, sets it to current time.
Note that this can happen only once per each idle period:
new iowaiting tasks can't magically appear on idle CPU's rq.

If iowait_exittime is set, then (iowait_exittime - idle_entrytime)
gets accounted as iowait, and the remaining (now - iowait_exittime)
as "true" idle.

Run-tested: /proc/stats no longer go backwards.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/tick.h     |  3 ++
 kernel/sched/core.c      | 20 ++++++++++++++
 kernel/time/tick-sched.c | 72 ++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 89 insertions(+), 6 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index b84773c..2a27883 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -67,6 +67,7 @@ struct tick_sched {
 	ktime_t				idle_exittime;
 	ktime_t				idle_sleeptime;
 	ktime_t				iowait_sleeptime;
+	ktime_t				iowait_exittime;
 	ktime_t				sleep_length;
 	unsigned long			last_jiffies;
 	unsigned long			next_jiffies;
@@ -139,6 +140,7 @@ extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
+extern void tick_nohz_iowait_to_idle(int cpu);
 
 # else /* !CONFIG_NO_HZ_COMMON */
 static inline int tick_nohz_tick_stopped(void)
@@ -157,6 +159,7 @@ static inline ktime_t tick_nohz_get_sleep_length(void)
 }
 static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
 static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
+static inline void tick_nohz_iowait_to_idle(int cpu) { }
 # endif /* !CONFIG_NO_HZ_COMMON */
 
 #ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9cae286..08dd220 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4255,6 +4255,9 @@ EXPORT_SYMBOL_GPL(yield_to);
  */
 void __sched io_schedule(void)
 {
+#ifdef CONFIG_NO_HZ_COMMON
+	int cpu_on_entry = smp_processor_id();
+#endif
 	struct rq *rq = raw_rq();
 
 	delayacct_blkio_start();
@@ -4263,13 +4266,23 @@ void __sched io_schedule(void)
 	current->in_iowait = 1;
 	schedule();
 	current->in_iowait = 0;
+#ifdef CONFIG_NO_HZ_COMMON
+	if (atomic_dec_and_test(&rq->nr_iowait)) {
+		if (smp_processor_id() != cpu_on_entry)
+			tick_nohz_iowait_to_idle(cpu_on_entry);
+	}
+#else
 	atomic_dec(&rq->nr_iowait);
+#endif
 	delayacct_blkio_end();
 }
 EXPORT_SYMBOL(io_schedule);
 
 long __sched io_schedule_timeout(long timeout)
 {
+#ifdef CONFIG_NO_HZ_COMMON
+	int cpu_on_entry = smp_processor_id();
+#endif
 	struct rq *rq = raw_rq();
 	long ret;
 
@@ -4279,7 +4292,14 @@ long __sched io_schedule_timeout(long timeout)
 	current->in_iowait = 1;
 	ret = schedule_timeout(timeout);
 	current->in_iowait = 0;
+#ifdef CONFIG_NO_HZ_COMMON
+	if (atomic_dec_and_test(&rq->nr_iowait)) {
+		if (smp_processor_id() != cpu_on_entry)
+			tick_nohz_iowait_to_idle(cpu_on_entry);
+	}
+#else
 	atomic_dec(&rq->nr_iowait);
+#endif
 	delayacct_blkio_end();
 	return ret;
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a99770e..679a577 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -403,17 +403,48 @@ static void tick_nohz_update_jiffies(ktime_t now)
 	touch_softlockup_watchdog();
 }
 
+/*
+ * Race prevention for idle/iowait counters:
+ *
+ * On entry and exit to idle/iowait states,
+ * ts->idle_entrytime, ts->{idle,iowait}_sleeptime and ts->idle_active
+ * fields are updated in this order, serialized with smp_wmb().
+ * On exit, ts->idle_entrytime is zeroed.
+ *
+ * The readers - get_cpu_{idle,iowait}_time_us() - fetch these fields
+ * in reversed order, serialized with smp_rmb().
+ * If ts->idle_active == 0, reading of ts->idle_entrytime (and one smp_rmb())
+ * are not necessary. If ts->idle_active != 0, ts->idle_entrytime
+ * (which is needed anyway for calculations)
+ * needs to be checked for zero, and reads need to be done anew if it is zeroed.
+ *
+ * This implements a seqlock-like race detection without needing additional
+ * data fields and read-modify-write ops on them.
+ */
 static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 {
-	ktime_t delta;
+	ktime_t delta, entry;
 
 	/* Updates the per cpu time idle statistics counters */
-	delta = ktime_sub(now, ts->idle_entrytime);
+	entry = ts->idle_entrytime;
+	delta = ktime_sub(now, entry);
 	ts->idle_entrytime.tv64 = 0;
 	smp_wmb();
 
-	if (ts->idle_active == 2)
+	if (ts->idle_active == 2) {
+		ktime_t end = ts->iowait_exittime;
+
+		if (end.tv64 != KTIME_MAX) {
+			/*
+			 * Last iowaiting task on our rq was woken up on other CPU
+			 * sometime in the past, it updated ts->iowait_exittime.
+			 */
+			delta = ktime_sub(now, end);
+			ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+			delta = ktime_sub(end, entry);
+		}
 		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
+	}
 	else
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
 
@@ -430,10 +461,18 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 	ts->idle_entrytime = now;
 	smp_wmb();
 	ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
+	ts->iowait_exittime.tv64 = KTIME_MAX;
 	sched_clock_idle_sleep_event();
 	return now;
 }
 
+void tick_nohz_iowait_to_idle(int cpu)
+{
+	struct tick_sched *ts = tick_get_tick_sched(cpu);
+
+	ts->iowait_exittime = ktime_get();
+}
+
 /**
  * get_cpu_idle_time_us - get the total idle time of a cpu
  * @cpu: CPU number to query
@@ -467,7 +506,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 	active = ACCESS_ONCE(ts->idle_active);
 	smp_rmb();
 	count = ACCESS_ONCE(ts->idle_sleeptime);
-	if (active == 1) {
+	if (active /* either 1 or 2 */) {
 		ktime_t delta, start;
 
 		smp_rmb();
@@ -479,6 +518,18 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 			 */
 			goto again;
 		}
+		if (active == 2) {
+			/* This idle period started as "iowait idle" */
+			ktime_t iowait_exit = ACCESS_ONCE(ts->iowait_exittime);
+
+			if (iowait_exit.tv64 == KTIME_MAX)
+				goto skip; /* and it still is */
+			/*
+			 * This CPU used to be "iowait idle", but iowait task
+			 * has migrated. The rest of idle time is "true idle":
+			 */
+			start = iowait_exit;
+		}
 		delta = ktime_sub(now, start);
 		count = ktime_add(count, delta);
 	} else {
@@ -488,6 +539,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 		 * we fetched idle_active, so count must be valid.
 		 */
 	}
+ skip:
 
 	return ktime_to_us(count);
 }
@@ -527,7 +579,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 	smp_rmb();
 	count = ACCESS_ONCE(ts->iowait_sleeptime);
 	if (active == 2) {
-		ktime_t delta, start;
+		ktime_t delta, start, end;
 
 		smp_rmb();
 		start = ACCESS_ONCE(ts->idle_entrytime);
@@ -538,7 +590,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 			 */
 			goto again;
 		}
-		delta = ktime_sub(now, start);
+		/*
+		 * If last iowaiting task on our rq was woken up on other CPU
+		 * sometime in the past, it updated ts->iowait_exittime.
+		 * Otherwise, ts->iowait_exittime == KTIME_MAX.
+		 */
+		end = ACCESS_ONCE(ts->iowait_exittime);
+		if (end.tv64 == KTIME_MAX)
+			end = now;
+		delta = ktime_sub(end, start);
 		count = ktime_add(count, delta);
 	} else {
 		/*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] nohz: Synchronize sleep time stats with memory barriers
  2014-04-23 19:00 ` [PATCH 2/3] nohz: Synchronize sleep time stats with memory barriers Denys Vlasenko
@ 2014-04-24  0:16   ` Frederic Weisbecker
  2014-04-24 18:36     ` Denys Vlasenko
  0 siblings, 1 reply; 10+ messages in thread
From: Frederic Weisbecker @ 2014-04-24  0:16 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Hidetoshi Seto, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Arjan van de Ven, Oleg Nesterov

On Wed, Apr 23, 2014 at 09:00:35PM +0200, Denys Vlasenko wrote:
> When some call site uses get_cpu_*_time_us() to read a sleeptime
> stat, it deduces the total sleeptime by adding the pending time
> to the last sleeptime snapshot if the CPU target is idle.
> 
> But this only works if idle_sleeptime, idle_entrytime and idle_active are
> read and updated under some disciplined order.
> 
> This patch changes updaters to modify idle_entrytime,
> {idle,iowait}_sleeptime, and idle_active exactly in this order,
> with write barriers on SMP to ensure other CPUs see then in this order too.
> Readers are changed read them in the opposite order, with read barriers.
> When readers detect a race by seeing cleared idle_entrytime,
> they retry the reads.
> 
> The "iowait-ness" of every idle period is decided at the moment it starts:
> if this CPU's run-queue had tasks waiting on I/O, then this idle
> period's duration will be added to iowait_sleeptime.
> This, along with proper SMP syncronization, fixes the bug where iowait
> counts could go backwards.
> 
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> ---
>  kernel/time/tick-sched.c | 83 ++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 66 insertions(+), 17 deletions(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 73ced0c4..a99770e 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -409,10 +409,15 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
>  
>  	/* Updates the per cpu time idle statistics counters */
>  	delta = ktime_sub(now, ts->idle_entrytime);
> -	if (nr_iowait_cpu(smp_processor_id()) > 0)
> +	ts->idle_entrytime.tv64 = 0;
> +	smp_wmb();
> +
> +	if (ts->idle_active == 2)
>  		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
>  	else
>  		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
> +
> +	smp_wmb();
>  	ts->idle_active = 0;
>  
>  	sched_clock_idle_wakeup_event(0);
> @@ -423,7 +428,8 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
>  	ktime_t now = ktime_get();
>  
>  	ts->idle_entrytime = now;
> -	ts->idle_active = 1;
> +	smp_wmb();
> +	ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
>  	sched_clock_idle_sleep_event();
>  	return now;
>  }
> @@ -444,25 +450,46 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
>   */
>  u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
>  {
> -	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
> -	ktime_t now, idle;
> +	struct tick_sched *ts;
> +	ktime_t now, count;
> +	int active;
>  
>  	if (!tick_nohz_active)
>  		return -1;
>  
> +	ts = &per_cpu(tick_cpu_sched, cpu);
> +
>  	now = ktime_get();
>  	if (last_update_time)
>  		*last_update_time = ktime_to_us(now);
>  
> -	if (ts->idle_active && !nr_iowait_cpu(cpu)) {
> -		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> -		idle = ktime_add(ts->idle_sleeptime, delta);
> + again:
> +	active = ACCESS_ONCE(ts->idle_active);
> +	smp_rmb();
> +	count = ACCESS_ONCE(ts->idle_sleeptime);
> +	if (active == 1) {
> +		ktime_t delta, start;
> +
> +		smp_rmb();
> +		start = ACCESS_ONCE(ts->idle_entrytime);
> +		if (start.tv64 == 0) {
> +			/*
> +			 * Other CPU is updating the count.
> +			 * We don't know whether fetched count is valid.
> +			 */
> +			goto again;
> +		}
> +		delta = ktime_sub(now, start);
> +		count = ktime_add(count, delta);

There is still a possibility that you race with an updater here.
Lets take these initial values, all of which belong to ts(CPU 1):

      // idle_sleeptime == 0
      // idle_entrytime == 0
      // ktime_get() == 100
      // idle_active = 1

           CPU 0                                       CPU 1

      count = idle_sleeptime // = 0               
                                                     tick_nohz_stop_idle();
                                                     tick_nohz_start_idle();
                                                     /* now idle_sleeptime == 100
                                                        and idle_entrytime == 100
                                                        ktime_get() is still 100
                                                        and idle_active is still 1
                                                        as it has toggled two times */
      delta = now - idle_entrytime; // 100 - 100
      count += delta // == 0

Then you get the spurious 0 result.

So memory barriers probably aren't enough here. seqcount would solve the issue as it maintains
update seq tokens.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] nohz: Fix iowait overcounting if iowait task migrates
  2014-04-23 19:00 ` [PATCH 3/3] nohz: Fix iowait overcounting if iowait task migrates Denys Vlasenko
@ 2014-04-24  0:37   ` Frederic Weisbecker
  2014-04-24  7:07   ` Peter Zijlstra
  2014-04-24  8:19   ` Hidetoshi Seto
  2 siblings, 0 replies; 10+ messages in thread
From: Frederic Weisbecker @ 2014-04-24  0:37 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Hidetoshi Seto, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Arjan van de Ven, Oleg Nesterov

On Wed, Apr 23, 2014 at 09:00:36PM +0200, Denys Vlasenko wrote:
> Before this change, if last IO-blocked task wakes up
> on a different CPU, the original CPU may stay idle for much longer,
> and the entire time it stays idle is accounted as iowait time.
> 
> This change adds struct tick_sched::iowait_exittime member.
> On entry to idle, it is set to KTIME_MAX.
> Last IO-blocked task, if migrated, sets it to current time.
> Note that this can happen only once per each idle period:
> new iowaiting tasks can't magically appear on idle CPU's rq.
> 
> If iowait_exittime is set, then (iowait_exittime - idle_entrytime)
> gets accounted as iowait, and the remaining (now - iowait_exittime)
> as "true" idle.
> 
> Run-tested: /proc/stats no longer go backwards.
> 
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> ---
>  include/linux/tick.h     |  3 ++
>  kernel/sched/core.c      | 20 ++++++++++++++
>  kernel/time/tick-sched.c | 72 ++++++++++++++++++++++++++++++++++++++++++++----
>  3 files changed, 89 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index b84773c..2a27883 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -67,6 +67,7 @@ struct tick_sched {
>  	ktime_t				idle_exittime;
>  	ktime_t				idle_sleeptime;
>  	ktime_t				iowait_sleeptime;
> +	ktime_t				iowait_exittime;
>  	ktime_t				sleep_length;
>  	unsigned long			last_jiffies;
>  	unsigned long			next_jiffies;
> @@ -139,6 +140,7 @@ extern void tick_nohz_irq_exit(void);
>  extern ktime_t tick_nohz_get_sleep_length(void);
>  extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
>  extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
> +extern void tick_nohz_iowait_to_idle(int cpu);
>  
>  # else /* !CONFIG_NO_HZ_COMMON */
>  static inline int tick_nohz_tick_stopped(void)
> @@ -157,6 +159,7 @@ static inline ktime_t tick_nohz_get_sleep_length(void)
>  }
>  static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
>  static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
> +static inline void tick_nohz_iowait_to_idle(int cpu) { }
>  # endif /* !CONFIG_NO_HZ_COMMON */
>  
>  #ifdef CONFIG_NO_HZ_FULL
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9cae286..08dd220 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4255,6 +4255,9 @@ EXPORT_SYMBOL_GPL(yield_to);
>   */
>  void __sched io_schedule(void)
>  {
> +#ifdef CONFIG_NO_HZ_COMMON
> +	int cpu_on_entry = smp_processor_id();
> +#endif
>  	struct rq *rq = raw_rq();
>  
>  	delayacct_blkio_start();
> @@ -4263,13 +4266,23 @@ void __sched io_schedule(void)
>  	current->in_iowait = 1;
>  	schedule();
>  	current->in_iowait = 0;
> +#ifdef CONFIG_NO_HZ_COMMON
> +	if (atomic_dec_and_test(&rq->nr_iowait)) {
> +		if (smp_processor_id() != cpu_on_entry)
> +			tick_nohz_iowait_to_idle(cpu_on_entry);

The update on tick_nohz_iowait_to_idle() can happen anytime,
possibly too late.

There can be races like this:

               CPU 0                     CPU 1
               --------------------------------

     //task A goes to sleep (iowait)
                                         //task A wakes up
                                         atomic_dec_and_test(nr_iowait) {
                                            now  = ktime_get(); // now == X
     //task B runs
     //Pending iowait time i flushed
     //task B sleeps (iowait)
     //now = Y
                                         //task A is preempted
                                         //task B runs
                                         atomic_dec_and_test(nr_iowait) {
                                            now  = ktime_get(); // now == Y
                                         //task B is preempted
                                         //task A runs back
                                            iowait_exittime = Z
     // account pending iowait time
     // delta = X - Y

Here the delta accounted is X - Y, which is probably a negative delta. So the
whole time between Z and Y is forgotten.

Now the race is tight and maybe not important in practice because preemption slices
should be short so only tiny chuncks may be overriden.

> +	}
> +#else
>  	atomic_dec(&rq->nr_iowait);
> +#endif
>  	delayacct_blkio_end();
>  }
>  EXPORT_SYMBOL(io_schedule);
>  
>  long __sched io_schedule_timeout(long timeout)
>  {
> +#ifdef CONFIG_NO_HZ_COMMON
> +	int cpu_on_entry = smp_processor_id();
> +#endif
>  	struct rq *rq = raw_rq();
>  	long ret;
>  
> @@ -4279,7 +4292,14 @@ long __sched io_schedule_timeout(long timeout)
>  	current->in_iowait = 1;
>  	ret = schedule_timeout(timeout);
>  	current->in_iowait = 0;
> +#ifdef CONFIG_NO_HZ_COMMON
> +	if (atomic_dec_and_test(&rq->nr_iowait)) {
> +		if (smp_processor_id() != cpu_on_entry)
> +			tick_nohz_iowait_to_idle(cpu_on_entry);
> +	}
> +#else
>  	atomic_dec(&rq->nr_iowait);
> +#endif
>  	delayacct_blkio_end();
>  	return ret;
>  }
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index a99770e..679a577 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -403,17 +403,48 @@ static void tick_nohz_update_jiffies(ktime_t now)
>  	touch_softlockup_watchdog();
>  }
>  
> +/*
> + * Race prevention for idle/iowait counters:
> + *
> + * On entry and exit to idle/iowait states,
> + * ts->idle_entrytime, ts->{idle,iowait}_sleeptime and ts->idle_active
> + * fields are updated in this order, serialized with smp_wmb().
> + * On exit, ts->idle_entrytime is zeroed.
> + *
> + * The readers - get_cpu_{idle,iowait}_time_us() - fetch these fields
> + * in reversed order, serialized with smp_rmb().
> + * If ts->idle_active == 0, reading of ts->idle_entrytime (and one smp_rmb())
> + * are not necessary. If ts->idle_active != 0, ts->idle_entrytime
> + * (which is needed anyway for calculations)
> + * needs to be checked for zero, and reads need to be done anew if it is zeroed.
> + *
> + * This implements a seqlock-like race detection without needing additional
> + * data fields and read-modify-write ops on them.
> + */
>  static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
>  {
> -	ktime_t delta;
> +	ktime_t delta, entry;
>  
>  	/* Updates the per cpu time idle statistics counters */
> -	delta = ktime_sub(now, ts->idle_entrytime);
> +	entry = ts->idle_entrytime;
> +	delta = ktime_sub(now, entry);
>  	ts->idle_entrytime.tv64 = 0;
>  	smp_wmb();
>  
> -	if (ts->idle_active == 2)
> +	if (ts->idle_active == 2) {
> +		ktime_t end = ts->iowait_exittime;
> +
> +		if (end.tv64 != KTIME_MAX) {
> +			/*
> +			 * Last iowaiting task on our rq was woken up on other CPU
> +			 * sometime in the past, it updated ts->iowait_exittime.
> +			 */
> +			delta = ktime_sub(now, end);
> +			ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
> +			delta = ktime_sub(end, entry);
> +		}
>  		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
> +	}
>  	else
>  		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
>  
> @@ -430,10 +461,18 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
>  	ts->idle_entrytime = now;
>  	smp_wmb();
>  	ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
> +	ts->iowait_exittime.tv64 = KTIME_MAX;
>  	sched_clock_idle_sleep_event();
>  	return now;
>  }
>  
> +void tick_nohz_iowait_to_idle(int cpu)
> +{
> +	struct tick_sched *ts = tick_get_tick_sched(cpu);
> +
> +	ts->iowait_exittime = ktime_get();
> +}
> +
>  /**
>   * get_cpu_idle_time_us - get the total idle time of a cpu
>   * @cpu: CPU number to query
> @@ -467,7 +506,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
>  	active = ACCESS_ONCE(ts->idle_active);
>  	smp_rmb();
>  	count = ACCESS_ONCE(ts->idle_sleeptime);
> -	if (active == 1) {
> +	if (active /* either 1 or 2 */) {
>  		ktime_t delta, start;
>  
>  		smp_rmb();
> @@ -479,6 +518,18 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
>  			 */
>  			goto again;
>  		}
> +		if (active == 2) {
> +			/* This idle period started as "iowait idle" */
> +			ktime_t iowait_exit = ACCESS_ONCE(ts->iowait_exittime);
> +
> +			if (iowait_exit.tv64 == KTIME_MAX)
> +				goto skip; /* and it still is */
> +			/*
> +			 * This CPU used to be "iowait idle", but iowait task
> +			 * has migrated. The rest of idle time is "true idle":
> +			 */
> +			start = iowait_exit;
> +		}
>  		delta = ktime_sub(now, start);
>  		count = ktime_add(count, delta);
>  	} else {
> @@ -488,6 +539,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
>  		 * we fetched idle_active, so count must be valid.
>  		 */
>  	}
> + skip:
>  
>  	return ktime_to_us(count);
>  }
> @@ -527,7 +579,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
>  	smp_rmb();
>  	count = ACCESS_ONCE(ts->iowait_sleeptime);
>  	if (active == 2) {
> -		ktime_t delta, start;
> +		ktime_t delta, start, end;
>  
>  		smp_rmb();
>  		start = ACCESS_ONCE(ts->idle_entrytime);
> @@ -538,7 +590,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
>  			 */
>  			goto again;
>  		}
> -		delta = ktime_sub(now, start);
> +		/*
> +		 * If last iowaiting task on our rq was woken up on other CPU
> +		 * sometime in the past, it updated ts->iowait_exittime.
> +		 * Otherwise, ts->iowait_exittime == KTIME_MAX.
> +		 */
> +		end = ACCESS_ONCE(ts->iowait_exittime);
> +		if (end.tv64 == KTIME_MAX)
> +			end = now;
> +		delta = ktime_sub(end, start);
>  		count = ktime_add(count, delta);
>  	} else {
>  		/*
> -- 
> 1.8.1.4
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] nohz: Fix iowait overcounting if iowait task migrates
  2014-04-23 19:00 ` [PATCH 3/3] nohz: Fix iowait overcounting if iowait task migrates Denys Vlasenko
  2014-04-24  0:37   ` Frederic Weisbecker
@ 2014-04-24  7:07   ` Peter Zijlstra
  2014-04-24 18:38     ` Denys Vlasenko
  2014-04-24  8:19   ` Hidetoshi Seto
  2 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2014-04-24  7:07 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Oleg Nesterov

On Wed, Apr 23, 2014 at 09:00:36PM +0200, Denys Vlasenko wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9cae286..08dd220 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4255,6 +4255,9 @@ EXPORT_SYMBOL_GPL(yield_to);
>   */
>  void __sched io_schedule(void)
>  {
> +#ifdef CONFIG_NO_HZ_COMMON
> +	int cpu_on_entry = smp_processor_id();
> +#endif
>  	struct rq *rq = raw_rq();
>  
>  	delayacct_blkio_start();
> @@ -4263,13 +4266,23 @@ void __sched io_schedule(void)
>  	current->in_iowait = 1;
>  	schedule();
>  	current->in_iowait = 0;
> +#ifdef CONFIG_NO_HZ_COMMON
> +	if (atomic_dec_and_test(&rq->nr_iowait)) {
> +		if (smp_processor_id() != cpu_on_entry)
> +			tick_nohz_iowait_to_idle(cpu_on_entry);
> +	}
> +#else
>  	atomic_dec(&rq->nr_iowait);
> +#endif
>  	delayacct_blkio_end();
>  }
>  EXPORT_SYMBOL(io_schedule);
>  
>  long __sched io_schedule_timeout(long timeout)
>  {
> +#ifdef CONFIG_NO_HZ_COMMON
> +	int cpu_on_entry = smp_processor_id();
> +#endif
>  	struct rq *rq = raw_rq();
>  	long ret;
>  
> @@ -4279,7 +4292,14 @@ long __sched io_schedule_timeout(long timeout)
>  	current->in_iowait = 1;
>  	ret = schedule_timeout(timeout);
>  	current->in_iowait = 0;
> +#ifdef CONFIG_NO_HZ_COMMON
> +	if (atomic_dec_and_test(&rq->nr_iowait)) {
> +		if (smp_processor_id() != cpu_on_entry)
> +			tick_nohz_iowait_to_idle(cpu_on_entry);
> +	}
> +#else
>  	atomic_dec(&rq->nr_iowait);
> +#endif
>  	delayacct_blkio_end();
>  	return ret;
>  }

Why do you insist on writing the same (buggy, see later) code twice?
Really, get lazy already and write it once but use it twice!

Its buggy because the smp_processor_id() is used in preemptible context,
its further buggy because the raw_rq() does it again and could get a rq
on a different cpu.

What you want is something like:

static inline void io_wait_start(struct rq *rq)
{
	atomic_inc(&rq->nr_iowait);
	current->in_iowait = 1;
}

static inline void io_wait_end(struct rq *rq)
{
	current->in_iowait = 0;
#ifdef CONFIG_NO_HZ_COMMON
	if (atomic_dec_and_test(&rq->nr_iowait) && 
		cpu_of(rq) != raw_smp_processor_id()) {
		tick_nohz_iowait_end(cpu_of(rq));
	}
#else
	atomic_dec(&rq->nr_iowait);
#endif
}

Anyway, I suspect that's still broken and you really need that lock
around the state like I did earlier, because the above isn't serialized
between remote wakeup and the cpu waking from nohz.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] nohz: Fix iowait overcounting if iowait task migrates
  2014-04-23 19:00 ` [PATCH 3/3] nohz: Fix iowait overcounting if iowait task migrates Denys Vlasenko
  2014-04-24  0:37   ` Frederic Weisbecker
  2014-04-24  7:07   ` Peter Zijlstra
@ 2014-04-24  8:19   ` Hidetoshi Seto
  2014-04-24 18:43     ` Denys Vlasenko
  2 siblings, 1 reply; 10+ messages in thread
From: Hidetoshi Seto @ 2014-04-24  8:19 UTC (permalink / raw)
  To: Denys Vlasenko, linux-kernel
  Cc: Frederic Weisbecker, Fernando Luis Vazquez Cao, Tetsuo Handa,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov

(2014/04/24 4:00), Denys Vlasenko wrote:
> Before this change, if last IO-blocked task wakes up
> on a different CPU, the original CPU may stay idle for much longer,
> and the entire time it stays idle is accounted as iowait time.
> 
> This change adds struct tick_sched::iowait_exittime member.
> On entry to idle, it is set to KTIME_MAX.
> Last IO-blocked task, if migrated, sets it to current time.
> Note that this can happen only once per each idle period:
> new iowaiting tasks can't magically appear on idle CPU's rq.
> 
> If iowait_exittime is set, then (iowait_exittime - idle_entrytime)
> gets accounted as iowait, and the remaining (now - iowait_exittime)
> as "true" idle.
> 
> Run-tested: /proc/stats no longer go backwards.
> 
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Cc: Oleg Nesterov <oleg@redhat.com>

At first I'd like to say thank you very much for continuing
work on this problem.

My comments are inlined.

> ---
>  include/linux/tick.h     |  3 ++
>  kernel/sched/core.c      | 20 ++++++++++++++
>  kernel/time/tick-sched.c | 72 ++++++++++++++++++++++++++++++++++++++++++++----
>  3 files changed, 89 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index b84773c..2a27883 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -67,6 +67,7 @@ struct tick_sched {
>  	ktime_t				idle_exittime;
>  	ktime_t				idle_sleeptime;
>  	ktime_t				iowait_sleeptime;
> +	ktime_t				iowait_exittime;
>  	ktime_t				sleep_length;
>  	unsigned long			last_jiffies;
>  	unsigned long			next_jiffies;
> @@ -139,6 +140,7 @@ extern void tick_nohz_irq_exit(void);
>  extern ktime_t tick_nohz_get_sleep_length(void);
>  extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
>  extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
> +extern void tick_nohz_iowait_to_idle(int cpu);
>  
>  # else /* !CONFIG_NO_HZ_COMMON */
>  static inline int tick_nohz_tick_stopped(void)
> @@ -157,6 +159,7 @@ static inline ktime_t tick_nohz_get_sleep_length(void)
>  }
>  static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
>  static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
> +static inline void tick_nohz_iowait_to_idle(int cpu) { }
>  # endif /* !CONFIG_NO_HZ_COMMON */
>  
>  #ifdef CONFIG_NO_HZ_FULL
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9cae286..08dd220 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4255,6 +4255,9 @@ EXPORT_SYMBOL_GPL(yield_to);
>   */
>  void __sched io_schedule(void)
>  {
> +#ifdef CONFIG_NO_HZ_COMMON
> +	int cpu_on_entry = smp_processor_id();
> +#endif
>  	struct rq *rq = raw_rq();
>  
>  	delayacct_blkio_start();
> @@ -4263,13 +4266,23 @@ void __sched io_schedule(void)
>  	current->in_iowait = 1;
>  	schedule();
>  	current->in_iowait = 0;
> +#ifdef CONFIG_NO_HZ_COMMON
> +	if (atomic_dec_and_test(&rq->nr_iowait)) {
> +		if (smp_processor_id() != cpu_on_entry)
> +			tick_nohz_iowait_to_idle(cpu_on_entry);
> +	}
> +#else
>  	atomic_dec(&rq->nr_iowait);
> +#endif
>  	delayacct_blkio_end();
>  }
>  EXPORT_SYMBOL(io_schedule);
>  
>  long __sched io_schedule_timeout(long timeout)
>  {
> +#ifdef CONFIG_NO_HZ_COMMON
> +	int cpu_on_entry = smp_processor_id();
> +#endif
>  	struct rq *rq = raw_rq();
>  	long ret;
>  
> @@ -4279,7 +4292,14 @@ long __sched io_schedule_timeout(long timeout)
>  	current->in_iowait = 1;
>  	ret = schedule_timeout(timeout);
>  	current->in_iowait = 0;
> +#ifdef CONFIG_NO_HZ_COMMON
> +	if (atomic_dec_and_test(&rq->nr_iowait)) {
> +		if (smp_processor_id() != cpu_on_entry)
> +			tick_nohz_iowait_to_idle(cpu_on_entry);
> +	}
> +#else
>  	atomic_dec(&rq->nr_iowait);
> +#endif
>  	delayacct_blkio_end();
>  	return ret;
>  }

As I already mentioned in previous discussion with Peter, I have
concern here that this change might have impact on performance.
Especially in case if system is a kind of io-busy box, originally
there may be no iowait time (and possibly also no idle time).
For such case this change adds extra execution cost to manage
value of iowait_exittime which might not used.

Did you have any benchmark on this?

(And if you have extra time, it would be helpful if you could
 compare performance impacts of your set and mine:
  [PATCH v4 0/2] nohz: fix idle accounting in NO_HZ kernels
  https://lkml.org/lkml/2014/4/17/120
)

And if we successfully found a way to get the iowait_exittime
within reasonable negligible cheap cost, then why we don't use
it for NOHZ=n kernels too? 

> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index a99770e..679a577 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -403,17 +403,48 @@ static void tick_nohz_update_jiffies(ktime_t now)
>  	touch_softlockup_watchdog();
>  }
>  
> +/*
> + * Race prevention for idle/iowait counters:
> + *
> + * On entry and exit to idle/iowait states,
> + * ts->idle_entrytime, ts->{idle,iowait}_sleeptime and ts->idle_active
> + * fields are updated in this order, serialized with smp_wmb().
> + * On exit, ts->idle_entrytime is zeroed.
> + *
> + * The readers - get_cpu_{idle,iowait}_time_us() - fetch these fields
> + * in reversed order, serialized with smp_rmb().
> + * If ts->idle_active == 0, reading of ts->idle_entrytime (and one smp_rmb())
> + * are not necessary. If ts->idle_active != 0, ts->idle_entrytime
> + * (which is needed anyway for calculations)
> + * needs to be checked for zero, and reads need to be done anew if it is zeroed.
> + *
> + * This implements a seqlock-like race detection without needing additional
> + * data fields and read-modify-write ops on them.
> + */
>  static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
>  {
> -	ktime_t delta;
> +	ktime_t delta, entry;
>  
>  	/* Updates the per cpu time idle statistics counters */
> -	delta = ktime_sub(now, ts->idle_entrytime);
> +	entry = ts->idle_entrytime;
> +	delta = ktime_sub(now, entry);
>  	ts->idle_entrytime.tv64 = 0;
>  	smp_wmb();
>  
> -	if (ts->idle_active == 2)
> +	if (ts->idle_active == 2) {
> +		ktime_t end = ts->iowait_exittime;
> +
> +		if (end.tv64 != KTIME_MAX) {
> +			/*
> +			 * Last iowaiting task on our rq was woken up on other CPU
> +			 * sometime in the past, it updated ts->iowait_exittime.
> +			 */
> +			delta = ktime_sub(now, end);
> +			ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
> +			delta = ktime_sub(end, entry);
> +		}
>  		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
> +	}
>  	else
>  		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
>  
> @@ -430,10 +461,18 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
>  	ts->idle_entrytime = now;
>  	smp_wmb();
>  	ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
> +	ts->iowait_exittime.tv64 = KTIME_MAX;
>  	sched_clock_idle_sleep_event();
>  	return now;
>  }
>  
> +void tick_nohz_iowait_to_idle(int cpu)
> +{
> +	struct tick_sched *ts = tick_get_tick_sched(cpu);
> +
> +	ts->iowait_exittime = ktime_get();
> +}
> +
>  /**
>   * get_cpu_idle_time_us - get the total idle time of a cpu
>   * @cpu: CPU number to query
> @@ -467,7 +506,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
>  	active = ACCESS_ONCE(ts->idle_active);
>  	smp_rmb();
>  	count = ACCESS_ONCE(ts->idle_sleeptime);
> -	if (active == 1) {
> +	if (active /* either 1 or 2 */) {
>  		ktime_t delta, start;
>  
>  		smp_rmb();
> @@ -479,6 +518,18 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
>  			 */
>  			goto again;
>  		}
> +		if (active == 2) {
> +			/* This idle period started as "iowait idle" */
> +			ktime_t iowait_exit = ACCESS_ONCE(ts->iowait_exittime);
> +
> +			if (iowait_exit.tv64 == KTIME_MAX)
> +				goto skip; /* and it still is */
> +			/*
> +			 * This CPU used to be "iowait idle", but iowait task
> +			 * has migrated. The rest of idle time is "true idle":
> +			 */
> +			start = iowait_exit;
> +		}
>  		delta = ktime_sub(now, start);
>  		count = ktime_add(count, delta);
>  	} else {
> @@ -488,6 +539,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
>  		 * we fetched idle_active, so count must be valid.
>  		 */
>  	}
> + skip:
>  
>  	return ktime_to_us(count);
>  }
> @@ -527,7 +579,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
>  	smp_rmb();
>  	count = ACCESS_ONCE(ts->iowait_sleeptime);
>  	if (active == 2) {
> -		ktime_t delta, start;
> +		ktime_t delta, start, end;
>  
>  		smp_rmb();
>  		start = ACCESS_ONCE(ts->idle_entrytime);
> @@ -538,7 +590,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
>  			 */
>  			goto again;
>  		}
> -		delta = ktime_sub(now, start);
> +		/*
> +		 * If last iowaiting task on our rq was woken up on other CPU
> +		 * sometime in the past, it updated ts->iowait_exittime.
> +		 * Otherwise, ts->iowait_exittime == KTIME_MAX.
> +		 */
> +		end = ACCESS_ONCE(ts->iowait_exittime);
> +		if (end.tv64 == KTIME_MAX)
> +			end = now;
> +		delta = ktime_sub(end, start);
>  		count = ktime_add(count, delta);
>  	} else {
>  		/*
> 

As Frederic already pointed, seqcount must be better choice.


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] nohz: Synchronize sleep time stats with memory barriers
  2014-04-24  0:16   ` Frederic Weisbecker
@ 2014-04-24 18:36     ` Denys Vlasenko
  0 siblings, 0 replies; 10+ messages in thread
From: Denys Vlasenko @ 2014-04-24 18:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Hidetoshi Seto, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Arjan van de Ven, Oleg Nesterov

On 04/24/2014 02:16 AM, Frederic Weisbecker wrote:
>> + again:
>> +	active = ACCESS_ONCE(ts->idle_active);
>> +	smp_rmb();
>> +	count = ACCESS_ONCE(ts->idle_sleeptime);
>> +	if (active == 1) {
>> +		ktime_t delta, start;
>> +
>> +		smp_rmb();
>> +		start = ACCESS_ONCE(ts->idle_entrytime);
>> +		if (start.tv64 == 0) {
>> +			/*
>> +			 * Other CPU is updating the count.
>> +			 * We don't know whether fetched count is valid.
>> +			 */
>> +			goto again;
>> +		}
>> +		delta = ktime_sub(now, start);
>> +		count = ktime_add(count, delta);
> 
> There is still a possibility that you race with an updater here.
> Lets take these initial values, all of which belong to ts(CPU 1):
> 
>       // idle_sleeptime == 0
>       // idle_entrytime == 0
>       // ktime_get() == 100
>       // idle_active = 1
> 
>            CPU 0                                       CPU 1
> 
>       count = idle_sleeptime // = 0               
>                                                      tick_nohz_stop_idle();
>                                                      tick_nohz_start_idle();
>                                                      /* now idle_sleeptime == 100
>                                                         and idle_entrytime == 100
>                                                         ktime_get() is still 100
>                                                         and idle_active is still 1
>                                                         as it has toggled two times */
>       delta = now - idle_entrytime; // 100 - 100
>       count += delta // == 0
> 
> Then you get the spurious 0 result.
> 
> So memory barriers probably aren't enough here. seqcount would solve the issue as it maintains
> update seq tokens.

I think you are right. I'll use your approach (seqcount) in the next version.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] nohz: Fix iowait overcounting if iowait task migrates
  2014-04-24  7:07   ` Peter Zijlstra
@ 2014-04-24 18:38     ` Denys Vlasenko
  0 siblings, 0 replies; 10+ messages in thread
From: Denys Vlasenko @ 2014-04-24 18:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Oleg Nesterov

On 04/24/2014 09:07 AM, Peter Zijlstra wrote:
> On Wed, Apr 23, 2014 at 09:00:36PM +0200, Denys Vlasenko wrote:
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 9cae286..08dd220 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4255,6 +4255,9 @@ EXPORT_SYMBOL_GPL(yield_to);
>>   */
>>  void __sched io_schedule(void)
>>  {
>> +#ifdef CONFIG_NO_HZ_COMMON
>> +	int cpu_on_entry = smp_processor_id();
>> +#endif
>>  	struct rq *rq = raw_rq();
>>  
>>  	delayacct_blkio_start();
>> @@ -4263,13 +4266,23 @@ void __sched io_schedule(void)
>>  	current->in_iowait = 1;
>>  	schedule();
>>  	current->in_iowait = 0;
>> +#ifdef CONFIG_NO_HZ_COMMON
>> +	if (atomic_dec_and_test(&rq->nr_iowait)) {
>> +		if (smp_processor_id() != cpu_on_entry)
>> +			tick_nohz_iowait_to_idle(cpu_on_entry);
>> +	}
>> +#else
>>  	atomic_dec(&rq->nr_iowait);
>> +#endif
>>  	delayacct_blkio_end();
>>  }
>>  EXPORT_SYMBOL(io_schedule);
>>  
>>  long __sched io_schedule_timeout(long timeout)
>>  {
>> +#ifdef CONFIG_NO_HZ_COMMON
>> +	int cpu_on_entry = smp_processor_id();
>> +#endif
>>  	struct rq *rq = raw_rq();
>>  	long ret;
>>  
>> @@ -4279,7 +4292,14 @@ long __sched io_schedule_timeout(long timeout)
>>  	current->in_iowait = 1;
>>  	ret = schedule_timeout(timeout);
>>  	current->in_iowait = 0;
>> +#ifdef CONFIG_NO_HZ_COMMON
>> +	if (atomic_dec_and_test(&rq->nr_iowait)) {
>> +		if (smp_processor_id() != cpu_on_entry)
>> +			tick_nohz_iowait_to_idle(cpu_on_entry);
>> +	}
>> +#else
>>  	atomic_dec(&rq->nr_iowait);
>> +#endif
>>  	delayacct_blkio_end();
>>  	return ret;
>>  }
> 
> Why do you insist on writing the same (buggy, see later) code twice?

I don't insist on such a thing. I am modifying existing code
which already has these nearly-identical functions.

I agree with you that they beg for common code being moved into
a shared helper function.


> Its buggy because the smp_processor_id() is used in preemptible context,
> its further buggy because the raw_rq() does it again and could get a rq
> on a different cpu.
> 
> What you want is something like:
> 
> static inline void io_wait_start(struct rq *rq)
> {
> 	atomic_inc(&rq->nr_iowait);
> 	current->in_iowait = 1;
> }
> 
> static inline void io_wait_end(struct rq *rq)
> {
> 	current->in_iowait = 0;
> #ifdef CONFIG_NO_HZ_COMMON
> 	if (atomic_dec_and_test(&rq->nr_iowait) && 
> 		cpu_of(rq) != raw_smp_processor_id()) {
> 		tick_nohz_iowait_end(cpu_of(rq));
> 	}
> #else
> 	atomic_dec(&rq->nr_iowait);
> #endif
> }

Thanks, will use this in the next patchset.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] nohz: Fix iowait overcounting if iowait task migrates
  2014-04-24  8:19   ` Hidetoshi Seto
@ 2014-04-24 18:43     ` Denys Vlasenko
  0 siblings, 0 replies; 10+ messages in thread
From: Denys Vlasenko @ 2014-04-24 18:43 UTC (permalink / raw)
  To: Hidetoshi Seto, linux-kernel
  Cc: Frederic Weisbecker, Fernando Luis Vazquez Cao, Tetsuo Handa,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov

On 04/24/2014 10:19 AM, Hidetoshi Seto wrote:
> As I already mentioned in previous discussion with Peter, I have
> concern here that this change might have impact on performance.
> Especially in case if system is a kind of io-busy box, originally
> there may be no iowait time (and possibly also no idle time).
> For such case this change adds extra execution cost to manage
> value of iowait_exittime which might not used.

Everything has some cost. Correctness usually trumps a few extra
locked bus cycles.

Wouldn't it be nice if we'd know whether anyone even needs
the stats? If no one going to read /proc/stat on the box,
there's no point in going to all the trouble to maintain the counters...

> And if we successfully found a way to get the iowait_exittime
> within reasonable negligible cheap cost, then why we don't use
> it for NOHZ=n kernels too? 

Kernels without NOHZ maintain the counters based on timer interrupt
sampling. It should still work fine.

> As Frederic already pointed, seqcount must be better choice.

Yes, I'm switching to seqcounts.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-04-24 18:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-23 19:00 [PATCH 1/3] nohz: Only update sleeptime stats locally Denys Vlasenko
2014-04-23 19:00 ` [PATCH 2/3] nohz: Synchronize sleep time stats with memory barriers Denys Vlasenko
2014-04-24  0:16   ` Frederic Weisbecker
2014-04-24 18:36     ` Denys Vlasenko
2014-04-23 19:00 ` [PATCH 3/3] nohz: Fix iowait overcounting if iowait task migrates Denys Vlasenko
2014-04-24  0:37   ` Frederic Weisbecker
2014-04-24  7:07   ` Peter Zijlstra
2014-04-24 18:38     ` Denys Vlasenko
2014-04-24  8:19   ` Hidetoshi Seto
2014-04-24 18:43     ` Denys Vlasenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.