linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/4] nohz: Only update sleeptime stats locally
@ 2014-04-24 18:45 Denys Vlasenko
  2014-04-24 18:45 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Denys Vlasenko
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Denys Vlasenko @ 2014-04-24 18:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Frederic Weisbecker, Denys Vlasenko, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov

From: Frederic Weisbecker <fweisbec@gmail.com>

The idle and io sleeptime stats can be updated concurrently from callers
of get_cpu_idle_time_us(), get_cpu_iowait_time_us() and
tick_nohz_stop_idle().

Updaters can easily race and mess up with internal datas coherency,
for example when a governor calls a get_cpu_*_time_us() API and the
target CPU exits idle at the same time, because no locking or whatsoever
is there to protect against this concurrency.

To fix this, lets only update the sleeptime stats locally when the CPU
exits from idle. This is the only place where a real update is truly
needed. The callers of get_cpu_*_time_us() can simply add up the pending
sleep time delta to the last sleeptime snapshot in order to get a coherent
result. There is no need for them to also update the stats.

Rebased to Linus' tree by Denys Vlasenko.

Reported-by: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 kernel/time/tick-sched.c | 63 ++++++++++++++++--------------------------------
 1 file changed, 21 insertions(+), 42 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6558b7a..7d86a54 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -406,31 +406,16 @@ static void tick_nohz_update_jiffies(ktime_t now)
 	touch_softlockup_watchdog();
 }
 
-/*
- * Updates the per cpu time idle statistics counters
- */
-static void
-update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
+static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 {
 	ktime_t delta;
 
-	if (ts->idle_active) {
-		delta = ktime_sub(now, ts->idle_entrytime);
-		if (nr_iowait_cpu(cpu) > 0)
-			ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
-		else
-			ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
-		ts->idle_entrytime = now;
-	}
-
-	if (last_update_time)
-		*last_update_time = ktime_to_us(now);
-
-}
-
-static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
-{
-	update_ts_time_stats(smp_processor_id(), ts, now, NULL);
+	/* Updates the per cpu time idle statistics counters */
+	delta = ktime_sub(now, ts->idle_entrytime);
+	if (nr_iowait_cpu(smp_processor_id()) > 0)
+		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
+	else
+		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
 	ts->idle_active = 0;
 
 	sched_clock_idle_wakeup_event(0);
@@ -469,17 +454,14 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 		return -1;
 
 	now = ktime_get();
-	if (last_update_time) {
-		update_ts_time_stats(cpu, ts, now, last_update_time);
-		idle = ts->idle_sleeptime;
-	} else {
-		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+	if (last_update_time)
+		*last_update_time = ktime_to_us(now);
 
-			idle = ktime_add(ts->idle_sleeptime, delta);
-		} else {
-			idle = ts->idle_sleeptime;
-		}
+	if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+		idle = ktime_add(ts->idle_sleeptime, delta);
+	} else {
+		idle = ts->idle_sleeptime;
 	}
 
 	return ktime_to_us(idle);
@@ -510,17 +492,14 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 		return -1;
 
 	now = ktime_get();
-	if (last_update_time) {
-		update_ts_time_stats(cpu, ts, now, last_update_time);
-		iowait = ts->iowait_sleeptime;
-	} else {
-		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+	if (last_update_time)
+		*last_update_time = ktime_to_us(now);
 
-			iowait = ktime_add(ts->iowait_sleeptime, delta);
-		} else {
-			iowait = ts->iowait_sleeptime;
-		}
+	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
+		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+		iowait = ktime_add(ts->iowait_sleeptime, delta);
+	} else {
+		iowait = ts->iowait_sleeptime;
 	}
 
 	return ktime_to_us(iowait);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2014-04-24 18:45 [PATCH 1/4] nohz: Only update sleeptime stats locally Denys Vlasenko
@ 2014-04-24 18:45 ` Denys Vlasenko
  2014-04-24 18:45 ` [PATCH 3/4] nohz: Fix idle/iowait counts going backwards Denys Vlasenko
  2014-04-24 18:45 ` [PATCH 4/4] nohz: Fix iowait overcounting if iowait task migrates Denys Vlasenko
  2 siblings, 0 replies; 11+ messages in thread
From: Denys Vlasenko @ 2014-04-24 18:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov

This patch is adapted from a very similar patch
by Frederic Weisbecker. The commit message text is also mostly
from Frederic's patch.

When some call site uses get_cpu_*_time_us() to read a sleeptime
stat, it deduces the total sleeptime by adding the pending time
to the last sleeptime snapshot if the CPU target is idle.

Namely this sums up to:

       sleeptime = ts($CPU)->idle_sleeptime;
       if (ts($CPU)->idle_active)
       	  sleeptime += NOW() - ts($CPU)->idle_entrytime

But this only works if idle_sleeptime, idle_entrytime and idle_active are
read and updated under some disciplined order.

Lets consider the following scenario:

             CPU 0                                            CPU 1

  (seq 1)    ts(CPU 0)->idle_active = 1
             ts(CPU 0)->idle_entrytime = NOW()

  (seq 2)    sleeptime = NOW() - ts(CPU 0)->idle_entrytime
             ts(CPU 0)->idle_sleeptime += sleeptime           sleeptime = ts(CPU 0)->idle_sleeptime;
                                                              if (ts(CPU 0)->idle_active)
             ts(CPU 0)->idle_entrytime = NOW()                    sleeptime += NOW() - ts(CPU 0)->idle_entrytime

The resulting value of sleeptime in CPU 1 can vary depending of some
ordering scenario:

* If it sees the value of idle_entrytime after seq 1 and the value of idle_sleeptime
after seq 2, the value of sleeptime will be buggy because it accounts the delta twice,
so it will be too high.

* If it sees the value of idle_entrytime after seq 2 and the value of idle_sleeptime
after seq 1, the value of sleeptime will be buggy because it misses the delta, so it
will be too low.

* If it sees the value of idle_entrytime and idle_sleeptime, both as seen after seq 1 or 2,
the value will be correct.

Some more tricky scenario can also happen if idle_active value is read from a former sequence.

Hence we must honour the following constraints:

- idle_sleeptime, idle_active and idle_entrytime must be updated and read
under some correctly enforced SMP ordering

- The three variable values as read by CPU 1 must belong to the same update
sequences from CPU 0. The update sequences must be delimited such that the
resulting three values after a sequence completion produce a coherent result
together when read from the CPU 1.

- We need to prevent from fetching middle-state sequence values.

The ideal solution to implement this synchronization is to use a seqcount. Lets
use one here around these three values to enforce sequence synchronization between
updates and read.

This fixes potential cpufreq governor bugs.
It also works towards fixing reported bug where non-monotonic sleeptime stats
are returned by /proc/stat when it is frequently read.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/tick.h     |  1 +
 kernel/time/tick-sched.c | 37 ++++++++++++++++++++++++++-----------
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index b84773c..4de1f9e 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -67,6 +67,7 @@ struct tick_sched {
 	ktime_t				idle_exittime;
 	ktime_t				idle_sleeptime;
 	ktime_t				iowait_sleeptime;
+	seqcount_t			idle_sleeptime_seq;
 	ktime_t				sleep_length;
 	unsigned long			last_jiffies;
 	unsigned long			next_jiffies;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 7d86a54..8f0f2ee 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -411,12 +411,14 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 	ktime_t delta;
 
 	/* Updates the per cpu time idle statistics counters */
+	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	delta = ktime_sub(now, ts->idle_entrytime);
 	if (nr_iowait_cpu(smp_processor_id()) > 0)
 		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
 	else
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
 	ts->idle_active = 0;
+	write_seqcount_end(&ts->idle_sleeptime_seq);
 
 	sched_clock_idle_wakeup_event(0);
 }
@@ -425,9 +427,13 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 {
 	ktime_t now = ktime_get();
 
+	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	ts->idle_entrytime = now;
 	ts->idle_active = 1;
+	write_seqcount_end(&ts->idle_sleeptime_seq);
+
 	sched_clock_idle_sleep_event();
+
 	return now;
 }
 
@@ -449,6 +455,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	ktime_t now, idle;
+	unsigned int seq;
 
 	if (!tick_nohz_active)
 		return -1;
@@ -457,15 +464,18 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
-	if (ts->idle_active && !nr_iowait_cpu(cpu)) {
-		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-		idle = ktime_add(ts->idle_sleeptime, delta);
-	} else {
+	do {
+		ktime_t delta;
+
+		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
 		idle = ts->idle_sleeptime;
-	}
+		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+			delta = ktime_sub(now, ts->idle_entrytime);
+			idle = ktime_add(idle, delta);
+		}
+	} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
 
 	return ktime_to_us(idle);
-
 }
 EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
 
@@ -487,6 +497,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	ktime_t now, iowait;
+	unsigned int seq;
 
 	if (!tick_nohz_active)
 		return -1;
@@ -495,12 +506,16 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
-	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
-		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-		iowait = ktime_add(ts->iowait_sleeptime, delta);
-	} else {
+	do {
+		ktime_t delta;
+
+		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
 		iowait = ts->iowait_sleeptime;
-	}
+		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
+			delta = ktime_sub(now, ts->idle_entrytime);
+			iowait = ktime_add(ts->iowait_sleeptime, delta);
+		}
+	} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
 
 	return ktime_to_us(iowait);
 }
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/4] nohz: Fix idle/iowait counts going backwards
  2014-04-24 18:45 [PATCH 1/4] nohz: Only update sleeptime stats locally Denys Vlasenko
  2014-04-24 18:45 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Denys Vlasenko
@ 2014-04-24 18:45 ` Denys Vlasenko
  2014-04-24 18:45 ` [PATCH 4/4] nohz: Fix iowait overcounting if iowait task migrates Denys Vlasenko
  2 siblings, 0 replies; 11+ messages in thread
From: Denys Vlasenko @ 2014-04-24 18:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov

With this change, "iowait-ness" of every idle period is decided
at the moment it starts:
if this CPU's run-queue had tasks waiting on I/O, then this idle
period's duration will be added to iowait_sleeptime.

This fixes the bug where iowait and/or idle counts could go backwards,
but iowait accounting is not precise (it can show more iowait
that there really is).

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 kernel/time/tick-sched.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 8f0f2ee..47ed7cf 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -413,7 +413,7 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 	/* Updates the per cpu time idle statistics counters */
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	delta = ktime_sub(now, ts->idle_entrytime);
-	if (nr_iowait_cpu(smp_processor_id()) > 0)
+	if (ts->idle_active == 2)
 		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
 	else
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
@@ -429,7 +429,7 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	ts->idle_entrytime = now;
-	ts->idle_active = 1;
+	ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
 	write_seqcount_end(&ts->idle_sleeptime_seq);
 
 	sched_clock_idle_sleep_event();
@@ -469,7 +469,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 
 		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
 		idle = ts->idle_sleeptime;
-		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+		if (ts->idle_active == 1) {
 			delta = ktime_sub(now, ts->idle_entrytime);
 			idle = ktime_add(idle, delta);
 		}
@@ -511,7 +511,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 
 		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
 		iowait = ts->iowait_sleeptime;
-		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
+		if (ts->idle_active == 2) {
 			delta = ktime_sub(now, ts->idle_entrytime);
 			iowait = ktime_add(ts->iowait_sleeptime, delta);
 		}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/4] nohz: Fix iowait overcounting if iowait task migrates
  2014-04-24 18:45 [PATCH 1/4] nohz: Only update sleeptime stats locally Denys Vlasenko
  2014-04-24 18:45 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Denys Vlasenko
  2014-04-24 18:45 ` [PATCH 3/4] nohz: Fix idle/iowait counts going backwards Denys Vlasenko
@ 2014-04-24 18:45 ` Denys Vlasenko
  2014-04-24 19:18   ` Peter Zijlstra
                     ` (2 more replies)
  2 siblings, 3 replies; 11+ messages in thread
From: Denys Vlasenko @ 2014-04-24 18:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Oleg Nesterov

Before this change, if last IO-blocked task wakes up
on a different CPU, the original CPU may stay idle for much longer,
and the entire time it stays idle is accounted as iowait time.

This change adds struct tick_sched::iowait_exittime member.
On entry to idle, it is set to KTIME_MAX.
Last IO-blocked task, if migrated, sets it to current time.
Note that this can happen only once per each idle period:
new iowaiting tasks can't magically appear on idle CPU's rq.

If iowait_exittime is set, then (iowait_exittime - idle_entrytime)
gets accounted as iowait, and the remaining (now - iowait_exittime)
as "true" idle.

Run-tested: /proc/stat counters no longer go backwards.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/tick.h     |  2 ++
 kernel/sched/core.c      | 14 +++++++++++
 kernel/time/tick-sched.c | 64 ++++++++++++++++++++++++++++++++++++++++--------
 3 files changed, 70 insertions(+), 10 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 4de1f9e..1bf653e 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -67,6 +67,7 @@ struct tick_sched {
 	ktime_t				idle_exittime;
 	ktime_t				idle_sleeptime;
 	ktime_t				iowait_sleeptime;
+	ktime_t				iowait_exittime;
 	seqcount_t			idle_sleeptime_seq;
 	ktime_t				sleep_length;
 	unsigned long			last_jiffies;
@@ -140,6 +141,7 @@ extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
+extern void tick_nohz_iowait_to_idle(int cpu);
 
 # else /* !CONFIG_NO_HZ_COMMON */
 static inline int tick_nohz_tick_stopped(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 268a45e..ffea757 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4218,7 +4218,14 @@ void __sched io_schedule(void)
 	current->in_iowait = 1;
 	schedule();
 	current->in_iowait = 0;
+#ifdef CONFIG_NO_HZ_COMMON
+	if (atomic_dec_and_test(&rq->nr_iowait)) {
+		if (raw_smp_processor_id() != cpu_of(rq))
+			tick_nohz_iowait_to_idle(cpu_of(rq));
+	}
+#else
 	atomic_dec(&rq->nr_iowait);
+#endif
 	delayacct_blkio_end();
 }
 EXPORT_SYMBOL(io_schedule);
@@ -4234,7 +4241,14 @@ long __sched io_schedule_timeout(long timeout)
 	current->in_iowait = 1;
 	ret = schedule_timeout(timeout);
 	current->in_iowait = 0;
+#ifdef CONFIG_NO_HZ_COMMON
+	if (atomic_dec_and_test(&rq->nr_iowait)) {
+		if (raw_smp_processor_id() != cpu_of(rq))
+			tick_nohz_iowait_to_idle(cpu_of(rq));
+	}
+#else
 	atomic_dec(&rq->nr_iowait);
+#endif
 	delayacct_blkio_end();
 	return ret;
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 47ed7cf..d78c942 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -408,15 +408,27 @@ static void tick_nohz_update_jiffies(ktime_t now)
 
 static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 {
-	ktime_t delta;
+	ktime_t delta, entry, end;
 
 	/* Updates the per cpu time idle statistics counters */
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
-	delta = ktime_sub(now, ts->idle_entrytime);
-	if (ts->idle_active == 2)
+	entry = ts->idle_entrytime;
+	delta = ktime_sub(now, entry);
+	if (ts->idle_active == 2) {
+		end = ts->iowait_exittime;
+		if (end.tv64 != KTIME_MAX) {
+			/*
+			 * Last iowaiting task on our rq was woken up on other CPU
+			 * sometime in the past, it updated ts->iowait_exittime.
+			 */
+			delta = ktime_sub(now, end);
+			ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+			delta = ktime_sub(end, entry);
+		}
 		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
-	else
+	} else {
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+	}
 	ts->idle_active = 0;
 	write_seqcount_end(&ts->idle_sleeptime_seq);
 
@@ -430,6 +442,7 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	ts->idle_entrytime = now;
 	ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
+	ts->iowait_exittime.tv64 = KTIME_MAX;
 	write_seqcount_end(&ts->idle_sleeptime_seq);
 
 	sched_clock_idle_sleep_event();
@@ -437,6 +450,16 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 	return now;
 }
 
+void tick_nohz_iowait_to_idle(int cpu)
+{
+	struct tick_sched *ts = tick_get_tick_sched(cpu);
+	ktime_t now = ktime_get();
+
+	write_seqcount_begin(&ts->idle_sleeptime_seq);
+	ts->iowait_exittime = now;
+	write_seqcount_end(&ts->idle_sleeptime_seq);
+}
+
 /**
  * get_cpu_idle_time_us - get the total idle time of a cpu
  * @cpu: CPU number to query
@@ -465,13 +488,26 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 		*last_update_time = ktime_to_us(now);
 
 	do {
-		ktime_t delta;
+		ktime_t start, delta, iowait_exit;
 
 		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
 		idle = ts->idle_sleeptime;
-		if (ts->idle_active == 1) {
-			delta = ktime_sub(now, ts->idle_entrytime);
+		if (ts->idle_active /* either 1 or 2 */) {
+			start = ts->idle_entrytime;
+			if (ts->idle_active == 2) {
+				/* This idle period started as "iowait idle" */
+				iowait_exit = ts->iowait_exittime;
+				if (iowait_exit.tv64 == KTIME_MAX)
+					goto skip; /* and it still is */
+				/*
+				 * This CPU used to be "iowait idle", but iowait task
+				 * has migrated. The rest of idle time is "true idle":
+				 */
+				start = iowait_exit;
+			}
+			delta = ktime_sub(now, start);
 			idle = ktime_add(idle, delta);
+ skip: ;
 		}
 	} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
 
@@ -507,13 +543,21 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 		*last_update_time = ktime_to_us(now);
 
 	do {
-		ktime_t delta;
+		ktime_t delta, end;
 
 		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
 		iowait = ts->iowait_sleeptime;
 		if (ts->idle_active == 2) {
-			delta = ktime_sub(now, ts->idle_entrytime);
-			iowait = ktime_add(ts->iowait_sleeptime, delta);
+			/*
+			 * If last iowaiting task on our rq was woken up on other CPU
+			 * sometime in the past, it updated ts->iowait_exittime.
+			 * Otherwise, ts->iowait_exittime == KTIME_MAX.
+			 */
+			end = ts->iowait_exittime;
+			if (end.tv64 == KTIME_MAX)
+				end = now;
+			delta = ktime_sub(end, ts->idle_entrytime);
+			iowait = ktime_add(iowait, delta);
 		}
 	} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/4] nohz: Fix iowait overcounting if iowait task migrates
  2014-04-24 18:45 ` [PATCH 4/4] nohz: Fix iowait overcounting if iowait task migrates Denys Vlasenko
@ 2014-04-24 19:18   ` Peter Zijlstra
  2014-04-25 18:56     ` Denys Vlasenko
  2014-04-29 14:47   ` Frederic Weisbecker
  2014-04-29 16:20   ` Frederic Weisbecker
  2 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2014-04-24 19:18 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Oleg Nesterov

On Thu, Apr 24, 2014 at 08:45:58PM +0200, Denys Vlasenko wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 268a45e..ffea757 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4218,7 +4218,14 @@ void __sched io_schedule(void)
>  	current->in_iowait = 1;
>  	schedule();
>  	current->in_iowait = 0;
> +#ifdef CONFIG_NO_HZ_COMMON
> +	if (atomic_dec_and_test(&rq->nr_iowait)) {
> +		if (raw_smp_processor_id() != cpu_of(rq))
> +			tick_nohz_iowait_to_idle(cpu_of(rq));
> +	}
> +#else
>  	atomic_dec(&rq->nr_iowait);
> +#endif
>  	delayacct_blkio_end();
>  }

You're really refusing to collapse that stuff eh? Maybe I should just
redirect this entire iowait fest to /dev/null :/

> +void tick_nohz_iowait_to_idle(int cpu)
> +{
> +	struct tick_sched *ts = tick_get_tick_sched(cpu);
> +	ktime_t now = ktime_get();
> +
> +	write_seqcount_begin(&ts->idle_sleeptime_seq);
> +	ts->iowait_exittime = now;
> +	write_seqcount_end(&ts->idle_sleeptime_seq);
> +}


So what again was wrong with this one?

  http://marc.info/?l=linux-kernel&m=139772917211023


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/4] nohz: Fix iowait overcounting if iowait task migrates
  2014-04-24 19:18   ` Peter Zijlstra
@ 2014-04-25 18:56     ` Denys Vlasenko
  0 siblings, 0 replies; 11+ messages in thread
From: Denys Vlasenko @ 2014-04-25 18:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Frederic Weisbecker, Hidetoshi Seto,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Oleg Nesterov

On 04/24/2014 09:18 PM, Peter Zijlstra wrote:
> On Thu, Apr 24, 2014 at 08:45:58PM +0200, Denys Vlasenko wrote:
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 268a45e..ffea757 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4218,7 +4218,14 @@ void __sched io_schedule(void)
>>  	current->in_iowait = 1;
>>  	schedule();
>>  	current->in_iowait = 0;
>> +#ifdef CONFIG_NO_HZ_COMMON
>> +	if (atomic_dec_and_test(&rq->nr_iowait)) {
>> +		if (raw_smp_processor_id() != cpu_of(rq))
>> +			tick_nohz_iowait_to_idle(cpu_of(rq));
>> +	}
>> +#else
>>  	atomic_dec(&rq->nr_iowait);
>> +#endif
>>  	delayacct_blkio_end();
>>  }
> 
> You're really refusing to collapse that stuff eh?

I'm sending two patches on top of my last patch set
which tidies up a few such aspects (another one
is where we fetch a percpu variable before knowing
that we'll need it, potentially wasting a few cycles).


>> +void tick_nohz_iowait_to_idle(int cpu)
>> +{
>> +	struct tick_sched *ts = tick_get_tick_sched(cpu);
>> +	ktime_t now = ktime_get();
>> +
>> +	write_seqcount_begin(&ts->idle_sleeptime_seq);
>> +	ts->iowait_exittime = now;
>> +	write_seqcount_end(&ts->idle_sleeptime_seq);
>> +}
>
>
> So what again was wrong with this one?
>
>   http://marc.info/?l=linux-kernel&m=139772917211023

That code has no provision to record when last iowait task
left the rq.

Therefore it can undercount iowait - it's very similar
to the problem I had before patch #4 in my patch series.

My patches 1-3 can overcount iowait because they consider
the entire idle period "iowait" if nr_iowait_cpu() != 0
at the *beginning*.

Hidetoshi's patches consider the entire idle period "iowait"
if nr_iowait_cpu() != 0 at the *end*.

He needs to code carefully so that this delayed decision
doesn't make reader functions return wrong results.

However, if nr_iowait_cpu() was 0 at the end it does not mean
that most of this time period it was also 0. It could have been
mostly !0 - and in this case iowait will be undercounted.

I personally thought that both over- or undercounting iowait
might be acceptable.

If not, then *some* form of recording and accounting
for exact moment when last iowait task left the rq is necessary.
That's what I did in patch #4.

-- 
vda


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/4] nohz: Fix iowait overcounting if iowait task migrates
  2014-04-24 18:45 ` [PATCH 4/4] nohz: Fix iowait overcounting if iowait task migrates Denys Vlasenko
  2014-04-24 19:18   ` Peter Zijlstra
@ 2014-04-29 14:47   ` Frederic Weisbecker
  2014-05-05 18:06     ` Denys Vlasenko
  2014-04-29 16:20   ` Frederic Weisbecker
  2 siblings, 1 reply; 11+ messages in thread
From: Frederic Weisbecker @ 2014-04-29 14:47 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Hidetoshi Seto, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Arjan van de Ven, Oleg Nesterov

On Thu, Apr 24, 2014 at 08:45:58PM +0200, Denys Vlasenko wrote:
> Before this change, if last IO-blocked task wakes up
> on a different CPU, the original CPU may stay idle for much longer,
> and the entire time it stays idle is accounted as iowait time.
> 
> This change adds struct tick_sched::iowait_exittime member.
> On entry to idle, it is set to KTIME_MAX.
> Last IO-blocked task, if migrated, sets it to current time.
> Note that this can happen only once per each idle period:
> new iowaiting tasks can't magically appear on idle CPU's rq.
> 
> If iowait_exittime is set, then (iowait_exittime - idle_entrytime)
> gets accounted as iowait, and the remaining (now - iowait_exittime)
> as "true" idle.
> 
> Run-tested: /proc/stat counters no longer go backwards.
> 
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> ---
>  include/linux/tick.h     |  2 ++
>  kernel/sched/core.c      | 14 +++++++++++
>  kernel/time/tick-sched.c | 64 ++++++++++++++++++++++++++++++++++++++++--------
>  3 files changed, 70 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index 4de1f9e..1bf653e 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -67,6 +67,7 @@ struct tick_sched {
>  	ktime_t				idle_exittime;
>  	ktime_t				idle_sleeptime;
>  	ktime_t				iowait_sleeptime;
> +	ktime_t				iowait_exittime;
>  	seqcount_t			idle_sleeptime_seq;
>  	ktime_t				sleep_length;
>  	unsigned long			last_jiffies;
> @@ -140,6 +141,7 @@ extern void tick_nohz_irq_exit(void);
>  extern ktime_t tick_nohz_get_sleep_length(void);
>  extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
>  extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
> +extern void tick_nohz_iowait_to_idle(int cpu);
>  
>  # else /* !CONFIG_NO_HZ_COMMON */
>  static inline int tick_nohz_tick_stopped(void)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 268a45e..ffea757 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4218,7 +4218,14 @@ void __sched io_schedule(void)
>  	current->in_iowait = 1;
>  	schedule();
>  	current->in_iowait = 0;
> +#ifdef CONFIG_NO_HZ_COMMON
> +	if (atomic_dec_and_test(&rq->nr_iowait)) {
> +		if (raw_smp_processor_id() != cpu_of(rq))
> +			tick_nohz_iowait_to_idle(cpu_of(rq));
> +	}
> +#else
>  	atomic_dec(&rq->nr_iowait);
> +#endif
>  	delayacct_blkio_end();
>  }
>  EXPORT_SYMBOL(io_schedule);
> @@ -4234,7 +4241,14 @@ long __sched io_schedule_timeout(long timeout)
>  	current->in_iowait = 1;
>  	ret = schedule_timeout(timeout);
>  	current->in_iowait = 0;
> +#ifdef CONFIG_NO_HZ_COMMON
> +	if (atomic_dec_and_test(&rq->nr_iowait)) {
> +		if (raw_smp_processor_id() != cpu_of(rq))
> +			tick_nohz_iowait_to_idle(cpu_of(rq));
> +	}
> +#else
>  	atomic_dec(&rq->nr_iowait);
> +#endif
>  	delayacct_blkio_end();
>  	return ret;
>  }
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 47ed7cf..d78c942 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -408,15 +408,27 @@ static void tick_nohz_update_jiffies(ktime_t now)
>  
>  static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
>  {
> -	ktime_t delta;
> +	ktime_t delta, entry, end;
>  
>  	/* Updates the per cpu time idle statistics counters */
>  	write_seqcount_begin(&ts->idle_sleeptime_seq);
> -	delta = ktime_sub(now, ts->idle_entrytime);
> -	if (ts->idle_active == 2)
> +	entry = ts->idle_entrytime;
> +	delta = ktime_sub(now, entry);
> +	if (ts->idle_active == 2) {
> +		end = ts->iowait_exittime;
> +		if (end.tv64 != KTIME_MAX) {
> +			/*
> +			 * Last iowaiting task on our rq was woken up on other CPU
> +			 * sometime in the past, it updated ts->iowait_exittime.
> +			 */
> +			delta = ktime_sub(now, end);
> +			ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
> +			delta = ktime_sub(end, entry);
> +		}
>  		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
> -	else
> +	} else {
>  		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
> +	}
>  	ts->idle_active = 0;
>  	write_seqcount_end(&ts->idle_sleeptime_seq);
>  
> @@ -430,6 +442,7 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
>  	write_seqcount_begin(&ts->idle_sleeptime_seq);
>  	ts->idle_entrytime = now;
>  	ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
> +	ts->iowait_exittime.tv64 = KTIME_MAX;
>  	write_seqcount_end(&ts->idle_sleeptime_seq);
>  
>  	sched_clock_idle_sleep_event();
> @@ -437,6 +450,16 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
>  	return now;
>  }
>  
> +void tick_nohz_iowait_to_idle(int cpu)
> +{
> +	struct tick_sched *ts = tick_get_tick_sched(cpu);
> +	ktime_t now = ktime_get();
> +
> +	write_seqcount_begin(&ts->idle_sleeptime_seq);
> +	ts->iowait_exittime = now;
> +	write_seqcount_end(&ts->idle_sleeptime_seq);

So now you have two concurrent updaters using the seqcount, which is
very dangerous as the counters aren't updated atomically.

seqcount is only suitable when there is a single sequential updater.
Once you deal with concurrent updaters you need seqlock.

And once you add seqlock in the hot scheduler path, you're hitting
a big scalability issue.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/4] nohz: Fix iowait overcounting if iowait task migrates
  2014-04-24 18:45 ` [PATCH 4/4] nohz: Fix iowait overcounting if iowait task migrates Denys Vlasenko
  2014-04-24 19:18   ` Peter Zijlstra
  2014-04-29 14:47   ` Frederic Weisbecker
@ 2014-04-29 16:20   ` Frederic Weisbecker
  2014-05-05 18:14     ` Denys Vlasenko
  2 siblings, 1 reply; 11+ messages in thread
From: Frederic Weisbecker @ 2014-04-29 16:20 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Hidetoshi Seto, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Arjan van de Ven, Oleg Nesterov

On Thu, Apr 24, 2014 at 08:45:58PM +0200, Denys Vlasenko wrote:
> Before this change, if last IO-blocked task wakes up
> on a different CPU, the original CPU may stay idle for much longer,
> and the entire time it stays idle is accounted as iowait time.
> 
> This change adds struct tick_sched::iowait_exittime member.
> On entry to idle, it is set to KTIME_MAX.
> Last IO-blocked task, if migrated, sets it to current time.
> Note that this can happen only once per each idle period:
> new iowaiting tasks can't magically appear on idle CPU's rq.
> 
> If iowait_exittime is set, then (iowait_exittime - idle_entrytime)
> gets accounted as iowait, and the remaining (now - iowait_exittime)
> as "true" idle.
> 
> Run-tested: /proc/stat counters no longer go backwards.
> 
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> ---
>  include/linux/tick.h     |  2 ++
>  kernel/sched/core.c      | 14 +++++++++++
>  kernel/time/tick-sched.c | 64 ++++++++++++++++++++++++++++++++++++++++--------
>  3 files changed, 70 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index 4de1f9e..1bf653e 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -67,6 +67,7 @@ struct tick_sched {
>  	ktime_t				idle_exittime;
>  	ktime_t				idle_sleeptime;
>  	ktime_t				iowait_sleeptime;
> +	ktime_t				iowait_exittime;
>  	seqcount_t			idle_sleeptime_seq;
>  	ktime_t				sleep_length;
>  	unsigned long			last_jiffies;
> @@ -140,6 +141,7 @@ extern void tick_nohz_irq_exit(void);
>  extern ktime_t tick_nohz_get_sleep_length(void);
>  extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
>  extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
> +extern void tick_nohz_iowait_to_idle(int cpu);
>  
>  # else /* !CONFIG_NO_HZ_COMMON */
>  static inline int tick_nohz_tick_stopped(void)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 268a45e..ffea757 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4218,7 +4218,14 @@ void __sched io_schedule(void)
>  	current->in_iowait = 1;
>  	schedule();
>  	current->in_iowait = 0;
> +#ifdef CONFIG_NO_HZ_COMMON
> +	if (atomic_dec_and_test(&rq->nr_iowait)) {
> +		if (raw_smp_processor_id() != cpu_of(rq))
> +			tick_nohz_iowait_to_idle(cpu_of(rq));

Note that even using seqlock doesn't alone help to fix the preemption issue
when the above may overwrite the exittime of the next last iowait task from
the old rq.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/4] nohz: Fix iowait overcounting if iowait task migrates
  2014-04-29 14:47   ` Frederic Weisbecker
@ 2014-05-05 18:06     ` Denys Vlasenko
  2014-05-07 13:38       ` Denys Vlasenko
  0 siblings, 1 reply; 11+ messages in thread
From: Denys Vlasenko @ 2014-05-05 18:06 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Hidetoshi Seto, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Arjan van de Ven, Oleg Nesterov

On 04/29/2014 04:47 PM, Frederic Weisbecker wrote:
> On Thu, Apr 24, 2014 at 08:45:58PM +0200, Denys Vlasenko wrote:
>> +void tick_nohz_iowait_to_idle(int cpu)
>> +{
>> +	struct tick_sched *ts = tick_get_tick_sched(cpu);
>> +	ktime_t now = ktime_get();
>> +
>> +	write_seqcount_begin(&ts->idle_sleeptime_seq);
>> +	ts->iowait_exittime = now;
>> +	write_seqcount_end(&ts->idle_sleeptime_seq);
> 
> So now you have two concurrent updaters using the seqcount, which is
> very dangerous as the counters aren't updated atomically.
> 
> seqcount is only suitable when there is a single sequential updater.
> Once you deal with concurrent updaters you need seqlock.
> 
> And once you add seqlock in the hot scheduler path, you're hitting
> a big scalability issue.

What I need here is merely an atomic store.
The complication is, of course, that, ktime_t is not atomic[64]_t.

How do you think I can do an atomic store?



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/4] nohz: Fix iowait overcounting if iowait task migrates
  2014-04-29 16:20   ` Frederic Weisbecker
@ 2014-05-05 18:14     ` Denys Vlasenko
  0 siblings, 0 replies; 11+ messages in thread
From: Denys Vlasenko @ 2014-05-05 18:14 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Hidetoshi Seto, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Arjan van de Ven, Oleg Nesterov

On 04/29/2014 06:20 PM, Frederic Weisbecker wrote:
>> index 268a45e..ffea757 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4218,7 +4218,14 @@ void __sched io_schedule(void)
>>  	current->in_iowait = 1;
>>  	schedule();
>>  	current->in_iowait = 0;
>> +#ifdef CONFIG_NO_HZ_COMMON
>> +	if (atomic_dec_and_test(&rq->nr_iowait)) {
>> +		if (raw_smp_processor_id() != cpu_of(rq))
>> +			tick_nohz_iowait_to_idle(cpu_of(rq));
> 
> Note that even using seqlock doesn't alone help to fix the preemption issue
> when the above may overwrite the exittime of the next last iowait task from
> the old rq.

Meaning: between atomic_dec_and_test(&rq->nr_iowait) going to zero
and tick_nohz_iowait_to_idle(), CPU left idle, acquired a new task,
this task submitted IO, IO completed, this second task also reached
this place, its atomic_dec_and_test(&rq->nr_iowait) went to zero
and now we race doing

	ts->iowait_exittime = now;

from two tasks?

This shouldn't be a problem anyway since the value of "now"
in this situation will be nearly the same - provided that
(a) store is atomic (no corruption due to the race) and
(b) we are careful when we are using it (for example, we
clamp it to be not earlier than idle start time).

Do you see a problem here which I miss?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/4] nohz: Fix iowait overcounting if iowait task migrates
  2014-05-05 18:06     ` Denys Vlasenko
@ 2014-05-07 13:38       ` Denys Vlasenko
  0 siblings, 0 replies; 11+ messages in thread
From: Denys Vlasenko @ 2014-05-07 13:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Hidetoshi Seto, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Arjan van de Ven, Oleg Nesterov

On 05/05/2014 08:06 PM, Denys Vlasenko wrote:
> On 04/29/2014 04:47 PM, Frederic Weisbecker wrote:
>> On Thu, Apr 24, 2014 at 08:45:58PM +0200, Denys Vlasenko wrote:
>>> +void tick_nohz_iowait_to_idle(int cpu)
>>> +{
>>> +	struct tick_sched *ts = tick_get_tick_sched(cpu);
>>> +	ktime_t now = ktime_get();
>>> +
>>> +	write_seqcount_begin(&ts->idle_sleeptime_seq);
>>> +	ts->iowait_exittime = now;
>>> +	write_seqcount_end(&ts->idle_sleeptime_seq);
>>
>> So now you have two concurrent updaters using the seqcount, which is
>> very dangerous as the counters aren't updated atomically.
>>
>> seqcount is only suitable when there is a single sequential updater.
>> Once you deal with concurrent updaters you need seqlock.
>>
>> And once you add seqlock in the hot scheduler path, you're hitting
>> a big scalability issue.
> 
> What I need here is merely an atomic store.
> The complication is, of course, that, ktime_t is not atomic[64]_t.
> 
> How do you think I can do an atomic store?

Ok, I think a have a version which uses atomic64_t
and is not looking ugly. Sending new patchset now...


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-05-07 13:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-24 18:45 [PATCH 1/4] nohz: Only update sleeptime stats locally Denys Vlasenko
2014-04-24 18:45 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Denys Vlasenko
2014-04-24 18:45 ` [PATCH 3/4] nohz: Fix idle/iowait counts going backwards Denys Vlasenko
2014-04-24 18:45 ` [PATCH 4/4] nohz: Fix iowait overcounting if iowait task migrates Denys Vlasenko
2014-04-24 19:18   ` Peter Zijlstra
2014-04-25 18:56     ` Denys Vlasenko
2014-04-29 14:47   ` Frederic Weisbecker
2014-05-05 18:06     ` Denys Vlasenko
2014-05-07 13:38       ` Denys Vlasenko
2014-04-29 16:20   ` Frederic Weisbecker
2014-05-05 18:14     ` Denys Vlasenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).