All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats
@ 2013-08-16 15:42 Frederic Weisbecker
  2013-08-16 15:42 ` [PATCH 1/4] nohz: Only update sleeptime stats locally Frederic Weisbecker
                   ` (4 more replies)
  0 siblings, 5 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 15:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Frederic Weisbecker, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Arjan van de Ven

Hi,

This is a resend to include Oleg in Cc, no modification since last version
https://lkml.org/lkml/2013/8/8/638.

Tree is still:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	timers/nohz-sleeptime-fix

Thanks.

Frederic Weisbecker (4):
  nohz: Only update sleeptime stats locally
  nohz: Synchronize sleep time stats with seqlock
  nohz: Consolidate sleep time stats read code
  nohz: Convert a few places to use local per cpu accesses

 include/linux/tick.h     |    2 +
 kernel/time/tick-sched.c |  124 +++++++++++++++++++--------------------------
 2 files changed, 54 insertions(+), 72 deletions(-)

-- 
1.7.5.4


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH 1/4] nohz: Only update sleeptime stats locally
  2013-08-16 15:42 [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats Frederic Weisbecker
@ 2013-08-16 15:42 ` Frederic Weisbecker
  2013-08-18 16:49   ` Oleg Nesterov
  2013-08-18 17:04   ` Oleg Nesterov
  2013-08-16 15:42 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Frederic Weisbecker
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 15:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Frederic Weisbecker, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Arjan van de Ven

The idle and io sleeptime stats can be updated concurrently from callers
of get_cpu_idle_time_us(), get_cpu_iowait_time_us() and tick_nohz_stop_idle().

Updaters can easily race and mess up with internal datas coherency , for example
when a governor calls a get_cpu_*_time_us() API and the target CPU exits idle at
the same time, because no locking or whatsoever is there to protect against
this concurrency.

To fix this, lets only update the sleeptime stats locally when the CPU
exits from idle. This is the only place where a real update is truly
needed. The callers of get_cpu_*_time_us() can simply add up the pending
sleep time delta to the last sleeptime snapshot in order to get a coherent
result. There is no need for them to also update the stats.

Reported-by: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 kernel/time/tick-sched.c |   65 +++++++++++++++------------------------------
 1 files changed, 22 insertions(+), 43 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index e77edc9..ede0405 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -408,33 +408,18 @@ static void tick_nohz_update_jiffies(ktime_t now)
 	touch_softlockup_watchdog();
 }
 
-/*
- * Updates the per cpu time idle statistics counters
- */
-static void
-update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
-{
-	ktime_t delta;
-
-	if (ts->idle_active) {
-		delta = ktime_sub(now, ts->idle_entrytime);
-		if (nr_iowait_cpu(cpu) > 0)
-			ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
-		else
-			ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
-		ts->idle_entrytime = now;
-	}
-
-	if (last_update_time)
-		*last_update_time = ktime_to_us(now);
-
-}
-
 static void tick_nohz_stop_idle(int cpu, ktime_t now)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	ktime_t delta;
 
-	update_ts_time_stats(cpu, ts, now, NULL);
+	/* Updates the per cpu time idle statistics counters */
+	delta = ktime_sub(now, ts->idle_entrytime);
+	if (nr_iowait_cpu(cpu) > 0)
+		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
+	else
+		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+	ts->idle_entrytime = now;
 	ts->idle_active = 0;
 
 	sched_clock_idle_wakeup_event(0);
@@ -473,17 +458,14 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 		return -1;
 
 	now = ktime_get();
-	if (last_update_time) {
-		update_ts_time_stats(cpu, ts, now, last_update_time);
-		idle = ts->idle_sleeptime;
-	} else {
-		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+	if (last_update_time)
+		*last_update_time = ktime_to_us(now);
 
-			idle = ktime_add(ts->idle_sleeptime, delta);
-		} else {
-			idle = ts->idle_sleeptime;
-		}
+	if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+		idle = ktime_add(ts->idle_sleeptime, delta);
+	} else {
+		idle = ts->idle_sleeptime;
 	}
 
 	return ktime_to_us(idle);
@@ -514,17 +496,14 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 		return -1;
 
 	now = ktime_get();
-	if (last_update_time) {
-		update_ts_time_stats(cpu, ts, now, last_update_time);
-		iowait = ts->iowait_sleeptime;
-	} else {
-		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+	if (last_update_time)
+		*last_update_time = ktime_to_us(now);
 
-			iowait = ktime_add(ts->iowait_sleeptime, delta);
-		} else {
-			iowait = ts->iowait_sleeptime;
-		}
+	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
+		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+		iowait = ktime_add(ts->iowait_sleeptime, delta);
+	} else {
+		iowait = ts->iowait_sleeptime;
 	}
 
 	return ktime_to_us(iowait);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 15:42 [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats Frederic Weisbecker
  2013-08-16 15:42 ` [PATCH 1/4] nohz: Only update sleeptime stats locally Frederic Weisbecker
@ 2013-08-16 15:42 ` Frederic Weisbecker
  2013-08-16 16:02   ` Oleg Nesterov
  2013-08-18 16:54   ` Oleg Nesterov
  2013-08-16 15:42 ` [PATCH 3/4] nohz: Consolidate sleep time stats read code Frederic Weisbecker
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 15:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Frederic Weisbecker, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Arjan van de Ven

When some call site uses get_cpu_*_time_us() to read a sleeptime
stat, it deduces the total sleeptime by adding the pending time
to the last sleeptime snapshot if the CPU target is idle.

Namely this sums up to:

       sleeptime = ts($CPU)->idle_sleeptime;
       if (ts($CPU)->idle_active)
       	  sleeptime += NOW() - ts($CPU)->idle_entrytime

But this only works if idle_sleeptime, idle_entrytime and idle_active are
read and updated under some disciplined order.

Lets consider the following scenario:

             CPU 0                                            CPU 1

  (seq 1)    ts(CPU 0)->idle_active = 1
             ts(CPU 0)->idle_entrytime = NOW()

  (seq 2)    sleeptime = NOW() - ts(CPU 0)->idle_entrytime
             ts(CPU 0)->idle_sleeptime += sleeptime           sleeptime = ts(CPU 0)->idle_sleeptime;
                                                              if (ts(CPU 0)->idle_active)
             ts(CPU 0)->idle_entrytime = NOW()                    sleeptime += NOW() - ts(CPU 0)->idle_entrytime

The resulting value of sleeptime in CPU 1 can vary depending of some
ordering scenario:

* If it sees the value of idle_entrytime after seq 1 and the value of idle_sleeptime
after seq 2, the value of sleeptime will be buggy because it accounts the delta twice,
so it will be too high.

* If it sees the value of idle_entrytime after seq 2 and the value of idle_sleeptime
after seq 1, the value of sleeptime will be buggy because it misses the delta, so it
will be too low.

* If it sees the value of idle_entrytime and idle_sleeptime, both as seen after seq 1 or 2,
the value will be correct.

Some more tricky scenario can also happen if idle_active value is read from a former sequence.

Hence we must honour the following constraints:

- idle_sleeptime, idle_active and idle_entrytime must be updated and read
under some correctly enforced SMP ordering

- The three variable values as read by CPU 1 must belong to the same update
sequences from CPU 0. The update sequences must be delimited such that the
resulting three values after a sequence completion produce a coherent result
together when read from the CPU 1.

- We need to prevent from fetching middle-state sequence values.

The ideal solution to implement this synchronization is to use a seqcount. Lets
use one here around these three values to enforce sequence synchronization between
updates and read.

This fixes a reported bug where non-monotonic sleeptime stats are returned by /proc/stat
when it is frequently read. And potential cpufreq governor bugs.

Reported-by: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 include/linux/tick.h     |    2 ++
 kernel/time/tick-sched.c |   37 +++++++++++++++++++++++++------------
 2 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 62bd8b7..49f9720 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -10,6 +10,7 @@
 #include <linux/irqflags.h>
 #include <linux/percpu.h>
 #include <linux/hrtimer.h>
+#include <linux/seqlock.h>
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 
@@ -70,6 +71,7 @@ struct tick_sched {
 	unsigned long			next_jiffies;
 	ktime_t				idle_expires;
 	int				do_timer_last;
+	seqcount_t			sleeptime_seq;
 };
 
 extern void __init tick_init(void);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index ede0405..f7fc27b 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -414,6 +414,7 @@ static void tick_nohz_stop_idle(int cpu, ktime_t now)
 	ktime_t delta;
 
 	/* Updates the per cpu time idle statistics counters */
+	write_seqcount_begin(&ts->sleeptime_seq);
 	delta = ktime_sub(now, ts->idle_entrytime);
 	if (nr_iowait_cpu(cpu) > 0)
 		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
@@ -421,6 +422,7 @@ static void tick_nohz_stop_idle(int cpu, ktime_t now)
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
 	ts->idle_entrytime = now;
 	ts->idle_active = 0;
+	write_seqcount_end(&ts->sleeptime_seq);
 
 	sched_clock_idle_wakeup_event(0);
 }
@@ -429,8 +431,11 @@ static ktime_t tick_nohz_start_idle(int cpu, struct tick_sched *ts)
 {
 	ktime_t now = ktime_get();
 
+	write_seqcount_begin(&ts->sleeptime_seq);
 	ts->idle_entrytime = now;
 	ts->idle_active = 1;
+	write_seqcount_end(&ts->sleeptime_seq);
+
 	sched_clock_idle_sleep_event();
 	return now;
 }
@@ -453,6 +458,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	ktime_t now, idle;
+	unsigned int seq;
 
 	if (!tick_nohz_enabled)
 		return -1;
@@ -461,12 +467,15 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
-	if (ts->idle_active && !nr_iowait_cpu(cpu)) {
-		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-		idle = ktime_add(ts->idle_sleeptime, delta);
-	} else {
-		idle = ts->idle_sleeptime;
-	}
+	do {
+		seq = read_seqcount_begin(&ts->sleeptime_seq);
+		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+			idle = ktime_add(ts->idle_sleeptime, delta);
+		} else {
+			idle = ts->idle_sleeptime;
+		}
+	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
 
 	return ktime_to_us(idle);
 
@@ -491,6 +500,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	ktime_t now, iowait;
+	unsigned int seq;
 
 	if (!tick_nohz_enabled)
 		return -1;
@@ -499,12 +509,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 	if (last_update_time)
 		*last_update_time = ktime_to_us(now);
 
-	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
-		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-		iowait = ktime_add(ts->iowait_sleeptime, delta);
-	} else {
-		iowait = ts->iowait_sleeptime;
-	}
+	do {
+		seq = read_seqcount_begin(&ts->sleeptime_seq);
+		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
+			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+			iowait = ktime_add(ts->iowait_sleeptime, delta);
+		} else {
+			iowait = ts->iowait_sleeptime;
+		}
+	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
 
 	return ktime_to_us(iowait);
 }
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 3/4] nohz: Consolidate sleep time stats read code
  2013-08-16 15:42 [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats Frederic Weisbecker
  2013-08-16 15:42 ` [PATCH 1/4] nohz: Only update sleeptime stats locally Frederic Weisbecker
  2013-08-16 15:42 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Frederic Weisbecker
@ 2013-08-16 15:42 ` Frederic Weisbecker
  2013-08-18 17:00   ` Oleg Nesterov
  2013-08-16 15:42 ` [PATCH 4/4] nohz: Convert a few places to use local per cpu accesses Frederic Weisbecker
  2013-08-20 18:15 ` [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats Oleg Nesterov
  4 siblings, 1 reply; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 15:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Frederic Weisbecker, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Arjan van de Ven

get_cpu_idle_time_us() and get_cpu_iowait_time_us() mostly share
the same code. Lets consolidate both implementations.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 kernel/time/tick-sched.c |   76 ++++++++++++++++++++--------------------------
 1 files changed, 33 insertions(+), 43 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f7fc27b..017bec2 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -440,24 +440,10 @@ static ktime_t tick_nohz_start_idle(int cpu, struct tick_sched *ts)
 	return now;
 }
 
-/**
- * get_cpu_idle_time_us - get the total idle time of a cpu
- * @cpu: CPU number to query
- * @last_update_time: variable to store update time in. Do not update
- * counters if NULL.
- *
- * Return the cummulative idle time (since boot) for a given
- * CPU, in microseconds.
- *
- * This time is measured via accounting rather than sampling,
- * and is as accurate as ktime_get() is.
- *
- * This function returns -1 if NOHZ is not enabled.
- */
-u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
+static u64 get_cpu_sleep_time_us(int cpu, bool io, u64 *last_update_time)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
-	ktime_t now, idle;
+	ktime_t now, sleep;
 	unsigned int seq;
 
 	if (!tick_nohz_enabled)
@@ -469,16 +455,41 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 
 	do {
 		seq = read_seqcount_begin(&ts->sleeptime_seq);
-		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+		if (io)
+			sleep = ts->iowait_sleeptime;
+		else
+			sleep = ts->idle_sleeptime;
+
+		if (ts->idle_active)
+			continue;
+
+		if ((io && nr_iowait_cpu(cpu)) || (!io && !nr_iowait_cpu(cpu))) {
 			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-			idle = ktime_add(ts->idle_sleeptime, delta);
-		} else {
-			idle = ts->idle_sleeptime;
+			sleep = ktime_add(sleep, delta);
 		}
 	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
 
-	return ktime_to_us(idle);
+	return ktime_to_us(sleep);
+
+}
 
+/**
+ * get_cpu_idle_time_us - get the total idle time of a cpu
+ * @cpu: CPU number to query
+ * @last_update_time: variable to store update time in. Do not update
+ * counters if NULL.
+ *
+ * Return the cummulative idle time (since boot) for a given
+ * CPU, in microseconds.
+ *
+ * This time is measured via accounting rather than sampling,
+ * and is as accurate as ktime_get() is.
+ *
+ * This function returns -1 if NOHZ is not enabled.
+ */
+u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
+{
+	return get_cpu_sleep_time_us(cpu, false, last_update_time);
 }
 EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
 
@@ -498,28 +509,7 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
  */
 u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 {
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
-	ktime_t now, iowait;
-	unsigned int seq;
-
-	if (!tick_nohz_enabled)
-		return -1;
-
-	now = ktime_get();
-	if (last_update_time)
-		*last_update_time = ktime_to_us(now);
-
-	do {
-		seq = read_seqcount_begin(&ts->sleeptime_seq);
-		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
-			iowait = ktime_add(ts->iowait_sleeptime, delta);
-		} else {
-			iowait = ts->iowait_sleeptime;
-		}
-	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
-
-	return ktime_to_us(iowait);
+	return get_cpu_sleep_time_us(cpu, true, last_update_time);
 }
 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 4/4] nohz: Convert a few places to use local per cpu accesses
  2013-08-16 15:42 [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2013-08-16 15:42 ` [PATCH 3/4] nohz: Consolidate sleep time stats read code Frederic Weisbecker
@ 2013-08-16 15:42 ` Frederic Weisbecker
  2013-08-16 16:00   ` Peter Zijlstra
  2013-08-20 18:15 ` [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats Oleg Nesterov
  4 siblings, 1 reply; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 15:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Frederic Weisbecker, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Arjan van de Ven

A few functions use remote per CPU access APIs when they
deal with local values.

Just to the right conversion to improve performance, code
readability and debug checks.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 kernel/time/tick-sched.c |   10 ++++------
 1 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 017bec2..11d64e2 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -395,11 +395,9 @@ __setup("nohz=", setup_tick_nohz);
  */
 static void tick_nohz_update_jiffies(ktime_t now)
 {
-	int cpu = smp_processor_id();
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	unsigned long flags;
 
-	ts->idle_waketime = now;
+	__this_cpu_write(tick_cpu_sched.idle_waketime, now);
 
 	local_irq_save(flags);
 	tick_do_update_jiffies64(now);
@@ -410,7 +408,7 @@ static void tick_nohz_update_jiffies(ktime_t now)
 
 static void tick_nohz_stop_idle(int cpu, ktime_t now)
 {
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
 	ktime_t delta;
 
 	/* Updates the per cpu time idle statistics counters */
@@ -901,7 +899,7 @@ static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
 void tick_nohz_idle_exit(void)
 {
 	int cpu = smp_processor_id();
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
 	ktime_t now;
 
 	local_irq_disable();
@@ -1020,7 +1018,7 @@ static void tick_nohz_kick_tick(int cpu, ktime_t now)
 
 static inline void tick_check_nohz(int cpu)
 {
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
 	ktime_t now;
 
 	if (!ts->idle_active && !ts->tick_stopped)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH 4/4] nohz: Convert a few places to use local per cpu accesses
  2013-08-16 15:42 ` [PATCH 4/4] nohz: Convert a few places to use local per cpu accesses Frederic Weisbecker
@ 2013-08-16 16:00   ` Peter Zijlstra
  2013-08-16 16:12     ` Frederic Weisbecker
  2013-08-16 16:19     ` Oleg Nesterov
  0 siblings, 2 replies; 75+ messages in thread
From: Peter Zijlstra @ 2013-08-16 16:00 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, LKML, Fernando Luis Vazquez Cao, Tetsuo Handa,
	Thomas Gleixner, Ingo Molnar, Andrew Morton, Arjan van de Ven

OK so these patches look ok to me -- didn't read in detail though.

On Fri, Aug 16, 2013 at 05:42:33PM +0200, Frederic Weisbecker wrote:
> A few functions use remote per CPU access APIs when they
> deal with local values.
> 
> Just to the right conversion to improve performance, code
> readability and debug checks.

> @@ -410,7 +408,7 @@ static void tick_nohz_update_jiffies(ktime_t now)
>  
>  static void tick_nohz_stop_idle(int cpu, ktime_t now)
>  {
> -	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
> +	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);

What's there from stopping someone from calling this with cpu !=
smp_processor_id() ? 

> @@ -1020,7 +1018,7 @@ static void tick_nohz_kick_tick(int cpu, ktime_t now)
>  
>  static inline void tick_check_nohz(int cpu)
>  {
> -	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
> +	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
>  	ktime_t now;
>  
>  	if (!ts->idle_active && !ts->tick_stopped)

idem

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 15:42 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Frederic Weisbecker
@ 2013-08-16 16:02   ` Oleg Nesterov
  2013-08-16 16:20     ` Frederic Weisbecker
  2013-08-16 16:32     ` Frederic Weisbecker
  2013-08-18 16:54   ` Oleg Nesterov
  1 sibling, 2 replies; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-16 16:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

Thanks Frederic!

I'll try to read this series carefully later. Not that I think
I can help, you certainly understand this much better.

Just one question below,

On 08/16, Frederic Weisbecker wrote:
>
> @@ -499,12 +509,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
>  	if (last_update_time)
>  		*last_update_time = ktime_to_us(now);
>  
> -	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> -		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> -		iowait = ktime_add(ts->iowait_sleeptime, delta);
> -	} else {
> -		iowait = ts->iowait_sleeptime;
> -	}
> +	do {
> +		seq = read_seqcount_begin(&ts->sleeptime_seq);
> +		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> +			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> +			iowait = ktime_add(ts->iowait_sleeptime, delta);
> +		} else {
> +			iowait = ts->iowait_sleeptime;
> +		}
> +	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));

Unless I missread this patch, this is still racy a bit.

Suppose it is called on CPU_0 and cpu == 1. Suppose that
ts->idle_active == T and nr_iowait_cpu(cpu) == 1.

So we return iowait_sleeptime + delta.

Suppose that we call get_cpu_iowait_time_us() again. By this time
the task which incremented ->nr_iowait can be woken up on another
CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
we return iowait_sleeptime, and this is not monotonic again.

No?

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 4/4] nohz: Convert a few places to use local per cpu accesses
  2013-08-16 16:00   ` Peter Zijlstra
@ 2013-08-16 16:12     ` Frederic Weisbecker
  2013-08-16 16:19     ` Oleg Nesterov
  1 sibling, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 16:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, LKML, Fernando Luis Vazquez Cao, Tetsuo Handa,
	Thomas Gleixner, Ingo Molnar, Andrew Morton, Arjan van de Ven

On Fri, Aug 16, 2013 at 06:00:43PM +0200, Peter Zijlstra wrote:
> OK so these patches look ok to me -- didn't read in detail though.
> 
> On Fri, Aug 16, 2013 at 05:42:33PM +0200, Frederic Weisbecker wrote:
> > A few functions use remote per CPU access APIs when they
> > deal with local values.
> > 
> > Just to the right conversion to improve performance, code
> > readability and debug checks.
> 
> > @@ -410,7 +408,7 @@ static void tick_nohz_update_jiffies(ktime_t now)
> >  
> >  static void tick_nohz_stop_idle(int cpu, ktime_t now)
> >  {
> > -	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
> > +	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
> 
> What's there from stopping someone from calling this with cpu !=
> smp_processor_id() ? 

May be I should remove the cpu arg yeah.

> 
> > @@ -1020,7 +1018,7 @@ static void tick_nohz_kick_tick(int cpu, ktime_t now)
> >  
> >  static inline void tick_check_nohz(int cpu)
> >  {
> > -	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
> > +	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
> >  	ktime_t now;
> >  
> >  	if (!ts->idle_active && !ts->tick_stopped)
> 
> idem

Yep same here. Will fix.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 4/4] nohz: Convert a few places to use local per cpu accesses
  2013-08-16 16:00   ` Peter Zijlstra
  2013-08-16 16:12     ` Frederic Weisbecker
@ 2013-08-16 16:19     ` Oleg Nesterov
  2013-08-16 16:34       ` Frederic Weisbecker
  1 sibling, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-16 16:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Arjan van de Ven

On 08/16, Peter Zijlstra wrote:
>
> OK so these patches look ok to me -- didn't read in detail though.
>
> On Fri, Aug 16, 2013 at 05:42:33PM +0200, Frederic Weisbecker wrote:
> > A few functions use remote per CPU access APIs when they
> > deal with local values.
> >
> > Just to the right conversion to improve performance, code
> > readability and debug checks.
>
> > @@ -410,7 +408,7 @@ static void tick_nohz_update_jiffies(ktime_t now)
> >
> >  static void tick_nohz_stop_idle(int cpu, ktime_t now)
> >  {
> > -	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
> > +	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
>
> What's there from stopping someone from calling this with cpu !=
> smp_processor_id() ?

I _guess_ this should not happen, but in this case we should probably
remove the "cpu" argument. And smp_processor_id() in tick_nohz_idle_exit().

tick_check_idle/tick_check_nohz doesn't need "int cpu" too, it seems.

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:02   ` Oleg Nesterov
@ 2013-08-16 16:20     ` Frederic Weisbecker
  2013-08-16 16:26       ` Oleg Nesterov
  2013-08-16 16:32     ` Frederic Weisbecker
  1 sibling, 1 reply; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 16:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
> Thanks Frederic!
> 
> I'll try to read this series carefully later. Not that I think
> I can help, you certainly understand this much better.
> 
> Just one question below,
> 
> On 08/16, Frederic Weisbecker wrote:
> >
> > @@ -499,12 +509,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
> >  	if (last_update_time)
> >  		*last_update_time = ktime_to_us(now);
> >  
> > -	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> > -		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> > -		iowait = ktime_add(ts->iowait_sleeptime, delta);
> > -	} else {
> > -		iowait = ts->iowait_sleeptime;
> > -	}
> > +	do {
> > +		seq = read_seqcount_begin(&ts->sleeptime_seq);
> > +		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> > +			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> > +			iowait = ktime_add(ts->iowait_sleeptime, delta);
> > +		} else {
> > +			iowait = ts->iowait_sleeptime;
> > +		}
> > +	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
> 
> Unless I missread this patch, this is still racy a bit.
> 
> Suppose it is called on CPU_0 and cpu == 1. Suppose that
> ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
> 
> So we return iowait_sleeptime + delta.
> 
> Suppose that we call get_cpu_iowait_time_us() again. By this time
> the task which incremented ->nr_iowait can be woken up on another
> CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
> we return iowait_sleeptime, and this is not monotonic again.

Hmm, by the time it decrements nr_iowait, it returned from schedule() and
so idle had flushed the pending iowait sleeptime.

May be you have some scenario in mind that I'm missing?

> 
> No?
> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:20     ` Frederic Weisbecker
@ 2013-08-16 16:26       ` Oleg Nesterov
  2013-08-16 16:46         ` Frederic Weisbecker
  0 siblings, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-16 16:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On 08/16, Frederic Weisbecker wrote:
>
> On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
> > > +	do {
> > > +		seq = read_seqcount_begin(&ts->sleeptime_seq);
> > > +		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> > > +			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> > > +			iowait = ktime_add(ts->iowait_sleeptime, delta);
> > > +		} else {
> > > +			iowait = ts->iowait_sleeptime;
> > > +		}
> > > +	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
> >
> > Unless I missread this patch, this is still racy a bit.
> >
> > Suppose it is called on CPU_0 and cpu == 1. Suppose that
> > ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
> >
> > So we return iowait_sleeptime + delta.
> >
> > Suppose that we call get_cpu_iowait_time_us() again. By this time
> > the task which incremented ->nr_iowait can be woken up on another
> > CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
> > we return iowait_sleeptime, and this is not monotonic again.
>
> Hmm, by the time it decrements nr_iowait, it returned from schedule() and
> so idle had flushed the pending iowait sleeptime.

Suppose a task does io_schedule() on CPU_0, and increments the counter.
This CPU becomes idle and nr_iowait_cpu(0) == 1.

Then this task is woken up, but try_to_wake_up() selects another CPU != 0.

It returns from schedule() and decrements the same counter, it doesn't
do raw_rq/etc again. nr_iowait_cpu(0) becomes 0.

In fact the task can even migrate to another CPU right after raw_rq().

> May be you have some scenario in mind that I'm missing?

Or me...

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:02   ` Oleg Nesterov
  2013-08-16 16:20     ` Frederic Weisbecker
@ 2013-08-16 16:32     ` Frederic Weisbecker
  2013-08-16 16:33       ` Oleg Nesterov
  2013-08-16 16:37       ` Frederic Weisbecker
  1 sibling, 2 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 16:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
> Thanks Frederic!
> 
> I'll try to read this series carefully later. Not that I think
> I can help, you certainly understand this much better.
> 
> Just one question below,
> 
> On 08/16, Frederic Weisbecker wrote:
> >
> > @@ -499,12 +509,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
> >  	if (last_update_time)
> >  		*last_update_time = ktime_to_us(now);
> >  
> > -	if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> > -		ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> > -		iowait = ktime_add(ts->iowait_sleeptime, delta);
> > -	} else {
> > -		iowait = ts->iowait_sleeptime;
> > -	}
> > +	do {
> > +		seq = read_seqcount_begin(&ts->sleeptime_seq);
> > +		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> > +			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> > +			iowait = ktime_add(ts->iowait_sleeptime, delta);
> > +		} else {
> > +			iowait = ts->iowait_sleeptime;
> > +		}
> > +	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
> 
> Unless I missread this patch, this is still racy a bit.
> 
> Suppose it is called on CPU_0 and cpu == 1. Suppose that
> ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
> 
> So we return iowait_sleeptime + delta.
> 
> Suppose that we call get_cpu_iowait_time_us() again. By this time
> the task which incremented ->nr_iowait can be woken up on another
> CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
> we return iowait_sleeptime, and this is not monotonic again.
> 
> No?

OTOH, io_schedule() does:

      atomic_inc(&rq->nr_iowait);
      schedule();
      atomic_dec(&rq->nr_iowait);

How do we handle that when the task is migrated after it goes to sleep? I don't
see the nr_iowait is decreased from the src CPU and increased on the dest CPU.

I don't either see that iowait tasks can't be migrated.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:32     ` Frederic Weisbecker
@ 2013-08-16 16:33       ` Oleg Nesterov
  2013-08-16 16:49         ` Frederic Weisbecker
  2013-08-16 16:37       ` Frederic Weisbecker
  1 sibling, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-16 16:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On 08/16, Frederic Weisbecker wrote:
>
> On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
> >
> > Unless I missread this patch, this is still racy a bit.
> >
> > Suppose it is called on CPU_0 and cpu == 1. Suppose that
> > ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
> >
> > So we return iowait_sleeptime + delta.
> >
> > Suppose that we call get_cpu_iowait_time_us() again. By this time
> > the task which incremented ->nr_iowait can be woken up on another
> > CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
> > we return iowait_sleeptime, and this is not monotonic again.
> >
> > No?
>
> OTOH, io_schedule() does:
>
>       atomic_inc(&rq->nr_iowait);
>       schedule();
>       atomic_dec(&rq->nr_iowait);
>
> How do we handle that when the task is migrated after it goes to sleep?

or even before it goes to sleep. This is what I meant.

> I don't either see that iowait tasks can't be migrated.

But probably this is fine? This is just the non-precise accounting.

But otoh, I agree. The whole idea about per-cpu nr_iowait looks a
bit strange.

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 4/4] nohz: Convert a few places to use local per cpu accesses
  2013-08-16 16:19     ` Oleg Nesterov
@ 2013-08-16 16:34       ` Frederic Weisbecker
  0 siblings, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 16:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, LKML, Fernando Luis Vazquez Cao, Tetsuo Handa,
	Thomas Gleixner, Ingo Molnar, Andrew Morton, Arjan van de Ven

On Fri, Aug 16, 2013 at 06:19:54PM +0200, Oleg Nesterov wrote:
> On 08/16, Peter Zijlstra wrote:
> >
> > OK so these patches look ok to me -- didn't read in detail though.
> >
> > On Fri, Aug 16, 2013 at 05:42:33PM +0200, Frederic Weisbecker wrote:
> > > A few functions use remote per CPU access APIs when they
> > > deal with local values.
> > >
> > > Just to the right conversion to improve performance, code
> > > readability and debug checks.
> >
> > > @@ -410,7 +408,7 @@ static void tick_nohz_update_jiffies(ktime_t now)
> > >
> > >  static void tick_nohz_stop_idle(int cpu, ktime_t now)
> > >  {
> > > -	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
> > > +	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
> >
> > What's there from stopping someone from calling this with cpu !=
> > smp_processor_id() ?
> 
> I _guess_ this should not happen, but in this case we should probably
> remove the "cpu" argument. And smp_processor_id() in tick_nohz_idle_exit().
> 
> tick_check_idle/tick_check_nohz doesn't need "int cpu" too, it seems.

Yep, indeed it seems right to remove the cpu argument to prevent from further
calls to these APIs on remote CPUs, which would otherwise result in random disaster.

> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:32     ` Frederic Weisbecker
  2013-08-16 16:33       ` Oleg Nesterov
@ 2013-08-16 16:37       ` Frederic Weisbecker
  1 sibling, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 16:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

2013/8/16 Frederic Weisbecker <fweisbec@gmail.com>:
> On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
>> Thanks Frederic!
>>
>> I'll try to read this series carefully later. Not that I think
>> I can help, you certainly understand this much better.
>>
>> Just one question below,
>>
>> On 08/16, Frederic Weisbecker wrote:
>> >
>> > @@ -499,12 +509,15 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
>> >     if (last_update_time)
>> >             *last_update_time = ktime_to_us(now);
>> >
>> > -   if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
>> > -           ktime_t delta = ktime_sub(now, ts->idle_entrytime);
>> > -           iowait = ktime_add(ts->iowait_sleeptime, delta);
>> > -   } else {
>> > -           iowait = ts->iowait_sleeptime;
>> > -   }
>> > +   do {
>> > +           seq = read_seqcount_begin(&ts->sleeptime_seq);
>> > +           if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
>> > +                   ktime_t delta = ktime_sub(now, ts->idle_entrytime);
>> > +                   iowait = ktime_add(ts->iowait_sleeptime, delta);
>> > +           } else {
>> > +                   iowait = ts->iowait_sleeptime;
>> > +           }
>> > +   } while (read_seqcount_retry(&ts->sleeptime_seq, seq));
>>
>> Unless I missread this patch, this is still racy a bit.
>>
>> Suppose it is called on CPU_0 and cpu == 1. Suppose that
>> ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
>>
>> So we return iowait_sleeptime + delta.
>>
>> Suppose that we call get_cpu_iowait_time_us() again. By this time
>> the task which incremented ->nr_iowait can be woken up on another
>> CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
>> we return iowait_sleeptime, and this is not monotonic again.
>>
>> No?
>
> OTOH, io_schedule() does:
>
>       atomic_inc(&rq->nr_iowait);
>       schedule();
>       atomic_dec(&rq->nr_iowait);
>
> How do we handle that when the task is migrated after it goes to sleep? I don't
> see the nr_iowait is decreased from the src CPU and increased on the dest CPU.
>
> I don't either see that iowait tasks can't be migrated.

My bad, The decrement happens on the src CPU anyway.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:26       ` Oleg Nesterov
@ 2013-08-16 16:46         ` Frederic Weisbecker
  2013-08-16 16:49           ` Oleg Nesterov
                             ` (2 more replies)
  0 siblings, 3 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 16:46 UTC (permalink / raw)
  To: Oleg Nesterov, Ingo Molnar, Thomas Gleixner
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Fri, Aug 16, 2013 at 06:26:54PM +0200, Oleg Nesterov wrote:
> On 08/16, Frederic Weisbecker wrote:
> >
> > On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
> > > > +	do {
> > > > +		seq = read_seqcount_begin(&ts->sleeptime_seq);
> > > > +		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> > > > +			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> > > > +			iowait = ktime_add(ts->iowait_sleeptime, delta);
> > > > +		} else {
> > > > +			iowait = ts->iowait_sleeptime;
> > > > +		}
> > > > +	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
> > >
> > > Unless I missread this patch, this is still racy a bit.
> > >
> > > Suppose it is called on CPU_0 and cpu == 1. Suppose that
> > > ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
> > >
> > > So we return iowait_sleeptime + delta.
> > >
> > > Suppose that we call get_cpu_iowait_time_us() again. By this time
> > > the task which incremented ->nr_iowait can be woken up on another
> > > CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
> > > we return iowait_sleeptime, and this is not monotonic again.
> >
> > Hmm, by the time it decrements nr_iowait, it returned from schedule() and
> > so idle had flushed the pending iowait sleeptime.
> 
> Suppose a task does io_schedule() on CPU_0, and increments the counter.
> This CPU becomes idle and nr_iowait_cpu(0) == 1.
> 
> Then this task is woken up, but try_to_wake_up() selects another CPU != 0.
> 
> It returns from schedule() and decrements the same counter, it doesn't
> do raw_rq/etc again. nr_iowait_cpu(0) becomes 0.
> 
> In fact the task can even migrate to another CPU right after raw_rq().

Ah I see now. So that indeed yet another race.

Should we flush that iowait to the src CPU? But then it means we must handle
concurrent updates to iowait_sleeptime, idle_sleeptime from the migration
code and from idle enter / exit.

So I fear we need a seqlock.

Or we can live with that and still account the whole idle time slept until
tick_nohz_stop_idle() to iowait if we called tick_nohz_start_idle() with nr_iowait > 0.
All we need is just a new field in ts-> that records on which state we entered
idle.

What do you think?

Ingo, Thomas?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:46         ` Frederic Weisbecker
@ 2013-08-16 16:49           ` Oleg Nesterov
  2013-08-16 17:12             ` Frederic Weisbecker
  2013-08-19 11:10           ` Peter Zijlstra
  2013-08-20  6:21           ` Fernando Luis Vázquez Cao
  2 siblings, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-16 16:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Thomas Gleixner, LKML, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On 08/16, Frederic Weisbecker wrote:
>
> tick_nohz_stop_idle() to iowait if we called tick_nohz_start_idle() with nr_iowait > 0.
> All we need is just a new field in ts-> that records on which state we entered
> idle.

Or we can turn ->idle_active into enum. And all other nr_iowait_cpu's
in this code should go away.

Personally I am fine either way.

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:33       ` Oleg Nesterov
@ 2013-08-16 16:49         ` Frederic Weisbecker
  0 siblings, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 16:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Fri, Aug 16, 2013 at 06:33:55PM +0200, Oleg Nesterov wrote:
> On 08/16, Frederic Weisbecker wrote:
> >
> > On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
> > >
> > > Unless I missread this patch, this is still racy a bit.
> > >
> > > Suppose it is called on CPU_0 and cpu == 1. Suppose that
> > > ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
> > >
> > > So we return iowait_sleeptime + delta.
> > >
> > > Suppose that we call get_cpu_iowait_time_us() again. By this time
> > > the task which incremented ->nr_iowait can be woken up on another
> > > CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
> > > we return iowait_sleeptime, and this is not monotonic again.
> > >
> > > No?
> >
> > OTOH, io_schedule() does:
> >
> >       atomic_inc(&rq->nr_iowait);
> >       schedule();
> >       atomic_dec(&rq->nr_iowait);
> >
> > How do we handle that when the task is migrated after it goes to sleep?
> 
> or even before it goes to sleep. This is what I meant.
> 
> > I don't either see that iowait tasks can't be migrated.
> 
> But probably this is fine? This is just the non-precise accounting.
> 
> But otoh, I agree. The whole idea about per-cpu nr_iowait looks a
> bit strange.


My bad, I thought it was doing:

   inc(this_rq->nr_iowait)
   schedule()
   dec(this_rq->nr_iowait)

But it actually uses the src CPU all along.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:49           ` Oleg Nesterov
@ 2013-08-16 17:12             ` Frederic Weisbecker
  2013-08-18 16:36               ` Oleg Nesterov
  2013-08-19 10:58               ` Peter Zijlstra
  0 siblings, 2 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-16 17:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Thomas Gleixner, LKML, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Fri, Aug 16, 2013 at 06:49:22PM +0200, Oleg Nesterov wrote:
> On 08/16, Frederic Weisbecker wrote:
> >
> > tick_nohz_stop_idle() to iowait if we called tick_nohz_start_idle() with nr_iowait > 0.
> > All we need is just a new field in ts-> that records on which state we entered
> > idle.
> 
> Or we can turn ->idle_active into enum. And all other nr_iowait_cpu's
> in this code should go away.
> 
> Personally I am fine either way.

Me too.

So my proposition is that we can keep the existing patches as they fix other distinct races
(and we add fixes on what peterz just reported) and send them to Ingo. Ah and I'll wait for
your review first.

Then if all goes well on the pull request we describe him the nr_iowait race and we let him
choose what to do with that nr_iowait migration race: either we ignore the
migration and always account to what we saw on idle start, or we flush that time accounting
on iowait migration, but that requires seqlocks on the idle path.

Or may be Peter could tell us as well. Peter, do you have a preference?

Thanks.

> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 17:12             ` Frederic Weisbecker
@ 2013-08-18 16:36               ` Oleg Nesterov
  2013-08-18 21:25                 ` Frederic Weisbecker
  2013-08-19 10:58               ` Peter Zijlstra
  1 sibling, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-18 16:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Thomas Gleixner, LKML, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On 08/16, Frederic Weisbecker wrote:
>
> On Fri, Aug 16, 2013 at 06:49:22PM +0200, Oleg Nesterov wrote:
> >
> > Personally I am fine either way.
>
> Me too.
>
> So my proposition is that we can keep the existing patches as they fix other distinct races

perhaps... but it would be nice to fix the "goes backward" problem.

This "race" is not theoretical, at least for get_cpu_iowait_time_us().
nr_iowait_cpu() can change from !0 to 0 very easily.

And just in case, note that get_cpu_idle_time_us() has the same problem
too, because it can also change from 0 to !0 although this case is much
less likely.

However, right now I do not see a simple solution. It seems that, if
get_cpu_idle_time_us() does ktime_add(ts->idle_sleeptime) it should
actually update ts->idle_sleeptime/entrytime to avoid these races
(the same for ->idle_sleeptime), but then we need the locking.

> (and we add fixes on what peterz just reported)

Do you mean his comments about 4/4 or I missed something else?

> Ah and I'll wait for
> your review first.

If only I could understand this code ;) It looks really simple and
I hope I can understand what it does. But not why. I simply can't
understand why idle/iowait are "mutually exclusive".

And if we do this,  then perhaps io_schedule() should do
"if (atomic_dec(&rq->nr_iowait)) update_iowait_sleeptime()" but
this means the locking again.

> Then if all goes well on the pull request we describe him the nr_iowait race and we let him
> choose what to do with that nr_iowait migration race:

OK, agreed.

> Or may be Peter could tell us as well. Peter, do you have a preference?

Yes, it would be nice to know what Peter thinks ;)

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 1/4] nohz: Only update sleeptime stats locally
  2013-08-16 15:42 ` [PATCH 1/4] nohz: Only update sleeptime stats locally Frederic Weisbecker
@ 2013-08-18 16:49   ` Oleg Nesterov
  2013-08-18 21:38     ` Frederic Weisbecker
  2013-08-18 17:04   ` Oleg Nesterov
  1 sibling, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-18 16:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On 08/16, Frederic Weisbecker wrote:
>
> To fix this, lets only update the sleeptime stats locally when the CPU
> exits from idle.

I am in no position to ack the changes in this area, but I like this
change very much. Because, as a code reader, I was totally confused by

	if (last_update_time)
		update_ts_time_stats()

code and it looks "obviously wrong".

I added more cc's. It seems to me that 9366d840 "cpufreq: governors:
Calculate iowait time only when necessary" doesn't realize what

	-       u64 idle_time = get_cpu_idle_time_us(cpu, NULL);
	+       u64 idle_time = get_cpu_idle_time_us(cpu, io_busy ? wall : NULL);

actually means. OTOH, get_cpu_iowait_time_us() was called with
last_update_time != NULL even before this patch...

In short. This looks like the clear fix to me, but I do not understand
this code enough, and I think that cpufreq should know about this change.

>  static void tick_nohz_stop_idle(int cpu, ktime_t now)
>  {
>  	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
> +	ktime_t delta;
>  
> -	update_ts_time_stats(cpu, ts, now, NULL);
> +	/* Updates the per cpu time idle statistics counters */
> +	delta = ktime_sub(now, ts->idle_entrytime);
> +	if (nr_iowait_cpu(cpu) > 0)
> +		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
> +	else
> +		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
> +	ts->idle_entrytime = now;
>  	ts->idle_active = 0;

With or without this change, why we update ->idle_entrytime in this case?
Looks harmless, but a bit confusing.

While this doesn't really matter, we could probably even kill ->idle_active
and use !!ts->idle_entrytime instead.

> @@ -473,17 +458,14 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)

And I think that we should kill this last_update_time argument later.

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 15:42 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Frederic Weisbecker
  2013-08-16 16:02   ` Oleg Nesterov
@ 2013-08-18 16:54   ` Oleg Nesterov
  2013-08-18 21:40     ` Frederic Weisbecker
  1 sibling, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-18 16:54 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On 08/16, Frederic Weisbecker wrote:
>
> This fixes a reported bug where non-monotonic sleeptime stats are returned by /proc/stat
> when it is frequently read.

Plus it fixes the problems with 32bit machines reading u64 values.

But perhaps the changelog should mention other races we discussed.

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 3/4] nohz: Consolidate sleep time stats read code
  2013-08-16 15:42 ` [PATCH 3/4] nohz: Consolidate sleep time stats read code Frederic Weisbecker
@ 2013-08-18 17:00   ` Oleg Nesterov
  2013-08-18 21:47     ` Frederic Weisbecker
  0 siblings, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-18 17:00 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On 08/16, Frederic Weisbecker wrote:
>
> get_cpu_idle_time_us() and get_cpu_iowait_time_us() mostly share
> the same code. Lets consolidate both implementations.

Personally I like every patch which consolidates the code ;)

>  	do {
>  		seq = read_seqcount_begin(&ts->sleeptime_seq);
> -		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
> +		if (io)
> +			sleep = ts->iowait_sleeptime;
> +		else
> +			sleep = ts->idle_sleeptime;
> +
> +		if (ts->idle_active)
> +			continue;
> +
> +		if ((io && nr_iowait_cpu(cpu)) || (!io && !nr_iowait_cpu(cpu))) {
>  			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> -			idle = ktime_add(ts->idle_sleeptime, delta);
> -		} else {
> -			idle = ts->idle_sleeptime;
> +			sleep = ktime_add(sleep, delta);
>  		}
>  	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));

Cosmetic/subjective, feel free to ignore. but perhaps

		if (ts->idle_active && io == !!nr_iowait_cpu(cpu)) {
			...
		}

looks more understandable.

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 1/4] nohz: Only update sleeptime stats locally
  2013-08-16 15:42 ` [PATCH 1/4] nohz: Only update sleeptime stats locally Frederic Weisbecker
  2013-08-18 16:49   ` Oleg Nesterov
@ 2013-08-18 17:04   ` Oleg Nesterov
  2013-08-19 18:05     ` Stratos Karafotis
  1 sibling, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-18 17:04 UTC (permalink / raw)
  To: Frederic Weisbecker, Rafael J. Wysocki, Viresh Kumar, Stratos Karafotis
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

Sorry for double post. forgot to cc cpufreq maintainers.

On 08/16, Frederic Weisbecker wrote:
>
> To fix this, lets only update the sleeptime stats locally when the CPU
> exits from idle.

I am in no position to ack the changes in this area, but I like this
change very much. Because, as a code reader, I was totally confused by

	if (last_update_time)
		update_ts_time_stats()

code and it looks "obviously wrong".

I added more cc's. It seems to me that 9366d840 "cpufreq: governors:
Calculate iowait time only when necessary" doesn't realize what

	-       u64 idle_time = get_cpu_idle_time_us(cpu, NULL);
	+       u64 idle_time = get_cpu_idle_time_us(cpu, io_busy ? wall : NULL);

actually means. OTOH, get_cpu_iowait_time_us() was called with
last_update_time != NULL even before this patch...

In short. This looks like the clear fix to me, but I do not understand
this code enough, and I think that cpufreq should know about this change.

>  static void tick_nohz_stop_idle(int cpu, ktime_t now)
>  {
>  	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
> +	ktime_t delta;
>  
> -	update_ts_time_stats(cpu, ts, now, NULL);
> +	/* Updates the per cpu time idle statistics counters */
> +	delta = ktime_sub(now, ts->idle_entrytime);
> +	if (nr_iowait_cpu(cpu) > 0)
> +		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
> +	else
> +		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
> +	ts->idle_entrytime = now;
>  	ts->idle_active = 0;

With or without this change, why we update ->idle_entrytime in this case?
Looks harmless, but a bit confusing.

While this doesn't really matter, we could probably even kill ->idle_active
and use !!ts->idle_entrytime instead.

> @@ -473,17 +458,14 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)

And I think that we should kill this last_update_time argument later.

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-18 16:36               ` Oleg Nesterov
@ 2013-08-18 21:25                 ` Frederic Weisbecker
  0 siblings, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-18 21:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Thomas Gleixner, LKML, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Sun, Aug 18, 2013 at 06:36:39PM +0200, Oleg Nesterov wrote:
> On 08/16, Frederic Weisbecker wrote:
> >
> > On Fri, Aug 16, 2013 at 06:49:22PM +0200, Oleg Nesterov wrote:
> > >
> > > Personally I am fine either way.
> >
> > Me too.
> >
> > So my proposition is that we can keep the existing patches as they fix other distinct races
> 
> perhaps... but it would be nice to fix the "goes backward" problem.
> 
> This "race" is not theoretical, at least for get_cpu_iowait_time_us().
> nr_iowait_cpu() can change from !0 to 0 very easily.

Actually yes. Now I think we should fix the iowait race in the same changeset because
the bugs I'm fixing in this patchset were probably lowering the iowait issue in some
case.

> 
> And just in case, note that get_cpu_idle_time_us() has the same problem
> too, because it can also change from 0 to !0 although this case is much
> less likely.

Right.

> 
> However, right now I do not see a simple solution. It seems that, if
> get_cpu_idle_time_us() does ktime_add(ts->idle_sleeptime) it should
> actually update ts->idle_sleeptime/entrytime to avoid these races
> (the same for ->idle_sleeptime), but then we need the locking.

It seems that wouldn't solve the issue. Imagine that task A waits for
IO on CPU 0. It waits 10 seconds. Then it's finally woken on CPU 1.
A further call to get_cpu_idle_time_us() on CPU 0, say 5 seconds later,
will miss the io sleeptime part.

For now I can't figure out how to avoid flushing the io sleeptime when
the iowait task is moved to another CPU. This requires a lock (moving from
seqcount to seqlock) and may be involve quite some cache issues in the idle
path, resulting in higher latencies on wakeup. We really need another solution.

> 
> > (and we add fixes on what peterz just reported)
> 
> Do you mean his comments about 4/4 or I missed something else?

Yep, just removing the cpu arguments from the functions that only
use ts locally.

> > Ah and I'll wait for
> > your review first.
> 
> If only I could understand this code ;) It looks really simple and
> I hope I can understand what it does. But not why. I simply can't
> understand why idle/iowait are "mutually exclusive".

I believe this can be changed such that iowait is included in the idle
sleeptime. We can do that if that's needed. 

> 
> And if we do this,  then perhaps io_schedule() should do
> "if (atomic_dec(&rq->nr_iowait)) update_iowait_sleeptime()" but
> this means the locking again.

Yep.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 1/4] nohz: Only update sleeptime stats locally
  2013-08-18 16:49   ` Oleg Nesterov
@ 2013-08-18 21:38     ` Frederic Weisbecker
  0 siblings, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-18 21:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven,
	Stratos Karafotis, Viresh Kumar, Rafael J. Wysocki

On Sun, Aug 18, 2013 at 06:49:37PM +0200, Oleg Nesterov wrote:
> On 08/16, Frederic Weisbecker wrote:
> >
> > To fix this, lets only update the sleeptime stats locally when the CPU
> > exits from idle.
> 
> I am in no position to ack the changes in this area, but I like this
> change very much. Because, as a code reader, I was totally confused by
> 
> 	if (last_update_time)
> 		update_ts_time_stats()
> 
> code and it looks "obviously wrong".
> 
> I added more cc's. It seems to me that 9366d840 "cpufreq: governors:
> Calculate iowait time only when necessary" doesn't realize what
> 
> 	-       u64 idle_time = get_cpu_idle_time_us(cpu, NULL);
> 	+       u64 idle_time = get_cpu_idle_time_us(cpu, io_busy ? wall : NULL);
> 
> actually means. OTOH, get_cpu_iowait_time_us() was called with
> last_update_time != NULL even before this patch...
> 
> In short. This looks like the clear fix to me, but I do not understand
> this code enough, and I think that cpufreq should know about this change.

Good point, and this time I'm really adding the Cc :)

> 
> >  static void tick_nohz_stop_idle(int cpu, ktime_t now)
> >  {
> >  	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
> > +	ktime_t delta;
> >  
> > -	update_ts_time_stats(cpu, ts, now, NULL);
> > +	/* Updates the per cpu time idle statistics counters */
> > +	delta = ktime_sub(now, ts->idle_entrytime);
> > +	if (nr_iowait_cpu(cpu) > 0)
> > +		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
> > +	else
> > +		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
> > +	ts->idle_entrytime = now;
> >  	ts->idle_active = 0;
> 
> With or without this change, why we update ->idle_entrytime in this case?
> Looks harmless, but a bit confusing.

Oh indeed I missed that. It's a leftover from the copy-paste of update_ts_time_stats()
content.

Well spotted, I'll fix.

> 
> While this doesn't really matter, we could probably even kill ->idle_active
> and use !!ts->idle_entrytime instead.

We could but it would be slightly more overhead in the irq entry path (cf: tick_check_nohz())
and it makes the code also a little bit harder to review IMHO.

> 
> > @@ -473,17 +458,14 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
> 
> And I think that we should kill this last_update_time argument later.

Agreed.

Thanks.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-18 16:54   ` Oleg Nesterov
@ 2013-08-18 21:40     ` Frederic Weisbecker
  0 siblings, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-18 21:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Sun, Aug 18, 2013 at 06:54:00PM +0200, Oleg Nesterov wrote:
> On 08/16, Frederic Weisbecker wrote:
> >
> > This fixes a reported bug where non-monotonic sleeptime stats are returned by /proc/stat
> > when it is frequently read.
> 
> Plus it fixes the problems with 32bit machines reading u64 values.

Ah indeed.

> 
> But perhaps the changelog should mention other races we discussed.

Agreed, I'll detail that.

thanks.

> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 3/4] nohz: Consolidate sleep time stats read code
  2013-08-18 17:00   ` Oleg Nesterov
@ 2013-08-18 21:47     ` Frederic Weisbecker
  0 siblings, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-18 21:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Sun, Aug 18, 2013 at 07:00:01PM +0200, Oleg Nesterov wrote:
> On 08/16, Frederic Weisbecker wrote:
> >
> > get_cpu_idle_time_us() and get_cpu_iowait_time_us() mostly share
> > the same code. Lets consolidate both implementations.
> 
> Personally I like every patch which consolidates the code ;)
> 
> >  	do {
> >  		seq = read_seqcount_begin(&ts->sleeptime_seq);
> > -		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
> > +		if (io)
> > +			sleep = ts->iowait_sleeptime;
> > +		else
> > +			sleep = ts->idle_sleeptime;
> > +
> > +		if (ts->idle_active)
> > +			continue;
> > +
> > +		if ((io && nr_iowait_cpu(cpu)) || (!io && !nr_iowait_cpu(cpu))) {
> >  			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> > -			idle = ktime_add(ts->idle_sleeptime, delta);
> > -		} else {
> > -			idle = ts->idle_sleeptime;
> > +			sleep = ktime_add(sleep, delta);
> >  		}
> >  	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
> 
> Cosmetic/subjective, feel free to ignore. but perhaps
> 
> 		if (ts->idle_active && io == !!nr_iowait_cpu(cpu)) {
> 			...
> 		}
> 
> looks more understandable.

Agreed, that's much better.

thanks!

> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 17:12             ` Frederic Weisbecker
  2013-08-18 16:36               ` Oleg Nesterov
@ 2013-08-19 10:58               ` Peter Zijlstra
  2013-08-19 15:44                 ` Arjan van de Ven
  2013-08-19 15:47                 ` Arjan van de Ven
  1 sibling, 2 replies; 75+ messages in thread
From: Peter Zijlstra @ 2013-08-19 10:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Ingo Molnar, Thomas Gleixner, LKML,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Andrew Morton,
	Arjan van de Ven

On Fri, Aug 16, 2013 at 07:12:09PM +0200, Frederic Weisbecker wrote:
> Or may be Peter could tell us as well. Peter, do you have a preference?

Still trying to wrap my head around it, but conceptually
get_cpu_iowait_time_us() doesn't make any kind of sense. iowait isn't
per cpu since effectively tasks that aren't running aren't assigned a
cpu (as Oleg already pointed out).

The fact that cpufreq 'needs' this just means that cpufreq is broken --
but I think I've said as much previously; cpufreq needs to stop living
in the partitioned-mp era and get dragged (kicking and screaming) into
the smp era.

I'm also not entirely clear on the 'desired' semantics here. Do we count
iowait time as idle or not?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:46         ` Frederic Weisbecker
  2013-08-16 16:49           ` Oleg Nesterov
@ 2013-08-19 11:10           ` Peter Zijlstra
  2013-08-19 11:15             ` Peter Zijlstra
  2013-08-20  6:59             ` Fernando Luis Vázquez Cao
  2013-08-20  6:21           ` Fernando Luis Vázquez Cao
  2 siblings, 2 replies; 75+ messages in thread
From: Peter Zijlstra @ 2013-08-19 11:10 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Ingo Molnar, Thomas Gleixner, LKML,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Andrew Morton,
	Arjan van de Ven

On Fri, Aug 16, 2013 at 06:46:28PM +0200, Frederic Weisbecker wrote:

Option A:

> Should we flush that iowait to the src CPU? But then it means we must handle
> concurrent updates to iowait_sleeptime, idle_sleeptime from the migration
> code and from idle enter / exit.
> 
> So I fear we need a seqlock.

Option B:

> Or we can live with that and still account the whole idle time slept until
> tick_nohz_stop_idle() to iowait if we called tick_nohz_start_idle() with nr_iowait > 0.
> All we need is just a new field in ts-> that records on which state we entered
> idle.
> 
> What do you think?

I think option B is unworkable. Afaict it could basically caused
unlimited iowait time. Suppose we have a load-balancer that tries it
bestestest to sort-left (ie. run a task on the lowest 'free' cpu
possible) -- the power aware folks are pondering such schemes.

Now suppose we have a small burst of activity and the rightmost cpu gets
to run something that goes to sleep on iowait.

We'd accrue iowait on that cpu until it wakes up, which could be days
from now if the load stays low enough, even though the task got to run
almost instantly on another cpu.

So no, if we need per-cpu iowait time we have to do A.

Since we already have atomics in the io_schedule*() paths, please
replace those with (seq)locks. Also see if you can place the entire
iowait accounting thing in a separate cacheline.

That said, I'm still not sure if iowait time counts as both idle and
iowait or only iowait.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-19 11:10           ` Peter Zijlstra
@ 2013-08-19 11:15             ` Peter Zijlstra
  2013-08-20  6:59             ` Fernando Luis Vázquez Cao
  1 sibling, 0 replies; 75+ messages in thread
From: Peter Zijlstra @ 2013-08-19 11:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Ingo Molnar, Thomas Gleixner, LKML,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Andrew Morton,
	Arjan van de Ven

On Mon, Aug 19, 2013 at 01:10:26PM +0200, Peter Zijlstra wrote:
> So no, if we need per-cpu iowait time we have to do A.
> 
> Since we already have atomics in the io_schedule*() paths, please
> replace those with (seq)locks. Also see if you can place the entire
> iowait accounting thing in a separate cacheline.

Also, it completely blows to have these extra atomics in the IO paths,
I'm sure that the Mega-IOP/s people would be much pleased if we could
remove these.



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-19 10:58               ` Peter Zijlstra
@ 2013-08-19 15:44                 ` Arjan van de Ven
  2013-08-19 15:47                 ` Arjan van de Ven
  1 sibling, 0 replies; 75+ messages in thread
From: Arjan van de Ven @ 2013-08-19 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Oleg Nesterov, Ingo Molnar, Thomas Gleixner,
	LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Andrew Morton

On 8/19/2013 3:58 AM, Peter Zijlstra wrote:
> I'm also not entirely clear on the 'desired' semantics here. Do we count
> iowait time as idle or not?

cpufreq only looks at this because it does not want to count this as idle...
other than that it could not care less.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-19 10:58               ` Peter Zijlstra
  2013-08-19 15:44                 ` Arjan van de Ven
@ 2013-08-19 15:47                 ` Arjan van de Ven
  1 sibling, 0 replies; 75+ messages in thread
From: Arjan van de Ven @ 2013-08-19 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Oleg Nesterov, Ingo Molnar, Thomas Gleixner,
	LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Andrew Morton

On 8/19/2013 3:58 AM, Peter Zijlstra wrote:
> On Fri, Aug 16, 2013 at 07:12:09PM +0200, Frederic Weisbecker wrote:
>> Or may be Peter could tell us as well. Peter, do you have a preference?
>
> Still trying to wrap my head around it, but conceptually
> get_cpu_iowait_time_us() doesn't make any kind of sense. iowait isn't
> per cpu since effectively tasks that aren't running aren't assigned a
> cpu (as Oleg already pointed out).
>
> The fact that cpufreq 'needs' this just means that cpufreq is broken --
> but I think I've said as much previously; cpufreq needs to stop living
> in the partitioned-mp era and get dragged (kicking and screaming) into
> the smp era.
>
> I'm also not entirely clear on the 'desired' semantics here. Do we count
> iowait time as idle or not?

fwiw only some Intel cpu's use this, and those by and large no longer use cpufreq
so this can just go away as far as I am concerned.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 1/4] nohz: Only update sleeptime stats locally
  2013-08-18 17:04   ` Oleg Nesterov
@ 2013-08-19 18:05     ` Stratos Karafotis
  0 siblings, 0 replies; 75+ messages in thread
From: Stratos Karafotis @ 2013-08-19 18:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Rafael J. Wysocki, Viresh Kumar, LKML,
	Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

On 08/18/2013 08:04 PM, Oleg Nesterov wrote:
> Sorry for double post. forgot to cc cpufreq maintainers.
> 
> On 08/16, Frederic Weisbecker wrote:
>>
>> To fix this, lets only update the sleeptime stats locally when the CPU
>> exits from idle.
> 
> I am in no position to ack the changes in this area, but I like this
> change very much. Because, as a code reader, I was totally confused by
> 
> 	if (last_update_time)
> 		update_ts_time_stats()
> 
> code and it looks "obviously wrong".
> 
> I added more cc's. It seems to me that 9366d840 "cpufreq: governors:
> Calculate iowait time only when necessary" doesn't realize what
> 
> 	-       u64 idle_time = get_cpu_idle_time_us(cpu, NULL);
> 	+       u64 idle_time = get_cpu_idle_time_us(cpu, io_busy ? wall : NULL);
> 
> actually means. OTOH, get_cpu_iowait_time_us() was called with
> last_update_time != NULL even before this patch...

To be honest, I am unfamiliar with tick-sched code.
With patch 9366d840, I was trying to avoid duplicate calls to
get_cpu_iowait_time_us function. I just saw that the original
code was calling update_ts_time_stats within get_cpu_idle_time_us
and get_cpu_iowait_time_us and I thought that I should keep calling
these functions with non NULL parameter to update the time stats.

In fact the original patch submission was without this:
-       u64 idle_time = get_cpu_idle_time_us(cpu, NULL);
+       u64 idle_time = get_cpu_idle_time_us(cpu, io_busy ? wall : NULL);
and the idle time calculation was wrong (ondemand couldn't increase to max freq)


For your convenience the call paths before and after this patch:

Before patch

get_cpu_idle_time(j, &cur_wall_time);
	u64 idle_time = get_cpu_idle_time_us(cpu, NULL);
	idle_time += get_cpu_iowait_time_us(cpu, wall);
		update_ts_time_stats(cpu, ts, now, last_update_time);
... 
get_cpu_iowait_time_us(j, &cur_wall_time);
	update_ts_time_stats(cpu, ts, now, last_update_time);


After patch (io_busy = 1)

cur_idle_time = get_cpu_idle_time(j, &cur_wall_time, io_busy);
	u64 idle_time = get_cpu_idle_time_us(cpu, io_busy ? wall : NULL);
		update_ts_time_stats(cpu, ts, now, last_update_time);


After patch (io_busy = 0)

cur_idle_time = get_cpu_idle_time(j, &cur_wall_time, io_busy);
	u64 idle_time = get_cpu_idle_time_us(cpu, io_busy ? wall : NULL);

	idle_time += get_cpu_iowait_time_us(cpu, wall);
		update_ts_time_stats(cpu, ts, now, last_update_time);


Regards,
Stratos

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-16 16:46         ` Frederic Weisbecker
  2013-08-16 16:49           ` Oleg Nesterov
  2013-08-19 11:10           ` Peter Zijlstra
@ 2013-08-20  6:21           ` Fernando Luis Vázquez Cao
  2013-08-20 21:55             ` Frederic Weisbecker
  2 siblings, 1 reply; 75+ messages in thread
From: Fernando Luis Vázquez Cao @ 2013-08-20  6:21 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa,
	Peter Zijlstra, Andrew Morton, Arjan van de Ven

(2013年08月17日 01:46), Frederic Weisbecker wrote:
> On Fri, Aug 16, 2013 at 06:26:54PM +0200, Oleg Nesterov wrote:
>> On 08/16, Frederic Weisbecker wrote:
>>> On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
>>>>> +	do {
>>>>> +		seq = read_seqcount_begin(&ts->sleeptime_seq);
>>>>> +		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
>>>>> +			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
>>>>> +			iowait = ktime_add(ts->iowait_sleeptime, delta);
>>>>> +		} else {
>>>>> +			iowait = ts->iowait_sleeptime;
>>>>> +		}
>>>>> +	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
>>>> Unless I missread this patch, this is still racy a bit.
>>>>
>>>> Suppose it is called on CPU_0 and cpu == 1. Suppose that
>>>> ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
>>>>
>>>> So we return iowait_sleeptime + delta.
>>>>
>>>> Suppose that we call get_cpu_iowait_time_us() again. By this time
>>>> the task which incremented ->nr_iowait can be woken up on another
>>>> CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
>>>> we return iowait_sleeptime, and this is not monotonic again.
>>> Hmm, by the time it decrements nr_iowait, it returned from schedule() and
>>> so idle had flushed the pending iowait sleeptime.
>> Suppose a task does io_schedule() on CPU_0, and increments the counter.
>> This CPU becomes idle and nr_iowait_cpu(0) == 1.
>>
>> Then this task is woken up, but try_to_wake_up() selects another CPU != 0.
>>
>> It returns from schedule() and decrements the same counter, it doesn't
>> do raw_rq/etc again. nr_iowait_cpu(0) becomes 0.
>>
>> In fact the task can even migrate to another CPU right after raw_rq().
> Ah I see now. So that indeed yet another race.

I am sorry for chiming in late.

That precisely the race I described here:

https://lkml.org/lkml/2013/7/2/3

I should have been more concise in my explanation. I apologize.

> Should we flush that iowait to the src CPU? But then it means we must handle
> concurrent updates to iowait_sleeptime, idle_sleeptime from the migration
> code and from idle enter / exit.
>
> So I fear we need a seqlock.
>
> Or we can live with that and still account the whole idle time slept until
> tick_nohz_stop_idle() to iowait if we called tick_nohz_start_idle() with nr_iowait > 0.
> All we need is just a new field in ts-> that records on which state we entered
> idle.

Another approach could be to shadow ->iowait_sleeptime as
suggested here: https://lkml.org/lkml/2013/7/2/165

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-19 11:10           ` Peter Zijlstra
  2013-08-19 11:15             ` Peter Zijlstra
@ 2013-08-20  6:59             ` Fernando Luis Vázquez Cao
  2013-08-20  8:44               ` Peter Zijlstra
  1 sibling, 1 reply; 75+ messages in thread
From: Fernando Luis Vázquez Cao @ 2013-08-20  6:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Oleg Nesterov, Ingo Molnar, Thomas Gleixner,
	LKML, Tetsuo Handa, Andrew Morton, Arjan van de Ven

(2013年08月19日 20:10), Peter Zijlstra wrote:
> On Fri, Aug 16, 2013 at 06:46:28PM +0200, Frederic Weisbecker wrote:
>
> Option A:
>
>> Should we flush that iowait to the src CPU? But then it means we must handle
>> concurrent updates to iowait_sleeptime, idle_sleeptime from the migration
>> code and from idle enter / exit.
>>
>> So I fear we need a seqlock.
> Option B:
>
>> Or we can live with that and still account the whole idle time slept until
>> tick_nohz_stop_idle() to iowait if we called tick_nohz_start_idle() with nr_iowait > 0.
>> All we need is just a new field in ts-> that records on which state we entered
>> idle.
>>
>> What do you think?
> I think option B is unworkable. Afaict it could basically caused
> unlimited iowait time. Suppose we have a load-balancer that tries it
> bestestest to sort-left (ie. run a task on the lowest 'free' cpu
> possible) -- the power aware folks are pondering such schemes.
>
> Now suppose we have a small burst of activity and the rightmost cpu gets
> to run something that goes to sleep on iowait.
>
> We'd accrue iowait on that cpu until it wakes up, which could be days
> from now if the load stays low enough, even though the task got to run
> almost instantly on another cpu.
>
> So no, if we need per-cpu iowait time we have to do A.
>
> Since we already have atomics in the io_schedule*() paths, please
> replace those with (seq)locks. Also see if you can place the entire
> iowait accounting thing in a separate cacheline.

I considered option A for a while but, fearing it would be
considered overkill, took a different approach: create a
shadow copy of ->iowait_sleeptime that is always kept
monotonic (artificially in some cases) and use that to
compute the values exported through /proc.

That said, if deemed acceptable, option A is the one I would
choose.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20  6:59             ` Fernando Luis Vázquez Cao
@ 2013-08-20  8:44               ` Peter Zijlstra
  2013-08-20 15:29                 ` Frederic Weisbecker
  2013-08-20 15:31                 ` Arjan van de Ven
  0 siblings, 2 replies; 75+ messages in thread
From: Peter Zijlstra @ 2013-08-20  8:44 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Frederic Weisbecker, Oleg Nesterov, Ingo Molnar, Thomas Gleixner,
	LKML, Tetsuo Handa, Andrew Morton, Arjan van de Ven

On Tue, Aug 20, 2013 at 03:59:36PM +0900, Fernando Luis Vázquez Cao wrote:
> That said, if deemed acceptable, option A is the one I would
> choose.

Right, so I think we can do A without much extra cost mostly because we
already have 2 atomics in the io_schedule() path. If we replace those
two atomic operations with locks and modify nr_iowait and the other
stats under the same lock, and ensure all those variables (including the
lock) live in the same cacheline we should have the same cost we have
now.

Of course, if we can get away with completely removing all of that
(which I think Arjan suggested was a real possibility) then that would
be ever so much better still :-)

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20  8:44               ` Peter Zijlstra
@ 2013-08-20 15:29                 ` Frederic Weisbecker
  2013-08-20 15:33                   ` Arjan van de Ven
  2013-08-20 15:31                 ` Arjan van de Ven
  1 sibling, 1 reply; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-20 15:29 UTC (permalink / raw)
  To: Peter Zijlstra, Arjan van de Ven, Rafael J. Wysocki, Viresh Kumar
  Cc: Fernando Luis Vázquez Cao, Oleg Nesterov, Ingo Molnar,
	Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Tue, Aug 20, 2013 at 10:44:05AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 20, 2013 at 03:59:36PM +0900, Fernando Luis Vázquez Cao wrote:
> > That said, if deemed acceptable, option A is the one I would
> > choose.
> 
> Right, so I think we can do A without much extra cost mostly because we
> already have 2 atomics in the io_schedule() path. If we replace those
> two atomic operations with locks and modify nr_iowait and the other
> stats under the same lock, and ensure all those variables (including the
> lock) live in the same cacheline we should have the same cost we have
> now.

I can try that :-)

> 
> Of course, if we can get away with completely removing all of that
> (which I think Arjan suggested was a real possibility) then that would
> be ever so much better still :-)

Would be lovely. But I don't know much about cpufreq, I hope somebody who's
familiar with that code can handle this. Then once there are no more users
of get_cpu_iowait_sleep_time() I can simply zap and clean the tick/time related
code.

Surely the overhead that this mess brings to io_schedule() (and it's going
to be worth with a seqlock, whether in the same cacheline than nr_iowait or
not) may be a good motivation to poke at that cpufreq code part.

Thanks.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20  8:44               ` Peter Zijlstra
  2013-08-20 15:29                 ` Frederic Weisbecker
@ 2013-08-20 15:31                 ` Arjan van de Ven
  2013-08-20 16:01                   ` Peter Zijlstra
  1 sibling, 1 reply; 75+ messages in thread
From: Arjan van de Ven @ 2013-08-20 15:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Fernando Luis Vázquez Cao, Frederic Weisbecker,
	Oleg Nesterov, Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa,
	Andrew Morton

On 8/20/2013 1:44 AM, Peter Zijlstra wrote:
> On Tue, Aug 20, 2013 at 03:59:36PM +0900, Fernando Luis Vázquez Cao wrote:
>> That said, if deemed acceptable, option A is the one I would
>> choose.
>
> Right, so I think we can do A without much extra cost mostly because we
> already have 2 atomics in the io_schedule() path. If we replace those
> two atomic operations with locks and modify nr_iowait and the other
> stats under the same lock, and ensure all those variables (including the
> lock) live in the same cacheline we should have the same cost we have
> now.
>
> Of course, if we can get away with completely removing all of that
> (which I think Arjan suggested was a real possibility) then that would
> be ever so much better still :-)

I'm quite ok with removing that.

however note that "top" also reports per cpu iowait...
and that's a userspace expectation

>


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 15:29                 ` Frederic Weisbecker
@ 2013-08-20 15:33                   ` Arjan van de Ven
  2013-08-20 15:35                     ` Frederic Weisbecker
  0 siblings, 1 reply; 75+ messages in thread
From: Arjan van de Ven @ 2013-08-20 15:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Fernando Luis Vázquez Cao, Oleg Nesterov, Ingo Molnar,
	Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On 8/20/2013 8:29 AM, Frederic Weisbecker wrote:

>>
>> Of course, if we can get away with completely removing all of that
>> (which I think Arjan suggested was a real possibility) then that would
>> be ever so much better still :-)
>
> Would be lovely. But I don't know much about cpufreq, I hope somebody who's
> familiar with that code can handle this. Then once there are no more users
> of get_cpu_iowait_sleep_time() I can simply zap and clean the tick/time related
> code.

it's just doing the "idle = 100 - busy% - iowait%" calculation.
(with the later part only for Intel cpus iirc)

in a perfect world the scheduler would be doing that calculation in the first place ;-)

removing the later part will impact performance some on specific workloads,
but most Intel cpus that this applies to should not be using cpufreq anymore
anyway.



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 15:33                   ` Arjan van de Ven
@ 2013-08-20 15:35                     ` Frederic Weisbecker
  2013-08-20 15:41                       ` Arjan van de Ven
  0 siblings, 1 reply; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-20 15:35 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Fernando Luis Vázquez Cao, Oleg Nesterov, Ingo Molnar,
	Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Tue, Aug 20, 2013 at 08:33:50AM -0700, Arjan van de Ven wrote:
> On 8/20/2013 8:29 AM, Frederic Weisbecker wrote:
> 
> >>
> >>Of course, if we can get away with completely removing all of that
> >>(which I think Arjan suggested was a real possibility) then that would
> >>be ever so much better still :-)
> >
> >Would be lovely. But I don't know much about cpufreq, I hope somebody who's
> >familiar with that code can handle this. Then once there are no more users
> >of get_cpu_iowait_sleep_time() I can simply zap and clean the tick/time related
> >code.
> 
> it's just doing the "idle = 100 - busy% - iowait%" calculation.
> (with the later part only for Intel cpus iirc)
> 
> in a perfect world the scheduler would be doing that calculation in the first place ;-)
> 
> removing the later part will impact performance some on specific workloads,
> but most Intel cpus that this applies to should not be using cpufreq anymore
> anyway.

Are there other users than intel?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 15:35                     ` Frederic Weisbecker
@ 2013-08-20 15:41                       ` Arjan van de Ven
  0 siblings, 0 replies; 75+ messages in thread
From: Arjan van de Ven @ 2013-08-20 15:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar,
	Fernando Luis Vázquez Cao, Oleg Nesterov, Ingo Molnar,
	Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On 8/20/2013 8:35 AM, Frederic Weisbecker wrote:
> On Tue, Aug 20, 2013 at 08:33:50AM -0700, Arjan van de Ven wrote:
>> On 8/20/2013 8:29 AM, Frederic Weisbecker wrote:
>>
>>>>
>>>> Of course, if we can get away with completely removing all of that
>>>> (which I think Arjan suggested was a real possibility) then that would
>>>> be ever so much better still :-)
>>>
>>> Would be lovely. But I don't know much about cpufreq, I hope somebody who's
>>> familiar with that code can handle this. Then once there are no more users
>>> of get_cpu_iowait_sleep_time() I can simply zap and clean the tick/time related
>>> code.
>>
>> it's just doing the "idle = 100 - busy% - iowait%" calculation.
>> (with the later part only for Intel cpus iirc)
>>
>> in a perfect world the scheduler would be doing that calculation in the first place ;-)
>>
>> removing the later part will impact performance some on specific workloads,
>> but most Intel cpus that this applies to should not be using cpufreq anymore
>> anyway.
>
> Are there other users than intel?
>

http://lxr.linux.no/linux+v3.10.7/drivers/cpufreq/cpufreq_ondemand.c#L69

nope

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 15:31                 ` Arjan van de Ven
@ 2013-08-20 16:01                   ` Peter Zijlstra
  2013-08-20 16:33                     ` Oleg Nesterov
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2013-08-20 16:01 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Fernando Luis Vázquez Cao, Frederic Weisbecker,
	Oleg Nesterov, Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa,
	Andrew Morton

On Tue, Aug 20, 2013 at 08:31:53AM -0700, Arjan van de Ven wrote:
> On 8/20/2013 1:44 AM, Peter Zijlstra wrote:

> >Of course, if we can get away with completely removing all of that
> >(which I think Arjan suggested was a real possibility) then that would
> >be ever so much better still :-)
> 
> I'm quite ok with removing that.
> 
> however note that "top" also reports per cpu iowait...
> and that's a userspace expectation

Right, broken as that maybe :/ OK that looks like CPUTIME_IOWAIT which
is tick based and not the ns based accounting.

Still it needs the per-cpu nr_iowait accounting which pretty much
requires the atomics so no big gains there.

Which means that if Frederic can make the ns thing as expensive as the
existing atomics we might as well keep the ns thing too.


Hmm, would something like the below make sense? I suppose this can be
done even for the ns case, you'd have to duplicate all stats though.

At the very least the below reduces the number of atomics, not entirely
sure it all matters much though, some benchmarking would be in order I
suppose.

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2167,8 +2167,12 @@ unsigned long nr_iowait(void)
 {
 	unsigned long i, sum = 0;
 
-	for_each_possible_cpu(i)
-		sum += atomic_read(&cpu_rq(i)->nr_iowait);
+	for_each_possible_cpu(i) {
+		struct rq *rq = cpu_rq(i);
+
+		sum += rq->nr_iowait_local;
+		sum += atomic_read(&rq->nr_iowait_remote);
+	}
 
 	return sum;
 }
@@ -2176,7 +2180,7 @@ unsigned long nr_iowait(void)
 unsigned long nr_iowait_cpu(int cpu)
 {
 	struct rq *this = cpu_rq(cpu);
-	return atomic_read(&this->nr_iowait);
+	return atomic_read(&this->nr_iowait_remote) + this->nr_iowait_local;
 }
 
 #ifdef CONFIG_SMP
@@ -4086,31 +4090,49 @@ EXPORT_SYMBOL_GPL(yield_to);
  */
 void __sched io_schedule(void)
 {
-	struct rq *rq = raw_rq();
+	struct rq *rq;
 
 	delayacct_blkio_start();
-	atomic_inc(&rq->nr_iowait);
 	blk_flush_plug(current);
+
+	preempt_disable();
+	rq = this_rq();
+	rq->nr_iowait_local++;
 	current->in_iowait = 1;
-	schedule();
+	schedule_preempt_disabled();
 	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
+	if (likely(task_cpu(current) == cpu_of(rq)))
+		rq->nr_iowait_local--;
+	else
+		atomic_dec(&rq->nr_iowait_remote);
+	preempt_enable();
+
 	delayacct_blkio_end();
 }
 EXPORT_SYMBOL(io_schedule);
 
 long __sched io_schedule_timeout(long timeout)
 {
-	struct rq *rq = raw_rq();
+	struct rq *rq;
 	long ret;
 
 	delayacct_blkio_start();
-	atomic_inc(&rq->nr_iowait);
 	blk_flush_plug(current);
+
+	preempt_disable();
+	rq = this_rq();
+	rq->nr_iowait_local++;
 	current->in_iowait = 1;
+	preempt_enable_no_resched();
 	ret = schedule_timeout(timeout);
+	preempt_disable();
 	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
+	if (likely(task_cpu(current) == cpu_of(rq)))
+		rq->nr_iowait_local--;
+	else
+		atomic_dec(&rq->nr_iowait_remote);
+	preempt_enable();
+
 	delayacct_blkio_end();
 	return ret;
 }
@@ -6650,7 +6672,8 @@ void __init sched_init(void)
 #endif
 #endif
 		init_rq_hrtick(rq);
-		atomic_set(&rq->nr_iowait, 0);
+		rq->nr_iowait_local = 0;
+		atomic_set(&rq->nr_iowait_remote, 0);
 	}
 
 	set_load_weight(&init_task);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -453,7 +453,8 @@ struct rq {
 	u64 clock;
 	u64 clock_task;
 
-	atomic_t nr_iowait;
+	int nr_iowait_local;
+	atomic_t nr_iowait_remote;
 
 #ifdef CONFIG_SMP
 	struct root_domain *rd;


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 16:01                   ` Peter Zijlstra
@ 2013-08-20 16:33                     ` Oleg Nesterov
  2013-08-20 17:54                       ` Peter Zijlstra
  2013-08-20 22:18                       ` Frederic Weisbecker
  0 siblings, 2 replies; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-20 16:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On 08/20, Peter Zijlstra wrote:
>
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -453,7 +453,8 @@ struct rq {
>  	u64 clock;
>  	u64 clock_task;
>  
> -	atomic_t nr_iowait;
> +	int nr_iowait_local;
> +	atomic_t nr_iowait_remote;

I am wondering how the extra lock(rq)/unlock(rq) in schedule() is bad
compared to atomic_dec.

IOW, what if we simply make rq->nr_iowait "int" and change schedule()
to update it? Something like below. Just curious.

As for nr_iowait_local + nr_iowait_remote, this doesn't look safe...
in theory nr_iowait_cpu() or even nr_iowait() can return a negative
number.

Oleg.


--- x/kernel/sched/core.c
+++ x/kernel/sched/core.c
@@ -2435,6 +2435,9 @@ need_resched:
 		rq->curr = next;
 		++*switch_count;
 
+		if (unlikely(prev->in_iowait))
+			rq->nr_iowait++;
+
 		context_switch(rq, prev, next); /* unlocks the rq */
 		/*
 		 * The context switch have flipped the stack from under us
@@ -2442,6 +2445,12 @@ need_resched:
 		 * this task called schedule() in the past. prev == current
 		 * is still correct, but it can be moved to another cpu/rq.
 		 */
+		if (unlikely(prev->in_iowait)) {
+			raw_spin_lock_irq(&rq->lock);
+			rq->nr_iowait--;
+			raw_spin_unlock_irq(&rq->lock);
+		}
+
 		cpu = smp_processor_id();
 		rq = cpu_rq(cpu);
 	} else
@@ -3939,31 +3948,24 @@ EXPORT_SYMBOL_GPL(yield_to);
  */
 void __sched io_schedule(void)
 {
-	struct rq *rq = raw_rq();
-
 	delayacct_blkio_start();
-	atomic_inc(&rq->nr_iowait);
 	blk_flush_plug(current);
 	current->in_iowait = 1;
 	schedule();
 	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
 	delayacct_blkio_end();
 }
 EXPORT_SYMBOL(io_schedule);
 
 long __sched io_schedule_timeout(long timeout)
 {
-	struct rq *rq = raw_rq();
 	long ret;
 
 	delayacct_blkio_start();
-	atomic_inc(&rq->nr_iowait);
 	blk_flush_plug(current);
 	current->in_iowait = 1;
 	ret = schedule_timeout(timeout);
 	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
 	delayacct_blkio_end();
 	return ret;
 }


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 16:33                     ` Oleg Nesterov
@ 2013-08-20 17:54                       ` Peter Zijlstra
  2013-08-20 18:25                         ` Oleg Nesterov
  2013-08-20 22:18                       ` Frederic Weisbecker
  1 sibling, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2013-08-20 17:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On Tue, Aug 20, 2013 at 06:33:12PM +0200, Oleg Nesterov wrote:
> On 08/20, Peter Zijlstra wrote:

> I am wondering how the extra lock(rq)/unlock(rq) in schedule() is bad
> compared to atomic_dec.
> 
> IOW, what if we simply make rq->nr_iowait "int" and change schedule()
> to update it? Something like below. Just curious.
> 
> As for nr_iowait_local + nr_iowait_remote, this doesn't look safe...
> in theory nr_iowait_cpu() or even nr_iowait() can return a negative
> number.

I'm too tired to see how its different, I'll think on it :-)


That said:

> --- x/kernel/sched/core.c
> +++ x/kernel/sched/core.c
> @@ -2435,6 +2435,9 @@ need_resched:
>  		rq->curr = next;
>  		++*switch_count;
>  
> +		if (unlikely(prev->in_iowait))
> +			rq->nr_iowait++;
> +
>  		context_switch(rq, prev, next); /* unlocks the rq */
>  		/*
>  		 * The context switch have flipped the stack from under us
> @@ -2442,6 +2445,12 @@ need_resched:
>  		 * this task called schedule() in the past. prev == current
>  		 * is still correct, but it can be moved to another cpu/rq.
>  		 */
> +		if (unlikely(prev->in_iowait)) {
> +			raw_spin_lock_irq(&rq->lock);
> +			rq->nr_iowait--;
> +			raw_spin_unlock_irq(&rq->lock);
> +		}

This seems like the wrong place, this is where you return from
schedule() running another task, not where the task you just send to
sleep wakes up.

If you want to to this, the nr_iowait decrement needs to be in the
wakeup path someplace.

>  		cpu = smp_processor_id();
>  		rq = cpu_rq(cpu);
>  	} else

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats
  2013-08-16 15:42 [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2013-08-16 15:42 ` [PATCH 4/4] nohz: Convert a few places to use local per cpu accesses Frederic Weisbecker
@ 2013-08-20 18:15 ` Oleg Nesterov
  2013-08-21  8:28   ` Peter Zijlstra
  4 siblings, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-20 18:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Fernando Luis Vazquez Cao, Tetsuo Handa, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven

While at it.

I do not also understand the cpu_online() checks in fs/proc/stat.c.

OK, I agree, if cpu is offline it should not participate in cpu
summary. But if it goes offline, why it should switch from
->iowait_sleeptime + cpustat[CPUTIME_IDLE] as it seen by /proc/stat?

This can be another source of "idle goes backward", no?

IOW. Ignoring the other problems we have, perhaps something like
below makes sense?

Oleg.

--- x/fs/proc/stat.c
+++ x/fs/proc/stat.c
@@ -45,10 +45,9 @@ static cputime64_t get_iowait_time(int c
 
 static u64 get_idle_time(int cpu)
 {
-	u64 idle, idle_time = -1ULL;
+	u64 idle, idle_time;
 
-	if (cpu_online(cpu))
-		idle_time = get_cpu_idle_time_us(cpu, NULL);
+	idle_time = get_cpu_idle_time_us(cpu, NULL);
 
 	if (idle_time == -1ULL)
 		/* !NO_HZ or cpu offline so we can rely on cpustat.idle */
@@ -61,10 +60,9 @@ static u64 get_idle_time(int cpu)
 
 static u64 get_iowait_time(int cpu)
 {
-	u64 iowait, iowait_time = -1ULL;
+	u64 iowait, iowait_time;
 
-	if (cpu_online(cpu))
-		iowait_time = get_cpu_iowait_time_us(cpu, NULL);
+	iowait_time = get_cpu_iowait_time_us(cpu, NULL);
 
 	if (iowait_time == -1ULL)
 		/* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
--- x/kernel/time/tick-sched.c
+++ x/kernel/time/tick-sched.c
@@ -477,7 +477,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *l
 		update_ts_time_stats(cpu, ts, now, last_update_time);
 		idle = ts->idle_sleeptime;
 	} else {
-		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
+		if (ts->idle_active && cpu_online(cpu) && !nr_iowait_cpu(cpu)) {
 			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
 
 			idle = ktime_add(ts->idle_sleeptime, delta);
@@ -518,7 +518,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 
 		update_ts_time_stats(cpu, ts, now, last_update_time);
 		iowait = ts->iowait_sleeptime;
 	} else {
-		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
+		if (ts->idle_active && cpu_online(cpu) && nr_iowait_cpu(cpu)) {
 			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
 
 			iowait = ktime_add(ts->iowait_sleeptime, delta);


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 17:54                       ` Peter Zijlstra
@ 2013-08-20 18:25                         ` Oleg Nesterov
  2013-08-21  8:31                           ` Peter Zijlstra
  0 siblings, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-20 18:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On 08/20, Peter Zijlstra wrote:
>
> On Tue, Aug 20, 2013 at 06:33:12PM +0200, Oleg Nesterov wrote:
> > --- x/kernel/sched/core.c
> > +++ x/kernel/sched/core.c
> > @@ -2435,6 +2435,9 @@ need_resched:
> >  		rq->curr = next;
> >  		++*switch_count;
> >  
> > +		if (unlikely(prev->in_iowait))
> > +			rq->nr_iowait++;
> > +
> >  		context_switch(rq, prev, next); /* unlocks the rq */
> >  		/*
> >  		 * The context switch have flipped the stack from under us
> > @@ -2442,6 +2445,12 @@ need_resched:
> >  		 * this task called schedule() in the past. prev == current
> >  		 * is still correct, but it can be moved to another cpu/rq.
> >  		 */
> > +		if (unlikely(prev->in_iowait)) {
> > +			raw_spin_lock_irq(&rq->lock);
> > +			rq->nr_iowait--;
> > +			raw_spin_unlock_irq(&rq->lock);
> > +		}
>
> This seems like the wrong place, this is where you return from
> schedule() running another task,

Yes, but prev is current, and rq should be "correct" for
rq->nr_iowait-- ?

This local var should be equal to its value when this task called
context_switch() in the past.

Like any other variable, like "rq = raw_rq()" in io_schedule().

> not where the task you just send to
> sleep wakes up.

sure, but currently io_schedule() does the same.



Btw. Whatever we do, can't we unify io_schedule/io_schedule_timeout?

Oleg.


--- x/kernel/sched/core.c
+++ x/kernel/sched/core.c
@@ -3939,16 +3939,7 @@ EXPORT_SYMBOL_GPL(yield_to);
  */
 void __sched io_schedule(void)
 {
-	struct rq *rq = raw_rq();
-
-	delayacct_blkio_start();
-	atomic_inc(&rq->nr_iowait);
-	blk_flush_plug(current);
-	current->in_iowait = 1;
-	schedule();
-	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
-	delayacct_blkio_end();
+	io_schedule_timeout(MAX_SCHEDULE_TIMEOUT);
 }
 EXPORT_SYMBOL(io_schedule);
 


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20  6:21           ` Fernando Luis Vázquez Cao
@ 2013-08-20 21:55             ` Frederic Weisbecker
  0 siblings, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-20 21:55 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: Oleg Nesterov, Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa,
	Peter Zijlstra, Andrew Morton, Arjan van de Ven

On Tue, Aug 20, 2013 at 03:21:11PM +0900, Fernando Luis Vázquez Cao wrote:
> (2013年08月17日 01:46), Frederic Weisbecker wrote:
> >On Fri, Aug 16, 2013 at 06:26:54PM +0200, Oleg Nesterov wrote:
> >>On 08/16, Frederic Weisbecker wrote:
> >>>On Fri, Aug 16, 2013 at 06:02:01PM +0200, Oleg Nesterov wrote:
> >>>>>+	do {
> >>>>>+		seq = read_seqcount_begin(&ts->sleeptime_seq);
> >>>>>+		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> >>>>>+			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> >>>>>+			iowait = ktime_add(ts->iowait_sleeptime, delta);
> >>>>>+		} else {
> >>>>>+			iowait = ts->iowait_sleeptime;
> >>>>>+		}
> >>>>>+	} while (read_seqcount_retry(&ts->sleeptime_seq, seq));
> >>>>Unless I missread this patch, this is still racy a bit.
> >>>>
> >>>>Suppose it is called on CPU_0 and cpu == 1. Suppose that
> >>>>ts->idle_active == T and nr_iowait_cpu(cpu) == 1.
> >>>>
> >>>>So we return iowait_sleeptime + delta.
> >>>>
> >>>>Suppose that we call get_cpu_iowait_time_us() again. By this time
> >>>>the task which incremented ->nr_iowait can be woken up on another
> >>>>CPU, and it can do atomic_dec(rq->nr_iowait). So the next time
> >>>>we return iowait_sleeptime, and this is not monotonic again.
> >>>Hmm, by the time it decrements nr_iowait, it returned from schedule() and
> >>>so idle had flushed the pending iowait sleeptime.
> >>Suppose a task does io_schedule() on CPU_0, and increments the counter.
> >>This CPU becomes idle and nr_iowait_cpu(0) == 1.
> >>
> >>Then this task is woken up, but try_to_wake_up() selects another CPU != 0.
> >>
> >>It returns from schedule() and decrements the same counter, it doesn't
> >>do raw_rq/etc again. nr_iowait_cpu(0) becomes 0.
> >>
> >>In fact the task can even migrate to another CPU right after raw_rq().
> >Ah I see now. So that indeed yet another race.
> 
> I am sorry for chiming in late.
> 
> That precisely the race I described here:
> 
> https://lkml.org/lkml/2013/7/2/3
> 
> I should have been more concise in my explanation. I apologize.

Reading that again, the description was good. It's rather me who didn't read that
not carefully enough :)

> 
> >Should we flush that iowait to the src CPU? But then it means we must handle
> >concurrent updates to iowait_sleeptime, idle_sleeptime from the migration
> >code and from idle enter / exit.
> >
> >So I fear we need a seqlock.
> >
> >Or we can live with that and still account the whole idle time slept until
> >tick_nohz_stop_idle() to iowait if we called tick_nohz_start_idle() with nr_iowait > 0.
> >All we need is just a new field in ts-> that records on which state we entered
> >idle.
> 
> Another approach could be to shadow ->iowait_sleeptime as
> suggested here: https://lkml.org/lkml/2013/7/2/165

Hmm, yeah it seems that it would enforce the monotonicity but the wrong forward jumps
due to bad ordering could still happen.

Oh and I realize you already suggested to account the iowait time on task migration more than
one month ago. Grr I should really sit down before reading emails.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 16:33                     ` Oleg Nesterov
  2013-08-20 17:54                       ` Peter Zijlstra
@ 2013-08-20 22:18                       ` Frederic Weisbecker
  2013-08-21 11:49                         ` Oleg Nesterov
  1 sibling, 1 reply; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-20 22:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Tue, Aug 20, 2013 at 06:33:12PM +0200, Oleg Nesterov wrote:
> On 08/20, Peter Zijlstra wrote:
> >
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -453,7 +453,8 @@ struct rq {
> >  	u64 clock;
> >  	u64 clock_task;
> >  
> > -	atomic_t nr_iowait;
> > +	int nr_iowait_local;
> > +	atomic_t nr_iowait_remote;
> 
> I am wondering how the extra lock(rq)/unlock(rq) in schedule() is bad
> compared to atomic_dec.
> 
> IOW, what if we simply make rq->nr_iowait "int" and change schedule()
> to update it? Something like below. Just curious.
> 
> As for nr_iowait_local + nr_iowait_remote, this doesn't look safe...
> in theory nr_iowait_cpu() or even nr_iowait() can return a negative
> number.
> 
> Oleg.
> 
> 
> --- x/kernel/sched/core.c
> +++ x/kernel/sched/core.c
> @@ -2435,6 +2435,9 @@ need_resched:
>  		rq->curr = next;
>  		++*switch_count;
>  
> +		if (unlikely(prev->in_iowait))
> +			rq->nr_iowait++;
> +
>  		context_switch(rq, prev, next); /* unlocks the rq */
>  		/*
>  		 * The context switch have flipped the stack from under us
> @@ -2442,6 +2445,12 @@ need_resched:
>  		 * this task called schedule() in the past. prev == current
>  		 * is still correct, but it can be moved to another cpu/rq.
>  		 */
> +		if (unlikely(prev->in_iowait)) {
> +			raw_spin_lock_irq(&rq->lock);
> +			rq->nr_iowait--;
> +			raw_spin_unlock_irq(&rq->lock);
> +		}
> +

It seems that with this solution rq->nr_iowait is only ever modified locally.
Can't we just disable irqs for rq->nr_iowait-- ?

Also if this is only updated locally, no need for a lock to update ts->iowait_sleep_time anymore,
we can flush the io sleeptime in place. Great thing.

Now just in case, it might be worth noting as well that the time elapsing from the rq task enqueue
until it is finally running on the CPU is not accounted anymore there. This can make a little difference
if the task is woken up but a higher priority task, possibly SCHED_FIFO is already running on the CPU.
Now do we care...

Thanks.

 
>  		cpu = smp_processor_id();
>  		rq = cpu_rq(cpu);
>  	} else
> @@ -3939,31 +3948,24 @@ EXPORT_SYMBOL_GPL(yield_to);
>   */
>  void __sched io_schedule(void)
>  {
> -	struct rq *rq = raw_rq();
> -
>  	delayacct_blkio_start();
> -	atomic_inc(&rq->nr_iowait);
>  	blk_flush_plug(current);
>  	current->in_iowait = 1;
>  	schedule();
>  	current->in_iowait = 0;
> -	atomic_dec(&rq->nr_iowait);
>  	delayacct_blkio_end();
>  }
>  EXPORT_SYMBOL(io_schedule);
>  
>  long __sched io_schedule_timeout(long timeout)
>  {
> -	struct rq *rq = raw_rq();
>  	long ret;
>  
>  	delayacct_blkio_start();
> -	atomic_inc(&rq->nr_iowait);
>  	blk_flush_plug(current);
>  	current->in_iowait = 1;
>  	ret = schedule_timeout(timeout);
>  	current->in_iowait = 0;
> -	atomic_dec(&rq->nr_iowait);
>  	delayacct_blkio_end();
>  	return ret;
>  }
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats
  2013-08-20 18:15 ` [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats Oleg Nesterov
@ 2013-08-21  8:28   ` Peter Zijlstra
  2013-08-21 11:42     ` Oleg Nesterov
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2013-08-21  8:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, LKML, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Arjan van de Ven

On Tue, Aug 20, 2013 at 08:15:00PM +0200, Oleg Nesterov wrote:
> While at it.
> 
> I do not also understand the cpu_online() checks in fs/proc/stat.c.
> 
> OK, I agree, if cpu is offline it should not participate in cpu
> summary. But if it goes offline, why it should switch from
> ->iowait_sleeptime + cpustat[CPUTIME_IDLE] as it seen by /proc/stat?
> 
> This can be another source of "idle goes backward", no?
> 
> IOW. Ignoring the other problems we have, perhaps something like
> below makes sense?


Agreed, however

> 
> +++ x/kernel/time/tick-sched.c
> @@ -477,7 +477,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *l
>  		update_ts_time_stats(cpu, ts, now, last_update_time);
>  		idle = ts->idle_sleeptime;
>  	} else {
> -		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
> +		if (ts->idle_active && cpu_online(cpu) && !nr_iowait_cpu(cpu)) {
>  			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
>  
>  			idle = ktime_add(ts->idle_sleeptime, delta);
> @@ -518,7 +518,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 
>  		update_ts_time_stats(cpu, ts, now, last_update_time);
>  		iowait = ts->iowait_sleeptime;
>  	} else {
> -		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> +		if (ts->idle_active && cpu_online(cpu) && nr_iowait_cpu(cpu)) {
>  			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
>  
>  			iowait = ktime_add(ts->iowait_sleeptime, delta);
> 

That's still mighty odd, but I guess that's in part due to the whacky
semantics. We could simply transfer any open nr_iowait to the cpu
doing the hotplug and then we have offline cpus that have nr_iowait == 0
and the above becomes simpler again.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 18:25                         ` Oleg Nesterov
@ 2013-08-21  8:31                           ` Peter Zijlstra
  2013-08-21 11:35                             ` Oleg Nesterov
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2013-08-21  8:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On Tue, Aug 20, 2013 at 08:25:53PM +0200, Oleg Nesterov wrote:
> On 08/20, Peter Zijlstra wrote:
> >
> > On Tue, Aug 20, 2013 at 06:33:12PM +0200, Oleg Nesterov wrote:

> > > +		if (unlikely(prev->in_iowait)) {
> > > +			raw_spin_lock_irq(&rq->lock);
> > > +			rq->nr_iowait--;
> > > +			raw_spin_unlock_irq(&rq->lock);
> > > +		}
> >
> > This seems like the wrong place, this is where you return from
> > schedule() running another task,
> 
> Yes, but prev is current, and rq should be "correct" for
> rq->nr_iowait-- ?

Yes its the right rq, but the wrong time.

> This local var should be equal to its value when this task called
> context_switch() in the past.
> 
> Like any other variable, like "rq = raw_rq()" in io_schedule().
> 
> > not where the task you just send to
> > sleep wakes up.
> 
> sure, but currently io_schedule() does the same.

No it doesn't. It only does the decrement when the task is woken back
up. Not right after it switches out.

> Btw. Whatever we do, can't we unify io_schedule/io_schedule_timeout?

I suppose we could, a timeout of MAX_SCHEDULE_TIMEOUT will act like a
regular schedule, but it gets all the overhead of doing
schedule_timeout(). So I don't think its a win.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21  8:31                           ` Peter Zijlstra
@ 2013-08-21 11:35                             ` Oleg Nesterov
  2013-08-21 12:33                               ` Peter Zijlstra
  2013-08-21 12:48                               ` Peter Zijlstra
  0 siblings, 2 replies; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-21 11:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On 08/21, Peter Zijlstra wrote:
>
> On Tue, Aug 20, 2013 at 08:25:53PM +0200, Oleg Nesterov wrote:
> > On 08/20, Peter Zijlstra wrote:
> > >
> > > On Tue, Aug 20, 2013 at 06:33:12PM +0200, Oleg Nesterov wrote:
>
> > > > +		if (unlikely(prev->in_iowait)) {
> > > > +			raw_spin_lock_irq(&rq->lock);
> > > > +			rq->nr_iowait--;
> > > > +			raw_spin_unlock_irq(&rq->lock);
> > > > +		}
> > >
> > > This seems like the wrong place, this is where you return from
> > > schedule() running another task,
> >
> > Yes, but prev is current, and rq should be "correct" for
> > rq->nr_iowait-- ?
>
> Yes its the right rq, but the wrong time.

Hmm. Just in case, it is not that I think this patch really makes sense,
but I'd like to understand why do you think it is wrong.

> > This local var should be equal to its value when this task called
> > context_switch() in the past.
> >
> > Like any other variable, like "rq = raw_rq()" in io_schedule().
> >
> > > not where the task you just send to
> > > sleep wakes up.
> >
> > sure, but currently io_schedule() does the same.
>
> No it doesn't. It only does the decrement when the task is woken back
> up. Not right after it switches out.

But it is not "after it switches out", it is after it switched back.

Lets ignore the locking,

	if (prev->in_iowait)
		rq->nr_iowait++;

	context_switch(prev, next);

	if (prev->in_iowait)
		rq->nr_iowait--;

>From the task_struct's (current's) pov prev/rq are the same, before or
after context_switch().

But from the CPU's pov they differ. And ignoring more details on UP the
code above is equivalent to

	if (prev->in_iowait)
		rq->nr_iowait++;

	if (next->in_iowait)
		rq->nr_iowait--;

	context_switch(prev, next);

No?

Yes, need_resched()/preemption can trigger more inc/dec's than io_schedule()
does, but I don't think this was your concern.

> > Btw. Whatever we do, can't we unify io_schedule/io_schedule_timeout?
>
> I suppose we could, a timeout of MAX_SCHEDULE_TIMEOUT will act like a
> regular schedule, but it gets all the overhead of doing
> schedule_timeout(). So I don't think its a win.

Well, the only overhead is "if(to == MAX_SCHEDULE_TIMEOUT)" at the start.
I don't think it makes sense to copy-and-paste the identical code to
avoid it. But please ignore, this is really minor and off-topic.

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats
  2013-08-21  8:28   ` Peter Zijlstra
@ 2013-08-21 11:42     ` Oleg Nesterov
  0 siblings, 0 replies; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-21 11:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Fernando Luis Vazquez Cao,
	Tetsuo Handa, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Arjan van de Ven

On 08/21, Peter Zijlstra wrote:
>
> On Tue, Aug 20, 2013 at 08:15:00PM +0200, Oleg Nesterov wrote:
> > While at it.
> >
> > I do not also understand the cpu_online() checks in fs/proc/stat.c.
> >
> > OK, I agree, if cpu is offline it should not participate in cpu
> > summary. But if it goes offline, why it should switch from
> > ->iowait_sleeptime + cpustat[CPUTIME_IDLE] as it seen by /proc/stat?
> >
> > This can be another source of "idle goes backward", no?
> >
> > IOW. Ignoring the other problems we have, perhaps something like
> > below makes sense?
>
>
> Agreed, however

OK, good,

> > +++ x/kernel/time/tick-sched.c
> > @@ -477,7 +477,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *l
> >  		update_ts_time_stats(cpu, ts, now, last_update_time);
> >  		idle = ts->idle_sleeptime;
> >  	} else {
> > -		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
> > +		if (ts->idle_active && cpu_online(cpu) && !nr_iowait_cpu(cpu)) {
> >  			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> >
> >  			idle = ktime_add(ts->idle_sleeptime, delta);
> > @@ -518,7 +518,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64
> >  		update_ts_time_stats(cpu, ts, now, last_update_time);
> >  		iowait = ts->iowait_sleeptime;
> >  	} else {
> > -		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
> > +		if (ts->idle_active && cpu_online(cpu) && nr_iowait_cpu(cpu)) {
> >  			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
> >
> >  			iowait = ktime_add(ts->iowait_sleeptime, delta);
> >
>
> That's still mighty odd, but I guess that's in part due to the whacky
> semantics. We could simply transfer any open nr_iowait to the cpu
> doing the hotplug and then we have offline cpus that have nr_iowait == 0
> and the above becomes simpler again.

This won't help get_cpu_idle_time_us().

But anyway we should fix other problems first, then think about this
change. I just wanted to verify that I didn't miss something and this
iowait_sleeptime -> CPUTIME_IDLE switch is indeed wrong.

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-20 22:18                       ` Frederic Weisbecker
@ 2013-08-21 11:49                         ` Oleg Nesterov
  0 siblings, 0 replies; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-21 11:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On 08/21, Frederic Weisbecker wrote:
>
> On Tue, Aug 20, 2013 at 06:33:12PM +0200, Oleg Nesterov wrote:
> >
> > --- x/kernel/sched/core.c
> > +++ x/kernel/sched/core.c
> > @@ -2435,6 +2435,9 @@ need_resched:
> >  		rq->curr = next;
> >  		++*switch_count;
> >
> > +		if (unlikely(prev->in_iowait))
> > +			rq->nr_iowait++;
> > +
> >  		context_switch(rq, prev, next); /* unlocks the rq */
> >  		/*
> >  		 * The context switch have flipped the stack from under us
> > @@ -2442,6 +2445,12 @@ need_resched:
> >  		 * this task called schedule() in the past. prev == current
> >  		 * is still correct, but it can be moved to another cpu/rq.
> >  		 */
> > +		if (unlikely(prev->in_iowait)) {
> > +			raw_spin_lock_irq(&rq->lock);
> > +			rq->nr_iowait--;
> > +			raw_spin_unlock_irq(&rq->lock);
> > +		}
> > +
>
> It seems that with this solution rq->nr_iowait is only ever modified locally.
> Can't we just disable irqs for rq->nr_iowait-- ?

Not really. rq holds the old value, which was used when this task
called context_switch() in the past.

IOW, if a task T does io_schedule() on CPU_0, then cpu_of(rq) is still 0
after context_switch(), even if T runs on another cpu.

> Also if this is only updated locally,

Unfortunately, I don't see how we can do this.

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21 11:35                             ` Oleg Nesterov
@ 2013-08-21 12:33                               ` Peter Zijlstra
  2013-08-21 14:23                                 ` Peter Zijlstra
  2013-08-21 12:48                               ` Peter Zijlstra
  1 sibling, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2013-08-21 12:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On Wed, Aug 21, 2013 at 01:35:51PM +0200, Oleg Nesterov wrote:
> On 08/21, Peter Zijlstra wrote:

> > Yes its the right rq, but the wrong time.
> 
> Hmm. Just in case, it is not that I think this patch really makes sense,
> but I'd like to understand why do you think it is wrong.
 
> But it is not "after it switches out", it is after it switched back.

D'uh I was being particularly dense it seems :/

Yes I think it is correct. You're now trading two atomic ops on a
different cacheline than rq->lock for one atomic op on the rq->lock.

Might be a win, esp. since hopefully there's a fair chance its the same
runqueue.



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21 11:35                             ` Oleg Nesterov
  2013-08-21 12:33                               ` Peter Zijlstra
@ 2013-08-21 12:48                               ` Peter Zijlstra
  2013-08-21 17:09                                 ` Oleg Nesterov
  1 sibling, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2013-08-21 12:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On Wed, Aug 21, 2013 at 01:35:51PM +0200, Oleg Nesterov wrote:
> > > Btw. Whatever we do, can't we unify io_schedule/io_schedule_timeout?
> >
> > I suppose we could, a timeout of MAX_SCHEDULE_TIMEOUT will act like a
> > regular schedule, but it gets all the overhead of doing
> > schedule_timeout(). So I don't think its a win.
> 
> Well, the only overhead is "if(to == MAX_SCHEDULE_TIMEOUT)" at the start.
> I don't think it makes sense to copy-and-paste the identical code to
> avoid it. But please ignore, this is really minor and off-topic.

Ah, so the 'problem' is that schedule_timeout() doesn't live in
kernel/sched/core.c and we thus get an extra function call (which are
somewhat expensive on some archs).

That said, I suppose we could do something like the below, that should
allow the compiler to DTRT.

---
 include/linux/sched.h |   35 ++++++++++++++++++++++---
 kernel/sched/core.c   |   22 +++++-----------
 kernel/timer.c        |   67 ++++++++++--------------------------------------
 3 files changed, 52 insertions(+), 72 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 50d04b9..112960c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -303,10 +303,37 @@ extern char __sched_text_start[], __sched_text_end[];
 extern int in_sched_functions(unsigned long addr);
 
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
-extern signed long schedule_timeout(signed long timeout);
-extern signed long schedule_timeout_interruptible(signed long timeout);
-extern signed long schedule_timeout_killable(signed long timeout);
-extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern long __schedule_timeout(long timeout);
+
+static inline long schedule_timeout(long timeout)
+{
+	if (timeout == MAX_SCHEDULE_TIMEOUT) {
+		schedule();
+		return timeout;
+	}
+	return __schedule_timeout(timeout);
+}
+
+/*
+ * We can use __set_current_state() here because schedule_timeout() calls
+ * schedule() unconditionally.
+ */
+static inline long schedule_timeout_interruptible(long timeout)
+{
+	__set_current_state(TASK_INTERRUPTIBLE);
+	return schedule_timeout(timeout);
+}
+static inline long schedule_timeout_killable(long timeout)
+{
+	__set_current_state(TASK_KILLABLE);
+	return schedule_timeout(timeout);
+}
+static inline long schedule_timeout_uninterruptible(long timeout)
+{
+	__set_current_state(TASK_UNINTERRUPTIBLE);
+	return schedule_timeout(timeout);
+}
+
 asmlinkage void schedule(void);
 extern void schedule_preempt_disabled(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f737871..ef2cddd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3937,21 +3937,6 @@ EXPORT_SYMBOL_GPL(yield_to);
  * This task is about to go to sleep on IO. Increment rq->nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
  */
-void __sched io_schedule(void)
-{
-	struct rq *rq = raw_rq();
-
-	delayacct_blkio_start();
-	atomic_inc(&rq->nr_iowait);
-	blk_flush_plug(current);
-	current->in_iowait = 1;
-	schedule();
-	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
-	delayacct_blkio_end();
-}
-EXPORT_SYMBOL(io_schedule);
-
 long __sched io_schedule_timeout(long timeout)
 {
 	struct rq *rq = raw_rq();
@@ -3968,6 +3953,13 @@ long __sched io_schedule_timeout(long timeout)
 	return ret;
 }
 
+void __sched io_schedule(void)
+{
+	io_schedule_timeout(MAX_SCHEDULE_TIMEOUT);
+}
+EXPORT_SYMBOL(io_schedule);
+
+
 /**
  * sys_sched_get_priority_max - return maximum RT priority.
  * @policy: scheduling class.
diff --git a/kernel/timer.c b/kernel/timer.c
index 15bc1b4..8244ce0 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1429,38 +1429,24 @@ static void process_timeout(unsigned long __data)
  *
  * In all cases the return value is guaranteed to be non-negative.
  */
-signed long __sched schedule_timeout(signed long timeout)
+signed long __sched __schedule_timeout(signed long timeout)
 {
 	struct timer_list timer;
 	unsigned long expire;
 
-	switch (timeout)
-	{
-	case MAX_SCHEDULE_TIMEOUT:
-		/*
-		 * These two special cases are useful to be comfortable
-		 * in the caller. Nothing more. We could take
-		 * MAX_SCHEDULE_TIMEOUT from one of the negative value
-		 * but I' d like to return a valid offset (>=0) to allow
-		 * the caller to do everything it want with the retval.
-		 */
-		schedule();
-		goto out;
-	default:
-		/*
-		 * Another bit of PARANOID. Note that the retval will be
-		 * 0 since no piece of kernel is supposed to do a check
-		 * for a negative retval of schedule_timeout() (since it
-		 * should never happens anyway). You just have the printk()
-		 * that will tell you if something is gone wrong and where.
-		 */
-		if (timeout < 0) {
-			printk(KERN_ERR "schedule_timeout: wrong timeout "
+	/*
+	 * Another bit of PARANOID. Note that the retval will be 0 since no
+	 * piece of kernel is supposed to do a check for a negative retval of
+	 * schedule_timeout() (since it should never happens anyway). You just
+	 * have the printk() that will tell you if something is gone wrong and
+	 * where.
+	 */
+	if (timeout < 0) {
+		printk(KERN_ERR "schedule_timeout: wrong timeout "
 				"value %lx\n", timeout);
-			dump_stack();
-			current->state = TASK_RUNNING;
-			goto out;
-		}
+		dump_stack();
+		current->state = TASK_RUNNING;
+		goto out;
 	}
 
 	expire = timeout + jiffies;
@@ -1478,32 +1464,7 @@ signed long __sched schedule_timeout(signed long timeout)
  out:
 	return timeout < 0 ? 0 : timeout;
 }
-EXPORT_SYMBOL(schedule_timeout);
-
-/*
- * We can use __set_current_state() here because schedule_timeout() calls
- * schedule() unconditionally.
- */
-signed long __sched schedule_timeout_interruptible(signed long timeout)
-{
-	__set_current_state(TASK_INTERRUPTIBLE);
-	return schedule_timeout(timeout);
-}
-EXPORT_SYMBOL(schedule_timeout_interruptible);
-
-signed long __sched schedule_timeout_killable(signed long timeout)
-{
-	__set_current_state(TASK_KILLABLE);
-	return schedule_timeout(timeout);
-}
-EXPORT_SYMBOL(schedule_timeout_killable);
-
-signed long __sched schedule_timeout_uninterruptible(signed long timeout)
-{
-	__set_current_state(TASK_UNINTERRUPTIBLE);
-	return schedule_timeout(timeout);
-}
-EXPORT_SYMBOL(schedule_timeout_uninterruptible);
+EXPORT_SYMBOL(__schedule_timeout);
 
 static int __cpuinit init_timers_cpu(int cpu)
 {


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21 12:33                               ` Peter Zijlstra
@ 2013-08-21 14:23                                 ` Peter Zijlstra
  2013-08-21 16:41                                   ` Oleg Nesterov
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2013-08-21 14:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On Wed, Aug 21, 2013 at 02:33:11PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 21, 2013 at 01:35:51PM +0200, Oleg Nesterov wrote:
> > On 08/21, Peter Zijlstra wrote:
> 
> > > Yes its the right rq, but the wrong time.
> > 
> > Hmm. Just in case, it is not that I think this patch really makes sense,
> > but I'd like to understand why do you think it is wrong.
>  
> > But it is not "after it switches out", it is after it switched back.
> 
> D'uh I was being particularly dense it seems :/
> 
> Yes I think it is correct. You're now trading two atomic ops on a
> different cacheline than rq->lock for one atomic op on the rq->lock.
> 
> Might be a win, esp. since hopefully there's a fair chance its the same
> runqueue.

The other consideration is that this adds two branches to the normal
schedule path. I really don't know what the regular ratio between
schedule() and io_schedule() is -- and I suspect it can very much depend
on workload -- but it might be a net loss due to that, even if it makes
io_schedule() 'lots' cheaper.



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21 14:23                                 ` Peter Zijlstra
@ 2013-08-21 16:41                                   ` Oleg Nesterov
  2013-10-01 14:05                                     ` Frederic Weisbecker
  0 siblings, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-21 16:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On 08/21, Peter Zijlstra wrote:
>
> The other consideration is that this adds two branches to the normal
> schedule path. I really don't know what the regular ratio between
> schedule() and io_schedule() is -- and I suspect it can very much depend
> on workload -- but it might be a net loss due to that, even if it makes
> io_schedule() 'lots' cheaper.

Yes, agreed. Please ignore it for now, I didn't try to actually suggest
this change. And even if this is fine perfomance wise, this needs some
benchmarking.

Well. actually I have a vague feeling that _perhaps_ this change can
help to solve other problems we are discussing, but I am not sure and
right now I can't even explain the idea to me.

In short: please ignore ;)

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21 12:48                               ` Peter Zijlstra
@ 2013-08-21 17:09                                 ` Oleg Nesterov
  2013-08-21 18:31                                   ` Peter Zijlstra
  0 siblings, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-21 17:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On 08/21, Peter Zijlstra wrote:
>
> On Wed, Aug 21, 2013 at 01:35:51PM +0200, Oleg Nesterov wrote:
> >
> > Well, the only overhead is "if(to == MAX_SCHEDULE_TIMEOUT)" at the start.
> > I don't think it makes sense to copy-and-paste the identical code to
> > avoid it. But please ignore, this is really minor and off-topic.
>
> Ah, so the 'problem' is that schedule_timeout() doesn't live in
> kernel/sched/core.c and we thus get an extra function call (which are
> somewhat expensive on some archs).

So, unlike me, you like -02 more than -Os ;)

> +static inline long schedule_timeout(long timeout)
> +{
> +	if (timeout == MAX_SCHEDULE_TIMEOUT) {
> +		schedule();
> +		return timeout;
> +	}
> +	return __schedule_timeout(timeout);
> +}

Well this means that every caller will do the MAX_SCHEDULE_TIMEOUT
check inline, and this case is unlikely. And you are also going
to make  schedule_timeout_*() inline...

But,

> +static inline long schedule_timeout_interruptible(long timeout)
> +{
> +	__set_current_state(TASK_INTERRUPTIBLE);
> +	return schedule_timeout(timeout);
> +}
> +static inline long schedule_timeout_killable(long timeout)
> +{
> +	__set_current_state(TASK_KILLABLE);
> +	return schedule_timeout(timeout);
> +}
> +static inline long schedule_timeout_uninterruptible(long timeout)
> +{
> +	__set_current_state(TASK_UNINTERRUPTIBLE);
> +	return schedule_timeout(timeout);
> +}
> ...
> -signed long __sched schedule_timeout_interruptible(signed long timeout)
> -{
> -	__set_current_state(TASK_INTERRUPTIBLE);
> -	return schedule_timeout(timeout);
> -}
> -EXPORT_SYMBOL(schedule_timeout_interruptible);
> -
> -signed long __sched schedule_timeout_killable(signed long timeout)
> -{
> -	__set_current_state(TASK_KILLABLE);
> -	return schedule_timeout(timeout);
> -}
> -EXPORT_SYMBOL(schedule_timeout_killable);
> -
> -signed long __sched schedule_timeout_uninterruptible(signed long timeout)
> -{
> -	__set_current_state(TASK_UNINTERRUPTIBLE);
> -	return schedule_timeout(timeout);
> -}
> -EXPORT_SYMBOL(schedule_timeout_uninterruptible);
> +EXPORT_SYMBOL(__schedule_timeout);

personally I think this particular change is fine.

Or we can export a single schedule_timeout_state(timeout, state) to
factor out get_current().

But just in case, of course I won't argue in any case.

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21 17:09                                 ` Oleg Nesterov
@ 2013-08-21 18:31                                   ` Peter Zijlstra
  2013-08-21 18:32                                     ` Oleg Nesterov
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2013-08-21 18:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On Wed, Aug 21, 2013 at 07:09:27PM +0200, Oleg Nesterov wrote:
> So, unlike me, you like -02 more than -Os ;)

I haven't checked the actual flags they enable in a while, but I think I
prefer something in the middle.

Esp. -freorder-blocks and the various -falign flags are something you
really want with -Os.

> > +static inline long schedule_timeout(long timeout)
> > +{
> > +	if (timeout == MAX_SCHEDULE_TIMEOUT) {
> > +		schedule();
> > +		return timeout;
> > +	}
> > +	return __schedule_timeout(timeout);
> > +}
> 
> Well this means that every caller will do the MAX_SCHEDULE_TIMEOUT
> check inline, and this case is unlikely. 

OK, so do not remove the MAX_SCHEDULE_TIMEOUT check from
__schedule_timeout() and change the above to:

static __always_inline long schedule_timeout(long timeout)
{
	if (__builtin_constant_p(timeout) && timeout == MAX_SCHEDULE_TIMEOUT) {
		schedule();
		return timeout;
	}
	return __schedule_timeout(timeout);
}

That should avoid extra code generation for the runtime sites while
still allowing what we set out to do.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21 18:31                                   ` Peter Zijlstra
@ 2013-08-21 18:32                                     ` Oleg Nesterov
  0 siblings, 0 replies; 75+ messages in thread
From: Oleg Nesterov @ 2013-08-21 18:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao,
	Frederic Weisbecker, Ingo Molnar, Thomas Gleixner, LKML,
	Tetsuo Handa, Andrew Morton

On 08/21, Peter Zijlstra wrote:
>
> OK, so do not remove the MAX_SCHEDULE_TIMEOUT check from
> __schedule_timeout() and change the above to:
>
> static __always_inline long schedule_timeout(long timeout)
> {
> 	if (__builtin_constant_p(timeout) && timeout == MAX_SCHEDULE_TIMEOUT) {
> 		schedule();
> 		return timeout;
> 	}
> 	return __schedule_timeout(timeout);
> }

Agreed, this looks nice.

Hmm. And this can simplify some changes in linux/wait.h I am going
to resend.

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-08-21 16:41                                   ` Oleg Nesterov
@ 2013-10-01 14:05                                     ` Frederic Weisbecker
  2013-10-01 14:26                                       ` Frederic Weisbecker
                                                         ` (2 more replies)
  0 siblings, 3 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-10-01 14:05 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Peter Zijlstra, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Wed, Aug 21, 2013 at 06:41:46PM +0200, Oleg Nesterov wrote:
> On 08/21, Peter Zijlstra wrote:
> >
> > The other consideration is that this adds two branches to the normal
> > schedule path. I really don't know what the regular ratio between
> > schedule() and io_schedule() is -- and I suspect it can very much depend
> > on workload -- but it might be a net loss due to that, even if it makes
> > io_schedule() 'lots' cheaper.
> 
> Yes, agreed. Please ignore it for now, I didn't try to actually suggest
> this change. And even if this is fine perfomance wise, this needs some
> benchmarking.
> 
> Well. actually I have a vague feeling that _perhaps_ this change can
> help to solve other problems we are discussing, but I am not sure and
> right now I can't even explain the idea to me.
> 
> In short: please ignore ;)
> 
> Oleg.
> 

Ok, the discussion is hard to synthesize but I think I now have more
clues to send a better new iteration.

So we have the choice between keeping atomics or using rq->lock. It seems
that using rq->lock is going to be worrisome for several reasons. So let's
stick to atomics (but tell me if you prefer the other way).

So the idea for the next try is to do something along the lines of:

struct cpu_idletime {
       nr_iowait,
       seqlock,
       idle_start,
       idle_time,
       iowait_time,
} __cacheline_aligned_in_smp;

DEFINE_PER_CPU(struct cpu_idletime, cpu_idletime);
       
io_schedule()
{
        int prev_cpu;
	
        preempt_disable();
        prev_cpu_idletime = __this_cpu_ptr(&cpu_idletime);
        atomic_inc(prev_cpu_idletime->nr_iowait);
        WARN_ON_ONCE(is_idle_task(current));
        preempt_enable_no_resched();

        schedule();

        write_seqlock(prev_cpu_idletime->seqlock)
	if (!atomic_dec_return(prev_cpu_idletime->nr_iowait))
           flush_cpu_idle_time(prev_cpu_idletime, 1)
        write_sequnlock(prev_cpu_idletime->seqlock)

}

flush_cpu_idle_time(cpu_idletime, iowait)
{
       if (!cpu_idletime->idle_start)
            return;

       if (nr_iowait)
            cpu_idletime->iowait_time = NOW() - cpu_idletime->idle_start;
       else
            cpu_idletime->idle_time = NOW() - cpu_idletime->idle_start;
}

idle_entry()
{
        write_seqlock(this_cpu_idletime->seqlock)
        this_cpu_idletime->idle_start = NOW();
        write_sequnlock(iowait(cpu)->seqlock)
}

idle_exit()
{
        write_seqlock(this_cpu_idletime->seqlock)
        flush_cpu_idle_time(this_cpu_idletime, atomic_read(&this_cpu_idletime->nr_iowait));
	this_cpu_idletime->idle_start = 0;
        write_sequnlock(this_cpu_idletime->seqlock)
}


Now this all realy on the fact that atomic_inc(cpu_idletime->nr_iowait) can't happen
in a CPU that is already idle. So it can't happen between idle_entry() and idle_exit().
Hence the WARN_ON(is_idle_task(current)) below after the inc.

Hmm?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 14:05                                     ` Frederic Weisbecker
@ 2013-10-01 14:26                                       ` Frederic Weisbecker
  2013-10-01 14:27                                         ` Frederic Weisbecker
  2013-10-01 14:49                                         ` Frederic Weisbecker
  2013-10-01 15:00                                       ` Peter Zijlstra
  2013-10-01 15:56                                       ` Peter Zijlstra
  2 siblings, 2 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-10-01 14:26 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao, Ingo Molnar,
	Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

2013/10/1 Frederic Weisbecker <fweisbec@gmail.com>:
> On Wed, Aug 21, 2013 at 06:41:46PM +0200, Oleg Nesterov wrote:
>> On 08/21, Peter Zijlstra wrote:
>> >
>> > The other consideration is that this adds two branches to the normal
>> > schedule path. I really don't know what the regular ratio between
>> > schedule() and io_schedule() is -- and I suspect it can very much depend
>> > on workload -- but it might be a net loss due to that, even if it makes
>> > io_schedule() 'lots' cheaper.
>>
>> Yes, agreed. Please ignore it for now, I didn't try to actually suggest
>> this change. And even if this is fine perfomance wise, this needs some
>> benchmarking.
>>
>> Well. actually I have a vague feeling that _perhaps_ this change can
>> help to solve other problems we are discussing, but I am not sure and
>> right now I can't even explain the idea to me.
>>
>> In short: please ignore ;)
>>
>> Oleg.
>>
>
> Ok, the discussion is hard to synthesize but I think I now have more
> clues to send a better new iteration.
>
> So we have the choice between keeping atomics or using rq->lock. It seems
> that using rq->lock is going to be worrisome for several reasons. So let's
> stick to atomics (but tell me if you prefer the other way).
>
> So the idea for the next try is to do something along the lines of:
>
> struct cpu_idletime {
>        nr_iowait,
>        seqlock,
>        idle_start,
>        idle_time,
>        iowait_time,
> } __cacheline_aligned_in_smp;
>
> DEFINE_PER_CPU(struct cpu_idletime, cpu_idletime);
>
> io_schedule()
> {
>         int prev_cpu;
>
>         preempt_disable();
>         prev_cpu_idletime = __this_cpu_ptr(&cpu_idletime);
>         atomic_inc(prev_cpu_idletime->nr_iowait);
>         WARN_ON_ONCE(is_idle_task(current));
>         preempt_enable_no_resched();
>
>         schedule();
>
>         write_seqlock(prev_cpu_idletime->seqlock)
>         if (!atomic_dec_return(prev_cpu_idletime->nr_iowait))
>            flush_cpu_idle_time(prev_cpu_idletime, 1)

I forgot...
              cpu_idletime->idle_start;

after the update.

Also now I wonder if we actually should lock the inc part. Otherwise
it may be hard to get the readers right...

Thanks.

>         write_sequnlock(prev_cpu_idletime->seqlock)
>
> }
>
> flush_cpu_idle_time(cpu_idletime, iowait)
> {
>        if (!cpu_idletime->idle_start)
>             return;
>
>        if (nr_iowait)
>             cpu_idletime->iowait_time = NOW() - cpu_idletime->idle_start;
>        else
>             cpu_idletime->idle_time = NOW() - cpu_idletime->idle_start;
> }
>
> idle_entry()
> {
>         write_seqlock(this_cpu_idletime->seqlock)
>         this_cpu_idletime->idle_start = NOW();
>         write_sequnlock(iowait(cpu)->seqlock)
> }
>
> idle_exit()
> {
>         write_seqlock(this_cpu_idletime->seqlock)
>         flush_cpu_idle_time(this_cpu_idletime, atomic_read(&this_cpu_idletime->nr_iowait));
>         this_cpu_idletime->idle_start = 0;
>         write_sequnlock(this_cpu_idletime->seqlock)
> }
>
>
> Now this all realy on the fact that atomic_inc(cpu_idletime->nr_iowait) can't happen
> in a CPU that is already idle. So it can't happen between idle_entry() and idle_exit().
> Hence the WARN_ON(is_idle_task(current)) below after the inc.
>
> Hmm?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 14:26                                       ` Frederic Weisbecker
@ 2013-10-01 14:27                                         ` Frederic Weisbecker
  2013-10-01 14:49                                         ` Frederic Weisbecker
  1 sibling, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-10-01 14:27 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao, Ingo Molnar,
	Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

2013/10/1 Frederic Weisbecker <fweisbec@gmail.com>:
> I forgot...
>               cpu_idletime->idle_start;
                 cpu_idletime->idle_start = NOW();

grrr.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 14:26                                       ` Frederic Weisbecker
  2013-10-01 14:27                                         ` Frederic Weisbecker
@ 2013-10-01 14:49                                         ` Frederic Weisbecker
  1 sibling, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-10-01 14:49 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Arjan van de Ven, Fernando Luis Vázquez Cao, Ingo Molnar,
	Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

2013/10/1 Frederic Weisbecker <fweisbec@gmail.com>:
> 2013/10/1 Frederic Weisbecker <fweisbec@gmail.com>:
>> On Wed, Aug 21, 2013 at 06:41:46PM +0200, Oleg Nesterov wrote:
>>> On 08/21, Peter Zijlstra wrote:
>>> >
>>> > The other consideration is that this adds two branches to the normal
>>> > schedule path. I really don't know what the regular ratio between
>>> > schedule() and io_schedule() is -- and I suspect it can very much depend
>>> > on workload -- but it might be a net loss due to that, even if it makes
>>> > io_schedule() 'lots' cheaper.
>>>
>>> Yes, agreed. Please ignore it for now, I didn't try to actually suggest
>>> this change. And even if this is fine perfomance wise, this needs some
>>> benchmarking.
>>>
>>> Well. actually I have a vague feeling that _perhaps_ this change can
>>> help to solve other problems we are discussing, but I am not sure and
>>> right now I can't even explain the idea to me.
>>>
>>> In short: please ignore ;)
>>>
>>> Oleg.
>>>
>>
>> Ok, the discussion is hard to synthesize but I think I now have more
>> clues to send a better new iteration.
>>
>> So we have the choice between keeping atomics or using rq->lock. It seems
>> that using rq->lock is going to be worrisome for several reasons. So let's
>> stick to atomics (but tell me if you prefer the other way).
>>
>> So the idea for the next try is to do something along the lines of:
>>
>> struct cpu_idletime {
>>        nr_iowait,
>>        seqlock,
>>        idle_start,
>>        idle_time,
>>        iowait_time,
>> } __cacheline_aligned_in_smp;
>>
>> DEFINE_PER_CPU(struct cpu_idletime, cpu_idletime);
>>
>> io_schedule()
>> {
>>         int prev_cpu;
>>
>>         preempt_disable();
>>         prev_cpu_idletime = __this_cpu_ptr(&cpu_idletime);
>>         atomic_inc(prev_cpu_idletime->nr_iowait);
>>         WARN_ON_ONCE(is_idle_task(current));
>>         preempt_enable_no_resched();
>>
>>         schedule();
>>
>>         write_seqlock(prev_cpu_idletime->seqlock)
>>         if (!atomic_dec_return(prev_cpu_idletime->nr_iowait))
>>            flush_cpu_idle_time(prev_cpu_idletime, 1)
>
> I forgot...
>               cpu_idletime->idle_start;
>
> after the update.
>
> Also now I wonder if we actually should lock the inc part. Otherwise
> it may be hard to get the readers right...
>
> Thanks.
>
>>         write_sequnlock(prev_cpu_idletime->seqlock)
>>
>> }
>>
>> flush_cpu_idle_time(cpu_idletime, iowait)
>> {
>>        if (!cpu_idletime->idle_start)
>>             return;
>>
>>        if (nr_iowait)
>>             cpu_idletime->iowait_time = NOW() - cpu_idletime->idle_start;
>>        else
>>             cpu_idletime->idle_time = NOW() - cpu_idletime->idle_start;
>> }
>>
>> idle_entry()
>> {
>>         write_seqlock(this_cpu_idletime->seqlock)
>>         this_cpu_idletime->idle_start = NOW();
>>         write_sequnlock(iowait(cpu)->seqlock)
>> }
>>
>> idle_exit()
>> {
>>         write_seqlock(this_cpu_idletime->seqlock)
>>         flush_cpu_idle_time(this_cpu_idletime, atomic_read(&this_cpu_idletime->nr_iowait));
>>         this_cpu_idletime->idle_start = 0;
>>         write_sequnlock(this_cpu_idletime->seqlock)
>> }
>>
>>
>> Now this all realy on the fact that atomic_inc(cpu_idletime->nr_iowait) can't happen
>> in a CPU that is already idle. So it can't happen between idle_entry() and idle_exit().
>> Hence the WARN_ON(is_idle_task(current)) below after the inc.
>>
>> Hmm?

Actually if we record the source of the idle entry time (whether we
enter idle as iowait or idlewait) and protect it inside the seqlock,
we can probably make the readers safe. So they would rely on that
instead of the nr_io_wait which is only used by the updaters.

I'm probably forgetting something obvious. Anyway I'll try to make a
patchset out of that, unless somebody points me out the mistakes I'm
missing here.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 14:05                                     ` Frederic Weisbecker
  2013-10-01 14:26                                       ` Frederic Weisbecker
@ 2013-10-01 15:00                                       ` Peter Zijlstra
  2013-10-01 15:21                                         ` Frederic Weisbecker
  2013-10-01 15:56                                       ` Peter Zijlstra
  2 siblings, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2013-10-01 15:00 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Tue, Oct 01, 2013 at 04:05:27PM +0200, Frederic Weisbecker wrote:
 
> struct cpu_idletime {
>        nr_iowait,
>        seqlock,
>        idle_start,
>        idle_time,
>        iowait_time,
> } __cacheline_aligned_in_smp;
> 
> DEFINE_PER_CPU(struct cpu_idletime, cpu_idletime);
>        
> io_schedule()
> {
>         int prev_cpu;
> 	
>         preempt_disable();
>         prev_cpu_idletime = __this_cpu_ptr(&cpu_idletime);
>         atomic_inc(prev_cpu_idletime->nr_iowait);
>         WARN_ON_ONCE(is_idle_task(current));
>         preempt_enable_no_resched();
> 
>         schedule();
> 
>         write_seqlock(prev_cpu_idletime->seqlock)
> 	if (!atomic_dec_return(prev_cpu_idletime->nr_iowait))
>            flush_cpu_idle_time(prev_cpu_idletime, 1)
>         write_sequnlock(prev_cpu_idletime->seqlock)
> 
> }

This is at least 3 atomic ops and a whole bunch of branches extra. It
used to be 2 atomics and no branches.

What again are we solving any why?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 15:00                                       ` Peter Zijlstra
@ 2013-10-01 15:21                                         ` Frederic Weisbecker
  0 siblings, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-10-01 15:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Tue, Oct 01, 2013 at 05:00:37PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 04:05:27PM +0200, Frederic Weisbecker wrote:
>  
> > struct cpu_idletime {
> >        nr_iowait,
> >        seqlock,
> >        idle_start,
> >        idle_time,
> >        iowait_time,
> > } __cacheline_aligned_in_smp;
> > 
> > DEFINE_PER_CPU(struct cpu_idletime, cpu_idletime);
> >        
> > io_schedule()
> > {
> >         int prev_cpu;
> > 	
> >         preempt_disable();
> >         prev_cpu_idletime = __this_cpu_ptr(&cpu_idletime);
> >         atomic_inc(prev_cpu_idletime->nr_iowait);
> >         WARN_ON_ONCE(is_idle_task(current));
> >         preempt_enable_no_resched();
> > 
> >         schedule();
> > 
> >         write_seqlock(prev_cpu_idletime->seqlock)
> > 	if (!atomic_dec_return(prev_cpu_idletime->nr_iowait))
> >            flush_cpu_idle_time(prev_cpu_idletime, 1)
> >         write_sequnlock(prev_cpu_idletime->seqlock)
> > 
> > }
> 
> This is at least 3 atomic ops and a whole bunch of branches extra. It
> used to be 2 atomics and no branches.

Yeah. If somebody has a better proposition, I'm all for it.
Of course the best would be to remove these stats if we can. I think we
already concluded that the idea of per CPU iowait stats is broken since
sleeping tasks aren't assigned a particular CPU.

> 
> What again are we solving any why?

Fernando summerized it better than I could
	 * https://lkml.org/lkml/2013/3/18/962
	 * and http://marc.info/?l=linux-kernel&m=137273800916899&w=2

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 14:05                                     ` Frederic Weisbecker
  2013-10-01 14:26                                       ` Frederic Weisbecker
  2013-10-01 15:00                                       ` Peter Zijlstra
@ 2013-10-01 15:56                                       ` Peter Zijlstra
  2013-10-01 16:47                                         ` Frederic Weisbecker
  2 siblings, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2013-10-01 15:56 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton


So what's wrong with something like:

struct cpu_idletime {
       seqlock_t seqlock;
       unsigned long nr_iowait;
       u64 start;
       u64 idle_time,
       u64 iowait_time,
} __cacheline_aligned_in_smp;

DEFINE_PER_CPU(struct cpu_idletime, cpu_idletime);

void io_schedule(void)
{
	struct cpu_idletime *it = __raw_get_cpu_var(cpu_idletime);

	write_seqlock(&it->seqlock);
	if (!it->nr_iowait++)
		it->start = local_clock();
	write_sequnlock(&it->seqlock);

	current->in_iowait = 1;
	schedule();
	current->in_iowait = 0;

	write_seqlock(&it->seqlock);
	if (!--it->nr_iowait)
		it->iowait_time += local_clock() - it->start;
	write_sequnlock(&it->seqlock);
}

Afaict you don't need the preempt disable and atomic muck at all.

It will all get a little more complicated to deal with overlapping idle
and iowait times, but the idea is the same.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 15:56                                       ` Peter Zijlstra
@ 2013-10-01 16:47                                         ` Frederic Weisbecker
  2013-10-01 16:59                                           ` Peter Zijlstra
  0 siblings, 1 reply; 75+ messages in thread
From: Frederic Weisbecker @ 2013-10-01 16:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Tue, Oct 01, 2013 at 05:56:33PM +0200, Peter Zijlstra wrote:
> 
> So what's wrong with something like:
> 
> struct cpu_idletime {
>        seqlock_t seqlock;
>        unsigned long nr_iowait;
>        u64 start;
>        u64 idle_time,
>        u64 iowait_time,
> } __cacheline_aligned_in_smp;
> 
> DEFINE_PER_CPU(struct cpu_idletime, cpu_idletime);
> 
> void io_schedule(void)
> {
> 	struct cpu_idletime *it = __raw_get_cpu_var(cpu_idletime);
> 
> 	write_seqlock(&it->seqlock);
> 	if (!it->nr_iowait++)
> 		it->start = local_clock();
> 	write_sequnlock(&it->seqlock);
> 
> 	current->in_iowait = 1;
> 	schedule();
> 	current->in_iowait = 0;
> 
> 	write_seqlock(&it->seqlock);
> 	if (!--it->nr_iowait)
> 		it->iowait_time += local_clock() - it->start;
> 	write_sequnlock(&it->seqlock);
> }
> 
> Afaict you don't need the preempt disable and atomic muck at all.

Yeah thinking more about it, the preempt disable was probably not
necessary. Now that's trading 2 atomics + 1 Lock/Unlock with 2 Lock/Unlock.

OTOH it computes iowait time seperately from idle time, so we probably don't
need to lock the idle time anymore.

Plus this solution is much much more simple. So if nobody sees a flaw there,
I'll try this.

Thanks.

> 
> It will all get a little more complicated to deal with overlapping idle
> and iowait times, but the idea is the same.

Probably no big deal.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 16:47                                         ` Frederic Weisbecker
@ 2013-10-01 16:59                                           ` Peter Zijlstra
  2013-10-02 12:45                                             ` Frederic Weisbecker
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2013-10-01 16:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Tue, Oct 01, 2013 at 06:47:10PM +0200, Frederic Weisbecker wrote:
> Yeah thinking more about it, the preempt disable was probably not
> necessary. Now that's trading 2 atomics + 1 Lock/Unlock with 2 Lock/Unlock.

It trades the current 2 atomics for 2 LOCK/UNLOCK. And on x86_64 that's
2 atomics.

So all we get is some extra branches; which we must hope for don't
actually mess things up too bad.

Ohh.. wait a sec.. we also call local_clock() which includes another
atomic :/ Bugger.. 



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-01 16:59                                           ` Peter Zijlstra
@ 2013-10-02 12:45                                             ` Frederic Weisbecker
  2013-10-02 12:50                                               ` Peter Zijlstra
  2013-10-02 14:35                                               ` Arjan van de Ven
  0 siblings, 2 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-10-02 12:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Tue, Oct 01, 2013 at 06:59:57PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 06:47:10PM +0200, Frederic Weisbecker wrote:
> > Yeah thinking more about it, the preempt disable was probably not
> > necessary. Now that's trading 2 atomics + 1 Lock/Unlock with 2 Lock/Unlock.
> 
> It trades the current 2 atomics for 2 LOCK/UNLOCK. And on x86_64 that's
> 2 atomics.

Do you mean 2 atomics for LOCK/UNLOCK? Or is that pair optimized somehow
in x86?

> 
> So all we get is some extra branches; which we must hope for don't
> actually mess things up too bad.
> 
> Ohh.. wait a sec.. we also call local_clock() which includes another
> atomic :/ Bugger.. 

Yeah. Anyway, I'm going to try something on top of that. May be we'll get
a fresher mind and ideas on how to optimize that all after.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-02 12:45                                             ` Frederic Weisbecker
@ 2013-10-02 12:50                                               ` Peter Zijlstra
  2013-10-02 14:35                                               ` Arjan van de Ven
  1 sibling, 0 replies; 75+ messages in thread
From: Peter Zijlstra @ 2013-10-02 12:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Arjan van de Ven, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Wed, Oct 02, 2013 at 02:45:41PM +0200, Frederic Weisbecker wrote:
> On Tue, Oct 01, 2013 at 06:59:57PM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 01, 2013 at 06:47:10PM +0200, Frederic Weisbecker wrote:
> > > Yeah thinking more about it, the preempt disable was probably not
> > > necessary. Now that's trading 2 atomics + 1 Lock/Unlock with 2 Lock/Unlock.
> > 
> > It trades the current 2 atomics for 2 LOCK/UNLOCK. And on x86_64 that's
> > 2 atomics.
> 
> Do you mean 2 atomics for LOCK/UNLOCK? Or is that pair optimized somehow
> in x86?

Unlock isn't an atomic on x86_64; it can be on certain i386 builds. See
UNLOCK_LOCK_PREFIX.

> > 
> > So all we get is some extra branches; which we must hope for don't
> > actually mess things up too bad.
> > 
> > Ohh.. wait a sec.. we also call local_clock() which includes another
> > atomic :/ Bugger.. 
> 
> Yeah. Anyway, I'm going to try something on top of that. May be we'll get
> a fresher mind and ideas on how to optimize that all after.

Fair enough.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-02 12:45                                             ` Frederic Weisbecker
  2013-10-02 12:50                                               ` Peter Zijlstra
@ 2013-10-02 14:35                                               ` Arjan van de Ven
  2013-10-02 16:01                                                 ` Frederic Weisbecker
  1 sibling, 1 reply; 75+ messages in thread
From: Arjan van de Ven @ 2013-10-02 14:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Oleg Nesterov, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On 10/2/2013 5:45 AM, Frederic Weisbecker wrote:
> On Tue, Oct 01, 2013 at 06:59:57PM +0200, Peter Zijlstra wrote:
>> On Tue, Oct 01, 2013 at 06:47:10PM +0200, Frederic Weisbecker wrote:
>>> Yeah thinking more about it, the preempt disable was probably not
>>> necessary. Now that's trading 2 atomics + 1 Lock/Unlock with 2 Lock/Unlock.
>>
>> It trades the current 2 atomics for 2 LOCK/UNLOCK. And on x86_64 that's
>> 2 atomics.
>
> Do you mean 2 atomics for LOCK/UNLOCK? Or is that pair optimized somehow
> in x86?

unlock is not actually an atomic.

and on some modern machines, neither is the lock, for the uncontended case ;-)


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock
  2013-10-02 14:35                                               ` Arjan van de Ven
@ 2013-10-02 16:01                                                 ` Frederic Weisbecker
  0 siblings, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-10-02 16:01 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Peter Zijlstra, Oleg Nesterov, Fernando Luis Vázquez Cao,
	Ingo Molnar, Thomas Gleixner, LKML, Tetsuo Handa, Andrew Morton

On Wed, Oct 02, 2013 at 07:35:49AM -0700, Arjan van de Ven wrote:
> On 10/2/2013 5:45 AM, Frederic Weisbecker wrote:
> >On Tue, Oct 01, 2013 at 06:59:57PM +0200, Peter Zijlstra wrote:
> >>On Tue, Oct 01, 2013 at 06:47:10PM +0200, Frederic Weisbecker wrote:
> >>>Yeah thinking more about it, the preempt disable was probably not
> >>>necessary. Now that's trading 2 atomics + 1 Lock/Unlock with 2 Lock/Unlock.
> >>
> >>It trades the current 2 atomics for 2 LOCK/UNLOCK. And on x86_64 that's
> >>2 atomics.
> >
> >Do you mean 2 atomics for LOCK/UNLOCK? Or is that pair optimized somehow
> >in x86?
> 
> unlock is not actually an atomic.
> 
> and on some modern machines, neither is the lock, for the uncontended case ;-)
> 

Yeah, we are never getting short on black magic as the time goes :)

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH 4/4] nohz: Convert a few places to use local per cpu accesses
  2013-08-09  0:54 [PATCH " Frederic Weisbecker
@ 2013-08-09  0:54 ` Frederic Weisbecker
  0 siblings, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2013-08-09  0:54 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Fernando Luis Vazquez Cao, Tetsuo Handa,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven

A few functions use remote per CPU access APIs when they
deal with local values.

Just to the right conversion to improve performance, code
readability and debug checks.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 kernel/time/tick-sched.c |   10 ++++------
 1 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 017bec2..11d64e2 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -395,11 +395,9 @@ __setup("nohz=", setup_tick_nohz);
  */
 static void tick_nohz_update_jiffies(ktime_t now)
 {
-	int cpu = smp_processor_id();
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	unsigned long flags;
 
-	ts->idle_waketime = now;
+	__this_cpu_write(tick_cpu_sched.idle_waketime, now);
 
 	local_irq_save(flags);
 	tick_do_update_jiffies64(now);
@@ -410,7 +408,7 @@ static void tick_nohz_update_jiffies(ktime_t now)
 
 static void tick_nohz_stop_idle(int cpu, ktime_t now)
 {
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
 	ktime_t delta;
 
 	/* Updates the per cpu time idle statistics counters */
@@ -901,7 +899,7 @@ static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
 void tick_nohz_idle_exit(void)
 {
 	int cpu = smp_processor_id();
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
 	ktime_t now;
 
 	local_irq_disable();
@@ -1020,7 +1018,7 @@ static void tick_nohz_kick_tick(int cpu, ktime_t now)
 
 static inline void tick_check_nohz(int cpu)
 {
-	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+	struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
 	ktime_t now;
 
 	if (!ts->idle_active && !ts->tick_stopped)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2013-10-02 16:01 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-16 15:42 [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats Frederic Weisbecker
2013-08-16 15:42 ` [PATCH 1/4] nohz: Only update sleeptime stats locally Frederic Weisbecker
2013-08-18 16:49   ` Oleg Nesterov
2013-08-18 21:38     ` Frederic Weisbecker
2013-08-18 17:04   ` Oleg Nesterov
2013-08-19 18:05     ` Stratos Karafotis
2013-08-16 15:42 ` [PATCH 2/4] nohz: Synchronize sleep time stats with seqlock Frederic Weisbecker
2013-08-16 16:02   ` Oleg Nesterov
2013-08-16 16:20     ` Frederic Weisbecker
2013-08-16 16:26       ` Oleg Nesterov
2013-08-16 16:46         ` Frederic Weisbecker
2013-08-16 16:49           ` Oleg Nesterov
2013-08-16 17:12             ` Frederic Weisbecker
2013-08-18 16:36               ` Oleg Nesterov
2013-08-18 21:25                 ` Frederic Weisbecker
2013-08-19 10:58               ` Peter Zijlstra
2013-08-19 15:44                 ` Arjan van de Ven
2013-08-19 15:47                 ` Arjan van de Ven
2013-08-19 11:10           ` Peter Zijlstra
2013-08-19 11:15             ` Peter Zijlstra
2013-08-20  6:59             ` Fernando Luis Vázquez Cao
2013-08-20  8:44               ` Peter Zijlstra
2013-08-20 15:29                 ` Frederic Weisbecker
2013-08-20 15:33                   ` Arjan van de Ven
2013-08-20 15:35                     ` Frederic Weisbecker
2013-08-20 15:41                       ` Arjan van de Ven
2013-08-20 15:31                 ` Arjan van de Ven
2013-08-20 16:01                   ` Peter Zijlstra
2013-08-20 16:33                     ` Oleg Nesterov
2013-08-20 17:54                       ` Peter Zijlstra
2013-08-20 18:25                         ` Oleg Nesterov
2013-08-21  8:31                           ` Peter Zijlstra
2013-08-21 11:35                             ` Oleg Nesterov
2013-08-21 12:33                               ` Peter Zijlstra
2013-08-21 14:23                                 ` Peter Zijlstra
2013-08-21 16:41                                   ` Oleg Nesterov
2013-10-01 14:05                                     ` Frederic Weisbecker
2013-10-01 14:26                                       ` Frederic Weisbecker
2013-10-01 14:27                                         ` Frederic Weisbecker
2013-10-01 14:49                                         ` Frederic Weisbecker
2013-10-01 15:00                                       ` Peter Zijlstra
2013-10-01 15:21                                         ` Frederic Weisbecker
2013-10-01 15:56                                       ` Peter Zijlstra
2013-10-01 16:47                                         ` Frederic Weisbecker
2013-10-01 16:59                                           ` Peter Zijlstra
2013-10-02 12:45                                             ` Frederic Weisbecker
2013-10-02 12:50                                               ` Peter Zijlstra
2013-10-02 14:35                                               ` Arjan van de Ven
2013-10-02 16:01                                                 ` Frederic Weisbecker
2013-08-21 12:48                               ` Peter Zijlstra
2013-08-21 17:09                                 ` Oleg Nesterov
2013-08-21 18:31                                   ` Peter Zijlstra
2013-08-21 18:32                                     ` Oleg Nesterov
2013-08-20 22:18                       ` Frederic Weisbecker
2013-08-21 11:49                         ` Oleg Nesterov
2013-08-20  6:21           ` Fernando Luis Vázquez Cao
2013-08-20 21:55             ` Frederic Weisbecker
2013-08-16 16:32     ` Frederic Weisbecker
2013-08-16 16:33       ` Oleg Nesterov
2013-08-16 16:49         ` Frederic Weisbecker
2013-08-16 16:37       ` Frederic Weisbecker
2013-08-18 16:54   ` Oleg Nesterov
2013-08-18 21:40     ` Frederic Weisbecker
2013-08-16 15:42 ` [PATCH 3/4] nohz: Consolidate sleep time stats read code Frederic Weisbecker
2013-08-18 17:00   ` Oleg Nesterov
2013-08-18 21:47     ` Frederic Weisbecker
2013-08-16 15:42 ` [PATCH 4/4] nohz: Convert a few places to use local per cpu accesses Frederic Weisbecker
2013-08-16 16:00   ` Peter Zijlstra
2013-08-16 16:12     ` Frederic Weisbecker
2013-08-16 16:19     ` Oleg Nesterov
2013-08-16 16:34       ` Frederic Weisbecker
2013-08-20 18:15 ` [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats Oleg Nesterov
2013-08-21  8:28   ` Peter Zijlstra
2013-08-21 11:42     ` Oleg Nesterov
  -- strict thread matches above, loose matches on Subject: below --
2013-08-09  0:54 [PATCH " Frederic Weisbecker
2013-08-09  0:54 ` [PATCH 4/4] nohz: Convert a few places to use local per cpu accesses Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.