All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/2] nohz: fix idle accounting in NO_HZ kernels
@ 2014-04-17  9:35 Hidetoshi Seto
  2014-04-17  9:38 ` [PATCH 1/2] nohz: make updating sleep stats local Hidetoshi Seto
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Hidetoshi Seto @ 2014-04-17  9:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko,
	stable

Hi all,

This patch set (rebased on v3.15-rc1) is my 4th try to fix an issue
that idle/iowait of /proc/stat can go backward. Originally reported
by Tetsuo and Fernando at last year, Mar 2013.

This v4 updates v3's approach, and adds description/comments to
make patch's intent to be clear (I hope so).

Of course still reviews are welcome.

Thanks,
H.Seto

Hidetoshi Seto (2):
      nohz: make updating sleep stats local
      nohz: delayed iowait accounting for nohz idle time stats

 include/linux/tick.h     |    6 +-
 kernel/time/tick-sched.c |  179 ++++++++++++++++++++++++++++++++++------------
 2 files changed, 136 insertions(+), 49 deletions(-)



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] nohz: make updating sleep stats local
  2014-04-17  9:35 [PATCH v4 0/2] nohz: fix idle accounting in NO_HZ kernels Hidetoshi Seto
@ 2014-04-17  9:38 ` Hidetoshi Seto
  2014-04-17  9:41 ` [PATCH 2/2] nohz: delayed iowait accounting for nohz idle time stats Hidetoshi Seto
  2014-04-22  6:34 ` [PATCH v4 0/2] nohz: fix idle accounting in NO_HZ kernels Hidetoshi Seto
  2 siblings, 0 replies; 8+ messages in thread
From: Hidetoshi Seto @ 2014-04-17  9:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko,
	stable

This patch is 1/2 of v4 patch set to fix an issue that idle/iowait
of /proc/stat can go backward. Originally reported by Tetsuo and
Fernando at last year, Mar 2013.

[BACKGROUNDS]: idle accounting on NO_HZ

  If NO_HZ is enabled, cpu stops tick interrupt for itself before
  go sleep to be idle. It means that time stats of the sleeping cpu
  will not be updated in tick interrupt. Instead when cpu wakes up,
  it updates time stats by calculating idle duration time from
  timestamp at entering idle and current time as exiting idle.

  OTOH, it can happen that there are some kind of observer who want
  to know how long the sleeping cpu have been idle. Imagine that
  userland runs top command or application who read /proc/stats.
  Therefore kernel provides get_cpu_{idle,iowait}_time_us() function
  for user to obtain current idle time stats of such sleeping cpu.
  This function reads time stats and timestamp at entering idle,
  and then return current idle time by adding duration calculated
  from timestamp and current time.

  There are 2 problems:
  (this patch 1/2 targets one, and following 2/2 targets another)

[PROBLEM 1]: there is no exclusive control.

  It is easy to understand that there are 2 different cpu - an
  observing cpu where running a program observing idle cpu's stat
  and an observed cpu where performing idle. It means race can
  happen if observed cpu wakes up while it is observed. Observer
  can accidentally add calculated duration time (say delta) to
  time stats which is just updated by woken cpu. Soon correct
  idle time is returned in next turn, so it will result in
  backward time. Therefore readers must be excluded by writer.

  Then time stats are updated by woken cpu so there are only one
  writer, right? No, unfortunately no. I'm not sure why are they,
  but in some reason the function get_cpu_{idle,iowait}_time_us()
  has ability to update sleeping cpu's time stats from observing
  cpu. From grep of today's kernel tree, this feature is used by
  cpuspeed module. Calling this function with this feature in
  periodically manner works like emulating tick for sleeping cpu.

  In summary there are races between multiple writer and mutiple
  reader but no exclusive control on this idle time stats dataset.

[WHAT THIS PATCH PROPOSED]:

  To fix problem 1, this patch 1/2 does following changes:

  - Stop updating from get_cpu_{idle,iowait}_time_us():
    It is hard to get reasonable performance with having exclusive
    locking for multiple writers. To limit the update areas to
    local, remove update functionality from these functions.
    It makes time stats to be updated by woken cpu only, so there
    are only one writer now!

  - Adds seqcount as exclusive locking for NO_HZ idle:
    It shoud be the minimum implemetation to avoid possible races
    between multiple readers vs. single writer. This lock protects:
    idle_active, idle_entrytime, idle_sleeptime, iowait_sleeptime.

  Now there is no other way to reach update_ts_time_stats(), fold
  this static routine into tick_nohz_stop_idle().

(Still there is problem 2. Continue to following patch 2/2.)

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Reported-by: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/tick.h     |    5 ++-
 kernel/time/tick-sched.c |   75 ++++++++++++++++++++++------------------------
 2 files changed, 39 insertions(+), 41 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index b84773c..00dd98d 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -62,6 +62,7 @@ struct tick_sched {
 	unsigned long			idle_calls;
 	unsigned long			idle_sleeps;
 	int				idle_active;
+	seqcount_t			idle_sleeptime_seq;
 	ktime_t				idle_entrytime;
 	ktime_t				idle_waketime;
 	ktime_t				idle_exittime;
@@ -137,8 +138,8 @@ extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
-extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
-extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
+extern u64 get_cpu_idle_time_us(int cpu, u64 *wall);
+extern u64 get_cpu_iowait_time_us(int cpu, u64 *wall);
 
 # else /* !CONFIG_NO_HZ_COMMON */
 static inline int tick_nohz_tick_stopped(void)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9f8af69..c8314a1 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -403,33 +403,23 @@ static void tick_nohz_update_jiffies(ktime_t now)
 	touch_softlockup_watchdog();
 }
 
-/*
- * Updates the per cpu time idle statistics counters
- */
-static void
-update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
+static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 {
 	ktime_t delta;
 
-	if (ts->idle_active) {
-		delta = ktime_sub(now, ts->idle_entrytime);
-		if (nr_iowait_cpu(cpu) > 0)
-			ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
-		else
-			ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
-		ts->idle_entrytime = now;
-	}
-
-	if (last_update_time)
-		*last_update_time = ktime_to_us(now);
+	write_seqcount_begin(&ts->idle_sleeptime_seq);
 
-}
-
-static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
-{
-	update_ts_time_stats(smp_processor_id(), ts, now, NULL);
+	/* Updates the per cpu time idle statistics counters */
+	delta = ktime_sub(now, ts->idle_entrytime);
+	if (nr_iowait_cpu(smp_processor_id()) > 0)
+		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
+	else
+		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+	ts->idle_entrytime = now;
 	ts->idle_active = 0;
 
+	write_seqcount_end(&ts->idle_sleeptime_seq);
+
 	sched_clock_idle_wakeup_event(0);
 }
 
@@ -437,8 +427,11 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 {
 	ktime_t now = ktime_get();
 
+	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	ts->idle_entrytime = now;
 	ts->idle_active = 1;
+	write_seqcount_end(&ts->idle_sleeptime_seq);
+
 	sched_clock_idle_sleep_event();
 	return now;
 }
@@ -446,10 +439,9 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 /**
  * get_cpu_idle_time_us - get the total idle time of a cpu
  * @cpu: CPU number to query
- * @last_update_time: variable to store update time in. Do not update
- * counters if NULL.
+ * @wall: variable to store current wall time in.
  *
- * Return the cummulative idle time (since boot) for a given
+ * Return the cumulative idle time (since boot) for a given
  * CPU, in microseconds.
  *
  * This time is measured via accounting rather than sampling,
@@ -457,19 +449,22 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
  *
  * This function returns -1 if NOHZ is not enabled.
  */
-u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
+u64 get_cpu_idle_time_us(int cpu, u64 *wall)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	ktime_t now, idle;
+	unsigned int seq;
 
 	if (!tick_nohz_active)
 		return -1;
 
 	now = ktime_get();
-	if (last_update_time) {
-		update_ts_time_stats(cpu, ts, now, last_update_time);
-		idle = ts->idle_sleeptime;
-	} else {
+	if (wall)
+		*wall = ktime_to_us(now);
+
+	do {
+		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
+ 
 		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
 			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
 
@@ -477,7 +472,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
 		} else {
 			idle = ts->idle_sleeptime;
 		}
-	}
+	} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
 
 	return ktime_to_us(idle);
 
@@ -487,10 +482,9 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
 /**
  * get_cpu_iowait_time_us - get the total iowait time of a cpu
  * @cpu: CPU number to query
- * @last_update_time: variable to store update time in. Do not update
- * counters if NULL.
+ * @wall: variable to store current wall time in.
  *
- * Return the cummulative iowait time (since boot) for a given
+ * Return the cumulative iowait time (since boot) for a given
  * CPU, in microseconds.
  *
  * This time is measured via accounting rather than sampling,
@@ -498,19 +492,22 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
  *
  * This function returns -1 if NOHZ is not enabled.
  */
-u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
+u64 get_cpu_iowait_time_us(int cpu, u64 *wall)
 {
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	ktime_t now, iowait;
+	unsigned int seq;
 
 	if (!tick_nohz_active)
 		return -1;
 
 	now = ktime_get();
-	if (last_update_time) {
-		update_ts_time_stats(cpu, ts, now, last_update_time);
-		iowait = ts->iowait_sleeptime;
-	} else {
+	if (wall)
+		*wall = ktime_to_us(now);
+
+	do {
+		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
+ 
 		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
 			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
 
@@ -518,7 +515,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
 		} else {
 			iowait = ts->iowait_sleeptime;
 		}
-	}
+	} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
 
 	return ktime_to_us(iowait);
 }
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] nohz: delayed iowait accounting for nohz idle time stats
  2014-04-17  9:35 [PATCH v4 0/2] nohz: fix idle accounting in NO_HZ kernels Hidetoshi Seto
  2014-04-17  9:38 ` [PATCH 1/2] nohz: make updating sleep stats local Hidetoshi Seto
@ 2014-04-17  9:41 ` Hidetoshi Seto
  2014-04-22 19:45   ` Peter Zijlstra
  2014-04-22  6:34 ` [PATCH v4 0/2] nohz: fix idle accounting in NO_HZ kernels Hidetoshi Seto
  2 siblings, 1 reply; 8+ messages in thread
From: Hidetoshi Seto @ 2014-04-17  9:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko,
	stable

This patch is 2/2 of v4 patch set to fix an issue that idle/iowait
of /proc/stat can go backward. Originally reported by Tetsuo and
Fernando at last year, Mar 2013.

(Continued from previous patch 1/2.)

[PROBLEM 2]: broken iowait accounting.

  As historical nature, cpu's idle time was accounted as either
  idle or iowait depending on the presence of tasks blocked by
  I/O. No one complain about it for a long time. However:

  > Still trying to wrap my head around it, but conceptually
  > get_cpu_iowait_time_us() doesn't make any kind of sense.
  > iowait isn't per cpu since effectively tasks that aren't
  > running aren't assigned a cpu (as Oleg already pointed out).
  -- Peter Zijlstra

  Now some kernel folks realized that accounting iowait as per-cpu
  does not make sense in SMP world. When we were in traditional
  UP era, cpu is considered to be waiting I/O if it is idle while
  nr_iowait > 0. But in these days with SMP systems, tasks easily
  migrate from a cpu where they issued an I/O to another cpu where
  they are queued after I/O completion.

  Back to NO_HZ mechanism. Totally terrible thing here is that
  observer need to handle duration "delta" without knowing that
  nr_iowait of sleeping cpu can be changed easily by woken tasks
  even if cpu is sleeping. So it can happen that:

    given:
      idle time stats: idle=1000, iowait=100
      stamp at idle entry: entry=50
      nr tasks waiting io: nr_iowait=1

    observer temporary assigns delta as iowait at 1st place,
    (but does not do update (=account delta to time stats)):
      1st reader's query @ now = 60:
        idle=1000
        iowait=110 (=100+(60-50))

    then blocked task is queued to runqueue of other active cpu,
    woken up from io_schedule{,_timeout}() and do atomic_dec()
    from the remote:
      nr_iowait=0

    and at last in 2nd turn observer assign delta as idle:
      2nd reader's query @ now = 70:
        idle=1020 (=1000+(70-50))
        iowait=100

  You will see that iowait is decreased from 110 to 100.

  In summary, the original problem that iowait of /proc/stats can
  go backward is a kind of hidden bug, while at the same time iowait
  accounting has fundamental problem and needs to be precisely
  reworked.

[TARGET OF THIS PATCH]:

  Complete rework for iowait accounting implies that some user
  interfaces might be replaced completely. It will introduce new
  counter or so, and kill per-cpu iowait counter which is known to
  being nonsense. It will take long time to be achieved, considering
  time to make the problem to a public knowledge, and also time for
  interface transition. Anyway as long as linux try to be reliable OS,
  such drastic change must be placed on mainline kernel in development
  sooner or later.

  OTOH, drastic change like above is not acceptable for conservative
  environment, such as stable kernels, some distributor's kernel and
  so on, due to respect for compatibility. Still these kernel expects
  per-cpu iowait counter works as well as it might have been.
  Therefore for such environment, applying a simple patch to fix
  behavior of counter will be appreciated rather than replacing an
  over-used interface or changing an existing usage/semantics.

  This patch targets latter kernels mainly. It does not do too much,
  but intend to fix the idle stats counters to be monotonic. I believe
  that mainline kernel should pick up this patch too, because it
  surely fix a hidden bug and does not intercept upcoming drastic
  change.

  Again, in summary, this patch does not do drastic change to cope
  with problem 2. But it just fix behavior of idle/iowait counter of
  /proc/stats.

[WHAT THIS PATCH PROPOSED]:

  The main reason of the bug is that observers have no idea to
  determine the delta to be idle or iowait at the first place.

  So I introduced delayed iowait accounting to allow observers to
  assign delta based on status of observed cpu at idle entry.

  Refer comment in patch for the detail.

[References]:

  First report from Fernando:
    [RFC] iowait/idle time accounting hiccups in NOHZ kernels
    https://lkml.org/lkml/2013/3/18/962
  Steps to reproduce guided by Tetsuo:
    https://lkml.org/lkml/2013/4/1/201

  1st patchset from Frederic:
    [PATCH 0/4] nohz: Fix racy sleeptime stats
    https://lkml.org/lkml/2013/8/8/638
    [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats
    https://lkml.org/lkml/2013/8/16/274

  2nd patchset from Frederic:
    [RFC PATCH 0/5] nohz: Fix racy sleeptime stats v2
    https://lkml.org/lkml/2013/10/19/86

  My previous patch set:
    [PATCH 0/2] nohz: fix idle accounting in NO_HZ kernels
    https://lkml.org/lkml/2014/3/23/256
    [PATCH 0/2 v2] nohz: fix idle accounting in NO_HZ kernels
    https://lkml.org/lkml/2014/3/30/315
    [PATCH v3 0/2] nohz: fix idle accounting in NO_HZ kernels
    https://lkml.org/lkml/2014/4/10/133

v4: rework delayed iowait accounting: check nr_iowait at entry
    (not to account iowait at out of io boundary)
    update description/comments to explain more details

v3: use seqcount instead of seqlock
    (achieved by inserting cleanup as former patch)
    plus introduce delayed iowait accounting

v2: update comments and description about problem 2.
    include fix for minor typo

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Reported-by: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/tick.h     |    1 +
 kernel/time/tick-sched.c |  122 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 106 insertions(+), 17 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index 00dd98d..cec32e4 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -68,6 +68,7 @@ struct tick_sched {
 	ktime_t				idle_exittime;
 	ktime_t				idle_sleeptime;
 	ktime_t				iowait_sleeptime;
+	ktime_t				iowait_pending;
 	ktime_t				sleep_length;
 	unsigned long			last_jiffies;
 	unsigned long			next_jiffies;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c8314a1..95e18bd 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -405,16 +405,90 @@ static void tick_nohz_update_jiffies(ktime_t now)
 
 static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
 {
+	static const ktime_t ktime_zero = { .tv64 = 0 };
 	ktime_t delta;
 
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
 
 	/* Updates the per cpu time idle statistics counters */
 	delta = ktime_sub(now, ts->idle_entrytime);
-	if (nr_iowait_cpu(smp_processor_id()) > 0)
-		ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
-	else
+
+	if (ts->idle_active == 1) {
+		/* delta is idle */
 		ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+	} else {
+		/*
+		 * delta is sum of idle/iowait:
+		 *
+		 * We account sleep time as iowait when nr_iowait of cpu
+		 * indicates there are tasks blocked by io, at the end of
+		 * idle (=here).
+		 * It means observers (=who query sleeping cpu's idle stats)
+		 * can not determine whether the unaccounted sleep time will
+		 * be idle or iowait on the fly.
+		 *
+		 * Therefore introduce following mechanism:
+		 *
+		 * [PHASE_1]: assign delta to idle
+		 *  - pending is 0
+		 *  - observers assign delta to idle
+		 *  - when cpu wakes up and come here:
+		 *    - If nr_iowait is 0:
+		 *        Account delta as idle, and continue PHASE_1.
+		 *    - If nr_iowait is >0:
+		 *        We'd like to account delta to iowait but it will
+		 *        be inconsistent with observers who already assigned
+		 *        a part of delta as idle. So here we account delta
+		 *        to idle and also pending as missed iowait. Then
+		 *        move to PHASE_2 from next turn.
+		 *
+		 * [PHASE_2]: we have missed iowait
+		 *  - pending is >0
+		 *  - observers assign delta to iowait until pending and over
+		 *    to idle.
+		 *  - when cpu wakes up and come here:
+		 *    - Account delta to iowait until pending and over to idle
+		 *      (=in same manner as observers).
+		 *    - Subtract accounted iowait from pending.
+		 *    - If nr_iowait is >0:
+		 *        The sleep time just finished was used to account
+		 *        pending iowait, so now we got new missed iowait.
+		 *        Accumulate delta as missed iowait, and wait next
+		 *        turn to let it be accounted. Continue PHASE_2.
+		 *    - If nr_iowait is 0:
+		 *        The sleep time just finished was considered as idle
+		 *        but utilized to account missed iowait. It is not
+		 *        problem because we just exchange excess idle for
+		 *        missed iowait. If we still have pending continue
+		 *        PHASE_2, otherwise move to PHASE_1 from next turn.
+		 */
+		ktime_t *idle = &ts->idle_sleeptime;
+		ktime_t *iowait = &ts->iowait_sleeptime;
+		ktime_t *pending = &ts->iowait_pending;
+
+		if (ktime_compare(*pending, ktime_zero) == 0) {
+			/* PHASE_1 */
+			*idle = ktime_add(*idle, delta);
+			if (nr_iowait_cpu(smp_processor_id()) > 0)
+				*pending = delta;
+		} else {
+			/* PHASE_2 */
+			if (ktime_compare(*pending, delta) > 0) {
+				*iowait = ktime_add(*iowait, delta);
+				if (!nr_iowait_cpu(smp_processor_id()))
+					*pending = ktime_sub(*pending, delta);
+			} else {
+				*idle = ktime_add(*idle, ktime_sub(delta,
+								   *pending));
+				*iowait = ktime_add(*iowait, *pending);
+				if (nr_iowait_cpu(smp_processor_id()) > 0)
+					*pending = delta;
+				else
+					*pending = ktime_zero;
+			}
+		}
+	}
+
 	ts->idle_entrytime = now;
 	ts->idle_active = 0;
 
@@ -429,7 +503,13 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
 
 	write_seqcount_begin(&ts->idle_sleeptime_seq);
 	ts->idle_entrytime = now;
-	ts->idle_active = 1;
+	/*
+	 * idle_active:
+	 *  0: cpu is not idle
+	 *  1: cpu is performing idle
+	 *  2: cpu is performing iowait and idle
+	 */
+	ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
 	write_seqcount_end(&ts->idle_sleeptime_seq);
 
 	sched_clock_idle_sleep_event();
@@ -464,18 +544,23 @@ u64 get_cpu_idle_time_us(int cpu, u64 *wall)
 
 	do {
 		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
- 
-		if (ts->idle_active && !nr_iowait_cpu(cpu)) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+		idle = ts->idle_sleeptime;
 
-			idle = ktime_add(ts->idle_sleeptime, delta);
-		} else {
-			idle = ts->idle_sleeptime;
+		if (ts->idle_active == 2) {
+			/* delta is sum of unaccounted idle/iowait */
+			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+			if (ktime_compare(delta, ts->iowait_pending) > 0) {
+				delta = ktime_sub(delta, ts->iowait_pending);
+				idle = ktime_add(idle, delta);
+			}
+		} else if (ts->idle_active == 1) {
+			/* delta is unaccounted idle */
+			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+			idle = ktime_add(idle, delta);
 		}
 	} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
 
 	return ktime_to_us(idle);
-
 }
 EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);
 
@@ -507,13 +592,16 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *wall)
 
 	do {
 		seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
- 
-		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
-			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+		iowait = ts->iowait_sleeptime;
 
-			iowait = ktime_add(ts->iowait_sleeptime, delta);
-		} else {
-			iowait = ts->iowait_sleeptime;
+		if (ts->idle_active == 2) {
+			/* delta is sum of unaccounted idle/iowait */
+			ktime_t delta = ktime_sub(now, ts->idle_entrytime);
+			if (ktime_compare(delta, ts->iowait_pending) > 0) {
+				iowait = ktime_add(iowait, ts->iowait_pending);
+			} else {
+				iowait = ktime_add(iowait, delta);
+			}
 		}
 	} while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq));
 
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v4 0/2] nohz: fix idle accounting in NO_HZ kernels
  2014-04-17  9:35 [PATCH v4 0/2] nohz: fix idle accounting in NO_HZ kernels Hidetoshi Seto
  2014-04-17  9:38 ` [PATCH 1/2] nohz: make updating sleep stats local Hidetoshi Seto
  2014-04-17  9:41 ` [PATCH 2/2] nohz: delayed iowait accounting for nohz idle time stats Hidetoshi Seto
@ 2014-04-22  6:34 ` Hidetoshi Seto
  2 siblings, 0 replies; 8+ messages in thread
From: Hidetoshi Seto @ 2014-04-22  6:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fernando Luis Vazquez Cao, Tetsuo Handa, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko,
	stable

Ping?

(I'll have a week holidays from next week.
 So thank you if you could give me your comments soon!)

Thanks,
H.Seto 

(2014/04/17 18:35), Hidetoshi Seto wrote:
> Hi all,
> 
> This patch set (rebased on v3.15-rc1) is my 4th try to fix an issue
> that idle/iowait of /proc/stat can go backward. Originally reported
> by Tetsuo and Fernando at last year, Mar 2013.
> 
> This v4 updates v3's approach, and adds description/comments to
> make patch's intent to be clear (I hope so).
> 
> Of course still reviews are welcome.
> 
> Thanks,
> H.Seto
> 
> Hidetoshi Seto (2):
>       nohz: make updating sleep stats local
>       nohz: delayed iowait accounting for nohz idle time stats
> 
>  include/linux/tick.h     |    6 +-
>  kernel/time/tick-sched.c |  179 ++++++++++++++++++++++++++++++++++------------
>  2 files changed, 136 insertions(+), 49 deletions(-)
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] nohz: delayed iowait accounting for nohz idle time stats
  2014-04-17  9:41 ` [PATCH 2/2] nohz: delayed iowait accounting for nohz idle time stats Hidetoshi Seto
@ 2014-04-22 19:45   ` Peter Zijlstra
  2014-04-23  7:40     ` Hidetoshi Seto
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2014-04-22 19:45 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Fernando Luis Vazquez Cao, Tetsuo Handa,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko,
	stable

On Thu, Apr 17, 2014 at 06:41:41PM +0900, Hidetoshi Seto wrote:
> This patch is 2/2 of v4 patch set to fix an issue that idle/iowait
> of /proc/stat can go backward. Originally reported by Tetsuo and
> Fernando at last year, Mar 2013.
> 
> (Continued from previous patch 1/2.)
> 
> [PROBLEM 2]: broken iowait accounting.
> 
>   As historical nature, cpu's idle time was accounted as either
>   idle or iowait depending on the presence of tasks blocked by
>   I/O. No one complain about it for a long time. However:
> 
>   > Still trying to wrap my head around it, but conceptually
>   > get_cpu_iowait_time_us() doesn't make any kind of sense.
>   > iowait isn't per cpu since effectively tasks that aren't
>   > running aren't assigned a cpu (as Oleg already pointed out).
>   -- Peter Zijlstra
> 
>   Now some kernel folks realized that accounting iowait as per-cpu
>   does not make sense in SMP world. When we were in traditional
>   UP era, cpu is considered to be waiting I/O if it is idle while
>   nr_iowait > 0. But in these days with SMP systems, tasks easily
>   migrate from a cpu where they issued an I/O to another cpu where
>   they are queued after I/O completion.
> 
>   Back to NO_HZ mechanism. Totally terrible thing here is that
>   observer need to handle duration "delta" without knowing that
>   nr_iowait of sleeping cpu can be changed easily by woken tasks
>   even if cpu is sleeping. So it can happen that:
> 
>     given:
>       idle time stats: idle=1000, iowait=100
>       stamp at idle entry: entry=50
>       nr tasks waiting io: nr_iowait=1
> 
>     observer temporary assigns delta as iowait at 1st place,
>     (but does not do update (=account delta to time stats)):
>       1st reader's query @ now = 60:
>         idle=1000
>         iowait=110 (=100+(60-50))
> 
>     then blocked task is queued to runqueue of other active cpu,
>     woken up from io_schedule{,_timeout}() and do atomic_dec()
>     from the remote:
>       nr_iowait=0
> 
>     and at last in 2nd turn observer assign delta as idle:
>       2nd reader's query @ now = 70:
>         idle=1020 (=1000+(70-50))
>         iowait=100
> 
>   You will see that iowait is decreased from 110 to 100.
> 
>   In summary, the original problem that iowait of /proc/stats can
>   go backward is a kind of hidden bug, while at the same time iowait
>   accounting has fundamental problem and needs to be precisely
>   reworked.
> 
> [TARGET OF THIS PATCH]:
> 
>   Complete rework for iowait accounting implies that some user
>   interfaces might be replaced completely. It will introduce new
>   counter or so, and kill per-cpu iowait counter which is known to
>   being nonsense. It will take long time to be achieved, considering
>   time to make the problem to a public knowledge, and also time for
>   interface transition. Anyway as long as linux try to be reliable OS,
>   such drastic change must be placed on mainline kernel in development
>   sooner or later.
> 
>   OTOH, drastic change like above is not acceptable for conservative
>   environment, such as stable kernels, some distributor's kernel and
>   so on, due to respect for compatibility. Still these kernel expects
>   per-cpu iowait counter works as well as it might have been.
>   Therefore for such environment, applying a simple patch to fix
>   behavior of counter will be appreciated rather than replacing an
>   over-used interface or changing an existing usage/semantics.
> 
>   This patch targets latter kernels mainly. It does not do too much,
>   but intend to fix the idle stats counters to be monotonic. I believe
>   that mainline kernel should pick up this patch too, because it
>   surely fix a hidden bug and does not intercept upcoming drastic
>   change.
> 
>   Again, in summary, this patch does not do drastic change to cope
>   with problem 2. But it just fix behavior of idle/iowait counter of
>   /proc/stats.
> 
> [WHAT THIS PATCH PROPOSED]:
> 
>   The main reason of the bug is that observers have no idea to
>   determine the delta to be idle or iowait at the first place.
> 
>   So I introduced delayed iowait accounting to allow observers to
>   assign delta based on status of observed cpu at idle entry.
> 

So the problem I have with this is that it makes CONFIG_NOHZ=[ny]
kernels behave quite differently.

Ideally NOHZ=y and NOHZ=n behave the same, my proposed solution the
other day does just that. This one OTOH seems to generate entirely
different results between those kernels.

It also doesn't really simplify the code; there's quite a bit of
complexity introduced to paper over this silly stuff.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] nohz: delayed iowait accounting for nohz idle time stats
  2014-04-22 19:45   ` Peter Zijlstra
@ 2014-04-23  7:40     ` Hidetoshi Seto
  2014-04-23  9:41       ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Hidetoshi Seto @ 2014-04-23  7:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Fernando Luis Vazquez Cao, Tetsuo Handa,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko,
	stable

(2014/04/23 4:45), Peter Zijlstra wrote:
> On Thu, Apr 17, 2014 at 06:41:41PM +0900, Hidetoshi Seto wrote:
>> [TARGET OF THIS PATCH]:
>>
>>   Complete rework for iowait accounting implies that some user
>>   interfaces might be replaced completely. It will introduce new
>>   counter or so, and kill per-cpu iowait counter which is known to
>>   being nonsense. It will take long time to be achieved, considering
>>   time to make the problem to a public knowledge, and also time for
>>   interface transition. Anyway as long as linux try to be reliable OS,
>>   such drastic change must be placed on mainline kernel in development
>>   sooner or later.
>>
>>   OTOH, drastic change like above is not acceptable for conservative
>>   environment, such as stable kernels, some distributor's kernel and
>>   so on, due to respect for compatibility. Still these kernel expects
>>   per-cpu iowait counter works as well as it might have been.
>>   Therefore for such environment, applying a simple patch to fix
>>   behavior of counter will be appreciated rather than replacing an
>>   over-used interface or changing an existing usage/semantics.
>>
>>   This patch targets latter kernels mainly. It does not do too much,
>>   but intend to fix the idle stats counters to be monotonic. I believe
>>   that mainline kernel should pick up this patch too, because it
>>   surely fix a hidden bug and does not intercept upcoming drastic
>>   change.
>>
>>   Again, in summary, this patch does not do drastic change to cope
>>   with problem 2. But it just fix behavior of idle/iowait counter of
>>   /proc/stats.
>>
>> [WHAT THIS PATCH PROPOSED]:
>>
>>   The main reason of the bug is that observers have no idea to
>>   determine the delta to be idle or iowait at the first place.
>>
>>   So I introduced delayed iowait accounting to allow observers to
>>   assign delta based on status of observed cpu at idle entry.
>>
> 
> So the problem I have with this is that it makes CONFIG_NOHZ=[ny]
> kernels behave quite differently.

It is not true.
There are already differences before applying my patches.
The behavior of NOHZ=y kernel diverged from original since it was born.

Please note that no one complained about this difference.

I just want to fix a counter not to go backward.
It's a simple bug, isn't it?

> Ideally NOHZ=y and NOHZ=n behave the same, my proposed solution the
> other day does just that. This one OTOH seems to generate entirely
> different results between those kernels.

As you know, behaviors of NOHZ=[ny] are both crap because of per-cpu
iowait accounting.

I guess we should say:
  Ideally NOHZ=[ny] behave the same "in the proper way."

What you proposed will do too much to make one nonsense to another
nonsense. It will be unhelpful for people...

> It also doesn't really simplify the code; there's quite a bit of
> complexity introduced to paper over this silly stuff.

Don't complicate things. I want to talk about a small simple bug.

a) today's NOHZ=n
   - provides per-cpu iowait counter
     (nonsense but still loved by innocent userland)

b) today's NOHZ=y
   - provides per-cpu iowait counter
   - use it's own idle time accounting different from a)
   - have a *bug* that counter might go backward

b') NOHZ=y + my patch
   - provides per-cpu iowait counter
   - use it's own idle time accounting same as b)
   - *bug* in b) have gone
   - instead accept gap in iowait value from b)
     - "pending" will not bloat more than one iowait span

c) unified something for NOHZ=[ny] (you proposed)
   - provides per-cpu iowait counter
   - use new accounting different from a) and also b)

d) ideal goal (=not designed & realized yet)
   - no per-cpu iowait counter any more
   - use new proper accounting different from all of above

I just want to make b) to b') by a patch as small as possible.

What you proposed will make both of a) and b) to c).
I think it does too much and changing a) is not required here.
(from my conservative perspective, patch must be non-intrusive)

Our final goal must be making d) to replace all nasty things
around there. But still there is no idea to do that.
And such big jump will not fit to stable environments.

Again, I just want to fix a small bug here...

I believe my patch is enough simple to do a minimum fix.
Please tell me if you have more simple/better way to fix this
long-standing minor bug, not only for mainline in development
but also for conservative stables like distributor's kernel.


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] nohz: delayed iowait accounting for nohz idle time stats
  2014-04-23  7:40     ` Hidetoshi Seto
@ 2014-04-23  9:41       ` Peter Zijlstra
  2014-04-24  5:50         ` Hidetoshi Seto
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2014-04-23  9:41 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: linux-kernel, Fernando Luis Vazquez Cao, Tetsuo Handa,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko,
	stable

On Wed, Apr 23, 2014 at 04:40:18PM +0900, Hidetoshi Seto wrote:
> (2014/04/23 4:45), Peter Zijlstra wrote:
> > On Thu, Apr 17, 2014 at 06:41:41PM +0900, Hidetoshi Seto wrote:
> >> [TARGET OF THIS PATCH]:
> >>
> >>   Complete rework for iowait accounting implies that some user
> >>   interfaces might be replaced completely. It will introduce new
> >>   counter or so, and kill per-cpu iowait counter which is known to
> >>   being nonsense. It will take long time to be achieved, considering
> >>   time to make the problem to a public knowledge, and also time for
> >>   interface transition. Anyway as long as linux try to be reliable OS,
> >>   such drastic change must be placed on mainline kernel in development
> >>   sooner or later.
> >>
> >>   OTOH, drastic change like above is not acceptable for conservative
> >>   environment, such as stable kernels, some distributor's kernel and
> >>   so on, due to respect for compatibility. Still these kernel expects
> >>   per-cpu iowait counter works as well as it might have been.
> >>   Therefore for such environment, applying a simple patch to fix
> >>   behavior of counter will be appreciated rather than replacing an
> >>   over-used interface or changing an existing usage/semantics.
> >>
> >>   This patch targets latter kernels mainly. It does not do too much,
> >>   but intend to fix the idle stats counters to be monotonic. I believe
> >>   that mainline kernel should pick up this patch too, because it
> >>   surely fix a hidden bug and does not intercept upcoming drastic
> >>   change.
> >>
> >>   Again, in summary, this patch does not do drastic change to cope
> >>   with problem 2. But it just fix behavior of idle/iowait counter of
> >>   /proc/stats.
> >>
> >> [WHAT THIS PATCH PROPOSED]:
> >>
> >>   The main reason of the bug is that observers have no idea to
> >>   determine the delta to be idle or iowait at the first place.
> >>
> >>   So I introduced delayed iowait accounting to allow observers to
> >>   assign delta based on status of observed cpu at idle entry.
> >>
> > 
> > So the problem I have with this is that it makes CONFIG_NOHZ=[ny]
> > kernels behave quite differently.
> 
> It is not true.
> There are already differences before applying my patches.
> The behavior of NOHZ=y kernel diverged from original since it was born.

But if you argue about not actually fixing iowait properly, then this
difference is the only actual regression.

> Please note that no one complained about this difference.

Then why are you working on this? You're the one that said there was a
regression between unnamed enterprise distro 5 and unnamed enterprise
distro 6.

> I just want to fix a counter not to go backward.
> It's a simple bug, isn't it?

Seeing how we managed to send as many patches as we did, I'd say that's
a fairly big clue as to how its not as simple.

Just make the value unconditionally 0 then. That's guaranteed not to go
backwards and just about as useful as the random fwd walk you make it.
Plus, its a lot easier.

> > Ideally NOHZ=y and NOHZ=n behave the same, my proposed solution the
> > other day does just that. This one OTOH seems to generate entirely
> > different results between those kernels.
> 
> As you know, behaviors of NOHZ=[ny] are both crap because of per-cpu
> iowait accounting.
> 
> I guess we should say:
>   Ideally NOHZ=[ny] behave the same "in the proper way."

No, the premise of NOHZ is that behaviour should not change, we found a
change in behaviour, we should make it go away.

Secondly, with or without NOHZ iowait accounting is complete crap. We
should also fix that.

> What you proposed will do too much to make one nonsense to another
> nonsense. It will be unhelpful for people...

I proposed that if you don't want to fix the actual iowait is crap
problem, you at least fix the NOHZ regression proper.

> > It also doesn't really simplify the code; there's quite a bit of
> > complexity introduced to paper over this silly stuff.
> 
> Don't complicate things. I want to talk about a small simple bug.
> 
> a) today's NOHZ=n
>    - provides per-cpu iowait counter
>      (nonsense but still loved by innocent userland)
> b) today's NOHZ=y
>    - provides per-cpu iowait counter
>    - use it's own idle time accounting different from a)
>    - have a *bug* that counter might go backward

Like said, the intent of NOHZ is to not change accounting, its often
more complicated from a because well, no ticks.

But it really should give the same number, nonsense or not. We do the
same for all other numbers -- we spend a ton of effort to fix the
loadavg a year or two ago.

loadavg is also another dubious number, fwiw.

> b') NOHZ=y + my patch
>    - provides per-cpu iowait counter
>    - use it's own idle time accounting same as b)
>    - *bug* in b) have gone
>    - instead accept gap in iowait value from b)
>      - "pending" will not bloat more than one iowait span
> 
> c) unified something for NOHZ=[ny] (you proposed)
>    - provides per-cpu iowait counter
>    - use new accounting different from a) and also b)
> 
> d) ideal goal (=not designed & realized yet)
>    - no per-cpu iowait counter any more
>    - use new proper accounting different from all of above
> 
> I just want to make b) to b') by a patch as small as possible.

I really don't see the value of b', the actual value of its result is
really no better than the constant 0, and the constant implementation is
_way_ simpler.

> What you proposed will make both of a) and b) to c).
> I think it does too much and changing a) is not required here.
> (from my conservative perspective, patch must be non-intrusive)
> 
> Our final goal must be making d) to replace all nasty things
> around there. But still there is no idea to do that.
> And such big jump will not fit to stable environments.
> 
> Again, I just want to fix a small bug here...
> 
> I believe my patch is enough simple to do a minimum fix.
> Please tell me if you have more simple/better way to fix this
> long-standing minor bug, not only for mainline in development
> but also for conservative stables like distributor's kernel.

I really think only fixing the backward motion is retarded. Its papering
over so much crap its not funny.

The fact that it does go backwards is because b does not give the
identical results from a, _that_ is a bug per the design principle of
NOHZ.

Fixing it to just return some random number that doesn't go backwards
doesn't fix it. It just makes the immediately observed problem go away;
entirely the wrong mindset.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] nohz: delayed iowait accounting for nohz idle time stats
  2014-04-23  9:41       ` Peter Zijlstra
@ 2014-04-24  5:50         ` Hidetoshi Seto
  0 siblings, 0 replies; 8+ messages in thread
From: Hidetoshi Seto @ 2014-04-24  5:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Fernando Luis Vazquez Cao, Tetsuo Handa,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Arjan van de Ven, Oleg Nesterov, Preeti U Murthy, Denys Vlasenko,
	stable

(2014/04/23 18:41), Peter Zijlstra wrote:
> On Wed, Apr 23, 2014 at 04:40:18PM +0900, Hidetoshi Seto wrote:
>> (2014/04/23 4:45), Peter Zijlstra wrote:
>>> On Thu, Apr 17, 2014 at 06:41:41PM +0900, Hidetoshi Seto wrote:
>>>> [TARGET OF THIS PATCH]:
>>>>
>>>>   Complete rework for iowait accounting implies that some user
>>>>   interfaces might be replaced completely. It will introduce new
>>>>   counter or so, and kill per-cpu iowait counter which is known to
>>>>   being nonsense. It will take long time to be achieved, considering
>>>>   time to make the problem to a public knowledge, and also time for
>>>>   interface transition. Anyway as long as linux try to be reliable OS,
>>>>   such drastic change must be placed on mainline kernel in development
>>>>   sooner or later.
>>>>
>>>>   OTOH, drastic change like above is not acceptable for conservative
>>>>   environment, such as stable kernels, some distributor's kernel and
>>>>   so on, due to respect for compatibility. Still these kernel expects
>>>>   per-cpu iowait counter works as well as it might have been.
>>>>   Therefore for such environment, applying a simple patch to fix
>>>>   behavior of counter will be appreciated rather than replacing an
>>>>   over-used interface or changing an existing usage/semantics.
>>>>
>>>>   This patch targets latter kernels mainly. It does not do too much,
>>>>   but intend to fix the idle stats counters to be monotonic. I believe
>>>>   that mainline kernel should pick up this patch too, because it
>>>>   surely fix a hidden bug and does not intercept upcoming drastic
>>>>   change.
>>>>
>>>>   Again, in summary, this patch does not do drastic change to cope
>>>>   with problem 2. But it just fix behavior of idle/iowait counter of
>>>>   /proc/stats.
>>>>
>>>> [WHAT THIS PATCH PROPOSED]:
>>>>
>>>>   The main reason of the bug is that observers have no idea to
>>>>   determine the delta to be idle or iowait at the first place.
>>>>
>>>>   So I introduced delayed iowait accounting to allow observers to
>>>>   assign delta based on status of observed cpu at idle entry.
>>>>
>>>
>>> So the problem I have with this is that it makes CONFIG_NOHZ=[ny]
>>> kernels behave quite differently.
>>
>> It is not true.
>> There are already differences before applying my patches.
>> The behavior of NOHZ=y kernel diverged from original since it was born.
> 
> But if you argue about not actually fixing iowait properly, then this
> difference is the only actual regression.

A difference is not always a regression.

>> Please note that no one complained about this difference.
> 
> Then why are you working on this? You're the one that said there was a
> regression between unnamed enterprise distro 5 and unnamed enterprise
> distro 6.

e.g.
 regression: a counter loses monotonicity
 improvement: drop power consumption by skipping tick during in idle
 not matter: useless fuzzy crap value turned to be another crap

If you say every difference is regression, then all kernel config
must be nop.

>> I just want to fix a counter not to go backward.
>> It's a simple bug, isn't it?
> 
> Seeing how we managed to send as many patches as we did, I'd say that's
> a fairly big clue as to how its not as simple.

I think the situation is that a simple small bug is glued to big
complicated background. I want to separate them and handle only
a small bug, therefore I wrote patch description a lot for background
and my target.

> Just make the value unconditionally 0 then. That's guaranteed not to go
> backwards and just about as useful as the random fwd walk you make it.
> Plus, its a lot easier.

I'd like to hire constant 0 if it is acceptable for stables.

I suppose it could be considered as improvement for mainline kernel
since it will be the first step for upcoming drastic changes.
But I also suppose it could be classified as regression for stable
kernels because one counter ordinarily used stops its progress.

>>> Ideally NOHZ=y and NOHZ=n behave the same, my proposed solution the
>>> other day does just that. This one OTOH seems to generate entirely
>>> different results between those kernels.
>>
>> As you know, behaviors of NOHZ=[ny] are both crap because of per-cpu
>> iowait accounting.
>>
>> I guess we should say:
>>   Ideally NOHZ=[ny] behave the same "in the proper way."
> 
> No, the premise of NOHZ is that behaviour should not change, we found a
> change in behaviour, we should make it go away.
> 
> Secondly, with or without NOHZ iowait accounting is complete crap. We
> should also fix that.

Humm..? I could not catch the reason why you say no here.
I think wasting time to unify 2 craps into 1 crap is not necessary
before replacing craps by a proper thing. 

>> What you proposed will do too much to make one nonsense to another
>> nonsense. It will be unhelpful for people...
> 
> I proposed that if you don't want to fix the actual iowait is crap
> problem, you at least fix the NOHZ regression proper.

I'd like to target only a part of differences which is considered as
regression.

I'm being overly cautious since removing or stopping this crap counter
could be new regression.

>>> It also doesn't really simplify the code; there's quite a bit of
>>> complexity introduced to paper over this silly stuff.
>>
>> Don't complicate things. I want to talk about a small simple bug.
>>
>> a) today's NOHZ=n
>>    - provides per-cpu iowait counter
>>      (nonsense but still loved by innocent userland)
>> b) today's NOHZ=y
>>    - provides per-cpu iowait counter
>>    - use it's own idle time accounting different from a)
>>    - have a *bug* that counter might go backward
> 
> Like said, the intent of NOHZ is to not change accounting, its often
> more complicated from a because well, no ticks.
> 
> But it really should give the same number, nonsense or not. We do the
> same for all other numbers -- we spend a ton of effort to fix the
> loadavg a year or two ago.
> 
> loadavg is also another dubious number, fwiw.

I can understand how you feel.

>> b') NOHZ=y + my patch
>>    - provides per-cpu iowait counter
>>    - use it's own idle time accounting same as b)
>>    - *bug* in b) have gone
>>    - instead accept gap in iowait value from b)
>>      - "pending" will not bloat more than one iowait span
>>
>> c) unified something for NOHZ=[ny] (you proposed)
>>    - provides per-cpu iowait counter
>>    - use new accounting different from a) and also b)
>>
>> d) ideal goal (=not designed & realized yet)
>>    - no per-cpu iowait counter any more
>>    - use new proper accounting different from all of above
>>
>> I just want to make b) to b') by a patch as small as possible.
> 
> I really don't see the value of b', the actual value of its result is
> really no better than the constant 0, and the constant implementation is
> _way_ simpler.

 a) monotonic per-cpu counter providing useless value
 b) non-monotonic per-cpu counter providing useless value
 b') monotonic per-cpu counter providing useless value
 c) monotonic per-cpu counter providing useless value
 d) monotonic counter(s) providing proper value
 x) constant 0 counter

My evaluation is:
  for stables:
   (bad)  (b,d,x) <<<<< (a,b',c) (good)
  for mainline:
   (bad)  b <<<<< (a,b',c) <<<<< x <<<<< d  (good) 

>> What you proposed will make both of a) and b) to c).
>> I think it does too much and changing a) is not required here.
>> (from my conservative perspective, patch must be non-intrusive)
>>
>> Our final goal must be making d) to replace all nasty things
>> around there. But still there is no idea to do that.
>> And such big jump will not fit to stable environments.
>>
>> Again, I just want to fix a small bug here...
>>
>> I believe my patch is enough simple to do a minimum fix.
>> Please tell me if you have more simple/better way to fix this
>> long-standing minor bug, not only for mainline in development
>> but also for conservative stables like distributor's kernel.
> 
> I really think only fixing the backward motion is retarded. Its papering
> over so much crap its not funny.
> 
> The fact that it does go backwards is because b does not give the
> identical results from a, _that_ is a bug per the design principle of
> NOHZ.
> 
> Fixing it to just return some random number that doesn't go backwards
> doesn't fix it. It just makes the immediately observed problem go away;
> entirely the wrong mindset.

I feel like that you are going to use cardboards instead of papers...


BTW, I found that Denys posted his updated patch set: 
https://lkml.org/lkml/2014/4/23/579

I think it looks like taking almost same approach as what you
proposed. However I guess:
  - It should not have #ifdef CONFIG_NOHZ_* because it make
    differences between behavior of NOHZ=[ny].
  - last_iowait need to be located in rq rather than in struct
    tick_sched, considering cache hit at io_schedule*().
  - last_iowait need to be referred in tick for NOHZ=n, not to make
    difference from NOHZ=y.

Anyway, though I still wonder what you proposed is acceptable fix
for distro's quality assurance, I hope someone in charge may have
broad mind like you and may resolve the matter favorably.

Is it worth to do if I make v5 based on your proposal and post it
for review comparing with v4?


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-04-24  5:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-17  9:35 [PATCH v4 0/2] nohz: fix idle accounting in NO_HZ kernels Hidetoshi Seto
2014-04-17  9:38 ` [PATCH 1/2] nohz: make updating sleep stats local Hidetoshi Seto
2014-04-17  9:41 ` [PATCH 2/2] nohz: delayed iowait accounting for nohz idle time stats Hidetoshi Seto
2014-04-22 19:45   ` Peter Zijlstra
2014-04-23  7:40     ` Hidetoshi Seto
2014-04-23  9:41       ` Peter Zijlstra
2014-04-24  5:50         ` Hidetoshi Seto
2014-04-22  6:34 ` [PATCH v4 0/2] nohz: fix idle accounting in NO_HZ kernels Hidetoshi Seto

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.