[RFC PATCH 0/4] sched: Improve cpu load accounting with nohz

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz
@ 2016-01-13 16:01 Frederic Weisbecker
  2016-01-13 16:01 ` [PATCH 1/4] sched: Don't account tickless CPU load on tick Frederic Weisbecker
                   ` (4 more replies)
  0 siblings, 5 replies; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-13 16:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Frederic Weisbecker, Byungchul Park, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Mike Galbraith, Rik van Riel

Hi,

Most of the remaining nohz full work is about making the scheduler
resilient to full dynticks (such that we can remove the 1Hz one day).
Byungchul Park started interesting work toward that with cpu load
updates in nohz full. So here is one more step forward.

Patches 1 and 2 concern both dyntick-idle and dyntick-full. The rest
is rather about full dynticks. Note that this isn't complete support for
cpu load in nohz full, we still need to think about a way to make
target_load() able to return up to date values on full dynticks targets.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	sched/core

HEAD: f1d7d0f5382be3819490a859313f692f142dfb74

Thanks,
	Frederic
---

Frederic Weisbecker (4):
      sched: Don't account tickless CPU load on tick
      sched: Consolidate nohz CPU load update code
      sched: Move cpu load stats functions above fair queue callbacks
      sched: Upload nohz full CPU load on task enqueue/dequeue

 kernel/sched/fair.c | 306 ++++++++++++++++++++++++++++++----------------------
 1 file changed, 175 insertions(+), 131 deletions(-)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-01-13 16:01 [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz Frederic Weisbecker
@ 2016-01-13 16:01 ` Frederic Weisbecker
  2016-01-19 13:08   ` Peter Zijlstra
  2016-01-13 16:01 ` [PATCH 2/4] sched: Consolidate nohz CPU load update code Frederic Weisbecker
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-13 16:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Frederic Weisbecker, Byungchul Park, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Mike Galbraith, Rik van Riel

The cpu load update on tick doesn't care about dynticks and as such is
buggy when occuring on nohz ticks (including idle ticks) as it resets
the jiffies snapshot that was recorded on nohz entry. We eventually
ignore the potentially long tickless load that happened before the
tick.

We can fix this in two ways:

1) Handle the tickless load, but then we must make sure that a freshly
   woken task's load doesn't get accounted as the whole previous tickless
   load.

2) Ignore nohz ticks and delay the accounting to the nohz exit point.

For simplicity, this patch propose to fix the issue with the second
solution.

Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/sched/fair.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1093873..b849ea8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4518,10 +4518,20 @@ void update_cpu_load_nohz(int active)
  */
 void update_cpu_load_active(struct rq *this_rq)
 {
-	unsigned long load = weighted_cpuload(cpu_of(this_rq));
+	unsigned long load;
+
+	/*
+	 * If the tick is stopped, we can't reliably update the
+	 * load without risking to spuriously account the weight
+	 * of a freshly woken task as the whole weight of a long
+	 * tickless period.
+	 */
+	if (tick_nohz_tick_stopped())
+		return;
 	/*
 	 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
 	 */
+	load = weighted_cpuload(cpu_of(this_rq));
 	this_rq->last_load_update_tick = jiffies;
 	__update_cpu_load(this_rq, load, 1, 1);
 }
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 2/4] sched: Consolidate nohz CPU load update code
  2016-01-13 16:01 [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz Frederic Weisbecker
  2016-01-13 16:01 ` [PATCH 1/4] sched: Don't account tickless CPU load on tick Frederic Weisbecker
@ 2016-01-13 16:01 ` Frederic Weisbecker
  2016-01-14  2:30   ` Byungchul Park
                     ` (2 more replies)
  2016-01-13 16:01 ` [RFC PATCH 3/4] sched: Move cpu load stats functions above fair queue callbacks Frederic Weisbecker
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-13 16:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Frederic Weisbecker, Byungchul Park, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Mike Galbraith, Rik van Riel

Lets factorize a bit of code there. We'll even have a third user soon.
While at it, standardize the idle update function name against the
others.

Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/sched/fair.c | 48 +++++++++++++++++++++++++-----------------------
 1 file changed, 25 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b849ea8..161cee2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4450,6 +4450,25 @@ static unsigned long weighted_cpuload(const int cpu)
 }
 
 #ifdef CONFIG_NO_HZ_COMMON
+static void __update_cpu_load_nohz(struct rq *this_rq,
+				   unsigned long curr_jiffies,
+				   unsigned long load,
+				   int active)
+{
+	unsigned long pending_updates;
+
+	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
+	if (pending_updates) {
+		this_rq->last_load_update_tick = curr_jiffies;
+		/*
+		 * In the regular NOHZ case, we were idle, this means load 0.
+		 * In the NOHZ_FULL case, we were non-idle, we should consider
+		 * its weighted load.
+		 */
+		__update_cpu_load(this_rq, load, pending_updates, active);
+	}
+}
+
 /*
  * There is no sane way to deal with nohz on smp when using jiffies because the
  * cpu doing the jiffies update might drift wrt the cpu doing the jiffy reading
@@ -4467,22 +4486,15 @@ static unsigned long weighted_cpuload(const int cpu)
  * Called from nohz_idle_balance() to update the load ratings before doing the
  * idle balance.
  */
-static void update_idle_cpu_load(struct rq *this_rq)
+static void update_cpu_load_idle(struct rq *this_rq)
 {
-	unsigned long curr_jiffies = READ_ONCE(jiffies);
-	unsigned long load = weighted_cpuload(cpu_of(this_rq));
-	unsigned long pending_updates;
-
 	/*
 	 * bail if there's load or we're actually up-to-date.
 	 */
-	if (load || curr_jiffies == this_rq->last_load_update_tick)
+	if (weighted_cpuload(cpu_of(this_rq)))
 		return;
 
-	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
-	this_rq->last_load_update_tick = curr_jiffies;
-
-	__update_cpu_load(this_rq, load, pending_updates, 0);
+	__update_cpu_load_nohz(this_rq, READ_ONCE(jiffies), 0, 0);
 }
 
 /*
@@ -4493,22 +4505,12 @@ void update_cpu_load_nohz(int active)
 	struct rq *this_rq = this_rq();
 	unsigned long curr_jiffies = READ_ONCE(jiffies);
 	unsigned long load = active ? weighted_cpuload(cpu_of(this_rq)) : 0;
-	unsigned long pending_updates;
 
 	if (curr_jiffies == this_rq->last_load_update_tick)
 		return;
 
 	raw_spin_lock(&this_rq->lock);
-	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
-	if (pending_updates) {
-		this_rq->last_load_update_tick = curr_jiffies;
-		/*
-		 * In the regular NOHZ case, we were idle, this means load 0.
-		 * In the NOHZ_FULL case, we were non-idle, we should consider
-		 * its weighted load.
-		 */
-		__update_cpu_load(this_rq, load, pending_updates, active);
-	}
+	__update_cpu_load_nohz(this_rq, curr_jiffies, load, active);
 	raw_spin_unlock(&this_rq->lock);
 }
 #endif /* CONFIG_NO_HZ */
@@ -4529,7 +4531,7 @@ void update_cpu_load_active(struct rq *this_rq)
 	if (tick_nohz_tick_stopped())
 		return;
 	/*
-	 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
+	 * See the mess around update_cpu_load_idle() / update_cpu_load_nohz().
 	 */
 	load = weighted_cpuload(cpu_of(this_rq));
 	this_rq->last_load_update_tick = jiffies;
@@ -7824,7 +7826,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 		if (time_after_eq(jiffies, rq->next_balance)) {
 			raw_spin_lock_irq(&rq->lock);
 			update_rq_clock(rq);
-			update_idle_cpu_load(rq);
+			update_cpu_load_idle(rq);
 			raw_spin_unlock_irq(&rq->lock);
 			rebalance_domains(rq, CPU_IDLE);
 		}
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 3/4] sched: Move cpu load stats functions above fair queue callbacks
  2016-01-13 16:01 [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz Frederic Weisbecker
  2016-01-13 16:01 ` [PATCH 1/4] sched: Don't account tickless CPU load on tick Frederic Weisbecker
  2016-01-13 16:01 ` [PATCH 2/4] sched: Consolidate nohz CPU load update code Frederic Weisbecker
@ 2016-01-13 16:01 ` Frederic Weisbecker
  2016-01-13 16:01 ` [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue Frederic Weisbecker
  2016-01-14 21:19 ` [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz Dietmar Eggemann
  4 siblings, 0 replies; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-13 16:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Frederic Weisbecker, Byungchul Park, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Mike Galbraith, Rik van Riel

We are going to update nohz full CPU load from fair queue update
functions. Lets make these cpu load functions visible from the queue
callbacks.

Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/sched/fair.c | 216 ++++++++++++++++++++++++++--------------------------
 1 file changed, 109 insertions(+), 107 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 161cee2..1e0cb6e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4199,113 +4199,6 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
-/*
- * The enqueue_task method is called before nr_running is
- * increased. Here we update the fair scheduling stats and
- * then put the task into the rbtree:
- */
-static void
-enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
-{
-	struct cfs_rq *cfs_rq;
-	struct sched_entity *se = &p->se;
-
-	for_each_sched_entity(se) {
-		if (se->on_rq)
-			break;
-		cfs_rq = cfs_rq_of(se);
-		enqueue_entity(cfs_rq, se, flags);
-
-		/*
-		 * end evaluation on encountering a throttled cfs_rq
-		 *
-		 * note: in the case of encountering a throttled cfs_rq we will
-		 * post the final h_nr_running increment below.
-		*/
-		if (cfs_rq_throttled(cfs_rq))
-			break;
-		cfs_rq->h_nr_running++;
-
-		flags = ENQUEUE_WAKEUP;
-	}
-
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		cfs_rq->h_nr_running++;
-
-		if (cfs_rq_throttled(cfs_rq))
-			break;
-
-		update_load_avg(se, 1);
-		update_cfs_shares(cfs_rq);
-	}
-
-	if (!se)
-		add_nr_running(rq, 1);
-
-	hrtick_update(rq);
-}
-
-static void set_next_buddy(struct sched_entity *se);
-
-/*
- * The dequeue_task method is called before nr_running is
- * decreased. We remove the task from the rbtree and
- * update the fair scheduling stats:
- */
-static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
-{
-	struct cfs_rq *cfs_rq;
-	struct sched_entity *se = &p->se;
-	int task_sleep = flags & DEQUEUE_SLEEP;
-
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		dequeue_entity(cfs_rq, se, flags);
-
-		/*
-		 * end evaluation on encountering a throttled cfs_rq
-		 *
-		 * note: in the case of encountering a throttled cfs_rq we will
-		 * post the final h_nr_running decrement below.
-		*/
-		if (cfs_rq_throttled(cfs_rq))
-			break;
-		cfs_rq->h_nr_running--;
-
-		/* Don't dequeue parent if it has other entities besides us */
-		if (cfs_rq->load.weight) {
-			/*
-			 * Bias pick_next to pick a task from this cfs_rq, as
-			 * p is sleeping when it is within its sched_slice.
-			 */
-			if (task_sleep && parent_entity(se))
-				set_next_buddy(parent_entity(se));
-
-			/* avoid re-evaluating load for this entity */
-			se = parent_entity(se);
-			break;
-		}
-		flags |= DEQUEUE_SLEEP;
-	}
-
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		cfs_rq->h_nr_running--;
-
-		if (cfs_rq_throttled(cfs_rq))
-			break;
-
-		update_load_avg(se, 1);
-		update_cfs_shares(cfs_rq);
-	}
-
-	if (!se)
-		sub_nr_running(rq, 1);
-
-	hrtick_update(rq);
-}
-
 #ifdef CONFIG_SMP
 
 /*
@@ -4537,8 +4430,117 @@ void update_cpu_load_active(struct rq *this_rq)
 	this_rq->last_load_update_tick = jiffies;
 	__update_cpu_load(this_rq, load, 1, 1);
 }
+#endif /* CONFIG_SMP */
 
 /*
+ * The enqueue_task method is called before nr_running is
+ * increased. Here we update the fair scheduling stats and
+ * then put the task into the rbtree:
+ */
+static void
+enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
+{
+	struct cfs_rq *cfs_rq;
+	struct sched_entity *se = &p->se;
+
+	for_each_sched_entity(se) {
+		if (se->on_rq)
+			break;
+		cfs_rq = cfs_rq_of(se);
+		enqueue_entity(cfs_rq, se, flags);
+
+		/*
+		 * end evaluation on encountering a throttled cfs_rq
+		 *
+		 * note: in the case of encountering a throttled cfs_rq we will
+		 * post the final h_nr_running increment below.
+		*/
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+		cfs_rq->h_nr_running++;
+
+		flags = ENQUEUE_WAKEUP;
+	}
+
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_running++;
+
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
+		update_load_avg(se, 1);
+		update_cfs_shares(cfs_rq);
+	}
+
+	if (!se)
+		add_nr_running(rq, 1);
+
+	hrtick_update(rq);
+}
+
+static void set_next_buddy(struct sched_entity *se);
+
+/*
+ * The dequeue_task method is called before nr_running is
+ * decreased. We remove the task from the rbtree and
+ * update the fair scheduling stats:
+ */
+static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
+{
+	struct cfs_rq *cfs_rq;
+	struct sched_entity *se = &p->se;
+	int task_sleep = flags & DEQUEUE_SLEEP;
+
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+		dequeue_entity(cfs_rq, se, flags);
+
+		/*
+		 * end evaluation on encountering a throttled cfs_rq
+		 *
+		 * note: in the case of encountering a throttled cfs_rq we will
+		 * post the final h_nr_running decrement below.
+		*/
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+		cfs_rq->h_nr_running--;
+
+		/* Don't dequeue parent if it has other entities besides us */
+		if (cfs_rq->load.weight) {
+			/*
+			 * Bias pick_next to pick a task from this cfs_rq, as
+			 * p is sleeping when it is within its sched_slice.
+			 */
+			if (task_sleep && parent_entity(se))
+				set_next_buddy(parent_entity(se));
+
+			/* avoid re-evaluating load for this entity */
+			se = parent_entity(se);
+			break;
+		}
+		flags |= DEQUEUE_SLEEP;
+	}
+
+	for_each_sched_entity(se) {
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_running--;
+
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+
+		update_load_avg(se, 1);
+		update_cfs_shares(cfs_rq);
+	}
+
+	if (!se)
+		sub_nr_running(rq, 1);
+
+	hrtick_update(rq);
+}
+
+#ifdef CONFIG_SMP
+/*
  * Return a low guess at the load of a migration-source cpu weighted
  * according to the scheduling class and "nice" value.
  *
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-13 16:01 [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2016-01-13 16:01 ` [RFC PATCH 3/4] sched: Move cpu load stats functions above fair queue callbacks Frederic Weisbecker
@ 2016-01-13 16:01 ` Frederic Weisbecker
  2016-01-19 13:17   ` Peter Zijlstra
  2016-01-20  9:03   ` Thomas Gleixner
  2016-01-14 21:19 ` [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz Dietmar Eggemann
  4 siblings, 2 replies; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-13 16:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Frederic Weisbecker, Byungchul Park, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Mike Galbraith, Rik van Riel

The full nohz CPU load is currently accounted on tick restart only.
But there are a few issues with this model:

_ On tick restart, if cpu_load[0] doesn't contain the load of the actual
  tickless load that just ran, we are going to account a wrong value.
  And it is very likely to be so given that cpu_load[0] doesn't have
  an opportunity to be updated between tick stop and tick restart.

_ If the runqueue had updates that didn't trigger a tick restart, we
  are going to miss those CPU load changes.

A solution to fix this is to update the CPU load everytime we enqueue
or dequeue a task in the fair runqueue and more than a jiffy occured
since the last update.

Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e0cb6e..763dc3b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4433,6 +4433,34 @@ void update_cpu_load_active(struct rq *this_rq)
 #endif /* CONFIG_SMP */
 
 /*
+ * In NO_HZ full mode, we need to account the CPU load without relying
+ * on the tick. We do it instead on task enqueue/dequeue time as those
+ * are the main points where CPU load changes.
+ */
+static inline void update_cpu_load_nohz_full(struct rq *rq)
+{
+#ifdef CONFIG_NO_HZ_FULL
+        unsigned long curr_jiffies;
+        unsigned long load;
+
+	if (!tick_nohz_full_cpu(cpu_of(rq)))
+		return;
+
+	curr_jiffies = READ_ONCE(jiffies);
+	load = weighted_cpuload(cpu_of(rq));
+	if (curr_jiffies == rq->last_load_update_tick) {
+		/*
+		 * At least record the current load so that we flush
+		 * it correctly on the next update.
+		 */
+		rq->cpu_load[0] = load;
+	} else {
+		__update_cpu_load_nohz(rq, curr_jiffies, load, 1);
+	}
+#endif
+}
+
+/*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
  * then put the task into the rbtree:
@@ -4477,6 +4505,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		add_nr_running(rq, 1);
 
 	hrtick_update(rq);
+	update_cpu_load_nohz_full(rq);
 }
 
 static void set_next_buddy(struct sched_entity *se);
@@ -4537,6 +4566,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		sub_nr_running(rq, 1);
 
 	hrtick_update(rq);
+	update_cpu_load_nohz_full(rq);
 }
 
 #ifdef CONFIG_SMP
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] sched: Consolidate nohz CPU load update code
  2016-01-13 16:01 ` [PATCH 2/4] sched: Consolidate nohz CPU load update code Frederic Weisbecker
@ 2016-01-14  2:30   ` Byungchul Park
  2016-01-19 13:13     ` Peter Zijlstra
  2016-01-14  5:18   ` Byungchul Park
  2016-02-29 11:14   ` [tip:sched/core] sched/fair: " tip-bot for Frederic Weisbecker
  2 siblings, 1 reply; 58+ messages in thread
From: Byungchul Park @ 2016-01-14  2:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 13, 2016 at 05:01:29PM +0100, Frederic Weisbecker wrote:
> Lets factorize a bit of code there. We'll even have a third user soon.
> While at it, standardize the idle update function name against the
> others.

This factorization was one of what I thought it needs to be done! :-)

> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] sched: Consolidate nohz CPU load update code
  2016-01-13 16:01 ` [PATCH 2/4] sched: Consolidate nohz CPU load update code Frederic Weisbecker
  2016-01-14  2:30   ` Byungchul Park
@ 2016-01-14  5:18   ` Byungchul Park
  2016-01-19 13:13     ` Peter Zijlstra
  2016-01-19 16:49     ` Frederic Weisbecker
  2016-02-29 11:14   ` [tip:sched/core] sched/fair: " tip-bot for Frederic Weisbecker
  2 siblings, 2 replies; 58+ messages in thread
From: Byungchul Park @ 2016-01-14  5:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 13, 2016 at 05:01:29PM +0100, Frederic Weisbecker wrote:
>  #ifdef CONFIG_NO_HZ_COMMON
> +static void __update_cpu_load_nohz(struct rq *this_rq,
> +				   unsigned long curr_jiffies,

Do we need to pass current jiffies as a function parameter?

> +				   unsigned long load,
> +				   int active)
> +{
> +	unsigned long pending_updates;
> +
> +	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
> +	if (pending_updates) {
> +		this_rq->last_load_update_tick = curr_jiffies;
> +		/*
> +		 * In the regular NOHZ case, we were idle, this means load 0.
> +		 * In the NOHZ_FULL case, we were non-idle, we should consider
> +		 * its weighted load.
> +		 */
> +		__update_cpu_load(this_rq, load, pending_updates, active);
> +	}
> +}
> +
>  /*
>   * There is no sane way to deal with nohz on smp when using jiffies because the
>   * cpu doing the jiffies update might drift wrt the cpu doing the jiffy reading
> @@ -4467,22 +4486,15 @@ static unsigned long weighted_cpuload(const int cpu)
>   * Called from nohz_idle_balance() to update the load ratings before doing the
>   * idle balance.
>   */
> -static void update_idle_cpu_load(struct rq *this_rq)
> +static void update_cpu_load_idle(struct rq *this_rq)
>  {
> -	unsigned long curr_jiffies = READ_ONCE(jiffies);
> -	unsigned long load = weighted_cpuload(cpu_of(this_rq));
> -	unsigned long pending_updates;
> -
>  	/*
>  	 * bail if there's load or we're actually up-to-date.
>  	 */
> -	if (load || curr_jiffies == this_rq->last_load_update_tick)
> +	if (weighted_cpuload(cpu_of(this_rq)))
>  		return;
>  
> -	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
> -	this_rq->last_load_update_tick = curr_jiffies;
> -
> -	__update_cpu_load(this_rq, load, pending_updates, 0);
> +	__update_cpu_load_nohz(this_rq, READ_ONCE(jiffies), 0, 0);

This question is not directly related to this patch but I am just
curious about... Should we use READ_ONCE on jiffies which is already
volatile type?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz
  2016-01-13 16:01 [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2016-01-13 16:01 ` [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue Frederic Weisbecker
@ 2016-01-14 21:19 ` Dietmar Eggemann
  2016-01-14 21:27   ` Peter Zijlstra
  4 siblings, 1 reply; 58+ messages in thread
From: Dietmar Eggemann @ 2016-01-14 21:19 UTC (permalink / raw)
  To: Frederic Weisbecker, Peter Zijlstra
  Cc: LKML, Byungchul Park, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

Hi Frederic,

On 13/01/16 16:01, Frederic Weisbecker wrote:
> Hi,
>
> Most of the remaining nohz full work is about making the scheduler
> resilient to full dynticks (such that we can remove the 1Hz one day).
> Byungchul Park started interesting work toward that with cpu load
> updates in nohz full. So here is one more step forward.
>
> Patches 1 and 2 concern both dyntick-idle and dyntick-full. The rest
> is rather about full dynticks. Note that this isn't complete support for
> cpu load in nohz full, we still need to think about a way to make
> target_load() able to return up to date values on full dynticks targets.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
>       sched/core
>
> HEAD: f1d7d0f5382be3819490a859313f692f142dfb74
>
> Thanks,
>       Frederic
> ---
>
> Frederic Weisbecker (4):
>       sched: Don't account tickless CPU load on tick
>       sched: Consolidate nohz CPU load update code
>       sched: Move cpu load stats functions above fair queue callbacks
>       sched: Upload nohz full CPU load on task enqueue/dequeue
>
>
>  kernel/sched/fair.c | 306 ++++++++++++++++++++++++++++++----------------------
>  1 file changed, 175 insertions(+), 131 deletions(-)
>

I noticed during the test of these patches on a NO_HZ_FULL cpu, that the
rq->cpu_load[] values can be abnormally high. This happens w/ and w/o your
patches but it happens way more w/ the patches applied.

It might be related to commit 59543275488d "sched/fair: Prepare
__update_cpu_load() to handle active tickless", at least the following
patch cures it for me.

-- Dietmar

-- >8 --

Subject: [PATCH] sched/fair: Avoid unsigned underflow in __update_cpu_load()

tickless_load, which is rq->cpu_load[0] in the active case, can be
greater than rq->cpu_load[i] (0 < i < CPU_LOAD_IDX_MAX) which then
leads to an unsigned underflow when calculating old_load.

In the NO_HZ_FULL case (tick_nohz_full_update_tick() ->
tick_nohz_restart_sched_tick() -> update_cpu_load_nohz() ->
__update_cpu_load()) with pending_updates > 1, rq->cpu_load[i]
(0 < i < CPU_LOAD_IDX_MAX) can end up with abnormally high values:

Fix this by only assigning the difference out of rq->cpu_load[i]
and tickless_load to old_load if the former is greater, otherwise
set it to 0.
Also bail out of decay_load_missed() if it is called with load = 0.

E.g. running a pinned 50% task (w/ heavy tracing) on a cpu in
NO_HZ_FULL mode showed these max values for rq->cpu_load  w/o
this patch:

max(rq->cpu_load[0]): 738
max(rq->cpu_load[1]): 626
max(rq->cpu_load[2]): 10133099161584019
max(rq->cpu_load[3]): 42361983994954023
max(rq->cpu_load[4]): 80220368362537147

W/ this patch, the values are back to normal:

max(rq->cpu_load[0]): 769
max(rq->cpu_load[1]): 618
max(rq->cpu_load[2]): 607
max(rq->cpu_load[3]): 602
max(rq->cpu_load[4]): 599

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9deda2ac319f..4bf8f7c2c8b7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4276,7 +4276,7 @@ decay_load_missed(unsigned long load, unsigned long missed_updates, int idx)
 {
        int j = 0;

-       if (!missed_updates)
+       if (!load || !missed_updates)
                return load;

        if (missed_updates >= degrade_zero_ticks[idx])
@@ -4346,7 +4346,10 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,

                /* scale is effectively 1 << i now, and >> i divides by scale */

-               old_load = this_rq->cpu_load[i] - tickless_load;
+               if (this_rq->cpu_load[i] > tickless_load)
+                       old_load = this_rq->cpu_load[i] - tickless_load;
+               else
+                       old_load = 0;
                old_load = decay_load_missed(old_load, pending_updates - 1, i);
                old_load += tickless_load;
                new_load = this_load;
--
1.9.1
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz
  2016-01-14 21:19 ` [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz Dietmar Eggemann
@ 2016-01-14 21:27   ` Peter Zijlstra
  2016-01-14 22:23     ` Dietmar Eggemann
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-14 21:27 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Frederic Weisbecker, LKML, Byungchul Park, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Mike Galbraith, Rik van Riel

On Thu, Jan 14, 2016 at 09:19:00PM +0000, Dietmar Eggemann wrote:
> @@ -4346,7 +4346,10 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
> 
>                 /* scale is effectively 1 << i now, and >> i divides by scale */
> 
> -               old_load = this_rq->cpu_load[i] - tickless_load;
> +               if (this_rq->cpu_load[i] > tickless_load)
> +                       old_load = this_rq->cpu_load[i] - tickless_load;
> +               else
> +                       old_load = 0;

Yeah, yuck. That'd go bad quick.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz
  2016-01-14 21:27   ` Peter Zijlstra
@ 2016-01-14 22:23     ` Dietmar Eggemann
  2016-01-15  7:07       ` Byungchul Park
  0 siblings, 1 reply; 58+ messages in thread
From: Dietmar Eggemann @ 2016-01-14 22:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Byungchul Park, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Mike Galbraith, Rik van Riel

On 01/14/2016 09:27 PM, Peter Zijlstra wrote:
> On Thu, Jan 14, 2016 at 09:19:00PM +0000, Dietmar Eggemann wrote:
>> @@ -4346,7 +4346,10 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>>
>>                  /* scale is effectively 1 << i now, and >> i divides by scale */
>>
>> -               old_load = this_rq->cpu_load[i] - tickless_load;
>> +               if (this_rq->cpu_load[i] > tickless_load)
>> +                       old_load = this_rq->cpu_load[i] - tickless_load;
>> +               else
>> +                       old_load = 0;
>
> Yeah, yuck. That'd go bad quick.
>

... because I set it to 0? But after the decay function we add
tickless_load to old_load. Maybe in case tickless_load >
this_rq->cpu_load[i] we decay this_rq->cpu_load[i] and do not add
tickless_load afterwards.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz
  2016-01-14 22:23     ` Dietmar Eggemann
@ 2016-01-15  7:07       ` Byungchul Park
  2016-01-15 16:56         ` Dietmar Eggemann
  2016-02-29 11:14         ` [tip:sched/core] sched/fair: Avoid using decay_load_missed() with a negative value tip-bot for Byungchul Park
  0 siblings, 2 replies; 58+ messages in thread
From: Byungchul Park @ 2016-01-15  7:07 UTC (permalink / raw)
  To: Dietmar Eggemann, perterz
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Mike Galbraith, Rik van Riel

On Thu, Jan 14, 2016 at 10:23:46PM +0000, Dietmar Eggemann wrote:
> On 01/14/2016 09:27 PM, Peter Zijlstra wrote:
> >On Thu, Jan 14, 2016 at 09:19:00PM +0000, Dietmar Eggemann wrote:
> >>@@ -4346,7 +4346,10 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
> >>
> >>                 /* scale is effectively 1 << i now, and >> i divides by scale */
> >>
> >>-               old_load = this_rq->cpu_load[i] - tickless_load;
> >>+               if (this_rq->cpu_load[i] > tickless_load)
> >>+                       old_load = this_rq->cpu_load[i] - tickless_load;
> >>+               else
> >>+                       old_load = 0;
> >
> >Yeah, yuck. That'd go bad quick.
> >
> 
> ... because I set it to 0? But after the decay function we add
> tickless_load to old_load. Maybe in case tickless_load >
> this_rq->cpu_load[i] we decay this_rq->cpu_load[i] and do not add
> tickless_load afterwards.
> 

I re-checked the equation I expanded and fortunately found it had no
problem. I think there are several ways to do it correctly. That is,

option 1. decay the absolute value with decay_load_missed() and adjust
the sign.

option 2. make decay_load_missed() can handle negative value.

option 3. refer to the patch below. I think this option is the best.

-----8<-----

>From ba3d3355fcce51c901376d268206f58a7d0e4214 Mon Sep 17 00:00:00 2001
From: Byungchul Park <byungchul.park@lge.com>
Date: Fri, 15 Jan 2016 15:58:09 +0900
Subject: [PATCH] sched/fair: prevent using decay_load_missed() with a negative
 value

decay_load_missed() cannot handle nagative value. So we need to prevent
using the function with a negative value.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
---
 kernel/sched/fair.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8dde8b6..3f08d75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4443,8 +4443,14 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
 
 		/* scale is effectively 1 << i now, and >> i divides by scale */
 
-		old_load = this_rq->cpu_load[i] - tickless_load;
+		old_load = this_rq->cpu_load[i];
 		old_load = decay_load_missed(old_load, pending_updates - 1, i);
+		old_load -= decay_load_missed(tickless_load, pending_updates - 1, i);
+		/*
+		 * old_load can never be a negative value because a decayed
+		 * tickless_load cannot be greater than the original
+		 * tickless_load.
+		 */
 		old_load += tickless_load;
 		new_load = this_load;
 		/*
-- 
1.9.1


> 
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz
  2016-01-15  7:07       ` Byungchul Park
@ 2016-01-15 16:56         ` Dietmar Eggemann
  2016-01-18  0:23           ` Byungchul Park
  2016-01-19 13:04           ` Peter Zijlstra
  2016-02-29 11:14         ` [tip:sched/core] sched/fair: Avoid using decay_load_missed() with a negative value tip-bot for Byungchul Park
  1 sibling, 2 replies; 58+ messages in thread
From: Dietmar Eggemann @ 2016-01-15 16:56 UTC (permalink / raw)
  To: Byungchul Park, perterz
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Mike Galbraith, Rik van Riel

Hi Byungchul,

On 15/01/16 07:07, Byungchul Park wrote:
> On Thu, Jan 14, 2016 at 10:23:46PM +0000, Dietmar Eggemann wrote:
>> On 01/14/2016 09:27 PM, Peter Zijlstra wrote:
>>> On Thu, Jan 14, 2016 at 09:19:00PM +0000, Dietmar Eggemann wrote:
>>>> @@ -4346,7 +4346,10 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,

[...]

> 
> I re-checked the equation I expanded and fortunately found it had no
> problem. I think there are several ways to do it correctly. That is,
> 
> option 1. decay the absolute value with decay_load_missed() and adjust
> the sign.
> 
> option 2. make decay_load_missed() can handle negative value.
> 
> option 3. refer to the patch below. I think this option is the best.
> 
> -----8<-----
> 
> From ba3d3355fcce51c901376d268206f58a7d0e4214 Mon Sep 17 00:00:00 2001
> From: Byungchul Park <byungchul.park@lge.com>
> Date: Fri, 15 Jan 2016 15:58:09 +0900
> Subject: [PATCH] sched/fair: prevent using decay_load_missed() with a negative
>  value
> 
> decay_load_missed() cannot handle nagative value. So we need to prevent
> using the function with a negative value.
> 
> Signed-off-by: Byungchul Park <byungchul.park@lge.com>
> ---
>  kernel/sched/fair.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8dde8b6..3f08d75 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4443,8 +4443,14 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>  
>  		/* scale is effectively 1 << i now, and >> i divides by scale */
>  
> -		old_load = this_rq->cpu_load[i] - tickless_load;
> +		old_load = this_rq->cpu_load[i];
>  		old_load = decay_load_missed(old_load, pending_updates - 1, i);
> +		old_load -= decay_load_missed(tickless_load, pending_updates - 1, i);
> +		/*
> +		 * old_load can never be a negative value because a decayed
> +		 * tickless_load cannot be greater than the original
> +		 * tickless_load.
> +		 */
>  		old_load += tickless_load;
>  		new_load = this_load;
>  		/*
>

So in this case you want to decay old_load and add (tickless_load -
decay(tickless_load)) on top. IMHO, this makes sense.


(w/ your patch and cpu w/ NO_HZ_FULL)

update_cpu_load_nohz: cpu=3 jiffies=4294935491
this_rq->last_load_update_tick=4294935489 pending_updates=2 active=1
load=0x114
__update_cpu_load: cpu=3 this_load=0x114 pending_updates=2
tickless_load=0xc2
__update_cpu_load: cpu=3 this_rq->cpu_load[0 .. 4] = [0xc2 0x62 0x32
0x1d 0x14]
__update_cpu_load: cpu=3 1. old_load=0x62
__update_cpu_load: cpu=3 2.1 old_load=0x31
__update_cpu_load: cpu=3 2.2 old_load=0xffffffffffffffd0   <-- after
decaying tickless_load
__update_cpu_load: cpu=3 3. old_load=0x92                  <-- after
adding tickless_load
__update_cpu_load: cpu=3 1. new_load=0x92
__update_cpu_load: cpu=3 2. new_load=0x92
__update_cpu_load: cpu=3 this_rq->cpu_load[1]=0xd3
...
update_cpu_load_active: cpu=3 this_rq->last_load_update_tick=4294935491
pending_updates=1 active=1 load=0x13e
__update_cpu_load: cpu=3 this_load=0x13e pending_updates=1
tickless_load=0x114
__update_cpu_load: cpu=3 this_rq->cpu_load[0 .. 4] = [0x114 0xd3 0x86
0x4f 0x2f]
...

Another point ... 'active=1' (function header: @active: !0 for NOHZ_FULL
is a little bit misleading) is also true for when __update_cpu_load() is
called from update_cpu_load_active(). In this case tickless_load
wouldn't have to be set at all since pending_updates is 1,
decay_load_missed() can handle that by bailing in case missed_updates = 0.

Couldn't we set tickless_load only in case:

unsigned long tickless_load = (active && pending_updates > 1) ?
this_rq->cpu_load[0] : 0;

Even though update_cpu_load_nohz() can call with pending_updates=1 and
active=1 but then we don't have to decay.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz
  2016-01-15 16:56         ` Dietmar Eggemann
@ 2016-01-18  0:23           ` Byungchul Park
  2016-01-19 13:04           ` Peter Zijlstra
  1 sibling, 0 replies; 58+ messages in thread
From: Byungchul Park @ 2016-01-18  0:23 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Mike Galbraith, Rik van Riel

On Fri, Jan 15, 2016 at 04:56:36PM +0000, Dietmar Eggemann wrote:
> Another point ... 'active=1' (function header: @active: !0 for NOHZ_FULL
> is a little bit misleading) is also true for when __update_cpu_load() is
> called from update_cpu_load_active(). In this case tickless_load
> wouldn't have to be set at all since pending_updates is 1,
> decay_load_missed() can handle that by bailing in case missed_updates = 0.

Hello Dietmar.

> 
> Couldn't we set tickless_load only in case:
> 
> unsigned long tickless_load = (active && pending_updates > 1) ?
> this_rq->cpu_load[0] : 0;

IMHO, this looks better even though it does not change much.

Thank you,
Byungchul

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz
  2016-01-15 16:56         ` Dietmar Eggemann
  2016-01-18  0:23           ` Byungchul Park
@ 2016-01-19 13:04           ` Peter Zijlstra
  2016-01-20  0:48             ` Byungchul Park
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-19 13:04 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Byungchul Park, perterz, Frederic Weisbecker, LKML,
	Chris Metcalf, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Paul E . McKenney, Mike Galbraith,
	Rik van Riel

On Fri, Jan 15, 2016 at 04:56:36PM +0000, Dietmar Eggemann wrote:
> Couldn't we set tickless_load only in case:
> 
> unsigned long tickless_load = (active && pending_updates > 1) ?
> this_rq->cpu_load[0] : 0;
> 
> Even though update_cpu_load_nohz() can call with pending_updates=1 and
> active=1 but then we don't have to decay.

decay_load_missed() has an early bail for !missed, which will be tickled
with pending_updates == 1.

What I was thinking of doing however is:

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4445,13 +4445,15 @@ static void __update_cpu_load(struct rq
 
 		old_load = this_rq->cpu_load[i];
 		old_load = decay_load_missed(old_load, pending_updates - 1, i);
-		old_load -= decay_load_missed(tickless_load, pending_updates - 1, i);
-		/*
-		 * old_load can never be a negative value because a decayed
-		 * tickless_load cannot be greater than the original
-		 * tickless_load.
-		 */
-		old_load += tickless_load;
+		if (tickless_load) {
+			old_load -= decay_load_missed(tickless_load, pending_updates - 1, i);
+			/*
+			 * old_load can never be a negative value because a
+			 * decayed tickless_load cannot be greater than the
+			 * original tickless_load.
+			 */
+			old_load += tickless_load;
+		}
 		new_load = this_load;
 		/*
 		 * Round up the averaging division if load is increasing. This


Since regardless of the pending_updates, none of that makes sense if
!tickless_load.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-01-13 16:01 ` [PATCH 1/4] sched: Don't account tickless CPU load on tick Frederic Weisbecker
@ 2016-01-19 13:08   ` Peter Zijlstra
  2016-01-19 16:22     ` Frederic Weisbecker
  2016-01-20  8:42     ` Thomas Gleixner
  0 siblings, 2 replies; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-19 13:08 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Byungchul Park, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 13, 2016 at 05:01:28PM +0100, Frederic Weisbecker wrote:
> The cpu load update on tick doesn't care about dynticks and as such is
> buggy when occuring on nohz ticks (including idle ticks) as it resets
> the jiffies snapshot that was recorded on nohz entry. We eventually
> ignore the potentially long tickless load that happened before the
> tick.

I don't get it, how can we call scheduler_tick() while
tick_nohz_tick_stopped() ?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] sched: Consolidate nohz CPU load update code
  2016-01-14  5:18   ` Byungchul Park
@ 2016-01-19 13:13     ` Peter Zijlstra
  2016-01-19 16:49     ` Frederic Weisbecker
  1 sibling, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-19 13:13 UTC (permalink / raw)
  To: Byungchul Park
  Cc: Frederic Weisbecker, LKML, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Thu, Jan 14, 2016 at 02:18:40PM +0900, Byungchul Park wrote:
> > +	__update_cpu_load_nohz(this_rq, READ_ONCE(jiffies), 0, 0);
> 
> This question is not directly related to this patch but I am just
> curious about... Should we use READ_ONCE on jiffies which is already
> volatile type?

I'd say so, if only because I forever forget that jiffies is declared
volatile and READ_ONCE() makes the intent explicit.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] sched: Consolidate nohz CPU load update code
  2016-01-14  2:30   ` Byungchul Park
@ 2016-01-19 13:13     ` Peter Zijlstra
  2016-01-20  0:51       ` Byungchul Park
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-19 13:13 UTC (permalink / raw)
  To: Byungchul Park
  Cc: Frederic Weisbecker, LKML, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Thu, Jan 14, 2016 at 11:30:50AM +0900, Byungchul Park wrote:
> On Wed, Jan 13, 2016 at 05:01:29PM +0100, Frederic Weisbecker wrote:
> > Lets factorize a bit of code there. We'll even have a third user soon.
> > While at it, standardize the idle update function name against the
> > others.
> 
> This factorization was one of what I thought it needs to be done! :-)

I shall interpret this as an Acked-by from you.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-13 16:01 ` [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue Frederic Weisbecker
@ 2016-01-19 13:17   ` Peter Zijlstra
  2016-01-19 17:03     ` Frederic Weisbecker
  2016-01-20  9:03   ` Thomas Gleixner
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-19 13:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Byungchul Park, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 13, 2016 at 05:01:31PM +0100, Frederic Weisbecker wrote:
> The full nohz CPU load is currently accounted on tick restart only.
> But there are a few issues with this model:
> 
> _ On tick restart, if cpu_load[0] doesn't contain the load of the actual
>   tickless load that just ran, we are going to account a wrong value.
>   And it is very likely to be so given that cpu_load[0] doesn't have
>   an opportunity to be updated between tick stop and tick restart.
> 
> _ If the runqueue had updates that didn't trigger a tick restart, we
>   are going to miss those CPU load changes.
> 
> A solution to fix this is to update the CPU load everytime we enqueue
> or dequeue a task in the fair runqueue and more than a jiffy occured
> since the last update.

Would not a much better solution be to do this remotely instead of from
one of the hottest functions in the scheduler?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-01-19 13:08   ` Peter Zijlstra
@ 2016-01-19 16:22     ` Frederic Weisbecker
  2016-01-19 18:56       ` Peter Zijlstra
  2016-01-20  8:42     ` Thomas Gleixner
  1 sibling, 1 reply; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-19 16:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Byungchul Park, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Tue, Jan 19, 2016 at 02:08:57PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 13, 2016 at 05:01:28PM +0100, Frederic Weisbecker wrote:
> > The cpu load update on tick doesn't care about dynticks and as such is
> > buggy when occuring on nohz ticks (including idle ticks) as it resets
> > the jiffies snapshot that was recorded on nohz entry. We eventually
> > ignore the potentially long tickless load that happened before the
> > tick.
> 
> I don't get it, how can we call scheduler_tick() while
> tick_nohz_tick_stopped() ?

tick_nohz_tick_stopped() (which is ts->tick_stopped == 1) doesn't actually
mean that the tick is really stopped. It just means that the tick fires only
when it's really needed (timer list expired, RCU stuff, irq_work, ...). So
if you're lucky, the tick is really stopped because there is nothing that needs
the tick in the future, otherwise (and it's probably most of the time) there is
some tick that may well be programmed some time ahead in the future.

So that's the kind of tick that can break the accounting of a whole long tickless
load that just ran.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] sched: Consolidate nohz CPU load update code
  2016-01-14  5:18   ` Byungchul Park
  2016-01-19 13:13     ` Peter Zijlstra
@ 2016-01-19 16:49     ` Frederic Weisbecker
  2016-01-20  1:41       ` Byungchul Park
  1 sibling, 1 reply; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-19 16:49 UTC (permalink / raw)
  To: Byungchul Park
  Cc: Peter Zijlstra, LKML, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Thu, Jan 14, 2016 at 02:18:40PM +0900, Byungchul Park wrote:
> On Wed, Jan 13, 2016 at 05:01:29PM +0100, Frederic Weisbecker wrote:
> >  #ifdef CONFIG_NO_HZ_COMMON
> > +static void __update_cpu_load_nohz(struct rq *this_rq,
> > +				   unsigned long curr_jiffies,
> 
> Do we need to pass current jiffies as a function parameter?

I guess we don't, I just wasn't much sure of the possible overhead of READ_ONCE()

> 
> > +				   unsigned long load,
> > +				   int active)
> > +{
> > +	unsigned long pending_updates;
> > +
> > +	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
> > +	if (pending_updates) {
> > +		this_rq->last_load_update_tick = curr_jiffies;
> > +		/*
> > +		 * In the regular NOHZ case, we were idle, this means load 0.
> > +		 * In the NOHZ_FULL case, we were non-idle, we should consider
> > +		 * its weighted load.
> > +		 */
> > +		__update_cpu_load(this_rq, load, pending_updates, active);
> > +	}
> > +}

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-19 13:17   ` Peter Zijlstra
@ 2016-01-19 17:03     ` Frederic Weisbecker
  2016-01-20  9:09       ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-19 17:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Byungchul Park, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Tue, Jan 19, 2016 at 02:17:08PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 13, 2016 at 05:01:31PM +0100, Frederic Weisbecker wrote:
> > The full nohz CPU load is currently accounted on tick restart only.
> > But there are a few issues with this model:
> > 
> > _ On tick restart, if cpu_load[0] doesn't contain the load of the actual
> >   tickless load that just ran, we are going to account a wrong value.
> >   And it is very likely to be so given that cpu_load[0] doesn't have
> >   an opportunity to be updated between tick stop and tick restart.
> > 
> > _ If the runqueue had updates that didn't trigger a tick restart, we
> >   are going to miss those CPU load changes.
> > 
> > A solution to fix this is to update the CPU load everytime we enqueue
> > or dequeue a task in the fair runqueue and more than a jiffy occured
> > since the last update.
> 
> Would not a much better solution be to do this remotely instead of from
> one of the hottest functions in the scheduler?

The problem with doing this remotely is that we can miss past cpu loads if
there was several enqueue/dequeue operations happening while tickless.

For example if CPU 1 runs sched entity A with a load of 5 (purely theoric)
for 5 sec then it sleeps, entity B runs with a load of 1 and then CPU 2 updates
the load of CPU 1 remotely. cpu_load[0] will be accurate because it's the current
load of CPU 1 (which is the load of entity B), but the other indexes won't take
the decayed load of entity A into account.

Now we can indeed remove the queue time local update and only rely on remote
updates when needed if we can live with a light precision on target_load() and
source_load().

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-01-19 16:22     ` Frederic Weisbecker
@ 2016-01-19 18:56       ` Peter Zijlstra
  2016-01-19 22:33         ` Frederic Weisbecker
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-19 18:56 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Byungchul Park, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Tue, Jan 19, 2016 at 05:22:11PM +0100, Frederic Weisbecker wrote:
> On Tue, Jan 19, 2016 at 02:08:57PM +0100, Peter Zijlstra wrote:
> > On Wed, Jan 13, 2016 at 05:01:28PM +0100, Frederic Weisbecker wrote:
> > > The cpu load update on tick doesn't care about dynticks and as such is
> > > buggy when occuring on nohz ticks (including idle ticks) as it resets
> > > the jiffies snapshot that was recorded on nohz entry. We eventually
> > > ignore the potentially long tickless load that happened before the
> > > tick.
> > 
> > I don't get it, how can we call scheduler_tick() while
> > tick_nohz_tick_stopped() ?
> 
> tick_nohz_tick_stopped() (which is ts->tick_stopped == 1) doesn't actually
> mean that the tick is really stopped. It just means that the tick fires only
> when it's really needed (timer list expired, RCU stuff, irq_work, ...).

That's insane and broken. Fix _that_.

If RCU, irq_work etc.. needs the tick, do not stop the tick.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-01-19 18:56       ` Peter Zijlstra
@ 2016-01-19 22:33         ` Frederic Weisbecker
  2016-01-20  5:43           ` Byungchul Park
  0 siblings, 1 reply; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-19 22:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Byungchul Park, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Tue, Jan 19, 2016 at 07:56:47PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 19, 2016 at 05:22:11PM +0100, Frederic Weisbecker wrote:
> > On Tue, Jan 19, 2016 at 02:08:57PM +0100, Peter Zijlstra wrote:
> > > On Wed, Jan 13, 2016 at 05:01:28PM +0100, Frederic Weisbecker wrote:
> > > > The cpu load update on tick doesn't care about dynticks and as such is
> > > > buggy when occuring on nohz ticks (including idle ticks) as it resets
> > > > the jiffies snapshot that was recorded on nohz entry. We eventually
> > > > ignore the potentially long tickless load that happened before the
> > > > tick.
> > > 
> > > I don't get it, how can we call scheduler_tick() while
> > > tick_nohz_tick_stopped() ?
> > 
> > tick_nohz_tick_stopped() (which is ts->tick_stopped == 1) doesn't actually
> > mean that the tick is really stopped. It just means that the tick fires only
> > when it's really needed (timer list expired, RCU stuff, irq_work, ...).
> 
> That's insane and broken. Fix _that_.
> 
> If RCU, irq_work etc.. needs the tick, do not stop the tick.

This is not the first time we have this conversation :-)

RCU/irq_work/foo_needs_tick() are treated just like any timer that expire in one
tick, although RCU is some more tunable there.

And timers that expire in 1 jiffy can be treated in two ways:

* If the tick is still periodic (ts->tick_stopped = 0), we don't stop the
  tick: we don't enter dynticks mode.

* If the tick is already stopped (or rather in dynticks mode to be more exact:
  ts->tick_stopped == 1) we just program the tick one jiffy ahead. This is
  an optimization and a simplification, if we were to restart the tick everytime
  we see a tick one jiffy ahead in the middle of a dynticks frame, we would have
  to perform all the accounting in tick_nohz_idle_exit() as well, including
  update_cpu_load_nohz() that locks rq->lock. Having a bunch of jiffies subsequently
  ticking in the middle of a dynticks frame is a common and frequent scenario and
  removing that optimization would have a bad visible impact.

Certainly the issue here is that "tick_stopped" can be misunderstood. ts->dynticks_active
would be better but we already have ts->nohz_active which reflects something very
different. I guess we need a cascading rename.

Anyway whether the next tick is one jiffy ahead or more doesn't really matter here.
The issue is that ticks can fire while dynticks-idle or dyntick-buzy and
update_cpu_load_active() treats them in a broken way.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz
  2016-01-19 13:04           ` Peter Zijlstra
@ 2016-01-20  0:48             ` Byungchul Park
  2016-01-20 13:04               ` Dietmar Eggemann
  0 siblings, 1 reply; 58+ messages in thread
From: Byungchul Park @ 2016-01-20  0:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, perterz, Frederic Weisbecker, LKML,
	Chris Metcalf, Thomas Gleixner, Luiz Capitulino,
	Christoph Lameter, Paul E . McKenney, Mike Galbraith,
	Rik van Riel

On Tue, Jan 19, 2016 at 02:04:57PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 15, 2016 at 04:56:36PM +0000, Dietmar Eggemann wrote:
> > Couldn't we set tickless_load only in case:
> > 
> > unsigned long tickless_load = (active && pending_updates > 1) ?
> > this_rq->cpu_load[0] : 0;
> > 
> > Even though update_cpu_load_nohz() can call with pending_updates=1 and
> > active=1 but then we don't have to decay.
> 
> decay_load_missed() has an early bail for !missed, which will be tickled
> with pending_updates == 1.

I think the way for decay_load_missed() to get an early bail for
*!load*, which the Dietmar's proposal did, is also good. And the
peterz's proposal avoiding an unnecessary "add" operation is also
good. Whatever..

> 
> What I was thinking of doing however is:
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4445,13 +4445,15 @@ static void __update_cpu_load(struct rq
>  
>  		old_load = this_rq->cpu_load[i];
>  		old_load = decay_load_missed(old_load, pending_updates - 1, i);
> -		old_load -= decay_load_missed(tickless_load, pending_updates - 1, i);
> -		/*
> -		 * old_load can never be a negative value because a decayed
> -		 * tickless_load cannot be greater than the original
> -		 * tickless_load.
> -		 */
> -		old_load += tickless_load;
> +		if (tickless_load) {

And additionally, in this approach, why don't you do like,

                if (tickless_load || pending_updates - 1)

> +			old_load -= decay_load_missed(tickless_load, pending_updates - 1, i);
> +			/*
> +			 * old_load can never be a negative value because a
> +			 * decayed tickless_load cannot be greater than the
> +			 * original tickless_load.
> +			 */
> +			old_load += tickless_load;
> +		}
>  		new_load = this_load;
>  		/*
>  		 * Round up the averaging division if load is increasing. This
> 
> 
> Since regardless of the pending_updates, none of that makes sense if
> !tickless_load.

None of that makes sense if !(pending_updates - 1), too. In that case,
it becomes,

                        old_load -= tickless_load;
                        old_load += tickless_load;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] sched: Consolidate nohz CPU load update code
  2016-01-19 13:13     ` Peter Zijlstra
@ 2016-01-20  0:51       ` Byungchul Park
  0 siblings, 0 replies; 58+ messages in thread
From: Byungchul Park @ 2016-01-20  0:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Tue, Jan 19, 2016 at 02:13:58PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 14, 2016 at 11:30:50AM +0900, Byungchul Park wrote:
> > On Wed, Jan 13, 2016 at 05:01:29PM +0100, Frederic Weisbecker wrote:
> > > Lets factorize a bit of code there. We'll even have a third user soon.
> > > While at it, standardize the idle update function name against the
> > > others.
> > 
> > This factorization was one of what I thought it needs to be done! :-)
> 
> I shall interpret this as an Acked-by from you.

Thumb up!

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/4] sched: Consolidate nohz CPU load update code
  2016-01-19 16:49     ` Frederic Weisbecker
@ 2016-01-20  1:41       ` Byungchul Park
  0 siblings, 0 replies; 58+ messages in thread
From: Byungchul Park @ 2016-01-20  1:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Tue, Jan 19, 2016 at 05:49:34PM +0100, Frederic Weisbecker wrote:
> On Thu, Jan 14, 2016 at 02:18:40PM +0900, Byungchul Park wrote:
> > On Wed, Jan 13, 2016 at 05:01:29PM +0100, Frederic Weisbecker wrote:
> > >  #ifdef CONFIG_NO_HZ_COMMON
> > > +static void __update_cpu_load_nohz(struct rq *this_rq,
> > > +				   unsigned long curr_jiffies,
> > 
> > Do we need to pass current jiffies as a function parameter?
> 
> I guess we don't, I just wasn't much sure of the possible overhead of READ_ONCE()

Ah. But I think passing an additional argument can cause additional
overhead, too, e.g. additional store/load on stack. But I am not sure
which one is larger, and it depends on architecture and abi.

> 
> > 
> > > +				   unsigned long load,
> > > +				   int active)
> > > +{
> > > +	unsigned long pending_updates;
> > > +
> > > +	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
> > > +	if (pending_updates) {
> > > +		this_rq->last_load_update_tick = curr_jiffies;
> > > +		/*
> > > +		 * In the regular NOHZ case, we were idle, this means load 0.
> > > +		 * In the NOHZ_FULL case, we were non-idle, we should consider
> > > +		 * its weighted load.
> > > +		 */
> > > +		__update_cpu_load(this_rq, load, pending_updates, active);
> > > +	}
> > > +}

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-01-19 22:33         ` Frederic Weisbecker
@ 2016-01-20  5:43           ` Byungchul Park
  2016-01-20 10:26             ` Byungchul Park
  0 siblings, 1 reply; 58+ messages in thread
From: Byungchul Park @ 2016-01-20  5:43 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Tue, Jan 19, 2016 at 11:33:22PM +0100, Frederic Weisbecker wrote:
> On Tue, Jan 19, 2016 at 07:56:47PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 19, 2016 at 05:22:11PM +0100, Frederic Weisbecker wrote:
> > > On Tue, Jan 19, 2016 at 02:08:57PM +0100, Peter Zijlstra wrote:
> > > > On Wed, Jan 13, 2016 at 05:01:28PM +0100, Frederic Weisbecker wrote:
> > > > > The cpu load update on tick doesn't care about dynticks and as such is
> > > > > buggy when occuring on nohz ticks (including idle ticks) as it resets
> > > > > the jiffies snapshot that was recorded on nohz entry. We eventually
> > > > > ignore the potentially long tickless load that happened before the
> > > > > tick.
> > > > 
> > > > I don't get it, how can we call scheduler_tick() while
> > > > tick_nohz_tick_stopped() ?
> > > 
> > > tick_nohz_tick_stopped() (which is ts->tick_stopped == 1) doesn't actually
> > > mean that the tick is really stopped. It just means that the tick fires only
> > > when it's really needed (timer list expired, RCU stuff, irq_work, ...).
> > 
> > That's insane and broken. Fix _that_.
> > 
> > If RCU, irq_work etc.. needs the tick, do not stop the tick.
> 
> This is not the first time we have this conversation :-)
> 
> RCU/irq_work/foo_needs_tick() are treated just like any timer that expire in one
> tick, although RCU is some more tunable there.
> 
> And timers that expire in 1 jiffy can be treated in two ways:
> 
> * If the tick is still periodic (ts->tick_stopped = 0), we don't stop the
>   tick: we don't enter dynticks mode.
> 
> * If the tick is already stopped (or rather in dynticks mode to be more exact:
>   ts->tick_stopped == 1) we just program the tick one jiffy ahead. This is
>   an optimization and a simplification, if we were to restart the tick everytime
>   we see a tick one jiffy ahead in the middle of a dynticks frame, we would have
>   to perform all the accounting in tick_nohz_idle_exit() as well, including
>   update_cpu_load_nohz() that locks rq->lock. Having a bunch of jiffies subsequently
>   ticking in the middle of a dynticks frame is a common and frequent scenario and
>   removing that optimization would have a bad visible impact.
> 
> Certainly the issue here is that "tick_stopped" can be misunderstood. ts->dynticks_active
> would be better but we already have ts->nohz_active which reflects something very
> different. I guess we need a cascading rename.
> 
> Anyway whether the next tick is one jiffy ahead or more doesn't really matter here.
> The issue is that ticks can fire while dynticks-idle or dyntick-buzy and
> update_cpu_load_active() treats them in a broken way.

It looks very tricky. I have a question. Do we have to call the
scheduler_tick() even while the tick is stopped? IMHO, it seems to be
ok even if we won't call it while the tick is stopped. Wrong? I mean,

---

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index bbc5d11..774adc2 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1422,7 +1422,8 @@ void update_process_times(int user_tick)
 	if (in_irq())
 		irq_work_tick();
 #endif
-	scheduler_tick();
+	if (!tick_nohz_tick_stopped())
+		scheduler_tick();
 	run_posix_cpu_timers(p);
 }
 
---

hm ???

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-01-19 13:08   ` Peter Zijlstra
  2016-01-19 16:22     ` Frederic Weisbecker
@ 2016-01-20  8:42     ` Thomas Gleixner
  2016-01-20 17:36       ` Frederic Weisbecker
  2016-01-22  8:40       ` Byungchul Park
  1 sibling, 2 replies; 58+ messages in thread
From: Thomas Gleixner @ 2016-01-20  8:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Byungchul Park, Chris Metcalf,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Tue, 19 Jan 2016, Peter Zijlstra wrote:

> On Wed, Jan 13, 2016 at 05:01:28PM +0100, Frederic Weisbecker wrote:
> > The cpu load update on tick doesn't care about dynticks and as such is
> > buggy when occuring on nohz ticks (including idle ticks) as it resets
> > the jiffies snapshot that was recorded on nohz entry. We eventually
> > ignore the potentially long tickless load that happened before the
> > tick.
> 
> I don't get it, how can we call scheduler_tick() while
> tick_nohz_tick_stopped() ?

tick->nohz_stopped is merily indicating that we switched from periodic mode to
tickless mode. That's probably a misnomer, but it still has that meaning.

You really need to look at it from the history of that code which was designed
for tickless idle. The nohz full stuff was bolted on it.

So if we stop the tick in idle - or for that matter in full nohz - we look
ahead when the next tick is required. That can be:

      - a timer wheel timer expiring

      - other stuff which prevents the cpu from going "tickless" like rcu,
        irqwork

So lets assume rcu and irqwork are silent, but we have a timer expiring 100ms
from now, then we program the tick timer to 100ms from now. When it fires it
invokes the normal tick_sched() timer machinery:

     - timekeeping update
     - update_process_times
     - profile_tick

I have no idea why that is a problem. If update_process_times() is invoked
then it will account the elapsed time to the idle task in case of tickless
idle. In case of nohz full it should simply account the time to the task which
was busy on the cpu in user space.

The above changelog is just crap and doesnt make any sense at all. And the
patch is fixing symptoms not the root cause.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-13 16:01 ` [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue Frederic Weisbecker
  2016-01-19 13:17   ` Peter Zijlstra
@ 2016-01-20  9:03   ` Thomas Gleixner
  2016-01-20 14:31     ` Frederic Weisbecker
  1 sibling, 1 reply; 58+ messages in thread
From: Thomas Gleixner @ 2016-01-20  9:03 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Byungchul Park, Chris Metcalf,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, 13 Jan 2016, Frederic Weisbecker wrote:
> A solution to fix this is to update the CPU load everytime we enqueue
> or dequeue a task in the fair runqueue and more than a jiffy occured
> since the last update.

That's not a solution. That's just crap.

I tell you since years, that you need to fix that remote accounting stuff,
but no, you insist on adding more trainwrecks left and right.

> The problem with doing this remotely is that we can miss past cpu loads if
> there was several enqueue/dequeue operations happening while tickless.

That's complete bullshit.

1) How is remote accounting that happens every tick different from local
   accounting which happens every tick?

2) How do you have enqueue/dequeue operations when you are running in full
   nohz, i.e. one task is consuming 100% cpu time in user space?

I'm really tired of that tinkering. The proper solution is to make NOHZ_FULL
depend on BROKEN.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-19 17:03     ` Frederic Weisbecker
@ 2016-01-20  9:09       ` Peter Zijlstra
  2016-01-20 14:54         ` Frederic Weisbecker
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-20  9:09 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Byungchul Park, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Tue, Jan 19, 2016 at 06:03:19PM +0100, Frederic Weisbecker wrote:
> On Tue, Jan 19, 2016 at 02:17:08PM +0100, Peter Zijlstra wrote:
> > On Wed, Jan 13, 2016 at 05:01:31PM +0100, Frederic Weisbecker wrote:
> > > The full nohz CPU load is currently accounted on tick restart only.
> > > But there are a few issues with this model:
> > > 
> > > _ On tick restart, if cpu_load[0] doesn't contain the load of the actual
> > >   tickless load that just ran, we are going to account a wrong value.
> > >   And it is very likely to be so given that cpu_load[0] doesn't have
> > >   an opportunity to be updated between tick stop and tick restart.
> > > 
> > > _ If the runqueue had updates that didn't trigger a tick restart, we
> > >   are going to miss those CPU load changes.
> > > 
> > > A solution to fix this is to update the CPU load everytime we enqueue
> > > or dequeue a task in the fair runqueue and more than a jiffy occured
> > > since the last update.
> > 
> > Would not a much better solution be to do this remotely instead of from
> > one of the hottest functions in the scheduler?
> 
> The problem with doing this remotely is that we can miss past cpu loads if
> there was several enqueue/dequeue operations happening while tickless.

Its a timer based sample, it _always_ and per definition misses
intermediate state.

You can simply do:

	for_each_nohzfull_cpu(cpu) {
		struct rq *rq = rq_of(cpu);

		raw_spin_lock(&rq->lock);
		update_cpu_load_active(rq);
		raw_spin_unlock(&rq->lock);
	}

Also, since when can we have enqueues/dequeues while NOHZ_FULL ? I
thought that was the 1 task 100% cpu case, there are no
enqueues/dequeues there.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-01-20  5:43           ` Byungchul Park
@ 2016-01-20 10:26             ` Byungchul Park
  2016-01-28 16:01               ` Frederic Weisbecker
  2016-02-01  6:33               ` Byungchul Park
  0 siblings, 2 replies; 58+ messages in thread
From: Byungchul Park @ 2016-01-20 10:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 20, 2016 at 02:43:35PM +0900, Byungchul Park wrote:
> 
> It looks very tricky. I have a question. Do we have to call the
> scheduler_tick() even while the tick is stopped? IMHO, it seems to be
> ok even if we won't call it while the tick is stopped. Wrong? I mean,
> 

The reason why I asked is that, scheduler_tick() looks to be a
scheduler callback for *periodic tick*. IMHO, we need to choose one of
these two.

1) Make scheduler_tick() can handle it, not only for the periodic tick
but also for the tick-like event during tick-stopped. But I am not sure
if this is the right way.

2) Distinguish the periodic tick from the tick-like event by which we
can handle rcu callback, irq work and so on, so that the periodic tick
handler only handles periodic stuff either locally or remotely, while
the tick-like event handler only does its purpose. I think this is
better, I am sure though.

> ---
> 
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index bbc5d11..774adc2 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -1422,7 +1422,8 @@ void update_process_times(int user_tick)
>  	if (in_irq())
>  		irq_work_tick();
>  #endif
> -	scheduler_tick();
> +	if (!tick_nohz_tick_stopped())
> +		scheduler_tick();
>  	run_posix_cpu_timers(p);
>  }
>  
> ---
> 
> hm ???

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz
  2016-01-20  0:48             ` Byungchul Park
@ 2016-01-20 13:04               ` Dietmar Eggemann
  0 siblings, 0 replies; 58+ messages in thread
From: Dietmar Eggemann @ 2016-01-20 13:04 UTC (permalink / raw)
  To: Byungchul Park, Peter Zijlstra
  Cc: perterz, Frederic Weisbecker, LKML, Chris Metcalf,
	Thomas Gleixner, Luiz Capitulino, Christoph Lameter,
	Paul E . McKenney, Mike Galbraith, Rik van Riel

On 20/01/16 00:48, Byungchul Park wrote:
> On Tue, Jan 19, 2016 at 02:04:57PM +0100, Peter Zijlstra wrote:
>> On Fri, Jan 15, 2016 at 04:56:36PM +0000, Dietmar Eggemann wrote:
>>> Couldn't we set tickless_load only in case:
>>>
>>> unsigned long tickless_load = (active && pending_updates > 1) ?
>>> this_rq->cpu_load[0] : 0;
>>>
>>> Even though update_cpu_load_nohz() can call with pending_updates=1 and
>>> active=1 but then we don't have to decay.
>>
>> decay_load_missed() has an early bail for !missed, which will be tickled
>> with pending_updates == 1.
> 
> I think the way for decay_load_missed() to get an early bail for
> *!load*, which the Dietmar's proposal did, is also good. And the
> peterz's proposal avoiding an unnecessary "add" operation is also
> good. Whatever..
> 
>>
>> What I was thinking of doing however is:
>>
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4445,13 +4445,15 @@ static void __update_cpu_load(struct rq
>>  
>>  		old_load = this_rq->cpu_load[i];
>>  		old_load = decay_load_missed(old_load, pending_updates - 1, i);
>> -		old_load -= decay_load_missed(tickless_load, pending_updates - 1, i);
>> -		/*
>> -		 * old_load can never be a negative value because a decayed
>> -		 * tickless_load cannot be greater than the original
>> -		 * tickless_load.
>> -		 */
>> -		old_load += tickless_load;
>> +		if (tickless_load) {
> 
> And additionally, in this approach, why don't you do like,
> 
>                 if (tickless_load || pending_updates - 1)
> 
>> +			old_load -= decay_load_missed(tickless_load, pending_updates - 1, i);
>> +			/*
>> +			 * old_load can never be a negative value because a
>> +			 * decayed tickless_load cannot be greater than the
>> +			 * original tickless_load.
>> +			 */
>> +			old_load += tickless_load;
>> +		}
>>  		new_load = this_load;
>>  		/*
>>  		 * Round up the averaging division if load is increasing. This
>>
>>
>> Since regardless of the pending_updates, none of that makes sense if
>> !tickless_load.
> 
> None of that makes sense if !(pending_updates - 1), too. In that case,
> it becomes,
> 
>                         old_load -= tickless_load;
>                         old_load += tickless_load;
> 

tickless_load is potentially set to != 0 (rq->cpu_load[0]) in case
__update_cpu_load() is called by:

 tick_nohz_full_update_tick()->tick_nohz_restart_sched_tick()->
 update_cpu_load_nohz() [CONFIG_NO_HZ_FULL=y]

 scheduler_tick()->update_cpu_load_active()

The former can call with pending_updates=1 and the latter always calls
w/ pending_updates=1 which makes it impossible to set tickless_load
correctly in __update_cpu_load(). I guess an enum indicating the caller
(update_cpu_load_active() update_cpu_load_nohz() or
update_idle_cpu_load) would help if we want to do this super correctly.

I don't have a strong preference whether we do 'if (tickless_load)' or
let decay_load_missed() bail when load = 0 (latter is not really
necessary because decay_load_missed() can handle load=0 already.
But if we do 'if (tickless_load)' why don't we also do 'if(old_load)'?
(old_load can be 0 as well)

Anyway, the patch fixes the unsigned underflow issue I saw for the
cpu_load[] values in NO_HZ_FULL mode.

Maybe you could fix the __update_cpu_load header as well:

'@active: !0 for NOHZ_FULL' (not really true, also when called by
update_cpu_load_active())
s/decay_load_misses()/decay_load_missed()
s/@active paramter/@active parameter

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-20  9:03   ` Thomas Gleixner
@ 2016-01-20 14:31     ` Frederic Weisbecker
  2016-01-20 14:43       ` Thomas Gleixner
  0 siblings, 1 reply; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-20 14:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, LKML, Byungchul Park, Chris Metcalf,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 20, 2016 at 10:03:32AM +0100, Thomas Gleixner wrote:
> On Wed, 13 Jan 2016, Frederic Weisbecker wrote:
> > A solution to fix this is to update the CPU load everytime we enqueue
> > or dequeue a task in the fair runqueue and more than a jiffy occured
> > since the last update.
> 
> That's not a solution. That's just crap.

Have you seen the "RFC"? That's what we use when we are not yet confident
with a solution but we want to start a debate in order to find a proper one.

> 
> I tell you since years, that you need to fix that remote accounting stuff,
> but no, you insist on adding more trainwrecks left and right.

The solution you proposed to me was to do remote scheduler_tick() from
CPU 0 and this was nacked by peterz (and he was right).

We all know that we need to fix this remote accounting stuff, but I'm the
only one who actually _tries_, at least through RFC's to start discussions, such that I
find the right direction to move forward.

You're not helping me _at all_ with your shitty rants, all you're doing is discouraging me
and pushing me out to quit kernel development. I seriously thought about it but that's not
going to happen, unless there is a collective opinion toward the fact I'm a nuisance for the
community.

So go to hell Thomas!

> 
> > The problem with doing this remotely is that we can miss past cpu loads if
> > there was several enqueue/dequeue operations happening while tickless.
> 
> That's complete bullshit.
> 
> 1) How is remote accounting that happens every tick different from local
>    accounting which happens every tick?

Enqueue/dequeue don't happen on tick, unless there is a wakeup on that interrupt.

> 
> 2) How do you have enqueue/dequeue operations when you are running in full
>    nohz, i.e. one task is consuming 100% cpu time in user space?

Well that task is going to sleep, wake up, sleep like any other task. We need to
account these slices properly. If a second task wakes up and restart the tick, we must
make sure that the previous tickless frame got accounted properly.

Besides, if a SCHED_FIFO task runs (tickless) with SCHED_NORMAL tasks in the runqueue,
those are typically still accounted with the tick, so perhaps we need to keep that behaviour
without the tick as well and account those SCHED_NORMAL task's load.

> 
> I'm really tired of that tinkering. The proper solution is to make NOHZ_FULL
> depend on BROKEN.

Sure, knock yourself out.

> 
> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-20 14:31     ` Frederic Weisbecker
@ 2016-01-20 14:43       ` Thomas Gleixner
  2016-01-20 16:40         ` Frederic Weisbecker
  0 siblings, 1 reply; 58+ messages in thread
From: Thomas Gleixner @ 2016-01-20 14:43 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Byungchul Park, Chris Metcalf,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, 20 Jan 2016, Frederic Weisbecker wrote:
> On Wed, Jan 20, 2016 at 10:03:32AM +0100, Thomas Gleixner wrote:
> > I tell you since years, that you need to fix that remote accounting stuff,
> > but no, you insist on adding more trainwrecks left and right.
> 
> The solution you proposed to me was to do remote scheduler_tick() from
> CPU 0 and this was nacked by peterz (and he was right).

He did not nack the general approach of remote accounting, right?

> We all know that we need to fix this remote accounting stuff, but I'm the
> only one who actually _tries_, at least through RFC's to start discussions,
> such that I find the right direction to move forward.

Well, I do not see any attempt to do remote accounting, not even in a
minimalistic form. The current RFC is about dealing with issues which are
caused by the lack of continous (remote) accounting.

> > > The problem with doing this remotely is that we can miss past cpu loads if
> > > there was several enqueue/dequeue operations happening while tickless.
> > 
> > That's complete bullshit.
> > 
> > 1) How is remote accounting that happens every tick different from local
> >    accounting which happens every tick?
> 
> Enqueue/dequeue don't happen on tick, unless there is a wakeup on that interrupt.

And how does that matter? Tick based accounting whether remote or local does
not account for intermediate states at all.

> > 2) How do you have enqueue/dequeue operations when you are running in full
> >    nohz, i.e. one task is consuming 100% cpu time in user space?
> 
> Well that task is going to sleep, wake up, sleep like any other task. We

If that tasks goes to sleep, then it leaves the full nohz state.

> need to account these slices properly. If a second task wakes up and restart
> the tick, we must make sure that the previous tickless frame got accounted
> properly.

The previous tickless frame ends when that task goes to sleep. And that's
where you update the accounting.

> Besides, if a SCHED_FIFO task runs (tickless) with SCHED_NORMAL tasks in the
> runqueue, those are typically still accounted with the tick, so perhaps we
> need to keep that behaviour without the tick as well and account those
> SCHED_NORMAL task's load.

So we agreed long time ago, that we first fix the issues with s single task
running undisturbed in user space, i.e. tickless. Those issues have never been
resolved fully, but now you try to add more complexity of extra runnable
tasks, nohz tasks sleeping and whatever.

Can we please go back to the point where this all started:

    ONE task running with 100% CPU in user space

And get all the issues around that resolved proper, which involves remote
accounting.

Once that works, you can add the new features, i.e. extra runnable tasks and
whatever.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-20  9:09       ` Peter Zijlstra
@ 2016-01-20 14:54         ` Frederic Weisbecker
  2016-01-20 15:11           ` Thomas Gleixner
  2016-01-20 16:56           ` Peter Zijlstra
  0 siblings, 2 replies; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-20 14:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Byungchul Park, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 20, 2016 at 10:09:06AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 19, 2016 at 06:03:19PM +0100, Frederic Weisbecker wrote:
> > On Tue, Jan 19, 2016 at 02:17:08PM +0100, Peter Zijlstra wrote:
> > > On Wed, Jan 13, 2016 at 05:01:31PM +0100, Frederic Weisbecker wrote:
> > > > The full nohz CPU load is currently accounted on tick restart only.
> > > > But there are a few issues with this model:
> > > > 
> > > > _ On tick restart, if cpu_load[0] doesn't contain the load of the actual
> > > >   tickless load that just ran, we are going to account a wrong value.
> > > >   And it is very likely to be so given that cpu_load[0] doesn't have
> > > >   an opportunity to be updated between tick stop and tick restart.
> > > > 
> > > > _ If the runqueue had updates that didn't trigger a tick restart, we
> > > >   are going to miss those CPU load changes.
> > > > 
> > > > A solution to fix this is to update the CPU load everytime we enqueue
> > > > or dequeue a task in the fair runqueue and more than a jiffy occured
> > > > since the last update.
> > > 
> > > Would not a much better solution be to do this remotely instead of from
> > > one of the hottest functions in the scheduler?
> > 
> > The problem with doing this remotely is that we can miss past cpu loads if
> > there was several enqueue/dequeue operations happening while tickless.
> 
> Its a timer based sample, it _always_ and per definition misses
> intermediate state.

Sure, but the problem is when these intermediate states are long enough.

> 
> You can simply do:
> 
> 	for_each_nohzfull_cpu(cpu) {
> 		struct rq *rq = rq_of(cpu);
> 
> 		raw_spin_lock(&rq->lock);
> 		update_cpu_load_active(rq);
> 		raw_spin_unlock(&rq->lock);
> 	}

But from where should we do that? Maybe we can do it before we call source/target_load(),
on the selected targets needed by the caller? The problem is that if we do that right
after a task got enqueued on the nohz runqueue, we may accidentally account it as the
whole dynticks frame (I mean, if we get rid of that enqueue/dequeue accounting).

> 
> Also, since when can we have enqueues/dequeues while NOHZ_FULL ? I
> thought that was the 1 task 100% cpu case, there are no
> enqueues/dequeues there.

That's the most optimized case but we can definetly have small moments with more
than one task running. For example if we have a workqueue, or such short and quick tasks.

If the user makes use of full dynticks for soft isolation (for performance, can live
with a few interrupts...), there can be short moments of multitasking.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-20 14:54         ` Frederic Weisbecker
@ 2016-01-20 15:11           ` Thomas Gleixner
  2016-01-20 15:19             ` Christoph Lameter
  2016-01-20 16:45             ` Frederic Weisbecker
  2016-01-20 16:56           ` Peter Zijlstra
  1 sibling, 2 replies; 58+ messages in thread
From: Thomas Gleixner @ 2016-01-20 15:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Byungchul Park, Chris Metcalf,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, 20 Jan 2016, Frederic Weisbecker wrote:
> On Wed, Jan 20, 2016 at 10:09:06AM +0100, Peter Zijlstra wrote:
> > Also, since when can we have enqueues/dequeues while NOHZ_FULL ? I
> > thought that was the 1 task 100% cpu case, there are no
> > enqueues/dequeues there.
> 
> That's the most optimized case but we can definetly have small moments with
> more than one task running. For example if we have a workqueue, or such
> short and quick tasks.
>
> If the user makes use of full dynticks for soft isolation (for performance,
> can live with a few interrupts...), there can be short moments of
> multitasking.

Again, you are trying to make the second step after the first one is
completed. We do not even have proper accounting when we have the ONE task
100% case and still you try to solve problems beyond that.

If that ONE task gets interrupted, then accounting should take place.

When there is another runnable task then that nohz state needs to be left. You
can go back to it once the task is alone again.

You are trying to make the complete accounting 'almost' tick independent, but
approaching that from the tick nohz angle is wrong.

When you really want to go there, and I can see why you want that, then you
need to solve this from ground up and such a solution has nothing to do with
any flavour of NOHZ. That simply needs to rework the whole accounting
machinery and rip out the complete tick dependency. Once you have that your
NOHZ business falls into place. Any other approach is just duct tape
engineering.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-20 15:11           ` Thomas Gleixner
@ 2016-01-20 15:19             ` Christoph Lameter
  2016-01-20 16:52               ` Frederic Weisbecker
  2016-01-20 16:45             ` Frederic Weisbecker
  1 sibling, 1 reply; 58+ messages in thread
From: Christoph Lameter @ 2016-01-20 15:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, Peter Zijlstra, LKML, Byungchul Park,
	Chris Metcalf, Luiz Capitulino, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, 20 Jan 2016, Thomas Gleixner wrote:

> > If the user makes use of full dynticks for soft isolation (for performance,
> > can live with a few interrupts...), there can be short moments of
> > multitasking.

"Soft" isolation? Like soft realtime ... Argh... Please stay away from
corrupting the intents of the nohz full mode.

> Again, you are trying to make the second step after the first one is
> completed. We do not even have proper accounting when we have the ONE task
> 100% case and still you try to solve problems beyond that.

Please lets only deal with the one task issue. I can have "soft" isolation
today by moving daemons etc off certain processors. We want definitely to
run nothing else on the processor and will want to even go further with
the cache allocation techniques in recent processor to limit the cache
disturbances from other processors etc etc.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-20 14:43       ` Thomas Gleixner
@ 2016-01-20 16:40         ` Frederic Weisbecker
  2016-01-20 16:42           ` Christoph Lameter
  0 siblings, 1 reply; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-20 16:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, LKML, Byungchul Park, Chris Metcalf,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 20, 2016 at 03:43:35PM +0100, Thomas Gleixner wrote:
> On Wed, 20 Jan 2016, Frederic Weisbecker wrote:
> > On Wed, Jan 20, 2016 at 10:03:32AM +0100, Thomas Gleixner wrote:
> > > I tell you since years, that you need to fix that remote accounting stuff,
> > > but no, you insist on adding more trainwrecks left and right.
> > 
> > The solution you proposed to me was to do remote scheduler_tick() from
> > CPU 0 and this was nacked by peterz (and he was right).
> 
> He did not nack the general approach of remote accounting, right?

He nacked the remote scheduler_tick(): https://lkml.org/lkml/2015/8/13/144
but not the general approach of remote accounting, which is the only way to
do what we want anyway, we just need to do it in a finer grained way.

>  
> > We all know that we need to fix this remote accounting stuff, but I'm the
> > only one who actually _tries_, at least through RFC's to start discussions,
> > such that I find the right direction to move forward.
> 
> Well, I do not see any attempt to do remote accounting, not even in a
> minimalistic form. The current RFC is about dealing with issues which are
> caused by the lack of continous (remote) accounting.

Well remote accounting of cpu load is something we really need, that's what I
planed to do, but I couldn't find a way to do it correctly without listening on
enqueue/dequeue event first. I said in the cover letter that it was still incomplete
and we need to find a way for target_load() to return up-to-date values (suggesting
we are going to need remote accounting and we'll need to debate around that).

>  
> > > > The problem with doing this remotely is that we can miss past cpu loads if
> > > > there was several enqueue/dequeue operations happening while tickless.
> > > 
> > > That's complete bullshit.
> > > 
> > > 1) How is remote accounting that happens every tick different from local
> > >    accounting which happens every tick?
> > 
> > Enqueue/dequeue don't happen on tick, unless there is a wakeup on that interrupt.
> 
> And how does that matter? Tick based accounting whether remote or local does
> not account for intermediate states at all.
>  
> > > 2) How do you have enqueue/dequeue operations when you are running in full
> > >    nohz, i.e. one task is consuming 100% cpu time in user space?
> > 
> > Well that task is going to sleep, wake up, sleep like any other task. We
> 
> If that tasks goes to sleep, then it leaves the full nohz state.

No, it stays in dynticks mode if we go idle. This isn't "full nohz" anymore but we
don't make much difference here. We could update cpu load on this transition
though.

> 
> > need to account these slices properly. If a second task wakes up and restart
> > the tick, we must make sure that the previous tickless frame got accounted
> > properly.
> 
> The previous tickless frame ends when that task goes to sleep. And that's
> where you update the accounting.

There is a continuity between full nohz and idle nohz, but surely we need to
update the cpu load in this transition. This was implied by the update on
enqueue/dequeue but if we don't take that direction yet, we'll need to do it
explicitly there.

>  
> > Besides, if a SCHED_FIFO task runs (tickless) with SCHED_NORMAL tasks in the
> > runqueue, those are typically still accounted with the tick, so perhaps we
> > need to keep that behaviour without the tick as well and account those
> > SCHED_NORMAL task's load.
> 
> So we agreed long time ago, that we first fix the issues with s single task
> running undisturbed in user space, i.e. tickless. Those issues have never been
> resolved fully, but now you try to add more complexity of extra runnable
> tasks, nohz tasks sleeping and whatever.

Nohz tasks do sleep, really, at least we need to handle that case now.

> 
> Can we please go back to the point where this all started:
> 
>     ONE task running with 100% CPU in user space
> 
> And get all the issues around that resolved proper, which involves remote
> accounting.

That was the plan to discuss in this rfc series.

> 
> Once that works, you can add the new features, i.e. extra runnable tasks and
> whatever.

Sure I can ignore the more complicated scenario for now. I agree with that.

So what we can do is to record the load of the nohz task on full dynticks frame
entry. Then on that full dynticks frame exit, we account that recorded load.

Then I'll follow up with another series to do the remote accounting part.

How does that sound?

Thanks

> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-20 16:40         ` Frederic Weisbecker
@ 2016-01-20 16:42           ` Christoph Lameter
  2016-01-20 16:47             ` Frederic Weisbecker
  0 siblings, 1 reply; 58+ messages in thread
From: Christoph Lameter @ 2016-01-20 16:42 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, Peter Zijlstra, LKML, Byungchul Park,
	Chris Metcalf, Luiz Capitulino, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, 20 Jan 2016, Frederic Weisbecker wrote:

> > So we agreed long time ago, that we first fix the issues with s single task
> > running undisturbed in user space, i.e. tickless. Those issues have never been
> > resolved fully, but now you try to add more complexity of extra runnable
> > tasks, nohz tasks sleeping and whatever.
>
> Nohz tasks do sleep, really, at least we need to handle that case now.

Sleep meaning call into the OS to be put to sleep? Why would NOHZ be
active then?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-20 15:11           ` Thomas Gleixner
  2016-01-20 15:19             ` Christoph Lameter
@ 2016-01-20 16:45             ` Frederic Weisbecker
  1 sibling, 0 replies; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-20 16:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, LKML, Byungchul Park, Chris Metcalf,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 20, 2016 at 04:11:14PM +0100, Thomas Gleixner wrote:
> On Wed, 20 Jan 2016, Frederic Weisbecker wrote:
> > On Wed, Jan 20, 2016 at 10:09:06AM +0100, Peter Zijlstra wrote:
> > > Also, since when can we have enqueues/dequeues while NOHZ_FULL ? I
> > > thought that was the 1 task 100% cpu case, there are no
> > > enqueues/dequeues there.
> > 
> > That's the most optimized case but we can definetly have small moments with
> > more than one task running. For example if we have a workqueue, or such
> > short and quick tasks.
> >
> > If the user makes use of full dynticks for soft isolation (for performance,
> > can live with a few interrupts...), there can be short moments of
> > multitasking.
> 
> Again, you are trying to make the second step after the first one is
> completed. We do not even have proper accounting when we have the ONE task
> 100% case and still you try to solve problems beyond that.
> 
> If that ONE task gets interrupted, then accounting should take place.
> 
> When there is another runnable task then that nohz state needs to be left. You
> can go back to it once the task is alone again.
> 
> You are trying to make the complete accounting 'almost' tick independent, but
> approaching that from the tick nohz angle is wrong.
> 
> When you really want to go there, and I can see why you want that, then you
> need to solve this from ground up and such a solution has nothing to do with
> any flavour of NOHZ. That simply needs to rework the whole accounting
> machinery and rip out the complete tick dependency. Once you have that your
> NOHZ business falls into place. Any other approach is just duct tape
> engineering.

Alright, let me respawn that series with handling the most simple and common
scenario, which is 100% single task in full dynticks, and anything else keeps
the tick. That will indeed allow us a more incremental approach.

Thanks.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-20 16:42           ` Christoph Lameter
@ 2016-01-20 16:47             ` Frederic Weisbecker
  0 siblings, 0 replies; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-20 16:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Thomas Gleixner, Peter Zijlstra, LKML, Byungchul Park,
	Chris Metcalf, Luiz Capitulino, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 20, 2016 at 10:42:56AM -0600, Christoph Lameter wrote:
> On Wed, 20 Jan 2016, Frederic Weisbecker wrote:
> 
> > > So we agreed long time ago, that we first fix the issues with s single task
> > > running undisturbed in user space, i.e. tickless. Those issues have never been
> > > resolved fully, but now you try to add more complexity of extra runnable
> > > tasks, nohz tasks sleeping and whatever.
> >
> > Nohz tasks do sleep, really, at least we need to handle that case now.
> 
> Sleep meaning call into the OS to be put to sleep?

Indeed, when the task has finished its work, it goes to sleep, right?

> Why would NOHZ be active then?

It's active when the task active.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-20 15:19             ` Christoph Lameter
@ 2016-01-20 16:52               ` Frederic Weisbecker
  0 siblings, 0 replies; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-20 16:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Thomas Gleixner, Peter Zijlstra, LKML, Byungchul Park,
	Chris Metcalf, Luiz Capitulino, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 20, 2016 at 09:19:36AM -0600, Christoph Lameter wrote:
> On Wed, 20 Jan 2016, Thomas Gleixner wrote:
> 
> > > If the user makes use of full dynticks for soft isolation (for performance,
> > > can live with a few interrupts...), there can be short moments of
> > > multitasking.
> 
> "Soft" isolation? Like soft realtime ... Argh... Please stay away from
> corrupting the intents of the nohz full mode.

Well some people, especially real time, want no disturbance at all because
their workload really depends on that.

Some other (HPC) don't care about little disturbance.

Although HPC users didn't declare themselves yet, the real time case
expect some more sacrifices (see nohz tasks patchset).

> 
> > Again, you are trying to make the second step after the first one is
> > completed. We do not even have proper accounting when we have the ONE task
> > 100% case and still you try to solve problems beyond that.
> 
> Please lets only deal with the one task issue. I can have "soft" isolation
> today by moving daemons etc off certain processors. We want definitely to
> run nothing else on the processor and will want to even go further with
> the cache allocation techniques in recent processor to limit the cache
> disturbances from other processors etc etc.

One task is common to all usecases anyway.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-20 14:54         ` Frederic Weisbecker
  2016-01-20 15:11           ` Thomas Gleixner
@ 2016-01-20 16:56           ` Peter Zijlstra
  2016-01-20 17:21             ` Frederic Weisbecker
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-20 16:56 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Byungchul Park, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 20, 2016 at 03:54:19PM +0100, Frederic Weisbecker wrote:

> > You can simply do:
> > 
> > 	for_each_nohzfull_cpu(cpu) {
> > 		struct rq *rq = rq_of(cpu);
> > 
> > 		raw_spin_lock(&rq->lock);
> > 		update_cpu_load_active(rq);
> > 		raw_spin_unlock(&rq->lock);
> > 	}
> 
> But from where should we do that? 

house keeper thingy

> Maybe we can do it before we call source/target_load(), on the
> selected targets needed by the caller? The problem is that if we do
> that right after a task got enqueued on the nohz runqueue, we may
> accidentally account it as the whole dynticks frame (I mean, if we get
> rid of that enqueue/dequeue accounting).

Yes so? What if the current tick happens right after a task get
enqueued? Then we account the whole tick as !idle, even tough we might
have been idle for 99% of the time.

Not a problem, this is sampling.

Doing it locally or remotely doesn't matter.

> > Also, since when can we have enqueues/dequeues while NOHZ_FULL ? I
> > thought that was the 1 task 100% cpu case, there are no
> > enqueues/dequeues there.
> 
> That's the most optimized case but we can definetly have small moments
> with more than one task running. For example if we have a workqueue,
> or such short and quick tasks.

The moment you have nr_running>1 the tick comes back on.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-20 16:56           ` Peter Zijlstra
@ 2016-01-20 17:21             ` Frederic Weisbecker
  2016-01-20 18:25               ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-20 17:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Byungchul Park, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 20, 2016 at 05:56:57PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 20, 2016 at 03:54:19PM +0100, Frederic Weisbecker wrote:
>
> > > You can simply do:
> > >
> > >	for_each_nohzfull_cpu(cpu) {
> > > 				   struct rq *rq = rq_of(cpu);
> > >
> > >		raw_spin_lock(&rq->lock);
> > >			update_cpu_load_active(rq);
> > >				raw_spin_unlock(&rq->lock);
> > > 				}
> >
> > But from where should we do that?
>
> house keeper thingy

You mean a periodic call to the above from the housekeepers?

I didn't think about doing that because you nacked that approach with
scheduler_tick(). This isn't much different.

It means the housekeeper is entirely dedicated to full dynticks CPUs.

>
> > Maybe we can do it before we call source/target_load(), on the
> > selected targets needed by the caller? The problem is that if we do
> > that right after a task got enqueued on the nohz runqueue, we may
> > accidentally account it as the whole dynticks frame (I mean, if we get
> > rid of that enqueue/dequeue accounting).
>
> Yes so? What if the current tick happens right after a task get
> enqueued? Then we account the whole tick as !idle, even tough we might
> have been idle for 99% of the time.

Idle is correctly taken care of there because tick_nohz_idle_exit() makes
sure that the whole dynticks load recorded is 0.

>
> Not a problem, this is sampling.

It's ok to have sampling imprecisions indeed but accounting long samples
of singletask time as multitask is rather erratic than imprecise. Now
that's the issue with pure on-demand updates. If we do the remote update
periodically instead, that wouldn't be a problem anymore as it would just
be about precision.

>
> Doing it locally or remotely doesn't matter.
>
> > > Also, since when can we have enqueues/dequeues while NOHZ_FULL ? I
> > > thought that was the 1 task 100% cpu case, there are no
> > > enqueues/dequeues there.
> >
> > That's the most optimized case but we can definetly have small moments
> > with more than one task running. For example if we have a workqueue,
> > or such short and quick tasks.
>
> The moment you have nr_running>1 the tick comes back on.

Sure, but the current nohz frame exit accounting is wrong at it accounts the
newly woken task as the whole tickless load. We need to record the singletask
load on nohz frame entry at least so we can retrieve and account it on nohz exit.

Unless, again, if we do that housekeeping periodic remote update. Then we don't
care locally at all anymore.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-01-20  8:42     ` Thomas Gleixner
@ 2016-01-20 17:36       ` Frederic Weisbecker
  2016-01-22  8:40       ` Byungchul Park
  1 sibling, 0 replies; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-20 17:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, LKML, Byungchul Park, Chris Metcalf,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 20, 2016 at 09:42:16AM +0100, Thomas Gleixner wrote:
> On Tue, 19 Jan 2016, Peter Zijlstra wrote:
> 
> > On Wed, Jan 13, 2016 at 05:01:28PM +0100, Frederic Weisbecker wrote:
> > > The cpu load update on tick doesn't care about dynticks and as such is
> > > buggy when occuring on nohz ticks (including idle ticks) as it resets
> > > the jiffies snapshot that was recorded on nohz entry. We eventually
> > > ignore the potentially long tickless load that happened before the
> > > tick.
> > 
> > I don't get it, how can we call scheduler_tick() while
> > tick_nohz_tick_stopped() ?
> 
> tick->nohz_stopped is merily indicating that we switched from periodic mode to
> tickless mode. That's probably a misnomer, but it still has that meaning.
> 
> You really need to look at it from the history of that code which was designed
> for tickless idle. The nohz full stuff was bolted on it.
> 
> So if we stop the tick in idle - or for that matter in full nohz - we look
> ahead when the next tick is required. That can be:
> 
>       - a timer wheel timer expiring
> 
>       - other stuff which prevents the cpu from going "tickless" like rcu,
>         irqwork
> 
> So lets assume rcu and irqwork are silent, but we have a timer expiring 100ms
> from now, then we program the tick timer to 100ms from now. When it fires it
> invokes the normal tick_sched() timer machinery:
> 
>      - timekeeping update
>      - update_process_times
>      - profile_tick
> 
> I have no idea why that is a problem. If update_process_times() is invoked
> then it will account the elapsed time to the idle task in case of tickless
> idle. In case of nohz full it should simply account the time to the task which
> was busy on the cpu in user space.
> 
> The above changelog is just crap and doesnt make any sense at all. And the
> patch is fixing symptoms not the root cause.

So the other way to fix this is to account properly the tickless load and avoid
to account some newly awoken task load. We could record the weighted_cpuload()
on nohz entry (or 0 in the case of idle) and then account that on idle exit.

I think it's a better solution.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-20 17:21             ` Frederic Weisbecker
@ 2016-01-20 18:25               ` Peter Zijlstra
  2016-01-21 13:25                 ` Frederic Weisbecker
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-20 18:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Byungchul Park, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 20, 2016 at 06:21:07PM +0100, Frederic Weisbecker wrote:

> You mean a periodic call to the above from the housekeepers?
> 
> I didn't think about doing that because you nacked that approach with
> scheduler_tick(). This isn't much different.

This does _one_ thing, namely load accounting. scheduler_tick() does a
whole bunch of things, some which really do not make much sense to do
remotely.

> It means the housekeeper is entirely dedicated to full dynticks CPUs.

Depends on how many there are, and how many housekeepers. I would
suggest one housekeeper cpu per node or so to make sure this keeps
working.

Then again, these load values aren't super important, esp. not for the
nohz_full case where you typically don't care about load balancing.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue
  2016-01-20 18:25               ` Peter Zijlstra
@ 2016-01-21 13:25                 ` Frederic Weisbecker
  0 siblings, 0 replies; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-21 13:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Byungchul Park, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 20, 2016 at 07:25:03PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 20, 2016 at 06:21:07PM +0100, Frederic Weisbecker wrote:
> 
> > You mean a periodic call to the above from the housekeepers?
> > 
> > I didn't think about doing that because you nacked that approach with
> > scheduler_tick(). This isn't much different.
> 
> This does _one_ thing, namely load accounting. scheduler_tick() does a
> whole bunch of things, some which really do not make much sense to do
> remotely.

Ok.

> 
> > It means the housekeeper is entirely dedicated to full dynticks CPUs.
> 
> Depends on how many there are, and how many housekeepers. I would
> suggest one housekeeper cpu per node or so to make sure this keeps
> working.

That's possible to do.

> 
> Then again, these load values aren't super important, esp. not for the
> nohz_full case where you typically don't care about load balancing.

In this case maybe we can assume that these stats don't need precision at
all on nohz full and when the full nohz frame starts, we define
all cpu_load[i] = cpu_load[0] = weighted_cpuload(cpu) and we keep it that way until
the nohz full frame ends. This way we avoid the remote update.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-01-20  8:42     ` Thomas Gleixner
  2016-01-20 17:36       ` Frederic Weisbecker
@ 2016-01-22  8:40       ` Byungchul Park
  1 sibling, 0 replies; 58+ messages in thread
From: Byungchul Park @ 2016-01-22  8:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Frederic Weisbecker, LKML, Chris Metcalf,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 20, 2016 at 09:42:16AM +0100, Thomas Gleixner wrote:
> 
> The above changelog is just crap and doesnt make any sense at all. And the
> patch is fixing symptoms not the root cause.

IMHO, the root cause is the "tick" definition. Am I only one confused?
I am confused now. What is tick? The timer interrupt for handling rcu
callback or irq work while the (perioid) tick was stoped, is a tick?

If it is true, then many code including scheduler assuming tick happens
periodically must be fixed to be able to handle the non-periodic tick, esp.
scheduler_tick(). And we have to focus that. Then we don't need something
accounting a cpu load remotely and periodically, because the local tick
handler can account it well locally.

If it is not true, that is, the timer interrupt for handling rcu callback
or irq work during tick-stopped, is not a tick but just a interrupt by which
we want something to be done, then I think we need to make the handler do
only its purpose. In this case, tick related handling has to be deferred to
the tick-restart point. And we have to focus it, so that it can be done.

Regardless of the answer, true or not true, if there is something the
housekeeper must do periodically, then it should be done even remotely. But
I think accounting cpu load is not the case. The way to implement it
depends on the above answer. IMHO, the letter direction is better.

Just a my opinion, and I maybe lack kernel knowledge compared with you.
Please let me know if I am wrong. Or please let me know your opinions.

> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-01-20 10:26             ` Byungchul Park
@ 2016-01-28 16:01               ` Frederic Weisbecker
  2016-01-29  9:50                 ` Peter Zijlstra
  2016-02-01  6:33               ` Byungchul Park
  1 sibling, 1 reply; 58+ messages in thread
From: Frederic Weisbecker @ 2016-01-28 16:01 UTC (permalink / raw)
  To: Byungchul Park
  Cc: Peter Zijlstra, LKML, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 20, 2016 at 07:26:14PM +0900, Byungchul Park wrote:
> On Wed, Jan 20, 2016 at 02:43:35PM +0900, Byungchul Park wrote:
> > 
> > It looks very tricky. I have a question. Do we have to call the
> > scheduler_tick() even while the tick is stopped? IMHO, it seems to be
> > ok even if we won't call it while the tick is stopped. Wrong? I mean,
> > 
> 
> The reason why I asked is that, scheduler_tick() looks to be a
> scheduler callback for *periodic tick*. IMHO, we need to choose one of
> these two.
> 
> 1) Make scheduler_tick() can handle it, not only for the periodic tick
> but also for the tick-like event during tick-stopped. But I am not sure
> if this is the right way.
> 
> 2) Distinguish the periodic tick from the tick-like event by which we
> can handle rcu callback, irq work and so on, so that the periodic tick
> handler only handles periodic stuff either locally or remotely, while
> the tick-like event handler only does its purpose. I think this is
> better, I am sure though.

So lets check all the things we call on scheduler_tick():

_ sched_clock_tick(): maybe it doesn't need to be called when idle. I'm not sure.
  Some code in idle (timers, irqs, ...) might need to call sched_clock().

_ update_rq_clock(), task_tick(): task_tick is empty for idle class, so we probably
  don't need an updated rq either.

_ update_cpu_load_active(): I was about to fix the issue properly and make it account
  correctly on idle ticks but we might as well want to spare it.

_ calc_global_load_tick(): no idea

_ perf_event_task_tick(): needed, some freq CPU events can trigger in idle and need
  adjustments

_ trigger_load_balance(): maybe needed, I see it triggers the softirq after some
  rebalance delay, regardless of the current CPU idleness.

_ rq_last_tick_reset(): not needed in idle

Here we are.


> 
> > ---
> > 
> > diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> > index bbc5d11..774adc2 100644
> > --- a/kernel/time/timer.c
> > +++ b/kernel/time/timer.c
> > @@ -1422,7 +1422,8 @@ void update_process_times(int user_tick)
> >  	if (in_irq())
> >  		irq_work_tick();
> >  #endif
> > -	scheduler_tick();
> > +	if (!tick_nohz_tick_stopped())
> > +		scheduler_tick();
> >  	run_posix_cpu_timers(p);
> >  }
> >  
> > ---
> > 
> > hm ???

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-01-28 16:01               ` Frederic Weisbecker
@ 2016-01-29  9:50                 ` Peter Zijlstra
  2016-02-01 10:05                   ` Byungchul Park
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-29  9:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Byungchul Park, LKML, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Thu, Jan 28, 2016 at 05:01:26PM +0100, Frederic Weisbecker wrote:
> So lets check all the things we call on scheduler_tick():
> 
> _ sched_clock_tick(): maybe it doesn't need to be called when idle. I'm not sure.
>   Some code in idle (timers, irqs, ...) might need to call sched_clock().

Only needed if you've got a shady TSC.

> _ update_rq_clock(), task_tick(): task_tick is empty for idle class, so we probably
>   don't need an updated rq either.

Right, for regular NOHZ we'll be running the idle task, and the idle
tick handler is empty. So for NOHZ you can ignore this.

For NOHZ_FULL you'll not be running the idle task and this gets
'interesting'.

The most important part would be tracking the task runtime, which is
used for a number of user visible things. This should be doable
remotely.

> _ update_cpu_load_active(): I was about to fix the issue properly and make it account
>   correctly on idle ticks but we might as well want to spare it.

Right, we've gone over this one in detail in other emails I think.

> _ calc_global_load_tick(): no idea

Can easily be done remote. However, this only records deltas of
nr_active (:= nr_running + nr_uninterruptible) and for NOHZ and
NOHZ_FULL this should not change, therefore the delta _should_ be 0 and
you can skip this.

> _ perf_event_task_tick(): needed, some freq CPU events can trigger in idle and need
>   adjustments

Right, this is a tricky one. Maybe I should look into moving this into a
hrtimer, but that too has 'fun' problems IIRC. I'll put it on the TODO
list somewhere.

> _ trigger_load_balance(): maybe needed, I see it triggers the softirq after some
>   rebalance delay, regardless of the current CPU idleness.

We already have NOHZ remote balancing, we could (and should) probably do
the same for NOHZ_FULL. Then again, I would expect the NOHZ_FULL cpus to
not actually be part of a balance domain, so we could probably detect
and short-circuit this.

> _ rq_last_tick_reset(): not needed in idle

Right, part of the NOHZ_FULL 'hack', once you fix all the remote
accounting stuff this could go away entirely think.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-01-20 10:26             ` Byungchul Park
  2016-01-28 16:01               ` Frederic Weisbecker
@ 2016-02-01  6:33               ` Byungchul Park
  1 sibling, 0 replies; 58+ messages in thread
From: Byungchul Park @ 2016-02-01  6:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Wed, Jan 20, 2016 at 07:26:14PM +0900, Byungchul Park wrote:
> On Wed, Jan 20, 2016 at 02:43:35PM +0900, Byungchul Park wrote:
> > 
> > It looks very tricky. I have a question. Do we have to call the
> > scheduler_tick() even while the tick is stopped? IMHO, it seems to be
> > ok even if we won't call it while the tick is stopped. Wrong? I mean,
> > 
> 
> The reason why I asked is that, scheduler_tick() looks to be a
> scheduler callback for *periodic tick*. IMHO, we need to choose one of
> these two.
> 
> 1) Make scheduler_tick() can handle it, not only for the periodic tick
> but also for the tick-like event during tick-stopped. But I am not sure
> if this is the right way.
> 
> 2) Distinguish the periodic tick from the tick-like event by which we
> can handle rcu callback, irq work and so on, so that the periodic tick
> handler only handles periodic stuff either locally or remotely, while
> the tick-like event handler only does its purpose. I think this is
> better, I am sure though.
              ^^^
              not
> 
> > ---
> > 
> > diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> > index bbc5d11..774adc2 100644
> > --- a/kernel/time/timer.c
> > +++ b/kernel/time/timer.c
> > @@ -1422,7 +1422,8 @@ void update_process_times(int user_tick)
> >  	if (in_irq())
> >  		irq_work_tick();
> >  #endif
> > -	scheduler_tick();
> > +	if (!tick_nohz_tick_stopped())
> > +		scheduler_tick();
> >  	run_posix_cpu_timers(p);
> >  }
> >  
> > ---
> > 
> > hm ???

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-01-29  9:50                 ` Peter Zijlstra
@ 2016-02-01 10:05                   ` Byungchul Park
  2016-02-01 10:09                     ` [PATCH] sched: calculate sched_clock_cpu without tick handling during nohz Byungchul Park
  2016-02-01 10:34                     ` [PATCH 1/4] sched: Don't account tickless CPU load on tick Peter Zijlstra
  0 siblings, 2 replies; 58+ messages in thread
From: Byungchul Park @ 2016-02-01 10:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Fri, Jan 29, 2016 at 10:50:16AM +0100, Peter Zijlstra wrote:
> On Thu, Jan 28, 2016 at 05:01:26PM +0100, Frederic Weisbecker wrote:
> > So lets check all the things we call on scheduler_tick():
> > 
> > _ sched_clock_tick(): maybe it doesn't need to be called when idle. I'm not sure.
> >   Some code in idle (timers, irqs, ...) might need to call sched_clock().
> 
> Only needed if you've got a shady TSC.

Yeh.. IMO, this can be done without the tick handling during nohz, with the
patch I am attaching. Could you check the patch? Or we have to handle it
remotely, too. (for a crazy TSC)

> 
> > _ update_rq_clock(), task_tick(): task_tick is empty for idle class, so we probably
> >   don't need an updated rq either.
> 
> Right, for regular NOHZ we'll be running the idle task, and the idle
> tick handler is empty. So for NOHZ you can ignore this.
> 
> For NOHZ_FULL you'll not be running the idle task and this gets
> 'interesting'.
> 
> The most important part would be tracking the task runtime, which is
> used for a number of user visible things. This should be doable
> remotely.

Isn't there any way to show it to user at the time it's requested?

> 
> > _ update_cpu_load_active(): I was about to fix the issue properly and make it account
> >   correctly on idle ticks but we might as well want to spare it.
> 
> Right, we've gone over this one in detail in other emails I think.

Doing it remotely... hm...

> 
> > _ calc_global_load_tick(): no idea
> 
> Can easily be done remote. However, this only records deltas of
> nr_active (:= nr_running + nr_uninterruptible) and for NOHZ and
> NOHZ_FULL this should not change, therefore the delta _should_ be 0 and
> you can skip this.

It sounds good.

> 
> > _ perf_event_task_tick(): needed, some freq CPU events can trigger in idle and need
> >   adjustments
> 
> Right, this is a tricky one. Maybe I should look into moving this into a
> hrtimer, but that too has 'fun' problems IIRC. I'll put it on the TODO
> list somewhere.

Good luck. I'm sure you'll do well.

> 
> > _ trigger_load_balance(): maybe needed, I see it triggers the softirq after some
> >   rebalance delay, regardless of the current CPU idleness.
> 
> We already have NOHZ remote balancing, we could (and should) probably do

I think so.

> the same for NOHZ_FULL. Then again, I would expect the NOHZ_FULL cpus to
> not actually be part of a balance domain, so we could probably detect

Could not the NOHZ_FULL cpus be part of a balance domain? It sounds good.

> and short-circuit this.
> 
> > _ rq_last_tick_reset(): not needed in idle
> 
> Right, part of the NOHZ_FULL 'hack', once you fix all the remote
> accounting stuff this could go away entirely think.

I'm sure it can be removed eventually!

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH] sched: calculate sched_clock_cpu without tick handling during nohz
  2016-02-01 10:05                   ` Byungchul Park
@ 2016-02-01 10:09                     ` Byungchul Park
  2016-02-01 10:34                     ` [PATCH 1/4] sched: Don't account tickless CPU load on tick Peter Zijlstra
  1 sibling, 0 replies; 58+ messages in thread
From: Byungchul Park @ 2016-02-01 10:09 UTC (permalink / raw)
  To: peterz, mingo; +Cc: linux-kernel

Currently, unstable TSCs can be adjusted by sched_clock_tick(),
sched_clock_local() and sched_clock_remote() with the tick handling.
Thus without tick, it does not work properly.

This patch makes it possible to adjust it approximately without the
tick handling so that it would be ok even if the sched_clock_tick()
is not called during nohz.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
---
 kernel/sched/clock.c | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index caf4041..064b40a 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -128,6 +128,9 @@ struct sched_clock_data {
 	u64			tick_raw;
 	u64			tick_gtod;
 	u64			clock;
+#ifdef CONFIG_NO_HZ_COMMON
+	unsigned long		update;
+#endif
 };
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_clock_data, sched_clock_data);
@@ -153,6 +156,9 @@ void sched_clock_init(void)
 		scd->tick_raw = 0;
 		scd->tick_gtod = ktime_now;
 		scd->clock = ktime_now;
+#ifdef CONFIG_NO_HZ_COMMON
+		scd->update = READ_ONCE(jiffies);
+#endif
 	}
 
 	sched_clock_running = 1;
@@ -186,6 +192,27 @@ static inline u64 wrap_max(u64 x, u64 y)
 	return (s64)(x - y) > 0 ? x : y;
 }
 
+#ifdef CONFIG_NO_HZ_COMMON
+static inline unsigned long tick_pending(struct sched_clock_data *scd)
+{
+	return READ_ONCE(jiffies) - scd->update;
+}
+
+static inline void sched_clock_tick_update(struct sched_clock_data *scd)
+{
+	scd->update = READ_ONCE(jiffies);
+}
+#else
+static inline unsigned long tick_pending(struct sched_clock_data *scd)
+{
+	return 0;
+}
+
+static inline void sched_clock_tick_update(struct sched_clock_data *scd)
+{
+}
+#endif
+
 /*
  * update the percpu scd from the raw @now value
  *
@@ -194,7 +221,7 @@ static inline u64 wrap_max(u64 x, u64 y)
  */
 static u64 sched_clock_local(struct sched_clock_data *scd)
 {
-	u64 now, clock, old_clock, min_clock, max_clock;
+	u64 now, clock, old_clock, min_clock, max_clock, next_gtod;
 	s64 delta;
 
 again:
@@ -211,9 +238,10 @@ again:
 	 *		      scd->tick_gtod + TICK_NSEC);
 	 */
 
+	next_gtod = scd->tick_gtod + (tick_pending(scd) + 1) * TICK_NSEC;
 	clock = scd->tick_gtod + delta;
 	min_clock = wrap_max(scd->tick_gtod, old_clock);
-	max_clock = wrap_max(old_clock, scd->tick_gtod + TICK_NSEC);
+	max_clock = wrap_max(old_clock, next_gtod);
 
 	clock = wrap_max(clock, min_clock);
 	clock = wrap_min(clock, max_clock);
@@ -333,6 +361,7 @@ void sched_clock_tick(void)
 
 	scd->tick_raw = now;
 	scd->tick_gtod = now_gtod;
+	sched_clock_tick_update(scd);
 	sched_clock_local(scd);
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-02-01 10:05                   ` Byungchul Park
  2016-02-01 10:09                     ` [PATCH] sched: calculate sched_clock_cpu without tick handling during nohz Byungchul Park
@ 2016-02-01 10:34                     ` Peter Zijlstra
  2016-02-01 23:51                       ` Byungchul Park
  2016-02-02  0:50                       ` Byungchul Park
  1 sibling, 2 replies; 58+ messages in thread
From: Peter Zijlstra @ 2016-02-01 10:34 UTC (permalink / raw)
  To: Byungchul Park
  Cc: Frederic Weisbecker, LKML, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Mon, Feb 01, 2016 at 07:05:13PM +0900, Byungchul Park wrote:
> On Fri, Jan 29, 2016 at 10:50:16AM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 28, 2016 at 05:01:26PM +0100, Frederic Weisbecker wrote:
> > > So lets check all the things we call on scheduler_tick():
> > > 
> > > _ sched_clock_tick(): maybe it doesn't need to be called when idle. I'm not sure.
> > >   Some code in idle (timers, irqs, ...) might need to call sched_clock().
> > 
> > Only needed if you've got a shady TSC.
> 
> Yeh.. IMO, this can be done without the tick handling during nohz, with the
> patch I am attaching. Could you check the patch? Or we have to handle it
> remotely, too. (for a crazy TSC)

I think NOHZ_FULL already requires the TSC not to be wrecked.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-02-01 10:34                     ` [PATCH 1/4] sched: Don't account tickless CPU load on tick Peter Zijlstra
@ 2016-02-01 23:51                       ` Byungchul Park
  2016-02-02  0:50                       ` Byungchul Park
  1 sibling, 0 replies; 58+ messages in thread
From: Byungchul Park @ 2016-02-01 23:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Mon, Feb 01, 2016 at 11:34:33AM +0100, Peter Zijlstra wrote:
> On Mon, Feb 01, 2016 at 07:05:13PM +0900, Byungchul Park wrote:
> > On Fri, Jan 29, 2016 at 10:50:16AM +0100, Peter Zijlstra wrote:
> > > On Thu, Jan 28, 2016 at 05:01:26PM +0100, Frederic Weisbecker wrote:
> > > > So lets check all the things we call on scheduler_tick():
> > > > 
> > > > _ sched_clock_tick(): maybe it doesn't need to be called when idle. I'm not sure.
> > > >   Some code in idle (timers, irqs, ...) might need to call sched_clock().
> > > 
> > > Only needed if you've got a shady TSC.
> > 
> > Yeh.. IMO, this can be done without the tick handling during nohz, with the
> > patch I am attaching. Could you check the patch? Or we have to handle it
> > remotely, too. (for a crazy TSC)
> 
> I think NOHZ_FULL already requires the TSC not to be wrecked.

It sounds good!

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/4] sched: Don't account tickless CPU load on tick
  2016-02-01 10:34                     ` [PATCH 1/4] sched: Don't account tickless CPU load on tick Peter Zijlstra
  2016-02-01 23:51                       ` Byungchul Park
@ 2016-02-02  0:50                       ` Byungchul Park
  1 sibling, 0 replies; 58+ messages in thread
From: Byungchul Park @ 2016-02-02  0:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Chris Metcalf, Thomas Gleixner,
	Luiz Capitulino, Christoph Lameter, Paul E . McKenney,
	Mike Galbraith, Rik van Riel

On Mon, Feb 01, 2016 at 11:34:33AM +0100, Peter Zijlstra wrote:
> On Mon, Feb 01, 2016 at 07:05:13PM +0900, Byungchul Park wrote:
> > On Fri, Jan 29, 2016 at 10:50:16AM +0100, Peter Zijlstra wrote:
> > > On Thu, Jan 28, 2016 at 05:01:26PM +0100, Frederic Weisbecker wrote:
> > > > So lets check all the things we call on scheduler_tick():
> > > > 
> > > > _ sched_clock_tick(): maybe it doesn't need to be called when idle. I'm not sure.
> > > >   Some code in idle (timers, irqs, ...) might need to call sched_clock().
> > > 
> > > Only needed if you've got a shady TSC.
> > 
> > Yeh.. IMO, this can be done without the tick handling during nohz, with the
> > patch I am attaching. Could you check the patch? Or we have to handle it
> > remotely, too. (for a crazy TSC)
> 
> I think NOHZ_FULL already requires the TSC not to be wrecked.

What about the regular NOHZ? Or does not any code in idle call a kind of
sched_lock_cpu() at all?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [tip:sched/core] sched/fair: Avoid using decay_load_missed() with a negative value
  2016-01-15  7:07       ` Byungchul Park
  2016-01-15 16:56         ` Dietmar Eggemann
@ 2016-02-29 11:14         ` tip-bot for Byungchul Park
  1 sibling, 0 replies; 58+ messages in thread
From: tip-bot for Byungchul Park @ 2016-02-29 11:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: cl, fweisbec, efault, paulmck, peterz, tglx, torvalds, riel,
	lcapitulino, byungchul.park, mingo, cmetcalf, linux-kernel,
	dietmar.eggemann, hpa

Commit-ID:  7400d3bbaa229eb8e7631d28fb34afd7cd2c96ff
Gitweb:     http://git.kernel.org/tip/7400d3bbaa229eb8e7631d28fb34afd7cd2c96ff
Author:     Byungchul Park <byungchul.park@lge.com>
AuthorDate: Fri, 15 Jan 2016 16:07:49 +0900
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 29 Feb 2016 09:53:03 +0100

sched/fair: Avoid using decay_load_missed() with a negative value

decay_load_missed() cannot handle nagative values, so we need to prevent
using the function with a negative value.

Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: perterz@infradead.org
Fixes: 59543275488d ("sched/fair: Prepare __update_cpu_load() to handle active tickless")
Link: http://lkml.kernel.org/r/20160115070749.GA1914@X58A-UD3R
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5641042..3b3dc39 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4509,9 +4509,17 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
 
 		/* scale is effectively 1 << i now, and >> i divides by scale */
 
-		old_load = this_rq->cpu_load[i] - tickless_load;
+		old_load = this_rq->cpu_load[i];
 		old_load = decay_load_missed(old_load, pending_updates - 1, i);
-		old_load += tickless_load;
+		if (tickless_load) {
+			old_load -= decay_load_missed(tickless_load, pending_updates - 1, i);
+			/*
+			 * old_load can never be a negative value because a
+			 * decayed tickless_load cannot be greater than the
+			 * original tickless_load.
+			 */
+			old_load += tickless_load;
+		}
 		new_load = this_load;
 		/*
 		 * Round up the averaging division if load is increasing. This

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [tip:sched/core] sched/fair: Consolidate nohz CPU load update code
  2016-01-13 16:01 ` [PATCH 2/4] sched: Consolidate nohz CPU load update code Frederic Weisbecker
  2016-01-14  2:30   ` Byungchul Park
  2016-01-14  5:18   ` Byungchul Park
@ 2016-02-29 11:14   ` tip-bot for Frederic Weisbecker
  2 siblings, 0 replies; 58+ messages in thread
From: tip-bot for Frederic Weisbecker @ 2016-02-29 11:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, byungchul.park, cl, cmetcalf, torvalds, peterz,
	hpa, efault, paulmck, mingo, lcapitulino, riel, tglx, fweisbec

Commit-ID:  be68a682c00dfa7733021e1da386ba7182fc7bb9
Gitweb:     http://git.kernel.org/tip/be68a682c00dfa7733021e1da386ba7182fc7bb9
Author:     Frederic Weisbecker <fweisbec@gmail.com>
AuthorDate: Wed, 13 Jan 2016 17:01:29 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 29 Feb 2016 09:53:04 +0100

sched/fair: Consolidate nohz CPU load update code

Lets factorize a bit of code there. We'll even have a third user soon.
While at it, standardize the idle update function name against the
others.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Byungchul Park <byungchul.park@lge.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1452700891-21807-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 48 +++++++++++++++++++++++++-----------------------
 1 file changed, 25 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3b3dc39..3313052 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4542,6 +4542,25 @@ static unsigned long weighted_cpuload(const int cpu)
 }
 
 #ifdef CONFIG_NO_HZ_COMMON
+static void __update_cpu_load_nohz(struct rq *this_rq,
+				   unsigned long curr_jiffies,
+				   unsigned long load,
+				   int active)
+{
+	unsigned long pending_updates;
+
+	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
+	if (pending_updates) {
+		this_rq->last_load_update_tick = curr_jiffies;
+		/*
+		 * In the regular NOHZ case, we were idle, this means load 0.
+		 * In the NOHZ_FULL case, we were non-idle, we should consider
+		 * its weighted load.
+		 */
+		__update_cpu_load(this_rq, load, pending_updates, active);
+	}
+}
+
 /*
  * There is no sane way to deal with nohz on smp when using jiffies because the
  * cpu doing the jiffies update might drift wrt the cpu doing the jiffy reading
@@ -4559,22 +4578,15 @@ static unsigned long weighted_cpuload(const int cpu)
  * Called from nohz_idle_balance() to update the load ratings before doing the
  * idle balance.
  */
-static void update_idle_cpu_load(struct rq *this_rq)
+static void update_cpu_load_idle(struct rq *this_rq)
 {
-	unsigned long curr_jiffies = READ_ONCE(jiffies);
-	unsigned long load = weighted_cpuload(cpu_of(this_rq));
-	unsigned long pending_updates;
-
 	/*
 	 * bail if there's load or we're actually up-to-date.
 	 */
-	if (load || curr_jiffies == this_rq->last_load_update_tick)
+	if (weighted_cpuload(cpu_of(this_rq)))
 		return;
 
-	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
-	this_rq->last_load_update_tick = curr_jiffies;
-
-	__update_cpu_load(this_rq, load, pending_updates, 0);
+	__update_cpu_load_nohz(this_rq, READ_ONCE(jiffies), 0, 0);
 }
 
 /*
@@ -4585,22 +4597,12 @@ void update_cpu_load_nohz(int active)
 	struct rq *this_rq = this_rq();
 	unsigned long curr_jiffies = READ_ONCE(jiffies);
 	unsigned long load = active ? weighted_cpuload(cpu_of(this_rq)) : 0;
-	unsigned long pending_updates;
 
 	if (curr_jiffies == this_rq->last_load_update_tick)
 		return;
 
 	raw_spin_lock(&this_rq->lock);
-	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
-	if (pending_updates) {
-		this_rq->last_load_update_tick = curr_jiffies;
-		/*
-		 * In the regular NOHZ case, we were idle, this means load 0.
-		 * In the NOHZ_FULL case, we were non-idle, we should consider
-		 * its weighted load.
-		 */
-		__update_cpu_load(this_rq, load, pending_updates, active);
-	}
+	__update_cpu_load_nohz(this_rq, curr_jiffies, load, active);
 	raw_spin_unlock(&this_rq->lock);
 }
 #endif /* CONFIG_NO_HZ */
@@ -4612,7 +4614,7 @@ void update_cpu_load_active(struct rq *this_rq)
 {
 	unsigned long load = weighted_cpuload(cpu_of(this_rq));
 	/*
-	 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
+	 * See the mess around update_cpu_load_idle() / update_cpu_load_nohz().
 	 */
 	this_rq->last_load_update_tick = jiffies;
 	__update_cpu_load(this_rq, load, 1, 1);
@@ -7906,7 +7908,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 		if (time_after_eq(jiffies, rq->next_balance)) {
 			raw_spin_lock_irq(&rq->lock);
 			update_rq_clock(rq);
-			update_idle_cpu_load(rq);
+			update_cpu_load_idle(rq);
 			raw_spin_unlock_irq(&rq->lock);
 			rebalance_domains(rq, CPU_IDLE);
 		}

^ permalink raw reply related	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2016-02-29 11:15 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-13 16:01 [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz Frederic Weisbecker
2016-01-13 16:01 ` [PATCH 1/4] sched: Don't account tickless CPU load on tick Frederic Weisbecker
2016-01-19 13:08   ` Peter Zijlstra
2016-01-19 16:22     ` Frederic Weisbecker
2016-01-19 18:56       ` Peter Zijlstra
2016-01-19 22:33         ` Frederic Weisbecker
2016-01-20  5:43           ` Byungchul Park
2016-01-20 10:26             ` Byungchul Park
2016-01-28 16:01               ` Frederic Weisbecker
2016-01-29  9:50                 ` Peter Zijlstra
2016-02-01 10:05                   ` Byungchul Park
2016-02-01 10:09                     ` [PATCH] sched: calculate sched_clock_cpu without tick handling during nohz Byungchul Park
2016-02-01 10:34                     ` [PATCH 1/4] sched: Don't account tickless CPU load on tick Peter Zijlstra
2016-02-01 23:51                       ` Byungchul Park
2016-02-02  0:50                       ` Byungchul Park
2016-02-01  6:33               ` Byungchul Park
2016-01-20  8:42     ` Thomas Gleixner
2016-01-20 17:36       ` Frederic Weisbecker
2016-01-22  8:40       ` Byungchul Park
2016-01-13 16:01 ` [PATCH 2/4] sched: Consolidate nohz CPU load update code Frederic Weisbecker
2016-01-14  2:30   ` Byungchul Park
2016-01-19 13:13     ` Peter Zijlstra
2016-01-20  0:51       ` Byungchul Park
2016-01-14  5:18   ` Byungchul Park
2016-01-19 13:13     ` Peter Zijlstra
2016-01-19 16:49     ` Frederic Weisbecker
2016-01-20  1:41       ` Byungchul Park
2016-02-29 11:14   ` [tip:sched/core] sched/fair: " tip-bot for Frederic Weisbecker
2016-01-13 16:01 ` [RFC PATCH 3/4] sched: Move cpu load stats functions above fair queue callbacks Frederic Weisbecker
2016-01-13 16:01 ` [RFC PATCH 4/4] sched: Upload nohz full CPU load on task enqueue/dequeue Frederic Weisbecker
2016-01-19 13:17   ` Peter Zijlstra
2016-01-19 17:03     ` Frederic Weisbecker
2016-01-20  9:09       ` Peter Zijlstra
2016-01-20 14:54         ` Frederic Weisbecker
2016-01-20 15:11           ` Thomas Gleixner
2016-01-20 15:19             ` Christoph Lameter
2016-01-20 16:52               ` Frederic Weisbecker
2016-01-20 16:45             ` Frederic Weisbecker
2016-01-20 16:56           ` Peter Zijlstra
2016-01-20 17:21             ` Frederic Weisbecker
2016-01-20 18:25               ` Peter Zijlstra
2016-01-21 13:25                 ` Frederic Weisbecker
2016-01-20  9:03   ` Thomas Gleixner
2016-01-20 14:31     ` Frederic Weisbecker
2016-01-20 14:43       ` Thomas Gleixner
2016-01-20 16:40         ` Frederic Weisbecker
2016-01-20 16:42           ` Christoph Lameter
2016-01-20 16:47             ` Frederic Weisbecker
2016-01-14 21:19 ` [RFC PATCH 0/4] sched: Improve cpu load accounting with nohz Dietmar Eggemann
2016-01-14 21:27   ` Peter Zijlstra
2016-01-14 22:23     ` Dietmar Eggemann
2016-01-15  7:07       ` Byungchul Park
2016-01-15 16:56         ` Dietmar Eggemann
2016-01-18  0:23           ` Byungchul Park
2016-01-19 13:04           ` Peter Zijlstra
2016-01-20  0:48             ` Byungchul Park
2016-01-20 13:04               ` Dietmar Eggemann
2016-02-29 11:14         ` [tip:sched/core] sched/fair: Avoid using decay_load_missed() with a negative value tip-bot for Byungchul Park

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.