All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	linux-kernel@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
	Chris Mason <clm@fb.com>,
	kernel-team@fb.com
Subject: Re: [PATCH v2 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
Date: Wed, 3 May 2017 20:00:28 +0200	[thread overview]
Message-ID: <20170503180028.ejf73et3pc4meqji@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20170424213324.GA23619@wtj.duckdns.org>



This is on my IVB-EP, 2 sockets, 10 cores / socket, 2 threads / core.

workload is constrained to 1 socket.

root@ivb-ep:~/bench/schbench# numactl -N 0 ./schbench -m 2 -t 10 -s 10000 -c 15000 -r 30
Latency percentiles (usec)
        50.0000th: 21
        75.0000th: 30
        90.0000th: 38
        95.0000th: 42
        *99.0000th: 49
        99.5000th: 53
        99.9000th: 15024
        min=0, max=15056


root@ivb-ep:~/bench/schbench# echo NO_FUDGE > /debug/sched_features; echo NO_FUDGE2 > /debug/sched_features ; mkdir /cgroup/ponies; echo $$ > /cgroup/ponies/tasks ; numactl -N 0 ./schbench -m 2 -t 10 -s 10000 -c 15000 -r 30 ; echo $$ > /cgroup/tasks ; rmdir /cgroup/ponies
Latency percentiles (usec)
        50.0000th: 14
        75.0000th: 19
        90.0000th: 24
        95.0000th: 28
        *99.0000th: 57
        99.5000th: 15024
        99.9000th: 15024
        min=0, max=15060


root@ivb-ep:~/bench/schbench# echo NO_FUDGE > /debug/sched_features; echo FUDGE2 > /debug/sched_features ; mkdir /cgroup/ponies; echo $$ > /cgroup/ponies/tasks ; numactl -N 0 ./schbench -m 2 -t 10 -s 10000 -c 15000 -r 30 ; echo $$ > /cgroup/tasks ; rmdir /cgroup/ponies
Latency percentiles (usec)
        50.0000th: 14
        75.0000th: 19
        90.0000th: 24
        95.0000th: 26
        *99.0000th: 38
        99.5000th: 49
        99.9000th: 9648
        min=0, max=15035


root@ivb-ep:~/bench/schbench# echo FUDGE > /debug/sched_features; echo FUDGE2 > /debug/sched_features ; mkdir /cgroup/ponies; echo $$ > /cgroup/ponies/tasks ; numactl -N 0 ./schbench -m 2 -t 10 -s 10000 -c 15000 -r 30 ; echo $$ > /cgroup/tasks ; rmdir /cgroup/ponies
Latency percentiles (usec)
        50.0000th: 14
        75.0000th: 19
        90.0000th: 24
        95.0000th: 27
        *99.0000th: 3060
        99.5000th: 7848
        99.9000th: 15024
        min=0, max=15041


root@ivb-ep:~/bench/schbench# echo 0 > /sys/module/fair/parameters/prop_type
root@ivb-ep:~/bench/schbench# echo FUDGE > /debug/sched_features; echo FUDGE2 > /debug/sched_features ; mkdir /cgroup/ponies; echo $$ > /cgroup/ponies/tasks ; numactl -N 0 ./schbench -m 2 -t 10 -s 10000 -c 15000 -r 30 ; echo $$ > /cgroup/tasks ; rmdir /cgroup/ponies
Latency percentiles (usec)
        50.0000th: 14
        75.0000th: 19
        90.0000th: 24
        95.0000th: 27
        *99.0000th: 52
        99.5000th: 4712
        99.9000th: 14640
        min=0, max=15033


Just FUDGE2 on its own seems to be the best on my system and is a change
that makes sense (and something Paul recently pointed out as well).

The implementation isn't particularly pretty or fast, but should
illustrate the idea.

Poking at the whole update_tg_cfs_load() thing only makes it worse after
that. And while I agree that that code is mind bending; it seems to work
OK-ish.

Tejun, Vincent, could you guys have a poke?

The thing is that if we assume se->avg.load_avg is correct, we should
already compute a correct cfs_rq->runnable_load_avg, we do all that
propagation right.

But because se->avg.load_avg is stuck in the 'past' because it's sum is
based on all its old weight, things don't quite work out. If we otoh
treat it as a runnable_sum and scale with weight, it seems to work out
fine.

Arguably sys_nice and all related crud should do the same, but nobody
really uses nice at any frequency, whereas we constantly change the
weight of our group entities.

---
 kernel/sched/fair.c     | 184 +++++++++++++++++++++++++++++++-----------------
 kernel/sched/features.h |   2 +
 2 files changed, 122 insertions(+), 64 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd6c3f9..d6a33e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -32,6 +32,7 @@
 #include <linux/mempolicy.h>
 #include <linux/migrate.h>
 #include <linux/task_work.h>
+#include <linux/moduleparam.h>
 
 #include <trace/events/sched.h>
 
@@ -2632,16 +2633,39 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 # ifdef CONFIG_SMP
-static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
+
+enum shares_type {
+	shares_runnable,
+	shares_avg,
+	shares_weight,
+};
+
+static long calc_cfs_shares(struct cfs_rq *cfs_rq, enum shares_type shares_type)
 {
-	long tg_weight, load, shares;
+	long tg_weight, tg_shares, load, shares;
+	struct task_group *tg = cfs_rq->tg;
 
-	/*
-	 * This really should be: cfs_rq->avg.load_avg, but instead we use
-	 * cfs_rq->load.weight, which is its upper bound. This helps ramp up
-	 * the shares for small weight interactive tasks.
-	 */
-	load = scale_load_down(cfs_rq->load.weight);
+	tg_shares = READ_ONCE(tg->shares);
+
+	switch (shares_type) {
+	case shares_runnable:
+		load = cfs_rq->runnable_load_avg;
+		break;
+
+	default:
+	case shares_avg:
+		load = cfs_rq->avg.load_avg;
+		break;
+
+	case shares_weight:
+		/*
+		 * This really should be: cfs_rq->avg.load_avg, but instead we
+		 * use cfs_rq->load.weight, which is its upper bound. This
+		 * helps ramp up the shares for small weight interactive tasks.
+		 */
+		load = scale_load_down(cfs_rq->load.weight);
+		break;
+	}
 
 	tg_weight = atomic_long_read(&tg->load_avg);
 
@@ -2665,23 +2689,33 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 	 * case no task is runnable on a CPU MIN_SHARES=2 should be returned
 	 * instead of 0.
 	 */
-	if (shares < MIN_SHARES)
-		shares = MIN_SHARES;
-	if (shares > tg->shares)
-		shares = tg->shares;
-
-	return shares;
-}
-# else /* CONFIG_SMP */
-static inline long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
-{
-	return tg->shares;
+	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
 # endif /* CONFIG_SMP */
 
+/*
+ * Unsigned subtract and clamp on underflow.
+ *
+ * Explicitly do a load-store to ensure the intermediate value never hits
+ * memory. This allows lockless observations without ever seeing the negative
+ * values.
+ */
+#define sub_positive(_ptr, _val) do {				\
+	typeof(_ptr) ptr = (_ptr);				\
+	typeof(*ptr) val = (_val);				\
+	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
+	res = var - val;					\
+	if (res > var)						\
+		res = 0;					\
+	WRITE_ONCE(*ptr, res);					\
+} while (0)
+
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
 {
+	if (se->load.weight == weight)
+		return;
+
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		if (cfs_rq->curr == se)
@@ -2689,10 +2723,40 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		account_entity_dequeue(cfs_rq, se);
 	}
 
+	if (sched_feat(FUDGE2)) {
+		unsigned long new_weight = max(scale_load_down(weight), 1UL);
+		unsigned long old_weight = max(scale_load_down(se->load.weight), 1UL);
+
+		sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
+		sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
+
+		if (se->on_rq) {
+			sub_positive(&cfs_rq->runnable_load_avg, se->avg.load_avg);
+			sub_positive(&cfs_rq->runnable_load_sum, se->avg.load_sum);
+		}
+
+		se->avg.load_avg *= new_weight;
+		se->avg.load_sum *= new_weight;
+
+		se->avg.load_avg /= old_weight;
+		se->avg.load_sum /= old_weight;
+	}
+
 	update_load_set(&se->load, weight);
 
-	if (se->on_rq)
+	if (se->on_rq) {
 		account_entity_enqueue(cfs_rq, se);
+	}
+
+	if (sched_feat(FUDGE2)) {
+		cfs_rq->avg.load_avg += se->avg.load_avg;
+		cfs_rq->avg.load_sum += se->avg.load_sum;
+
+		if (se->on_rq) {
+			cfs_rq->runnable_load_avg += se->avg.load_avg;
+			cfs_rq->runnable_load_sum += se->avg.load_sum;
+		}
+	}
 }
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
@@ -2700,7 +2764,6 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 static void update_cfs_shares(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = group_cfs_rq(se);
-	struct task_group *tg;
 	long shares;
 
 	if (!cfs_rq)
@@ -2709,13 +2772,14 @@ static void update_cfs_shares(struct sched_entity *se)
 	if (throttled_hierarchy(cfs_rq))
 		return;
 
-	tg = cfs_rq->tg;
-
 #ifndef CONFIG_SMP
-	if (likely(se->load.weight == tg->shares))
+	shares = READ_ONCE(cfs_rq->tg->shares);
+
+	if (likely(se->load.weight == shares))
 		return;
+#else
+	shares = calc_cfs_shares(cfs_rq, shares_weight);
 #endif
-	shares = calc_cfs_shares(cfs_rq, tg);
 
 	reweight_entity(cfs_rq_of(se), se, shares);
 }
@@ -3070,42 +3134,51 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
 }
 
+static int prop_type = shares_avg;
+
+module_param(prop_type, int, 0644);
+
 /* Take into account change of load of a child task group */
 static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long delta, load = gcfs_rq->avg.load_avg;
+	long delta, load;
 
 	/*
 	 * If the load of group cfs_rq is null, the load of the
 	 * sched_entity will also be null so we can skip the formula
 	 */
-	if (load) {
-		long tg_load;
+	if (!sched_feat(FUDGE)) {
+		load = gcfs_rq->avg.load_avg;
+		if (load) {
+			long tg_load;
 
-		/* Get tg's load and ensure tg_load > 0 */
-		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
+			/* Get tg's load and ensure tg_load > 0 */
+			tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
 
-		/* Ensure tg_load >= load and updated with current load*/
-		tg_load -= gcfs_rq->tg_load_avg_contrib;
-		tg_load += load;
+			/* Ensure tg_load >= load and updated with current load*/
+			tg_load -= gcfs_rq->tg_load_avg_contrib;
+			tg_load += load;
 
-		/*
-		 * We need to compute a correction term in the case that the
-		 * task group is consuming more CPU than a task of equal
-		 * weight. A task with a weight equals to tg->shares will have
-		 * a load less or equal to scale_load_down(tg->shares).
-		 * Similarly, the sched_entities that represent the task group
-		 * at parent level, can't have a load higher than
-		 * scale_load_down(tg->shares). And the Sum of sched_entities'
-		 * load must be <= scale_load_down(tg->shares).
-		 */
-		if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
-			/* scale gcfs_rq's load into tg's shares*/
-			load *= scale_load_down(gcfs_rq->tg->shares);
-			load /= tg_load;
+			/*
+			 * We need to compute a correction term in the case that the
+			 * task group is consuming more CPU than a task of equal
+			 * weight. A task with a weight equals to tg->shares will have
+			 * a load less or equal to scale_load_down(tg->shares).
+			 * Similarly, the sched_entities that represent the task group
+			 * at parent level, can't have a load higher than
+			 * scale_load_down(tg->shares). And the Sum of sched_entities'
+			 * load must be <= scale_load_down(tg->shares).
+			 */
+			if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
+				/* scale gcfs_rq's load into tg's shares*/
+				load *= scale_load_down(gcfs_rq->tg->shares);
+				load /= tg_load;
+			}
 		}
+	} else {
+		load = calc_cfs_shares(gcfs_rq, prop_type);
 	}
 
 	delta = load - se->avg.load_avg;
@@ -3236,23 +3309,6 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
 	}
 }
 
-/*
- * Unsigned subtract and clamp on underflow.
- *
- * Explicitly do a load-store to ensure the intermediate value never hits
- * memory. This allows lockless observations without ever seeing the negative
- * values.
- */
-#define sub_positive(_ptr, _val) do {				\
-	typeof(_ptr) ptr = (_ptr);				\
-	typeof(*ptr) val = (_val);				\
-	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
-	res = var - val;					\
-	if (res > var)						\
-		res = 0;					\
-	WRITE_ONCE(*ptr, res);					\
-} while (0)
-
 /**
  * update_cfs_rq_load_avg - update the cfs_rq's load/util averages
  * @now: current time, as per cfs_rq_clock_task()
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index dc4d148..4c517b4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -80,3 +80,5 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
+SCHED_FEAT(FUDGE, true)
+SCHED_FEAT(FUDGE2, true)

  reply	other threads:[~2017-05-03 18:00 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-24 20:13 [RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is in use Tejun Heo
2017-04-24 20:14 ` [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity Tejun Heo
2017-04-24 21:33   ` [PATCH v2 " Tejun Heo
2017-05-03 18:00     ` Peter Zijlstra [this message]
2017-05-03 21:45       ` Tejun Heo
2017-05-04  5:51         ` Peter Zijlstra
2017-05-04  6:21           ` Peter Zijlstra
2017-05-04  9:49             ` Dietmar Eggemann
2017-05-04 10:57               ` Peter Zijlstra
2017-05-04 17:39               ` Tejun Heo
2017-05-05 10:36                 ` Dietmar Eggemann
2017-05-04 10:26       ` Vincent Guittot
2017-04-25  8:35   ` [PATCH " Vincent Guittot
2017-04-25 18:12     ` Tejun Heo
2017-04-26 16:51       ` Vincent Guittot
2017-04-26 22:40         ` Tejun Heo
2017-04-27  7:00           ` Vincent Guittot
2017-05-01 14:17         ` Peter Zijlstra
2017-05-01 14:52           ` Peter Zijlstra
2017-05-01 21:56           ` Tejun Heo
2017-05-02  8:19             ` Peter Zijlstra
2017-05-02  8:30               ` Peter Zijlstra
2017-05-02 20:00                 ` Tejun Heo
2017-05-03  9:10                   ` Peter Zijlstra
2017-04-26 16:14   ` Vincent Guittot
2017-04-26 22:27     ` Tejun Heo
2017-04-27  8:59       ` Vincent Guittot
2017-04-28 17:46         ` Tejun Heo
2017-05-02  7:20           ` Vincent Guittot
2017-04-24 20:14 ` [PATCH 2/2] sched/fair: Always propagate runnable_load_avg Tejun Heo
2017-04-25  8:46   ` Vincent Guittot
2017-04-25  9:05     ` Vincent Guittot
2017-04-25 12:59       ` Vincent Guittot
2017-04-25 18:49         ` Tejun Heo
2017-04-25 20:49           ` Tejun Heo
2017-04-25 21:15             ` Chris Mason
2017-04-25 21:08           ` Tejun Heo
2017-04-26 10:21             ` Vincent Guittot
2017-04-27  0:30               ` Tejun Heo
2017-04-27  8:28                 ` Vincent Guittot
2017-04-28 16:14                   ` Tejun Heo
2017-05-02  6:56                     ` Vincent Guittot
2017-05-02 20:56                       ` Tejun Heo
2017-05-03  7:25                         ` Vincent Guittot
2017-05-03  7:54                           ` Vincent Guittot
2017-04-26 18:12   ` Vincent Guittot
2017-04-26 22:52     ` Tejun Heo
2017-04-27  8:29       ` Vincent Guittot
2017-04-28 20:33         ` Tejun Heo
2017-04-28 20:38           ` Tejun Heo
2017-05-01 15:56           ` Peter Zijlstra
2017-05-02 22:01             ` Tejun Heo
2017-05-02  7:18           ` Vincent Guittot
2017-05-02 13:26             ` Vincent Guittot
2017-05-02 22:37               ` Tejun Heo
2017-05-02 21:50             ` Tejun Heo
2017-05-03  7:34               ` Vincent Guittot
2017-05-03  9:37                 ` Peter Zijlstra
2017-05-03 10:37                   ` Vincent Guittot
2017-05-03 13:09                     ` Peter Zijlstra
2017-05-03 21:49                       ` Tejun Heo
2017-05-04  8:19                         ` Vincent Guittot
2017-05-04 17:43                           ` Tejun Heo
2017-05-04 19:02                             ` Vincent Guittot
2017-05-04 19:04                               ` Tejun Heo
2017-04-24 21:35 ` [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities Tejun Heo
2017-04-24 21:48   ` Peter Zijlstra
2017-04-24 22:54     ` Tejun Heo
2017-04-25 21:09   ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170503180028.ejf73et3pc4meqji@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=clm@fb.com \
    --cc=efault@gmx.de \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pjt@google.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.