linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] SCHED_IDLE extensions
@ 2021-07-30  2:00 Josh Don
  2021-07-30  2:00 ` [PATCH v2 1/2] sched: cgroup SCHED_IDLE support Josh Don
  2021-07-30  2:00 ` [PATCH 2/2] sched: adjust SCHED_IDLE interactions Josh Don
  0 siblings, 2 replies; 20+ messages in thread
From: Josh Don @ 2021-07-30  2:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel, Josh Don

This patch series contains improvements/extensions for SCHED_IDLE.

The first patch of the series is the updated v2 of the previously mailed
patch to add cgroup support for SCHED_IDLE.

The second patch is a new patch to adjust some SCHED_IDLE semantics.

Josh Don (2):
  sched: cgroup SCHED_IDLE support
  sched: adjust SCHED_IDLE interactions

 kernel/sched/core.c  |  25 +++++
 kernel/sched/debug.c |   3 +
 kernel/sched/fair.c  | 229 +++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |   8 ++
 4 files changed, 234 insertions(+), 31 deletions(-)

-- 
2.32.0.554.ge1b32706d8-goog


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/2] sched: cgroup SCHED_IDLE support
  2021-07-30  2:00 [PATCH v2 0/2] SCHED_IDLE extensions Josh Don
@ 2021-07-30  2:00 ` Josh Don
  2021-08-03  2:14   ` jun qian
                     ` (3 more replies)
  2021-07-30  2:00 ` [PATCH 2/2] sched: adjust SCHED_IDLE interactions Josh Don
  1 sibling, 4 replies; 20+ messages in thread
From: Josh Don @ 2021-07-30  2:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel, Josh Don

This extends SCHED_IDLE to cgroups.

Interface: cgroup/cpu.idle.
 0: default behavior
 1: SCHED_IDLE

Extending SCHED_IDLE to cgroups means that we incorporate the existing
aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its
descendant threads towards the idle_h_nr_running count of all of its
ancestor cgroups. Thus, sched_idle_rq() will work properly.
Additionally, SCHED_IDLE cgroups are configured with minimum weight.

There are two key differences between the per-task and per-cgroup
SCHED_IDLE interface:

- The cgroup interface allows tasks within a SCHED_IDLE hierarchy to
maintain their relative weights. The entity that is "idle" is the
cgroup, not the tasks themselves.

- Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption
decision is not made by comparing the current task with the woken task,
but rather by comparing their matching sched_entity.

A typical use-case for this is a user that creates an idle and a
non-idle subtree. The non-idle subtree will dominate competition vs
the idle subtree, but the idle subtree will still be high priority
vs other users on the system. The latter is accomplished via comparing
matching sched_entity in the waken preemption path (this could also be
improved by making the sched_idle_rq() decision dependent on the
perspective of a specific task).

For now, we maintain the existing SCHED_IDLE semantics. Future patches
may make improvements that extend how we treat SCHED_IDLE entities.

The per-task_group idle field is an integer that currently only holds
either a 0 or a 1. This is explicitly typed as an integer to allow for
further extensions to this API. For example, a negative value may
indicate a highly latency-sensitive cgroup that should be preferred for
preemption/placement/etc.

Signed-off-by: Josh Don <joshdon@google.com>
---
v2:
- Use WEIGHT_IDLEPRIO for the idle cgroup weight
- Add cgroup-v1 support

 kernel/sched/core.c  |  25 ++++++
 kernel/sched/debug.c |   3 +
 kernel/sched/fair.c  | 197 +++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |   8 ++
 4 files changed, 208 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cd004d542419..0e1fb9206c02 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10120,6 +10120,20 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static s64 cpu_idle_read_s64(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	return css_tg(css)->idle;
+}
+
+static int cpu_idle_write_s64(struct cgroup_subsys_state *css,
+				struct cftype *cft, s64 idle)
+{
+	return sched_group_set_idle(css_tg(css), idle);
+}
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -10127,6 +10141,11 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_shares_read_u64,
 		.write_u64 = cpu_shares_write_u64,
 	},
+	{
+		.name = "idle",
+		.read_s64 = cpu_idle_read_s64,
+		.write_s64 = cpu_idle_write_s64,
+	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
@@ -10334,6 +10353,12 @@ static struct cftype cpu_files[] = {
 		.read_s64 = cpu_weight_nice_read_s64,
 		.write_s64 = cpu_weight_nice_write_s64,
 	},
+	{
+		.name = "idle",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_s64 = cpu_idle_read_s64,
+		.write_s64 = cpu_idle_write_s64,
+	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 7e08e3d947c2..49716228efb4 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -607,6 +607,9 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
 			cfs_rq->nr_spread_over);
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
+	SEQ_printf(m, "  .%-30s: %d\n", "h_nr_running", cfs_rq->h_nr_running);
+	SEQ_printf(m, "  .%-30s: %d\n", "idle_h_nr_running",
+			cfs_rq->idle_h_nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_SMP
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6bf0889efced..a7feae1cb0f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -431,6 +431,23 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 	}
 }
 
+static int tg_is_idle(struct task_group *tg)
+{
+	return tg->idle > 0;
+}
+
+static int cfs_rq_is_idle(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->idle > 0;
+}
+
+static int se_is_idle(struct sched_entity *se)
+{
+	if (entity_is_task(se))
+		return task_has_idle_policy(task_of(se));
+	return cfs_rq_is_idle(group_cfs_rq(se));
+}
+
 #else	/* !CONFIG_FAIR_GROUP_SCHED */
 
 #define for_each_sched_entity(se) \
@@ -468,6 +485,21 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
 }
 
+static int tg_is_idle(struct task_group *tg)
+{
+	return 0;
+}
+
+static int cfs_rq_is_idle(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
+static int se_is_idle(struct sched_entity *se)
+{
+	return 0;
+}
+
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
 static __always_inline
@@ -4841,6 +4873,9 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
 
+		if (cfs_rq_is_idle(group_cfs_rq(se)))
+			idle_task_delta = cfs_rq->h_nr_running;
+
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
 
@@ -4860,6 +4895,9 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		update_load_avg(qcfs_rq, se, 0);
 		se_update_runnable(se);
 
+		if (cfs_rq_is_idle(group_cfs_rq(se)))
+			idle_task_delta = cfs_rq->h_nr_running;
+
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
 	}
@@ -4904,39 +4942,45 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	task_delta = cfs_rq->h_nr_running;
 	idle_task_delta = cfs_rq->idle_h_nr_running;
 	for_each_sched_entity(se) {
+		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+
 		if (se->on_rq)
 			break;
-		cfs_rq = cfs_rq_of(se);
-		enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
+		enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
+
+		if (cfs_rq_is_idle(group_cfs_rq(se)))
+			idle_task_delta = cfs_rq->h_nr_running;
 
-		cfs_rq->h_nr_running += task_delta;
-		cfs_rq->idle_h_nr_running += idle_task_delta;
+		qcfs_rq->h_nr_running += task_delta;
+		qcfs_rq->idle_h_nr_running += idle_task_delta;
 
 		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
+		if (cfs_rq_throttled(qcfs_rq))
 			goto unthrottle_throttle;
 	}
 
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 
-		update_load_avg(cfs_rq, se, UPDATE_TG);
+		update_load_avg(qcfs_rq, se, UPDATE_TG);
 		se_update_runnable(se);
 
-		cfs_rq->h_nr_running += task_delta;
-		cfs_rq->idle_h_nr_running += idle_task_delta;
+		if (cfs_rq_is_idle(group_cfs_rq(se)))
+			idle_task_delta = cfs_rq->h_nr_running;
 
+		qcfs_rq->h_nr_running += task_delta;
+		qcfs_rq->idle_h_nr_running += idle_task_delta;
 
 		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
+		if (cfs_rq_throttled(qcfs_rq))
 			goto unthrottle_throttle;
 
 		/*
 		 * One parent has been throttled and cfs_rq removed from the
 		 * list. Add it back to not break the leaf list.
 		 */
-		if (throttled_hierarchy(cfs_rq))
-			list_add_leaf_cfs_rq(cfs_rq);
+		if (throttled_hierarchy(qcfs_rq))
+			list_add_leaf_cfs_rq(qcfs_rq);
 	}
 
 	/* At this point se is NULL and we are at root level*/
@@ -4949,9 +4993,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	 * assertion below.
 	 */
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 
-		if (list_add_leaf_cfs_rq(cfs_rq))
+		if (list_add_leaf_cfs_rq(qcfs_rq))
 			break;
 	}
 
@@ -5574,6 +5618,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
 			goto enqueue_throttle;
@@ -5591,6 +5638,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
 			goto enqueue_throttle;
@@ -5668,6 +5718,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running--;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
 			goto dequeue_throttle;
@@ -5697,6 +5750,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running--;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
 			goto dequeue_throttle;
@@ -7041,24 +7097,22 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
 
 static void set_last_buddy(struct sched_entity *se)
 {
-	if (entity_is_task(se) && unlikely(task_has_idle_policy(task_of(se))))
-		return;
-
 	for_each_sched_entity(se) {
 		if (SCHED_WARN_ON(!se->on_rq))
 			return;
+		if (se_is_idle(se))
+			return;
 		cfs_rq_of(se)->last = se;
 	}
 }
 
 static void set_next_buddy(struct sched_entity *se)
 {
-	if (entity_is_task(se) && unlikely(task_has_idle_policy(task_of(se))))
-		return;
-
 	for_each_sched_entity(se) {
 		if (SCHED_WARN_ON(!se->on_rq))
 			return;
+		if (se_is_idle(se))
+			return;
 		cfs_rq_of(se)->next = se;
 	}
 }
@@ -7079,6 +7133,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	int scale = cfs_rq->nr_running >= sched_nr_latency;
 	int next_buddy_marked = 0;
+	int cse_is_idle, pse_is_idle;
 
 	if (unlikely(se == pse))
 		return;
@@ -7123,8 +7178,21 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		return;
 
 	find_matching_se(&se, &pse);
-	update_curr(cfs_rq_of(se));
 	BUG_ON(!pse);
+
+	cse_is_idle = se_is_idle(se);
+	pse_is_idle = se_is_idle(pse);
+
+	/*
+	 * Preempt an idle group in favor of a non-idle group (and don't preempt
+	 * in the inverse case).
+	 */
+	if (cse_is_idle && !pse_is_idle)
+		goto preempt;
+	if (cse_is_idle != pse_is_idle)
+		return;
+
+	update_curr(cfs_rq_of(se));
 	if (wakeup_preempt_entity(se, pse) == 1) {
 		/*
 		 * Bias pick_next to pick the sched entity that is
@@ -11418,10 +11486,12 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 
 static DEFINE_MUTEX(shares_mutex);
 
-int sched_group_set_shares(struct task_group *tg, unsigned long shares)
+static int __sched_group_set_shares(struct task_group *tg, unsigned long shares)
 {
 	int i;
 
+	lockdep_assert_held(&shares_mutex);
+
 	/*
 	 * We can't change the weight of the root cgroup.
 	 */
@@ -11430,9 +11500,8 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 
 	shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
 
-	mutex_lock(&shares_mutex);
 	if (tg->shares == shares)
-		goto done;
+		return 0;
 
 	tg->shares = shares;
 	for_each_possible_cpu(i) {
@@ -11450,10 +11519,88 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 		rq_unlock_irqrestore(rq, &rf);
 	}
 
-done:
+	return 0;
+}
+
+int sched_group_set_shares(struct task_group *tg, unsigned long shares)
+{
+	int ret;
+
+	mutex_lock(&shares_mutex);
+	if (tg_is_idle(tg))
+		ret = -EINVAL;
+	else
+		ret = __sched_group_set_shares(tg, shares);
+	mutex_unlock(&shares_mutex);
+
+	return ret;
+}
+
+int sched_group_set_idle(struct task_group *tg, long idle)
+{
+	int i;
+
+	if (tg == &root_task_group)
+		return -EINVAL;
+
+	if (idle < 0 || idle > 1)
+		return -EINVAL;
+
+	mutex_lock(&shares_mutex);
+
+	if (tg->idle == idle) {
+		mutex_unlock(&shares_mutex);
+		return 0;
+	}
+
+	tg->idle = idle;
+
+	for_each_possible_cpu(i) {
+		struct rq *rq = cpu_rq(i);
+		struct sched_entity *se = tg->se[i];
+		struct cfs_rq *grp_cfs_rq = tg->cfs_rq[i];
+		bool was_idle = cfs_rq_is_idle(grp_cfs_rq);
+		long idle_task_delta;
+		struct rq_flags rf;
+
+		rq_lock_irqsave(rq, &rf);
+
+		grp_cfs_rq->idle = idle;
+		if (WARN_ON_ONCE(was_idle == cfs_rq_is_idle(grp_cfs_rq)))
+			goto next_cpu;
+
+		idle_task_delta = grp_cfs_rq->h_nr_running -
+				  grp_cfs_rq->idle_h_nr_running;
+		if (!cfs_rq_is_idle(grp_cfs_rq))
+			idle_task_delta *= -1;
+
+		for_each_sched_entity(se) {
+			struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+			if (!se->on_rq)
+				break;
+
+			cfs_rq->idle_h_nr_running += idle_task_delta;
+
+			/* Already accounted at parent level and above. */
+			if (cfs_rq_is_idle(cfs_rq))
+				break;
+		}
+
+next_cpu:
+		rq_unlock_irqrestore(rq, &rf);
+	}
+
+	/* Idle groups have minimum weight. */
+	if (tg_is_idle(tg))
+		__sched_group_set_shares(tg, scale_load(WEIGHT_IDLEPRIO));
+	else
+		__sched_group_set_shares(tg, NICE_0_LOAD);
+
 	mutex_unlock(&shares_mutex);
 	return 0;
 }
+
 #else /* CONFIG_FAIR_GROUP_SCHED */
 
 void free_fair_sched_group(struct task_group *tg) { }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d9f8d73a1d84..8dfad8fb756c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -396,6 +396,9 @@ struct task_group {
 	struct cfs_rq		**cfs_rq;
 	unsigned long		shares;
 
+	/* A positive value indicates that this is a SCHED_IDLE group. */
+	int			idle;
+
 #ifdef	CONFIG_SMP
 	/*
 	 * load_avg can be heavily contended at clock tick time, so put
@@ -505,6 +508,8 @@ extern void sched_move_task(struct task_struct *tsk);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
 
+extern int sched_group_set_idle(struct task_group *tg, long idle);
+
 #ifdef CONFIG_SMP
 extern void set_task_rq_fair(struct sched_entity *se,
 			     struct cfs_rq *prev, struct cfs_rq *next);
@@ -601,6 +606,9 @@ struct cfs_rq {
 	struct list_head	leaf_cfs_rq_list;
 	struct task_group	*tg;	/* group that "owns" this runqueue */
 
+	/* Locally cached copy of our task_group's idle value */
+	int			idle;
+
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
 	s64			runtime_remaining;
-- 
2.32.0.554.ge1b32706d8-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] sched: adjust SCHED_IDLE interactions
  2021-07-30  2:00 [PATCH v2 0/2] SCHED_IDLE extensions Josh Don
  2021-07-30  2:00 ` [PATCH v2 1/2] sched: cgroup SCHED_IDLE support Josh Don
@ 2021-07-30  2:00 ` Josh Don
  2021-08-11 13:31   ` Vincent Guittot
  2021-08-16 12:31   ` Peter Zijlstra
  1 sibling, 2 replies; 20+ messages in thread
From: Josh Don @ 2021-07-30  2:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel, Josh Don

This patch makes some behavioral changes when SCHED_IDLE entities are
competing with non SCHED_IDLE entities.

1) Ignore min_granularity for determining the sched_slide of a
SCHED_IDLE entity when it is competing with a non SCHED_IDLE entity.
This reduces the latency of getting a non SCHED_IDLE entity back on cpu,
at the expense of increased context switch frequency of SCHED_IDLE
entities.

In steady state competition between SCHED_IDLE/non-SCHED_IDLE,
preemption is driven by the tick, so SCHED_IDLE min_granularity is
approximately bounded on the low end by the tick HZ.

Example: on a machine with HZ=1000, spawned two threads, one of which is
SCHED_IDLE, and affined to one cpu. Without this patch, the SCHED_IDLE
thread runs for 4ms then waits for 1.4s. With this patch, it runs for
1ms and waits 340ms (as it round-robins with the other thread).

The benefit of this change is to reduce the round-robin latency for non
SCHED_IDLE entities when competing with a SCHED_IDLE entity.

2) Don't give sleeper credit to SCHED_IDLE entities when they wake onto
a cfs_rq with non SCHED_IDLE entities. As a result, newly woken
SCHED_IDLE entities will take longer to preempt non SCHED_IDLE entities.

Example: spawned four threads affined to one cpu, one of which was set
to SCHED_IDLE. Without this patch, wakeup latency for the SCHED_IDLE
thread was ~1-2ms, with the patch the wakeup latency was ~10ms.

The benefit of this change is to make it less likely that a newly woken
SCHED_IDLE entity will preempt a short-running non SCHED_IDLE entity
before it blocks.

Signed-off-by: Josh Don <joshdon@google.com>
---
 kernel/sched/fair.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a7feae1cb0f0..24b2c6c057e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -674,6 +674,7 @@ static u64 __sched_period(unsigned long nr_running)
 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	unsigned int nr_running = cfs_rq->nr_running;
+	struct sched_entity *init_se = se;
 	u64 slice;
 
 	if (sched_feat(ALT_PERIOD))
@@ -684,12 +685,13 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	for_each_sched_entity(se) {
 		struct load_weight *load;
 		struct load_weight lw;
+		struct cfs_rq *qcfs_rq;
 
-		cfs_rq = cfs_rq_of(se);
-		load = &cfs_rq->load;
+		qcfs_rq = cfs_rq_of(se);
+		load = &qcfs_rq->load;
 
 		if (unlikely(!se->on_rq)) {
-			lw = cfs_rq->load;
+			lw = qcfs_rq->load;
 
 			update_load_add(&lw, se->load.weight);
 			load = &lw;
@@ -697,8 +699,18 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		slice = __calc_delta(slice, se->load.weight, load);
 	}
 
-	if (sched_feat(BASE_SLICE))
-		slice = max(slice, (u64)sysctl_sched_min_granularity);
+	if (sched_feat(BASE_SLICE)) {
+		/*
+		 * SCHED_IDLE entities are not subject to min_granularity if
+		 * they are competing with non SCHED_IDLE entities. As a result,
+		 * non SCHED_IDLE entities will have reduced latency to get back
+		 * on cpu, at the cost of increased context switch frequency of
+		 * SCHED_IDLE entities.
+		 */
+		if (!se_is_idle(init_se) ||
+		    cfs_rq->h_nr_running == cfs_rq->idle_h_nr_running)
+			slice = max(slice, (u64)sysctl_sched_min_granularity);
+	}
 
 	return slice;
 }
@@ -4216,7 +4228,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 		if (sched_feat(GENTLE_FAIR_SLEEPERS))
 			thresh >>= 1;
 
-		vruntime -= thresh;
+		/*
+		 * Don't give sleep credit to a SCHED_IDLE entity if we're
+		 * placing it onto a cfs_rq with non SCHED_IDLE entities.
+		 */
+		if (!se_is_idle(se) ||
+		    cfs_rq->h_nr_running == cfs_rq->idle_h_nr_running)
+			vruntime -= thresh;
+		else
+			vruntime += 1;
 	}
 
 	/* ensure we never gain time by being placed backwards. */
-- 
2.32.0.554.ge1b32706d8-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] sched: cgroup SCHED_IDLE support
  2021-07-30  2:00 ` [PATCH v2 1/2] sched: cgroup SCHED_IDLE support Josh Don
@ 2021-08-03  2:14   ` jun qian
  2021-08-03 20:37     ` Josh Don
  2021-08-05 10:18   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: jun qian @ 2021-08-03  2:14 UTC (permalink / raw)
  To: Josh Don
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel

hi Josh

Josh Don <joshdon@google.com> 于2021年7月30日周五 上午10:03写道:
>
> This extends SCHED_IDLE to cgroups.
>
> Interface: cgroup/cpu.idle.
>  0: default behavior
>  1: SCHED_IDLE
>
> Extending SCHED_IDLE to cgroups means that we incorporate the existing
> aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its
> descendant threads towards the idle_h_nr_running count of all of its
> ancestor cgroups. Thus, sched_idle_rq() will work properly.
> Additionally, SCHED_IDLE cgroups are configured with minimum weight.
>
> There are two key differences between the per-task and per-cgroup
> SCHED_IDLE interface:
>
> - The cgroup interface allows tasks within a SCHED_IDLE hierarchy to
> maintain their relative weights. The entity that is "idle" is the
> cgroup, not the tasks themselves.
>
> - Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption
> decision is not made by comparing the current task with the woken task,
> but rather by comparing their matching sched_entity.
>
> A typical use-case for this is a user that creates an idle and a
> non-idle subtree. The non-idle subtree will dominate competition vs
> the idle subtree, but the idle subtree will still be high priority
> vs other users on the system. The latter is accomplished via comparing
> matching sched_entity in the waken preemption path (this could also be
> improved by making the sched_idle_rq() decision dependent on the
> perspective of a specific task).
>
> For now, we maintain the existing SCHED_IDLE semantics. Future patches
> may make improvements that extend how we treat SCHED_IDLE entities.
>
> The per-task_group idle field is an integer that currently only holds
> either a 0 or a 1. This is explicitly typed as an integer to allow for
> further extensions to this API. For example, a negative value may
> indicate a highly latency-sensitive cgroup that should be preferred for
> preemption/placement/etc.
>
> Signed-off-by: Josh Don <joshdon@google.com>
> ---
> v2:
> - Use WEIGHT_IDLEPRIO for the idle cgroup weight
> - Add cgroup-v1 support
>
>  kernel/sched/core.c  |  25 ++++++
>  kernel/sched/debug.c |   3 +
>  kernel/sched/fair.c  | 197 +++++++++++++++++++++++++++++++++++++------
>  kernel/sched/sched.h |   8 ++
>  4 files changed, 208 insertions(+), 25 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index cd004d542419..0e1fb9206c02 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10120,6 +10120,20 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
>  }
>  #endif /* CONFIG_RT_GROUP_SCHED */
>
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +static s64 cpu_idle_read_s64(struct cgroup_subsys_state *css,
> +                              struct cftype *cft)
> +{
> +       return css_tg(css)->idle;
> +}
> +
> +static int cpu_idle_write_s64(struct cgroup_subsys_state *css,
> +                               struct cftype *cft, s64 idle)
> +{
> +       return sched_group_set_idle(css_tg(css), idle);
> +}
> +#endif
> +
>  static struct cftype cpu_legacy_files[] = {
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         {
> @@ -10127,6 +10141,11 @@ static struct cftype cpu_legacy_files[] = {
>                 .read_u64 = cpu_shares_read_u64,
>                 .write_u64 = cpu_shares_write_u64,
>         },
> +       {
> +               .name = "idle",
> +               .read_s64 = cpu_idle_read_s64,
> +               .write_s64 = cpu_idle_write_s64,
> +       },
>  #endif
>  #ifdef CONFIG_CFS_BANDWIDTH
>         {
> @@ -10334,6 +10353,12 @@ static struct cftype cpu_files[] = {
>                 .read_s64 = cpu_weight_nice_read_s64,
>                 .write_s64 = cpu_weight_nice_write_s64,
>         },
> +       {
> +               .name = "idle",
> +               .flags = CFTYPE_NOT_ON_ROOT,
> +               .read_s64 = cpu_idle_read_s64,
> +               .write_s64 = cpu_idle_write_s64,
> +       },
>  #endif
>  #ifdef CONFIG_CFS_BANDWIDTH
>         {
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 7e08e3d947c2..49716228efb4 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -607,6 +607,9 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>         SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
>                         cfs_rq->nr_spread_over);
>         SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
> +       SEQ_printf(m, "  .%-30s: %d\n", "h_nr_running", cfs_rq->h_nr_running);
> +       SEQ_printf(m, "  .%-30s: %d\n", "idle_h_nr_running",
> +                       cfs_rq->idle_h_nr_running);
>         SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
>  #ifdef CONFIG_SMP
>         SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6bf0889efced..a7feae1cb0f0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -431,6 +431,23 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
>         }
>  }
>
> +static int tg_is_idle(struct task_group *tg)
> +{
> +       return tg->idle > 0;
> +}
> +
> +static int cfs_rq_is_idle(struct cfs_rq *cfs_rq)
> +{
> +       return cfs_rq->idle > 0;
> +}
> +
> +static int se_is_idle(struct sched_entity *se)
> +{
> +       if (entity_is_task(se))
> +               return task_has_idle_policy(task_of(se));
> +       return cfs_rq_is_idle(group_cfs_rq(se));
> +}
> +
>  #else  /* !CONFIG_FAIR_GROUP_SCHED */
>
>  #define for_each_sched_entity(se) \
> @@ -468,6 +485,21 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
>  {
>  }
>
> +static int tg_is_idle(struct task_group *tg)
> +{
> +       return 0;
> +}
> +
> +static int cfs_rq_is_idle(struct cfs_rq *cfs_rq)
> +{
> +       return 0;
> +}
> +
> +static int se_is_idle(struct sched_entity *se)
> +{
> +       return 0;
> +}
> +
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>
>  static __always_inline
> @@ -4841,6 +4873,9 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>
>                 dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
>
> +               if (cfs_rq_is_idle(group_cfs_rq(se)))
> +                       idle_task_delta = cfs_rq->h_nr_running;
> +
>                 qcfs_rq->h_nr_running -= task_delta;
>                 qcfs_rq->idle_h_nr_running -= idle_task_delta;
>
> @@ -4860,6 +4895,9 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>                 update_load_avg(qcfs_rq, se, 0);
>                 se_update_runnable(se);
>
> +               if (cfs_rq_is_idle(group_cfs_rq(se)))
> +                       idle_task_delta = cfs_rq->h_nr_running;
> +
>                 qcfs_rq->h_nr_running -= task_delta;
>                 qcfs_rq->idle_h_nr_running -= idle_task_delta;
>         }
> @@ -4904,39 +4942,45 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>         task_delta = cfs_rq->h_nr_running;
>         idle_task_delta = cfs_rq->idle_h_nr_running;
>         for_each_sched_entity(se) {
> +               struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> +
>                 if (se->on_rq)
>                         break;
> -               cfs_rq = cfs_rq_of(se);
> -               enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
> +               enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
> +
> +               if (cfs_rq_is_idle(group_cfs_rq(se)))
> +                       idle_task_delta = cfs_rq->h_nr_running;
>
> -               cfs_rq->h_nr_running += task_delta;
> -               cfs_rq->idle_h_nr_running += idle_task_delta;
> +               qcfs_rq->h_nr_running += task_delta;
> +               qcfs_rq->idle_h_nr_running += idle_task_delta;
>
>                 /* end evaluation on encountering a throttled cfs_rq */
> -               if (cfs_rq_throttled(cfs_rq))
> +               if (cfs_rq_throttled(qcfs_rq))
>                         goto unthrottle_throttle;
>         }
>
>         for_each_sched_entity(se) {
> -               cfs_rq = cfs_rq_of(se);
> +               struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>
> -               update_load_avg(cfs_rq, se, UPDATE_TG);
> +               update_load_avg(qcfs_rq, se, UPDATE_TG);
>                 se_update_runnable(se);
>
> -               cfs_rq->h_nr_running += task_delta;
> -               cfs_rq->idle_h_nr_running += idle_task_delta;
> +               if (cfs_rq_is_idle(group_cfs_rq(se)))
> +                       idle_task_delta = cfs_rq->h_nr_running;
>
> +               qcfs_rq->h_nr_running += task_delta;
> +               qcfs_rq->idle_h_nr_running += idle_task_delta;
>
>                 /* end evaluation on encountering a throttled cfs_rq */
> -               if (cfs_rq_throttled(cfs_rq))
> +               if (cfs_rq_throttled(qcfs_rq))
>                         goto unthrottle_throttle;
>
>                 /*
>                  * One parent has been throttled and cfs_rq removed from the
>                  * list. Add it back to not break the leaf list.
>                  */
> -               if (throttled_hierarchy(cfs_rq))
> -                       list_add_leaf_cfs_rq(cfs_rq);
> +               if (throttled_hierarchy(qcfs_rq))
> +                       list_add_leaf_cfs_rq(qcfs_rq);
>         }
>
>         /* At this point se is NULL and we are at root level*/
> @@ -4949,9 +4993,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>          * assertion below.
>          */
>         for_each_sched_entity(se) {
> -               cfs_rq = cfs_rq_of(se);
> +               struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>
> -               if (list_add_leaf_cfs_rq(cfs_rq))
> +               if (list_add_leaf_cfs_rq(qcfs_rq))
>                         break;
>         }
>
> @@ -5574,6 +5618,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 cfs_rq->h_nr_running++;
>                 cfs_rq->idle_h_nr_running += idle_h_nr_running;
>
> +               if (cfs_rq_is_idle(cfs_rq))
> +                       idle_h_nr_running = 1;
> +
>                 /* end evaluation on encountering a throttled cfs_rq */
>                 if (cfs_rq_throttled(cfs_rq))
>                         goto enqueue_throttle;
> @@ -5591,6 +5638,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 cfs_rq->h_nr_running++;
>                 cfs_rq->idle_h_nr_running += idle_h_nr_running;
>
> +               if (cfs_rq_is_idle(cfs_rq))
> +                       idle_h_nr_running = 1;
> +
>                 /* end evaluation on encountering a throttled cfs_rq */
>                 if (cfs_rq_throttled(cfs_rq))
>                         goto enqueue_throttle;
> @@ -5668,6 +5718,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 cfs_rq->h_nr_running--;
>                 cfs_rq->idle_h_nr_running -= idle_h_nr_running;
>
> +               if (cfs_rq_is_idle(cfs_rq))
> +                       idle_h_nr_running = 1;
> +
>                 /* end evaluation on encountering a throttled cfs_rq */
>                 if (cfs_rq_throttled(cfs_rq))
>                         goto dequeue_throttle;
> @@ -5697,6 +5750,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 cfs_rq->h_nr_running--;
>                 cfs_rq->idle_h_nr_running -= idle_h_nr_running;
>
> +               if (cfs_rq_is_idle(cfs_rq))
> +                       idle_h_nr_running = 1;
> +
>                 /* end evaluation on encountering a throttled cfs_rq */
>                 if (cfs_rq_throttled(cfs_rq))
>                         goto dequeue_throttle;
> @@ -7041,24 +7097,22 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
>
>  static void set_last_buddy(struct sched_entity *se)
>  {
> -       if (entity_is_task(se) && unlikely(task_has_idle_policy(task_of(se))))
> -               return;
> -
>         for_each_sched_entity(se) {
>                 if (SCHED_WARN_ON(!se->on_rq))
>                         return;
> +               if (se_is_idle(se))
> +                       return;
>                 cfs_rq_of(se)->last = se;
>         }
>  }
>
>  static void set_next_buddy(struct sched_entity *se)
>  {
> -       if (entity_is_task(se) && unlikely(task_has_idle_policy(task_of(se))))
> -               return;
> -
>         for_each_sched_entity(se) {
>                 if (SCHED_WARN_ON(!se->on_rq))
>                         return;
> +               if (se_is_idle(se))
> +                       return;
>                 cfs_rq_of(se)->next = se;
>         }
>  }
> @@ -7079,6 +7133,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>         struct cfs_rq *cfs_rq = task_cfs_rq(curr);
>         int scale = cfs_rq->nr_running >= sched_nr_latency;
>         int next_buddy_marked = 0;
> +       int cse_is_idle, pse_is_idle;
>
>         if (unlikely(se == pse))
>                 return;
> @@ -7123,8 +7178,21 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>                 return;
>
>         find_matching_se(&se, &pse);
> -       update_curr(cfs_rq_of(se));
>         BUG_ON(!pse);
> +
> +       cse_is_idle = se_is_idle(se);
> +       pse_is_idle = se_is_idle(pse);
> +
> +       /*
> +        * Preempt an idle group in favor of a non-idle group (and don't preempt
> +        * in the inverse case).
> +        */
> +       if (cse_is_idle && !pse_is_idle)
> +               goto preempt;

In the next schedule, it will pick next entity, How do we ensure that
the entity selected next time will not be an idle
entity. So I think we need to do something to idle group when picking
next entity, For example, when we pick next
entity, we try to choose normal group as much as possible.

> +       if (cse_is_idle != pse_is_idle)
> +               return;
> +
> +       update_curr(cfs_rq_of(se));
>         if (wakeup_preempt_entity(se, pse) == 1) {
>                 /*
>                  * Bias pick_next to pick the sched entity that is
> @@ -11418,10 +11486,12 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
>
>  static DEFINE_MUTEX(shares_mutex);
>
> -int sched_group_set_shares(struct task_group *tg, unsigned long shares)
> +static int __sched_group_set_shares(struct task_group *tg, unsigned long shares)
>  {
>         int i;
>
> +       lockdep_assert_held(&shares_mutex);
> +
>         /*
>          * We can't change the weight of the root cgroup.
>          */
> @@ -11430,9 +11500,8 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
>
>         shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
>
> -       mutex_lock(&shares_mutex);
>         if (tg->shares == shares)
> -               goto done;
> +               return 0;
>
>         tg->shares = shares;
>         for_each_possible_cpu(i) {
> @@ -11450,10 +11519,88 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
>                 rq_unlock_irqrestore(rq, &rf);
>         }
>
> -done:
> +       return 0;
> +}
> +
> +int sched_group_set_shares(struct task_group *tg, unsigned long shares)
> +{
> +       int ret;
> +
> +       mutex_lock(&shares_mutex);
> +       if (tg_is_idle(tg))
> +               ret = -EINVAL;
> +       else
> +               ret = __sched_group_set_shares(tg, shares);
> +       mutex_unlock(&shares_mutex);
> +
> +       return ret;
> +}
> +
> +int sched_group_set_idle(struct task_group *tg, long idle)
> +{
> +       int i;
> +
> +       if (tg == &root_task_group)
> +               return -EINVAL;
> +
> +       if (idle < 0 || idle > 1)
> +               return -EINVAL;
> +
> +       mutex_lock(&shares_mutex);
> +
> +       if (tg->idle == idle) {
> +               mutex_unlock(&shares_mutex);
> +               return 0;
> +       }
> +
> +       tg->idle = idle;
> +
> +       for_each_possible_cpu(i) {
> +               struct rq *rq = cpu_rq(i);
> +               struct sched_entity *se = tg->se[i];
> +               struct cfs_rq *grp_cfs_rq = tg->cfs_rq[i];
> +               bool was_idle = cfs_rq_is_idle(grp_cfs_rq);
> +               long idle_task_delta;
> +               struct rq_flags rf;
> +
> +               rq_lock_irqsave(rq, &rf);
> +
> +               grp_cfs_rq->idle = idle;
> +               if (WARN_ON_ONCE(was_idle == cfs_rq_is_idle(grp_cfs_rq)))
> +                       goto next_cpu;
> +
> +               idle_task_delta = grp_cfs_rq->h_nr_running -
> +                                 grp_cfs_rq->idle_h_nr_running;
> +               if (!cfs_rq_is_idle(grp_cfs_rq))
> +                       idle_task_delta *= -1;
> +
> +               for_each_sched_entity(se) {
> +                       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +                       if (!se->on_rq)
> +                               break;
> +
> +                       cfs_rq->idle_h_nr_running += idle_task_delta;
> +
> +                       /* Already accounted at parent level and above. */
> +                       if (cfs_rq_is_idle(cfs_rq))
> +                               break;
> +               }
> +
> +next_cpu:
> +               rq_unlock_irqrestore(rq, &rf);
> +       }
> +
> +       /* Idle groups have minimum weight. */
> +       if (tg_is_idle(tg))
> +               __sched_group_set_shares(tg, scale_load(WEIGHT_IDLEPRIO));
> +       else
> +               __sched_group_set_shares(tg, NICE_0_LOAD);
> +
>         mutex_unlock(&shares_mutex);
>         return 0;
>  }
> +
>  #else /* CONFIG_FAIR_GROUP_SCHED */
>
>  void free_fair_sched_group(struct task_group *tg) { }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d9f8d73a1d84..8dfad8fb756c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -396,6 +396,9 @@ struct task_group {
>         struct cfs_rq           **cfs_rq;
>         unsigned long           shares;
>
> +       /* A positive value indicates that this is a SCHED_IDLE group. */
> +       int                     idle;
> +
>  #ifdef CONFIG_SMP
>         /*
>          * load_avg can be heavily contended at clock tick time, so put
> @@ -505,6 +508,8 @@ extern void sched_move_task(struct task_struct *tsk);
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
>
> +extern int sched_group_set_idle(struct task_group *tg, long idle);
> +
>  #ifdef CONFIG_SMP
>  extern void set_task_rq_fair(struct sched_entity *se,
>                              struct cfs_rq *prev, struct cfs_rq *next);
> @@ -601,6 +606,9 @@ struct cfs_rq {
>         struct list_head        leaf_cfs_rq_list;
>         struct task_group       *tg;    /* group that "owns" this runqueue */
>
> +       /* Locally cached copy of our task_group's idle value */
> +       int                     idle;
> +
>  #ifdef CONFIG_CFS_BANDWIDTH
>         int                     runtime_enabled;
>         s64                     runtime_remaining;
> --
> 2.32.0.554.ge1b32706d8-goog
>

Do we need to do something in the sched tick for the situation with
idle entities?

Thanks

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] sched: cgroup SCHED_IDLE support
  2021-08-03  2:14   ` jun qian
@ 2021-08-03 20:37     ` Josh Don
  0 siblings, 0 replies; 20+ messages in thread
From: Josh Don @ 2021-08-03 20:37 UTC (permalink / raw)
  To: jun qian
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel

Hi Jun,

> > @@ -7123,8 +7178,21 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> >                 return;
> >
> >         find_matching_se(&se, &pse);
> > -       update_curr(cfs_rq_of(se));
> >         BUG_ON(!pse);
> > +
> > +       cse_is_idle = se_is_idle(se);
> > +       pse_is_idle = se_is_idle(pse);
> > +
> > +       /*
> > +        * Preempt an idle group in favor of a non-idle group (and don't preempt
> > +        * in the inverse case).
> > +        */
> > +       if (cse_is_idle && !pse_is_idle)
> > +               goto preempt;
>
> In the next schedule, it will pick next entity, How do we ensure that
> the entity selected next time will not be an idle
> entity. So I think we need to do something to idle group when picking
> next entity, For example, when we pick next
> entity, we try to choose normal group as much as possible.

Yep, exactly, this is achieved by adjusting the weight of idle
entities to the minimum value (see sched_group_set_idle()).

> Do we need to do something in the sched tick for the situation with
> idle entities?

check_preempt_tick() should already be sufficient, given that vruntime
for idle entities  will advance quickly, and in patch 2/2 I adjusted
the sched_slice for idle entities to allow for smaller than
min_granularity. Did you have a more specific concern?

Best,
Josh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] sched: cgroup SCHED_IDLE support
  2021-07-30  2:00 ` [PATCH v2 1/2] sched: cgroup SCHED_IDLE support Josh Don
  2021-08-03  2:14   ` jun qian
@ 2021-08-05 10:18   ` Peter Zijlstra
  2021-08-05 17:13     ` Tejun Heo
  2021-08-11 13:48   ` Vincent Guittot
  2021-08-23  9:26   ` [tip: sched/core] sched: Cgroup " tip-bot2 for Josh Don
  3 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2021-08-05 10:18 UTC (permalink / raw)
  To: Josh Don
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel

On Thu, Jul 29, 2021 at 07:00:18PM -0700, Josh Don wrote:
> This extends SCHED_IDLE to cgroups.
> 
> Interface: cgroup/cpu.idle.
>  0: default behavior
>  1: SCHED_IDLE
> 
> Extending SCHED_IDLE to cgroups means that we incorporate the existing
> aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its
> descendant threads towards the idle_h_nr_running count of all of its
> ancestor cgroups. Thus, sched_idle_rq() will work properly.
> Additionally, SCHED_IDLE cgroups are configured with minimum weight.
> 
> There are two key differences between the per-task and per-cgroup
> SCHED_IDLE interface:
> 
> - The cgroup interface allows tasks within a SCHED_IDLE hierarchy to
> maintain their relative weights. The entity that is "idle" is the
> cgroup, not the tasks themselves.
> 
> - Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption
> decision is not made by comparing the current task with the woken task,
> but rather by comparing their matching sched_entity.
> 
> A typical use-case for this is a user that creates an idle and a
> non-idle subtree. The non-idle subtree will dominate competition vs
> the idle subtree, but the idle subtree will still be high priority
> vs other users on the system. The latter is accomplished via comparing
> matching sched_entity in the waken preemption path (this could also be
> improved by making the sched_idle_rq() decision dependent on the
> perspective of a specific task).
> 
> For now, we maintain the existing SCHED_IDLE semantics. Future patches
> may make improvements that extend how we treat SCHED_IDLE entities.
> 
> The per-task_group idle field is an integer that currently only holds
> either a 0 or a 1. This is explicitly typed as an integer to allow for
> further extensions to this API. For example, a negative value may
> indicate a highly latency-sensitive cgroup that should be preferred for
> preemption/placement/etc.
> 
> Signed-off-by: Josh Don <joshdon@google.com>

So I'm tempted to apply this, but last time TJ wasn't liking it much for
the interface or somesuch. His argument that this encodes the
hierarchical scheduling behaviour, but I'm not really buying that
argument, as it doesn't really add more constraints than we already have
by the hierarchical relative weight.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] sched: cgroup SCHED_IDLE support
  2021-08-05 10:18   ` Peter Zijlstra
@ 2021-08-05 17:13     ` Tejun Heo
  2021-08-05 23:54       ` Josh Don
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2021-08-05 17:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Don, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Rik van Riel, linux-kernel

Hello,

On Thu, Aug 05, 2021 at 12:18:49PM +0200, Peter Zijlstra wrote:
> So I'm tempted to apply this, but last time TJ wasn't liking it much for
> the interface or somesuch. His argument that this encodes the
> hierarchical scheduling behaviour, but I'm not really buying that
> argument, as it doesn't really add more constraints than we already have
> by the hierarchical relative weight.

Interface-wise, writing 1 to idle file is fine. This does move away
the interface from being a purely semantical weight tree, which is a
concern and I have a difficult time seeing that the stated benefits
are significant enough to justify the changes. That said, this is
primarily a scheduler decision, so if you think the addition is
justified, please go ahead.

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] sched: cgroup SCHED_IDLE support
  2021-08-05 17:13     ` Tejun Heo
@ 2021-08-05 23:54       ` Josh Don
  0 siblings, 0 replies; 20+ messages in thread
From: Josh Don @ 2021-08-05 23:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Rik van Riel, linux-kernel

On Thu, Aug 5, 2021 at 10:13 AM Tejun Heo <tj@kernel.org> wrote:
>
> Interface-wise, writing 1 to idle file is fine. This does move away
> the interface from being a purely semantical weight tree, which is a
> concern and I have a difficult time seeing that the stated benefits
> are significant enough to justify the changes. That said, this is
> primarily a scheduler decision, so if you think the addition is
> justified, please go ahead.

Thanks for taking a look Tejun. I won't restate the use-cases from
past threads here, but ideally the capabilities introduced in this
series will make SCHED_IDLE more useful (internally, we've certainly
found that to be the case :)).

Best,
Josh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: adjust SCHED_IDLE interactions
  2021-07-30  2:00 ` [PATCH 2/2] sched: adjust SCHED_IDLE interactions Josh Don
@ 2021-08-11 13:31   ` Vincent Guittot
  2021-08-12 21:09     ` Josh Don
  2021-08-16 12:52     ` Peter Zijlstra
  2021-08-16 12:31   ` Peter Zijlstra
  1 sibling, 2 replies; 20+ messages in thread
From: Vincent Guittot @ 2021-08-11 13:31 UTC (permalink / raw)
  To: Josh Don
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel

On Fri, 30 Jul 2021 at 04:00, Josh Don <joshdon@google.com> wrote:
>
> This patch makes some behavioral changes when SCHED_IDLE entities are
> competing with non SCHED_IDLE entities.
>
> 1) Ignore min_granularity for determining the sched_slide of a
> SCHED_IDLE entity when it is competing with a non SCHED_IDLE entity.
> This reduces the latency of getting a non SCHED_IDLE entity back on cpu,
> at the expense of increased context switch frequency of SCHED_IDLE
> entities.
>
> In steady state competition between SCHED_IDLE/non-SCHED_IDLE,
> preemption is driven by the tick, so SCHED_IDLE min_granularity is
> approximately bounded on the low end by the tick HZ.
>
> Example: on a machine with HZ=1000, spawned two threads, one of which is
> SCHED_IDLE, and affined to one cpu. Without this patch, the SCHED_IDLE
> thread runs for 4ms then waits for 1.4s. With this patch, it runs for
> 1ms and waits 340ms (as it round-robins with the other thread).
>
> The benefit of this change is to reduce the round-robin latency for non
> SCHED_IDLE entities when competing with a SCHED_IDLE entity.
>
> 2) Don't give sleeper credit to SCHED_IDLE entities when they wake onto
> a cfs_rq with non SCHED_IDLE entities. As a result, newly woken
> SCHED_IDLE entities will take longer to preempt non SCHED_IDLE entities.
>
> Example: spawned four threads affined to one cpu, one of which was set
> to SCHED_IDLE. Without this patch, wakeup latency for the SCHED_IDLE
> thread was ~1-2ms, with the patch the wakeup latency was ~10ms.
>
> The benefit of this change is to make it less likely that a newly woken
> SCHED_IDLE entity will preempt a short-running non SCHED_IDLE entity
> before it blocks.
>
> Signed-off-by: Josh Don <joshdon@google.com>
> ---
>  kernel/sched/fair.c | 32 ++++++++++++++++++++++++++------
>  1 file changed, 26 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a7feae1cb0f0..24b2c6c057e6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -674,6 +674,7 @@ static u64 __sched_period(unsigned long nr_running)
>  static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         unsigned int nr_running = cfs_rq->nr_running;
> +       struct sched_entity *init_se = se;
>         u64 slice;
>
>         if (sched_feat(ALT_PERIOD))
> @@ -684,12 +685,13 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>         for_each_sched_entity(se) {
>                 struct load_weight *load;
>                 struct load_weight lw;
> +               struct cfs_rq *qcfs_rq;
>
> -               cfs_rq = cfs_rq_of(se);
> -               load = &cfs_rq->load;
> +               qcfs_rq = cfs_rq_of(se);
> +               load = &qcfs_rq->load;
>
>                 if (unlikely(!se->on_rq)) {
> -                       lw = cfs_rq->load;
> +                       lw = qcfs_rq->load;
>
>                         update_load_add(&lw, se->load.weight);
>                         load = &lw;
> @@ -697,8 +699,18 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>                 slice = __calc_delta(slice, se->load.weight, load);
>         }
>
> -       if (sched_feat(BASE_SLICE))
> -               slice = max(slice, (u64)w);
> +       if (sched_feat(BASE_SLICE)) {
> +               /*
> +                * SCHED_IDLE entities are not subject to min_granularity if
> +                * they are competing with non SCHED_IDLE entities. As a result,
> +                * non SCHED_IDLE entities will have reduced latency to get back
> +                * on cpu, at the cost of increased context switch frequency of
> +                * SCHED_IDLE entities.
> +                */

Ensuring that the entity will have a minimum runtime has been added to
ensure that we let enough time to move forward.
If you exclude sched_idle entities from this min runtime, the
sched_slice of an idle_entity will be really small.
I don't have details of your example above but I can imagine that it's
a 16 cpus system which means a sysctl_sched_min_granularity=3.75ms
which explains the 4ms running time of an idle entity
For a 16 cpus system, the sched_slice of an idle_entity in your
example in the cover letter is: 6*(1+log2(16))*3/1027=87us. Of course
this become even worse with more threads and cgroups or thread with
ncie prio -19

This value is then used to set the next hrtimer event in SCHED_HRTICK
and 87us is too small to make any progress

The 1ms of your test comes from the tick which could be a good
candidate for a min value or the
normalized_sysctl_sched_min_granularity which has the advantage of not
increasing with number of CPU

> +               if (!se_is_idle(init_se) ||
> +                   cfs_rq->h_nr_running == cfs_rq->idle_h_nr_running)
> +                       slice = max(slice, (u64)sysctl_sched_min_granularity);
> +       }
>
>         return slice;
>  }
> @@ -4216,7 +4228,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>                 if (sched_feat(GENTLE_FAIR_SLEEPERS))
>                         thresh >>= 1;
>
> -               vruntime -= thresh;
> +               /*
> +                * Don't give sleep credit to a SCHED_IDLE entity if we're
> +                * placing it onto a cfs_rq with non SCHED_IDLE entities.
> +                */
> +               if (!se_is_idle(se) ||
> +                   cfs_rq->h_nr_running == cfs_rq->idle_h_nr_running)

Can't this condition above create unfairness between idle entities ?
idle thread 1 wake up while normal thread is running
normal thread thread sleeps immediately after
idle thread 2 wakes up just after and gets some credits compared to the 1st one.

> +                       vruntime -= thresh;
> +               else
> +                       vruntime += 1;
>         }
>
>         /* ensure we never gain time by being placed backwards. */
> --
> 2.32.0.554.ge1b32706d8-goog
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/2] sched: cgroup SCHED_IDLE support
  2021-07-30  2:00 ` [PATCH v2 1/2] sched: cgroup SCHED_IDLE support Josh Don
  2021-08-03  2:14   ` jun qian
  2021-08-05 10:18   ` Peter Zijlstra
@ 2021-08-11 13:48   ` Vincent Guittot
  2021-08-23  9:26   ` [tip: sched/core] sched: Cgroup " tip-bot2 for Josh Don
  3 siblings, 0 replies; 20+ messages in thread
From: Vincent Guittot @ 2021-08-11 13:48 UTC (permalink / raw)
  To: Josh Don
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel

On Fri, 30 Jul 2021 at 04:00, Josh Don <joshdon@google.com> wrote:
>
> This extends SCHED_IDLE to cgroups.
>
> Interface: cgroup/cpu.idle.
>  0: default behavior
>  1: SCHED_IDLE
>
> Extending SCHED_IDLE to cgroups means that we incorporate the existing
> aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its
> descendant threads towards the idle_h_nr_running count of all of its
> ancestor cgroups. Thus, sched_idle_rq() will work properly.
> Additionally, SCHED_IDLE cgroups are configured with minimum weight.
>
> There are two key differences between the per-task and per-cgroup
> SCHED_IDLE interface:
>
> - The cgroup interface allows tasks within a SCHED_IDLE hierarchy to
> maintain their relative weights. The entity that is "idle" is the
> cgroup, not the tasks themselves.
>
> - Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption
> decision is not made by comparing the current task with the woken task,
> but rather by comparing their matching sched_entity.
>
> A typical use-case for this is a user that creates an idle and a
> non-idle subtree. The non-idle subtree will dominate competition vs
> the idle subtree, but the idle subtree will still be high priority
> vs other users on the system. The latter is accomplished via comparing
> matching sched_entity in the waken preemption path (this could also be
> improved by making the sched_idle_rq() decision dependent on the
> perspective of a specific task).
>
> For now, we maintain the existing SCHED_IDLE semantics. Future patches
> may make improvements that extend how we treat SCHED_IDLE entities.
>
> The per-task_group idle field is an integer that currently only holds
> either a 0 or a 1. This is explicitly typed as an integer to allow for
> further extensions to this API. For example, a negative value may
> indicate a highly latency-sensitive cgroup that should be preferred for
> preemption/placement/etc.

I'm a bit skeptical of this possible future addon but it's not part of
this patchset  so I will not go further on this.

So it's quite nice to see sched_idle support being extended to cgroup
and becoming more useful and easy to use

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

>
> Signed-off-by: Josh Don <joshdon@google.com>
> ---
> v2:
> - Use WEIGHT_IDLEPRIO for the idle cgroup weight
> - Add cgroup-v1 support
>
>  kernel/sched/core.c  |  25 ++++++
>  kernel/sched/debug.c |   3 +
>  kernel/sched/fair.c  | 197 +++++++++++++++++++++++++++++++++++++------
>  kernel/sched/sched.h |   8 ++
>  4 files changed, 208 insertions(+), 25 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index cd004d542419..0e1fb9206c02 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10120,6 +10120,20 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
>  }
>  #endif /* CONFIG_RT_GROUP_SCHED */
>
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +static s64 cpu_idle_read_s64(struct cgroup_subsys_state *css,
> +                              struct cftype *cft)
> +{
> +       return css_tg(css)->idle;
> +}
> +
> +static int cpu_idle_write_s64(struct cgroup_subsys_state *css,
> +                               struct cftype *cft, s64 idle)
> +{
> +       return sched_group_set_idle(css_tg(css), idle);
> +}
> +#endif
> +
>  static struct cftype cpu_legacy_files[] = {
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         {
> @@ -10127,6 +10141,11 @@ static struct cftype cpu_legacy_files[] = {
>                 .read_u64 = cpu_shares_read_u64,
>                 .write_u64 = cpu_shares_write_u64,
>         },
> +       {
> +               .name = "idle",
> +               .read_s64 = cpu_idle_read_s64,
> +               .write_s64 = cpu_idle_write_s64,
> +       },
>  #endif
>  #ifdef CONFIG_CFS_BANDWIDTH
>         {
> @@ -10334,6 +10353,12 @@ static struct cftype cpu_files[] = {
>                 .read_s64 = cpu_weight_nice_read_s64,
>                 .write_s64 = cpu_weight_nice_write_s64,
>         },
> +       {
> +               .name = "idle",
> +               .flags = CFTYPE_NOT_ON_ROOT,
> +               .read_s64 = cpu_idle_read_s64,
> +               .write_s64 = cpu_idle_write_s64,
> +       },
>  #endif
>  #ifdef CONFIG_CFS_BANDWIDTH
>         {
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 7e08e3d947c2..49716228efb4 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -607,6 +607,9 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>         SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
>                         cfs_rq->nr_spread_over);
>         SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
> +       SEQ_printf(m, "  .%-30s: %d\n", "h_nr_running", cfs_rq->h_nr_running);
> +       SEQ_printf(m, "  .%-30s: %d\n", "idle_h_nr_running",
> +                       cfs_rq->idle_h_nr_running);
>         SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
>  #ifdef CONFIG_SMP
>         SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6bf0889efced..a7feae1cb0f0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -431,6 +431,23 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
>         }
>  }
>
> +static int tg_is_idle(struct task_group *tg)
> +{
> +       return tg->idle > 0;
> +}
> +
> +static int cfs_rq_is_idle(struct cfs_rq *cfs_rq)
> +{
> +       return cfs_rq->idle > 0;
> +}
> +
> +static int se_is_idle(struct sched_entity *se)
> +{
> +       if (entity_is_task(se))
> +               return task_has_idle_policy(task_of(se));
> +       return cfs_rq_is_idle(group_cfs_rq(se));
> +}
> +
>  #else  /* !CONFIG_FAIR_GROUP_SCHED */
>
>  #define for_each_sched_entity(se) \
> @@ -468,6 +485,21 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
>  {
>  }
>
> +static int tg_is_idle(struct task_group *tg)
> +{
> +       return 0;
> +}
> +
> +static int cfs_rq_is_idle(struct cfs_rq *cfs_rq)
> +{
> +       return 0;
> +}
> +
> +static int se_is_idle(struct sched_entity *se)
> +{
> +       return 0;
> +}
> +
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>
>  static __always_inline
> @@ -4841,6 +4873,9 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>
>                 dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
>
> +               if (cfs_rq_is_idle(group_cfs_rq(se)))
> +                       idle_task_delta = cfs_rq->h_nr_running;
> +
>                 qcfs_rq->h_nr_running -= task_delta;
>                 qcfs_rq->idle_h_nr_running -= idle_task_delta;
>
> @@ -4860,6 +4895,9 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>                 update_load_avg(qcfs_rq, se, 0);
>                 se_update_runnable(se);
>
> +               if (cfs_rq_is_idle(group_cfs_rq(se)))
> +                       idle_task_delta = cfs_rq->h_nr_running;
> +
>                 qcfs_rq->h_nr_running -= task_delta;
>                 qcfs_rq->idle_h_nr_running -= idle_task_delta;
>         }
> @@ -4904,39 +4942,45 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>         task_delta = cfs_rq->h_nr_running;
>         idle_task_delta = cfs_rq->idle_h_nr_running;
>         for_each_sched_entity(se) {
> +               struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> +
>                 if (se->on_rq)
>                         break;
> -               cfs_rq = cfs_rq_of(se);
> -               enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
> +               enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
> +
> +               if (cfs_rq_is_idle(group_cfs_rq(se)))
> +                       idle_task_delta = cfs_rq->h_nr_running;
>
> -               cfs_rq->h_nr_running += task_delta;
> -               cfs_rq->idle_h_nr_running += idle_task_delta;
> +               qcfs_rq->h_nr_running += task_delta;
> +               qcfs_rq->idle_h_nr_running += idle_task_delta;
>
>                 /* end evaluation on encountering a throttled cfs_rq */
> -               if (cfs_rq_throttled(cfs_rq))
> +               if (cfs_rq_throttled(qcfs_rq))
>                         goto unthrottle_throttle;
>         }
>
>         for_each_sched_entity(se) {
> -               cfs_rq = cfs_rq_of(se);
> +               struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>
> -               update_load_avg(cfs_rq, se, UPDATE_TG);
> +               update_load_avg(qcfs_rq, se, UPDATE_TG);
>                 se_update_runnable(se);
>
> -               cfs_rq->h_nr_running += task_delta;
> -               cfs_rq->idle_h_nr_running += idle_task_delta;
> +               if (cfs_rq_is_idle(group_cfs_rq(se)))
> +                       idle_task_delta = cfs_rq->h_nr_running;
>
> +               qcfs_rq->h_nr_running += task_delta;
> +               qcfs_rq->idle_h_nr_running += idle_task_delta;
>
>                 /* end evaluation on encountering a throttled cfs_rq */
> -               if (cfs_rq_throttled(cfs_rq))
> +               if (cfs_rq_throttled(qcfs_rq))
>                         goto unthrottle_throttle;
>
>                 /*
>                  * One parent has been throttled and cfs_rq removed from the
>                  * list. Add it back to not break the leaf list.
>                  */
> -               if (throttled_hierarchy(cfs_rq))
> -                       list_add_leaf_cfs_rq(cfs_rq);
> +               if (throttled_hierarchy(qcfs_rq))
> +                       list_add_leaf_cfs_rq(qcfs_rq);
>         }
>
>         /* At this point se is NULL and we are at root level*/
> @@ -4949,9 +4993,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>          * assertion below.
>          */
>         for_each_sched_entity(se) {
> -               cfs_rq = cfs_rq_of(se);
> +               struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>
> -               if (list_add_leaf_cfs_rq(cfs_rq))
> +               if (list_add_leaf_cfs_rq(qcfs_rq))
>                         break;
>         }
>
> @@ -5574,6 +5618,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 cfs_rq->h_nr_running++;
>                 cfs_rq->idle_h_nr_running += idle_h_nr_running;
>
> +               if (cfs_rq_is_idle(cfs_rq))
> +                       idle_h_nr_running = 1;
> +
>                 /* end evaluation on encountering a throttled cfs_rq */
>                 if (cfs_rq_throttled(cfs_rq))
>                         goto enqueue_throttle;
> @@ -5591,6 +5638,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 cfs_rq->h_nr_running++;
>                 cfs_rq->idle_h_nr_running += idle_h_nr_running;
>
> +               if (cfs_rq_is_idle(cfs_rq))
> +                       idle_h_nr_running = 1;
> +
>                 /* end evaluation on encountering a throttled cfs_rq */
>                 if (cfs_rq_throttled(cfs_rq))
>                         goto enqueue_throttle;
> @@ -5668,6 +5718,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 cfs_rq->h_nr_running--;
>                 cfs_rq->idle_h_nr_running -= idle_h_nr_running;
>
> +               if (cfs_rq_is_idle(cfs_rq))
> +                       idle_h_nr_running = 1;
> +
>                 /* end evaluation on encountering a throttled cfs_rq */
>                 if (cfs_rq_throttled(cfs_rq))
>                         goto dequeue_throttle;
> @@ -5697,6 +5750,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 cfs_rq->h_nr_running--;
>                 cfs_rq->idle_h_nr_running -= idle_h_nr_running;
>
> +               if (cfs_rq_is_idle(cfs_rq))
> +                       idle_h_nr_running = 1;
> +
>                 /* end evaluation on encountering a throttled cfs_rq */
>                 if (cfs_rq_throttled(cfs_rq))
>                         goto dequeue_throttle;
> @@ -7041,24 +7097,22 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
>
>  static void set_last_buddy(struct sched_entity *se)
>  {
> -       if (entity_is_task(se) && unlikely(task_has_idle_policy(task_of(se))))
> -               return;
> -
>         for_each_sched_entity(se) {
>                 if (SCHED_WARN_ON(!se->on_rq))
>                         return;
> +               if (se_is_idle(se))
> +                       return;
>                 cfs_rq_of(se)->last = se;
>         }
>  }
>
>  static void set_next_buddy(struct sched_entity *se)
>  {
> -       if (entity_is_task(se) && unlikely(task_has_idle_policy(task_of(se))))
> -               return;
> -
>         for_each_sched_entity(se) {
>                 if (SCHED_WARN_ON(!se->on_rq))
>                         return;
> +               if (se_is_idle(se))
> +                       return;
>                 cfs_rq_of(se)->next = se;
>         }
>  }
> @@ -7079,6 +7133,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>         struct cfs_rq *cfs_rq = task_cfs_rq(curr);
>         int scale = cfs_rq->nr_running >= sched_nr_latency;
>         int next_buddy_marked = 0;
> +       int cse_is_idle, pse_is_idle;
>
>         if (unlikely(se == pse))
>                 return;
> @@ -7123,8 +7178,21 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>                 return;
>
>         find_matching_se(&se, &pse);
> -       update_curr(cfs_rq_of(se));
>         BUG_ON(!pse);
> +
> +       cse_is_idle = se_is_idle(se);
> +       pse_is_idle = se_is_idle(pse);
> +
> +       /*
> +        * Preempt an idle group in favor of a non-idle group (and don't preempt
> +        * in the inverse case).
> +        */
> +       if (cse_is_idle && !pse_is_idle)
> +               goto preempt;
> +       if (cse_is_idle != pse_is_idle)
> +               return;
> +
> +       update_curr(cfs_rq_of(se));
>         if (wakeup_preempt_entity(se, pse) == 1) {
>                 /*
>                  * Bias pick_next to pick the sched entity that is
> @@ -11418,10 +11486,12 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
>
>  static DEFINE_MUTEX(shares_mutex);
>
> -int sched_group_set_shares(struct task_group *tg, unsigned long shares)
> +static int __sched_group_set_shares(struct task_group *tg, unsigned long shares)
>  {
>         int i;
>
> +       lockdep_assert_held(&shares_mutex);
> +
>         /*
>          * We can't change the weight of the root cgroup.
>          */
> @@ -11430,9 +11500,8 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
>
>         shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
>
> -       mutex_lock(&shares_mutex);
>         if (tg->shares == shares)
> -               goto done;
> +               return 0;
>
>         tg->shares = shares;
>         for_each_possible_cpu(i) {
> @@ -11450,10 +11519,88 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
>                 rq_unlock_irqrestore(rq, &rf);
>         }
>
> -done:
> +       return 0;
> +}
> +
> +int sched_group_set_shares(struct task_group *tg, unsigned long shares)
> +{
> +       int ret;
> +
> +       mutex_lock(&shares_mutex);
> +       if (tg_is_idle(tg))
> +               ret = -EINVAL;
> +       else
> +               ret = __sched_group_set_shares(tg, shares);
> +       mutex_unlock(&shares_mutex);
> +
> +       return ret;
> +}
> +
> +int sched_group_set_idle(struct task_group *tg, long idle)
> +{
> +       int i;
> +
> +       if (tg == &root_task_group)
> +               return -EINVAL;
> +
> +       if (idle < 0 || idle > 1)
> +               return -EINVAL;
> +
> +       mutex_lock(&shares_mutex);
> +
> +       if (tg->idle == idle) {
> +               mutex_unlock(&shares_mutex);
> +               return 0;
> +       }
> +
> +       tg->idle = idle;
> +
> +       for_each_possible_cpu(i) {
> +               struct rq *rq = cpu_rq(i);
> +               struct sched_entity *se = tg->se[i];
> +               struct cfs_rq *grp_cfs_rq = tg->cfs_rq[i];
> +               bool was_idle = cfs_rq_is_idle(grp_cfs_rq);
> +               long idle_task_delta;
> +               struct rq_flags rf;
> +
> +               rq_lock_irqsave(rq, &rf);
> +
> +               grp_cfs_rq->idle = idle;
> +               if (WARN_ON_ONCE(was_idle == cfs_rq_is_idle(grp_cfs_rq)))
> +                       goto next_cpu;
> +
> +               idle_task_delta = grp_cfs_rq->h_nr_running -
> +                                 grp_cfs_rq->idle_h_nr_running;
> +               if (!cfs_rq_is_idle(grp_cfs_rq))
> +                       idle_task_delta *= -1;
> +
> +               for_each_sched_entity(se) {
> +                       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +                       if (!se->on_rq)
> +                               break;
> +
> +                       cfs_rq->idle_h_nr_running += idle_task_delta;
> +
> +                       /* Already accounted at parent level and above. */
> +                       if (cfs_rq_is_idle(cfs_rq))
> +                               break;
> +               }
> +
> +next_cpu:
> +               rq_unlock_irqrestore(rq, &rf);
> +       }
> +
> +       /* Idle groups have minimum weight. */
> +       if (tg_is_idle(tg))
> +               __sched_group_set_shares(tg, scale_load(WEIGHT_IDLEPRIO));
> +       else
> +               __sched_group_set_shares(tg, NICE_0_LOAD);
> +
>         mutex_unlock(&shares_mutex);
>         return 0;
>  }
> +
>  #else /* CONFIG_FAIR_GROUP_SCHED */
>
>  void free_fair_sched_group(struct task_group *tg) { }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d9f8d73a1d84..8dfad8fb756c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -396,6 +396,9 @@ struct task_group {
>         struct cfs_rq           **cfs_rq;
>         unsigned long           shares;
>
> +       /* A positive value indicates that this is a SCHED_IDLE group. */
> +       int                     idle;
> +
>  #ifdef CONFIG_SMP
>         /*
>          * load_avg can be heavily contended at clock tick time, so put
> @@ -505,6 +508,8 @@ extern void sched_move_task(struct task_struct *tsk);
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
>
> +extern int sched_group_set_idle(struct task_group *tg, long idle);
> +
>  #ifdef CONFIG_SMP
>  extern void set_task_rq_fair(struct sched_entity *se,
>                              struct cfs_rq *prev, struct cfs_rq *next);
> @@ -601,6 +606,9 @@ struct cfs_rq {
>         struct list_head        leaf_cfs_rq_list;
>         struct task_group       *tg;    /* group that "owns" this runqueue */
>
> +       /* Locally cached copy of our task_group's idle value */
> +       int                     idle;
> +
>  #ifdef CONFIG_CFS_BANDWIDTH
>         int                     runtime_enabled;
>         s64                     runtime_remaining;
> --
> 2.32.0.554.ge1b32706d8-goog
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: adjust SCHED_IDLE interactions
  2021-08-11 13:31   ` Vincent Guittot
@ 2021-08-12 21:09     ` Josh Don
  2021-08-13 12:43       ` Vincent Guittot
  2021-08-16 12:31       ` Peter Zijlstra
  2021-08-16 12:52     ` Peter Zijlstra
  1 sibling, 2 replies; 20+ messages in thread
From: Josh Don @ 2021-08-12 21:09 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel

> > @@ -697,8 +699,18 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >                 slice = __calc_delta(slice, se->load.weight, load);
> >         }
> >
> > -       if (sched_feat(BASE_SLICE))
> > -               slice = max(slice, (u64)w);
> > +       if (sched_feat(BASE_SLICE)) {
> > +               /*
> > +                * SCHED_IDLE entities are not subject to min_granularity if
> > +                * they are competing with non SCHED_IDLE entities. As a result,
> > +                * non SCHED_IDLE entities will have reduced latency to get back
> > +                * on cpu, at the cost of increased context switch frequency of
> > +                * SCHED_IDLE entities.
> > +                */
>
> Ensuring that the entity will have a minimum runtime has been added to
> ensure that we let enough time to move forward.
> If you exclude sched_idle entities from this min runtime, the
> sched_slice of an idle_entity will be really small.
> I don't have details of your example above but I can imagine that it's
> a 16 cpus system which means a sysctl_sched_min_granularity=3.75ms
> which explains the 4ms running time of an idle entity
> For a 16 cpus system, the sched_slice of an idle_entity in your
> example in the cover letter is: 6*(1+log2(16))*3/1027=87us. Of course
> this become even worse with more threads and cgroups or thread with
> ncie prio -19
>
> This value is then used to set the next hrtimer event in SCHED_HRTICK
> and 87us is too small to make any progress
>
> The 1ms of your test comes from the tick which could be a good
> candidate for a min value or the
> normalized_sysctl_sched_min_granularity which has the advantage of not
> increasing with number of CPU

Fair point, this shouldn't completely ignore min granularity. Something like

unsigned int sysctl_sched_idle_min_granularity = NSEC_PER_MSEC;

(and still only using this value instead of the default
min_granularity when the SCHED_IDLE entity is competing with normal
entities)

> > @@ -4216,7 +4228,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> >                 if (sched_feat(GENTLE_FAIR_SLEEPERS))
> >                         thresh >>= 1;
> >
> > -               vruntime -= thresh;
> > +               /*
> > +                * Don't give sleep credit to a SCHED_IDLE entity if we're
> > +                * placing it onto a cfs_rq with non SCHED_IDLE entities.
> > +                */
> > +               if (!se_is_idle(se) ||
> > +                   cfs_rq->h_nr_running == cfs_rq->idle_h_nr_running)
>
> Can't this condition above create unfairness between idle entities ?
> idle thread 1 wake up while normal thread is running
> normal thread thread sleeps immediately after
> idle thread 2 wakes up just after and gets some credits compared to the 1st one.

Yes, this sacrifices some idle<->idle fairness when there is a normal
thread that comes and goes. One alternative is to simply further
reduce thresh for idle entities. That will interfere with idle<->idle
fairness when there are no normal threads, which is why I opted for
the former. On second thought though, the former fairness issue seems
more problematic. Thoughts on applying a smaller sleep credit
threshold universally to idle entities?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: adjust SCHED_IDLE interactions
  2021-08-12 21:09     ` Josh Don
@ 2021-08-13 12:43       ` Vincent Guittot
  2021-08-13 23:55         ` Josh Don
  2021-08-16 12:31       ` Peter Zijlstra
  1 sibling, 1 reply; 20+ messages in thread
From: Vincent Guittot @ 2021-08-13 12:43 UTC (permalink / raw)
  To: Josh Don
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel

On Thu, 12 Aug 2021 at 23:09, Josh Don <joshdon@google.com> wrote:
>
> > > @@ -697,8 +699,18 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > >                 slice = __calc_delta(slice, se->load.weight, load);
> > >         }
> > >
> > > -       if (sched_feat(BASE_SLICE))
> > > -               slice = max(slice, (u64)w);
> > > +       if (sched_feat(BASE_SLICE)) {
> > > +               /*
> > > +                * SCHED_IDLE entities are not subject to min_granularity if
> > > +                * they are competing with non SCHED_IDLE entities. As a result,
> > > +                * non SCHED_IDLE entities will have reduced latency to get back
> > > +                * on cpu, at the cost of increased context switch frequency of
> > > +                * SCHED_IDLE entities.
> > > +                */
> >
> > Ensuring that the entity will have a minimum runtime has been added to
> > ensure that we let enough time to move forward.
> > If you exclude sched_idle entities from this min runtime, the
> > sched_slice of an idle_entity will be really small.
> > I don't have details of your example above but I can imagine that it's
> > a 16 cpus system which means a sysctl_sched_min_granularity=3.75ms
> > which explains the 4ms running time of an idle entity
> > For a 16 cpus system, the sched_slice of an idle_entity in your
> > example in the cover letter is: 6*(1+log2(16))*3/1027=87us. Of course
> > this become even worse with more threads and cgroups or thread with
> > ncie prio -19
> >
> > This value is then used to set the next hrtimer event in SCHED_HRTICK
> > and 87us is too small to make any progress
> >
> > The 1ms of your test comes from the tick which could be a good
> > candidate for a min value or the
> > normalized_sysctl_sched_min_granularity which has the advantage of not
> > increasing with number of CPU
>
> Fair point, this shouldn't completely ignore min granularity. Something like
>
> unsigned int sysctl_sched_idle_min_granularity = NSEC_PER_MSEC;
>
> (and still only using this value instead of the default
> min_granularity when the SCHED_IDLE entity is competing with normal
> entities)

Yes that looks like a good option

Also note that with a NSEC_PER_MSEC default value, the sched_idle
entity will most probably run 2 ticks instead of the 1 tick (HZ=1000)
that you have with your proposal because a bit less than a full tick
is accounted to the running thread (the time spent in interrupt is not
accounted as an example) so sysctl_sched_idle_min_granularity of 1ms
with HZ=1000 will most propably run 2 ticks. Instead you could reuse
the default 750000ULL value of sched_idle_min_granularity

That being said sysctl_sched_idle_min_granularity =
normalized_sysctl_sched_min_granularity * scale_factor which means
that normalized_sysctl_sched_min_granularity stays the same
(750000ULL) whatever the number of cpus

>
> > > @@ -4216,7 +4228,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> > >                 if (sched_feat(GENTLE_FAIR_SLEEPERS))
> > >                         thresh >>= 1;
> > >
> > > -               vruntime -= thresh;
> > > +               /*
> > > +                * Don't give sleep credit to a SCHED_IDLE entity if we're
> > > +                * placing it onto a cfs_rq with non SCHED_IDLE entities.
> > > +                */
> > > +               if (!se_is_idle(se) ||
> > > +                   cfs_rq->h_nr_running == cfs_rq->idle_h_nr_running)
> >
> > Can't this condition above create unfairness between idle entities ?
> > idle thread 1 wake up while normal thread is running
> > normal thread thread sleeps immediately after
> > idle thread 2 wakes up just after and gets some credits compared to the 1st one.
>
> Yes, this sacrifices some idle<->idle fairness when there is a normal
> thread that comes and goes. One alternative is to simply further
> reduce thresh for idle entities. That will interfere with idle<->idle
> fairness when there are no normal threads, which is why I opted for
> the former. On second thought though, the former fairness issue seems
> more problematic. Thoughts on applying a smaller sleep credit
> threshold universally to idle entities?

This one is a bit more complex to set.
With adding 1, you favor the already runnable tasks by ensuring that
they have or will run a slice during this period before sched_idle
task
But as soon as you subtract something to min_vruntime, the task will
most probably be scheduled at the next tick if other tasks already run
for a while (at least a sched period). If we use
sysctl_sched_min_granularity for sched_idle tasks that wake up instead
of sysctl_sched_latency, we will ensure that a sched_idle task will
not preempt a normal task, which woke up few ms before, and we will
keep some fairness for sched_idle task that sleeps compare to other.

so a thresh of sysctl_sched_min_granularity (3.75ms with 16 cpus )
should not disturb your UC and keep some benefit for newly wake up
sched_ide task

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: adjust SCHED_IDLE interactions
  2021-08-13 12:43       ` Vincent Guittot
@ 2021-08-13 23:55         ` Josh Don
  0 siblings, 0 replies; 20+ messages in thread
From: Josh Don @ 2021-08-13 23:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel

On Fri, Aug 13, 2021 at 5:43 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
[snip]
> > >
> > > The 1ms of your test comes from the tick which could be a good
> > > candidate for a min value or the
> > > normalized_sysctl_sched_min_granularity which has the advantage of not
> > > increasing with number of CPU
> >
> > Fair point, this shouldn't completely ignore min granularity. Something like
> >
> > unsigned int sysctl_sched_idle_min_granularity = NSEC_PER_MSEC;
> >
> > (and still only using this value instead of the default
> > min_granularity when the SCHED_IDLE entity is competing with normal
> > entities)
>
> Yes that looks like a good option
>
> Also note that with a NSEC_PER_MSEC default value, the sched_idle
> entity will most probably run 2 ticks instead of the 1 tick (HZ=1000)
> that you have with your proposal because a bit less than a full tick
> is accounted to the running thread (the time spent in interrupt is not
> accounted as an example) so sysctl_sched_idle_min_granularity of 1ms
> with HZ=1000 will most propably run 2 ticks. Instead you could reuse
> the default 750000ULL value of sched_idle_min_granularity

Yes, great point. That's a better value here, with sufficient margin.

> That being said sysctl_sched_idle_min_granularity =
> normalized_sysctl_sched_min_granularity * scale_factor which means
> that normalized_sysctl_sched_min_granularity stays the same
> (750000ULL) whatever the number of cpus
>
> >
> > > > @@ -4216,7 +4228,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> > > >                 if (sched_feat(GENTLE_FAIR_SLEEPERS))
> > > >                         thresh >>= 1;
> > > >
> > > > -               vruntime -= thresh;
> > > > +               /*
> > > > +                * Don't give sleep credit to a SCHED_IDLE entity if we're
> > > > +                * placing it onto a cfs_rq with non SCHED_IDLE entities.
> > > > +                */
> > > > +               if (!se_is_idle(se) ||
> > > > +                   cfs_rq->h_nr_running == cfs_rq->idle_h_nr_running)
> > >
> > > Can't this condition above create unfairness between idle entities ?
> > > idle thread 1 wake up while normal thread is running
> > > normal thread thread sleeps immediately after
> > > idle thread 2 wakes up just after and gets some credits compared to the 1st one.
> >
> > Yes, this sacrifices some idle<->idle fairness when there is a normal
> > thread that comes and goes. One alternative is to simply further
> > reduce thresh for idle entities. That will interfere with idle<->idle
> > fairness when there are no normal threads, which is why I opted for
> > the former. On second thought though, the former fairness issue seems
> > more problematic. Thoughts on applying a smaller sleep credit
> > threshold universally to idle entities?
>
> This one is a bit more complex to set.
> With adding 1, you favor the already runnable tasks by ensuring that
> they have or will run a slice during this period before sched_idle
> task
> But as soon as you subtract something to min_vruntime, the task will
> most probably be scheduled at the next tick if other tasks already run
> for a while (at least a sched period). If we use
> sysctl_sched_min_granularity for sched_idle tasks that wake up instead
> of sysctl_sched_latency, we will ensure that a sched_idle task will
> not preempt a normal task, which woke up few ms before, and we will
> keep some fairness for sched_idle task that sleeps compare to other.
>
> so a thresh of sysctl_sched_min_granularity (3.75ms with 16 cpus )
> should not disturb your UC and keep some benefit for newly wake up
> sched_ide task

If the normal task has already been running for at least a period, it
should be ok to preempt.
A thresh around the min_granularity seems like a good order of
magnitude; I'll experiment a bit.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: adjust SCHED_IDLE interactions
  2021-07-30  2:00 ` [PATCH 2/2] sched: adjust SCHED_IDLE interactions Josh Don
  2021-08-11 13:31   ` Vincent Guittot
@ 2021-08-16 12:31   ` Peter Zijlstra
  1 sibling, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2021-08-16 12:31 UTC (permalink / raw)
  To: Josh Don
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel

On Thu, Jul 29, 2021 at 07:00:19PM -0700, Josh Don wrote:
> 1) Ignore min_granularity for determining the sched_slide of a
> SCHED_IDLE entity when it is competing with a non SCHED_IDLE entity.
> This reduces the latency of getting a non SCHED_IDLE entity back on cpu,
> at the expense of increased context switch frequency of SCHED_IDLE
> entities.

> 2) Don't give sleeper credit to SCHED_IDLE entities when they wake onto
> a cfs_rq with non SCHED_IDLE entities. As a result, newly woken
> SCHED_IDLE entities will take longer to preempt non SCHED_IDLE entities.

ISTR we had a rule about one change per patch :-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: adjust SCHED_IDLE interactions
  2021-08-12 21:09     ` Josh Don
  2021-08-13 12:43       ` Vincent Guittot
@ 2021-08-16 12:31       ` Peter Zijlstra
  2021-08-17 23:48         ` Josh Don
  1 sibling, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2021-08-16 12:31 UTC (permalink / raw)
  To: Josh Don
  Cc: Vincent Guittot, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel

On Thu, Aug 12, 2021 at 02:09:15PM -0700, Josh Don wrote:
> > > @@ -697,8 +699,18 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > >                 slice = __calc_delta(slice, se->load.weight, load);
> > >         }
> > >
> > > -       if (sched_feat(BASE_SLICE))
> > > -               slice = max(slice, (u64)w);
> > > +       if (sched_feat(BASE_SLICE)) {
> > > +               /*
> > > +                * SCHED_IDLE entities are not subject to min_granularity if
> > > +                * they are competing with non SCHED_IDLE entities. As a result,
> > > +                * non SCHED_IDLE entities will have reduced latency to get back
> > > +                * on cpu, at the cost of increased context switch frequency of
> > > +                * SCHED_IDLE entities.
> > > +                */
> >
> > Ensuring that the entity will have a minimum runtime has been added to
> > ensure that we let enough time to move forward.
> > If you exclude sched_idle entities from this min runtime, the
> > sched_slice of an idle_entity will be really small.
> > I don't have details of your example above but I can imagine that it's
> > a 16 cpus system which means a sysctl_sched_min_granularity=3.75ms
> > which explains the 4ms running time of an idle entity
> > For a 16 cpus system, the sched_slice of an idle_entity in your
> > example in the cover letter is: 6*(1+log2(16))*3/1027=87us. Of course
> > this become even worse with more threads and cgroups or thread with
> > ncie prio -19
> >
> > This value is then used to set the next hrtimer event in SCHED_HRTICK
> > and 87us is too small to make any progress
> >
> > The 1ms of your test comes from the tick which could be a good
> > candidate for a min value or the
> > normalized_sysctl_sched_min_granularity which has the advantage of not
> > increasing with number of CPU
> 
> Fair point, this shouldn't completely ignore min granularity. Something like
> 
> unsigned int sysctl_sched_idle_min_granularity = NSEC_PER_MSEC;
> 
> (and still only using this value instead of the default
> min_granularity when the SCHED_IDLE entity is competing with normal
> entities)

TICK_NSEC ?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: adjust SCHED_IDLE interactions
  2021-08-11 13:31   ` Vincent Guittot
  2021-08-12 21:09     ` Josh Don
@ 2021-08-16 12:52     ` Peter Zijlstra
  2021-08-16 12:56       ` Vincent Guittot
  2021-08-17 23:40       ` Josh Don
  1 sibling, 2 replies; 20+ messages in thread
From: Peter Zijlstra @ 2021-08-16 12:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Josh Don, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel

On Wed, Aug 11, 2021 at 03:31:49PM +0200, Vincent Guittot wrote:
> On Fri, 30 Jul 2021 at 04:00, Josh Don <joshdon@google.com> wrote:


> > @@ -4216,7 +4228,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> >                 if (sched_feat(GENTLE_FAIR_SLEEPERS))
> >                         thresh >>= 1;
> >
> > -               vruntime -= thresh;
> > +               /*
> > +                * Don't give sleep credit to a SCHED_IDLE entity if we're
> > +                * placing it onto a cfs_rq with non SCHED_IDLE entities.
> > +                */
> > +               if (!se_is_idle(se) ||
> > +                   cfs_rq->h_nr_running == cfs_rq->idle_h_nr_running)

I really dislike that second clause, either never do this for idle or
always, but not sometimes when the planets are aligned just right.

> Can't this condition above create unfairness between idle entities ?
> idle thread 1 wake up while normal thread is running
> normal thread thread sleeps immediately after
> idle thread 2 wakes up just after and gets some credits compared to the 1st one.

No. Strictly speaking cfs is unfair here. But it's a really tricky case.

Consider a task that is running 50% competing against a task that's
running 100%. What's fair in that situation, a 50/50 split, or a 25/75
split? What if that 50% is 50% of a minute?

What we do here is fudge the vruntime such that we end up with a 50/50
split provided the period over which it blocks is less than a slice.
After that it gradually converges to the 'expected' 25/75 split that
results from strict runnable competition.

By not letting idle tasks participate in this, we avoid idle tasks
'stealing' the !runnable time and they revert back to strict runnable
competition only.

> > +                       vruntime -= thresh;
> > +               else
> > +                       vruntime += 1;
> >         }
> >
> >         /* ensure we never gain time by being placed backwards. */
> > --
> > 2.32.0.554.ge1b32706d8-goog
> >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: adjust SCHED_IDLE interactions
  2021-08-16 12:52     ` Peter Zijlstra
@ 2021-08-16 12:56       ` Vincent Guittot
  2021-08-17 23:40       ` Josh Don
  1 sibling, 0 replies; 20+ messages in thread
From: Vincent Guittot @ 2021-08-16 12:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Don, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel

On Mon, 16 Aug 2021 at 14:52, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Aug 11, 2021 at 03:31:49PM +0200, Vincent Guittot wrote:
> > On Fri, 30 Jul 2021 at 04:00, Josh Don <joshdon@google.com> wrote:
>
>
> > > @@ -4216,7 +4228,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> > >                 if (sched_feat(GENTLE_FAIR_SLEEPERS))
> > >                         thresh >>= 1;
> > >
> > > -               vruntime -= thresh;
> > > +               /*
> > > +                * Don't give sleep credit to a SCHED_IDLE entity if we're
> > > +                * placing it onto a cfs_rq with non SCHED_IDLE entities.
> > > +                */
> > > +               if (!se_is_idle(se) ||
> > > +                   cfs_rq->h_nr_running == cfs_rq->idle_h_nr_running)
>
> I really dislike that second clause, either never do this for idle or
> always, but not sometimes when the planets are aligned just right.

That was my point too

>
> > Can't this condition above create unfairness between idle entities ?
> > idle thread 1 wake up while normal thread is running
> > normal thread thread sleeps immediately after
> > idle thread 2 wakes up just after and gets some credits compared to the 1st one.
>
> No. Strictly speaking cfs is unfair here. But it's a really tricky case.
>
> Consider a task that is running 50% competing against a task that's
> running 100%. What's fair in that situation, a 50/50 split, or a 25/75
> split? What if that 50% is 50% of a minute?
>
> What we do here is fudge the vruntime such that we end up with a 50/50
> split provided the period over which it blocks is less than a slice.
> After that it gradually converges to the 'expected' 25/75 split that
> results from strict runnable competition.
>
> By not letting idle tasks participate in this, we avoid idle tasks
> 'stealing' the !runnable time and they revert back to strict runnable
> competition only.
>
> > > +                       vruntime -= thresh;
> > > +               else
> > > +                       vruntime += 1;
> > >         }
> > >
> > >         /* ensure we never gain time by being placed backwards. */
> > > --
> > > 2.32.0.554.ge1b32706d8-goog
> > >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: adjust SCHED_IDLE interactions
  2021-08-16 12:52     ` Peter Zijlstra
  2021-08-16 12:56       ` Vincent Guittot
@ 2021-08-17 23:40       ` Josh Don
  1 sibling, 0 replies; 20+ messages in thread
From: Josh Don @ 2021-08-17 23:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel

On Mon, Aug 16, 2021 at 5:52 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Aug 11, 2021 at 03:31:49PM +0200, Vincent Guittot wrote:
> > On Fri, 30 Jul 2021 at 04:00, Josh Don <joshdon@google.com> wrote:
>
>
> > > @@ -4216,7 +4228,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> > >                 if (sched_feat(GENTLE_FAIR_SLEEPERS))
> > >                         thresh >>= 1;
> > >
> > > -               vruntime -= thresh;
> > > +               /*
> > > +                * Don't give sleep credit to a SCHED_IDLE entity if we're
> > > +                * placing it onto a cfs_rq with non SCHED_IDLE entities.
> > > +                */
> > > +               if (!se_is_idle(se) ||
> > > +                   cfs_rq->h_nr_running == cfs_rq->idle_h_nr_running)
>
> I really dislike that second clause, either never do this for idle or
> always, but not sometimes when the planets are aligned just right.

Yep, switched this to always for idle entities.

>
> > Can't this condition above create unfairness between idle entities ?
> > idle thread 1 wake up while normal thread is running
> > normal thread thread sleeps immediately after
> > idle thread 2 wakes up just after and gets some credits compared to the 1st one.
>
> No. Strictly speaking cfs is unfair here. But it's a really tricky case.
>
> Consider a task that is running 50% competing against a task that's
> running 100%. What's fair in that situation, a 50/50 split, or a 25/75
> split? What if that 50% is 50% of a minute?
>
> What we do here is fudge the vruntime such that we end up with a 50/50
> split provided the period over which it blocks is less than a slice.
> After that it gradually converges to the 'expected' 25/75 split that
> results from strict runnable competition.
>
> By not letting idle tasks participate in this, we avoid idle tasks
> 'stealing' the !runnable time and they revert back to strict runnable
> competition only.

I like Vincent's suggestion to use a reduced threshold for idle
entities, that's been working pretty well. And it retains some
idle<->idle fairness when we have idle waking onto other idle threads.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] sched: adjust SCHED_IDLE interactions
  2021-08-16 12:31       ` Peter Zijlstra
@ 2021-08-17 23:48         ` Josh Don
  0 siblings, 0 replies; 20+ messages in thread
From: Josh Don @ 2021-08-17 23:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, Oleg Rombakh,
	Viresh Kumar, Steve Sistare, Tejun Heo, Rik van Riel,
	linux-kernel

Hi Peter,

On Mon, Aug 16, 2021 at 5:32 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Aug 12, 2021 at 02:09:15PM -0700, Josh Don wrote:
> > > > @@ -697,8 +699,18 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > > >                 slice = __calc_delta(slice, se->load.weight, load);
> > > >         }
> > > >
> > > > -       if (sched_feat(BASE_SLICE))
> > > > -               slice = max(slice, (u64)w);
> > > > +       if (sched_feat(BASE_SLICE)) {
> > > > +               /*
> > > > +                * SCHED_IDLE entities are not subject to min_granularity if
> > > > +                * they are competing with non SCHED_IDLE entities. As a result,
> > > > +                * non SCHED_IDLE entities will have reduced latency to get back
> > > > +                * on cpu, at the cost of increased context switch frequency of
> > > > +                * SCHED_IDLE entities.
> > > > +                */
> > >
> > > Ensuring that the entity will have a minimum runtime has been added to
> > > ensure that we let enough time to move forward.
> > > If you exclude sched_idle entities from this min runtime, the
> > > sched_slice of an idle_entity will be really small.
> > > I don't have details of your example above but I can imagine that it's
> > > a 16 cpus system which means a sysctl_sched_min_granularity=3.75ms
> > > which explains the 4ms running time of an idle entity
> > > For a 16 cpus system, the sched_slice of an idle_entity in your
> > > example in the cover letter is: 6*(1+log2(16))*3/1027=87us. Of course
> > > this become even worse with more threads and cgroups or thread with
> > > ncie prio -19
> > >
> > > This value is then used to set the next hrtimer event in SCHED_HRTICK
> > > and 87us is too small to make any progress
> > >
> > > The 1ms of your test comes from the tick which could be a good
> > > candidate for a min value or the
> > > normalized_sysctl_sched_min_granularity which has the advantage of not
> > > increasing with number of CPU
> >
> > Fair point, this shouldn't completely ignore min granularity. Something like
> >
> > unsigned int sysctl_sched_idle_min_granularity = NSEC_PER_MSEC;
> >
> > (and still only using this value instead of the default
> > min_granularity when the SCHED_IDLE entity is competing with normal
> > entities)
>
> TICK_NSEC ?

Based on discussion with Vincent, added a separate
sched_idle_min_granularity. Didn't make sense to me to define in terms
of TICK_NSEC, since e.g. we wouldn't necessarily want a smaller slice
on a higher HZ machine.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [tip: sched/core] sched: Cgroup SCHED_IDLE support
  2021-07-30  2:00 ` [PATCH v2 1/2] sched: cgroup SCHED_IDLE support Josh Don
                     ` (2 preceding siblings ...)
  2021-08-11 13:48   ` Vincent Guittot
@ 2021-08-23  9:26   ` tip-bot2 for Josh Don
  3 siblings, 0 replies; 20+ messages in thread
From: tip-bot2 for Josh Don @ 2021-08-23  9:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Josh Don, Peter Zijlstra (Intel), Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     304000390f88d049c85e9a0958ac5567f38816ee
Gitweb:        https://git.kernel.org/tip/304000390f88d049c85e9a0958ac5567f38816ee
Author:        Josh Don <joshdon@google.com>
AuthorDate:    Thu, 29 Jul 2021 19:00:18 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 20 Aug 2021 12:32:58 +02:00

sched: Cgroup SCHED_IDLE support

This extends SCHED_IDLE to cgroups.

Interface: cgroup/cpu.idle.
 0: default behavior
 1: SCHED_IDLE

Extending SCHED_IDLE to cgroups means that we incorporate the existing
aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its
descendant threads towards the idle_h_nr_running count of all of its
ancestor cgroups. Thus, sched_idle_rq() will work properly.
Additionally, SCHED_IDLE cgroups are configured with minimum weight.

There are two key differences between the per-task and per-cgroup
SCHED_IDLE interface:

  - The cgroup interface allows tasks within a SCHED_IDLE hierarchy to
    maintain their relative weights. The entity that is "idle" is the
    cgroup, not the tasks themselves.

  - Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption
    decision is not made by comparing the current task with the woken
    task, but rather by comparing their matching sched_entity.

A typical use-case for this is a user that creates an idle and a
non-idle subtree. The non-idle subtree will dominate competition vs
the idle subtree, but the idle subtree will still be high priority vs
other users on the system. The latter is accomplished via comparing
matching sched_entity in the waken preemption path (this could also be
improved by making the sched_idle_rq() decision dependent on the
perspective of a specific task).

For now, we maintain the existing SCHED_IDLE semantics. Future patches
may make improvements that extend how we treat SCHED_IDLE entities.

The per-task_group idle field is an integer that currently only holds
either a 0 or a 1. This is explicitly typed as an integer to allow for
further extensions to this API. For example, a negative value may
indicate a highly latency-sensitive cgroup that should be preferred
for preemption/placement/etc.

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210730020019.1487127-2-joshdon@google.com
---
 kernel/sched/core.c  |  25 +++++-
 kernel/sched/debug.c |   3 +-
 kernel/sched/fair.c  | 197 ++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |   8 ++-
 4 files changed, 208 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d67c5dd..7fa6ce7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10135,6 +10135,20 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static s64 cpu_idle_read_s64(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	return css_tg(css)->idle;
+}
+
+static int cpu_idle_write_s64(struct cgroup_subsys_state *css,
+				struct cftype *cft, s64 idle)
+{
+	return sched_group_set_idle(css_tg(css), idle);
+}
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -10142,6 +10156,11 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_shares_read_u64,
 		.write_u64 = cpu_shares_write_u64,
 	},
+	{
+		.name = "idle",
+		.read_s64 = cpu_idle_read_s64,
+		.write_s64 = cpu_idle_write_s64,
+	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
@@ -10349,6 +10368,12 @@ static struct cftype cpu_files[] = {
 		.read_s64 = cpu_weight_nice_read_s64,
 		.write_s64 = cpu_weight_nice_write_s64,
 	},
+	{
+		.name = "idle",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_s64 = cpu_idle_read_s64,
+		.write_s64 = cpu_idle_write_s64,
+	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 7e08e3d..4971622 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -607,6 +607,9 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
 			cfs_rq->nr_spread_over);
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
+	SEQ_printf(m, "  .%-30s: %d\n", "h_nr_running", cfs_rq->h_nr_running);
+	SEQ_printf(m, "  .%-30s: %d\n", "idle_h_nr_running",
+			cfs_rq->idle_h_nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_SMP
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 47a0fbf..6cd05f1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -431,6 +431,23 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 	}
 }
 
+static int tg_is_idle(struct task_group *tg)
+{
+	return tg->idle > 0;
+}
+
+static int cfs_rq_is_idle(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->idle > 0;
+}
+
+static int se_is_idle(struct sched_entity *se)
+{
+	if (entity_is_task(se))
+		return task_has_idle_policy(task_of(se));
+	return cfs_rq_is_idle(group_cfs_rq(se));
+}
+
 #else	/* !CONFIG_FAIR_GROUP_SCHED */
 
 #define for_each_sched_entity(se) \
@@ -468,6 +485,21 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
 }
 
+static int tg_is_idle(struct task_group *tg)
+{
+	return 0;
+}
+
+static int cfs_rq_is_idle(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
+static int se_is_idle(struct sched_entity *se)
+{
+	return 0;
+}
+
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
 static __always_inline
@@ -4812,6 +4844,9 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
 
+		if (cfs_rq_is_idle(group_cfs_rq(se)))
+			idle_task_delta = cfs_rq->h_nr_running;
+
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
 
@@ -4831,6 +4866,9 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		update_load_avg(qcfs_rq, se, 0);
 		se_update_runnable(se);
 
+		if (cfs_rq_is_idle(group_cfs_rq(se)))
+			idle_task_delta = cfs_rq->h_nr_running;
+
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
 	}
@@ -4875,39 +4913,45 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	task_delta = cfs_rq->h_nr_running;
 	idle_task_delta = cfs_rq->idle_h_nr_running;
 	for_each_sched_entity(se) {
+		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+
 		if (se->on_rq)
 			break;
-		cfs_rq = cfs_rq_of(se);
-		enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
+		enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
+
+		if (cfs_rq_is_idle(group_cfs_rq(se)))
+			idle_task_delta = cfs_rq->h_nr_running;
 
-		cfs_rq->h_nr_running += task_delta;
-		cfs_rq->idle_h_nr_running += idle_task_delta;
+		qcfs_rq->h_nr_running += task_delta;
+		qcfs_rq->idle_h_nr_running += idle_task_delta;
 
 		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
+		if (cfs_rq_throttled(qcfs_rq))
 			goto unthrottle_throttle;
 	}
 
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 
-		update_load_avg(cfs_rq, se, UPDATE_TG);
+		update_load_avg(qcfs_rq, se, UPDATE_TG);
 		se_update_runnable(se);
 
-		cfs_rq->h_nr_running += task_delta;
-		cfs_rq->idle_h_nr_running += idle_task_delta;
+		if (cfs_rq_is_idle(group_cfs_rq(se)))
+			idle_task_delta = cfs_rq->h_nr_running;
 
+		qcfs_rq->h_nr_running += task_delta;
+		qcfs_rq->idle_h_nr_running += idle_task_delta;
 
 		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
+		if (cfs_rq_throttled(qcfs_rq))
 			goto unthrottle_throttle;
 
 		/*
 		 * One parent has been throttled and cfs_rq removed from the
 		 * list. Add it back to not break the leaf list.
 		 */
-		if (throttled_hierarchy(cfs_rq))
-			list_add_leaf_cfs_rq(cfs_rq);
+		if (throttled_hierarchy(qcfs_rq))
+			list_add_leaf_cfs_rq(qcfs_rq);
 	}
 
 	/* At this point se is NULL and we are at root level*/
@@ -4920,9 +4964,9 @@ unthrottle_throttle:
 	 * assertion below.
 	 */
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 
-		if (list_add_leaf_cfs_rq(cfs_rq))
+		if (list_add_leaf_cfs_rq(qcfs_rq))
 			break;
 	}
 
@@ -5545,6 +5589,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
 			goto enqueue_throttle;
@@ -5562,6 +5609,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
 			goto enqueue_throttle;
@@ -5639,6 +5689,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running--;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
 			goto dequeue_throttle;
@@ -5668,6 +5721,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running--;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
 			goto dequeue_throttle;
@@ -7010,24 +7066,22 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
 
 static void set_last_buddy(struct sched_entity *se)
 {
-	if (entity_is_task(se) && unlikely(task_has_idle_policy(task_of(se))))
-		return;
-
 	for_each_sched_entity(se) {
 		if (SCHED_WARN_ON(!se->on_rq))
 			return;
+		if (se_is_idle(se))
+			return;
 		cfs_rq_of(se)->last = se;
 	}
 }
 
 static void set_next_buddy(struct sched_entity *se)
 {
-	if (entity_is_task(se) && unlikely(task_has_idle_policy(task_of(se))))
-		return;
-
 	for_each_sched_entity(se) {
 		if (SCHED_WARN_ON(!se->on_rq))
 			return;
+		if (se_is_idle(se))
+			return;
 		cfs_rq_of(se)->next = se;
 	}
 }
@@ -7048,6 +7102,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	int scale = cfs_rq->nr_running >= sched_nr_latency;
 	int next_buddy_marked = 0;
+	int cse_is_idle, pse_is_idle;
 
 	if (unlikely(se == pse))
 		return;
@@ -7092,8 +7147,21 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		return;
 
 	find_matching_se(&se, &pse);
-	update_curr(cfs_rq_of(se));
 	BUG_ON(!pse);
+
+	cse_is_idle = se_is_idle(se);
+	pse_is_idle = se_is_idle(pse);
+
+	/*
+	 * Preempt an idle group in favor of a non-idle group (and don't preempt
+	 * in the inverse case).
+	 */
+	if (cse_is_idle && !pse_is_idle)
+		goto preempt;
+	if (cse_is_idle != pse_is_idle)
+		return;
+
+	update_curr(cfs_rq_of(se));
 	if (wakeup_preempt_entity(se, pse) == 1) {
 		/*
 		 * Bias pick_next to pick the sched entity that is
@@ -11387,10 +11455,12 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 
 static DEFINE_MUTEX(shares_mutex);
 
-int sched_group_set_shares(struct task_group *tg, unsigned long shares)
+static int __sched_group_set_shares(struct task_group *tg, unsigned long shares)
 {
 	int i;
 
+	lockdep_assert_held(&shares_mutex);
+
 	/*
 	 * We can't change the weight of the root cgroup.
 	 */
@@ -11399,9 +11469,8 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 
 	shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
 
-	mutex_lock(&shares_mutex);
 	if (tg->shares == shares)
-		goto done;
+		return 0;
 
 	tg->shares = shares;
 	for_each_possible_cpu(i) {
@@ -11419,10 +11488,88 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 		rq_unlock_irqrestore(rq, &rf);
 	}
 
-done:
+	return 0;
+}
+
+int sched_group_set_shares(struct task_group *tg, unsigned long shares)
+{
+	int ret;
+
+	mutex_lock(&shares_mutex);
+	if (tg_is_idle(tg))
+		ret = -EINVAL;
+	else
+		ret = __sched_group_set_shares(tg, shares);
+	mutex_unlock(&shares_mutex);
+
+	return ret;
+}
+
+int sched_group_set_idle(struct task_group *tg, long idle)
+{
+	int i;
+
+	if (tg == &root_task_group)
+		return -EINVAL;
+
+	if (idle < 0 || idle > 1)
+		return -EINVAL;
+
+	mutex_lock(&shares_mutex);
+
+	if (tg->idle == idle) {
+		mutex_unlock(&shares_mutex);
+		return 0;
+	}
+
+	tg->idle = idle;
+
+	for_each_possible_cpu(i) {
+		struct rq *rq = cpu_rq(i);
+		struct sched_entity *se = tg->se[i];
+		struct cfs_rq *grp_cfs_rq = tg->cfs_rq[i];
+		bool was_idle = cfs_rq_is_idle(grp_cfs_rq);
+		long idle_task_delta;
+		struct rq_flags rf;
+
+		rq_lock_irqsave(rq, &rf);
+
+		grp_cfs_rq->idle = idle;
+		if (WARN_ON_ONCE(was_idle == cfs_rq_is_idle(grp_cfs_rq)))
+			goto next_cpu;
+
+		idle_task_delta = grp_cfs_rq->h_nr_running -
+				  grp_cfs_rq->idle_h_nr_running;
+		if (!cfs_rq_is_idle(grp_cfs_rq))
+			idle_task_delta *= -1;
+
+		for_each_sched_entity(se) {
+			struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+			if (!se->on_rq)
+				break;
+
+			cfs_rq->idle_h_nr_running += idle_task_delta;
+
+			/* Already accounted at parent level and above. */
+			if (cfs_rq_is_idle(cfs_rq))
+				break;
+		}
+
+next_cpu:
+		rq_unlock_irqrestore(rq, &rf);
+	}
+
+	/* Idle groups have minimum weight. */
+	if (tg_is_idle(tg))
+		__sched_group_set_shares(tg, scale_load(WEIGHT_IDLEPRIO));
+	else
+		__sched_group_set_shares(tg, NICE_0_LOAD);
+
 	mutex_unlock(&shares_mutex);
 	return 0;
 }
+
 #else /* CONFIG_FAIR_GROUP_SCHED */
 
 void free_fair_sched_group(struct task_group *tg) { }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 75a5c12..5fa0290 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -396,6 +396,9 @@ struct task_group {
 	struct cfs_rq		**cfs_rq;
 	unsigned long		shares;
 
+	/* A positive value indicates that this is a SCHED_IDLE group. */
+	int			idle;
+
 #ifdef	CONFIG_SMP
 	/*
 	 * load_avg can be heavily contended at clock tick time, so put
@@ -505,6 +508,8 @@ extern void sched_move_task(struct task_struct *tsk);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
 
+extern int sched_group_set_idle(struct task_group *tg, long idle);
+
 #ifdef CONFIG_SMP
 extern void set_task_rq_fair(struct sched_entity *se,
 			     struct cfs_rq *prev, struct cfs_rq *next);
@@ -601,6 +606,9 @@ struct cfs_rq {
 	struct list_head	leaf_cfs_rq_list;
 	struct task_group	*tg;	/* group that "owns" this runqueue */
 
+	/* Locally cached copy of our task_group's idle value */
+	int			idle;
+
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
 	s64			runtime_remaining;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-08-23  9:26 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-30  2:00 [PATCH v2 0/2] SCHED_IDLE extensions Josh Don
2021-07-30  2:00 ` [PATCH v2 1/2] sched: cgroup SCHED_IDLE support Josh Don
2021-08-03  2:14   ` jun qian
2021-08-03 20:37     ` Josh Don
2021-08-05 10:18   ` Peter Zijlstra
2021-08-05 17:13     ` Tejun Heo
2021-08-05 23:54       ` Josh Don
2021-08-11 13:48   ` Vincent Guittot
2021-08-23  9:26   ` [tip: sched/core] sched: Cgroup " tip-bot2 for Josh Don
2021-07-30  2:00 ` [PATCH 2/2] sched: adjust SCHED_IDLE interactions Josh Don
2021-08-11 13:31   ` Vincent Guittot
2021-08-12 21:09     ` Josh Don
2021-08-13 12:43       ` Vincent Guittot
2021-08-13 23:55         ` Josh Don
2021-08-16 12:31       ` Peter Zijlstra
2021-08-17 23:48         ` Josh Don
2021-08-16 12:52     ` Peter Zijlstra
2021-08-16 12:56       ` Vincent Guittot
2021-08-17 23:40       ` Josh Don
2021-08-16 12:31   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).