linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched: cgroup SCHED_IDLE support
@ 2021-06-08 23:11 Josh Don
  2021-06-10 12:53 ` Dietmar Eggemann
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Josh Don @ 2021-06-08 23:11 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, Tejun Heo,
	linux-kernel, Josh Don

This extends SCHED_IDLE to cgroups.

Interface: cgroup/cpu.idle.
 0: default behavior
 1: SCHED_IDLE

Extending SCHED_IDLE to cgroups means that we incorporate the existing
aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its
descendant threads towards the idle_h_nr_running count of all of its
ancestor cgroups. Thus, sched_idle_rq() will work properly.
Additionally, SCHED_IDLE cgroups are configured with minimum weight.

There are two key differences between the per-task and per-cgroup
SCHED_IDLE interface:

- The cgroup interface allows tasks within a SCHED_IDLE hierarchy to
maintain their relative weights. The entity that is "idle" is the
cgroup, not the tasks themselves.

- Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption
decision is not made by comparing the current task with the woken task,
but rather by comparing their matching sched_entity.

A typical use-case for this is a user that creates an idle and a
non-idle subtree. The non-idle subtree will dominate competition vs
the idle subtree, but the idle subtree will still be high priority
vs other users on the system. The latter is accomplished via comparing
matching sched_entity in the waken preemption path (this could also be
improved by making the sched_idle_rq() decision dependent on the
perspective of a specific task).

For now, we maintain the existing SCHED_IDLE semantics. Future patches
may make improvements that extend how we treat SCHED_IDLE entities.

The per-task_group idle field is an integer that currently only holds
either a 0 or a 1. This is explicitly typed as an integer to allow for
further extensions to this API. For example, a negative value may
indicate a highly latency-sensitive cgroup that should be preferred for
preemption/placement/etc.

Signed-off-by: Josh Don <joshdon@google.com>
---
 kernel/sched/core.c  |  18 ++++
 kernel/sched/debug.c |   3 +
 kernel/sched/fair.c  | 201 +++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |   8 ++
 4 files changed, 205 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8ab69d01dab4..a82f5d3b85d3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10234,6 +10234,18 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
 
 	return sched_group_set_shares(css_tg(css), scale_load(weight));
 }
+
+static s64 cpu_idle_read_s64(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	return css_tg(css)->idle;
+}
+
+static int cpu_idle_write_s64(struct cgroup_subsys_state *css,
+				struct cftype *cft, s64 idle)
+{
+	return sched_group_set_idle(css_tg(css), idle);
+}
 #endif
 
 static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
@@ -10306,6 +10318,12 @@ static struct cftype cpu_files[] = {
 		.read_s64 = cpu_weight_nice_read_s64,
 		.write_s64 = cpu_weight_nice_write_s64,
 	},
+	{
+		.name = "idle",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_s64 = cpu_idle_read_s64,
+		.write_s64 = cpu_idle_write_s64,
+	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 3bdee5fd7d29..8c7780a96d66 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -600,6 +600,9 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
 			cfs_rq->nr_spread_over);
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
+	SEQ_printf(m, "  .%-30s: %d\n", "h_nr_running", cfs_rq->h_nr_running);
+	SEQ_printf(m, "  .%-30s: %d\n", "idle_h_nr_running",
+			cfs_rq->idle_h_nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_SMP
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 720adea946d1..fd24f1bfcac5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -431,6 +431,23 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 	}
 }
 
+static int tg_is_idle(struct task_group *tg)
+{
+	return tg->idle > 0;
+}
+
+static int cfs_rq_is_idle(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->idle > 0;
+}
+
+static int se_is_idle(struct sched_entity *se)
+{
+	if (entity_is_task(se))
+		return task_has_idle_policy(task_of(se));
+	return cfs_rq_is_idle(group_cfs_rq(se));
+}
+
 #else	/* !CONFIG_FAIR_GROUP_SCHED */
 
 #define for_each_sched_entity(se) \
@@ -468,6 +485,21 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
 }
 
+static int tg_is_idle(struct task_group *tg)
+{
+	return 0;
+}
+
+static int cfs_rq_is_idle(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
+static int se_is_idle(struct sched_entity *se)
+{
+	return 0;
+}
+
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
 static __always_inline
@@ -4773,6 +4805,9 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
 
+		if (cfs_rq_is_idle(group_cfs_rq(se)))
+			idle_task_delta = cfs_rq->h_nr_running;
+
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
 
@@ -4792,6 +4827,9 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		update_load_avg(qcfs_rq, se, 0);
 		se_update_runnable(se);
 
+		if (cfs_rq_is_idle(group_cfs_rq(se)))
+			idle_task_delta = cfs_rq->h_nr_running;
+
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
 	}
@@ -4836,39 +4874,45 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	task_delta = cfs_rq->h_nr_running;
 	idle_task_delta = cfs_rq->idle_h_nr_running;
 	for_each_sched_entity(se) {
+		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+
 		if (se->on_rq)
 			break;
-		cfs_rq = cfs_rq_of(se);
-		enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
+		enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
+
+		if (cfs_rq_is_idle(group_cfs_rq(se)))
+			idle_task_delta = cfs_rq->h_nr_running;
 
-		cfs_rq->h_nr_running += task_delta;
-		cfs_rq->idle_h_nr_running += idle_task_delta;
+		qcfs_rq->h_nr_running += task_delta;
+		qcfs_rq->idle_h_nr_running += idle_task_delta;
 
 		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
+		if (cfs_rq_throttled(qcfs_rq))
 			goto unthrottle_throttle;
 	}
 
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 
-		update_load_avg(cfs_rq, se, UPDATE_TG);
+		update_load_avg(qcfs_rq, se, UPDATE_TG);
 		se_update_runnable(se);
 
-		cfs_rq->h_nr_running += task_delta;
-		cfs_rq->idle_h_nr_running += idle_task_delta;
+		if (cfs_rq_is_idle(group_cfs_rq(se)))
+			idle_task_delta = cfs_rq->h_nr_running;
 
+		qcfs_rq->h_nr_running += task_delta;
+		qcfs_rq->idle_h_nr_running += idle_task_delta;
 
 		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
+		if (cfs_rq_throttled(qcfs_rq))
 			goto unthrottle_throttle;
 
 		/*
 		 * One parent has been throttled and cfs_rq removed from the
 		 * list. Add it back to not break the leaf list.
 		 */
-		if (throttled_hierarchy(cfs_rq))
-			list_add_leaf_cfs_rq(cfs_rq);
+		if (throttled_hierarchy(qcfs_rq))
+			list_add_leaf_cfs_rq(qcfs_rq);
 	}
 
 	/* At this point se is NULL and we are at root level*/
@@ -4881,9 +4925,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	 * assertion below.
 	 */
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 
-		if (list_add_leaf_cfs_rq(cfs_rq))
+		if (list_add_leaf_cfs_rq(qcfs_rq))
 			break;
 	}
 
@@ -5503,6 +5547,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
 			goto enqueue_throttle;
@@ -5520,6 +5567,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
 			goto enqueue_throttle;
@@ -5597,6 +5647,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running--;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
 			goto dequeue_throttle;
@@ -5626,6 +5679,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running--;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
+		if (cfs_rq_is_idle(cfs_rq))
+			idle_h_nr_running = 1;
+
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
 			goto dequeue_throttle;
@@ -6948,24 +7004,22 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
 
 static void set_last_buddy(struct sched_entity *se)
 {
-	if (entity_is_task(se) && unlikely(task_has_idle_policy(task_of(se))))
-		return;
-
 	for_each_sched_entity(se) {
 		if (SCHED_WARN_ON(!se->on_rq))
 			return;
+		if (se_is_idle(se))
+			return;
 		cfs_rq_of(se)->last = se;
 	}
 }
 
 static void set_next_buddy(struct sched_entity *se)
 {
-	if (entity_is_task(se) && unlikely(task_has_idle_policy(task_of(se))))
-		return;
-
 	for_each_sched_entity(se) {
 		if (SCHED_WARN_ON(!se->on_rq))
 			return;
+		if (se_is_idle(se))
+			return;
 		cfs_rq_of(se)->next = se;
 	}
 }
@@ -6986,6 +7040,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	int scale = cfs_rq->nr_running >= sched_nr_latency;
 	int next_buddy_marked = 0;
+	int cse_is_idle, pse_is_idle;
 
 	if (unlikely(se == pse))
 		return;
@@ -7030,8 +7085,21 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		return;
 
 	find_matching_se(&se, &pse);
-	update_curr(cfs_rq_of(se));
 	BUG_ON(!pse);
+
+	cse_is_idle = se_is_idle(se);
+	pse_is_idle = se_is_idle(pse);
+
+	/*
+	 * Preempt an idle group in favor of a non-idle group (and don't preempt
+	 * in the inverse case).
+	 */
+	if (cse_is_idle && !pse_is_idle)
+		goto preempt;
+	if (cse_is_idle != pse_is_idle)
+		return;
+
+	update_curr(cfs_rq_of(se));
 	if (wakeup_preempt_entity(se, pse) == 1) {
 		/*
 		 * Bias pick_next to pick the sched entity that is
@@ -11340,10 +11408,14 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 
 static DEFINE_MUTEX(shares_mutex);
 
-int sched_group_set_shares(struct task_group *tg, unsigned long shares)
+#define IDLE_WEIGHT sched_prio_to_weight[ARRAY_SIZE(sched_prio_to_weight) - 1]
+
+static int __sched_group_set_shares(struct task_group *tg, unsigned long shares)
 {
 	int i;
 
+	lockdep_assert_held(&shares_mutex);
+
 	/*
 	 * We can't change the weight of the root cgroup.
 	 */
@@ -11352,9 +11424,8 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 
 	shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
 
-	mutex_lock(&shares_mutex);
 	if (tg->shares == shares)
-		goto done;
+		return 0;
 
 	tg->shares = shares;
 	for_each_possible_cpu(i) {
@@ -11372,10 +11443,90 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 		rq_unlock_irqrestore(rq, &rf);
 	}
 
-done:
+	return 0;
+}
+
+int sched_group_set_shares(struct task_group *tg, unsigned long shares)
+{
+	int ret;
+
+	mutex_lock(&shares_mutex);
+	if (tg_is_idle(tg))
+		ret = -EINVAL;
+	else
+		ret = __sched_group_set_shares(tg, shares);
+	mutex_unlock(&shares_mutex);
+
+	return ret;
+}
+
+int sched_group_set_idle(struct task_group *tg, long idle)
+{
+	int i;
+
+	if (tg == &root_task_group)
+		return -EINVAL;
+
+	if (idle < 0 || idle > 1)
+		return -EINVAL;
+
+	mutex_lock(&shares_mutex);
+
+	if (tg->idle == idle) {
+		mutex_unlock(&shares_mutex);
+		return 0;
+	}
+
+	tg->idle = idle;
+
+	for_each_possible_cpu(i) {
+		struct rq *rq = cpu_rq(i);
+		struct sched_entity *se = tg->se[i];
+		struct cfs_rq *grp_cfs_rq = tg->cfs_rq[i];
+		bool was_idle = cfs_rq_is_idle(grp_cfs_rq);
+		long idle_task_delta;
+		struct rq_flags rf;
+
+		rq_lock_irqsave(rq, &rf);
+
+		grp_cfs_rq->idle = idle;
+		if (WARN_ON_ONCE(was_idle == cfs_rq_is_idle(grp_cfs_rq)))
+			goto next_cpu;
+
+		idle_task_delta = grp_cfs_rq->h_nr_running -
+				  grp_cfs_rq->idle_h_nr_running;
+		if (!cfs_rq_is_idle(grp_cfs_rq))
+			idle_task_delta *= -1;
+
+		for_each_sched_entity(se) {
+			struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+			if (!se->on_rq)
+				break;
+
+			cfs_rq->idle_h_nr_running += idle_task_delta;
+
+			/* Already accounted at parent level and above. */
+			if (cfs_rq_is_idle(cfs_rq))
+				break;
+		}
+
+next_cpu:
+		rq_unlock_irqrestore(rq, &rf);
+	}
+
+	/* Idle groups have minimum weight. */
+	if (tg_is_idle(tg))
+		__sched_group_set_shares(tg, scale_load(IDLE_WEIGHT));
+	else
+		__sched_group_set_shares(tg, NICE_0_LOAD);
+
 	mutex_unlock(&shares_mutex);
 	return 0;
 }
+
+#undef IDLE_WEIGHT
+
 #else /* CONFIG_FAIR_GROUP_SCHED */
 
 void free_fair_sched_group(struct task_group *tg) { }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8f0194cee0ba..aa8ffcd52ee4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -393,6 +393,9 @@ struct task_group {
 	struct cfs_rq		**cfs_rq;
 	unsigned long		shares;
 
+	/* A positive value indicates that this is a SCHED_IDLE group. */
+	int			idle;
+
 #ifdef	CONFIG_SMP
 	/*
 	 * load_avg can be heavily contended at clock tick time, so put
@@ -502,6 +505,8 @@ extern void sched_move_task(struct task_struct *tsk);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
 
+extern int sched_group_set_idle(struct task_group *tg, long idle);
+
 #ifdef CONFIG_SMP
 extern void set_task_rq_fair(struct sched_entity *se,
 			     struct cfs_rq *prev, struct cfs_rq *next);
@@ -598,6 +603,9 @@ struct cfs_rq {
 	struct list_head	leaf_cfs_rq_list;
 	struct task_group	*tg;	/* group that "owns" this runqueue */
 
+	/* Locally cached copy of our task_group's idle value */
+	int			idle;
+
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
 	s64			runtime_remaining;
-- 
2.32.0.rc1.229.g3e70b5a671-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-08 23:11 [PATCH] sched: cgroup SCHED_IDLE support Josh Don
@ 2021-06-10 12:53 ` Dietmar Eggemann
  2021-06-10 19:14   ` Josh Don
  2021-06-16 15:42 ` Tejun Heo
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Dietmar Eggemann @ 2021-06-10 12:53 UTC (permalink / raw)
  To: Josh Don, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, Tejun Heo,
	linux-kernel

Hi Josh,

On 09/06/2021 01:11, Josh Don wrote:

[...]

>  static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
> @@ -10306,6 +10318,12 @@ static struct cftype cpu_files[] = {
>  		.read_s64 = cpu_weight_nice_read_s64,
>  		.write_s64 = cpu_weight_nice_write_s64,
>  	},
> +	{
> +		.name = "idle",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.read_s64 = cpu_idle_read_s64,
> +		.write_s64 = cpu_idle_write_s64,
> +	},

Any reason why this should only work on cgroup-v2?

struct cftype cpu_legacy_files[] vs. cpu_files[]

[...]

> @@ -11340,10 +11408,14 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
>  
>  static DEFINE_MUTEX(shares_mutex);
>  
> -int sched_group_set_shares(struct task_group *tg, unsigned long shares)
> +#define IDLE_WEIGHT sched_prio_to_weight[ARRAY_SIZE(sched_prio_to_weight) - 1]

Why not 3 ? Like for tasks (WEIGHT_IDLEPRIO)?

[...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-10 12:53 ` Dietmar Eggemann
@ 2021-06-10 19:14   ` Josh Don
  2021-06-11 16:43     ` Dietmar Eggemann
  0 siblings, 1 reply; 19+ messages in thread
From: Josh Don @ 2021-06-10 19:14 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, Tejun Heo,
	linux-kernel

Hey Dietmar,

On Thu, Jun 10, 2021 at 5:53 AM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> Any reason why this should only work on cgroup-v2?

My (perhaps incorrect) assumption that new development should not
extend v1. I'd actually prefer making this work on v1 as well; I'll
add that support.

> struct cftype cpu_legacy_files[] vs. cpu_files[]
>
> [...]
>
> > @@ -11340,10 +11408,14 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
> >
> >  static DEFINE_MUTEX(shares_mutex);
> >
> > -int sched_group_set_shares(struct task_group *tg, unsigned long shares)
> > +#define IDLE_WEIGHT sched_prio_to_weight[ARRAY_SIZE(sched_prio_to_weight) - 1]
>
> Why not 3 ? Like for tasks (WEIGHT_IDLEPRIO)?
>
> [...]

Went back and forth on this; on second look, I do think it makes sense
to use the IDLEPRIO weight of 3 here. This gets converted to a 0,
rather than a 1 for display of cpu.weight, which is also actually a
nice property.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-10 19:14   ` Josh Don
@ 2021-06-11 16:43     ` Dietmar Eggemann
  2021-06-11 23:34       ` Josh Don
  0 siblings, 1 reply; 19+ messages in thread
From: Dietmar Eggemann @ 2021-06-11 16:43 UTC (permalink / raw)
  To: Josh Don
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, Tejun Heo,
	linux-kernel

On 10/06/2021 21:14, Josh Don wrote:
> Hey Dietmar,
> 
> On Thu, Jun 10, 2021 at 5:53 AM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> Any reason why this should only work on cgroup-v2?
> 
> My (perhaps incorrect) assumption that new development should not
> extend v1. I'd actually prefer making this work on v1 as well; I'll
> add that support.
> 
>> struct cftype cpu_legacy_files[] vs. cpu_files[]
>>
>> [...]
>>
>>> @@ -11340,10 +11408,14 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
>>>
>>>  static DEFINE_MUTEX(shares_mutex);
>>>
>>> -int sched_group_set_shares(struct task_group *tg, unsigned long shares)
>>> +#define IDLE_WEIGHT sched_prio_to_weight[ARRAY_SIZE(sched_prio_to_weight) - 1]
>>
>> Why not 3 ? Like for tasks (WEIGHT_IDLEPRIO)?
>>
>> [...]
> 
> Went back and forth on this; on second look, I do think it makes sense
> to use the IDLEPRIO weight of 3 here. This gets converted to a 0,
> rather than a 1 for display of cpu.weight, which is also actually a
> nice property.

I'm struggling to see the benefit here.

For a taskgroup A: Why setting A/cpu.idle=1 to force a minimum A->shares
when you can set it directly via A/cpu.weight (to 1 (minimum))?

WEIGHT	   cpu.weight 	tg->shares

3	   0		3072

15	   1		15360

	   1		10240

`A/cpu.weight` follows cgroup-v2's `weights` `resource distribution
model`* but I can only see `A/cpu.idle` as a layer on top of it forcing
`A/cpu.weight` to get its minimum value?

*Documentation/admin-guide/cgroup-v2.rst

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-11 16:43     ` Dietmar Eggemann
@ 2021-06-11 23:34       ` Josh Don
  2021-06-15 10:06         ` Dietmar Eggemann
  0 siblings, 1 reply; 19+ messages in thread
From: Josh Don @ 2021-06-11 23:34 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, Tejun Heo,
	linux-kernel

On Fri, Jun 11, 2021 at 9:43 AM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 10/06/2021 21:14, Josh Don wrote:
> > Hey Dietmar,
> >
> > On Thu, Jun 10, 2021 at 5:53 AM Dietmar Eggemann
> > <dietmar.eggemann@arm.com> wrote:
> >>
> >> Any reason why this should only work on cgroup-v2?
> >
> > My (perhaps incorrect) assumption that new development should not
> > extend v1. I'd actually prefer making this work on v1 as well; I'll
> > add that support.
> >
> >> struct cftype cpu_legacy_files[] vs. cpu_files[]
> >>
> >> [...]
> >>
> >>> @@ -11340,10 +11408,14 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
> >>>
> >>>  static DEFINE_MUTEX(shares_mutex);
> >>>
> >>> -int sched_group_set_shares(struct task_group *tg, unsigned long shares)
> >>> +#define IDLE_WEIGHT sched_prio_to_weight[ARRAY_SIZE(sched_prio_to_weight) - 1]
> >>
> >> Why not 3 ? Like for tasks (WEIGHT_IDLEPRIO)?
> >>
> >> [...]
> >
> > Went back and forth on this; on second look, I do think it makes sense
> > to use the IDLEPRIO weight of 3 here. This gets converted to a 0,
> > rather than a 1 for display of cpu.weight, which is also actually a
> > nice property.
>
> I'm struggling to see the benefit here.
>
> For a taskgroup A: Why setting A/cpu.idle=1 to force a minimum A->shares
> when you can set it directly via A/cpu.weight (to 1 (minimum))?
>
> WEIGHT     cpu.weight   tg->shares
>
> 3          0            3072
>
> 15         1            15360
>
>            1            10240
>
> `A/cpu.weight` follows cgroup-v2's `weights` `resource distribution
> model`* but I can only see `A/cpu.idle` as a layer on top of it forcing
> `A/cpu.weight` to get its minimum value?
>
> *Documentation/admin-guide/cgroup-v2.rst

Setting cpu.idle carries additional properties in addition to just the
weight. Currently, it primarily includes (a) special wakeup preemption
handling, and (b) contribution to idle_h_nr_running for the purpose of
marking a cpu as a sched_idle_cpu(). Essentially, the current
SCHED_IDLE mechanics. I've also discussed with Peter a potential
extension to SCHED_IDLE to manipulate vruntime.

We set the cgroup weight here, since by definition SCHED_IDLE entities
have the least scheduling weight. From the perspective of your
question, the analogous statement for tasks would be that we set task
weight to the min when doing setsched(SCHED_IDLE), even though we
already have a renice mechanism.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-11 23:34       ` Josh Don
@ 2021-06-15 10:06         ` Dietmar Eggemann
  2021-06-15 23:30           ` Josh Don
  2021-06-25  9:24           ` Peter Zijlstra
  0 siblings, 2 replies; 19+ messages in thread
From: Dietmar Eggemann @ 2021-06-15 10:06 UTC (permalink / raw)
  To: Josh Don
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, Tejun Heo,
	linux-kernel

On 12/06/2021 01:34, Josh Don wrote:
> On Fri, Jun 11, 2021 at 9:43 AM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 10/06/2021 21:14, Josh Don wrote:
>>> Hey Dietmar,
>>>
>>> On Thu, Jun 10, 2021 at 5:53 AM Dietmar Eggemann
>>> <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> Any reason why this should only work on cgroup-v2?
>>>
>>> My (perhaps incorrect) assumption that new development should not
>>> extend v1. I'd actually prefer making this work on v1 as well; I'll
>>> add that support.
>>>
>>>> struct cftype cpu_legacy_files[] vs. cpu_files[]
>>>>
>>>> [...]
>>>>
>>>>> @@ -11340,10 +11408,14 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
>>>>>
>>>>>  static DEFINE_MUTEX(shares_mutex);
>>>>>
>>>>> -int sched_group_set_shares(struct task_group *tg, unsigned long shares)
>>>>> +#define IDLE_WEIGHT sched_prio_to_weight[ARRAY_SIZE(sched_prio_to_weight) - 1]
>>>>
>>>> Why not 3 ? Like for tasks (WEIGHT_IDLEPRIO)?
>>>>
>>>> [...]
>>>
>>> Went back and forth on this; on second look, I do think it makes sense
>>> to use the IDLEPRIO weight of 3 here. This gets converted to a 0,
>>> rather than a 1 for display of cpu.weight, which is also actually a
>>> nice property.
>>
>> I'm struggling to see the benefit here.
>>
>> For a taskgroup A: Why setting A/cpu.idle=1 to force a minimum A->shares
>> when you can set it directly via A/cpu.weight (to 1 (minimum))?
>>
>> WEIGHT     cpu.weight   tg->shares
>>
>> 3          0            3072
>>
>> 15         1            15360
>>
>>            1            10240
>>
>> `A/cpu.weight` follows cgroup-v2's `weights` `resource distribution
>> model`* but I can only see `A/cpu.idle` as a layer on top of it forcing
>> `A/cpu.weight` to get its minimum value?
>>
>> *Documentation/admin-guide/cgroup-v2.rst
> 
> Setting cpu.idle carries additional properties in addition to just the
> weight. Currently, it primarily includes (a) special wakeup preemption
> handling, and (b) contribution to idle_h_nr_running for the purpose of
> marking a cpu as a sched_idle_cpu(). Essentially, the current
> SCHED_IDLE mechanics. I've also discussed with Peter a potential
> extension to SCHED_IDLE to manipulate vruntime.

Right, I forgot about (b).

But IMHO, (a) could be handled with this special tg->shares value for
SCHED_IDLE.

If there would be a way to open up `cpu.weight`, `cpu.weight.nice` (and
`cpu,shares` for v1) to take a special value for SCHED_IDLE, then you
won't need cpu.idle.
And you could handle the functionality from sched_group_set_idle()
directly in sched_group_set_shares().
In this case sched_group_set_shares() wouldn't have to be rejected on an
idle tg.
A tg would just become !idle by writing a different cpu.weight value.
Currently, if you !idle a tg it gets the default NICE_0_LOAD.


I guess cpu.weight [1, 10000] would be easy, 0 could be taken for that
and mapped into weight = WEIGHT_IDLEPRIO (3, 3072) to call
sched_group_set_shares(..., scale_load(weight).
cpu.weight = 1 maps to (10, 10240)

cpu.weight.nice [-20, 19] would be already more complicated, 20?

And for cpu.shares [2, 2 << 18] 0 could be used. The issue here is that
WEIGHT_IDLEPRIO (3, 3072) is a valid value already for shares.

> We set the cgroup weight here, since by definition SCHED_IDLE entities
> have the least scheduling weight. From the perspective of your
> question, the analogous statement for tasks would be that we set task
> weight to the min when doing setsched(SCHED_IDLE), even though we
> already have a renice mechanism.

I agree. `cpu.idle = 1` is like setting the task policy to SCHED_IDLE.
And there is even the `cpu.weight.nice` to support the `task - tg`
analogy on nice values.

I'm just wondering if integrating this into `cpu.weight` and friends
would be better to make the code behind this easier to grasp.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-15 10:06         ` Dietmar Eggemann
@ 2021-06-15 23:30           ` Josh Don
  2021-06-25  9:24           ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Josh Don @ 2021-06-15 23:30 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, Tejun Heo,
	linux-kernel

On Tue, Jun 15, 2021 at 3:07 AM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 12/06/2021 01:34, Josh Don wrote:
> > On Fri, Jun 11, 2021 at 9:43 AM Dietmar Eggemann
> > <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 10/06/2021 21:14, Josh Don wrote:

[snip]

> > Setting cpu.idle carries additional properties in addition to just the
> > weight. Currently, it primarily includes (a) special wakeup preemption
> > handling, and (b) contribution to idle_h_nr_running for the purpose of
> > marking a cpu as a sched_idle_cpu(). Essentially, the current
> > SCHED_IDLE mechanics. I've also discussed with Peter a potential
> > extension to SCHED_IDLE to manipulate vruntime.
>
> Right, I forgot about (b).
>
> But IMHO, (a) could be handled with this special tg->shares value for
> SCHED_IDLE.
>
> If there would be a way to open up `cpu.weight`, `cpu.weight.nice` (and
> `cpu,shares` for v1) to take a special value for SCHED_IDLE, then you
> won't need cpu.idle.
> And you could handle the functionality from sched_group_set_idle()
> directly in sched_group_set_shares().
> In this case sched_group_set_shares() wouldn't have to be rejected on an
> idle tg.
> A tg would just become !idle by writing a different cpu.weight value.
> Currently, if you !idle a tg it gets the default NICE_0_LOAD.
>
> I guess cpu.weight [1, 10000] would be easy, 0 could be taken for that
> and mapped into weight = WEIGHT_IDLEPRIO (3, 3072) to call
> sched_group_set_shares(..., scale_load(weight).
> cpu.weight = 1 maps to (10, 10240)
>
> cpu.weight.nice [-20, 19] would be already more complicated, 20?
>
> And for cpu.shares [2, 2 << 18] 0 could be used. The issue here is that
> WEIGHT_IDLEPRIO (3, 3072) is a valid value already for shares.
>
> > We set the cgroup weight here, since by definition SCHED_IDLE entities
> > have the least scheduling weight. From the perspective of your
> > question, the analogous statement for tasks would be that we set task
> > weight to the min when doing setsched(SCHED_IDLE), even though we
> > already have a renice mechanism.
>
> I agree. `cpu.idle = 1` is like setting the task policy to SCHED_IDLE.
> And there is even the `cpu.weight.nice` to support the `task - tg`
> analogy on nice values.
>
> I'm just wondering if integrating this into `cpu.weight` and friends
> would be better to make the code behind this easier to grasp.

Might be confusing that we manipulate cgroup SCHED_IDLE via special
weight/weight.nice values, whereas task SCHED_IDLE is manipulated via
setsched (and we would allow nice 20 for cgroups but not for tasks,
for example).

More importantly, I think it would be better to avoid unintentional
side effects caused by overloading the weight interfaces. Thoughts on
that?

Another (less compelling) reason for the separate interface is that it
allows further extension. For example, negative values in cpu.idle
could indicate increasingly more latency-sensitive cgroups, which
could benefit from better wakeup preemption, entity placement, etc.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-08 23:11 [PATCH] sched: cgroup SCHED_IDLE support Josh Don
  2021-06-10 12:53 ` Dietmar Eggemann
@ 2021-06-16 15:42 ` Tejun Heo
  2021-06-17  1:01   ` Josh Don
  2021-06-25  8:08   ` Peter Zijlstra
  2021-06-25  8:14 ` Peter Zijlstra
  2021-06-25  8:20 ` Peter Zijlstra
  3 siblings, 2 replies; 19+ messages in thread
From: Tejun Heo @ 2021-06-16 15:42 UTC (permalink / raw)
  To: Josh Don
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, linux-kernel,
	Rik van Riel

Hello,

On Tue, Jun 08, 2021 at 04:11:32PM -0700, Josh Don wrote:
> This extends SCHED_IDLE to cgroups.
> 
> Interface: cgroup/cpu.idle.
>  0: default behavior
>  1: SCHED_IDLE
> 
> Extending SCHED_IDLE to cgroups means that we incorporate the existing
> aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its
> descendant threads towards the idle_h_nr_running count of all of its
> ancestor cgroups. Thus, sched_idle_rq() will work properly.
> Additionally, SCHED_IDLE cgroups are configured with minimum weight.
> 
> There are two key differences between the per-task and per-cgroup
> SCHED_IDLE interface:
> 
> - The cgroup interface allows tasks within a SCHED_IDLE hierarchy to
> maintain their relative weights. The entity that is "idle" is the
> cgroup, not the tasks themselves.
> 
> - Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption
> decision is not made by comparing the current task with the woken task,
> but rather by comparing their matching sched_entity.
> 
> A typical use-case for this is a user that creates an idle and a
> non-idle subtree. The non-idle subtree will dominate competition vs
> the idle subtree, but the idle subtree will still be high priority
> vs other users on the system. The latter is accomplished via comparing
> matching sched_entity in the waken preemption path (this could also be
> improved by making the sched_idle_rq() decision dependent on the
> perspective of a specific task).

A high-level problem that I see with the proposal is that this would bake
the current recursive implementation into the interface. The semantics of
the currently exposed interface, at least the weight based part, is abstract
and doesn't necessarily dictate how the scheduling is actually performed.
Adding this would mean that we're now codifying the current behavior of
fully nested scheduling into the interface.

There are several practical challenges with the current implementation
caused by the full nesting - e.g. nesting levels are expensive for context
switch heavy applicaitons often going over >1% per level, and heuristics
which assume global queue may behave unexpectedly - ie. we can create
conditions where things like idle-wakeup boost behave very differently
depending on whether tasks are inside a cgroup or not even when the eventual
relative weights and past usages are similar.

Can you please give more details on why this is beneficial? Is the benefit
mostly around making configuration easy or are there actual scheduling
behaviors that you can't achieve otherwise?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-16 15:42 ` Tejun Heo
@ 2021-06-17  1:01   ` Josh Don
  2021-06-26  9:57     ` Tejun Heo
  2021-06-25  8:08   ` Peter Zijlstra
  1 sibling, 1 reply; 19+ messages in thread
From: Josh Don @ 2021-06-17  1:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, linux-kernel,
	Rik van Riel

Hey Tejun,

Thanks for taking a look.

On Wed, Jun 16, 2021 at 8:42 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Jun 08, 2021 at 04:11:32PM -0700, Josh Don wrote:
> > This extends SCHED_IDLE to cgroups.
> >
> > Interface: cgroup/cpu.idle.
> >  0: default behavior
> >  1: SCHED_IDLE
> >
> > Extending SCHED_IDLE to cgroups means that we incorporate the existing
> > aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its
> > descendant threads towards the idle_h_nr_running count of all of its
> > ancestor cgroups. Thus, sched_idle_rq() will work properly.
> > Additionally, SCHED_IDLE cgroups are configured with minimum weight.
> >
> > There are two key differences between the per-task and per-cgroup
> > SCHED_IDLE interface:
> >
> > - The cgroup interface allows tasks within a SCHED_IDLE hierarchy to
> > maintain their relative weights. The entity that is "idle" is the
> > cgroup, not the tasks themselves.
> >
> > - Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption
> > decision is not made by comparing the current task with the woken task,
> > but rather by comparing their matching sched_entity.
> >
> > A typical use-case for this is a user that creates an idle and a
> > non-idle subtree. The non-idle subtree will dominate competition vs
> > the idle subtree, but the idle subtree will still be high priority
> > vs other users on the system. The latter is accomplished via comparing
> > matching sched_entity in the waken preemption path (this could also be
> > improved by making the sched_idle_rq() decision dependent on the
> > perspective of a specific task).
>
> A high-level problem that I see with the proposal is that this would bake
> the current recursive implementation into the interface. The semantics of
> the currently exposed interface, at least the weight based part, is abstract
> and doesn't necessarily dictate how the scheduling is actually performed.
> Adding this would mean that we're now codifying the current behavior of
> fully nested scheduling into the interface.
>
> There are several practical challenges with the current implementation
> caused by the full nesting - e.g. nesting levels are expensive for context
> switch heavy applicaitons often going over >1% per level, and heuristics
> which assume global queue may behave unexpectedly - ie. we can create
> conditions where things like idle-wakeup boost behave very differently
> depending on whether tasks are inside a cgroup or not even when the eventual
> relative weights and past usages are similar.

To be clear, are you talking mainly about the idle_h_nr_running
accounting? The fact that setting cpu.idle propagates changes upwards
to ancestor cgroups?

The goal is to match the SCHED_IDLE semantics here, which requires
this behavior. The end result is no different than a nested cgroup
with SCHED_IDLE tasks. One possible alternative would be to only count
root-level cpu.idle cgroups towards the overall idle_h_nr_running.
I've not opted for that, in order to make this work more similarly to
the task interface. Also note that there isn't additional overhead
from the accounting, since enqueue/dequeue already traverses these
nesting levels.

Could you elaborate a bit more on your statement here?

> Can you please give more details on why this is beneficial? Is the benefit
> mostly around making configuration easy or are there actual scheduling
> behaviors that you can't achieve otherwise?

There are actual scheduling behaviors that cannot be achieved without
the cgroup interface. The task SCHED_IDLE mechanism is a combination
of special SCHED_IDLE handling (idle_h_nr_running,
check_preempt_wakeup), and minimum scheduling weight.

Consider a tree like

                  root
             /             \
            A              C
        /      \             |
      B       idle       t4
     |           |     \
     t1         t2   t3

Here, 'idle' is our cpu.idle cgroup. The following properties would
not be possible if we moved t2/t3 into SCHED_IDLE without the cgroup
interface:
- t1 always preempts t2/t3 on wakeup, but t4 does not
- t2 and t3 have different, non-minimum weights. Technically we could
also achieve this by adding another layer of nested cgroups, but that
starts to make the hierarchy much more complex.
- I've also discussed with Peter a possible extension (vruntime
adjustments) to the current SCHED_IDLE semantics. Similarly to the
first bullet here, we'd need a cgroup idle toggle to achieve certain
scheduling behaviors with this.

Thanks,
Josh

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-16 15:42 ` Tejun Heo
  2021-06-17  1:01   ` Josh Don
@ 2021-06-25  8:08   ` Peter Zijlstra
  2021-06-26 10:06     ` Tejun Heo
  2021-06-26 11:42     ` Rik van Riel
  1 sibling, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2021-06-25  8:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Josh Don, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, linux-kernel,
	Rik van Riel

On Wed, Jun 16, 2021 at 11:42:05AM -0400, Tejun Heo wrote:
> A high-level problem that I see with the proposal is that this would bake
> the current recursive implementation into the interface. The semantics of
> the currently exposed interface, at least the weight based part, is abstract
> and doesn't necessarily dictate how the scheduling is actually performed.
> Adding this would mean that we're now codifying the current behavior of
> fully nested scheduling into the interface.

It's a direct concequence of the hierarchical requirement. The approach
is the only valid one. The other relative controllers that don't do
this, are simply broken.

Absolute controllers have it easier, they can be trivially flattened.

> There are several practical challenges with the current implementation
> caused by the full nesting - e.g. nesting levels are expensive for context
> switch heavy applicaitons often going over >1% per level,

Yeah, and there's numerical problems you run into as well due to
limitied precision.

Just don't do deep hierarchies.

AFAICT it's a simple matter of conflicting requirements, on the one hand
the hierarchical thing is required, on the other hand people seem to
think all this crap is 'free' and create super deep hierarchies and then
complain shit don't work right.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-08 23:11 [PATCH] sched: cgroup SCHED_IDLE support Josh Don
  2021-06-10 12:53 ` Dietmar Eggemann
  2021-06-16 15:42 ` Tejun Heo
@ 2021-06-25  8:14 ` Peter Zijlstra
  2021-06-26  0:18   ` Josh Don
  2021-06-25  8:20 ` Peter Zijlstra
  3 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2021-06-25  8:14 UTC (permalink / raw)
  To: Josh Don
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, Tejun Heo,
	linux-kernel

On Tue, Jun 08, 2021 at 04:11:32PM -0700, Josh Don wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8ab69d01dab4..a82f5d3b85d3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10234,6 +10234,18 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
>  
>  	return sched_group_set_shares(css_tg(css), scale_load(weight));
>  }
> +
> +static s64 cpu_idle_read_s64(struct cgroup_subsys_state *css,
> +			       struct cftype *cft)
> +{
> +	return css_tg(css)->idle;
> +}
> +
> +static int cpu_idle_write_s64(struct cgroup_subsys_state *css,
> +				struct cftype *cft, s64 idle)
> +{
> +	return sched_group_set_idle(css_tg(css), idle);
> +}
>  #endif
>  
>  static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
> @@ -10306,6 +10318,12 @@ static struct cftype cpu_files[] = {
>  		.read_s64 = cpu_weight_nice_read_s64,
>  		.write_s64 = cpu_weight_nice_write_s64,
>  	},
> +	{
> +		.name = "idle",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.read_s64 = cpu_idle_read_s64,
> +		.write_s64 = cpu_idle_write_s64,
> +	},
>  #endif
>  #ifdef CONFIG_CFS_BANDWIDTH
>  	{

You should probably also make cpu_weight_*write_?64() return -EINVAL when
IDLE. Similar to how SCHED_IDLE doesn't do nice.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-08 23:11 [PATCH] sched: cgroup SCHED_IDLE support Josh Don
                   ` (2 preceding siblings ...)
  2021-06-25  8:14 ` Peter Zijlstra
@ 2021-06-25  8:20 ` Peter Zijlstra
  2021-06-26  0:35   ` Josh Don
  3 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2021-06-25  8:20 UTC (permalink / raw)
  To: Josh Don
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, Tejun Heo,
	linux-kernel

On Tue, Jun 08, 2021 at 04:11:32PM -0700, Josh Don wrote:
> +static int se_is_idle(struct sched_entity *se)
> +{
> +	if (entity_is_task(se))
> +		return task_has_idle_policy(task_of(se));
> +	return cfs_rq_is_idle(group_cfs_rq(se));
> +}

I'm conflicted on this, on the one hand, since we want 'idle' to be a
sched_entity propery, I'd say, make it a sched_entity field, OTOH,
that's probably going to be a mess too :/

Let me read more..

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-15 10:06         ` Dietmar Eggemann
  2021-06-15 23:30           ` Josh Don
@ 2021-06-25  9:24           ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2021-06-25  9:24 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Josh Don, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, Tejun Heo,
	linux-kernel

On Tue, Jun 15, 2021 at 12:06:57PM +0200, Dietmar Eggemann wrote:

> I agree. `cpu.idle = 1` is like setting the task policy to SCHED_IDLE.
> And there is even the `cpu.weight.nice` to support the `task - tg`
> analogy on nice values.
> 
> I'm just wondering if integrating this into `cpu.weight` and friends
> would be better to make the code behind this easier to grasp.

Magic weight values are dodgy imo. Easiest to have an explicit idle knob
which then disables the weight knobs.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-25  8:14 ` Peter Zijlstra
@ 2021-06-26  0:18   ` Josh Don
  0 siblings, 0 replies; 19+ messages in thread
From: Josh Don @ 2021-06-26  0:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, Tejun Heo,
	linux-kernel

On Fri, Jun 25, 2021 at 1:15 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> You should probably also make cpu_weight_*write_?64() return -EINVAL when
> IDLE. Similar to how SCHED_IDLE doesn't do nice.

sched_group_set_shares() returns -EINVAL when called on an idle group.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-25  8:20 ` Peter Zijlstra
@ 2021-06-26  0:35   ` Josh Don
  0 siblings, 0 replies; 19+ messages in thread
From: Josh Don @ 2021-06-26  0:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, Tejun Heo,
	linux-kernel

On Fri, Jun 25, 2021 at 1:21 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jun 08, 2021 at 04:11:32PM -0700, Josh Don wrote:
> > +static int se_is_idle(struct sched_entity *se)
> > +{
> > +     if (entity_is_task(se))
> > +             return task_has_idle_policy(task_of(se));
> > +     return cfs_rq_is_idle(group_cfs_rq(se));
> > +}
>
> I'm conflicted on this, on the one hand, since we want 'idle' to be a
> sched_entity propery, I'd say, make it a sched_entity field, OTOH,
> that's probably going to be a mess too :/
>
> Let me read more..

Yea, I see both arguments; a field probably wouldn't be that bad. In
any case, this seemed reasonable and avoids more touchpoints in
core.c, so *shrug*.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-17  1:01   ` Josh Don
@ 2021-06-26  9:57     ` Tejun Heo
  2021-06-29  4:57       ` Josh Don
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2021-06-26  9:57 UTC (permalink / raw)
  To: Josh Don
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, linux-kernel,
	Rik van Riel

Hello,

On Wed, Jun 16, 2021 at 06:01:59PM -0700, Josh Don wrote:
> Consider a tree like
> 
>                   root
>              /             \
>             A              C
>         /      \             |
>       B       idle       t4
>      |           |     \
>      t1         t2   t3
> 
> Here, 'idle' is our cpu.idle cgroup. The following properties would
> not be possible if we moved t2/t3 into SCHED_IDLE without the cgroup
> interface:
> - t1 always preempts t2/t3 on wakeup, but t4 does not
> - t2 and t3 have different, non-minimum weights. Technically we could
> also achieve this by adding another layer of nested cgroups, but that
> starts to make the hierarchy much more complex.
> - I've also discussed with Peter a possible extension (vruntime
> adjustments) to the current SCHED_IDLE semantics. Similarly to the
> first bullet here, we'd need a cgroup idle toggle to achieve certain
> scheduling behaviors with this.

Would you care to share some concrete use cases?

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-25  8:08   ` Peter Zijlstra
@ 2021-06-26 10:06     ` Tejun Heo
  2021-06-26 11:42     ` Rik van Riel
  1 sibling, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2021-06-26 10:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Don, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, linux-kernel,
	Rik van Riel

Hello, Peter.

On Fri, Jun 25, 2021 at 10:08:36AM +0200, Peter Zijlstra wrote:
> It's a direct concequence of the hierarchical requirement. The approach
> is the only valid one. The other relative controllers that don't do
> this, are simply broken.
> 
> Absolute controllers have it easier, they can be trivially flattened.

That's too strong a claim. e.g. iocost controller, while in a different
domain, is a weight controller which takes different trade-offs to achieve
hierarchical weight based distribution at negligible nesting overhead. There
usually are more than one way to skin a cat.

> > There are several practical challenges with the current implementation
> > caused by the full nesting - e.g. nesting levels are expensive for context
> > switch heavy applicaitons often going over >1% per level,
> 
> Yeah, and there's numerical problems you run into as well due to
> limitied precision.

Another issue is per-queue level heuristics like boosting after idle
nesting not in quite optimal ways.

> Just don't do deep hierarchies.
>
> AFAICT it's a simple matter of conflicting requirements, on the one hand
> the hierarchical thing is required, on the other hand people seem to
> think all this crap is 'free' and create super deep hierarchies and then
> complain shit don't work right.

The problem is that the overhead is significant enough even at pretty
shallow levels. Even at just two/three levels, the cost is already
significant enough for some large-scale applications.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-25  8:08   ` Peter Zijlstra
  2021-06-26 10:06     ` Tejun Heo
@ 2021-06-26 11:42     ` Rik van Riel
  1 sibling, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2021-06-26 11:42 UTC (permalink / raw)
  To: Peter Zijlstra, Tejun Heo
  Cc: Josh Don, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1171 bytes --]

On Fri, 2021-06-25 at 10:08 +0200, Peter Zijlstra wrote:
> On Wed, Jun 16, 2021 at 11:42:05AM -0400, Tejun Heo wrote:
> > A high-level problem that I see with the proposal is that this
> > would bake
> > the current recursive implementation into the interface. The
> > semantics of
> > the currently exposed interface, at least the weight based part, is
> > abstract
> > and doesn't necessarily dictate how the scheduling is actually
> > performed.
> > Adding this would mean that we're now codifying the current
> > behavior of
> > fully nested scheduling into the interface.
> 
> It's a direct concequence of the hierarchical requirement. The
> approach
> is the only valid one. The other relative controllers that don't do
> this, are simply broken.
> 
> Absolute controllers have it easier, they can be trivially flattened.

I'm pretty sure CFS can be mostly flattened too, just not
trivially :)

Doing the delta_exectime to vruntime accounting in a somewhat
delayed fashion, with never more than one or two timeslices
worth of vruntime at a time being accounted, should make the
flat CFS runqueue model work.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: cgroup SCHED_IDLE support
  2021-06-26  9:57     ` Tejun Heo
@ 2021-06-29  4:57       ` Josh Don
  0 siblings, 0 replies; 19+ messages in thread
From: Josh Don @ 2021-06-29  4:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paul Turner, David Rientjes,
	Oleg Rombakh, Viresh Kumar, Steve Sistare, linux-kernel,
	Rik van Riel

On Sat, Jun 26, 2021 at 2:57 AM Tejun Heo <tj@kernel.org> wrote:
[snip]
>
> Would you care to share some concrete use cases?
>
> Thank you.
>
> --
> tejun

Sure thing, there are two use cases we've found compelling:

1. On a machine, different users are given their own top-level cgroup
(configured with an appropriate number of shares). Each user is free
to spawn any threads and create any additional cgroups within their
top-level group.
Some users would like to run high priority, latency-sensitive work
(for example, responding to an RPC) as well as some batch tasks (ie.
background work such as data manipulation, transcoding, etc.) within
their cgroup. The batch tasks should interfere minimally with the high
priority work. However, it is still desired that this batch work be
considered the same as the high priority work vs the jobs of some
other user on the machine.

To achieve this, the user sets up two sub-cgroups, one of which is
marked as idle. The idle cgroup will always be preempted on wakeup of
a task in the other sub-cgroup (but not a wakeup of another user's
task). This is not possible with the per-task SCHED_IDLE setting.
Cgroup shares/weight alone is also not as strong as SCHED_IDLE.

2. We can create a top-level idle cgroup in which we place users who
want to run some best-effort work (ie. some long running
computations). Since it is the top-level cgroup that is marked idle,
any other task on the machine will always preempt something running
within the top-level idle cgroup. We can also easily maintain the
relative weights between different users within the idle group.

This top-level idle group allows for soaking otherwise unused cycles,
and offers cheap machine quota for users who have latency-tolerant
jobs.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-06-29  4:57 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-08 23:11 [PATCH] sched: cgroup SCHED_IDLE support Josh Don
2021-06-10 12:53 ` Dietmar Eggemann
2021-06-10 19:14   ` Josh Don
2021-06-11 16:43     ` Dietmar Eggemann
2021-06-11 23:34       ` Josh Don
2021-06-15 10:06         ` Dietmar Eggemann
2021-06-15 23:30           ` Josh Don
2021-06-25  9:24           ` Peter Zijlstra
2021-06-16 15:42 ` Tejun Heo
2021-06-17  1:01   ` Josh Don
2021-06-26  9:57     ` Tejun Heo
2021-06-29  4:57       ` Josh Don
2021-06-25  8:08   ` Peter Zijlstra
2021-06-26 10:06     ` Tejun Heo
2021-06-26 11:42     ` Rik van Riel
2021-06-25  8:14 ` Peter Zijlstra
2021-06-26  0:18   ` Josh Don
2021-06-25  8:20 ` Peter Zijlstra
2021-06-26  0:35   ` Josh Don

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).