linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/2] Fix nohz_full vs cfs bandwidth
@ 2023-07-12 13:33 Phil Auld
  2023-07-12 13:33 ` [PATCH v2 1/2] sched, cgroup: Restore meaning to hierarchical_quota Phil Auld
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Phil Auld @ 2023-07-12 13:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo, Phil Auld

This is v6 of patch 2/2 which is adding code to prevent
the tick from being stopped when the single running task
has bandwidth limits. Discussions had led to the idea of
adding a bit to task_struct to help make this decision.

There was some complexity with doing it in the task which
is  avoided by using something in the cfs_rq. Looking 
into that lead me to the hierarchical_quota field in the 
cfs_bandwith struct. We spend a good deal of effort
updating (or trying to, see patch 1/2) that value for
the whole task_group tree when a quota is set/changed.

This new version first fixes that value to be meaningful
for cgroupv2 and then leverages it to make the decisions
about blocking the tick_stop. 

Phil Auld (2):
  sched, cgroup: Restore meaning to hierarchical_quota
  Sched/fair: Block nohz tick_stop when cfs bandwidth in use

 kernel/sched/core.c     | 23 ++++++++++++++---
 kernel/sched/fair.c     | 56 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h |  2 ++
 kernel/sched/sched.h    |  3 ++-
 4 files changed, 76 insertions(+), 8 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/2] sched, cgroup: Restore meaning to hierarchical_quota
  2023-07-12 13:33 [PATCH v6 0/2] Fix nohz_full vs cfs bandwidth Phil Auld
@ 2023-07-12 13:33 ` Phil Auld
  2023-07-12 22:09   ` Benjamin Segall
  2023-07-12 13:33 ` [PATCH v6 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use Phil Auld
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Phil Auld @ 2023-07-12 13:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo, Phil Auld

In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
groups due to the previous fix simply taking the min.  It should
reflect a limit imposed at that level or by an ancestor. Even
though cgroupv2 does not require child quota to be less than or
equal to that of its ancestors the task group will still be
constrained by such a quota so this should be shown here. Cgroupv1
continues to set this correctly.

In both cases, add initialization when a new task group is created
based on the current parent's value (or RUNTIME_INF in the case of
root_task_group). Otherwise, the field is wrong until a quota is
changed after creation and __cfs_schedulable() is called.

Fixes: c53593e5cb69 ("sched, cgroup: Don't reject lower cpu.max on ancestors")
Signed-off-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
---

v2: Improve comment about how setting hierarchical_quota correctly

helps the scheduler. Remove extra parens.
 kernel/sched/core.c  | 13 +++++++++----
 kernel/sched/fair.c  |  7 ++++---
 kernel/sched/sched.h |  2 +-
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a68d1276bab0..f80697a79baf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9904,7 +9904,7 @@ void __init sched_init(void)
 		ptr += nr_cpu_ids * sizeof(void **);
 
 		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
-		init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
+		init_cfs_bandwidth(&root_task_group.cfs_bandwidth, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_RT_GROUP_SCHED
 		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
@@ -11038,11 +11038,16 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
 
 		/*
 		 * Ensure max(child_quota) <= parent_quota.  On cgroup2,
-		 * always take the min.  On cgroup1, only inherit when no
-		 * limit is set:
+		 * always take the non-RUNTIME_INF min.  On cgroup1, only
+		 * inherit when no limit is set. In cgroup2 this is used
+		 * by the scheduler to determine if a given CFS task has a
+		 * bandwidth constraint at some higher level.
 		 */
 		if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
-			quota = min(quota, parent_quota);
+			if (quota == RUNTIME_INF)
+				quota = parent_quota;
+			else if (parent_quota != RUNTIME_INF)
+				quota = min(quota, parent_quota);
 		} else {
 			if (quota == RUNTIME_INF)
 				quota = parent_quota;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..d9b3d4617e16 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6005,13 +6005,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
 }
 
-void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent)
 {
 	raw_spin_lock_init(&cfs_b->lock);
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 	cfs_b->burst = 0;
+	cfs_b->hierarchical_quota = parent ? parent->hierarchical_quota : RUNTIME_INF;
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
@@ -6168,7 +6169,7 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	return 0;
 }
 
-void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
@@ -12373,7 +12374,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 
 	tg->shares = NICE_0_LOAD;
 
-	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
+	init_cfs_bandwidth(tg_cfs_bandwidth(tg), tg_cfs_bandwidth(parent));
 
 	for_each_possible_cpu(i) {
 		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec7b3e0a2b20..63822c9238cc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -460,7 +460,7 @@ extern void unregister_fair_sched_group(struct task_group *tg);
 extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 			struct sched_entity *se, int cpu,
 			struct sched_entity *parent);
-extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
+extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent);
 
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v6 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use
  2023-07-12 13:33 [PATCH v6 0/2] Fix nohz_full vs cfs bandwidth Phil Auld
  2023-07-12 13:33 ` [PATCH v2 1/2] sched, cgroup: Restore meaning to hierarchical_quota Phil Auld
@ 2023-07-12 13:33 ` Phil Auld
  2023-07-12 22:11   ` Benjamin Segall
                     ` (2 more replies)
  2023-07-31 15:17 ` [PATCH v6 0/2] Fix nohz_full vs cfs bandwidth Phil Auld
  2023-07-31 16:38 ` Phil Auld
  3 siblings, 3 replies; 22+ messages in thread
From: Phil Auld @ 2023-07-12 13:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo, Phil Auld

CFS bandwidth limits and NOHZ full don't play well together.  Tasks
can easily run well past their quotas before a remote tick does
accounting.  This leads to long, multi-period stalls before such
tasks can run again. Currently, when presented with these conflicting
requirements the scheduler is favoring nohz_full and letting the tick
be stopped. However, nohz tick stopping is already best-effort, there
are a number of conditions that can prevent it, whereas cfs runtime
bandwidth is expected to be enforced.

Make the scheduler favor bandwidth over stopping the tick by setting
TICK_DEP_BIT_SCHED when the only running task is a cfs task with
runtime limit enabled. We use cfs_b->hierarchical_quota to
determine if the task requires the tick.

Add check in pick_next_task_fair() as well since that is where
we have a handle on the task that is actually going to be running.

Add check in sched_can_stop_tick() to cover some edge cases such
as nr_running going from 2->1 and the 1 remains the running task.

Add sched_feat HZ_BW (off by default) to control the tick_stop
behavior.

Signed-off-by: Phil Auld <pauld@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
---
v6: restore check for fair_sched_class

v5: Reworked checks to use newly-fixed cfs_b->hierarchical_quota to
check for bw
constraints. 

v4: Made checks for runtime_enabled hierarchical. 

v3: Moved sched_cfs_bandwidth_active() prototype to sched.h outside of
guards to
silence -Wmissing-prototypes.

v2:  Ben pointed out that the bit could get cleared in the dequeue path
if we migrate a newly enqueued task without preempting curr. Added a
check for that edge case to sched_can_stop_tick. Removed the call to
sched_can_stop_tick from sched_fair_update_stop_tick since it was
redundant.

 kernel/sched/core.c     | 26 ++++++++++++++++++++++
 kernel/sched/fair.c     | 49 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  2 ++
 kernel/sched/sched.h    |  1 +
 4 files changed, 78 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f80697a79baf..8a2ed4c0b709 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1194,6 +1194,20 @@ static void nohz_csd_func(void *info)
 #endif /* CONFIG_NO_HZ_COMMON */
 
 #ifdef CONFIG_NO_HZ_FULL
+static inline bool __need_bw_check(struct rq *rq, struct task_struct *p)
+{
+	if (rq->nr_running != 1)
+		return false;
+
+	if (p->sched_class != &fair_sched_class)
+		return false;
+
+	if (!task_on_rq_queued(p))
+		return false;
+
+	return true;
+}
+
 bool sched_can_stop_tick(struct rq *rq)
 {
 	int fifo_nr_running;
@@ -1229,6 +1243,18 @@ bool sched_can_stop_tick(struct rq *rq)
 	if (rq->nr_running > 1)
 		return false;
 
+	/*
+	 * If there is one task and it has CFS runtime bandwidth constraints
+	 * and it's on the cpu now we don't want to stop the tick.
+	 * This check prevents clearing the bit if a newly enqueued task here is
+	 * dequeued by migrating while the constrained task continues to run.
+	 * E.g. going from 2->1 without going through pick_next_task().
+	 */
+	if (sched_feat(HZ_BW) && __need_bw_check(rq, rq->curr)) {
+		if (cfs_task_bw_constrained(rq->curr))
+			return false;
+	}
+
 	return true;
 }
 #endif /* CONFIG_NO_HZ_FULL */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d9b3d4617e16..acd9f317aad1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6140,6 +6140,46 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 	rcu_read_unlock();
 }
 
+bool cfs_task_bw_constrained(struct task_struct *p)
+{
+	struct cfs_rq *cfs_rq = task_cfs_rq(p);
+
+	if (!cfs_bandwidth_used())
+		return false;
+
+	if (cfs_rq->runtime_enabled ||
+	    tg_cfs_bandwidth(cfs_rq->tg)->hierarchical_quota != RUNTIME_INF)
+		return true;
+
+	return false;
+}
+
+#ifdef CONFIG_NO_HZ_FULL
+/* called from pick_next_task_fair() */
+static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
+{
+	int cpu = cpu_of(rq);
+
+	if (!sched_feat(HZ_BW) || !cfs_bandwidth_used())
+		return;
+
+	if (!tick_nohz_full_cpu(cpu))
+		return;
+
+	if (rq->nr_running != 1)
+		return;
+
+	/*
+	 *  We know there is only one task runnable and we've just picked it. The
+	 *  normal enqueue path will have cleared TICK_DEP_BIT_SCHED if we will
+	 *  be otherwise able to stop the tick. Just need to check if we are using
+	 *  bandwidth control.
+	 */
+	if (cfs_task_bw_constrained(p))
+		tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
+}
+#endif
+
 #else /* CONFIG_CFS_BANDWIDTH */
 
 static inline bool cfs_bandwidth_used(void)
@@ -6182,9 +6222,17 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
 static inline void update_runtime_enabled(struct rq *rq) {}
 static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}
+bool cfs_task_bw_constrained(struct task_struct *p)
+{
+	return false;
+}
 
 #endif /* CONFIG_CFS_BANDWIDTH */
 
+#if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
+static inline void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p) {}
+#endif
+
 /**************************************************
  * CFS operations on tasks:
  */
@@ -8098,6 +8146,7 @@ done: __maybe_unused;
 		hrtick_start_fair(rq, p);
 
 	update_misfit_status(p, rq);
+	sched_fair_update_stop_tick(rq, p);
 
 	return p;
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..6fdf1fdf6b17 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -101,3 +101,5 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(HZ_BW, false)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 63822c9238cc..d6d346bc78aa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -465,6 +465,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
+extern bool cfs_task_bw_constrained(struct task_struct *p);
 
 extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
 		struct sched_rt_entity *rt_se, int cpu,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] sched, cgroup: Restore meaning to hierarchical_quota
  2023-07-12 13:33 ` [PATCH v2 1/2] sched, cgroup: Restore meaning to hierarchical_quota Phil Auld
@ 2023-07-12 22:09   ` Benjamin Segall
  2023-07-13 13:23     ` Phil Auld
  2023-07-14 12:57     ` [PATCH v3 " Phil Auld
  0 siblings, 2 replies; 22+ messages in thread
From: Benjamin Segall @ 2023-07-12 22:09 UTC (permalink / raw)
  To: Phil Auld
  Cc: linux-kernel, Juri Lelli, Ingo Molnar,
	Daniel Bristot de Oliveira, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

Phil Auld <pauld@redhat.com> writes:

> In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
> groups due to the previous fix simply taking the min.  It should
> reflect a limit imposed at that level or by an ancestor. Even
> though cgroupv2 does not require child quota to be less than or
> equal to that of its ancestors the task group will still be
> constrained by such a quota so this should be shown here. Cgroupv1
> continues to set this correctly.
>
> In both cases, add initialization when a new task group is created
> based on the current parent's value (or RUNTIME_INF in the case of
> root_task_group). Otherwise, the field is wrong until a quota is
> changed after creation and __cfs_schedulable() is called.
>
> Fixes: c53593e5cb69 ("sched, cgroup: Don't reject lower cpu.max on ancestors")
> Signed-off-by: Phil Auld <pauld@redhat.com>
> Reviewed-by: Ben Segall <bsegall@google.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> ---
>
> v2: Improve comment about how setting hierarchical_quota correctly
>
> helps the scheduler. Remove extra parens.
>  kernel/sched/core.c  | 13 +++++++++----
>  kernel/sched/fair.c  |  7 ++++---
>  kernel/sched/sched.h |  2 +-
>  3 files changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a68d1276bab0..f80697a79baf 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9904,7 +9904,7 @@ void __init sched_init(void)
>  		ptr += nr_cpu_ids * sizeof(void **);
>  
>  		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
> -		init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
> +		init_cfs_bandwidth(&root_task_group.cfs_bandwidth, NULL);
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  #ifdef CONFIG_RT_GROUP_SCHED
>  		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
> @@ -11038,11 +11038,16 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
>  
>  		/*
>  		 * Ensure max(child_quota) <= parent_quota.  On cgroup2,
> -		 * always take the min.  On cgroup1, only inherit when no
> -		 * limit is set:
> +		 * always take the non-RUNTIME_INF min.  On cgroup1, only
> +		 * inherit when no limit is set. In cgroup2 this is used
> +		 * by the scheduler to determine if a given CFS task has a
> +		 * bandwidth constraint at some higher level.
>  		 */

It's still used for determining this on cgroup1 (and the cgroup1 code
still works for that), right?

>  		if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
> -			quota = min(quota, parent_quota);
> +			if (quota == RUNTIME_INF)
> +				quota = parent_quota;
> +			else if (parent_quota != RUNTIME_INF)
> +				quota = min(quota, parent_quota);
>  		} else {
>  			if (quota == RUNTIME_INF)
>  				quota = parent_quota;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f55884..d9b3d4617e16 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6005,13 +6005,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>  	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
>  }
>  
> -void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent)
>  {
>  	raw_spin_lock_init(&cfs_b->lock);
>  	cfs_b->runtime = 0;
>  	cfs_b->quota = RUNTIME_INF;
>  	cfs_b->period = ns_to_ktime(default_cfs_period());
>  	cfs_b->burst = 0;
> +	cfs_b->hierarchical_quota = parent ? parent->hierarchical_quota : RUNTIME_INF;
>  
>  	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
>  	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
> @@ -6168,7 +6169,7 @@ static inline int throttled_lb_pair(struct task_group *tg,
>  	return 0;
>  }
>  
> -void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
> +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> @@ -12373,7 +12374,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
>  
>  	tg->shares = NICE_0_LOAD;
>  
> -	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
> +	init_cfs_bandwidth(tg_cfs_bandwidth(tg), tg_cfs_bandwidth(parent));
>  
>  	for_each_possible_cpu(i) {
>  		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ec7b3e0a2b20..63822c9238cc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -460,7 +460,7 @@ extern void unregister_fair_sched_group(struct task_group *tg);
>  extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
>  			struct sched_entity *se, int cpu,
>  			struct sched_entity *parent);
> -extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
> +extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent);
>  
>  extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
>  extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use
  2023-07-12 13:33 ` [PATCH v6 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use Phil Auld
@ 2023-07-12 22:11   ` Benjamin Segall
  2023-07-13 13:25     ` Phil Auld
  2023-07-31 22:49   ` Peter Zijlstra
  2023-08-09 19:34   ` [tip: sched/core] sched/fair: " tip-bot2 for Phil Auld
  2 siblings, 1 reply; 22+ messages in thread
From: Benjamin Segall @ 2023-07-12 22:11 UTC (permalink / raw)
  To: Phil Auld
  Cc: linux-kernel, Juri Lelli, Ingo Molnar,
	Daniel Bristot de Oliveira, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

Phil Auld <pauld@redhat.com> writes:

> CFS bandwidth limits and NOHZ full don't play well together.  Tasks
> can easily run well past their quotas before a remote tick does
> accounting.  This leads to long, multi-period stalls before such
> tasks can run again. Currently, when presented with these conflicting
> requirements the scheduler is favoring nohz_full and letting the tick
> be stopped. However, nohz tick stopping is already best-effort, there
> are a number of conditions that can prevent it, whereas cfs runtime
> bandwidth is expected to be enforced.
>
> Make the scheduler favor bandwidth over stopping the tick by setting
> TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> runtime limit enabled. We use cfs_b->hierarchical_quota to
> determine if the task requires the tick.
>
> Add check in pick_next_task_fair() as well since that is where
> we have a handle on the task that is actually going to be running.
>
> Add check in sched_can_stop_tick() to cover some edge cases such
> as nr_running going from 2->1 and the 1 remains the running task.
>
> Add sched_feat HZ_BW (off by default) to control the tick_stop
> behavior.

I think this looks good now.
Reviewed-By: Ben Segall <bsegall@google.com>

>
> Signed-off-by: Phil Auld <pauld@redhat.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> ---
> v6: restore check for fair_sched_class
>
> v5: Reworked checks to use newly-fixed cfs_b->hierarchical_quota to
> check for bw
> constraints. 
>
> v4: Made checks for runtime_enabled hierarchical. 
>
> v3: Moved sched_cfs_bandwidth_active() prototype to sched.h outside of
> guards to
> silence -Wmissing-prototypes.
>
> v2:  Ben pointed out that the bit could get cleared in the dequeue path
> if we migrate a newly enqueued task without preempting curr. Added a
> check for that edge case to sched_can_stop_tick. Removed the call to
> sched_can_stop_tick from sched_fair_update_stop_tick since it was
> redundant.
>
>  kernel/sched/core.c     | 26 ++++++++++++++++++++++
>  kernel/sched/fair.c     | 49 +++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/features.h |  2 ++
>  kernel/sched/sched.h    |  1 +
>  4 files changed, 78 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f80697a79baf..8a2ed4c0b709 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1194,6 +1194,20 @@ static void nohz_csd_func(void *info)
>  #endif /* CONFIG_NO_HZ_COMMON */
>  
>  #ifdef CONFIG_NO_HZ_FULL
> +static inline bool __need_bw_check(struct rq *rq, struct task_struct *p)
> +{
> +	if (rq->nr_running != 1)
> +		return false;
> +
> +	if (p->sched_class != &fair_sched_class)
> +		return false;
> +
> +	if (!task_on_rq_queued(p))
> +		return false;
> +
> +	return true;
> +}
> +
>  bool sched_can_stop_tick(struct rq *rq)
>  {
>  	int fifo_nr_running;
> @@ -1229,6 +1243,18 @@ bool sched_can_stop_tick(struct rq *rq)
>  	if (rq->nr_running > 1)
>  		return false;
>  
> +	/*
> +	 * If there is one task and it has CFS runtime bandwidth constraints
> +	 * and it's on the cpu now we don't want to stop the tick.
> +	 * This check prevents clearing the bit if a newly enqueued task here is
> +	 * dequeued by migrating while the constrained task continues to run.
> +	 * E.g. going from 2->1 without going through pick_next_task().
> +	 */
> +	if (sched_feat(HZ_BW) && __need_bw_check(rq, rq->curr)) {
> +		if (cfs_task_bw_constrained(rq->curr))
> +			return false;
> +	}
> +
>  	return true;
>  }
>  #endif /* CONFIG_NO_HZ_FULL */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d9b3d4617e16..acd9f317aad1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6140,6 +6140,46 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
>  	rcu_read_unlock();
>  }
>  
> +bool cfs_task_bw_constrained(struct task_struct *p)
> +{
> +	struct cfs_rq *cfs_rq = task_cfs_rq(p);
> +
> +	if (!cfs_bandwidth_used())
> +		return false;
> +
> +	if (cfs_rq->runtime_enabled ||
> +	    tg_cfs_bandwidth(cfs_rq->tg)->hierarchical_quota != RUNTIME_INF)
> +		return true;
> +
> +	return false;
> +}
> +
> +#ifdef CONFIG_NO_HZ_FULL
> +/* called from pick_next_task_fair() */
> +static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
> +{
> +	int cpu = cpu_of(rq);
> +
> +	if (!sched_feat(HZ_BW) || !cfs_bandwidth_used())
> +		return;
> +
> +	if (!tick_nohz_full_cpu(cpu))
> +		return;
> +
> +	if (rq->nr_running != 1)
> +		return;
> +
> +	/*
> +	 *  We know there is only one task runnable and we've just picked it. The
> +	 *  normal enqueue path will have cleared TICK_DEP_BIT_SCHED if we will
> +	 *  be otherwise able to stop the tick. Just need to check if we are using
> +	 *  bandwidth control.
> +	 */
> +	if (cfs_task_bw_constrained(p))
> +		tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
> +}
> +#endif
> +
>  #else /* CONFIG_CFS_BANDWIDTH */
>  
>  static inline bool cfs_bandwidth_used(void)
> @@ -6182,9 +6222,17 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>  static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
>  static inline void update_runtime_enabled(struct rq *rq) {}
>  static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}
> +bool cfs_task_bw_constrained(struct task_struct *p)
> +{
> +	return false;
> +}
>  
>  #endif /* CONFIG_CFS_BANDWIDTH */
>  
> +#if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
> +static inline void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p) {}
> +#endif
> +
>  /**************************************************
>   * CFS operations on tasks:
>   */
> @@ -8098,6 +8146,7 @@ done: __maybe_unused;
>  		hrtick_start_fair(rq, p);
>  
>  	update_misfit_status(p, rq);
> +	sched_fair_update_stop_tick(rq, p);
>  
>  	return p;
>  
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c76bd3..6fdf1fdf6b17 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -101,3 +101,5 @@ SCHED_FEAT(LATENCY_WARN, false)
>  
>  SCHED_FEAT(ALT_PERIOD, true)
>  SCHED_FEAT(BASE_SLICE, true)
> +
> +SCHED_FEAT(HZ_BW, false)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 63822c9238cc..d6d346bc78aa 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -465,6 +465,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth
>  extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
>  extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
>  extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
> +extern bool cfs_task_bw_constrained(struct task_struct *p);
>  
>  extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
>  		struct sched_rt_entity *rt_se, int cpu,

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] sched, cgroup: Restore meaning to hierarchical_quota
  2023-07-12 22:09   ` Benjamin Segall
@ 2023-07-13 13:23     ` Phil Auld
  2023-07-13 20:12       ` Benjamin Segall
  2023-07-14 12:57     ` [PATCH v3 " Phil Auld
  1 sibling, 1 reply; 22+ messages in thread
From: Phil Auld @ 2023-07-13 13:23 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: linux-kernel, Juri Lelli, Ingo Molnar,
	Daniel Bristot de Oliveira, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

On Wed, Jul 12, 2023 at 03:09:31PM -0700 Benjamin Segall wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> > In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
> > groups due to the previous fix simply taking the min.  It should
> > reflect a limit imposed at that level or by an ancestor. Even
> > though cgroupv2 does not require child quota to be less than or
> > equal to that of its ancestors the task group will still be
> > constrained by such a quota so this should be shown here. Cgroupv1
> > continues to set this correctly.
> >
> > In both cases, add initialization when a new task group is created
> > based on the current parent's value (or RUNTIME_INF in the case of
> > root_task_group). Otherwise, the field is wrong until a quota is
> > changed after creation and __cfs_schedulable() is called.
> >
> > Fixes: c53593e5cb69 ("sched, cgroup: Don't reject lower cpu.max on ancestors")
> > Signed-off-by: Phil Auld <pauld@redhat.com>
> > Reviewed-by: Ben Segall <bsegall@google.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Frederic Weisbecker <frederic@kernel.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > ---
> >
> > v2: Improve comment about how setting hierarchical_quota correctly
> >
> > helps the scheduler. Remove extra parens.
> >  kernel/sched/core.c  | 13 +++++++++----
> >  kernel/sched/fair.c  |  7 ++++---
> >  kernel/sched/sched.h |  2 +-
> >  3 files changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a68d1276bab0..f80697a79baf 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -9904,7 +9904,7 @@ void __init sched_init(void)
> >  		ptr += nr_cpu_ids * sizeof(void **);
> >  
> >  		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
> > -		init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
> > +		init_cfs_bandwidth(&root_task_group.cfs_bandwidth, NULL);
> >  #endif /* CONFIG_FAIR_GROUP_SCHED */
> >  #ifdef CONFIG_RT_GROUP_SCHED
> >  		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
> > @@ -11038,11 +11038,16 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
> >  
> >  		/*
> >  		 * Ensure max(child_quota) <= parent_quota.  On cgroup2,
> > -		 * always take the min.  On cgroup1, only inherit when no
> > -		 * limit is set:
> > +		 * always take the non-RUNTIME_INF min.  On cgroup1, only
> > +		 * inherit when no limit is set. In cgroup2 this is used
> > +		 * by the scheduler to determine if a given CFS task has a
> > +		 * bandwidth constraint at some higher level.
> >  		 */
> 
> It's still used for determining this on cgroup1 (and the cgroup1 code
> still works for that), right?
>

It would, except that the enforcement of child quota <= parent quota
means that cfs_rq->runtime_enabled will be set and we'll hit that first
on cgroup1.  So we don't really use it for this determination in cgroup1.

But I could generalize that comment if you want.


Thanks,
Phil


> >  		if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
> > -			quota = min(quota, parent_quota);
> > +			if (quota == RUNTIME_INF)
> > +				quota = parent_quota;
> > +			else if (parent_quota != RUNTIME_INF)
> > +				quota = min(quota, parent_quota);
> >  		} else {
> >  			if (quota == RUNTIME_INF)
> >  				quota = parent_quota;
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 373ff5f55884..d9b3d4617e16 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6005,13 +6005,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >  	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> >  }
> >  
> > -void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> > +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent)
> >  {
> >  	raw_spin_lock_init(&cfs_b->lock);
> >  	cfs_b->runtime = 0;
> >  	cfs_b->quota = RUNTIME_INF;
> >  	cfs_b->period = ns_to_ktime(default_cfs_period());
> >  	cfs_b->burst = 0;
> > +	cfs_b->hierarchical_quota = parent ? parent->hierarchical_quota : RUNTIME_INF;
> >  
> >  	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
> >  	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
> > @@ -6168,7 +6169,7 @@ static inline int throttled_lb_pair(struct task_group *tg,
> >  	return 0;
> >  }
> >  
> > -void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
> > +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}
> >  
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> >  static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> > @@ -12373,7 +12374,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> >  
> >  	tg->shares = NICE_0_LOAD;
> >  
> > -	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
> > +	init_cfs_bandwidth(tg_cfs_bandwidth(tg), tg_cfs_bandwidth(parent));
> >  
> >  	for_each_possible_cpu(i) {
> >  		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index ec7b3e0a2b20..63822c9238cc 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -460,7 +460,7 @@ extern void unregister_fair_sched_group(struct task_group *tg);
> >  extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
> >  			struct sched_entity *se, int cpu,
> >  			struct sched_entity *parent);
> > -extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
> > +extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent);
> >  
> >  extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
> >  extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
> 

-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use
  2023-07-12 22:11   ` Benjamin Segall
@ 2023-07-13 13:25     ` Phil Auld
  0 siblings, 0 replies; 22+ messages in thread
From: Phil Auld @ 2023-07-13 13:25 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: linux-kernel, Juri Lelli, Ingo Molnar,
	Daniel Bristot de Oliveira, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

On Wed, Jul 12, 2023 at 03:11:32PM -0700 Benjamin Segall wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> > CFS bandwidth limits and NOHZ full don't play well together.  Tasks
> > can easily run well past their quotas before a remote tick does
> > accounting.  This leads to long, multi-period stalls before such
> > tasks can run again. Currently, when presented with these conflicting
> > requirements the scheduler is favoring nohz_full and letting the tick
> > be stopped. However, nohz tick stopping is already best-effort, there
> > are a number of conditions that can prevent it, whereas cfs runtime
> > bandwidth is expected to be enforced.
> >
> > Make the scheduler favor bandwidth over stopping the tick by setting
> > TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> > runtime limit enabled. We use cfs_b->hierarchical_quota to
> > determine if the task requires the tick.
> >
> > Add check in pick_next_task_fair() as well since that is where
> > we have a handle on the task that is actually going to be running.
> >
> > Add check in sched_can_stop_tick() to cover some edge cases such
> > as nr_running going from 2->1 and the 1 remains the running task.
> >
> > Add sched_feat HZ_BW (off by default) to control the tick_stop
> > behavior.
> 
> I think this looks good now.
> Reviewed-By: Ben Segall <bsegall@google.com>
>

Thanks for your help and patience!


Cheers,
Phil


> >
> > Signed-off-by: Phil Auld <pauld@redhat.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Frederic Weisbecker <frederic@kernel.org>
> > ---
> > v6: restore check for fair_sched_class
> >
> > v5: Reworked checks to use newly-fixed cfs_b->hierarchical_quota to
> > check for bw
> > constraints. 
> >
> > v4: Made checks for runtime_enabled hierarchical. 
> >
> > v3: Moved sched_cfs_bandwidth_active() prototype to sched.h outside of
> > guards to
> > silence -Wmissing-prototypes.
> >
> > v2:  Ben pointed out that the bit could get cleared in the dequeue path
> > if we migrate a newly enqueued task without preempting curr. Added a
> > check for that edge case to sched_can_stop_tick. Removed the call to
> > sched_can_stop_tick from sched_fair_update_stop_tick since it was
> > redundant.
> >
> >  kernel/sched/core.c     | 26 ++++++++++++++++++++++
> >  kernel/sched/fair.c     | 49 +++++++++++++++++++++++++++++++++++++++++
> >  kernel/sched/features.h |  2 ++
> >  kernel/sched/sched.h    |  1 +
> >  4 files changed, 78 insertions(+)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f80697a79baf..8a2ed4c0b709 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1194,6 +1194,20 @@ static void nohz_csd_func(void *info)
> >  #endif /* CONFIG_NO_HZ_COMMON */
> >  
> >  #ifdef CONFIG_NO_HZ_FULL
> > +static inline bool __need_bw_check(struct rq *rq, struct task_struct *p)
> > +{
> > +	if (rq->nr_running != 1)
> > +		return false;
> > +
> > +	if (p->sched_class != &fair_sched_class)
> > +		return false;
> > +
> > +	if (!task_on_rq_queued(p))
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> >  bool sched_can_stop_tick(struct rq *rq)
> >  {
> >  	int fifo_nr_running;
> > @@ -1229,6 +1243,18 @@ bool sched_can_stop_tick(struct rq *rq)
> >  	if (rq->nr_running > 1)
> >  		return false;
> >  
> > +	/*
> > +	 * If there is one task and it has CFS runtime bandwidth constraints
> > +	 * and it's on the cpu now we don't want to stop the tick.
> > +	 * This check prevents clearing the bit if a newly enqueued task here is
> > +	 * dequeued by migrating while the constrained task continues to run.
> > +	 * E.g. going from 2->1 without going through pick_next_task().
> > +	 */
> > +	if (sched_feat(HZ_BW) && __need_bw_check(rq, rq->curr)) {
> > +		if (cfs_task_bw_constrained(rq->curr))
> > +			return false;
> > +	}
> > +
> >  	return true;
> >  }
> >  #endif /* CONFIG_NO_HZ_FULL */
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d9b3d4617e16..acd9f317aad1 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6140,6 +6140,46 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
> >  	rcu_read_unlock();
> >  }
> >  
> > +bool cfs_task_bw_constrained(struct task_struct *p)
> > +{
> > +	struct cfs_rq *cfs_rq = task_cfs_rq(p);
> > +
> > +	if (!cfs_bandwidth_used())
> > +		return false;
> > +
> > +	if (cfs_rq->runtime_enabled ||
> > +	    tg_cfs_bandwidth(cfs_rq->tg)->hierarchical_quota != RUNTIME_INF)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> > +#ifdef CONFIG_NO_HZ_FULL
> > +/* called from pick_next_task_fair() */
> > +static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
> > +{
> > +	int cpu = cpu_of(rq);
> > +
> > +	if (!sched_feat(HZ_BW) || !cfs_bandwidth_used())
> > +		return;
> > +
> > +	if (!tick_nohz_full_cpu(cpu))
> > +		return;
> > +
> > +	if (rq->nr_running != 1)
> > +		return;
> > +
> > +	/*
> > +	 *  We know there is only one task runnable and we've just picked it. The
> > +	 *  normal enqueue path will have cleared TICK_DEP_BIT_SCHED if we will
> > +	 *  be otherwise able to stop the tick. Just need to check if we are using
> > +	 *  bandwidth control.
> > +	 */
> > +	if (cfs_task_bw_constrained(p))
> > +		tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
> > +}
> > +#endif
> > +
> >  #else /* CONFIG_CFS_BANDWIDTH */
> >  
> >  static inline bool cfs_bandwidth_used(void)
> > @@ -6182,9 +6222,17 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> >  static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
> >  static inline void update_runtime_enabled(struct rq *rq) {}
> >  static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}
> > +bool cfs_task_bw_constrained(struct task_struct *p)
> > +{
> > +	return false;
> > +}
> >  
> >  #endif /* CONFIG_CFS_BANDWIDTH */
> >  
> > +#if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
> > +static inline void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p) {}
> > +#endif
> > +
> >  /**************************************************
> >   * CFS operations on tasks:
> >   */
> > @@ -8098,6 +8146,7 @@ done: __maybe_unused;
> >  		hrtick_start_fair(rq, p);
> >  
> >  	update_misfit_status(p, rq);
> > +	sched_fair_update_stop_tick(rq, p);
> >  
> >  	return p;
> >  
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index ee7f23c76bd3..6fdf1fdf6b17 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -101,3 +101,5 @@ SCHED_FEAT(LATENCY_WARN, false)
> >  
> >  SCHED_FEAT(ALT_PERIOD, true)
> >  SCHED_FEAT(BASE_SLICE, true)
> > +
> > +SCHED_FEAT(HZ_BW, false)
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 63822c9238cc..d6d346bc78aa 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -465,6 +465,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth
> >  extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
> >  extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
> >  extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
> > +extern bool cfs_task_bw_constrained(struct task_struct *p);
> >  
> >  extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
> >  		struct sched_rt_entity *rt_se, int cpu,
> 

-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] sched, cgroup: Restore meaning to hierarchical_quota
  2023-07-13 13:23     ` Phil Auld
@ 2023-07-13 20:12       ` Benjamin Segall
  2023-07-13 23:27         ` Phil Auld
  0 siblings, 1 reply; 22+ messages in thread
From: Benjamin Segall @ 2023-07-13 20:12 UTC (permalink / raw)
  To: Phil Auld
  Cc: linux-kernel, Juri Lelli, Ingo Molnar,
	Daniel Bristot de Oliveira, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

Phil Auld <pauld@redhat.com> writes:

> On Wed, Jul 12, 2023 at 03:09:31PM -0700 Benjamin Segall wrote:
>> Phil Auld <pauld@redhat.com> writes:
>> 
>> > In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
>> > groups due to the previous fix simply taking the min.  It should
>> > reflect a limit imposed at that level or by an ancestor. Even
>> > though cgroupv2 does not require child quota to be less than or
>> > equal to that of its ancestors the task group will still be
>> > constrained by such a quota so this should be shown here. Cgroupv1
>> > continues to set this correctly.
>> >
>> > In both cases, add initialization when a new task group is created
>> > based on the current parent's value (or RUNTIME_INF in the case of
>> > root_task_group). Otherwise, the field is wrong until a quota is
>> > changed after creation and __cfs_schedulable() is called.
>> >
>> > Fixes: c53593e5cb69 ("sched, cgroup: Don't reject lower cpu.max on ancestors")
>> > Signed-off-by: Phil Auld <pauld@redhat.com>
>> > Reviewed-by: Ben Segall <bsegall@google.com>
>> > Cc: Ingo Molnar <mingo@redhat.com>
>> > Cc: Peter Zijlstra <peterz@infradead.org>
>> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> > Cc: Juri Lelli <juri.lelli@redhat.com>
>> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> > Cc: Valentin Schneider <vschneid@redhat.com>
>> > Cc: Ben Segall <bsegall@google.com>
>> > Cc: Frederic Weisbecker <frederic@kernel.org>
>> > Cc: Tejun Heo <tj@kernel.org>
>> > ---
>> >
>> > v2: Improve comment about how setting hierarchical_quota correctly
>> >
>> > helps the scheduler. Remove extra parens.
>> >  kernel/sched/core.c  | 13 +++++++++----
>> >  kernel/sched/fair.c  |  7 ++++---
>> >  kernel/sched/sched.h |  2 +-
>> >  3 files changed, 14 insertions(+), 8 deletions(-)
>> >
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index a68d1276bab0..f80697a79baf 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -9904,7 +9904,7 @@ void __init sched_init(void)
>> >  		ptr += nr_cpu_ids * sizeof(void **);
>> >  
>> >  		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
>> > -		init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
>> > +		init_cfs_bandwidth(&root_task_group.cfs_bandwidth, NULL);
>> >  #endif /* CONFIG_FAIR_GROUP_SCHED */
>> >  #ifdef CONFIG_RT_GROUP_SCHED
>> >  		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
>> > @@ -11038,11 +11038,16 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
>> >  
>> >  		/*
>> >  		 * Ensure max(child_quota) <= parent_quota.  On cgroup2,
>> > -		 * always take the min.  On cgroup1, only inherit when no
>> > -		 * limit is set:
>> > +		 * always take the non-RUNTIME_INF min.  On cgroup1, only
>> > +		 * inherit when no limit is set. In cgroup2 this is used
>> > +		 * by the scheduler to determine if a given CFS task has a
>> > +		 * bandwidth constraint at some higher level.
>> >  		 */
>> 
>> It's still used for determining this on cgroup1 (and the cgroup1 code
>> still works for that), right?
>>
>
> It would, except that the enforcement of child quota <= parent quota
> means that cfs_rq->runtime_enabled will be set and we'll hit that first
> on cgroup1.  So we don't really use it for this determination in
> cgroup1.

cgroup1 tg_cfs_schedulable_down only constricts child quota when it's
set. You can set quota=RUNTIME_INF on any cgroup, no matter what its
parent is. (The schedulable constraint is a little silly)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] sched, cgroup: Restore meaning to hierarchical_quota
  2023-07-13 20:12       ` Benjamin Segall
@ 2023-07-13 23:27         ` Phil Auld
  0 siblings, 0 replies; 22+ messages in thread
From: Phil Auld @ 2023-07-13 23:27 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: linux-kernel, Juri Lelli, Ingo Molnar,
	Daniel Bristot de Oliveira, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

On Thu, Jul 13, 2023 at 01:12:04PM -0700 Benjamin Segall wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> > On Wed, Jul 12, 2023 at 03:09:31PM -0700 Benjamin Segall wrote:
> >> Phil Auld <pauld@redhat.com> writes:
> >> 
> >> > In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
> >> > groups due to the previous fix simply taking the min.  It should
> >> > reflect a limit imposed at that level or by an ancestor. Even
> >> > though cgroupv2 does not require child quota to be less than or
> >> > equal to that of its ancestors the task group will still be
> >> > constrained by such a quota so this should be shown here. Cgroupv1
> >> > continues to set this correctly.
> >> >
> >> > In both cases, add initialization when a new task group is created
> >> > based on the current parent's value (or RUNTIME_INF in the case of
> >> > root_task_group). Otherwise, the field is wrong until a quota is
> >> > changed after creation and __cfs_schedulable() is called.
> >> >
> >> > Fixes: c53593e5cb69 ("sched, cgroup: Don't reject lower cpu.max on ancestors")
> >> > Signed-off-by: Phil Auld <pauld@redhat.com>
> >> > Reviewed-by: Ben Segall <bsegall@google.com>
> >> > Cc: Ingo Molnar <mingo@redhat.com>
> >> > Cc: Peter Zijlstra <peterz@infradead.org>
> >> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> >> > Cc: Juri Lelli <juri.lelli@redhat.com>
> >> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >> > Cc: Valentin Schneider <vschneid@redhat.com>
> >> > Cc: Ben Segall <bsegall@google.com>
> >> > Cc: Frederic Weisbecker <frederic@kernel.org>
> >> > Cc: Tejun Heo <tj@kernel.org>
> >> > ---
> >> >
> >> > v2: Improve comment about how setting hierarchical_quota correctly
> >> >
> >> > helps the scheduler. Remove extra parens.
> >> >  kernel/sched/core.c  | 13 +++++++++----
> >> >  kernel/sched/fair.c  |  7 ++++---
> >> >  kernel/sched/sched.h |  2 +-
> >> >  3 files changed, 14 insertions(+), 8 deletions(-)
> >> >
> >> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> > index a68d1276bab0..f80697a79baf 100644
> >> > --- a/kernel/sched/core.c
> >> > +++ b/kernel/sched/core.c
> >> > @@ -9904,7 +9904,7 @@ void __init sched_init(void)
> >> >  		ptr += nr_cpu_ids * sizeof(void **);
> >> >  
> >> >  		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
> >> > -		init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
> >> > +		init_cfs_bandwidth(&root_task_group.cfs_bandwidth, NULL);
> >> >  #endif /* CONFIG_FAIR_GROUP_SCHED */
> >> >  #ifdef CONFIG_RT_GROUP_SCHED
> >> >  		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
> >> > @@ -11038,11 +11038,16 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
> >> >  
> >> >  		/*
> >> >  		 * Ensure max(child_quota) <= parent_quota.  On cgroup2,
> >> > -		 * always take the min.  On cgroup1, only inherit when no
> >> > -		 * limit is set:
> >> > +		 * always take the non-RUNTIME_INF min.  On cgroup1, only
> >> > +		 * inherit when no limit is set. In cgroup2 this is used
> >> > +		 * by the scheduler to determine if a given CFS task has a
> >> > +		 * bandwidth constraint at some higher level.
> >> >  		 */
> >> 
> >> It's still used for determining this on cgroup1 (and the cgroup1 code
> >> still works for that), right?
> >>
> >
> > It would, except that the enforcement of child quota <= parent quota
> > means that cfs_rq->runtime_enabled will be set and we'll hit that first
> > on cgroup1.  So we don't really use it for this determination in
> > cgroup1.
> 
> cgroup1 tg_cfs_schedulable_down only constricts child quota when it's
> set. You can set quota=RUNTIME_INF on any cgroup, no matter what its
> parent is. (The schedulable constraint is a little silly)
> 

Aargh :)   I had "In both cases this ..." there before I talked myself out
of it.

I'll update this one.



Cheers,
Phil



-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 1/2] sched, cgroup: Restore meaning to hierarchical_quota
  2023-07-12 22:09   ` Benjamin Segall
  2023-07-13 13:23     ` Phil Auld
@ 2023-07-14 12:57     ` Phil Auld
  2023-07-17 18:27       ` Tejun Heo
  2023-08-09 19:34       ` [tip: sched/core] " tip-bot2 for Phil Auld
  1 sibling, 2 replies; 22+ messages in thread
From: Phil Auld @ 2023-07-14 12:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo, Phil Auld

In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
groups due to the previous fix simply taking the min.  It should
reflect a limit imposed at that level or by an ancestor. Even
though cgroupv2 does not require child quota to be less than or
equal to that of its ancestors the task group will still be
constrained by such a quota so this should be shown here. Cgroupv1
continues to set this correctly.

In both cases, add initialization when a new task group is created
based on the current parent's value (or RUNTIME_INF in the case of
root_task_group). Otherwise, the field is wrong until a quota is
changed after creation and __cfs_schedulable() is called.

Fixes: c53593e5cb69 ("sched, cgroup: Don't reject lower cpu.max on ancestors")
Signed-off-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
---
v3: Make comment apply to both cgroup1 and 2.

v2: Improve comment about how setting hierarchical_quota correctly
helps the scheduler. Remove extra parens.

 kernel/sched/core.c  | 13 +++++++++----
 kernel/sched/fair.c  |  7 ++++---
 kernel/sched/sched.h |  2 +-
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a68d1276bab0..f7c1510fdac1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9904,7 +9904,7 @@ void __init sched_init(void)
 		ptr += nr_cpu_ids * sizeof(void **);
 
 		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
-		init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
+		init_cfs_bandwidth(&root_task_group.cfs_bandwidth, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_RT_GROUP_SCHED
 		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
@@ -11038,11 +11038,16 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
 
 		/*
 		 * Ensure max(child_quota) <= parent_quota.  On cgroup2,
-		 * always take the min.  On cgroup1, only inherit when no
-		 * limit is set:
+		 * always take the non-RUNTIME_INF min.  On cgroup1, only
+		 * inherit when no limit is set. In both cases this is used
+		 * by the scheduler to determine if a given CFS task has a
+		 * bandwidth constraint at some higher level.
 		 */
 		if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
-			quota = min(quota, parent_quota);
+			if (quota == RUNTIME_INF)
+				quota = parent_quota;
+			else if (parent_quota != RUNTIME_INF)
+				quota = min(quota, parent_quota);
 		} else {
 			if (quota == RUNTIME_INF)
 				quota = parent_quota;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..d9b3d4617e16 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6005,13 +6005,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
 }
 
-void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent)
 {
 	raw_spin_lock_init(&cfs_b->lock);
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 	cfs_b->burst = 0;
+	cfs_b->hierarchical_quota = parent ? parent->hierarchical_quota : RUNTIME_INF;
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
@@ -6168,7 +6169,7 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	return 0;
 }
 
-void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
@@ -12373,7 +12374,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 
 	tg->shares = NICE_0_LOAD;
 
-	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
+	init_cfs_bandwidth(tg_cfs_bandwidth(tg), tg_cfs_bandwidth(parent));
 
 	for_each_possible_cpu(i) {
 		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec7b3e0a2b20..63822c9238cc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -460,7 +460,7 @@ extern void unregister_fair_sched_group(struct task_group *tg);
 extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 			struct sched_entity *se, int cpu,
 			struct sched_entity *parent);
-extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
+extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent);
 
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/2] sched, cgroup: Restore meaning to hierarchical_quota
  2023-07-14 12:57     ` [PATCH v3 " Phil Auld
@ 2023-07-17 18:27       ` Tejun Heo
  2023-07-18 12:57         ` Phil Auld
  2023-08-09 19:34       ` [tip: sched/core] " tip-bot2 for Phil Auld
  1 sibling, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2023-07-17 18:27 UTC (permalink / raw)
  To: Phil Auld
  Cc: linux-kernel, Juri Lelli, Ingo Molnar,
	Daniel Bristot de Oliveira, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Ben Segall, Steven Rostedt,
	Mel Gorman, Frederic Weisbecker

On Fri, Jul 14, 2023 at 08:57:46AM -0400, Phil Auld wrote:
> In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
> groups due to the previous fix simply taking the min.  It should
> reflect a limit imposed at that level or by an ancestor. Even
> though cgroupv2 does not require child quota to be less than or
> equal to that of its ancestors the task group will still be
> constrained by such a quota so this should be shown here. Cgroupv1
> continues to set this correctly.
> 
> In both cases, add initialization when a new task group is created
> based on the current parent's value (or RUNTIME_INF in the case of
> root_task_group). Otherwise, the field is wrong until a quota is
> changed after creation and __cfs_schedulable() is called.
> 
> Fixes: c53593e5cb69 ("sched, cgroup: Don't reject lower cpu.max on ancestors")

Does this really fix anything observable? I wonder whether this is more
misleading than helpful. In cgroup2, the value simply wasn't being used,
right?

> Signed-off-by: Phil Auld <pauld@redhat.com>
> Reviewed-by: Ben Segall <bsegall@google.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>

Acked-by: Tejun Heo <tj@kernel.org>

> +		 * always take the non-RUNTIME_INF min.  On cgroup1, only
> +		 * inherit when no limit is set. In both cases this is used
> +		 * by the scheduler to determine if a given CFS task has a
> +		 * bandwidth constraint at some higher level.

The discussion on this comment is stretching too long and this is fine too
but what's worth commenting for cgroup2 is that the limit value itself
doesn't mean anything and we're just latching onto the value used by cgroup1
to track whether there's any limit active or not.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/2] sched, cgroup: Restore meaning to hierarchical_quota
  2023-07-17 18:27       ` Tejun Heo
@ 2023-07-18 12:57         ` Phil Auld
  2023-07-18 13:25           ` Phil Auld
  0 siblings, 1 reply; 22+ messages in thread
From: Phil Auld @ 2023-07-18 12:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Juri Lelli, Ingo Molnar,
	Daniel Bristot de Oliveira, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Ben Segall, Steven Rostedt,
	Mel Gorman, Frederic Weisbecker

On Mon, Jul 17, 2023 at 08:27:24AM -1000 Tejun Heo wrote:
> On Fri, Jul 14, 2023 at 08:57:46AM -0400, Phil Auld wrote:
> > In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
> > groups due to the previous fix simply taking the min.  It should
> > reflect a limit imposed at that level or by an ancestor. Even
> > though cgroupv2 does not require child quota to be less than or
> > equal to that of its ancestors the task group will still be
> > constrained by such a quota so this should be shown here. Cgroupv1
> > continues to set this correctly.
> > 
> > In both cases, add initialization when a new task group is created
> > based on the current parent's value (or RUNTIME_INF in the case of
> > root_task_group). Otherwise, the field is wrong until a quota is
> > changed after creation and __cfs_schedulable() is called.
> > 
> > Fixes: c53593e5cb69 ("sched, cgroup: Don't reject lower cpu.max on ancestors")
> 
> Does this really fix anything observable? I wonder whether this is more
> misleading than helpful. In cgroup2, the value simply wasn't being used,
> right?
>

It wasn't being used but was actively being set wrong. I mean if we are
going to bother doing the __cfs_schedulable() tg tree walk we might as
well have not been setting a bogus value. But that said, no it was not
observable until I tried to use it.

I'm fine if that's dropped. I just wanted it set right going forward :)


> > Signed-off-by: Phil Auld <pauld@redhat.com>
> > Reviewed-by: Ben Segall <bsegall@google.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Frederic Weisbecker <frederic@kernel.org>
> > Cc: Tejun Heo <tj@kernel.org>
> 
> Acked-by: Tejun Heo <tj@kernel.org>
>

Thanks!


> > +		 * always take the non-RUNTIME_INF min.  On cgroup1, only
> > +		 * inherit when no limit is set. In both cases this is used
> > +		 * by the scheduler to determine if a given CFS task has a
> > +		 * bandwidth constraint at some higher level.
> 
> The discussion on this comment is stretching too long and this is fine too
> but what's worth commenting for cgroup2 is that the limit value itself
> doesn't mean anything and we're just latching onto the value used by cgroup1
> to track whether there's any limit active or not.

I thought that was implied by the wording. "If a given task has a bandwidth
contraint" not "what a given task's bandwidth constraint is".  In both cases
that's how the other parts of the scheduler are using it. The actual
non-RUNTIME_INF value only matters in this function (and only for cgroup1
indeed).

But... the value is just as accurate for cgroup2 and cgroup1.  The task is
still going to be limited by that bandwidth constraint even if its own
bandwidth limit is nominally higher, right? 


Cheers,
Phil

> 
> Thanks.
> 
> -- 
> tejun
> 

-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/2] sched, cgroup: Restore meaning to hierarchical_quota
  2023-07-18 12:57         ` Phil Auld
@ 2023-07-18 13:25           ` Phil Auld
  0 siblings, 0 replies; 22+ messages in thread
From: Phil Auld @ 2023-07-18 13:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Juri Lelli, Ingo Molnar,
	Daniel Bristot de Oliveira, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Ben Segall, Steven Rostedt,
	Mel Gorman, Frederic Weisbecker

Hi Tejun,

On Tue, Jul 18, 2023 at 08:57:59AM -0400 Phil Auld wrote:
> On Mon, Jul 17, 2023 at 08:27:24AM -1000 Tejun Heo wrote:
> > On Fri, Jul 14, 2023 at 08:57:46AM -0400, Phil Auld wrote:
> > > In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
> > > groups due to the previous fix simply taking the min.  It should
> > > reflect a limit imposed at that level or by an ancestor. Even
> > > though cgroupv2 does not require child quota to be less than or
> > > equal to that of its ancestors the task group will still be
> > > constrained by such a quota so this should be shown here. Cgroupv1
> > > continues to set this correctly.
> > > 
> > > In both cases, add initialization when a new task group is created
> > > based on the current parent's value (or RUNTIME_INF in the case of
> > > root_task_group). Otherwise, the field is wrong until a quota is
> > > changed after creation and __cfs_schedulable() is called.
> > > 
> > > Fixes: c53593e5cb69 ("sched, cgroup: Don't reject lower cpu.max on ancestors")
> > 
> > Does this really fix anything observable? I wonder whether this is more
> > misleading than helpful. In cgroup2, the value simply wasn't being used,
> > right?
> >

(Sorry, my editor bit me...I had added:)

I don't feel strongly about the fixes. What was there seemed broken to me
so ... "Fixes". But it doesn't matter. 


> 
> It wasn't being used but was actively being set wrong. I mean if we are
> going to bother doing the __cfs_schedulable() tg tree walk we might as
> well have not been setting a bogus value. But that said, no it was not
> observable until I tried to use it.
>

We have a field called hierarchical_quota, that was being unconditionally
set to -1 for cgroup2. I figured it would be more correct to reflect
the hieratchical quota. :)


> I'm fine if that's dropped. I just wanted it set right going forward :)
> 
> 
> > > Signed-off-by: Phil Auld <pauld@redhat.com>
> > > Reviewed-by: Ben Segall <bsegall@google.com>
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > Cc: Juri Lelli <juri.lelli@redhat.com>
> > > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > Cc: Valentin Schneider <vschneid@redhat.com>
> > > Cc: Ben Segall <bsegall@google.com>
> > > Cc: Frederic Weisbecker <frederic@kernel.org>
> > > Cc: Tejun Heo <tj@kernel.org>
> > 
> > Acked-by: Tejun Heo <tj@kernel.org>
> >
> 
> Thanks!
> 
> 
> > > +		 * always take the non-RUNTIME_INF min.  On cgroup1, only
> > > +		 * inherit when no limit is set. In both cases this is used
> > > +		 * by the scheduler to determine if a given CFS task has a
> > > +		 * bandwidth constraint at some higher level.
> > 
> > The discussion on this comment is stretching too long and this is fine too
> > but what's worth commenting for cgroup2 is that the limit value itself
> > doesn't mean anything and we're just latching onto the value used by cgroup1
> > to track whether there's any limit active or not.
> 
> I thought that was implied by the wording. "If a given task has a bandwidth
> contraint" not "what a given task's bandwidth constraint is".  In both cases
> that's how the other parts of the scheduler are using it. The actual
> non-RUNTIME_INF value only matters in this function (and only for cgroup1
> indeed).
> 
> But... the value is just as accurate for cgroup2 and cgroup1.  The task is
> still going to be limited by that bandwidth constraint even if its own
> bandwidth limit is nominally higher, right? 
> 
> 
> Cheers,
> Phil
> 
> > 
> > Thanks.
> > 
> > -- 
> > tejun
> > 
> 
> -- 
> 

-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 0/2] Fix nohz_full vs cfs bandwidth
  2023-07-12 13:33 [PATCH v6 0/2] Fix nohz_full vs cfs bandwidth Phil Auld
  2023-07-12 13:33 ` [PATCH v2 1/2] sched, cgroup: Restore meaning to hierarchical_quota Phil Auld
  2023-07-12 13:33 ` [PATCH v6 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use Phil Auld
@ 2023-07-31 15:17 ` Phil Auld
  2023-07-31 16:38 ` Phil Auld
  3 siblings, 0 replies; 22+ messages in thread
From: Phil Auld @ 2023-07-31 15:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo


Hi Peter,

On Wed, Jul 12, 2023 at 09:33:55AM -0400 Phil Auld wrote:
> This is v6 of patch 2/2 which is adding code to prevent
> the tick from being stopped when the single running task
> has bandwidth limits. Discussions had led to the idea of
> adding a bit to task_struct to help make this decision.
> 
> There was some complexity with doing it in the task which
> is  avoided by using something in the cfs_rq. Looking 
> into that lead me to the hierarchical_quota field in the 
> cfs_bandwith struct. We spend a good deal of effort
> updating (or trying to, see patch 1/2) that value for
> the whole task_group tree when a quota is set/changed.
> 
> This new version first fixes that value to be meaningful
> for cgroupv2 and then leverages it to make the decisions
> about blocking the tick_stop. 
> 
> Phil Auld (2):
>   sched, cgroup: Restore meaning to hierarchical_quota
>   Sched/fair: Block nohz tick_stop when cfs bandwidth in use
> 
>  kernel/sched/core.c     | 23 ++++++++++++++---
>  kernel/sched/fair.c     | 56 ++++++++++++++++++++++++++++++++++++++---
>  kernel/sched/features.h |  2 ++
>  kernel/sched/sched.h    |  3 ++-
>  4 files changed, 76 insertions(+), 8 deletions(-)
> 
> -- 
> 2.31.1
> 

Ping :)

Any thoughts on these now?


Cheers,
Phil
-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 0/2] Fix nohz_full vs cfs bandwidth
  2023-07-12 13:33 [PATCH v6 0/2] Fix nohz_full vs cfs bandwidth Phil Auld
                   ` (2 preceding siblings ...)
  2023-07-31 15:17 ` [PATCH v6 0/2] Fix nohz_full vs cfs bandwidth Phil Auld
@ 2023-07-31 16:38 ` Phil Auld
  2023-07-31 17:23   ` Phil Auld
  3 siblings, 1 reply; 22+ messages in thread
From: Phil Auld @ 2023-07-31 16:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

Hi Peter,

On Wed, Jul 12, 2023 at 09:33:55AM -0400 Phil Auld wrote:
> This is v6 of patch 2/2 which is adding code to prevent
> the tick from being stopped when the single running task
> has bandwidth limits. Discussions had led to the idea of
> adding a bit to task_struct to help make this decision.
> 
> There was some complexity with doing it in the task which
> is  avoided by using something in the cfs_rq. Looking 
> into that lead me to the hierarchical_quota field in the 
> cfs_bandwith struct. We spend a good deal of effort
> updating (or trying to, see patch 1/2) that value for
> the whole task_group tree when a quota is set/changed.
> 
> This new version first fixes that value to be meaningful
> for cgroupv2 and then leverages it to make the decisions
> about blocking the tick_stop. 
> 
> Phil Auld (2):
>   sched, cgroup: Restore meaning to hierarchical_quota
>   Sched/fair: Block nohz tick_stop when cfs bandwidth in use
> 
>  kernel/sched/core.c     | 23 ++++++++++++++---
>  kernel/sched/fair.c     | 56 ++++++++++++++++++++++++++++++++++++++---
>  kernel/sched/features.h |  2 ++
>  kernel/sched/sched.h    |  3 ++-
>  4 files changed, 76 insertions(+), 8 deletions(-)
> 
> -- 
> 2.31.1
> 

Ping :)

Any thoughts on these now?


Cheers,
Phil


-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 0/2] Fix nohz_full vs cfs bandwidth
  2023-07-31 16:38 ` Phil Auld
@ 2023-07-31 17:23   ` Phil Auld
  0 siblings, 0 replies; 22+ messages in thread
From: Phil Auld @ 2023-07-31 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Juri Lelli, Ingo Molnar, Daniel Bristot de Oliveira,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo


Sorry, ignore duplicate, please.  Apparently I forgot email while
on PTO last week :)

On Mon, Jul 31, 2023 at 12:38:37PM -0400 Phil Auld wrote:
> Hi Peter,
> 
> On Wed, Jul 12, 2023 at 09:33:55AM -0400 Phil Auld wrote:
> > This is v6 of patch 2/2 which is adding code to prevent
> > the tick from being stopped when the single running task
> > has bandwidth limits. Discussions had led to the idea of
> > adding a bit to task_struct to help make this decision.
> > 
> > There was some complexity with doing it in the task which
> > is  avoided by using something in the cfs_rq. Looking 
> > into that lead me to the hierarchical_quota field in the 
> > cfs_bandwith struct. We spend a good deal of effort
> > updating (or trying to, see patch 1/2) that value for
> > the whole task_group tree when a quota is set/changed.
> > 
> > This new version first fixes that value to be meaningful
> > for cgroupv2 and then leverages it to make the decisions
> > about blocking the tick_stop. 
> > 
> > Phil Auld (2):
> >   sched, cgroup: Restore meaning to hierarchical_quota
> >   Sched/fair: Block nohz tick_stop when cfs bandwidth in use
> > 
> >  kernel/sched/core.c     | 23 ++++++++++++++---
> >  kernel/sched/fair.c     | 56 ++++++++++++++++++++++++++++++++++++++---
> >  kernel/sched/features.h |  2 ++
> >  kernel/sched/sched.h    |  3 ++-
> >  4 files changed, 76 insertions(+), 8 deletions(-)
> > 
> > -- 
> > 2.31.1
> > 
> 
> Ping :)
> 
> Any thoughts on these now?
> 
> 
> Cheers,
> Phil
> 
> 
> -- 
> 

-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use
  2023-07-12 13:33 ` [PATCH v6 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use Phil Auld
  2023-07-12 22:11   ` Benjamin Segall
@ 2023-07-31 22:49   ` Peter Zijlstra
  2023-08-01 11:13     ` Phil Auld
  2023-08-09 19:34   ` [tip: sched/core] sched/fair: " tip-bot2 for Phil Auld
  2 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2023-07-31 22:49 UTC (permalink / raw)
  To: Phil Auld
  Cc: linux-kernel, Juri Lelli, Ingo Molnar,
	Daniel Bristot de Oliveira, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

On Wed, Jul 12, 2023 at 09:33:57AM -0400, Phil Auld wrote:
> CFS bandwidth limits and NOHZ full don't play well together.  Tasks
> can easily run well past their quotas before a remote tick does
> accounting.  This leads to long, multi-period stalls before such
> tasks can run again. Currently, when presented with these conflicting
> requirements the scheduler is favoring nohz_full and letting the tick
> be stopped. However, nohz tick stopping is already best-effort, there
> are a number of conditions that can prevent it, whereas cfs runtime
> bandwidth is expected to be enforced.
> 
> Make the scheduler favor bandwidth over stopping the tick by setting
> TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> runtime limit enabled. We use cfs_b->hierarchical_quota to
> determine if the task requires the tick.
> 
> Add check in pick_next_task_fair() as well since that is where
> we have a handle on the task that is actually going to be running.
> 
> Add check in sched_can_stop_tick() to cover some edge cases such
> as nr_running going from 2->1 and the 1 remains the running task.

These appear fine to me, except:

> Add sched_feat HZ_BW (off by default) to control the tick_stop
> behavior.

What was the thinking here? This means nobody will be using this -- why
would you want this default disabled?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use
  2023-07-31 22:49   ` Peter Zijlstra
@ 2023-08-01 11:13     ` Phil Auld
  2023-08-01 15:37       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Phil Auld @ 2023-08-01 11:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Juri Lelli, Ingo Molnar,
	Daniel Bristot de Oliveira, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

On Tue, Aug 01, 2023 at 12:49:34AM +0200 Peter Zijlstra wrote:
> On Wed, Jul 12, 2023 at 09:33:57AM -0400, Phil Auld wrote:
> > CFS bandwidth limits and NOHZ full don't play well together.  Tasks
> > can easily run well past their quotas before a remote tick does
> > accounting.  This leads to long, multi-period stalls before such
> > tasks can run again. Currently, when presented with these conflicting
> > requirements the scheduler is favoring nohz_full and letting the tick
> > be stopped. However, nohz tick stopping is already best-effort, there
> > are a number of conditions that can prevent it, whereas cfs runtime
> > bandwidth is expected to be enforced.
> > 
> > Make the scheduler favor bandwidth over stopping the tick by setting
> > TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> > runtime limit enabled. We use cfs_b->hierarchical_quota to
> > determine if the task requires the tick.
> > 
> > Add check in pick_next_task_fair() as well since that is where
> > we have a handle on the task that is actually going to be running.
> > 
> > Add check in sched_can_stop_tick() to cover some edge cases such
> > as nr_running going from 2->1 and the 1 remains the running task.
> 
> These appear fine to me, except:
>
> > Add sched_feat HZ_BW (off by default) to control the tick_stop
> > behavior.
> 
> What was the thinking here? This means nobody will be using this -- why
> would you want this default disabled?
> 

That was just a hedge in case it caused issues. I'd probably have had to
enable it in RHEL anyway. Using a feature was to make it inocuous when
disabled.  Would you prefer me to enable it or remove the sched_feat
entirely? (or do you want to just switch that to true when you apply it?)

Thanks,
Phil

-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use
  2023-08-01 11:13     ` Phil Auld
@ 2023-08-01 15:37       ` Peter Zijlstra
  2023-08-02 14:20         ` Phil Auld
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2023-08-01 15:37 UTC (permalink / raw)
  To: Phil Auld
  Cc: linux-kernel, Juri Lelli, Ingo Molnar,
	Daniel Bristot de Oliveira, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

On Tue, Aug 01, 2023 at 07:13:42AM -0400, Phil Auld wrote:
> On Tue, Aug 01, 2023 at 12:49:34AM +0200 Peter Zijlstra wrote:
> > On Wed, Jul 12, 2023 at 09:33:57AM -0400, Phil Auld wrote:
> > > CFS bandwidth limits and NOHZ full don't play well together.  Tasks
> > > can easily run well past their quotas before a remote tick does
> > > accounting.  This leads to long, multi-period stalls before such
> > > tasks can run again. Currently, when presented with these conflicting
> > > requirements the scheduler is favoring nohz_full and letting the tick
> > > be stopped. However, nohz tick stopping is already best-effort, there
> > > are a number of conditions that can prevent it, whereas cfs runtime
> > > bandwidth is expected to be enforced.
> > > 
> > > Make the scheduler favor bandwidth over stopping the tick by setting
> > > TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> > > runtime limit enabled. We use cfs_b->hierarchical_quota to
> > > determine if the task requires the tick.
> > > 
> > > Add check in pick_next_task_fair() as well since that is where
> > > we have a handle on the task that is actually going to be running.
> > > 
> > > Add check in sched_can_stop_tick() to cover some edge cases such
> > > as nr_running going from 2->1 and the 1 remains the running task.
> > 
> > These appear fine to me, except:
> >
> > > Add sched_feat HZ_BW (off by default) to control the tick_stop
> > > behavior.
> > 
> > What was the thinking here? This means nobody will be using this -- why
> > would you want this default disabled?
> > 
> 
> That was just a hedge in case it caused issues. I'd probably have had to
> enable it in RHEL anyway. Using a feature was to make it inocuous when
> disabled.  Would you prefer me to enable it or remove the sched_feat
> entirely? (or do you want to just switch that to true when you apply it?)

I've edited it to default enabled -- we can pull the feature flag
eventually I suppose.

Things didn't readily apply, so I've kicked at it a little. Should be in
queue/sched/core for the robots to chew on.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use
  2023-08-01 15:37       ` Peter Zijlstra
@ 2023-08-02 14:20         ` Phil Auld
  0 siblings, 0 replies; 22+ messages in thread
From: Phil Auld @ 2023-08-02 14:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Juri Lelli, Ingo Molnar,
	Daniel Bristot de Oliveira, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Steven Rostedt, Mel Gorman,
	Frederic Weisbecker, Tejun Heo

On Tue, Aug 01, 2023 at 05:37:31PM +0200 Peter Zijlstra wrote:
> On Tue, Aug 01, 2023 at 07:13:42AM -0400, Phil Auld wrote:
> > On Tue, Aug 01, 2023 at 12:49:34AM +0200 Peter Zijlstra wrote:
> > > On Wed, Jul 12, 2023 at 09:33:57AM -0400, Phil Auld wrote:
> > > > CFS bandwidth limits and NOHZ full don't play well together.  Tasks
> > > > can easily run well past their quotas before a remote tick does
> > > > accounting.  This leads to long, multi-period stalls before such
> > > > tasks can run again. Currently, when presented with these conflicting
> > > > requirements the scheduler is favoring nohz_full and letting the tick
> > > > be stopped. However, nohz tick stopping is already best-effort, there
> > > > are a number of conditions that can prevent it, whereas cfs runtime
> > > > bandwidth is expected to be enforced.
> > > > 
> > > > Make the scheduler favor bandwidth over stopping the tick by setting
> > > > TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> > > > runtime limit enabled. We use cfs_b->hierarchical_quota to
> > > > determine if the task requires the tick.
> > > > 
> > > > Add check in pick_next_task_fair() as well since that is where
> > > > we have a handle on the task that is actually going to be running.
> > > > 
> > > > Add check in sched_can_stop_tick() to cover some edge cases such
> > > > as nr_running going from 2->1 and the 1 remains the running task.
> > > 
> > > These appear fine to me, except:
> > >
> > > > Add sched_feat HZ_BW (off by default) to control the tick_stop
> > > > behavior.
> > > 
> > > What was the thinking here? This means nobody will be using this -- why
> > > would you want this default disabled?
> > > 
> > 
> > That was just a hedge in case it caused issues. I'd probably have had to
> > enable it in RHEL anyway. Using a feature was to make it inocuous when
> > disabled.  Would you prefer me to enable it or remove the sched_feat
> > entirely? (or do you want to just switch that to true when you apply it?)
> 
> I've edited it to default enabled -- we can pull the feature flag
> eventually I suppose.
> 
> Things didn't readily apply, so I've kicked at it a little. Should be in
> queue/sched/core for the robots to chew on.
>

Or "choke on" as the case may be :)
I sent you something that will hopefully clean that up.


Thanks!


Cheers,
Phil
-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched/fair: Block nohz tick_stop when cfs bandwidth in use
  2023-07-12 13:33 ` [PATCH v6 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use Phil Auld
  2023-07-12 22:11   ` Benjamin Segall
  2023-07-31 22:49   ` Peter Zijlstra
@ 2023-08-09 19:34   ` tip-bot2 for Phil Auld
  2 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Phil Auld @ 2023-08-09 19:34 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Phil Auld, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     88c56cfeaec4642aee8aac58b38d5708c6aae0d3
Gitweb:        https://git.kernel.org/tip/88c56cfeaec4642aee8aac58b38d5708c6aae0d3
Author:        Phil Auld <pauld@redhat.com>
AuthorDate:    Wed, 12 Jul 2023 09:33:57 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 02 Aug 2023 16:19:26 +02:00

sched/fair: Block nohz tick_stop when cfs bandwidth in use

CFS bandwidth limits and NOHZ full don't play well together.  Tasks
can easily run well past their quotas before a remote tick does
accounting.  This leads to long, multi-period stalls before such
tasks can run again. Currently, when presented with these conflicting
requirements the scheduler is favoring nohz_full and letting the tick
be stopped. However, nohz tick stopping is already best-effort, there
are a number of conditions that can prevent it, whereas cfs runtime
bandwidth is expected to be enforced.

Make the scheduler favor bandwidth over stopping the tick by setting
TICK_DEP_BIT_SCHED when the only running task is a cfs task with
runtime limit enabled. We use cfs_b->hierarchical_quota to
determine if the task requires the tick.

Add check in pick_next_task_fair() as well since that is where
we have a handle on the task that is actually going to be running.

Add check in sched_can_stop_tick() to cover some edge cases such
as nr_running going from 2->1 and the 1 remains the running task.

Reviewed-By: Ben Segall <bsegall@google.com>
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230712133357.381137-3-pauld@redhat.com
---
 kernel/sched/core.c     | 26 ++++++++++++++++++++-
 kernel/sched/fair.c     | 52 +++++++++++++++++++++++++++++++++++++++-
 kernel/sched/features.h |  2 ++-
 kernel/sched/sched.h    |  2 ++-
 4 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3af25ca..614271a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1194,6 +1194,20 @@ static void nohz_csd_func(void *info)
 #endif /* CONFIG_NO_HZ_COMMON */
 
 #ifdef CONFIG_NO_HZ_FULL
+static inline bool __need_bw_check(struct rq *rq, struct task_struct *p)
+{
+	if (rq->nr_running != 1)
+		return false;
+
+	if (p->sched_class != &fair_sched_class)
+		return false;
+
+	if (!task_on_rq_queued(p))
+		return false;
+
+	return true;
+}
+
 bool sched_can_stop_tick(struct rq *rq)
 {
 	int fifo_nr_running;
@@ -1229,6 +1243,18 @@ bool sched_can_stop_tick(struct rq *rq)
 	if (rq->nr_running > 1)
 		return false;
 
+	/*
+	 * If there is one task and it has CFS runtime bandwidth constraints
+	 * and it's on the cpu now we don't want to stop the tick.
+	 * This check prevents clearing the bit if a newly enqueued task here is
+	 * dequeued by migrating while the constrained task continues to run.
+	 * E.g. going from 2->1 without going through pick_next_task().
+	 */
+	if (sched_feat(HZ_BW) && __need_bw_check(rq, rq->curr)) {
+		if (cfs_task_bw_constrained(rq->curr))
+			return false;
+	}
+
 	return true;
 }
 #endif /* CONFIG_NO_HZ_FULL */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 26bfbb6..c282064 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6189,6 +6189,46 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 	rq_clock_stop_loop_update(rq);
 }
 
+bool cfs_task_bw_constrained(struct task_struct *p)
+{
+	struct cfs_rq *cfs_rq = task_cfs_rq(p);
+
+	if (!cfs_bandwidth_used())
+		return false;
+
+	if (cfs_rq->runtime_enabled ||
+	    tg_cfs_bandwidth(cfs_rq->tg)->hierarchical_quota != RUNTIME_INF)
+		return true;
+
+	return false;
+}
+
+#ifdef CONFIG_NO_HZ_FULL
+/* called from pick_next_task_fair() */
+static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
+{
+	int cpu = cpu_of(rq);
+
+	if (!sched_feat(HZ_BW) || !cfs_bandwidth_used())
+		return;
+
+	if (!tick_nohz_full_cpu(cpu))
+		return;
+
+	if (rq->nr_running != 1)
+		return;
+
+	/*
+	 *  We know there is only one task runnable and we've just picked it. The
+	 *  normal enqueue path will have cleared TICK_DEP_BIT_SCHED if we will
+	 *  be otherwise able to stop the tick. Just need to check if we are using
+	 *  bandwidth control.
+	 */
+	if (cfs_task_bw_constrained(p))
+		tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
+}
+#endif
+
 #else /* CONFIG_CFS_BANDWIDTH */
 
 static inline bool cfs_bandwidth_used(void)
@@ -6231,9 +6271,18 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
 static inline void update_runtime_enabled(struct rq *rq) {}
 static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}
-
+#ifdef CONFIG_CGROUP_SCHED
+bool cfs_task_bw_constrained(struct task_struct *p)
+{
+	return false;
+}
+#endif
 #endif /* CONFIG_CFS_BANDWIDTH */
 
+#if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
+static inline void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p) {}
+#endif
+
 /**************************************************
  * CFS operations on tasks:
  */
@@ -8201,6 +8250,7 @@ done: __maybe_unused;
 		hrtick_start_fair(rq, p);
 
 	update_misfit_status(p, rq);
+	sched_fair_update_stop_tick(rq, p);
 
 	return p;
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c..e10074c 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -101,3 +101,5 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(HZ_BW, true)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 602de71..19af176 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -459,6 +459,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
+extern bool cfs_task_bw_constrained(struct task_struct *p);
 
 extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
 		struct sched_rt_entity *rt_se, int cpu,
@@ -494,6 +495,7 @@ static inline void set_task_rq_fair(struct sched_entity *se,
 #else /* CONFIG_CGROUP_SCHED */
 
 struct cfs_bandwidth { };
+static inline bool cfs_task_bw_constrained(struct task_struct *p) { return false; }
 
 #endif	/* CONFIG_CGROUP_SCHED */
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched, cgroup: Restore meaning to hierarchical_quota
  2023-07-14 12:57     ` [PATCH v3 " Phil Auld
  2023-07-17 18:27       ` Tejun Heo
@ 2023-08-09 19:34       ` tip-bot2 for Phil Auld
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot2 for Phil Auld @ 2023-08-09 19:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Phil Auld, Peter Zijlstra (Intel),
	Ben Segall, Tejun Heo, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     c98c18270be115678f4295b10a5af5dcc9c4efa0
Gitweb:        https://git.kernel.org/tip/c98c18270be115678f4295b10a5af5dcc9c4efa0
Author:        Phil Auld <pauld@redhat.com>
AuthorDate:    Fri, 14 Jul 2023 08:57:46 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 02 Aug 2023 16:19:26 +02:00

sched, cgroup: Restore meaning to hierarchical_quota

In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
groups due to the previous fix simply taking the min.  It should
reflect a limit imposed at that level or by an ancestor. Even
though cgroupv2 does not require child quota to be less than or
equal to that of its ancestors the task group will still be
constrained by such a quota so this should be shown here. Cgroupv1
continues to set this correctly.

In both cases, add initialization when a new task group is created
based on the current parent's value (or RUNTIME_INF in the case of
root_task_group). Otherwise, the field is wrong until a quota is
changed after creation and __cfs_schedulable() is called.

Fixes: c53593e5cb69 ("sched, cgroup: Don't reject lower cpu.max on ancestors")
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230714125746.812891-1-pauld@redhat.com
---
 kernel/sched/core.c  | 13 +++++++++----
 kernel/sched/fair.c  |  7 ++++---
 kernel/sched/sched.h |  2 +-
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 83e3654..3af25ca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9953,7 +9953,7 @@ void __init sched_init(void)
 		ptr += nr_cpu_ids * sizeof(void **);
 
 		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
-		init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
+		init_cfs_bandwidth(&root_task_group.cfs_bandwidth, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_RT_GROUP_SCHED
 		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
@@ -11087,11 +11087,16 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
 
 		/*
 		 * Ensure max(child_quota) <= parent_quota.  On cgroup2,
-		 * always take the min.  On cgroup1, only inherit when no
-		 * limit is set:
+		 * always take the non-RUNTIME_INF min.  On cgroup1, only
+		 * inherit when no limit is set. In both cases this is used
+		 * by the scheduler to determine if a given CFS task has a
+		 * bandwidth constraint at some higher level.
 		 */
 		if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
-			quota = min(quota, parent_quota);
+			if (quota == RUNTIME_INF)
+				quota = parent_quota;
+			else if (parent_quota != RUNTIME_INF)
+				quota = min(quota, parent_quota);
 		} else {
 			if (quota == RUNTIME_INF)
 				quota = parent_quota;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f55b0a7..26bfbb6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6045,13 +6045,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
 }
 
-void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent)
 {
 	raw_spin_lock_init(&cfs_b->lock);
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 	cfs_b->burst = 0;
+	cfs_b->hierarchical_quota = parent ? parent->hierarchical_quota : RUNTIME_INF;
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
@@ -6217,7 +6218,7 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	return 0;
 }
 
-void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
@@ -12599,7 +12600,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 
 	tg->shares = NICE_0_LOAD;
 
-	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
+	init_cfs_bandwidth(tg_cfs_bandwidth(tg), tg_cfs_bandwidth(parent));
 
 	for_each_possible_cpu(i) {
 		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9baeb1a..602de71 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -454,7 +454,7 @@ extern void unregister_fair_sched_group(struct task_group *tg);
 extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 			struct sched_entity *se, int cpu,
 			struct sched_entity *parent);
-extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
+extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent);
 
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-08-09 19:35 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-12 13:33 [PATCH v6 0/2] Fix nohz_full vs cfs bandwidth Phil Auld
2023-07-12 13:33 ` [PATCH v2 1/2] sched, cgroup: Restore meaning to hierarchical_quota Phil Auld
2023-07-12 22:09   ` Benjamin Segall
2023-07-13 13:23     ` Phil Auld
2023-07-13 20:12       ` Benjamin Segall
2023-07-13 23:27         ` Phil Auld
2023-07-14 12:57     ` [PATCH v3 " Phil Auld
2023-07-17 18:27       ` Tejun Heo
2023-07-18 12:57         ` Phil Auld
2023-07-18 13:25           ` Phil Auld
2023-08-09 19:34       ` [tip: sched/core] " tip-bot2 for Phil Auld
2023-07-12 13:33 ` [PATCH v6 2/2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use Phil Auld
2023-07-12 22:11   ` Benjamin Segall
2023-07-13 13:25     ` Phil Auld
2023-07-31 22:49   ` Peter Zijlstra
2023-08-01 11:13     ` Phil Auld
2023-08-01 15:37       ` Peter Zijlstra
2023-08-02 14:20         ` Phil Auld
2023-08-09 19:34   ` [tip: sched/core] sched/fair: " tip-bot2 for Phil Auld
2023-07-31 15:17 ` [PATCH v6 0/2] Fix nohz_full vs cfs bandwidth Phil Auld
2023-07-31 16:38 ` Phil Auld
2023-07-31 17:23   ` Phil Auld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).