linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/2] sched/deadline: Fix and optimize sched_dl_global_validate()
@ 2020-10-07 15:11 Peng Liu
  2020-10-07 15:12 ` [PATCH v5 1/2] sched/deadline: " Peng Liu
  2020-10-07 15:13 ` [PATCH v5 2/2] sched/deadline: Fix sched_dl_global_validate() Peng Liu
  0 siblings, 2 replies; 5+ messages in thread
From: Peng Liu @ 2020-10-07 15:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, valentin.schneider, raistlin,
	iwtbavbm

When change global rt bandwidth, we check to make sure that new
settings could accommodate the allocated dl bandwidth.

Under SMP, the dl_bw is on a per root domain basis, currently we check
and update the new settings one cpu by one cpu, but not in the unit of
root domain, which is either overdoing or error.

patch 1 removed the superfluous checking and updating
patch 2 fixed the error validation

For details, please see the corresponding patch.

----------------
v5 <-- v4:
 - no functional changes, just split the v4 single patch to two to
   obey the "one patch do only one thing" rule;
 - turn root_domain::visit_gen from u64 to u32;
   both suggested by Juri.
 - refine changelog;

v4 <-- v3:
 - refine changelog;
 - eliminate the ugly #ifdef guys with Peter's method;

v3 <-- v2:
 - fix build error for !CONFIG_SMP, reported by kernel test robot;

v2 <-- v1:
 - replace cpumask_weight(cpu_rq(cpu)->rd->span) with dl_bw_cpus(cpu),
   suggested by Juri;

Peng Liu (2):
  sched/deadline: optimize sched_dl_global_validate()
  sched/deadline: Fix sched_dl_global_validate()

 kernel/sched/deadline.c | 46 +++++++++++++++++++++++++++++---------
 kernel/sched/sched.h    | 49 +++++++++++++++++++++--------------------
 kernel/sched/topology.c |  1 +
 3 files changed, 62 insertions(+), 34 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v5 1/2] sched/deadline: optimize sched_dl_global_validate()
  2020-10-07 15:11 [PATCH v5 0/2] sched/deadline: Fix and optimize sched_dl_global_validate() Peng Liu
@ 2020-10-07 15:12 ` Peng Liu
  2020-10-07 16:55   ` Peter Zijlstra
  2020-10-07 15:13 ` [PATCH v5 2/2] sched/deadline: Fix sched_dl_global_validate() Peng Liu
  1 sibling, 1 reply; 5+ messages in thread
From: Peng Liu @ 2020-10-07 15:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, valentin.schneider, raistlin,
	iwtbavbm

Under CONFIG_SMP, dl_bw is per root domain, but not per CPU.
When checking or updating dl_bw, currently iterating every CPU is
overdoing, just need iterate each root domain once.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peng Liu <iwtbavbm@gmail.com>
---
 kernel/sched/deadline.c | 43 ++++++++++++++++++++++++++++++++---------
 kernel/sched/sched.h    |  7 +++++++
 kernel/sched/topology.c |  1 +
 3 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c19c1883d695..5200e185923f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -97,6 +97,17 @@ static inline unsigned long dl_bw_capacity(int i)
 		return __dl_bw_capacity(i);
 	}
 }
+
+static inline bool dl_bw_visited(int cpu, u32 gen)
+{
+	struct root_domain *rd = cpu_rq(cpu)->rd;
+
+	if (rd->visit_gen == gen)
+		return true;
+
+	rd->visit_gen = gen;
+	return false;
+}
 #else
 static inline struct dl_bw *dl_bw_of(int i)
 {
@@ -112,6 +123,11 @@ static inline unsigned long dl_bw_capacity(int i)
 {
 	return SCHED_CAPACITY_SCALE;
 }
+
+static inline bool dl_bw_visited(int cpu, u32 gen)
+{
+	return false;
+}
 #endif
 
 static inline
@@ -2514,26 +2530,30 @@ const struct sched_class dl_sched_class
 	.update_curr		= update_curr_dl,
 };
 
+/* Used for dl_bw check and update. */
+static u32 dl_generation;
+
 int sched_dl_global_validate(void)
 {
 	u64 runtime = global_rt_runtime();
 	u64 period = global_rt_period();
 	u64 new_bw = to_ratio(period, runtime);
 	struct dl_bw *dl_b;
-	int cpu, ret = 0;
 	unsigned long flags;
+	int cpu, ret = 0;
+	u32 gen = ++dl_generation;
 
 	/*
 	 * Here we want to check the bandwidth not being set to some
 	 * value smaller than the currently allocated bandwidth in
 	 * any of the root_domains.
-	 *
-	 * FIXME: Cycling on all the CPUs is overdoing, but simpler than
-	 * cycling on root_domains... Discussion on different/better
-	 * solutions is welcome!
 	 */
 	for_each_possible_cpu(cpu) {
 		rcu_read_lock_sched();
+
+		if (dl_bw_visited(cpu, gen))
+			goto next;
+
 		dl_b = dl_bw_of(cpu);
 
 		raw_spin_lock_irqsave(&dl_b->lock, flags);
@@ -2541,6 +2561,7 @@ int sched_dl_global_validate(void)
 			ret = -EBUSY;
 		raw_spin_unlock_irqrestore(&dl_b->lock, flags);
 
+next:
 		rcu_read_unlock_sched();
 
 		if (ret)
@@ -2567,8 +2588,9 @@ void sched_dl_do_global(void)
 {
 	u64 new_bw = -1;
 	struct dl_bw *dl_b;
-	int cpu;
 	unsigned long flags;
+	int cpu;
+	u32 gen = ++dl_generation;
 
 	def_dl_bandwidth.dl_period = global_rt_period();
 	def_dl_bandwidth.dl_runtime = global_rt_runtime();
@@ -2576,11 +2598,14 @@ void sched_dl_do_global(void)
 	if (global_rt_runtime() != RUNTIME_INF)
 		new_bw = to_ratio(global_rt_period(), global_rt_runtime());
 
-	/*
-	 * FIXME: As above...
-	 */
 	for_each_possible_cpu(cpu) {
 		rcu_read_lock_sched();
+
+		if (dl_bw_visited(cpu, gen)) {
+			rcu_read_unlock_sched();
+			continue;
+		}
+
 		dl_b = dl_bw_of(cpu);
 
 		raw_spin_lock_irqsave(&dl_b->lock, flags);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 28709f6b0975..53477e8b26b0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -798,6 +798,13 @@ struct root_domain {
 	 */
 	cpumask_var_t		dlo_mask;
 	atomic_t		dlo_count;
+
+	/*
+	 * Indicate whether a root_domain's dl_bw has been checked or
+	 * updated. It's monotonously increasing, then wrap around.
+	 */
+	u32 visit_gen;
+
 	struct dl_bw		dl_bw;
 	struct cpudl		cpudl;
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index dd7770226086..90f3e5558fa2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -516,6 +516,7 @@ static int init_rootdomain(struct root_domain *rd)
 	init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
 #endif
 
+	rd->visit_gen = 0;
 	init_dl_bw(&rd->dl_bw);
 	if (cpudl_init(&rd->cpudl) != 0)
 		goto free_rto_mask;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v5 2/2] sched/deadline: Fix sched_dl_global_validate()
  2020-10-07 15:11 [PATCH v5 0/2] sched/deadline: Fix and optimize sched_dl_global_validate() Peng Liu
  2020-10-07 15:12 ` [PATCH v5 1/2] sched/deadline: " Peng Liu
@ 2020-10-07 15:13 ` Peng Liu
  1 sibling, 0 replies; 5+ messages in thread
From: Peng Liu @ 2020-10-07 15:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, valentin.schneider, raistlin,
	iwtbavbm

When change sched_rt_{runtime, period}_us, we validate that the new
settings should at least accommodate the currently allocated -dl
bandwidth:

  sched_rt_handler()
    -->	sched_dl_bandwidth_validate()
	{
		new_bw = global_rt_runtime()/global_rt_period();

		for_each_possible_cpu(cpu) {
			dl_b = dl_bw_of(cpu);
			if (new_bw < dl_b->total_bw)    <-------
				ret = -EBUSY;
		}
	}

But under CONFIG_SMP, dl_bw is per root domain , but not per CPU,
dl_b->total_bw is the allocated bandwidth of the whole root domain.
Instead, we should compare dl_b->total_bw against "cpus*new_bw",
where 'cpus' is the number of CPUs of the root domain.

Also, below annotation(in kernel/sched/sched.h) implied implementation
only appeared in SCHED_DEADLINE v2[1], then deadline scheduler kept
evolving till got merged(v9), but the annotation remains unchanged,
meaningless and misleading, update it.

* With respect to SMP, the bandwidth is given on a per-CPU basis,
* meaning that:
*  - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
*  - dl_total_bw array contains, in the i-eth element, the currently
*    allocated bandwidth on the i-eth CPU.

[1]: https://lore.kernel.org/lkml/1267385230.13676.101.camel@Palantir/

Fixes: 332ac17ef5bf ("sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks")
Signed-off-by: Peng Liu <iwtbavbm@gmail.com>
---
 kernel/sched/deadline.c |  5 +++--
 kernel/sched/sched.h    | 42 ++++++++++++++++++-----------------------
 2 files changed, 21 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5200e185923f..e32715db7119 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2540,7 +2540,7 @@ int sched_dl_global_validate(void)
 	u64 new_bw = to_ratio(period, runtime);
 	struct dl_bw *dl_b;
 	unsigned long flags;
-	int cpu, ret = 0;
+	int cpu, cpus, ret = 0;
 	u32 gen = ++dl_generation;
 
 	/*
@@ -2555,9 +2555,10 @@ int sched_dl_global_validate(void)
 			goto next;
 
 		dl_b = dl_bw_of(cpu);
+		cpus = dl_bw_cpus(cpu);
 
 		raw_spin_lock_irqsave(&dl_b->lock, flags);
-		if (new_bw < dl_b->total_bw)
+		if (new_bw * cpus < dl_b->total_bw)
 			ret = -EBUSY;
 		raw_spin_unlock_irqrestore(&dl_b->lock, flags);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 53477e8b26b0..b407e7af8e09 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -257,30 +257,6 @@ struct rt_bandwidth {
 
 void __dl_clear_params(struct task_struct *p);
 
-/*
- * To keep the bandwidth of -deadline tasks and groups under control
- * we need some place where:
- *  - store the maximum -deadline bandwidth of the system (the group);
- *  - cache the fraction of that bandwidth that is currently allocated.
- *
- * This is all done in the data structure below. It is similar to the
- * one used for RT-throttling (rt_bandwidth), with the main difference
- * that, since here we are only interested in admission control, we
- * do not decrease any runtime while the group "executes", neither we
- * need a timer to replenish it.
- *
- * With respect to SMP, the bandwidth is given on a per-CPU basis,
- * meaning that:
- *  - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
- *  - dl_total_bw array contains, in the i-eth element, the currently
- *    allocated bandwidth on the i-eth CPU.
- * Moreover, groups consume bandwidth on each CPU, while tasks only
- * consume bandwidth on the CPU they're running on.
- * Finally, dl_total_bw_cpu is used to cache the index of dl_total_bw
- * that will be shown the next time the proc or cgroup controls will
- * be red. It on its turn can be changed by writing on its own
- * control.
- */
 struct dl_bandwidth {
 	raw_spinlock_t		dl_runtime_lock;
 	u64			dl_runtime;
@@ -292,6 +268,24 @@ static inline int dl_bandwidth_enabled(void)
 	return sysctl_sched_rt_runtime >= 0;
 }
 
+/*
+ * To keep the bandwidth of -deadline tasks under control
+ * we need some place where:
+ *  - store the maximum -deadline bandwidth of each cpu;
+ *  - cache the fraction of bandwidth that is currently allocated in
+ *    each root domain;
+ *
+ * This is all done in the data structure below. It is similar to the
+ * one used for RT-throttling (rt_bandwidth), with the main difference
+ * that, since here we are only interested in admission control, we
+ * do not decrease any runtime while the group "executes", neither we
+ * need a timer to replenish it.
+ *
+ * With respect to SMP, bandwidth is given on a per root domain basis,
+ * meaning that:
+ *  - bw (< 100%) is the deadline bandwidth of each CPU;
+ *  - total_bw is the currently allocated bandwidth in each root domain;
+ */
 struct dl_bw {
 	raw_spinlock_t		lock;
 	u64			bw;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v5 1/2] sched/deadline: optimize sched_dl_global_validate()
  2020-10-07 15:12 ` [PATCH v5 1/2] sched/deadline: " Peng Liu
@ 2020-10-07 16:55   ` Peter Zijlstra
  2020-10-08 13:58     ` Peng Liu
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2020-10-07 16:55 UTC (permalink / raw)
  To: Peng Liu
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot,
	valentin.schneider, raistlin

On Wed, Oct 07, 2020 at 11:12:29PM +0800, Peng Liu wrote:
> +/* Used for dl_bw check and update. */
> +static u32 dl_generation;

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 28709f6b0975..53477e8b26b0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -798,6 +798,13 @@ struct root_domain {
>  	 */
>  	cpumask_var_t		dlo_mask;
>  	atomic_t		dlo_count;
> +
> +	/*
> +	 * Indicate whether a root_domain's dl_bw has been checked or
> +	 * updated. It's monotonously increasing, then wrap around.
> +	 */
> +	u32 visit_gen;
> +
>  	struct dl_bw		dl_bw;
>  	struct cpudl		cpudl;
>  
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index dd7770226086..90f3e5558fa2 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -516,6 +516,7 @@ static int init_rootdomain(struct root_domain *rd)
>  	init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
>  #endif
>  
> +	rd->visit_gen = 0;
>  	init_dl_bw(&rd->dl_bw);
>  	if (cpudl_init(&rd->cpudl) != 0)
>  		goto free_rto_mask;

I'm fairly sure I made the generation a u64, the above is susceptible to
a false positive due to wrap-around.

Increase the generation to -1, create a new root domain, then the next
generation is 0 and we'll skip the new domain, even though it should be
updated.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v5 1/2] sched/deadline: optimize sched_dl_global_validate()
  2020-10-07 16:55   ` Peter Zijlstra
@ 2020-10-08 13:58     ` Peng Liu
  0 siblings, 0 replies; 5+ messages in thread
From: Peng Liu @ 2020-10-08 13:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, valentin.schneider, raistlin,
	linux-kernel

On Wed, Oct 07, 2020 at 06:55:33PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 07, 2020 at 11:12:29PM +0800, Peng Liu wrote:
> > +/* Used for dl_bw check and update. */
> > +static u32 dl_generation;
> 
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 28709f6b0975..53477e8b26b0 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -798,6 +798,13 @@ struct root_domain {
> >  	 */
> >  	cpumask_var_t		dlo_mask;
> >  	atomic_t		dlo_count;
> > +
> > +	/*
> > +	 * Indicate whether a root_domain's dl_bw has been checked or
> > +	 * updated. It's monotonously increasing, then wrap around.
> > +	 */
> > +	u32 visit_gen;
> > +
> >  	struct dl_bw		dl_bw;
> >  	struct cpudl		cpudl;
> >  
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index dd7770226086..90f3e5558fa2 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -516,6 +516,7 @@ static int init_rootdomain(struct root_domain *rd)
> >  	init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
> >  #endif
> >  
> > +	rd->visit_gen = 0;
> >  	init_dl_bw(&rd->dl_bw);
> >  	if (cpudl_init(&rd->cpudl) != 0)
> >  		goto free_rto_mask;
> 
> I'm fairly sure I made the generation a u64, the above is susceptible to
> a false positive due to wrap-around.
> 
> Increase the generation to -1, create a new root domain, then the next
> generation is 0 and we'll skip the new domain, even though it should be
> updated.

Ah... at first, I also thought that u32 is "big enough" given that
no one would frequently change the settings, 'wrap-around' shouldn't
be a concern.

So...OK, I will revert it back to u64. What a big circle! :)

Thanks for your time!

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-10-08 13:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-07 15:11 [PATCH v5 0/2] sched/deadline: Fix and optimize sched_dl_global_validate() Peng Liu
2020-10-07 15:12 ` [PATCH v5 1/2] sched/deadline: " Peng Liu
2020-10-07 16:55   ` Peter Zijlstra
2020-10-08 13:58     ` Peng Liu
2020-10-07 15:13 ` [PATCH v5 2/2] sched/deadline: Fix sched_dl_global_validate() Peng Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).