linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] sched/cpuset: distribute tasks within affinity masks
@ 2020-03-11  1:01 Josh Don
  2020-03-11 14:05 ` Qais Yousef
  2020-03-20 12:58 ` [tip: sched/core] sched/core: Distribute " tip-bot2 for Paul Turner
  0 siblings, 2 replies; 8+ messages in thread
From: Josh Don @ 2020-03-11  1:01 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Li Zefan, Tejun Heo, Johannes Weiner
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	linux-kernel, cgroups, Paul Turner, Josh Don

From: Paul Turner <pjt@google.com>

Currently, when updating the affinity of tasks via either cpusets.cpus,
or, sched_setaffinity(); tasks not currently running within the newly
specified mask will be arbitrarily assigned to the first CPU within the
mask.

This (particularly in the case that we are restricting masks) can
result in many tasks being assigned to the first CPUs of their new
masks.

This:
 1) Can induce scheduling delays while the load-balancer has a chance to
    spread them between their new CPUs.
 2) Can antogonize a poor load-balancer behavior where it has a
    difficult time recognizing that a cross-socket imbalance has been
    forced by an affinity mask.

This change adds a new cpumask interface to allow iterated calls to
distribute within the intersection of the provided masks.

The cases that this mainly affects are:
- modifying cpuset.cpus
- when tasks join a cpuset
- when modifying a task's affinity via sched_setaffinity(2)

Co-developed-by: Josh Don <joshdon@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Paul Turner <pjt@google.com>
---
v2:
- Moved the "distribute" implementation to a new
cpumask_any_and_distribute() function
- No longer move a task if it is already running on an allowed cpu

 include/linux/cpumask.h |  7 +++++++
 kernel/sched/core.c     |  7 ++++++-
 lib/cpumask.c           | 29 +++++++++++++++++++++++++++++
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index d5cc88514aee..f0d895d6ac39 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -194,6 +194,11 @@ static inline unsigned int cpumask_local_spread(unsigned int i, int node)
 	return 0;
 }
 
+static inline int cpumask_any_and_distribute(const struct cpumask *src1p,
+					     const struct cpumask *src2p) {
+	return cpumask_next_and(-1, src1p, src2p);
+}
+
 #define for_each_cpu(cpu, mask)			\
 	for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask)
 #define for_each_cpu_not(cpu, mask)		\
@@ -245,6 +250,8 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
 int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
 int cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
 unsigned int cpumask_local_spread(unsigned int i, int node);
+int cpumask_any_and_distribute(const struct cpumask *src1p,
+			       const struct cpumask *src2p);
 
 /**
  * for_each_cpu - iterate over every cpu in a mask
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a9983da4408..fc6f2bec7d44 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1652,7 +1652,12 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 	if (cpumask_equal(p->cpus_ptr, new_mask))
 		goto out;
 
-	dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);
+	/*
+	 * Picking a ~random cpu helps in cases where we are changing affinity
+	 * for groups of tasks (ie. cpuset), so that load balancing is not
+	 * immediately required to distribute the tasks within their new mask.
+	 */
+	dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, new_mask);
 	if (dest_cpu >= nr_cpu_ids) {
 		ret = -EINVAL;
 		goto out;
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 0cb672eb107c..fb22fb266f93 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -232,3 +232,32 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
 	BUG();
 }
 EXPORT_SYMBOL(cpumask_local_spread);
+
+static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
+
+/**
+ * Returns an arbitrary cpu within srcp1 & srcp2.
+ *
+ * Iterated calls using the same srcp1 and srcp2 will be distributed within
+ * their intersection.
+ *
+ * Returns >= nr_cpu_ids if the intersection is empty.
+ */
+int cpumask_any_and_distribute(const struct cpumask *src1p,
+			       const struct cpumask *src2p)
+{
+	int next, prev;
+
+	/* NOTE: our first selection will skip 0. */
+	prev = __this_cpu_read(distribute_cpu_mask_prev);
+
+	next = cpumask_next_and(prev, src1p, src2p);
+	if (next >= nr_cpu_ids)
+		next = cpumask_first_and(src1p, src2p);
+
+	if (next < nr_cpu_ids)
+		__this_cpu_write(distribute_cpu_mask_prev, next);
+
+	return next;
+}
+EXPORT_SYMBOL(cpumask_any_and_distribute);
-- 
2.25.1.481.gfbce0eb801-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] sched/cpuset: distribute tasks within affinity masks
  2020-03-11  1:01 [PATCH v2] sched/cpuset: distribute tasks within affinity masks Josh Don
@ 2020-03-11 14:05 ` Qais Yousef
  2020-03-17 19:24   ` Peter Zijlstra
  2020-03-20 12:58 ` [tip: sched/core] sched/core: Distribute " tip-bot2 for Paul Turner
  1 sibling, 1 reply; 8+ messages in thread
From: Qais Yousef @ 2020-03-11 14:05 UTC (permalink / raw)
  To: Josh Don
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Li Zefan, Tejun Heo, Johannes Weiner, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel, cgroups,
	Paul Turner

On 03/10/20 18:01, Josh Don wrote:
> From: Paul Turner <pjt@google.com>
> 
> Currently, when updating the affinity of tasks via either cpusets.cpus,
> or, sched_setaffinity(); tasks not currently running within the newly
> specified mask will be arbitrarily assigned to the first CPU within the
> mask.
> 
> This (particularly in the case that we are restricting masks) can
> result in many tasks being assigned to the first CPUs of their new
> masks.
> 
> This:
>  1) Can induce scheduling delays while the load-balancer has a chance to
>     spread them between their new CPUs.
>  2) Can antogonize a poor load-balancer behavior where it has a
>     difficult time recognizing that a cross-socket imbalance has been
>     forced by an affinity mask.
> 
> This change adds a new cpumask interface to allow iterated calls to
> distribute within the intersection of the provided masks.
> 
> The cases that this mainly affects are:
> - modifying cpuset.cpus
> - when tasks join a cpuset
> - when modifying a task's affinity via sched_setaffinity(2)
> 
> Co-developed-by: Josh Don <joshdon@google.com>
> Signed-off-by: Josh Don <joshdon@google.com>
> Signed-off-by: Paul Turner <pjt@google.com>

This actually helps me fix a similar problem I faced in RT [1]. If multiple RT
tasks wakeup at the same time we get a 'thundering herd' issue where they all
end up going to the same CPU, just to be pushed out again.

Beside this will help fix another problem for RT tasks fitness, which is
a manifestation of the problem above. If two tasks wake up at the same time and
they happen to run on a little cpu (but request to run on a big one), one of
them will end up being migrated because find_lowest_rq() will return the first
cpu in the mask for both tasks.

I tested the API (not the change in sched/core.c) and it looks good to me.

> ---
> v2:
> - Moved the "distribute" implementation to a new
> cpumask_any_and_distribute() function
> - No longer move a task if it is already running on an allowed cpu
> 
>  include/linux/cpumask.h |  7 +++++++
>  kernel/sched/core.c     |  7 ++++++-
>  lib/cpumask.c           | 29 +++++++++++++++++++++++++++++
>  3 files changed, 42 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index d5cc88514aee..f0d895d6ac39 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -194,6 +194,11 @@ static inline unsigned int cpumask_local_spread(unsigned int i, int node)
>  	return 0;
>  }
>  
> +static inline int cpumask_any_and_distribute(const struct cpumask *src1p,
> +					     const struct cpumask *src2p) {
> +	return cpumask_next_and(-1, src1p, src2p);
> +}

nit: cpumask_first_and() is better here?

It might be a good idea to split the API from the user too.

Anyway, for the API.

Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Tested-by: Qais Yousef <qais.yousef@arm.com>

Thanks

--
Qais Yousef

[1] https://lore.kernel.org/lkml/20200219140243.wfljmupcrwm2jelo@e107158-lin/

> +
>  #define for_each_cpu(cpu, mask)			\
>  	for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask)
>  #define for_each_cpu_not(cpu, mask)		\
> @@ -245,6 +250,8 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
>  int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
>  int cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
>  unsigned int cpumask_local_spread(unsigned int i, int node);
> +int cpumask_any_and_distribute(const struct cpumask *src1p,
> +			       const struct cpumask *src2p);
>  
>  /**
>   * for_each_cpu - iterate over every cpu in a mask
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a9983da4408..fc6f2bec7d44 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1652,7 +1652,12 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
>  	if (cpumask_equal(p->cpus_ptr, new_mask))
>  		goto out;
>  
> -	dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);
> +	/*
> +	 * Picking a ~random cpu helps in cases where we are changing affinity
> +	 * for groups of tasks (ie. cpuset), so that load balancing is not
> +	 * immediately required to distribute the tasks within their new mask.
> +	 */
> +	dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, new_mask);
>  	if (dest_cpu >= nr_cpu_ids) {
>  		ret = -EINVAL;
>  		goto out;
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index 0cb672eb107c..fb22fb266f93 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -232,3 +232,32 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
>  	BUG();
>  }
>  EXPORT_SYMBOL(cpumask_local_spread);
> +
> +static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
> +
> +/**
> + * Returns an arbitrary cpu within srcp1 & srcp2.
> + *
> + * Iterated calls using the same srcp1 and srcp2 will be distributed within
> + * their intersection.
> + *
> + * Returns >= nr_cpu_ids if the intersection is empty.
> + */
> +int cpumask_any_and_distribute(const struct cpumask *src1p,
> +			       const struct cpumask *src2p)
> +{
> +	int next, prev;
> +
> +	/* NOTE: our first selection will skip 0. */
> +	prev = __this_cpu_read(distribute_cpu_mask_prev);
> +
> +	next = cpumask_next_and(prev, src1p, src2p);
> +	if (next >= nr_cpu_ids)
> +		next = cpumask_first_and(src1p, src2p);
> +
> +	if (next < nr_cpu_ids)
> +		__this_cpu_write(distribute_cpu_mask_prev, next);
> +
> +	return next;
> +}
> +EXPORT_SYMBOL(cpumask_any_and_distribute);
> -- 
> 2.25.1.481.gfbce0eb801-goog
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] sched/cpuset: distribute tasks within affinity masks
  2020-03-11 14:05 ` Qais Yousef
@ 2020-03-17 19:24   ` Peter Zijlstra
  2020-03-17 21:35     ` Josh Don
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2020-03-17 19:24 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Josh Don, Ingo Molnar, Juri Lelli, Vincent Guittot, Li Zefan,
	Tejun Heo, Johannes Weiner, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, linux-kernel, cgroups, Paul Turner

On Wed, Mar 11, 2020 at 02:05:33PM +0000, Qais Yousef wrote:
> On 03/10/20 18:01, Josh Don wrote:
> > From: Paul Turner <pjt@google.com>
> > 
> > Currently, when updating the affinity of tasks via either cpusets.cpus,
> > or, sched_setaffinity(); tasks not currently running within the newly
> > specified mask will be arbitrarily assigned to the first CPU within the
> > mask.
> > 
> > This (particularly in the case that we are restricting masks) can
> > result in many tasks being assigned to the first CPUs of their new
> > masks.
> > 
> > This:
> >  1) Can induce scheduling delays while the load-balancer has a chance to
> >     spread them between their new CPUs.
> >  2) Can antogonize a poor load-balancer behavior where it has a
> >     difficult time recognizing that a cross-socket imbalance has been
> >     forced by an affinity mask.
> > 
> > This change adds a new cpumask interface to allow iterated calls to
> > distribute within the intersection of the provided masks.
> > 
> > The cases that this mainly affects are:
> > - modifying cpuset.cpus
> > - when tasks join a cpuset
> > - when modifying a task's affinity via sched_setaffinity(2)
> > 
> > Co-developed-by: Josh Don <joshdon@google.com>
> > Signed-off-by: Josh Don <joshdon@google.com>
> > Signed-off-by: Paul Turner <pjt@google.com>

> Anyway, for the API.
> 
> Reviewed-by: Qais Yousef <qais.yousef@arm.com>
> Tested-by: Qais Yousef <qais.yousef@arm.com>

Thanks guys!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] sched/cpuset: distribute tasks within affinity masks
  2020-03-17 19:24   ` Peter Zijlstra
@ 2020-03-17 21:35     ` Josh Don
  2020-03-18 11:34       ` Qais Yousef
  0 siblings, 1 reply; 8+ messages in thread
From: Josh Don @ 2020-03-17 21:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot, Li Zefan,
	Tejun Heo, Johannes Weiner, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, linux-kernel, cgroups, Paul Turner

On Wed, Mar 11, 2020 at 7:05 AM Qais Yousef <qais.yousef@arm.com> wrote:
>
> This actually helps me fix a similar problem I faced in RT [1]. If multiple RT
> tasks wakeup at the same time we get a 'thundering herd' issue where they all
> end up going to the same CPU, just to be pushed out again.
>
> Beside this will help fix another problem for RT tasks fitness, which is
> a manifestation of the problem above. If two tasks wake up at the same time and
> they happen to run on a little cpu (but request to run on a big one), one of
> them will end up being migrated because find_lowest_rq() will return the first
> cpu in the mask for both tasks.
>
> I tested the API (not the change in sched/core.c) and it looks good to me.

Nice, glad that the API already has another use case. Thanks for taking a look.

> nit: cpumask_first_and() is better here?

Yea, I would also prefer to use it, but the definition of
cpumask_first_and() follows this section, as it itself uses
cpumask_next_and().

> It might be a good idea to split the API from the user too.

Not sure what you mean by this, could you clarify?

On Tue, Mar 17, 2020 at 12:24 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> > Anyway, for the API.
> >
> > Reviewed-by: Qais Yousef <qais.yousef@arm.com>
> > Tested-by: Qais Yousef <qais.yousef@arm.com>
>
> Thanks guys!

Thanks Peter, any other comments or are you happy with merging this patch as-is?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] sched/cpuset: distribute tasks within affinity masks
  2020-03-17 21:35     ` Josh Don
@ 2020-03-18 11:34       ` Qais Yousef
  2020-03-19 22:45         ` Josh Don
  0 siblings, 1 reply; 8+ messages in thread
From: Qais Yousef @ 2020-03-18 11:34 UTC (permalink / raw)
  To: Josh Don
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Li Zefan, Tejun Heo, Johannes Weiner, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel, cgroups,
	Paul Turner

On 03/17/20 14:35, Josh Don wrote:
> On Wed, Mar 11, 2020 at 7:05 AM Qais Yousef <qais.yousef@arm.com> wrote:
> >
> > This actually helps me fix a similar problem I faced in RT [1]. If multiple RT
> > tasks wakeup at the same time we get a 'thundering herd' issue where they all
> > end up going to the same CPU, just to be pushed out again.
> >
> > Beside this will help fix another problem for RT tasks fitness, which is
> > a manifestation of the problem above. If two tasks wake up at the same time and
> > they happen to run on a little cpu (but request to run on a big one), one of
> > them will end up being migrated because find_lowest_rq() will return the first
> > cpu in the mask for both tasks.
> >
> > I tested the API (not the change in sched/core.c) and it looks good to me.
> 
> Nice, glad that the API already has another use case. Thanks for taking a look.
> 
> > nit: cpumask_first_and() is better here?
> 
> Yea, I would also prefer to use it, but the definition of
> cpumask_first_and() follows this section, as it itself uses
> cpumask_next_and().
> 
> > It might be a good idea to split the API from the user too.
> 
> Not sure what you mean by this, could you clarify?

I meant it'd be a good idea to split the cpumask API into its own patch and
have a separate patch for the user in sched/core.c. But that was a small nit.
If the user (in sched/core.c) somehow introduces a regression, reverting it
separately should be trivial.

Thanks

--
Qais Yousef

> 
> On Tue, Mar 17, 2020 at 12:24 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > Anyway, for the API.
> > >
> > > Reviewed-by: Qais Yousef <qais.yousef@arm.com>
> > > Tested-by: Qais Yousef <qais.yousef@arm.com>
> >
> > Thanks guys!
> 
> Thanks Peter, any other comments or are you happy with merging this patch as-is?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] sched/cpuset: distribute tasks within affinity masks
  2020-03-18 11:34       ` Qais Yousef
@ 2020-03-19 22:45         ` Josh Don
  2020-03-20 11:28           ` Qais Yousef
  0 siblings, 1 reply; 8+ messages in thread
From: Josh Don @ 2020-03-19 22:45 UTC (permalink / raw)
  To: Qais Yousef, Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Li Zefan, Tejun Heo,
	Johannes Weiner, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, linux-kernel, cgroups, Paul Turner

On Wed, Mar 18, 2020 at 4:35 AM Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 03/17/20 14:35, Josh Don wrote:
> > On Wed, Mar 11, 2020 at 7:05 AM Qais Yousef <qais.yousef@arm.com> wrote:
> > >
> > > It might be a good idea to split the API from the user too.
> >
> > Not sure what you mean by this, could you clarify?
>
> I meant it'd be a good idea to split the cpumask API into its own patch and
> have a separate patch for the user in sched/core.c. But that was a small nit.
> If the user (in sched/core.c) somehow introduces a regression, reverting it
> separately should be trivial.
>
> Thanks

Ah, yes I agree that sounds like a good idea, I can do that.

Peter, any other nit before I re-send?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] sched/cpuset: distribute tasks within affinity masks
  2020-03-19 22:45         ` Josh Don
@ 2020-03-20 11:28           ` Qais Yousef
  0 siblings, 0 replies; 8+ messages in thread
From: Qais Yousef @ 2020-03-20 11:28 UTC (permalink / raw)
  To: Josh Don
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Li Zefan, Tejun Heo, Johannes Weiner, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, linux-kernel, cgroups,
	Paul Turner

On 03/19/20 15:45, Josh Don wrote:
> On Wed, Mar 18, 2020 at 4:35 AM Qais Yousef <qais.yousef@arm.com> wrote:
> >
> > On 03/17/20 14:35, Josh Don wrote:
> > > On Wed, Mar 11, 2020 at 7:05 AM Qais Yousef <qais.yousef@arm.com> wrote:
> > > >
> > > > It might be a good idea to split the API from the user too.
> > >
> > > Not sure what you mean by this, could you clarify?
> >
> > I meant it'd be a good idea to split the cpumask API into its own patch and
> > have a separate patch for the user in sched/core.c. But that was a small nit.
> > If the user (in sched/core.c) somehow introduces a regression, reverting it
> > separately should be trivial.
> >
> > Thanks
> 
> Ah, yes I agree that sounds like a good idea, I can do that.
> 
> Peter, any other nit before I re-send?

AFAICT it was already picked up, so no need to resend to fix the nits from my
side.

--
Qais Yousef

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [tip: sched/core] sched/core: Distribute tasks within affinity masks
  2020-03-11  1:01 [PATCH v2] sched/cpuset: distribute tasks within affinity masks Josh Don
  2020-03-11 14:05 ` Qais Yousef
@ 2020-03-20 12:58 ` tip-bot2 for Paul Turner
  1 sibling, 0 replies; 8+ messages in thread
From: tip-bot2 for Paul Turner @ 2020-03-20 12:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Paul Turner, Josh Don, Peter Zijlstra (Intel), Qais Yousef, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76
Gitweb:        https://git.kernel.org/tip/46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76
Author:        Paul Turner <pjt@google.com>
AuthorDate:    Tue, 10 Mar 2020 18:01:13 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 20 Mar 2020 13:06:18 +01:00

sched/core: Distribute tasks within affinity masks

Currently, when updating the affinity of tasks via either cpusets.cpus,
or, sched_setaffinity(); tasks not currently running within the newly
specified mask will be arbitrarily assigned to the first CPU within the
mask.

This (particularly in the case that we are restricting masks) can
result in many tasks being assigned to the first CPUs of their new
masks.

This:
 1) Can induce scheduling delays while the load-balancer has a chance to
    spread them between their new CPUs.
 2) Can antogonize a poor load-balancer behavior where it has a
    difficult time recognizing that a cross-socket imbalance has been
    forced by an affinity mask.

This change adds a new cpumask interface to allow iterated calls to
distribute within the intersection of the provided masks.

The cases that this mainly affects are:
 - modifying cpuset.cpus
 - when tasks join a cpuset
 - when modifying a task's affinity via sched_setaffinity(2)

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Tested-by: Qais Yousef <qais.yousef@arm.com>
Link: https://lkml.kernel.org/r/20200311010113.136465-1-joshdon@google.com
---
 include/linux/cpumask.h |  7 +++++++
 kernel/sched/core.c     |  7 ++++++-
 lib/cpumask.c           | 29 +++++++++++++++++++++++++++++
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index d5cc885..f0d895d 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -194,6 +194,11 @@ static inline unsigned int cpumask_local_spread(unsigned int i, int node)
 	return 0;
 }
 
+static inline int cpumask_any_and_distribute(const struct cpumask *src1p,
+					     const struct cpumask *src2p) {
+	return cpumask_next_and(-1, src1p, src2p);
+}
+
 #define for_each_cpu(cpu, mask)			\
 	for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask)
 #define for_each_cpu_not(cpu, mask)		\
@@ -245,6 +250,8 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
 int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
 int cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
 unsigned int cpumask_local_spread(unsigned int i, int node);
+int cpumask_any_and_distribute(const struct cpumask *src1p,
+			       const struct cpumask *src2p);
 
 /**
  * for_each_cpu - iterate over every cpu in a mask
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 978bf6f..014d4f7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1650,7 +1650,12 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 	if (cpumask_equal(p->cpus_ptr, new_mask))
 		goto out;
 
-	dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);
+	/*
+	 * Picking a ~random cpu helps in cases where we are changing affinity
+	 * for groups of tasks (ie. cpuset), so that load balancing is not
+	 * immediately required to distribute the tasks within their new mask.
+	 */
+	dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, new_mask);
 	if (dest_cpu >= nr_cpu_ids) {
 		ret = -EINVAL;
 		goto out;
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 0cb672e..fb22fb2 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -232,3 +232,32 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
 	BUG();
 }
 EXPORT_SYMBOL(cpumask_local_spread);
+
+static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
+
+/**
+ * Returns an arbitrary cpu within srcp1 & srcp2.
+ *
+ * Iterated calls using the same srcp1 and srcp2 will be distributed within
+ * their intersection.
+ *
+ * Returns >= nr_cpu_ids if the intersection is empty.
+ */
+int cpumask_any_and_distribute(const struct cpumask *src1p,
+			       const struct cpumask *src2p)
+{
+	int next, prev;
+
+	/* NOTE: our first selection will skip 0. */
+	prev = __this_cpu_read(distribute_cpu_mask_prev);
+
+	next = cpumask_next_and(prev, src1p, src2p);
+	if (next >= nr_cpu_ids)
+		next = cpumask_first_and(src1p, src2p);
+
+	if (next < nr_cpu_ids)
+		__this_cpu_write(distribute_cpu_mask_prev, next);
+
+	return next;
+}
+EXPORT_SYMBOL(cpumask_any_and_distribute);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-03-20 12:59 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-11  1:01 [PATCH v2] sched/cpuset: distribute tasks within affinity masks Josh Don
2020-03-11 14:05 ` Qais Yousef
2020-03-17 19:24   ` Peter Zijlstra
2020-03-17 21:35     ` Josh Don
2020-03-18 11:34       ` Qais Yousef
2020-03-19 22:45         ` Josh Don
2020-03-20 11:28           ` Qais Yousef
2020-03-20 12:58 ` [tip: sched/core] sched/core: Distribute " tip-bot2 for Paul Turner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).