linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched/numa: use runnable_avg to classify node
@ 2020-08-25 12:18 Vincent Guittot
  2020-08-25 13:58 ` Mel Gorman
  2020-08-27 15:35 ` Mel Gorman
  0 siblings, 2 replies; 7+ messages in thread
From: Vincent Guittot @ 2020-08-25 12:18 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, linux-kernel
  Cc: Vincent Guittot

Use runnable_avg to classify numa node state similarly to what is done for
normal load balancer. This helps to ensure that numa and normal balancers
use the same view of the state of the system.

- large arm64system: 2 nodes / 224 CPUs
hackbench -l (256000/#grp) -g #grp

grp    tip/sched/core         +patchset              improvement
1      14,008(+/- 4,99 %)     13,800(+/- 3.88 %)     1,48 %
4       4,340(+/- 5.35 %)      4.283(+/- 4.85 %)     1,33 %
16      3,357(+/- 0.55 %)      3.359(+/- 0.54 %)    -0,06 %
32      3,050(+/- 0.94 %)      3.039(+/- 1,06 %)     0,38 %
64      2.968(+/- 1,85 %)      3.006(+/- 2.92 %)    -1.27 %
128     3,290(+/-12.61 %)      3,108(+/- 5.97 %)     5.51 %
256     3.235(+/- 3.95 %)      3,188(+/- 2.83 %)     1.45 %

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2ba8f230feb9..1b927b599919 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1504,6 +1504,7 @@ enum numa_type {
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
 	unsigned long load;
+	unsigned long runnable;
 	unsigned long util;
 	/* Total compute capacity of CPUs on a node */
 	unsigned long compute_capacity;
@@ -1547,6 +1548,7 @@ struct task_numa_env {
 };
 
 static unsigned long cpu_load(struct rq *rq);
+static unsigned long cpu_runnable(struct rq *rq);
 static unsigned long cpu_util(int cpu);
 static inline long adjust_numa_imbalance(int imbalance, int src_nr_running);
 
@@ -1555,11 +1557,13 @@ numa_type numa_classify(unsigned int imbalance_pct,
 			 struct numa_stats *ns)
 {
 	if ((ns->nr_running > ns->weight) &&
-	    ((ns->compute_capacity * 100) < (ns->util * imbalance_pct)))
+	    (((ns->compute_capacity * 100) < (ns->util * imbalance_pct)) ||
+	     ((ns->compute_capacity * imbalance_pct) < (ns->runnable * 100))))
 		return node_overloaded;
 
 	if ((ns->nr_running < ns->weight) ||
-	    ((ns->compute_capacity * 100) > (ns->util * imbalance_pct)))
+	    (((ns->compute_capacity * 100) > (ns->util * imbalance_pct)) &&
+	     ((ns->compute_capacity * imbalance_pct) > (ns->runnable * 100))))
 		return node_has_spare;
 
 	return node_fully_busy;
@@ -1610,6 +1614,7 @@ static void update_numa_stats(struct task_numa_env *env,
 		struct rq *rq = cpu_rq(cpu);
 
 		ns->load += cpu_load(rq);
+		ns->runnable += cpu_runnable(rq);
 		ns->util += cpu_util(cpu);
 		ns->nr_running += rq->cfs.h_nr_running;
 		ns->compute_capacity += capacity_of(cpu);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/numa: use runnable_avg to classify node
  2020-08-25 12:18 [PATCH] sched/numa: use runnable_avg to classify node Vincent Guittot
@ 2020-08-25 13:58 ` Mel Gorman
  2020-08-25 15:52   ` Vincent Guittot
  2020-08-27 15:35 ` Mel Gorman
  1 sibling, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2020-08-25 13:58 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	linux-kernel

On Tue, Aug 25, 2020 at 02:18:18PM +0200, Vincent Guittot wrote:
> Use runnable_avg to classify numa node state similarly to what is done for
> normal load balancer. This helps to ensure that numa and normal balancers
> use the same view of the state of the system.
> 
> - large arm64system: 2 nodes / 224 CPUs
> hackbench -l (256000/#grp) -g #grp
> 
> grp    tip/sched/core         +patchset              improvement
> 1      14,008(+/- 4,99 %)     13,800(+/- 3.88 %)     1,48 %
> 4       4,340(+/- 5.35 %)      4.283(+/- 4.85 %)     1,33 %
> 16      3,357(+/- 0.55 %)      3.359(+/- 0.54 %)    -0,06 %
> 32      3,050(+/- 0.94 %)      3.039(+/- 1,06 %)     0,38 %
> 64      2.968(+/- 1,85 %)      3.006(+/- 2.92 %)    -1.27 %
> 128     3,290(+/-12.61 %)      3,108(+/- 5.97 %)     5.51 %
> 256     3.235(+/- 3.95 %)      3,188(+/- 2.83 %)     1.45 %
> 

Intuitively the patch makes sense but I'm not a fan of using hackbench
for evaluating NUMA balancing. The tasks are too short-lived and it's
not sensitive enough to data placement because of the small footprint
and because hackbench tends to saturate a machine.

As predicting NUMA balancing behaviour in your head can be difficult, I've
queued up a battery of tests on a few different NUMA machines and will see
what falls out. It'll take a few days as some of the tests are long-lived.

Baseline will be 5.9-rc2 as I haven't looked at the topology rework in
tip/sched/core and this patch should not be related to it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/numa: use runnable_avg to classify node
  2020-08-25 13:58 ` Mel Gorman
@ 2020-08-25 15:52   ` Vincent Guittot
  0 siblings, 0 replies; 7+ messages in thread
From: Vincent Guittot @ 2020-08-25 15:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, linux-kernel

On Tue, 25 Aug 2020 at 15:58, Mel Gorman <mgorman@suse.de> wrote:
>
> On Tue, Aug 25, 2020 at 02:18:18PM +0200, Vincent Guittot wrote:
> > Use runnable_avg to classify numa node state similarly to what is done for
> > normal load balancer. This helps to ensure that numa and normal balancers
> > use the same view of the state of the system.
> >
> > - large arm64system: 2 nodes / 224 CPUs
> > hackbench -l (256000/#grp) -g #grp
> >
> > grp    tip/sched/core         +patchset              improvement
> > 1      14,008(+/- 4,99 %)     13,800(+/- 3.88 %)     1,48 %
> > 4       4,340(+/- 5.35 %)      4.283(+/- 4.85 %)     1,33 %
> > 16      3,357(+/- 0.55 %)      3.359(+/- 0.54 %)    -0,06 %
> > 32      3,050(+/- 0.94 %)      3.039(+/- 1,06 %)     0,38 %
> > 64      2.968(+/- 1,85 %)      3.006(+/- 2.92 %)    -1.27 %
> > 128     3,290(+/-12.61 %)      3,108(+/- 5.97 %)     5.51 %
> > 256     3.235(+/- 3.95 %)      3,188(+/- 2.83 %)     1.45 %
> >
>
> Intuitively the patch makes sense but I'm not a fan of using hackbench
> for evaluating NUMA balancing. The tasks are too short-lived and it's
> not sensitive enough to data placement because of the small footprint
> and because hackbench tends to saturate a machine.
>
> As predicting NUMA balancing behaviour in your head can be difficult, I've
> queued up a battery of tests on a few different NUMA machines and will see
> what falls out. It'll take a few days as some of the tests are long-lived.

Thanks for testing Mel

>
> Baseline will be 5.9-rc2 as I haven't looked at the topology rework in
> tip/sched/core and this patch should not be related to it.

looks fine to me

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/numa: use runnable_avg to classify node
  2020-08-25 12:18 [PATCH] sched/numa: use runnable_avg to classify node Vincent Guittot
  2020-08-25 13:58 ` Mel Gorman
@ 2020-08-27 15:35 ` Mel Gorman
  2020-08-27 15:43   ` Vincent Guittot
  1 sibling, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2020-08-27 15:35 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	linux-kernel

On Tue, Aug 25, 2020 at 02:18:18PM +0200, Vincent Guittot wrote:
> Use runnable_avg to classify numa node state similarly to what is done for
> normal load balancer. This helps to ensure that numa and normal balancers
> use the same view of the state of the system.
> 
> - large arm64system: 2 nodes / 224 CPUs
> hackbench -l (256000/#grp) -g #grp
> 
> grp    tip/sched/core         +patchset              improvement
> 1      14,008(+/- 4,99 %)     13,800(+/- 3.88 %)     1,48 %
> 4       4,340(+/- 5.35 %)      4.283(+/- 4.85 %)     1,33 %
> 16      3,357(+/- 0.55 %)      3.359(+/- 0.54 %)    -0,06 %
> 32      3,050(+/- 0.94 %)      3.039(+/- 1,06 %)     0,38 %
> 64      2.968(+/- 1,85 %)      3.006(+/- 2.92 %)    -1.27 %
> 128     3,290(+/-12.61 %)      3,108(+/- 5.97 %)     5.51 %
> 256     3.235(+/- 3.95 %)      3,188(+/- 2.83 %)     1.45 %
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

The testing was a mixed bag of wins and losses but wins more than it
loses. Biggest loss was a 9.04% regression on nas-SP using openmp for
parallelisation on Zen1. Biggest win was around 8% gain running
specjbb2005 on Zen2 (with some major gains of up to 55% for some thread
counts). Most workloads were stable across multiple Intel and AMD
machines.

There were some oddities in changes in NUMA scanning rate but that is
likely a side-effect because the locality over time for the same loads
did not look obviously worse. There was no negative result I could point
at that was not offset by a positive result elsewhere. Given it's not
a univeral win or loss, matching numa and lb balancing as closely as
possible is best so

Reviewed-by: Mel Gorman <mgorman@suse.de>

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/numa: use runnable_avg to classify node
  2020-08-27 15:35 ` Mel Gorman
@ 2020-08-27 15:43   ` Vincent Guittot
  2020-08-27 18:22     ` Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Vincent Guittot @ 2020-08-27 15:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, linux-kernel

On Thu, 27 Aug 2020 at 17:35, Mel Gorman <mgorman@suse.de> wrote:
>
> On Tue, Aug 25, 2020 at 02:18:18PM +0200, Vincent Guittot wrote:
> > Use runnable_avg to classify numa node state similarly to what is done for
> > normal load balancer. This helps to ensure that numa and normal balancers
> > use the same view of the state of the system.
> >
> > - large arm64system: 2 nodes / 224 CPUs
> > hackbench -l (256000/#grp) -g #grp
> >
> > grp    tip/sched/core         +patchset              improvement
> > 1      14,008(+/- 4,99 %)     13,800(+/- 3.88 %)     1,48 %
> > 4       4,340(+/- 5.35 %)      4.283(+/- 4.85 %)     1,33 %
> > 16      3,357(+/- 0.55 %)      3.359(+/- 0.54 %)    -0,06 %
> > 32      3,050(+/- 0.94 %)      3.039(+/- 1,06 %)     0,38 %
> > 64      2.968(+/- 1,85 %)      3.006(+/- 2.92 %)    -1.27 %
> > 128     3,290(+/-12.61 %)      3,108(+/- 5.97 %)     5.51 %
> > 256     3.235(+/- 3.95 %)      3,188(+/- 2.83 %)     1.45 %
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> The testing was a mixed bag of wins and losses but wins more than it
> loses. Biggest loss was a 9.04% regression on nas-SP using openmp for
> parallelisation on Zen1. Biggest win was around 8% gain running
> specjbb2005 on Zen2 (with some major gains of up to 55% for some thread
> counts). Most workloads were stable across multiple Intel and AMD
> machines.
>
> There were some oddities in changes in NUMA scanning rate but that is
> likely a side-effect because the locality over time for the same loads
> did not look obviously worse. There was no negative result I could point
> at that was not offset by a positive result elsewhere. Given it's not
> a univeral win or loss, matching numa and lb balancing as closely as
> possible is best so
>
> Reviewed-by: Mel Gorman <mgorman@suse.de>

Thanks.

I will try to reproduce the nas-SP test on my setup to see what is going one

Vincent

>
> Thanks.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/numa: use runnable_avg to classify node
  2020-08-27 15:43   ` Vincent Guittot
@ 2020-08-27 18:22     ` Mel Gorman
  2020-08-28  6:47       ` Vincent Guittot
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2020-08-27 18:22 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, linux-kernel

On Thu, Aug 27, 2020 at 05:43:11PM +0200, Vincent Guittot wrote:
> > The testing was a mixed bag of wins and losses but wins more than it
> > loses. Biggest loss was a 9.04% regression on nas-SP using openmp for
> > parallelisation on Zen1. Biggest win was around 8% gain running
> > specjbb2005 on Zen2 (with some major gains of up to 55% for some thread
> > counts). Most workloads were stable across multiple Intel and AMD
> > machines.
> >
> > There were some oddities in changes in NUMA scanning rate but that is
> > likely a side-effect because the locality over time for the same loads
> > did not look obviously worse. There was no negative result I could point
> > at that was not offset by a positive result elsewhere. Given it's not
> > a univeral win or loss, matching numa and lb balancing as closely as
> > possible is best so
> >
> > Reviewed-by: Mel Gorman <mgorman@suse.de>
> 
> Thanks.
> 
> I will try to reproduce the nas-SP test on my setup to see what is going one
> 

You can try but you might be chasing ghosts. Please note that this nas-SP
observation was only on zen1 and only for C-class and OMP. The other
machines tested for the same class and OMP were fine (including zen2). Even
D-class on the same machine with OMP was fine as was MPI in both cases. The
bad result indicated that NUMA scanning and faulting was higher but that
is more likely to be a problem with NUMA balancing than your patch.

In the five iterations, two iterations showed a large spike in scan rate
towards the end of an iteration but not the other three. The scan rate
was also not consistently high so there is a degree of luck involved with
SP specifically and there is not a consistently penalty as a result of
your patch.

The only thing to be aware of is that this patch might show up in
bisections once it's merged for both performance gains and losses.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/numa: use runnable_avg to classify node
  2020-08-27 18:22     ` Mel Gorman
@ 2020-08-28  6:47       ` Vincent Guittot
  0 siblings, 0 replies; 7+ messages in thread
From: Vincent Guittot @ 2020-08-28  6:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, linux-kernel

On Thu, 27 Aug 2020 at 20:22, Mel Gorman <mgorman@suse.de> wrote:
>
> On Thu, Aug 27, 2020 at 05:43:11PM +0200, Vincent Guittot wrote:
> > > The testing was a mixed bag of wins and losses but wins more than it
> > > loses. Biggest loss was a 9.04% regression on nas-SP using openmp for
> > > parallelisation on Zen1. Biggest win was around 8% gain running
> > > specjbb2005 on Zen2 (with some major gains of up to 55% for some thread
> > > counts). Most workloads were stable across multiple Intel and AMD
> > > machines.
> > >
> > > There were some oddities in changes in NUMA scanning rate but that is
> > > likely a side-effect because the locality over time for the same loads
> > > did not look obviously worse. There was no negative result I could point
> > > at that was not offset by a positive result elsewhere. Given it's not
> > > a univeral win or loss, matching numa and lb balancing as closely as
> > > possible is best so
> > >
> > > Reviewed-by: Mel Gorman <mgorman@suse.de>
> >
> > Thanks.
> >
> > I will try to reproduce the nas-SP test on my setup to see what is going one
> >
>
> You can try but you might be chasing ghosts. Please note that this nas-SP
> observation was only on zen1 and only for C-class and OMP. The other
> machines tested for the same class and OMP were fine (including zen2). Even
> D-class on the same machine with OMP was fine as was MPI in both cases. The
> bad result indicated that NUMA scanning and faulting was higher but that
> is more likely to be a problem with NUMA balancing than your patch.
>
> In the five iterations, two iterations showed a large spike in scan rate
> towards the end of an iteration but not the other three. The scan rate
> was also not consistently high so there is a degree of luck involved with
> SP specifically and there is not a consistently penalty as a result of
> your patch.
>
> The only thing to be aware of is that this patch might show up in
> bisections once it's merged for both performance gains and losses.

Thanks for the detailed explanation. I will save my time and continue
on the fairness problem in this case.

Vincent

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-08-28  6:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-25 12:18 [PATCH] sched/numa: use runnable_avg to classify node Vincent Guittot
2020-08-25 13:58 ` Mel Gorman
2020-08-25 15:52   ` Vincent Guittot
2020-08-27 15:35 ` Mel Gorman
2020-08-27 15:43   ` Vincent Guittot
2020-08-27 18:22     ` Mel Gorman
2020-08-28  6:47       ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).