[PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer
@ 2020-09-07  7:27 Barry Song
  2020-09-07  9:27 ` Mel Gorman
  0 siblings, 1 reply; 8+ messages in thread
From: Barry Song @ 2020-09-07  7:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	bsegall, mgorman, linux-kernel
  Cc: linuxarm, Barry Song, Mel Gorman, Peter Zijlstra,
	Valentin Schneider, Phil Auld, Hillf Danton, Ingo Molnar

Something is wrong. In find_busiest_group(), we are checking if src has
higher load, however, in task_numa_find_cpu(), we are checking if dst
will have higher load after balancing. It seems it is not sensible to
check src.
It maybe cause wrong imbalance value, for example, if
dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and
src_running = env->src_stats.nr_running - 1 results in 1;
The current code is thinking imbalance as 0 since src_running is smaller
than 2.
This is inconsistent with load balancer.

Fixes: fb86f5b211 ("sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity")
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 kernel/sched/fair.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1a68a05..90cfee7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1548,7 +1548,7 @@ struct task_numa_env {
 
 static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_util(int cpu);
-static inline long adjust_numa_imbalance(int imbalance, int src_nr_running);
+static inline long adjust_numa_imbalance(int imbalance, int nr_running);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1925,7 +1925,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		src_running = env->src_stats.nr_running - 1;
 		dst_running = env->dst_stats.nr_running + 1;
 		imbalance = max(0, dst_running - src_running);
-		imbalance = adjust_numa_imbalance(imbalance, src_running);
+		imbalance = adjust_numa_imbalance(imbalance, dst_running);
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {
@@ -8957,7 +8957,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	}
 }
 
-static inline long adjust_numa_imbalance(int imbalance, int src_nr_running)
+static inline long adjust_numa_imbalance(int imbalance, int nr_running)
 {
 	unsigned int imbalance_min;
 
@@ -8966,7 +8966,7 @@ static inline long adjust_numa_imbalance(int imbalance, int src_nr_running)
 	 * tasks that remain local when the source domain is almost idle.
 	 */
 	imbalance_min = 2;
-	if (src_nr_running <= imbalance_min)
+	if (nr_running <= imbalance_min)
 		return 0;
 
 	return imbalance;
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer
  2020-09-07  7:27 [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer Barry Song
@ 2020-09-07  9:27 ` Mel Gorman
  2020-09-07  9:44   ` Song Bao Hua (Barry Song)
  0 siblings, 1 reply; 8+ messages in thread
From: Mel Gorman @ 2020-09-07  9:27 UTC (permalink / raw)
  To: Barry Song
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	bsegall, linux-kernel, linuxarm, Mel Gorman, Peter Zijlstra,
	Valentin Schneider, Phil Auld, Hillf Danton, Ingo Molnar

On Mon, Sep 07, 2020 at 07:27:08PM +1200, Barry Song wrote:
> Something is wrong. In find_busiest_group(), we are checking if src has
> higher load, however, in task_numa_find_cpu(), we are checking if dst
> will have higher load after balancing. It seems it is not sensible to
> check src.
> It maybe cause wrong imbalance value, for example, if
> dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and
> src_running = env->src_stats.nr_running - 1 results in 1;
> The current code is thinking imbalance as 0 since src_running is smaller
> than 2.
> This is inconsistent with load balancer.
> 

It checks the conditions if the move was to happen. Have you evaluated
this for a NUMA balancing load and confirmed it a) balances properly and
b) does not increase the scan rate trying to "fix" the problem?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer
  2020-09-07  9:27 ` Mel Gorman
@ 2020-09-07  9:44   ` Song Bao Hua (Barry Song)
  2020-09-07 10:44     ` Mel Gorman
  2020-09-07 12:31     ` Srikar Dronamraju
  0 siblings, 2 replies; 8+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-09-07  9:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	bsegall, linux-kernel, Linuxarm, Mel Gorman, Peter Zijlstra,
	Valentin Schneider, Phil Auld, Hillf Danton, Ingo Molnar



> -----Original Message-----
> From: Mel Gorman [mailto:mgorman@suse.de]
> Sent: Monday, September 7, 2020 9:27 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: mingo@redhat.com; peterz@infradead.org; juri.lelli@redhat.com;
> vincent.guittot@linaro.org; dietmar.eggemann@arm.com;
> bsegall@google.com; linux-kernel@vger.kernel.org; Linuxarm
> <linuxarm@huawei.com>; Mel Gorman <mgorman@techsingularity.net>;
> Peter Zijlstra <a.p.zijlstra@chello.nl>; Valentin Schneider
> <valentin.schneider@arm.com>; Phil Auld <pauld@redhat.com>; Hillf Danton
> <hdanton@sina.com>; Ingo Molnar <mingo@kernel.org>
> Subject: Re: [PATCH] sched/fair: use dst group while checking imbalance for
> NUMA balancer
> 
> On Mon, Sep 07, 2020 at 07:27:08PM +1200, Barry Song wrote:
> > Something is wrong. In find_busiest_group(), we are checking if src has
> > higher load, however, in task_numa_find_cpu(), we are checking if dst
> > will have higher load after balancing. It seems it is not sensible to
> > check src.
> > It maybe cause wrong imbalance value, for example, if
> > dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and
> > src_running = env->src_stats.nr_running - 1 results in 1;
> > The current code is thinking imbalance as 0 since src_running is smaller
> > than 2.
> > This is inconsistent with load balancer.
> >
> 
> It checks the conditions if the move was to happen. Have you evaluated
> this for a NUMA balancing load and confirmed it a) balances properly and
> b) does not increase the scan rate trying to "fix" the problem?

I think the original code was trying to check if the numa migration
would lead to new imbalance in load balancer. In case src is A, dst is B, and
both of them have nr_running as 2. A moves one task to B, then A
will have 1, B will have 3. In load balancer, A will try to pull task
from B since B's nr_running is larger than min_imbalance. But the code
is saying imbalance=0 by finding A's nr_running is smaller than
min_imbalance.

Will share more test data if you need.

> 
> --
> Mel Gorman
> SUSE Labs

Thanks
Barry

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer
  2020-09-07  9:44   ` Song Bao Hua (Barry Song)
@ 2020-09-07 10:44     ` Mel Gorman
  2020-09-07 12:31     ` Srikar Dronamraju
  1 sibling, 0 replies; 8+ messages in thread
From: Mel Gorman @ 2020-09-07 10:44 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song)
  Cc: Mel Gorman, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, linux-kernel, Linuxarm,
	Peter Zijlstra, Valentin Schneider, Phil Auld, Hillf Danton,
	Ingo Molnar

On Mon, Sep 07, 2020 at 09:44:03AM +0000, Song Bao Hua (Barry Song) wrote:
> 
> 
> > -----Original Message-----
> > From: Mel Gorman [mailto:mgorman@suse.de]
> > Sent: Monday, September 7, 2020 9:27 PM
> > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > Cc: mingo@redhat.com; peterz@infradead.org; juri.lelli@redhat.com;
> > vincent.guittot@linaro.org; dietmar.eggemann@arm.com;
> > bsegall@google.com; linux-kernel@vger.kernel.org; Linuxarm
> > <linuxarm@huawei.com>; Mel Gorman <mgorman@techsingularity.net>;
> > Peter Zijlstra <a.p.zijlstra@chello.nl>; Valentin Schneider
> > <valentin.schneider@arm.com>; Phil Auld <pauld@redhat.com>; Hillf Danton
> > <hdanton@sina.com>; Ingo Molnar <mingo@kernel.org>
> > Subject: Re: [PATCH] sched/fair: use dst group while checking imbalance for
> > NUMA balancer
> > 
> > On Mon, Sep 07, 2020 at 07:27:08PM +1200, Barry Song wrote:
> > > Something is wrong. In find_busiest_group(), we are checking if src has
> > > higher load, however, in task_numa_find_cpu(), we are checking if dst
> > > will have higher load after balancing. It seems it is not sensible to
> > > check src.
> > > It maybe cause wrong imbalance value, for example, if
> > > dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and
> > > src_running = env->src_stats.nr_running - 1 results in 1;
> > > The current code is thinking imbalance as 0 since src_running is smaller
> > > than 2.
> > > This is inconsistent with load balancer.
> > >
> > 
> > It checks the conditions if the move was to happen. Have you evaluated
> > this for a NUMA balancing load and confirmed it a) balances properly and
> > b) does not increase the scan rate trying to "fix" the problem?
> 
> I think the original code was trying to check if the numa migration
> would lead to new imbalance in load balancer. In case src is A, dst is B, and
> both of them have nr_running as 2. A moves one task to B, then A
> will have 1, B will have 3. In load balancer, A will try to pull task
> from B since B's nr_running is larger than min_imbalance. But the code
> is saying imbalance=0 by finding A's nr_running is smaller than
> min_imbalance.
> 
> Will share more test data if you need.
> 

Include the test description, data and details of the system you used to
evaluate the patch. I ask because the load/numa reconcilation took a long
time to cover all the corner cases and it's very easy to reintroduce major
regressions. At least one of those corner cases was trying to balance
in the wrong direction because in some cases NUMA balancing will try to
allow a small imbalance if it makes sense from a locality point of view.
Another corner case was if that small imbalance is too large or done at
the wrong time, it regresses overall even though the locality is good
because of memory bandwidth limitations. This is obviously far from ideal
but it does mean that it's an area that needs data backing up the changes.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer
  2020-09-07  9:44   ` Song Bao Hua (Barry Song)
  2020-09-07 10:44     ` Mel Gorman
@ 2020-09-07 12:31     ` Srikar Dronamraju
       [not found]       ` <20200908010717.12436-1-hdanton@sina.com>
  1 sibling, 1 reply; 8+ messages in thread
From: Srikar Dronamraju @ 2020-09-07 12:31 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song)
  Cc: Mel Gorman, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, linux-kernel, Linuxarm, Mel Gorman,
	Peter Zijlstra, Valentin Schneider, Phil Auld, Hillf Danton,
	Ingo Molnar

> > 
> > On Mon, Sep 07, 2020 at 07:27:08PM +1200, Barry Song wrote:
> > > Something is wrong. In find_busiest_group(), we are checking if src has
> > > higher load, however, in task_numa_find_cpu(), we are checking if dst
> > > will have higher load after balancing. It seems it is not sensible to
> > > check src.
> > > It maybe cause wrong imbalance value, for example, if
> > > dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and
> > > src_running = env->src_stats.nr_running - 1 results in 1;
> > > The current code is thinking imbalance as 0 since src_running is smaller
> > > than 2.
> > > This is inconsistent with load balancer.
> > >

I have observed the similar behaviour what Barry Song has documented with a
simple ebizzy with less threads on a 2 node system

ebizzy -t 6 -S 100

We see couple of ebizzy threads moving back and forth between the 2 nodes
because of numa balancer and load balancer trying to do the exact opposite.

However with Barry's patch, couple of tests regress heavily. (Any numa
workload that has shared numa faults).
For example:
perf bench numa mem --no-data_rand_walk -p 1 -t 6 -G 0 -P 3072 -T 0 -l 50 -c

I also don't understand the rational behind checking for dst_running in numa
balancer path. This almost means no numa balancing in lightly loaded scenario.

So agree with Mel that we should probably test more scenarios before
we accept this patch.

> > 
> > It checks the conditions if the move was to happen. Have you evaluated
> > this for a NUMA balancing load and confirmed it a) balances properly and
> > b) does not increase the scan rate trying to "fix" the problem?
> 
> I think the original code was trying to check if the numa migration
> would lead to new imbalance in load balancer. In case src is A, dst is B, and
> both of them have nr_running as 2. A moves one task to B, then A
> will have 1, B will have 3. In load balancer, A will try to pull task
> from B since B's nr_running is larger than min_imbalance. But the code
> is saying imbalance=0 by finding A's nr_running is smaller than
> min_imbalance.
> 
> Will share more test data if you need.
> 
> > 
> > --
> > Mel Gorman
> > SUSE Labs
> 
> Thanks
> Barry

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer
       [not found]       ` <20200908010717.12436-1-hdanton@sina.com>
@ 2020-09-10 21:50         ` Jirka Hladky
  2020-09-21 11:02           ` Mel Gorman
  0 siblings, 1 reply; 8+ messages in thread
From: Jirka Hladky @ 2020-09-10 21:50 UTC (permalink / raw)
  To: Hillf Danton, Mel Gorman
  Cc: Srikar Dronamraju, Song Bao Hua (Barry Song),
	mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	bsegall, linux-kernel, Linuxarm, Mel Gorman, Peter Zijlstra,
	Valentin Schneider, Phil Auld, Ingo Molnar, kkolakow, Jiri Vozar

Hi Hilf and Mel,

thanks a lot for bringing this to my attention. We have tested the
proposed patch and we are getting excellent results so far!

1) Threads stability has improved a lot. We see much fewer threads
migrations. Check for example this heatmap based on the mpstat results
collected while running sp.C test from the NAS benchmark. sp.C was run
with 16 threads on a dual-socket AMD 7351 server with 8 NUMA nodes:
5.9 vanilla https://drive.google.com/file/d/10rojhTSQUSu-1aGQi-srr99SmVQ3Llgo/view?usp=sharing
5.9 with the patch
https://drive.google.com/file/d/1eZdTBWvWMNvdvXqitwAlcKZ7gb-ySQUl/view?usp=sharing

The heatmaps are generated from the mpstat output (there are 5
different runs at one picture). We collect CPU utilization every 5
seconds. Lighter color corresponds to lower CPU utilization. Light
color means that the thread may have run on different CPUs during that
5 seconds. Solid straight lines, on the other hand, mean that thread
was running on the same CPU all the time. The difference is striking.

We see significantly fewer threads migrations across many different
tests - NAS, SPECjbb2005, SPECjvm2008

2) We see also performance improvement in terms of runtime, especially
at low load scenarios (number of threads being roughly equal to the 2*
number of NUMA nodes). It fixes the performance drop which we see
since 5.7 kernel, discussed for example here:
https://lore.kernel.org/lkml/CAE4VaGB7+sR1nf3Ux8W=hgN46gNXRYr0uBWJU0oYnk7h00Y_dw@mail.gmail.com/

The biggest improvements are for the NAS and the SPECjvm2008
benchmarks (typically between 20-50%). SPECjbb2005 shows also
improvements, around 10%. This is again on NUMA servers at the low
utilization. You can find snapshots of some results here:
https://drive.google.com/drive/folders/1k3Gb-vlI7UjPZZcLkoL2W2VB_zqxIJ3_?usp=sharing

@Mel - could you please test the proposed patch? I know you have good
coverage for the inter-process communication benchmarks which may show
different behavior than benchmarks which we use. I hope that fewer
threads migrations might show positive effects also for these tests.
Please give it a try.

Thanks a lot!
Jirka

On Tue, Sep 8, 2020 at 3:07 AM Hillf Danton <hdanton@sina.com> wrote:
>
>
> On Mon, 7 Sep 2020 18:01:06 +0530 Srikar Dronamraju wrote:
> > > > On Mon, Sep 07, 2020 at 07:27:08PM +1200, Barry Song wrote:
> > > > > Something is wrong. In find_busiest_group(), we are checking if src has
> > > > > higher load, however, in task_numa_find_cpu(), we are checking if dst
> > > > > will have higher load after balancing. It seems it is not sensible to
> > > > > check src.
> > > > > It maybe cause wrong imbalance value, for example, if
> > > > > dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and
> > > > > src_running = env->src_stats.nr_running - 1 results in 1;
> > > > > The current code is thinking imbalance as 0 since src_running is smaller
> > > > > than 2.
> > > > > This is inconsistent with load balancer.
> > > > >
>
> Hi Srikar and Barry
> >
> > I have observed the similar behaviour what Barry Song has documented with a
> > simple ebizzy with less threads on a 2 node system
>
> Thanks for your testing.
> >
> > ebizzy -t 6 -S 100
> >
> > We see couple of ebizzy threads moving back and forth between the 2 nodes
> > because of numa balancer and load balancer trying to do the exact opposite.
> >
> > However with Barry's patch, couple of tests regress heavily. (Any numa
> > workload that has shared numa faults).
> > For example:
> > perf bench numa mem --no-data_rand_walk -p 1 -t 6 -G 0 -P 3072 -T 0 -l 50 -c
> >
> > I also don't understand the rational behind checking for dst_running in numa
> > balancer path. This almost means no numa balancing in lightly loaded scenario.
> >
> > So agree with Mel that we should probably test more scenarios before
> > we accept this patch.
>
> Take a look at Jirka's work [1] please if you have any plan to do some
> more tests.
>
> [1] https://lore.kernel.org/lkml/CAE4VaGB7+sR1nf3Ux8W=hgN46gNXRYr0uBWJU0oYnk7h00Y_dw@mail.gmail.com/
> >
> > > >
> > > > It checks the conditions if the move was to happen. Have you evaluated
> > > > this for a NUMA balancing load and confirmed it a) balances properly and
> > > > b) does not increase the scan rate trying to "fix" the problem?
> > >
> > > I think the original code was trying to check if the numa migration
> > > would lead to new imbalance in load balancer. In case src is A, dst is B, and
> > > both of them have nr_running as 2. A moves one task to B, then A
> > > will have 1, B will have 3. In load balancer, A will try to pull task
> > > from B since B's nr_running is larger than min_imbalance. But the code
> > > is saying imbalance=0 by finding A's nr_running is smaller than
> > > min_imbalance.
> > >
> > > Will share more test data if you need.
> > >
> > > >
> > > > --
> > > > Mel Gorman
> > > > SUSE Labs
> > >
> > > Thanks
> > > Barry
> >
> > --
> > Thanks and Regards
> > Srikar Dronamraju
>
>

-- 
-Jirka

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer
  2020-09-10 21:50         ` Jirka Hladky
@ 2020-09-21 11:02           ` Mel Gorman
  2020-09-21 16:02             ` Jirka Hladky
  0 siblings, 1 reply; 8+ messages in thread
From: Mel Gorman @ 2020-09-21 11:02 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Hillf Danton, Mel Gorman, Srikar Dronamraju,
	Song Bao Hua (Barry Song),
	mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	bsegall, linux-kernel, Linuxarm, Peter Zijlstra,
	Valentin Schneider, Phil Auld, Ingo Molnar, kkolakow, Jiri Vozar

On Thu, Sep 10, 2020 at 11:50:04PM +0200, Jirka Hladky wrote:
> 1) Threads stability has improved a lot. We see much fewer threads
> migrations. Check for example this heatmap based on the mpstat results
> collected while running sp.C test from the NAS benchmark. sp.C was run
> with 16 threads on a dual-socket AMD 7351 server with 8 NUMA nodes:
> 5.9 vanilla https://drive.google.com/file/d/10rojhTSQUSu-1aGQi-srr99SmVQ3Llgo/view?usp=sharing
> 5.9 with the patch
> https://drive.google.com/file/d/1eZdTBWvWMNvdvXqitwAlcKZ7gb-ySQUl/view?usp=sharing
> 
> The heatmaps are generated from the mpstat output (there are 5
> different runs at one picture). We collect CPU utilization every 5
> seconds. Lighter color corresponds to lower CPU utilization. Light
> color means that the thread may have run on different CPUs during that
> 5 seconds. Solid straight lines, on the other hand, mean that thread
> was running on the same CPU all the time. The difference is striking.
> 
> We see significantly fewer threads migrations across many different
> tests - NAS, SPECjbb2005, SPECjvm2008
> 

For all three, I see the point where it's better or worse depends on
overall activity. I looked at heatmaps for a variety of workloads and
visually the bigger differences tend to be with utilisation is relatively
low (e.g. one third of CPUs active).

> 2) We see also performance improvement in terms of runtime, especially
> at low load scenarios (number of threads being roughly equal to the 2*
> number of NUMA nodes). It fixes the performance drop which we see
> since 5.7 kernel, discussed for example here:
> https://lore.kernel.org/lkml/CAE4VaGB7+sR1nf3Ux8W=hgN46gNXRYr0uBWJU0oYnk7h00Y_dw@mail.gmail.com/
> 

This would make some sense given the original intent about allowing
imbalances that was later revised significantly. It depends on when
memory throughput is more important so the impact varies with the level
of untilisation.  For example, on a 2-socket machine running specjvm 2005
I see

specjbb
                               5.9.0-rc4              5.9.0-rc4
                                 vanilla        dstbalance-v1r1
Hmean     tput-1     46425.00 (   0.00%)    43394.00 *  -6.53%*
Hmean     tput-2     98416.00 (   0.00%)    96031.00 *  -2.42%*
Hmean     tput-3    150184.00 (   0.00%)   148783.00 *  -0.93%*
Hmean     tput-4    200683.00 (   0.00%)   197906.00 *  -1.38%*
Hmean     tput-5    236305.00 (   0.00%)   245549.00 *   3.91%*
Hmean     tput-6    281559.00 (   0.00%)   285692.00 *   1.47%*
Hmean     tput-7    338558.00 (   0.00%)   334467.00 *  -1.21%*
Hmean     tput-8    340745.00 (   0.00%)   372501.00 *   9.32%*
Hmean     tput-9    424343.00 (   0.00%)   413006.00 *  -2.67%*
Hmean     tput-10   421854.00 (   0.00%)   434261.00 *   2.94%*
Hmean     tput-11   493256.00 (   0.00%)   485330.00 *  -1.61%*
Hmean     tput-12   549573.00 (   0.00%)   529959.00 *  -3.57%*
Hmean     tput-13   593183.00 (   0.00%)   555010.00 *  -6.44%*
Hmean     tput-14   588252.00 (   0.00%)   599166.00 *   1.86%*
Hmean     tput-15   623065.00 (   0.00%)   642713.00 *   3.15%*
Hmean     tput-16   703924.00 (   0.00%)   660758.00 *  -6.13%*
Hmean     tput-17   666023.00 (   0.00%)   697675.00 *   4.75%*
Hmean     tput-18   761502.00 (   0.00%)   758360.00 *  -0.41%*
Hmean     tput-19   796088.00 (   0.00%)   798368.00 *   0.29%*
Hmean     tput-20   733564.00 (   0.00%)   823086.00 *  12.20%*
Hmean     tput-21   840980.00 (   0.00%)   856711.00 *   1.87%*
Hmean     tput-22   804285.00 (   0.00%)   872238.00 *   8.45%*
Hmean     tput-23   795208.00 (   0.00%)   889374.00 *  11.84%*
Hmean     tput-24   848619.00 (   0.00%)   966783.00 *  13.92%*
Hmean     tput-25   750848.00 (   0.00%)   903790.00 *  20.37%*
Hmean     tput-26   780523.00 (   0.00%)   962254.00 *  23.28%*
Hmean     tput-27  1042245.00 (   0.00%)   991544.00 *  -4.86%*
Hmean     tput-28  1090580.00 (   0.00%)  1035926.00 *  -5.01%*
Hmean     tput-29   999483.00 (   0.00%)  1082948.00 *   8.35%*
Hmean     tput-30  1098663.00 (   0.00%)  1113427.00 *   1.34%*
Hmean     tput-31  1125671.00 (   0.00%)  1134175.00 *   0.76%*
Hmean     tput-32   968167.00 (   0.00%)  1250286.00 *  29.14%*
Hmean     tput-33  1077676.00 (   0.00%)  1060893.00 *  -1.56%*
Hmean     tput-34  1090538.00 (   0.00%)  1090933.00 *   0.04%*
Hmean     tput-35   967058.00 (   0.00%)  1107421.00 *  14.51%*
Hmean     tput-36  1051745.00 (   0.00%)  1210663.00 *  15.11%*
Hmean     tput-37  1019465.00 (   0.00%)  1351446.00 *  32.56%*
Hmean     tput-38  1083102.00 (   0.00%)  1064541.00 *  -1.71%*
Hmean     tput-39  1232990.00 (   0.00%)  1303623.00 *   5.73%*
Hmean     tput-40  1175542.00 (   0.00%)  1340943.00 *  14.07%*
Hmean     tput-41  1127826.00 (   0.00%)  1339492.00 *  18.77%*
Hmean     tput-42  1198313.00 (   0.00%)  1411023.00 *  17.75%*
Hmean     tput-43  1163733.00 (   0.00%)  1228253.00 *   5.54%*
Hmean     tput-44  1305562.00 (   0.00%)  1357886.00 *   4.01%*
Hmean     tput-45  1326752.00 (   0.00%)  1406061.00 *   5.98%*
Hmean     tput-46  1339424.00 (   0.00%)  1418451.00 *   5.90%*
Hmean     tput-47  1415057.00 (   0.00%)  1381570.00 *  -2.37%*
Hmean     tput-48  1392003.00 (   0.00%)  1421167.00 *   2.10%*
Hmean     tput-49  1408374.00 (   0.00%)  1418659.00 *   0.73%*
Hmean     tput-50  1359822.00 (   0.00%)  1391070.00 *   2.30%*
Hmean     tput-51  1414246.00 (   0.00%)  1392679.00 *  -1.52%*
Hmean     tput-52  1432352.00 (   0.00%)  1354020.00 *  -5.47%*
Hmean     tput-53  1387563.00 (   0.00%)  1409563.00 *   1.59%*
Hmean     tput-54  1406420.00 (   0.00%)  1388711.00 *  -1.26%*
Hmean     tput-55  1438804.00 (   0.00%)  1387472.00 *  -3.57%*
Hmean     tput-56  1399465.00 (   0.00%)  1400296.00 *   0.06%*
Hmean     tput-57  1428132.00 (   0.00%)  1396399.00 *  -2.22%*
Hmean     tput-58  1432385.00 (   0.00%)  1386253.00 *  -3.22%*
Hmean     tput-59  1421612.00 (   0.00%)  1371416.00 *  -3.53%*
Hmean     tput-60  1429423.00 (   0.00%)  1389412.00 *  -2.80%*
Hmean     tput-61  1396230.00 (   0.00%)  1351122.00 *  -3.23%*
Hmean     tput-62  1418396.00 (   0.00%)  1383098.00 *  -2.49%*
Hmean     tput-63  1409918.00 (   0.00%)  1374662.00 *  -2.50%*
Hmean     tput-64  1410236.00 (   0.00%)  1376216.00 *  -2.41%*
Hmean     tput-65  1396405.00 (   0.00%)  1364418.00 *  -2.29%*
Hmean     tput-66  1395975.00 (   0.00%)  1357326.00 *  -2.77%*
Hmean     tput-67  1392986.00 (   0.00%)  1349642.00 *  -3.11%*
Hmean     tput-68  1386541.00 (   0.00%)  1343261.00 *  -3.12%*
Hmean     tput-69  1374407.00 (   0.00%)  1342588.00 *  -2.32%*
Hmean     tput-70  1377513.00 (   0.00%)  1334654.00 *  -3.11%*
Hmean     tput-71  1369319.00 (   0.00%)  1334952.00 *  -2.51%*
Hmean     tput-72  1354635.00 (   0.00%)  1329005.00 *  -1.89%*
Hmean     tput-73  1350933.00 (   0.00%)  1318942.00 *  -2.37%*
Hmean     tput-74  1351714.00 (   0.00%)  1316347.00 *  -2.62%*
Hmean     tput-75  1352198.00 (   0.00%)  1309974.00 *  -3.12%*
Hmean     tput-76  1349490.00 (   0.00%)  1286064.00 *  -4.70%*
Hmean     tput-77  1336131.00 (   0.00%)  1303684.00 *  -2.43%*
Hmean     tput-78  1308896.00 (   0.00%)  1271024.00 *  -2.89%*
Hmean     tput-79  1326703.00 (   0.00%)  1290862.00 *  -2.70%*
Hmean     tput-80  1336199.00 (   0.00%)  1291629.00 *  -3.34%*

Note as the utilisation reaches roughly half the CPUs that performance
is much better but then degrades for higher thread counts. Part of this
is that NUMA balancing activity is higher with the patch once roughly a
quarter of the CPUs are in use.

NAS is all over the map depending on the compute kernel and degree of
parallelisation. Sometimes it is much faster depending on the machine
and the degree of parallelisation, other times slower. Generally the
level of NUMA balancing activity is variable -- sometimes better,
sometimes worse and locality is generally lower (which is not a solid
conclusion as it's usually correlated with differences in hinting
faults).

There are counter-examples. On EPYC 2 running hackbench, locality and
scan activity is much worse with the patch. On hackbench using processes
and sockets, it is about 8% slower for small numbers of groups and 28%
slower for larger numbers of groups. On the same workload, NUMA activity
is much higher, locality goes from 50% to 1.5%. Bizarrely, with pipes
the patch is much better except in the opposite direction.

For something like dbench, it depends on the machine. All but one
machine showed an improvement. On EPYC 2, it's much worse.

> The biggest improvements are for the NAS and the SPECjvm2008
> benchmarks (typically between 20-50%). SPECjbb2005 shows also
> improvements, around 10%. This is again on NUMA servers at the low
> utilization. You can find snapshots of some results here:
> https://drive.google.com/drive/folders/1k3Gb-vlI7UjPZZcLkoL2W2VB_zqxIJ3_?usp=sharing
> 
> @Mel - could you please test the proposed patch? I know you have good
> coverage for the inter-process communication benchmarks which may show
> different behavior than benchmarks which we use. I hope that fewer
> threads migrations might show positive effects also for these tests.
> Please give it a try.
> 

Basically I see mixed bag depending on the workload, machine and overall
levels of utilisation. Sometimes it's better (sometimes much better),
other times it is worse (sometimes much worse). Given that there isn't a
universally good decision in this section and more people seem to prefer
the patch then it may be best to keep the LB decisions consistent and
revisit imbalance handling when the load balancer code changes settle
down so

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer
  2020-09-21 11:02           ` Mel Gorman
@ 2020-09-21 16:02             ` Jirka Hladky
  0 siblings, 0 replies; 8+ messages in thread
From: Jirka Hladky @ 2020-09-21 16:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Hillf Danton, Mel Gorman, Srikar Dronamraju,
	Song Bao Hua (Barry Song),
	mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	bsegall, linux-kernel, Linuxarm, Peter Zijlstra,
	Valentin Schneider, Phil Auld, Ingo Molnar, kkolakow, Jiri Vozar

Resending - previously, I forgot to send the message in the plain text
mode. I'm sorry.

Hi Mel,

Thanks a lot for looking into this!

Our results are mostly in line with what you see. We observe big gains
(20-50%) when the system is loaded to 1/3 of the maximum capacity and
mixed results at the full load - some workloads benefit from the patch
at the full load, others not, but performance changes at the full load
are mostly within the noise of results (+/-5%). Overall, we think this
patch is helpful.

Jirka


On Mon, Sep 21, 2020 at 1:02 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, Sep 10, 2020 at 11:50:04PM +0200, Jirka Hladky wrote:
> > 1) Threads stability has improved a lot. We see much fewer threads
> > migrations. Check for example this heatmap based on the mpstat results
> > collected while running sp.C test from the NAS benchmark. sp.C was run
> > with 16 threads on a dual-socket AMD 7351 server with 8 NUMA nodes:
> > 5.9 vanilla https://drive.google.com/file/d/10rojhTSQUSu-1aGQi-srr99SmVQ3Llgo/view?usp=sharing
> > 5.9 with the patch
> > https://drive.google.com/file/d/1eZdTBWvWMNvdvXqitwAlcKZ7gb-ySQUl/view?usp=sharing
> >
> > The heatmaps are generated from the mpstat output (there are 5
> > different runs at one picture). We collect CPU utilization every 5
> > seconds. Lighter color corresponds to lower CPU utilization. Light
> > color means that the thread may have run on different CPUs during that
> > 5 seconds. Solid straight lines, on the other hand, mean that thread
> > was running on the same CPU all the time. The difference is striking.
> >
> > We see significantly fewer threads migrations across many different
> > tests - NAS, SPECjbb2005, SPECjvm2008
> >
>
> For all three, I see the point where it's better or worse depends on
> overall activity. I looked at heatmaps for a variety of workloads and
> visually the bigger differences tend to be with utilisation is relatively
> low (e.g. one third of CPUs active).
>
> > 2) We see also performance improvement in terms of runtime, especially
> > at low load scenarios (number of threads being roughly equal to the 2*
> > number of NUMA nodes). It fixes the performance drop which we see
> > since 5.7 kernel, discussed for example here:
> > https://lore.kernel.org/lkml/CAE4VaGB7+sR1nf3Ux8W=hgN46gNXRYr0uBWJU0oYnk7h00Y_dw@mail.gmail.com/
> >
>
> This would make some sense given the original intent about allowing
> imbalances that was later revised significantly. It depends on when
> memory throughput is more important so the impact varies with the level
> of untilisation.  For example, on a 2-socket machine running specjvm 2005
> I see
>
> specjbb
>                                5.9.0-rc4              5.9.0-rc4
>                                  vanilla        dstbalance-v1r1
> Hmean     tput-1     46425.00 (   0.00%)    43394.00 *  -6.53%*
> Hmean     tput-2     98416.00 (   0.00%)    96031.00 *  -2.42%*
> Hmean     tput-3    150184.00 (   0.00%)   148783.00 *  -0.93%*
> Hmean     tput-4    200683.00 (   0.00%)   197906.00 *  -1.38%*
> Hmean     tput-5    236305.00 (   0.00%)   245549.00 *   3.91%*
> Hmean     tput-6    281559.00 (   0.00%)   285692.00 *   1.47%*
> Hmean     tput-7    338558.00 (   0.00%)   334467.00 *  -1.21%*
> Hmean     tput-8    340745.00 (   0.00%)   372501.00 *   9.32%*
> Hmean     tput-9    424343.00 (   0.00%)   413006.00 *  -2.67%*
> Hmean     tput-10   421854.00 (   0.00%)   434261.00 *   2.94%*
> Hmean     tput-11   493256.00 (   0.00%)   485330.00 *  -1.61%*
> Hmean     tput-12   549573.00 (   0.00%)   529959.00 *  -3.57%*
> Hmean     tput-13   593183.00 (   0.00%)   555010.00 *  -6.44%*
> Hmean     tput-14   588252.00 (   0.00%)   599166.00 *   1.86%*
> Hmean     tput-15   623065.00 (   0.00%)   642713.00 *   3.15%*
> Hmean     tput-16   703924.00 (   0.00%)   660758.00 *  -6.13%*
> Hmean     tput-17   666023.00 (   0.00%)   697675.00 *   4.75%*
> Hmean     tput-18   761502.00 (   0.00%)   758360.00 *  -0.41%*
> Hmean     tput-19   796088.00 (   0.00%)   798368.00 *   0.29%*
> Hmean     tput-20   733564.00 (   0.00%)   823086.00 *  12.20%*
> Hmean     tput-21   840980.00 (   0.00%)   856711.00 *   1.87%*
> Hmean     tput-22   804285.00 (   0.00%)   872238.00 *   8.45%*
> Hmean     tput-23   795208.00 (   0.00%)   889374.00 *  11.84%*
> Hmean     tput-24   848619.00 (   0.00%)   966783.00 *  13.92%*
> Hmean     tput-25   750848.00 (   0.00%)   903790.00 *  20.37%*
> Hmean     tput-26   780523.00 (   0.00%)   962254.00 *  23.28%*
> Hmean     tput-27  1042245.00 (   0.00%)   991544.00 *  -4.86%*
> Hmean     tput-28  1090580.00 (   0.00%)  1035926.00 *  -5.01%*
> Hmean     tput-29   999483.00 (   0.00%)  1082948.00 *   8.35%*
> Hmean     tput-30  1098663.00 (   0.00%)  1113427.00 *   1.34%*
> Hmean     tput-31  1125671.00 (   0.00%)  1134175.00 *   0.76%*
> Hmean     tput-32   968167.00 (   0.00%)  1250286.00 *  29.14%*
> Hmean     tput-33  1077676.00 (   0.00%)  1060893.00 *  -1.56%*
> Hmean     tput-34  1090538.00 (   0.00%)  1090933.00 *   0.04%*
> Hmean     tput-35   967058.00 (   0.00%)  1107421.00 *  14.51%*
> Hmean     tput-36  1051745.00 (   0.00%)  1210663.00 *  15.11%*
> Hmean     tput-37  1019465.00 (   0.00%)  1351446.00 *  32.56%*
> Hmean     tput-38  1083102.00 (   0.00%)  1064541.00 *  -1.71%*
> Hmean     tput-39  1232990.00 (   0.00%)  1303623.00 *   5.73%*
> Hmean     tput-40  1175542.00 (   0.00%)  1340943.00 *  14.07%*
> Hmean     tput-41  1127826.00 (   0.00%)  1339492.00 *  18.77%*
> Hmean     tput-42  1198313.00 (   0.00%)  1411023.00 *  17.75%*
> Hmean     tput-43  1163733.00 (   0.00%)  1228253.00 *   5.54%*
> Hmean     tput-44  1305562.00 (   0.00%)  1357886.00 *   4.01%*
> Hmean     tput-45  1326752.00 (   0.00%)  1406061.00 *   5.98%*
> Hmean     tput-46  1339424.00 (   0.00%)  1418451.00 *   5.90%*
> Hmean     tput-47  1415057.00 (   0.00%)  1381570.00 *  -2.37%*
> Hmean     tput-48  1392003.00 (   0.00%)  1421167.00 *   2.10%*
> Hmean     tput-49  1408374.00 (   0.00%)  1418659.00 *   0.73%*
> Hmean     tput-50  1359822.00 (   0.00%)  1391070.00 *   2.30%*
> Hmean     tput-51  1414246.00 (   0.00%)  1392679.00 *  -1.52%*
> Hmean     tput-52  1432352.00 (   0.00%)  1354020.00 *  -5.47%*
> Hmean     tput-53  1387563.00 (   0.00%)  1409563.00 *   1.59%*
> Hmean     tput-54  1406420.00 (   0.00%)  1388711.00 *  -1.26%*
> Hmean     tput-55  1438804.00 (   0.00%)  1387472.00 *  -3.57%*
> Hmean     tput-56  1399465.00 (   0.00%)  1400296.00 *   0.06%*
> Hmean     tput-57  1428132.00 (   0.00%)  1396399.00 *  -2.22%*
> Hmean     tput-58  1432385.00 (   0.00%)  1386253.00 *  -3.22%*
> Hmean     tput-59  1421612.00 (   0.00%)  1371416.00 *  -3.53%*
> Hmean     tput-60  1429423.00 (   0.00%)  1389412.00 *  -2.80%*
> Hmean     tput-61  1396230.00 (   0.00%)  1351122.00 *  -3.23%*
> Hmean     tput-62  1418396.00 (   0.00%)  1383098.00 *  -2.49%*
> Hmean     tput-63  1409918.00 (   0.00%)  1374662.00 *  -2.50%*
> Hmean     tput-64  1410236.00 (   0.00%)  1376216.00 *  -2.41%*
> Hmean     tput-65  1396405.00 (   0.00%)  1364418.00 *  -2.29%*
> Hmean     tput-66  1395975.00 (   0.00%)  1357326.00 *  -2.77%*
> Hmean     tput-67  1392986.00 (   0.00%)  1349642.00 *  -3.11%*
> Hmean     tput-68  1386541.00 (   0.00%)  1343261.00 *  -3.12%*
> Hmean     tput-69  1374407.00 (   0.00%)  1342588.00 *  -2.32%*
> Hmean     tput-70  1377513.00 (   0.00%)  1334654.00 *  -3.11%*
> Hmean     tput-71  1369319.00 (   0.00%)  1334952.00 *  -2.51%*
> Hmean     tput-72  1354635.00 (   0.00%)  1329005.00 *  -1.89%*
> Hmean     tput-73  1350933.00 (   0.00%)  1318942.00 *  -2.37%*
> Hmean     tput-74  1351714.00 (   0.00%)  1316347.00 *  -2.62%*
> Hmean     tput-75  1352198.00 (   0.00%)  1309974.00 *  -3.12%*
> Hmean     tput-76  1349490.00 (   0.00%)  1286064.00 *  -4.70%*
> Hmean     tput-77  1336131.00 (   0.00%)  1303684.00 *  -2.43%*
> Hmean     tput-78  1308896.00 (   0.00%)  1271024.00 *  -2.89%*
> Hmean     tput-79  1326703.00 (   0.00%)  1290862.00 *  -2.70%*
> Hmean     tput-80  1336199.00 (   0.00%)  1291629.00 *  -3.34%*
>
> Note as the utilisation reaches roughly half the CPUs that performance
> is much better but then degrades for higher thread counts. Part of this
> is that NUMA balancing activity is higher with the patch once roughly a
> quarter of the CPUs are in use.
>
> NAS is all over the map depending on the compute kernel and degree of
> parallelisation. Sometimes it is much faster depending on the machine
> and the degree of parallelisation, other times slower. Generally the
> level of NUMA balancing activity is variable -- sometimes better,
> sometimes worse and locality is generally lower (which is not a solid
> conclusion as it's usually correlated with differences in hinting
> faults).
>
> There are counter-examples. On EPYC 2 running hackbench, locality and
> scan activity is much worse with the patch. On hackbench using processes
> and sockets, it is about 8% slower for small numbers of groups and 28%
> slower for larger numbers of groups. On the same workload, NUMA activity
> is much higher, locality goes from 50% to 1.5%. Bizarrely, with pipes
> the patch is much better except in the opposite direction.
>
> For something like dbench, it depends on the machine. All but one
> machine showed an improvement. On EPYC 2, it's much worse.
>
> > The biggest improvements are for the NAS and the SPECjvm2008
> > benchmarks (typically between 20-50%). SPECjbb2005 shows also
> > improvements, around 10%. This is again on NUMA servers at the low
> > utilization. You can find snapshots of some results here:
> > https://drive.google.com/drive/folders/1k3Gb-vlI7UjPZZcLkoL2W2VB_zqxIJ3_?usp=sharing
> >
> > @Mel - could you please test the proposed patch? I know you have good
> > coverage for the inter-process communication benchmarks which may show
> > different behavior than benchmarks which we use. I hope that fewer
> > threads migrations might show positive effects also for these tests.
> > Please give it a try.
> >
>
> Basically I see mixed bag depending on the workload, machine and overall
> levels of utilisation. Sometimes it's better (sometimes much better),
> other times it is worse (sometimes much worse). Given that there isn't a
> universally good decision in this section and more people seem to prefer
> the patch then it may be best to keep the LB decisions consistent and
> revisit imbalance handling when the load balancer code changes settle
> down so
>
> Acked-by: Mel Gorman <mgorman@techsingularity.net>
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-09-21 16:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-07  7:27 [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer Barry Song
2020-09-07  9:27 ` Mel Gorman
2020-09-07  9:44   ` Song Bao Hua (Barry Song)
2020-09-07 10:44     ` Mel Gorman
2020-09-07 12:31     ` Srikar Dronamraju
     [not found]       ` <20200908010717.12436-1-hdanton@sina.com>
2020-09-10 21:50         ` Jirka Hladky
2020-09-21 11:02           ` Mel Gorman
2020-09-21 16:02             ` Jirka Hladky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.