All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Aubrey Li <aubrey.li@linux.intel.com>,
	Barry Song <song.bao.hua@hisilicon.com>,
	Mike Galbraith <efault@gmx.de>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
Date: Mon, 13 Dec 2021 13:01:31 +0000	[thread overview]
Message-ID: <20211213130131.GQ3366@techsingularity.net> (raw)
In-Reply-To: <YbcEE/mgIAhWuS+A@BLR-5CG11610CF.amd.com>

On Mon, Dec 13, 2021 at 01:58:03PM +0530, Gautham R. Shenoy wrote:
> > On a Zen3 machine running STREAM parallelised with OMP to have on instance
> > per LLC the results and without binding, the results are
> > 
> >                             5.16.0-rc1             5.16.0-rc1
> >                                vanilla       sched-numaimb-v4
> > MB/sec copy-16    166712.18 (   0.00%)   651540.22 ( 290.82%)
> > MB/sec scale-16   140109.66 (   0.00%)   382254.74 ( 172.83%)
> > MB/sec add-16     160791.18 (   0.00%)   623073.98 ( 287.51%)
> > MB/sec triad-16   160043.84 (   0.00%)   633964.52 ( 296.12%)
> 
> 
> Could you please share the size of the stream array ? These numbers
> are higher than what I am observing.
> 

512MB

> > @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >  	}
> >  }
> >  
> > -#define NUMA_IMBALANCE_MIN 2
> > -
> >  static inline long adjust_numa_imbalance(int imbalance,
> > -				int dst_running, int dst_weight)
> > +				int dst_running, int dst_weight,
> > +				int imb_numa_nr)
> >  {
> >  	if (!allow_numa_imbalance(dst_running, dst_weight))
> >  		return imbalance;
> >
> 
> if (4 * dst_running >= dst_weight) we return imbalance here. The
> dst_weight here corresponds to the span of the domain, while
> dst_running is the nr_running in busiest.
> 

Yes, once dst_running is high enough, no imbalance is allowed. In
previous versions I changed this but that was a mistake and in this
version, the threshold where imbalance is not allowed remains the same.

> On Zen3, at the top most NUMA domain, the dst_weight = 256 across in
> all the configurations of Nodes Per Socket (NPS) = 1/2/4. There are
> two groups, where each group is a socket. So, unless there are at
> least 64 tasks running in one of the sockets, we would not return
> imbalance here and go to the next step.
> 

Yes

> 
> > -	/*
> > -	 * Allow a small imbalance based on a simple pair of communicating
> > -	 * tasks that remain local when the destination is lightly loaded.
> > -	 */
> > -	if (imbalance <= NUMA_IMBALANCE_MIN)
> > +	if (imbalance <= imb_numa_nr)
> 
> imb_numa_nr in NPS=1 mode, imb_numa_nr would be 4. Since NUMA domains
> don't have PREFER_SIBLING, we would be balancing the number of idle
> CPUs. We will end up doing the imbalance, as long as the difference
> between the idle CPUs is at least 8.
> 
> In NPS=2, imb_numa_nr = 8 for this topmost NUMA domain. So here, we
> will not rebalance unless the difference between the idle CPUs is 16.
> 
> In NPS=4, imb_numa_nr = 16 for this topmost NUMA domain. So, the
> threshold is now bumped up to 32.
> 
> >  		return 0;
> 
> 
> 
> >  
> >  	return imbalance;
> > @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >  		/* Consider allowing a small imbalance between NUMA groups */
> >  		if (env->sd->flags & SD_NUMA) {
> >  			env->imbalance = adjust_numa_imbalance(env->imbalance,
> > -				busiest->sum_nr_running, env->sd->span_weight);
> > +				busiest->sum_nr_running, env->sd->span_weight,
> > +				env->sd->imb_numa_nr);
> >  		}
> >  
> >  		return;
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index d201a7052a29..bacec575ade2 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >  		}
> >  	}
> >  
> > +	/*
> > +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> > +	 * imbalanced.
> > +	 */
> > +	for_each_cpu(i, cpu_map) {
> > +		unsigned int imb = 0;
> > +		unsigned int imb_span = 1;
> > +
> > +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > +			struct sched_domain *child = sd->child;
> > +
> > +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > +				struct sched_domain *top = sd;
> 
> 
> We don't seem to be using top anywhere where sd may not be used since
> we already have variables imb and imb_span to record the
> top->imb_numa_nr and top->span_weight.
> 

Top could have been removed but we might still need it.

> 
> > +				unsigned int llc_sq;
> > +
> > +				/*
> > +				 * nr_llcs = (top->span_weight / llc_weight);
> > +				 * imb = (child_weight / nr_llcs) >> 2
> 
> child here is the llc. So can we use imb = (llc_weight / nr_llcs) >> 2.
> 

That is be clearer.

> > +				 *
> > +				 * is equivalent to
> > +				 *
> > +				 * imb = (llc_weight^2 / top->span_weight) >> 2
> > +				 *
> > +				 */
> > +				llc_sq = child->span_weight * child->span_weight;
> > +
> > +				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
> > +				imb_span = sd->span_weight;
> 
> On Zen3, child_weight (or llc_weight) = 16. llc_sq = 256.
>    with NPS=1
>       top = DIE.
>       top->span_weight = 128. imb = max(2, (256/128) >> 2) = 2. imb_span = 128.
> 
>    with NPS=2
>       top = NODE.
>       top->span_weight = 64. imb = max(2, (256/64) >> 2) = 2. imb_span = 64.
> 
>    with NPS=4      
>       top = NODE.
>       top->span_weight = 32. imb = max(2, (256/32) >> 2) = 2. imb_span = 32.
> 
> On Zen2, child_weight (or llc_weight) = 8. llc_sq = 64.
>    with NPS=1
>       top = DIE.
>       top->span_weight = 128. imb = max(2, (64/128) >> 2) = 2. imb_span = 128.
> 
>    with NPS=2
>       top = NODE.
>       top->span_weight = 64. imb = max(2, (64/64) >> 2) = 2. imb_span = 64.
> 
>    with NPS=4      
>       top = NODE.
>       top->span_weight = 32. imb = max(2, (64/32) >> 2) = 2. imb_span = 32.
> 
> 
> > +
> > +				sd->imb_numa_nr = imb;
> > +			} else {
> > +				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> > +			}
> 
> On Zen3,
>    with NPS=1
>         sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/128) = 4.
> 
>    with NPS=2
>         sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/64) = 4
> 	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/64) = 8
> 
>    with NPS=4
>         sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/32) = 8
> 	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/32) = 16
> 
> 
> For Zen2, since the imb_span and imb values are the same as the
> corresponding NPS=x values on Zen3, the imb_numa_nr values are the
> same as well since the corresponding sd->span_weight is the same.
> 
> If we look at the highest NUMA domain, there are two groups in all the
> NPS configurations. There are the same number of LLCs in each of these
> groups across the different NPS configurations (nr_llcs=8 on Zen3, 16
> on Zen2) . However, the imb_numa_nr at this domain varies with the NPS
> value, since we compute the imb_numa_nr value relative to the number
> of "top" domains that can be fit within this NUMA domain. This is
> because the size of the "top" domain varies with the NPS value. This
> shows up in the benchmark results.
> 

This was intentional to have some scaling but based on your results, the
scaling might be at the wrong level.

> 
> 
> The numbers with stream, tbench and YCSB +
> Mongodb are as follows:
> 
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Stream with 16 threads.
> built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10
> Zen3, 64C128T per socket, 2 sockets,
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> NPS=1
> Test:     tip/sched/core                 mel-v3                    mel-v4
>  Copy:    113716.62 (0.00 pct)     218961.59 (92.55 pct)     217130.07 (90.93 pct)
> Scale:    110996.89 (0.00 pct)     216674.73 (95.20 pct)     220765.94 (98.89 pct)
>   Add:    124504.19 (0.00 pct)     253461.32 (103.57 pct     260273.88 (109.04 pct)
> Triad:    122890.43 (0.00 pct)     247552.00 (101.44 pct     252615.62 (105.56 pct)
> 
> 
> NPS=2
> Test:     tip/sched/core                 mel-v3                     mel-v4
>  Copy:    58217.00 (0.00 pct)      204630.34 (251.49 pct)     191312.73 (228.62 pct)
> Scale:    55004.76 (0.00 pct)      212142.88 (285.68 pct)     175499.15 (219.06 pct)
>   Add:    63269.04 (0.00 pct)      254752.56 (302.64 pct)     203571.50 (221.75 pct)
> Triad:    62178.25 (0.00 pct)      247290.80 (297.71 pct)     198988.70 (220.02 pct)
> 
> NPS=4
> Test:     tip/sched/core                 mel-v3                     mel-v4
>  Copy:    37986.66 (0.00 pct)      254183.87 (569.13 pct)     48748.87 (28.33 pct)
> Scale:    35471.22 (0.00 pct)      237804.76 (570.41 pct)     48317.82 (36.21 pct)
>   Add:    39303.25 (0.00 pct)      292285.20 (643.66 pct)     54259.59 (38.05 pct)
> Triad:    39319.85 (0.00 pct)      285284.30 (625.54 pct)     54503.98 (38.61 pct)
> 

At minimum, v3 is a failure because a single pair of communicating tasks
were getting split across NUMA domains and the allowed numa imbalance
gets cut off too early because of the change to allow_numa_imbalance.
So while it's a valid comparison, it's definitely not the fix.

Given how you describe NPS, maybe the scaling should only start at the
point where tasks are no longer balanced between sibling domains. Can
you try this? I've only boot tested it at this point. It should work for
STREAM at least but probably not great for tbench.


diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index bacec575ade2..1fa3e977521d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2255,26 +2255,38 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 
 			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
 			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
-				struct sched_domain *top = sd;
+				struct sched_domain *top, *top_p;
 				unsigned int llc_sq;
 
 				/*
-				 * nr_llcs = (top->span_weight / llc_weight);
-				 * imb = (child_weight / nr_llcs) >> 2
+				 * nr_llcs = (sd->span_weight / llc_weight);
+				 * imb = (llc_weight / nr_llcs) >> 2
 				 *
 				 * is equivalent to
 				 *
-				 * imb = (llc_weight^2 / top->span_weight) >> 2
+				 * imb = (llc_weight^2 / sd->span_weight) >> 2
 				 *
 				 */
 				llc_sq = child->span_weight * child->span_weight;
 
-				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
-				imb_span = sd->span_weight;
-
+				imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
 				sd->imb_numa_nr = imb;
+
+				/*
+				 * Set span based on top domain that places
+				 * tasks in sibling domains.
+				 */
+				top = sd;
+				top_p = top->parent;
+				while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
+					top = top->parent;
+					top_p = top->parent;
+				}
+				imb_span = top_p ? top_p->span_weight : sd->span_weight;
 			} else {
-				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
+				int factor = max(1U, (sd->span_weight / imb_span));
+
+				sd->imb_numa_nr = imb * factor;
 			}
 		}
 	}

  reply	other threads:[~2021-12-13 13:01 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-10  9:33 [PATCH v4 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
2021-12-10  9:33 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
2021-12-21 10:53   ` Vincent Guittot
2021-12-21 11:32     ` Mel Gorman
2021-12-21 13:05       ` Vincent Guittot
2021-12-10  9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
2021-12-13  8:28   ` Gautham R. Shenoy
2021-12-13 13:01     ` Mel Gorman [this message]
2021-12-13 14:47       ` Gautham R. Shenoy
2021-12-15 11:52         ` Gautham R. Shenoy
2021-12-15 12:25           ` Mel Gorman
2021-12-16 18:33             ` Gautham R. Shenoy
2021-12-20 11:12               ` Mel Gorman
2021-12-21 15:03                 ` Gautham R. Shenoy
2021-12-21 17:13                 ` Vincent Guittot
2021-12-22  8:52                   ` Jirka Hladky
2022-01-04 19:52                     ` Jirka Hladky
2022-01-05 10:42                   ` Mel Gorman
2022-01-05 10:49                     ` Mel Gorman
2022-01-10 15:53                     ` Vincent Guittot
2022-01-12 10:24                       ` Mel Gorman
2021-12-17 19:54   ` Gautham R. Shenoy
  -- strict thread matches above, loose matches on Subject: below --
2022-02-08  9:43 [PATCH v6 0/2] Adjust NUMA imbalance for " Mel Gorman
2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2022-02-08 16:19   ` Gautham R. Shenoy
2022-02-09  5:10   ` K Prateek Nayak
2022-02-09 10:33     ` Mel Gorman
2022-02-11 19:02       ` Jirka Hladky
2022-02-14 10:27   ` Srikar Dronamraju
2022-02-14 11:03   ` Vincent Guittot
2022-02-03 14:46 [PATCH v5 0/2] Adjust NUMA imbalance for " Mel Gorman
2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2022-02-04  1:30   ` kernel test robot
2022-02-04  7:06   ` Srikar Dronamraju
2022-02-04  9:04     ` Mel Gorman
2022-02-04 15:07   ` Nayak, KPrateek (K Prateek)
2022-02-04 16:45     ` Mel Gorman
2021-12-01 15:18 [PATCH v3 0/2] Adjust NUMA imbalance for " Mel Gorman
2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2021-12-03  8:15   ` Barry Song
2021-12-03 10:50     ` Mel Gorman
2021-12-03 11:14       ` Barry Song
2021-12-03 13:27         ` Mel Gorman
2021-12-04 10:40   ` Peter Zijlstra
2021-12-06  8:48     ` Gautham R. Shenoy
2021-12-06 14:51       ` Peter Zijlstra
2021-12-06 15:12     ` Mel Gorman
2021-12-09 14:23       ` Valentin Schneider
2021-12-09 15:43         ` Mel Gorman
2021-11-25 15:19 [PATCH 0/2] Adjust NUMA imbalance for " Mel Gorman
2021-11-25 15:19 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2021-11-26 23:22   ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211213130131.GQ3366@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=aubrey.li@linux.intel.com \
    --cc=efault@gmx.de \
    --cc=gautham.shenoy@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=song.bao.hua@hisilicon.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.