From: Ingo Molnar <mingo@kernel.org>
To: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: linux-kernel@vger.kernel.org,
Peter Zijlstra <peterz@infradead.org>,
Vincent Guittot <vincent.guittot@linaro.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Valentin Schneider <vschneid@redhat.com>
Subject: Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag
Date: Tue, 12 Mar 2024 11:57:49 +0100 [thread overview]
Message-ID: <ZfA1LRq1d2ueoSRm@gmail.com> (raw)
In-Reply-To: <41e11090-a100-48a7-a0dd-c989772822d7@linux.ibm.com>
* Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> > I think we should probably do something about this contention on this
> > large system: especially if #2 'no work to be done' bailout is the
> > common case.
>
>
> I have been thinking would it be right to move this balancing
> trylock/atomic after should_we_balance(swb). This does reduce the number
> of times this checked/updated significantly. Contention is still present.
> That's possible at higher utilization when there are multiple NUMA
> domains. one CPU in each NUMA domain can contend if their invocation is
> aligned.
Note that it's not true contention: it simply means there's overlapping
requests for the highest domains to be balanced, for which we only have a
single thread of execution at a time, system-wide.
> That makes sense since, Right now a CPU takes lock, checks if it can
> balance, do balance if yes and then releases the lock. If the lock is
> taken after swb then also, CPU checks if it can balance,
> tries to take the lock and releases the lock if it did. If lock is
> contended, it bails out of load_balance. That is the current behaviour as
> well, or I am completely wrong.
>
> Not sure in which scenarios that would hurt. we could do this after this
> series. This may need wider functional testing to make sure we don't
> regress badly in some cases. This is only an *idea* as of now.
>
> Perf probes at spin_trylock and spin_unlock codepoints on the same 224CPU, 6 NUMA node system.
> 6.8-rc6
> -----------------------------------------
> idle system:
> 449 probe:rebalance_domains_L37
> 377 probe:rebalance_domains_L55
> stress-ng --cpu=$(nproc) -l 51 << 51% load
> 88K probe:rebalance_domains_L37
> 77K probe:rebalance_domains_L55
> stress-ng --cpu=$(nproc) -l 100 << 100% load
> 41K probe:rebalance_domains_L37
> 10K probe:rebalance_domains_L55
>
> +below patch
> ----------------------------------------
> idle system:
> 462 probe:load_balance_L35
> 394 probe:load_balance_L274
> stress-ng --cpu=$(nproc) -l 51 << 51% load
> 5K probe:load_balance_L35 <<-- almost 15x less
> 4K probe:load_balance_L274
> stress-ng --cpu=$(nproc) -l 100 << 100% load
> 8K probe:load_balance_L35
> 3K probe:load_balance_L274 <<-- almost 4x less
That's nice.
> +static DEFINE_SPINLOCK(balancing);
> /*
> * Check this_cpu to ensure it is balanced within domain. Attempt to move
> * tasks if there is an imbalance.
> @@ -11286,6 +11287,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> struct rq *busiest;
> struct rq_flags rf;
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
> + int need_serialize;
> struct lb_env env = {
> .sd = sd,
> .dst_cpu = this_cpu,
> @@ -11308,6 +11310,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> goto out_balanced;
> }
>
> + need_serialize = sd->flags & SD_SERIALIZE;
> + if (need_serialize) {
> + if (!spin_trylock(&balancing))
> + goto lockout;
> + }
> +
> group = find_busiest_group(&env);
So if I'm reading your patch right, the main difference appears to be that
it allows the should_we_balance() check to be executed in parallel, and
will only try to take the NUMA-balancing flag if that function indicates an
imbalance.
Since should_we_balance() isn't taking any locks AFAICS, this might be a
valid approach. What might make sense is to instrument the percentage of
NUMA-balancing flag-taking 'failures' vs. successful attempts - not
necessarily the 'contention percentage'.
But another question is, why do we get here so frequently, so that the
cumulative execution time of these SD_SERIAL rebalance passes exceeds that
of 100% of single CPU time? Ie. a single CPU is basically continuously
scanning the scheduler data structures for imbalances, right? That doesn't
seem natural even with just ~224 CPUs.
Alternatively, is perhaps the execution time of the SD_SERIAL pass so large
that we exceed 100% CPU time?
Thanks,
Ingo
next prev parent reply other threads:[~2024-03-12 10:57 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-04 9:48 [PATCH -v3 0/9] sched/balancing: Misc updates & cleanups Ingo Molnar
2024-03-04 9:48 ` [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag Ingo Molnar
2024-03-05 10:50 ` Valentin Schneider
2024-03-08 9:48 ` Ingo Molnar
2024-03-05 11:11 ` Shrikanth Hegde
2024-03-08 11:23 ` Ingo Molnar
2024-03-08 14:48 ` Shrikanth Hegde
2024-03-12 10:57 ` Ingo Molnar [this message]
2024-03-21 12:12 ` Shrikanth Hegde
2024-03-04 9:48 ` [PATCH 2/9] sched/balancing: Remove reliance on 'enum cpu_idle_type' ordering when iterating [CPU_MAX_IDLE_TYPES] arrays in show_schedstat() Ingo Molnar
2024-03-04 15:05 ` Shrikanth Hegde
2024-03-08 9:55 ` Ingo Molnar
2024-03-04 9:48 ` [PATCH 3/9] sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions Ingo Molnar
2024-03-05 10:50 ` Valentin Schneider
2024-03-06 15:46 ` Vincent Guittot
2024-03-08 9:59 ` Ingo Molnar
2024-03-04 9:48 ` [PATCH 4/9] sched/balancing: Change comment formatting to not overlap Git conflict marker lines Ingo Molnar
2024-03-05 10:50 ` Valentin Schneider
2024-03-06 15:44 ` Vincent Guittot
2024-03-04 9:48 ` [PATCH 5/9] sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK Ingo Molnar
2024-03-05 10:50 ` Valentin Schneider
2024-03-06 15:43 ` Vincent Guittot
2024-03-08 10:11 ` Ingo Molnar
2024-03-04 9:48 ` [PATCH 6/9] sched/balancing: Update run_rebalance_domains() comments Ingo Molnar
2024-03-05 10:50 ` Valentin Schneider
2024-03-06 16:17 ` Vincent Guittot
2024-03-08 10:15 ` Ingo Molnar
2024-03-08 11:57 ` Vincent Guittot
2024-03-08 16:45 ` Valentin Schneider
2024-03-04 9:48 ` [PATCH 7/9] sched/balancing: Vertically align the comments of 'struct sg_lb_stats' and 'struct sd_lb_stats' Ingo Molnar
2024-03-05 10:50 ` Valentin Schneider
2024-03-04 9:48 ` [PATCH 8/9] sched/balancing: Update comments in " Ingo Molnar
2024-03-05 10:51 ` Valentin Schneider
2024-03-04 9:48 ` [PATCH 9/9] sched/balancing: Rename run_rebalance_domains() => sched_balance_softirq() Ingo Molnar
2024-03-05 10:51 ` Valentin Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZfA1LRq1d2ueoSRm@gmail.com \
--to=mingo@kernel.org \
--cc=dietmar.eggemann@arm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=sshegde@linux.ibm.com \
--cc=torvalds@linux-foundation.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).