linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Valentin Schneider <vschneid@redhat.com>
Subject: Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag
Date: Tue, 12 Mar 2024 11:57:49 +0100	[thread overview]
Message-ID: <ZfA1LRq1d2ueoSRm@gmail.com> (raw)
In-Reply-To: <41e11090-a100-48a7-a0dd-c989772822d7@linux.ibm.com>


* Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> > I think we should probably do something about this contention on this 
> > large system: especially if #2 'no work to be done' bailout is the 
> > common case.
> 
> 
> I have been thinking would it be right to move this balancing 
> trylock/atomic after should_we_balance(swb). This does reduce the number 
> of times this checked/updated significantly. Contention is still present. 
> That's possible at higher utilization when there are multiple NUMA 
> domains. one CPU in each NUMA domain can contend if their invocation is 
> aligned.

Note that it's not true contention: it simply means there's overlapping 
requests for the highest domains to be balanced, for which we only have a 
single thread of execution at a time, system-wide.

> That makes sense since, Right now a CPU takes lock, checks if it can 
>  balance, do balance if yes and then releases the lock. If the lock is 
>  taken after swb then also, CPU checks if it can balance,
> tries to take the lock and releases the lock if it did. If lock is 
> contended, it bails out of load_balance. That is the current behaviour as 
> well, or I am completely wrong.
> 
> Not sure in which scenarios that would hurt. we could do this after this 
> series. This may need wider functional testing to make sure we don't 
> regress badly in some cases. This is only an *idea* as of now.
> 
> Perf probes at spin_trylock and spin_unlock codepoints on the same 224CPU, 6 NUMA node system. 
> 6.8-rc6                                                                         
> -----------------------------------------                                       
> idle system:                                                                    
> 449 probe:rebalance_domains_L37                                                 
> 377 probe:rebalance_domains_L55                                                 
> stress-ng --cpu=$(nproc) -l 51     << 51% load                                               
> 88K probe:rebalance_domains_L37                                                 
> 77K probe:rebalance_domains_L55                                                 
> stress-ng --cpu=$(nproc) -l 100    << 100% load                                             
> 41K probe:rebalance_domains_L37                                                 
> 10K probe:rebalance_domains_L55                                                 
>                                                                                 
> +below patch                                                                          
> ----------------------------------------                                        
> idle system:                                                                    
> 462 probe:load_balance_L35                                                      
> 394 probe:load_balance_L274                                                     
> stress-ng --cpu=$(nproc) -l 51      << 51% load                                            
> 5K probe:load_balance_L35                       	<<-- almost 15x less                                
> 4K probe:load_balance_L274                                                      
> stress-ng --cpu=$(nproc) -l 100     << 100% load                                            
> 8K probe:load_balance_L35                                                       
> 3K probe:load_balance_L274 				<<-- almost 4x less

That's nice.

> +static DEFINE_SPINLOCK(balancing);
>  /*
>   * Check this_cpu to ensure it is balanced within domain. Attempt to move
>   * tasks if there is an imbalance.
> @@ -11286,6 +11287,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  	struct rq *busiest;
>  	struct rq_flags rf;
>  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
> +	int need_serialize;
>  	struct lb_env env = {
>  		.sd		= sd,
>  		.dst_cpu	= this_cpu,
> @@ -11308,6 +11310,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  		goto out_balanced;
>  	}
> 
> +	need_serialize = sd->flags & SD_SERIALIZE;
> +	if (need_serialize) {
> +		if (!spin_trylock(&balancing))
> +			goto lockout;
> +	}
> +
>  	group = find_busiest_group(&env);

So if I'm reading your patch right, the main difference appears to be that 
it allows the should_we_balance() check to be executed in parallel, and 
will only try to take the NUMA-balancing flag if that function indicates an 
imbalance.

Since should_we_balance() isn't taking any locks AFAICS, this might be a 
valid approach. What might make sense is to instrument the percentage of 
NUMA-balancing flag-taking 'failures' vs. successful attempts - not 
necessarily the 'contention percentage'.

But another question is, why do we get here so frequently, so that the 
cumulative execution time of these SD_SERIAL rebalance passes exceeds that 
of 100% of single CPU time? Ie. a single CPU is basically continuously 
scanning the scheduler data structures for imbalances, right? That doesn't 
seem natural even with just ~224 CPUs.

Alternatively, is perhaps the execution time of the SD_SERIAL pass so large 
that we exceed 100% CPU time?

Thanks,

	Ingo

  reply	other threads:[~2024-03-12 10:57 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-04  9:48 [PATCH -v3 0/9] sched/balancing: Misc updates & cleanups Ingo Molnar
2024-03-04  9:48 ` [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-08  9:48     ` Ingo Molnar
2024-03-05 11:11   ` Shrikanth Hegde
2024-03-08 11:23     ` Ingo Molnar
2024-03-08 14:48       ` Shrikanth Hegde
2024-03-12 10:57         ` Ingo Molnar [this message]
2024-03-21 12:12           ` Shrikanth Hegde
2024-03-04  9:48 ` [PATCH 2/9] sched/balancing: Remove reliance on 'enum cpu_idle_type' ordering when iterating [CPU_MAX_IDLE_TYPES] arrays in show_schedstat() Ingo Molnar
2024-03-04 15:05   ` Shrikanth Hegde
2024-03-08  9:55     ` Ingo Molnar
2024-03-04  9:48 ` [PATCH 3/9] sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-06 15:46   ` Vincent Guittot
2024-03-08  9:59     ` Ingo Molnar
2024-03-04  9:48 ` [PATCH 4/9] sched/balancing: Change comment formatting to not overlap Git conflict marker lines Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-06 15:44   ` Vincent Guittot
2024-03-04  9:48 ` [PATCH 5/9] sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-06 15:43   ` Vincent Guittot
2024-03-08 10:11     ` Ingo Molnar
2024-03-04  9:48 ` [PATCH 6/9] sched/balancing: Update run_rebalance_domains() comments Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-06 16:17     ` Vincent Guittot
2024-03-08 10:15       ` Ingo Molnar
2024-03-08 11:57         ` Vincent Guittot
2024-03-08 16:45           ` Valentin Schneider
2024-03-04  9:48 ` [PATCH 7/9] sched/balancing: Vertically align the comments of 'struct sg_lb_stats' and 'struct sd_lb_stats' Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-04  9:48 ` [PATCH 8/9] sched/balancing: Update comments in " Ingo Molnar
2024-03-05 10:51   ` Valentin Schneider
2024-03-04  9:48 ` [PATCH 9/9] sched/balancing: Rename run_rebalance_domains() => sched_balance_softirq() Ingo Molnar
2024-03-05 10:51   ` Valentin Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZfA1LRq1d2ueoSRm@gmail.com \
    --to=mingo@kernel.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=sshegde@linux.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).