Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

From: Ingo Molnar <mingo@kernel.org>
To: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Valentin Schneider <vschneid@redhat.com>
Subject: Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag
Date: Tue, 12 Mar 2024 11:57:49 +0100	[thread overview]
Message-ID: <ZfA1LRq1d2ueoSRm@gmail.com> (raw)
In-Reply-To: <41e11090-a100-48a7-a0dd-c989772822d7@linux.ibm.com>

* Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> > I think we should probably do something about this contention on this 
> > large system: especially if #2 'no work to be done' bailout is the 
> > common case.
> 
> 
> I have been thinking would it be right to move this balancing 
> trylock/atomic after should_we_balance(swb). This does reduce the number 
> of times this checked/updated significantly. Contention is still present. 
> That's possible at higher utilization when there are multiple NUMA 
> domains. one CPU in each NUMA domain can contend if their invocation is 
> aligned.

Note that it's not true contention: it simply means there's overlapping 
requests for the highest domains to be balanced, for which we only have a 
single thread of execution at a time, system-wide.

> That makes sense since, Right now a CPU takes lock, checks if it can 
>  balance, do balance if yes and then releases the lock. If the lock is 
>  taken after swb then also, CPU checks if it can balance,
> tries to take the lock and releases the lock if it did. If lock is 
> contended, it bails out of load_balance. That is the current behaviour as 
> well, or I am completely wrong.
> 
> Not sure in which scenarios that would hurt. we could do this after this 
> series. This may need wider functional testing to make sure we don't 
> regress badly in some cases. This is only an *idea* as of now.
> 
> Perf probes at spin_trylock and spin_unlock codepoints on the same 224CPU, 6 NUMA node system. 
> 6.8-rc6                                                                         
> -----------------------------------------                                       
> idle system:                                                                    
> 449 probe:rebalance_domains_L37                                                 
> 377 probe:rebalance_domains_L55                                                 
> stress-ng --cpu=$(nproc) -l 51     << 51% load                                               
> 88K probe:rebalance_domains_L37                                                 
> 77K probe:rebalance_domains_L55                                                 
> stress-ng --cpu=$(nproc) -l 100    << 100% load                                             
> 41K probe:rebalance_domains_L37                                                 
> 10K probe:rebalance_domains_L55                                                 
>                                                                                 
> +below patch                                                                          
> ----------------------------------------                                        
> idle system:                                                                    
> 462 probe:load_balance_L35                                                      
> 394 probe:load_balance_L274                                                     
> stress-ng --cpu=$(nproc) -l 51      << 51% load                                            
> 5K probe:load_balance_L35                       	<<-- almost 15x less                                
> 4K probe:load_balance_L274                                                      
> stress-ng --cpu=$(nproc) -l 100     << 100% load                                            
> 8K probe:load_balance_L35                                                       
> 3K probe:load_balance_L274 				<<-- almost 4x less

That's nice.

> +static DEFINE_SPINLOCK(balancing);
>  /*
>   * Check this_cpu to ensure it is balanced within domain. Attempt to move
>   * tasks if there is an imbalance.
> @@ -11286,6 +11287,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  	struct rq *busiest;
>  	struct rq_flags rf;
>  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
> +	int need_serialize;
>  	struct lb_env env = {
>  		.sd		= sd,
>  		.dst_cpu	= this_cpu,
> @@ -11308,6 +11310,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  		goto out_balanced;
>  	}
> 
> +	need_serialize = sd->flags & SD_SERIALIZE;
> +	if (need_serialize) {
> +		if (!spin_trylock(&balancing))
> +			goto lockout;
> +	}
> +
>  	group = find_busiest_group(&env);

So if I'm reading your patch right, the main difference appears to be that 
it allows the should_we_balance() check to be executed in parallel, and 
will only try to take the NUMA-balancing flag if that function indicates an 
imbalance.

Since should_we_balance() isn't taking any locks AFAICS, this might be a 
valid approach. What might make sense is to instrument the percentage of 
NUMA-balancing flag-taking 'failures' vs. successful attempts - not 
necessarily the 'contention percentage'.

But another question is, why do we get here so frequently, so that the 
cumulative execution time of these SD_SERIAL rebalance passes exceeds that 
of 100% of single CPU time? Ie. a single CPU is basically continuously 
scanning the scheduler data structures for imbalances, right? That doesn't 
seem natural even with just ~224 CPUs.

Alternatively, is perhaps the execution time of the SD_SERIAL pass so large 
that we exceed 100% CPU time?

Thanks,

	Ingo