linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Shrikanth Hegde <sshegde@linux.ibm.com>
To: Ingo Molnar <mingo@kernel.org>, linux-kernel@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Valentin Schneider <vschneid@redhat.com>
Subject: Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag
Date: Tue, 5 Mar 2024 16:41:58 +0530	[thread overview]
Message-ID: <bf612672-f7c3-4585-ac31-e02a1ebf614c@linux.ibm.com> (raw)
In-Reply-To: <20240304094831.3639338-2-mingo@kernel.org>



On 3/4/24 3:18 PM, Ingo Molnar wrote:
> The 'balancing' spinlock added in:
> 
>   08c183f31bdb ("[PATCH] sched: add option to serialize load balancing")
> 

[...]

> 
>  
> -static DEFINE_SPINLOCK(balancing);
> +/*
> + * This flag serializes load-balancing passes over large domains
> + * (such as SD_NUMA) - only once load-balancing instance may run
> + * at a time, to reduce overhead on very large systems with lots
> + * of CPUs and large NUMA distances.
> + *
> + * - Note that load-balancing passes triggered while another one
> + *   is executing are skipped and not re-tried.
> + *
> + * - Also note that this does not serialize sched_balance_domains()
> + *   execution, as non-SD_SERIALIZE domains will still be
> + *   load-balanced in parallel.
> + */
> +static atomic_t sched_balance_running = ATOMIC_INIT(0);
>  
>  /*

Continuing the discussion related whether this balancing lock is 
contended or not. 


It was observed in large system (1920CPU, 16 NUMA Nodes) cacheline containing the 
balancing trylock was contended and rebalance_domains was seen as part of the traces. 

So did some experiments on smaller system. This system as 224 CPUs and 6 NUMA nodes.
Added probe points in rebalance_domains. If lock is not contended, then lock should
success and both probe points should match. If not, there should be contention. 
Below are the system details and perf probe -L rebalance_domains.

NUMA:                    
  NUMA node(s):          6
  NUMA node0 CPU(s):     0-31
  NUMA node1 CPU(s):     32-71
  NUMA node4 CPU(s):     72-111
  NUMA node5 CPU(s):     112-151
  NUMA node6 CPU(s):     152-183
  NUMA node7 CPU(s):     184-223


------------------------------------------------------------------------------------------------------------------
#perf probe -L rebalance_domains
<rebalance_domains@/shrikanth/sched_tip/kernel/sched/fair.c:0>
      0  static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
         {
      2         int continue_balancing = 1;
      3         int cpu = rq->cpu;
[...]


     33                 interval = get_sd_balance_interval(sd, busy);

                        need_serialize = sd->flags & SD_SERIALIZE;
     36                 if (need_serialize) {
     37                         if (!spin_trylock(&balancing))
                                        goto out;
                        }

     41                 if (time_after_eq(jiffies, sd->last_balance + interval)) {
     42                         if (load_balance(cpu, rq, sd, idle, &continue_balancing)) {
                                        /*
                                         * The LBF_DST_PINNED logic could have changed
                                         * env->dst_cpu, so we can't know our idle
                                         * state even if we migrated tasks. Update it.
                                         */
     48                                 idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
     49                                 busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
                                }
     51                         sd->last_balance = jiffies;
     52                         interval = get_sd_balance_interval(sd, busy);
                        }
     54                 if (need_serialize)
     55                         spin_unlock(&balancing);
         out:
     57                 if (time_after(next_balance, sd->last_balance + interval)) {
                                next_balance = sd->last_balance + interval;
                                update_next_balance = 1;
                        }
                }

perf probe --list
  probe:rebalance_domains_L37 (on rebalance_domains+856)
  probe:rebalance_domains_L55 (on rebalance_domains+904)
------------------------------------------------------------------------------------------------------------------

Perf records are collected for 10 seconds in different system loads. load is created using stress-ng. 
Contention is calculated as (1-L55/L37)*100

system is idle:  		<--	No contention
1K probe:rebalance_domains_L37
1K probe:rebalance_domains_L55


system is at 25% loa: 		<-- 	4.4% contention
223K probe:rebalance_domains_L37: 1 chunks LOST!
213K probe:rebalance_domains_L55: 1 chunks LOST!



system is at 50% load		<--	12.5% contention
168K probe:rebalance_domains_L37
147K probe:rebalance_domains_L55


system is at 75% load		<-- 	25.6% contention
113K probe:rebalance_domains_L37
84K probe:rebalance_domains_L55

87
system is at 100% load		<--	87.5% contention.
64K probe:rebalance_domains_L37
8K probe:rebalance_domains_L55


A few reasons for contentions could be: 
1. idle load balance is running and some other cpu is becoming idle, and tries newidle_balance. 
2. when system is busy, every CPU would do busy balancing, it would contend for the lock. It will not do balance as 
   should_we_balance says this CPU need not balance. It bails out and release the lock. 

  parent reply	other threads:[~2024-03-05 11:12 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-04  9:48 [PATCH -v3 0/9] sched/balancing: Misc updates & cleanups Ingo Molnar
2024-03-04  9:48 ` [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-08  9:48     ` Ingo Molnar
2024-03-05 11:11   ` Shrikanth Hegde [this message]
2024-03-08 11:23     ` Ingo Molnar
2024-03-08 14:48       ` Shrikanth Hegde
2024-03-12 10:57         ` Ingo Molnar
2024-03-21 12:12           ` Shrikanth Hegde
2024-03-04  9:48 ` [PATCH 2/9] sched/balancing: Remove reliance on 'enum cpu_idle_type' ordering when iterating [CPU_MAX_IDLE_TYPES] arrays in show_schedstat() Ingo Molnar
2024-03-04 15:05   ` Shrikanth Hegde
2024-03-08  9:55     ` Ingo Molnar
2024-03-04  9:48 ` [PATCH 3/9] sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-06 15:46   ` Vincent Guittot
2024-03-08  9:59     ` Ingo Molnar
2024-03-04  9:48 ` [PATCH 4/9] sched/balancing: Change comment formatting to not overlap Git conflict marker lines Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-06 15:44   ` Vincent Guittot
2024-03-04  9:48 ` [PATCH 5/9] sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-06 15:43   ` Vincent Guittot
2024-03-08 10:11     ` Ingo Molnar
2024-03-04  9:48 ` [PATCH 6/9] sched/balancing: Update run_rebalance_domains() comments Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-06 16:17     ` Vincent Guittot
2024-03-08 10:15       ` Ingo Molnar
2024-03-08 11:57         ` Vincent Guittot
2024-03-08 16:45           ` Valentin Schneider
2024-03-04  9:48 ` [PATCH 7/9] sched/balancing: Vertically align the comments of 'struct sg_lb_stats' and 'struct sd_lb_stats' Ingo Molnar
2024-03-05 10:50   ` Valentin Schneider
2024-03-04  9:48 ` [PATCH 8/9] sched/balancing: Update comments in " Ingo Molnar
2024-03-05 10:51   ` Valentin Schneider
2024-03-04  9:48 ` [PATCH 9/9] sched/balancing: Rename run_rebalance_domains() => sched_balance_softirq() Ingo Molnar
2024-03-05 10:51   ` Valentin Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bf612672-f7c3-4585-ac31-e02a1ebf614c@linux.ibm.com \
    --to=sshegde@linux.ibm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).