All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] sched/fair: allow disabling newidle_balance with sched_relax_domain_level
@ 2024-03-28  0:47 ` Vitalii Bursov
  0 siblings, 0 replies; 12+ messages in thread
From: Vitalii Bursov @ 2024-03-28  0:30 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Vitalii Bursov

Hi,

During the upgrade from Linux 5.4 we found a small (around 3%) 
performance regression which was tracked to commit 
c5b0a7eefc70150caf23e37bc9d639c68c87a097

    sched/fair: Remove sysctl_sched_migration_cost condition

    With a default value of 500us, sysctl_sched_migration_cost is
    significanlty higher than the cost of load_balance. Remove the
    condition and rely on the sd->max_newidle_lb_cost to abort
    newidle_balance.

https://lore.kernel.org/lkml/20211019123537.17146-6-vincent.guittot@linaro.org/T/#m248877858373980d8fffa7156931211a1f0d7c4c


Looks like "newidle" balancing is beneficial for a lot of workloads, 
just not for this specific one. The workload is video encoding, there 
are 100s-1000s of threads, some are synchonized with mutexes and 
conditional variables. The process aims to have a portion of CPU idle, 
so no CPU cores are 100% busy. Perhaps, the performance impact we see 
comes from additional processing in the scheduler and additional cost 
like more cache misses, and not from an incorrect balancing. See
perf output below.

My understanding is that "sched_relax_domain_level" cgroup parameter 
should control if newidle_balance() is called and what's the scope
of the balancing is, but it doesn't fully work for this case.

cpusets.rst documentation:
> The 'cpuset.sched_relax_domain_level' file allows you to request changing
> this searching range as you like.  This file takes int value which
> indicates size of searching range in levels ideally as follows,
> otherwise initial value -1 that indicates the cpuset has no request.
>  
> ====== ===========================================================
>   -1   no request. use system default or follow request of others.
>    0   no search.
>    1   search siblings (hyperthreads in a core).
>    2   search cores in a package.
>    3   search cpus in a node [= system wide on non-NUMA system]
>    4   search nodes in a chunk of node [on NUMA system]
>    5   search system wide [on NUMA system]
> ====== ===========================================================

Setting cpuset.sched_relax_domain_level to 0 works as 1.

On a dual-CPU server, domains and levels are as follows:
  domain 0: level 0, SMT
  domain 1: level 2, MC
  domain 2: level 5, NUMA

So, to support "0 no search", the value in 
cpuset.sched_relax_domain_level should disable SD_BALANCE_NEWIDLE for a 
specified level and keep it enabled for prior levels. For example, SMT 
level is 0, so sched_relax_domain_level=0 should exclude levels >=0.

Instead, cpuset.sched_relax_domain_level enables the specified level,
which effectively removes "no search" option. See below for domain
flags for all cpuset.sched_relax_domain_level values.

Proposed patch allows clearing SD_BALANCE_NEWIDLE flags when 
cpuset.sched_relax_domain_level is set to 0 and extends max
value validation range beyond sched_domain_level_max. This allows
setting SD_BALANCE_NEWIDLE on all levels and override platform
default if it does not include all levels.

Thanks

=========================
Perf output for a simimar workload/test case shows that newidle_balance 
is called when handling futex and nanosleep syscalls:
8.74%     0.40%  a.out    [kernel.vmlinux]    [k] entry_SYSCALL_64
8.34% entry_SYSCALL_64
 - do_syscall_64
    - 5.50% __x64_sys_futex
       - 5.42% do_futex
          - 3.79% futex_wait
             - 3.74% __futex_wait
                - 3.53% futex_wait_queue
                   - 3.45% schedule
                      - 3.43% __schedule
                         - 2.06% pick_next_task
                            - 1.93% pick_next_task_fair
                               - 1.87% newidle_balance
                                  - 1.52% load_balance
                                     - 1.16% find_busiest_group
                                        - 1.13% update_sd_lb_stats.constprop.0
                                             1.01% update_sg_lb_stats
                         - 0.83% dequeue_task_fair
                              0.66% dequeue_entity
          - 1.57% futex_wake
             - 1.22% wake_up_q
                - 1.20% try_to_wake_up
                     0.58% select_task_rq_fair
    - 2.44% __x64_sys_nanosleep
       - 2.36% hrtimer_nanosleep
          - 2.33% do_nanosleep
             - 2.05% schedule
                - 2.03% __schedule
                   - 1.23% pick_next_task
                      - 1.15% pick_next_task_fair
                         - 1.12% newidle_balance
                            - 0.90% load_balance
                               - 0.68% find_busiest_group
                                  - 0.66% update_sd_lb_stats.constprop.0
                                       0.59% update_sg_lb_stats
                     0.52% dequeue_task_fair

When newidle_balance is disabled (or when using older kernels), perf
output is:
6.37%     0.41%  a.out    [kernel.vmlinux]    [k] entry_SYSCALL_64
5.96% entry_SYSCALL_64
 - do_syscall_64
    - 3.97% __x64_sys_futex
       - 3.89% do_futex
          - 2.32% futex_wait
             - 2.27% __futex_wait
                - 2.05% futex_wait_queue
                   - 1.98% schedule
                      - 1.96% __schedule
                         - 0.81% dequeue_task_fair
                              0.66% dequeue_entity
                         - 0.64% pick_next_task
                              0.51% pick_next_task_fair
          - 1.52% futex_wake
             - 1.15% wake_up_q
                - try_to_wake_up
                     0.59% select_task_rq_fair
    - 1.58% __x64_sys_nanosleep
       - 1.52% hrtimer_nanosleep
          - 1.48% do_nanosleep
             - 1.20% schedule
                - 1.19% __schedule
                     0.53% dequeue_task_fair


Without a patch:
=========================
CPUs: 2 Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

# uname -r
6.8.1

# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 63962 MB
node 0 free: 59961 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 64446 MB
node 1 free: 63338 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

# head /proc/schedstat 
version 15
timestamp 4295347219
cpu0 0 0 0 0 0 0 3035466036 858375615 67578
domain0 0000,01000001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
domain1 000f,ff000fff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
domain2 ffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...

# cd /sys/kernel/debug/sched/domains
# echo -1 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{name,flags,groups_flags,max_newidle_lb_cost}
cpu0/domain0/name:SMT
cpu0/domain1/name:MC
cpu0/domain2/name:NUMA

cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP 
                            SD_NUMA
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain0/max_newidle_lb_cost:2236
cpu0/domain1/max_newidle_lb_cost:3444
cpu0/domain2/max_newidle_lb_cost:4590

# echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain1/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SERIALIZE SD_OVERLAP SD_NUMA 
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/groups_flags:SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain0/max_newidle_lb_cost:0
cpu0/domain1/max_newidle_lb_cost:0
cpu0/domain2/max_newidle_lb_cost:0

# echo 1 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost}

cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain1/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SERIALIZE SD_OVERLAP SD_NUMA 
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/groups_flags:SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain0/max_newidle_lb_cost:309
cpu0/domain1/max_newidle_lb_cost:0
cpu0/domain2/max_newidle_lb_cost:0

# echo 2 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost}

cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SERIALIZE SD_OVERLAP SD_NUMA 
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain0/max_newidle_lb_cost:276
cpu0/domain1/max_newidle_lb_cost:2776
cpu0/domain2/max_newidle_lb_cost:0

# echo 3 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SERIALIZE SD_OVERLAP SD_NUMA 
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain0/max_newidle_lb_cost:289
cpu0/domain1/max_newidle_lb_cost:3192
cpu0/domain2/max_newidle_lb_cost:0

# echo 4 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SERIALIZE SD_OVERLAP SD_NUMA 
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain0/max_newidle_lb_cost:1306
cpu0/domain1/max_newidle_lb_cost:1999
cpu0/domain2/max_newidle_lb_cost:0

# echo 5 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
bash: echo: write error: Invalid argument
=========================


The same system with the patch applied:
=========================
# cd /sys/kernel/debug/sched/domains
# echo -1 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{name,level,flags,groups_flags}
cpu0/domain0/name:SMT
cpu0/domain1/name:MC
cpu0/domain2/name:NUMA
cpu0/domain0/level:0
cpu0/domain1/level:2
cpu0/domain2/level:5
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...

# echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags}
cpu0/domain0/flags:SD_BALANCE_EXEC ...
cpu0/domain1/flags:SD_BALANCE_EXEC ...
cpu0/domain2/flags:SD_BALANCE_EXEC ...
cpu0/domain1/groups_flags:SD_BALANCE_EXEC ...
cpu0/domain2/groups_flags:SD_BALANCE_EXEC ...

# echo 1 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/flags:SD_BALANCE_EXEC ...
cpu0/domain2/flags:SD_BALANCE_EXEC ...
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/groups_flags:SD_BALANCE_EXEC ...

[skip 2, same as 1]

# echo 3 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/flags:SD_BALANCE_EXEC ...
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...

[skip 4 and 5, same as 3]

# echo 6 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...

# echo 7 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
bash: echo: write error: Invalid argument
=========================

Vitalii Bursov (1):
  sched/fair: allow disabling newidle_balance with
    sched_relax_domain_level

 kernel/cgroup/cpuset.c  | 2 +-
 kernel/sched/debug.c    | 1 +
 kernel/sched/topology.c | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/1] sched/fair: allow disabling newidle_balance with sched_relax_domain_level
  2024-03-28  0:47 ` Vitalii Bursov
  (?)
@ 2024-03-28  0:30 ` Vitalii Bursov
  2024-03-28  5:51   ` Shrikanth Hegde
  2024-03-28 14:43   ` Vincent Guittot
  -1 siblings, 2 replies; 12+ messages in thread
From: Vitalii Bursov @ 2024-03-28  0:30 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Vitalii Bursov

Change relax_domain_level checks so that it would be possible
to exclude all domains from newidle balancing.

This matches the behavior described in the documentation:
  -1   no request. use system default or follow request of others.
   0   no search.
   1   search siblings (hyperthreads in a core).

"2" enables levels 0 and 1, level_max excludes the last (level_max)
level, and level_max+1 includes all levels.

Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
---
 kernel/cgroup/cpuset.c  | 2 +-
 kernel/sched/debug.c    | 1 +
 kernel/sched/topology.c | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 4237c874871..da24187c4e0 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2948,7 +2948,7 @@ bool current_cpuset_is_being_rebound(void)
 static int update_relax_domain_level(struct cpuset *cs, s64 val)
 {
 #ifdef CONFIG_SMP
-	if (val < -1 || val >= sched_domain_level_max)
+	if (val < -1 || val > sched_domain_level_max + 1)
 		return -EINVAL;
 #endif
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8d5d98a5834..8454cd4e5e1 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -423,6 +423,7 @@ static void register_sd(struct sched_domain *sd, struct dentry *parent)
 
 #undef SDM
 
+	debugfs_create_u32("level", 0444, parent, (u32 *)&sd->level);
 	debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops);
 	debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd_flags_fops);
 }
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 99ea5986038..3127c9b30af 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1468,7 +1468,7 @@ static void set_domain_attribute(struct sched_domain *sd,
 	} else
 		request = attr->relax_domain_level;
 
-	if (sd->level > request) {
+	if (sd->level >= request) {
 		/* Turn off idle balance on this domain: */
 		sd->flags &= ~(SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE);
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 0/1] sched/fair: allow disabling newidle_balance with sched_relax_domain_level
@ 2024-03-28  0:47 ` Vitalii Bursov
  0 siblings, 0 replies; 12+ messages in thread
From: Vitalii Bursov @ 2024-03-28  0:47 UTC (permalink / raw)
  To: linux-kernel, Vitalii Bursov

Hi,

During the upgrade from Linux 5.4 we found a small (around 3%) 
performance regression which was tracked to commit 
c5b0a7eefc70150caf23e37bc9d639c68c87a097

    sched/fair: Remove sysctl_sched_migration_cost condition

    With a default value of 500us, sysctl_sched_migration_cost is
    significanlty higher than the cost of load_balance. Remove the
    condition and rely on the sd->max_newidle_lb_cost to abort
    newidle_balance.


Looks like "newidle" balancing is beneficial for a lot of workloads, 
just not for this specific one. The workload is video encoding, there 
are 100s-1000s of threads, some are synchonized with mutexes and 
conditional variables. The process aims to have a portion of CPU idle, 
so no CPU cores are 100% busy. Perhaps, the performance impact we see 
comes from additional processing in the scheduler and additional cost 
like more cache misses, and not from an incorrect balancing. See
perf output below.

My understanding is that "sched_relax_domain_level" cgroup parameter 
should control if newidle_balance() is called and what's the scope
of the balancing is, but it doesn't fully work for this case.

cpusets.rst documentation:
> The 'cpuset.sched_relax_domain_level' file allows you to request changing
> this searching range as you like.  This file takes int value which
> indicates size of searching range in levels ideally as follows,
> otherwise initial value -1 that indicates the cpuset has no request.
>  
> ====== ===========================================================
>   -1   no request. use system default or follow request of others.
>    0   no search.
>    1   search siblings (hyperthreads in a core).
>    2   search cores in a package.
>    3   search cpus in a node [= system wide on non-NUMA system]
>    4   search nodes in a chunk of node [on NUMA system]
>    5   search system wide [on NUMA system]
> ====== ===========================================================

Setting cpuset.sched_relax_domain_level to 0 works as 1.

On a dual-CPU server, domains and levels are as follows:
  domain 0: level 0, SMT
  domain 1: level 2, MC
  domain 2: level 5, NUMA

So, to support "0 no search", the value in 
cpuset.sched_relax_domain_level should disable SD_BALANCE_NEWIDLE for a 
specified level and keep it enabled for prior levels. For example, SMT 
level is 0, so sched_relax_domain_level=0 should exclude levels >=0.

Instead, cpuset.sched_relax_domain_level enables the specified level,
which effectively removes "no search" option. See below for domain
flags for all cpuset.sched_relax_domain_level values.

Proposed patch allows clearing SD_BALANCE_NEWIDLE flags when 
cpuset.sched_relax_domain_level is set to 0 and extends max
value validation range beyond sched_domain_level_max. This allows
setting SD_BALANCE_NEWIDLE on all levels and override platform
default if it does not include all levels.

Thanks

=========================
Perf output for a simimar workload/test case shows that newidle_balance 
is called when handling futex and nanosleep syscalls:
8.74%     0.40%  a.out    [kernel.vmlinux]    [k] entry_SYSCALL_64
8.34% entry_SYSCALL_64
 - do_syscall_64
    - 5.50% __x64_sys_futex
       - 5.42% do_futex
          - 3.79% futex_wait
             - 3.74% __futex_wait
                - 3.53% futex_wait_queue
                   - 3.45% schedule
                      - 3.43% __schedule
                         - 2.06% pick_next_task
                            - 1.93% pick_next_task_fair
                               - 1.87% newidle_balance
                                  - 1.52% load_balance
                                     - 1.16% find_busiest_group
                                        - 1.13% update_sd_lb_stats.constprop.0
                                             1.01% update_sg_lb_stats
                         - 0.83% dequeue_task_fair
                              0.66% dequeue_entity
          - 1.57% futex_wake
             - 1.22% wake_up_q
                - 1.20% try_to_wake_up
                     0.58% select_task_rq_fair
    - 2.44% __x64_sys_nanosleep
       - 2.36% hrtimer_nanosleep
          - 2.33% do_nanosleep
             - 2.05% schedule
                - 2.03% __schedule
                   - 1.23% pick_next_task
                      - 1.15% pick_next_task_fair
                         - 1.12% newidle_balance
                            - 0.90% load_balance
                               - 0.68% find_busiest_group
                                  - 0.66% update_sd_lb_stats.constprop.0
                                       0.59% update_sg_lb_stats
                     0.52% dequeue_task_fair

When newidle_balance is disabled (or when using older kernels), perf
output is:
6.37%     0.41%  a.out    [kernel.vmlinux]    [k] entry_SYSCALL_64
5.96% entry_SYSCALL_64
 - do_syscall_64
    - 3.97% __x64_sys_futex
       - 3.89% do_futex
          - 2.32% futex_wait
             - 2.27% __futex_wait
                - 2.05% futex_wait_queue
                   - 1.98% schedule
                      - 1.96% __schedule
                         - 0.81% dequeue_task_fair
                              0.66% dequeue_entity
                         - 0.64% pick_next_task
                              0.51% pick_next_task_fair
          - 1.52% futex_wake
             - 1.15% wake_up_q
                - try_to_wake_up
                     0.59% select_task_rq_fair
    - 1.58% __x64_sys_nanosleep
       - 1.52% hrtimer_nanosleep
          - 1.48% do_nanosleep
             - 1.20% schedule
                - 1.19% __schedule
                     0.53% dequeue_task_fair


Without a patch:
=========================
CPUs: 2 Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

# uname -r
6.8.1

# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 63962 MB
node 0 free: 59961 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 64446 MB
node 1 free: 63338 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

# head /proc/schedstat 
version 15
timestamp 4295347219
cpu0 0 0 0 0 0 0 3035466036 858375615 67578
domain0 0000,01000001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
domain1 000f,ff000fff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
domain2 ffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...

# cd /sys/kernel/debug/sched/domains
# echo -1 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{name,flags,groups_flags,max_newidle_lb_cost}
cpu0/domain0/name:SMT
cpu0/domain1/name:MC
cpu0/domain2/name:NUMA

cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP 
                            SD_NUMA
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain0/max_newidle_lb_cost:2236
cpu0/domain1/max_newidle_lb_cost:3444
cpu0/domain2/max_newidle_lb_cost:4590

# echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain1/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SERIALIZE SD_OVERLAP SD_NUMA 
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/groups_flags:SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain0/max_newidle_lb_cost:0
cpu0/domain1/max_newidle_lb_cost:0
cpu0/domain2/max_newidle_lb_cost:0

# echo 1 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost}

cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain1/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SERIALIZE SD_OVERLAP SD_NUMA 
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/groups_flags:SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain0/max_newidle_lb_cost:309
cpu0/domain1/max_newidle_lb_cost:0
cpu0/domain2/max_newidle_lb_cost:0

# echo 2 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost}

cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SERIALIZE SD_OVERLAP SD_NUMA 
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain0/max_newidle_lb_cost:276
cpu0/domain1/max_newidle_lb_cost:2776
cpu0/domain2/max_newidle_lb_cost:0

# echo 3 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SERIALIZE SD_OVERLAP SD_NUMA 
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain0/max_newidle_lb_cost:289
cpu0/domain1/max_newidle_lb_cost:3192
cpu0/domain2/max_newidle_lb_cost:0

# echo 4 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags,max_newidle_lb_cost}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK 
                            SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/flags:SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SERIALIZE SD_OVERLAP SD_NUMA 
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_CPUCAPACITY SD_SHARE_PKG_RESOURCES 
                            SD_PREFER_SIBLING 
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC 
                            SD_BALANCE_FORK SD_WAKE_AFFINE 
                            SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING 
cpu0/domain0/max_newidle_lb_cost:1306
cpu0/domain1/max_newidle_lb_cost:1999
cpu0/domain2/max_newidle_lb_cost:0

# echo 5 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
bash: echo: write error: Invalid argument
=========================


The same system with the patch applied:
=========================
# cd /sys/kernel/debug/sched/domains
# echo -1 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{name,level,flags,groups_flags}
cpu0/domain0/name:SMT
cpu0/domain1/name:MC
cpu0/domain2/name:NUMA
cpu0/domain0/level:0
cpu0/domain1/level:2
cpu0/domain2/level:5
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...

# echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags}
cpu0/domain0/flags:SD_BALANCE_EXEC ...
cpu0/domain1/flags:SD_BALANCE_EXEC ...
cpu0/domain2/flags:SD_BALANCE_EXEC ...
cpu0/domain1/groups_flags:SD_BALANCE_EXEC ...
cpu0/domain2/groups_flags:SD_BALANCE_EXEC ...

# echo 1 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/flags:SD_BALANCE_EXEC ...
cpu0/domain2/flags:SD_BALANCE_EXEC ...
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/groups_flags:SD_BALANCE_EXEC ...

[skip 2, same as 1]

# echo 3 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/flags:SD_BALANCE_EXEC ...
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...

[skip 4 and 5, same as 3]

# echo 6 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
# grep . cpu0/*/{flags,groups_flags}
cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain1/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...
cpu0/domain2/groups_flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC ...

# echo 7 > /sys/fs/cgroup/cpuset/cpuset.sched_relax_domain_level
bash: echo: write error: Invalid argument
=========================

Vitalii Bursov (1):
  sched/fair: allow disabling newidle_balance with
    sched_relax_domain_level

 kernel/cgroup/cpuset.c  | 2 +-
 kernel/sched/debug.c    | 1 +
 kernel/sched/topology.c | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/1] sched/fair: allow disabling newidle_balance with sched_relax_domain_level
  2024-03-28  0:47 ` Vitalii Bursov
  (?)
  (?)
@ 2024-03-28  5:48 ` Shrikanth Hegde
  2024-03-28 16:17   ` Vitalii Bursov
  -1 siblings, 1 reply; 12+ messages in thread
From: Shrikanth Hegde @ 2024-03-28  5:48 UTC (permalink / raw)
  To: Vitalii Bursov; +Cc: linux-kernel



On 3/28/24 6:17 AM, Vitalii Bursov wrote:
> Hi,
> 
> During the upgrade from Linux 5.4 we found a small (around 3%) 
> performance regression which was tracked to commit 

You see the regression since it is doing more newidle balance? 

> c5b0a7eefc70150caf23e37bc9d639c68c87a097
> 
>     sched/fair: Remove sysctl_sched_migration_cost condition
> 
>     With a default value of 500us, sysctl_sched_migration_cost is
>     significanlty higher than the cost of load_balance. Remove the
>     condition and rely on the sd->max_newidle_lb_cost to abort
>     newidle_balance.
> 
> 
> Looks like "newidle" balancing is beneficial for a lot of workloads, 
> just not for this specific one. The workload is video encoding, there 
> are 100s-1000s of threads, some are synchonized with mutexes and

s/synchonized/synchronized/
 
> conditional variables. The process aims to have a portion of CPU idle, 
> so no CPU cores are 100% busy. Perhaps, the performance impact we see 
> comes from additional processing in the scheduler and additional cost 
> like more cache misses, and not from an incorrect balancing. See
> perf output below.
> 
> My understanding is that "sched_relax_domain_level" cgroup parameter 
> should control if newidle_balance() is called and what's the scope

s/newidle_balance()/sched_balance_newidle()   at all the places since the 
name has been changed recently. 

> of the balancing is, but it doesn't fully work for this case.
> 
> cpusets.rst documentation:
>> The 'cpuset.sched_relax_domain_level' file allows you to request changing
>> this searching range as you like.  This file takes int value which
>> indicates size of searching range in levels ideally as follows,
>> otherwise initial value -1 that indicates the cpuset has no request.
>>  
>> ====== ===========================================================
>>   -1   no request. use system default or follow request of others.
>>    0   no search.
>>    1   search siblings (hyperthreads in a core).
>>    2   search cores in a package.
>>    3   search cpus in a node [= system wide on non-NUMA system]
>>    4   search nodes in a chunk of node [on NUMA system]
>>    5   search system wide [on NUMA system]
>> ====== ===========================================================
> 

I think this document needs to be updated. levels need not be serial order 
due to sched domains degenation. It should have a paragraph which tells the user
to take a look at /sys/kernel/debug/sched/domains/cpu*/domain*/ for system 
specific details. 

> Setting cpuset.sched_relax_domain_level to 0 works as 1.
> 
> On a dual-CPU server, domains and levels are as follows:
>   domain 0: level 0, SMT
>   domain 1: level 2, MC
>   domain 2: level 5, NUMA
> 
> So, to support "0 no search", the value in 
> cpuset.sched_relax_domain_level should disable SD_BALANCE_NEWIDLE for a 
> specified level and keep it enabled for prior levels. For example, SMT 
> level is 0, so sched_relax_domain_level=0 should exclude levels >=0.
> 
> Instead, cpuset.sched_relax_domain_level enables the specified level,
> which effectively removes "no search" option. See below for domain
> flags for all cpuset.sched_relax_domain_level values.
> 
> Proposed patch allows clearing SD_BALANCE_NEWIDLE flags when 
> cpuset.sched_relax_domain_level is set to 0 and extends max
> value validation range beyond sched_domain_level_max. This allows
> setting SD_BALANCE_NEWIDLE on all levels and override platform
> default if it does not include all levels.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] sched/fair: allow disabling newidle_balance with sched_relax_domain_level
  2024-03-28  0:30 ` [PATCH 1/1] " Vitalii Bursov
@ 2024-03-28  5:51   ` Shrikanth Hegde
  2024-03-28 16:19     ` Vitalii Bursov
  2024-03-28 14:43   ` Vincent Guittot
  1 sibling, 1 reply; 12+ messages in thread
From: Shrikanth Hegde @ 2024-03-28  5:51 UTC (permalink / raw)
  To: Vitalii Bursov
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel



On 3/28/24 6:00 AM, Vitalii Bursov wrote:
> Change relax_domain_level checks so that it would be possible
> to exclude all domains from newidle balancing.
> 
> This matches the behavior described in the documentation:
>   -1   no request. use system default or follow request of others.
>    0   no search.
>    1   search siblings (hyperthreads in a core).
> 
> "2" enables levels 0 and 1, level_max excludes the last (level_max)
> level, and level_max+1 includes all levels.
> 
> Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
> ---
>  kernel/cgroup/cpuset.c  | 2 +-
>  kernel/sched/debug.c    | 1 +
>  kernel/sched/topology.c | 2 +-
>  3 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 4237c874871..da24187c4e0 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2948,7 +2948,7 @@ bool current_cpuset_is_being_rebound(void)
>  static int update_relax_domain_level(struct cpuset *cs, s64 val)
>  {
>  #ifdef CONFIG_SMP
> -	if (val < -1 || val >= sched_domain_level_max)
> +	if (val < -1 || val > sched_domain_level_max + 1)
>  		return -EINVAL;
>  #endif
>  
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 8d5d98a5834..8454cd4e5e1 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -423,6 +423,7 @@ static void register_sd(struct sched_domain *sd, struct dentry *parent)
>  
>  #undef SDM
>  
> +	debugfs_create_u32("level", 0444, parent, (u32 *)&sd->level);

It would be better if the level can be after group_flags since its a new addition?

>  	debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops);
>  	debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd_flags_fops);
>  }
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 99ea5986038..3127c9b30af 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1468,7 +1468,7 @@ static void set_domain_attribute(struct sched_domain *sd,
>  	} else
>  		request = attr->relax_domain_level;
>  
> -	if (sd->level > request) {
> +	if (sd->level >= request) {
>  		/* Turn off idle balance on this domain: */
>  		sd->flags &= ~(SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE);
>  	}

Other than the above change looks good. 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] sched/fair: allow disabling newidle_balance with sched_relax_domain_level
  2024-03-28  0:30 ` [PATCH 1/1] " Vitalii Bursov
  2024-03-28  5:51   ` Shrikanth Hegde
@ 2024-03-28 14:43   ` Vincent Guittot
  2024-03-28 16:27     ` Vitalii Bursov
  1 sibling, 1 reply; 12+ messages in thread
From: Vincent Guittot @ 2024-03-28 14:43 UTC (permalink / raw)
  To: Vitalii Bursov
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel

On Thu, 28 Mar 2024 at 01:31, Vitalii Bursov <vitaly@bursov.com> wrote:
>
> Change relax_domain_level checks so that it would be possible
> to exclude all domains from newidle balancing.
>
> This matches the behavior described in the documentation:
>   -1   no request. use system default or follow request of others.
>    0   no search.
>    1   search siblings (hyperthreads in a core).
>
> "2" enables levels 0 and 1, level_max excludes the last (level_max)
> level, and level_max+1 includes all levels.

I was about to say that max+1 is useless because it's the same as -1
but it's not exactly the same because it can supersede the system wide
default_relax_domain_level. I wonder if one should be able to enable
more levels than what the system has set by default.

>
> Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
> ---
>  kernel/cgroup/cpuset.c  | 2 +-
>  kernel/sched/debug.c    | 1 +
>  kernel/sched/topology.c | 2 +-
>  3 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 4237c874871..da24187c4e0 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2948,7 +2948,7 @@ bool current_cpuset_is_being_rebound(void)
>  static int update_relax_domain_level(struct cpuset *cs, s64 val)
>  {
>  #ifdef CONFIG_SMP
> -       if (val < -1 || val >= sched_domain_level_max)
> +       if (val < -1 || val > sched_domain_level_max + 1)
>                 return -EINVAL;
>  #endif
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 8d5d98a5834..8454cd4e5e1 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -423,6 +423,7 @@ static void register_sd(struct sched_domain *sd, struct dentry *parent)
>
>  #undef SDM
>
> +       debugfs_create_u32("level", 0444, parent, (u32 *)&sd->level);

IMO, this should be a separate patch as it's not part of the fix

>         debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops);
>         debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd_flags_fops);
>  }
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 99ea5986038..3127c9b30af 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1468,7 +1468,7 @@ static void set_domain_attribute(struct sched_domain *sd,
>         } else
>                 request = attr->relax_domain_level;
>
> -       if (sd->level > request) {
> +       if (sd->level >= request) {

good catch and worth :
Fixes: 9ae7ab20b483 ("sched/topology: Don't set SD_BALANCE_WAKE on
cpuset domain relax")


>                 /* Turn off idle balance on this domain: */
>                 sd->flags &= ~(SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE);
>         }
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/1] sched/fair: allow disabling newidle_balance with sched_relax_domain_level
  2024-03-28  5:48 ` [PATCH 0/1] " Shrikanth Hegde
@ 2024-03-28 16:17   ` Vitalii Bursov
  0 siblings, 0 replies; 12+ messages in thread
From: Vitalii Bursov @ 2024-03-28 16:17 UTC (permalink / raw)
  To: Shrikanth Hegde; +Cc: linux-kernel


On 28.03.24 07:48, Shrikanth Hegde wrote:
> 
> 
> On 3/28/24 6:17 AM, Vitalii Bursov wrote:
>> Hi,
>>
>> During the upgrade from Linux 5.4 we found a small (around 3%) 
>> performance regression which was tracked to commit 
> 
> You see the regression since it is doing more newidle balance? 

In some sense, yes.

Before this commit or with this commit reverted on newer kernels,
newidle balance almost never called in this code path because
sysctl_sched_migration_cost is too high (it's set to 5ms by tuned
in "performance" profiles).

So, it's not exactly more, my understanding is that it had started
working on this and similar servers.

When it's working, perf for the process and kernel shows:
* slightly more (+10%), or much more cpu-migrations (10x or more).
It looks like new migrations come from futex syscalls in our case.
* more "cycles" and "instructions", which is understandable, as
newidle requires some work to be done. However, IPC ratio falls.
* more L1 and L2 load misses.
(I haven't checked other performance counters on this CPU)

The above certainly depends on a workload, and it concerns a specific
case when we see the overall performance regression. Perhaps, all
mentioned factors and consequences contribute. For example,
additional cache pressure may have a higher negative impact than the
benefit of newidle, but I don't know if it's the case exactly.

We haven't tried running a production workload with newidle restricted
only to SMP level yet, but a synthetic test (which may be totally
different from our real workload) shows that it is possible for SMP
newidle to be a little better compared to when it's fully disabled.
About 0.5% better in the test case instead of 1.5% worse when fully
enabled.

>> c5b0a7eefc70150caf23e37bc9d639c68c87a097
>>
>>     sched/fair: Remove sysctl_sched_migration_cost condition
>>
>>     With a default value of 500us, sysctl_sched_migration_cost is
>>     significanlty higher than the cost of load_balance. Remove the
>>     condition and rely on the sd->max_newidle_lb_cost to abort
>>     newidle_balance.
>>
>>
>> Looks like "newidle" balancing is beneficial for a lot of workloads, 
>> just not for this specific one. The workload is video encoding, there 
>> are 100s-1000s of threads, some are synchonized with mutexes and
> 
> s/synchonized/synchronized/
Thanks

>> conditional variables. The process aims to have a portion of CPU idle, 
>> so no CPU cores are 100% busy. Perhaps, the performance impact we see 
>> comes from additional processing in the scheduler and additional cost 
>> like more cache misses, and not from an incorrect balancing. See
>> perf output below.
>>
>> My understanding is that "sched_relax_domain_level" cgroup parameter 
>> should control if newidle_balance() is called and what's the scope
> 
> s/newidle_balance()/sched_balance_newidle()   at all the places since the 
> name has been changed recently. Thanks

>> of the balancing is, but it doesn't fully work for this case.
>>
>> cpusets.rst documentation:
>>> The 'cpuset.sched_relax_domain_level' file allows you to request changing
>>> this searching range as you like.  This file takes int value which
>>> indicates size of searching range in levels ideally as follows,
>>> otherwise initial value -1 that indicates the cpuset has no request.
>>>  
>>> ====== ===========================================================
>>>   -1   no request. use system default or follow request of others.
>>>    0   no search.
>>>    1   search siblings (hyperthreads in a core).
>>>    2   search cores in a package.
>>>    3   search cpus in a node [= system wide on non-NUMA system]
>>>    4   search nodes in a chunk of node [on NUMA system]
>>>    5   search system wide [on NUMA system]
>>> ====== ===========================================================
>>
> 
> I think this document needs to be updated. levels need not be serial order 
> due to sched domains degenation. It should have a paragraph which tells the user
> to take a look at /sys/kernel/debug/sched/domains/cpu*/domain*/ for system 
> specific details. 
> 
Agree, it's little confusing. I can add something like this.
Thanks

>> Setting cpuset.sched_relax_domain_level to 0 works as 1.
>>
>> On a dual-CPU server, domains and levels are as follows:
>>   domain 0: level 0, SMT
>>   domain 1: level 2, MC
>>   domain 2: level 5, NUMA
>>
>> So, to support "0 no search", the value in 
>> cpuset.sched_relax_domain_level should disable SD_BALANCE_NEWIDLE for a 
>> specified level and keep it enabled for prior levels. For example, SMT 
>> level is 0, so sched_relax_domain_level=0 should exclude levels >=0.
>>
>> Instead, cpuset.sched_relax_domain_level enables the specified level,
>> which effectively removes "no search" option. See below for domain
>> flags for all cpuset.sched_relax_domain_level values.
>>
>> Proposed patch allows clearing SD_BALANCE_NEWIDLE flags when 
>> cpuset.sched_relax_domain_level is set to 0 and extends max
>> value validation range beyond sched_domain_level_max. This allows
>> setting SD_BALANCE_NEWIDLE on all levels and override platform
>> default if it does not include all levels.
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] sched/fair: allow disabling newidle_balance with sched_relax_domain_level
  2024-03-28  5:51   ` Shrikanth Hegde
@ 2024-03-28 16:19     ` Vitalii Bursov
  0 siblings, 0 replies; 12+ messages in thread
From: Vitalii Bursov @ 2024-03-28 16:19 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel


On 28.03.24 07:51, Shrikanth Hegde wrote:
> 
> 
> On 3/28/24 6:00 AM, Vitalii Bursov wrote:
>> Change relax_domain_level checks so that it would be possible
>> to exclude all domains from newidle balancing.
>>
>> This matches the behavior described in the documentation:
>>   -1   no request. use system default or follow request of others.
>>    0   no search.
>>    1   search siblings (hyperthreads in a core).
>>
>> "2" enables levels 0 and 1, level_max excludes the last (level_max)
>> level, and level_max+1 includes all levels.
>>
>> Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
>> ---
>>  kernel/cgroup/cpuset.c  | 2 +-
>>  kernel/sched/debug.c    | 1 +
>>  kernel/sched/topology.c | 2 +-
>>  3 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 4237c874871..da24187c4e0 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -2948,7 +2948,7 @@ bool current_cpuset_is_being_rebound(void)
>>  static int update_relax_domain_level(struct cpuset *cs, s64 val)
>>  {
>>  #ifdef CONFIG_SMP
>> -	if (val < -1 || val >= sched_domain_level_max)
>> +	if (val < -1 || val > sched_domain_level_max + 1)
>>  		return -EINVAL;
>>  #endif
>>  
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 8d5d98a5834..8454cd4e5e1 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -423,6 +423,7 @@ static void register_sd(struct sched_domain *sd, struct dentry *parent)
>>  
>>  #undef SDM
>>  
>> +	debugfs_create_u32("level", 0444, parent, (u32 *)&sd->level);
> 
> It would be better if the level can be after group_flags since its a new addition?

I'll change the order.
Thanks

>>  	debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops);
>>  	debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd_flags_fops);
>>  }
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 99ea5986038..3127c9b30af 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1468,7 +1468,7 @@ static void set_domain_attribute(struct sched_domain *sd,
>>  	} else
>>  		request = attr->relax_domain_level;
>>  
>> -	if (sd->level > request) {
>> +	if (sd->level >= request) {
>>  		/* Turn off idle balance on this domain: */
>>  		sd->flags &= ~(SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE);
>>  	}
> 
> Other than the above change looks good. 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] sched/fair: allow disabling newidle_balance with sched_relax_domain_level
  2024-03-28 14:43   ` Vincent Guittot
@ 2024-03-28 16:27     ` Vitalii Bursov
  2024-03-28 16:48       ` Vincent Guittot
  0 siblings, 1 reply; 12+ messages in thread
From: Vitalii Bursov @ 2024-03-28 16:27 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel


On 28.03.24 16:43, Vincent Guittot wrote:
> On Thu, 28 Mar 2024 at 01:31, Vitalii Bursov <vitaly@bursov.com> wrote:
>>
>> Change relax_domain_level checks so that it would be possible
>> to exclude all domains from newidle balancing.
>>
>> This matches the behavior described in the documentation:
>>   -1   no request. use system default or follow request of others.
>>    0   no search.
>>    1   search siblings (hyperthreads in a core).
>>
>> "2" enables levels 0 and 1, level_max excludes the last (level_max)
>> level, and level_max+1 includes all levels.
> 
> I was about to say that max+1 is useless because it's the same as -1
> but it's not exactly the same because it can supersede the system wide
> default_relax_domain_level. I wonder if one should be able to enable
> more levels than what the system has set by default.

I don't know is such systems exist, but cpusets.rst suggests that
increasing it beyoud the default value is possible:
> If your situation is:
> 
>  - The migration costs between each cpu can be assumed considerably
>    small(for you) due to your special application's behavior or
>    special hardware support for CPU cache etc.
>  - The searching cost doesn't have impact(for you) or you can make
>    the searching cost enough small by managing cpuset to compact etc.
>  - The latency is required even it sacrifices cache hit rate etc.
>    then increasing 'sched_relax_domain_level' would benefit you.


>>
>> Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
>> ---
>>  kernel/cgroup/cpuset.c  | 2 +-
>>  kernel/sched/debug.c    | 1 +
>>  kernel/sched/topology.c | 2 +-
>>  3 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 4237c874871..da24187c4e0 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -2948,7 +2948,7 @@ bool current_cpuset_is_being_rebound(void)
>>  static int update_relax_domain_level(struct cpuset *cs, s64 val)
>>  {
>>  #ifdef CONFIG_SMP
>> -       if (val < -1 || val >= sched_domain_level_max)
>> +       if (val < -1 || val > sched_domain_level_max + 1)
>>                 return -EINVAL;
>>  #endif
>>
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 8d5d98a5834..8454cd4e5e1 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -423,6 +423,7 @@ static void register_sd(struct sched_domain *sd, struct dentry *parent)
>>
>>  #undef SDM
>>
>> +       debugfs_create_u32("level", 0444, parent, (u32 *)&sd->level);
> 
> IMO, this should be a separate patch as it's not part of the fix

Thanks, I'll split it.

>>         debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops);
>>         debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd_flags_fops);
>>  }
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 99ea5986038..3127c9b30af 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1468,7 +1468,7 @@ static void set_domain_attribute(struct sched_domain *sd,
>>         } else
>>                 request = attr->relax_domain_level;
>>
>> -       if (sd->level > request) {
>> +       if (sd->level >= request) {
> 
> good catch and worth :
> Fixes: 9ae7ab20b483 ("sched/topology: Don't set SD_BALANCE_WAKE on
> cpuset domain relax")
> 
Will add this.
Thanks.

> 
>>                 /* Turn off idle balance on this domain: */
>>                 sd->flags &= ~(SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE);
>>         }
>> --
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] sched/fair: allow disabling newidle_balance with sched_relax_domain_level
  2024-03-28 16:27     ` Vitalii Bursov
@ 2024-03-28 16:48       ` Vincent Guittot
  2024-03-28 17:10         ` Vitalii Bursov
  0 siblings, 1 reply; 12+ messages in thread
From: Vincent Guittot @ 2024-03-28 16:48 UTC (permalink / raw)
  To: Vitalii Bursov
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel

On Thu, 28 Mar 2024 at 17:27, Vitalii Bursov <vitaly@bursov.com> wrote:
>
>
> On 28.03.24 16:43, Vincent Guittot wrote:
> > On Thu, 28 Mar 2024 at 01:31, Vitalii Bursov <vitaly@bursov.com> wrote:
> >>
> >> Change relax_domain_level checks so that it would be possible
> >> to exclude all domains from newidle balancing.
> >>
> >> This matches the behavior described in the documentation:
> >>   -1   no request. use system default or follow request of others.
> >>    0   no search.
> >>    1   search siblings (hyperthreads in a core).
> >>
> >> "2" enables levels 0 and 1, level_max excludes the last (level_max)
> >> level, and level_max+1 includes all levels.
> >
> > I was about to say that max+1 is useless because it's the same as -1
> > but it's not exactly the same because it can supersede the system wide
> > default_relax_domain_level. I wonder if one should be able to enable
> > more levels than what the system has set by default.
>
> I don't know is such systems exist, but cpusets.rst suggests that
> increasing it beyoud the default value is possible:
> > If your situation is:
> >
> >  - The migration costs between each cpu can be assumed considerably
> >    small(for you) due to your special application's behavior or
> >    special hardware support for CPU cache etc.
> >  - The searching cost doesn't have impact(for you) or you can make
> >    the searching cost enough small by managing cpuset to compact etc.
> >  - The latency is required even it sacrifices cache hit rate etc.
> >    then increasing 'sched_relax_domain_level' would benefit you.

Fair enough. The doc should be updated as we can now clear the flags
but not set them

>
>
> >>
> >> Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
> >> ---
> >>  kernel/cgroup/cpuset.c  | 2 +-
> >>  kernel/sched/debug.c    | 1 +
> >>  kernel/sched/topology.c | 2 +-
> >>  3 files changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> >> index 4237c874871..da24187c4e0 100644
> >> --- a/kernel/cgroup/cpuset.c
> >> +++ b/kernel/cgroup/cpuset.c
> >> @@ -2948,7 +2948,7 @@ bool current_cpuset_is_being_rebound(void)
> >>  static int update_relax_domain_level(struct cpuset *cs, s64 val)
> >>  {
> >>  #ifdef CONFIG_SMP
> >> -       if (val < -1 || val >= sched_domain_level_max)
> >> +       if (val < -1 || val > sched_domain_level_max + 1)
> >>                 return -EINVAL;
> >>  #endif
> >>
> >> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> >> index 8d5d98a5834..8454cd4e5e1 100644
> >> --- a/kernel/sched/debug.c
> >> +++ b/kernel/sched/debug.c
> >> @@ -423,6 +423,7 @@ static void register_sd(struct sched_domain *sd, struct dentry *parent)
> >>
> >>  #undef SDM
> >>
> >> +       debugfs_create_u32("level", 0444, parent, (u32 *)&sd->level);
> >
> > IMO, this should be a separate patch as it's not part of the fix
>
> Thanks, I'll split it.
>
> >>         debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops);
> >>         debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd_flags_fops);
> >>  }
> >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> >> index 99ea5986038..3127c9b30af 100644
> >> --- a/kernel/sched/topology.c
> >> +++ b/kernel/sched/topology.c
> >> @@ -1468,7 +1468,7 @@ static void set_domain_attribute(struct sched_domain *sd,
> >>         } else
> >>                 request = attr->relax_domain_level;
> >>
> >> -       if (sd->level > request) {
> >> +       if (sd->level >= request) {
> >
> > good catch and worth :
> > Fixes: 9ae7ab20b483 ("sched/topology: Don't set SD_BALANCE_WAKE on
> > cpuset domain relax")
> >
> Will add this.
> Thanks.
>
> >
> >>                 /* Turn off idle balance on this domain: */
> >>                 sd->flags &= ~(SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE);
> >>         }
> >> --
> >> 2.20.1
> >>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] sched/fair: allow disabling newidle_balance with sched_relax_domain_level
  2024-03-28 16:48       ` Vincent Guittot
@ 2024-03-28 17:10         ` Vitalii Bursov
  2024-03-28 17:38           ` Vincent Guittot
  0 siblings, 1 reply; 12+ messages in thread
From: Vitalii Bursov @ 2024-03-28 17:10 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel



On 28.03.24 18:48, Vincent Guittot wrote:
> On Thu, 28 Mar 2024 at 17:27, Vitalii Bursov <vitaly@bursov.com> wrote:
>>
>>
>> On 28.03.24 16:43, Vincent Guittot wrote:
>>> On Thu, 28 Mar 2024 at 01:31, Vitalii Bursov <vitaly@bursov.com> wrote:
>>>>
>>>> Change relax_domain_level checks so that it would be possible
>>>> to exclude all domains from newidle balancing.
>>>>
>>>> This matches the behavior described in the documentation:
>>>>   -1   no request. use system default or follow request of others.
>>>>    0   no search.
>>>>    1   search siblings (hyperthreads in a core).
>>>>
>>>> "2" enables levels 0 and 1, level_max excludes the last (level_max)
>>>> level, and level_max+1 includes all levels.
>>>
>>> I was about to say that max+1 is useless because it's the same as -1
>>> but it's not exactly the same because it can supersede the system wide
>>> default_relax_domain_level. I wonder if one should be able to enable
>>> more levels than what the system has set by default.
>>
>> I don't know is such systems exist, but cpusets.rst suggests that
>> increasing it beyoud the default value is possible:
>>> If your situation is:
>>>
>>>  - The migration costs between each cpu can be assumed considerably
>>>    small(for you) due to your special application's behavior or
>>>    special hardware support for CPU cache etc.
>>>  - The searching cost doesn't have impact(for you) or you can make
>>>    the searching cost enough small by managing cpuset to compact etc.
>>>  - The latency is required even it sacrifices cache hit rate etc.
>>>    then increasing 'sched_relax_domain_level' would benefit you.
> 
> Fair enough. The doc should be updated as we can now clear the flags
> but not set them
> 

SD_BALANCE_NEWIDLE is always set by default in sd_init() and cleared
in set_domain_attribute() depending on default_relax_domain_level
("relax_domain_level" kernel parameter) and cgroup configuration
if it's present.

So, it should work both ways - clearing flags when relax level
is decreasing, and not clearing the flag when it's increasing,
isn't it?

Also, after a closer look at set_domain_attribute(), it looks like
default_relax_domain_level is -1 on all systems, so if cgroup does
not set relax level, it won't clear any flags, which probably means
that level_max+1 is redundant today.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] sched/fair: allow disabling newidle_balance with sched_relax_domain_level
  2024-03-28 17:10         ` Vitalii Bursov
@ 2024-03-28 17:38           ` Vincent Guittot
  0 siblings, 0 replies; 12+ messages in thread
From: Vincent Guittot @ 2024-03-28 17:38 UTC (permalink / raw)
  To: Vitalii Bursov
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel

On Thu, 28 Mar 2024 at 18:10, Vitalii Bursov <vitaly@bursov.com> wrote:
>
>
>
> On 28.03.24 18:48, Vincent Guittot wrote:
> > On Thu, 28 Mar 2024 at 17:27, Vitalii Bursov <vitaly@bursov.com> wrote:
> >>
> >>
> >> On 28.03.24 16:43, Vincent Guittot wrote:
> >>> On Thu, 28 Mar 2024 at 01:31, Vitalii Bursov <vitaly@bursov.com> wrote:
> >>>>
> >>>> Change relax_domain_level checks so that it would be possible
> >>>> to exclude all domains from newidle balancing.
> >>>>
> >>>> This matches the behavior described in the documentation:
> >>>>   -1   no request. use system default or follow request of others.
> >>>>    0   no search.
> >>>>    1   search siblings (hyperthreads in a core).
> >>>>
> >>>> "2" enables levels 0 and 1, level_max excludes the last (level_max)
> >>>> level, and level_max+1 includes all levels.
> >>>
> >>> I was about to say that max+1 is useless because it's the same as -1
> >>> but it's not exactly the same because it can supersede the system wide
> >>> default_relax_domain_level. I wonder if one should be able to enable
> >>> more levels than what the system has set by default.
> >>
> >> I don't know is such systems exist, but cpusets.rst suggests that
> >> increasing it beyoud the default value is possible:
> >>> If your situation is:
> >>>
> >>>  - The migration costs between each cpu can be assumed considerably
> >>>    small(for you) due to your special application's behavior or
> >>>    special hardware support for CPU cache etc.
> >>>  - The searching cost doesn't have impact(for you) or you can make
> >>>    the searching cost enough small by managing cpuset to compact etc.
> >>>  - The latency is required even it sacrifices cache hit rate etc.
> >>>    then increasing 'sched_relax_domain_level' would benefit you.
> >
> > Fair enough. The doc should be updated as we can now clear the flags
> > but not set them
> >
>
> SD_BALANCE_NEWIDLE is always set by default in sd_init() and cleared
> in set_domain_attribute() depending on default_relax_domain_level
> ("relax_domain_level" kernel parameter) and cgroup configuration
> if it's present.

Yes, I meant that before
9ae7ab20b483 ("sched/topology: Don't set SD_BALANCE_WAKE on cpuset
domain relax")
The flags SD_BALANCE_NEWIDLE and SD_BALANCE_WAKE could also be set
even though sd_init() would not set them

>
> So, it should work both ways - clearing flags when relax level
> is decreasing, and not clearing the flag when it's increasing,
> isn't it?
>
> Also, after a closer look at set_domain_attribute(), it looks like
> default_relax_domain_level is -1 on all systems, so if cgroup does
> not set relax level, it won't clear any flags, which probably means
> that level_max+1 is redundant today.

Except if the boot parameter has set it to another level which was my
point. Does it make sense to be able to set a relax_level to level_max
in one cgroup if  we have "relax_domain_level=1" in boot params as an
example ? But this is out of the scope of this patch because it
already works for level_max-1 so why not for level_max

So keep your change in update_relax_domain_level()

Thanks

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-03-28 17:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-28  0:30 [PATCH 0/1] sched/fair: allow disabling newidle_balance with sched_relax_domain_level Vitalii Bursov
2024-03-28  0:47 ` Vitalii Bursov
2024-03-28  0:30 ` [PATCH 1/1] " Vitalii Bursov
2024-03-28  5:51   ` Shrikanth Hegde
2024-03-28 16:19     ` Vitalii Bursov
2024-03-28 14:43   ` Vincent Guittot
2024-03-28 16:27     ` Vitalii Bursov
2024-03-28 16:48       ` Vincent Guittot
2024-03-28 17:10         ` Vitalii Bursov
2024-03-28 17:38           ` Vincent Guittot
2024-03-28  5:48 ` [PATCH 0/1] " Shrikanth Hegde
2024-03-28 16:17   ` Vitalii Bursov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.