All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/2] sched: Minor changes for rd->overload access
@ 2024-03-25  5:45 Shrikanth Hegde
  2024-03-25  5:45 ` [PATCH v3 1/2] sched/fair: Check rd->overload value before update Shrikanth Hegde
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Shrikanth Hegde @ 2024-03-25  5:45 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot
  Cc: sshegde, dietmar.eggemann, qyousef, linux-kernel, vschneid

v2 -> v3:
- Wrapped check for value change inside ser_rd_overload_status
  as suggested by Qais.
- Added reviewed-by tags.

v1 -> v2:
- dropped Fixes tag.
- Added one of the perf probes in the changelog for reference.
- Added reviewed-by tags.

tl;dr
When running workloads in large systems, it was observed that access to
rd->overload was taking time. It would be better to check the value
before updating since value changes less often. Patch 1/2 does that.
With patch updates happen only if necessary. CPU Bus traffic reduced a
bit. No significant gains in workload performance.

Qais Suggested that it would be better to use the helper functions to
access the rd->overload instead. Patch 2/2 does that.

*These patches depend on below to be applied first*
https://lore.kernel.org/all/20240307085725.444486-1-sshegde@linux.ibm.com/


-----------------------------------------------------------------------
Detailed Perf annotation and probes stat
-----------------------------------------------------------------------
=======
6.8-rc5
=======
320 CPU system, SMT8
  NUMA node(s):          4
  NUMA node0 CPU(s):     0-79
  NUMA node1 CPU(s):     80-159
  NUMA node6 CPU(s):     160-239
  NUMA node7 CPU(s):     240-319

Perf annoate while running "schbench -t 320 -i 30 -r 30"
       │     if (!READ_ONCE(this_rq->rd->overload) ||
 18.05 │       ld       r9,2752(r31)
       │     sd = rcu_dereference_check_sched_domain(this_rq->sd);
  6.97 │       ld       r30,2760(r31)


Added some dummy codes so the probes can be put at required places.
perf probe -L update_sd_lb_stats
     46         if (env->sd->flags & SD_NUMA)
     47                 env->fbq_type = fbq_classify_group(&sds->busiest_stat);

     49         if (!env->sd->parent) {
                        /* update overload indicator if we are at root domain */
     51                 WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);

perf -probe -L sched_balance_newidle
                rcu_read_lock();
     38         sd = rcu_dereference_check_sched_domain(this_rq->sd);

                if (!READ_ONCE(this_rq->rd->overload) ||
                    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

perf probe -L add_nr_running
         #ifdef CONFIG_SMP
     11         if (prev_nr < 2 && rq->nr_running >= 2) {
     12                 if (!READ_ONCE(rq->rd->overload)) {
     13                         a = a +10;
                                WRITE_ONCE(rq->rd->overload, 1);
                        }

probe hits when running different workload.
idle
320 probe:add_nr_running_L12
260 probe:add_nr_running_L13
1K probe:sched_balance_newidle_L38
596 probe:update_sd_lb_stats_L51

./hackbench 10 process 100000 loops
130K probe:add_nr_running_L12
93 probe:add_nr_running_L13
1M probe:sched_balance_newidle_L38
109K probe:update_sd_lb_stats_L51

./schbench -t 320 -i 30 -r 30
3K probe:add_nr_running_L12
436 probe:add_nr_running_L13
125K probe:sched_balance_newidle_L38
33K probe:update_sd_lb_stats_L51

Modified stress-ng --wait
3K probe:add_nr_running_L12
1K probe:add_nr_running_L13
6M probe:sched_balance_newidle_L38
11K probe:update_sd_lb_stats_L51

stress-ng --cpu=400 -l 20
833 probe:add_nr_running_L12
280 probe:add_nr_running_L13
2K probe:sched_balance_newidle_L38
1K probe:update_sd_lb_stats_L51

stress-ng --cpu=400 -l 100
730 probe:add_nr_running_L12
0 probe:add_nr_running_L13
0 probe:sched_balance_newidle_L38
0 probe:update_sd_lb_stats_L51

stress-ng --cpu=800 -l 50
2K probe:add_nr_running_L12
0 probe:add_nr_running_L13
2K probe:sched_balance_newidle_L38
946 probe:update_sd_lb_stats_L51

stress-ng --cpu=800 -l 100
361 probe:add_nr_running_L12
0 probe:add_nr_running_L13
0 probe:sched_balance_newidle_L38
0 probe:update_sd_lb_stats_L51

L13 numbers are quite less compared to L12. This indicates that it might
not change often.

------------------------------------------------------------------------------
==========
With Patch:
==========
Perf annoate while running "schbench -t 320 -i 30 -r 30"
       │     if (!READ_ONCE(this_rq->rd->overload) ||
       │       ld       r9,2752(r31)
       │     sd = rcu_dereference_check_sched_domain(this_rq->sd);
       │       ld       r30,2760(r31)
       │     if (!READ_ONCE(this_rq->rd->overload) ||
       │       lwz      r9,536(r9)
       │       cmpwi    r9,0
       │     ↓ beq      2b4
       │100:   mflr     r0
       │       cmpdi    r30,0
  0.38 │       std      r0,240(r1)
  1.56 │     ↓ beq      120


perf probe -L update_sd_lb_stats
     49         if (!env->sd->parent) {
     50                 int a;
                        /* update overload indicator if we are at root domain */
                        if ( READ_ONCE(env->dst_rq->rd->overload) != sg_status & SG_OVERLOAD) {
     53                         a= a+10;
                                WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
                        }

perf probe -L sched_balance_newidle
     38         sd = rcu_dereference_check_sched_domain(this_rq->sd);

                if (!READ_ONCE(this_rq->rd->overload) ||
                    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

perf probe -L add_nr_running
		#ifdef CONFIG_SMP
     11         if (prev_nr < 2 && rq->nr_running >= 2) {
     12                 if (!READ_ONCE(rq->rd->overload)) {
     13                         a = a +10;
                                WRITE_ONCE(rq->rd->overload, 1);
                        }

perf probes when running different workloads. How many times actual
value changes in update_sd_lb_stats is indicated as L53/L50*100.
idle
818 probe:sched_balance_newidle_L38
262 probe:update_sd_lb_stats_L53	<-- 86%
321 probe:add_nr_running_L12
261 probe:add_nr_running_L13
304 probe:update_sd_lb_stats_L50

./hackbench 10 process 100000 loops
1M probe:sched_balance_newidle_L38
139 probe:update_sd_lb_stats_L53	<-- 0.25%
129K probe:add_nr_running_L12
74 probe:add_nr_running_L13
54K probe:update_sd_lb_stats_L50

./schbench -t 320 -i 30 -r 30
101K probe:sched_balance_newidle_L38
2K probe:update_sd_lb_stats_L53		<-- 9.09%
5K probe:add_nr_running_L12
1K probe:add_nr_running_L13
22K probe:update_sd_lb_stats_L50

Modified stress-ng --wait
6M probe:sched_balance_newidle_L38
2K probe:update_sd_lb_stats_L53		<-- 25%
4K probe:add_nr_running_L12
2K probe:add_nr_running_L13
8K probe:update_sd_lb_stats_L50

stress-ng --cpu=400 -l 20
2K probe:sched_balance_newidle_L38
286 probe:update_sd_lb_stats_L53	<-- 36.11%
746 probe:add_nr_running_L12
256 probe:add_nr_running_L13
792 probe:update_sd_lb_stats_L50

stress-ng --cpu=400 -l 100
2 probe:sched_balance_newidle_L38
0 probe:update_sd_lb_stats_L53		<-- NA
923 probe:add_nr_running_L12
0 probe:add_nr_running_L13
0 probe:update_sd_lb_stats_L50

stress-ng --cpu=800 -l 50
2K probe:sched_balance_newidle_L38
0 probe:update_sd_lb_stats_L53		<-- 0%
2K probe:add_nr_running_L12
0 probe:add_nr_running_L13
429 probe:update_sd_lb_stats_L50

stress-ng --cpu=800 -l 100
0 probe:sched_balance_newidle_L38
0 probe:update_sd_lb_stats_L53		<-- NA
424 probe:add_nr_running_L12
0 probe:add_nr_running_L13
1 probe:update_sd_lb_stats_L50

This indicates that likely that value changes less often. So adding a
read before update would help in generic workloads.
-------------------------------------------------------------------------------

Shrikanth Hegde (2):
  sched/fair: Check rd->overload value before update
  sched/fair: Use helper functions to access rd->overload

 kernel/sched/fair.c  |  4 ++--
 kernel/sched/sched.h | 14 ++++++++++++--
 2 files changed, 14 insertions(+), 4 deletions(-)

--
2.39.3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/2] sched/fair: Check rd->overload value before update
  2024-03-25  5:45 [PATCH v3 0/2] sched: Minor changes for rd->overload access Shrikanth Hegde
@ 2024-03-25  5:45 ` Shrikanth Hegde
  2024-03-28 10:47   ` [tip: sched/core] sched/fair: Check root_domain::overload " tip-bot2 for Shrikanth Hegde
  2024-03-25  5:45 ` [PATCH v3 2/2] sched/fair: Use helper functions to access rd->overload Shrikanth Hegde
  2024-03-25 10:36 ` [PATCH v3 0/2] sched: Minor changes for rd->overload access Ingo Molnar
  2 siblings, 1 reply; 18+ messages in thread
From: Shrikanth Hegde @ 2024-03-25  5:45 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot
  Cc: sshegde, dietmar.eggemann, qyousef, linux-kernel, vschneid

Overload is an indicator used to inform if there is any rq in the root
domain has 2 or more running tasks. Root domain is a global structure per
cpuset island.

Overload status is updated when adding a task to the runqueue.
It is cleared in update_sd_lb_stats during load balance, if none of the
rq has 2 or more running task. It is used during to newidle
balance to see if its worth doing load balance. If it is set, then
newidle balance will try aggressively to pull a task.

Since commit 630246a06ae2 ("sched/fair: Clean-up update_sg_lb_stats
parameters"), overload is being updated unconditionally. The change in
value of this depends on the workload. On typical workloads, it is
likely that this value doesn't change often. Write causes the
cacheline to be invalidated. This would cause true sharing when the same
variable is accessed in sched_balance_newidle. On large systems this would
cause more CPU bus traffic.

Perf probe stats for reference. Detailed info on perf probe can be found
in the cover letter. With Patch while running hackbench.
1M probe:sched_balance_newidle_L38
139 probe:update_sd_lb_stats_L53
129K probe:add_nr_running_L12
74 probe:add_nr_running_L13
54K probe:update_sd_lb_stats_L50

Perf probes prove that actual change in value is less often. L50 vs L53.
So it would be better to check the value before updating it. CPU bus
traffic reduces with the patch.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02d4d224b436..eeebadd7d9ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10621,7 +10621,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

 	if (!env->sd->parent) {
 		/* update overload indicator if we are at root domain */
-		WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
+		if (READ_ONCE(env->dst_rq->rd->overload) != (sg_status & SG_OVERLOAD))
+			WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);

 		/* Update over-utilization (tipping point, U >= 0) indicator */
 		set_rd_overutilized_status(env->dst_rq->rd,
--
2.39.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 2/2] sched/fair: Use helper functions to access rd->overload
  2024-03-25  5:45 [PATCH v3 0/2] sched: Minor changes for rd->overload access Shrikanth Hegde
  2024-03-25  5:45 ` [PATCH v3 1/2] sched/fair: Check rd->overload value before update Shrikanth Hegde
@ 2024-03-25  5:45 ` Shrikanth Hegde
  2024-03-28 10:47   ` [tip: sched/core] sched/fair: Use helper functions to access root_domain::overload tip-bot2 for Shrikanth Hegde
  2024-03-25 10:36 ` [PATCH v3 0/2] sched: Minor changes for rd->overload access Ingo Molnar
  2 siblings, 1 reply; 18+ messages in thread
From: Shrikanth Hegde @ 2024-03-25  5:45 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot
  Cc: sshegde, dietmar.eggemann, qyousef, linux-kernel, vschneid

rd->overload is accessed at multiple places. Instead it could use helper
get/set functions. This would make the code more readable and easy to
maintain.

No change in functionality intended.

Suggested-by: Qais Yousef <qyousef@layalina.io>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c  |  5 ++---
 kernel/sched/sched.h | 14 ++++++++++++--
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eeebadd7d9ae..f00cb66cc479 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10621,8 +10621,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

 	if (!env->sd->parent) {
 		/* update overload indicator if we are at root domain */
-		if (READ_ONCE(env->dst_rq->rd->overload) != (sg_status & SG_OVERLOAD))
-			WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
+		set_rd_overload_status(env->dst_rq->rd, sg_status & SG_OVERLOAD);

 		/* Update over-utilization (tipping point, U >= 0) indicator */
 		set_rd_overutilized_status(env->dst_rq->rd,
@@ -12344,7 +12343,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);

-	if (!READ_ONCE(this_rq->rd->overload) ||
+	if (!is_rd_overloaded(this_rq->rd) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

 		if (sd)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 41024c1c49b4..bed5ad8a1827 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -918,6 +918,17 @@ extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
 extern void sched_get_rd(struct root_domain *rd);
 extern void sched_put_rd(struct root_domain *rd);

+static inline int is_rd_overloaded(struct root_domain *rd)
+{
+	return READ_ONCE(rd->overload);
+}
+
+static inline void set_rd_overload_status(struct root_domain *rd, int status)
+{
+	if (is_rd_overloaded(rd) != status)
+		WRITE_ONCE(rd->overload, status);
+}
+
 #ifdef HAVE_RT_PUSH_IPI
 extern void rto_push_irq_work_func(struct irq_work *work);
 #endif
@@ -2518,8 +2529,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)

 #ifdef CONFIG_SMP
 	if (prev_nr < 2 && rq->nr_running >= 2) {
-		if (!READ_ONCE(rq->rd->overload))
-			WRITE_ONCE(rq->rd->overload, 1);
+		set_rd_overload_status(rq->rd, SG_OVERLOAD);
 	}
 #endif

--
2.39.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] sched: Minor changes for rd->overload access
  2024-03-25  5:45 [PATCH v3 0/2] sched: Minor changes for rd->overload access Shrikanth Hegde
  2024-03-25  5:45 ` [PATCH v3 1/2] sched/fair: Check rd->overload value before update Shrikanth Hegde
  2024-03-25  5:45 ` [PATCH v3 2/2] sched/fair: Use helper functions to access rd->overload Shrikanth Hegde
@ 2024-03-25 10:36 ` Ingo Molnar
  2024-03-25 11:33   ` Shrikanth Hegde
  2 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2024-03-25 10:36 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: peterz, vincent.guittot, dietmar.eggemann, qyousef, linux-kernel,
	vschneid


* Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> v2 -> v3:
> - Wrapped check for value change inside ser_rd_overload_status
>   as suggested by Qais.
> - Added reviewed-by tags.
> 
> v1 -> v2:
> - dropped Fixes tag.
> - Added one of the perf probes in the changelog for reference.
> - Added reviewed-by tags.
> 
> tl;dr
> When running workloads in large systems, it was observed that access to
> rd->overload was taking time. It would be better to check the value
> before updating since value changes less often. Patch 1/2 does that.
> With patch updates happen only if necessary. CPU Bus traffic reduced a
> bit. No significant gains in workload performance.

Could you please post this against the latest scheduler tree, ie. tip:sched/core?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] sched: Minor changes for rd->overload access
  2024-03-25 10:36 ` [PATCH v3 0/2] sched: Minor changes for rd->overload access Ingo Molnar
@ 2024-03-25 11:33   ` Shrikanth Hegde
  2024-03-26  8:00     ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Shrikanth Hegde @ 2024-03-25 11:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: peterz, vincent.guittot, dietmar.eggemann, qyousef, linux-kernel,
	vschneid



On 3/25/24 4:06 PM, Ingo Molnar wrote:
> 
> * Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> 
>> v2 -> v3:
>> - Wrapped check for value change inside ser_rd_overload_status
>>   as suggested by Qais.
>> - Added reviewed-by tags.
>>
>> v1 -> v2:
>> - dropped Fixes tag.
>> - Added one of the perf probes in the changelog for reference.
>> - Added reviewed-by tags.
>>
>> tl;dr
>> When running workloads in large systems, it was observed that access to
>> rd->overload was taking time. It would be better to check the value
>> before updating since value changes less often. Patch 1/2 does that.
>> With patch updates happen only if necessary. CPU Bus traffic reduced a
>> bit. No significant gains in workload performance.
> 
> Could you please post this against the latest scheduler tree, ie. tip:sched/core?
> 
> Thanks,
> 
> 	Ingo

Hi Ingo. I had mentioned the same in cover letter. Sorry if I didn't 
mention it in the correct place. 

*These patches depend on below to be applied first*
https://lore.kernel.org/all/20240307085725.444486-1-sshegde@linux.ibm.com/

After that the above patch would apply. I kept the dependency since these are in similar 
code area. Can we please consider the above patch as well? Is there any other change that needs 
to be done? 

On tip
1043c003415b (HEAD -> master) sched/fair: Use helper functions to access rd->overload
3049bc16643d sched/fair: Check rd->overload value before update
436d634f2cad sched/fair: Combine EAS check with overutilized access
379aa2cd62e0 sched/fair: Use helper function to access rd->overutilized
19bfeb2d565e sched/fair: Add EAS checks before updating overutilized
71706005072c (origin/master, origin/HEAD) Merge branch into tip/master: 'x86/shstk'


fa63c2c111ea (HEAD -> sched/core) sched/fair: Use helper functions to access rd->overload
9bef486d044b sched/fair: Check rd->overload value before update
21f90cae75c8 sched/fair: Combine EAS check with overutilized access
e835f1fa3654 sched/fair: Use helper function to access rd->overutilized
ddf390f9449d sched/fair: Add EAS checks before updating overutilized
58eeb2d79b54 (origin/sched/core) sched/fair: Don't double balance_interval for migrate_misfit


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] sched: Minor changes for rd->overload access
  2024-03-25 11:33   ` Shrikanth Hegde
@ 2024-03-26  8:00     ` Ingo Molnar
  2024-03-27  6:04       ` Shrikanth Hegde
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2024-03-26  8:00 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: peterz, vincent.guittot, dietmar.eggemann, qyousef, linux-kernel,
	vschneid


* Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> 
> 
> On 3/25/24 4:06 PM, Ingo Molnar wrote:
> > 
> > * Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> > 
> >> v2 -> v3:
> >> - Wrapped check for value change inside ser_rd_overload_status
> >>   as suggested by Qais.
> >> - Added reviewed-by tags.
> >>
> >> v1 -> v2:
> >> - dropped Fixes tag.
> >> - Added one of the perf probes in the changelog for reference.
> >> - Added reviewed-by tags.
> >>
> >> tl;dr
> >> When running workloads in large systems, it was observed that access to
> >> rd->overload was taking time. It would be better to check the value
> >> before updating since value changes less often. Patch 1/2 does that.
> >> With patch updates happen only if necessary. CPU Bus traffic reduced a
> >> bit. No significant gains in workload performance.
> > 
> > Could you please post this against the latest scheduler tree, ie. tip:sched/core?
> > 
> > Thanks,
> > 
> > 	Ingo
> 
> Hi Ingo. I had mentioned the same in cover letter. Sorry if I didn't 
> mention it in the correct place. 
> 
> *These patches depend on below to be applied first*
> https://lore.kernel.org/all/20240307085725.444486-1-sshegde@linux.ibm.com/
> 
> After that the above patch would apply.

OK, I missed that, but I don't really like patch #3 of that other series, 
so we'll need to address that first.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] sched: Minor changes for rd->overload access
  2024-03-26  8:00     ` Ingo Molnar
@ 2024-03-27  6:04       ` Shrikanth Hegde
  2024-03-28 10:34         ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Shrikanth Hegde @ 2024-03-27  6:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: peterz, vincent.guittot, dietmar.eggemann, qyousef, linux-kernel,
	vschneid



On 3/26/24 1:30 PM, Ingo Molnar wrote:
> 
> * Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> 
>>
>>
>> On 3/25/24 4:06 PM, Ingo Molnar wrote:
>>>
>>> * Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>>>
>>>> v2 -> v3:
>>>> - Wrapped check for value change inside ser_rd_overload_status
>>>>   as suggested by Qais.
>>>> - Added reviewed-by tags.
>>>>
>>>> v1 -> v2:
>>>> - dropped Fixes tag.
>>>> - Added one of the perf probes in the changelog for reference.
>>>> - Added reviewed-by tags.
>>>>
>>>> tl;dr
>>>> When running workloads in large systems, it was observed that access to
>>>> rd->overload was taking time. It would be better to check the value
>>>> before updating since value changes less often. Patch 1/2 does that.
>>>> With patch updates happen only if necessary. CPU Bus traffic reduced a
>>>> bit. No significant gains in workload performance.
>>>
>>> Could you please post this against the latest scheduler tree, ie. tip:sched/core?
>>>
>>> Thanks,
>>>
>>> 	Ingo
>>
>> Hi Ingo. I had mentioned the same in cover letter. Sorry if I didn't 
>> mention it in the correct place. 
>>
>> *These patches depend on below to be applied first*
>> https://lore.kernel.org/all/20240307085725.444486-1-sshegde@linux.ibm.com/
>>
>> After that the above patch would apply.

Hi Ingo. 

These two patches apply cleanly now to sched/core. 

7a9dd7ef946c (HEAD -> sched/core) sched/fair: Use helper functions to access rd->overload
4f24aa918558 sched/fair: Check rd->overload value before update
c829d6818b60 (origin/sched/core) sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()
d0f5d3cefc25 sched/fair: Introduce is_rd_overutilized() helper function to access root_domain::overutilized


> 
> OK, I missed that, but I don't really like patch #3 of that other series, 
> so we'll need to address that first.
> 

That will be a standalone patch now and can go later based on the discussions. 

> Thanks,
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] sched: Minor changes for rd->overload access
  2024-03-27  6:04       ` Shrikanth Hegde
@ 2024-03-28 10:34         ` Ingo Molnar
  2024-03-28 10:56           ` [tip: sched/core] sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED tip-bot2 for Ingo Molnar
                             ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Ingo Molnar @ 2024-03-28 10:34 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: peterz, vincent.guittot, dietmar.eggemann, qyousef, linux-kernel,
	vschneid


* Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> 
> 
> On 3/26/24 1:30 PM, Ingo Molnar wrote:
> > 
> > * Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> > 
> >>
> >>
> >> On 3/25/24 4:06 PM, Ingo Molnar wrote:
> >>>
> >>> * Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> >>>
> >>>> v2 -> v3:
> >>>> - Wrapped check for value change inside ser_rd_overload_status
> >>>>   as suggested by Qais.
> >>>> - Added reviewed-by tags.
> >>>>
> >>>> v1 -> v2:
> >>>> - dropped Fixes tag.
> >>>> - Added one of the perf probes in the changelog for reference.
> >>>> - Added reviewed-by tags.
> >>>>
> >>>> tl;dr
> >>>> When running workloads in large systems, it was observed that access to
> >>>> rd->overload was taking time. It would be better to check the value
> >>>> before updating since value changes less often. Patch 1/2 does that.
> >>>> With patch updates happen only if necessary. CPU Bus traffic reduced a
> >>>> bit. No significant gains in workload performance.
> >>>
> >>> Could you please post this against the latest scheduler tree, ie. tip:sched/core?
> >>>
> >>> Thanks,
> >>>
> >>> 	Ingo
> >>
> >> Hi Ingo. I had mentioned the same in cover letter. Sorry if I didn't 
> >> mention it in the correct place. 
> >>
> >> *These patches depend on below to be applied first*
> >> https://lore.kernel.org/all/20240307085725.444486-1-sshegde@linux.ibm.com/
> >>
> >> After that the above patch would apply.
> 
> Hi Ingo. 
> 
> These two patches apply cleanly now to sched/core. 
> 
> 7a9dd7ef946c (HEAD -> sched/core) sched/fair: Use helper functions to access rd->overload
> 4f24aa918558 sched/fair: Check rd->overload value before update
> c829d6818b60 (origin/sched/core) sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()
> d0f5d3cefc25 sched/fair: Introduce is_rd_overutilized() helper function to access root_domain::overutilized

I've applied them, but note that there were quite a few avoidable typos 
and grammar mistakes in the changelogs. Please always review what 
you've submitted versus what I have applied and try to learn from that: 
I almost invariable have to edit some detail to make the changelog more 
presentable... Could you please take more care with future patches?

I also renamed the accessor functions in the second patch to:

      get_rd_overload()
      set_rd_overload()

Plus I've applied a patch to rename ::overload to ::overloaded. It is 
silly to use an ambiguous noun instead of a clear adjective when naming 
such a flag ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tip: sched/core] sched/fair: Use helper functions to access root_domain::overload
  2024-03-25  5:45 ` [PATCH v3 2/2] sched/fair: Use helper functions to access rd->overload Shrikanth Hegde
@ 2024-03-28 10:47   ` tip-bot2 for Shrikanth Hegde
  0 siblings, 0 replies; 18+ messages in thread
From: tip-bot2 for Shrikanth Hegde @ 2024-03-28 10:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Qais Yousef, Shrikanth Hegde, Ingo Molnar, Vincent Guittot, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     caac6291728ed5493d8a53f4b086c270849ce0c4
Gitweb:        https://git.kernel.org/tip/caac6291728ed5493d8a53f4b086c270849ce0c4
Author:        Shrikanth Hegde <sshegde@linux.ibm.com>
AuthorDate:    Mon, 25 Mar 2024 11:15:05 +05:30
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 28 Mar 2024 11:30:13 +01:00

sched/fair: Use helper functions to access root_domain::overload

Introduce two helper functions to access & set the root_domain::overload flag:

  get_rd_overload()
  set_rd_overload()

To make sure code is always following READ_ONCE()/WRITE_ONCE() access methods.

No change in functionality intended.

[ mingo: Renamed the accessors to get_/set_rd_overload(), tidied up the changelog. ]

Suggested-by: Qais Yousef <qyousef@layalina.io>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240325054505.201995-3-sshegde@linux.ibm.com
---
 kernel/sched/fair.c  |  5 ++---
 kernel/sched/sched.h | 14 ++++++++++++--
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 600fdde..0cc0582 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10657,8 +10657,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 	if (!env->sd->parent) {
 		/* update overload indicator if we are at root domain */
-		if (READ_ONCE(env->dst_rq->rd->overload) != (sg_status & SG_OVERLOAD))
-			WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
+		set_rd_overload(env->dst_rq->rd, sg_status & SG_OVERLOAD);
 
 		/* Update over-utilization (tipping point, U >= 0) indicator */
 		set_rd_overutilized_status(env->dst_rq->rd,
@@ -12391,7 +12390,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
-	if (!READ_ONCE(this_rq->rd->overload) ||
+	if (!get_rd_overload(this_rq->rd) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
 		if (sd)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4f9e952..e86ee26 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -930,6 +930,17 @@ extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
 extern void sched_get_rd(struct root_domain *rd);
 extern void sched_put_rd(struct root_domain *rd);
 
+static inline int get_rd_overload(struct root_domain *rd)
+{
+	return READ_ONCE(rd->overload);
+}
+
+static inline void set_rd_overload(struct root_domain *rd, int status)
+{
+	if (get_rd_overload(rd) != status)
+		WRITE_ONCE(rd->overload, status);
+}
+
 #ifdef HAVE_RT_PUSH_IPI
 extern void rto_push_irq_work_func(struct irq_work *work);
 #endif
@@ -2530,8 +2541,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 
 #ifdef CONFIG_SMP
 	if (prev_nr < 2 && rq->nr_running >= 2) {
-		if (!READ_ONCE(rq->rd->overload))
-			WRITE_ONCE(rq->rd->overload, 1);
+		set_rd_overload(rq->rd, SG_OVERLOAD);
 	}
 #endif
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [tip: sched/core] sched/fair: Check root_domain::overload value before update
  2024-03-25  5:45 ` [PATCH v3 1/2] sched/fair: Check rd->overload value before update Shrikanth Hegde
@ 2024-03-28 10:47   ` tip-bot2 for Shrikanth Hegde
  0 siblings, 0 replies; 18+ messages in thread
From: tip-bot2 for Shrikanth Hegde @ 2024-03-28 10:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Shrikanth Hegde, Ingo Molnar, Qais Yousef, Vincent Guittot,
	Mel Gorman, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     c628db0a6831f80e89873ee44f1b40e3ab3216c6
Gitweb:        https://git.kernel.org/tip/c628db0a6831f80e89873ee44f1b40e3ab3216c6
Author:        Shrikanth Hegde <sshegde@linux.ibm.com>
AuthorDate:    Mon, 25 Mar 2024 11:15:04 +05:30
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 28 Mar 2024 11:30:13 +01:00

sched/fair: Check root_domain::overload value before update

The root_domain::overload flag is 1 when there's any rq
in the root domain that has 2 or more running tasks. (Ie. it's overloaded.)

The root_domain structure itself is a global structure per cpuset island.

The ::overload flag is maintained the following way:

  - Set when adding a second task to the runqueue.

  - It is cleared in update_sd_lb_stats() during load balance,
    if none of the rqs have 2 or more running tasks.

This flag is used during newidle balance to see if its worth doing a full
load balance pass, which can be an expensive operation. If it is set,
then newidle balance will try to aggressively pull a task.

Since commit:

  630246a06ae2 ("sched/fair: Clean-up update_sg_lb_stats parameters")

::overload is being written unconditionally, even if it has the same
value. The change in value of this depends on the workload, but on
typical workloads, it doesn't change all that often: a system is
either dominantly overloaded for substantial amounts of time, or not.

Extra writes to this semi-global structure cause unnecessary overhead, extra
bus traffic, etc. - so avoid it as much as possible.

Perf probe stats show that it's worth making this change (numbers are
with patch applied):

	1M    probe:sched_balance_newidle_L38
	139   probe:update_sd_lb_stats_L53     <====== 1->0 writes
	129K  probe:add_nr_running_L12
	74    probe:add_nr_running_L13         <====== 0->1 writes
	54K   probe:update_sd_lb_stats_L50     <====== reads

These numbers prove that actual change in the ::overload value is (much) less
frequent: L50 is much larger at ~54,000 accesses vs L53+L13 of 139+74.

[ mingo: Rewrote the changelog. ]

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20240325054505.201995-2-sshegde@linux.ibm.com
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3846230..600fdde 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10657,7 +10657,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 	if (!env->sd->parent) {
 		/* update overload indicator if we are at root domain */
-		WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
+		if (READ_ONCE(env->dst_rq->rd->overload) != (sg_status & SG_OVERLOAD))
+			WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
 
 		/* Update over-utilization (tipping point, U >= 0) indicator */
 		set_rd_overutilized_status(env->dst_rq->rd,

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [tip: sched/core] sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED
  2024-03-28 10:34         ` Ingo Molnar
@ 2024-03-28 10:56           ` tip-bot2 for Ingo Molnar
  2024-03-28 10:56           ` [tip: sched/core] sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded() tip-bot2 for Ingo Molnar
                             ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: tip-bot2 for Ingo Molnar @ 2024-03-28 10:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ingo Molnar, Qais Yousef, Shrikanth Hegde, Vincent Guittot,
	Peter Zijlstra, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     7bda10ba7f453729f210264dd07d38989fb858d9
Gitweb:        https://git.kernel.org/tip/7bda10ba7f453729f210264dd07d38989fb858d9
Author:        Ingo Molnar <mingo@kernel.org>
AuthorDate:    Thu, 28 Mar 2024 11:44:16 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 28 Mar 2024 11:44:44 +01:00

sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED

Follow the rename of the root_domain::overloaded flag.

Note that this also matches the SG_OVERUTILIZED flag better.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
---
 kernel/sched/fair.c  | 6 +++---
 kernel/sched/sched.h | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf10665..839a97a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9961,7 +9961,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->sum_nr_running += nr_running;
 
 		if (nr_running > 1)
-			*sg_status |= SG_OVERLOAD;
+			*sg_status |= SG_OVERLOADED;
 
 		if (cpu_overutilized(i))
 			*sg_status |= SG_OVERUTILIZED;
@@ -9986,7 +9986,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			/* Check for a misfit task on the cpu */
 			if (sgs->group_misfit_task_load < rq->misfit_task_load) {
 				sgs->group_misfit_task_load = rq->misfit_task_load;
-				*sg_status |= SG_OVERLOAD;
+				*sg_status |= SG_OVERLOADED;
 			}
 		} else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
 			/* Check for a task running on a CPU with reduced capacity */
@@ -10657,7 +10657,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 	if (!env->sd->parent) {
 		/* update overload indicator if we are at root domain */
-		set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOAD);
+		set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
 
 		/* Update over-utilization (tipping point, U >= 0) indicator */
 		set_rd_overutilized_status(env->dst_rq->rd,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7e7ae1..07c6669 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -851,7 +851,7 @@ struct perf_domain {
 };
 
 /* Scheduling group status flags */
-#define SG_OVERLOAD		0x1 /* More than one runnable task on a CPU. */
+#define SG_OVERLOADED		0x1 /* More than one runnable task on a CPU. */
 #define SG_OVERUTILIZED		0x2 /* One or more CPUs are over-utilized. */
 
 /*
@@ -2541,7 +2541,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 
 #ifdef CONFIG_SMP
 	if (prev_nr < 2 && rq->nr_running >= 2) {
-		set_rd_overloaded(rq->rd, SG_OVERLOAD);
+		set_rd_overloaded(rq->rd, SG_OVERLOADED);
 	}
 #endif
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [tip: sched/core] sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()
  2024-03-28 10:34         ` Ingo Molnar
  2024-03-28 10:56           ` [tip: sched/core] sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED tip-bot2 for Ingo Molnar
@ 2024-03-28 10:56           ` tip-bot2 for Ingo Molnar
  2024-03-28 10:56           ` [tip: sched/core] sched/fair: Rename root_domain::overload to ::overloaded tip-bot2 for Ingo Molnar
                             ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: tip-bot2 for Ingo Molnar @ 2024-03-28 10:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ingo Molnar, Qais Yousef, Shrikanth Hegde, Vincent Guittot,
	Peter Zijlstra, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     76cc4f91488af0a808bec97794bfe434dece7d67
Gitweb:        https://git.kernel.org/tip/76cc4f91488af0a808bec97794bfe434dece7d67
Author:        Ingo Molnar <mingo@kernel.org>
AuthorDate:    Thu, 28 Mar 2024 11:41:31 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 28 Mar 2024 11:41:31 +01:00

sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()

Follow the rename of the root_domain::overloaded flag.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
---
 kernel/sched/fair.c  | 4 ++--
 kernel/sched/sched.h | 8 ++++----
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0cc0582..bf10665 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10657,7 +10657,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 	if (!env->sd->parent) {
 		/* update overload indicator if we are at root domain */
-		set_rd_overload(env->dst_rq->rd, sg_status & SG_OVERLOAD);
+		set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOAD);
 
 		/* Update over-utilization (tipping point, U >= 0) indicator */
 		set_rd_overutilized_status(env->dst_rq->rd,
@@ -12390,7 +12390,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
-	if (!get_rd_overload(this_rq->rd) ||
+	if (!get_rd_overloaded(this_rq->rd) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
 		if (sd)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cddc479..c7e7ae1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -930,14 +930,14 @@ extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
 extern void sched_get_rd(struct root_domain *rd);
 extern void sched_put_rd(struct root_domain *rd);
 
-static inline int get_rd_overload(struct root_domain *rd)
+static inline int get_rd_overloaded(struct root_domain *rd)
 {
 	return READ_ONCE(rd->overloaded);
 }
 
-static inline void set_rd_overload(struct root_domain *rd, int status)
+static inline void set_rd_overloaded(struct root_domain *rd, int status)
 {
-	if (get_rd_overload(rd) != status)
+	if (get_rd_overloaded(rd) != status)
 		WRITE_ONCE(rd->overloaded, status);
 }
 
@@ -2541,7 +2541,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 
 #ifdef CONFIG_SMP
 	if (prev_nr < 2 && rq->nr_running >= 2) {
-		set_rd_overload(rq->rd, SG_OVERLOAD);
+		set_rd_overloaded(rq->rd, SG_OVERLOAD);
 	}
 #endif
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [tip: sched/core] sched/fair: Rename root_domain::overload to ::overloaded
  2024-03-28 10:34         ` Ingo Molnar
  2024-03-28 10:56           ` [tip: sched/core] sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED tip-bot2 for Ingo Molnar
  2024-03-28 10:56           ` [tip: sched/core] sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded() tip-bot2 for Ingo Molnar
@ 2024-03-28 10:56           ` tip-bot2 for Ingo Molnar
  2024-03-28 11:07           ` [PATCH v3 0/2] sched: Minor changes for rd->overload access Ingo Molnar
                             ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: tip-bot2 for Ingo Molnar @ 2024-03-28 10:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ingo Molnar, Qais Yousef, Shrikanth Hegde, Vincent Guittot,
	Peter Zijlstra, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     dfb83ef7b8b064c15be19cf7fcbde0996712de8f
Gitweb:        https://git.kernel.org/tip/dfb83ef7b8b064c15be19cf7fcbde0996712de8f
Author:        Ingo Molnar <mingo@kernel.org>
AuthorDate:    Thu, 28 Mar 2024 11:33:20 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 28 Mar 2024 11:38:58 +01:00

sched/fair: Rename root_domain::overload to ::overloaded

It is silly to use an ambiguous noun instead of a clear adjective when naming
such a flag ...

Note how root_domain::overutilized already used a proper adjective.

rd->overloaded is now set to 1 when the root domain is overloaded.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
---
 kernel/sched/sched.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e86ee26..cddc479 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -874,7 +874,7 @@ struct root_domain {
 	 * - More than one runnable task
 	 * - Running task is misfit
 	 */
-	int			overload;
+	int			overloaded;
 
 	/* Indicate one or more cpus over-utilized (tipping point) */
 	int			overutilized;
@@ -932,13 +932,13 @@ extern void sched_put_rd(struct root_domain *rd);
 
 static inline int get_rd_overload(struct root_domain *rd)
 {
-	return READ_ONCE(rd->overload);
+	return READ_ONCE(rd->overloaded);
 }
 
 static inline void set_rd_overload(struct root_domain *rd, int status)
 {
 	if (get_rd_overload(rd) != status)
-		WRITE_ONCE(rd->overload, status);
+		WRITE_ONCE(rd->overloaded, status);
 }
 
 #ifdef HAVE_RT_PUSH_IPI

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] sched: Minor changes for rd->overload access
  2024-03-28 10:34         ` Ingo Molnar
                             ` (2 preceding siblings ...)
  2024-03-28 10:56           ` [tip: sched/core] sched/fair: Rename root_domain::overload to ::overloaded tip-bot2 for Ingo Molnar
@ 2024-03-28 11:07           ` Ingo Molnar
  2024-03-28 17:19             ` Shrikanth Hegde
  2024-03-28 12:01           ` [tip: sched/core] sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized() tip-bot2 for Ingo Molnar
  2024-03-28 12:58           ` [PATCH v3 0/2] sched: Minor changes for rd->overload access Shrikanth Hegde
  5 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2024-03-28 11:07 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: peterz, vincent.guittot, dietmar.eggemann, qyousef, linux-kernel,
	vschneid


* Ingo Molnar <mingo@kernel.org> wrote:

> Plus I've applied a patch to rename ::overload to ::overloaded. It is 
> silly to use an ambiguous noun instead of a clear adjective when naming 
> such a flag ...

Plus SG_OVERLOAD should be SG_OVERLOADED as well - it now looks in line 
with SG_OVERUTILIZED:

 /* Scheduling group status flags */
 #define SG_OVERLOADED           0x1 /* More than one runnable task on a CPU. */
 #define SG_OVERUTILIZED         0x2 /* One or more CPUs are over-utilized. */

My followup question is: why are these a bitmask, why not separate 
flags?

AFAICS we only ever set them separately:

 thule:~/tip> git grep SG_OVER kernel/sched/
 kernel/sched/fair.c:            set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
 kernel/sched/fair.c:                    *sg_status |= SG_OVERLOADED;
 kernel/sched/fair.c:                    *sg_status |= SG_OVERUTILIZED;
 kernel/sched/fair.c:                            *sg_status |= SG_OVERLOADED;
 kernel/sched/fair.c:            set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
 kernel/sched/fair.c:                                       sg_status & SG_OVERUTILIZED);
 kernel/sched/fair.c:    } else if (sg_status & SG_OVERUTILIZED) {
 kernel/sched/fair.c:            set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
 kernel/sched/sched.h:#define SG_OVERLOADED              0x1 /* More than one runnable task on a CPU. */
 kernel/sched/sched.h:#define SG_OVERUTILIZED            0x2 /* One or more CPUs are over-utilized. */
 kernel/sched/sched.h:           set_rd_overloaded(rq->rd, SG_OVERLOADED);

In fact this results in suboptimal code:

                /* update overload indicator if we are at root domain */
                set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
                        
                /* Update over-utilization (tipping point, U >= 0) indicator */
                set_rd_overutilized_status(env->dst_rq->rd,
                                           sg_status & SG_OVERUTILIZED);

Note how the bits that got mixed together in sg_status now have to be 
masked out individually.

The sg_status bitmask appears to make no sense at all to me.

By turning these into individual bool flags we could also do away with 
all the extra SG_OVERLOADED/SG_OVERUTILIZED abstraction.

Ie. something like the patch below? Untested.

Thanks,

	Ingo

=================>
From: Ingo Molnar <mingo@kernel.org>
Date: Thu, 28 Mar 2024 12:00:14 +0100
Subject: [PATCH] sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags

SG_OVERLOADED and SG_OVERUTILIZED flags plus the sg_status bitmask are an
unnecessary complication that only make the code harder to read and slower.

We only ever set them separately:

 thule:~/tip> git grep SG_OVER kernel/sched/
 kernel/sched/fair.c:            set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
 kernel/sched/fair.c:                    *sg_status |= SG_OVERLOADED;
 kernel/sched/fair.c:                    *sg_status |= SG_OVERUTILIZED;
 kernel/sched/fair.c:                            *sg_status |= SG_OVERLOADED;
 kernel/sched/fair.c:            set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
 kernel/sched/fair.c:                                       sg_status & SG_OVERUTILIZED);
 kernel/sched/fair.c:    } else if (sg_status & SG_OVERUTILIZED) {
 kernel/sched/fair.c:            set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
 kernel/sched/sched.h:#define SG_OVERLOADED              0x1 /* More than one runnable task on a CPU. */
 kernel/sched/sched.h:#define SG_OVERUTILIZED            0x2 /* One or more CPUs are over-utilized. */
 kernel/sched/sched.h:           set_rd_overloaded(rq->rd, SG_OVERLOADED);

And use them separately, which results in suboptimal code:

                /* update overload indicator if we are at root domain */
                set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);

                /* Update over-utilization (tipping point, U >= 0) indicator */
                set_rd_overutilized_status(env->dst_rq->rd,

Introduce separate sg_overloaded and sg_overutilized flags in update_sd_lb_stats()
and its lower level functions, and change all of them to 'bool'.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c  | 33 ++++++++++++++++-----------------
 kernel/sched/sched.h | 17 ++++++-----------
 2 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f29efd5f19f6..ebc8d5f855de 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6688,19 +6688,18 @@ static inline bool cpu_overutilized(int cpu)
 /*
  * overutilized value make sense only if EAS is enabled
  */
-static inline int is_rd_overutilized(struct root_domain *rd)
+static inline bool is_rd_overutilized(struct root_domain *rd)
 {
 	return !sched_energy_enabled() || READ_ONCE(rd->overutilized);
 }
 
-static inline void set_rd_overutilized(struct root_domain *rd,
-					      unsigned int status)
+static inline void set_rd_overutilized(struct root_domain *rd, bool flag)
 {
 	if (!sched_energy_enabled())
 		return;
 
-	WRITE_ONCE(rd->overutilized, status);
-	trace_sched_overutilized_tp(rd, !!status);
+	WRITE_ONCE(rd->overutilized, flag);
+	trace_sched_overutilized_tp(rd, flag);
 }
 
 static inline void check_update_overutilized_status(struct rq *rq)
@@ -6711,7 +6710,7 @@ static inline void check_update_overutilized_status(struct rq *rq)
 	 */
 
 	if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
-		set_rd_overutilized(rq->rd, SG_OVERUTILIZED);
+		set_rd_overutilized(rq->rd, 1);
 }
 #else
 static inline void check_update_overutilized_status(struct rq *rq) { }
@@ -9940,7 +9939,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 				      struct sd_lb_stats *sds,
 				      struct sched_group *group,
 				      struct sg_lb_stats *sgs,
-				      int *sg_status)
+				      bool *sg_overloaded,
+				      bool *sg_overutilized)
 {
 	int i, nr_running, local_group;
 
@@ -9961,10 +9961,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->sum_nr_running += nr_running;
 
 		if (nr_running > 1)
-			*sg_status |= SG_OVERLOADED;
+			*sg_overloaded = 1;
 
 		if (cpu_overutilized(i))
-			*sg_status |= SG_OVERUTILIZED;
+			*sg_overutilized = 1;
 
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
@@ -9986,7 +9986,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			/* Check for a misfit task on the cpu */
 			if (sgs->group_misfit_task_load < rq->misfit_task_load) {
 				sgs->group_misfit_task_load = rq->misfit_task_load;
-				*sg_status |= SG_OVERLOADED;
+				*sg_overloaded = 1;
 			}
 		} else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
 			/* Check for a task running on a CPU with reduced capacity */
@@ -10612,7 +10612,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
 	unsigned long sum_util = 0;
-	int sg_status = 0;
+	bool sg_overloaded = 0, sg_overutilized = 0;
 
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
@@ -10628,7 +10628,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 				update_group_capacity(env->sd, env->dst_cpu);
 		}
 
-		update_sg_lb_stats(env, sds, sg, sgs, &sg_status);
+		update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded, &sg_overutilized);
 
 		if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -10657,13 +10657,12 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 	if (!env->sd->parent) {
 		/* update overload indicator if we are at root domain */
-		set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
+		set_rd_overloaded(env->dst_rq->rd, sg_overloaded);
 
 		/* Update over-utilization (tipping point, U >= 0) indicator */
-		set_rd_overutilized(env->dst_rq->rd,
-					   sg_status & SG_OVERUTILIZED);
-	} else if (sg_status & SG_OVERUTILIZED) {
-		set_rd_overutilized(env->dst_rq->rd, SG_OVERUTILIZED);
+		set_rd_overutilized(env->dst_rq->rd, sg_overloaded);
+	} else if (sg_overutilized) {
+		set_rd_overutilized(env->dst_rq->rd, sg_overutilized);
 	}
 
 	update_idle_cpu_scan(env, sum_util);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 07c6669b8250..7c39dbf31f75 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -713,7 +713,7 @@ struct rt_rq {
 	} highest_prio;
 #endif
 #ifdef CONFIG_SMP
-	int			overloaded;
+	bool			overloaded;
 	struct plist_head	pushable_tasks;
 
 #endif /* CONFIG_SMP */
@@ -757,7 +757,7 @@ struct dl_rq {
 		u64		next;
 	} earliest_dl;
 
-	int			overloaded;
+	bool			overloaded;
 
 	/*
 	 * Tasks on this rq that can be pushed away. They are kept in
@@ -850,10 +850,6 @@ struct perf_domain {
 	struct rcu_head rcu;
 };
 
-/* Scheduling group status flags */
-#define SG_OVERLOADED		0x1 /* More than one runnable task on a CPU. */
-#define SG_OVERUTILIZED		0x2 /* One or more CPUs are over-utilized. */
-
 /*
  * We add the notion of a root-domain which will be used to define per-domain
  * variables. Each exclusive cpuset essentially defines an island domain by
@@ -874,10 +870,10 @@ struct root_domain {
 	 * - More than one runnable task
 	 * - Running task is misfit
 	 */
-	int			overloaded;
+	bool			overloaded;
 
 	/* Indicate one or more cpus over-utilized (tipping point) */
-	int			overutilized;
+	bool			overutilized;
 
 	/*
 	 * The bit corresponding to a CPU gets set here if such CPU has more
@@ -2540,9 +2536,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 	}
 
 #ifdef CONFIG_SMP
-	if (prev_nr < 2 && rq->nr_running >= 2) {
-		set_rd_overloaded(rq->rd, SG_OVERLOADED);
-	}
+	if (prev_nr < 2 && rq->nr_running >= 2)
+		set_rd_overloaded(rq->rd, 1);
 #endif
 
 	sched_update_tick_dependency(rq);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [tip: sched/core] sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized()
  2024-03-28 10:34         ` Ingo Molnar
                             ` (3 preceding siblings ...)
  2024-03-28 11:07           ` [PATCH v3 0/2] sched: Minor changes for rd->overload access Ingo Molnar
@ 2024-03-28 12:01           ` tip-bot2 for Ingo Molnar
  2024-03-28 12:58           ` [PATCH v3 0/2] sched: Minor changes for rd->overload access Shrikanth Hegde
  5 siblings, 0 replies; 18+ messages in thread
From: tip-bot2 for Ingo Molnar @ 2024-03-28 12:01 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ingo Molnar, Qais Yousef, Shrikanth Hegde, Vincent Guittot,
	Peter Zijlstra, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     4d0a63e5b841c759c9a306aff158420421ef016f
Gitweb:        https://git.kernel.org/tip/4d0a63e5b841c759c9a306aff158420421ef016f
Author:        Ingo Molnar <mingo@kernel.org>
AuthorDate:    Thu, 28 Mar 2024 11:54:42 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 28 Mar 2024 11:54:42 +01:00

sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized()

The _status() postfix has no real meaning, simplify the naming
and harmonize it with set_rd_overloaded().

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/ZgVHq65XKsOZpfgK@gmail.com
---
 kernel/sched/fair.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 839a97a..f29efd5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6693,7 +6693,7 @@ static inline int is_rd_overutilized(struct root_domain *rd)
 	return !sched_energy_enabled() || READ_ONCE(rd->overutilized);
 }
 
-static inline void set_rd_overutilized_status(struct root_domain *rd,
+static inline void set_rd_overutilized(struct root_domain *rd,
 					      unsigned int status)
 {
 	if (!sched_energy_enabled())
@@ -6711,7 +6711,7 @@ static inline void check_update_overutilized_status(struct rq *rq)
 	 */
 
 	if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
-		set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
+		set_rd_overutilized(rq->rd, SG_OVERUTILIZED);
 }
 #else
 static inline void check_update_overutilized_status(struct rq *rq) { }
@@ -10660,10 +10660,10 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
 
 		/* Update over-utilization (tipping point, U >= 0) indicator */
-		set_rd_overutilized_status(env->dst_rq->rd,
+		set_rd_overutilized(env->dst_rq->rd,
 					   sg_status & SG_OVERUTILIZED);
 	} else if (sg_status & SG_OVERUTILIZED) {
-		set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
+		set_rd_overutilized(env->dst_rq->rd, SG_OVERUTILIZED);
 	}
 
 	update_idle_cpu_scan(env, sum_util);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] sched: Minor changes for rd->overload access
  2024-03-28 10:34         ` Ingo Molnar
                             ` (4 preceding siblings ...)
  2024-03-28 12:01           ` [tip: sched/core] sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized() tip-bot2 for Ingo Molnar
@ 2024-03-28 12:58           ` Shrikanth Hegde
  5 siblings, 0 replies; 18+ messages in thread
From: Shrikanth Hegde @ 2024-03-28 12:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: peterz, vincent.guittot, dietmar.eggemann, qyousef, linux-kernel,
	vschneid



On 3/28/24 4:04 PM, Ingo Molnar wrote:
> 
> * Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> 

>>
>> Hi Ingo. 
>>
>> These two patches apply cleanly now to sched/core. 
>>
>> 7a9dd7ef946c (HEAD -> sched/core) sched/fair: Use helper functions to access rd->overload
>> 4f24aa918558 sched/fair: Check rd->overload value before update
>> c829d6818b60 (origin/sched/core) sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()
>> d0f5d3cefc25 sched/fair: Introduce is_rd_overutilized() helper function to access root_domain::overutilized
> 
> I've applied them, but note that there were quite a few avoidable typos 
> and grammar mistakes in the changelogs. Please always review what 
> you've submitted versus what I have applied and try to learn from that: 
> I almost invariable have to edit some detail to make the changelog more 
> presentable... Could you please take more care with future patches?
> 

Noted. 
I will learn for it, thank you. 

> I also renamed the accessor functions in the second patch to:
> 
>       get_rd_overload()
>       set_rd_overload()
> 
> Plus I've applied a patch to rename ::overload to ::overloaded. It is 
> silly to use an ambiguous noun instead of a clear adjective when naming 
> such a flag ...
> 

sure. looks fine.

> Thanks,
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] sched: Minor changes for rd->overload access
  2024-03-28 11:07           ` [PATCH v3 0/2] sched: Minor changes for rd->overload access Ingo Molnar
@ 2024-03-28 17:19             ` Shrikanth Hegde
  2024-03-29  6:55               ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Shrikanth Hegde @ 2024-03-28 17:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: peterz, vincent.guittot, dietmar.eggemann, qyousef, linux-kernel,
	vschneid



On 3/28/24 4:37 PM, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
>> Plus I've applied a patch to rename ::overload to ::overloaded. It is 
>> silly to use an ambiguous noun instead of a clear adjective when naming 
>> such a flag ...
> 
> Plus SG_OVERLOAD should be SG_OVERLOADED as well - it now looks in line 
> with SG_OVERUTILIZED:
> 
>  /* Scheduling group status flags */
>  #define SG_OVERLOADED           0x1 /* More than one runnable task on a CPU. */
>  #define SG_OVERUTILIZED         0x2 /* One or more CPUs are over-utilized. */
> 
> My followup question is: why are these a bitmask, why not separate 
> flags?
> 
> AFAICS we only ever set them separately:
> 
>  thule:~/tip> git grep SG_OVER kernel/sched/
>  kernel/sched/fair.c:            set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
>  kernel/sched/fair.c:                    *sg_status |= SG_OVERLOADED;
>  kernel/sched/fair.c:                    *sg_status |= SG_OVERUTILIZED;
>  kernel/sched/fair.c:                            *sg_status |= SG_OVERLOADED;
>  kernel/sched/fair.c:            set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
>  kernel/sched/fair.c:                                       sg_status & SG_OVERUTILIZED);
>  kernel/sched/fair.c:    } else if (sg_status & SG_OVERUTILIZED) {
>  kernel/sched/fair.c:            set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
>  kernel/sched/sched.h:#define SG_OVERLOADED              0x1 /* More than one runnable task on a CPU. */
>  kernel/sched/sched.h:#define SG_OVERUTILIZED            0x2 /* One or more CPUs are over-utilized. */
>  kernel/sched/sched.h:           set_rd_overloaded(rq->rd, SG_OVERLOADED);
> 
> In fact this results in suboptimal code:
> 
>                 /* update overload indicator if we are at root domain */
>                 set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
>                         
>                 /* Update over-utilization (tipping point, U >= 0) indicator */
>                 set_rd_overutilized_status(env->dst_rq->rd,
>                                            sg_status & SG_OVERUTILIZED);
> 
> Note how the bits that got mixed together in sg_status now have to be 
> masked out individually.
> 
> The sg_status bitmask appears to make no sense at all to me.
> 
> By turning these into individual bool flags we could also do away with 
> all the extra SG_OVERLOADED/SG_OVERUTILIZED abstraction.
> 
> Ie. something like the patch below? Untested.

Looks good. I see it is merged to sched/core. 
Did a boot with that patch and hackbench is showing same results 320 CPU system.


> 
> Thanks,
> 
> 	Ingo
> 
> =================>
> From: Ingo Molnar <mingo@kernel.org>
> Date: Thu, 28 Mar 2024 12:00:14 +0100
> Subject: [PATCH] sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags
> 
> SG_OVERLOADED and SG_OVERUTILIZED flags plus the sg_status bitmask are an
> unnecessary complication that only make the code harder to read and slower.
> 
> We only ever set them separately:
> 
>  thule:~/tip> git grep SG_OVER kernel/sched/
>  kernel/sched/fair.c:            set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
>  kernel/sched/fair.c:                    *sg_status |= SG_OVERLOADED;
>  kernel/sched/fair.c:                    *sg_status |= SG_OVERUTILIZED;
>  kernel/sched/fair.c:                            *sg_status |= SG_OVERLOADED;
>  kernel/sched/fair.c:            set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
>  kernel/sched/fair.c:                                       sg_status & SG_OVERUTILIZED);
>  kernel/sched/fair.c:    } else if (sg_status & SG_OVERUTILIZED) {
>  kernel/sched/fair.c:            set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
>  kernel/sched/sched.h:#define SG_OVERLOADED              0x1 /* More than one runnable task on a CPU. */
>  kernel/sched/sched.h:#define SG_OVERUTILIZED            0x2 /* One or more CPUs are over-utilized. */
>  kernel/sched/sched.h:           set_rd_overloaded(rq->rd, SG_OVERLOADED);
> 
> And use them separately, which results in suboptimal code:
> 
>                 /* update overload indicator if we are at root domain */
>                 set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
> 
>                 /* Update over-utilization (tipping point, U >= 0) indicator */
>                 set_rd_overutilized_status(env->dst_rq->rd,
> 
> Introduce separate sg_overloaded and sg_overutilized flags in update_sd_lb_stats()
> and its lower level functions, and change all of them to 'bool'.
> 
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  kernel/sched/fair.c  | 33 ++++++++++++++++-----------------
>  kernel/sched/sched.h | 17 ++++++-----------
>  2 files changed, 22 insertions(+), 28 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f29efd5f19f6..ebc8d5f855de 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6688,19 +6688,18 @@ static inline bool cpu_overutilized(int cpu)
>  /*
>   * overutilized value make sense only if EAS is enabled
>   */
> -static inline int is_rd_overutilized(struct root_domain *rd)
> +static inline bool is_rd_overutilized(struct root_domain *rd)
>  {
>  	return !sched_energy_enabled() || READ_ONCE(rd->overutilized);
>  }
>  
> -static inline void set_rd_overutilized(struct root_domain *rd,
> -					      unsigned int status)
> +static inline void set_rd_overutilized(struct root_domain *rd, bool flag)
>  {
>  	if (!sched_energy_enabled())
>  		return;
>  
> -	WRITE_ONCE(rd->overutilized, status);
> -	trace_sched_overutilized_tp(rd, !!status);
> +	WRITE_ONCE(rd->overutilized, flag);
> +	trace_sched_overutilized_tp(rd, flag);
>  }
>  
>  static inline void check_update_overutilized_status(struct rq *rq)
> @@ -6711,7 +6710,7 @@ static inline void check_update_overutilized_status(struct rq *rq)
>  	 */
>  
>  	if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
> -		set_rd_overutilized(rq->rd, SG_OVERUTILIZED);
> +		set_rd_overutilized(rq->rd, 1);
>  }
>  #else
>  static inline void check_update_overutilized_status(struct rq *rq) { }
> @@ -9940,7 +9939,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  				      struct sd_lb_stats *sds,
>  				      struct sched_group *group,
>  				      struct sg_lb_stats *sgs,
> -				      int *sg_status)
> +				      bool *sg_overloaded,
> +				      bool *sg_overutilized)
>  {
>  	int i, nr_running, local_group;
>  
> @@ -9961,10 +9961,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  		sgs->sum_nr_running += nr_running;
>  
>  		if (nr_running > 1)
> -			*sg_status |= SG_OVERLOADED;
> +			*sg_overloaded = 1;
>  
>  		if (cpu_overutilized(i))
> -			*sg_status |= SG_OVERUTILIZED;
> +			*sg_overutilized = 1;
>  
>  #ifdef CONFIG_NUMA_BALANCING
>  		sgs->nr_numa_running += rq->nr_numa_running;
> @@ -9986,7 +9986,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  			/* Check for a misfit task on the cpu */
>  			if (sgs->group_misfit_task_load < rq->misfit_task_load) {
>  				sgs->group_misfit_task_load = rq->misfit_task_load;
> -				*sg_status |= SG_OVERLOADED;
> +				*sg_overloaded = 1;
>  			}
>  		} else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
>  			/* Check for a task running on a CPU with reduced capacity */
> @@ -10612,7 +10612,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	struct sg_lb_stats *local = &sds->local_stat;
>  	struct sg_lb_stats tmp_sgs;
>  	unsigned long sum_util = 0;
> -	int sg_status = 0;
> +	bool sg_overloaded = 0, sg_overutilized = 0;
>  
>  	do {
>  		struct sg_lb_stats *sgs = &tmp_sgs;
> @@ -10628,7 +10628,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  				update_group_capacity(env->sd, env->dst_cpu);
>  		}
>  
> -		update_sg_lb_stats(env, sds, sg, sgs, &sg_status);
> +		update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded, &sg_overutilized);
>  
>  		if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
>  			sds->busiest = sg;
> @@ -10657,13 +10657,12 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  
>  	if (!env->sd->parent) {
>  		/* update overload indicator if we are at root domain */
> -		set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
> +		set_rd_overloaded(env->dst_rq->rd, sg_overloaded);
>  
>  		/* Update over-utilization (tipping point, U >= 0) indicator */
> -		set_rd_overutilized(env->dst_rq->rd,
> -					   sg_status & SG_OVERUTILIZED);
> -	} else if (sg_status & SG_OVERUTILIZED) {
> -		set_rd_overutilized(env->dst_rq->rd, SG_OVERUTILIZED);
> +		set_rd_overutilized(env->dst_rq->rd, sg_overloaded);
> +	} else if (sg_overutilized) {
> +		set_rd_overutilized(env->dst_rq->rd, sg_overutilized);
>  	}
>  
>  	update_idle_cpu_scan(env, sum_util);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 07c6669b8250..7c39dbf31f75 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -713,7 +713,7 @@ struct rt_rq {
>  	} highest_prio;
>  #endif
>  #ifdef CONFIG_SMP
> -	int			overloaded;
> +	bool			overloaded;
>  	struct plist_head	pushable_tasks;
>  
>  #endif /* CONFIG_SMP */
> @@ -757,7 +757,7 @@ struct dl_rq {
>  		u64		next;
>  	} earliest_dl;
>  
> -	int			overloaded;
> +	bool			overloaded;
>  
>  	/*
>  	 * Tasks on this rq that can be pushed away. They are kept in
> @@ -850,10 +850,6 @@ struct perf_domain {
>  	struct rcu_head rcu;
>  };
>  
> -/* Scheduling group status flags */
> -#define SG_OVERLOADED		0x1 /* More than one runnable task on a CPU. */
> -#define SG_OVERUTILIZED		0x2 /* One or more CPUs are over-utilized. */
> -
>  /*
>   * We add the notion of a root-domain which will be used to define per-domain
>   * variables. Each exclusive cpuset essentially defines an island domain by
> @@ -874,10 +870,10 @@ struct root_domain {
>  	 * - More than one runnable task
>  	 * - Running task is misfit
>  	 */
> -	int			overloaded;
> +	bool			overloaded;
>  
>  	/* Indicate one or more cpus over-utilized (tipping point) */
> -	int			overutilized;
> +	bool			overutilized;
>  
>  	/*
>  	 * The bit corresponding to a CPU gets set here if such CPU has more
> @@ -2540,9 +2536,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
>  	}
>  
>  #ifdef CONFIG_SMP
> -	if (prev_nr < 2 && rq->nr_running >= 2) {
> -		set_rd_overloaded(rq->rd, SG_OVERLOADED);
> -	}
> +	if (prev_nr < 2 && rq->nr_running >= 2)
> +		set_rd_overloaded(rq->rd, 1);
>  #endif
>  
>  	sched_update_tick_dependency(rq);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] sched: Minor changes for rd->overload access
  2024-03-28 17:19             ` Shrikanth Hegde
@ 2024-03-29  6:55               ` Ingo Molnar
  0 siblings, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2024-03-29  6:55 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: peterz, vincent.guittot, dietmar.eggemann, qyousef, linux-kernel,
	vschneid


* Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> 
> 
> On 3/28/24 4:37 PM, Ingo Molnar wrote:
> > 
> > * Ingo Molnar <mingo@kernel.org> wrote:
> > 
> >> Plus I've applied a patch to rename ::overload to ::overloaded. It is 
> >> silly to use an ambiguous noun instead of a clear adjective when naming 
> >> such a flag ...
> > 
> > Plus SG_OVERLOAD should be SG_OVERLOADED as well - it now looks in line 
> > with SG_OVERUTILIZED:
> > 
> >  /* Scheduling group status flags */
> >  #define SG_OVERLOADED           0x1 /* More than one runnable task on a CPU. */
> >  #define SG_OVERUTILIZED         0x2 /* One or more CPUs are over-utilized. */
> > 
> > My followup question is: why are these a bitmask, why not separate 
> > flags?
> > 
> > AFAICS we only ever set them separately:
> > 
> >  thule:~/tip> git grep SG_OVER kernel/sched/
> >  kernel/sched/fair.c:            set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
> >  kernel/sched/fair.c:                    *sg_status |= SG_OVERLOADED;
> >  kernel/sched/fair.c:                    *sg_status |= SG_OVERUTILIZED;
> >  kernel/sched/fair.c:                            *sg_status |= SG_OVERLOADED;
> >  kernel/sched/fair.c:            set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
> >  kernel/sched/fair.c:                                       sg_status & SG_OVERUTILIZED);
> >  kernel/sched/fair.c:    } else if (sg_status & SG_OVERUTILIZED) {
> >  kernel/sched/fair.c:            set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
> >  kernel/sched/sched.h:#define SG_OVERLOADED              0x1 /* More than one runnable task on a CPU. */
> >  kernel/sched/sched.h:#define SG_OVERUTILIZED            0x2 /* One or more CPUs are over-utilized. */
> >  kernel/sched/sched.h:           set_rd_overloaded(rq->rd, SG_OVERLOADED);
> > 
> > In fact this results in suboptimal code:
> > 
> >                 /* update overload indicator if we are at root domain */
> >                 set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
> >                         
> >                 /* Update over-utilization (tipping point, U >= 0) indicator */
> >                 set_rd_overutilized_status(env->dst_rq->rd,
> >                                            sg_status & SG_OVERUTILIZED);
> > 
> > Note how the bits that got mixed together in sg_status now have to be 
> > masked out individually.
> > 
> > The sg_status bitmask appears to make no sense at all to me.
> > 
> > By turning these into individual bool flags we could also do away with 
> > all the extra SG_OVERLOADED/SG_OVERUTILIZED abstraction.
> > 
> > Ie. something like the patch below? Untested.
> 
> Looks good. I see it is merged to sched/core. 
> Did a boot with that patch and hackbench is showing same results 320 CPU system.

Thanks, I've added:

    Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
    Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>

And applied the additional docbook fix below on top as well.

Thaks,

	Ingo

=================>
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ebc8d5f855de..1dd37168da50 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9933,7 +9933,8 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
  * @sds: Load-balancing data with statistics of the local group.
  * @group: sched_group whose statistics are to be updated.
  * @sgs: variable to hold the statistics for this group.
- * @sg_status: Holds flag indicating the status of the sched_group
+ * @sg_overloaded: sched_group is overloaded
+ * @sg_overutilized: sched_group is overutilized
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
 				      struct sd_lb_stats *sds,

^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-03-29  6:55 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-25  5:45 [PATCH v3 0/2] sched: Minor changes for rd->overload access Shrikanth Hegde
2024-03-25  5:45 ` [PATCH v3 1/2] sched/fair: Check rd->overload value before update Shrikanth Hegde
2024-03-28 10:47   ` [tip: sched/core] sched/fair: Check root_domain::overload " tip-bot2 for Shrikanth Hegde
2024-03-25  5:45 ` [PATCH v3 2/2] sched/fair: Use helper functions to access rd->overload Shrikanth Hegde
2024-03-28 10:47   ` [tip: sched/core] sched/fair: Use helper functions to access root_domain::overload tip-bot2 for Shrikanth Hegde
2024-03-25 10:36 ` [PATCH v3 0/2] sched: Minor changes for rd->overload access Ingo Molnar
2024-03-25 11:33   ` Shrikanth Hegde
2024-03-26  8:00     ` Ingo Molnar
2024-03-27  6:04       ` Shrikanth Hegde
2024-03-28 10:34         ` Ingo Molnar
2024-03-28 10:56           ` [tip: sched/core] sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED tip-bot2 for Ingo Molnar
2024-03-28 10:56           ` [tip: sched/core] sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded() tip-bot2 for Ingo Molnar
2024-03-28 10:56           ` [tip: sched/core] sched/fair: Rename root_domain::overload to ::overloaded tip-bot2 for Ingo Molnar
2024-03-28 11:07           ` [PATCH v3 0/2] sched: Minor changes for rd->overload access Ingo Molnar
2024-03-28 17:19             ` Shrikanth Hegde
2024-03-29  6:55               ` Ingo Molnar
2024-03-28 12:01           ` [tip: sched/core] sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized() tip-bot2 for Ingo Molnar
2024-03-28 12:58           ` [PATCH v3 0/2] sched: Minor changes for rd->overload access Shrikanth Hegde

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.