linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and
@ 2021-06-06 13:11 Yuan ZhaoXiong
  2021-06-28 13:58 ` [tip: sched/core] sched: Optimize housekeeping_cpumask() in for_each_cpu_and() tip-bot2 for Yuan ZhaoXiong
  0 siblings, 1 reply; 11+ messages in thread
From: Yuan ZhaoXiong @ 2021-06-06 13:11 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot
  Cc: linux-kernel, lirongqing, yuanzhaoxiong

On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
the others are used for housekeeping. When many housekeeping cpus are
in idle state, we can observe huge time burn in the loop for searching
nearest busy housekeeper cpu by ftrace.

   9)               |              get_nohz_timer_target() {
   9)               |                housekeeping_test_cpu() {
   9)   0.390 us    |                  housekeeping_get_mask.part.1();
   9)   0.561 us    |                }
   9)   0.090 us    |                __rcu_read_lock();
   9)   0.090 us    |                housekeeping_cpumask();
   9)   0.521 us    |                housekeeping_cpumask();
   9)   0.140 us    |                housekeeping_cpumask();

   ...

   9)   0.500 us    |                housekeeping_cpumask();
   9)               |                housekeeping_any_cpu() {
   9)   0.090 us    |                  housekeeping_get_mask.part.1();
   9)   0.100 us    |                  sched_numa_find_closest();
   9)   0.491 us    |                }
   9)   0.100 us    |                __rcu_read_unlock();
   9) + 76.163 us   |              }

for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
function the
        for_each_cpu_and(i, sched_domain_span(sd),
                housekeeping_cpumask(HK_FLAG_TIMER))
equals to below:
        for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
                housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
That will cause that housekeeping_cpumask() will be invoked many times.
The housekeeping_cpumask() function returns a const value, so it is
unnecessary to invoke it every time. This patch can minimize the worst
searching time from ~76us to ~16us in my testing.

Similarly, the find_new_ilb() function has the same problem.

Co-developed-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com>
---
 kernel/sched/core.c | 6 ++++--
 kernel/sched/fair.c | 6 ++++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 98191218d891..14ad3bb36321 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -645,6 +645,7 @@ int get_nohz_timer_target(void)
 {
 	int i, cpu = smp_processor_id(), default_cpu = -1;
 	struct sched_domain *sd;
+	const struct cpumask *hk_mask;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TIMER)) {
 		if (!idle_cpu(cpu))
@@ -652,10 +653,11 @@ int get_nohz_timer_target(void)
 		default_cpu = cpu;
 	}
 
+	hk_mask = housekeeping_cpumask(HK_FLAG_TIMER);
+
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
-		for_each_cpu_and(i, sched_domain_span(sd),
-			housekeeping_cpumask(HK_FLAG_TIMER)) {
+		for_each_cpu_and(i, sched_domain_span(sd), hk_mask) {
 			if (cpu == i)
 				continue;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..d3ecfbf160bf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10097,9 +10097,11 @@ static inline int on_null_domain(struct rq *rq)
 static inline int find_new_ilb(void)
 {
 	int ilb;
+	const struct cpumask *hk_mask;
 
-	for_each_cpu_and(ilb, nohz.idle_cpus_mask,
-			      housekeeping_cpumask(HK_FLAG_MISC)) {
+	hk_mask = housekeeping_cpumask(HK_FLAG_MISC);
+
+	for_each_cpu_and(ilb, nohz.idle_cpus_mask, hk_mask) {
 
 		if (ilb == smp_processor_id())
 			continue;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [tip: sched/core] sched: Optimize housekeeping_cpumask() in for_each_cpu_and()
  2021-06-06 13:11 [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and Yuan ZhaoXiong
@ 2021-06-28 13:58 ` tip-bot2 for Yuan ZhaoXiong
  0 siblings, 0 replies; 11+ messages in thread
From: tip-bot2 for Yuan ZhaoXiong @ 2021-06-28 13:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Li RongQing, Yuan ZhaoXiong, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     031e3bd8986fffe31e1ddbf5264cccfe30c9abd7
Gitweb:        https://git.kernel.org/tip/031e3bd8986fffe31e1ddbf5264cccfe30c9abd7
Author:        Yuan ZhaoXiong <yuanzhaoxiong@baidu.com>
AuthorDate:    Sun, 06 Jun 2021 21:11:55 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 28 Jun 2021 15:42:26 +02:00

sched: Optimize housekeeping_cpumask() in for_each_cpu_and()

On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
the others are used for housekeeping. When many housekeeping cpus are
in idle state, we can observe huge time burn in the loop for searching
nearest busy housekeeper cpu by ftrace.

   9)               |              get_nohz_timer_target() {
   9)               |                housekeeping_test_cpu() {
   9)   0.390 us    |                  housekeeping_get_mask.part.1();
   9)   0.561 us    |                }
   9)   0.090 us    |                __rcu_read_lock();
   9)   0.090 us    |                housekeeping_cpumask();
   9)   0.521 us    |                housekeeping_cpumask();
   9)   0.140 us    |                housekeeping_cpumask();

   ...

   9)   0.500 us    |                housekeeping_cpumask();
   9)               |                housekeeping_any_cpu() {
   9)   0.090 us    |                  housekeeping_get_mask.part.1();
   9)   0.100 us    |                  sched_numa_find_closest();
   9)   0.491 us    |                }
   9)   0.100 us    |                __rcu_read_unlock();
   9) + 76.163 us   |              }

for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
function the
        for_each_cpu_and(i, sched_domain_span(sd),
                housekeeping_cpumask(HK_FLAG_TIMER))
equals to below:
        for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
                housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
That will cause that housekeeping_cpumask() will be invoked many times.
The housekeeping_cpumask() function returns a const value, so it is
unnecessary to invoke it every time. This patch can minimize the worst
searching time from ~76us to ~16us in my testing.

Similarly, the find_new_ilb() function has the same problem.

Co-developed-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1622985115-51007-1-git-send-email-yuanzhaoxiong@baidu.com
---
 kernel/sched/core.c | 6 ++++--
 kernel/sched/fair.c | 6 ++++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2883c22..0c22cd0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -993,6 +993,7 @@ int get_nohz_timer_target(void)
 {
 	int i, cpu = smp_processor_id(), default_cpu = -1;
 	struct sched_domain *sd;
+	const struct cpumask *hk_mask;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TIMER)) {
 		if (!idle_cpu(cpu))
@@ -1000,10 +1001,11 @@ int get_nohz_timer_target(void)
 		default_cpu = cpu;
 	}
 
+	hk_mask = housekeeping_cpumask(HK_FLAG_TIMER);
+
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
-		for_each_cpu_and(i, sched_domain_span(sd),
-			housekeeping_cpumask(HK_FLAG_TIMER)) {
+		for_each_cpu_and(i, sched_domain_span(sd), hk_mask) {
 			if (cpu == i)
 				continue;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 45edf61..11d2294 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10188,9 +10188,11 @@ static inline int on_null_domain(struct rq *rq)
 static inline int find_new_ilb(void)
 {
 	int ilb;
+	const struct cpumask *hk_mask;
 
-	for_each_cpu_and(ilb, nohz.idle_cpus_mask,
-			      housekeeping_cpumask(HK_FLAG_MISC)) {
+	hk_mask = housekeeping_cpumask(HK_FLAG_MISC);
+
+	for_each_cpu_and(ilb, nohz.idle_cpus_mask, hk_mask) {
 
 		if (ilb == smp_processor_id())
 			continue;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and
  2021-06-02  2:03 [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and Yuan ZhaoXiong
@ 2021-06-02  7:57 ` Peter Zijlstra
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2021-06-02  7:57 UTC (permalink / raw)
  To: Yuan ZhaoXiong
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, linux-kernel

On Wed, Jun 02, 2021 at 10:03:52AM +0800, Yuan ZhaoXiong wrote:
> On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
> the others are used for housekeeping. When many housekeeping cpus are
> in idle state, we can observe huge time burn in the loop for searching
> nearest busy housekeeper cpu by ftrace.
> 
>    9)               |              get_nohz_timer_target() {
>    9)               |                housekeeping_test_cpu() {
>    9)   0.390 us    |                  housekeeping_get_mask.part.1();
>    9)   0.561 us    |                }
>    9)   0.090 us    |                __rcu_read_lock();
>    9)   0.090 us    |                housekeeping_cpumask();
>    9)   0.521 us    |                housekeeping_cpumask();
>    9)   0.140 us    |                housekeeping_cpumask();
> 
>    ...
> 
>    9)   0.500 us    |                housekeeping_cpumask();
>    9)               |                housekeeping_any_cpu() {
>    9)   0.090 us    |                  housekeeping_get_mask.part.1();
>    9)   0.100 us    |                  sched_numa_find_closest();
>    9)   0.491 us    |                }
>    9)   0.100 us    |                __rcu_read_unlock();
>    9) + 76.163 us   |              }
> 
> for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
> function the
>         for_each_cpu_and(i, sched_domain_span(sd),
>                 housekeeping_cpumask(HK_FLAG_TIMER))
> equals to below:
>         for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
>                 housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
> That will cause that housekeeping_cpumask() will be invoked many times.
> The housekeeping_cpumask() function returns a const value, so it is
> unnecessary to invoke it every time. This patch can minimize the worst
> searching time from ~76us to ~16us in my testing.
> 
> Similarly, the find_new_ilb() function has the same problem.
> 
> Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com>
> Signed-off-by: Li RongQing <lirongqing@baidu.com>

This is still not a valid SoB chain. Please refer to
Documentation/process/submitting-patches.rst.

The first SoB should match the Author, which if missing is From, the
last SoB should match the Sender which is From. Since there is only one
From but two SoBs this cannot be right.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and
@ 2021-06-02  2:03 Yuan ZhaoXiong
  2021-06-02  7:57 ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Yuan ZhaoXiong @ 2021-06-02  2:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot
  Cc: linux-kernel

On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
the others are used for housekeeping. When many housekeeping cpus are
in idle state, we can observe huge time burn in the loop for searching
nearest busy housekeeper cpu by ftrace.

   9)               |              get_nohz_timer_target() {
   9)               |                housekeeping_test_cpu() {
   9)   0.390 us    |                  housekeeping_get_mask.part.1();
   9)   0.561 us    |                }
   9)   0.090 us    |                __rcu_read_lock();
   9)   0.090 us    |                housekeeping_cpumask();
   9)   0.521 us    |                housekeeping_cpumask();
   9)   0.140 us    |                housekeeping_cpumask();

   ...

   9)   0.500 us    |                housekeeping_cpumask();
   9)               |                housekeeping_any_cpu() {
   9)   0.090 us    |                  housekeeping_get_mask.part.1();
   9)   0.100 us    |                  sched_numa_find_closest();
   9)   0.491 us    |                }
   9)   0.100 us    |                __rcu_read_unlock();
   9) + 76.163 us   |              }

for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
function the
        for_each_cpu_and(i, sched_domain_span(sd),
                housekeeping_cpumask(HK_FLAG_TIMER))
equals to below:
        for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
                housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
That will cause that housekeeping_cpumask() will be invoked many times.
The housekeeping_cpumask() function returns a const value, so it is
unnecessary to invoke it every time. This patch can minimize the worst
searching time from ~76us to ~16us in my testing.

Similarly, the find_new_ilb() function has the same problem.

Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
 kernel/sched/core.c | 6 ++++--
 kernel/sched/fair.c | 6 ++++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 98191218d891..14ad3bb36321 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -645,6 +645,7 @@ int get_nohz_timer_target(void)
 {
 	int i, cpu = smp_processor_id(), default_cpu = -1;
 	struct sched_domain *sd;
+	const struct cpumask *hk_mask;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TIMER)) {
 		if (!idle_cpu(cpu))
@@ -652,10 +653,11 @@ int get_nohz_timer_target(void)
 		default_cpu = cpu;
 	}
 
+	hk_mask = housekeeping_cpumask(HK_FLAG_TIMER);
+
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
-		for_each_cpu_and(i, sched_domain_span(sd),
-			housekeeping_cpumask(HK_FLAG_TIMER)) {
+		for_each_cpu_and(i, sched_domain_span(sd), hk_mask) {
 			if (cpu == i)
 				continue;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..d3ecfbf160bf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10097,9 +10097,11 @@ static inline int on_null_domain(struct rq *rq)
 static inline int find_new_ilb(void)
 {
 	int ilb;
+	const struct cpumask *hk_mask;
 
-	for_each_cpu_and(ilb, nohz.idle_cpus_mask,
-			      housekeeping_cpumask(HK_FLAG_MISC)) {
+	hk_mask = housekeeping_cpumask(HK_FLAG_MISC);
+
+	for_each_cpu_and(ilb, nohz.idle_cpus_mask, hk_mask) {
 
 		if (ilb == smp_processor_id())
 			continue;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and
  2021-05-27  9:40 ` Peter Zijlstra
@ 2021-05-31 10:37   ` Peter Zijlstra
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2021-05-31 10:37 UTC (permalink / raw)
  To: Yuan ZhaoXiong
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, linux-kernel

On Thu, May 27, 2021 at 11:40:42AM +0200, Peter Zijlstra wrote:
> On Sat, Apr 17, 2021 at 11:01:37PM +0800, Yuan ZhaoXiong wrote:
> > On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
> > the others are used for housekeeping. When many housekeeping cpus are
> > in idle state, we can observe huge time burn in the loop for searching
> > nearest busy housekeeper cpu by ftrace.
> > 
> >    9)               |              get_nohz_timer_target() {
> >    9)               |                housekeeping_test_cpu() {
> >    9)   0.390 us    |                  housekeeping_get_mask.part.1();
> >    9)   0.561 us    |                }
> >    9)   0.090 us    |                __rcu_read_lock();
> >    9)   0.090 us    |                housekeeping_cpumask();
> >    9)   0.521 us    |                housekeeping_cpumask();
> >    9)   0.140 us    |                housekeeping_cpumask();
> > 
> >    ...
> > 
> >    9)   0.500 us    |                housekeeping_cpumask();
> >    9)               |                housekeeping_any_cpu() {
> >    9)   0.090 us    |                  housekeeping_get_mask.part.1();
> >    9)   0.100 us    |                  sched_numa_find_closest();
> >    9)   0.491 us    |                }
> >    9)   0.100 us    |                __rcu_read_unlock();
> >    9) + 76.163 us   |              }
> > 
> > for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
> > function the
> >         for_each_cpu_and(i, sched_domain_span(sd),
> >                 housekeeping_cpumask(HK_FLAG_TIMER))
> > equals to below:
> >         for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
> >                 housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
> > That will cause that housekeeping_cpumask() will be invoked many times.
> > The housekeeping_cpumask() function returns a const value, so it is
> > unnecessary to invoke it every time. This patch can minimize the worst
> > searching time from ~76us to ~16us in my testing.
> > 
> > Similarly, the find_new_ilb() function has the same problem.
> > 
> > Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com>
> > Signed-off-by: Li RongQing <lirongqing@baidu.com>
> 
> Just noticed, this SoB chain isn't valid. What do I do with Li's entry?

I'm dropping this patch, please resend with a valid SoB chain.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and
  2021-04-17 15:01 Yuan ZhaoXiong
  2021-04-19  9:56 ` Peter Zijlstra
  2021-05-20  8:36 ` Peter Zijlstra
@ 2021-05-27  9:40 ` Peter Zijlstra
  2021-05-31 10:37   ` Peter Zijlstra
  2 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2021-05-27  9:40 UTC (permalink / raw)
  To: Yuan ZhaoXiong
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, linux-kernel

On Sat, Apr 17, 2021 at 11:01:37PM +0800, Yuan ZhaoXiong wrote:
> On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
> the others are used for housekeeping. When many housekeeping cpus are
> in idle state, we can observe huge time burn in the loop for searching
> nearest busy housekeeper cpu by ftrace.
> 
>    9)               |              get_nohz_timer_target() {
>    9)               |                housekeeping_test_cpu() {
>    9)   0.390 us    |                  housekeeping_get_mask.part.1();
>    9)   0.561 us    |                }
>    9)   0.090 us    |                __rcu_read_lock();
>    9)   0.090 us    |                housekeeping_cpumask();
>    9)   0.521 us    |                housekeeping_cpumask();
>    9)   0.140 us    |                housekeeping_cpumask();
> 
>    ...
> 
>    9)   0.500 us    |                housekeeping_cpumask();
>    9)               |                housekeeping_any_cpu() {
>    9)   0.090 us    |                  housekeeping_get_mask.part.1();
>    9)   0.100 us    |                  sched_numa_find_closest();
>    9)   0.491 us    |                }
>    9)   0.100 us    |                __rcu_read_unlock();
>    9) + 76.163 us   |              }
> 
> for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
> function the
>         for_each_cpu_and(i, sched_domain_span(sd),
>                 housekeeping_cpumask(HK_FLAG_TIMER))
> equals to below:
>         for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
>                 housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
> That will cause that housekeeping_cpumask() will be invoked many times.
> The housekeeping_cpumask() function returns a const value, so it is
> unnecessary to invoke it every time. This patch can minimize the worst
> searching time from ~76us to ~16us in my testing.
> 
> Similarly, the find_new_ilb() function has the same problem.
> 
> Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com>
> Signed-off-by: Li RongQing <lirongqing@baidu.com>

Just noticed, this SoB chain isn't valid. What do I do with Li's entry?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and
  2021-04-17 15:01 Yuan ZhaoXiong
  2021-04-19  9:56 ` Peter Zijlstra
@ 2021-05-20  8:36 ` Peter Zijlstra
  2021-05-27  9:40 ` Peter Zijlstra
  2 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2021-05-20  8:36 UTC (permalink / raw)
  To: Yuan ZhaoXiong
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, linux-kernel

On Sat, Apr 17, 2021 at 11:01:37PM +0800, Yuan ZhaoXiong wrote:
> On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
> the others are used for housekeeping. When many housekeeping cpus are
> in idle state, we can observe huge time burn in the loop for searching
> nearest busy housekeeper cpu by ftrace.
> 
>    9)               |              get_nohz_timer_target() {
>    9)               |                housekeeping_test_cpu() {
>    9)   0.390 us    |                  housekeeping_get_mask.part.1();
>    9)   0.561 us    |                }
>    9)   0.090 us    |                __rcu_read_lock();
>    9)   0.090 us    |                housekeeping_cpumask();
>    9)   0.521 us    |                housekeeping_cpumask();
>    9)   0.140 us    |                housekeeping_cpumask();
> 
>    ...
> 
>    9)   0.500 us    |                housekeeping_cpumask();
>    9)               |                housekeeping_any_cpu() {
>    9)   0.090 us    |                  housekeeping_get_mask.part.1();
>    9)   0.100 us    |                  sched_numa_find_closest();
>    9)   0.491 us    |                }
>    9)   0.100 us    |                __rcu_read_unlock();
>    9) + 76.163 us   |              }
> 
> for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
> function the
>         for_each_cpu_and(i, sched_domain_span(sd),
>                 housekeeping_cpumask(HK_FLAG_TIMER))
> equals to below:
>         for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
>                 housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
> That will cause that housekeeping_cpumask() will be invoked many times.
> The housekeeping_cpumask() function returns a const value, so it is
> unnecessary to invoke it every time. This patch can minimize the worst
> searching time from ~76us to ~16us in my testing.
> 
> Similarly, the find_new_ilb() function has the same problem.
> 
> Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com>
> Signed-off-by: Li RongQing <lirongqing@baidu.com>

Thanks!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and
  2021-04-20  6:44   ` Yuan,Zhaoxiong
@ 2021-04-30  6:38     ` Yuan,Zhaoxiong
  0 siblings, 0 replies; 11+ messages in thread
From: Yuan,Zhaoxiong @ 2021-04-30  6:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, linux-kernel

> 在 2021/4/19 下午5:57,“Peter Zijlstra”<peterz@infradead.org> 写入:

> On Sat, Apr 17, 2021 at 11:01:37PM +0800, Yuan ZhaoXiong wrote:
>> On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
>> the others are used for housekeeping. When many housekeeping cpus are
>> in idle state, we can observe huge time burn in the loop for searching
>> nearest busy housekeeper cpu by ftrace.
>> 
>>    9)               |              get_nohz_timer_target() {
>>    9)               |                housekeeping_test_cpu() {
>>    9)   0.390 us    |                  housekeeping_get_mask.part.1();
>>    9)   0.561 us    |                }
>>    9)   0.090 us    |                __rcu_read_lock();
>>    9)   0.090 us    |                housekeeping_cpumask();
>>    9)   0.521 us    |                housekeeping_cpumask();
>>    9)   0.140 us    |                housekeeping_cpumask();
>> 
>>    ...
>> 
>>    9)   0.500 us    |                housekeeping_cpumask();
>>    9)               |                housekeeping_any_cpu() {
>>    9)   0.090 us    |                  housekeeping_get_mask.part.1();
>>    9)   0.100 us    |                  sched_numa_find_closest();
>>    9)   0.491 us    |                }
>>    9)   0.100 us    |                __rcu_read_unlock();
>>    9) + 76.163 us   |              }
>> 
>> for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
>> function the
>>         for_each_cpu_and(i, sched_domain_span(sd),
>>                 housekeeping_cpumask(HK_FLAG_TIMER))
>> equals to below:
>>         for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
>>                 housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
>> That will cause that housekeeping_cpumask() will be invoked many times.
>> The housekeeping_cpumask() function returns a const value, so it is
>> unnecessary to invoke it every time. This patch can minimize the worst
>> searching time from ~76us to ~16us in my testing.
>> 
>> Similarly, the find_new_ilb() function has the same problem.
    
>  Would it not make sense to mark housekeeping_cpumask() __pure instead?
    
> After marking housekeeping_cpumask() __pure and then test again, the results 
> proves that huge time burn in the loop for searching the nearest busy housekeeper 
> still exists. 
>
> Using objdump -D vmlinux we can see get_nohz_timer_target() disassembled code 
as below:
> ffffffff810b96c0 <get_nohz_timer_target>:
> ffffffff810b96c0:       e8 db 7f 94 00          callq  ffffffff81a016a0 <__fentry__>
> ffffffff810b96c5:       41 57                   push   %r15
> ffffffff810b96c7:       41 56                   push   %r14
> ffffffff810b96c9:       41 55                   push   %r13
> ffffffff810b96cb:       41 54                   push   %r12
> ffffffff810b96cd:       55                      push   %rbp
> ffffffff810b96ce:       53                      push   %rbx
> ffffffff810b96cf:       48 83 ec 08             sub    $0x8,%rsp
> ffffffff810b96d3:       65 8b 1d 56 5a f5 7e    mov    %gs:0x7ef55a56(%rip),%ebx        # f130 <cpu_number>
> ffffffff810b96da:       41 89 dc                mov    %ebx,%r12d
> ffffffff810b96dd:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> ffffffff810b96e2:       4c 63 f3                movslq %ebx,%r14
> ffffffff810b96e5:       48 c7 c5 40 0b 02 00    mov    $0x20b40,%rbp
> ffffffff810b96ec:       4a 8b 04 f5 20 77 13    mov    -0x7dec88e0(,%r14,8),%rax
> ffffffff810b96f3:       82
> ffffffff810b96f4:       49 89 ed                mov    %rbp,%r13
> ffffffff810b96f7:       4c 01 e8                add    %r13,%rax
> ffffffff810b96fa:       48 8b 88 90 09 00 00    mov    0x990(%rax),%rcx
> ffffffff810b9701:       48 39 88 88 09 00 00    cmp    %rcx,0x988(%rax)
> ffffffff810b9708:       0f 84 ce 00 00 00       je     ffffffff810b97dc <get_nohz_timer_target+0x11c>
> ffffffff810b970e:       48 83 c4 08             add    $0x8,%rsp
> ffffffff810b9712:       44 89 e0                mov    %r12d,%eax
> ffffffff810b9715:       5b                      pop    %rbx
> ffffffff810b9716:       5d                      pop    %rbp
> ffffffff810b9717:       41 5c                   pop    %r12
> ffffffff810b9719:       41 5d                   pop    %r13
> ffffffff810b971b:       41 5e                   pop    %r14
> ffffffff810b971d:       41 5f                   pop    %r15
> ffffffff810b971f:       c3                      retq
> ffffffff810b9720:       be 01 00 00 00          mov    $0x1,%esi
> ffffffff810b9725:       89 df                   mov    %ebx,%edi
> ffffffff810b9727:       e8 74 87 02 00          callq  ffffffff810e1ea0 <housekeeping_test_cpu>
> ffffffff810b972c:       84 c0                   test   %al,%al
> ffffffff810b972e:       75 b2                   jne    ffffffff810b96e2 <get_nohz_timer_target+0x22>
> ffffffff810b9730:       e8 0b ea 03 00          callq  ffffffff810f8140 <__rcu_read_lock>
> ffffffff810b9735:       48 c7 c5 40 0b 02 00    mov    $0x20b40,%rbp
> ffffffff810b973c:       48 63 d3                movslq %ebx,%rdx
> ffffffff810b973f:       c7 44 24 04 ff ff ff    movl   $0xffffffff,0x4(%rsp)
> ffffffff810b9746:       ff
> ffffffff810b9747:       48 89 e8                mov    %rbp,%rax
> ffffffff810b974a:       48 03 04 d5 20 77 13    add    -0x7dec88e0(,%rdx,8),%rax
> ffffffff810b9751:       82
> ffffffff810b9752:       4c 8b a8 d8 09 00 00    mov    0x9d8(%rax),%r13
> ffffffff810b9759:       4d 85 ed                test   %r13,%r13
> ffffffff810b975c:       0f 84 d3 00 00 00       je     ffffffff810b9835 <get_nohz_timer_target+0x175>
> ffffffff810b9762:       41 be ff ff ff ff       mov    $0xffffffff,%r14d
> ffffffff810b9768:       4d 8d a5 38 01 00 00    lea    0x138(%r13),%r12
> ffffffff810b976f:       45 89 f7                mov    %r14d,%r15d
> ffffffff810b9772:       bf 01 00 00 00          mov    $0x1,%edi
> ffffffff810b9777:       e8 f4 86 02 00          callq  ffffffff810e1e70 <housekeeping_cpumask>
> ffffffff810b977c:       44 89 ff                mov    %r15d,%edi
> ffffffff810b977f:       48 89 c2                mov    %rax,%rdx
> ffffffff810b9782:       4c 89 e6                mov    %r12,%rsi
> ffffffff810b9785:       e8 b6 ea 79 00          callq  ffffffff81858240 <cpumask_next_and>
> ffffffff810b978a:       3b 05 b4 4e 3e 01       cmp    0x13e4eb4(%rip),%eax        # ffffffff8249e644 <nr_cpu_ids>
> ffffffff810b9790:       41 89 c7                mov    %eax,%r15d
> ffffffff810b9793:       0f 83 84 00 00 00       jae    ffffffff810b981d <get_nohz_timer_target+0x15d>
> ffffffff810b9799:       44 39 fb                cmp    %r15d,%ebx
> ffffffff810b979c:       74 d4                   je     ffffffff810b9772 <get_nohz_timer_target+0xb2>
> ffffffff810b979e:       49 63 c7                movslq %r15d,%rax
> ffffffff810b97a1:       48 89 ea                mov    %rbp,%rdx
> ffffffff810b97a4:       48 03 14 c5 20 77 13    add    -0x7dec88e0(,%rax,8),%rdx
> ffffffff810b97ab:       82
> ffffffff810b97ac:       48 8b 82 90 09 00 00    mov    0x990(%rdx),%rax
> ffffffff810b97b3:       48 39 82 88 09 00 00    cmp    %rax,0x988(%rdx)
> ffffffff810b97ba:       75 13                   jne    ffffffff810b97cf <get_nohz_timer_target+0x10f>
> ffffffff810b97bc:       8b 42 04                mov    0x4(%rdx),%eax
> ffffffff810b97bf:       85 c0                   test   %eax,%eax
> ffffffff810b97c1:       75 0c                   jne    ffffffff810b97cf <get_nohz_timer_target+0x10f>
> ffffffff810b97c3:       48 8b 82 20 0c 00 00    mov    0xc20(%rdx),%rax
> ffffffff810b97ca:       48 85 c0                test   %rax,%rax
> ffffffff810b97cd:       74 a3                   je     ffffffff810b9772 <get_nohz_timer_target+0xb2>
> ffffffff810b97cf:       e8 1c 33 04 00          callq  ffffffff810fcaf0 <__rcu_read_unlock>
> ffffffff810b97d4:       45 89 fc                mov    %r15d,%r12d
> ffffffff810b97d7:       e9 32 ff ff ff          jmpq   ffffffff810b970e <get_nohz_timer_target+0x4e>
> ffffffff810b97dc:       8b 50 04                mov    0x4(%rax),%edx
> ffffffff810b97df:       85 d2                   test   %edx,%edx
> ffffffff810b97e1:       0f 85 27 ff ff ff       jne    ffffffff810b970e <get_nohz_timer_target+0x4e>
> ffffffff810b97e7:       48 8b 80 20 0c 00 00    mov    0xc20(%rax),%rax
> ffffffff810b97ee:       48 85 c0                test   %rax,%rax
> ffffffff810b97f1:       0f 85 17 ff ff ff       jne    ffffffff810b970e <get_nohz_timer_target+0x4e>
> ffffffff810b97f7:       e8 44 e9 03 00          callq  ffffffff810f8140 <__rcu_read_lock>
> ffffffff810b97fc:       4e 03 2c f5 20 77 13    add    -0x7dec88e0(,%r14,8),%r13
> ffffffff810b9803:       82
> ffffffff810b9804:       89 5c 24 04             mov    %ebx,0x4(%rsp)
> ffffffff810b9808:       41 89 df                mov    %ebx,%r15d
> ffffffff810b980b:       4d 8b ad d8 09 00 00    mov    0x9d8(%r13),%r13
> ffffffff810b9812:       4d 85 ed                test   %r13,%r13
> ffffffff810b9815:       0f 85 47 ff ff ff       jne    ffffffff810b9762 <get_nohz_timer_target+0xa2>
> ffffffff810b981b:       eb 12                   jmp    ffffffff810b982f <get_nohz_timer_target+0x16f>
> ffffffff810b981d:       4d 8b 6d 00             mov    0x0(%r13),%r13
> ffffffff810b9821:       4d 85 ed                test   %r13,%r13
> ffffffff810b9824:       0f 85 3e ff ff ff       jne    ffffffff810b9768 <get_nohz_timer_target+0xa8>
> ffffffff810b982a:       44 8b 7c 24 04          mov    0x4(%rsp),%r15d
> ffffffff810b982f:       41 83 ff ff             cmp    $0xffffffff,%r15d
> ffffffff810b9833:       75 9a                   jne    ffffffff810b97cf <get_nohz_timer_target+0x10f>
> ffffffff810b9835:       bf 01 00 00 00          mov    $0x1,%edi
> ffffffff810b983a:       e8 91 86 02 00          callq  ffffffff810e1ed0 <housekeeping_any_cpu>
> ffffffff810b983f:       41 89 c7                mov    %eax,%r15d
> ffffffff810b9842:       eb 8b                   jmp    ffffffff810b97cf <get_nohz_timer_target+0x10f>
> ffffffff810b9844:       66 90                   xchg   %ax,%ax
> ffffffff810b9846:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
> ffffffff810b984d:       00 00 00
>
> The disassembled code proves that the __pure mark does not work.

Until now, the __pure mark does not work in our test, should the patch be merged into the mainline?

Thanks,
Yuan ZhaoXiong


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and
  2021-04-19  9:56 ` Peter Zijlstra
@ 2021-04-20  6:44   ` Yuan,Zhaoxiong
  2021-04-30  6:38     ` Yuan,Zhaoxiong
  0 siblings, 1 reply; 11+ messages in thread
From: Yuan,Zhaoxiong @ 2021-04-20  6:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, linux-kernel



在 2021/4/19 下午5:57,“Peter Zijlstra”<peterz@infradead.org> 写入:

    On Sat, Apr 17, 2021 at 11:01:37PM +0800, Yuan ZhaoXiong wrote:
    > On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
    > the others are used for housekeeping. When many housekeeping cpus are
    > in idle state, we can observe huge time burn in the loop for searching
    > nearest busy housekeeper cpu by ftrace.
    > 
    >    9)               |              get_nohz_timer_target() {
    >    9)               |                housekeeping_test_cpu() {
    >    9)   0.390 us    |                  housekeeping_get_mask.part.1();
    >    9)   0.561 us    |                }
    >    9)   0.090 us    |                __rcu_read_lock();
    >    9)   0.090 us    |                housekeeping_cpumask();
    >    9)   0.521 us    |                housekeeping_cpumask();
    >    9)   0.140 us    |                housekeeping_cpumask();
    > 
    >    ...
    > 
    >    9)   0.500 us    |                housekeeping_cpumask();
    >    9)               |                housekeeping_any_cpu() {
    >    9)   0.090 us    |                  housekeeping_get_mask.part.1();
    >    9)   0.100 us    |                  sched_numa_find_closest();
    >    9)   0.491 us    |                }
    >    9)   0.100 us    |                __rcu_read_unlock();
    >    9) + 76.163 us   |              }
    > 
    > for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
    > function the
    >         for_each_cpu_and(i, sched_domain_span(sd),
    >                 housekeeping_cpumask(HK_FLAG_TIMER))
    > equals to below:
    >         for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
    >                 housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
    > That will cause that housekeeping_cpumask() will be invoked many times.
    > The housekeeping_cpumask() function returns a const value, so it is
    > unnecessary to invoke it every time. This patch can minimize the worst
    > searching time from ~76us to ~16us in my testing.
    > 
    > Similarly, the find_new_ilb() function has the same problem.
    
    Would it not make sense to mark housekeeping_cpumask() __pure instead?
    
After marking housekeeping_cpumask() __pure and then test again, the results 
proves that huge time burn in the loop for searching the nearest busy housekeeper 
still exists. 

Using objdump -D vmlinux we can see get_nohz_timer_target() disassembled code 
as below:
ffffffff810b96c0 <get_nohz_timer_target>:
ffffffff810b96c0:       e8 db 7f 94 00          callq  ffffffff81a016a0 <__fentry__>
ffffffff810b96c5:       41 57                   push   %r15
ffffffff810b96c7:       41 56                   push   %r14
ffffffff810b96c9:       41 55                   push   %r13
ffffffff810b96cb:       41 54                   push   %r12
ffffffff810b96cd:       55                      push   %rbp
ffffffff810b96ce:       53                      push   %rbx
ffffffff810b96cf:       48 83 ec 08             sub    $0x8,%rsp
ffffffff810b96d3:       65 8b 1d 56 5a f5 7e    mov    %gs:0x7ef55a56(%rip),%ebx        # f130 <cpu_number>
ffffffff810b96da:       41 89 dc                mov    %ebx,%r12d
ffffffff810b96dd:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffff810b96e2:       4c 63 f3                movslq %ebx,%r14
ffffffff810b96e5:       48 c7 c5 40 0b 02 00    mov    $0x20b40,%rbp
ffffffff810b96ec:       4a 8b 04 f5 20 77 13    mov    -0x7dec88e0(,%r14,8),%rax
ffffffff810b96f3:       82
ffffffff810b96f4:       49 89 ed                mov    %rbp,%r13
ffffffff810b96f7:       4c 01 e8                add    %r13,%rax
ffffffff810b96fa:       48 8b 88 90 09 00 00    mov    0x990(%rax),%rcx
ffffffff810b9701:       48 39 88 88 09 00 00    cmp    %rcx,0x988(%rax)
ffffffff810b9708:       0f 84 ce 00 00 00       je     ffffffff810b97dc <get_nohz_timer_target+0x11c>
ffffffff810b970e:       48 83 c4 08             add    $0x8,%rsp
ffffffff810b9712:       44 89 e0                mov    %r12d,%eax
ffffffff810b9715:       5b                      pop    %rbx
ffffffff810b9716:       5d                      pop    %rbp
ffffffff810b9717:       41 5c                   pop    %r12
ffffffff810b9719:       41 5d                   pop    %r13
ffffffff810b971b:       41 5e                   pop    %r14
ffffffff810b971d:       41 5f                   pop    %r15
ffffffff810b971f:       c3                      retq
ffffffff810b9720:       be 01 00 00 00          mov    $0x1,%esi
ffffffff810b9725:       89 df                   mov    %ebx,%edi
ffffffff810b9727:       e8 74 87 02 00          callq  ffffffff810e1ea0 <housekeeping_test_cpu>
ffffffff810b972c:       84 c0                   test   %al,%al
ffffffff810b972e:       75 b2                   jne    ffffffff810b96e2 <get_nohz_timer_target+0x22>
ffffffff810b9730:       e8 0b ea 03 00          callq  ffffffff810f8140 <__rcu_read_lock>
ffffffff810b9735:       48 c7 c5 40 0b 02 00    mov    $0x20b40,%rbp
ffffffff810b973c:       48 63 d3                movslq %ebx,%rdx
ffffffff810b973f:       c7 44 24 04 ff ff ff    movl   $0xffffffff,0x4(%rsp)
ffffffff810b9746:       ff
ffffffff810b9747:       48 89 e8                mov    %rbp,%rax
ffffffff810b974a:       48 03 04 d5 20 77 13    add    -0x7dec88e0(,%rdx,8),%rax
ffffffff810b9751:       82
ffffffff810b9752:       4c 8b a8 d8 09 00 00    mov    0x9d8(%rax),%r13
ffffffff810b9759:       4d 85 ed                test   %r13,%r13
ffffffff810b975c:       0f 84 d3 00 00 00       je     ffffffff810b9835 <get_nohz_timer_target+0x175>
ffffffff810b9762:       41 be ff ff ff ff       mov    $0xffffffff,%r14d
ffffffff810b9768:       4d 8d a5 38 01 00 00    lea    0x138(%r13),%r12
ffffffff810b976f:       45 89 f7                mov    %r14d,%r15d
ffffffff810b9772:       bf 01 00 00 00          mov    $0x1,%edi
ffffffff810b9777:       e8 f4 86 02 00          callq  ffffffff810e1e70 <housekeeping_cpumask>
ffffffff810b977c:       44 89 ff                mov    %r15d,%edi
ffffffff810b977f:       48 89 c2                mov    %rax,%rdx
ffffffff810b9782:       4c 89 e6                mov    %r12,%rsi
ffffffff810b9785:       e8 b6 ea 79 00          callq  ffffffff81858240 <cpumask_next_and>
ffffffff810b978a:       3b 05 b4 4e 3e 01       cmp    0x13e4eb4(%rip),%eax        # ffffffff8249e644 <nr_cpu_ids>
ffffffff810b9790:       41 89 c7                mov    %eax,%r15d
ffffffff810b9793:       0f 83 84 00 00 00       jae    ffffffff810b981d <get_nohz_timer_target+0x15d>
ffffffff810b9799:       44 39 fb                cmp    %r15d,%ebx
ffffffff810b979c:       74 d4                   je     ffffffff810b9772 <get_nohz_timer_target+0xb2>
ffffffff810b979e:       49 63 c7                movslq %r15d,%rax
ffffffff810b97a1:       48 89 ea                mov    %rbp,%rdx
ffffffff810b97a4:       48 03 14 c5 20 77 13    add    -0x7dec88e0(,%rax,8),%rdx
ffffffff810b97ab:       82
ffffffff810b97ac:       48 8b 82 90 09 00 00    mov    0x990(%rdx),%rax
ffffffff810b97b3:       48 39 82 88 09 00 00    cmp    %rax,0x988(%rdx)
ffffffff810b97ba:       75 13                   jne    ffffffff810b97cf <get_nohz_timer_target+0x10f>
ffffffff810b97bc:       8b 42 04                mov    0x4(%rdx),%eax
ffffffff810b97bf:       85 c0                   test   %eax,%eax
ffffffff810b97c1:       75 0c                   jne    ffffffff810b97cf <get_nohz_timer_target+0x10f>
ffffffff810b97c3:       48 8b 82 20 0c 00 00    mov    0xc20(%rdx),%rax
ffffffff810b97ca:       48 85 c0                test   %rax,%rax
ffffffff810b97cd:       74 a3                   je     ffffffff810b9772 <get_nohz_timer_target+0xb2>
ffffffff810b97cf:       e8 1c 33 04 00          callq  ffffffff810fcaf0 <__rcu_read_unlock>
ffffffff810b97d4:       45 89 fc                mov    %r15d,%r12d
ffffffff810b97d7:       e9 32 ff ff ff          jmpq   ffffffff810b970e <get_nohz_timer_target+0x4e>
ffffffff810b97dc:       8b 50 04                mov    0x4(%rax),%edx
ffffffff810b97df:       85 d2                   test   %edx,%edx
ffffffff810b97e1:       0f 85 27 ff ff ff       jne    ffffffff810b970e <get_nohz_timer_target+0x4e>
ffffffff810b97e7:       48 8b 80 20 0c 00 00    mov    0xc20(%rax),%rax
ffffffff810b97ee:       48 85 c0                test   %rax,%rax
ffffffff810b97f1:       0f 85 17 ff ff ff       jne    ffffffff810b970e <get_nohz_timer_target+0x4e>
ffffffff810b97f7:       e8 44 e9 03 00          callq  ffffffff810f8140 <__rcu_read_lock>
ffffffff810b97fc:       4e 03 2c f5 20 77 13    add    -0x7dec88e0(,%r14,8),%r13
ffffffff810b9803:       82
ffffffff810b9804:       89 5c 24 04             mov    %ebx,0x4(%rsp)
ffffffff810b9808:       41 89 df                mov    %ebx,%r15d
ffffffff810b980b:       4d 8b ad d8 09 00 00    mov    0x9d8(%r13),%r13
ffffffff810b9812:       4d 85 ed                test   %r13,%r13
ffffffff810b9815:       0f 85 47 ff ff ff       jne    ffffffff810b9762 <get_nohz_timer_target+0xa2>
ffffffff810b981b:       eb 12                   jmp    ffffffff810b982f <get_nohz_timer_target+0x16f>
ffffffff810b981d:       4d 8b 6d 00             mov    0x0(%r13),%r13
ffffffff810b9821:       4d 85 ed                test   %r13,%r13
ffffffff810b9824:       0f 85 3e ff ff ff       jne    ffffffff810b9768 <get_nohz_timer_target+0xa8>
ffffffff810b982a:       44 8b 7c 24 04          mov    0x4(%rsp),%r15d
ffffffff810b982f:       41 83 ff ff             cmp    $0xffffffff,%r15d
ffffffff810b9833:       75 9a                   jne    ffffffff810b97cf <get_nohz_timer_target+0x10f>
ffffffff810b9835:       bf 01 00 00 00          mov    $0x1,%edi
ffffffff810b983a:       e8 91 86 02 00          callq  ffffffff810e1ed0 <housekeeping_any_cpu>
ffffffff810b983f:       41 89 c7                mov    %eax,%r15d
ffffffff810b9842:       eb 8b                   jmp    ffffffff810b97cf <get_nohz_timer_target+0x10f>
ffffffff810b9844:       66 90                   xchg   %ax,%ax
ffffffff810b9846:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
ffffffff810b984d:       00 00 00

The disassembled code proves that the __pure mark does not work.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and
  2021-04-17 15:01 Yuan ZhaoXiong
@ 2021-04-19  9:56 ` Peter Zijlstra
  2021-04-20  6:44   ` Yuan,Zhaoxiong
  2021-05-20  8:36 ` Peter Zijlstra
  2021-05-27  9:40 ` Peter Zijlstra
  2 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2021-04-19  9:56 UTC (permalink / raw)
  To: Yuan ZhaoXiong
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, linux-kernel

On Sat, Apr 17, 2021 at 11:01:37PM +0800, Yuan ZhaoXiong wrote:
> On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
> the others are used for housekeeping. When many housekeeping cpus are
> in idle state, we can observe huge time burn in the loop for searching
> nearest busy housekeeper cpu by ftrace.
> 
>    9)               |              get_nohz_timer_target() {
>    9)               |                housekeeping_test_cpu() {
>    9)   0.390 us    |                  housekeeping_get_mask.part.1();
>    9)   0.561 us    |                }
>    9)   0.090 us    |                __rcu_read_lock();
>    9)   0.090 us    |                housekeeping_cpumask();
>    9)   0.521 us    |                housekeeping_cpumask();
>    9)   0.140 us    |                housekeeping_cpumask();
> 
>    ...
> 
>    9)   0.500 us    |                housekeeping_cpumask();
>    9)               |                housekeeping_any_cpu() {
>    9)   0.090 us    |                  housekeeping_get_mask.part.1();
>    9)   0.100 us    |                  sched_numa_find_closest();
>    9)   0.491 us    |                }
>    9)   0.100 us    |                __rcu_read_unlock();
>    9) + 76.163 us   |              }
> 
> for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
> function the
>         for_each_cpu_and(i, sched_domain_span(sd),
>                 housekeeping_cpumask(HK_FLAG_TIMER))
> equals to below:
>         for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
>                 housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
> That will cause that housekeeping_cpumask() will be invoked many times.
> The housekeeping_cpumask() function returns a const value, so it is
> unnecessary to invoke it every time. This patch can minimize the worst
> searching time from ~76us to ~16us in my testing.
> 
> Similarly, the find_new_ilb() function has the same problem.

Would it not make sense to mark housekeeping_cpumask() __pure instead?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and
@ 2021-04-17 15:01 Yuan ZhaoXiong
  2021-04-19  9:56 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Yuan ZhaoXiong @ 2021-04-17 15:01 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot
  Cc: linux-kernel

On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
the others are used for housekeeping. When many housekeeping cpus are
in idle state, we can observe huge time burn in the loop for searching
nearest busy housekeeper cpu by ftrace.

   9)               |              get_nohz_timer_target() {
   9)               |                housekeeping_test_cpu() {
   9)   0.390 us    |                  housekeeping_get_mask.part.1();
   9)   0.561 us    |                }
   9)   0.090 us    |                __rcu_read_lock();
   9)   0.090 us    |                housekeeping_cpumask();
   9)   0.521 us    |                housekeeping_cpumask();
   9)   0.140 us    |                housekeeping_cpumask();

   ...

   9)   0.500 us    |                housekeeping_cpumask();
   9)               |                housekeeping_any_cpu() {
   9)   0.090 us    |                  housekeeping_get_mask.part.1();
   9)   0.100 us    |                  sched_numa_find_closest();
   9)   0.491 us    |                }
   9)   0.100 us    |                __rcu_read_unlock();
   9) + 76.163 us   |              }

for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
function the
        for_each_cpu_and(i, sched_domain_span(sd),
                housekeeping_cpumask(HK_FLAG_TIMER))
equals to below:
        for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
                housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
That will cause that housekeeping_cpumask() will be invoked many times.
The housekeeping_cpumask() function returns a const value, so it is
unnecessary to invoke it every time. This patch can minimize the worst
searching time from ~76us to ~16us in my testing.

Similarly, the find_new_ilb() function has the same problem.

Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
 kernel/sched/core.c | 6 ++++--
 kernel/sched/fair.c | 6 ++++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 98191218d891..14ad3bb36321 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -645,6 +645,7 @@ int get_nohz_timer_target(void)
 {
 	int i, cpu = smp_processor_id(), default_cpu = -1;
 	struct sched_domain *sd;
+	const struct cpumask *hk_mask;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TIMER)) {
 		if (!idle_cpu(cpu))
@@ -652,10 +653,11 @@ int get_nohz_timer_target(void)
 		default_cpu = cpu;
 	}
 
+	hk_mask = housekeeping_cpumask(HK_FLAG_TIMER);
+
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
-		for_each_cpu_and(i, sched_domain_span(sd),
-			housekeeping_cpumask(HK_FLAG_TIMER)) {
+		for_each_cpu_and(i, sched_domain_span(sd), hk_mask) {
 			if (cpu == i)
 				continue;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..d3ecfbf160bf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10097,9 +10097,11 @@ static inline int on_null_domain(struct rq *rq)
 static inline int find_new_ilb(void)
 {
 	int ilb;
+	const struct cpumask *hk_mask;
 
-	for_each_cpu_and(ilb, nohz.idle_cpus_mask,
-			      housekeeping_cpumask(HK_FLAG_MISC)) {
+	hk_mask = housekeeping_cpumask(HK_FLAG_MISC);
+
+	for_each_cpu_and(ilb, nohz.idle_cpus_mask, hk_mask) {
 
 		if (ilb == smp_processor_id())
 			continue;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-06-28 13:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-06 13:11 [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and Yuan ZhaoXiong
2021-06-28 13:58 ` [tip: sched/core] sched: Optimize housekeeping_cpumask() in for_each_cpu_and() tip-bot2 for Yuan ZhaoXiong
  -- strict thread matches above, loose matches on Subject: below --
2021-06-02  2:03 [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and Yuan ZhaoXiong
2021-06-02  7:57 ` Peter Zijlstra
2021-04-17 15:01 Yuan ZhaoXiong
2021-04-19  9:56 ` Peter Zijlstra
2021-04-20  6:44   ` Yuan,Zhaoxiong
2021-04-30  6:38     ` Yuan,Zhaoxiong
2021-05-20  8:36 ` Peter Zijlstra
2021-05-27  9:40 ` Peter Zijlstra
2021-05-31 10:37   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).