From: "Yuan,Zhaoxiong" <yuanzhaoxiong@baidu.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "mingo@redhat.com" <mingo@redhat.com>,
"juri.lelli@redhat.com" <juri.lelli@redhat.com>,
"vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
"dietmar.eggemann@arm.com" <dietmar.eggemann@arm.com>,
"rostedt@goodmis.org" <rostedt@goodmis.org>,
"bsegall@google.com" <bsegall@google.com>,
"mgorman@suse.de" <mgorman@suse.de>,
"bristot@redhat.com" <bristot@redhat.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and
Date: Tue, 20 Apr 2021 06:44:54 +0000 [thread overview]
Message-ID: <830177B0-45E0-4768-80AB-A99B85D3A52F@baidu.com> (raw)
In-Reply-To: <YH1T2f96IWlR7aOi@hirez.programming.kicks-ass.net>
在 2021/4/19 下午5:57,“Peter Zijlstra”<peterz@infradead.org> 写入:
On Sat, Apr 17, 2021 at 11:01:37PM +0800, Yuan ZhaoXiong wrote:
> On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
> the others are used for housekeeping. When many housekeeping cpus are
> in idle state, we can observe huge time burn in the loop for searching
> nearest busy housekeeper cpu by ftrace.
>
> 9) | get_nohz_timer_target() {
> 9) | housekeeping_test_cpu() {
> 9) 0.390 us | housekeeping_get_mask.part.1();
> 9) 0.561 us | }
> 9) 0.090 us | __rcu_read_lock();
> 9) 0.090 us | housekeeping_cpumask();
> 9) 0.521 us | housekeeping_cpumask();
> 9) 0.140 us | housekeeping_cpumask();
>
> ...
>
> 9) 0.500 us | housekeeping_cpumask();
> 9) | housekeeping_any_cpu() {
> 9) 0.090 us | housekeeping_get_mask.part.1();
> 9) 0.100 us | sched_numa_find_closest();
> 9) 0.491 us | }
> 9) 0.100 us | __rcu_read_unlock();
> 9) + 76.163 us | }
>
> for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
> function the
> for_each_cpu_and(i, sched_domain_span(sd),
> housekeeping_cpumask(HK_FLAG_TIMER))
> equals to below:
> for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
> housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
> That will cause that housekeeping_cpumask() will be invoked many times.
> The housekeeping_cpumask() function returns a const value, so it is
> unnecessary to invoke it every time. This patch can minimize the worst
> searching time from ~76us to ~16us in my testing.
>
> Similarly, the find_new_ilb() function has the same problem.
Would it not make sense to mark housekeeping_cpumask() __pure instead?
After marking housekeeping_cpumask() __pure and then test again, the results
proves that huge time burn in the loop for searching the nearest busy housekeeper
still exists.
Using objdump -D vmlinux we can see get_nohz_timer_target() disassembled code
as below:
ffffffff810b96c0 <get_nohz_timer_target>:
ffffffff810b96c0: e8 db 7f 94 00 callq ffffffff81a016a0 <__fentry__>
ffffffff810b96c5: 41 57 push %r15
ffffffff810b96c7: 41 56 push %r14
ffffffff810b96c9: 41 55 push %r13
ffffffff810b96cb: 41 54 push %r12
ffffffff810b96cd: 55 push %rbp
ffffffff810b96ce: 53 push %rbx
ffffffff810b96cf: 48 83 ec 08 sub $0x8,%rsp
ffffffff810b96d3: 65 8b 1d 56 5a f5 7e mov %gs:0x7ef55a56(%rip),%ebx # f130 <cpu_number>
ffffffff810b96da: 41 89 dc mov %ebx,%r12d
ffffffff810b96dd: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
ffffffff810b96e2: 4c 63 f3 movslq %ebx,%r14
ffffffff810b96e5: 48 c7 c5 40 0b 02 00 mov $0x20b40,%rbp
ffffffff810b96ec: 4a 8b 04 f5 20 77 13 mov -0x7dec88e0(,%r14,8),%rax
ffffffff810b96f3: 82
ffffffff810b96f4: 49 89 ed mov %rbp,%r13
ffffffff810b96f7: 4c 01 e8 add %r13,%rax
ffffffff810b96fa: 48 8b 88 90 09 00 00 mov 0x990(%rax),%rcx
ffffffff810b9701: 48 39 88 88 09 00 00 cmp %rcx,0x988(%rax)
ffffffff810b9708: 0f 84 ce 00 00 00 je ffffffff810b97dc <get_nohz_timer_target+0x11c>
ffffffff810b970e: 48 83 c4 08 add $0x8,%rsp
ffffffff810b9712: 44 89 e0 mov %r12d,%eax
ffffffff810b9715: 5b pop %rbx
ffffffff810b9716: 5d pop %rbp
ffffffff810b9717: 41 5c pop %r12
ffffffff810b9719: 41 5d pop %r13
ffffffff810b971b: 41 5e pop %r14
ffffffff810b971d: 41 5f pop %r15
ffffffff810b971f: c3 retq
ffffffff810b9720: be 01 00 00 00 mov $0x1,%esi
ffffffff810b9725: 89 df mov %ebx,%edi
ffffffff810b9727: e8 74 87 02 00 callq ffffffff810e1ea0 <housekeeping_test_cpu>
ffffffff810b972c: 84 c0 test %al,%al
ffffffff810b972e: 75 b2 jne ffffffff810b96e2 <get_nohz_timer_target+0x22>
ffffffff810b9730: e8 0b ea 03 00 callq ffffffff810f8140 <__rcu_read_lock>
ffffffff810b9735: 48 c7 c5 40 0b 02 00 mov $0x20b40,%rbp
ffffffff810b973c: 48 63 d3 movslq %ebx,%rdx
ffffffff810b973f: c7 44 24 04 ff ff ff movl $0xffffffff,0x4(%rsp)
ffffffff810b9746: ff
ffffffff810b9747: 48 89 e8 mov %rbp,%rax
ffffffff810b974a: 48 03 04 d5 20 77 13 add -0x7dec88e0(,%rdx,8),%rax
ffffffff810b9751: 82
ffffffff810b9752: 4c 8b a8 d8 09 00 00 mov 0x9d8(%rax),%r13
ffffffff810b9759: 4d 85 ed test %r13,%r13
ffffffff810b975c: 0f 84 d3 00 00 00 je ffffffff810b9835 <get_nohz_timer_target+0x175>
ffffffff810b9762: 41 be ff ff ff ff mov $0xffffffff,%r14d
ffffffff810b9768: 4d 8d a5 38 01 00 00 lea 0x138(%r13),%r12
ffffffff810b976f: 45 89 f7 mov %r14d,%r15d
ffffffff810b9772: bf 01 00 00 00 mov $0x1,%edi
ffffffff810b9777: e8 f4 86 02 00 callq ffffffff810e1e70 <housekeeping_cpumask>
ffffffff810b977c: 44 89 ff mov %r15d,%edi
ffffffff810b977f: 48 89 c2 mov %rax,%rdx
ffffffff810b9782: 4c 89 e6 mov %r12,%rsi
ffffffff810b9785: e8 b6 ea 79 00 callq ffffffff81858240 <cpumask_next_and>
ffffffff810b978a: 3b 05 b4 4e 3e 01 cmp 0x13e4eb4(%rip),%eax # ffffffff8249e644 <nr_cpu_ids>
ffffffff810b9790: 41 89 c7 mov %eax,%r15d
ffffffff810b9793: 0f 83 84 00 00 00 jae ffffffff810b981d <get_nohz_timer_target+0x15d>
ffffffff810b9799: 44 39 fb cmp %r15d,%ebx
ffffffff810b979c: 74 d4 je ffffffff810b9772 <get_nohz_timer_target+0xb2>
ffffffff810b979e: 49 63 c7 movslq %r15d,%rax
ffffffff810b97a1: 48 89 ea mov %rbp,%rdx
ffffffff810b97a4: 48 03 14 c5 20 77 13 add -0x7dec88e0(,%rax,8),%rdx
ffffffff810b97ab: 82
ffffffff810b97ac: 48 8b 82 90 09 00 00 mov 0x990(%rdx),%rax
ffffffff810b97b3: 48 39 82 88 09 00 00 cmp %rax,0x988(%rdx)
ffffffff810b97ba: 75 13 jne ffffffff810b97cf <get_nohz_timer_target+0x10f>
ffffffff810b97bc: 8b 42 04 mov 0x4(%rdx),%eax
ffffffff810b97bf: 85 c0 test %eax,%eax
ffffffff810b97c1: 75 0c jne ffffffff810b97cf <get_nohz_timer_target+0x10f>
ffffffff810b97c3: 48 8b 82 20 0c 00 00 mov 0xc20(%rdx),%rax
ffffffff810b97ca: 48 85 c0 test %rax,%rax
ffffffff810b97cd: 74 a3 je ffffffff810b9772 <get_nohz_timer_target+0xb2>
ffffffff810b97cf: e8 1c 33 04 00 callq ffffffff810fcaf0 <__rcu_read_unlock>
ffffffff810b97d4: 45 89 fc mov %r15d,%r12d
ffffffff810b97d7: e9 32 ff ff ff jmpq ffffffff810b970e <get_nohz_timer_target+0x4e>
ffffffff810b97dc: 8b 50 04 mov 0x4(%rax),%edx
ffffffff810b97df: 85 d2 test %edx,%edx
ffffffff810b97e1: 0f 85 27 ff ff ff jne ffffffff810b970e <get_nohz_timer_target+0x4e>
ffffffff810b97e7: 48 8b 80 20 0c 00 00 mov 0xc20(%rax),%rax
ffffffff810b97ee: 48 85 c0 test %rax,%rax
ffffffff810b97f1: 0f 85 17 ff ff ff jne ffffffff810b970e <get_nohz_timer_target+0x4e>
ffffffff810b97f7: e8 44 e9 03 00 callq ffffffff810f8140 <__rcu_read_lock>
ffffffff810b97fc: 4e 03 2c f5 20 77 13 add -0x7dec88e0(,%r14,8),%r13
ffffffff810b9803: 82
ffffffff810b9804: 89 5c 24 04 mov %ebx,0x4(%rsp)
ffffffff810b9808: 41 89 df mov %ebx,%r15d
ffffffff810b980b: 4d 8b ad d8 09 00 00 mov 0x9d8(%r13),%r13
ffffffff810b9812: 4d 85 ed test %r13,%r13
ffffffff810b9815: 0f 85 47 ff ff ff jne ffffffff810b9762 <get_nohz_timer_target+0xa2>
ffffffff810b981b: eb 12 jmp ffffffff810b982f <get_nohz_timer_target+0x16f>
ffffffff810b981d: 4d 8b 6d 00 mov 0x0(%r13),%r13
ffffffff810b9821: 4d 85 ed test %r13,%r13
ffffffff810b9824: 0f 85 3e ff ff ff jne ffffffff810b9768 <get_nohz_timer_target+0xa8>
ffffffff810b982a: 44 8b 7c 24 04 mov 0x4(%rsp),%r15d
ffffffff810b982f: 41 83 ff ff cmp $0xffffffff,%r15d
ffffffff810b9833: 75 9a jne ffffffff810b97cf <get_nohz_timer_target+0x10f>
ffffffff810b9835: bf 01 00 00 00 mov $0x1,%edi
ffffffff810b983a: e8 91 86 02 00 callq ffffffff810e1ed0 <housekeeping_any_cpu>
ffffffff810b983f: 41 89 c7 mov %eax,%r15d
ffffffff810b9842: eb 8b jmp ffffffff810b97cf <get_nohz_timer_target+0x10f>
ffffffff810b9844: 66 90 xchg %ax,%ax
ffffffff810b9846: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
ffffffff810b984d: 00 00 00
The disassembled code proves that the __pure mark does not work.
next prev parent reply other threads:[~2021-04-20 7:32 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-17 15:01 [PATCH] sched: Optimize housekeeping_cpumask in for_each_cpu_and Yuan ZhaoXiong
2021-04-19 9:56 ` Peter Zijlstra
2021-04-20 6:44 ` Yuan,Zhaoxiong [this message]
2021-04-30 6:38 ` Yuan,Zhaoxiong
2021-05-20 8:36 ` Peter Zijlstra
2021-05-27 9:40 ` Peter Zijlstra
2021-05-31 10:37 ` Peter Zijlstra
2021-06-02 2:03 Yuan ZhaoXiong
2021-06-02 7:57 ` Peter Zijlstra
2021-06-06 13:11 Yuan ZhaoXiong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=830177B0-45E0-4768-80AB-A99B85D3A52F@baidu.com \
--to=yuanzhaoxiong@baidu.com \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.