sched: Optimize housekeeping_cpumask in for_each_cpu_and
diff mbox series

Message ID 1618671697-26098-1-git-send-email-yuanzhaoxiong@baidu.com
State New, archived
Headers show
Series
  • sched: Optimize housekeeping_cpumask in for_each_cpu_and
Related show

Commit Message

Yuan ZhaoXiong April 17, 2021, 3:01 p.m. UTC
On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
the others are used for housekeeping. When many housekeeping cpus are
in idle state, we can observe huge time burn in the loop for searching
nearest busy housekeeper cpu by ftrace.

   9)               |              get_nohz_timer_target() {
   9)               |                housekeeping_test_cpu() {
   9)   0.390 us    |                  housekeeping_get_mask.part.1();
   9)   0.561 us    |                }
   9)   0.090 us    |                __rcu_read_lock();
   9)   0.090 us    |                housekeeping_cpumask();
   9)   0.521 us    |                housekeeping_cpumask();
   9)   0.140 us    |                housekeeping_cpumask();

   ...

   9)   0.500 us    |                housekeeping_cpumask();
   9)               |                housekeeping_any_cpu() {
   9)   0.090 us    |                  housekeeping_get_mask.part.1();
   9)   0.100 us    |                  sched_numa_find_closest();
   9)   0.491 us    |                }
   9)   0.100 us    |                __rcu_read_unlock();
   9) + 76.163 us   |              }

for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
function the
        for_each_cpu_and(i, sched_domain_span(sd),
                housekeeping_cpumask(HK_FLAG_TIMER))
equals to below:
        for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
                housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
That will cause that housekeeping_cpumask() will be invoked many times.
The housekeeping_cpumask() function returns a const value, so it is
unnecessary to invoke it every time. This patch can minimize the worst
searching time from ~76us to ~16us in my testing.

Similarly, the find_new_ilb() function has the same problem.

Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
 kernel/sched/core.c | 6 ++++--
 kernel/sched/fair.c | 6 ++++--
 2 files changed, 8 insertions(+), 4 deletions(-)

Comments

Peter Zijlstra April 19, 2021, 9:56 a.m. UTC | #1
On Sat, Apr 17, 2021 at 11:01:37PM +0800, Yuan ZhaoXiong wrote:
> On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
> the others are used for housekeeping. When many housekeeping cpus are
> in idle state, we can observe huge time burn in the loop for searching
> nearest busy housekeeper cpu by ftrace.
> 
>    9)               |              get_nohz_timer_target() {
>    9)               |                housekeeping_test_cpu() {
>    9)   0.390 us    |                  housekeeping_get_mask.part.1();
>    9)   0.561 us    |                }
>    9)   0.090 us    |                __rcu_read_lock();
>    9)   0.090 us    |                housekeeping_cpumask();
>    9)   0.521 us    |                housekeeping_cpumask();
>    9)   0.140 us    |                housekeeping_cpumask();
> 
>    ...
> 
>    9)   0.500 us    |                housekeeping_cpumask();
>    9)               |                housekeeping_any_cpu() {
>    9)   0.090 us    |                  housekeeping_get_mask.part.1();
>    9)   0.100 us    |                  sched_numa_find_closest();
>    9)   0.491 us    |                }
>    9)   0.100 us    |                __rcu_read_unlock();
>    9) + 76.163 us   |              }
> 
> for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
> function the
>         for_each_cpu_and(i, sched_domain_span(sd),
>                 housekeeping_cpumask(HK_FLAG_TIMER))
> equals to below:
>         for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
>                 housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
> That will cause that housekeeping_cpumask() will be invoked many times.
> The housekeeping_cpumask() function returns a const value, so it is
> unnecessary to invoke it every time. This patch can minimize the worst
> searching time from ~76us to ~16us in my testing.
> 
> Similarly, the find_new_ilb() function has the same problem.

Would it not make sense to mark housekeeping_cpumask() __pure instead?
Yuan ZhaoXiong April 20, 2021, 6:44 a.m. UTC | #2
在 2021/4/19 下午5:57,“Peter Zijlstra”<peterz@infradead.org> 写入:

    On Sat, Apr 17, 2021 at 11:01:37PM +0800, Yuan ZhaoXiong wrote:
    > On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
    > the others are used for housekeeping. When many housekeeping cpus are
    > in idle state, we can observe huge time burn in the loop for searching
    > nearest busy housekeeper cpu by ftrace.
    > 
    >    9)               |              get_nohz_timer_target() {
    >    9)               |                housekeeping_test_cpu() {
    >    9)   0.390 us    |                  housekeeping_get_mask.part.1();
    >    9)   0.561 us    |                }
    >    9)   0.090 us    |                __rcu_read_lock();
    >    9)   0.090 us    |                housekeeping_cpumask();
    >    9)   0.521 us    |                housekeeping_cpumask();
    >    9)   0.140 us    |                housekeeping_cpumask();
    > 
    >    ...
    > 
    >    9)   0.500 us    |                housekeeping_cpumask();
    >    9)               |                housekeeping_any_cpu() {
    >    9)   0.090 us    |                  housekeeping_get_mask.part.1();
    >    9)   0.100 us    |                  sched_numa_find_closest();
    >    9)   0.491 us    |                }
    >    9)   0.100 us    |                __rcu_read_unlock();
    >    9) + 76.163 us   |              }
    > 
    > for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
    > function the
    >         for_each_cpu_and(i, sched_domain_span(sd),
    >                 housekeeping_cpumask(HK_FLAG_TIMER))
    > equals to below:
    >         for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
    >                 housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
    > That will cause that housekeeping_cpumask() will be invoked many times.
    > The housekeeping_cpumask() function returns a const value, so it is
    > unnecessary to invoke it every time. This patch can minimize the worst
    > searching time from ~76us to ~16us in my testing.
    > 
    > Similarly, the find_new_ilb() function has the same problem.
    
    Would it not make sense to mark housekeeping_cpumask() __pure instead?
    
After marking housekeeping_cpumask() __pure and then test again, the results 
proves that huge time burn in the loop for searching the nearest busy housekeeper 
still exists. 

Using objdump -D vmlinux we can see get_nohz_timer_target() disassembled code 
as below:
ffffffff810b96c0 <get_nohz_timer_target>:
ffffffff810b96c0:       e8 db 7f 94 00          callq  ffffffff81a016a0 <__fentry__>
ffffffff810b96c5:       41 57                   push   %r15
ffffffff810b96c7:       41 56                   push   %r14
ffffffff810b96c9:       41 55                   push   %r13
ffffffff810b96cb:       41 54                   push   %r12
ffffffff810b96cd:       55                      push   %rbp
ffffffff810b96ce:       53                      push   %rbx
ffffffff810b96cf:       48 83 ec 08             sub    $0x8,%rsp
ffffffff810b96d3:       65 8b 1d 56 5a f5 7e    mov    %gs:0x7ef55a56(%rip),%ebx        # f130 <cpu_number>
ffffffff810b96da:       41 89 dc                mov    %ebx,%r12d
ffffffff810b96dd:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffff810b96e2:       4c 63 f3                movslq %ebx,%r14
ffffffff810b96e5:       48 c7 c5 40 0b 02 00    mov    $0x20b40,%rbp
ffffffff810b96ec:       4a 8b 04 f5 20 77 13    mov    -0x7dec88e0(,%r14,8),%rax
ffffffff810b96f3:       82
ffffffff810b96f4:       49 89 ed                mov    %rbp,%r13
ffffffff810b96f7:       4c 01 e8                add    %r13,%rax
ffffffff810b96fa:       48 8b 88 90 09 00 00    mov    0x990(%rax),%rcx
ffffffff810b9701:       48 39 88 88 09 00 00    cmp    %rcx,0x988(%rax)
ffffffff810b9708:       0f 84 ce 00 00 00       je     ffffffff810b97dc <get_nohz_timer_target+0x11c>
ffffffff810b970e:       48 83 c4 08             add    $0x8,%rsp
ffffffff810b9712:       44 89 e0                mov    %r12d,%eax
ffffffff810b9715:       5b                      pop    %rbx
ffffffff810b9716:       5d                      pop    %rbp
ffffffff810b9717:       41 5c                   pop    %r12
ffffffff810b9719:       41 5d                   pop    %r13
ffffffff810b971b:       41 5e                   pop    %r14
ffffffff810b971d:       41 5f                   pop    %r15
ffffffff810b971f:       c3                      retq
ffffffff810b9720:       be 01 00 00 00          mov    $0x1,%esi
ffffffff810b9725:       89 df                   mov    %ebx,%edi
ffffffff810b9727:       e8 74 87 02 00          callq  ffffffff810e1ea0 <housekeeping_test_cpu>
ffffffff810b972c:       84 c0                   test   %al,%al
ffffffff810b972e:       75 b2                   jne    ffffffff810b96e2 <get_nohz_timer_target+0x22>
ffffffff810b9730:       e8 0b ea 03 00          callq  ffffffff810f8140 <__rcu_read_lock>
ffffffff810b9735:       48 c7 c5 40 0b 02 00    mov    $0x20b40,%rbp
ffffffff810b973c:       48 63 d3                movslq %ebx,%rdx
ffffffff810b973f:       c7 44 24 04 ff ff ff    movl   $0xffffffff,0x4(%rsp)
ffffffff810b9746:       ff
ffffffff810b9747:       48 89 e8                mov    %rbp,%rax
ffffffff810b974a:       48 03 04 d5 20 77 13    add    -0x7dec88e0(,%rdx,8),%rax
ffffffff810b9751:       82
ffffffff810b9752:       4c 8b a8 d8 09 00 00    mov    0x9d8(%rax),%r13
ffffffff810b9759:       4d 85 ed                test   %r13,%r13
ffffffff810b975c:       0f 84 d3 00 00 00       je     ffffffff810b9835 <get_nohz_timer_target+0x175>
ffffffff810b9762:       41 be ff ff ff ff       mov    $0xffffffff,%r14d
ffffffff810b9768:       4d 8d a5 38 01 00 00    lea    0x138(%r13),%r12
ffffffff810b976f:       45 89 f7                mov    %r14d,%r15d
ffffffff810b9772:       bf 01 00 00 00          mov    $0x1,%edi
ffffffff810b9777:       e8 f4 86 02 00          callq  ffffffff810e1e70 <housekeeping_cpumask>
ffffffff810b977c:       44 89 ff                mov    %r15d,%edi
ffffffff810b977f:       48 89 c2                mov    %rax,%rdx
ffffffff810b9782:       4c 89 e6                mov    %r12,%rsi
ffffffff810b9785:       e8 b6 ea 79 00          callq  ffffffff81858240 <cpumask_next_and>
ffffffff810b978a:       3b 05 b4 4e 3e 01       cmp    0x13e4eb4(%rip),%eax        # ffffffff8249e644 <nr_cpu_ids>
ffffffff810b9790:       41 89 c7                mov    %eax,%r15d
ffffffff810b9793:       0f 83 84 00 00 00       jae    ffffffff810b981d <get_nohz_timer_target+0x15d>
ffffffff810b9799:       44 39 fb                cmp    %r15d,%ebx
ffffffff810b979c:       74 d4                   je     ffffffff810b9772 <get_nohz_timer_target+0xb2>
ffffffff810b979e:       49 63 c7                movslq %r15d,%rax
ffffffff810b97a1:       48 89 ea                mov    %rbp,%rdx
ffffffff810b97a4:       48 03 14 c5 20 77 13    add    -0x7dec88e0(,%rax,8),%rdx
ffffffff810b97ab:       82
ffffffff810b97ac:       48 8b 82 90 09 00 00    mov    0x990(%rdx),%rax
ffffffff810b97b3:       48 39 82 88 09 00 00    cmp    %rax,0x988(%rdx)
ffffffff810b97ba:       75 13                   jne    ffffffff810b97cf <get_nohz_timer_target+0x10f>
ffffffff810b97bc:       8b 42 04                mov    0x4(%rdx),%eax
ffffffff810b97bf:       85 c0                   test   %eax,%eax
ffffffff810b97c1:       75 0c                   jne    ffffffff810b97cf <get_nohz_timer_target+0x10f>
ffffffff810b97c3:       48 8b 82 20 0c 00 00    mov    0xc20(%rdx),%rax
ffffffff810b97ca:       48 85 c0                test   %rax,%rax
ffffffff810b97cd:       74 a3                   je     ffffffff810b9772 <get_nohz_timer_target+0xb2>
ffffffff810b97cf:       e8 1c 33 04 00          callq  ffffffff810fcaf0 <__rcu_read_unlock>
ffffffff810b97d4:       45 89 fc                mov    %r15d,%r12d
ffffffff810b97d7:       e9 32 ff ff ff          jmpq   ffffffff810b970e <get_nohz_timer_target+0x4e>
ffffffff810b97dc:       8b 50 04                mov    0x4(%rax),%edx
ffffffff810b97df:       85 d2                   test   %edx,%edx
ffffffff810b97e1:       0f 85 27 ff ff ff       jne    ffffffff810b970e <get_nohz_timer_target+0x4e>
ffffffff810b97e7:       48 8b 80 20 0c 00 00    mov    0xc20(%rax),%rax
ffffffff810b97ee:       48 85 c0                test   %rax,%rax
ffffffff810b97f1:       0f 85 17 ff ff ff       jne    ffffffff810b970e <get_nohz_timer_target+0x4e>
ffffffff810b97f7:       e8 44 e9 03 00          callq  ffffffff810f8140 <__rcu_read_lock>
ffffffff810b97fc:       4e 03 2c f5 20 77 13    add    -0x7dec88e0(,%r14,8),%r13
ffffffff810b9803:       82
ffffffff810b9804:       89 5c 24 04             mov    %ebx,0x4(%rsp)
ffffffff810b9808:       41 89 df                mov    %ebx,%r15d
ffffffff810b980b:       4d 8b ad d8 09 00 00    mov    0x9d8(%r13),%r13
ffffffff810b9812:       4d 85 ed                test   %r13,%r13
ffffffff810b9815:       0f 85 47 ff ff ff       jne    ffffffff810b9762 <get_nohz_timer_target+0xa2>
ffffffff810b981b:       eb 12                   jmp    ffffffff810b982f <get_nohz_timer_target+0x16f>
ffffffff810b981d:       4d 8b 6d 00             mov    0x0(%r13),%r13
ffffffff810b9821:       4d 85 ed                test   %r13,%r13
ffffffff810b9824:       0f 85 3e ff ff ff       jne    ffffffff810b9768 <get_nohz_timer_target+0xa8>
ffffffff810b982a:       44 8b 7c 24 04          mov    0x4(%rsp),%r15d
ffffffff810b982f:       41 83 ff ff             cmp    $0xffffffff,%r15d
ffffffff810b9833:       75 9a                   jne    ffffffff810b97cf <get_nohz_timer_target+0x10f>
ffffffff810b9835:       bf 01 00 00 00          mov    $0x1,%edi
ffffffff810b983a:       e8 91 86 02 00          callq  ffffffff810e1ed0 <housekeeping_any_cpu>
ffffffff810b983f:       41 89 c7                mov    %eax,%r15d
ffffffff810b9842:       eb 8b                   jmp    ffffffff810b97cf <get_nohz_timer_target+0x10f>
ffffffff810b9844:       66 90                   xchg   %ax,%ax
ffffffff810b9846:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
ffffffff810b984d:       00 00 00

The disassembled code proves that the __pure mark does not work.
Yuan ZhaoXiong April 30, 2021, 6:38 a.m. UTC | #3
> 在 2021/4/19 下午5:57,“Peter Zijlstra”<peterz@infradead.org> 写入:

> On Sat, Apr 17, 2021 at 11:01:37PM +0800, Yuan ZhaoXiong wrote:
>> On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
>> the others are used for housekeeping. When many housekeeping cpus are
>> in idle state, we can observe huge time burn in the loop for searching
>> nearest busy housekeeper cpu by ftrace.
>> 
>>    9)               |              get_nohz_timer_target() {
>>    9)               |                housekeeping_test_cpu() {
>>    9)   0.390 us    |                  housekeeping_get_mask.part.1();
>>    9)   0.561 us    |                }
>>    9)   0.090 us    |                __rcu_read_lock();
>>    9)   0.090 us    |                housekeeping_cpumask();
>>    9)   0.521 us    |                housekeeping_cpumask();
>>    9)   0.140 us    |                housekeeping_cpumask();
>> 
>>    ...
>> 
>>    9)   0.500 us    |                housekeeping_cpumask();
>>    9)               |                housekeeping_any_cpu() {
>>    9)   0.090 us    |                  housekeeping_get_mask.part.1();
>>    9)   0.100 us    |                  sched_numa_find_closest();
>>    9)   0.491 us    |                }
>>    9)   0.100 us    |                __rcu_read_unlock();
>>    9) + 76.163 us   |              }
>> 
>> for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
>> function the
>>         for_each_cpu_and(i, sched_domain_span(sd),
>>                 housekeeping_cpumask(HK_FLAG_TIMER))
>> equals to below:
>>         for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
>>                 housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
>> That will cause that housekeeping_cpumask() will be invoked many times.
>> The housekeeping_cpumask() function returns a const value, so it is
>> unnecessary to invoke it every time. This patch can minimize the worst
>> searching time from ~76us to ~16us in my testing.
>> 
>> Similarly, the find_new_ilb() function has the same problem.
    
>  Would it not make sense to mark housekeeping_cpumask() __pure instead?
    
> After marking housekeeping_cpumask() __pure and then test again, the results 
> proves that huge time burn in the loop for searching the nearest busy housekeeper 
> still exists. 
>
> Using objdump -D vmlinux we can see get_nohz_timer_target() disassembled code 
as below:
> ffffffff810b96c0 <get_nohz_timer_target>:
> ffffffff810b96c0:       e8 db 7f 94 00          callq  ffffffff81a016a0 <__fentry__>
> ffffffff810b96c5:       41 57                   push   %r15
> ffffffff810b96c7:       41 56                   push   %r14
> ffffffff810b96c9:       41 55                   push   %r13
> ffffffff810b96cb:       41 54                   push   %r12
> ffffffff810b96cd:       55                      push   %rbp
> ffffffff810b96ce:       53                      push   %rbx
> ffffffff810b96cf:       48 83 ec 08             sub    $0x8,%rsp
> ffffffff810b96d3:       65 8b 1d 56 5a f5 7e    mov    %gs:0x7ef55a56(%rip),%ebx        # f130 <cpu_number>
> ffffffff810b96da:       41 89 dc                mov    %ebx,%r12d
> ffffffff810b96dd:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> ffffffff810b96e2:       4c 63 f3                movslq %ebx,%r14
> ffffffff810b96e5:       48 c7 c5 40 0b 02 00    mov    $0x20b40,%rbp
> ffffffff810b96ec:       4a 8b 04 f5 20 77 13    mov    -0x7dec88e0(,%r14,8),%rax
> ffffffff810b96f3:       82
> ffffffff810b96f4:       49 89 ed                mov    %rbp,%r13
> ffffffff810b96f7:       4c 01 e8                add    %r13,%rax
> ffffffff810b96fa:       48 8b 88 90 09 00 00    mov    0x990(%rax),%rcx
> ffffffff810b9701:       48 39 88 88 09 00 00    cmp    %rcx,0x988(%rax)
> ffffffff810b9708:       0f 84 ce 00 00 00       je     ffffffff810b97dc <get_nohz_timer_target+0x11c>
> ffffffff810b970e:       48 83 c4 08             add    $0x8,%rsp
> ffffffff810b9712:       44 89 e0                mov    %r12d,%eax
> ffffffff810b9715:       5b                      pop    %rbx
> ffffffff810b9716:       5d                      pop    %rbp
> ffffffff810b9717:       41 5c                   pop    %r12
> ffffffff810b9719:       41 5d                   pop    %r13
> ffffffff810b971b:       41 5e                   pop    %r14
> ffffffff810b971d:       41 5f                   pop    %r15
> ffffffff810b971f:       c3                      retq
> ffffffff810b9720:       be 01 00 00 00          mov    $0x1,%esi
> ffffffff810b9725:       89 df                   mov    %ebx,%edi
> ffffffff810b9727:       e8 74 87 02 00          callq  ffffffff810e1ea0 <housekeeping_test_cpu>
> ffffffff810b972c:       84 c0                   test   %al,%al
> ffffffff810b972e:       75 b2                   jne    ffffffff810b96e2 <get_nohz_timer_target+0x22>
> ffffffff810b9730:       e8 0b ea 03 00          callq  ffffffff810f8140 <__rcu_read_lock>
> ffffffff810b9735:       48 c7 c5 40 0b 02 00    mov    $0x20b40,%rbp
> ffffffff810b973c:       48 63 d3                movslq %ebx,%rdx
> ffffffff810b973f:       c7 44 24 04 ff ff ff    movl   $0xffffffff,0x4(%rsp)
> ffffffff810b9746:       ff
> ffffffff810b9747:       48 89 e8                mov    %rbp,%rax
> ffffffff810b974a:       48 03 04 d5 20 77 13    add    -0x7dec88e0(,%rdx,8),%rax
> ffffffff810b9751:       82
> ffffffff810b9752:       4c 8b a8 d8 09 00 00    mov    0x9d8(%rax),%r13
> ffffffff810b9759:       4d 85 ed                test   %r13,%r13
> ffffffff810b975c:       0f 84 d3 00 00 00       je     ffffffff810b9835 <get_nohz_timer_target+0x175>
> ffffffff810b9762:       41 be ff ff ff ff       mov    $0xffffffff,%r14d
> ffffffff810b9768:       4d 8d a5 38 01 00 00    lea    0x138(%r13),%r12
> ffffffff810b976f:       45 89 f7                mov    %r14d,%r15d
> ffffffff810b9772:       bf 01 00 00 00          mov    $0x1,%edi
> ffffffff810b9777:       e8 f4 86 02 00          callq  ffffffff810e1e70 <housekeeping_cpumask>
> ffffffff810b977c:       44 89 ff                mov    %r15d,%edi
> ffffffff810b977f:       48 89 c2                mov    %rax,%rdx
> ffffffff810b9782:       4c 89 e6                mov    %r12,%rsi
> ffffffff810b9785:       e8 b6 ea 79 00          callq  ffffffff81858240 <cpumask_next_and>
> ffffffff810b978a:       3b 05 b4 4e 3e 01       cmp    0x13e4eb4(%rip),%eax        # ffffffff8249e644 <nr_cpu_ids>
> ffffffff810b9790:       41 89 c7                mov    %eax,%r15d
> ffffffff810b9793:       0f 83 84 00 00 00       jae    ffffffff810b981d <get_nohz_timer_target+0x15d>
> ffffffff810b9799:       44 39 fb                cmp    %r15d,%ebx
> ffffffff810b979c:       74 d4                   je     ffffffff810b9772 <get_nohz_timer_target+0xb2>
> ffffffff810b979e:       49 63 c7                movslq %r15d,%rax
> ffffffff810b97a1:       48 89 ea                mov    %rbp,%rdx
> ffffffff810b97a4:       48 03 14 c5 20 77 13    add    -0x7dec88e0(,%rax,8),%rdx
> ffffffff810b97ab:       82
> ffffffff810b97ac:       48 8b 82 90 09 00 00    mov    0x990(%rdx),%rax
> ffffffff810b97b3:       48 39 82 88 09 00 00    cmp    %rax,0x988(%rdx)
> ffffffff810b97ba:       75 13                   jne    ffffffff810b97cf <get_nohz_timer_target+0x10f>
> ffffffff810b97bc:       8b 42 04                mov    0x4(%rdx),%eax
> ffffffff810b97bf:       85 c0                   test   %eax,%eax
> ffffffff810b97c1:       75 0c                   jne    ffffffff810b97cf <get_nohz_timer_target+0x10f>
> ffffffff810b97c3:       48 8b 82 20 0c 00 00    mov    0xc20(%rdx),%rax
> ffffffff810b97ca:       48 85 c0                test   %rax,%rax
> ffffffff810b97cd:       74 a3                   je     ffffffff810b9772 <get_nohz_timer_target+0xb2>
> ffffffff810b97cf:       e8 1c 33 04 00          callq  ffffffff810fcaf0 <__rcu_read_unlock>
> ffffffff810b97d4:       45 89 fc                mov    %r15d,%r12d
> ffffffff810b97d7:       e9 32 ff ff ff          jmpq   ffffffff810b970e <get_nohz_timer_target+0x4e>
> ffffffff810b97dc:       8b 50 04                mov    0x4(%rax),%edx
> ffffffff810b97df:       85 d2                   test   %edx,%edx
> ffffffff810b97e1:       0f 85 27 ff ff ff       jne    ffffffff810b970e <get_nohz_timer_target+0x4e>
> ffffffff810b97e7:       48 8b 80 20 0c 00 00    mov    0xc20(%rax),%rax
> ffffffff810b97ee:       48 85 c0                test   %rax,%rax
> ffffffff810b97f1:       0f 85 17 ff ff ff       jne    ffffffff810b970e <get_nohz_timer_target+0x4e>
> ffffffff810b97f7:       e8 44 e9 03 00          callq  ffffffff810f8140 <__rcu_read_lock>
> ffffffff810b97fc:       4e 03 2c f5 20 77 13    add    -0x7dec88e0(,%r14,8),%r13
> ffffffff810b9803:       82
> ffffffff810b9804:       89 5c 24 04             mov    %ebx,0x4(%rsp)
> ffffffff810b9808:       41 89 df                mov    %ebx,%r15d
> ffffffff810b980b:       4d 8b ad d8 09 00 00    mov    0x9d8(%r13),%r13
> ffffffff810b9812:       4d 85 ed                test   %r13,%r13
> ffffffff810b9815:       0f 85 47 ff ff ff       jne    ffffffff810b9762 <get_nohz_timer_target+0xa2>
> ffffffff810b981b:       eb 12                   jmp    ffffffff810b982f <get_nohz_timer_target+0x16f>
> ffffffff810b981d:       4d 8b 6d 00             mov    0x0(%r13),%r13
> ffffffff810b9821:       4d 85 ed                test   %r13,%r13
> ffffffff810b9824:       0f 85 3e ff ff ff       jne    ffffffff810b9768 <get_nohz_timer_target+0xa8>
> ffffffff810b982a:       44 8b 7c 24 04          mov    0x4(%rsp),%r15d
> ffffffff810b982f:       41 83 ff ff             cmp    $0xffffffff,%r15d
> ffffffff810b9833:       75 9a                   jne    ffffffff810b97cf <get_nohz_timer_target+0x10f>
> ffffffff810b9835:       bf 01 00 00 00          mov    $0x1,%edi
> ffffffff810b983a:       e8 91 86 02 00          callq  ffffffff810e1ed0 <housekeeping_any_cpu>
> ffffffff810b983f:       41 89 c7                mov    %eax,%r15d
> ffffffff810b9842:       eb 8b                   jmp    ffffffff810b97cf <get_nohz_timer_target+0x10f>
> ffffffff810b9844:       66 90                   xchg   %ax,%ax
> ffffffff810b9846:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
> ffffffff810b984d:       00 00 00
>
> The disassembled code proves that the __pure mark does not work.

Until now, the __pure mark does not work in our test, should the patch be merged into the mainline?

Thanks,
Yuan ZhaoXiong
Peter Zijlstra May 20, 2021, 8:36 a.m. UTC | #4
On Sat, Apr 17, 2021 at 11:01:37PM +0800, Yuan ZhaoXiong wrote:
> On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
> the others are used for housekeeping. When many housekeeping cpus are
> in idle state, we can observe huge time burn in the loop for searching
> nearest busy housekeeper cpu by ftrace.
> 
>    9)               |              get_nohz_timer_target() {
>    9)               |                housekeeping_test_cpu() {
>    9)   0.390 us    |                  housekeeping_get_mask.part.1();
>    9)   0.561 us    |                }
>    9)   0.090 us    |                __rcu_read_lock();
>    9)   0.090 us    |                housekeeping_cpumask();
>    9)   0.521 us    |                housekeeping_cpumask();
>    9)   0.140 us    |                housekeeping_cpumask();
> 
>    ...
> 
>    9)   0.500 us    |                housekeeping_cpumask();
>    9)               |                housekeeping_any_cpu() {
>    9)   0.090 us    |                  housekeeping_get_mask.part.1();
>    9)   0.100 us    |                  sched_numa_find_closest();
>    9)   0.491 us    |                }
>    9)   0.100 us    |                __rcu_read_unlock();
>    9) + 76.163 us   |              }
> 
> for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
> function the
>         for_each_cpu_and(i, sched_domain_span(sd),
>                 housekeeping_cpumask(HK_FLAG_TIMER))
> equals to below:
>         for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
>                 housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
> That will cause that housekeeping_cpumask() will be invoked many times.
> The housekeeping_cpumask() function returns a const value, so it is
> unnecessary to invoke it every time. This patch can minimize the worst
> searching time from ~76us to ~16us in my testing.
> 
> Similarly, the find_new_ilb() function has the same problem.
> 
> Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com>
> Signed-off-by: Li RongQing <lirongqing@baidu.com>

Thanks!
Peter Zijlstra May 27, 2021, 9:40 a.m. UTC | #5
On Sat, Apr 17, 2021 at 11:01:37PM +0800, Yuan ZhaoXiong wrote:
> On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
> the others are used for housekeeping. When many housekeeping cpus are
> in idle state, we can observe huge time burn in the loop for searching
> nearest busy housekeeper cpu by ftrace.
> 
>    9)               |              get_nohz_timer_target() {
>    9)               |                housekeeping_test_cpu() {
>    9)   0.390 us    |                  housekeeping_get_mask.part.1();
>    9)   0.561 us    |                }
>    9)   0.090 us    |                __rcu_read_lock();
>    9)   0.090 us    |                housekeeping_cpumask();
>    9)   0.521 us    |                housekeeping_cpumask();
>    9)   0.140 us    |                housekeeping_cpumask();
> 
>    ...
> 
>    9)   0.500 us    |                housekeeping_cpumask();
>    9)               |                housekeeping_any_cpu() {
>    9)   0.090 us    |                  housekeeping_get_mask.part.1();
>    9)   0.100 us    |                  sched_numa_find_closest();
>    9)   0.491 us    |                }
>    9)   0.100 us    |                __rcu_read_unlock();
>    9) + 76.163 us   |              }
> 
> for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
> function the
>         for_each_cpu_and(i, sched_domain_span(sd),
>                 housekeeping_cpumask(HK_FLAG_TIMER))
> equals to below:
>         for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
>                 housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
> That will cause that housekeeping_cpumask() will be invoked many times.
> The housekeeping_cpumask() function returns a const value, so it is
> unnecessary to invoke it every time. This patch can minimize the worst
> searching time from ~76us to ~16us in my testing.
> 
> Similarly, the find_new_ilb() function has the same problem.
> 
> Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com>
> Signed-off-by: Li RongQing <lirongqing@baidu.com>

Just noticed, this SoB chain isn't valid. What do I do with Li's entry?
Peter Zijlstra May 31, 2021, 10:37 a.m. UTC | #6
On Thu, May 27, 2021 at 11:40:42AM +0200, Peter Zijlstra wrote:
> On Sat, Apr 17, 2021 at 11:01:37PM +0800, Yuan ZhaoXiong wrote:
> > On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
> > the others are used for housekeeping. When many housekeeping cpus are
> > in idle state, we can observe huge time burn in the loop for searching
> > nearest busy housekeeper cpu by ftrace.
> > 
> >    9)               |              get_nohz_timer_target() {
> >    9)               |                housekeeping_test_cpu() {
> >    9)   0.390 us    |                  housekeeping_get_mask.part.1();
> >    9)   0.561 us    |                }
> >    9)   0.090 us    |                __rcu_read_lock();
> >    9)   0.090 us    |                housekeeping_cpumask();
> >    9)   0.521 us    |                housekeeping_cpumask();
> >    9)   0.140 us    |                housekeeping_cpumask();
> > 
> >    ...
> > 
> >    9)   0.500 us    |                housekeeping_cpumask();
> >    9)               |                housekeeping_any_cpu() {
> >    9)   0.090 us    |                  housekeeping_get_mask.part.1();
> >    9)   0.100 us    |                  sched_numa_find_closest();
> >    9)   0.491 us    |                }
> >    9)   0.100 us    |                __rcu_read_unlock();
> >    9) + 76.163 us   |              }
> > 
> > for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
> > function the
> >         for_each_cpu_and(i, sched_domain_span(sd),
> >                 housekeeping_cpumask(HK_FLAG_TIMER))
> > equals to below:
> >         for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
> >                 housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
> > That will cause that housekeeping_cpumask() will be invoked many times.
> > The housekeeping_cpumask() function returns a const value, so it is
> > unnecessary to invoke it every time. This patch can minimize the worst
> > searching time from ~76us to ~16us in my testing.
> > 
> > Similarly, the find_new_ilb() function has the same problem.
> > 
> > Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com>
> > Signed-off-by: Li RongQing <lirongqing@baidu.com>
> 
> Just noticed, this SoB chain isn't valid. What do I do with Li's entry?

I'm dropping this patch, please resend with a valid SoB chain.

Patch
diff mbox series

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 98191218d891..14ad3bb36321 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -645,6 +645,7 @@  int get_nohz_timer_target(void)
 {
 	int i, cpu = smp_processor_id(), default_cpu = -1;
 	struct sched_domain *sd;
+	const struct cpumask *hk_mask;
 
 	if (housekeeping_cpu(cpu, HK_FLAG_TIMER)) {
 		if (!idle_cpu(cpu))
@@ -652,10 +653,11 @@  int get_nohz_timer_target(void)
 		default_cpu = cpu;
 	}
 
+	hk_mask = housekeeping_cpumask(HK_FLAG_TIMER);
+
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
-		for_each_cpu_and(i, sched_domain_span(sd),
-			housekeeping_cpumask(HK_FLAG_TIMER)) {
+		for_each_cpu_and(i, sched_domain_span(sd), hk_mask) {
 			if (cpu == i)
 				continue;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..d3ecfbf160bf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10097,9 +10097,11 @@  static inline int on_null_domain(struct rq *rq)
 static inline int find_new_ilb(void)
 {
 	int ilb;
+	const struct cpumask *hk_mask;
 
-	for_each_cpu_and(ilb, nohz.idle_cpus_mask,
-			      housekeeping_cpumask(HK_FLAG_MISC)) {
+	hk_mask = housekeeping_cpumask(HK_FLAG_MISC);
+
+	for_each_cpu_and(ilb, nohz.idle_cpus_mask, hk_mask) {
 
 		if (ilb == smp_processor_id())
 			continue;