linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Aaron Lu <aaron.lu@intel.com>
To: David Vernet <void@manifault.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	<linux-kernel@vger.kernel.org>, <mingo@redhat.com>,
	<juri.lelli@redhat.com>, <vincent.guittot@linaro.org>,
	<rostedt@goodmis.org>, <dietmar.eggemann@arm.com>,
	<bsegall@google.com>, <mgorman@suse.de>, <bristot@redhat.com>,
	<vschneid@redhat.com>, <joshdon@google.com>,
	<roman.gushchin@linux.dev>, <tj@kernel.org>,
	<kernel-team@meta.com>
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS
Date: Thu, 15 Jun 2023 12:49:17 +0800	[thread overview]
Message-ID: <20230615044917.GA109334@ziqianlu-dell> (raw)
In-Reply-To: <20230615000103.GC2883716@maniforge>

[-- Attachment #1: Type: text/plain, Size: 21602 bytes --]

Hi David,

On Wed, Jun 14, 2023 at 07:01:03PM -0500, David Vernet wrote:
> Hi Aaron,
> 
> Thanks for taking a look and running some benchmarks. I tried to
> reproduce your results on a 26 core / 52 thread Cooperlake host, and a
> 20 core / 40 thread x 2 Skylake host, but I wasn't able to observe the
> contention on the per-swqueue spinlock you were saw on your Ice Lake.
> 
> I ran the following netperf benchmarks:
> 
> - netperf -l 60 -n $(nproc) -6
> - netperf -l 60 -n $(nproc) -6 -t UDP_RR

Just to confirm: did you run multiple of the above or just one?

> 
> Here are the results I'm seeing from taking the average of running those
> benchmarks three times on each host (all results are from the "Throughput"
> column of the below table).
> 
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput <<<
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 
> 26 / 52 x 1 Cooperlake
> ----------------------
> 
> Default workload:
> 
> NO_SWQUEUE: 34103.76
> SWQUEUE:    34707.46
> Delta: +1.77%
> 
> UDP_RR:
> 
> NO_SWQUEUE: 57695.29
> SWQUEUE: 54827.36
> Delta: -4.97%
> 
> There's clearly a statistically significant decline in performance for
> UDP_RR here, but surprisingly, I don't see any contention on swqueue
> locks when I profile with perf. Rather, we seem to be contending on the
> rq lock (presumably in swqueue_pick_next_task() which is unexpected (to
> me, at least):

I don't see rq lock contention though and since you have seen some
contention here, I suppose you should have started multiple instances of
netperf client?

> 
>      7.81%  netperf  [kernel.vmlinux]                                           [k] queued_spin_lock_slowpath
>             |          
>              --7.81%--queued_spin_lock_slowpath
>                        |          
>                         --7.55%--task_rq_lock
>                                   pick_next_task_fair
>                                   schedule
>                                   schedule_timeout
>                                   __skb_wait_for_more_packets
>                                   __skb_recv_udp
>                                   udpv6_recvmsg
>                                   inet6_recvmsg
>                                   __x64_sys_recvfrom
>                                   do_syscall_64
>                                   entry_SYSCALL_64
>                                   __libc_recvfrom
>                                   0x7f923fdfacf0
>                                   0
>      4.97%  netperf  [kernel.vmlinux]                                           [k] _raw_spin_lock_irqsave
>             |          
>              --4.96%--_raw_spin_lock_irqsave
>                        |          
>                        |--1.13%--prepare_to_wait_exclusive
>                        |          __skb_wait_for_more_packets
>                        |          __skb_recv_udp
>                        |          udpv6_recvmsg
>                        |          inet6_recvmsg
>                        |          __x64_sys_recvfrom
>                        |          do_syscall_64
>                        |          entry_SYSCALL_64
>                        |          __libc_recvfrom
>                        |          0x7f923fdfacf0
>                        |          0
>                        |          
>                        |--0.92%--default_wake_function
>                        |          autoremove_wake_function
>                        |          __wake_up_sync_key
>                        |          sock_def_readable
>                        |          __udp_enqueue_schedule_skb
>                        |          udpv6_queue_rcv_one_skb
>                        |          __udp6_lib_rcv
>                        |          ip6_input
>                        |          ipv6_rcv
>                        |          process_backlog
>                        |          net_rx_action
>                        |          __softirqentry_text_start
>                        |          __local_bh_enable_ip
>                        |          ip6_output
>                        |          ip6_local_out
>                        |          ip6_send_skb
>                        |          udp_v6_send_skb
>                        |          udpv6_sendmsg
>                        |          __sys_sendto
>                        |          __x64_sys_sendto
>                        |          do_syscall_64
>                        |          entry_SYSCALL_64
>                        |          __libc_sendto
>                        |          
>                        |--0.81%--__wake_up_sync_key
>                        |          sock_def_readable
>                        |          __udp_enqueue_schedule_skb
>                        |          udpv6_queue_rcv_one_skb
>                        |          __udp6_lib_rcv
>                        |          ip6_input
>                        |          ipv6_rcv
>                        |          process_backlog
>                        |          net_rx_action
>                        |          __softirqentry_text_start
>                        |          __local_bh_enable_ip
>                        |          ip6_output
>                        |          ip6_local_out
>                        |          ip6_send_skb
>                        |          udp_v6_send_skb
>                        |          udpv6_sendmsg
>                        |          __sys_sendto
>                        |          __x64_sys_sendto
>                        |          do_syscall_64
>                        |          entry_SYSCALL_64
>                        |          __libc_sendto
>                        |          
>                        |--0.73%--enqueue_task_fair
>                        |          |          
>                        |           --0.72%--default_wake_function
>                        |                     autoremove_wake_function
>                        |                     __wake_up_sync_key
>                        |                     sock_def_readable
>                        |                     __udp_enqueue_schedule_skb
>                        |                     udpv6_queue_rcv_one_skb
>                        |                     __udp6_lib_rcv
>                        |                     ip6_input
>                        |                     ipv6_rcv
>                        |                     process_backlog
>                        |                     net_rx_action
>                        |                     __softirqentry_text_start
>                        |                     __local_bh_enable_ip
>                        |                     ip6_output
>                        |                     ip6_local_out
>                        |                     ip6_send_skb
>                        |                     udp_v6_send_skb
>                        |                     udpv6_sendmsg
>                        |                     __sys_sendto
>                        |                     __x64_sys_sendto
>                        |                     do_syscall_64
>                        |                     entry_SYSCALL_64
>                        |                     __libc_sendto
>                        |          
>                         --0.68%--task_rq_lock
>                                   pick_next_task_fair
>                                   schedule
>                                   schedule_timeout
>                                   __skb_wait_for_more_packets
>                                   __skb_recv_udp
>                                   udpv6_recvmsg
>                                   inet6_recvmsg
>                                   __x64_sys_recvfrom
>                                   do_syscall_64
>                                   entry_SYSCALL_64
>                                   __libc_recvfrom
>                                   0x7f923fdfacf0
>                                   0
> 
> The profile without swqueue doesn't show any contention on the rq lock,
> but does show us spending a good amount of time in the scheduler:
> 
>      4.03%  netperf  [kernel.vmlinux]                                           [k] default_wake_function
>             |          
>              --3.98%--default_wake_function
>                        autoremove_wake_function
>                        __wake_up_sync_key
>                        sock_def_readable
>                        __udp_enqueue_schedule_skb
>                        udpv6_queue_rcv_one_skb
>                        __udp6_lib_rcv
>                        ip6_input
>                        ipv6_rcv
>                        process_backlog
>                        net_rx_action
>                        __softirqentry_text_start
>                        __local_bh_enable_ip
>                        ip6_output
>                        ip6_local_out
>                        ip6_send_skb
>                        udp_v6_send_skb
>                        udpv6_sendmsg
>                        __sys_sendto
>                        __x64_sys_sendto
>                        do_syscall_64
>                        entry_SYSCALL_64
>                        __libc_sendto
>      3.70%  netperf  [kernel.vmlinux]                                           [k] enqueue_entity
>             |          
>              --3.66%--enqueue_entity
>                        enqueue_task_fair
>                        |          
>                         --3.66%--default_wake_function
>                                   autoremove_wake_function
>                                   __wake_up_sync_key
>                                   sock_def_readable
>                                   __udp_enqueue_schedule_skb
>                                   udpv6_queue_rcv_one_skb
>                                   __udp6_lib_rcv
>                                   ip6_input
>                                   ipv6_rcv
>                                   process_backlog
>                                   net_rx_action
>                                   __softirqentry_text_start
>                                   __local_bh_enable_ip
>                                   ip6_output
>                                   ip6_local_out
>                                   ip6_send_skb
>                                   udp_v6_send_skb
>                                   udpv6_sendmsg
>                                   __sys_sendto
>                                   __x64_sys_sendto
>                                   do_syscall_64
>                                   entry_SYSCALL_64
>                                   __libc_sendto
> 
> There's clearly a measurable impact on performance here due to swqueue
> (negative for UDP_RR, positive for the default workload), but this looks
> quite different than what you were observing.

Yes.

> 
> 20 / 40 x 2 Skylake
> -------------------
> 
> Default workload:
> 
> NO_SWQUEUE: 57437.45
> SWQUEUE: 58801.11
> Delta: +2.37%
> 
> UDP_RR:
> 
> NO_SWQUEUE: 75932.28
> SWQUEUE: 75232.77
> Delta: -.92%
> 
> Same observation here. I didn't collect a profile, but the trend seems
> consistent, and there's clearly a tradeoff. Despite the small drop in
> perf for UDP_RR, it seems quite a bit less drastic than what would be
> expected with the contention you showed in your profile.

Indeed.

> 
> 7950X (8 cores / 16 threads per CCX, 2 CCXs)
> --------------------------------------------
> 
> Default workload:
> 
> NO_SWQUEUE: 77615.08
> SWQUEUE: 77465.73
> Delta: -.19%
> 
> UDP_RR:
> 
> NO_SWQUEUE: 230258.75
> SWQUEUE: 230280.34
> Delta: ~0%
> 
> I'd call this essentially a no-op.
> 
 
> With all that said, I have a few thoughts.
> 
> Firstly, would you mind please sharing your .config? It's possible that
> the hosts I'm testing on just don't have big enough LLCs to observe the
> contention you're seeing on the swqueue spinlock, as my 26 / 52 CPL host
> is smaller than a single socket of your 32/64 ICL host. On the other
> hand, your ICL isn't _that_ much bigger per-LLC, so I'd be curious to

Agreed. Your 26cores/52threads isn't that smaller than mine and I had
expected to see something similar.

> see if there's a .config difference here that's causing the contention.

Attached my .config.
I also pushed the branch I used for testing to github just in case you
want to take a look: https://github.com/aaronlu/linux.git swqueue

> Also, the fact that Milan (which has only 6 cores / 12 threads per LLC)
> also saw a performance hit with swqueue for UDP_RR suggests to me that
> the issue with UDP_RR is not the scalability of the per-LLC swqueue
> spinlock.

I've tested again today and I still saw serious contention on
swqueue->lock with your cmdline. I did it this way:
"
$ netserver                                                                                      
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC

$ for i in `seq 128`; do netperf -l 60 -n 128 -6 -t UDP_RR & done
"
And the profile is about the same as my last posting:

    83.23%    83.18%  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath            -      -            
            |          
            |--40.23%--sendto
            |          entry_SYSCALL_64
            |          do_syscall_64
            |          |          
            |           --40.22%--__x64_sys_sendto
            |                     __sys_sendto
            |                     sock_sendmsg
            |                     inet6_sendmsg
            |                     udpv6_sendmsg
            |                     udp_v6_send_skb
            |                     ip6_send_skb
            |                     ip6_local_out
            |                     ip6_output
            |                     ip6_finish_output
            |                     ip6_finish_output2
            |                     __dev_queue_xmit
            |                     __local_bh_enable_ip
            |                     do_softirq.part.0
            |                     __do_softirq
            |                     net_rx_action
            |                     __napi_poll
            |                     process_backlog
            |                     __netif_receive_skb
            |                     __netif_receive_skb_one_core
            |                     ipv6_rcv
            |                     ip6_input
            |                     ip6_input_finish
            |                     ip6_protocol_deliver_rcu
            |                     udpv6_rcv
            |                     __udp6_lib_rcv
            |                     udp6_unicast_rcv_skb
            |                     udpv6_queue_rcv_skb
            |                     udpv6_queue_rcv_one_skb
            |                     __udp_enqueue_schedule_skb
            |                     sock_def_readable
            |                     __wake_up_sync_key
            |                     __wake_up_common_lock
            |                     |          
            |                      --40.22%--__wake_up_common
            |                                receiver_wake_function
            |                                autoremove_wake_function
            |                                default_wake_function
            |                                try_to_wake_up
            |                                ttwu_do_activate
            |                                enqueue_task
            |                                enqueue_task_fair
            |                                |          
            |                                 --40.22%--_raw_spin_lock_irqsave
            |                                           |          
            |                                            --40.21%--native_queued_spin_lock_slowpath
            |          
            |--38.59%--recvfrom
            |          |          
            |           --38.59%--entry_SYSCALL_64
            |                     do_syscall_64
            |                     __x64_sys_recvfrom
            |                     __sys_recvfrom
            |                     sock_recvmsg
            |                     inet6_recvmsg
            |                     udpv6_recvmsg
            |                     __skb_recv_udp
            |                     __skb_wait_for_more_packets
            |                     schedule_timeout
            |                     schedule
            |                     __schedule
            |                     |          
            |                      --38.59%--pick_next_task_fair
            |                                |          
            |                                 --38.59%--swqueue_remove_task
            |                                           |          
            |                                            --38.59%--_raw_spin_lock_irqsave
            |                                                      |          
            |                                                       --38.58%--native_queued_spin_lock_slowpath
            |          
            |--2.25%--sendto
            |          entry_SYSCALL_64
            |          do_syscall_64
            |          |          
            |           --2.25%--__x64_sys_sendto
            |                     __sys_sendto
            |                     sock_sendmsg
            |                     inet6_sendmsg
            |                     udpv6_sendmsg
            |                     udp_v6_send_skb
            |                     ip6_send_skb
            |                     ip6_local_out
            |                     ip6_output
            |                     ip6_finish_output
            |                     ip6_finish_output2
            |                     __dev_queue_xmit
            |                     __local_bh_enable_ip
            |                     do_softirq.part.0
            |                     __do_softirq
            |                     net_rx_action
            |                     __napi_poll
            |                     process_backlog
            |                     __netif_receive_skb
            |                     __netif_receive_skb_one_core
            |                     ipv6_rcv
            |                     ip6_input
            |                     ip6_input_finish
            |                     ip6_protocol_deliver_rcu
            |                     udpv6_rcv
            |                     __udp6_lib_rcv
            |                     udp6_unicast_rcv_skb
            |                     udpv6_queue_rcv_skb
            |                     udpv6_queue_rcv_one_skb
            |                     __udp_enqueue_schedule_skb
            |                     sock_def_readable
            |                     __wake_up_sync_key
            |                     __wake_up_common_lock
            |                     |          
            |                      --2.25%--__wake_up_common
            |                                receiver_wake_function
            |                                autoremove_wake_function
            |                                default_wake_function
            |                                try_to_wake_up
            |                                ttwu_do_activate
            |                                enqueue_task
            |                                enqueue_task_fair
            |                                _raw_spin_lock_irqsave
            |                                |          
            |                                 --2.25%--native_queued_spin_lock_slowpath
            |          
             --2.10%--recvfrom
                       entry_SYSCALL_64
                       do_syscall_64
                       __x64_sys_recvfrom
                       __sys_recvfrom
                       sock_recvmsg
                       inet6_recvmsg
                       udpv6_recvmsg
                       __skb_recv_udp
                       __skb_wait_for_more_packets
                       schedule_timeout
                       schedule
                       __schedule
                       |          
                        --2.09%--pick_next_task_fair
                                  swqueue_remove_task
                                  _raw_spin_lock_irqsave
                                  |          
                                   --2.09%--native_queued_spin_lock_slowpath

During the test, vmstat showed lines like this:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free     buff  cache     si   so    bi    bo   in     cs   us sy id wa st
185 0      0 128555784  47348 1791244    0    0     0     0  32387 4380484  1 98  0  0  0

When swqueue is disabled, vmstat showed lines like this and there is no
lock contention:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free     buff  cache     si   so    bi    bo   in     cs   us sy id wa st
116 0      0 128977136  58324 1338760    0    0     0    16 599677 9734395  5 72 23  0  0

The runnable tasks are a lot more when swqueue is in use and sys% also
increased, perhaps due to the lock contention.

I'll see if I can find a smaller machine and give it a run there too.

Thanks,
Aaron

[-- Attachment #2: config-6.4.0-rc5-00237-g06b8769b15ae.gz --]
[-- Type: application/gzip, Size: 67409 bytes --]

  reply	other threads:[~2023-06-15  4:50 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-13  5:20 [RFC PATCH 0/3] sched: Implement shared wakequeue in CFS David Vernet
2023-06-13  5:20 ` [RFC PATCH 1/3] sched: Make migrate_task_to() take any task David Vernet
2023-06-21 13:04   ` Peter Zijlstra
2023-06-22  2:07     ` David Vernet
2023-06-13  5:20 ` [RFC PATCH 2/3] sched/fair: Add SWQUEUE sched feature and skeleton calls David Vernet
2023-06-21 12:49   ` Peter Zijlstra
2023-06-22 14:53     ` David Vernet
2023-06-13  5:20 ` [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS David Vernet
2023-06-13  8:32   ` Peter Zijlstra
2023-06-14  4:35     ` Aaron Lu
2023-06-14  9:27       ` Peter Zijlstra
2023-06-15  0:01       ` David Vernet
2023-06-15  4:49         ` Aaron Lu [this message]
2023-06-15  7:31           ` Aaron Lu
2023-06-15 23:26             ` David Vernet
2023-06-16  0:53               ` Aaron Lu
2023-06-20 17:36                 ` David Vernet
2023-06-21  2:35                   ` Aaron Lu
2023-06-21  2:43                     ` David Vernet
2023-06-21  4:54                       ` Aaron Lu
2023-06-21  5:43                         ` David Vernet
2023-06-21  6:03                           ` Aaron Lu
2023-06-22 15:57                             ` Chris Mason
2023-06-13  8:41   ` Peter Zijlstra
2023-06-14 20:26     ` David Vernet
2023-06-16  8:08   ` Vincent Guittot
2023-06-20 19:54     ` David Vernet
2023-06-20 21:37       ` Roman Gushchin
2023-06-21 14:22       ` Peter Zijlstra
2023-06-19  6:13   ` Gautham R. Shenoy
2023-06-20 20:08     ` David Vernet
2023-06-21  8:17       ` Gautham R. Shenoy
2023-06-22  1:43         ` David Vernet
2023-06-22  9:11           ` Gautham R. Shenoy
2023-06-22 10:29             ` Peter Zijlstra
2023-06-23  9:50               ` Gautham R. Shenoy
2023-06-26  6:04                 ` Gautham R. Shenoy
2023-06-27  3:17                   ` David Vernet
2023-06-27 16:31                     ` Chris Mason
2023-06-21 14:20   ` Peter Zijlstra
2023-06-21 20:34     ` David Vernet
2023-06-22 10:58       ` Peter Zijlstra
2023-06-22 14:43         ` David Vernet
2023-07-10 11:57 ` [RFC PATCH 0/3] " K Prateek Nayak
2023-07-11  4:43   ` David Vernet
2023-07-11  5:06     ` K Prateek Nayak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230615044917.GA109334@ziqianlu-dell \
    --to=aaron.lu@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=roman.gushchin@linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).