* UDP regression with packets rates < 10k per sec @ 2009-09-08 22:38 Christoph Lameter 2009-09-08 22:52 ` Eric Dumazet 0 siblings, 1 reply; 22+ messages in thread From: Christoph Lameter @ 2009-09-08 22:38 UTC (permalink / raw) To: eric.dumazet; +Cc: netdev Looks like we have a regression since 2.6.22 due to latency increases in the network stack? The following is the result of measuring latencies for UDP multicast traffic at packet rates of 10pps 100pps 1kpps 10kpps and 100k pps. Two system running "mcast -n1 -r<rate>" (mcast tool from http://gentwo.org/ll). Measurements in microseconds for one hop using bnx2 on Dell R610 (64 bit 2.6.31-rc9) and Dell 1950 (32 bit 2.6.22.19 3.3Ghz). Dell R610 RX usecs tuned to 0. 32 bit tuned to 1 (NIC is flaky at 0). Kernel 10pps 100pps 1kpps 10kpps 100kpps --------------------------------------------------------------- 2.6.22 (32bit) 30 29.5 29 30 41 2.6.31-rc9(64 bit) 64 63 46 30 40 The only minor improvement is at a rate of 100kpps. All rates lower than 10k regress significantly. Could there be something wrong with the bnx2 interrupt routing? They all end up on cpu0 here. There are 8 of them in a system with 16 "processors". How do those need to be configured? There are some sparse comments in Documentation/networking/multiqueue.txt but the text does not say anything about the irq routing. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-08 22:38 UDP regression with packets rates < 10k per sec Christoph Lameter @ 2009-09-08 22:52 ` Eric Dumazet 2009-09-09 14:01 ` Christoph Lameter 0 siblings, 1 reply; 22+ messages in thread From: Eric Dumazet @ 2009-09-08 22:52 UTC (permalink / raw) To: Christoph Lameter; +Cc: netdev Christoph Lameter a écrit : > Looks like we have a regression since 2.6.22 due to latency increases in > the network stack? The following is the result of measuring latencies for > UDP multicast traffic at packet rates of 10pps 100pps 1kpps 10kpps and > 100k pps. Two system running "mcast -n1 -r<rate>" (mcast tool from > http://gentwo.org/ll). > > Measurements in microseconds for one hop using bnx2 on Dell R610 (64 bit > 2.6.31-rc9) and Dell 1950 (32 bit 2.6.22.19 3.3Ghz). Dell R610 RX usecs > tuned to 0. 32 bit tuned to 1 (NIC is flaky at 0). > > Kernel 10pps 100pps 1kpps 10kpps 100kpps > --------------------------------------------------------------- > 2.6.22 (32bit) 30 29.5 29 30 41 > 2.6.31-rc9(64 bit) 64 63 46 30 40 > > The only minor improvement is at a rate of 100kpps. All rates > lower than 10k regress significantly. > > Could there be something wrong with the bnx2 interrupt routing? They all > end up on cpu0 here. There are 8 of them in a system with 16 "processors". > How do those need to be configured? There are some sparse comments in > Documentation/networking/multiqueue.txt but the text does not say anything > about the irq routing. > > Hi Christoph In order to reproduce this here, could you tell me if you use Producer linux-2.6.22 -> Receiver 2.6.22 Producer linux-2.6.31 -> Receiver 2.6.31 Or a mix of Producer linux-2.6.31 -> Receiver 2.6.22 Producer linux-2.6.22 -> Receiver 2.6.31-rc9 Not clear what is your exact setup Thanks ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-08 22:52 ` Eric Dumazet @ 2009-09-09 14:01 ` Christoph Lameter 2009-09-09 15:09 ` Eric Dumazet 0 siblings, 1 reply; 22+ messages in thread From: Christoph Lameter @ 2009-09-09 14:01 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Wed, 9 Sep 2009, Eric Dumazet wrote: > In order to reproduce this here, could you tell me if you use > > Producer linux-2.6.22 -> Receiver 2.6.22 > Producer linux-2.6.31 -> Receiver 2.6.31 I use the above setup. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-09 14:01 ` Christoph Lameter @ 2009-09-09 15:09 ` Eric Dumazet 2009-09-09 16:47 ` Christoph Lameter 0 siblings, 1 reply; 22+ messages in thread From: Eric Dumazet @ 2009-09-09 15:09 UTC (permalink / raw) To: Christoph Lameter; +Cc: netdev Christoph Lameter a écrit : > On Wed, 9 Sep 2009, Eric Dumazet wrote: > >> In order to reproduce this here, could you tell me if you use >> >> Producer linux-2.6.22 -> Receiver 2.6.22 >> Producer linux-2.6.31 -> Receiver 2.6.31 > > I use the above setup. Then frames are sent on wire but not received (they are received via mc loop, internal stack magic) # ./mcast -L -n1 -r 10000 WARNING: Multiple active ethernet devices. Using local address 192.168.0.1 Receiver: Listening to control channel 239.0.192.1 Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0 origin 192.168.0.1 Sender: Sending 10000 msgs/ch/sec on 1 channels. Probe interval=0.001-1 sec. TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv 100000 0 0 0 10000 3000.0 7.84 8.89 10.51 0.66 # uname -a Linux erd 2.6.30.5 #2 SMP Mon Sep 7 17:15:43 CEST 2009 i686 i686 i386 GNU/Linux I tried an old kernel on same hardware : # ./mcast -L -n1 -r 10000 WARNING: Multiple active ethernet devices. Using local address 55.225.18.6 Receiver: Listening to control channel 239.0.192.1 Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0 origin 55.225.18.6 Sender: Sending 10000 msgs/ch/sec on 1 channels. Probe interval=0.001-1 sec. TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv 99999 0 0 0 9998 0.0 9.00 9.95 14.50 1.56 Linux erd 2.6.9-55.ELsmp #1 SMP Fri Apr 20 17:03:35 EDT 2007 i686 i686 i386 GNU/Linux So my numbers seem much better than yours... ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-09 15:09 ` Eric Dumazet @ 2009-09-09 16:47 ` Christoph Lameter 2009-09-09 17:06 ` Eric Dumazet 2009-09-09 17:08 ` Eric Dumazet 0 siblings, 2 replies; 22+ messages in thread From: Christoph Lameter @ 2009-09-09 16:47 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Wed, 9 Sep 2009, Eric Dumazet wrote: > Christoph Lameter a ?crit : > > On Wed, 9 Sep 2009, Eric Dumazet wrote: > > > >> In order to reproduce this here, could you tell me if you use > >> > >> Producer linux-2.6.22 -> Receiver 2.6.22 > >> Producer linux-2.6.31 -> Receiver 2.6.31 > > > > I use the above setup. > > Then frames are sent on wire but not received > > (they are received via mc loop, internal stack magic) We are talking about two machines running 2.6.22 or 2.6.31. There is no magic mc loop between the two machines. -L was not used. > # ./mcast -L -n1 -r 10000 > WARNING: Multiple active ethernet devices. Using local address 192.168.0.1 > Receiver: Listening to control channel 239.0.192.1 > Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0 origin 192.168.0.1 > Sender: Sending 10000 msgs/ch/sec on 1 channels. Probe interval=0.001-1 sec. > > TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv > 100000 0 0 0 10000 3000.0 7.84 8.89 10.51 0.66 These are loopback latencies... Dont use -L > # uname -a > Linux erd 2.6.30.5 #2 SMP Mon Sep 7 17:15:43 CEST 2009 i686 i686 i386 GNU/Linux > > > > I tried an old kernel on same hardware : > > # ./mcast -L -n1 -r 10000 > WARNING: Multiple active ethernet devices. Using local address 55.225.18.6 > Receiver: Listening to control channel 239.0.192.1 > Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0 origin 55.225.18.6 > Sender: Sending 10000 msgs/ch/sec on 1 channels. Probe interval=0.001-1 sec. > > TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv > 99999 0 0 0 9998 0.0 9.00 9.95 14.50 1.56 > > Linux erd 2.6.9-55.ELsmp #1 SMP Fri Apr 20 17:03:35 EDT 2007 i686 i686 i386 GNU/Linux > > So my numbers seem much better than yours... These are loopback numbers. Why is 2.6.9 so high? The regression show at less than 10k. Could you use real NICs? With multiple TX queues and all the other cool stuff? And run at lower packet rates? My loopback numbers also show the same trends. 2.6.22: mcast -Ln1 TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv 101 0 0 0 10 3.0 5.47 5.74 7.00 0.43 mcast -Ln1 -r10000 TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv 100000 0 0 0 10000 3000.0 5.97 6.11 6.40 0.13 2.6.31-rc9 mcast -Ln1 TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv 100 0 0 0 10 3.0 13.26 13.45 13.56 0.09 mcast -Ln1 -r10000 TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv 100000 0 0 0 10000 3000.0 5.70 5.82 5.91 0.07 So 2.6.22 is better at 10 msgs per second. 2.6.31 is slightly better at 10k. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-09 16:47 ` Christoph Lameter @ 2009-09-09 17:06 ` Eric Dumazet 2009-09-09 17:55 ` Christoph Lameter 2009-09-09 17:08 ` Eric Dumazet 1 sibling, 1 reply; 22+ messages in thread From: Eric Dumazet @ 2009-09-09 17:06 UTC (permalink / raw) To: Christoph Lameter; +Cc: netdev Christoph Lameter a écrit : > On Wed, 9 Sep 2009, Eric Dumazet wrote: > >> Christoph Lameter a ?crit : >>> On Wed, 9 Sep 2009, Eric Dumazet wrote: >>> >>>> In order to reproduce this here, could you tell me if you use >>>> >>>> Producer linux-2.6.22 -> Receiver 2.6.22 >>>> Producer linux-2.6.31 -> Receiver 2.6.31 >>> I use the above setup. >> Then frames are sent on wire but not received >> >> (they are received via mc loop, internal stack magic) > > We are talking about two machines running 2.6.22 or 2.6.31. There is no > magic mc loop between the two machines. -L was not used. > >> # ./mcast -L -n1 -r 10000 >> WARNING: Multiple active ethernet devices. Using local address 192.168.0.1 >> Receiver: Listening to control channel 239.0.192.1 >> Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0 origin 192.168.0.1 >> Sender: Sending 10000 msgs/ch/sec on 1 channels. Probe interval=0.001-1 sec. >> >> TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv >> 100000 0 0 0 10000 3000.0 7.84 8.89 10.51 0.66 > > These are loopback latencies... Dont use -L > >> # uname -a >> Linux erd 2.6.30.5 #2 SMP Mon Sep 7 17:15:43 CEST 2009 i686 i686 i386 GNU/Linux >> >> >> >> I tried an old kernel on same hardware : >> >> # ./mcast -L -n1 -r 10000 >> WARNING: Multiple active ethernet devices. Using local address 55.225.18.6 >> Receiver: Listening to control channel 239.0.192.1 >> Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0 origin 55.225.18.6 >> Sender: Sending 10000 msgs/ch/sec on 1 channels. Probe interval=0.001-1 sec. >> >> TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv >> 99999 0 0 0 9998 0.0 9.00 9.95 14.50 1.56 >> >> Linux erd 2.6.9-55.ELsmp #1 SMP Fri Apr 20 17:03:35 EDT 2007 i686 i686 i386 GNU/Linux >> >> So my numbers seem much better than yours... > > These are loopback numbers. Why is 2.6.9 so high? The regression show at > less than 10k. > > Could you use real NICs? With multiple TX queues and all the other cool > stuff? And run at lower packet rates? well, on your mono-flow test, multiqueue wont help anyway. > > My loopback numbers also show the same trends. > > 2.6.22: > > mcast -Ln1 > TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv > 101 0 0 0 10 3.0 5.47 5.74 7.00 0.43 > > mcast -Ln1 -r10000 > TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv > 100000 0 0 0 10000 3000.0 5.97 6.11 6.40 0.13 > > > 2.6.31-rc9 > > mcast -Ln1 > TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv > 100 0 0 0 10 3.0 13.26 13.45 13.56 0.09 > > mcast -Ln1 -r10000 > TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv > 100000 0 0 0 10000 3000.0 5.70 5.82 5.91 0.07 > > > So 2.6.22 is better at 10 msgs per second. 2.6.31 is slightly better at > 10k. Unfortunatly there is too much noise on this test # uname -a Linux nag001 2.6.26-2-amd64 #1 SMP Fri Mar 27 04:02:59 UTC 2009 x86_64 GNU/Linux # ./mcast -Ln1 -C WARNING: Multiple active ethernet devices. Using local address 55.225.18.110 Receiver: Listening to control channel 239.0.192.1 Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0 origin 55.225.18.110 Sender: Sending 10 msgs/ch/sec on 1 channels. Probe interval=0.001-1 sec. TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv 100 0 0 0 10 0.0 8.78 10.56 18.20 2.62 100 0 0 0 10 3.0 8.60 9.33 9.88 0.45 101 0 0 0 10 3.0 8.42 9.47 11.53 0.85 100 0 0 0 10 3.0 8.37 10.19 14.85 1.74 101 0 0 0 10 3.0 8.86 11.23 18.00 3.11 100 0 0 0 10 3.0 8.43 9.72 11.48 0.82 101 0 0 0 10 3.0 9.02 29.78 210.25 60.16 100 0 0 0 10 3.0 8.47 10.27 11.69 0.99 100 0 0 0 10 3.0 9.21 10.16 12.15 0.84 101 0 0 0 10 3.0 8.46 10.65 17.88 2.64 101 0 0 0 10 3.0 8.65 10.00 11.32 0.86 100 0 0 0 10 3.0 8.98 9.71 10.77 0.58 100 0 0 0 10 3.0 8.38 11.73 19.06 3.74 Also I believe you hit scheduler artefact at very low rate, since there are two tasks that are considered as coupled. Check commit 6f3d09291b4982991680b61763b2541e53e2a95f sched, net: socket wakeups are sync 'sync' wakeups are a hint towards the scheduler that (certain) networking related wakeups likely create coupling between tasks. Signed-off-by: Ingo Molnar <mingo@elte.hu> diff --git a/net/core/sock.c b/net/core/sock.c index 09cb3a7..2654c14 100644 (file) --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1621,7 +1621,7 @@ static void sock_def_readable(struct sock *sk, int len) { read_lock(&sk->sk_callback_lock); if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) - wake_up_interruptible(sk->sk_sleep); + wake_up_interruptible_sync(sk->sk_sleep); sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN); read_unlock(&sk->sk_callback_lock); } @@ -1635,7 +1635,7 @@ static void sock_def_write_space(struct sock *sk) */ if ((atomic_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf) { if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) - wake_up_interruptible(sk->sk_sleep); + wake_up_interruptible_sync(sk->sk_sleep); /* Should agree with poll, otherwise some programs break */ if (sock_writeable(sk)) ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-09 17:06 ` Eric Dumazet @ 2009-09-09 17:55 ` Christoph Lameter 2009-09-10 20:37 ` Eric Dumazet 0 siblings, 1 reply; 22+ messages in thread From: Christoph Lameter @ 2009-09-09 17:55 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Wed, 9 Sep 2009, Eric Dumazet wrote: > > My loopback numbers also show the same trends. > > > > 2.6.22: > > > > mcast -Ln1 > > TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv > > 101 0 0 0 10 3.0 5.47 5.74 7.00 0.43 > > > > 2.6.31-rc9 > > > > mcast -Ln1 > > TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv > > 100 0 0 0 10 3.0 13.26 13.45 13.56 0.09 > > > > > > So 2.6.22 is better at 10 msgs per second. 2.6.31 is slightly better at > > 10k. > > Linux nag001 2.6.26-2-amd64 #1 SMP Fri Mar 27 04:02:59 UTC 2009 x86_64 GNU/Linux > > TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv > 101 0 0 0 10 3.0 8.42 9.47 11.53 0.85 > > Also I believe you hit scheduler artefact at very low rate, > since there are two tasks that are considered as coupled. > > Check commit 6f3d09291b4982991680b61763b2541e53e2a95f Reverting that patch yields: clameter@rd-strategy3-deb64:~$ ./mcast -Ln1 TotalMsg Lost SeqErr TXDrop Msg/Sec KB/Sec Min/us Avg/us Max/us StdDv 101 0 0 0 10 3.0 27.21 31.28 52.68 8.16 Not good. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-09 17:55 ` Christoph Lameter @ 2009-09-10 20:37 ` Eric Dumazet 2009-09-10 21:36 ` Christoph Lameter ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Eric Dumazet @ 2009-09-10 20:37 UTC (permalink / raw) To: Christoph Lameter; +Cc: netdev Something must be wrong with program or whatever... On the receiver I did this to trace the latency messages only # tcpdump -i eth0 not host 239.0.192.2 and port 9002 -n 22:28:07.223300 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32 22:28:07.223403 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32 22:28:08.223301 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32 22:28:08.223416 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32 22:28:09.223380 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32 22:28:09.223494 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32 22:28:10.223481 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32 22:28:10.223597 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32 22:28:11.223581 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32 22:28:11.223678 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32 See how the answer is *very* slow ? Something like > 100 us ? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-10 20:37 ` Eric Dumazet @ 2009-09-10 21:36 ` Christoph Lameter 2009-09-10 21:37 ` Eric Dumazet 2009-09-10 21:39 ` Christoph Lameter 2 siblings, 0 replies; 22+ messages in thread From: Christoph Lameter @ 2009-09-10 21:36 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Thu, 10 Sep 2009, Eric Dumazet wrote: > On the receiver I did this to trace the latency messages only > > # tcpdump -i eth0 not host 239.0.192.2 and port 9002 -n You do not show the multicast traffic. The latency timestamps requests are send unicast and at larger intervals. About one per second per mc channel. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-10 20:37 ` Eric Dumazet 2009-09-10 21:36 ` Christoph Lameter @ 2009-09-10 21:37 ` Eric Dumazet 2009-09-10 21:42 ` Christoph Lameter 2009-09-10 21:39 ` Christoph Lameter 2 siblings, 1 reply; 22+ messages in thread From: Eric Dumazet @ 2009-09-10 21:37 UTC (permalink / raw) To: Christoph Lameter; +Cc: netdev Eric Dumazet a écrit : > Something must be wrong with program or whatever... > > On the receiver I did this to trace the latency messages only > > # tcpdump -i eth0 not host 239.0.192.2 and port 9002 -n > > 22:28:07.223300 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32 > 22:28:07.223403 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32 > > 22:28:08.223301 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32 > 22:28:08.223416 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32 > > 22:28:09.223380 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32 > 22:28:09.223494 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32 > > 22:28:10.223481 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32 > 22:28:10.223597 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32 > > 22:28:11.223581 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32 > 22:28:11.223678 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32 > > > See how the answer is *very* slow ? Something like > 100 us ? > Never mind, this is OK, after proper irq affinity and tweaks on receiver : (I should do same on other machine as well...) (echo 1 >/proc/irq/default_smp_affinity) before up-ing bnx2 NIC echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor and idle=mwait (or hpet and C2/C3 states are used on this machine and wakeups are slow) # tcpdump -i eth0 not host 239.0.192.2 and port 9002 -n 23:25:02.137143 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32 23:25:02.137212 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32 23:25:03.137204 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32 23:25:03.137288 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32 23:25:04.137276 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32 23:25:04.137355 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32 I presume powersaving improvements in last kernels are against latencies goals... ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-10 21:37 ` Eric Dumazet @ 2009-09-10 21:42 ` Christoph Lameter 2009-09-14 21:10 ` Christoph Lameter 0 siblings, 1 reply; 22+ messages in thread From: Christoph Lameter @ 2009-09-10 21:42 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Thu, 10 Sep 2009, Eric Dumazet wrote: > and idle=mwait (or hpet and C2/C3 states are used on this machine and > wakeups are slow) We disabled C states here. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-10 21:42 ` Christoph Lameter @ 2009-09-14 21:10 ` Christoph Lameter 2009-09-15 5:29 ` Eric Dumazet 0 siblings, 1 reply; 22+ messages in thread From: Christoph Lameter @ 2009-09-14 21:10 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev Where are we on this? Definitely a regression? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-14 21:10 ` Christoph Lameter @ 2009-09-15 5:29 ` Eric Dumazet 2009-09-15 14:07 ` Christoph Lameter 0 siblings, 1 reply; 22+ messages in thread From: Eric Dumazet @ 2009-09-15 5:29 UTC (permalink / raw) To: Christoph Lameter; +Cc: netdev Christoph Lameter a écrit : > Where are we on this? Definitely a regression? ?? I tried to reproduce your numbers and failed on my machines. 2.6.31 is actually faster than 2.6.22 on the bench you provided. Must be specific to the hardware I guess ? As text size presumably is bigger in 2.6.31, fetching code in cpu caches to handle 10 packets per second is what we call a cold path anyway. If you want to make it a fast path, you want to make sure code its always hot in cpu caches, and find a way to inject packets into the kernel to make sure cpu keep the path hot. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-15 5:29 ` Eric Dumazet @ 2009-09-15 14:07 ` Christoph Lameter 2009-09-15 17:26 ` Eric Dumazet 0 siblings, 1 reply; 22+ messages in thread From: Christoph Lameter @ 2009-09-15 14:07 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Tue, 15 Sep 2009, Eric Dumazet wrote: > 2.6.31 is actually faster than 2.6.22 on the bench you provided. Well at high packet rates which were not the topic. > Must be specific to the hardware I guess ? Huh? Even your loopback numbers did show the regression up to 10k. > As text size presumably is bigger in 2.6.31, fetching code > in cpu caches to handle 10 packets per second is what we call > a cold path anyway. Ok so its an accepted regression? This is a significant reason not to use newer versions of kernels for latency critical applications that may have to send a packet once in a while for notification. The latency is doubled (1G) / tripled / quadrupled (IB) vs 2.6.22. > If you want to make it a fast path, you want to make sure code its > always hot in cpu caches, and find a way to inject packets into > the kernel to make sure cpu keep the path hot. Oh, gosh. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-15 14:07 ` Christoph Lameter @ 2009-09-15 17:26 ` Eric Dumazet 2009-09-15 20:25 ` Christoph Lameter 0 siblings, 1 reply; 22+ messages in thread From: Eric Dumazet @ 2009-09-15 17:26 UTC (permalink / raw) To: Christoph Lameter; +Cc: netdev Christoph Lameter a écrit : > On Tue, 15 Sep 2009, Eric Dumazet wrote: > >> 2.6.31 is actually faster than 2.6.22 on the bench you provided. > > Well at high packet rates which were not the topic. > >> Must be specific to the hardware I guess ? > > Huh? Even your loopback numbers did show the regression up to 10k. > >> As text size presumably is bigger in 2.6.31, fetching code >> in cpu caches to handle 10 packets per second is what we call >> a cold path anyway. > > Ok so its an accepted regression? This is a significant reason not to use > newer versions of kernels for latency critical applications that may have > to send a packet once in a while for notification. The latency is doubled > (1G) / tripled / quadrupled (IB) vs 2.6.22. > >> If you want to make it a fast path, you want to make sure code its >> always hot in cpu caches, and find a way to inject packets into >> the kernel to make sure cpu keep the path hot. > > Oh, gosh. It seems there is a lot of confusion on this topic, so I will make a full recap : Once I understood my 2.6.31 kernel had much more features than 2.6.22 and that I tuned it to : - Let cpu run at full speed (3GHz instead of 2GHz) : before tuning, 2.6.31 was using "ondemand" governor and my cpus were running at 2GHz, while they where running at 3GHz on my 2.6.22 config - Dont let cpus enter C2/C3 wait states (idle=mwait) - Correctly affine cpu to ethX irq (2.6.22 was running ethX irq on one cpu, while on 2.6.31, irqs were distributed to all online cpus) Then, your mcast test gives same results, at 10pps, 100pps, 1000pps, 10000pps When sniffing receiving side, I can notice : - Answer to an icmp ping (served by softirq only) : 6 us between request and reply - Answer to one 'give timestamp' request from mcast client : 11 us betwen request and reply, regardless of kernel version (2.6.22 or 2.6.31) So there is a 5us cost to actually wakeup a process and let him do the recvfrom() and sendto() pair, which is quite OK, and this time was not significantly changed between 2.6.22 and 2.6.31 Hope this helps ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-15 17:26 ` Eric Dumazet @ 2009-09-15 20:25 ` Christoph Lameter 2009-09-15 19:02 ` Eric Dumazet 2009-09-25 14:19 ` Eric Dumazet 0 siblings, 2 replies; 22+ messages in thread From: Christoph Lameter @ 2009-09-15 20:25 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Tue, 15 Sep 2009, Eric Dumazet wrote: > Once I understood my 2.6.31 kernel had much more features than 2.6.22 and that I tuned > it to : > > - Let cpu run at full speed (3GHz instead of 2GHz) : before tuning, 2.6.31 was > using "ondemand" governor and my cpus were running at 2GHz, while they where > running at 3GHz on my 2.6.22 config My kernel did not have support for any governors compiled in. > - Dont let cpus enter C2/C3 wait states (idle=mwait) Ok. Trying idle=mwait. > - Correctly affine cpu to ethX irq (2.6.22 was running ethX irq on one cpu, while > on 2.6.31, irqs were distributed to all online cpus) Interrupts of both 2.6.22 and 2.6.31 go to cpu 0. Does it matter for loopback? > Then, your mcast test gives same results, at 10pps, 100pps, 1000pps, 10000pps loopback via mcast -Ln1 -r <rate> 10pps 100pps 1000pps 10000pps 2.6.22(32bit) 7.36 7.28 7.15 7.16 2.6.31(64bit) 9.28 10.27 9.70 9.79 What a difference. Now the initial latency rampup for 2.6.31 is gone. So even w/o governors the kernel does something to increase the latencies. We sacrificed 2 - 3 microseconds per message to kernel features, bloat and 64 bitness? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-15 20:25 ` Christoph Lameter @ 2009-09-15 19:02 ` Eric Dumazet 2009-09-25 14:19 ` Eric Dumazet 1 sibling, 0 replies; 22+ messages in thread From: Eric Dumazet @ 2009-09-15 19:02 UTC (permalink / raw) To: Christoph Lameter; +Cc: netdev Christoph Lameter a écrit : > On Tue, 15 Sep 2009, Eric Dumazet wrote: > >> Once I understood my 2.6.31 kernel had much more features than 2.6.22 and that I tuned >> it to : >> >> - Let cpu run at full speed (3GHz instead of 2GHz) : before tuning, 2.6.31 was >> using "ondemand" governor and my cpus were running at 2GHz, while they where >> running at 3GHz on my 2.6.22 config > > My kernel did not have support for any governors compiled in. > >> - Dont let cpus enter C2/C3 wait states (idle=mwait) > > Ok. Trying idle=mwait. > >> - Correctly affine cpu to ethX irq (2.6.22 was running ethX irq on one cpu, while >> on 2.6.31, irqs were distributed to all online cpus) > > Interrupts of both 2.6.22 and 2.6.31 go to cpu 0. Does it matter for > loopback? No of course, loopback triggers softirq on the local cpu, no special setup to respect. > >> Then, your mcast test gives same results, at 10pps, 100pps, 1000pps, 10000pps > > loopback via mcast -Ln1 -r <rate> > > 10pps 100pps 1000pps 10000pps > 2.6.22(32bit) 7.36 7.28 7.15 7.16 > 2.6.31(64bit) 9.28 10.27 9.70 9.79 > > What a difference. Now the initial latency rampup for 2.6.31 is gone. So > even w/o governors the kernel does something to increase the latencies. > > We sacrificed 2 - 3 microseconds per message to kernel features, bloat and > 64 bitness? Well, I dont know, I mainly use 32bits kernel, but yes, using 64bits has a cost, since skbs for example are bigger, sockets are bigger, so we touch more cache lines per transaction... You could precisely compute number of cycles per transaction, with "perf" tools (only on 2.6.31), between 64bit and 32bit kernels, benching 100000 pps for example and counting number of perf counter irqs per second ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-15 20:25 ` Christoph Lameter 2009-09-15 19:02 ` Eric Dumazet @ 2009-09-25 14:19 ` Eric Dumazet 2009-09-26 16:13 ` Christoph Lameter 1 sibling, 1 reply; 22+ messages in thread From: Eric Dumazet @ 2009-09-25 14:19 UTC (permalink / raw) To: Christoph Lameter; +Cc: netdev Christoph Lameter a écrit : > We sacrificed 2 - 3 microseconds per message to kernel features, bloat and > 64 bitness? Update : With my current kernel on receiver (linux-2.6 32bit + some networking patches + SLUB_STATS) mcast -n1 -b eth3 -r 2000 on the sender (2.6.29 unfortunatly, I cannot change it at this moment) I took a look at ftrace(function) to check paths : (tg3 instead of bnx2 on this snapshot, it should not be a big concern) - tg3 interrupts on CPU-0, copybreak disabled. Hardware interrupt ------------------ <idle>-0 [000] 13580.504033: do_IRQ <-common_interrupt <idle>-0 [000] 13580.504034: irq_enter <-do_IRQ <idle>-0 [000] 13580.504034: rcu_irq_enter <-irq_enter <idle>-0 [000] 13580.504034: idle_cpu <-irq_enter <idle>-0 [000] 13580.504034: tick_check_idle <-irq_enter <idle>-0 [000] 13580.504034: tick_check_oneshot_broadcast <-tick_check_idle <idle>-0 [000] 13580.504034: tick_nohz_stop_idle <-tick_check_idle <idle>-0 [000] 13580.504034: ktime_get <-tick_nohz_stop_idle <idle>-0 [000] 13580.504035: ktime_get <-sched_clock_tick <idle>-0 [000] 13580.504035: touch_softlockup_watchdog <-sched_clock_idle_wakeup_event <idle>-0 [000] 13580.504035: ktime_get <-tick_check_idle <idle>-0 [000] 13580.504035: tick_do_update_jiffies64 <-tick_check_idle <idle>-0 [000] 13580.504035: _spin_lock <-tick_do_update_jiffies64 <idle>-0 [000] 13580.504036: do_timer <-tick_do_update_jiffies64 <idle>-0 [000] 13580.504036: update_wall_time <-do_timer <idle>-0 [000] 13580.504036: calc_global_load <-do_timer <idle>-0 [000] 13580.504036: touch_softlockup_watchdog <-tick_check_idle <idle>-0 [000] 13580.504036: handle_irq <-do_IRQ <idle>-0 [000] 13580.504036: irq_to_desc <-handle_irq <idle>-0 [000] 13580.504036: handle_edge_irq <-handle_irq <idle>-0 [000] 13580.504037: _spin_lock <-handle_edge_irq <idle>-0 [000] 13580.504037: ack_apic_edge <-handle_edge_irq <idle>-0 [000] 13580.504037: irq_to_desc <-ack_apic_edge <idle>-0 [000] 13580.504037: irq_complete_move <-ack_apic_edge <idle>-0 [000] 13580.504037: move_native_irq <-ack_apic_edge <idle>-0 [000] 13580.504037: irq_to_desc <-move_native_irq <idle>-0 [000] 13580.504038: native_apic_mem_write <-ack_apic_edge <idle>-0 [000] 13580.504038: handle_IRQ_event <-handle_edge_irq 2) Handled by tg3 <idle>-0 [000] 13580.504038: tg3_msi <-handle_IRQ_event <idle>-0 [000] 13580.504038: tg3_write32 <-tg3_msi <idle>-0 [000] 13580.504038: __napi_schedule <-tg3_msi <idle>-0 [000] 13580.504038: note_interrupt <-handle_edge_irq <idle>-0 [000] 13580.504039: _spin_lock <-handle_edge_irq <idle>-0 [000] 13580.504039: irq_exit <-do_IRQ 3) Softirq <idle>-0 [000] 13580.504039: do_softirq <-irq_exit <idle>-0 [000] 13580.504039: __do_softirq <-do_softirq <idle>-0 [000] 13580.504039: net_rx_action <-__do_softirq <idle>-0 [000] 13580.504039: tg3_poll <-net_rx_action <idle>-0 [000] 13580.504040: tg3_alloc_rx_skb <-tg3_poll <idle>-0 [000] 13580.504040: __netdev_alloc_skb <-tg3_alloc_rx_skb <idle>-0 [000] 13580.504040: __alloc_skb <-__netdev_alloc_skb <idle>-0 [000] 13580.504040: kmem_cache_alloc <-__alloc_skb <idle>-0 [000] 13580.504040: __kmalloc_track_caller <-__alloc_skb <idle>-0 [000] 13580.504040: get_slab <-__kmalloc_track_caller <idle>-0 [000] 13580.504040: __slab_alloc <-__kmalloc_track_caller hmm... is it normal we call deactivate_slab() ? (I included at the end of this mail : grep . /sys/kernel/slab/:t-0004096/* ) <idle>-0 [000] 13580.504041: deactivate_slab <-__slab_alloc <idle>-0 [000] 13580.504041: _spin_lock <-__slab_alloc <idle>-0 [000] 13580.504041: nommu_map_page <-tg3_alloc_rx_skb <idle>-0 [000] 13580.504041: skb_put <-tg3_poll <idle>-0 [000] 13580.504041: eth_type_trans <-tg3_poll <idle>-0 [000] 13580.504041: skb_pull <-eth_type_trans <idle>-0 [000] 13580.504042: napi_gro_receive <-tg3_poll <idle>-0 [000] 13580.504042: skb_gro_reset_offset <-napi_gro_receive <idle>-0 [000] 13580.504042: __napi_gro_receive <-napi_gro_receive <idle>-0 [000] 13580.504042: dev_gro_receive <-__napi_gro_receive <idle>-0 [000] 13580.504042: napi_skb_finish <-napi_gro_receive <idle>-0 [000] 13580.504042: netif_receive_skb <-napi_skb_finish <idle>-0 [000] 13580.504043: ip_rcv <-netif_receive_skb firewall (iptables) phase 1 (mangle prerouting) <idle>-0 [000] 13580.504043: nf_hook_slow <-ip_rcv <idle>-0 [000] 13580.504043: nf_iterate <-nf_hook_slow <idle>-0 [000] 13580.504043: ipt_pre_routing_hook <-nf_iterate <idle>-0 [000] 13580.504043: ipt_do_table <-ipt_pre_routing_hook <idle>-0 [000] 13580.504043: local_bh_disable <-ipt_do_table <idle>-0 [000] 13580.504044: _spin_lock <-ipt_do_table <idle>-0 [000] 13580.504044: local_bh_enable <-ipt_do_table <idle>-0 [000] 13580.504044: ip_rcv_finish <-ip_rcv <idle>-0 [000] 13580.504044: ip_route_input <-ip_rcv_finish <idle>-0 [000] 13580.504044: ip_local_deliver <-ip_rcv_finish firewall (iptables) phase 2 (mangle INPUT) <idle>-0 [000] 13580.504045: nf_hook_slow <-ip_local_deliver <idle>-0 [000] 13580.504045: nf_iterate <-nf_hook_slow <idle>-0 [000] 13580.504045: ipt_local_in_hook <-nf_iterate <idle>-0 [000] 13580.504045: ipt_do_table <-ipt_local_in_hook <idle>-0 [000] 13580.504045: local_bh_disable <-ipt_do_table <idle>-0 [000] 13580.504045: _spin_lock <-ipt_do_table <idle>-0 [000] 13580.504045: local_bh_enable <-ipt_do_table firewall (iptables) phase 3 (filter INPUT) <idle>-0 [000] 13580.504046: ipt_local_in_hook <-nf_iterate <idle>-0 [000] 13580.504046: ipt_do_table <-ipt_local_in_hook <idle>-0 [000] 13580.504046: local_bh_disable <-ipt_do_table <idle>-0 [000] 13580.504046: _spin_lock <-ipt_do_table <idle>-0 [000] 13580.504046: local_bh_enable <-ipt_do_table <idle>-0 [000] 13580.504046: ip_local_deliver_finish <-ip_local_deliver <idle>-0 [000] 13580.504046: raw_local_deliver <-ip_local_deliver_finish OK , UDP handler... <idle>-0 [000] 13580.504047: udp_rcv <-ip_local_deliver_finish <idle>-0 [000] 13580.504047: __udp4_lib_rcv <-udp_rcv <idle>-0 [000] 13580.504047: _spin_lock <-__udp4_lib_rcv <idle>-0 [000] 13580.504047: ip_mc_sf_allow <-__udp4_lib_rcv <idle>-0 [000] 13580.504047: udp_queue_rcv_skb <-__udp4_lib_rcv <idle>-0 [000] 13580.504047: _spin_lock <-udp_queue_rcv_skb <idle>-0 [000] 13580.504048: __udp_queue_rcv_skb <-udp_queue_rcv_skb <idle>-0 [000] 13580.504048: sock_queue_rcv_skb <-__udp_queue_rcv_skb <idle>-0 [000] 13580.504048: sk_filter <-sock_queue_rcv_skb <idle>-0 [000] 13580.504048: local_bh_disable <-sk_filter <idle>-0 [000] 13580.504048: local_bh_enable <-sk_filter <idle>-0 [000] 13580.504048: skb_queue_tail <-sock_queue_rcv_skb <idle>-0 [000] 13580.504049: _spin_lock_irqsave <-skb_queue_tail <idle>-0 [000] 13580.504049: _spin_unlock_irqrestore <-skb_queue_tail <idle>-0 [000] 13580.504049: sock_def_readable <-sock_queue_rcv_skb <idle>-0 [000] 13580.504049: _read_lock <-sock_def_readable <idle>-0 [000] 13580.504049: __wake_up_sync_key <-sock_def_readable <idle>-0 [000] 13580.504049: _spin_lock_irqsave <-__wake_up_sync_key <idle>-0 [000] 13580.504049: __wake_up_common <-__wake_up_sync_key <idle>-0 [000] 13580.504050: receiver_wake_function <-__wake_up_common <idle>-0 [000] 13580.504050: autoremove_wake_function <-receiver_wake_function <idle>-0 [000] 13580.504050: default_wake_function <-autoremove_wake_function wakeup mcast task <idle>-0 [000] 13580.504050: try_to_wake_up <-default_wake_function <idle>-0 [000] 13580.504050: task_rq_lock <-try_to_wake_up <idle>-0 [000] 13580.504050: _spin_lock <-task_rq_lock <idle>-0 [000] 13580.504051: _spin_unlock_irqrestore <-try_to_wake_up <idle>-0 [000] 13580.504051: select_task_rq_fair <-try_to_wake_up <idle>-0 [000] 13580.504051: task_rq_lock <-try_to_wake_up <idle>-0 [000] 13580.504051: _spin_lock <-task_rq_lock <idle>-0 [000] 13580.504051: activate_task <-try_to_wake_up <idle>-0 [000] 13580.504051: enqueue_task <-activate_task <idle>-0 [000] 13580.504052: enqueue_task_fair <-enqueue_task <idle>-0 [000] 13580.504052: update_curr <-enqueue_task_fair <idle>-0 [000] 13580.504052: place_entity <-enqueue_task_fair <idle>-0 [000] 13580.504052: __enqueue_entity <-enqueue_task_fair <idle>-0 [000] 13580.504052: check_preempt_curr_idle <-try_to_wake_up <idle>-0 [000] 13580.504053: resched_task <-check_preempt_curr_idle <idle>-0 [000] 13580.504053: _spin_unlock_irqrestore <-try_to_wake_up <idle>-0 [000] 13580.504053: _spin_unlock_irqrestore <-__wake_up_sync_key tg3 finished its rx handling <idle>-0 [000] 13580.504053: tg3_write32 <-tg3_poll <idle>-0 [000] 13580.504053: tg3_write32 <-tg3_poll <idle>-0 [000] 13580.504054: napi_complete <-tg3_poll <idle>-0 [000] 13580.504054: napi_gro_flush <-napi_complete <idle>-0 [000] 13580.504054: __napi_complete <-napi_complete <idle>-0 [000] 13580.504054: tg3_write32 <-tg3_poll <idle>-0 [000] 13580.504054: rcu_bh_qs <-__do_softirq <idle>-0 [000] 13580.504054: _local_bh_enable <-__do_softirq <idle>-0 [000] 13580.504055: rcu_irq_exit <-irq_exit <idle>-0 [000] 13580.504055: idle_cpu <-irq_exit <idle>-0 [000] 13580.504055: tick_nohz_restart_sched_tick <-cpu_idle <idle>-0 [000] 13580.504055: tick_nohz_stop_idle <-tick_nohz_restart_sched_tick <idle>-0 [000] 13580.504055: rcu_exit_nohz <-tick_nohz_restart_sched_tick <idle>-0 [000] 13580.504055: select_nohz_load_balancer <-tick_nohz_restart_sched_tick <idle>-0 [000] 13580.504056: ktime_get <-tick_nohz_restart_sched_tick <idle>-0 [000] 13580.504056: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick <idle>-0 [000] 13580.504056: account_idle_ticks <-tick_nohz_restart_sched_tick <idle>-0 [000] 13580.504056: account_idle_time <-account_idle_ticks <idle>-0 [000] 13580.504056: touch_softlockup_watchdog <-tick_nohz_restart_sched_tick <idle>-0 [000] 13580.504056: hrtimer_cancel <-tick_nohz_restart_sched_tick <idle>-0 [000] 13580.504056: hrtimer_try_to_cancel <-hrtimer_cancel <idle>-0 [000] 13580.504057: lock_hrtimer_base <-hrtimer_try_to_cancel <idle>-0 [000] 13580.504057: _spin_lock_irqsave <-lock_hrtimer_base <idle>-0 [000] 13580.504057: __remove_hrtimer <-hrtimer_try_to_cancel <idle>-0 [000] 13580.504057: hrtimer_force_reprogram <-__remove_hrtimer <idle>-0 [000] 13580.504057: tick_program_event <-hrtimer_force_reprogram <idle>-0 [000] 13580.504058: tick_dev_program_event <-tick_program_event <idle>-0 [000] 13580.504058: ktime_get <-tick_dev_program_event <idle>-0 [000] 13580.504058: clockevents_program_event <-tick_dev_program_event <idle>-0 [000] 13580.504058: lapic_next_event <-clockevents_program_event <idle>-0 [000] 13580.504058: native_apic_mem_write <-lapic_next_event <idle>-0 [000] 13580.504058: _spin_unlock_irqrestore <-hrtimer_try_to_cancel <idle>-0 [000] 13580.504058: hrtimer_forward <-tick_nohz_restart_sched_tick <idle>-0 [000] 13580.504059: ktime_add_safe <-hrtimer_forward <idle>-0 [000] 13580.504059: ktime_add_safe <-hrtimer_forward <idle>-0 [000] 13580.504059: hrtimer_start_range_ns <-tick_nohz_restart_sched_tick <idle>-0 [000] 13580.504059: __hrtimer_start_range_ns <-hrtimer_start_range_ns <idle>-0 [000] 13580.504059: lock_hrtimer_base <-__hrtimer_start_range_ns <idle>-0 [000] 13580.504059: _spin_lock_irqsave <-lock_hrtimer_base <idle>-0 [000] 13580.504060: ktime_add_safe <-__hrtimer_start_range_ns <idle>-0 [000] 13580.504060: enqueue_hrtimer <-__hrtimer_start_range_ns <idle>-0 [000] 13580.504060: tick_program_event <-__hrtimer_start_range_ns <idle>-0 [000] 13580.504060: tick_dev_program_event <-tick_program_event <idle>-0 [000] 13580.504060: ktime_get <-tick_dev_program_event <idle>-0 [000] 13580.504061: clockevents_program_event <-tick_dev_program_event <idle>-0 [000] 13580.504061: lapic_next_event <-clockevents_program_event <idle>-0 [000] 13580.504061: native_apic_mem_write <-lapic_next_event <idle>-0 [000] 13580.504061: _spin_unlock_irqrestore <-__hrtimer_start_range_ns <idle>-0 [000] 13580.504061: schedule <-cpu_idle <idle>-0 [000] 13580.504061: rcu_sched_qs <-schedule <idle>-0 [000] 13580.504061: _spin_lock_irq <-schedule <idle>-0 [000] 13580.504062: put_prev_task_idle <-schedule <idle>-0 [000] 13580.504062: pick_next_task_fair <-schedule <idle>-0 [000] 13580.504062: set_next_entity <-pick_next_task_fair <idle>-0 [000] 13580.504062: update_stats_wait_end <-set_next_entity <idle>-0 [000] 13580.504062: __dequeue_entity <-set_next_entity <idle>-0 [000] 13580.504062: hrtick_start_fair <-pick_next_task_fair <idle>-0 [000] 13580.504063: perf_event_task_sched_out <-schedule <idle>-0 [000] 13580.504063: __switch_to <-schedule mcast is scheduled, gets the frame mcast-21429 [000] 13580.504063: finish_task_switch <-schedule mcast-21429 [000] 13580.504063: perf_event_task_sched_in <-finish_task_switch mcast-21429 [000] 13580.504063: finish_wait <-__skb_recv_datagram mcast-21429 [000] 13580.504064: _spin_lock_irqsave <-__skb_recv_datagram mcast-21429 [000] 13580.504064: _spin_unlock_irqrestore <-__skb_recv_datagram mcast-21429 [000] 13580.504064: skb_copy_datagram_iovec <-udp_recvmsg mcast-21429 [000] 13580.504064: memcpy_toiovec <-skb_copy_datagram_iovec mcast-21429 [000] 13580.504064: copy_to_user <-memcpy_toiovec mcast-21429 [000] 13580.504065: lock_sock_nested <-udp_recvmsg mcast-21429 [000] 13580.504065: _spin_lock_bh <-lock_sock_nested mcast-21429 [000] 13580.504065: local_bh_disable <-_spin_lock_bh mcast-21429 [000] 13580.504065: local_bh_enable <-lock_sock_nested mcast-21429 [000] 13580.504065: skb_free_datagram <-udp_recvmsg mcast-21429 [000] 13580.504065: consume_skb <-skb_free_datagram mcast-21429 [000] 13580.504065: __kfree_skb <-consume_skb mcast-21429 [000] 13580.504066: skb_release_head_state <-__kfree_skb mcast-21429 [000] 13580.504066: dst_release <-skb_release_head_state mcast-21429 [000] 13580.504066: sock_rfree <-skb_release_head_state mcast-21429 [000] 13580.504066: skb_release_data <-__kfree_skb mcast-21429 [000] 13580.504066: kfree <-skb_release_data mcast-21429 [000] 13580.504066: __slab_free <-kfree is it normal we call add_partial() ? mcast-21429 [000] 13580.504067: add_partial <-__slab_free mcast-21429 [000] 13580.504067: _spin_lock <-add_partial mcast-21429 [000] 13580.504067: kmem_cache_free <-__kfree_skb mcast-21429 [000] 13580.504067: __slab_free <-kmem_cache_free mcast-21429 [000] 13580.504067: release_sock <-udp_recvmsg mcast-21429 [000] 13580.504067: _spin_lock_bh <-release_sock mcast-21429 [000] 13580.504067: local_bh_disable <-_spin_lock_bh mcast-21429 [000] 13580.504068: _spin_unlock_bh <-release_sock mcast-21429 [000] 13580.504068: local_bh_enable_ip <-_spin_unlock_bh mcast-21429 [000] 13580.504068: move_addr_to_user <-sys_recvfrom mcast-21429 [000] 13580.504068: copy_to_user <-move_addr_to_user mcast-21429 [000] 13580.504068: fput <-sys_recvfrom mcast-21429 [000] 13580.504069: sys_gettimeofday <-sysenter_do_call mcast-21429 [000] 13580.504069: do_gettimeofday <-sys_gettimeofday mcast-21429 [000] 13580.504069: getnstimeofday <-do_gettimeofday mcast-21429 [000] 13580.504069: copy_to_user <-sys_gettimeofday mcast-21429 [000] 13580.504069: sys_socketcall <-sysenter_do_call mcast-21429 [000] 13580.504070: copy_from_user <-sys_socketcall mcast wants next frame and is going to block mcast-21429 [000] 13580.504070: sys_recvfrom <-sys_socketcall mcast-21429 [000] 13580.504070: sockfd_lookup_light <-sys_recvfrom mcast-21429 [000] 13580.504070: fget_light <-sockfd_lookup_light mcast-21429 [000] 13580.504070: sock_recvmsg <-sys_recvfrom mcast-21429 [000] 13580.504070: sock_common_recvmsg <-sock_recvmsg mcast-21429 [000] 13580.504071: udp_recvmsg <-sock_common_recvmsg mcast-21429 [000] 13580.504071: __skb_recv_datagram <-udp_recvmsg mcast-21429 [000] 13580.504071: _spin_lock_irqsave <-__skb_recv_datagram mcast-21429 [000] 13580.504071: _spin_unlock_irqrestore <-__skb_recv_datagram mcast-21429 [000] 13580.504071: prepare_to_wait_exclusive <-__skb_recv_datagram mcast-21429 [000] 13580.504071: _spin_lock_irqsave <-prepare_to_wait_exclusive mcast-21429 [000] 13580.504071: _spin_unlock_irqrestore <-prepare_to_wait_exclusive mcast-21429 [000] 13580.504072: schedule_timeout <-__skb_recv_datagram mcast-21429 [000] 13580.504072: schedule <-schedule_timeout mcast-21429 [000] 13580.504072: rcu_sched_qs <-schedule mcast-21429 [000] 13580.504072: _spin_lock_irq <-schedule mcast-21429 [000] 13580.504072: deactivate_task <-schedule mcast-21429 [000] 13580.504072: dequeue_task <-deactivate_task mcast-21429 [000] 13580.504073: dequeue_task_fair <-dequeue_task mcast-21429 [000] 13580.504073: update_curr <-dequeue_task_fair mcast-21429 [000] 13580.504073: hrtick_start_fair <-dequeue_task_fair mcast-21429 [000] 13580.504073: find_busiest_group <-schedule mcast-21429 [000] 13580.504073: scale_rt_power <-find_busiest_group mcast-21429 [000] 13580.504073: sched_avg_update <-scale_rt_power mcast-21429 [000] 13580.504074: msecs_to_jiffies <-schedule mcast-21429 [000] 13580.504074: find_busiest_group <-schedule mcast-21429 [000] 13580.504074: msecs_to_jiffies <-schedule mcast-21429 [000] 13580.504075: put_prev_task_fair <-schedule mcast-21429 [000] 13580.504075: pick_next_task_fair <-schedule mcast-21429 [000] 13580.504075: pick_next_task_rt <-schedule mcast-21429 [000] 13580.504075: pick_next_task_fair <-schedule mcast-21429 [000] 13580.504075: pick_next_task_idle <-schedule mcast-21429 [000] 13580.504075: perf_event_task_sched_out <-schedule mcast-21429 [000] 13580.504076: __switch_to <-schedule <idle>-0 [000] 13580.504076: finish_task_switch <-schedule <idle>-0 [000] 13580.504076: perf_event_task_sched_in <-finish_task_switch <idle>-0 [000] 13580.504076: tick_nohz_stop_sched_tick <-cpu_idle <idle>-0 [000] 13580.504076: ktime_get <-tick_nohz_stop_sched_tick <idle>-0 [000] 13580.504076: get_next_timer_interrupt <-tick_nohz_stop_sched_tick <idle>-0 [000] 13580.504077: _spin_lock <-get_next_timer_interrupt <idle>-0 [000] 13580.504077: hrtimer_get_next_event <-get_next_timer_interrupt <idle>-0 [000] 13580.504077: _spin_lock_irqsave <-hrtimer_get_next_event <idle>-0 [000] 13580.504077: _spin_unlock_irqrestore <-hrtimer_get_next_event <idle>-0 [000] 13580.504077: rcu_needs_cpu <-tick_nohz_stop_sched_tick <idle>-0 [000] 13580.504077: printk_needs_cpu <-tick_nohz_stop_sched_tick <idle>-0 [000] 13580.504077: select_nohz_load_balancer <-tick_nohz_stop_sched_tick <idle>-0 [000] 13580.504078: rcu_enter_nohz <-tick_nohz_stop_sched_tick <idle>-0 [000] 13580.504078: hrtimer_start <-tick_nohz_stop_sched_tick <idle>-0 [000] 13580.504078: __hrtimer_start_range_ns <-hrtimer_start <idle>-0 [000] 13580.504078: lock_hrtimer_base <-__hrtimer_start_range_ns <idle>-0 [000] 13580.504078: _spin_lock_irqsave <-lock_hrtimer_base <idle>-0 [000] 13580.504078: __remove_hrtimer <-__hrtimer_start_range_ns <idle>-0 [000] 13580.504079: hrtimer_force_reprogram <-__remove_hrtimer <idle>-0 [000] 13580.504079: tick_program_event <-hrtimer_force_reprogram <idle>-0 [000] 13580.504079: tick_dev_program_event <-tick_program_event <idle>-0 [000] 13580.504079: ktime_get <-tick_dev_program_event <idle>-0 [000] 13580.504079: clockevents_program_event <-tick_dev_program_event <idle>-0 [000] 13580.504079: lapic_next_event <-clockevents_program_event <idle>-0 [000] 13580.504079: native_apic_mem_write <-lapic_next_event <idle>-0 [000] 13580.504080: ktime_add_safe <-__hrtimer_start_range_ns <idle>-0 [000] 13580.504080: enqueue_hrtimer <-__hrtimer_start_range_ns <idle>-0 [000] 13580.504080: tick_program_event <-__hrtimer_start_range_ns <idle>-0 [000] 13580.504080: tick_dev_program_event <-tick_program_event <idle>-0 [000] 13580.504080: ktime_get <-tick_dev_program_event <idle>-0 [000] 13580.504080: clockevents_program_event <-tick_dev_program_event <idle>-0 [000] 13580.504081: lapic_next_event <-clockevents_program_event <idle>-0 [000] 13580.504081: native_apic_mem_write <-lapic_next_event <idle>-0 [000] 13580.504081: _spin_unlock_irqrestore <-__hrtimer_start_range_ns <idle>-0 [000] 13580.504081: poll_idle <-cpu_idle Next interrupt... <idle>-0 [000] 13580.505032: do_IRQ <-common_interrupt <idle>-0 [000] 13580.505032: irq_enter <-do_IRQ <idle>-0 [000] 13580.505032: rcu_irq_enter <-irq_enter <idle>-0 [000] 13580.505032: idle_cpu <-irq_enter Too many slowpaths for 4096 slabs ? $ cd /sys/kernel/slab/:t-0004096 $ grep . * aliases:1 align:8 grep: alloc_calls: Function not implemented alloc_fastpath:416584 C0=234119 C1=52938 C2=18413 C3=4739 C4=49140 C5=14679 C6=39266 C7=3290 alloc_from_partial:459402 C0=459391 C1=8 C2=1 C5=2 alloc_refill:459619 C0=459460 C1=54 C2=1 C3=4 C4=52 C5=31 C6=2 C7=15 alloc_slab:103 C0=45 C1=28 C3=1 C4=26 C5=1 C6=1 C7=1 alloc_slowpath:459628 C0=459462 C1=55 C2=2 C3=5 C4=53 C5=32 C6=3 C7=16 cache_dma:0 cpuslab_flush:0 cpu_slabs:8 deactivate_empty:0 deactivate_full:459498 C0=459436 C1=35 C4=25 C5=2 deactivate_remote_frees:0 deactivate_to_head:0 deactivate_to_tail:0 destroy_by_rcu:0 free_add_partial:459448 C0=459324 C1=9 C2=39 C4=74 C5=2 grep: free_calls: Function not implemented free_fastpath:218431 C0=36679 C1=52679 C2=18408 C3=4708 C4=48837 C5=14602 C6=39252 C7=3266 free_frozen:491 C0=161 C1=95 C2=14 C3=36 C4=77 C5=64 C6=13 C7=31 free_remove_partial:40 C0=40 free_slab:40 C0=40 free_slowpath:657340 C0=656835 C1=119 C2=76 C3=36 C4=159 C5=69 C6=15 C7=31 hwcache_align:0 min_partial:10 objects:499 object_size:4096 objects_partial:35 objs_per_slab:8 order:3 order_fallback:0 partial:5 poison:0 reclaim_account:0 red_zone:0 sanity_checks:0 slabs:63 slab_size:4096 store_user:0 total_objects:504 trace:0 comments : - lots of disable_bh()/enable_bh(), (enable_bh is slow), that could be avoided... - many ktime_get() calls - my HZ=1000 setup might be stupid on a CONFIG_NO_HZ=y kernel :( ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-25 14:19 ` Eric Dumazet @ 2009-09-26 16:13 ` Christoph Lameter 0 siblings, 0 replies; 22+ messages in thread From: Christoph Lameter @ 2009-09-26 16:13 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Fri, 25 Sep 2009, Eric Dumazet wrote: > With my current kernel on receiver (linux-2.6 32bit + some networking > patches + SLUB_STATS) mcast -n1 -b eth3 -r 2000 on the sender (2.6.29 > unfortunatly, I cannot change it at this moment) My tests are all done using SLAB since I wanted to exclude differences as much as possible. > <idle>-0 [000] 13580.504040: __kmalloc_track_caller <-__alloc_skb > <idle>-0 [000] 13580.504040: get_slab <-__kmalloc_track_caller > <idle>-0 [000] 13580.504040: __slab_alloc <-__kmalloc_track_caller > > hmm... is it normal we call deactivate_slab() ? deactivate_slab() is called when the slab page we are allocating from runs out of objects or is not fit for allocation (we want objects from a different node etc). > mcast-21429 [000] 13580.504066: sock_rfree <-skb_release_head_state > mcast-21429 [000] 13580.504066: skb_release_data <-__kfree_skb > mcast-21429 [000] 13580.504066: kfree <-skb_release_data > mcast-21429 [000] 13580.504066: __slab_free <-kfree > > is it normal we call add_partial() ? add_partial is called when we free objects in a slab page that had all objects allocated before. Then it can be used for allocations again and must be tracked. Fully allocated slab pages are not tracked. > Too many slowpaths for 4096 slabs ? > > $ cd /sys/kernel/slab/:t-0004096 > $ grep . * > aliases:1 > align:8 > grep: alloc_calls: Function not implemented > alloc_fastpath:416584 C0=234119 C1=52938 C2=18413 C3=4739 C4=49140 C5=14679 C6=39266 C7=3290 > alloc_from_partial:459402 C0=459391 C1=8 C2=1 C5=2 > alloc_refill:459619 C0=459460 C1=54 C2=1 C3=4 C4=52 C5=31 C6=2 C7=15 > alloc_slab:103 C0=45 C1=28 C3=1 C4=26 C5=1 C6=1 C7=1 > alloc_slowpath:459628 C0=459462 C1=55 C2=2 C3=5 C4=53 C5=32 C6=3 C7=16 Hmmm. That is a high percentage. All are refills. So there are remote frees from the other processor to the slab page the first processor allocates from. One processor allocates the object and then pushes it to the other for freeing? Bad for caching. > free_slowpath:657340 C0=656835 C1=119 C2=76 C3=36 C4=159 C5=69 C6=15 C7=31 Also quite high. Consistent with remote freeing of objects allocated on the first processors. Objects are very short lived. > comments : > - lots of disable_bh()/enable_bh(), (enable_bh is slow), that could be avoided... > - many ktime_get() calls > - my HZ=1000 setup might be stupid on a CONFIG_NO_HZ=y kernel :( There are 8 objects per slab (order 3 allocation). You could maybe tune things a bit increasing the objects per slab which may cut down on the # of deactivate_slab() calls and will also reduce the need for add_partial(). But I dont see either call causing significant latencies. both calls should happen on every 8th or so call of kfree/kmalloc.0 To increase the objects per slab to 32: boot with slub_max_order=5 and then echo 5 >/sys/slab/kmalloc-4096/order Would require order 5 allocations. Dont expect it to make too much of a difference. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-10 20:37 ` Eric Dumazet 2009-09-10 21:36 ` Christoph Lameter 2009-09-10 21:37 ` Eric Dumazet @ 2009-09-10 21:39 ` Christoph Lameter 2 siblings, 0 replies; 22+ messages in thread From: Christoph Lameter @ 2009-09-10 21:39 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Thu, 10 Sep 2009, Eric Dumazet wrote: > 22:28:11.223581 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32 > 22:28:11.223678 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32 > > See how the answer is *very* slow ? Something like > 100 us ? If you run tcpdump then you create some overhead I guess. 100 usec should show up as latencies > 50usec in the report. The latency request includes the timestamp from the source. So the destination can take another timestamp at receive time. The sending back of the reply to the timestamp request is not critical at all. The receiver will just register the time differential established by the sender. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-09 16:47 ` Christoph Lameter 2009-09-09 17:06 ` Eric Dumazet @ 2009-09-09 17:08 ` Eric Dumazet 2009-09-09 17:26 ` Christoph Lameter 1 sibling, 1 reply; 22+ messages in thread From: Eric Dumazet @ 2009-09-09 17:08 UTC (permalink / raw) To: Christoph Lameter; +Cc: netdev Christoph Lameter a écrit : > On Wed, 9 Sep 2009, Eric Dumazet wrote: > >> Christoph Lameter a ?crit : >>> On Wed, 9 Sep 2009, Eric Dumazet wrote: >>> >>>> In order to reproduce this here, could you tell me if you use >>>> >>>> Producer linux-2.6.22 -> Receiver 2.6.22 >>>> Producer linux-2.6.31 -> Receiver 2.6.31 >>> I use the above setup. >> Then frames are sent on wire but not received >> >> (they are received via mc loop, internal stack magic) > > We are talking about two machines running 2.6.22 or 2.6.31. There is no > magic mc loop between the two machines. -L was not used. > I have no idea how you run the test then, and how are computed time deltas. Are you clocks synchronized to less than one us ? How is it done ? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: UDP regression with packets rates < 10k per sec 2009-09-09 17:08 ` Eric Dumazet @ 2009-09-09 17:26 ` Christoph Lameter 0 siblings, 0 replies; 22+ messages in thread From: Christoph Lameter @ 2009-09-09 17:26 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Wed, 9 Sep 2009, Eric Dumazet wrote: > I have no idea how you run the test then, and how are computed time deltas. > > Are you clocks synchronized to less than one us ? How is it done ? Packets are sent one way and include a timestamp. Once in a while (probe interval) a packet is sent back to the sender (with the timestamp the sender took earlier). The sender takes a second timestamp and calculates the latency by dividing by two. Seen this approach in multiple middleware test programs. Clocks are typically not synchronized enough to actually compare timestamps from two systems. ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2009-09-26 16:17 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-09-08 22:38 UDP regression with packets rates < 10k per sec Christoph Lameter 2009-09-08 22:52 ` Eric Dumazet 2009-09-09 14:01 ` Christoph Lameter 2009-09-09 15:09 ` Eric Dumazet 2009-09-09 16:47 ` Christoph Lameter 2009-09-09 17:06 ` Eric Dumazet 2009-09-09 17:55 ` Christoph Lameter 2009-09-10 20:37 ` Eric Dumazet 2009-09-10 21:36 ` Christoph Lameter 2009-09-10 21:37 ` Eric Dumazet 2009-09-10 21:42 ` Christoph Lameter 2009-09-14 21:10 ` Christoph Lameter 2009-09-15 5:29 ` Eric Dumazet 2009-09-15 14:07 ` Christoph Lameter 2009-09-15 17:26 ` Eric Dumazet 2009-09-15 20:25 ` Christoph Lameter 2009-09-15 19:02 ` Eric Dumazet 2009-09-25 14:19 ` Eric Dumazet 2009-09-26 16:13 ` Christoph Lameter 2009-09-10 21:39 ` Christoph Lameter 2009-09-09 17:08 ` Eric Dumazet 2009-09-09 17:26 ` Christoph Lameter
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.