All of lore.kernel.org
 help / color / mirror / Atom feed
* UDP regression with packets rates < 10k per sec
@ 2009-09-08 22:38 Christoph Lameter
  2009-09-08 22:52 ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2009-09-08 22:38 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

Looks like we have a regression since 2.6.22 due to latency increases in
the network stack? The following is the result of measuring latencies for
UDP multicast traffic at packet rates of 10pps 100pps 1kpps 10kpps and
100k pps. Two system running "mcast -n1 -r<rate>" (mcast tool from
http://gentwo.org/ll).

Measurements in microseconds for one hop using bnx2 on Dell R610 (64 bit
2.6.31-rc9) and Dell 1950 (32 bit 2.6.22.19 3.3Ghz). Dell R610 RX usecs
tuned to 0. 32 bit tuned to 1 (NIC is flaky at 0).

Kernel			10pps	100pps	1kpps	10kpps	100kpps
---------------------------------------------------------------
2.6.22 (32bit)		30	29.5	29	30	41
2.6.31-rc9(64 bit)	64	63	46	30	40

The only minor improvement is at a rate of 100kpps. All rates
lower than 10k regress significantly.

Could there be something wrong with the bnx2 interrupt routing? They all
end up on cpu0 here. There are 8 of them in a system with 16 "processors".
How do those need to be configured? There are some sparse comments in
Documentation/networking/multiqueue.txt but the text does not say anything
about the irq routing.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-08 22:38 UDP regression with packets rates < 10k per sec Christoph Lameter
@ 2009-09-08 22:52 ` Eric Dumazet
  2009-09-09 14:01   ` Christoph Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2009-09-08 22:52 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev

Christoph Lameter a écrit :
> Looks like we have a regression since 2.6.22 due to latency increases in
> the network stack? The following is the result of measuring latencies for
> UDP multicast traffic at packet rates of 10pps 100pps 1kpps 10kpps and
> 100k pps. Two system running "mcast -n1 -r<rate>" (mcast tool from
> http://gentwo.org/ll).
> 
> Measurements in microseconds for one hop using bnx2 on Dell R610 (64 bit
> 2.6.31-rc9) and Dell 1950 (32 bit 2.6.22.19 3.3Ghz). Dell R610 RX usecs
> tuned to 0. 32 bit tuned to 1 (NIC is flaky at 0).
> 
> Kernel			10pps	100pps	1kpps	10kpps	100kpps
> ---------------------------------------------------------------
> 2.6.22 (32bit)		30	29.5	29	30	41
> 2.6.31-rc9(64 bit)	64	63	46	30	40
> 
> The only minor improvement is at a rate of 100kpps. All rates
> lower than 10k regress significantly.
> 
> Could there be something wrong with the bnx2 interrupt routing? They all
> end up on cpu0 here. There are 8 of them in a system with 16 "processors".
> How do those need to be configured? There are some sparse comments in
> Documentation/networking/multiqueue.txt but the text does not say anything
> about the irq routing.
> 
> 

Hi Christoph

In order to reproduce this here, could you tell me if you use 

Producer linux-2.6.22 -> Receiver 2.6.22
Producer linux-2.6.31 -> Receiver 2.6.31

Or a mix of
Producer linux-2.6.31 -> Receiver 2.6.22
Producer linux-2.6.22 -> Receiver 2.6.31-rc9

Not clear what is your exact setup

Thanks

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-08 22:52 ` Eric Dumazet
@ 2009-09-09 14:01   ` Christoph Lameter
  2009-09-09 15:09     ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2009-09-09 14:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Wed, 9 Sep 2009, Eric Dumazet wrote:

> In order to reproduce this here, could you tell me if you use
>
> Producer linux-2.6.22 -> Receiver 2.6.22
> Producer linux-2.6.31 -> Receiver 2.6.31

I use the above setup.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-09 14:01   ` Christoph Lameter
@ 2009-09-09 15:09     ` Eric Dumazet
  2009-09-09 16:47       ` Christoph Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2009-09-09 15:09 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev

Christoph Lameter a écrit :
> On Wed, 9 Sep 2009, Eric Dumazet wrote:
> 
>> In order to reproduce this here, could you tell me if you use
>>
>> Producer linux-2.6.22 -> Receiver 2.6.22
>> Producer linux-2.6.31 -> Receiver 2.6.31
> 
> I use the above setup.

Then frames are sent on wire but not received

(they are received via  mc loop, internal stack magic)


# ./mcast -L -n1 -r 10000
WARNING: Multiple active ethernet devices. Using local address 192.168.0.1
Receiver: Listening to control channel 239.0.192.1
Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0 origin 192.168.0.1
Sender: Sending 10000 msgs/ch/sec on 1 channels. Probe interval=0.001-1 sec.

TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
100000      0      0      0   10000  3000.0    7.84    8.89   10.51  0.66

# uname -a
Linux erd 2.6.30.5 #2 SMP Mon Sep 7 17:15:43 CEST 2009 i686 i686 i386 GNU/Linux



I tried an old kernel on same hardware :

# ./mcast -L -n1 -r 10000
WARNING: Multiple active ethernet devices. Using local address 55.225.18.6
Receiver: Listening to control channel 239.0.192.1
Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0 origin 55.225.18.6
Sender: Sending 10000 msgs/ch/sec on 1 channels. Probe interval=0.001-1 sec.

TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
   99999      0      0      0    9998     0.0    9.00    9.95   14.50  1.56

Linux erd 2.6.9-55.ELsmp #1 SMP Fri Apr 20 17:03:35 EDT 2007 i686 i686 i386 GNU/Linux

So my numbers seem much better than yours...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-09 15:09     ` Eric Dumazet
@ 2009-09-09 16:47       ` Christoph Lameter
  2009-09-09 17:06         ` Eric Dumazet
  2009-09-09 17:08         ` Eric Dumazet
  0 siblings, 2 replies; 22+ messages in thread
From: Christoph Lameter @ 2009-09-09 16:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Wed, 9 Sep 2009, Eric Dumazet wrote:

> Christoph Lameter a ?crit :
> > On Wed, 9 Sep 2009, Eric Dumazet wrote:
> >
> >> In order to reproduce this here, could you tell me if you use
> >>
> >> Producer linux-2.6.22 -> Receiver 2.6.22
> >> Producer linux-2.6.31 -> Receiver 2.6.31
> >
> > I use the above setup.
>
> Then frames are sent on wire but not received
>
> (they are received via  mc loop, internal stack magic)

We are talking about two machines running 2.6.22 or 2.6.31. There is no
magic mc loop between the two machines. -L was not used.

> # ./mcast -L -n1 -r 10000
> WARNING: Multiple active ethernet devices. Using local address 192.168.0.1
> Receiver: Listening to control channel 239.0.192.1
> Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0 origin 192.168.0.1
> Sender: Sending 10000 msgs/ch/sec on 1 channels. Probe interval=0.001-1 sec.
>
> TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
> 100000      0      0      0   10000  3000.0    7.84    8.89   10.51  0.66

These are loopback latencies... Dont use -L

> # uname -a
> Linux erd 2.6.30.5 #2 SMP Mon Sep 7 17:15:43 CEST 2009 i686 i686 i386 GNU/Linux
>
>
>
> I tried an old kernel on same hardware :
>
> # ./mcast -L -n1 -r 10000
> WARNING: Multiple active ethernet devices. Using local address 55.225.18.6
> Receiver: Listening to control channel 239.0.192.1
> Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0 origin 55.225.18.6
> Sender: Sending 10000 msgs/ch/sec on 1 channels. Probe interval=0.001-1 sec.
>
> TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
>    99999      0      0      0    9998     0.0    9.00    9.95   14.50  1.56
>
> Linux erd 2.6.9-55.ELsmp #1 SMP Fri Apr 20 17:03:35 EDT 2007 i686 i686 i386 GNU/Linux
>
> So my numbers seem much better than yours...

These are loopback numbers. Why is 2.6.9 so high? The regression show at
less than 10k.

Could you use real NICs? With multiple TX queues and all the other cool
stuff? And run at lower packet rates?

My loopback numbers also show the same trends.

2.6.22:

mcast -Ln1
TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
     101      0      0      0      10     3.0    5.47    5.74    7.00  0.43

mcast -Ln1 -r10000
TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
  100000      0      0      0   10000  3000.0    5.97    6.11    6.40 0.13


2.6.31-rc9

mcast -Ln1
TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
     100      0      0      0      10     3.0   13.26   13.45   13.56  0.09

mcast -Ln1 -r10000
TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
  100000      0      0      0   10000  3000.0    5.70    5.82    5.91  0.07


So 2.6.22 is better at 10 msgs per second. 2.6.31 is slightly better at
10k.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-09 16:47       ` Christoph Lameter
@ 2009-09-09 17:06         ` Eric Dumazet
  2009-09-09 17:55           ` Christoph Lameter
  2009-09-09 17:08         ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2009-09-09 17:06 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev

Christoph Lameter a écrit :
> On Wed, 9 Sep 2009, Eric Dumazet wrote:
> 
>> Christoph Lameter a ?crit :
>>> On Wed, 9 Sep 2009, Eric Dumazet wrote:
>>>
>>>> In order to reproduce this here, could you tell me if you use
>>>>
>>>> Producer linux-2.6.22 -> Receiver 2.6.22
>>>> Producer linux-2.6.31 -> Receiver 2.6.31
>>> I use the above setup.
>> Then frames are sent on wire but not received
>>
>> (they are received via  mc loop, internal stack magic)
> 
> We are talking about two machines running 2.6.22 or 2.6.31. There is no
> magic mc loop between the two machines. -L was not used.
> 
>> # ./mcast -L -n1 -r 10000
>> WARNING: Multiple active ethernet devices. Using local address 192.168.0.1
>> Receiver: Listening to control channel 239.0.192.1
>> Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0 origin 192.168.0.1
>> Sender: Sending 10000 msgs/ch/sec on 1 channels. Probe interval=0.001-1 sec.
>>
>> TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
>> 100000      0      0      0   10000  3000.0    7.84    8.89   10.51  0.66
> 
> These are loopback latencies... Dont use -L
> 
>> # uname -a
>> Linux erd 2.6.30.5 #2 SMP Mon Sep 7 17:15:43 CEST 2009 i686 i686 i386 GNU/Linux
>>
>>
>>
>> I tried an old kernel on same hardware :
>>
>> # ./mcast -L -n1 -r 10000
>> WARNING: Multiple active ethernet devices. Using local address 55.225.18.6
>> Receiver: Listening to control channel 239.0.192.1
>> Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0 origin 55.225.18.6
>> Sender: Sending 10000 msgs/ch/sec on 1 channels. Probe interval=0.001-1 sec.
>>
>> TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
>>    99999      0      0      0    9998     0.0    9.00    9.95   14.50  1.56
>>
>> Linux erd 2.6.9-55.ELsmp #1 SMP Fri Apr 20 17:03:35 EDT 2007 i686 i686 i386 GNU/Linux
>>
>> So my numbers seem much better than yours...
> 
> These are loopback numbers. Why is 2.6.9 so high? The regression show at
> less than 10k.
> 
> Could you use real NICs? With multiple TX queues and all the other cool
> stuff? And run at lower packet rates?

well, on your mono-flow test, multiqueue wont help anyway.

> 
> My loopback numbers also show the same trends.
> 
> 2.6.22:
> 
> mcast -Ln1
> TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
>      101      0      0      0      10     3.0    5.47    5.74    7.00  0.43
> 
> mcast -Ln1 -r10000
> TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
>   100000      0      0      0   10000  3000.0    5.97    6.11    6.40 0.13
> 
> 
> 2.6.31-rc9
> 
> mcast -Ln1
> TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
>      100      0      0      0      10     3.0   13.26   13.45   13.56  0.09
> 
> mcast -Ln1 -r10000
> TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
>   100000      0      0      0   10000  3000.0    5.70    5.82    5.91  0.07
> 
> 
> So 2.6.22 is better at 10 msgs per second. 2.6.31 is slightly better at
> 10k.

Unfortunatly there is too much noise on this test

# uname -a
Linux nag001 2.6.26-2-amd64 #1 SMP Fri Mar 27 04:02:59 UTC 2009 x86_64 GNU/Linux
# ./mcast -Ln1 -C
WARNING: Multiple active ethernet devices. Using local address 55.225.18.110
Receiver: Listening to control channel 239.0.192.1
Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0 origin 55.225.18.110
Sender: Sending 10 msgs/ch/sec on 1 channels. Probe interval=0.001-1 sec.

TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
     100      0      0      0      10     0.0    8.78   10.56   18.20  2.62
     100      0      0      0      10     3.0    8.60    9.33    9.88  0.45
     101      0      0      0      10     3.0    8.42    9.47   11.53  0.85
     100      0      0      0      10     3.0    8.37   10.19   14.85  1.74
     101      0      0      0      10     3.0    8.86   11.23   18.00  3.11
     100      0      0      0      10     3.0    8.43    9.72   11.48  0.82
     101      0      0      0      10     3.0    9.02   29.78  210.25 60.16
     100      0      0      0      10     3.0    8.47   10.27   11.69  0.99
     100      0      0      0      10     3.0    9.21   10.16   12.15  0.84
     101      0      0      0      10     3.0    8.46   10.65   17.88  2.64
     101      0      0      0      10     3.0    8.65   10.00   11.32  0.86
     100      0      0      0      10     3.0    8.98    9.71   10.77  0.58
     100      0      0      0      10     3.0    8.38   11.73   19.06  3.74


Also I believe you hit scheduler artefact at very low rate,
since there are two tasks that are considered as coupled.

Check commit 6f3d09291b4982991680b61763b2541e53e2a95f

sched, net: socket wakeups are sync

'sync' wakeups are a hint towards the scheduler that (certain)
networking related wakeups likely create coupling between tasks.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/net/core/sock.c b/net/core/sock.c
index 09cb3a7..2654c14 100644 (file)
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1621,7 +1621,7 @@ static void sock_def_readable(struct sock *sk, int len)
 {
        read_lock(&sk->sk_callback_lock);
        if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
-               wake_up_interruptible(sk->sk_sleep);
+               wake_up_interruptible_sync(sk->sk_sleep);
        sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
        read_unlock(&sk->sk_callback_lock);
 }
@@ -1635,7 +1635,7 @@ static void sock_def_write_space(struct sock *sk)
         */
        if ((atomic_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf) {
                if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
-                       wake_up_interruptible(sk->sk_sleep);
+                       wake_up_interruptible_sync(sk->sk_sleep);
 
                /* Should agree with poll, otherwise some programs break */
                if (sock_writeable(sk))


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-09 16:47       ` Christoph Lameter
  2009-09-09 17:06         ` Eric Dumazet
@ 2009-09-09 17:08         ` Eric Dumazet
  2009-09-09 17:26           ` Christoph Lameter
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2009-09-09 17:08 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev

Christoph Lameter a écrit :
> On Wed, 9 Sep 2009, Eric Dumazet wrote:
> 
>> Christoph Lameter a ?crit :
>>> On Wed, 9 Sep 2009, Eric Dumazet wrote:
>>>
>>>> In order to reproduce this here, could you tell me if you use
>>>>
>>>> Producer linux-2.6.22 -> Receiver 2.6.22
>>>> Producer linux-2.6.31 -> Receiver 2.6.31
>>> I use the above setup.
>> Then frames are sent on wire but not received
>>
>> (they are received via  mc loop, internal stack magic)
> 
> We are talking about two machines running 2.6.22 or 2.6.31. There is no
> magic mc loop between the two machines. -L was not used.
> 


I have no idea how you run the test then, and how are computed time deltas.

Are you clocks synchronized to less than one us ? How is it done ?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-09 17:08         ` Eric Dumazet
@ 2009-09-09 17:26           ` Christoph Lameter
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Lameter @ 2009-09-09 17:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Wed, 9 Sep 2009, Eric Dumazet wrote:

> I have no idea how you run the test then, and how are computed time deltas.
>
> Are you clocks synchronized to less than one us ? How is it done ?

Packets are sent one way and include a timestamp. Once in a while (probe
interval) a packet is sent back to the sender (with the timestamp the
sender took earlier). The sender takes a second timestamp and
calculates the latency by dividing by two.

Seen this approach in multiple middleware test programs. Clocks are
typically not synchronized enough to actually compare timestamps from two
systems.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-09 17:06         ` Eric Dumazet
@ 2009-09-09 17:55           ` Christoph Lameter
  2009-09-10 20:37             ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2009-09-09 17:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Wed, 9 Sep 2009, Eric Dumazet wrote:

> > My loopback numbers also show the same trends.
> >
> > 2.6.22:
> >
> > mcast -Ln1
> > TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
> >      101      0      0      0      10     3.0    5.47    5.74    7.00  0.43
> >
> > 2.6.31-rc9
> >
> > mcast -Ln1
> > TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
> >      100      0      0      0      10     3.0   13.26   13.45   13.56  0.09
> >
> >
> > So 2.6.22 is better at 10 msgs per second. 2.6.31 is slightly better at
> > 10k.
>
> Linux nag001 2.6.26-2-amd64 #1 SMP Fri Mar 27 04:02:59 UTC 2009 x86_64 GNU/Linux
>
> TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
>      101      0      0      0      10     3.0    8.42    9.47   11.53  0.85
>
> Also I believe you hit scheduler artefact at very low rate,
> since there are two tasks that are considered as coupled.
>
> Check commit 6f3d09291b4982991680b61763b2541e53e2a95f

Reverting that patch yields:

clameter@rd-strategy3-deb64:~$ ./mcast -Ln1

TotalMsg   Lost SeqErr TXDrop Msg/Sec  KB/Sec  Min/us  Avg/us  Max/us StdDv
     101      0      0      0      10     3.0   27.21   31.28   52.68  8.16

Not good.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-09 17:55           ` Christoph Lameter
@ 2009-09-10 20:37             ` Eric Dumazet
  2009-09-10 21:36               ` Christoph Lameter
                                 ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Eric Dumazet @ 2009-09-10 20:37 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev

Something must be wrong with program or whatever...

On the receiver I did this to trace the latency messages only

# tcpdump -i eth0 not host 239.0.192.2 and port 9002 -n

22:28:07.223300 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32
22:28:07.223403 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32

22:28:08.223301 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32
22:28:08.223416 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32

22:28:09.223380 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32
22:28:09.223494 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32

22:28:10.223481 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32
22:28:10.223597 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32

22:28:11.223581 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32
22:28:11.223678 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32


See how the answer is *very* slow ? Something like > 100 us ?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-10 20:37             ` Eric Dumazet
@ 2009-09-10 21:36               ` Christoph Lameter
  2009-09-10 21:37               ` Eric Dumazet
  2009-09-10 21:39               ` Christoph Lameter
  2 siblings, 0 replies; 22+ messages in thread
From: Christoph Lameter @ 2009-09-10 21:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Thu, 10 Sep 2009, Eric Dumazet wrote:

> On the receiver I did this to trace the latency messages only
>
> # tcpdump -i eth0 not host 239.0.192.2 and port 9002 -n

You do not show the multicast traffic. The latency timestamps requests are
send unicast and at larger intervals. About one per second per mc channel.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-10 20:37             ` Eric Dumazet
  2009-09-10 21:36               ` Christoph Lameter
@ 2009-09-10 21:37               ` Eric Dumazet
  2009-09-10 21:42                 ` Christoph Lameter
  2009-09-10 21:39               ` Christoph Lameter
  2 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2009-09-10 21:37 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev

Eric Dumazet a écrit :
> Something must be wrong with program or whatever...
> 
> On the receiver I did this to trace the latency messages only
> 
> # tcpdump -i eth0 not host 239.0.192.2 and port 9002 -n
> 
> 22:28:07.223300 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32
> 22:28:07.223403 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32
> 
> 22:28:08.223301 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32
> 22:28:08.223416 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32
> 
> 22:28:09.223380 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32
> 22:28:09.223494 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32
> 
> 22:28:10.223481 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32
> 22:28:10.223597 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32
> 
> 22:28:11.223581 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32
> 22:28:11.223678 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32
> 
> 
> See how the answer is *very* slow ? Something like > 100 us ?
> 

Never mind, this is OK, after proper irq affinity and tweaks on receiver :
(I should do same on other machine as well...)

(echo 1 >/proc/irq/default_smp_affinity) before up-ing bnx2 NIC

echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

and idle=mwait  (or hpet and C2/C3 states are used on this machine and 
wakeups are slow)

# tcpdump -i eth0 not host 239.0.192.2 and port 9002 -n
23:25:02.137143 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32
23:25:02.137212 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32

23:25:03.137204 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32
23:25:03.137288 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32

23:25:04.137276 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32
23:25:04.137355 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32


I presume powersaving improvements in last kernels are against
latencies goals...


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-10 20:37             ` Eric Dumazet
  2009-09-10 21:36               ` Christoph Lameter
  2009-09-10 21:37               ` Eric Dumazet
@ 2009-09-10 21:39               ` Christoph Lameter
  2 siblings, 0 replies; 22+ messages in thread
From: Christoph Lameter @ 2009-09-10 21:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Thu, 10 Sep 2009, Eric Dumazet wrote:

> 22:28:11.223581 IP 55.225.18.7.9002 > 55.225.18.5.9002: UDP, length 32
> 22:28:11.223678 IP 55.225.18.5.9002 > 55.225.18.7.9002: UDP, length 32
>
> See how the answer is *very* slow ? Something like > 100 us ?

If you run tcpdump then you create some overhead I guess. 100 usec should
show up as latencies > 50usec in the report.

The latency request includes the timestamp from the source. So the
destination can take another timestamp at receive time. The sending back
of the reply to the timestamp request is not critical at all. The receiver
will just register the time differential established by the sender.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-10 21:37               ` Eric Dumazet
@ 2009-09-10 21:42                 ` Christoph Lameter
  2009-09-14 21:10                   ` Christoph Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2009-09-10 21:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Thu, 10 Sep 2009, Eric Dumazet wrote:

> and idle=mwait  (or hpet and C2/C3 states are used on this machine and
> wakeups are slow)

We disabled C states here.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-10 21:42                 ` Christoph Lameter
@ 2009-09-14 21:10                   ` Christoph Lameter
  2009-09-15  5:29                     ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2009-09-14 21:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev


Where are we on this? Definitely a regression?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-14 21:10                   ` Christoph Lameter
@ 2009-09-15  5:29                     ` Eric Dumazet
  2009-09-15 14:07                       ` Christoph Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2009-09-15  5:29 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev

Christoph Lameter a écrit :
> Where are we on this? Definitely a regression?

??

I tried to reproduce your numbers and failed on my machines.

2.6.31 is actually faster than 2.6.22 on the bench you provided.

Must be specific to the hardware I guess ?


As text size presumably is bigger in 2.6.31, fetching code
in cpu caches to handle 10 packets per second is what we call
a cold path anyway.

If you want to make it a fast path, you want to make sure code its
always hot in cpu caches, and find a way to inject packets into
the kernel to make sure cpu keep the path hot.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-15  5:29                     ` Eric Dumazet
@ 2009-09-15 14:07                       ` Christoph Lameter
  2009-09-15 17:26                         ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Lameter @ 2009-09-15 14:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Tue, 15 Sep 2009, Eric Dumazet wrote:

> 2.6.31 is actually faster than 2.6.22 on the bench you provided.

Well at high packet rates which were not the topic.

> Must be specific to the hardware I guess ?

Huh? Even your loopback numbers did show the regression up to 10k.

> As text size presumably is bigger in 2.6.31, fetching code
> in cpu caches to handle 10 packets per second is what we call
> a cold path anyway.

Ok so its an accepted regression? This is a significant reason not to use
newer versions of kernels for latency critical applications that may have
to send a packet once in a while for notification. The latency is doubled
(1G) / tripled / quadrupled (IB) vs 2.6.22.

> If you want to make it a fast path, you want to make sure code its
> always hot in cpu caches, and find a way to inject packets into
> the kernel to make sure cpu keep the path hot.

Oh, gosh.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-15 14:07                       ` Christoph Lameter
@ 2009-09-15 17:26                         ` Eric Dumazet
  2009-09-15 20:25                           ` Christoph Lameter
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2009-09-15 17:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev

Christoph Lameter a écrit :
> On Tue, 15 Sep 2009, Eric Dumazet wrote:
> 
>> 2.6.31 is actually faster than 2.6.22 on the bench you provided.
> 
> Well at high packet rates which were not the topic.
> 
>> Must be specific to the hardware I guess ?
> 
> Huh? Even your loopback numbers did show the regression up to 10k.
> 
>> As text size presumably is bigger in 2.6.31, fetching code
>> in cpu caches to handle 10 packets per second is what we call
>> a cold path anyway.
> 
> Ok so its an accepted regression? This is a significant reason not to use
> newer versions of kernels for latency critical applications that may have
> to send a packet once in a while for notification. The latency is doubled
> (1G) / tripled / quadrupled (IB) vs 2.6.22.
> 
>> If you want to make it a fast path, you want to make sure code its
>> always hot in cpu caches, and find a way to inject packets into
>> the kernel to make sure cpu keep the path hot.
> 
> Oh, gosh.

It seems there is a lot of confusion on this topic, so I will make a full recap :

Once I understood my 2.6.31 kernel had much more features than 2.6.22 and that I tuned
it to :

- Let cpu run at full speed (3GHz instead of 2GHz) : before tuning, 2.6.31 was 
using "ondemand" governor and my cpus were running at 2GHz, while they where
running at 3GHz on my 2.6.22 config

- Dont let cpus enter C2/C3 wait states (idle=mwait)

- Correctly affine cpu to ethX irq (2.6.22 was running ethX irq on one cpu, while
 on 2.6.31, irqs were distributed to all online cpus)


Then, your mcast test gives same results, at 10pps, 100pps, 1000pps, 10000pps

When sniffing receiving side, I can notice :

- Answer to an icmp ping (served by softirq only) : 6 us between request and reply

- Answer to one 'give timestamp' request from mcast client : 11 us betwen request and reply,
  regardless of kernel version (2.6.22 or 2.6.31)

So there is a 5us cost to actually wakeup a process and let him do the recvfrom() and sendto() pair,
which is quite OK, and this time was not significantly changed between 2.6.22 and 2.6.31

Hope this helps

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-15 20:25                           ` Christoph Lameter
@ 2009-09-15 19:02                             ` Eric Dumazet
  2009-09-25 14:19                             ` Eric Dumazet
  1 sibling, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2009-09-15 19:02 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev

Christoph Lameter a écrit :
> On Tue, 15 Sep 2009, Eric Dumazet wrote:
> 
>> Once I understood my 2.6.31 kernel had much more features than 2.6.22 and that I tuned
>> it to :
>>
>> - Let cpu run at full speed (3GHz instead of 2GHz) : before tuning, 2.6.31 was
>> using "ondemand" governor and my cpus were running at 2GHz, while they where
>> running at 3GHz on my 2.6.22 config
> 
> My kernel did not have support for any governors compiled in.
> 
>> - Dont let cpus enter C2/C3 wait states (idle=mwait)
> 
> Ok. Trying idle=mwait.
> 
>> - Correctly affine cpu to ethX irq (2.6.22 was running ethX irq on one cpu, while
>>  on 2.6.31, irqs were distributed to all online cpus)
> 
> Interrupts of both 2.6.22 and 2.6.31 go to cpu 0. Does it matter for
> loopback?

No of course, loopback triggers softirq on the local cpu, no special setup
to respect.

> 
>> Then, your mcast test gives same results, at 10pps, 100pps, 1000pps, 10000pps
> 
> loopback via mcast -Ln1 -r <rate>
> 
> 		10pps	100pps	1000pps	10000pps
> 2.6.22(32bit)	7.36	7.28	7.15	7.16
> 2.6.31(64bit)	9.28	10.27	9.70	9.79
> 
> What a difference. Now the initial latency rampup for 2.6.31 is gone. So
> even w/o governors the kernel does something to increase the latencies.
> 
> We sacrificed 2 - 3 microseconds per message to kernel features, bloat and
> 64 bitness?

Well, I dont know, I mainly use 32bits kernel, but yes, using 64bits has a cost,
since skbs for example are bigger, sockets are bigger, so we touch more cache lines
per transaction...


You could precisely compute number of cycles per transaction, with "perf" tools
(only on 2.6.31), between 64bit and 32bit kernels, benching 100000 pps for example
and counting number of perf counter irqs per second

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-15 17:26                         ` Eric Dumazet
@ 2009-09-15 20:25                           ` Christoph Lameter
  2009-09-15 19:02                             ` Eric Dumazet
  2009-09-25 14:19                             ` Eric Dumazet
  0 siblings, 2 replies; 22+ messages in thread
From: Christoph Lameter @ 2009-09-15 20:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Tue, 15 Sep 2009, Eric Dumazet wrote:

> Once I understood my 2.6.31 kernel had much more features than 2.6.22 and that I tuned
> it to :
>
> - Let cpu run at full speed (3GHz instead of 2GHz) : before tuning, 2.6.31 was
> using "ondemand" governor and my cpus were running at 2GHz, while they where
> running at 3GHz on my 2.6.22 config

My kernel did not have support for any governors compiled in.

> - Dont let cpus enter C2/C3 wait states (idle=mwait)

Ok. Trying idle=mwait.

> - Correctly affine cpu to ethX irq (2.6.22 was running ethX irq on one cpu, while
>  on 2.6.31, irqs were distributed to all online cpus)

Interrupts of both 2.6.22 and 2.6.31 go to cpu 0. Does it matter for
loopback?

> Then, your mcast test gives same results, at 10pps, 100pps, 1000pps, 10000pps

loopback via mcast -Ln1 -r <rate>

		10pps	100pps	1000pps	10000pps
2.6.22(32bit)	7.36	7.28	7.15	7.16
2.6.31(64bit)	9.28	10.27	9.70	9.79

What a difference. Now the initial latency rampup for 2.6.31 is gone. So
even w/o governors the kernel does something to increase the latencies.

We sacrificed 2 - 3 microseconds per message to kernel features, bloat and
64 bitness?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-15 20:25                           ` Christoph Lameter
  2009-09-15 19:02                             ` Eric Dumazet
@ 2009-09-25 14:19                             ` Eric Dumazet
  2009-09-26 16:13                               ` Christoph Lameter
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2009-09-25 14:19 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev

Christoph Lameter a écrit :

> We sacrificed 2 - 3 microseconds per message to kernel features, bloat and
> 64 bitness?

Update :

With my current kernel on receiver (linux-2.6 32bit + some networking patches + SLUB_STATS)
mcast -n1 -b eth3 -r 2000 on the sender (2.6.29 unfortunatly, I cannot change it at this moment)


I took a look at ftrace(function) to check paths :
(tg3 instead of bnx2 on this snapshot, it should not be a big concern)

- tg3 interrupts on CPU-0, copybreak disabled.

Hardware interrupt
------------------
          <idle>-0     [000] 13580.504033: do_IRQ <-common_interrupt
          <idle>-0     [000] 13580.504034: irq_enter <-do_IRQ
          <idle>-0     [000] 13580.504034: rcu_irq_enter <-irq_enter
          <idle>-0     [000] 13580.504034: idle_cpu <-irq_enter
          <idle>-0     [000] 13580.504034: tick_check_idle <-irq_enter
          <idle>-0     [000] 13580.504034: tick_check_oneshot_broadcast <-tick_check_idle
          <idle>-0     [000] 13580.504034: tick_nohz_stop_idle <-tick_check_idle
          <idle>-0     [000] 13580.504034: ktime_get <-tick_nohz_stop_idle
          <idle>-0     [000] 13580.504035: ktime_get <-sched_clock_tick
          <idle>-0     [000] 13580.504035: touch_softlockup_watchdog <-sched_clock_idle_wakeup_event
          <idle>-0     [000] 13580.504035: ktime_get <-tick_check_idle
          <idle>-0     [000] 13580.504035: tick_do_update_jiffies64 <-tick_check_idle
          <idle>-0     [000] 13580.504035: _spin_lock <-tick_do_update_jiffies64
          <idle>-0     [000] 13580.504036: do_timer <-tick_do_update_jiffies64
          <idle>-0     [000] 13580.504036: update_wall_time <-do_timer
          <idle>-0     [000] 13580.504036: calc_global_load <-do_timer
          <idle>-0     [000] 13580.504036: touch_softlockup_watchdog <-tick_check_idle
          <idle>-0     [000] 13580.504036: handle_irq <-do_IRQ
          <idle>-0     [000] 13580.504036: irq_to_desc <-handle_irq
          <idle>-0     [000] 13580.504036: handle_edge_irq <-handle_irq
          <idle>-0     [000] 13580.504037: _spin_lock <-handle_edge_irq
          <idle>-0     [000] 13580.504037: ack_apic_edge <-handle_edge_irq
          <idle>-0     [000] 13580.504037: irq_to_desc <-ack_apic_edge
          <idle>-0     [000] 13580.504037: irq_complete_move <-ack_apic_edge
          <idle>-0     [000] 13580.504037: move_native_irq <-ack_apic_edge
          <idle>-0     [000] 13580.504037: irq_to_desc <-move_native_irq
          <idle>-0     [000] 13580.504038: native_apic_mem_write <-ack_apic_edge
          <idle>-0     [000] 13580.504038: handle_IRQ_event <-handle_edge_irq

2) Handled by tg3
          <idle>-0     [000] 13580.504038: tg3_msi <-handle_IRQ_event
          <idle>-0     [000] 13580.504038: tg3_write32 <-tg3_msi
          <idle>-0     [000] 13580.504038: __napi_schedule <-tg3_msi
          <idle>-0     [000] 13580.504038: note_interrupt <-handle_edge_irq
          <idle>-0     [000] 13580.504039: _spin_lock <-handle_edge_irq
          <idle>-0     [000] 13580.504039: irq_exit <-do_IRQ

3) Softirq
          <idle>-0     [000] 13580.504039: do_softirq <-irq_exit
          <idle>-0     [000] 13580.504039: __do_softirq <-do_softirq
          <idle>-0     [000] 13580.504039: net_rx_action <-__do_softirq
          <idle>-0     [000] 13580.504039: tg3_poll <-net_rx_action
          <idle>-0     [000] 13580.504040: tg3_alloc_rx_skb <-tg3_poll
          <idle>-0     [000] 13580.504040: __netdev_alloc_skb <-tg3_alloc_rx_skb
          <idle>-0     [000] 13580.504040: __alloc_skb <-__netdev_alloc_skb
          <idle>-0     [000] 13580.504040: kmem_cache_alloc <-__alloc_skb
          <idle>-0     [000] 13580.504040: __kmalloc_track_caller <-__alloc_skb
          <idle>-0     [000] 13580.504040: get_slab <-__kmalloc_track_caller
          <idle>-0     [000] 13580.504040: __slab_alloc <-__kmalloc_track_caller

hmm... is it normal we call deactivate_slab() ?
(I included at the end of this mail : grep . /sys/kernel/slab/:t-0004096/* )

          <idle>-0     [000] 13580.504041: deactivate_slab <-__slab_alloc
          <idle>-0     [000] 13580.504041: _spin_lock <-__slab_alloc
          <idle>-0     [000] 13580.504041: nommu_map_page <-tg3_alloc_rx_skb
          <idle>-0     [000] 13580.504041: skb_put <-tg3_poll
          <idle>-0     [000] 13580.504041: eth_type_trans <-tg3_poll
          <idle>-0     [000] 13580.504041: skb_pull <-eth_type_trans
          <idle>-0     [000] 13580.504042: napi_gro_receive <-tg3_poll
          <idle>-0     [000] 13580.504042: skb_gro_reset_offset <-napi_gro_receive
          <idle>-0     [000] 13580.504042: __napi_gro_receive <-napi_gro_receive
          <idle>-0     [000] 13580.504042: dev_gro_receive <-__napi_gro_receive
          <idle>-0     [000] 13580.504042: napi_skb_finish <-napi_gro_receive
          <idle>-0     [000] 13580.504042: netif_receive_skb <-napi_skb_finish
          <idle>-0     [000] 13580.504043: ip_rcv <-netif_receive_skb

firewall (iptables) phase 1 (mangle prerouting)

          <idle>-0     [000] 13580.504043: nf_hook_slow <-ip_rcv
          <idle>-0     [000] 13580.504043: nf_iterate <-nf_hook_slow
          <idle>-0     [000] 13580.504043: ipt_pre_routing_hook <-nf_iterate
          <idle>-0     [000] 13580.504043: ipt_do_table <-ipt_pre_routing_hook
          <idle>-0     [000] 13580.504043: local_bh_disable <-ipt_do_table
          <idle>-0     [000] 13580.504044: _spin_lock <-ipt_do_table
          <idle>-0     [000] 13580.504044: local_bh_enable <-ipt_do_table
          <idle>-0     [000] 13580.504044: ip_rcv_finish <-ip_rcv
          <idle>-0     [000] 13580.504044: ip_route_input <-ip_rcv_finish
          <idle>-0     [000] 13580.504044: ip_local_deliver <-ip_rcv_finish
firewall (iptables) phase 2  (mangle INPUT)
          <idle>-0     [000] 13580.504045: nf_hook_slow <-ip_local_deliver
          <idle>-0     [000] 13580.504045: nf_iterate <-nf_hook_slow
          <idle>-0     [000] 13580.504045: ipt_local_in_hook <-nf_iterate
          <idle>-0     [000] 13580.504045: ipt_do_table <-ipt_local_in_hook
          <idle>-0     [000] 13580.504045: local_bh_disable <-ipt_do_table
          <idle>-0     [000] 13580.504045: _spin_lock <-ipt_do_table
          <idle>-0     [000] 13580.504045: local_bh_enable <-ipt_do_table
firewall (iptables) phase 3  (filter INPUT)
          <idle>-0     [000] 13580.504046: ipt_local_in_hook <-nf_iterate
          <idle>-0     [000] 13580.504046: ipt_do_table <-ipt_local_in_hook
          <idle>-0     [000] 13580.504046: local_bh_disable <-ipt_do_table
          <idle>-0     [000] 13580.504046: _spin_lock <-ipt_do_table
          <idle>-0     [000] 13580.504046: local_bh_enable <-ipt_do_table
          <idle>-0     [000] 13580.504046: ip_local_deliver_finish <-ip_local_deliver
          <idle>-0     [000] 13580.504046: raw_local_deliver <-ip_local_deliver_finish

OK , UDP handler...

          <idle>-0     [000] 13580.504047: udp_rcv <-ip_local_deliver_finish
          <idle>-0     [000] 13580.504047: __udp4_lib_rcv <-udp_rcv
          <idle>-0     [000] 13580.504047: _spin_lock <-__udp4_lib_rcv
          <idle>-0     [000] 13580.504047: ip_mc_sf_allow <-__udp4_lib_rcv
          <idle>-0     [000] 13580.504047: udp_queue_rcv_skb <-__udp4_lib_rcv
          <idle>-0     [000] 13580.504047: _spin_lock <-udp_queue_rcv_skb
          <idle>-0     [000] 13580.504048: __udp_queue_rcv_skb <-udp_queue_rcv_skb
          <idle>-0     [000] 13580.504048: sock_queue_rcv_skb <-__udp_queue_rcv_skb

          <idle>-0     [000] 13580.504048: sk_filter <-sock_queue_rcv_skb
          <idle>-0     [000] 13580.504048: local_bh_disable <-sk_filter
          <idle>-0     [000] 13580.504048: local_bh_enable <-sk_filter

          <idle>-0     [000] 13580.504048: skb_queue_tail <-sock_queue_rcv_skb
          <idle>-0     [000] 13580.504049: _spin_lock_irqsave <-skb_queue_tail
          <idle>-0     [000] 13580.504049: _spin_unlock_irqrestore <-skb_queue_tail
          <idle>-0     [000] 13580.504049: sock_def_readable <-sock_queue_rcv_skb
          <idle>-0     [000] 13580.504049: _read_lock <-sock_def_readable
          <idle>-0     [000] 13580.504049: __wake_up_sync_key <-sock_def_readable
          <idle>-0     [000] 13580.504049: _spin_lock_irqsave <-__wake_up_sync_key
          <idle>-0     [000] 13580.504049: __wake_up_common <-__wake_up_sync_key
          <idle>-0     [000] 13580.504050: receiver_wake_function <-__wake_up_common
          <idle>-0     [000] 13580.504050: autoremove_wake_function <-receiver_wake_function
          <idle>-0     [000] 13580.504050: default_wake_function <-autoremove_wake_function

wakeup mcast task

          <idle>-0     [000] 13580.504050: try_to_wake_up <-default_wake_function
          <idle>-0     [000] 13580.504050: task_rq_lock <-try_to_wake_up
          <idle>-0     [000] 13580.504050: _spin_lock <-task_rq_lock
          <idle>-0     [000] 13580.504051: _spin_unlock_irqrestore <-try_to_wake_up
          <idle>-0     [000] 13580.504051: select_task_rq_fair <-try_to_wake_up
          <idle>-0     [000] 13580.504051: task_rq_lock <-try_to_wake_up
          <idle>-0     [000] 13580.504051: _spin_lock <-task_rq_lock
          <idle>-0     [000] 13580.504051: activate_task <-try_to_wake_up
          <idle>-0     [000] 13580.504051: enqueue_task <-activate_task
          <idle>-0     [000] 13580.504052: enqueue_task_fair <-enqueue_task
          <idle>-0     [000] 13580.504052: update_curr <-enqueue_task_fair
          <idle>-0     [000] 13580.504052: place_entity <-enqueue_task_fair
          <idle>-0     [000] 13580.504052: __enqueue_entity <-enqueue_task_fair
          <idle>-0     [000] 13580.504052: check_preempt_curr_idle <-try_to_wake_up
          <idle>-0     [000] 13580.504053: resched_task <-check_preempt_curr_idle
          <idle>-0     [000] 13580.504053: _spin_unlock_irqrestore <-try_to_wake_up
          <idle>-0     [000] 13580.504053: _spin_unlock_irqrestore <-__wake_up_sync_key

tg3 finished its rx handling

          <idle>-0     [000] 13580.504053: tg3_write32 <-tg3_poll
          <idle>-0     [000] 13580.504053: tg3_write32 <-tg3_poll
          <idle>-0     [000] 13580.504054: napi_complete <-tg3_poll
          <idle>-0     [000] 13580.504054: napi_gro_flush <-napi_complete
          <idle>-0     [000] 13580.504054: __napi_complete <-napi_complete
          <idle>-0     [000] 13580.504054: tg3_write32 <-tg3_poll
          <idle>-0     [000] 13580.504054: rcu_bh_qs <-__do_softirq
          <idle>-0     [000] 13580.504054: _local_bh_enable <-__do_softirq
          <idle>-0     [000] 13580.504055: rcu_irq_exit <-irq_exit
          <idle>-0     [000] 13580.504055: idle_cpu <-irq_exit
          <idle>-0     [000] 13580.504055: tick_nohz_restart_sched_tick <-cpu_idle
          <idle>-0     [000] 13580.504055: tick_nohz_stop_idle <-tick_nohz_restart_sched_tick
          <idle>-0     [000] 13580.504055: rcu_exit_nohz <-tick_nohz_restart_sched_tick
          <idle>-0     [000] 13580.504055: select_nohz_load_balancer <-tick_nohz_restart_sched_tick
          <idle>-0     [000] 13580.504056: ktime_get <-tick_nohz_restart_sched_tick
          <idle>-0     [000] 13580.504056: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
          <idle>-0     [000] 13580.504056: account_idle_ticks <-tick_nohz_restart_sched_tick
          <idle>-0     [000] 13580.504056: account_idle_time <-account_idle_ticks
          <idle>-0     [000] 13580.504056: touch_softlockup_watchdog <-tick_nohz_restart_sched_tick
          <idle>-0     [000] 13580.504056: hrtimer_cancel <-tick_nohz_restart_sched_tick
          <idle>-0     [000] 13580.504056: hrtimer_try_to_cancel <-hrtimer_cancel
          <idle>-0     [000] 13580.504057: lock_hrtimer_base <-hrtimer_try_to_cancel
          <idle>-0     [000] 13580.504057: _spin_lock_irqsave <-lock_hrtimer_base
          <idle>-0     [000] 13580.504057: __remove_hrtimer <-hrtimer_try_to_cancel
          <idle>-0     [000] 13580.504057: hrtimer_force_reprogram <-__remove_hrtimer
          <idle>-0     [000] 13580.504057: tick_program_event <-hrtimer_force_reprogram
          <idle>-0     [000] 13580.504058: tick_dev_program_event <-tick_program_event
          <idle>-0     [000] 13580.504058: ktime_get <-tick_dev_program_event
          <idle>-0     [000] 13580.504058: clockevents_program_event <-tick_dev_program_event
          <idle>-0     [000] 13580.504058: lapic_next_event <-clockevents_program_event
          <idle>-0     [000] 13580.504058: native_apic_mem_write <-lapic_next_event
          <idle>-0     [000] 13580.504058: _spin_unlock_irqrestore <-hrtimer_try_to_cancel
          <idle>-0     [000] 13580.504058: hrtimer_forward <-tick_nohz_restart_sched_tick
          <idle>-0     [000] 13580.504059: ktime_add_safe <-hrtimer_forward
          <idle>-0     [000] 13580.504059: ktime_add_safe <-hrtimer_forward
          <idle>-0     [000] 13580.504059: hrtimer_start_range_ns <-tick_nohz_restart_sched_tick
          <idle>-0     [000] 13580.504059: __hrtimer_start_range_ns <-hrtimer_start_range_ns
          <idle>-0     [000] 13580.504059: lock_hrtimer_base <-__hrtimer_start_range_ns
          <idle>-0     [000] 13580.504059: _spin_lock_irqsave <-lock_hrtimer_base
          <idle>-0     [000] 13580.504060: ktime_add_safe <-__hrtimer_start_range_ns
          <idle>-0     [000] 13580.504060: enqueue_hrtimer <-__hrtimer_start_range_ns
          <idle>-0     [000] 13580.504060: tick_program_event <-__hrtimer_start_range_ns
          <idle>-0     [000] 13580.504060: tick_dev_program_event <-tick_program_event
          <idle>-0     [000] 13580.504060: ktime_get <-tick_dev_program_event
          <idle>-0     [000] 13580.504061: clockevents_program_event <-tick_dev_program_event
          <idle>-0     [000] 13580.504061: lapic_next_event <-clockevents_program_event
          <idle>-0     [000] 13580.504061: native_apic_mem_write <-lapic_next_event
          <idle>-0     [000] 13580.504061: _spin_unlock_irqrestore <-__hrtimer_start_range_ns
          <idle>-0     [000] 13580.504061: schedule <-cpu_idle
          <idle>-0     [000] 13580.504061: rcu_sched_qs <-schedule
          <idle>-0     [000] 13580.504061: _spin_lock_irq <-schedule
          <idle>-0     [000] 13580.504062: put_prev_task_idle <-schedule
          <idle>-0     [000] 13580.504062: pick_next_task_fair <-schedule
          <idle>-0     [000] 13580.504062: set_next_entity <-pick_next_task_fair
          <idle>-0     [000] 13580.504062: update_stats_wait_end <-set_next_entity
          <idle>-0     [000] 13580.504062: __dequeue_entity <-set_next_entity
          <idle>-0     [000] 13580.504062: hrtick_start_fair <-pick_next_task_fair
          <idle>-0     [000] 13580.504063: perf_event_task_sched_out <-schedule
          <idle>-0     [000] 13580.504063: __switch_to <-schedule


mcast is scheduled, gets the frame

           mcast-21429 [000] 13580.504063: finish_task_switch <-schedule
           mcast-21429 [000] 13580.504063: perf_event_task_sched_in <-finish_task_switch
           mcast-21429 [000] 13580.504063: finish_wait <-__skb_recv_datagram
           mcast-21429 [000] 13580.504064: _spin_lock_irqsave <-__skb_recv_datagram
           mcast-21429 [000] 13580.504064: _spin_unlock_irqrestore <-__skb_recv_datagram
           mcast-21429 [000] 13580.504064: skb_copy_datagram_iovec <-udp_recvmsg
           mcast-21429 [000] 13580.504064: memcpy_toiovec <-skb_copy_datagram_iovec
           mcast-21429 [000] 13580.504064: copy_to_user <-memcpy_toiovec
           mcast-21429 [000] 13580.504065: lock_sock_nested <-udp_recvmsg
           mcast-21429 [000] 13580.504065: _spin_lock_bh <-lock_sock_nested
           mcast-21429 [000] 13580.504065: local_bh_disable <-_spin_lock_bh
           mcast-21429 [000] 13580.504065: local_bh_enable <-lock_sock_nested
           mcast-21429 [000] 13580.504065: skb_free_datagram <-udp_recvmsg
           mcast-21429 [000] 13580.504065: consume_skb <-skb_free_datagram
           mcast-21429 [000] 13580.504065: __kfree_skb <-consume_skb
           mcast-21429 [000] 13580.504066: skb_release_head_state <-__kfree_skb
           mcast-21429 [000] 13580.504066: dst_release <-skb_release_head_state
           mcast-21429 [000] 13580.504066: sock_rfree <-skb_release_head_state
           mcast-21429 [000] 13580.504066: skb_release_data <-__kfree_skb
           mcast-21429 [000] 13580.504066: kfree <-skb_release_data
           mcast-21429 [000] 13580.504066: __slab_free <-kfree

  is it normal we call add_partial() ?

           mcast-21429 [000] 13580.504067: add_partial <-__slab_free
           mcast-21429 [000] 13580.504067: _spin_lock <-add_partial
           mcast-21429 [000] 13580.504067: kmem_cache_free <-__kfree_skb
           mcast-21429 [000] 13580.504067: __slab_free <-kmem_cache_free
           mcast-21429 [000] 13580.504067: release_sock <-udp_recvmsg
           mcast-21429 [000] 13580.504067: _spin_lock_bh <-release_sock
           mcast-21429 [000] 13580.504067: local_bh_disable <-_spin_lock_bh
           mcast-21429 [000] 13580.504068: _spin_unlock_bh <-release_sock
           mcast-21429 [000] 13580.504068: local_bh_enable_ip <-_spin_unlock_bh
           mcast-21429 [000] 13580.504068: move_addr_to_user <-sys_recvfrom
           mcast-21429 [000] 13580.504068: copy_to_user <-move_addr_to_user
           mcast-21429 [000] 13580.504068: fput <-sys_recvfrom

           mcast-21429 [000] 13580.504069: sys_gettimeofday <-sysenter_do_call
           mcast-21429 [000] 13580.504069: do_gettimeofday <-sys_gettimeofday
           mcast-21429 [000] 13580.504069: getnstimeofday <-do_gettimeofday
           mcast-21429 [000] 13580.504069: copy_to_user <-sys_gettimeofday

           mcast-21429 [000] 13580.504069: sys_socketcall <-sysenter_do_call
           mcast-21429 [000] 13580.504070: copy_from_user <-sys_socketcall

mcast wants next frame and is going to block

           mcast-21429 [000] 13580.504070: sys_recvfrom <-sys_socketcall
           mcast-21429 [000] 13580.504070: sockfd_lookup_light <-sys_recvfrom
           mcast-21429 [000] 13580.504070: fget_light <-sockfd_lookup_light
           mcast-21429 [000] 13580.504070: sock_recvmsg <-sys_recvfrom
           mcast-21429 [000] 13580.504070: sock_common_recvmsg <-sock_recvmsg
           mcast-21429 [000] 13580.504071: udp_recvmsg <-sock_common_recvmsg
           mcast-21429 [000] 13580.504071: __skb_recv_datagram <-udp_recvmsg
           mcast-21429 [000] 13580.504071: _spin_lock_irqsave <-__skb_recv_datagram
           mcast-21429 [000] 13580.504071: _spin_unlock_irqrestore <-__skb_recv_datagram
           mcast-21429 [000] 13580.504071: prepare_to_wait_exclusive <-__skb_recv_datagram
           mcast-21429 [000] 13580.504071: _spin_lock_irqsave <-prepare_to_wait_exclusive
           mcast-21429 [000] 13580.504071: _spin_unlock_irqrestore <-prepare_to_wait_exclusive
           mcast-21429 [000] 13580.504072: schedule_timeout <-__skb_recv_datagram
           mcast-21429 [000] 13580.504072: schedule <-schedule_timeout
           mcast-21429 [000] 13580.504072: rcu_sched_qs <-schedule
           mcast-21429 [000] 13580.504072: _spin_lock_irq <-schedule
           mcast-21429 [000] 13580.504072: deactivate_task <-schedule
           mcast-21429 [000] 13580.504072: dequeue_task <-deactivate_task
           mcast-21429 [000] 13580.504073: dequeue_task_fair <-dequeue_task
           mcast-21429 [000] 13580.504073: update_curr <-dequeue_task_fair
           mcast-21429 [000] 13580.504073: hrtick_start_fair <-dequeue_task_fair
           mcast-21429 [000] 13580.504073: find_busiest_group <-schedule
           mcast-21429 [000] 13580.504073: scale_rt_power <-find_busiest_group
           mcast-21429 [000] 13580.504073: sched_avg_update <-scale_rt_power
           mcast-21429 [000] 13580.504074: msecs_to_jiffies <-schedule
           mcast-21429 [000] 13580.504074: find_busiest_group <-schedule
           mcast-21429 [000] 13580.504074: msecs_to_jiffies <-schedule
           mcast-21429 [000] 13580.504075: put_prev_task_fair <-schedule
           mcast-21429 [000] 13580.504075: pick_next_task_fair <-schedule
           mcast-21429 [000] 13580.504075: pick_next_task_rt <-schedule
           mcast-21429 [000] 13580.504075: pick_next_task_fair <-schedule
           mcast-21429 [000] 13580.504075: pick_next_task_idle <-schedule
           mcast-21429 [000] 13580.504075: perf_event_task_sched_out <-schedule
           mcast-21429 [000] 13580.504076: __switch_to <-schedule
          <idle>-0     [000] 13580.504076: finish_task_switch <-schedule
          <idle>-0     [000] 13580.504076: perf_event_task_sched_in <-finish_task_switch
          <idle>-0     [000] 13580.504076: tick_nohz_stop_sched_tick <-cpu_idle
          <idle>-0     [000] 13580.504076: ktime_get <-tick_nohz_stop_sched_tick
          <idle>-0     [000] 13580.504076: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
          <idle>-0     [000] 13580.504077: _spin_lock <-get_next_timer_interrupt
          <idle>-0     [000] 13580.504077: hrtimer_get_next_event <-get_next_timer_interrupt
          <idle>-0     [000] 13580.504077: _spin_lock_irqsave <-hrtimer_get_next_event
          <idle>-0     [000] 13580.504077: _spin_unlock_irqrestore <-hrtimer_get_next_event
          <idle>-0     [000] 13580.504077: rcu_needs_cpu <-tick_nohz_stop_sched_tick
          <idle>-0     [000] 13580.504077: printk_needs_cpu <-tick_nohz_stop_sched_tick
          <idle>-0     [000] 13580.504077: select_nohz_load_balancer <-tick_nohz_stop_sched_tick
          <idle>-0     [000] 13580.504078: rcu_enter_nohz <-tick_nohz_stop_sched_tick
          <idle>-0     [000] 13580.504078: hrtimer_start <-tick_nohz_stop_sched_tick
          <idle>-0     [000] 13580.504078: __hrtimer_start_range_ns <-hrtimer_start
          <idle>-0     [000] 13580.504078: lock_hrtimer_base <-__hrtimer_start_range_ns
          <idle>-0     [000] 13580.504078: _spin_lock_irqsave <-lock_hrtimer_base
          <idle>-0     [000] 13580.504078: __remove_hrtimer <-__hrtimer_start_range_ns
          <idle>-0     [000] 13580.504079: hrtimer_force_reprogram <-__remove_hrtimer
          <idle>-0     [000] 13580.504079: tick_program_event <-hrtimer_force_reprogram
          <idle>-0     [000] 13580.504079: tick_dev_program_event <-tick_program_event
          <idle>-0     [000] 13580.504079: ktime_get <-tick_dev_program_event
          <idle>-0     [000] 13580.504079: clockevents_program_event <-tick_dev_program_event
          <idle>-0     [000] 13580.504079: lapic_next_event <-clockevents_program_event
          <idle>-0     [000] 13580.504079: native_apic_mem_write <-lapic_next_event
          <idle>-0     [000] 13580.504080: ktime_add_safe <-__hrtimer_start_range_ns
          <idle>-0     [000] 13580.504080: enqueue_hrtimer <-__hrtimer_start_range_ns
          <idle>-0     [000] 13580.504080: tick_program_event <-__hrtimer_start_range_ns
          <idle>-0     [000] 13580.504080: tick_dev_program_event <-tick_program_event
          <idle>-0     [000] 13580.504080: ktime_get <-tick_dev_program_event
          <idle>-0     [000] 13580.504080: clockevents_program_event <-tick_dev_program_event
          <idle>-0     [000] 13580.504081: lapic_next_event <-clockevents_program_event
          <idle>-0     [000] 13580.504081: native_apic_mem_write <-lapic_next_event
          <idle>-0     [000] 13580.504081: _spin_unlock_irqrestore <-__hrtimer_start_range_ns
          <idle>-0     [000] 13580.504081: poll_idle <-cpu_idle


Next interrupt...


          <idle>-0     [000] 13580.505032: do_IRQ <-common_interrupt
          <idle>-0     [000] 13580.505032: irq_enter <-do_IRQ
          <idle>-0     [000] 13580.505032: rcu_irq_enter <-irq_enter
          <idle>-0     [000] 13580.505032: idle_cpu <-irq_enter


Too many slowpaths for 4096 slabs ?

$ cd /sys/kernel/slab/:t-0004096
$ grep . *
aliases:1
align:8
grep: alloc_calls: Function not implemented
alloc_fastpath:416584 C0=234119 C1=52938 C2=18413 C3=4739 C4=49140 C5=14679 C6=39266 C7=3290
alloc_from_partial:459402 C0=459391 C1=8 C2=1 C5=2
alloc_refill:459619 C0=459460 C1=54 C2=1 C3=4 C4=52 C5=31 C6=2 C7=15
alloc_slab:103 C0=45 C1=28 C3=1 C4=26 C5=1 C6=1 C7=1
alloc_slowpath:459628 C0=459462 C1=55 C2=2 C3=5 C4=53 C5=32 C6=3 C7=16
cache_dma:0
cpuslab_flush:0
cpu_slabs:8
deactivate_empty:0
deactivate_full:459498 C0=459436 C1=35 C4=25 C5=2
deactivate_remote_frees:0
deactivate_to_head:0
deactivate_to_tail:0
destroy_by_rcu:0
free_add_partial:459448 C0=459324 C1=9 C2=39 C4=74 C5=2
grep: free_calls: Function not implemented
free_fastpath:218431 C0=36679 C1=52679 C2=18408 C3=4708 C4=48837 C5=14602 C6=39252 C7=3266
free_frozen:491 C0=161 C1=95 C2=14 C3=36 C4=77 C5=64 C6=13 C7=31
free_remove_partial:40 C0=40
free_slab:40 C0=40
free_slowpath:657340 C0=656835 C1=119 C2=76 C3=36 C4=159 C5=69 C6=15 C7=31
hwcache_align:0
min_partial:10
objects:499
object_size:4096
objects_partial:35
objs_per_slab:8
order:3
order_fallback:0
partial:5
poison:0
reclaim_account:0
red_zone:0
sanity_checks:0
slabs:63
slab_size:4096
store_user:0
total_objects:504
trace:0



comments : 
- lots of disable_bh()/enable_bh(), (enable_bh is slow), that could be avoided...
- many ktime_get() calls
- my HZ=1000 setup might be stupid on a CONFIG_NO_HZ=y kernel :(

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: UDP regression with packets rates < 10k per sec
  2009-09-25 14:19                             ` Eric Dumazet
@ 2009-09-26 16:13                               ` Christoph Lameter
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Lameter @ 2009-09-26 16:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Fri, 25 Sep 2009, Eric Dumazet wrote:

> With my current kernel on receiver (linux-2.6 32bit + some networking
> patches + SLUB_STATS) mcast -n1 -b eth3 -r 2000 on the sender (2.6.29
> unfortunatly, I cannot change it at this moment)

My tests are all done using SLAB since I wanted to exclude differences as
much as possible.

>           <idle>-0     [000] 13580.504040: __kmalloc_track_caller <-__alloc_skb
>           <idle>-0     [000] 13580.504040: get_slab <-__kmalloc_track_caller
>           <idle>-0     [000] 13580.504040: __slab_alloc <-__kmalloc_track_caller
>
> hmm... is it normal we call deactivate_slab() ?

deactivate_slab() is called when the slab page we are allocating from runs
out of objects or is not fit for allocation (we want objects from a
different node etc).

>            mcast-21429 [000] 13580.504066: sock_rfree <-skb_release_head_state
>            mcast-21429 [000] 13580.504066: skb_release_data <-__kfree_skb
>            mcast-21429 [000] 13580.504066: kfree <-skb_release_data
>            mcast-21429 [000] 13580.504066: __slab_free <-kfree
>
>   is it normal we call add_partial() ?

add_partial is called when we free objects in a slab page that had all
objects allocated before. Then it can be used for allocations again and
must be tracked. Fully allocated slab pages are not tracked.

> Too many slowpaths for 4096 slabs ?
>
> $ cd /sys/kernel/slab/:t-0004096
> $ grep . *
> aliases:1
> align:8
> grep: alloc_calls: Function not implemented
> alloc_fastpath:416584 C0=234119 C1=52938 C2=18413 C3=4739 C4=49140 C5=14679 C6=39266 C7=3290
> alloc_from_partial:459402 C0=459391 C1=8 C2=1 C5=2
> alloc_refill:459619 C0=459460 C1=54 C2=1 C3=4 C4=52 C5=31 C6=2 C7=15
> alloc_slab:103 C0=45 C1=28 C3=1 C4=26 C5=1 C6=1 C7=1
> alloc_slowpath:459628 C0=459462 C1=55 C2=2 C3=5 C4=53 C5=32 C6=3 C7=16

Hmmm. That is a high percentage. All are refills. So there are remote
frees from the other processor to the slab page the first processor
allocates from. One processor allocates the object and then pushes it to
the other for freeing? Bad for caching.

> free_slowpath:657340 C0=656835 C1=119 C2=76 C3=36 C4=159 C5=69 C6=15 C7=31

Also quite high. Consistent with remote freeing of objects allocated on
the first processors. Objects are very short lived.

> comments :
> - lots of disable_bh()/enable_bh(), (enable_bh is slow), that could be avoided...
> - many ktime_get() calls
> - my HZ=1000 setup might be stupid on a CONFIG_NO_HZ=y kernel :(

There are 8 objects per slab (order 3 allocation). You could maybe tune
things a bit increasing the objects per slab which may cut down on the #
of deactivate_slab() calls and will also reduce the need for
add_partial(). But I dont see either call causing significant latencies.

both calls should happen on every 8th or so call of kfree/kmalloc.0

To increase the objects per slab to 32:

boot with slub_max_order=5

and then

echo 5 >/sys/slab/kmalloc-4096/order

Would require order 5 allocations. Dont expect it to make too much of a
difference.


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2009-09-26 16:17 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-08 22:38 UDP regression with packets rates < 10k per sec Christoph Lameter
2009-09-08 22:52 ` Eric Dumazet
2009-09-09 14:01   ` Christoph Lameter
2009-09-09 15:09     ` Eric Dumazet
2009-09-09 16:47       ` Christoph Lameter
2009-09-09 17:06         ` Eric Dumazet
2009-09-09 17:55           ` Christoph Lameter
2009-09-10 20:37             ` Eric Dumazet
2009-09-10 21:36               ` Christoph Lameter
2009-09-10 21:37               ` Eric Dumazet
2009-09-10 21:42                 ` Christoph Lameter
2009-09-14 21:10                   ` Christoph Lameter
2009-09-15  5:29                     ` Eric Dumazet
2009-09-15 14:07                       ` Christoph Lameter
2009-09-15 17:26                         ` Eric Dumazet
2009-09-15 20:25                           ` Christoph Lameter
2009-09-15 19:02                             ` Eric Dumazet
2009-09-25 14:19                             ` Eric Dumazet
2009-09-26 16:13                               ` Christoph Lameter
2009-09-10 21:39               ` Christoph Lameter
2009-09-09 17:08         ` Eric Dumazet
2009-09-09 17:26           ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.