All of lore.kernel.org
 help / color / mirror / Atom feed
* High contention on the sk_buff_head.lock
@ 2009-03-18 17:24 Vernon Mauery
  2009-03-18 19:07 ` Eric Dumazet
  2009-03-18 20:54 ` Andi Kleen
  0 siblings, 2 replies; 43+ messages in thread
From: Vernon Mauery @ 2009-03-18 17:24 UTC (permalink / raw)
  To: netdev; +Cc: LKML, rt-users

I have been beating on network throughput in the -rt kernel for some time
now.  After digging down through the send path of UDP packets, I found
that the sk_buff_head.lock is under some very high contention.  This lock
is acquired each time a packet is enqueued on a qdisc and then acquired
again to dequeue the packet.  Under high networking loads, the enqueueing
processes are not only contending among each other for the lock, but also
with the net-tx soft irq.  This makes for some very high contention on this
one lock.  My testcase is running varying numbers of concurrent netperf
instances pushing UDP traffic to another machine.  As the count goes from
1 to 2, the network performance increases.  But from 2 to 4 and from 4 to 8,
we see a big decline, with 8 instances pushing about half of what a single
thread can do.

Running 2.6.29-rc6-rt3 on an 8-way machine with a 10GbE card (I have tried
both NetXen and Broadcom, with very similar results), I can only push about
1200 Mb/s.  Whereas with the mainline 2.6.29-rc8 kernel, I can push nearly
6000 Mb/s. But still not as much as I think is possible.  I was curious and
decided to see if the mainline kernel was hitting the same lock, and using
/proc/lock_stat, it is hitting the sk_buff_head.lock as well (it was the
number one contended lock).

So while this issue really hits -rt kernels hard, it has a real effect on
mainline kernels as well.  The contention of the spinlocks is amplified
when they get turned into rt-mutexes, which causes a double context switch.

Below is the top of the lock_stat for 2.6.29-rc8.  This was captured from
a 1 minute network stress test.  The next high contender had 2 orders of
magnitude fewer contentions.  Think of the throughput increase if we could
ease this contention a bit.  We might even be able to saturate a 10GbE
link.

lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
       class name    con-bounces    contentions   waittime-min   waittime-max   waittime-total    acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

    &list->lock#3:      24517307       24643791           0.71        1286.62      56516392.42       34834296       44904018           0.60        164.79    31314786.02
     -------------
    &list->lock#3       15596927    [<ffffffff812474da>] dev_queue_xmit+0x2ea/0x468
    &list->lock#3        9046864    [<ffffffff812546e9>] __qdisc_run+0x11b/0x1ef
     -------------
    &list->lock#3        6525300    [<ffffffff812546e9>] __qdisc_run+0x11b/0x1ef
    &list->lock#3       18118491    [<ffffffff812474da>] dev_queue_xmit+0x2ea/0x468


The story is the same for -rt kernels, only the waittime and holdtime are both
orders of magnitude greater.

I am not exactly clear on the solution, but if I understand correctly, in the
past there has been some discussion of batched enqueueing and dequeueing.  Is
anyone else working on this problem right now who has just not yet posted
anything for review?  Questions, comments, flames?

--Vernon


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 17:24 High contention on the sk_buff_head.lock Vernon Mauery
@ 2009-03-18 19:07 ` Eric Dumazet
  2009-03-18 20:17   ` Vernon Mauery
  2009-03-18 20:54 ` Andi Kleen
  1 sibling, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2009-03-18 19:07 UTC (permalink / raw)
  To: Vernon Mauery; +Cc: netdev, LKML, rt-users

Vernon Mauery a écrit :
> I have been beating on network throughput in the -rt kernel for some time
> now.  After digging down through the send path of UDP packets, I found
> that the sk_buff_head.lock is under some very high contention.  This lock
> is acquired each time a packet is enqueued on a qdisc and then acquired
> again to dequeue the packet.  Under high networking loads, the enqueueing
> processes are not only contending among each other for the lock, but also
> with the net-tx soft irq.  This makes for some very high contention on this
> one lock.  My testcase is running varying numbers of concurrent netperf
> instances pushing UDP traffic to another machine.  As the count goes from
> 1 to 2, the network performance increases.  But from 2 to 4 and from 4
> to 8,
> we see a big decline, with 8 instances pushing about half of what a single
> thread can do.
> 
> Running 2.6.29-rc6-rt3 on an 8-way machine with a 10GbE card (I have tried
> both NetXen and Broadcom, with very similar results), I can only push about
> 1200 Mb/s.  Whereas with the mainline 2.6.29-rc8 kernel, I can push nearly
> 6000 Mb/s. But still not as much as I think is possible.  I was curious and
> decided to see if the mainline kernel was hitting the same lock, and using
> /proc/lock_stat, it is hitting the sk_buff_head.lock as well (it was the
> number one contended lock).
> 
> So while this issue really hits -rt kernels hard, it has a real effect on
> mainline kernels as well.  The contention of the spinlocks is amplified
> when they get turned into rt-mutexes, which causes a double context switch.
> 
> Below is the top of the lock_stat for 2.6.29-rc8.  This was captured from
> a 1 minute network stress test.  The next high contender had 2 orders of
> magnitude fewer contentions.  Think of the throughput increase if we could
> ease this contention a bit.  We might even be able to saturate a 10GbE
> link.
> 
> lock_stat version 0.3
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
>       class name    con-bounces    contentions   waittime-min  
> waittime-max   waittime-total    acq-bounces   acquisitions  
> holdtime-min  holdtime-max holdtime-total
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> 
>    &list->lock#3:      24517307       24643791           0.71       
> 1286.62      56516392.42       34834296       44904018          
> 0.60        164.79    31314786.02
>     -------------
>    &list->lock#3       15596927    [<ffffffff812474da>]
> dev_queue_xmit+0x2ea/0x468
>    &list->lock#3        9046864    [<ffffffff812546e9>]
> __qdisc_run+0x11b/0x1ef
>     -------------
>    &list->lock#3        6525300    [<ffffffff812546e9>]
> __qdisc_run+0x11b/0x1ef
>    &list->lock#3       18118491    [<ffffffff812474da>]
> dev_queue_xmit+0x2ea/0x468
> 
> 
> The story is the same for -rt kernels, only the waittime and holdtime
> are both
> orders of magnitude greater.
> 
> I am not exactly clear on the solution, but if I understand correctly,
> in the
> past there has been some discussion of batched enqueueing and
> dequeueing.  Is
> anyone else working on this problem right now who has just not yet posted
> anything for review?  Questions, comments, flames?
> 

Yes we have a known contention point here, but before adding more complex code,
could you try following patch please ?

[PATCH] net: Reorder fields of struct Qdisc

dev_queue_xmit() needs to dirty fields "state" and "q"

On x86_64 arch, they currently span two cache lines, involving more
cache line ping pongs than necessary.

Before patch :

offsetof(struct Qdisc, state)=0x38
offsetof(struct Qdisc, q)=0x48
offsetof(struct Qdisc, dev_queue)=0x60

After patch :

offsetof(struct Qdisc, dev_queue)=0x38
offsetof(struct Qdisc, state)=0x48
offsetof(struct Qdisc, q)=0x50


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f8c4742..e24feeb 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -51,10 +51,11 @@ struct Qdisc
 	u32			handle;
 	u32			parent;
 	atomic_t		refcnt;
-	unsigned long		state;
+	struct netdev_queue	*dev_queue;
+
 	struct sk_buff		*gso_skb;
+	unsigned long		state;
 	struct sk_buff_head	q;
-	struct netdev_queue	*dev_queue;
 	struct Qdisc		*next_sched;
 	struct list_head	list;
 



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 19:07 ` Eric Dumazet
@ 2009-03-18 20:17   ` Vernon Mauery
  2009-03-20 23:29     ` Jarek Poplawski
  0 siblings, 1 reply; 43+ messages in thread
From: Vernon Mauery @ 2009-03-18 20:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, LKML, rt-users

Eric Dumazet wrote:
> Vernon Mauery a écrit :
>> I have been beating on network throughput in the -rt kernel for some time
>> now.  After digging down through the send path of UDP packets, I found
>> that the sk_buff_head.lock is under some very high contention.  This lock
>> is acquired each time a packet is enqueued on a qdisc and then acquired
>> again to dequeue the packet.  Under high networking loads, the enqueueing
>> processes are not only contending among each other for the lock, but also
>> with the net-tx soft irq.  This makes for some very high contention on this
>> one lock.  My testcase is running varying numbers of concurrent netperf
>> instances pushing UDP traffic to another machine.  As the count goes from
>> 1 to 2, the network performance increases.  But from 2 to 4 and from 4
>> to 8,
>> we see a big decline, with 8 instances pushing about half of what a single
>> thread can do.
>>
>> Running 2.6.29-rc6-rt3 on an 8-way machine with a 10GbE card (I have tried
>> both NetXen and Broadcom, with very similar results), I can only push about
>> 1200 Mb/s.  Whereas with the mainline 2.6.29-rc8 kernel, I can push nearly
>> 6000 Mb/s. But still not as much as I think is possible.  I was curious and
>> decided to see if the mainline kernel was hitting the same lock, and using
>> /proc/lock_stat, it is hitting the sk_buff_head.lock as well (it was the
>> number one contended lock).
>>
>> So while this issue really hits -rt kernels hard, it has a real effect on
>> mainline kernels as well.  The contention of the spinlocks is amplified
>> when they get turned into rt-mutexes, which causes a double context switch.
>>
>> Below is the top of the lock_stat for 2.6.29-rc8.  This was captured from
>> a 1 minute network stress test.  The next high contender had 2 orders of
>> magnitude fewer contentions.  Think of the throughput increase if we could
>> ease this contention a bit.  We might even be able to saturate a 10GbE
>> link.
>>
>> lock_stat version 0.3
>> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>>       class name    con-bounces    contentions   waittime-min  
>> waittime-max   waittime-total    acq-bounces   acquisitions  
>> holdtime-min  holdtime-max holdtime-total
>> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>>
>>    &list->lock#3:      24517307       24643791           0.71       
>> 1286.62      56516392.42       34834296       44904018          
>> 0.60        164.79    31314786.02
>>     -------------
>>    &list->lock#3       15596927    [<ffffffff812474da>]
>> dev_queue_xmit+0x2ea/0x468
>>    &list->lock#3        9046864    [<ffffffff812546e9>]
>> __qdisc_run+0x11b/0x1ef
>>     -------------
>>    &list->lock#3        6525300    [<ffffffff812546e9>]
>> __qdisc_run+0x11b/0x1ef
>>    &list->lock#3       18118491    [<ffffffff812474da>]
>> dev_queue_xmit+0x2ea/0x468
>>
>>
>> The story is the same for -rt kernels, only the waittime and holdtime
>> are both
>> orders of magnitude greater.
>>
>> I am not exactly clear on the solution, but if I understand correctly,
>> in the
>> past there has been some discussion of batched enqueueing and
>> dequeueing.  Is
>> anyone else working on this problem right now who has just not yet posted
>> anything for review?  Questions, comments, flames?
>>
> 
> Yes we have a known contention point here, but before adding more complex code,
> could you try following patch please ?

This patch does seem to reduce the number of contentions by about 10%.  That is
a good start (and a good catch on the cacheline bounces).  But, like I mentioned
above, this lock still has 2 orders of magnitude greater contention than the
next lock, so even a large decrease like 10% makes little difference in the
overall contention characteristics.

So we will have to do something more.  Whether it needs to be more complex or
not is still up in the air.  Batched enqueueing/dequeueing are just two options
and the former would be a *lot* less complex than the latter.

If anyone else has any ideas they have been holding back, now would be a great
time to get them out in the open.

--Vernon

> [PATCH] net: Reorder fields of struct Qdisc
> 
> dev_queue_xmit() needs to dirty fields "state" and "q"
> 
> On x86_64 arch, they currently span two cache lines, involving more
> cache line ping pongs than necessary.
> 
> Before patch :
> 
> offsetof(struct Qdisc, state)=0x38
> offsetof(struct Qdisc, q)=0x48
> offsetof(struct Qdisc, dev_queue)=0x60
> 
> After patch :
> 
> offsetof(struct Qdisc, dev_queue)=0x38
> offsetof(struct Qdisc, state)=0x48
> offsetof(struct Qdisc, q)=0x50
> 
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index f8c4742..e24feeb 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -51,10 +51,11 @@ struct Qdisc
>  	u32			handle;
>  	u32			parent;
>  	atomic_t		refcnt;
> -	unsigned long		state;
> +	struct netdev_queue	*dev_queue;
> +
>  	struct sk_buff		*gso_skb;
> +	unsigned long		state;
>  	struct sk_buff_head	q;
> -	struct netdev_queue	*dev_queue;
>  	struct Qdisc		*next_sched;
>  	struct list_head	list;
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 17:24 High contention on the sk_buff_head.lock Vernon Mauery
  2009-03-18 19:07 ` Eric Dumazet
@ 2009-03-18 20:54 ` Andi Kleen
  2009-03-18 21:03   ` David Miller
                     ` (2 more replies)
  1 sibling, 3 replies; 43+ messages in thread
From: Andi Kleen @ 2009-03-18 20:54 UTC (permalink / raw)
  To: Vernon Mauery; +Cc: netdev, LKML, rt-users

Vernon Mauery <vernux@us.ibm.com> writes:
>
> So while this issue really hits -rt kernels hard, it has a real effect on
> mainline kernels as well.  The contention of the spinlocks is amplified
> when they get turned into rt-mutexes, which causes a double context switch.

The new adaptive spin heuristics that have been discussed some time
ago didn't help? Unsurprisingly making locks a lot more expensive
has drawbacks as you discovered.

>    &list->lock#3:      24517307       24643791           0.71        1286.62      56516392.42       34834296       44904018           0.60        164.79    31314786.02
>     -------------
>    &list->lock#3       15596927    [<ffffffff812474da>] dev_queue_xmit+0x2ea/0x468

The real "fix" would be probably to use a multi queue capable NIC
and a NIC driver that sets up multiple queues for TX (normally they
only do for RX). Then cores or a set of cores (often the number
of cores is larger than the number of NIC queues) could avoid this
problem. Disadvantage: more memory use.

But then again I'm not sure it's  worth it if the problem only
happens in out of tree RT.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 20:54 ` Andi Kleen
@ 2009-03-18 21:03   ` David Miller
  2009-03-18 21:10     ` Vernon Mauery
  2009-03-18 21:07   ` Vernon Mauery
  2009-03-19 12:59   ` Peter W. Morreale
  2 siblings, 1 reply; 43+ messages in thread
From: David Miller @ 2009-03-18 21:03 UTC (permalink / raw)
  To: andi; +Cc: vernux, netdev, linux-kernel, linux-rt-users

From: Andi Kleen <andi@firstfloor.org>
Date: Wed, 18 Mar 2009 21:54:37 +0100

> But then again I'm not sure it's  worth it if the problem only
> happens in out of tree RT.

The list of problems that only show up with the RT kernel seems to be
constantly increasing, but honestly is very surprising to me.

I don't understand why we even need to be concerned about this stuff
upstream, to be honest.

Please reproduce this in the vanilla kernel, then get back to us.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 20:54 ` Andi Kleen
  2009-03-18 21:03   ` David Miller
@ 2009-03-18 21:07   ` Vernon Mauery
  2009-03-18 21:45     ` Eilon Greenstein
  2009-03-19 12:59   ` Peter W. Morreale
  2 siblings, 1 reply; 43+ messages in thread
From: Vernon Mauery @ 2009-03-18 21:07 UTC (permalink / raw)
  To: Andi Kleen; +Cc: netdev, LKML, rt-users

Andi Kleen wrote:
> Vernon Mauery <vernux@us.ibm.com> writes:
>> So while this issue really hits -rt kernels hard, it has a real effect on
>> mainline kernels as well.  The contention of the spinlocks is amplified
>> when they get turned into rt-mutexes, which causes a double context switch.
> 
> The new adaptive spin heuristics that have been discussed some time
> ago didn't help? Unsurprisingly making locks a lot more expensive
> has drawbacks as you discovered.

Yes.  Well, while the adaptive spinlocks did great things for the
network throughput last time I tested them, they also didn't quite
give the determinism in other areas.  It would be nice to be able to
target a handful of trouble locks with adaptive spinlocks.

Even so, though I saw dramatic throughput increases with adaptive
spinlocks, they would still be bound by this same lock contention
that I am seeing when the locks are true spinlocks.

>>    &list->lock#3:      24517307       24643791           0.71        1286.62      56516392.42       34834296       44904018           0.60        164.79    31314786.02
>>     -------------
>>    &list->lock#3       15596927    [<ffffffff812474da>] dev_queue_xmit+0x2ea/0x468
> 
> The real "fix" would be probably to use a multi queue capable NIC
> and a NIC driver that sets up multiple queues for TX (normally they
> only do for RX). Then cores or a set of cores (often the number
> of cores is larger than the number of NIC queues) could avoid this
> problem. Disadvantage: more memory use.

Hmmm.  So do either the netxen_nic or bnx2x drivers support multiple
queues?  (that is the HW that I have access to right now).  And do I
need to do anything to set them up?

> But then again I'm not sure it's  worth it if the problem only
> happens in out of tree RT.

The effects of the high contention are not quite so pronounced in the
vanilla kernel, but I think we are still limited by this lock.  In the
-rt kernel, it is obvious that the lock contention is causing lots of
trouble.

--Vernon

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 21:03   ` David Miller
@ 2009-03-18 21:10     ` Vernon Mauery
  2009-03-18 21:38       ` David Miller
  0 siblings, 1 reply; 43+ messages in thread
From: Vernon Mauery @ 2009-03-18 21:10 UTC (permalink / raw)
  To: David Miller; +Cc: andi, netdev, linux-kernel, linux-rt-users

David Miller wrote:
> From: Andi Kleen <andi@firstfloor.org>
> Date: Wed, 18 Mar 2009 21:54:37 +0100
> 
>> But then again I'm not sure it's  worth it if the problem only
>> happens in out of tree RT.
> 
> The list of problems that only show up with the RT kernel seems to be
> constantly increasing, but honestly is very surprising to me.
> 
> I don't understand why we even need to be concerned about this stuff
> upstream, to be honest.
> 
> Please reproduce this in the vanilla kernel, then get back to us.

Huh? The numbers that I posted *were* from the vanilla kernel. I ran
the 2.6.29-rc8 kernel with lock_stat enabled.  The lock contention
happens on the same lock in both vanilla and -rt, it just happens
to be more pronounced in the -rt kernel because of the double context
switches that the sleeping spinlock/rt-mutexes introduce.

--Vernon

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 21:10     ` Vernon Mauery
@ 2009-03-18 21:38       ` David Miller
  2009-03-18 21:49         ` Vernon Mauery
  2009-03-18 21:54         ` Gregory Haskins
  0 siblings, 2 replies; 43+ messages in thread
From: David Miller @ 2009-03-18 21:38 UTC (permalink / raw)
  To: vernux; +Cc: andi, netdev, linux-kernel, linux-rt-users

From: Vernon Mauery <vernux@us.ibm.com>
Date: Wed, 18 Mar 2009 14:10:33 -0700

> David Miller wrote:
> > From: Andi Kleen <andi@firstfloor.org>
> > Date: Wed, 18 Mar 2009 21:54:37 +0100
> > 
> >> But then again I'm not sure it's  worth it if the problem only
> >> happens in out of tree RT.
> > The list of problems that only show up with the RT kernel seems to be
> > constantly increasing, but honestly is very surprising to me.
> > I don't understand why we even need to be concerned about this stuff
> > upstream, to be honest.
> > Please reproduce this in the vanilla kernel, then get back to us.
> 
> Huh? The numbers that I posted *were* from the vanilla kernel. I ran
> the 2.6.29-rc8 kernel with lock_stat enabled.  The lock contention
> happens on the same lock in both vanilla and -rt, it just happens
> to be more pronounced in the -rt kernel because of the double context
> switches that the sleeping spinlock/rt-mutexes introduce.

And the double context switches are probably also why less
natural batching and locality are achieved in the RT kernel.

Isn't that true?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 21:07   ` Vernon Mauery
@ 2009-03-18 21:45     ` Eilon Greenstein
  2009-03-18 21:51       ` Vernon Mauery
  0 siblings, 1 reply; 43+ messages in thread
From: Eilon Greenstein @ 2009-03-18 21:45 UTC (permalink / raw)
  To: Vernon Mauery; +Cc: Andi Kleen, netdev, LKML, rt-users

On Wed, 2009-03-18 at 14:07 -0700, Vernon Mauery wrote:
> > 
> > The real "fix" would be probably to use a multi queue capable NIC
> > and a NIC driver that sets up multiple queues for TX (normally they
> > only do for RX). Then cores or a set of cores (often the number
> > of cores is larger than the number of NIC queues) could avoid this
> > problem. Disadvantage: more memory use.
> 
> Hmmm.  So do either the netxen_nic or bnx2x drivers support multiple
> queues?  (that is the HW that I have access to right now).  And do I
> need to do anything to set them up?
> 
The version of bnx2x in net-next support multi Tx queues (and Rx). It
will open an equal number of Tx and Rx queues up to 16 or the number of
cores in the system. You can validate that all queues are transmitting
with "ethtool -S" which has per queue statistics in that version.

I hope it will help,
Eilon



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 21:38       ` David Miller
@ 2009-03-18 21:49         ` Vernon Mauery
  2009-03-19  1:02           ` David Miller
  2009-03-18 21:54         ` Gregory Haskins
  1 sibling, 1 reply; 43+ messages in thread
From: Vernon Mauery @ 2009-03-18 21:49 UTC (permalink / raw)
  To: David Miller; +Cc: andi, netdev, linux-kernel, linux-rt-users

David Miller wrote:
> From: Vernon Mauery <vernux@us.ibm.com>
> Date: Wed, 18 Mar 2009 14:10:33 -0700
> 
>> David Miller wrote:
>>> From: Andi Kleen <andi@firstfloor.org>
>>> Date: Wed, 18 Mar 2009 21:54:37 +0100
>>>
>>>> But then again I'm not sure it's  worth it if the problem only
>>>> happens in out of tree RT.
>>> The list of problems that only show up with the RT kernel seems to be
>>> constantly increasing, but honestly is very surprising to me.
>>> I don't understand why we even need to be concerned about this stuff
>>> upstream, to be honest.
>>> Please reproduce this in the vanilla kernel, then get back to us.
>> Huh? The numbers that I posted *were* from the vanilla kernel. I ran
>> the 2.6.29-rc8 kernel with lock_stat enabled.  The lock contention
>> happens on the same lock in both vanilla and -rt, it just happens
>> to be more pronounced in the -rt kernel because of the double context
>> switches that the sleeping spinlock/rt-mutexes introduce.
> 
> And the double context switches are probably also why less
> natural batching and locality are achieved in the RT kernel.
> 
> Isn't that true?

Yes, the double context switches surely hurt the temporal and
spatial locality of the vanilla codepath, but it also induces a
longer penalty for blocking on a lock -- instead of a nanoseconds
or a few microseconds, the task gets delayed for tens of
microseconds.  So really, the -rt kernel has more to fix than
the vanilla kernel in this case, but any improvement in the lock
contention in the vanilla case would be magnified and would cause
dramatic improvements in the -rt kernel.

--Vernon

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 21:45     ` Eilon Greenstein
@ 2009-03-18 21:51       ` Vernon Mauery
  2009-03-18 21:59         ` Andi Kleen
  0 siblings, 1 reply; 43+ messages in thread
From: Vernon Mauery @ 2009-03-18 21:51 UTC (permalink / raw)
  To: Eilon Greenstein; +Cc: Andi Kleen, netdev, LKML, rt-users

Eilon Greenstein wrote:
> On Wed, 2009-03-18 at 14:07 -0700, Vernon Mauery wrote:
>>> The real "fix" would be probably to use a multi queue capable NIC
>>> and a NIC driver that sets up multiple queues for TX (normally they
>>> only do for RX). Then cores or a set of cores (often the number
>>> of cores is larger than the number of NIC queues) could avoid this
>>> problem. Disadvantage: more memory use.
>> Hmmm.  So do either the netxen_nic or bnx2x drivers support multiple
>> queues?  (that is the HW that I have access to right now).  And do I
>> need to do anything to set them up?
>>
> The version of bnx2x in net-next support multi Tx queues (and Rx). It
> will open an equal number of Tx and Rx queues up to 16 or the number of
> cores in the system. You can validate that all queues are transmitting
> with "ethtool -S" which has per queue statistics in that version.

Thanks.  I will test to see how this affects this lock contention the
next time the broadcom hardware is available.

--Vernon

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 21:38       ` David Miller
  2009-03-18 21:49         ` Vernon Mauery
@ 2009-03-18 21:54         ` Gregory Haskins
  2009-03-19  1:03           ` David Miller
  2009-03-19  7:15           ` Evgeniy Polyakov
  1 sibling, 2 replies; 43+ messages in thread
From: Gregory Haskins @ 2009-03-18 21:54 UTC (permalink / raw)
  To: David Miller
  Cc: vernux, andi, netdev, linux-kernel, linux-rt-users, Patrick Mullaney

[-- Attachment #1: Type: text/plain, Size: 1964 bytes --]

David Miller wrote:
> From: Vernon Mauery <vernux@us.ibm.com>
> Date: Wed, 18 Mar 2009 14:10:33 -0700
>
>   
>> David Miller wrote:
>>     
>>> From: Andi Kleen <andi@firstfloor.org>
>>> Date: Wed, 18 Mar 2009 21:54:37 +0100
>>>
>>>       
>>>> But then again I'm not sure it's  worth it if the problem only
>>>> happens in out of tree RT.
>>>>         
>>> The list of problems that only show up with the RT kernel seems to be
>>> constantly increasing, but honestly is very surprising to me.
>>> I don't understand why we even need to be concerned about this stuff
>>> upstream, to be honest.
>>> Please reproduce this in the vanilla kernel, then get back to us.
>>>       
>> Huh? The numbers that I posted *were* from the vanilla kernel. I ran
>> the 2.6.29-rc8 kernel with lock_stat enabled.  The lock contention
>> happens on the same lock in both vanilla and -rt, it just happens
>> to be more pronounced in the -rt kernel because of the double context
>> switches that the sleeping spinlock/rt-mutexes introduce.
>>     
>
> And the double context switches are probably also why less
> natural batching and locality are achieved in the RT kernel.
>
> Isn't that true?
>   

Note that -rt doesnt typically context-switch under contention anymore
since we introduced adaptive-locks.  Also note that the contention
against the lock is still contention, regardless of whether you have -rt
or not.  Its just that the slow-path to handle the contended case for
-rt is more expensive than mainline.  However, once you have the
contention as stated, you have already lost.

We have observed the posters findings ourselves in both mainline and
-rt.  I.e. That lock doesnt scale very well once you have more than a
handful of cores.  It's certainly a great area to look at for improving
the overall stack, IMO, as I believe there is quite a bit of headroom
left to be recovered that is buried there.

Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 21:51       ` Vernon Mauery
@ 2009-03-18 21:59         ` Andi Kleen
  2009-03-18 22:19           ` Rick Jones
  0 siblings, 1 reply; 43+ messages in thread
From: Andi Kleen @ 2009-03-18 21:59 UTC (permalink / raw)
  To: Vernon Mauery; +Cc: Eilon Greenstein, Andi Kleen, netdev, LKML, rt-users

> Thanks.  I will test to see how this affects this lock contention the
> next time the broadcom hardware is available.

The other strategy to reduce lock contention here is to use TSO/GSO/USO.
With that the lock has to be taken less often because there are less packets
travelling down the stack. I'm not sure how well that works with netperf style 
workloads though. Using multiple TX queues is probably better though if you have
capable hardware. Or ideally both.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 21:59         ` Andi Kleen
@ 2009-03-18 22:19           ` Rick Jones
  0 siblings, 0 replies; 43+ messages in thread
From: Rick Jones @ 2009-03-18 22:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Vernon Mauery, Eilon Greenstein, netdev, LKML, rt-users

Andi Kleen wrote:
>>Thanks.  I will test to see how this affects this lock contention the
>>next time the broadcom hardware is available.
> 
> 
> The other strategy to reduce lock contention here is to use TSO/GSO/USO.
> With that the lock has to be taken less often because there are less packets
> travelling down the stack. I'm not sure how well that works with netperf style 
> workloads though. 

All depends on what the user provides with the test-specific -m option for how 
much data they shove into the socket each time "send" is called, and I suppose if 
they use a test-specific -D option to set TCP_NODELAY in the case of a TCP test 
when they have small values of -m.  Eg

netperf -t TCP_STREAM ... -- -m 64K
vs
netperf -t TCP_STREAM ... -- -m 1024
vs
netperf -t TCP_STREAM ... -- -m 1024 -D
vs
netperf -t UDP_STREAM ... -- -m 1024

etc etc.

If the netperf test is:

netperf -t TCP_RR ... -- -r 1   (single-byte request/response)

then TSO/GSO/USO won't matter at all, and probably still wont matter even if the 
user has ./configure'd netperf with --enable-burst and does:

netperf -t TCP_RR ... -- -r 1 -b 64
or
netperf -t TCP_RR ... -- -r 1 -b 64 -D

which was basically what I was doing for the 32-core scaling stuff I posted about 
a few weeks ago.  That was running on multi-queue NICs, so looking at some of the 
profiles of the "no iptables" data may help show how big/small the problem is, 
keeping in mind that my runs (either the XFrame II runs, or the Chelsio T3C runs 
before them) had one queue per core in the system...and as such may be a best 
case scenario as far as lock contention on a per-queue basis goes.

ftp://ftp.netperf.org/

happy benchmarking,

rick jones

BTW, that setup went "poof" and had to go to other nefarious porpoises.  I'm not 
sure when I can recreate it, but I still have both the XFrame and T3C NICs when 
the HW comes free again.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 21:49         ` Vernon Mauery
@ 2009-03-19  1:02           ` David Miller
  0 siblings, 0 replies; 43+ messages in thread
From: David Miller @ 2009-03-19  1:02 UTC (permalink / raw)
  To: vernux; +Cc: andi, netdev, linux-kernel, linux-rt-users

From: Vernon Mauery <vernux@us.ibm.com>
Date: Wed, 18 Mar 2009 14:49:17 -0700

> Yes, the double context switches surely hurt the temporal and
> spatial locality of the vanilla codepath, but it also induces a
> longer penalty for blocking on a lock -- instead of a nanoseconds
> or a few microseconds, the task gets delayed for tens of
> microseconds.  So really, the -rt kernel has more to fix than
> the vanilla kernel in this case, but any improvement in the lock
> contention in the vanilla case would be magnified and would cause
> dramatic improvements in the -rt kernel.

Contention on a shared resource is not implicitly bad.

And with upstream spinlocks the cost is relatively low for a case like
this where a thread of control goes in and only holds the lock for
long enough to unlink a packet from the list and immediately the lock
is released.

The cost should be, cache line move from cpu-to-cpu, atomic lock,
linked list unlink, a store, and a memory barrier.  And that's
all it is upstream.

If the -rt kernel makes this 10 times more expensive, I really don't
see why that is an upstream concern at the current point in time.

That's the tradeoff, common situations where locks are held by
multiple threads with contention, but only for mere cycles, are
seemingly a lot more expensive in the -rt kernel.

I mean, for example, why doesn't the -rt kernel just spin for a little
while like a normal spinlock would instead of always entering that
expensive contended code path?  It would be a huge win here, but I
have yet to see discussion of changes in that area of the -rt kernel
to combat these effects.  Is it the networking that always has to
change for the sake of -rt? :-)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 21:54         ` Gregory Haskins
@ 2009-03-19  1:03           ` David Miller
  2009-03-19  1:13             ` Sven-Thorsten Dietrich
                               ` (2 more replies)
  2009-03-19  7:15           ` Evgeniy Polyakov
  1 sibling, 3 replies; 43+ messages in thread
From: David Miller @ 2009-03-19  1:03 UTC (permalink / raw)
  To: ghaskins; +Cc: vernux, andi, netdev, linux-kernel, linux-rt-users, pmullaney

From: Gregory Haskins <ghaskins@novell.com>
Date: Wed, 18 Mar 2009 17:54:04 -0400

> Note that -rt doesnt typically context-switch under contention anymore
> since we introduced adaptive-locks.  Also note that the contention
> against the lock is still contention, regardless of whether you have -rt
> or not.  Its just that the slow-path to handle the contended case for
> -rt is more expensive than mainline.  However, once you have the
> contention as stated, you have already lost.

First, contention is not implicitly a bad thing.

Second, if the -rt kernel is doing adaptive spinning I see no
reason why that adaptive spinning is not kicking in here to
make this problem just go away.

This lock is held for mere cycles, just to unlink an SKB from
the networking qdisc, and then it is immediately released.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-19  1:03           ` David Miller
@ 2009-03-19  1:13             ` Sven-Thorsten Dietrich
  2009-03-19  1:17               ` David Miller
  2009-03-19  3:48             ` Gregory Haskins
  2009-03-19 12:50             ` Peter W. Morreale
  2 siblings, 1 reply; 43+ messages in thread
From: Sven-Thorsten Dietrich @ 2009-03-19  1:13 UTC (permalink / raw)
  To: David Miller
  Cc: ghaskins, vernux, andi, netdev, linux-kernel, linux-rt-users, pmullaney

On Wed, 2009-03-18 at 18:03 -0700, David Miller wrote:
> From: Gregory Haskins <ghaskins@novell.com>
> Date: Wed, 18 Mar 2009 17:54:04 -0400
> 
> > Note that -rt doesnt typically context-switch under contention anymore
> > since we introduced adaptive-locks.  Also note that the contention
> > against the lock is still contention, regardless of whether you have -rt
> > or not.  Its just that the slow-path to handle the contended case for
> > -rt is more expensive than mainline.  However, once you have the
> > contention as stated, you have already lost.
> 
> First, contention is not implicitly a bad thing.
> 

Its a bad thing when it does not scale.

> Second, if the -rt kernel is doing adaptive spinning I see no
> reason why that adaptive spinning is not kicking in here to
> make this problem just go away.
> 

If only the first of N contending threads gets to spin, 2..N would
context-switch.


> This lock is held for mere cycles, just to unlink an SKB from
> the networking qdisc, and then it is immediately released.

For very short hold times, and heavy contention, as well as for
scalability, the solution may lie in tunable spinner-count and adaptive
spinner time-out.

Sven

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-19  1:13             ` Sven-Thorsten Dietrich
@ 2009-03-19  1:17               ` David Miller
  2009-03-19  1:43                 ` Sven-Thorsten Dietrich
  0 siblings, 1 reply; 43+ messages in thread
From: David Miller @ 2009-03-19  1:17 UTC (permalink / raw)
  To: sven
  Cc: ghaskins, vernux, andi, netdev, linux-kernel, linux-rt-users, pmullaney

From: Sven-Thorsten Dietrich <sven@thebigcorporation.com>
Date: Wed, 18 Mar 2009 18:13:11 -0700

> On Wed, 2009-03-18 at 18:03 -0700, David Miller wrote:
> > From: Gregory Haskins <ghaskins@novell.com>
> > Date: Wed, 18 Mar 2009 17:54:04 -0400
> > 
> > > Note that -rt doesnt typically context-switch under contention anymore
> > > since we introduced adaptive-locks.  Also note that the contention
> > > against the lock is still contention, regardless of whether you have -rt
> > > or not.  Its just that the slow-path to handle the contended case for
> > > -rt is more expensive than mainline.  However, once you have the
> > > contention as stated, you have already lost.
> > 
> > First, contention is not implicitly a bad thing.
>
> Its a bad thing when it does not scale.

You have only one pipe to shove packets into in this case, and for
your workload multiple cpus are going to be trying to stuff a single
packet at a time from a single UDP send request.

There is no added parallelism you can create for that kind of workload
on that kind of hardware.

It is one of the issues addressed by multiqueue.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-19  1:17               ` David Miller
@ 2009-03-19  1:43                 ` Sven-Thorsten Dietrich
  2009-03-19  1:54                   ` David Miller
  2009-03-19 13:45                   ` High contention on the sk_buff_head.lock Andi Kleen
  0 siblings, 2 replies; 43+ messages in thread
From: Sven-Thorsten Dietrich @ 2009-03-19  1:43 UTC (permalink / raw)
  To: David Miller
  Cc: ghaskins, vernux, andi, netdev, linux-kernel, linux-rt-users, pmullaney

On Wed, 2009-03-18 at 18:17 -0700, David Miller wrote:
> From: Sven-Thorsten Dietrich <sven@thebigcorporation.com>
> Date: Wed, 18 Mar 2009 18:13:11 -0700
> 
> > On Wed, 2009-03-18 at 18:03 -0700, David Miller wrote:
> > > From: Gregory Haskins <ghaskins@novell.com>
> > > Date: Wed, 18 Mar 2009 17:54:04 -0400
> > > 
> > > > Note that -rt doesnt typically context-switch under contention anymore
> > > > since we introduced adaptive-locks.  Also note that the contention
> > > > against the lock is still contention, regardless of whether you have -rt
> > > > or not.  Its just that the slow-path to handle the contended case for
> > > > -rt is more expensive than mainline.  However, once you have the
> > > > contention as stated, you have already lost.
> > > 
> > > First, contention is not implicitly a bad thing.
> >
> > Its a bad thing when it does not scale.
> 
> You have only one pipe to shove packets into in this case, and for
> your workload multiple cpus are going to be trying to stuff a single
> packet at a time from a single UDP send request.
> 
> There is no added parallelism you can create for that kind of workload
> on that kind of hardware.
> 

Do we have to rule-out per-CPU queues, that aggregate into a master
queue in a batch-wise manner? 

I speculate that might reduce the lock contention by a factor of NCPUs.
I cannot say this would be enough to mitigate the consequent overhead
penalty. 

> It is one of the issues addressed by multiqueue.

Properly tuned adaptive locks, would theoretically perform
near-optimally compared to the mainline case, and would provide better
CPU utilization on very large parallel architectures. 

Sven


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-19  1:43                 ` Sven-Thorsten Dietrich
@ 2009-03-19  1:54                   ` David Miller
  2009-03-19  5:49                       ` Eric Dumazet
  2009-03-19 13:45                   ` High contention on the sk_buff_head.lock Andi Kleen
  1 sibling, 1 reply; 43+ messages in thread
From: David Miller @ 2009-03-19  1:54 UTC (permalink / raw)
  To: sven
  Cc: ghaskins, vernux, andi, netdev, linux-kernel, linux-rt-users, pmullaney

From: Sven-Thorsten Dietrich <sven@thebigcorporation.com>
Date: Wed, 18 Mar 2009 18:43:27 -0700

> Do we have to rule-out per-CPU queues, that aggregate into a master
> queue in a batch-wise manner? 

That would violate the properties and characteristics expected by
the packet scheduler, wrt. to fair based fairness, rate limiting,
etc.

The only legal situation where we can parallelize to single device
is where only the most trivial packet scheduler is attached to
the device and the device is multiqueue, and that is exactly what
we do right now.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-19  1:03           ` David Miller
  2009-03-19  1:13             ` Sven-Thorsten Dietrich
@ 2009-03-19  3:48             ` Gregory Haskins
  2009-03-19  5:38               ` David Miller
  2009-03-19 12:50             ` Peter W. Morreale
  2 siblings, 1 reply; 43+ messages in thread
From: Gregory Haskins @ 2009-03-19  3:48 UTC (permalink / raw)
  To: David Miller
  Cc: vernux, andi, netdev, linux-kernel, linux-rt-users, pmullaney

[-- Attachment #1: Type: text/plain, Size: 4364 bytes --]

David Miller wrote:
> From: Gregory Haskins <ghaskins@novell.com>
> Date: Wed, 18 Mar 2009 17:54:04 -0400
>
>   
>> Note that -rt doesnt typically context-switch under contention anymore
>> since we introduced adaptive-locks.  Also note that the contention
>> against the lock is still contention, regardless of whether you have -rt
>> or not.  Its just that the slow-path to handle the contended case for
>> -rt is more expensive than mainline.  However, once you have the
>> contention as stated, you have already lost.
>>     
>
> First, contention is not implicitly a bad thing.
>   
However, when the contention in question is your top bottleneck, even
small improvements have the potential to yield large performance gains. :)

> Second, if the -rt kernel is doing adaptive spinning I see no
> reason why that adaptive spinning is not kicking in here
It does.  Things would be *much* worse if it wasn't.

>  to
> make this problem just go away.
>   
Well, "go away" is a relative term.  Before adaptive-locks, the box was
heavily context-switching under a network workload like we are
discussing, and that is where much of the performance was lost. 
Adaptive-locks mitigate that particular problem so that spinlock_t
clients now more closely model mainline behavior (i.e. spin under
contention, at least when they can) and this brought -rt much closer to
what you expect for performance from mainline.

However, in -rt the contention path is, by necessity, still moderately
more heavy weight than a simple mainline spin (PI, etc), so any
contention will still hurt relatively more.   And obviously
adaptive-locks do nothing to address the contention itself.  Therefore,
as long as the contention remains it will have a higher impact in -rt. 
But make no mistake: the contention can (and I believe does) affect both
trees.

To see this in action, try taking a moderately large smp system (8-way+)
and scaling the number of flows.  At 1 flow the stack is generally quite
capable of maintaining line-rate (at least up to GigE).  With two flows
this should result in each achieving roughly 50% of the line-rate, for a
sum total of line-rate.  But as you add flows to the equation, the
aggregate bandwidth typically starts to drop off to be something less
than line-rate (many times its actually much less).  And when this
happens, the contention in question is usually at the top of the charts
(thus all the interest in this thread).

> This lock is held for mere cycles, just to unlink an SKB from
> the networking qdisc, and then it is immediately released.
>   
To clarify: I haven't been looking at the stack code since last summer,
so some things may have changed since then.  However, the issue back
then from my perspective was the general backpressure in the qdisc
subsystem (e.g. dev->queue_lock), not just the skb head.lock per se. 
This dev->queue_lock can be held for quite a long time, and the
mechanism in general scales poorly as the fabric speeds and core-counts
increase.  (It is arguable whether you would even want another buffering
layer on any reasonably fast interconnect, anyway.  It ultimately just
destabilizes the flows and is probably best left to the underlying
hardware which will typically have the proper facilities for
shaping/limiting, etc. However, that is a topic for another discussion ;).)

One approach that we found yielded some significant improvements here
was to bypass the qdisc outright and do per-cpu lockless queuing.  One
thread/core could then simply aggregate the per-cpu buffers with only
minor contention between the egress (consuming) core and the producer
thread.  I realize that this solution as-is may not be feasible for
production use.  However, it did serve to prove the theory that backing
off cache-line pressures that scale with the core count *did* help the
stack make more forward progress per unit time (e.g. higher throughput)
as the number of flows increased.  The idea was more or less to refactor
the egress/TX path to be more per-cpu, similar to the methodology of the
ingress/RX side.

You could probably adapt this same concept to work harmoniously with the
qdisc infrastructure (though as previously indicated, I am not sure if
we would *want* to ;)).  Something to consider, anyway.

Regards,
-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-19  3:48             ` Gregory Haskins
@ 2009-03-19  5:38               ` David Miller
  2009-03-19 12:42                 ` Gregory Haskins
  0 siblings, 1 reply; 43+ messages in thread
From: David Miller @ 2009-03-19  5:38 UTC (permalink / raw)
  To: ghaskins; +Cc: vernux, andi, netdev, linux-kernel, linux-rt-users, pmullaney

From: Gregory Haskins <ghaskins@novell.com>
Date: Wed, 18 Mar 2009 23:48:46 -0400

> To see this in action, try taking a moderately large smp system
> (8-way+) and scaling the number of flows.

I can maintain line-rate over 10GB with a 64-cpu box.  It's not
a problem.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-19  1:54                   ` David Miller
@ 2009-03-19  5:49                       ` Eric Dumazet
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2009-03-19  5:49 UTC (permalink / raw)
  To: David Miller
  Cc: sven, ghaskins, vernux, andi, netdev, linux-kernel,
	linux-rt-users, pmullaney

David Miller a écrit :
> From: Sven-Thorsten Dietrich <sven@thebigcorporation.com>
> Date: Wed, 18 Mar 2009 18:43:27 -0700
> 
>> Do we have to rule-out per-CPU queues, that aggregate into a master
>> queue in a batch-wise manner? 
> 
> That would violate the properties and characteristics expected by
> the packet scheduler, wrt. to fair based fairness, rate limiting,
> etc.
> 
> The only legal situation where we can parallelize to single device
> is where only the most trivial packet scheduler is attached to
> the device and the device is multiqueue, and that is exactly what
> we do right now.

I agree with you David.

Still, there is room for improvements, since :

1) default qdisc is pfifo_fast. This beast uses three sk_buff_head (96 bytes)
  where it could use 3 smaller list_head (3 * 16 = 48 bytes on x86_64)

 (assuming sizeof(spinlock_t) is only 4 bytes, but it's more than that
 on various situations (LOCKDEP, ...)

2) struct Qdisc layout could be better, letting read mostly fields
   at beginning of structure. (ie move 'dev_queue', 'next_sched', reshape_fail,
   u32_node, __parent, ...)

  'struct gnet_stats_basic' has a 32 bits hole

   'gnet_stats_queue' could be split, at least in Qdisc, so that three
   seldom use fields (drops, requeues, overlimits) go in a different cache line.

   gnet_stats_rate_est might be also moved in a 'not very used' cache line, if
   I am not mistaken ?

3) In stress situation a CPU A queues a skb to a sk_buff_head, but a CPU B
   dequeues it to feed device, involving an expensive cache line miss
   on the skb.{next|prev} (to set them to NULL)

   We could:
      Use a special dequeue op that doesnt touch skb.{next|prev}
   Eventually set next/prev to NULL after q.lock is released




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
@ 2009-03-19  5:49                       ` Eric Dumazet
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2009-03-19  5:49 UTC (permalink / raw)
  To: David Miller
  Cc: sven, ghaskins, vernux, andi, netdev, linux-kernel,
	linux-rt-users, pmullaney

David Miller a écrit :
> From: Sven-Thorsten Dietrich <sven@thebigcorporation.com>
> Date: Wed, 18 Mar 2009 18:43:27 -0700
> 
>> Do we have to rule-out per-CPU queues, that aggregate into a master
>> queue in a batch-wise manner? 
> 
> That would violate the properties and characteristics expected by
> the packet scheduler, wrt. to fair based fairness, rate limiting,
> etc.
> 
> The only legal situation where we can parallelize to single device
> is where only the most trivial packet scheduler is attached to
> the device and the device is multiqueue, and that is exactly what
> we do right now.

I agree with you David.

Still, there is room for improvements, since :

1) default qdisc is pfifo_fast. This beast uses three sk_buff_head (96 bytes)
  where it could use 3 smaller list_head (3 * 16 = 48 bytes on x86_64)

 (assuming sizeof(spinlock_t) is only 4 bytes, but it's more than that
 on various situations (LOCKDEP, ...)

2) struct Qdisc layout could be better, letting read mostly fields
   at beginning of structure. (ie move 'dev_queue', 'next_sched', reshape_fail,
   u32_node, __parent, ...)

  'struct gnet_stats_basic' has a 32 bits hole

   'gnet_stats_queue' could be split, at least in Qdisc, so that three
   seldom use fields (drops, requeues, overlimits) go in a different cache line.

   gnet_stats_rate_est might be also moved in a 'not very used' cache line, if
   I am not mistaken ?

3) In stress situation a CPU A queues a skb to a sk_buff_head, but a CPU B
   dequeues it to feed device, involving an expensive cache line miss
   on the skb.{next|prev} (to set them to NULL)

   We could:
      Use a special dequeue op that doesnt touch skb.{next|prev}
   Eventually set next/prev to NULL after q.lock is released



--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-19  5:49                       ` Eric Dumazet
  (?)
@ 2009-03-19  5:58                       ` David Miller
  2009-03-19 14:04                           ` Eric Dumazet
  -1 siblings, 1 reply; 43+ messages in thread
From: David Miller @ 2009-03-19  5:58 UTC (permalink / raw)
  To: dada1
  Cc: sven, ghaskins, vernux, andi, netdev, linux-kernel,
	linux-rt-users, pmullaney

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 19 Mar 2009 06:49:19 +0100

> Still, there is room for improvements, since :
> 
> 1) default qdisc is pfifo_fast. This beast uses three sk_buff_head (96 bytes)
>   where it could use 3 smaller list_head (3 * 16 = 48 bytes on x86_64)
> 
>  (assuming sizeof(spinlock_t) is only 4 bytes, but it's more than that
>  on various situations (LOCKDEP, ...)

I already plan on doing this, skb->{next,prev} will be replaced with a
list_head and nearly all of the sk_buff_head usage will simply
disappear.  It's a lot of work because every piece of SKB queue
handling code has to be sanitized to only use the interfaces in
linux/skbuff.h and lots of extremely ugly code like the PPP
defragmenter make many non-trivial direct skb->{next,prev}
manipulations.

> 2) struct Qdisc layout could be better, letting read mostly fields
>    at beginning of structure. (ie move 'dev_queue', 'next_sched', reshape_fail,
>    u32_node, __parent, ...)

I have no problem with your struct layout changes, submit it formally.

> 3) In stress situation a CPU A queues a skb to a sk_buff_head, but a CPU B
>    dequeues it to feed device, involving an expensive cache line miss
>    on the skb.{next|prev} (to set them to NULL)
> 
>    We could:
>       Use a special dequeue op that doesnt touch skb.{next|prev}
>    Eventually set next/prev to NULL after q.lock is released

You absolutely can't do this, as it would break GSO/GRO.  The whole
transmit path is littered with checks of skb->next being NULL for the
purposes of segmentation handling.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 21:54         ` Gregory Haskins
  2009-03-19  1:03           ` David Miller
@ 2009-03-19  7:15           ` Evgeniy Polyakov
  1 sibling, 0 replies; 43+ messages in thread
From: Evgeniy Polyakov @ 2009-03-19  7:15 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: David Miller, vernux, andi, netdev, linux-kernel, linux-rt-users,
	Patrick Mullaney

On Wed, Mar 18, 2009 at 05:54:04PM -0400, Gregory Haskins (ghaskins@novell.com) wrote:
> Note that -rt doesnt typically context-switch under contention anymore
> since we introduced adaptive-locks.  Also note that the contention
> against the lock is still contention, regardless of whether you have -rt
> or not.  Its just that the slow-path to handle the contended case for
> -rt is more expensive than mainline.  However, once you have the
> contention as stated, you have already lost.
> 
> We have observed the posters findings ourselves in both mainline and
> -rt.  I.e. That lock doesnt scale very well once you have more than a
> handful of cores.  It's certainly a great area to look at for improving
> the overall stack, IMO, as I believe there is quite a bit of headroom
> left to be recovered that is buried there.

Something tells me that you observer skb head lock contention because
you stressed the network and anything else on that machine slacked in
sleep. What if you will start IO stress, will __queue_lock contention
have the same magnitude order? Or having as many processes as skbs each
of which will race for the scheduler? Will run-queue lock show up in the
stats?

I believe the asnwer is yes for all the questions.
You stressed one subsystem and it showed up in the statistics.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-19  5:38               ` David Miller
@ 2009-03-19 12:42                 ` Gregory Haskins
  2009-03-19 20:52                   ` David Miller
  0 siblings, 1 reply; 43+ messages in thread
From: Gregory Haskins @ 2009-03-19 12:42 UTC (permalink / raw)
  To: David Miller
  Cc: vernux, andi, netdev, linux-kernel, linux-rt-users, pmullaney

[-- Attachment #1: Type: text/plain, Size: 3476 bytes --]

David Miller wrote:
> From: Gregory Haskins <ghaskins@novell.com>
> Date: Wed, 18 Mar 2009 23:48:46 -0400
>
>   
>> To see this in action, try taking a moderately large smp system
>> (8-way+) and scaling the number of flows.
>>     
>
> I can maintain line-rate over 10GB with a 64-cpu box.
Oh man, I am jealous of that 64-way :)

How many simultaneous flows?  What hardware?  What qdisc and other
config do you use?  MTU?  I cannot replicate such results on 10GB even
with much smaller cpu counts.

On my test rig here, I have a 10GB link connected by crossover between
two 8-core boxes.  Running one unidirectional TCP flow is typically
toping out at ~5.5Gb/s on 2.6.29-rc8.  Granted we are using MTU=1500,
which in of itself is part of the upper limit.  However, that result in
of itself isn't a problem, per se.  What is a problem is the aggregate
bandwidth drops with scaling the number of flows.  I would like to
understand how to make this better, if possible, and perhaps I can learn
something from your setup.
>   It's not
> a problem.
>   

To clarify terms, we are not saying "the stack performs inadequately". 
What we are saying here is that analysis of our workloads and of the
current stack indicates that we are io-bound, and that this particular
locking architecture in the qdisc subsystem is the apparent top gating
factor from going faster.  Therefore we are really saying "how can we
make it even better"?  This is not a bad question to ask in general,
would you agree?

To vet our findings, we built that prototype I mentioned in the last
mail where we substituted the single queue and queue_lock with a
per-cpu, lockless queue.  This meant each cpu could submit work
independent of the others with substantially reduced contention.  More
importantly, it eliminated the property of scaling the RFO pressure on a
single cache-lne for the queue-lock.  When we did this, we got
significant increases in aggregate throughput (somewhere on the order of
6%-25% depending on workload,  but this was last summer so I am a little
hazy on the exact numbers).

So you had said something to the effect of "Contention isn't implicitly
a bad thing".   I agree to a point.  At least so much as contention
cannot always be avoided.  Ultimately we only have one resource in this
equation:  the phy-link in question.  So naturally multiple flows
targeted for that link will contend for it.

But the important thing to note here is that there are different kinds
of contention.  And contention against spinlocks *is* generally bad, for
multiple reasons.  It not only affects the cores under contention, but
it affects all cores that exist in the same coherency domain.  IMO it
should be avoided whenever possible.

So I am not saying our per-cpu solution is the answer.  But what I am
saying is that we found that an architecture that doesnt piggy back all
flows into a single spinlock does have the potential to unleash even
more Linux-networking fury :)

I haven't really been following the latest developments in netdev, but
if I understand correctly part of what we are talking about here would
be addressed by the new MQ stuff?  And if I also understand that
correctly, support for MQ is dependent on the hardware beneath it?  If
so, I wonder if we could apply some of the ideas I presented earlier for
making "soft MQ" with a lockless-queue per flow or something like that? 
Thoughts?

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-19  1:03           ` David Miller
  2009-03-19  1:13             ` Sven-Thorsten Dietrich
  2009-03-19  3:48             ` Gregory Haskins
@ 2009-03-19 12:50             ` Peter W. Morreale
  2 siblings, 0 replies; 43+ messages in thread
From: Peter W. Morreale @ 2009-03-19 12:50 UTC (permalink / raw)
  To: David Miller
  Cc: ghaskins, vernux, andi, netdev, linux-kernel, linux-rt-users, pmullaney

On Wed, 2009-03-18 at 18:03 -0700, David Miller wrote:
> From: Gregory Haskins <ghaskins@novell.com>
> Date: Wed, 18 Mar 2009 17:54:04 -0400
> 
> > Note that -rt doesnt typically context-switch under contention anymore
> > since we introduced adaptive-locks.  Also note that the contention
> > against the lock is still contention, regardless of whether you have -rt
> > or not.  Its just that the slow-path to handle the contended case for
> > -rt is more expensive than mainline.  However, once you have the
> > contention as stated, you have already lost.
> 
> First, contention is not implicitly a bad thing.
> 
> Second, if the -rt kernel is doing adaptive spinning I see no
> reason why that adaptive spinning is not kicking in here to
> make this problem just go away.

The basic 'problem' with comparing RT adaptive spinning to non-rt
spinlocks is that if the lock owner is !oncpu, all spinners must break
and go to sleep, otherwise we (potentially) deadlock.  This does not
exist for non-rt spinners. 

Best,
-PWM


> 
> This lock is held for mere cycles, just to unlink an SKB from
> the networking qdisc, and then it is immediately released.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 20:54 ` Andi Kleen
  2009-03-18 21:03   ` David Miller
  2009-03-18 21:07   ` Vernon Mauery
@ 2009-03-19 12:59   ` Peter W. Morreale
  2009-03-19 13:36     ` Peter W. Morreale
  2009-03-19 13:46     ` Andi Kleen
  2 siblings, 2 replies; 43+ messages in thread
From: Peter W. Morreale @ 2009-03-19 12:59 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Vernon Mauery, netdev, LKML, rt-users

On Wed, 2009-03-18 at 21:54 +0100, Andi Kleen wrote:
> Vernon Mauery <vernux@us.ibm.com> writes:
> >

> 
> But then again I'm not sure it's  worth it if the problem only
> happens in out of tree RT.
> 

Heh.  That darn bastard child again, eh?  :-)


Recall that adaptive spin,developed in RT, is now in mainline. 

Best,
-PWM



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-19 12:59   ` Peter W. Morreale
@ 2009-03-19 13:36     ` Peter W. Morreale
  2009-03-19 13:46     ` Andi Kleen
  1 sibling, 0 replies; 43+ messages in thread
From: Peter W. Morreale @ 2009-03-19 13:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Vernon Mauery, netdev, LKML, rt-users

On Thu, 2009-03-19 at 06:59 -0600, Peter W. Morreale wrote:
> On Wed, 2009-03-18 at 21:54 +0100, Andi Kleen wrote:
> > Vernon Mauery <vernux@us.ibm.com> writes:
> > >
> 
> > 
> > But then again I'm not sure it's  worth it if the problem only
> > happens in out of tree RT.
> > 
> 
> Heh.  That darn bastard child again, eh?  :-)
> 
> 
> Recall that adaptive spin,developed in RT, is now in mainline. 
> 

Err..  correction... *about* to be in mainline.   Only point I'm making
is that RT does implicitly, if not explicitly, contributes to mainline.
Its not a one-way street. 

Best,
-PWM

> Best,
> -PWM
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-19  1:43                 ` Sven-Thorsten Dietrich
  2009-03-19  1:54                   ` David Miller
@ 2009-03-19 13:45                   ` Andi Kleen
  1 sibling, 0 replies; 43+ messages in thread
From: Andi Kleen @ 2009-03-19 13:45 UTC (permalink / raw)
  To: Sven-Thorsten Dietrich
  Cc: David Miller, ghaskins, vernux, andi, netdev, linux-kernel,
	linux-rt-users, pmullaney

> Do we have to rule-out per-CPU queues, that aggregate into a master
> queue in a batch-wise manner? 

TSO/GSO/USO is that actually in a very specific way. It batches
packets into larger packets and then less per packet locks need to be taken.  
It was unclear if that was used in this workload or not.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-19 12:59   ` Peter W. Morreale
  2009-03-19 13:36     ` Peter W. Morreale
@ 2009-03-19 13:46     ` Andi Kleen
  1 sibling, 0 replies; 43+ messages in thread
From: Andi Kleen @ 2009-03-19 13:46 UTC (permalink / raw)
  To: Peter W. Morreale; +Cc: Andi Kleen, Vernon Mauery, netdev, LKML, rt-users

> Recall that adaptive spin,developed in RT, is now in mainline. 

For mutexes, but not for spinlocks. And qdisc uses spinlocks.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH] net: reorder struct Qdisc for better SMP performance
  2009-03-19  5:58                       ` David Miller
@ 2009-03-19 14:04                           ` Eric Dumazet
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2009-03-19 14:04 UTC (permalink / raw)
  To: David Miller
  Cc: sven, ghaskins, vernux, andi, netdev, linux-kernel,
	linux-rt-users, pmullaney

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Thu, 19 Mar 2009 06:49:19 +0100
> 
>> Still, there is room for improvements, since :
>>
>> 1) default qdisc is pfifo_fast. This beast uses three sk_buff_head (96 bytes)
>>   where it could use 3 smaller list_head (3 * 16 = 48 bytes on x86_64)
>>
>>  (assuming sizeof(spinlock_t) is only 4 bytes, but it's more than that
>>  on various situations (LOCKDEP, ...)
> 
> I already plan on doing this, skb->{next,prev} will be replaced with a
> list_head and nearly all of the sk_buff_head usage will simply
> disappear.  It's a lot of work because every piece of SKB queue
> handling code has to be sanitized to only use the interfaces in
> linux/skbuff.h and lots of extremely ugly code like the PPP
> defragmenter make many non-trivial direct skb->{next,prev}
> manipulations.
> 
>> 2) struct Qdisc layout could be better, letting read mostly fields
>>    at beginning of structure. (ie move 'dev_queue', 'next_sched', reshape_fail,
>>    u32_node, __parent, ...)
> 
> I have no problem with your struct layout changes, submit it formally.

OK here is the layout change then ;)

Thank you

[PATCH] net: reorder struct Qdisc for better SMP performance

dev_queue_xmit() needs to dirty fields "state", "q", "bstats" and "qstats"

On x86_64 arch, they currently span three cache lines, involving more
cache line ping pongs than necessary, making longer holding of queue spinlock.

We can reduce this to one cache line, by grouping all read-mostly fields
at the beginning of structure. (Or should I say, all highly modified fields
at the end :) )

Before patch :

offsetof(struct Qdisc, state)=0x38
offsetof(struct Qdisc, q)=0x48
offsetof(struct Qdisc, bstats)=0x80
offsetof(struct Qdisc, qstats)=0x90
sizeof(struct Qdisc)=0xc8

After patch :

offsetof(struct Qdisc, state)=0x80
offsetof(struct Qdisc, q)=0x88
offsetof(struct Qdisc, bstats)=0xa0
offsetof(struct Qdisc, qstats)=0xac
sizeof(struct Qdisc)=0xc0

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/gen_stats.h |    2 +-
 include/net/sch_generic.h |   21 ++++++++++++---------
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/include/linux/gen_stats.h b/include/linux/gen_stats.h
index 13f4e74..0ffa41d 100644
--- a/include/linux/gen_stats.h
+++ b/include/linux/gen_stats.h
@@ -22,7 +22,7 @@ struct gnet_stats_basic
 {
 	__u64	bytes;
 	__u32	packets;
-};
+} __attribute__ ((packed));
 
 /**
  * struct gnet_stats_rate_est - rate estimator
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f8c4742..b0ae100 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -48,18 +48,10 @@ struct Qdisc
 	int			padded;
 	struct Qdisc_ops	*ops;
 	struct qdisc_size_table	*stab;
+	struct list_head	list;
 	u32			handle;
 	u32			parent;
 	atomic_t		refcnt;
-	unsigned long		state;
-	struct sk_buff		*gso_skb;
-	struct sk_buff_head	q;
-	struct netdev_queue	*dev_queue;
-	struct Qdisc		*next_sched;
-	struct list_head	list;
-
-	struct gnet_stats_basic	bstats;
-	struct gnet_stats_queue	qstats;
 	struct gnet_stats_rate_est	rate_est;
 	int			(*reshape_fail)(struct sk_buff *skb,
 					struct Qdisc *q);
@@ -70,6 +62,17 @@ struct Qdisc
 	 * and it will live until better solution will be invented.
 	 */
 	struct Qdisc		*__parent;
+	struct netdev_queue	*dev_queue;
+	struct Qdisc		*next_sched;
+
+	struct sk_buff		*gso_skb;
+	/*
+	 * For performance sake on SMP, we put highly modified fields at the end
+	 */
+	unsigned long		state;
+	struct sk_buff_head	q;
+	struct gnet_stats_basic bstats;
+	struct gnet_stats_queue	qstats;
 };
 
 struct Qdisc_class_ops


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH] net: reorder struct Qdisc for better SMP performance
@ 2009-03-19 14:04                           ` Eric Dumazet
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2009-03-19 14:04 UTC (permalink / raw)
  To: David Miller
  Cc: sven, ghaskins, vernux, andi, netdev, linux-kernel,
	linux-rt-users, pmullaney

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Thu, 19 Mar 2009 06:49:19 +0100
> 
>> Still, there is room for improvements, since :
>>
>> 1) default qdisc is pfifo_fast. This beast uses three sk_buff_head (96 bytes)
>>   where it could use 3 smaller list_head (3 * 16 = 48 bytes on x86_64)
>>
>>  (assuming sizeof(spinlock_t) is only 4 bytes, but it's more than that
>>  on various situations (LOCKDEP, ...)
> 
> I already plan on doing this, skb->{next,prev} will be replaced with a
> list_head and nearly all of the sk_buff_head usage will simply
> disappear.  It's a lot of work because every piece of SKB queue
> handling code has to be sanitized to only use the interfaces in
> linux/skbuff.h and lots of extremely ugly code like the PPP
> defragmenter make many non-trivial direct skb->{next,prev}
> manipulations.
> 
>> 2) struct Qdisc layout could be better, letting read mostly fields
>>    at beginning of structure. (ie move 'dev_queue', 'next_sched', reshape_fail,
>>    u32_node, __parent, ...)
> 
> I have no problem with your struct layout changes, submit it formally.

OK here is the layout change then ;)

Thank you

[PATCH] net: reorder struct Qdisc for better SMP performance

dev_queue_xmit() needs to dirty fields "state", "q", "bstats" and "qstats"

On x86_64 arch, they currently span three cache lines, involving more
cache line ping pongs than necessary, making longer holding of queue spinlock.

We can reduce this to one cache line, by grouping all read-mostly fields
at the beginning of structure. (Or should I say, all highly modified fields
at the end :) )

Before patch :

offsetof(struct Qdisc, state)=0x38
offsetof(struct Qdisc, q)=0x48
offsetof(struct Qdisc, bstats)=0x80
offsetof(struct Qdisc, qstats)=0x90
sizeof(struct Qdisc)=0xc8

After patch :

offsetof(struct Qdisc, state)=0x80
offsetof(struct Qdisc, q)=0x88
offsetof(struct Qdisc, bstats)=0xa0
offsetof(struct Qdisc, qstats)=0xac
sizeof(struct Qdisc)=0xc0

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/gen_stats.h |    2 +-
 include/net/sch_generic.h |   21 ++++++++++++---------
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/include/linux/gen_stats.h b/include/linux/gen_stats.h
index 13f4e74..0ffa41d 100644
--- a/include/linux/gen_stats.h
+++ b/include/linux/gen_stats.h
@@ -22,7 +22,7 @@ struct gnet_stats_basic
 {
 	__u64	bytes;
 	__u32	packets;
-};
+} __attribute__ ((packed));
 
 /**
  * struct gnet_stats_rate_est - rate estimator
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f8c4742..b0ae100 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -48,18 +48,10 @@ struct Qdisc
 	int			padded;
 	struct Qdisc_ops	*ops;
 	struct qdisc_size_table	*stab;
+	struct list_head	list;
 	u32			handle;
 	u32			parent;
 	atomic_t		refcnt;
-	unsigned long		state;
-	struct sk_buff		*gso_skb;
-	struct sk_buff_head	q;
-	struct netdev_queue	*dev_queue;
-	struct Qdisc		*next_sched;
-	struct list_head	list;
-
-	struct gnet_stats_basic	bstats;
-	struct gnet_stats_queue	qstats;
 	struct gnet_stats_rate_est	rate_est;
 	int			(*reshape_fail)(struct sk_buff *skb,
 					struct Qdisc *q);
@@ -70,6 +62,17 @@ struct Qdisc
 	 * and it will live until better solution will be invented.
 	 */
 	struct Qdisc		*__parent;
+	struct netdev_queue	*dev_queue;
+	struct Qdisc		*next_sched;
+
+	struct sk_buff		*gso_skb;
+	/*
+	 * For performance sake on SMP, we put highly modified fields at the end
+	 */
+	unsigned long		state;
+	struct sk_buff_head	q;
+	struct gnet_stats_basic bstats;
+	struct gnet_stats_queue	qstats;
 };
 
 struct Qdisc_class_ops

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-19 12:42                 ` Gregory Haskins
@ 2009-03-19 20:52                   ` David Miller
  0 siblings, 0 replies; 43+ messages in thread
From: David Miller @ 2009-03-19 20:52 UTC (permalink / raw)
  To: ghaskins; +Cc: vernux, andi, netdev, linux-kernel, linux-rt-users, pmullaney

From: Gregory Haskins <ghaskins@novell.com>
Date: Thu, 19 Mar 2009 08:42:33 -0400

> David Miller wrote:
> > From: Gregory Haskins <ghaskins@novell.com>
> > Date: Wed, 18 Mar 2009 23:48:46 -0400
> >
> >   
> >> To see this in action, try taking a moderately large smp system
> >> (8-way+) and scaling the number of flows.
> >>     
> >
> > I can maintain line-rate over 10GB with a 64-cpu box.
> Oh man, I am jealous of that 64-way :)
> 
> How many simultaneous flows?  What hardware?  What qdisc and other
> config do you use?  MTU?  I cannot replicate such results on 10GB even
> with much smaller cpu counts.

Sun Neptune NIU 10G with 24 TX queues.

And it's all because of the number of TX queues, nothing more.

It is essential for good performance with any level of cpu
arity.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] net: reorder struct Qdisc for better SMP performance
  2009-03-19 14:04                           ` Eric Dumazet
  (?)
@ 2009-03-20  8:33                           ` David Miller
  -1 siblings, 0 replies; 43+ messages in thread
From: David Miller @ 2009-03-20  8:33 UTC (permalink / raw)
  To: dada1
  Cc: sven, ghaskins, vernux, andi, netdev, linux-kernel,
	linux-rt-users, pmullaney

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 19 Mar 2009 15:04:11 +0100

> [PATCH] net: reorder struct Qdisc for better SMP performance

Applied, thanks.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-18 20:17   ` Vernon Mauery
@ 2009-03-20 23:29     ` Jarek Poplawski
  2009-03-23  8:32         ` Eric Dumazet
  0 siblings, 1 reply; 43+ messages in thread
From: Jarek Poplawski @ 2009-03-20 23:29 UTC (permalink / raw)
  To: Vernon Mauery; +Cc: Eric Dumazet, netdev, LKML, rt-users

Vernon Mauery wrote, On 03/18/2009 09:17 PM:
...
> This patch does seem to reduce the number of contentions by about 10%.  That is
> a good start (and a good catch on the cacheline bounces).  But, like I mentioned
> above, this lock still has 2 orders of magnitude greater contention than the
> next lock, so even a large decrease like 10% makes little difference in the
> overall contention characteristics.
> 
> So we will have to do something more.  Whether it needs to be more complex or
> not is still up in the air.  Batched enqueueing/dequeueing are just two options
> and the former would be a *lot* less complex than the latter.
> 
> If anyone else has any ideas they have been holding back, now would be a great
> time to get them out in the open.

I think there would be interesting to check another idea around this
contention: not all contenders are equal here. One thread is doing
qdisc_run() and owning the transmit queue (even after releasing the TX
lock). So if it waits for the qdisc lock the NIC, if not multiqueue,
is idle. Probably some handicap like in the patch below could make
some difference in throughput; alas I didn't test it.

Jarek P.
---

 net/core/dev.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index f112970..d5ad808 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1852,7 +1852,11 @@ gso:
 	if (q->enqueue) {
 		spinlock_t *root_lock = qdisc_lock(q);
 
-		spin_lock(root_lock);
+		while (!spin_trylock(root_lock)) {
+			do {
+				cpu_relax();
+			} while (spin_is_locked(root_lock));
+		}
 
 		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
 			kfree_skb(skb);

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-20 23:29     ` Jarek Poplawski
@ 2009-03-23  8:32         ` Eric Dumazet
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2009-03-23  8:32 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Vernon Mauery, netdev, LKML, rt-users

Jarek Poplawski a écrit :
> Vernon Mauery wrote, On 03/18/2009 09:17 PM:
> ...
>> This patch does seem to reduce the number of contentions by about 10%.  That is
>> a good start (and a good catch on the cacheline bounces).  But, like I mentioned
>> above, this lock still has 2 orders of magnitude greater contention than the
>> next lock, so even a large decrease like 10% makes little difference in the
>> overall contention characteristics.
>>
>> So we will have to do something more.  Whether it needs to be more complex or
>> not is still up in the air.  Batched enqueueing/dequeueing are just two options
>> and the former would be a *lot* less complex than the latter.
>>
>> If anyone else has any ideas they have been holding back, now would be a great
>> time to get them out in the open.
> 
> I think there would be interesting to check another idea around this
> contention: not all contenders are equal here. One thread is doing
> qdisc_run() and owning the transmit queue (even after releasing the TX
> lock). So if it waits for the qdisc lock the NIC, if not multiqueue,
> is idle. Probably some handicap like in the patch below could make
> some difference in throughput; alas I didn't test it.
> 
> Jarek P.
> ---
> 
>  net/core/dev.c |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index f112970..d5ad808 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1852,7 +1852,11 @@ gso:
>  	if (q->enqueue) {
>  		spinlock_t *root_lock = qdisc_lock(q);
>  
> -		spin_lock(root_lock);
> +		while (!spin_trylock(root_lock)) {
> +			do {
> +				cpu_relax();
> +			} while (spin_is_locked(root_lock));
> +		}
>  
>  		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
>  			kfree_skb(skb);
> 
> 

I dont understand, doesnt it defeat the ticket spinlock thing and fairness ?

Thread doing __qdisc_run() already owns the __QDISC_STATE_RUNNING bit.

trying or taking spinlock has same effect, since it force a cache line ping pong,
and this is the real problem.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
@ 2009-03-23  8:32         ` Eric Dumazet
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2009-03-23  8:32 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Vernon Mauery, netdev, LKML, rt-users

Jarek Poplawski a écrit :
> Vernon Mauery wrote, On 03/18/2009 09:17 PM:
> ...
>> This patch does seem to reduce the number of contentions by about 10%.  That is
>> a good start (and a good catch on the cacheline bounces).  But, like I mentioned
>> above, this lock still has 2 orders of magnitude greater contention than the
>> next lock, so even a large decrease like 10% makes little difference in the
>> overall contention characteristics.
>>
>> So we will have to do something more.  Whether it needs to be more complex or
>> not is still up in the air.  Batched enqueueing/dequeueing are just two options
>> and the former would be a *lot* less complex than the latter.
>>
>> If anyone else has any ideas they have been holding back, now would be a great
>> time to get them out in the open.
> 
> I think there would be interesting to check another idea around this
> contention: not all contenders are equal here. One thread is doing
> qdisc_run() and owning the transmit queue (even after releasing the TX
> lock). So if it waits for the qdisc lock the NIC, if not multiqueue,
> is idle. Probably some handicap like in the patch below could make
> some difference in throughput; alas I didn't test it.
> 
> Jarek P.
> ---
> 
>  net/core/dev.c |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index f112970..d5ad808 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1852,7 +1852,11 @@ gso:
>  	if (q->enqueue) {
>  		spinlock_t *root_lock = qdisc_lock(q);
>  
> -		spin_lock(root_lock);
> +		while (!spin_trylock(root_lock)) {
> +			do {
> +				cpu_relax();
> +			} while (spin_is_locked(root_lock));
> +		}
>  
>  		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
>  			kfree_skb(skb);
> 
> 

I dont understand, doesnt it defeat the ticket spinlock thing and fairness ?

Thread doing __qdisc_run() already owns the __QDISC_STATE_RUNNING bit.

trying or taking spinlock has same effect, since it force a cache line ping pong,
and this is the real problem.

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-23  8:32         ` Eric Dumazet
  (?)
@ 2009-03-23  8:37         ` David Miller
  2009-03-23  8:50           ` Jarek Poplawski
  2009-04-02 14:13           ` Herbert Xu
  -1 siblings, 2 replies; 43+ messages in thread
From: David Miller @ 2009-03-23  8:37 UTC (permalink / raw)
  To: dada1; +Cc: jarkao2, vernux, netdev, linux-kernel, linux-rt-users

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Mon, 23 Mar 2009 09:32:39 +0100

> I dont understand, doesnt it defeat the ticket spinlock thing and fairness ?
> 
> Thread doing __qdisc_run() already owns the __QDISC_STATE_RUNNING bit.

Right.

Remember, the way this is designed is that if there is a busy
cpu taking packets out of the queue and putting them into the
device then other cpus will simply add to the queue and immediately
return.  This effectively keeps the queue running there processing
all the new work that other cpus are adding to the qdisc.

Those other cpus make these decisions by looking at that
__QDISC_STATE_RUNNING bit, which the queue runner grabs before
it does any work.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-23  8:37         ` David Miller
@ 2009-03-23  8:50           ` Jarek Poplawski
  2009-04-02 14:13           ` Herbert Xu
  1 sibling, 0 replies; 43+ messages in thread
From: Jarek Poplawski @ 2009-03-23  8:50 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, vernux, netdev, linux-kernel, linux-rt-users

On Mon, Mar 23, 2009 at 01:37:49AM -0700, David Miller wrote:
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Mon, 23 Mar 2009 09:32:39 +0100
> 
> > I dont understand, doesnt it defeat the ticket spinlock thing and fairness ?
> > 
> > Thread doing __qdisc_run() already owns the __QDISC_STATE_RUNNING bit.
> 
> Right.
> 
> Remember, the way this is designed is that if there is a busy
> cpu taking packets out of the queue and putting them into the
> device then other cpus will simply add to the queue and immediately
> return.

But this "busy cpu" can't take packets out of the queue when it's
waiting on the contended spinlock. Anyway, it's only for testing,
and I didn't say it has to be right.

Jarek P.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-03-23  8:37         ` David Miller
  2009-03-23  8:50           ` Jarek Poplawski
@ 2009-04-02 14:13           ` Herbert Xu
  2009-04-02 14:15             ` Herbert Xu
  1 sibling, 1 reply; 43+ messages in thread
From: Herbert Xu @ 2009-04-02 14:13 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, jarkao2, vernux, netdev, linux-kernel, linux-rt-users

David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Mon, 23 Mar 2009 09:32:39 +0100
> 
>> I dont understand, doesnt it defeat the ticket spinlock thing and fairness ?
>> 
>> Thread doing __qdisc_run() already owns the __QDISC_STATE_RUNNING bit.
> 
> Right.
> 
> Remember, the way this is designed is that if there is a busy
> cpu taking packets out of the queue and putting them into the
> device then other cpus will simply add to the queue and immediately
> return.  This effectively keeps the queue running there processing
> all the new work that other cpus are adding to the qdisc.
> 
> Those other cpus make these decisions by looking at that
> __QDISC_STATE_RUNNING bit, which the queue runner grabs before
> it does any work.

Come on guys, if this lock is a problem. go out and buy a proper
NIC that supports multiequeue TX!

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: High contention on the sk_buff_head.lock
  2009-04-02 14:13           ` Herbert Xu
@ 2009-04-02 14:15             ` Herbert Xu
  0 siblings, 0 replies; 43+ messages in thread
From: Herbert Xu @ 2009-04-02 14:15 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, jarkao2, vernux, netdev, linux-kernel, linux-rt-users

On Thu, Apr 02, 2009 at 10:13:42PM +0800, Herbert Xu wrote:
>
> Come on guys, if this lock is a problem. go out and buy a proper
> NIC that supports multiequeue TX!

Nevermind, I was reading old emails again :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2009-04-02 14:16 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-18 17:24 High contention on the sk_buff_head.lock Vernon Mauery
2009-03-18 19:07 ` Eric Dumazet
2009-03-18 20:17   ` Vernon Mauery
2009-03-20 23:29     ` Jarek Poplawski
2009-03-23  8:32       ` Eric Dumazet
2009-03-23  8:32         ` Eric Dumazet
2009-03-23  8:37         ` David Miller
2009-03-23  8:50           ` Jarek Poplawski
2009-04-02 14:13           ` Herbert Xu
2009-04-02 14:15             ` Herbert Xu
2009-03-18 20:54 ` Andi Kleen
2009-03-18 21:03   ` David Miller
2009-03-18 21:10     ` Vernon Mauery
2009-03-18 21:38       ` David Miller
2009-03-18 21:49         ` Vernon Mauery
2009-03-19  1:02           ` David Miller
2009-03-18 21:54         ` Gregory Haskins
2009-03-19  1:03           ` David Miller
2009-03-19  1:13             ` Sven-Thorsten Dietrich
2009-03-19  1:17               ` David Miller
2009-03-19  1:43                 ` Sven-Thorsten Dietrich
2009-03-19  1:54                   ` David Miller
2009-03-19  5:49                     ` Eric Dumazet
2009-03-19  5:49                       ` Eric Dumazet
2009-03-19  5:58                       ` David Miller
2009-03-19 14:04                         ` [PATCH] net: reorder struct Qdisc for better SMP performance Eric Dumazet
2009-03-19 14:04                           ` Eric Dumazet
2009-03-20  8:33                           ` David Miller
2009-03-19 13:45                   ` High contention on the sk_buff_head.lock Andi Kleen
2009-03-19  3:48             ` Gregory Haskins
2009-03-19  5:38               ` David Miller
2009-03-19 12:42                 ` Gregory Haskins
2009-03-19 20:52                   ` David Miller
2009-03-19 12:50             ` Peter W. Morreale
2009-03-19  7:15           ` Evgeniy Polyakov
2009-03-18 21:07   ` Vernon Mauery
2009-03-18 21:45     ` Eilon Greenstein
2009-03-18 21:51       ` Vernon Mauery
2009-03-18 21:59         ` Andi Kleen
2009-03-18 22:19           ` Rick Jones
2009-03-19 12:59   ` Peter W. Morreale
2009-03-19 13:36     ` Peter W. Morreale
2009-03-19 13:46     ` Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.