All of lore.kernel.org
 help / color / mirror / Atom feed
* SO_REUSEPORT - can it be done in kernel?
@ 2011-01-27 10:07 Daniel Baluta
  2011-01-27 15:55 ` Bill Sommerfeld
  2011-01-27 21:32 ` Tom Herbert
  0 siblings, 2 replies; 91+ messages in thread
From: Daniel Baluta @ 2011-01-27 10:07 UTC (permalink / raw)
  To: therbert; +Cc: netdev

Hi Tom,

How did you solved the issue regarding scaling TCP listeners?
I think SO_REUSEPORT proposed by patch [1] can be a good
start. Where there any follow ups?

Also, solving the problem in users pace can be an option. I want
to run multiple instances of a DNS server on a multi core system.
Any suggestions would be welcomed.

SO_REUSEPORT option seems to be already there [2]. Where
there any plans to have a kernel implementation?

thanks,
Daniel.

[1] http://amailbox.org/mailarchive/linux-netdev/2010/4/19/6274993
[2] http://lxr.linux.no/linux+v2.6.37/include/asm-generic/socket.h#L25

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-01-27 10:07 SO_REUSEPORT - can it be done in kernel? Daniel Baluta
@ 2011-01-27 15:55 ` Bill Sommerfeld
  2011-01-27 21:32 ` Tom Herbert
  1 sibling, 0 replies; 91+ messages in thread
From: Bill Sommerfeld @ 2011-01-27 15:55 UTC (permalink / raw)
  To: Daniel Baluta; +Cc: therbert, netdev

On Thu, Jan 27, 2011 at 02:07, Daniel Baluta <daniel.baluta@gmail.com> wrote:
> How did you solved the issue regarding scaling TCP listeners?
> I think SO_REUSEPORT proposed by patch [1] can be a good
> start. Where there any follow ups?

Google is using the patch internally.  I've recently joined google and
have picked up this work from Tom; I'm starting to rework how it
interacts with TCP (in particular, changing how it interacts with
request sockets and listen sockets so that incoming connections are
not prematurely bound to a specific listener sharing the port).  I
have nothing worth sharing yet.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-01-27 10:07 SO_REUSEPORT - can it be done in kernel? Daniel Baluta
  2011-01-27 15:55 ` Bill Sommerfeld
@ 2011-01-27 21:32 ` Tom Herbert
  2011-02-25 12:56   ` Thomas Graf
  1 sibling, 1 reply; 91+ messages in thread
From: Tom Herbert @ 2011-01-27 21:32 UTC (permalink / raw)
  To: Daniel Baluta; +Cc: netdev

On Thu, Jan 27, 2011 at 2:07 AM, Daniel Baluta <daniel.baluta@gmail.com> wrote:
> Hi Tom,
>
> How did you solved the issue regarding scaling TCP listeners?
> I think SO_REUSEPORT proposed by patch [1] can be a good
> start. Where there any follow ups?
>

As Bill mentioned we are continue to work on the TCP issues.  Looks
like modifying the TCP listener structures will probably be required.

> Also, solving the problem in users pace can be an option. I want
> to run multiple instances of a DNS server on a multi core system.
> Any suggestions would be welcomed.
>
> SO_REUSEPORT option seems to be already there [2]. Where
> there any plans to have a kernel implementation?
>

Yes, we are still planning this.  The UDP implementation for my
earlier patch should be usable to try for DNS/UDP-- this is in fact
where we saw a major performance gain.  Eric Dumazet had some nice
improvements that should probably be looked at also.

Tom

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-01-27 21:32 ` Tom Herbert
@ 2011-02-25 12:56   ` Thomas Graf
  2011-02-25 19:18     ` Rick Jones
  2011-02-25 19:51     ` Tom Herbert
  0 siblings, 2 replies; 91+ messages in thread
From: Thomas Graf @ 2011-02-25 12:56 UTC (permalink / raw)
  To: Tom Herbert, Bill Sommerfeld; +Cc: Daniel Baluta, netdev

On Thu, Jan 27, 2011 at 01:32:25PM -0800, Tom Herbert wrote:
> Yes, we are still planning this.  The UDP implementation for my
> earlier patch should be usable to try for DNS/UDP-- this is in fact
> where we saw a major performance gain.  Eric Dumazet had some nice
> improvements that should probably be looked at also.

I can confirm this.

Serious scalability issues have been reported on a 12 core system
running bind 9.7-2. The system was only able to deliver ~110K queries
per second.

Using your SO_REUSEPORT patch and a modified bind using it. The same
system is able to deliver ~650K queries per seconds while maxing out
all cores completely.

Tom, Bill: do you have a timeline for merging this? Especially the
UDP bits?

-Thomas

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-25 12:56   ` Thomas Graf
@ 2011-02-25 19:18     ` Rick Jones
  2011-02-25 19:20       ` David Miller
                         ` (2 more replies)
  2011-02-25 19:51     ` Tom Herbert
  1 sibling, 3 replies; 91+ messages in thread
From: Rick Jones @ 2011-02-25 19:18 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Tom Herbert, Bill Sommerfeld, Daniel Baluta, netdev

On Fri, 2011-02-25 at 07:56 -0500, Thomas Graf wrote:
> On Thu, Jan 27, 2011 at 01:32:25PM -0800, Tom Herbert wrote:
> > Yes, we are still planning this.  The UDP implementation for my
> > earlier patch should be usable to try for DNS/UDP-- this is in fact
> > where we saw a major performance gain.  Eric Dumazet had some nice
> > improvements that should probably be looked at also.
> 
> I can confirm this.
> 
> Serious scalability issues have been reported on a 12 core system
> running bind 9.7-2. The system was only able to deliver ~110K queries
> per second.
> 
> Using your SO_REUSEPORT patch and a modified bind using it. The same
> system is able to deliver ~650K queries per seconds while maxing out
> all cores completely.

I think the idea is goodness, but will ask, was the (first) bottleneck
actually in the kernel, or was it in bind itself?  I've seen
single-instance, single-byte burst-mode netperf TCP_RR do in excess of
300K transactions per second (with TCP_NODELAY set) on an X5560 core.

ftp://ftp.netperf.org/netperf/misc/dl380g6_X5560_rhel54_ad386_cxgb3_1.4.1.2_b2b_to_same_agg_1500mtu_20100513-2.csv

and that was with now ancient RHEL5.4 bits...  yes, there is a bit of
apples, oranges and kumquats but still, I am wondering if this didn't
also "work around" some internal BIND scaling issues as well.

rick jones

> 
> Tom, Bill: do you have a timeline for merging this? Especially the
> UDP bits?
> 
> -Thomas
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-25 19:18     ` Rick Jones
@ 2011-02-25 19:20       ` David Miller
  2011-02-26  0:57         ` Herbert Xu
  2011-02-25 19:21       ` SO_REUSEPORT - can it be done in kernel? Eric Dumazet
  2011-02-25 22:48       ` Thomas Graf
  2 siblings, 1 reply; 91+ messages in thread
From: David Miller @ 2011-02-25 19:20 UTC (permalink / raw)
  To: rick.jones2; +Cc: tgraf, therbert, wsommerfeld, daniel.baluta, netdev

From: Rick Jones <rick.jones2@hp.com>
Date: Fri, 25 Feb 2011 11:18:15 -0800

> and that was with now ancient RHEL5.4 bits...  yes, there is a bit of
> apples, oranges and kumquats but still, I am wondering if this didn't
> also "work around" some internal BIND scaling issues as well.

I think this is fundamentally a bind problem as well.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-25 19:18     ` Rick Jones
  2011-02-25 19:20       ` David Miller
@ 2011-02-25 19:21       ` Eric Dumazet
  2011-02-25 22:48       ` Thomas Graf
  2 siblings, 0 replies; 91+ messages in thread
From: Eric Dumazet @ 2011-02-25 19:21 UTC (permalink / raw)
  To: rick.jones2
  Cc: Thomas Graf, Tom Herbert, Bill Sommerfeld, Daniel Baluta, netdev

Le vendredi 25 février 2011 à 11:18 -0800, Rick Jones a écrit :

> I think the idea is goodness, but will ask, was the (first) bottleneck
> actually in the kernel, or was it in bind itself?  I've seen
> single-instance, single-byte burst-mode netperf TCP_RR do in excess of
> 300K transactions per second (with TCP_NODELAY set) on an X5560 core.
> 
> ftp://ftp.netperf.org/netperf/misc/dl380g6_X5560_rhel54_ad386_cxgb3_1.4.1.2_b2b_to_same_agg_1500mtu_20100513-2.csv
> 
> and that was with now ancient RHEL5.4 bits...  yes, there is a bit of
> apples, oranges and kumquats but still, I am wondering if this didn't
> also "work around" some internal BIND scaling issues as well.
> 

A single core can probably give 300K transactions.

But if you use several cores, accessing a single socket (the one bound
on port 53), then performance drops because of false sharing,
locking....




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-25 12:56   ` Thomas Graf
  2011-02-25 19:18     ` Rick Jones
@ 2011-02-25 19:51     ` Tom Herbert
  2011-02-25 22:58       ` Thomas Graf
  2011-02-25 23:33       ` Bill Sommerfeld
  1 sibling, 2 replies; 91+ messages in thread
From: Tom Herbert @ 2011-02-25 19:51 UTC (permalink / raw)
  To: Tom Herbert, Bill Sommerfeld, Daniel Baluta, netdev; +Cc: Thomas Graf

> Using your SO_REUSEPORT patch and a modified bind using it. The same
> system is able to deliver ~650K queries per seconds while maxing out
> all cores completely.
>
Nice data point.

> Tom, Bill: do you have a timeline for merging this? Especially the
> UDP bits?
>
Bill has been working on the TCP implementation which is requiring
some fairly major surgery on the listener connections in syn-rcvd
state, this is ongoing.

On the UDP side, I believe the patch is functional, but as Eric
pointed out it probably could be further optimized.  I'll split out
the UDP bits into a separate patch and post that...

Tom

> -Thomas
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-25 19:18     ` Rick Jones
  2011-02-25 19:20       ` David Miller
  2011-02-25 19:21       ` SO_REUSEPORT - can it be done in kernel? Eric Dumazet
@ 2011-02-25 22:48       ` Thomas Graf
  2011-02-25 23:15         ` Rick Jones
  2 siblings, 1 reply; 91+ messages in thread
From: Thomas Graf @ 2011-02-25 22:48 UTC (permalink / raw)
  To: Rick Jones; +Cc: Tom Herbert, Bill Sommerfeld, Daniel Baluta, netdev

On Fri, Feb 25, 2011 at 11:18:15AM -0800, Rick Jones wrote:
> I think the idea is goodness, but will ask, was the (first) bottleneck
> actually in the kernel, or was it in bind itself?  I've seen
> single-instance, single-byte burst-mode netperf TCP_RR do in excess of
> 300K transactions per second (with TCP_NODELAY set) on an X5560 core.
> 
> ftp://ftp.netperf.org/netperf/misc/dl380g6_X5560_rhel54_ad386_cxgb3_1.4.1.2_b2b_to_same_agg_1500mtu_20100513-2.csv
> 
> and that was with now ancient RHEL5.4 bits...  yes, there is a bit of
> apples, oranges and kumquats but still, I am wondering if this didn't
> also "work around" some internal BIND scaling issues as well.

Yes it is. We have observed two separate bottlenecks.

The first we have discovered is within BIND. As soon as more than 1
worker thread is being used strace showed a ton of futex() system
calls to the kernel as soon as the number of queries crossed a magic
barrier. This suggested heavy lock contention within BIND.

This BIND lock contetion was not visible on all systems having scalability
issues though. Some machines were not able to deliver enough queries to
BIND in order for the lock contention to appear.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-25 19:51     ` Tom Herbert
@ 2011-02-25 22:58       ` Thomas Graf
  2011-02-25 23:33       ` Bill Sommerfeld
  1 sibling, 0 replies; 91+ messages in thread
From: Thomas Graf @ 2011-02-25 22:58 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Bill Sommerfeld, Daniel Baluta, netdev

On Fri, Feb 25, 2011 at 11:51:13AM -0800, Tom Herbert wrote:
> On the UDP side, I believe the patch is functional, but as Eric
> pointed out it probably could be further optimized.  I'll split out
> the UDP bits into a separate patch and post that...

Cool! I will be happy to assist in improving it further.

We already see 97-99% CPU utizliation spread evenly over all cores
(with about 20-30% spent in softirq) so at least it already scales
perfectly well.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-25 22:48       ` Thomas Graf
@ 2011-02-25 23:15         ` Rick Jones
  0 siblings, 0 replies; 91+ messages in thread
From: Rick Jones @ 2011-02-25 23:15 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Tom Herbert, Bill Sommerfeld, Daniel Baluta, netdev

On Fri, 2011-02-25 at 17:48 -0500, Thomas Graf wrote:
> On Fri, Feb 25, 2011 at 11:18:15AM -0800, Rick Jones wrote:
> > I think the idea is goodness, but will ask, was the (first) bottleneck
> > actually in the kernel, or was it in bind itself?  I've seen
> > single-instance, single-byte burst-mode netperf TCP_RR do in excess of
> > 300K transactions per second (with TCP_NODELAY set) on an X5560 core.
> > 
> > ftp://ftp.netperf.org/netperf/misc/dl380g6_X5560_rhel54_ad386_cxgb3_1.4.1.2_b2b_to_same_agg_1500mtu_20100513-2.csv
> > 
> > and that was with now ancient RHEL5.4 bits...  yes, there is a bit of
> > apples, oranges and kumquats but still, I am wondering if this didn't
> > also "work around" some internal BIND scaling issues as well.
> 
> Yes it is. We have observed two separate bottlenecks.
> 
> The first we have discovered is within BIND. As soon as more than 1
> worker thread is being used strace showed a ton of futex() system
> calls to the kernel as soon as the number of queries crossed a magic
> barrier. This suggested heavy lock contention within BIND.

The more things change, the more they remain the same, or perhaps "Code
may come and go, but lock contention is forever:

ftp://ftp.cup.hp.com/dist/networking/briefs/bind9_perf.txt

rick jones

The system ftp.cup.hp.com is probably going away before long, I will
probably put its collection of ancient writeups somewhere on netperf.org

> 
> This BIND lock contetion was not visible on all systems having scalability
> issues though. Some machines were not able to deliver enough queries to
> BIND in order for the lock contention to appear.



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-25 19:51     ` Tom Herbert
  2011-02-25 22:58       ` Thomas Graf
@ 2011-02-25 23:33       ` Bill Sommerfeld
  1 sibling, 0 replies; 91+ messages in thread
From: Bill Sommerfeld @ 2011-02-25 23:33 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Daniel Baluta, netdev, Thomas Graf

On Fri, Feb 25, 2011 at 11:51, Tom Herbert <therbert@google.com> wrote:
>> Tom, Bill: do you have a timeline for merging this? Especially the
>> UDP bits?
> Bill has been working on the TCP implementation which is requiring
> some fairly major surgery on the listener connections in syn-rcvd
> state, this is ongoing.

Yup.  The broad approach I settled on is to delay binding of new
connections to listener sockets by moving receive_sock's from a
per-listen_sock hash table to new hash chains in the global hash
table.

This is very much a work-in-progress.  I'm part way through the
conversion and have running code with most of the new structures in
place in parallel with the old; I'm about to start relying exclusively
on the new, and then will tear down the old; once that's done I'll be
in a position to hook that up to SO_REUSEPORT and start actually
measuring the difference.  In short: it will be a while.

So splitting SO_REUSEPORT for UDP from SO_REUSEPORT for TCP makes a
lot of sense to me.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-25 19:20       ` David Miller
@ 2011-02-26  0:57         ` Herbert Xu
  2011-02-26  2:12           ` David Miller
  2011-02-27 11:02           ` Thomas Graf
  0 siblings, 2 replies; 91+ messages in thread
From: Herbert Xu @ 2011-02-26  0:57 UTC (permalink / raw)
  To: David Miller
  Cc: rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev

David Miller <davem@davemloft.net> wrote:
>
> I think this is fundamentally a bind problem as well.

I'm fairly certain the bottleneck is indeed in the kernel, and
in the UDP stack in particular.

This is born out by a test where I used two named worker threads,
both working on the same socket.  Stracing shows that they're
working flat out only doing sendmsg/recvmsg.

The result was that they obtained (in aggregate) half the throughput
of a single worker thread.

I then retested by having them use two sockets and the performance
greatly improved.

Now this is actually expected since our UDP stack is essentially
single-threaded on the send side when only one socket is being
used, mostly due to the corking functionality.

I'm unsure how big a role the receive side scalability actually
plays in this case, but I suspect it isn't great.

Which is why I'm quite skeptical about this REUSEPORT patch as
IMHO the only reason it produces a great result is solely because
it is allowing parallel sends going out.

Rather than modifying all UDP applications out there to fix what
is fundamentally a kernel problem, I think what we should do is
fix the UDP stack so that it actually scales.

It isn't all that hard since the easy way would be to only take
the lock if we're already corked or about to cork.

For the receive side we also don't need REUSEPORT as we can simply
make our UDP stack multiqueue.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-26  0:57         ` Herbert Xu
@ 2011-02-26  2:12           ` David Miller
  2011-02-26  2:48             ` Herbert Xu
  2011-02-27 11:02           ` Thomas Graf
  1 sibling, 1 reply; 91+ messages in thread
From: David Miller @ 2011-02-26  2:12 UTC (permalink / raw)
  To: herbert; +Cc: rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 26 Feb 2011 08:57:18 +0800

> It isn't all that hard since the easy way would be to only take
> the lock if we're already corked or about to cork.

We take the lock unconditionally because we essentially have to after
UDP takes on the socket buffer accounting facilities similar to TCP.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-26  2:12           ` David Miller
@ 2011-02-26  2:48             ` Herbert Xu
  2011-02-26  3:07               ` David Miller
  0 siblings, 1 reply; 91+ messages in thread
From: Herbert Xu @ 2011-02-26  2:48 UTC (permalink / raw)
  To: David Miller
  Cc: rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev

On Fri, Feb 25, 2011 at 06:12:44PM -0800, David Miller wrote:
> 
> We take the lock unconditionally because we essentially have to after
> UDP takes on the socket buffer accounting facilities similar to TCP.

Well I just checked out the history tree (2.6.12) and it too had
the unconditional lock on the send path.  So this predates the
system-wide buffer limit change.

I'm looking at redoing this and the bulk of the work is going to
be restructuring ip_append_data/ip_push_pending_frames so that it
doesn't store the states in sk/inet_sk.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-26  2:48             ` Herbert Xu
@ 2011-02-26  3:07               ` David Miller
  2011-02-26  3:11                 ` Herbert Xu
  0 siblings, 1 reply; 91+ messages in thread
From: David Miller @ 2011-02-26  3:07 UTC (permalink / raw)
  To: herbert; +Cc: rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 26 Feb 2011 10:48:48 +0800

> I'm looking at redoing this and the bulk of the work is going to
> be restructuring ip_append_data/ip_push_pending_frames so that it
> doesn't store the states in sk/inet_sk.

I suppose you're going to replace that stuff with an on-stack
control structure that gets passed around by reference or
similar?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-26  3:07               ` David Miller
@ 2011-02-26  3:11                 ` Herbert Xu
  2011-02-26  7:31                   ` Eric Dumazet
  0 siblings, 1 reply; 91+ messages in thread
From: Herbert Xu @ 2011-02-26  3:11 UTC (permalink / raw)
  To: David Miller
  Cc: rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev

On Fri, Feb 25, 2011 at 07:07:23PM -0800, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Sat, 26 Feb 2011 10:48:48 +0800
> 
> > I'm looking at redoing this and the bulk of the work is going to
> > be restructuring ip_append_data/ip_push_pending_frames so that it
> > doesn't store the states in sk/inet_sk.
> 
> I suppose you're going to replace that stuff with an on-stack
> control structure that gets passed around by reference or
> similar?

Either that or have ip_append_data do ip_push_pending_frames
directly.

That function's signature is a mess already and I need to think
about this a bit more :)

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-26  3:11                 ` Herbert Xu
@ 2011-02-26  7:31                   ` Eric Dumazet
  2011-02-26  7:46                     ` David Miller
  0 siblings, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-02-26  7:31 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, tgraf, therbert, wsommerfeld,
	daniel.baluta, netdev

Le samedi 26 février 2011 à 11:11 +0800, Herbert Xu a écrit :
> On Fri, Feb 25, 2011 at 07:07:23PM -0800, David Miller wrote:
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Date: Sat, 26 Feb 2011 10:48:48 +0800
> > 
> > > I'm looking at redoing this and the bulk of the work is going to
> > > be restructuring ip_append_data/ip_push_pending_frames so that it
> > > doesn't store the states in sk/inet_sk.
> > 
> > I suppose you're going to replace that stuff with an on-stack
> > control structure that gets passed around by reference or
> > similar?
> 
> Either that or have ip_append_data do ip_push_pending_frames
> directly.
> 
> That function's signature is a mess already and I need to think
> about this a bit more :)
> 
> Cheers,


UDP CORK is a problem indeed. I wonder who really uses it ?




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-26  7:31                   ` Eric Dumazet
@ 2011-02-26  7:46                     ` David Miller
  0 siblings, 0 replies; 91+ messages in thread
From: David Miller @ 2011-02-26  7:46 UTC (permalink / raw)
  To: eric.dumazet
  Cc: herbert, rick.jones2, tgraf, therbert, wsommerfeld,
	daniel.baluta, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 26 Feb 2011 08:31:24 +0100

> UDP CORK is a problem indeed. I wonder who really uses it ?

git grep MSG_MORE -- net/sunrpc


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-26  0:57         ` Herbert Xu
  2011-02-26  2:12           ` David Miller
@ 2011-02-27 11:02           ` Thomas Graf
  2011-02-27 11:06             ` Herbert Xu
  1 sibling, 1 reply; 91+ messages in thread
From: Thomas Graf @ 2011-02-27 11:02 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev

On Sat, Feb 26, 2011 at 08:57:18AM +0800, Herbert Xu wrote:
> I'm fairly certain the bottleneck is indeed in the kernel, and
> in the UDP stack in particular.
> 
> This is born out by a test where I used two named worker threads,
> both working on the same socket.  Stracing shows that they're
> working flat out only doing sendmsg/recvmsg.
> 
> The result was that they obtained (in aggregate) half the throughput
> of a single worker thread.

I agree. This is the bottleneck that I described were the kernel is
not able to deliver enough queries for BIND to show the lock
contention issues.

But there is also the situation where netperf RR performance numbers
indicate a mugh higher kernel capability but BIND is not able to
deliver more even though the CPU utilization is very low. This is
the situation where we see the large number of futex calls indicating
the lock contention due to too many queries on a single socket.

> Which is why I'm quite skeptical about this REUSEPORT patch as
> IMHO the only reason it produces a great result is solely because
> it is allowing parallel sends going out.
> 
> Rather than modifying all UDP applications out there to fix what
> is fundamentally a kernel problem, I think what we should do is
> fix the UDP stack so that it actually scales.

I am not suggesting that this is the ultimate and final fix for this
problem. It is fixing a symptom rather than fixing the cause but
sometimes being able to fix the symptom becomes really handy :-)

Adding SO_REUSEPORT does not prevent us from fixing the UDP stack
in the long run.

> It isn't all that hard since the easy way would be to only take
> the lock if we're already corked or about to cork.
> 
> For the receive side we also don't need REUSEPORT as we can simply
> make our UDP stack multiqueue.

OK, it is not required and there is definitely a better way to fix
the kernel bottleneck in the long term. Even better.

I still suggest to merge this patch as a immediate workaround fix
until we scale properly on a single socket and also as a workaround
for applications which can't get rid of their per socket mutex quickly.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-27 11:02           ` Thomas Graf
@ 2011-02-27 11:06             ` Herbert Xu
  2011-02-28  3:45               ` Tom Herbert
                                 ` (5 more replies)
  0 siblings, 6 replies; 91+ messages in thread
From: Herbert Xu @ 2011-02-27 11:06 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev

On Sun, Feb 27, 2011 at 06:02:05AM -0500, Thomas Graf wrote:
>
> I still suggest to merge this patch as a immediate workaround fix
> until we scale properly on a single socket and also as a workaround
> for applications which can't get rid of their per socket mutex quickly.

I disagree completely.

This patch adds a user-space API that we will have to carry
with us for perpetuity.  I would only support this if we had
no other way around the problem.

If this does turn out to be mostly due to sendmsg contention
then fixing it is going to be much simpler than making the UDP
stack multiqueue capable.

I'm working on this right now.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-27 11:06             ` Herbert Xu
@ 2011-02-28  3:45               ` Tom Herbert
  2011-02-28  4:26                 ` Herbert Xu
  2011-02-28 11:36               ` Herbert Xu
                                 ` (4 subsequent siblings)
  5 siblings, 1 reply; 91+ messages in thread
From: Tom Herbert @ 2011-02-28  3:45 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, rick.jones2, wsommerfeld, daniel.baluta, netdev

> I disagree completely.
>
> This patch adds a user-space API that we will have to carry
> with us for perpetuity.  I would only support this if we had
> no other way around the problem.
>
> If this does turn out to be mostly due to sendmsg contention
> then fixing it is going to be much simpler than making the UDP
> stack multiqueue capable.
>

That sounds promising, but receive side will still have problems.
There is lock contention on the queue as well as cache line bouncing
on the sock structures.  Also multiple threads sleeping on same socket
typically leads to asymmetric load across the threads (and
degenerative cases where receiving thread is woken up and other
threads have already processed all the packets).  TCP listener threads
suffer from these same problems.

Tom


> I'm working on this right now.
>
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-28  3:45               ` Tom Herbert
@ 2011-02-28  4:26                 ` Herbert Xu
  0 siblings, 0 replies; 91+ messages in thread
From: Herbert Xu @ 2011-02-28  4:26 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, rick.jones2, wsommerfeld, daniel.baluta, netdev

On Sun, Feb 27, 2011 at 07:45:55PM -0800, Tom Herbert wrote:
> That sounds promising, but receive side will still have problems.
> There is lock contention on the queue as well as cache line bouncing
> on the sock structures.  Also multiple threads sleeping on same socket
> typically leads to asymmetric load across the threads (and
> degenerative cases where receiving thread is woken up and other
> threads have already processed all the packets).  TCP listener threads
> suffer from these same problems.

IOW this is something that we have to solve anyway.  I'm just
being overly cautious here because user-space API changes are
something that we should not enter into lightly.

If this patch was completely internal to the kernel, then I would
have much less of an objection as we can always revert/change it
later on.  With a user-space API we don't have that flexibility.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-27 11:06             ` Herbert Xu
  2011-02-28  3:45               ` Tom Herbert
@ 2011-02-28 11:36               ` Herbert Xu
  2011-02-28 13:32                 ` Eric Dumazet
                                   ` (3 more replies)
  2011-02-28 11:41               ` [PATCH 2/5] net: Remove explicit write references to sk/inet in ip_append_data Herbert Xu
                                 ` (3 subsequent siblings)
  5 siblings, 4 replies; 91+ messages in thread
From: Herbert Xu @ 2011-02-28 11:36 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev

On Sun, Feb 27, 2011 at 07:06:14PM +0800, Herbert Xu wrote:
> I'm working on this right now.

OK I think I was definitely on the right track.  With the send
patch made lockless I now get numbers which are even better than
those obtained with running named with multiple sockets.  That's
right, a single socket is now faster than what multiple sockets
were without the patch (of course, multiple sockets may still
faster with the patch vs. a single socket for obvious reasons,
but I couldn't measure any significant difference).

Also worthy of note is that prior to the patch all CPUs showed
idleness (lazy bastards!), with the patch they're all maxed out.

In retrospect, the idleness was simply the result of the socket
lock scheduling away and was an indication of lock contention.

Here are the patches I used.  Please don't them yet as I intend
to clean them up quite a bit.

But please do test them heavily, especially if you have an AMD
NUMA machine as that's where scalability problems really show
up.  Intel tends to be a lot more forgiving.  My last AMD machine
blew up years ago :)

Thanks!
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 2/5] net: Remove explicit write references to sk/inet in ip_append_data
  2011-02-27 11:06             ` Herbert Xu
  2011-02-28  3:45               ` Tom Herbert
  2011-02-28 11:36               ` Herbert Xu
@ 2011-02-28 11:41               ` Herbert Xu
  2011-03-01  5:31                 ` Eric Dumazet
  2011-02-28 11:41               ` [PATCH 1/5] net: Remove unused sk_sndmsg_* from UFO Herbert Xu
                                 ` (2 subsequent siblings)
  5 siblings, 1 reply; 91+ messages in thread
From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf

net: Remove explicit write references to sk/inet in ip_append_data

In order to allow simultaneous calls to ip_append_data on the same
socket, it must not modify any shared state in sk or inet (other
than those that are designed to allow that such as atomic counters).

This patch abstracts out write references to sk and inet_sk in
ip_append_data and its friends so that we may use the underlying
code in parallel.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/net/inet_sock.h |   23 ++--
 net/ipv4/ip_output.c    |  238 ++++++++++++++++++++++++++++--------------------
 2 files changed, 154 insertions(+), 107 deletions(-)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 8181498..b3de102 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -86,6 +86,19 @@ static inline struct inet_request_sock *inet_rsk(const struct request_sock *sk)
 	return (struct inet_request_sock *)sk;
 }
 
+struct inet_cork {
+	unsigned int		flags;
+	unsigned int		fragsize;
+	struct ip_options	*opt;
+	struct dst_entry	*dst;
+	int			length; /* Total length of all frames */
+	__be32			addr;
+	struct flowi		fl;
+	struct page		*page;
+	u32			off;
+	u8			tx_flags;
+};
+
 struct ip_mc_socklist;
 struct ipv6_pinfo;
 struct rtable;
@@ -143,15 +156,7 @@ struct inet_sock {
 	int			mc_index;
 	__be32			mc_addr;
 	struct ip_mc_socklist __rcu	*mc_list;
-	struct {
-		unsigned int		flags;
-		unsigned int		fragsize;
-		struct ip_options	*opt;
-		struct dst_entry	*dst;
-		int			length; /* Total length of all frames */
-		__be32			addr;
-		struct flowi		fl;
-	} cork;
+	struct inet_cork	cork;
 };
 
 #define IPCORK_OPT	1	/* ip-options has been held in ipcork.opt */
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d3a4540..1dd5ecc 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -733,6 +733,7 @@ csum_page(struct page *page, int offset, int copy)
 }
 
 static inline int ip_ufo_append_data(struct sock *sk,
+			struct sk_buff_head *queue,
 			int getfrag(void *from, char *to, int offset, int len,
 			       int odd, struct sk_buff *skb),
 			void *from, int length, int hh_len, int fragheaderlen,
@@ -745,7 +746,7 @@ static inline int ip_ufo_append_data(struct sock *sk,
 	 * device, so create one single skb packet containing complete
 	 * udp datagram
 	 */
-	if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL) {
+	if ((skb = skb_peek_tail(queue)) == NULL) {
 		skb = sock_alloc_send_skb(sk,
 			hh_len + fragheaderlen + transhdrlen + 20,
 			(flags & MSG_DONTWAIT), &err);
@@ -771,35 +772,24 @@ static inline int ip_ufo_append_data(struct sock *sk,
 		/* specify the length of each IP datagram fragment */
 		skb_shinfo(skb)->gso_size = mtu - fragheaderlen;
 		skb_shinfo(skb)->gso_type = SKB_GSO_UDP;
-		__skb_queue_tail(&sk->sk_write_queue, skb);
+		__skb_queue_tail(queue, skb);
 	}
 
 	return skb_append_datato_frags(sk, skb, getfrag, from,
 				       (length - transhdrlen));
 }
 
-/*
- *	ip_append_data() and ip_append_page() can make one large IP datagram
- *	from many pieces of data. Each pieces will be holded on the socket
- *	until ip_push_pending_frames() is called. Each piece can be a page
- *	or non-page data.
- *
- *	Not only UDP, other transport protocols - e.g. raw sockets - can use
- *	this interface potentially.
- *
- *	LATER: length must be adjusted by pad at tail, when it is required.
- */
-int ip_append_data(struct sock *sk,
-		   int getfrag(void *from, char *to, int offset, int len,
-			       int odd, struct sk_buff *skb),
-		   void *from, int length, int transhdrlen,
-		   struct ipcm_cookie *ipc, struct rtable **rtp,
-		   unsigned int flags)
+static int __ip_append_data(struct sock *sk, struct sk_buff_head *queue,
+			    struct inet_cork *cork,
+			    int getfrag(void *from, char *to, int offset,
+					int len, int odd, struct sk_buff *skb),
+			    void *from, int length, int transhdrlen,
+			    unsigned int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct sk_buff *skb;
 
-	struct ip_options *opt = NULL;
+	struct ip_options *opt = inet->cork.opt;
 	int hh_len;
 	int exthdrlen;
 	int mtu;
@@ -808,58 +798,19 @@ int ip_append_data(struct sock *sk,
 	int offset = 0;
 	unsigned int maxfraglen, fragheaderlen;
 	int csummode = CHECKSUM_NONE;
-	struct rtable *rt;
-
-	if (flags&MSG_PROBE)
-		return 0;
+	struct rtable *rt = (struct rtable *)cork->dst;
 
-	if (skb_queue_empty(&sk->sk_write_queue)) {
-		/*
-		 * setup for corking.
-		 */
-		opt = ipc->opt;
-		if (opt) {
-			if (inet->cork.opt == NULL) {
-				inet->cork.opt = kmalloc(sizeof(struct ip_options) + 40, sk->sk_allocation);
-				if (unlikely(inet->cork.opt == NULL))
-					return -ENOBUFS;
-			}
-			memcpy(inet->cork.opt, opt, sizeof(struct ip_options)+opt->optlen);
-			inet->cork.flags |= IPCORK_OPT;
-			inet->cork.addr = ipc->addr;
-		}
-		rt = *rtp;
-		if (unlikely(!rt))
-			return -EFAULT;
-		/*
-		 * We steal reference to this route, caller should not release it
-		 */
-		*rtp = NULL;
-		inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE ?
-					    rt->dst.dev->mtu :
-					    dst_mtu(rt->dst.path);
-		inet->cork.dst = &rt->dst;
-		inet->cork.length = 0;
-		sk->sk_sndmsg_page = NULL;
-		sk->sk_sndmsg_off = 0;
-		exthdrlen = rt->dst.header_len;
-		length += exthdrlen;
-		transhdrlen += exthdrlen;
-	} else {
-		rt = (struct rtable *)inet->cork.dst;
-		if (inet->cork.flags & IPCORK_OPT)
-			opt = inet->cork.opt;
+	exthdrlen = transhdrlen ? rt->dst.header_len : 0;
+	length += exthdrlen;
+	transhdrlen += exthdrlen;
+	mtu = inet->cork.fragsize;
 
-		transhdrlen = 0;
-		exthdrlen = 0;
-		mtu = inet->cork.fragsize;
-	}
 	hh_len = LL_RESERVED_SPACE(rt->dst.dev);
 
 	fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
 	maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
 
-	if (inet->cork.length + length > 0xFFFF - fragheaderlen) {
+	if (cork->length + length > 0xFFFF - fragheaderlen) {
 		ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->inet_dport,
 			       mtu-exthdrlen);
 		return -EMSGSIZE;
@@ -875,15 +826,15 @@ int ip_append_data(struct sock *sk,
 	    !exthdrlen)
 		csummode = CHECKSUM_PARTIAL;
 
-	skb = skb_peek_tail(&sk->sk_write_queue);
+	skb = skb_peek_tail(queue);
 
-	inet->cork.length += length;
+	cork->length += length;
 	if (((length > mtu) || (skb && skb_is_gso(skb))) &&
 	    (sk->sk_protocol == IPPROTO_UDP) &&
 	    (rt->dst.dev->features & NETIF_F_UFO)) {
-		err = ip_ufo_append_data(sk, getfrag, from, length, hh_len,
-					 fragheaderlen, transhdrlen, mtu,
-					 flags);
+		err = ip_ufo_append_data(sk, queue, getfrag, from, length,
+					 hh_len, fragheaderlen, transhdrlen,
+					 mtu, flags);
 		if (err)
 			goto error;
 		return 0;
@@ -960,7 +911,7 @@ alloc_new_skb:
 				else
 					/* only the initial fragment is
 					   time stamped */
-					ipc->tx_flags = 0;
+					cork->tx_flags = 0;
 			}
 			if (skb == NULL)
 				goto error;
@@ -971,7 +922,7 @@ alloc_new_skb:
 			skb->ip_summed = csummode;
 			skb->csum = 0;
 			skb_reserve(skb, hh_len);
-			skb_shinfo(skb)->tx_flags = ipc->tx_flags;
+			skb_shinfo(skb)->tx_flags = cork->tx_flags;
 
 			/*
 			 *	Find where to start putting bytes.
@@ -1008,7 +959,7 @@ alloc_new_skb:
 			/*
 			 * Put the packet on the pending queue.
 			 */
-			__skb_queue_tail(&sk->sk_write_queue, skb);
+			__skb_queue_tail(queue, skb);
 			continue;
 		}
 
@@ -1028,8 +979,8 @@ alloc_new_skb:
 		} else {
 			int i = skb_shinfo(skb)->nr_frags;
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i-1];
-			struct page *page = sk->sk_sndmsg_page;
-			int off = sk->sk_sndmsg_off;
+			struct page *page = cork->page;
+			int off = cork->off;
 			unsigned int left;
 
 			if (page && (left = PAGE_SIZE - off) > 0) {
@@ -1041,7 +992,7 @@ alloc_new_skb:
 						goto error;
 					}
 					get_page(page);
-					skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0);
+					skb_fill_page_desc(skb, i, page, off, 0);
 					frag = &skb_shinfo(skb)->frags[i];
 				}
 			} else if (i < MAX_SKB_FRAGS) {
@@ -1052,8 +1003,8 @@ alloc_new_skb:
 					err = -ENOMEM;
 					goto error;
 				}
-				sk->sk_sndmsg_page = page;
-				sk->sk_sndmsg_off = 0;
+				cork->page = page;
+				cork->off = 0;
 
 				skb_fill_page_desc(skb, i, page, 0, 0);
 				frag = &skb_shinfo(skb)->frags[i];
@@ -1065,7 +1016,7 @@ alloc_new_skb:
 				err = -EFAULT;
 				goto error;
 			}
-			sk->sk_sndmsg_off += copy;
+			cork->off += copy;
 			frag->size += copy;
 			skb->len += copy;
 			skb->data_len += copy;
@@ -1079,11 +1030,87 @@ alloc_new_skb:
 	return 0;
 
 error:
-	inet->cork.length -= length;
+	cork->length -= length;
 	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
 	return err;
 }
 
+static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
+			 struct ipcm_cookie *ipc, struct rtable **rtp)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct ip_options *opt;
+	struct rtable *rt;
+
+	/*
+	 * setup for corking.
+	 */
+	opt = ipc->opt;
+	if (opt) {
+		if (cork->opt == NULL) {
+			cork->opt = kmalloc(sizeof(struct ip_options) + 40,
+					    sk->sk_allocation);
+			if (unlikely(cork->opt == NULL))
+				return -ENOBUFS;
+		}
+		memcpy(cork->opt, opt, sizeof(struct ip_options) + opt->optlen);
+		cork->flags |= IPCORK_OPT;
+		cork->addr = ipc->addr;
+	}
+	rt = *rtp;
+	if (unlikely(!rt))
+		return -EFAULT;
+	/*
+	 * We steal reference to this route, caller should not release it
+	 */
+	*rtp = NULL;
+	cork->fragsize = inet->pmtudisc == IP_PMTUDISC_PROBE ?
+			 rt->dst.dev->mtu : dst_mtu(rt->dst.path);
+	cork->dst = &rt->dst;
+	cork->length = 0;
+	cork->tx_flags = ipc->tx_flags;
+	cork->page = NULL;
+	cork->off = 0;
+
+	return 0;
+}
+
+/*
+ *	ip_append_data() and ip_append_page() can make one large IP datagram
+ *	from many pieces of data. Each pieces will be holded on the socket
+ *	until ip_push_pending_frames() is called. Each piece can be a page
+ *	or non-page data.
+ *
+ *	Not only UDP, other transport protocols - e.g. raw sockets - can use
+ *	this interface potentially.
+ *
+ *	LATER: length must be adjusted by pad at tail, when it is required.
+ */
+int ip_append_data(struct sock *sk,
+		   int getfrag(void *from, char *to, int offset, int len,
+			       int odd, struct sk_buff *skb),
+		   void *from, int length, int transhdrlen,
+		   struct ipcm_cookie *ipc, struct rtable **rtp,
+		   unsigned int flags)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	int err;
+
+	if (flags&MSG_PROBE)
+		return 0;
+
+	if (skb_queue_empty(&sk->sk_write_queue)) {
+		err = ip_setup_cork(sk, &inet->cork, ipc, rtp);
+		if (err)
+			return err;
+	} else {
+		transhdrlen = 0;
+	}
+
+	return __ip_append_data(sk, &sk->sk_write_queue, &inet->cork, getfrag,
+				from, length, transhdrlen, flags);
+}
+
 ssize_t	ip_append_page(struct sock *sk, struct page *page,
 		       int offset, size_t size, int flags)
 {
@@ -1227,40 +1254,42 @@ error:
 	return err;
 }
 
-static void ip_cork_release(struct inet_sock *inet)
+static void ip_cork_release(struct inet_cork *cork)
 {
-	inet->cork.flags &= ~IPCORK_OPT;
-	kfree(inet->cork.opt);
-	inet->cork.opt = NULL;
-	dst_release(inet->cork.dst);
-	inet->cork.dst = NULL;
+	cork->flags &= ~IPCORK_OPT;
+	kfree(cork->opt);
+	cork->opt = NULL;
+	dst_release(cork->dst);
+	cork->dst = NULL;
 }
 
 /*
  *	Combined all pending IP fragments on the socket as one IP datagram
  *	and push them out.
  */
-int ip_push_pending_frames(struct sock *sk)
+static int __ip_push_pending_frames(struct sock *sk,
+				    struct sk_buff_head *queue,
+				    struct inet_cork *cork)
 {
 	struct sk_buff *skb, *tmp_skb;
 	struct sk_buff **tail_skb;
 	struct inet_sock *inet = inet_sk(sk);
 	struct net *net = sock_net(sk);
 	struct ip_options *opt = NULL;
-	struct rtable *rt = (struct rtable *)inet->cork.dst;
+	struct rtable *rt = (struct rtable *)cork->dst;
 	struct iphdr *iph;
 	__be16 df = 0;
 	__u8 ttl;
 	int err = 0;
 
-	if ((skb = __skb_dequeue(&sk->sk_write_queue)) == NULL)
+	if ((skb = __skb_dequeue(queue)) == NULL)
 		goto out;
 	tail_skb = &(skb_shinfo(skb)->frag_list);
 
 	/* move skb->data to ip header from ext header */
 	if (skb->data < skb_network_header(skb))
 		__skb_pull(skb, skb_network_offset(skb));
-	while ((tmp_skb = __skb_dequeue(&sk->sk_write_queue)) != NULL) {
+	while ((tmp_skb = __skb_dequeue(queue)) != NULL) {
 		__skb_pull(tmp_skb, skb_network_header_len(skb));
 		*tail_skb = tmp_skb;
 		tail_skb = &(tmp_skb->next);
@@ -1286,8 +1315,8 @@ int ip_push_pending_frames(struct sock *sk)
 	     ip_dont_fragment(sk, &rt->dst)))
 		df = htons(IP_DF);
 
-	if (inet->cork.flags & IPCORK_OPT)
-		opt = inet->cork.opt;
+	if (cork->flags & IPCORK_OPT)
+		opt = cork->opt;
 
 	if (rt->rt_type == RTN_MULTICAST)
 		ttl = inet->mc_ttl;
@@ -1299,7 +1328,7 @@ int ip_push_pending_frames(struct sock *sk)
 	iph->ihl = 5;
 	if (opt) {
 		iph->ihl += opt->optlen>>2;
-		ip_options_build(skb, opt, inet->cork.addr, rt, 0);
+		ip_options_build(skb, opt, cork->addr, rt, 0);
 	}
 	iph->tos = inet->tos;
 	iph->frag_off = df;
@@ -1315,7 +1344,7 @@ int ip_push_pending_frames(struct sock *sk)
 	 * Steal rt from cork.dst to avoid a pair of atomic_inc/atomic_dec
 	 * on dst refcount
 	 */
-	inet->cork.dst = NULL;
+	cork->dst = NULL;
 	skb_dst_set(skb, &rt->dst);
 
 	if (iph->protocol == IPPROTO_ICMP)
@@ -1332,7 +1361,7 @@ int ip_push_pending_frames(struct sock *sk)
 	}
 
 out:
-	ip_cork_release(inet);
+	ip_cork_release(cork);
 	return err;
 
 error:
@@ -1340,17 +1369,30 @@ error:
 	goto out;
 }
 
+int ip_push_pending_frames(struct sock *sk)
+{
+	return __ip_push_pending_frames(sk, &sk->sk_write_queue,
+					&inet_sk(sk)->cork);
+}
+
 /*
  *	Throw away all pending data on the socket.
  */
-void ip_flush_pending_frames(struct sock *sk)
+static void __ip_flush_pending_frames(struct sock *sk,
+				      struct sk_buff_head *queue,
+				      struct inet_cork *cork)
 {
 	struct sk_buff *skb;
 
-	while ((skb = __skb_dequeue_tail(&sk->sk_write_queue)) != NULL)
+	while ((skb = __skb_dequeue_tail(queue)) != NULL)
 		kfree_skb(skb);
 
-	ip_cork_release(inet_sk(sk));
+	ip_cork_release(cork);
+}
+
+void ip_flush_pending_frames(struct sock *sk)
+{
+	__ip_flush_pending_frames(sk, &sk->sk_write_queue, &inet_sk(sk)->cork);
 }
 
 

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 1/5] net: Remove unused sk_sndmsg_* from UFO
  2011-02-27 11:06             ` Herbert Xu
                                 ` (2 preceding siblings ...)
  2011-02-28 11:41               ` [PATCH 2/5] net: Remove explicit write references to sk/inet in ip_append_data Herbert Xu
@ 2011-02-28 11:41               ` Herbert Xu
  2011-03-01  5:31                 ` Eric Dumazet
  2011-02-28 11:41               ` [PATCH 3/5] inet: Add ip_make_skb and ip_send_skb Herbert Xu
  2011-02-28 11:41               ` [PATCH 4/5] udp: Add lockless transmit path Herbert Xu
  5 siblings, 1 reply; 91+ messages in thread
From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf

net: Remove unused sk_sndmsg_* from UFO

UFO doesn't really use the sk_sndmsg_* parameters so touching
them is pointless.  It can't use them anyway since the whole
point of UFO is to use the original pages without copying.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/core/skbuff.c     |    3 ---
 net/ipv4/ip_output.c  |    1 -
 net/ipv6/ip6_output.c |    1 -
 3 files changed, 5 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d883dcc..97011a7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2434,8 +2434,6 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
 			return -ENOMEM;
 
 		/* initialize the next frag */
-		sk->sk_sndmsg_page = page;
-		sk->sk_sndmsg_off = 0;
 		skb_fill_page_desc(skb, frg_cnt, page, 0, 0);
 		skb->truesize += PAGE_SIZE;
 		atomic_add(PAGE_SIZE, &sk->sk_wmem_alloc);
@@ -2455,7 +2453,6 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
 			return -EFAULT;
 
 		/* copy was successful so update the size parameters */
-		sk->sk_sndmsg_off += copy;
 		frag->size += copy;
 		skb->len += copy;
 		skb->data_len += copy;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 04c7b3b..d3a4540 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -767,7 +767,6 @@ static inline int ip_ufo_append_data(struct sock *sk,
 
 		skb->ip_summed = CHECKSUM_PARTIAL;
 		skb->csum = 0;
-		sk->sk_sndmsg_off = 0;
 
 		/* specify the length of each IP datagram fragment */
 		skb_shinfo(skb)->gso_size = mtu - fragheaderlen;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 5f8d242..9965182 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1061,7 +1061,6 @@ static inline int ip6_ufo_append_data(struct sock *sk,
 
 		skb->ip_summed = CHECKSUM_PARTIAL;
 		skb->csum = 0;
-		sk->sk_sndmsg_off = 0;
 	}
 
 	err = skb_append_datato_frags(sk,skb, getfrag, from,

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 4/5] udp: Add lockless transmit path
  2011-02-27 11:06             ` Herbert Xu
                                 ` (4 preceding siblings ...)
  2011-02-28 11:41               ` [PATCH 3/5] inet: Add ip_make_skb and ip_send_skb Herbert Xu
@ 2011-02-28 11:41               ` Herbert Xu
  2011-02-28 11:41                 ` Herbert Xu
  2011-03-01  5:30                 ` Eric Dumazet
  5 siblings, 2 replies; 91+ messages in thread
From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf

udp: Add lockless transmit path

The UDP transmit path has been running under the socket lock
for a long time because of the corking feature.  This means that
transmitting to the same socket in multiple threads does not
scale at all.

However, as most users don't actually use corking, the locking
can be removed in the common case.

This patch creates a lockless fast path where corking is not used.

Please note that this does create a slight inaccuracy in the
enforcement of socket send buffer limits.  In particular, we
may exceed the socket limit by up to (number of CPUs) * (packet
size) because of the way the limit is computed.

As the primary purpose of socket buffers is to indicate congestion,
this should not be a great problem for now.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/net/udp.h     |   11 +++++
 include/net/udplite.h |   12 +++++
 net/ipv4/udp.c        |  104 +++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 126 insertions(+), 1 deletion(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index bb967dd..b8563ba 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -144,6 +144,17 @@ static inline __wsum udp_csum_outgoing(struct sock *sk, struct sk_buff *skb)
 	return csum;
 }
 
+static inline __wsum udp_csum(struct sk_buff *skb)
+{
+	__wsum csum = csum_partial(skb_transport_header(skb),
+				   sizeof(struct udphdr), skb->csum);
+
+	for (skb = skb_shinfo(skb)->frag_list; skb; skb = skb->next) {
+		csum = csum_add(csum, skb->csum);
+	}
+	return csum;
+}
+
 /* hash routines shared between UDPv4/6 and UDP-Litev4/6 */
 static inline void udp_lib_hash(struct sock *sk)
 {
diff --git a/include/net/udplite.h b/include/net/udplite.h
index afdffe6..673a024 100644
--- a/include/net/udplite.h
+++ b/include/net/udplite.h
@@ -115,6 +115,18 @@ static inline __wsum udplite_csum_outgoing(struct sock *sk, struct sk_buff *skb)
 	return csum;
 }
 
+static inline __wsum udplite_csum(struct sk_buff *skb)
+{
+	struct sock *sk = skb->sk;
+	int cscov = udplite_sender_cscov(udp_sk(sk), udp_hdr(skb));
+	const int off = skb_transport_offset(skb);
+	const int len = skb->len - off;
+
+	skb->ip_summed = CHECKSUM_NONE;     /* no HW support for checksumming */
+
+	return skb_checksum(skb, off, min(cscov, len), 0);
+}
+
 extern void	udplite4_register(void);
 extern int 	udplite_get_port(struct sock *sk, unsigned short snum,
 			int (*scmp)(const struct sock *, const struct sock *));
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8157b17..7fd3664 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -769,6 +769,95 @@ out:
 	return err;
 }
 
+static void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst)
+{
+	struct udphdr *uh = udp_hdr(skb);
+	struct sk_buff *frags = skb_shinfo(skb)->frag_list;
+	int offset = skb_transport_offset(skb);
+	int len = skb->len - offset;
+	int hlen = len;
+	__wsum csum = 0;
+
+	if (!frags) {
+		/*
+		 * Only one fragment on the socket.
+		 */
+		skb->csum_start = skb_transport_header(skb) - skb->head;
+		skb->csum_offset = offsetof(struct udphdr, check);
+		uh->check = ~csum_tcpudp_magic(src, dst, len,
+					       IPPROTO_UDP, 0);
+	} else {
+		/*
+		 * HW-checksum won't work as there are two or more
+		 * fragments on the socket so that all csums of sk_buffs
+		 * should be together
+		 */
+		do {
+			csum = csum_add(csum, frags->csum);
+			hlen -= frags->len;
+		} while ((frags = frags->next));
+
+		csum = skb_checksum(skb, offset, hlen, csum);
+		skb->ip_summed = CHECKSUM_NONE;
+
+		uh->check = csum_tcpudp_magic(src, dst, len, IPPROTO_UDP, csum);
+		if (uh->check == 0)
+			uh->check = CSUM_MANGLED_0;
+	}
+}
+
+static int udp_send_skb(struct sk_buff *skb, __be32 daddr, __be32 dport)
+{
+	struct sock *sk = skb->sk;
+	struct inet_sock *inet = inet_sk(sk);
+	struct udphdr *uh;
+	struct rtable *rt = (struct rtable *)skb_dst(skb);
+	int err = 0;
+	int is_udplite = IS_UDPLITE(sk);
+	int offset = skb_transport_offset(skb);
+	int len = skb->len - offset;
+	__wsum csum = 0;
+
+	/*
+	 * Create a UDP header
+	 */
+	uh = udp_hdr(skb);
+	uh->source = inet->inet_sport;
+	uh->dest = dport;
+	uh->len = htons(len);
+	uh->check = 0;
+
+	if (is_udplite)
+		csum = udplite_csum(skb);
+	else if (sk->sk_no_check == UDP_CSUM_NOXMIT) {
+		skb->ip_summed = CHECKSUM_NONE;
+		goto send;
+	} else if (skb->ip_summed == CHECKSUM_PARTIAL) {
+		udp4_hwcsum(skb, rt->rt_src, daddr);
+		goto send;
+	} else
+		csum = udp_csum(skb);
+
+	/* add protocol-dependent pseudo-header */
+	uh->check = csum_tcpudp_magic(rt->rt_src, daddr, len,
+				      sk->sk_protocol, csum);
+	if (uh->check == 0)
+		uh->check = CSUM_MANGLED_0;
+
+send:
+	err = ip_send_skb(skb);
+	if (err) {
+		if (err == -ENOBUFS && !inet->recverr) {
+			UDP_INC_STATS_USER(sock_net(sk),
+					   UDP_MIB_SNDBUFERRORS, is_udplite);
+			err = 0;
+		}
+	} else
+		UDP_INC_STATS_USER(sock_net(sk),
+				   UDP_MIB_OUTDATAGRAMS, is_udplite);
+	return err;
+}
+
 int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		size_t len)
 {
@@ -785,6 +874,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	int err, is_udplite = IS_UDPLITE(sk);
 	int corkreq = up->corkflag || msg->msg_flags&MSG_MORE;
 	int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
+	struct sk_buff *skb;
 
 	if (len > 0xFFFF)
 		return -EMSGSIZE;
@@ -799,6 +889,8 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
 
+	getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag;
+
 	if (up->pending) {
 		/*
 		 * There are pending frames.
@@ -923,6 +1015,17 @@ back_from_confirm:
 	if (!ipc.addr)
 		daddr = ipc.addr = rt->rt_dst;
 
+	/* Lockless fast path for the non-corking case. */
+	if (!corkreq) {
+		skb = ip_make_skb(sk, getfrag, msg->msg_iov, ulen,
+				  sizeof(struct udphdr), &ipc, &rt,
+				  msg->msg_flags);
+		err = PTR_ERR(skb);
+		if (skb && !IS_ERR(skb))
+			err = udp_send_skb(skb, daddr, dport);
+		goto out;
+	}
+
 	lock_sock(sk);
 	if (unlikely(up->pending)) {
 		/* The socket is already corked while preparing it. */
@@ -944,7 +1047,6 @@ back_from_confirm:
 
 do_append_data:
 	up->len += ulen;
-	getfrag  =  is_udplite ?  udplite_getfrag : ip_generic_getfrag;
 	err = ip_append_data(sk, getfrag, msg->msg_iov, ulen,
 			sizeof(struct udphdr), &ipc, &rt,
 			corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 3/5] inet: Add ip_make_skb and ip_send_skb
  2011-02-27 11:06             ` Herbert Xu
                                 ` (3 preceding siblings ...)
  2011-02-28 11:41               ` [PATCH 1/5] net: Remove unused sk_sndmsg_* from UFO Herbert Xu
@ 2011-02-28 11:41               ` Herbert Xu
  2011-03-01  5:31                 ` Eric Dumazet
  2011-02-28 11:41               ` [PATCH 4/5] udp: Add lockless transmit path Herbert Xu
  5 siblings, 1 reply; 91+ messages in thread
From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf

inet: Add ip_make_skb and ip_send_skb

This patch adds the helper ip_make_skb which is like ip_append_data
and ip_push_pending_frames all rolled into one, except that it does
not send the skb produced.  The sending part is carried out by
ip_send_skb, which the transport protocol can call after it has
tweaked the skb.

It is meant to be called in cases where corking is not used should
have a one-to-one correspondence to sendmsg.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/net/ip.h     |    8 ++++++
 net/ipv4/ip_output.c |   65 ++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 67fac78..a96e525 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -116,8 +116,16 @@ extern int		ip_append_data(struct sock *sk,
 extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
 extern ssize_t		ip_append_page(struct sock *sk, struct page *page,
 				int offset, size_t size, int flags);
+extern int		ip_send_skb(struct sk_buff *skb);
 extern int		ip_push_pending_frames(struct sock *sk);
 extern void		ip_flush_pending_frames(struct sock *sk);
+extern struct sk_buff  *ip_make_skb(struct sock *sk,
+				    int getfrag(void *from, char *to, int offset, int len,
+						int odd, struct sk_buff *skb),
+				    void *from, int length, int transhdrlen,
+				    struct ipcm_cookie *ipc,
+				    struct rtable **rtp,
+				    unsigned int flags);
 
 /* datagram.c */
 extern int		ip4_datagram_connect(struct sock *sk, 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 1dd5ecc..dba14c6 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1267,9 +1267,9 @@ static void ip_cork_release(struct inet_cork *cork)
  *	Combined all pending IP fragments on the socket as one IP datagram
  *	and push them out.
  */
-static int __ip_push_pending_frames(struct sock *sk,
-				    struct sk_buff_head *queue,
-				    struct inet_cork *cork)
+static struct sk_buff *__ip_make_skb(struct sock *sk,
+				     struct sk_buff_head *queue,
+				     struct inet_cork *cork)
 {
 	struct sk_buff *skb, *tmp_skb;
 	struct sk_buff **tail_skb;
@@ -1280,7 +1280,6 @@ static int __ip_push_pending_frames(struct sock *sk,
 	struct iphdr *iph;
 	__be16 df = 0;
 	__u8 ttl;
-	int err = 0;
 
 	if ((skb = __skb_dequeue(queue)) == NULL)
 		goto out;
@@ -1351,28 +1350,37 @@ static int __ip_push_pending_frames(struct sock *sk,
 		icmp_out_count(net, ((struct icmphdr *)
 			skb_transport_header(skb))->type);
 
-	/* Netfilter gets whole the not fragmented skb. */
+	ip_cork_release(cork);
+out:
+	return skb;
+}
+
+int ip_send_skb(struct sk_buff *skb)
+{
+	struct net *net = sock_net(skb->sk);
+	int err;
+
 	err = ip_local_out(skb);
 	if (err) {
 		if (err > 0)
 			err = net_xmit_errno(err);
 		if (err)
-			goto error;
+			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
 	}
 
-out:
-	ip_cork_release(cork);
 	return err;
-
-error:
-	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
-	goto out;
 }
 
 int ip_push_pending_frames(struct sock *sk)
 {
-	return __ip_push_pending_frames(sk, &sk->sk_write_queue,
-					&inet_sk(sk)->cork);
+	struct sk_buff *skb;
+
+	skb = __ip_make_skb(sk, &sk->sk_write_queue, &inet_sk(sk)->cork);
+	if (!skb)
+		return 0;
+
+	/* Netfilter gets whole the not fragmented skb. */
+	return ip_send_skb(skb);
 }
 
 /*
@@ -1395,6 +1403,35 @@ void ip_flush_pending_frames(struct sock *sk)
 	__ip_flush_pending_frames(sk, &sk->sk_write_queue, &inet_sk(sk)->cork);
 }
 
+struct sk_buff *ip_make_skb(struct sock *sk,
+			    int getfrag(void *from, char *to, int offset,
+					int len, int odd, struct sk_buff *skb),
+			    void *from, int length, int transhdrlen,
+			    struct ipcm_cookie *ipc, struct rtable **rtp,
+			    unsigned int flags)
+{
+	struct inet_cork cork = {};
+	struct sk_buff_head queue;
+	int err;
+
+	if (flags & MSG_PROBE)
+		return NULL;
+
+	__skb_queue_head_init(&queue);
+
+	err = ip_setup_cork(sk, &cork, ipc, rtp);
+	if (err)
+		return ERR_PTR(err);
+
+	err = __ip_append_data(sk, &queue, &cork, getfrag,
+			       from, length, transhdrlen, flags);
+	if (err) {
+		__ip_flush_pending_frames(sk, &queue, &cork);
+		return ERR_PTR(err);
+	}
+
+	return __ip_make_skb(sk, &queue, &cork);
+}
 
 /*
  *	Fetch data from kernel space and fill in checksum if needed.

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH 4/5] udp: Add lockless transmit path
  2011-02-28 11:41               ` [PATCH 4/5] udp: Add lockless transmit path Herbert Xu
@ 2011-02-28 11:41                 ` Herbert Xu
  2011-03-01  5:30                 ` Eric Dumazet
  1 sibling, 0 replies; 91+ messages in thread
From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf

On Mon, Feb 28, 2011 at 07:41:01PM +0800, Herbert Xu wrote:
> udp: Add lockless transmit path

Doh! There are only 4 patches in the series.  So you didn't
miss anything, yet :)
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-28 11:36               ` Herbert Xu
@ 2011-02-28 13:32                 ` Eric Dumazet
  2011-02-28 14:13                   ` Herbert Xu
  2011-02-28 14:53                   ` Eric Dumazet
  2011-02-28 14:13                 ` Thomas Graf
                                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 91+ messages in thread
From: Eric Dumazet @ 2011-02-28 13:32 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev

Le lundi 28 février 2011 à 19:36 +0800, Herbert Xu a écrit :
> On Sun, Feb 27, 2011 at 07:06:14PM +0800, Herbert Xu wrote:
> > I'm working on this right now.
> 
> OK I think I was definitely on the right track.  With the send
> patch made lockless I now get numbers which are even better than
> those obtained with running named with multiple sockets.  That's
> right, a single socket is now faster than what multiple sockets
> were without the patch (of course, multiple sockets may still
> faster with the patch vs. a single socket for obvious reasons,
> but I couldn't measure any significant difference).
> 
> Also worthy of note is that prior to the patch all CPUs showed
> idleness (lazy bastards!), with the patch they're all maxed out.
> 
> In retrospect, the idleness was simply the result of the socket
> lock scheduling away and was an indication of lock contention.
> 

Now, input path can run without finding socket locked by xmit path, so
skb are queued into receive queue, not backlog one.

> Here are the patches I used.  Please don't them yet as I intend
> to clean them up quite a bit.
> 
> But please do test them heavily, especially if you have an AMD
> NUMA machine as that's where scalability problems really show
> up.  Intel tends to be a lot more forgiving.  My last AMD machine
> blew up years ago :)

I am going to test them, thanks !



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-28 11:36               ` Herbert Xu
  2011-02-28 13:32                 ` Eric Dumazet
@ 2011-02-28 14:13                 ` Thomas Graf
  2011-02-28 16:22                   ` Eric Dumazet
  2011-03-01  5:33                 ` Eric Dumazet
  2011-03-01 12:35                 ` Herbert Xu
  3 siblings, 1 reply; 91+ messages in thread
From: Thomas Graf @ 2011-02-28 14:13 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev

On Mon, Feb 28, 2011 at 07:36:59PM +0800, Herbert Xu wrote:
> But please do test them heavily, especially if you have an AMD
> NUMA machine as that's where scalability problems really show
> up.  Intel tends to be a lot more forgiving.  My last AMD machine
> blew up years ago :)

This is just a preliminary test result and not 100% reliable
because half through the testing the machine reported memory
issues and disabled a DIMM before booting the tested kernels.

Nevertheless, bind 9.7.3:

2.6.38-rc5+: 62kqps
2.6.38-rc5+ w/ Herbert's patch: 442kqps

This is on a 2 NUMA Intel Xeon X5560 @ 2.80GHz with 16 cores

Again, this number is not 100% reliably but clearly shows that
the concept of the patch is working very well.

Will test Herbert's patch on the machine that did 650kqps with
SO_REUSEPORT and also on some AMD machines.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-28 13:32                 ` Eric Dumazet
@ 2011-02-28 14:13                   ` Herbert Xu
  2011-02-28 14:22                     ` Eric Dumazet
  2011-02-28 14:53                   ` Eric Dumazet
  1 sibling, 1 reply; 91+ messages in thread
From: Herbert Xu @ 2011-02-28 14:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev

On Mon, Feb 28, 2011 at 02:32:51PM +0100, Eric Dumazet wrote:
>
> Now, input path can run without finding socket locked by xmit path, so
> skb are queued into receive queue, not backlog one.

Indeed, I think this is what Dave alluded to earlier.  This will
eventually have to be dealt with but for now the data rate is low
enough that it isn't killing us.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-28 14:13                   ` Herbert Xu
@ 2011-02-28 14:22                     ` Eric Dumazet
  2011-02-28 14:25                       ` Herbert Xu
  0 siblings, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-02-28 14:22 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev

Le lundi 28 février 2011 à 22:13 +0800, Herbert Xu a écrit :
> On Mon, Feb 28, 2011 at 02:32:51PM +0100, Eric Dumazet wrote:
> >
> > Now, input path can run without finding socket locked by xmit path, so
> > skb are queued into receive queue, not backlog one.
> 
> Indeed, I think this is what Dave alluded to earlier.  This will
> eventually have to be dealt with but for now the data rate is low
> enough that it isn't killing us.

Not sure how you read this ;)

I said that before your patches, a sender was consuming lot of time to
transfert frames from backlog to receive queue right before releasing
socket lock.

Now, the receive path doesnt slow down the senders, and vice versa.

:)



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-28 14:22                     ` Eric Dumazet
@ 2011-02-28 14:25                       ` Herbert Xu
  0 siblings, 0 replies; 91+ messages in thread
From: Herbert Xu @ 2011-02-28 14:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev

On Mon, Feb 28, 2011 at 03:22:06PM +0100, Eric Dumazet wrote:
>
> Not sure how you read this ;)
> 
> I said that before your patches, a sender was consuming lot of time to
> transfert frames from backlog to receive queue right before releasing
> socket lock.
> 
> Now, the receive path doesnt slow down the senders, and vice versa.
> 
> :)

I understood what you wrote :)

I was just referring to an earlier message where Dave talked about
the UDP accounting patch making us having to take the lock on every
packet.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-28 13:32                 ` Eric Dumazet
  2011-02-28 14:13                   ` Herbert Xu
@ 2011-02-28 14:53                   ` Eric Dumazet
  2011-02-28 15:01                     ` Thomas Graf
  1 sibling, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-02-28 14:53 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev

Le lundi 28 février 2011 à 14:32 +0100, Eric Dumazet a écrit :
> Le lundi 28 février 2011 à 19:36 +0800, Herbert Xu a écrit :
> > On Sun, Feb 27, 2011 at 07:06:14PM +0800, Herbert Xu wrote:
> > > I'm working on this right now.
> > 
> > OK I think I was definitely on the right track.  With the send
> > patch made lockless I now get numbers which are even better than
> > those obtained with running named with multiple sockets.  That's
> > right, a single socket is now faster than what multiple sockets
> > were without the patch (of course, multiple sockets may still
> > faster with the patch vs. a single socket for obvious reasons,
> > but I couldn't measure any significant difference).
> > 
> > Also worthy of note is that prior to the patch all CPUs showed
> > idleness (lazy bastards!), with the patch they're all maxed out.
> > 
> > In retrospect, the idleness was simply the result of the socket
> > lock scheduling away and was an indication of lock contention.
> > 
> 
> Now, input path can run without finding socket locked by xmit path, so
> skb are queued into receive queue, not backlog one.
> 
> > Here are the patches I used.  Please don't them yet as I intend
> > to clean them up quite a bit.
> > 
> > But please do test them heavily, especially if you have an AMD
> > NUMA machine as that's where scalability problems really show
> > up.  Intel tends to be a lot more forgiving.  My last AMD machine
> > blew up years ago :)
> 
> I am going to test them, thanks !
> 

First "sending only" tests on my 2x4x2 machine (two E5540@2.53GHz, quad
core, hyper threaded, NUMA kernel)

16 threads, each one sending 100.000 UDP frames using a _shared_ socket

I use the same destination IP, so suffer a bit of dst refcount
contention.

(to dummy0 device to avoid contention on qdisc and device)
# ip ro get 10.2.2.21
10.2.2.21 dev dummy0  src 10.2.2.2 
    cache 

LOCKDEP enabled kernel

Before :

time ./udpflood -f -t 16 -l 100000 10.2.2.21

real	0m42.749s
user	0m1.010s
sys	1m38.039s

After :

time ./udpflood -f -t 16 -l 100000 10.2.2.21

real	0m1.167s
user	0m0.488s
sys	0m17.373s


With one thread only and 16*100000 frames :
# time ./udpflood -f -l 1600000 10.2.2.21

real	0m9.318s
user	0m0.238s
sys	0m9.052s

(We have some false sharing on atomic fields in struct file and socket,
but nothing to worry about.)

With LOCKDEP OFF :

16 threads :

# time ./udpflood -f -t 16 -l 100000 10.2.2.21

real	0m0.718s
user	0m0.376s
sys	0m10.963s

1 thread :

# time ./udpflood -f -l 1600000 10.2.2.21

real	0m1.514s
user	0m0.153s
sys	0m1.357s


"perf record/report" results for the 16 threads case (no lockdep)

# Events: 389K cpu-clock-msecs
#
# Overhead      Command        Shared Object                               Symbol
# ........  ...........  ...................  ...................................
#
     9.03%     udpflood  [kernel.kallsyms]    [k] sock_wfree
     8.58%     udpflood  [kernel.kallsyms]    [k] __ip_route_output_key
     8.52%     udpflood  [kernel.kallsyms]    [k] sock_alloc_send_pskb
     7.46%     udpflood  [kernel.kallsyms]    [k] sock_def_write_space
     6.76%     udpflood  [kernel.kallsyms]    [k] __xfrm_lookup
     6.18%      swapper  [kernel.kallsyms]    [k] acpi_idle_enter_bm
     5.66%     udpflood  [kernel.kallsyms]    [k] dst_release
     4.96%     udpflood  [kernel.kallsyms]    [k] udp_sendmsg
     3.48%     udpflood  [kernel.kallsyms]    [k] fget_light
     2.75%     udpflood  [kernel.kallsyms]    [k] sock_tx_timestamp
     2.40%     udpflood  [kernel.kallsyms]    [k] __ip_make_skb
     2.36%     udpflood  [kernel.kallsyms]    [k] fput
     1.87%      swapper  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
     1.81%     udpflood  [kernel.kallsyms]    [k] inet_sendmsg
     1.53%     udpflood  [kernel.kallsyms]    [k] sys_sendto
     1.50%     udpflood  [kernel.kallsyms]    [k] ip_finish_output
     1.31%     udpflood  [kernel.kallsyms]    [k] csum_partial_copy_generic
     1.30%     udpflood  udpflood             [.] do_thread
     1.28%     udpflood  [kernel.kallsyms]    [k] __ip_append_data
     1.08%     udpflood  [kernel.kallsyms]    [k] __memset
     1.05%     udpflood  [kernel.kallsyms]    [k] ip_route_output_flow
     0.91%     udpflood  [kernel.kallsyms]    [k] kfree
     0.88%     udpflood  [vdso]               [.] 0xffffe430
     0.83%     udpflood  [kernel.kallsyms]    [k] copy_user_generic_string
     0.78%     udpflood  libc-2.3.4.so        [.] __GI_memcpy
     0.77%     udpflood  [kernel.kallsyms]    [k] ia32_sysenter_target


What do you suggest to perform a bind based test ?




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-28 14:53                   ` Eric Dumazet
@ 2011-02-28 15:01                     ` Thomas Graf
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Graf @ 2011-02-28 15:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Mon, Feb 28, 2011 at 03:53:03PM +0100, Eric Dumazet wrote:
> What do you suggest to perform a bind based test ?

We use queryperf from BIND sources. I typically run 1 queryperf
instance per core on multiple machines.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-28 14:13                 ` Thomas Graf
@ 2011-02-28 16:22                   ` Eric Dumazet
  2011-02-28 16:37                     ` Thomas Graf
  0 siblings, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-02-28 16:22 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

Le lundi 28 février 2011 à 09:13 -0500, Thomas Graf a écrit :
> On Mon, Feb 28, 2011 at 07:36:59PM +0800, Herbert Xu wrote:
> > But please do test them heavily, especially if you have an AMD
> > NUMA machine as that's where scalability problems really show
> > up.  Intel tends to be a lot more forgiving.  My last AMD machine
> > blew up years ago :)
> 
> This is just a preliminary test result and not 100% reliable
> because half through the testing the machine reported memory
> issues and disabled a DIMM before booting the tested kernels.
> 
> Nevertheless, bind 9.7.3:
> 
> 2.6.38-rc5+: 62kqps
> 2.6.38-rc5+ w/ Herbert's patch: 442kqps
> 
> This is on a 2 NUMA Intel Xeon X5560 @ 2.80GHz with 16 cores
> 
> Again, this number is not 100% reliably but clearly shows that
> the concept of the patch is working very well.
> 
> Will test Herbert's patch on the machine that did 650kqps with
> SO_REUSEPORT and also on some AMD machines.
> --

I suspect your queryperf input file hits many zones ?

With a single zone, my machine is able to give 250kps : most of the time
is consumed in bind code, dealing with rwlocks and false sharing
things...

(bind-9.7.2-P3)
Using two remote machines to perform queries, on bnx2x adapter, RSS
enabled : two cpus receive UDP frames for the same socket, so we also
hit false sharing in kernel receive path.


---------------------------------------------------------------------------------------------------------------------------------
   PerfTop:  558863 irqs/sec  kernel:40.8%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, 16 CPUs)
---------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ______________________________________

           137175.00 12.4% acpi_idle_enter_bm            [kernel.kallsyms]                     
            63784.00  5.8% _raw_spin_unlock_irqrestore   [kernel.kallsyms]                     
            54140.00  4.9% isc_rwlock_lock               /opt/src/bind-9.7.2-P3/bin/named/named
            32682.00  2.9% isc_rwlock_unlock             /opt/src/bind-9.7.2-P3/bin/named/named
            21823.00  2.0% dns_rbt_findnode              /opt/src/bind-9.7.2-P3/bin/named/named
            20306.00  1.8% __ticket_spin_lock            [kernel.kallsyms]                     
            16881.00  1.5% finish_task_switch            [kernel.kallsyms]                     
            15335.00  1.4% zone_find                     /opt/src/bind-9.7.2-P3/bin/named/named
            14082.00  1.3% decrement_reference           /opt/src/bind-9.7.2-P3/bin/named/named
            14064.00  1.3% __pthread_mutex_lock_internal /lib/tls/libpthread-2.3.4.so          
            13519.00  1.2% isc_stats_increment           /opt/src/bind-9.7.2-P3/bin/named/named
            13027.00  1.2% __GI_memcpy                   /lib/tls/libc-2.3.4.so                
            12516.00  1.1% dns_name_concatenate          /opt/src/bind-9.7.2-P3/bin/named/named
            12499.00  1.1% currentversion                /opt/src/bind-9.7.2-P3/bin/named/named
            11412.00  1.0% dns_name_fullcompare          /opt/src/bind-9.7.2-P3/bin/named/named
            10814.00  1.0% new_reference.clone.6         /opt/src/bind-9.7.2-P3/bin/named/named
            10580.00  1.0% attach                        /opt/src/bind-9.7.2-P3/bin/named/named
             9805.00  0.9% zone_zonecut_callback         /opt/src/bind-9.7.2-P3/bin/named/named



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-28 16:22                   ` Eric Dumazet
@ 2011-02-28 16:37                     ` Thomas Graf
  2011-02-28 17:07                       ` Eric Dumazet
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Graf @ 2011-02-28 16:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Mon, Feb 28, 2011 at 05:22:54PM +0100, Eric Dumazet wrote:
> Le lundi 28 février 2011 à 09:13 -0500, Thomas Graf a écrit :
> > On Mon, Feb 28, 2011 at 07:36:59PM +0800, Herbert Xu wrote:
> > > But please do test them heavily, especially if you have an AMD
> > > NUMA machine as that's where scalability problems really show
> > > up.  Intel tends to be a lot more forgiving.  My last AMD machine
> > > blew up years ago :)
> > 
> > This is just a preliminary test result and not 100% reliable
> > because half through the testing the machine reported memory
> > issues and disabled a DIMM before booting the tested kernels.
> > 
> > Nevertheless, bind 9.7.3:
> > 
> > 2.6.38-rc5+: 62kqps
> > 2.6.38-rc5+ w/ Herbert's patch: 442kqps
> > 
> > This is on a 2 NUMA Intel Xeon X5560 @ 2.80GHz with 16 cores
> > 
> > Again, this number is not 100% reliably but clearly shows that
> > the concept of the patch is working very well.
> > 
> > Will test Herbert's patch on the machine that did 650kqps with
> > SO_REUSEPORT and also on some AMD machines.
> > --
> 
> I suspect your queryperf input file hits many zones ?

No, we use a simple example.com zone with host[1-4] A records
resolving to 10.[1-4].0.1

> With a single zone, my machine is able to give 250kps : most of the time
> is consumed in bind code, dealing with rwlocks and false sharing
> things...
> 
> (bind-9.7.2-P3)
> Using two remote machines to perform queries, on bnx2x adapter, RSS
> enabled : two cpus receive UDP frames for the same socket, so we also
> hit false sharing in kernel receive path.

How do you measure the qps? The output of queryperf? That is not always
accurate. I run rdnc stats twice and then calculate the qps based on the
counter "queries resulted in successful answer" diff and timestamp diff.

The numbers differ a lot depending on the architecture we test on.

F.e. on a 12 core AMD with 2 NUMA nodes:

2.6.32   named -n 1: 37.0kqps
         named:       3.8kqps (yes, no joke, the socket receive buffer is
                               always full and the kernel drops pkts)

2.6.38-rc5+ with Herbert's patches:
        named -n 1:  36.9kqps
        named:      222.0kqps

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-28 16:37                     ` Thomas Graf
@ 2011-02-28 17:07                       ` Eric Dumazet
  2011-03-01 10:19                         ` Thomas Graf
  0 siblings, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-02-28 17:07 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

Le lundi 28 février 2011 à 11:37 -0500, Thomas Graf a écrit :

> How do you measure the qps? The output of queryperf? That is not always
> accurate. I run rdnc stats twice and then calculate the qps based on the
> counter "queries resulted in successful answer" diff and timestamp diff.
> 

I have some custom ethernet/system monitoring package installed, so I
get packet rates from it.

I appears my two source machines were not fast enough. (One had LOCKDEP
kernel).

I now reach 320 kqps, even if I force NIC interrupts through one cpu
only.

> The numbers differ a lot depending on the architecture we test on.
> 
> F.e. on a 12 core AMD with 2 NUMA nodes:
> 
> 2.6.32   named -n 1: 37.0kqps
>          named:       3.8kqps (yes, no joke, the socket receive buffer is
>                                always full and the kernel drops pkts)

Yes, this old kernel miss commit c377411f2494a93 added in 2.6.35
(net: sk_add_backlog() take rmem_alloc into account)

Quoting the change log :

 Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
 receiver can now process ~200.000 pps (instead of ~100 pps before the
 patch) on a 8 core machine.

> 
> 2.6.38-rc5+ with Herbert's patches:
>         named -n 1:  36.9kqps
>         named:      222.0kqps



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 4/5] udp: Add lockless transmit path
  2011-02-28 11:41               ` [PATCH 4/5] udp: Add lockless transmit path Herbert Xu
  2011-02-28 11:41                 ` Herbert Xu
@ 2011-03-01  5:30                 ` Eric Dumazet
  1 sibling, 0 replies; 91+ messages in thread
From: Eric Dumazet @ 2011-03-01  5:30 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf

Le lundi 28 février 2011 à 19:41 +0800, Herbert Xu a écrit :
> udp: Add lockless transmit path
> 
> The UDP transmit path has been running under the socket lock
> for a long time because of the corking feature.  This means that
> transmitting to the same socket in multiple threads does not
> scale at all.
> 
> However, as most users don't actually use corking, the locking
> can be removed in the common case.
> 
> This patch creates a lockless fast path where corking is not used.
> 
> Please note that this does create a slight inaccuracy in the
> enforcement of socket send buffer limits.  In particular, we
> may exceed the socket limit by up to (number of CPUs) * (packet
> size) because of the way the limit is computed.
> 
> As the primary purpose of socket buffers is to indicate congestion,
> this should not be a great problem for now.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> ---

So far I found no obvious problem, and got pretty impressive results.

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 1/5] net: Remove unused sk_sndmsg_* from UFO
  2011-02-28 11:41               ` [PATCH 1/5] net: Remove unused sk_sndmsg_* from UFO Herbert Xu
@ 2011-03-01  5:31                 ` Eric Dumazet
  0 siblings, 0 replies; 91+ messages in thread
From: Eric Dumazet @ 2011-03-01  5:31 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf

Le lundi 28 février 2011 à 19:41 +0800, Herbert Xu a écrit :
> net: Remove unused sk_sndmsg_* from UFO
> 
> UFO doesn't really use the sk_sndmsg_* parameters so touching
> them is pointless.  It can't use them anyway since the whole
> point of UFO is to use the original pages without copying.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> ---

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 2/5] net: Remove explicit write references to sk/inet in ip_append_data
  2011-02-28 11:41               ` [PATCH 2/5] net: Remove explicit write references to sk/inet in ip_append_data Herbert Xu
@ 2011-03-01  5:31                 ` Eric Dumazet
  0 siblings, 0 replies; 91+ messages in thread
From: Eric Dumazet @ 2011-03-01  5:31 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf

Le lundi 28 février 2011 à 19:41 +0800, Herbert Xu a écrit :
> net: Remove explicit write references to sk/inet in ip_append_data
> 
> In order to allow simultaneous calls to ip_append_data on the same
> socket, it must not modify any shared state in sk or inet (other
> than those that are designed to allow that such as atomic counters).
> 
> This patch abstracts out write references to sk and inet_sk in
> ip_append_data and its friends so that we may use the underlying
> code in parallel.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> ---
> 
>  include/net/inet_sock.h |   23 ++--
>  net/ipv4/ip_output.c    |  238 ++++++++++++++++++++++++++++--------------------
>  2 files changed, 154 insertions(+), 107 deletions(-)

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 3/5] inet: Add ip_make_skb and ip_send_skb
  2011-02-28 11:41               ` [PATCH 3/5] inet: Add ip_make_skb and ip_send_skb Herbert Xu
@ 2011-03-01  5:31                 ` Eric Dumazet
  0 siblings, 0 replies; 91+ messages in thread
From: Eric Dumazet @ 2011-03-01  5:31 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf

Le lundi 28 février 2011 à 19:41 +0800, Herbert Xu a écrit :
> inet: Add ip_make_skb and ip_send_skb
> 
> This patch adds the helper ip_make_skb which is like ip_append_data
> and ip_push_pending_frames all rolled into one, except that it does
> not send the skb produced.  The sending part is carried out by
> ip_send_skb, which the transport protocol can call after it has
> tweaked the skb.
> 
> It is meant to be called in cases where corking is not used should
> have a one-to-one correspondence to sendmsg.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-28 11:36               ` Herbert Xu
  2011-02-28 13:32                 ` Eric Dumazet
  2011-02-28 14:13                 ` Thomas Graf
@ 2011-03-01  5:33                 ` Eric Dumazet
  2011-03-01 12:35                 ` Herbert Xu
  3 siblings, 0 replies; 91+ messages in thread
From: Eric Dumazet @ 2011-03-01  5:33 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev

Le lundi 28 février 2011 à 19:36 +0800, Herbert Xu a écrit :

> Here are the patches I used.  Please don't them yet as I intend
> to clean them up quite a bit.
> 

I assume you mean "please dont commit them" ;)

> But please do test them heavily, especially if you have an AMD
> NUMA machine as that's where scalability problems really show
> up.  Intel tends to be a lot more forgiving.  My last AMD machine
> blew up years ago :)

Same here, My only AMD machine is a desktop class machine, not a server.




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-28 17:07                       ` Eric Dumazet
@ 2011-03-01 10:19                         ` Thomas Graf
  2011-03-01 10:33                           ` Eric Dumazet
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Graf @ 2011-03-01 10:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Mon, Feb 28, 2011 at 06:07:49PM +0100, Eric Dumazet wrote:
> > The numbers differ a lot depending on the architecture we test on.
> > 
> > F.e. on a 12 core AMD with 2 NUMA nodes:
> > 
> > 2.6.32   named -n 1: 37.0kqps
> >          named:       3.8kqps (yes, no joke, the socket receive buffer is
> >                                always full and the kernel drops pkts)
> 
> Yes, this old kernel miss commit c377411f2494a93 added in 2.6.35
> (net: sk_add_backlog() take rmem_alloc into account)

I retested with net-2.6 w/o Herbert's patch:

named -n 1: 36.9kqps
named:      16.2kqps

> > 2.6.38-rc5+ with Herbert's patches:
> >         named -n 1:  36.9kqps
> >         named:      222.0kqps

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 10:19                         ` Thomas Graf
@ 2011-03-01 10:33                           ` Eric Dumazet
  2011-03-01 11:07                             ` Thomas Graf
  0 siblings, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-03-01 10:33 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

Le mardi 01 mars 2011 à 05:19 -0500, Thomas Graf a écrit :
> On Mon, Feb 28, 2011 at 06:07:49PM +0100, Eric Dumazet wrote:
> > > The numbers differ a lot depending on the architecture we test on.
> > > 
> > > F.e. on a 12 core AMD with 2 NUMA nodes:
> > > 
> > > 2.6.32   named -n 1: 37.0kqps
> > >          named:       3.8kqps (yes, no joke, the socket receive buffer is
> > >                                always full and the kernel drops pkts)
> > 
> > Yes, this old kernel miss commit c377411f2494a93 added in 2.6.35
> > (net: sk_add_backlog() take rmem_alloc into account)
> 
> I retested with net-2.6 w/o Herbert's patch:
> 
> named -n 1: 36.9kqps
> named:      16.2kqps

Thats better ;)

You could do "cat /proc/net/udp" to check if drops occur on port 53
socket (last column)

But maybe your queryperf is limited to few queries in flight (default is
20 per queryperf instance) 



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 10:33                           ` Eric Dumazet
@ 2011-03-01 11:07                             ` Thomas Graf
  2011-03-01 11:13                               ` Eric Dumazet
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Graf @ 2011-03-01 11:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Tue, Mar 01, 2011 at 11:33:22AM +0100, Eric Dumazet wrote:
> > I retested with net-2.6 w/o Herbert's patch:
> > 
> > named -n 1: 36.9kqps
> > named:      16.2kqps
> 
> Thats better ;)
> 
> You could do "cat /proc/net/udp" to check if drops occur on port 53
> socket (last column)
> 
> But maybe your queryperf is limited to few queries in flight (default is
> 20 per queryperf instance) 

I tried -q 10, 20, 30, 50, 100. Starting with 20 I see drops, at q=50
queryperf reports 99% drops.

I also tested again on the Intel machine that did ~650kqps using SO_REUSEPORT.

net-2.6: 106.3kqps, 101.2kqps
net-2.6 lockless udp: 251.7kqps, 250.4kqps

I see drops in both test cases occur so I believe the rate supplied by the
clients is sufficient.

The difference is obvious when looking at top and mpstat:

UDP lockless (250kqps):

Cpu0  : 46.4%us, 28.8%sy,  0.0%ni, 24.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  2.0%us,  1.3%sy,  0.0%ni,  3.0%id,  0.0%wa,  0.0%hi, 93.6%si,  0.0%st
Cpu2  : 45.9%us, 28.2%sy,  0.0%ni, 25.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 50.0%us, 21.6%sy,  0.0%ni, 28.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 45.4%us, 27.8%sy,  0.0%ni, 26.5%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu5  : 50.7%us, 23.2%sy,  0.0%ni, 26.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 45.2%us, 28.9%sy,  0.0%ni, 25.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 50.5%us, 22.0%sy,  0.0%ni, 27.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  : 45.3%us, 29.3%sy,  0.0%ni, 25.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  : 50.8%us, 20.8%sy,  0.0%ni, 28.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 : 46.1%us, 27.8%sy,  0.0%ni, 26.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 : 27.2%us, 11.3%sy,  0.0%ni,  3.3%id,  0.0%wa,  0.0%hi, 58.1%si,  0.0%st

05:50:44 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
05:50:44 AM  all   23.86    0.00   13.02    0.22    0.00    6.98    0.00    0.00   55.92
05:50:44 AM    0   26.16    0.00   17.20    0.73    0.00    0.30    0.00    0.00   55.61
05:50:44 AM    1    2.36    0.00    2.11    0.70    0.00   51.97    0.00    0.00   42.87
05:50:44 AM    2   25.90    0.00   16.38    0.32    0.00    0.03    0.00    0.00   57.36
05:50:44 AM    3   28.26    0.00   12.73    0.27    0.00    0.02    0.00    0.00   58.73
05:50:44 AM    4   25.63    0.00   16.04    0.13    0.00    0.03    0.00    0.00   58.17
05:50:44 AM    5   28.19    0.00   12.54    0.17    0.00    0.01    0.00    0.00   59.09
05:50:44 AM    6   25.28    0.00   15.21    0.02    0.00    1.95    0.00    0.00   57.54
05:50:44 AM    7   28.34    0.00   12.40    0.10    0.00    0.01    0.00    0.00   59.14
05:50:44 AM    8   25.70    0.00   15.91    0.01    0.00    0.02    0.00    0.00   58.37
05:50:44 AM    9   28.31    0.00   12.56    0.11    0.00    0.01    0.00    0.00   59.01
05:50:44 AM   10   25.85    0.00   15.65    0.01    0.00    0.02    0.00    0.00   58.47
05:50:44 AM   11   16.11    0.00    7.44    0.10    0.00   29.87    0.00    0.00   46.49

SO_REUSEPORT test (doing 640kqps):

Cpu0  : 57.3%us, 26.5%sy,  0.0%ni,  3.3%id,  0.0%wa,  0.0%hi, 12.9%si,  0.0%st
Cpu1  : 25.7%us, 10.0%sy,  0.0%ni,  0.3%id,  0.0%wa,  0.0%hi, 64.0%si,  0.0%st
Cpu2  : 56.3%us, 28.8%sy,  0.0%ni,  3.0%id,  0.0%wa,  0.0%hi, 11.9%si,  0.0%st
Cpu3  : 29.1%us, 10.9%sy,  0.0%ni,  1.3%id,  0.0%wa,  0.0%hi, 58.6%si,  0.0%st
Cpu4  : 57.3%us, 28.5%sy,  0.0%ni,  2.3%id,  0.0%wa,  0.0%hi, 11.9%si,  0.0%st
Cpu5  : 64.8%us, 22.6%sy,  0.0%ni,  3.0%id,  0.0%wa,  0.0%hi,  9.6%si,  0.0%st
Cpu6  : 59.0%us, 26.7%sy,  0.0%ni,  2.7%id,  0.0%wa,  0.0%hi, 11.7%si,  0.0%st
Cpu7  : 64.1%us, 22.3%sy,  0.0%ni,  3.7%id,  0.0%wa,  0.0%hi, 10.0%si,  0.0%st
Cpu8  : 57.6%us, 27.5%sy,  0.0%ni,  3.0%id,  0.0%wa,  0.0%hi, 11.9%si,  0.0%st
Cpu9  : 65.2%us, 22.2%sy,  0.0%ni,  2.3%id,  0.0%wa,  0.0%hi, 10.3%si,  0.0%st
Cpu10 : 56.9%us, 28.3%sy,  0.0%ni,  3.0%id,  0.0%wa,  0.0%hi, 11.8%si,  0.0%st
Cpu11 : 40.2%us, 14.6%sy,  0.0%ni,  2.3%id,  0.0%wa,  0.0%hi, 42.9%si,  0.0%st




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 11:07                             ` Thomas Graf
@ 2011-03-01 11:13                               ` Eric Dumazet
  2011-03-01 11:27                                 ` Thomas Graf
  0 siblings, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-03-01 11:13 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

Le mardi 01 mars 2011 à 06:07 -0500, Thomas Graf a écrit :
> On Tue, Mar 01, 2011 at 11:33:22AM +0100, Eric Dumazet wrote:
> > > I retested with net-2.6 w/o Herbert's patch:
> > > 
> > > named -n 1: 36.9kqps
> > > named:      16.2kqps
> > 
> > Thats better ;)
> > 
> > You could do "cat /proc/net/udp" to check if drops occur on port 53
> > socket (last column)
> > 
> > But maybe your queryperf is limited to few queries in flight (default is
> > 20 per queryperf instance) 
> 
> I tried -q 10, 20, 30, 50, 100. Starting with 20 I see drops, at q=50
> queryperf reports 99% drops.
> 
> I also tested again on the Intel machine that did ~650kqps using SO_REUSEPORT.
> 
> net-2.6: 106.3kqps, 101.2kqps
> net-2.6 lockless udp: 251.7kqps, 250.4kqps
> 
> I see drops in both test cases occur so I believe the rate supplied by the
> clients is sufficient.
> 
> The difference is obvious when looking at top and mpstat:
> 
> UDP lockless (250kqps):
> 
> Cpu0  : 46.4%us, 28.8%sy,  0.0%ni, 24.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu1  :  2.0%us,  1.3%sy,  0.0%ni,  3.0%id,  0.0%wa,  0.0%hi, 93.6%si,  0.0%st
> Cpu2  : 45.9%us, 28.2%sy,  0.0%ni, 25.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu3  : 50.0%us, 21.6%sy,  0.0%ni, 28.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu4  : 45.4%us, 27.8%sy,  0.0%ni, 26.5%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> Cpu5  : 50.7%us, 23.2%sy,  0.0%ni, 26.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu6  : 45.2%us, 28.9%sy,  0.0%ni, 25.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu7  : 50.5%us, 22.0%sy,  0.0%ni, 27.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu8  : 45.3%us, 29.3%sy,  0.0%ni, 25.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu9  : 50.8%us, 20.8%sy,  0.0%ni, 28.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu10 : 46.1%us, 27.8%sy,  0.0%ni, 26.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu11 : 27.2%us, 11.3%sy,  0.0%ni,  3.3%id,  0.0%wa,  0.0%hi, 58.1%si,  0.0%st

Its a bit strange two cpus spend time in softirq, unless you have two
queryperf sources, and a multiqueue NIC, or maybe you use two NICS ?

Mind use "perf top -C 1" and "perf top -C 11" to check what these cpus
do ?




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 11:13                               ` Eric Dumazet
@ 2011-03-01 11:27                                 ` Thomas Graf
  2011-03-01 11:45                                   ` Eric Dumazet
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Graf @ 2011-03-01 11:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Tue, Mar 01, 2011 at 12:13:04PM +0100, Eric Dumazet wrote:
> Its a bit strange two cpus spend time in softirq, unless you have two
> queryperf sources, and a multiqueue NIC, or maybe you use two NICS ?

one NIC, 2 clients (12 instances per client)

[root@hp-bl460cg7-01 ~]# cat /sys/class/net/eth0/queues/rx-0/rps_cpus 
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000

[root@hp-bl460cg7-01 ~]# netstat -s | grep err
    1781377 packet receive errors

> Mind use "perf top -C 1" and "perf top -C 11" to check what these cpus
> do ?

--------------------------------------------------------------------------------------------------------------------
   PerfTop:   16198 irqs/sec  kernel:99.1%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 1)
--------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ___________________________________________________________

            51675.00 33.2% _raw_spin_unlock_irqrestore [kernel.kallsyms]                                          
            12426.00  8.0% clflush_cache_range         [kernel.kallsyms]                                          
             5511.00  3.5% be_poll_rx                  /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
             4567.00  2.9% __udp4_lib_lookup           [kernel.kallsyms]                                          
             3981.00  2.6% __kmalloc_node_track_caller [kernel.kallsyms]                                          
             3975.00  2.6% get_rx_page_info            /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
             3725.00  2.4% sk_run_filter               [kernel.kallsyms]                                          
             3606.00  2.3% get_page_from_freelist      [kernel.kallsyms]                                          
             3178.00  2.0% __domain_mapping            [kernel.kallsyms]                                          
             3122.00  2.0% kmem_cache_alloc_node       [kernel.kallsyms]                                          
             2839.00  1.8% sock_queue_rcv_skb          [kernel.kallsyms]                                          
             2246.00  1.4% __netif_receive_skb         [kernel.kallsyms]                                          
             2245.00  1.4% nf_iterate                  [kernel.kallsyms]                                          
             2081.00  1.3% __udp4_lib_rcv              [kernel.kallsyms]                                          
             2042.00  1.3% ipt_do_table                [kernel.kallsyms]                                          
             1901.00  1.2% _raw_spin_lock              [kernel.kallsyms]                                          
             1856.00  1.2% __alloc_skb                 [kernel.kallsyms]                                          
             1645.00  1.1% read_tsc                    [kernel.kallsyms]                                          
             1562.00  1.0% nf_ct_tuple_equal           [kernel.kallsyms]                                          
             1562.00  1.0% ip_rcv                      [kernel.kallsyms]                                          
             1495.00  1.0% __nf_conntrack_find_get     [kernel.kallsyms]                                          
             1477.00  0.9% sock_def_readable           [kernel.kallsyms]                                          
             1363.00  0.9% find_first_bit              [kernel.kallsyms]                                          
             1360.00  0.9% domain_get_iommu            [kernel.kallsyms]                                          
             1255.00  0.8% udp_queue_rcv_skb           [kernel.kallsyms]                                          
             1174.00  0.8% xfrm4_policy_check.clone.0  [kernel.kallsyms]                                          
             1138.00  0.7% hash_conntrack_raw          [kernel.kallsyms]                                          
             1000.00  0.6% intel_unmap_page            [kernel.kallsyms]                                          
              959.00  0.6% load_pointer                [kernel.kallsyms]                                          
              957.00  0.6% sock_flag                   [kernel.kallsyms]                                          
              938.00  0.6% nf_conntrack_in             [kernel.kallsyms]                                          
              891.00  0.6% _local_bh_enable_ip         [kernel.kallsyms]                                          
              884.00  0.6% eth_type_trans              [kernel.kallsyms]                                          
              832.00  0.5% be_post_rx_frags            /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
              829.00  0.5% __alloc_pages_nodemask      [kernel.kallsyms]                                          
              813.00  0.5% kmem_cache_alloc            [kernel.kallsyms]                                          
              802.00  0.5% netif_receive_skb           [kernel.kallsyms]                                          
              802.00  0.5% ip_route_input_common       [kernel.kallsyms]                                          
              723.00  0.5% nf_ct_get_tuple             [kernel.kallsyms]                                          
              720.00  0.5% __intel_map_single          [kernel.kallsyms]                                          
              720.00  0.5% udp_error                   [kernel.kallsyms]                                          

--------------------------------------------------------------------------------------------------------------------
   PerfTop:   16360 irqs/sec  kernel:72.6%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 11)
--------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ___________________________________________________________

            16993.00 32.4% _raw_spin_unlock_irqrestore   [kernel.kallsyms]                                          
             5833.00 11.1% clflush_cache_range           [kernel.kallsyms]                                          
             3315.00  6.3% be_tx_compl_process           /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
             1818.00  3.5% kmem_cache_free               [kernel.kallsyms]                                          
             1415.00  2.7% isc_rwlock_lock               /usr/lib64/libisc.so.62.0.1                                
             1090.00  2.1% be_poll_tx_mcc                /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
              811.00  1.5% skb_release_head_state        [kernel.kallsyms]                                          
              772.00  1.5% skb_release_data              [kernel.kallsyms]                                          
              712.00  1.4% dns_rbt_findnode              /usr/lib64/libdns.so.69.0.1                                
              703.00  1.3% isc_rwlock_unlock             /usr/lib64/libisc.so.62.0.1                                
              695.00  1.3% dma_pte_clear_range           [kernel.kallsyms]                                          
              618.00  1.2% kfree_skb                     [kernel.kallsyms]                                          
              597.00  1.1% kfree                         [kernel.kallsyms]                                          
              553.00  1.1% intel_unmap_page              [kernel.kallsyms]                                          
              531.00  1.0% __do_softirq                  [kernel.kallsyms]                                          
              504.00  1.0% isc_stats_increment           /usr/lib64/libisc.so.62.0.1                                
              397.00  0.8% virt_to_head_page             [kernel.kallsyms]                                          
              306.00  0.6% _raw_spin_lock                [kernel.kallsyms]                                          
              270.00  0.5% domain_get_iommu              [kernel.kallsyms]                                          
              256.00  0.5% dns_name_fullcompare          /usr/lib64/libdns.so.69.0.1                                
              233.00  0.4% find_first_bit                [kernel.kallsyms]                                          
              222.00  0.4% dns_name_equal                /usr/lib64/libdns.so.69.0.1                                
              218.00  0.4% __pthread_mutex_lock_internal /lib64/libpthread-2.12.so                                  
              207.00  0.4% dns_rbtnodechain_init         /usr/lib64/libdns.so.69.0.1                                
              196.00  0.4% dns_acl_match                 /usr/lib64/libdns.so.69.0.1                                
              194.00  0.4% dma_pte_free_pagetable        [kernel.kallsyms]                                          
              192.00  0.4% dns_name_getlabelsequence     /usr/lib64/libdns.so.69.0.1                                


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 11:27                                 ` Thomas Graf
@ 2011-03-01 11:45                                   ` Eric Dumazet
  2011-03-01 11:53                                     ` Herbert Xu
                                                       ` (2 more replies)
  0 siblings, 3 replies; 91+ messages in thread
From: Eric Dumazet @ 2011-03-01 11:45 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

Le mardi 01 mars 2011 à 06:27 -0500, Thomas Graf a écrit :
> On Tue, Mar 01, 2011 at 12:13:04PM +0100, Eric Dumazet wrote:
> > Its a bit strange two cpus spend time in softirq, unless you have two
> > queryperf sources, and a multiqueue NIC, or maybe you use two NICS ?
> 
> one NIC, 2 clients (12 instances per client)
> 
> [root@hp-bl460cg7-01 ~]# cat /sys/class/net/eth0/queues/rx-0/rps_cpus 
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
> 
> [root@hp-bl460cg7-01 ~]# netstat -s | grep err
>     1781377 packet receive errors
> 
> > Mind use "perf top -C 1" and "perf top -C 11" to check what these cpus
> > do ?
> 

Thanks that's really interesting

> --------------------------------------------------------------------------------------------------------------------
>    PerfTop:   16198 irqs/sec  kernel:99.1%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 1)
> --------------------------------------------------------------------------------------------------------------------
> 
>              samples  pcnt function                    DSO
>              _______ _____ ___________________________ ___________________________________________________________
> 

CPU 1 handles receives from your BENET NIC

(Its a bit strange, given this NIC should provide 4 rx queues). Load
could be split to two cpus in your case (two sources)

Try :

ethtool -S eth0 | grep rx_pk
     rxq0: rx_pkts: ??
     rxq1: rx_pkts: ??
     rxq2: rx_pkts: ??
     rxq3: rx_pkts: ??
     rxq4: rx_pkts: ??


Its BE_HDR_LEN being 64, small UDP frames are too big to fit in skb
head.




>             51675.00 33.2% _raw_spin_unlock_irqrestore [kernel.kallsyms]                                          
>             12426.00  8.0% clflush_cache_range         [kernel.kallsyms]                                          
>              5511.00  3.5% be_poll_rx                  /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
>              4567.00  2.9% __udp4_lib_lookup           [kernel.kallsyms]                                          
>              3981.00  2.6% __kmalloc_node_track_caller [kernel.kallsyms]                                          
>              3975.00  2.6% get_rx_page_info            /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
>              3725.00  2.4% sk_run_filter               [kernel.kallsyms]                                          
>              3606.00  2.3% get_page_from_freelist      [kernel.kallsyms]                                          
>              3178.00  2.0% __domain_mapping            [kernel.kallsyms]                                          
>              3122.00  2.0% kmem_cache_alloc_node       [kernel.kallsyms]                                          
>              2839.00  1.8% sock_queue_rcv_skb          [kernel.kallsyms]                                          
>              2246.00  1.4% __netif_receive_skb         [kernel.kallsyms]                                          
>              2245.00  1.4% nf_iterate                  [kernel.kallsyms]                                          
>              2081.00  1.3% __udp4_lib_rcv              [kernel.kallsyms]                                          
>              2042.00  1.3% ipt_do_table                [kernel.kallsyms]                                          
>              1901.00  1.2% _raw_spin_lock              [kernel.kallsyms]                                          
>              1856.00  1.2% __alloc_skb                 [kernel.kallsyms]                                          
>              1645.00  1.1% read_tsc                    [kernel.kallsyms]                                          
>              1562.00  1.0% nf_ct_tuple_equal           [kernel.kallsyms]                                          
>              1562.00  1.0% ip_rcv                      [kernel.kallsyms]                                          
>              1495.00  1.0% __nf_conntrack_find_get     [kernel.kallsyms]                                          
>              1477.00  0.9% sock_def_readable           [kernel.kallsyms]                                          
>              1363.00  0.9% find_first_bit              [kernel.kallsyms]                                          
>              1360.00  0.9% domain_get_iommu            [kernel.kallsyms]                                          
>              1255.00  0.8% udp_queue_rcv_skb           [kernel.kallsyms]                                          
>              1174.00  0.8% xfrm4_policy_check.clone.0  [kernel.kallsyms]                                          
>              1138.00  0.7% hash_conntrack_raw          [kernel.kallsyms]                                          
>              1000.00  0.6% intel_unmap_page            [kernel.kallsyms]                                          
>               959.00  0.6% load_pointer                [kernel.kallsyms]                                          
>               957.00  0.6% sock_flag                   [kernel.kallsyms]                                          
>               938.00  0.6% nf_conntrack_in             [kernel.kallsyms]                                          
>               891.00  0.6% _local_bh_enable_ip         [kernel.kallsyms]                                          
>               884.00  0.6% eth_type_trans              [kernel.kallsyms]                                          
>               832.00  0.5% be_post_rx_frags            /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
>               829.00  0.5% __alloc_pages_nodemask      [kernel.kallsyms]                                          
>               813.00  0.5% kmem_cache_alloc            [kernel.kallsyms]                                          
>               802.00  0.5% netif_receive_skb           [kernel.kallsyms]                                          
>               802.00  0.5% ip_route_input_common       [kernel.kallsyms]                                          
>               723.00  0.5% nf_ct_get_tuple             [kernel.kallsyms]                                          
>               720.00  0.5% __intel_map_single          [kernel.kallsyms]                                          
>               720.00  0.5% udp_error                   [kernel.kallsyms]                                          
> 
> --------------------------------------------------------------------------------------------------------------------
>    PerfTop:   16360 irqs/sec  kernel:72.6%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 11)
> --------------------------------------------------------------------------------------------------------------------
> 

CPU 11 handles all TX completions : Its a potential bottleneck.

I might ressurect XPS patch ;)

>              samples  pcnt function                      DSO
>              _______ _____ _____________________________ ___________________________________________________________
> 
>             16993.00 32.4% _raw_spin_unlock_irqrestore   [kernel.kallsyms]                                          
>              5833.00 11.1% clflush_cache_range           [kernel.kallsyms]                                          
>              3315.00  6.3% be_tx_compl_process           /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
>              1818.00  3.5% kmem_cache_free               [kernel.kallsyms]                                          
>              1415.00  2.7% isc_rwlock_lock               /usr/lib64/libisc.so.62.0.1                                
>              1090.00  2.1% be_poll_tx_mcc                /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
>               811.00  1.5% skb_release_head_state        [kernel.kallsyms]                                          
>               772.00  1.5% skb_release_data              [kernel.kallsyms]                                          
>               712.00  1.4% dns_rbt_findnode              /usr/lib64/libdns.so.69.0.1                                
>               703.00  1.3% isc_rwlock_unlock             /usr/lib64/libisc.so.62.0.1                                
>               695.00  1.3% dma_pte_clear_range           [kernel.kallsyms]                                          
>               618.00  1.2% kfree_skb                     [kernel.kallsyms]                                          
>               597.00  1.1% kfree                         [kernel.kallsyms]                                          
>               553.00  1.1% intel_unmap_page              [kernel.kallsyms]                                          
>               531.00  1.0% __do_softirq                  [kernel.kallsyms]                                          
>               504.00  1.0% isc_stats_increment           /usr/lib64/libisc.so.62.0.1                                
>               397.00  0.8% virt_to_head_page             [kernel.kallsyms]                                          
>               306.00  0.6% _raw_spin_lock                [kernel.kallsyms]                                          
>               270.00  0.5% domain_get_iommu              [kernel.kallsyms]                                          
>               256.00  0.5% dns_name_fullcompare          /usr/lib64/libdns.so.69.0.1                                
>               233.00  0.4% find_first_bit                [kernel.kallsyms]                                          
>               222.00  0.4% dns_name_equal                /usr/lib64/libdns.so.69.0.1                                
>               218.00  0.4% __pthread_mutex_lock_internal /lib64/libpthread-2.12.so                                  
>               207.00  0.4% dns_rbtnodechain_init         /usr/lib64/libdns.so.69.0.1                                
>               196.00  0.4% dns_acl_match                 /usr/lib64/libdns.so.69.0.1                                
>               194.00  0.4% dma_pte_free_pagetable        [kernel.kallsyms]                                          
>               192.00  0.4% dns_name_getlabelsequence     /usr/lib64/libdns.so.69.0.1                                
> 



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 11:45                                   ` Eric Dumazet
@ 2011-03-01 11:53                                     ` Herbert Xu
  2011-03-01 12:32                                       ` Herbert Xu
  2011-03-01 13:03                                       ` Eric Dumazet
  2011-03-01 12:01                                     ` Thomas Graf
  2011-03-01 12:18                                     ` Thomas Graf
  2 siblings, 2 replies; 91+ messages in thread
From: Herbert Xu @ 2011-03-01 11:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Tue, Mar 01, 2011 at 12:45:09PM +0100, Eric Dumazet wrote:
>
> CPU 11 handles all TX completions : Its a potential bottleneck.
> 
> I might ressurect XPS patch ;)

Actually this has been my gripe all along with our TX multiqueue
support.  We should not decide the queue based on the socket, but
on the current CPU.

We already do the right thing for forwarded packets because there
is no socket to latch onto, we just need to fix it for locally
generated traffic.

The odd packet reordering each time your scheduler decides to
migrate the process isn't a big deal IMHO.  If your scheduler
is constantly moving things you've got bigger problems to worry
about.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 11:45                                   ` Eric Dumazet
  2011-03-01 11:53                                     ` Herbert Xu
@ 2011-03-01 12:01                                     ` Thomas Graf
  2011-03-01 12:15                                       ` Herbert Xu
  2011-03-01 13:27                                       ` Herbert Xu
  2011-03-01 12:18                                     ` Thomas Graf
  2 siblings, 2 replies; 91+ messages in thread
From: Thomas Graf @ 2011-03-01 12:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Tue, Mar 01, 2011 at 12:45:09PM +0100, Eric Dumazet wrote:

This is how perf top looks like with SO_REUSEPORT

----------------------------------------------------------------------------------------------------------------------------------
   PerfTop:   27498 irqs/sec  kernel:50.5%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 1)
----------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ __________________

            16464.00  6.0% isc_rwlock_lock               libisc.so.62.0.1
            15462.00  5.7% intel_idle                    [kernel.kallsyms]
            13140.00  4.8% _spin_unlock_irqrestore       [kernel.kallsyms]
             9283.00  3.4% __do_softirq                  [kernel.kallsyms]
             8469.00  3.1% finish_task_switch            [kernel.kallsyms]
             8189.00  3.0% __udp4_lib_lookup             [kernel.kallsyms]
             8096.00  3.0% dns_rbt_findnode              libdns.so.69.0.1
             7619.00  2.8% isc_rwlock_unlock             libisc.so.62.0.1
             5090.00  1.9% isc_stats_increment           libisc.so.62.0.1
             4325.00  1.6% tick_nohz_stop_sched_tick     [kernel.kallsyms]
             3656.00  1.3% _spin_lock                    [kernel.kallsyms]
             3540.00  1.3% __pthread_mutex_lock_internal libpthread-2.12.so
             3168.00  1.2% _spin_lock_bh                 [kernel.kallsyms]
             2576.00  0.9% dns_name_fullcompare          libdns.so.69.0.1
             2492.00  0.9% __pthread_mutex_unlock        libpthread-2.12.so
             2486.00  0.9% isc___mempool_get             libisc.so.62.0.1
             2475.00  0.9% dns_rbtnodechain_init         libdns.so.69.0.1
             2454.00  0.9% be_poll_rx                    [be2net]
             2417.00  0.9% sk_run_filter                 [kernel.kallsyms]
             2411.00  0.9% tick_nohz_restart_sched_tick  [kernel.kallsyms]
             2331.00  0.9% dns_name_equal                libdns.so.69.0.1
             2198.00  0.8% net_rx_action                 [kernel.kallsyms]
             2135.00  0.8% fget_light                    [kernel.kallsyms]
             2130.00  0.8% dns_zone_attach               libdns.so.69.0.1
             2073.00  0.8% dns_name_getlabelsequence     libdns.so.69.0.1
             2024.00  0.7% copy_user_generic_string      [kernel.kallsyms]
             2003.00  0.7% dns_acl_match                 libdns.so.69.0.1
             1868.00  0.7% be_xmit                       [be2net]

----------------------------------------------------------------------------------------------------------------------------------
   PerfTop:   16206 irqs/sec  kernel:88.6%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 3)
----------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ __________________

            15662.00 11.3% __udp4_lib_lookup             [kernel.kallsyms]
            10404.00  7.5% intel_idle                    [kernel.kallsyms]
            10248.00  7.4% _spin_unlock_irqrestore       [kernel.kallsyms]
             4386.00  3.2% __do_softirq                  [kernel.kallsyms]
             4324.00  3.1% be_poll_rx                    [be2net]
             4165.00  3.0% get_rx_page_info              [be2net]
             4050.00  2.9% get_page_from_freelist        [kernel.kallsyms]
             4045.00  2.9% finish_task_switch            [kernel.kallsyms]
             3861.00  2.8% sk_run_filter                 [kernel.kallsyms]
             3544.00  2.5% ip_route_input                [kernel.kallsyms]
             3385.00  2.4% _spin_lock                    [kernel.kallsyms]
             2583.00  1.9% get_rps_cpu                   [kernel.kallsyms]
             2042.00  1.5% tick_nohz_stop_sched_tick     [kernel.kallsyms]
             1971.00  1.4% kmem_cache_alloc_node_notrace [kernel.kallsyms]
             1788.00  1.3% _read_lock                    [kernel.kallsyms]
             1777.00  1.3% __netif_receive_skb           [kernel.kallsyms]
             1777.00  1.3% isc_rwlock_lock               libisc.so.62.0.1
             1769.00  1.3% memcpy_c                      [kernel.kallsyms]
             1618.00  1.2% __alloc_skb                   [kernel.kallsyms]
             1591.00  1.1% __pthread_mutex_lock_internal libpthread-2.12.so
             1576.00  1.1% kmem_cache_alloc_node         [kernel.kallsyms]
             1450.00  1.0% sock_queue_rcv_skb            [kernel.kallsyms]
             1427.00  1.0% tick_nohz_restart_sched_tick  [kernel.kallsyms]
             1214.00  0.9% __udp4_lib_rcv                [kernel.kallsyms]
             1124.00  0.8% net_rx_action                 [kernel.kallsyms]
             1113.00  0.8% getnstimeofday                [kernel.kallsyms]
             1072.00  0.8% selinux_socket_sock_rcv_skb   [kernel.kallsyms]
             1016.00  0.7% ip_rcv                        [kernel.kallsyms]
              992.00  0.7% sock_def_readable             [kernel.kallsyms]
              961.00  0.7% dns_rbt_findnode              libdns.so.69.0.1
              899.00  0.6% fget                          [kernel.kallsyms]
              898.00  0.6% datagram_poll                 [kernel.kallsyms]
              809.00  0.6% isc_rwlock_unlock             libisc.so.62.0.1
              803.00  0.6% __alloc_pages_nodemask        [kernel.kallsyms]
              799.00  0.6% udp_queue_rcv_skb             [kernel.kallsyms]
              694.00  0.5% packet_rcv                    [kernel.kallsyms]
              662.00  0.5% mutex_lock                    [kernel.kallsyms]

------------------------------------------------------------------------------------------------------------------------------------------
   PerfTop:   31619 irqs/sec  kernel:37.7%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 10)
-------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ __________________

             6726.00  7.7% isc_rwlock_lock               libisc.so.62.0.1
             4597.00  5.3% _spin_unlock_irqrestore       [kernel.kallsyms]
             4230.00  4.9% intel_idle                    [kernel.kallsyms]
             3319.00  3.8% dns_rbt_findnode              libdns.so.69.0.1
             3178.00  3.7% isc_rwlock_unlock             libisc.so.62.0.1
             2682.00  3.1% finish_task_switch            [kernel.kallsyms]
             2164.00  2.5% isc_stats_increment           libisc.so.62.0.1
             1435.00  1.7% tick_nohz_stop_sched_tick     [kernel.kallsyms]
             1407.00  1.6% _spin_lock_bh                 [kernel.kallsyms]
             1288.00  1.5% __pthread_mutex_lock_internal libpthread-2.12.so
             1264.00  1.5% copy_user_generic_string      [kernel.kallsyms]
             1082.00  1.2% _spin_lock                    [kernel.kallsyms]
             1061.00  1.2% be_xmit                       [be2net]
             1024.00  1.2% __pthread_mutex_unlock        libpthread-2.12.so
             1014.00  1.2% dns_rbtnodechain_init         libdns.so.69.0.1
              989.00  1.1% isc___mempool_get             libisc.so.62.0.1
              964.00  1.1% dns_name_equal                libdns.so.69.0.1
              957.00  1.1% dns_name_getlabelsequence     libdns.so.69.0.1
              944.00  1.1% dns_name_fullcompare          libdns.so.69.0.1
              858.00  1.0% dns_zone_attach               libdns.so.69.0.1
              793.00  0.9% udp_sendmsg                   [kernel.kallsyms]
              785.00  0.9% tick_nohz_restart_sched_tick  [kernel.kallsyms]
              784.00  0.9% dns_acl_match                 libdns.so.69.0.1
              776.00  0.9% fget_light                    [kernel.kallsyms]
              723.00  0.8% dns_name_hash                 libdns.so.69.0.1
              691.00  0.8% dns_message_rendersection     libdns.so.69.0.1
              675.00  0.8% dns_name_fromwire             libdns.so.69.0.1
              658.00  0.8% udp_recvmsg                   [kernel.kallsyms]
              646.00  0.7% kmem_cache_free               [kernel.kallsyms]
              641.00  0.7% kfree                         [kernel.kallsyms]
              535.00  0.6% isc_radix_search              libisc.so.62.0.1
              531.00  0.6% dev_queue_xmit                [kernel.kallsyms]


-------------------------------------------------------------------------------------------------------------------------------------------
   PerfTop:   31136 irqs/sec  kernel:48.3%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 11)
-------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ __________________

            13043.00  6.0% isc_rwlock_lock               libisc.so.62.0.1
            10852.00  5.0% _spin_unlock_irqrestore       [kernel.kallsyms]
            10538.00  4.9% be_tx_compl_process           [be2net]
             8275.00  3.8% kfree                         [kernel.kallsyms]
             6467.00  3.0% kmem_cache_free               [kernel.kallsyms]
             6453.00  3.0% dns_rbt_findnode              libdns.so.69.0.1
             6423.00  3.0% intel_idle                    [kernel.kallsyms]
             6199.00  2.9% isc_rwlock_unlock             libisc.so.62.0.1
             5492.00  2.5% sock_wfree                    [kernel.kallsyms]
             5372.00  2.5% finish_task_switch            [kernel.kallsyms]
             5321.00  2.4% kfree_skb                     [kernel.kallsyms]
             4030.00  1.9% isc_stats_increment           libisc.so.62.0.1
             3820.00  1.8% skb_release_data              [kernel.kallsyms]
             3518.00  1.6% be_poll_tx_mcc                [be2net]
             3034.00  1.4% sock_def_write_space          [kernel.kallsyms]
             2599.00  1.2% __do_softirq                  [kernel.kallsyms]
             2572.00  1.2% tick_nohz_stop_sched_tick     [kernel.kallsyms]
             2519.00  1.2% __pthread_mutex_lock_internal libpthread-2.12.so
             2497.00  1.1% _spin_lock_bh                 [kernel.kallsyms]
             2045.00  0.9% dns_name_fullcompare          libdns.so.69.0.1
             1960.00  0.9% isc___mempool_get             libisc.so.62.0.1
             1873.00  0.9% dns_rbtnodechain_init         libdns.so.69.0.1
             1861.00  0.9% _spin_lock                    [kernel.kallsyms]
             1806.00  0.8% __pthread_mutex_unlock        libpthread-2.12.so
             1791.00  0.8% dns_name_equal                libdns.so.69.0.1
             1757.00  0.8% dns_zone_attach               libdns.so.69.0.1
             1621.00  0.7% dns_name_getlabelsequence     libdns.so.69.0.1
             1576.00  0.7% fget_light                    [kernel.kallsyms]
             1532.00  0.7% dns_acl_match                 libdns.so.69.0.1
             1515.00  0.7% tick_nohz_restart_sched_tick  [kernel.kallsyms]
             1510.00  0.7% be_xmit                       [be2net]


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 12:01                                     ` Thomas Graf
@ 2011-03-01 12:15                                       ` Herbert Xu
  2011-03-01 13:27                                       ` Herbert Xu
  1 sibling, 0 replies; 91+ messages in thread
From: Herbert Xu @ 2011-03-01 12:15 UTC (permalink / raw)
  To: Eric Dumazet, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Tue, Mar 01, 2011 at 07:01:12AM -0500, Thomas Graf wrote:
> On Tue, Mar 01, 2011 at 12:45:09PM +0100, Eric Dumazet wrote:
> 
> This is how perf top looks like with SO_REUSEPORT

Yeah I think Eric is spot on.  The remaining bottleneck is because
we hash all outbound packets from a single socket to a single TX
queue, despite the fact that they were produced on difference CPUs.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 11:45                                   ` Eric Dumazet
  2011-03-01 11:53                                     ` Herbert Xu
  2011-03-01 12:01                                     ` Thomas Graf
@ 2011-03-01 12:18                                     ` Thomas Graf
  2011-03-01 12:19                                       ` Herbert Xu
  2 siblings, 1 reply; 91+ messages in thread
From: Thomas Graf @ 2011-03-01 12:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Tue, Mar 01, 2011 at 12:45:09PM +0100, Eric Dumazet wrote:
> ethtool -S eth0 | grep rx_pk
>      rxq0: rx_pkts: ??
>      rxq1: rx_pkts: ??
>      rxq2: rx_pkts: ??
>      rxq3: rx_pkts: ??
>      rxq4: rx_pkts: ??

It could do multiqueue but it doesnt:

[root@hp-bl460cg7-01 ~]# ethtool -S eth0 | grep rx_pk
     rxq0: rx_pkts: 1512
     rxq1: rx_pkts: 462
     rxq2: rx_pkts: 122
     rxq3: rx_pkts: 24751393
     rxq4: rx_pkts: 35

So, adding a third client making sure it would hit another queue:
     rxq0: rx_pkts: 3041
     rxq1: rx_pkts: 867
     rxq2: rx_pkts: 4610476
     rxq3: rx_pkts: 57418776
     rxq4: rx_pkts: 40

... makes it use CPU 5 for rxq2 and the qps goes up from 250kqps to 270kqps

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
   PerfTop:   18417 irqs/sec  kernel:50.2%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 5)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ___________________________________________________________

            12712.00 18.5% _raw_spin_unlock_irqrestore   [kernel.kallsyms]                                          
             3697.00  5.4% isc_rwlock_lock               /usr/lib64/libisc.so.62.0.1                                
             1948.00  2.8% dns_rbt_findnode              /usr/lib64/libdns.so.69.0.1                                
             1809.00  2.6% isc_rwlock_unlock             /usr/lib64/libisc.so.62.0.1                                
             1631.00  2.4% __do_softirq                  [kernel.kallsyms]                                          
             1237.00  1.8% isc_stats_increment           /usr/lib64/libisc.so.62.0.1                                
             1106.00  1.6% clflush_cache_range           [kernel.kallsyms]                                          
              964.00  1.4% _raw_spin_lock                [kernel.kallsyms]                                          
              714.00  1.0% be_poll_rx                    /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
              630.00  0.9% __pthread_mutex_lock_internal /lib64/libpthread-2.12.so                                  
              627.00  0.9% dns_name_fullcompare          /usr/lib64/libdns.so.69.0.1                                
              582.00  0.8% dns_rbtnodechain_init         /usr/lib64/libdns.so.69.0.1                                
              552.00  0.8% sk_run_filter                 [kernel.kallsyms]                                          
              527.00  0.8% dns_name_getlabelsequence     /usr/lib64/libdns.so.69.0.1                                
              525.00  0.8% __pthread_mutex_unlock        /lib64/libpthread-2.12.so                                  
              492.00  0.7% dns_name_equal                /usr/lib64/libdns.so.69.0.1                                
              468.00  0.7% isc___mempool_get             /usr/lib64/libisc.so.62.0.1                                
              462.00  0.7% __udp4_lib_lookup             [kernel.kallsyms]                                          
              457.00  0.7% dns_acl_match                 /usr/lib64/libdns.so.69.0.1                                
              453.00  0.7% dns_zone_attach               /usr/lib64/libdns.so.69.0.1                                
              451.00  0.7% fget_light                    [kernel.kallsyms]                                          
              443.00  0.6% dns_message_rendersection     /usr/lib64/libdns.so.69.0.1                                
              431.00  0.6% ipt_do_table                  [kernel.kallsyms]                                          
              429.00  0.6% nf_iterate                    [kernel.kallsyms]                                          
              422.00  0.6% __kmalloc_node_track_caller   [kernel.kallsyms]                                          
              408.00  0.6% __domain_mapping              [kernel.kallsyms]                                          
              387.00  0.6% dns_name_hash                 /usr/lib64/libdns.so.69.0.1                                
              353.00  0.5% copy_user_generic_string      [kernel.kallsyms]                                          
              349.00  0.5% dns_name_fromwire             /usr/lib64/libdns.so.69.0.1                          

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 12:18                                     ` Thomas Graf
@ 2011-03-01 12:19                                       ` Herbert Xu
  2011-03-01 13:50                                         ` Thomas Graf
  0 siblings, 1 reply; 91+ messages in thread
From: Herbert Xu @ 2011-03-01 12:19 UTC (permalink / raw)
  To: Eric Dumazet, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Tue, Mar 01, 2011 at 07:18:29AM -0500, Thomas Graf wrote:
>
> 
> ... makes it use CPU 5 for rxq2 and the qps goes up from 250kqps to 270kqps

I think the increase here comes from the larger number of packets
in flight more than anything.

The bottleneck is still the TX queue (both software and hardware).

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 11:53                                     ` Herbert Xu
@ 2011-03-01 12:32                                       ` Herbert Xu
  2011-03-01 13:04                                         ` Eric Dumazet
  2011-03-01 13:03                                       ` Eric Dumazet
  1 sibling, 1 reply; 91+ messages in thread
From: Herbert Xu @ 2011-03-01 12:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Tue, Mar 01, 2011 at 07:53:05PM +0800, Herbert Xu wrote:
> On Tue, Mar 01, 2011 at 12:45:09PM +0100, Eric Dumazet wrote:
> >
> > CPU 11 handles all TX completions : Its a potential bottleneck.
> > 
> > I might ressurect XPS patch ;)
> 
> Actually this has been my gripe all along with our TX multiqueue
> support.  We should not decide the queue based on the socket, but
> on the current CPU.
> 
> We already do the right thing for forwarded packets because there
> is no socket to latch onto, we just need to fix it for locally
> generated traffic.
> 
> The odd packet reordering each time your scheduler decides to
> migrate the process isn't a big deal IMHO.  If your scheduler
> is constantly moving things you've got bigger problems to worry
> about.

If anybody wants to play here is a patch to do exactly that:

net: Determine TX queue purely by current CPU

Distributing packets generated on one CPU to multiple queues
makes no sense.  Nor does putting packets from multiple CPUs
into a single queue.

While this may introduce packet reordering should the scheduler
decide to migrate a thread, it isn't a big deal because migration
is meant to be a rare event, and nothing will die as long as the
ordering doesn't occur all the time.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/core/dev.c b/net/core/dev.c
index 8ae6631..87bd20a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2164,22 +2164,12 @@ static u32 hashrnd __read_mostly;
 u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
 		  unsigned int num_tx_queues)
 {
-	u32 hash;
+	u32 hash = raw_smp_processor_id();
 
-	if (skb_rx_queue_recorded(skb)) {
-		hash = skb_get_rx_queue(skb);
-		while (unlikely(hash >= num_tx_queues))
-			hash -= num_tx_queues;
-		return hash;
-	}
+	while (unlikely(hash >= num_tx_queues))
+		hash -= num_tx_queues;
 
-	if (skb->sk && skb->sk->sk_hash)
-		hash = skb->sk->sk_hash;
-	else
-		hash = (__force u16) skb->protocol ^ skb->rxhash;
-	hash = jhash_1word(hash, hashrnd);
-
-	return (u16) (((u64) hash * num_tx_queues) >> 32);
+	return hash;
 }
 EXPORT_SYMBOL(__skb_tx_hash);
 
Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-02-28 11:36               ` Herbert Xu
                                   ` (2 preceding siblings ...)
  2011-03-01  5:33                 ` Eric Dumazet
@ 2011-03-01 12:35                 ` Herbert Xu
  2011-03-01 12:36                   ` [PATCH 1/5] inet: Remove unused sk_sndmsg_* from UFO Herbert Xu
                                     ` (5 more replies)
  3 siblings, 6 replies; 91+ messages in thread
From: Herbert Xu @ 2011-03-01 12:35 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev

On Mon, Feb 28, 2011 at 07:36:59PM +0800, Herbert Xu wrote:
> Here are the patches I used.  Please don't them yet as I intend
> to clean them up quite a bit.

OK here is the version ready for merging (please retest them
though as I have changed things substantially).

The main change is that the legacy UDP code path is now gone
so we use the same UDP header generation whether corking is on
or off.

I will add IPv6 support in a later patch set.

Thanks!
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 3/5] inet: Add ip_make_skb and ip_finish_skb
  2011-03-01 12:35                 ` Herbert Xu
  2011-03-01 12:36                   ` [PATCH 1/5] inet: Remove unused sk_sndmsg_* from UFO Herbert Xu
@ 2011-03-01 12:36                   ` Herbert Xu
  2011-03-01 12:36                   ` [PATCH 2/5] inet: Remove explicit write references to sk/inet in ip_append_data Herbert Xu
                                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 91+ messages in thread
From: Herbert Xu @ 2011-03-01 12:36 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf

inet: Add ip_make_skb and ip_finish_skb

This patch adds the helper ip_make_skb which is like ip_append_data
and ip_push_pending_frames all rolled into one, except that it does
not send the skb produced.  The sending part is carried out by
ip_send_skb, which the transport protocol can call after it has
tweaked the skb.

It is meant to be called in cases where corking is not used should
have a one-to-one correspondence to sendmsg.

This patch also adds the helper ip_finish_skb which is meant to
be replace ip_push_pending_frames when corking is required.
Previously the protocol stack would peek at the socket write
queue and add its header to the first packet.  With ip_finish_skb,
the protocol stack can directly operate on the final skb instead,
just like the non-corking case with ip_make_skb.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/net/ip.h     |   16 ++++++++++++
 net/ipv4/ip_output.c |   65 ++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 67 insertions(+), 14 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 67fac78..a4f6311 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -116,8 +116,24 @@ extern int		ip_append_data(struct sock *sk,
 extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
 extern ssize_t		ip_append_page(struct sock *sk, struct page *page,
 				int offset, size_t size, int flags);
+extern struct sk_buff  *__ip_make_skb(struct sock *sk,
+				      struct sk_buff_head *queue,
+				      struct inet_cork *cork);
+extern int		ip_send_skb(struct sk_buff *skb);
 extern int		ip_push_pending_frames(struct sock *sk);
 extern void		ip_flush_pending_frames(struct sock *sk);
+extern struct sk_buff  *ip_make_skb(struct sock *sk,
+				    int getfrag(void *from, char *to, int offset, int len,
+						int odd, struct sk_buff *skb),
+				    void *from, int length, int transhdrlen,
+				    struct ipcm_cookie *ipc,
+				    struct rtable **rtp,
+				    unsigned int flags);
+
+static inline struct sk_buff *ip_finish_skb(struct sock *sk)
+{
+	return __ip_make_skb(sk, &sk->sk_write_queue, &inet_sk(sk)->cork);
+}
 
 /* datagram.c */
 extern int		ip4_datagram_connect(struct sock *sk, 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 1dd5ecc..460308c 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1267,9 +1267,9 @@ static void ip_cork_release(struct inet_cork *cork)
  *	Combined all pending IP fragments on the socket as one IP datagram
  *	and push them out.
  */
-static int __ip_push_pending_frames(struct sock *sk,
-				    struct sk_buff_head *queue,
-				    struct inet_cork *cork)
+struct sk_buff *__ip_make_skb(struct sock *sk,
+			      struct sk_buff_head *queue,
+			      struct inet_cork *cork)
 {
 	struct sk_buff *skb, *tmp_skb;
 	struct sk_buff **tail_skb;
@@ -1280,7 +1280,6 @@ static int __ip_push_pending_frames(struct sock *sk,
 	struct iphdr *iph;
 	__be16 df = 0;
 	__u8 ttl;
-	int err = 0;
 
 	if ((skb = __skb_dequeue(queue)) == NULL)
 		goto out;
@@ -1351,28 +1350,37 @@ static int __ip_push_pending_frames(struct sock *sk,
 		icmp_out_count(net, ((struct icmphdr *)
 			skb_transport_header(skb))->type);
 
-	/* Netfilter gets whole the not fragmented skb. */
+	ip_cork_release(cork);
+out:
+	return skb;
+}
+
+int ip_send_skb(struct sk_buff *skb)
+{
+	struct net *net = sock_net(skb->sk);
+	int err;
+
 	err = ip_local_out(skb);
 	if (err) {
 		if (err > 0)
 			err = net_xmit_errno(err);
 		if (err)
-			goto error;
+			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
 	}
 
-out:
-	ip_cork_release(cork);
 	return err;
-
-error:
-	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
-	goto out;
 }
 
 int ip_push_pending_frames(struct sock *sk)
 {
-	return __ip_push_pending_frames(sk, &sk->sk_write_queue,
-					&inet_sk(sk)->cork);
+	struct sk_buff *skb;
+
+	skb = ip_finish_skb(sk);
+	if (!skb)
+		return 0;
+
+	/* Netfilter gets whole the not fragmented skb. */
+	return ip_send_skb(skb);
 }
 
 /*
@@ -1395,6 +1403,35 @@ void ip_flush_pending_frames(struct sock *sk)
 	__ip_flush_pending_frames(sk, &sk->sk_write_queue, &inet_sk(sk)->cork);
 }
 
+struct sk_buff *ip_make_skb(struct sock *sk,
+			    int getfrag(void *from, char *to, int offset,
+					int len, int odd, struct sk_buff *skb),
+			    void *from, int length, int transhdrlen,
+			    struct ipcm_cookie *ipc, struct rtable **rtp,
+			    unsigned int flags)
+{
+	struct inet_cork cork = {};
+	struct sk_buff_head queue;
+	int err;
+
+	if (flags & MSG_PROBE)
+		return NULL;
+
+	__skb_queue_head_init(&queue);
+
+	err = ip_setup_cork(sk, &cork, ipc, rtp);
+	if (err)
+		return ERR_PTR(err);
+
+	err = __ip_append_data(sk, &queue, &cork, getfrag,
+			       from, length, transhdrlen, flags);
+	if (err) {
+		__ip_flush_pending_frames(sk, &queue, &cork);
+		return ERR_PTR(err);
+	}
+
+	return __ip_make_skb(sk, &queue, &cork);
+}
 
 /*
  *	Fetch data from kernel space and fill in checksum if needed.

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 2/5] inet: Remove explicit write references to sk/inet in ip_append_data
  2011-03-01 12:35                 ` Herbert Xu
  2011-03-01 12:36                   ` [PATCH 1/5] inet: Remove unused sk_sndmsg_* from UFO Herbert Xu
  2011-03-01 12:36                   ` [PATCH 3/5] inet: Add ip_make_skb and ip_finish_skb Herbert Xu
@ 2011-03-01 12:36                   ` Herbert Xu
  2011-03-02  6:15                     ` inet: Replace left-over references to inet->cork Herbert Xu
  2011-03-01 12:36                   ` [PATCH 4/5] udp: Switch to ip_finish_skb Herbert Xu
                                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 91+ messages in thread
From: Herbert Xu @ 2011-03-01 12:36 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf

inet: Remove explicit write references to sk/inet in ip_append_data

In order to allow simultaneous calls to ip_append_data on the same
socket, it must not modify any shared state in sk or inet (other
than those that are designed to allow that such as atomic counters).

This patch abstracts out write references to sk and inet_sk in
ip_append_data and its friends so that we may use the underlying
code in parallel.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/net/inet_sock.h |   23 ++--
 net/ipv4/ip_output.c    |  238 ++++++++++++++++++++++++++++--------------------
 2 files changed, 154 insertions(+), 107 deletions(-)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 8181498..b3de102 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -86,6 +86,19 @@ static inline struct inet_request_sock *inet_rsk(const struct request_sock *sk)
 	return (struct inet_request_sock *)sk;
 }
 
+struct inet_cork {
+	unsigned int		flags;
+	unsigned int		fragsize;
+	struct ip_options	*opt;
+	struct dst_entry	*dst;
+	int			length; /* Total length of all frames */
+	__be32			addr;
+	struct flowi		fl;
+	struct page		*page;
+	u32			off;
+	u8			tx_flags;
+};
+
 struct ip_mc_socklist;
 struct ipv6_pinfo;
 struct rtable;
@@ -143,15 +156,7 @@ struct inet_sock {
 	int			mc_index;
 	__be32			mc_addr;
 	struct ip_mc_socklist __rcu	*mc_list;
-	struct {
-		unsigned int		flags;
-		unsigned int		fragsize;
-		struct ip_options	*opt;
-		struct dst_entry	*dst;
-		int			length; /* Total length of all frames */
-		__be32			addr;
-		struct flowi		fl;
-	} cork;
+	struct inet_cork	cork;
 };
 
 #define IPCORK_OPT	1	/* ip-options has been held in ipcork.opt */
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d3a4540..1dd5ecc 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -733,6 +733,7 @@ csum_page(struct page *page, int offset, int copy)
 }
 
 static inline int ip_ufo_append_data(struct sock *sk,
+			struct sk_buff_head *queue,
 			int getfrag(void *from, char *to, int offset, int len,
 			       int odd, struct sk_buff *skb),
 			void *from, int length, int hh_len, int fragheaderlen,
@@ -745,7 +746,7 @@ static inline int ip_ufo_append_data(struct sock *sk,
 	 * device, so create one single skb packet containing complete
 	 * udp datagram
 	 */
-	if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL) {
+	if ((skb = skb_peek_tail(queue)) == NULL) {
 		skb = sock_alloc_send_skb(sk,
 			hh_len + fragheaderlen + transhdrlen + 20,
 			(flags & MSG_DONTWAIT), &err);
@@ -771,35 +772,24 @@ static inline int ip_ufo_append_data(struct sock *sk,
 		/* specify the length of each IP datagram fragment */
 		skb_shinfo(skb)->gso_size = mtu - fragheaderlen;
 		skb_shinfo(skb)->gso_type = SKB_GSO_UDP;
-		__skb_queue_tail(&sk->sk_write_queue, skb);
+		__skb_queue_tail(queue, skb);
 	}
 
 	return skb_append_datato_frags(sk, skb, getfrag, from,
 				       (length - transhdrlen));
 }
 
-/*
- *	ip_append_data() and ip_append_page() can make one large IP datagram
- *	from many pieces of data. Each pieces will be holded on the socket
- *	until ip_push_pending_frames() is called. Each piece can be a page
- *	or non-page data.
- *
- *	Not only UDP, other transport protocols - e.g. raw sockets - can use
- *	this interface potentially.
- *
- *	LATER: length must be adjusted by pad at tail, when it is required.
- */
-int ip_append_data(struct sock *sk,
-		   int getfrag(void *from, char *to, int offset, int len,
-			       int odd, struct sk_buff *skb),
-		   void *from, int length, int transhdrlen,
-		   struct ipcm_cookie *ipc, struct rtable **rtp,
-		   unsigned int flags)
+static int __ip_append_data(struct sock *sk, struct sk_buff_head *queue,
+			    struct inet_cork *cork,
+			    int getfrag(void *from, char *to, int offset,
+					int len, int odd, struct sk_buff *skb),
+			    void *from, int length, int transhdrlen,
+			    unsigned int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct sk_buff *skb;
 
-	struct ip_options *opt = NULL;
+	struct ip_options *opt = inet->cork.opt;
 	int hh_len;
 	int exthdrlen;
 	int mtu;
@@ -808,58 +798,19 @@ int ip_append_data(struct sock *sk,
 	int offset = 0;
 	unsigned int maxfraglen, fragheaderlen;
 	int csummode = CHECKSUM_NONE;
-	struct rtable *rt;
-
-	if (flags&MSG_PROBE)
-		return 0;
+	struct rtable *rt = (struct rtable *)cork->dst;
 
-	if (skb_queue_empty(&sk->sk_write_queue)) {
-		/*
-		 * setup for corking.
-		 */
-		opt = ipc->opt;
-		if (opt) {
-			if (inet->cork.opt == NULL) {
-				inet->cork.opt = kmalloc(sizeof(struct ip_options) + 40, sk->sk_allocation);
-				if (unlikely(inet->cork.opt == NULL))
-					return -ENOBUFS;
-			}
-			memcpy(inet->cork.opt, opt, sizeof(struct ip_options)+opt->optlen);
-			inet->cork.flags |= IPCORK_OPT;
-			inet->cork.addr = ipc->addr;
-		}
-		rt = *rtp;
-		if (unlikely(!rt))
-			return -EFAULT;
-		/*
-		 * We steal reference to this route, caller should not release it
-		 */
-		*rtp = NULL;
-		inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE ?
-					    rt->dst.dev->mtu :
-					    dst_mtu(rt->dst.path);
-		inet->cork.dst = &rt->dst;
-		inet->cork.length = 0;
-		sk->sk_sndmsg_page = NULL;
-		sk->sk_sndmsg_off = 0;
-		exthdrlen = rt->dst.header_len;
-		length += exthdrlen;
-		transhdrlen += exthdrlen;
-	} else {
-		rt = (struct rtable *)inet->cork.dst;
-		if (inet->cork.flags & IPCORK_OPT)
-			opt = inet->cork.opt;
+	exthdrlen = transhdrlen ? rt->dst.header_len : 0;
+	length += exthdrlen;
+	transhdrlen += exthdrlen;
+	mtu = inet->cork.fragsize;
 
-		transhdrlen = 0;
-		exthdrlen = 0;
-		mtu = inet->cork.fragsize;
-	}
 	hh_len = LL_RESERVED_SPACE(rt->dst.dev);
 
 	fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
 	maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
 
-	if (inet->cork.length + length > 0xFFFF - fragheaderlen) {
+	if (cork->length + length > 0xFFFF - fragheaderlen) {
 		ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->inet_dport,
 			       mtu-exthdrlen);
 		return -EMSGSIZE;
@@ -875,15 +826,15 @@ int ip_append_data(struct sock *sk,
 	    !exthdrlen)
 		csummode = CHECKSUM_PARTIAL;
 
-	skb = skb_peek_tail(&sk->sk_write_queue);
+	skb = skb_peek_tail(queue);
 
-	inet->cork.length += length;
+	cork->length += length;
 	if (((length > mtu) || (skb && skb_is_gso(skb))) &&
 	    (sk->sk_protocol == IPPROTO_UDP) &&
 	    (rt->dst.dev->features & NETIF_F_UFO)) {
-		err = ip_ufo_append_data(sk, getfrag, from, length, hh_len,
-					 fragheaderlen, transhdrlen, mtu,
-					 flags);
+		err = ip_ufo_append_data(sk, queue, getfrag, from, length,
+					 hh_len, fragheaderlen, transhdrlen,
+					 mtu, flags);
 		if (err)
 			goto error;
 		return 0;
@@ -960,7 +911,7 @@ alloc_new_skb:
 				else
 					/* only the initial fragment is
 					   time stamped */
-					ipc->tx_flags = 0;
+					cork->tx_flags = 0;
 			}
 			if (skb == NULL)
 				goto error;
@@ -971,7 +922,7 @@ alloc_new_skb:
 			skb->ip_summed = csummode;
 			skb->csum = 0;
 			skb_reserve(skb, hh_len);
-			skb_shinfo(skb)->tx_flags = ipc->tx_flags;
+			skb_shinfo(skb)->tx_flags = cork->tx_flags;
 
 			/*
 			 *	Find where to start putting bytes.
@@ -1008,7 +959,7 @@ alloc_new_skb:
 			/*
 			 * Put the packet on the pending queue.
 			 */
-			__skb_queue_tail(&sk->sk_write_queue, skb);
+			__skb_queue_tail(queue, skb);
 			continue;
 		}
 
@@ -1028,8 +979,8 @@ alloc_new_skb:
 		} else {
 			int i = skb_shinfo(skb)->nr_frags;
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i-1];
-			struct page *page = sk->sk_sndmsg_page;
-			int off = sk->sk_sndmsg_off;
+			struct page *page = cork->page;
+			int off = cork->off;
 			unsigned int left;
 
 			if (page && (left = PAGE_SIZE - off) > 0) {
@@ -1041,7 +992,7 @@ alloc_new_skb:
 						goto error;
 					}
 					get_page(page);
-					skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0);
+					skb_fill_page_desc(skb, i, page, off, 0);
 					frag = &skb_shinfo(skb)->frags[i];
 				}
 			} else if (i < MAX_SKB_FRAGS) {
@@ -1052,8 +1003,8 @@ alloc_new_skb:
 					err = -ENOMEM;
 					goto error;
 				}
-				sk->sk_sndmsg_page = page;
-				sk->sk_sndmsg_off = 0;
+				cork->page = page;
+				cork->off = 0;
 
 				skb_fill_page_desc(skb, i, page, 0, 0);
 				frag = &skb_shinfo(skb)->frags[i];
@@ -1065,7 +1016,7 @@ alloc_new_skb:
 				err = -EFAULT;
 				goto error;
 			}
-			sk->sk_sndmsg_off += copy;
+			cork->off += copy;
 			frag->size += copy;
 			skb->len += copy;
 			skb->data_len += copy;
@@ -1079,11 +1030,87 @@ alloc_new_skb:
 	return 0;
 
 error:
-	inet->cork.length -= length;
+	cork->length -= length;
 	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
 	return err;
 }
 
+static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
+			 struct ipcm_cookie *ipc, struct rtable **rtp)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct ip_options *opt;
+	struct rtable *rt;
+
+	/*
+	 * setup for corking.
+	 */
+	opt = ipc->opt;
+	if (opt) {
+		if (cork->opt == NULL) {
+			cork->opt = kmalloc(sizeof(struct ip_options) + 40,
+					    sk->sk_allocation);
+			if (unlikely(cork->opt == NULL))
+				return -ENOBUFS;
+		}
+		memcpy(cork->opt, opt, sizeof(struct ip_options) + opt->optlen);
+		cork->flags |= IPCORK_OPT;
+		cork->addr = ipc->addr;
+	}
+	rt = *rtp;
+	if (unlikely(!rt))
+		return -EFAULT;
+	/*
+	 * We steal reference to this route, caller should not release it
+	 */
+	*rtp = NULL;
+	cork->fragsize = inet->pmtudisc == IP_PMTUDISC_PROBE ?
+			 rt->dst.dev->mtu : dst_mtu(rt->dst.path);
+	cork->dst = &rt->dst;
+	cork->length = 0;
+	cork->tx_flags = ipc->tx_flags;
+	cork->page = NULL;
+	cork->off = 0;
+
+	return 0;
+}
+
+/*
+ *	ip_append_data() and ip_append_page() can make one large IP datagram
+ *	from many pieces of data. Each pieces will be holded on the socket
+ *	until ip_push_pending_frames() is called. Each piece can be a page
+ *	or non-page data.
+ *
+ *	Not only UDP, other transport protocols - e.g. raw sockets - can use
+ *	this interface potentially.
+ *
+ *	LATER: length must be adjusted by pad at tail, when it is required.
+ */
+int ip_append_data(struct sock *sk,
+		   int getfrag(void *from, char *to, int offset, int len,
+			       int odd, struct sk_buff *skb),
+		   void *from, int length, int transhdrlen,
+		   struct ipcm_cookie *ipc, struct rtable **rtp,
+		   unsigned int flags)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	int err;
+
+	if (flags&MSG_PROBE)
+		return 0;
+
+	if (skb_queue_empty(&sk->sk_write_queue)) {
+		err = ip_setup_cork(sk, &inet->cork, ipc, rtp);
+		if (err)
+			return err;
+	} else {
+		transhdrlen = 0;
+	}
+
+	return __ip_append_data(sk, &sk->sk_write_queue, &inet->cork, getfrag,
+				from, length, transhdrlen, flags);
+}
+
 ssize_t	ip_append_page(struct sock *sk, struct page *page,
 		       int offset, size_t size, int flags)
 {
@@ -1227,40 +1254,42 @@ error:
 	return err;
 }
 
-static void ip_cork_release(struct inet_sock *inet)
+static void ip_cork_release(struct inet_cork *cork)
 {
-	inet->cork.flags &= ~IPCORK_OPT;
-	kfree(inet->cork.opt);
-	inet->cork.opt = NULL;
-	dst_release(inet->cork.dst);
-	inet->cork.dst = NULL;
+	cork->flags &= ~IPCORK_OPT;
+	kfree(cork->opt);
+	cork->opt = NULL;
+	dst_release(cork->dst);
+	cork->dst = NULL;
 }
 
 /*
  *	Combined all pending IP fragments on the socket as one IP datagram
  *	and push them out.
  */
-int ip_push_pending_frames(struct sock *sk)
+static int __ip_push_pending_frames(struct sock *sk,
+				    struct sk_buff_head *queue,
+				    struct inet_cork *cork)
 {
 	struct sk_buff *skb, *tmp_skb;
 	struct sk_buff **tail_skb;
 	struct inet_sock *inet = inet_sk(sk);
 	struct net *net = sock_net(sk);
 	struct ip_options *opt = NULL;
-	struct rtable *rt = (struct rtable *)inet->cork.dst;
+	struct rtable *rt = (struct rtable *)cork->dst;
 	struct iphdr *iph;
 	__be16 df = 0;
 	__u8 ttl;
 	int err = 0;
 
-	if ((skb = __skb_dequeue(&sk->sk_write_queue)) == NULL)
+	if ((skb = __skb_dequeue(queue)) == NULL)
 		goto out;
 	tail_skb = &(skb_shinfo(skb)->frag_list);
 
 	/* move skb->data to ip header from ext header */
 	if (skb->data < skb_network_header(skb))
 		__skb_pull(skb, skb_network_offset(skb));
-	while ((tmp_skb = __skb_dequeue(&sk->sk_write_queue)) != NULL) {
+	while ((tmp_skb = __skb_dequeue(queue)) != NULL) {
 		__skb_pull(tmp_skb, skb_network_header_len(skb));
 		*tail_skb = tmp_skb;
 		tail_skb = &(tmp_skb->next);
@@ -1286,8 +1315,8 @@ int ip_push_pending_frames(struct sock *sk)
 	     ip_dont_fragment(sk, &rt->dst)))
 		df = htons(IP_DF);
 
-	if (inet->cork.flags & IPCORK_OPT)
-		opt = inet->cork.opt;
+	if (cork->flags & IPCORK_OPT)
+		opt = cork->opt;
 
 	if (rt->rt_type == RTN_MULTICAST)
 		ttl = inet->mc_ttl;
@@ -1299,7 +1328,7 @@ int ip_push_pending_frames(struct sock *sk)
 	iph->ihl = 5;
 	if (opt) {
 		iph->ihl += opt->optlen>>2;
-		ip_options_build(skb, opt, inet->cork.addr, rt, 0);
+		ip_options_build(skb, opt, cork->addr, rt, 0);
 	}
 	iph->tos = inet->tos;
 	iph->frag_off = df;
@@ -1315,7 +1344,7 @@ int ip_push_pending_frames(struct sock *sk)
 	 * Steal rt from cork.dst to avoid a pair of atomic_inc/atomic_dec
 	 * on dst refcount
 	 */
-	inet->cork.dst = NULL;
+	cork->dst = NULL;
 	skb_dst_set(skb, &rt->dst);
 
 	if (iph->protocol == IPPROTO_ICMP)
@@ -1332,7 +1361,7 @@ int ip_push_pending_frames(struct sock *sk)
 	}
 
 out:
-	ip_cork_release(inet);
+	ip_cork_release(cork);
 	return err;
 
 error:
@@ -1340,17 +1369,30 @@ error:
 	goto out;
 }
 
+int ip_push_pending_frames(struct sock *sk)
+{
+	return __ip_push_pending_frames(sk, &sk->sk_write_queue,
+					&inet_sk(sk)->cork);
+}
+
 /*
  *	Throw away all pending data on the socket.
  */
-void ip_flush_pending_frames(struct sock *sk)
+static void __ip_flush_pending_frames(struct sock *sk,
+				      struct sk_buff_head *queue,
+				      struct inet_cork *cork)
 {
 	struct sk_buff *skb;
 
-	while ((skb = __skb_dequeue_tail(&sk->sk_write_queue)) != NULL)
+	while ((skb = __skb_dequeue_tail(queue)) != NULL)
 		kfree_skb(skb);
 
-	ip_cork_release(inet_sk(sk));
+	ip_cork_release(cork);
+}
+
+void ip_flush_pending_frames(struct sock *sk)
+{
+	__ip_flush_pending_frames(sk, &sk->sk_write_queue, &inet_sk(sk)->cork);
 }
 
 

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 1/5] inet: Remove unused sk_sndmsg_* from UFO
  2011-03-01 12:35                 ` Herbert Xu
@ 2011-03-01 12:36                   ` Herbert Xu
  2011-03-01 12:36                   ` [PATCH 3/5] inet: Add ip_make_skb and ip_finish_skb Herbert Xu
                                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 91+ messages in thread
From: Herbert Xu @ 2011-03-01 12:36 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf

inet: Remove unused sk_sndmsg_* from UFO

UFO doesn't really use the sk_sndmsg_* parameters so touching
them is pointless.  It can't use them anyway since the whole
point of UFO is to use the original pages without copying.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/core/skbuff.c     |    3 ---
 net/ipv4/ip_output.c  |    1 -
 net/ipv6/ip6_output.c |    1 -
 3 files changed, 5 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d883dcc..97011a7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2434,8 +2434,6 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
 			return -ENOMEM;
 
 		/* initialize the next frag */
-		sk->sk_sndmsg_page = page;
-		sk->sk_sndmsg_off = 0;
 		skb_fill_page_desc(skb, frg_cnt, page, 0, 0);
 		skb->truesize += PAGE_SIZE;
 		atomic_add(PAGE_SIZE, &sk->sk_wmem_alloc);
@@ -2455,7 +2453,6 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
 			return -EFAULT;
 
 		/* copy was successful so update the size parameters */
-		sk->sk_sndmsg_off += copy;
 		frag->size += copy;
 		skb->len += copy;
 		skb->data_len += copy;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 04c7b3b..d3a4540 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -767,7 +767,6 @@ static inline int ip_ufo_append_data(struct sock *sk,
 
 		skb->ip_summed = CHECKSUM_PARTIAL;
 		skb->csum = 0;
-		sk->sk_sndmsg_off = 0;
 
 		/* specify the length of each IP datagram fragment */
 		skb_shinfo(skb)->gso_size = mtu - fragheaderlen;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 5f8d242..9965182 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1061,7 +1061,6 @@ static inline int ip6_ufo_append_data(struct sock *sk,
 
 		skb->ip_summed = CHECKSUM_PARTIAL;
 		skb->csum = 0;
-		sk->sk_sndmsg_off = 0;
 	}
 
 	err = skb_append_datato_frags(sk,skb, getfrag, from,

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 5/5] udp: Add lockless transmit path
  2011-03-01 12:35                 ` Herbert Xu
                                     ` (3 preceding siblings ...)
  2011-03-01 12:36                   ` [PATCH 4/5] udp: Switch to ip_finish_skb Herbert Xu
@ 2011-03-01 12:36                   ` Herbert Xu
  2011-03-01 16:43                   ` SO_REUSEPORT - can it be done in kernel? Eric Dumazet
  5 siblings, 0 replies; 91+ messages in thread
From: Herbert Xu @ 2011-03-01 12:36 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf

udp: Add lockless transmit path

The UDP transmit path has been running under the socket lock
for a long time because of the corking feature.  This means that
transmitting to the same socket in multiple threads does not
scale at all.

However, as most users don't actually use corking, the locking
can be removed in the common case.

This patch creates a lockless fast path where corking is not used.

Please note that this does create a slight inaccuracy in the
enforcement of socket send buffer limits.  In particular, we
may exceed the socket limit by up to (number of CPUs) * (packet
size) because of the way the limit is computed.

As the primary purpose of socket buffers is to indicate congestion,
this should not be a great problem for now.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/ipv4/udp.c |   15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 9a6d326..bb9f707 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -802,6 +802,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	int err, is_udplite = IS_UDPLITE(sk);
 	int corkreq = up->corkflag || msg->msg_flags&MSG_MORE;
 	int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
+	struct sk_buff *skb;
 
 	if (len > 0xFFFF)
 		return -EMSGSIZE;
@@ -816,6 +817,8 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
 
+	getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag;
+
 	if (up->pending) {
 		/*
 		 * There are pending frames.
@@ -940,6 +943,17 @@ back_from_confirm:
 	if (!ipc.addr)
 		daddr = ipc.addr = rt->rt_dst;
 
+	/* Lockless fast path for the non-corking case. */
+	if (!corkreq) {
+		skb = ip_make_skb(sk, getfrag, msg->msg_iov, ulen,
+				  sizeof(struct udphdr), &ipc, &rt,
+				  msg->msg_flags);
+		err = PTR_ERR(skb);
+		if (skb && !IS_ERR(skb))
+			err = udp_send_skb(skb, daddr, dport);
+		goto out;
+	}
+
 	lock_sock(sk);
 	if (unlikely(up->pending)) {
 		/* The socket is already corked while preparing it. */
@@ -961,7 +975,6 @@ back_from_confirm:
 
 do_append_data:
 	up->len += ulen;
-	getfrag  =  is_udplite ?  udplite_getfrag : ip_generic_getfrag;
 	err = ip_append_data(sk, getfrag, msg->msg_iov, ulen,
 			sizeof(struct udphdr), &ipc, &rt,
 			corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 4/5] udp: Switch to ip_finish_skb
  2011-03-01 12:35                 ` Herbert Xu
                                     ` (2 preceding siblings ...)
  2011-03-01 12:36                   ` [PATCH 2/5] inet: Remove explicit write references to sk/inet in ip_append_data Herbert Xu
@ 2011-03-01 12:36                   ` Herbert Xu
  2011-03-01 12:36                   ` [PATCH 5/5] udp: Add lockless transmit path Herbert Xu
  2011-03-01 16:43                   ` SO_REUSEPORT - can it be done in kernel? Eric Dumazet
  5 siblings, 0 replies; 91+ messages in thread
From: Herbert Xu @ 2011-03-01 12:36 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf

udp: Switch to ip_finish_skb

This patch converts UDP to use the new ip_finish_skb API.  This
would then allows us to more easily use ip_make_skb which allows
UDP to run without a socket lock.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/net/udp.h     |   11 ++++++
 include/net/udplite.h |   12 +++++++
 net/ipv4/udp.c        |   83 ++++++++++++++++++++++++++++++--------------------
 3 files changed, 73 insertions(+), 33 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index bb967dd..b8563ba 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -144,6 +144,17 @@ static inline __wsum udp_csum_outgoing(struct sock *sk, struct sk_buff *skb)
 	return csum;
 }
 
+static inline __wsum udp_csum(struct sk_buff *skb)
+{
+	__wsum csum = csum_partial(skb_transport_header(skb),
+				   sizeof(struct udphdr), skb->csum);
+
+	for (skb = skb_shinfo(skb)->frag_list; skb; skb = skb->next) {
+		csum = csum_add(csum, skb->csum);
+	}
+	return csum;
+}
+
 /* hash routines shared between UDPv4/6 and UDP-Litev4/6 */
 static inline void udp_lib_hash(struct sock *sk)
 {
diff --git a/include/net/udplite.h b/include/net/udplite.h
index afdffe6..673a024 100644
--- a/include/net/udplite.h
+++ b/include/net/udplite.h
@@ -115,6 +115,18 @@ static inline __wsum udplite_csum_outgoing(struct sock *sk, struct sk_buff *skb)
 	return csum;
 }
 
+static inline __wsum udplite_csum(struct sk_buff *skb)
+{
+	struct sock *sk = skb->sk;
+	int cscov = udplite_sender_cscov(udp_sk(sk), udp_hdr(skb));
+	const int off = skb_transport_offset(skb);
+	const int len = skb->len - off;
+
+	skb->ip_summed = CHECKSUM_NONE;     /* no HW support for checksumming */
+
+	return skb_checksum(skb, off, min(cscov, len), 0);
+}
+
 extern void	udplite4_register(void);
 extern int 	udplite_get_port(struct sock *sk, unsigned short snum,
 			int (*scmp)(const struct sock *, const struct sock *));
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8157b17..9a6d326 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -663,75 +663,72 @@ void udp_flush_pending_frames(struct sock *sk)
 EXPORT_SYMBOL(udp_flush_pending_frames);
 
 /**
- * 	udp4_hwcsum_outgoing  -  handle outgoing HW checksumming
- * 	@sk: 	socket we are sending on
+ * 	udp4_hwcsum  -  handle outgoing HW checksumming
  * 	@skb: 	sk_buff containing the filled-in UDP header
  * 	        (checksum field must be zeroed out)
+ *	@src:	source IP address
+ *	@dst:	destination IP address
  */
-static void udp4_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb,
-				 __be32 src, __be32 dst, int len)
+static void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst)
 {
-	unsigned int offset;
 	struct udphdr *uh = udp_hdr(skb);
+	struct sk_buff *frags = skb_shinfo(skb)->frag_list;
+	int offset = skb_transport_offset(skb);
+	int len = skb->len - offset;
+	int hlen = len;
 	__wsum csum = 0;
 
-	if (skb_queue_len(&sk->sk_write_queue) == 1) {
+	if (!frags) {
 		/*
 		 * Only one fragment on the socket.
 		 */
 		skb->csum_start = skb_transport_header(skb) - skb->head;
 		skb->csum_offset = offsetof(struct udphdr, check);
-		uh->check = ~csum_tcpudp_magic(src, dst, len, IPPROTO_UDP, 0);
+		uh->check = ~csum_tcpudp_magic(src, dst, len,
+					       IPPROTO_UDP, 0);
 	} else {
 		/*
 		 * HW-checksum won't work as there are two or more
 		 * fragments on the socket so that all csums of sk_buffs
 		 * should be together
 		 */
-		offset = skb_transport_offset(skb);
-		skb->csum = skb_checksum(skb, offset, skb->len - offset, 0);
+		do {
+			csum = csum_add(csum, frags->csum);
+			hlen -= frags->len;
+		} while ((frags = frags->next));
 
+		csum = skb_checksum(skb, offset, hlen, csum);
 		skb->ip_summed = CHECKSUM_NONE;
 
-		skb_queue_walk(&sk->sk_write_queue, skb) {
-			csum = csum_add(csum, skb->csum);
-		}
-
 		uh->check = csum_tcpudp_magic(src, dst, len, IPPROTO_UDP, csum);
 		if (uh->check == 0)
 			uh->check = CSUM_MANGLED_0;
 	}
 }
 
-/*
- * Push out all pending data as one UDP datagram. Socket is locked.
- */
-static int udp_push_pending_frames(struct sock *sk)
+static int udp_send_skb(struct sk_buff *skb, __be32 daddr, __be32 dport)
 {
-	struct udp_sock  *up = udp_sk(sk);
+	struct sock *sk = skb->sk;
 	struct inet_sock *inet = inet_sk(sk);
-	struct flowi *fl = &inet->cork.fl;
-	struct sk_buff *skb;
 	struct udphdr *uh;
+	struct rtable *rt = (struct rtable *)skb_dst(skb);
 	int err = 0;
 	int is_udplite = IS_UDPLITE(sk);
+	int offset = skb_transport_offset(skb);
+	int len = skb->len - offset;
 	__wsum csum = 0;
 
-	/* Grab the skbuff where UDP header space exists. */
-	if ((skb = skb_peek(&sk->sk_write_queue)) == NULL)
-		goto out;
-
 	/*
 	 * Create a UDP header
 	 */
 	uh = udp_hdr(skb);
-	uh->source = fl->fl_ip_sport;
-	uh->dest = fl->fl_ip_dport;
-	uh->len = htons(up->len);
+	uh->source = inet->inet_sport;
+	uh->dest = dport;
+	uh->len = htons(len);
 	uh->check = 0;
 
 	if (is_udplite)  				 /*     UDP-Lite      */
-		csum  = udplite_csum_outgoing(sk, skb);
+		csum = udplite_csum(skb);
 
 	else if (sk->sk_no_check == UDP_CSUM_NOXMIT) {   /* UDP csum disabled */
 
@@ -740,20 +737,20 @@ static int udp_push_pending_frames(struct sock *sk)
 
 	} else if (skb->ip_summed == CHECKSUM_PARTIAL) { /* UDP hardware csum */
 
-		udp4_hwcsum_outgoing(sk, skb, fl->fl4_src, fl->fl4_dst, up->len);
+		udp4_hwcsum(skb, rt->rt_src, daddr);
 		goto send;
 
-	} else						 /*   `normal' UDP    */
-		csum = udp_csum_outgoing(sk, skb);
+	} else
+		csum = udp_csum(skb);
 
 	/* add protocol-dependent pseudo-header */
-	uh->check = csum_tcpudp_magic(fl->fl4_src, fl->fl4_dst, up->len,
+	uh->check = csum_tcpudp_magic(rt->rt_src, daddr, len,
 				      sk->sk_protocol, csum);
 	if (uh->check == 0)
 		uh->check = CSUM_MANGLED_0;
 
 send:
-	err = ip_push_pending_frames(sk);
+	err = ip_send_skb(skb);
 	if (err) {
 		if (err == -ENOBUFS && !inet->recverr) {
 			UDP_INC_STATS_USER(sock_net(sk),
@@ -763,6 +760,26 @@ send:
 	} else
 		UDP_INC_STATS_USER(sock_net(sk),
 				   UDP_MIB_OUTDATAGRAMS, is_udplite);
+	return err;
+}
+
+/*
+ * Push out all pending data as one UDP datagram. Socket is locked.
+ */
+static int udp_push_pending_frames(struct sock *sk)
+{
+	struct udp_sock  *up = udp_sk(sk);
+	struct inet_sock *inet = inet_sk(sk);
+	struct flowi *fl = &inet->cork.fl;
+	struct sk_buff *skb;
+	int err = 0;
+
+	skb = ip_finish_skb(sk);
+	if (!skb)
+		goto out;
+
+	err = udp_send_skb(skb, fl->fl4_dst, fl->fl_ip_dport);
+
 out:
 	up->len = 0;
 	up->pending = 0;

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 11:53                                     ` Herbert Xu
  2011-03-01 12:32                                       ` Herbert Xu
@ 2011-03-01 13:03                                       ` Eric Dumazet
  2011-03-01 13:18                                         ` Herbert Xu
  1 sibling, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-03-01 13:03 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

Le mardi 01 mars 2011 à 19:53 +0800, Herbert Xu a écrit :
> On Tue, Mar 01, 2011 at 12:45:09PM +0100, Eric Dumazet wrote:
> >
> > CPU 11 handles all TX completions : Its a potential bottleneck.
> > 
> > I might ressurect XPS patch ;)
> 
> Actually this has been my gripe all along with our TX multiqueue
> support.  We should not decide the queue based on the socket, but
> on the current CPU.
> 
> We already do the right thing for forwarded packets because there
> is no socket to latch onto, we just need to fix it for locally
> generated traffic.
> 

I believe its now done properly (in net-next-2.6) with commit
4f57c087de9b46182 (net: implement mechanism for HW based QOS)

> The odd packet reordering each time your scheduler decides to
> migrate the process isn't a big deal IMHO.  If your scheduler
> is constantly moving things you've got bigger problems to worry
> about.

Well, BENET has one TX queue anyway...



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 12:32                                       ` Herbert Xu
@ 2011-03-01 13:04                                         ` Eric Dumazet
  2011-03-01 13:11                                           ` Herbert Xu
  0 siblings, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-03-01 13:04 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

Le mardi 01 mars 2011 à 20:32 +0800, Herbert Xu a écrit :
> On Tue, Mar 01, 2011 at 07:53:05PM +0800, Herbert Xu wrote:
> > On Tue, Mar 01, 2011 at 12:45:09PM +0100, Eric Dumazet wrote:
> > >
> > > CPU 11 handles all TX completions : Its a potential bottleneck.
> > > 
> > > I might ressurect XPS patch ;)
> > 
> > Actually this has been my gripe all along with our TX multiqueue
> > support.  We should not decide the queue based on the socket, but
> > on the current CPU.
> > 
> > We already do the right thing for forwarded packets because there
> > is no socket to latch onto, we just need to fix it for locally
> > generated traffic.
> > 
> > The odd packet reordering each time your scheduler decides to
> > migrate the process isn't a big deal IMHO.  If your scheduler
> > is constantly moving things you've got bigger problems to worry
> > about.
> 
> If anybody wants to play here is a patch to do exactly that:
> 
> net: Determine TX queue purely by current CPU
> 
> Distributing packets generated on one CPU to multiple queues
> makes no sense.  Nor does putting packets from multiple CPUs
> into a single queue.
> 
> While this may introduce packet reordering should the scheduler
> decide to migrate a thread, it isn't a big deal because migration
> is meant to be a rare event, and nothing will die as long as the
> ordering doesn't occur all the time.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 8ae6631..87bd20a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2164,22 +2164,12 @@ static u32 hashrnd __read_mostly;
>  u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
>  		  unsigned int num_tx_queues)
>  {
> -	u32 hash;
> +	u32 hash = raw_smp_processor_id();
>  
> -	if (skb_rx_queue_recorded(skb)) {
> -		hash = skb_get_rx_queue(skb);
> -		while (unlikely(hash >= num_tx_queues))
> -			hash -= num_tx_queues;
> -		return hash;
> -	}
> +	while (unlikely(hash >= num_tx_queues))
> +		hash -= num_tx_queues;
>  
> -	if (skb->sk && skb->sk->sk_hash)
> -		hash = skb->sk->sk_hash;
> -	else
> -		hash = (__force u16) skb->protocol ^ skb->rxhash;
> -	hash = jhash_1word(hash, hashrnd);
> -
> -	return (u16) (((u64) hash * num_tx_queues) >> 32);
> +	return hash;
>  }
>  EXPORT_SYMBOL(__skb_tx_hash);
>  
> Cheers,

Well, some machines have 4096 cpus ;)




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 13:04                                         ` Eric Dumazet
@ 2011-03-01 13:11                                           ` Herbert Xu
  0 siblings, 0 replies; 91+ messages in thread
From: Herbert Xu @ 2011-03-01 13:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Tue, Mar 01, 2011 at 02:04:29PM +0100, Eric Dumazet wrote:
> Well, some machines have 4096 cpus ;)

Well just change it to use the multiplication then :)
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 13:03                                       ` Eric Dumazet
@ 2011-03-01 13:18                                         ` Herbert Xu
  2011-03-01 13:52                                           ` Eric Dumazet
  2011-03-01 16:31                                           ` Eric Dumazet
  0 siblings, 2 replies; 91+ messages in thread
From: Herbert Xu @ 2011-03-01 13:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Tue, Mar 01, 2011 at 02:03:29PM +0100, Eric Dumazet wrote:
>
> I believe its now done properly (in net-next-2.6) with commit
> 4f57c087de9b46182 (net: implement mechanism for HW based QOS)

Nope, that has nothing to do with this.

> > The odd packet reordering each time your scheduler decides to
> > migrate the process isn't a big deal IMHO.  If your scheduler
> > is constantly moving things you've got bigger problems to worry
> > about.
> 
> Well, BENET has one TX queue anyway...

Interesting.  So I wonder which lock is showing up at the top
of the profile with a single socket then.  As it's definitely
going away with multiple sockets, that means it's not the TX
queue lock.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 12:01                                     ` Thomas Graf
  2011-03-01 12:15                                       ` Herbert Xu
@ 2011-03-01 13:27                                       ` Herbert Xu
  1 sibling, 0 replies; 91+ messages in thread
From: Herbert Xu @ 2011-03-01 13:27 UTC (permalink / raw)
  To: Eric Dumazet, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Tue, Mar 01, 2011 at 07:01:12AM -0500, Thomas Graf wrote:
> On Tue, Mar 01, 2011 at 12:45:09PM +0100, Eric Dumazet wrote:
> 
> This is how perf top looks like with SO_REUSEPORT
> 
> ----------------------------------------------------------------------------------------------------------------------------------
>    PerfTop:   27498 irqs/sec  kernel:50.5%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 1)
> ----------------------------------------------------------------------------------------------------------------------------------
> 
>              samples  pcnt function                      DSO
>              _______ _____ _____________________________ __________________
> 
>             16464.00  6.0% isc_rwlock_lock               libisc.so.62.0.1
>             15462.00  5.7% intel_idle                    [kernel.kallsyms]

So was this a RHEL6 kernel? I wonder if that is what's making it
perform better.

I guess we'll find out tomorrow.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 12:19                                       ` Herbert Xu
@ 2011-03-01 13:50                                         ` Thomas Graf
  2011-03-01 14:06                                           ` Eric Dumazet
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Graf @ 2011-03-01 13:50 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Dumazet, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Tue, Mar 01, 2011 at 08:19:51PM +0800, Herbert Xu wrote:
> On Tue, Mar 01, 2011 at 07:18:29AM -0500, Thomas Graf wrote:
> >
> > 
> > ... makes it use CPU 5 for rxq2 and the qps goes up from 250kqps to 270kqps
> 
> I think the increase here comes from the larger number of packets
> in flight more than anything.
> 
> The bottleneck is still the TX queue (both software and hardware).

Disabled netfilter and reran test

Now does ~316kqps (rx was split over 2 queues)

----------------------------------------------------------------------------------------------------------------------
   PerfTop:   30608 irqs/sec  kernel:66.1%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 1)
----------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ___________________________________________________________

            19237.00  5.6% _raw_spin_unlock_irqrestore   /lib/modules/2.6.38-rc5+/build/vmlinux                     
            17170.00  5.0% get_rx_page_info              /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
            11411.00  3.3% be_poll_rx                    /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
            11320.00  3.3% isc_rwlock_lock               /usr/lib64/libisc.so.62.0.1                                
            10669.00  3.1% __do_softirq                  /lib/modules/2.6.38-rc5+/build/vmlinux                     
            10655.00  3.1% get_page_from_freelist        /lib/modules/2.6.38-rc5+/build/vmlinux                     
             9523.00  2.8% intel_idle                    /lib/modules/2.6.38-rc5+/build/vmlinux                     
             8677.00  2.5% __udp4_lib_lookup             /lib/modules/2.6.38-rc5+/build/vmlinux                     
             8379.00  2.4% sock_queue_rcv_skb            /lib/modules/2.6.38-rc5+/build/vmlinux                     
             8226.00  2.4% sk_run_filter                 /lib/modules/2.6.38-rc5+/build/vmlinux                     
             6724.00  1.9% __netif_receive_skb           /lib/modules/2.6.38-rc5+/build/vmlinux                     
             6553.00  1.9% __alloc_skb                   /lib/modules/2.6.38-rc5+/build/vmlinux                     
             6205.00  1.8% udp_queue_rcv_skb             /lib/modules/2.6.38-rc5+/build/vmlinux                     
             6038.00  1.7% _raw_spin_lock                /lib/modules/2.6.38-rc5+/build/vmlinux                     
             5868.00  1.7% isc_rwlock_unlock             /usr/lib64/libisc.so.62.0.1                                
             5696.00  1.6% dns_rbt_findnode              /usr/lib64/libdns.so.69.0.1                                
             5647.00  1.6% read_tsc                      /lib/modules/2.6.38-rc5+/build/vmlinux                     
             5633.00  1.6% getnstimeofday                /lib/modules/2.6.38-rc5+/build/vmlinux                     
             5448.00  1.6% kmem_cache_alloc_node_trace   /lib/modules/2.6.38-rc5+/build/vmlinux                     
             5272.00  1.5% finish_task_switch            /lib/modules/2.6.38-rc5+/build/vmlinux                     
             4719.00  1.4% sock_def_readable             /lib/modules/2.6.38-rc5+/build/vmlinux                     
             4002.00  1.2% is_swiotlb_buffer             /lib/modules/2.6.38-rc5+/build/vmlinux                     
             3914.00  1.1% memcpy                        /lib/modules/2.6.38-rc5+/build/vmlinux                     
             3717.00  1.1% isc_stats_increment           /usr/lib64/libisc.so.62.0.1                                
             3706.00  1.1% __udp4_lib_rcv                /lib/modules/2.6.38-rc5+/build/vmlinux                     
             3653.00  1.1% ip_rcv                        /lib/modules/2.6.38-rc5+/build/vmlinux                     
             3598.00  1.0% kmem_cache_alloc_node         /lib/modules/2.6.38-rc5+/build/vmlinux                     
             3407.00  1.0% ip_route_input_common         /lib/modules/2.6.38-rc5+/build/vmlinux                     
             2683.00  0.8% be_post_rx_frags              /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
             2666.00  0.8% __pthread_mutex_lock_internal /lib64/libpthread-2.12.so                                  
             2331.00  0.7% __phys_addr                   /lib/modules/2.6.38-rc5+/build/vmlinux                     
             2230.00  0.6% __alloc_pages_nodemask        /lib/modules/2.6.38-rc5+/build/vmlinux                     
             2023.00  0.6% dns_name_fullcompare          /usr/lib64/libdns.so.69.0.1                                
             1972.00  0.6% packet_rcv                    /lib/modules/2.6.38-rc5+/build/vmlinux                     
             1902.00  0.6% eth_type_trans                /lib/modules/2.6.38-rc5+/build/vmlinux                     
             1860.00  0.5% __pthread_mutex_unlock        /lib64/libpthread-2.12.so                                  
             1804.00  0.5% fget_light                    /lib/modules/2.6.38-rc5+/build/vmlinux                     
             1739.00  0.5% alloc_pages_current           /lib/modules/2.6.38-rc5+/build/vmlinux                     
             1736.00  0.5% dns_rbtnodechain_init         /usr/lib64/libdns.so.69.0.1                            

----------------------------------------------------------------------------------------------------------------------
   PerfTop:   29038 irqs/sec  kernel:48.0%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 11)
----------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ___________________________________________________________

            12833.00  7.5% intel_idle                    /lib/modules/2.6.38-rc5+/build/vmlinux                     
            10771.00  6.3% isc_rwlock_lock               /usr/lib64/libisc.so.62.0.1                                
             8713.00  5.1% be_tx_compl_process           /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
             6452.00  3.8% kfree                         /lib/modules/2.6.38-rc5+/build/vmlinux                     
             5935.00  3.5% skb_release_data              /lib/modules/2.6.38-rc5+/build/vmlinux                     
             5552.00  3.2% kmem_cache_free               /lib/modules/2.6.38-rc5+/build/vmlinux                     
             5292.00  3.1% isc_rwlock_unlock             /usr/lib64/libisc.so.62.0.1                                
             4893.00  2.9% dns_rbt_findnode              /usr/lib64/libdns.so.69.0.1                                
             4413.00  2.6% kfree_skb                     /lib/modules/2.6.38-rc5+/build/vmlinux                     
             3802.00  2.2% be_poll_tx_mcc                /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
             3515.00  2.1% isc_stats_increment           /usr/lib64/libisc.so.62.0.1                                
             3016.00  1.8% _raw_spin_unlock_irqrestore   /lib/modules/2.6.38-rc5+/build/vmlinux                     
             2202.00  1.3% __do_softirq                  /lib/modules/2.6.38-rc5+/build/vmlinux                     
             2027.00  1.2% _raw_spin_lock                /lib/modules/2.6.38-rc5+/build/vmlinux                     
             1935.00  1.1% finish_task_switch            /lib/modules/2.6.38-rc5+/build/vmlinux                     
             1906.00  1.1% __pthread_mutex_lock_internal /lib64/libpthread-2.12.so                                  
             1837.00  1.1% dns_name_fullcompare          /usr/lib64/libdns.so.69.0.1                                
             1702.00  1.0% dns_rbtnodechain_init         /usr/lib64/libdns.so.69.0.1                                
             1561.00  0.9% fget_light                    /lib/modules/2.6.38-rc5+/build/vmlinux                     
             1559.00  0.9% dns_name_getlabelsequence     /usr/lib64/libdns.so.69.0.1                                
             1491.00  0.9% dns_name_equal                /usr/lib64/libdns.so.69.0.1                                
             1464.00  0.9% __pthread_mutex_unlock        /lib64/libpthread-2.12.so                                  
             1454.00  0.9% dns_acl_match                 /usr/lib64/libdns.so.69.0.1                                
             1293.00  0.8% dns_zone_attach               /usr/lib64/libdns.so.69.0.1                                
             1245.00  0.7% be_xmit                       /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
             1159.00  0.7% dns_message_rendersection     /usr/lib64/libdns.so.69.0.1                                
             1115.00  0.7% isc___mempool_get             /usr/lib64/libisc.so.62.0.1                                
             1100.00  0.6% copy_user_generic_string      /lib/modules/2.6.38-rc5+/build/vmlinux                     
             1030.00  0.6% dns_name_fromwire             /usr/lib64/libdns.so.69.0.1                                
             1015.00  0.6% dns_name_hash                 /usr/lib64/libdns.so.69.0.1                                
             1013.00  0.6% isc_radix_search              /usr/lib64/libisc.so.62.0.1                                
              970.00  0.6% __ip_route_output_key         /lib/modules/2.6.38-rc5+/build/vmlinux                     
              917.00  0.5% fput                          /lib/modules/2.6.38-rc5+/build/vmlinux                     
              817.00  0.5% dev_queue_xmit                /lib/modules/2.6.38-rc5+/build/vmlinux                     
              812.00  0.5% sk_run_filter                 /lib/modules/2.6.38-rc5+/build/vmlinux                     
              806.00  0.5% avc_has_perm_noaudit          /lib/modules/2.6.38-rc5+/build/vmlinux                     
              802.00  0.5% sock_wfree                    /lib/modules/2.6.38-rc5+/build/vmlinux                     
              793.00  0.5% dns_name_towire               /usr/lib64/libdns.so.69.0.1                                
              754.00  0.4% sock_alloc_send_pskb          /lib/modules/2.6.38-rc5+/build/vmlinux                     
              752.00  0.4% dns_message_parse             /usr/lib64/libdns.so.69.0.1                                
              749.00  0.4% dns_rdata_towire              /usr/lib64/libdns.so.69.0.1                                
              728.00  0.4% dns_rdataset_init             /usr/lib64/libdns.so.69.0.1                                
              709.00  0.4% isc___mempool_put             /usr/lib64/libisc.so.62.0.1                                
              699.00  0.4% skb_release_head_state        /lib/modules/2.6.38-rc5+/build/vmlinux                     
              685.00  0.4% _raw_spin_lock_bh             /lib/modules/2.6.38-rc5+/build/vmlinux                     
              683.00  0.4% dns_name_concatenate          /usr/lib64/libdns.so.69.0.1                                
              678.00  0.4% __ip_append_data              /lib/modules/2.6.38-rc5+/build/vmlinux                     
              673.00  0.4% tick_nohz_stop_sched_tick     /lib/modules/2.6.38-rc5+/build/vmlinux                     
              662.00  0.4% sys_sendmsg                   /lib/modules/2.6.38-rc5+/build/vmlinux                     
              654.00  0.4% dns_compress_findglobal       /usr/lib64/libdns.so.69.0.1                                
              654.00  0.4% memcpy                        /lib64/libc-2.12.so                                        
              637.00  0.4% dns_compress_invalidate       /usr/lib64/libdns.so.69.0.1                                
              597.00  0.3% isc__buffer_init              /usr/lib64/libisc.so.62.0.1                                
              595.00  0.3% dns_zone_detach               /usr/lib64/libdns.so.69.0.1                                
WARNING: failed to keep up with mmap data.
WARNING: failed to keep up with mmap data.


----------------------------------------------------------------------------------------------------------------------
   PerfTop:   29539 irqs/sec  kernel:47.0%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 11)
----------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ___________________________________________________________

            14478.00  7.5% intel_idle                    /lib/modules/2.6.38-rc5+/build/vmlinux                     
            12279.00  6.3% isc_rwlock_lock               /usr/lib64/libisc.so.62.0.1                                
             9844.00  5.1% be_tx_compl_process           /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
             7368.00  3.8% kfree                         /lib/modules/2.6.38-rc5+/build/vmlinux                     
             6696.00  3.5% skb_release_data              /lib/modules/2.6.38-rc5+/build/vmlinux                     
             6240.00  3.2% kmem_cache_free               /lib/modules/2.6.38-rc5+/build/vmlinux                     
             6034.00  3.1% isc_rwlock_unlock             /usr/lib64/libisc.so.62.0.1                                
             5547.00  2.9% dns_rbt_findnode              /usr/lib64/libdns.so.69.0.1                                
             5012.00  2.6% kfree_skb                     /lib/modules/2.6.38-rc5+/build/vmlinux                     
             4290.00  2.2% be_poll_tx_mcc                /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
             4024.00  2.1% isc_stats_increment           /usr/lib64/libisc.so.62.0.1                                
             3417.00  1.8% _raw_spin_unlock_irqrestore   /lib/modules/2.6.38-rc5+/build/vmlinux                     
             2470.00  1.3% __do_softirq                  /lib/modules/2.6.38-rc5+/build/vmlinux                     
             2312.00  1.2% _raw_spin_lock                /lib/modules/2.6.38-rc5+/build/vmlinux                     
             2138.00  1.1% finish_task_switch            /lib/modules/2.6.38-rc5+/build/vmlinux                     
             2136.00  1.1% __pthread_mutex_lock_internal /lib64/libpthread-2.12.so                                  
             2061.00  1.1% dns_name_fullcompare          /usr/lib64/libdns.so.69.0.1                                
             1961.00  1.0% dns_rbtnodechain_init         /usr/lib64/libdns.so.69.0.1                                
             1797.00  0.9% dns_name_getlabelsequence     /usr/lib64/libdns.so.69.0.1                                
             1743.00  0.9% fget_light                    /lib/modules/2.6.38-rc5+/build/vmlinux                     
             1723.00  0.9% dns_name_equal                /usr/lib64/libdns.so.69.0.1                                
             1673.00  0.9% __pthread_mutex_unlock        /lib64/libpthread-2.12.so                                  
             1671.00  0.9% dns_acl_match                 /usr/lib64/libdns.so.69.0.1                                
             1488.00  0.8% dns_zone_attach               /usr/lib64/libdns.so.69.0.1                                
             1428.00  0.7% be_xmit                       /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
             1369.00  0.7% dns_message_rendersection     /usr/lib64/libdns.so.69.0.1                                
             1278.00  0.7% isc___mempool_get             /usr/lib64/libisc.so.62.0.1                                
             1251.00  0.6% copy_user_generic_string      /lib/modules/2.6.38-rc5+/build/vmlinux                     
             1193.00  0.6% dns_name_fromwire             /usr/lib64/libdns.so.69.0.1                                
             1182.00  0.6% isc_radix_search              /usr/lib64/libisc.so.62.0.1                        

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 13:18                                         ` Herbert Xu
@ 2011-03-01 13:52                                           ` Eric Dumazet
  2011-03-01 13:58                                             ` Herbert Xu
  2011-03-01 16:31                                           ` Eric Dumazet
  1 sibling, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-03-01 13:52 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

Le mardi 01 mars 2011 à 21:18 +0800, Herbert Xu a écrit :

> Interesting.  So I wonder which lock is showing up at the top
> of the profile with a single socket then.  As it's definitely
> going away with multiple sockets, that means it's not the TX
> queue lock.
> 

This CPU also runs named process, so this is socket lock and receive
queue lock.

Named threads all do : recvmsg()/sendmsg() in a loop, so all are waiting
a frame before doing some work.

Because of single receive queue, extra context switches occur (all
threads but one have to sleep again per query)

For about 80 kqps (standard linux-2.6 kernel, no patches), I have
following vmstat output

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 4  1      0 2184048  63496 1595056    0    0     0  2060 64592 294528 19 11 67  4
 6  1      0 2184040  63496 1595056    0    0     0  1960 64686 293928 19 11 66  4
 3  1      0 2184040  63496 1595056    0    0     0  2344 64556 294268 20 11 66  4
 4  1      0 2184040  63496 1595056    0    0     0  2400 64626 293859 19 11 67  4




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 13:52                                           ` Eric Dumazet
@ 2011-03-01 13:58                                             ` Herbert Xu
  0 siblings, 0 replies; 91+ messages in thread
From: Herbert Xu @ 2011-03-01 13:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Tue, Mar 01, 2011 at 02:52:46PM +0100, Eric Dumazet wrote:
> Le mardi 01 mars 2011 à 21:18 +0800, Herbert Xu a écrit :
> 
> > Interesting.  So I wonder which lock is showing up at the top
> > of the profile with a single socket then.  As it's definitely
> > going away with multiple sockets, that means it's not the TX
> > queue lock.
> > 
> 
> This CPU also runs named process, so this is socket lock and receive
> queue lock.

It can't be the socket lock because it's an IRQ-disabling variant.
The receive queue lock, I'll buy that.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 13:50                                         ` Thomas Graf
@ 2011-03-01 14:06                                           ` Eric Dumazet
  2011-03-01 14:22                                             ` Thomas Graf
  0 siblings, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-03-01 14:06 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

Le mardi 01 mars 2011 à 08:50 -0500, Thomas Graf a écrit :
> On Tue, Mar 01, 2011 at 08:19:51PM +0800, Herbert Xu wrote:
> > On Tue, Mar 01, 2011 at 07:18:29AM -0500, Thomas Graf wrote:
> > >
> > > 
> > > ... makes it use CPU 5 for rxq2 and the qps goes up from 250kqps to 270kqps
> > 
> > I think the increase here comes from the larger number of packets
> > in flight more than anything.
> > 
> > The bottleneck is still the TX queue (both software and hardware).
> 
> Disabled netfilter and reran test
> 
> Now does ~316kqps (rx was split over 2 queues)

Would be nice to cpu affine named to _not_ run on CPU11, just to
specialize it for TX completions and have softirq time percentage and
"perf top -C 11 " results

Thanks



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 14:06                                           ` Eric Dumazet
@ 2011-03-01 14:22                                             ` Thomas Graf
  2011-03-01 14:30                                               ` Thomas Graf
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Graf @ 2011-03-01 14:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Tue, Mar 01, 2011 at 03:06:59PM +0100, Eric Dumazet wrote:
> Would be nice to cpu affine named to _not_ run on CPU11, just to
> specialize it for TX completions and have softirq time percentage and
> "perf top -C 11 " results

----------------------------------------------------------------------------------------------------------------------
   PerfTop:     995 irqs/sec  kernel:97.7%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 11)
----------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ___________________________________________________________

              335.00 23.3% intel_idle                  /lib/modules/2.6.38-rc5+/build/vmlinux                     
              253.00 17.6% be_tx_compl_process         /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
              177.00 12.3% skb_release_data            /lib/modules/2.6.38-rc5+/build/vmlinux                     
              132.00  9.2% kfree                       /lib/modules/2.6.38-rc5+/build/vmlinux                     
              127.00  8.8% kfree_skb                   /lib/modules/2.6.38-rc5+/build/vmlinux                     
              105.00  7.3% be_poll_tx_mcc              /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
               99.00  6.9% kmem_cache_free             /lib/modules/2.6.38-rc5+/build/vmlinux                     
               36.00  2.5% __do_softirq                /lib/modules/2.6.38-rc5+/build/vmlinux                     
               20.00  1.4% _raw_spin_unlock_irqrestore /lib/modules/2.6.38-rc5+/build/vmlinux                     
               19.00  1.3% skb_release_head_state      /lib/modules/2.6.38-rc5+/build/vmlinux                     
               13.00  0.9% unmap_tx_frag               /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
               11.00  0.8% rb_next                     /usr/bin/perf                                              
               10.00  0.7% dso__find_symbol            /usr/bin/perf                                              
                9.00  0.6% is_swiotlb_buffer           /lib/modules/2.6.38-rc5+/build/vmlinux                     
                9.00  0.6% __strcmp_sse42              /lib64/libc-2.12.so                                        
                8.00  0.6% __kfree_skb                 /lib/modules/2.6.38-rc5+/build/vmlinux                     
                8.00  0.6% __strstr_sse42              /lib64/libc-2.12.so                                        
                6.00  0.4% _int_malloc                 /lib64/libc-2.12.so      

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 14:22                                             ` Thomas Graf
@ 2011-03-01 14:30                                               ` Thomas Graf
  2011-03-01 14:52                                                 ` Eric Dumazet
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Graf @ 2011-03-01 14:30 UTC (permalink / raw)
  To: Eric Dumazet, Herbert Xu, David Miller, rick.jones2, therbert,
	wsommerfeld

On Tue, Mar 01, 2011 at 09:22:35AM -0500, Thomas Graf wrote:
> On Tue, Mar 01, 2011 at 03:06:59PM +0100, Eric Dumazet wrote:
> > Would be nice to cpu affine named to _not_ run on CPU11, just to
> > specialize it for TX completions and have softirq time percentage and
> > "perf top -C 11 " results

CPU 1 isolated as well (named running with mask 0,2-10)

----------------------------------------------------------------------------------------------------------------------
   PerfTop:     580 irqs/sec  kernel:100.0%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 1)
----------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ___________________________________________________________

              283.00  9.2% get_rx_page_info            /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
              256.00  8.4% _raw_spin_unlock_irqrestore /lib/modules/2.6.38-rc5+/build/vmlinux                     
              190.00  6.2% be_poll_rx                  /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
              182.00  5.9% get_page_from_freelist      /lib/modules/2.6.38-rc5+/build/vmlinux                     
              157.00  5.1% intel_idle                  /lib/modules/2.6.38-rc5+/build/vmlinux                     
              143.00  4.7% __do_softirq                /lib/modules/2.6.38-rc5+/build/vmlinux                     
              133.00  4.3% sock_queue_rcv_skb          /lib/modules/2.6.38-rc5+/build/vmlinux                     
              133.00  4.3% __udp4_lib_lookup           /lib/modules/2.6.38-rc5+/build/vmlinux                     
              131.00  4.3% sk_run_filter               /lib/modules/2.6.38-rc5+/build/vmlinux                     
              114.00  3.7% getnstimeofday              /lib/modules/2.6.38-rc5+/build/vmlinux                     
              112.00  3.7% __alloc_skb                 /lib/modules/2.6.38-rc5+/build/vmlinux                     
              103.00  3.4% read_tsc                    /lib/modules/2.6.38-rc5+/build/vmlinux                     
              100.00  3.3% __netif_receive_skb         /lib/modules/2.6.38-rc5+/build/vmlinux                     
               95.00  3.1% udp_queue_rcv_skb           /lib/modules/2.6.38-rc5+/build/vmlinux                     
               82.00  2.7% sock_def_readable           /lib/modules/2.6.38-rc5+/build/vmlinux                     
               79.00  2.6% kmem_cache_alloc_node_trace /lib/modules/2.6.38-rc5+/build/vmlinux                     
               72.00  2.3% _raw_spin_lock              /lib/modules/2.6.38-rc5+/build/vmlinux                     
               67.00  2.2% __phys_addr                 /lib/modules/2.6.38-rc5+/build/vmlinux                     
               63.00  2.1% is_swiotlb_buffer           /lib/modules/2.6.38-rc5+/build/vmlinux                     
               51.00  1.7% __udp4_lib_rcv              /lib/modules/2.6.38-rc5+/build/vmlinux                     
               48.00  1.6% memcpy                      /lib/modules/2.6.38-rc5+/build/vmlinux                     
               47.00  1.5% ip_rcv                      /lib/modules/2.6.38-rc5+/build/vmlinux                     
               46.00  1.5% kmem_cache_alloc_node       /lib/modules/2.6.38-rc5+/build/vmlinux                     
               44.00  1.4% dma_issue_pending_all       /lib/modules/2.6.38-rc5+/build/vmlinux                     
               40.00  1.3% ip_route_input_common       /lib/modules/2.6.38-rc5+/build/vmlinux                     
               36.00  1.2% __alloc_pages_nodemask      /lib/modules/2.6.38-rc5+/build/vmlinux                     
               33.00  1.1% be_post_rx_frags            /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
               24.00  0.8% alloc_pages_current         /lib/modules/2.6.38-rc5+/build/vmlinux                     
               21.00  0.7% packet_rcv                  /lib/modules/2.6.38-rc5+/build/vmlinux                     
               20.00  0.7% local_bh_enable             /lib/modules/2.6.38-rc5+/build/vmlinux                     
               17.00  0.6% consume_skb                 /lib/modules/2.6.38-rc5+/build/vmlinux                     
               16.00  0.5% next_zones_zonelist         /lib/modules/2.6.38-rc5+/build/vmlinux                     
               14.00  0.5% selinux_socket_sock_rcv_skb /lib/modules/2.6.38-rc5+/build/vmlinux                     
               13.00  0.4% ip_local_deliver            /lib/modules/2.6.38-rc5+/build/vmlinux                     
               11.00  0.4% sk_filter                   /lib/modules/2.6.38-rc5+/build/vmlinux                     
               10.00  0.3% get_rps_cpu                 /lib/modules/2.6.38-rc5+/build/vmlinux                     
                9.00  0.3% native_read_tsc             /lib/modules/2.6.38-rc5+/build/vmlinux                     
                8.00  0.3% local_bh_disable            /lib/modules/2.6.38-rc5+/build/vmlinux                     
                8.00  0.3% eth_type_trans              /lib/modules/2.6.38-rc5+/build/vmlinux                     
                8.00  0.3% napi_complete               /lib/modules/2.6.38-rc5+/build/vmlinux                     
                7.00  0.2% netif_receive_skb           /lib/modules/2.6.38-rc5+/build/vmlinux                     
                7.00  0.2% dso__find_symbol            /usr/bin/perf                                              
                7.00  0.2% __kmalloc_node_track_caller /lib/modules/2.6.38-rc5+/build/vmlinux             

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 14:30                                               ` Thomas Graf
@ 2011-03-01 14:52                                                 ` Eric Dumazet
  2011-03-01 15:07                                                   ` Thomas Graf
  0 siblings, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-03-01 14:52 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

Le mardi 01 mars 2011 à 09:30 -0500, Thomas Graf a écrit :
> On Tue, Mar 01, 2011 at 09:22:35AM -0500, Thomas Graf wrote:
> > On Tue, Mar 01, 2011 at 03:06:59PM +0100, Eric Dumazet wrote:
> > > Would be nice to cpu affine named to _not_ run on CPU11, just to
> > > specialize it for TX completions and have softirq time percentage and
> > > "perf top -C 11 " results
> 
> CPU 1 isolated as well (named running with mask 0,2-10)
> 
> ----------------------------------------------------------------------------------------------------------------------
>    PerfTop:     580 irqs/sec  kernel:100.0%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 1)
> ----------------------------------------------------------------------------------------------------------------------
> 
>              samples  pcnt function                    DSO
>              _______ _____ ___________________________ ___________________________________________________________
> 
>               283.00  9.2% get_rx_page_info            /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
>               256.00  8.4% _raw_spin_unlock_irqrestore /lib/modules/2.6.38-rc5+/build/vmlinux                     
>               190.00  6.2% be_poll_rx                  /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
>               182.00  5.9% get_page_from_freelist      /lib/modules/2.6.38-rc5+/build/vmlinux                     
>               157.00  5.1% intel_idle                  /lib/modules/2.6.38-rc5+/build/vmlinux                     
>               143.00  4.7% __do_softirq                /lib/modules/2.6.38-rc5+/build/vmlinux                     
>               133.00  4.3% sock_queue_rcv_skb          /lib/modules/2.6.38-rc5+/build/vmlinux                     
>               133.00  4.3% __udp4_lib_lookup           /lib/modules/2.6.38-rc5+/build/vmlinux                     
>               131.00  4.3% sk_run_filter               /lib/modules/2.6.38-rc5+/build/vmlinux   

sk_run_filter ? Do you have a packet filter running ?



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 14:52                                                 ` Eric Dumazet
@ 2011-03-01 15:07                                                   ` Thomas Graf
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Graf @ 2011-03-01 15:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Tue, Mar 01, 2011 at 03:52:40PM +0100, Eric Dumazet wrote:
> Le mardi 01 mars 2011 à 09:30 -0500, Thomas Graf a écrit :
> > On Tue, Mar 01, 2011 at 09:22:35AM -0500, Thomas Graf wrote:
> > > On Tue, Mar 01, 2011 at 03:06:59PM +0100, Eric Dumazet wrote:
> > > > Would be nice to cpu affine named to _not_ run on CPU11, just to
> > > > specialize it for TX completions and have softirq time percentage and
> > > > "perf top -C 11 " results
> > 
> > CPU 1 isolated as well (named running with mask 0,2-10)
> > 
> > ----------------------------------------------------------------------------------------------------------------------
> >    PerfTop:     580 irqs/sec  kernel:100.0%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 1)
> > ----------------------------------------------------------------------------------------------------------------------
> > 
> >              samples  pcnt function                    DSO
> >              _______ _____ ___________________________ ___________________________________________________________
> > 
> >               283.00  9.2% get_rx_page_info            /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
> >               256.00  8.4% _raw_spin_unlock_irqrestore /lib/modules/2.6.38-rc5+/build/vmlinux                     
> >               190.00  6.2% be_poll_rx                  /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
> >               182.00  5.9% get_page_from_freelist      /lib/modules/2.6.38-rc5+/build/vmlinux                     
> >               157.00  5.1% intel_idle                  /lib/modules/2.6.38-rc5+/build/vmlinux                     
> >               143.00  4.7% __do_softirq                /lib/modules/2.6.38-rc5+/build/vmlinux                     
> >               133.00  4.3% sock_queue_rcv_skb          /lib/modules/2.6.38-rc5+/build/vmlinux                     
> >               133.00  4.3% __udp4_lib_lookup           /lib/modules/2.6.38-rc5+/build/vmlinux                     
> >               131.00  4.3% sk_run_filter               /lib/modules/2.6.38-rc5+/build/vmlinux   
> 
> sk_run_filter ? Do you have a packet filter running ?

dhclient was running. With dhclient killed:

----------------------------------------------------------------------------------------------------------------------
   PerfTop:     726 irqs/sec  kernel:99.9%  exact:  0.0% [1000Hz cpu-clock-msecs],  (all, CPU: 1)
----------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ___________________________________________________________

              472.00 10.6% get_rx_page_info            /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
              419.00  9.4% _raw_spin_unlock_irqrestore /lib/modules/2.6.38-rc5+/build/vmlinux                     
              280.00  6.3% be_poll_rx                  /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
              259.00  5.8% get_page_from_freelist      /lib/modules/2.6.38-rc5+/build/vmlinux                     
              248.00  5.6% __do_softirq                /lib/modules/2.6.38-rc5+/build/vmlinux                     
              238.00  5.4% intel_idle                  /lib/modules/2.6.38-rc5+/build/vmlinux                     
              204.00  4.6% sock_queue_rcv_skb          /lib/modules/2.6.38-rc5+/build/vmlinux                     
              189.00  4.3% __udp4_lib_lookup           /lib/modules/2.6.38-rc5+/build/vmlinux                     
              178.00  4.0% getnstimeofday              /lib/modules/2.6.38-rc5+/build/vmlinux                     
              169.00  3.8% __alloc_skb                 /lib/modules/2.6.38-rc5+/build/vmlinux                     
              144.00  3.2% read_tsc                    /lib/modules/2.6.38-rc5+/build/vmlinux                     
              143.00  3.2% sock_def_readable           /lib/modules/2.6.38-rc5+/build/vmlinux                     
              138.00  3.1% udp_queue_rcv_skb           /lib/modules/2.6.38-rc5+/build/vmlinux                     
              115.00  2.6% kmem_cache_alloc_node_trace /lib/modules/2.6.38-rc5+/build/vmlinux                     
              114.00  2.6% __netif_receive_skb         /lib/modules/2.6.38-rc5+/build/vmlinux                     
              109.00  2.5% _raw_spin_lock              /lib/modules/2.6.38-rc5+/build/vmlinux                     
              100.00  2.3% is_swiotlb_buffer           /lib/modules/2.6.38-rc5+/build/vmlinux                     
               90.00  2.0% __phys_addr                 /lib/modules/2.6.38-rc5+/build/vmlinux                     
               82.00  1.8% __udp4_lib_rcv              /lib/modules/2.6.38-rc5+/build/vmlinux                     
               80.00  1.8% kmem_cache_alloc_node       /lib/modules/2.6.38-rc5+/build/vmlinux                     
               73.00  1.6% ip_route_input_common       /lib/modules/2.6.38-rc5+/build/vmlinux                     
               60.00  1.4% memcpy                      /lib/modules/2.6.38-rc5+/build/vmlinux                     
               59.00  1.3% dma_issue_pending_all       /lib/modules/2.6.38-rc5+/build/vmlinux                     
               58.00  1.3% ip_rcv                      /lib/modules/2.6.38-rc5+/build/vmlinux                     
               57.00  1.3% be_post_rx_frags            /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko
               49.00  1.1% __alloc_pages_nodemask      /lib/modules/2.6.38-rc5+/build/vmlinux                     
               45.00  1.0% alloc_pages_current         /lib/modules/2.6.38-rc5+/build/vmlinux                     
               27.00  0.6% get_rps_cpu                 /lib/modules/2.6.38-rc5+/build/vmlinux                     
               23.00  0.5% napi_complete               /lib/modules/2.6.38-rc5+/build/vmlinux                     
               22.00  0.5% ip_local_deliver            /lib/modules/2.6.38-rc5+/build/vmlinux                     
               18.00  0.4% selinux_socket_sock_rcv_skb /lib/modules/2.6.38-rc5+/build/vmlinux                     
               17.00  0.4% native_read_tsc             /lib/modules/2.6.38-rc5+/build/vmlinux                     
               16.00  0.4% local_bh_enable             /lib/modules/2.6.38-rc5+/build/vmlinux                     
               16.00  0.4% next_zones_zonelist         /lib/modules/2.6.38-rc5+/build/vmlinux                     
               14.00  0.3% sk_filter                   /lib/modules/2.6.38-rc5+/build/vmlinux                     
               13.00  0.3% eth_type_trans              /lib/modules/2.6.38-rc5+/build/vmlinux                     
               10.00  0.2% __kmalloc_node_track_caller /lib/modules/2.6.38-rc5+/build/vmlinux                     
               10.00  0.2% _raw_spin_lock_irqsave      /lib/modules/2.6.38-rc5+/build/vmlinux                     
                9.00  0.2% raw_local_deliver           /lib/modules/2.6.38-rc5+/build/vmlinux                     
                8.00  0.2% __udp_queue_rcv_skb         /lib/modules/2.6.38-rc5+/build/vmlinux                     
                8.00  0.2% netif_receive_skb           /lib/modules/2.6.38-rc5+/build/vmlinux                     
                8.00  0.2% ip_queue_rcv_skb            /lib/modules/2.6.38-rc5+/build/vmlinux                     
                7.00  0.2% net_rx_action               /lib/modules/2.6.38-rc5+/build/vmlinux                     
                6.00  0.1% swiotlb_map_page            /lib/modules/2.6.38-rc5+/build/vmlinux                     
                6.00  0.1% __sk_mem_schedule           /lib/modules/2.6.38-rc5+/build/vmlinux                     
                6.00  0.1% dso__find_symbol            /usr/bin/perf                                              
                6.00  0.1% __netdev_alloc_skb          /lib/modules/2.6.38-rc5+/build/vmlinux              

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 13:18                                         ` Herbert Xu
  2011-03-01 13:52                                           ` Eric Dumazet
@ 2011-03-01 16:31                                           ` Eric Dumazet
  2011-03-02  0:23                                             ` Herbert Xu
  1 sibling, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-03-01 16:31 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

Le mardi 01 mars 2011 à 21:18 +0800, Herbert Xu a écrit :
> On Tue, Mar 01, 2011 at 02:03:29PM +0100, Eric Dumazet wrote:
> >
> > I believe its now done properly (in net-next-2.6) with commit
> > 4f57c087de9b46182 (net: implement mechanism for HW based QOS)
> 
> Nope, that has nothing to do with this.

Right, I was thinking of commit 1d24eb4815d1e0e8 (xps: Transmit Packet
Steering)

Now you say all this stuff should be replaced by "use this cpu number
nly", just because you have a multi threaded process sending UDP frames
trough one socket...

This wont work for tcp streams, you could imagine a multi-threaded
application using a shared tcp socket as well. Too many OOO packets.




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 12:35                 ` Herbert Xu
                                     ` (4 preceding siblings ...)
  2011-03-01 12:36                   ` [PATCH 5/5] udp: Add lockless transmit path Herbert Xu
@ 2011-03-01 16:43                   ` Eric Dumazet
  2011-03-01 20:36                     ` David Miller
  5 siblings, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-03-01 16:43 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev

Le mardi 01 mars 2011 à 20:35 +0800, Herbert Xu a écrit :
> On Mon, Feb 28, 2011 at 07:36:59PM +0800, Herbert Xu wrote:
> > Here are the patches I used.  Please don't them yet as I intend
> > to clean them up quite a bit.
> 
> OK here is the version ready for merging (please retest them
> though as I have changed things substantially).
> 
> The main change is that the legacy UDP code path is now gone
> so we use the same UDP header generation whether corking is on
> or off.
> 
> I will add IPv6 support in a later patch set.
> 
> Thanks!

For the whole patchset :

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Tests were fine on my dev machine.

Thanks



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 16:43                   ` SO_REUSEPORT - can it be done in kernel? Eric Dumazet
@ 2011-03-01 20:36                     ` David Miller
  0 siblings, 0 replies; 91+ messages in thread
From: David Miller @ 2011-03-01 20:36 UTC (permalink / raw)
  To: eric.dumazet
  Cc: herbert, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 01 Mar 2011 17:43:07 +0100

> Le mardi 01 mars 2011 à 20:35 +0800, Herbert Xu a écrit :
>> On Mon, Feb 28, 2011 at 07:36:59PM +0800, Herbert Xu wrote:
>> > Here are the patches I used.  Please don't them yet as I intend
>> > to clean them up quite a bit.
>> 
>> OK here is the version ready for merging (please retest them
>> though as I have changed things substantially).
>> 
>> The main change is that the legacy UDP code path is now gone
>> so we use the same UDP header generation whether corking is on
>> or off.
>> 
>> I will add IPv6 support in a later patch set.
>> 
>> Thanks!
> 
> For the whole patchset :
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, great work everyone!

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-01 16:31                                           ` Eric Dumazet
@ 2011-03-02  0:23                                             ` Herbert Xu
  2011-03-02  2:00                                               ` Eric Dumazet
  0 siblings, 1 reply; 91+ messages in thread
From: Herbert Xu @ 2011-03-02  0:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Tue, Mar 01, 2011 at 05:31:24PM +0100, Eric Dumazet wrote:
>
> This wont work for tcp streams, you could imagine a multi-threaded
> application using a shared tcp socket as well. Too many OOO packets.

Think about it, a TCP socket cannot be used by a multi-threaded app
in a scalable way.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-02  0:23                                             ` Herbert Xu
@ 2011-03-02  2:00                                               ` Eric Dumazet
  2011-03-02  2:39                                                 ` Herbert Xu
  0 siblings, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-03-02  2:00 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

Le mercredi 02 mars 2011 à 08:23 +0800, Herbert Xu a écrit :
> On Tue, Mar 01, 2011 at 05:31:24PM +0100, Eric Dumazet wrote:
> >
> > This wont work for tcp streams, you could imagine a multi-threaded
> > application using a shared tcp socket as well. Too many OOO packets.
> 
> Think about it, a TCP socket cannot be used by a multi-threaded app
> in a scalable way.

Well...

If you think about it, SO_REUSEPORT patch has exactly the same goal : 

Let each thread use a different socket, to scale without kernel limits.

We cant modify TX selection each time we want to "fix" a problem without
changing user side (not adding an API), and as side effect make non
optimal applications become miserable.

We added RPS and XPS that works correctly if each socket is used by one
thread. Maybe we need to add an user API or automatically detect a
particular DGRAM socket is used by many different threads to :

0) Decide OOM is ok for this workload (many threads issuing send() at
the same time)

1) Setup several receive queues (up to num_possible_cpus())

2) Use an appropriate TX queue selection 




^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-02  2:00                                               ` Eric Dumazet
@ 2011-03-02  2:39                                                 ` Herbert Xu
  2011-03-02  2:56                                                   ` Eric Dumazet
  2011-03-02  7:12                                                   ` Tom Herbert
  0 siblings, 2 replies; 91+ messages in thread
From: Herbert Xu @ 2011-03-02  2:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Wed, Mar 02, 2011 at 03:00:03AM +0100, Eric Dumazet wrote:
> > 
> > Think about it, a TCP socket cannot be used by a multi-threaded app
> > in a scalable way.
> 
> Well...
> 
> If you think about it, SO_REUSEPORT patch has exactly the same goal : 

UDP is a datagram protocol, TCP is not.

Anyway, here is an alternate proposal.  When a TCP socket transmits
for the first time (SYN or SYN-ACK), we pick a queue based on CPU and
store it in the socket.  From then on we stick to that selection.

We would only allow changes if we can ensure that all transmitted
packets have left the queue.  Or we just never change it like we
do now.

For datagram protocols we simply use the current CPU.

> We added RPS and XPS that works correctly if each socket is used by one
> thread. Maybe we need to add an user API or automatically detect a
> particular DGRAM socket is used by many different threads to :

No we don't need that for datagram protocols at all.  By definition
there is no ordering guarantee across CPUs for datagram sockets.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-02  2:39                                                 ` Herbert Xu
@ 2011-03-02  2:56                                                   ` Eric Dumazet
  2011-03-02  3:09                                                     ` Herbert Xu
  2011-03-02  7:12                                                   ` Tom Herbert
  1 sibling, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-03-02  2:56 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

Le mercredi 02 mars 2011 à 10:39 +0800, Herbert Xu a écrit :

> UDP is a datagram protocol, TCP is not.
> 
> Anyway, here is an alternate proposal.  When a TCP socket transmits
> for the first time (SYN or SYN-ACK), we pick a queue based on CPU and
> store it in the socket.  From then on we stick to that selection.
> 

Many TCP apps I know use one thread to perform listen/accept and a pool
of threads to handle each new conn.

Anyway, the SYN-ACK is generated by softirq, not really user choice.
CPU depends if NIC is RX multiqueue or RPS is setup.

All this discussion is about letting process scheduler decide TX queue,
(because user/admin used cpu affinity) or let network stack drive
scheduler : Please migrate this thread on this cpu.

Both schems should be allowed/configurable so that best results are
available.

> We would only allow changes if we can ensure that all transmitted
> packets have left the queue.  Or we just never change it like we
> do now.
> 

We do change in case of dst/route change. Each device can have different
number of TX queues.



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-02  2:56                                                   ` Eric Dumazet
@ 2011-03-02  3:09                                                     ` Herbert Xu
  2011-03-02  3:44                                                       ` Eric Dumazet
  0 siblings, 1 reply; 91+ messages in thread
From: Herbert Xu @ 2011-03-02  3:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

On Wed, Mar 02, 2011 at 03:56:38AM +0100, Eric Dumazet wrote:
>
> Anyway, the SYN-ACK is generated by softirq, not really user choice.
> CPU depends if NIC is RX multiqueue or RPS is setup.

Which is exactly what we want.  The RX queue selection should
determine the TX cpu.

> All this discussion is about letting process scheduler decide TX queue,
> (because user/admin used cpu affinity) or let network stack drive
> scheduler : Please migrate this thread on this cpu.
> 
> Both schems should be allowed/configurable so that best results are
> available.

Whatever scheme we end up with, hashing different sockets running
in the same thread to different queues is just broken.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-02  3:09                                                     ` Herbert Xu
@ 2011-03-02  3:44                                                       ` Eric Dumazet
  0 siblings, 0 replies; 91+ messages in thread
From: Eric Dumazet @ 2011-03-02  3:44 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld,
	daniel.baluta, netdev

Le mercredi 02 mars 2011 à 11:09 +0800, Herbert Xu a écrit :
> On Wed, Mar 02, 2011 at 03:56:38AM +0100, Eric Dumazet wrote:
> >
> > Anyway, the SYN-ACK is generated by softirq, not really user choice.
> > CPU depends if NIC is RX multiqueue or RPS is setup.
> 
> Which is exactly what we want.  The RX queue selection should
> determine the TX cpu.
> 

This is working today with RFS/XPS. Or it should, indirectly.

OOO problem is handled as well.




^ permalink raw reply	[flat|nested] 91+ messages in thread

* inet: Replace left-over references to inet->cork
  2011-03-01 12:36                   ` [PATCH 2/5] inet: Remove explicit write references to sk/inet in ip_append_data Herbert Xu
@ 2011-03-02  6:15                     ` Herbert Xu
  2011-03-02  7:01                       ` David Miller
  0 siblings, 1 reply; 91+ messages in thread
From: Herbert Xu @ 2011-03-02  6:15 UTC (permalink / raw)
  To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev, Thomas Graf

On Tue, Mar 01, 2011 at 08:36:47PM +0800, Herbert Xu wrote:
> inet: Remove explicit write references to sk/inet in ip_append_data

Just found a couple of spots where inet->cork was still used
instead of just cork.

inet: Replace left-over references to inet->cork

The patch to replace inet->cork with cork left out two spots in
__ip_append_data that can result in bogus packet construction.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 460308c..3e8637c 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -789,7 +789,7 @@ static int __ip_append_data(struct sock *sk, struct sk_buff_head *queue,
 	struct inet_sock *inet = inet_sk(sk);
 	struct sk_buff *skb;
 
-	struct ip_options *opt = inet->cork.opt;
+	struct ip_options *opt = cork->opt;
 	int hh_len;
 	int exthdrlen;
 	int mtu;
@@ -803,7 +803,7 @@ static int __ip_append_data(struct sock *sk, struct sk_buff_head *queue,
 	exthdrlen = transhdrlen ? rt->dst.header_len : 0;
 	length += exthdrlen;
 	transhdrlen += exthdrlen;
-	mtu = inet->cork.fragsize;
+	mtu = cork->fragsize;
 
 	hh_len = LL_RESERVED_SPACE(rt->dst.dev);

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: inet: Replace left-over references to inet->cork
  2011-03-02  6:15                     ` inet: Replace left-over references to inet->cork Herbert Xu
@ 2011-03-02  7:01                       ` David Miller
  0 siblings, 0 replies; 91+ messages in thread
From: David Miller @ 2011-03-02  7:01 UTC (permalink / raw)
  To: herbert; +Cc: rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, tgraf

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 2 Mar 2011 14:15:17 +0800

> On Tue, Mar 01, 2011 at 08:36:47PM +0800, Herbert Xu wrote:
>> inet: Remove explicit write references to sk/inet in ip_append_data
> 
> Just found a couple of spots where inet->cork was still used
> instead of just cork.
> 
> inet: Replace left-over references to inet->cork

Applied, thanks Herbert.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-02  2:39                                                 ` Herbert Xu
  2011-03-02  2:56                                                   ` Eric Dumazet
@ 2011-03-02  7:12                                                   ` Tom Herbert
  2011-03-02  7:31                                                     ` Herbert Xu
  1 sibling, 1 reply; 91+ messages in thread
From: Tom Herbert @ 2011-03-02  7:12 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Dumazet, Thomas Graf, David Miller, rick.jones2,
	wsommerfeld, daniel.baluta, netdev

On Tue, Mar 1, 2011 at 6:39 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Wed, Mar 02, 2011 at 03:00:03AM +0100, Eric Dumazet wrote:
>> >
>> > Think about it, a TCP socket cannot be used by a multi-threaded app
>> > in a scalable way.
>>
>> Well...
>>
>> If you think about it, SO_REUSEPORT patch has exactly the same goal :
>
In a sense.  SO_RESUSEPORT for TCP is intended to provide a scalable
listener solution.  Sharing an established socket is not very
efficient, something like a multiplexing socket layer on top of TCP
might be good.

> UDP is a datagram protocol, TCP is not.
>
> Anyway, here is an alternate proposal.  When a TCP socket transmits
> for the first time (SYN or SYN-ACK), we pick a queue based on CPU and
> store it in the socket.  From then on we stick to that selection.
>
> We would only allow changes if we can ensure that all transmitted
> packets have left the queue.  Or we just never change it like we
> do now.
>
XPS does all this already.

> For datagram protocols we simply use the current CPU.
>
Probably need to set skb->ooo_okay (for UDP etc.) also so that XPS
will change queues.

Tom

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-02  7:12                                                   ` Tom Herbert
@ 2011-03-02  7:31                                                     ` Herbert Xu
  2011-03-02  8:04                                                       ` Eric Dumazet
  0 siblings, 1 reply; 91+ messages in thread
From: Herbert Xu @ 2011-03-02  7:31 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Thomas Graf, David Miller, rick.jones2,
	wsommerfeld, daniel.baluta, netdev

On Tue, Mar 01, 2011 at 11:12:29PM -0800, Tom Herbert wrote:
>
> Probably need to set skb->ooo_okay (for UDP etc.) also so that XPS
> will change queues.

Hmm, not quite.  We still want to maintain packet ordering from
the same CPU.  That is, if I do two sendmsg calls from the same
CPU, they should go into the same queue in that order.

So this shouldn't just be a knob that says whether we can pick
queues at random.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-02  7:31                                                     ` Herbert Xu
@ 2011-03-02  8:04                                                       ` Eric Dumazet
  2011-03-02  8:07                                                         ` Herbert Xu
  0 siblings, 1 reply; 91+ messages in thread
From: Eric Dumazet @ 2011-03-02  8:04 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Tom Herbert, Thomas Graf, David Miller, rick.jones2, wsommerfeld,
	daniel.baluta, netdev

Le mercredi 02 mars 2011 à 15:31 +0800, Herbert Xu a écrit :
> On Tue, Mar 01, 2011 at 11:12:29PM -0800, Tom Herbert wrote:
> >
> > Probably need to set skb->ooo_okay (for UDP etc.) also so that XPS
> > will change queues.
> 
> Hmm, not quite.  We still want to maintain packet ordering from
> the same CPU.  That is, if I do two sendmsg calls from the same
> CPU, they should go into the same queue in that order.
> 
> So this shouldn't just be a knob that says whether we can pick
> queues at random.
> 

Not sure why two UDP packets from the same cpu should be sent on same
queue.

- Some qdisc do reorder packets anyway.
- Some bonding setups use two links in round-robin mode (link
aggregation)



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-02  8:04                                                       ` Eric Dumazet
@ 2011-03-02  8:07                                                         ` Herbert Xu
  2011-03-02  8:24                                                           ` Eric Dumazet
  0 siblings, 1 reply; 91+ messages in thread
From: Herbert Xu @ 2011-03-02  8:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Thomas Graf, David Miller, rick.jones2, wsommerfeld,
	daniel.baluta, netdev

On Wed, Mar 02, 2011 at 09:04:08AM +0100, Eric Dumazet wrote:
> 
> Not sure why two UDP packets from the same cpu should be sent on same
> queue.
> 
> - Some qdisc do reorder packets anyway.

Which qdisc reorders packets belonging to the same flow?

> - Some bonding setups use two links in round-robin mode (link
> aggregation)

Just because the Internet may reorder things doesn't mean that
we should.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: SO_REUSEPORT - can it be done in kernel?
  2011-03-02  8:07                                                         ` Herbert Xu
@ 2011-03-02  8:24                                                           ` Eric Dumazet
  0 siblings, 0 replies; 91+ messages in thread
From: Eric Dumazet @ 2011-03-02  8:24 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Tom Herbert, Thomas Graf, David Miller, rick.jones2, wsommerfeld,
	daniel.baluta, netdev

Le mercredi 02 mars 2011 à 16:07 +0800, Herbert Xu a écrit :
> On Wed, Mar 02, 2011 at 09:04:08AM +0100, Eric Dumazet wrote:
> > 
> > Not sure why two UDP packets from the same cpu should be sent on same
> > queue.
> > 
> > - Some qdisc do reorder packets anyway.
> 
> Which qdisc reorders packets belonging to the same flow?
> 

Hmm to be fair you did not specified "same flow", and /sbin/named
answers are usually one packet long...

How are we going to detect flows in sendto() calls ?

Just kidding.

If you want to push your patch, I suspect a dynamic per_cpu variable is
needed per TX-multiqueue device, so that "current cpu -> txq number" is
one instruction.




^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2011-03-02  8:24 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-27 10:07 SO_REUSEPORT - can it be done in kernel? Daniel Baluta
2011-01-27 15:55 ` Bill Sommerfeld
2011-01-27 21:32 ` Tom Herbert
2011-02-25 12:56   ` Thomas Graf
2011-02-25 19:18     ` Rick Jones
2011-02-25 19:20       ` David Miller
2011-02-26  0:57         ` Herbert Xu
2011-02-26  2:12           ` David Miller
2011-02-26  2:48             ` Herbert Xu
2011-02-26  3:07               ` David Miller
2011-02-26  3:11                 ` Herbert Xu
2011-02-26  7:31                   ` Eric Dumazet
2011-02-26  7:46                     ` David Miller
2011-02-27 11:02           ` Thomas Graf
2011-02-27 11:06             ` Herbert Xu
2011-02-28  3:45               ` Tom Herbert
2011-02-28  4:26                 ` Herbert Xu
2011-02-28 11:36               ` Herbert Xu
2011-02-28 13:32                 ` Eric Dumazet
2011-02-28 14:13                   ` Herbert Xu
2011-02-28 14:22                     ` Eric Dumazet
2011-02-28 14:25                       ` Herbert Xu
2011-02-28 14:53                   ` Eric Dumazet
2011-02-28 15:01                     ` Thomas Graf
2011-02-28 14:13                 ` Thomas Graf
2011-02-28 16:22                   ` Eric Dumazet
2011-02-28 16:37                     ` Thomas Graf
2011-02-28 17:07                       ` Eric Dumazet
2011-03-01 10:19                         ` Thomas Graf
2011-03-01 10:33                           ` Eric Dumazet
2011-03-01 11:07                             ` Thomas Graf
2011-03-01 11:13                               ` Eric Dumazet
2011-03-01 11:27                                 ` Thomas Graf
2011-03-01 11:45                                   ` Eric Dumazet
2011-03-01 11:53                                     ` Herbert Xu
2011-03-01 12:32                                       ` Herbert Xu
2011-03-01 13:04                                         ` Eric Dumazet
2011-03-01 13:11                                           ` Herbert Xu
2011-03-01 13:03                                       ` Eric Dumazet
2011-03-01 13:18                                         ` Herbert Xu
2011-03-01 13:52                                           ` Eric Dumazet
2011-03-01 13:58                                             ` Herbert Xu
2011-03-01 16:31                                           ` Eric Dumazet
2011-03-02  0:23                                             ` Herbert Xu
2011-03-02  2:00                                               ` Eric Dumazet
2011-03-02  2:39                                                 ` Herbert Xu
2011-03-02  2:56                                                   ` Eric Dumazet
2011-03-02  3:09                                                     ` Herbert Xu
2011-03-02  3:44                                                       ` Eric Dumazet
2011-03-02  7:12                                                   ` Tom Herbert
2011-03-02  7:31                                                     ` Herbert Xu
2011-03-02  8:04                                                       ` Eric Dumazet
2011-03-02  8:07                                                         ` Herbert Xu
2011-03-02  8:24                                                           ` Eric Dumazet
2011-03-01 12:01                                     ` Thomas Graf
2011-03-01 12:15                                       ` Herbert Xu
2011-03-01 13:27                                       ` Herbert Xu
2011-03-01 12:18                                     ` Thomas Graf
2011-03-01 12:19                                       ` Herbert Xu
2011-03-01 13:50                                         ` Thomas Graf
2011-03-01 14:06                                           ` Eric Dumazet
2011-03-01 14:22                                             ` Thomas Graf
2011-03-01 14:30                                               ` Thomas Graf
2011-03-01 14:52                                                 ` Eric Dumazet
2011-03-01 15:07                                                   ` Thomas Graf
2011-03-01  5:33                 ` Eric Dumazet
2011-03-01 12:35                 ` Herbert Xu
2011-03-01 12:36                   ` [PATCH 1/5] inet: Remove unused sk_sndmsg_* from UFO Herbert Xu
2011-03-01 12:36                   ` [PATCH 3/5] inet: Add ip_make_skb and ip_finish_skb Herbert Xu
2011-03-01 12:36                   ` [PATCH 2/5] inet: Remove explicit write references to sk/inet in ip_append_data Herbert Xu
2011-03-02  6:15                     ` inet: Replace left-over references to inet->cork Herbert Xu
2011-03-02  7:01                       ` David Miller
2011-03-01 12:36                   ` [PATCH 4/5] udp: Switch to ip_finish_skb Herbert Xu
2011-03-01 12:36                   ` [PATCH 5/5] udp: Add lockless transmit path Herbert Xu
2011-03-01 16:43                   ` SO_REUSEPORT - can it be done in kernel? Eric Dumazet
2011-03-01 20:36                     ` David Miller
2011-02-28 11:41               ` [PATCH 2/5] net: Remove explicit write references to sk/inet in ip_append_data Herbert Xu
2011-03-01  5:31                 ` Eric Dumazet
2011-02-28 11:41               ` [PATCH 1/5] net: Remove unused sk_sndmsg_* from UFO Herbert Xu
2011-03-01  5:31                 ` Eric Dumazet
2011-02-28 11:41               ` [PATCH 3/5] inet: Add ip_make_skb and ip_send_skb Herbert Xu
2011-03-01  5:31                 ` Eric Dumazet
2011-02-28 11:41               ` [PATCH 4/5] udp: Add lockless transmit path Herbert Xu
2011-02-28 11:41                 ` Herbert Xu
2011-03-01  5:30                 ` Eric Dumazet
2011-02-25 19:21       ` SO_REUSEPORT - can it be done in kernel? Eric Dumazet
2011-02-25 22:48       ` Thomas Graf
2011-02-25 23:15         ` Rick Jones
2011-02-25 19:51     ` Tom Herbert
2011-02-25 22:58       ` Thomas Graf
2011-02-25 23:33       ` Bill Sommerfeld

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.