* SO_REUSEPORT - can it be done in kernel? @ 2011-01-27 10:07 Daniel Baluta 2011-01-27 15:55 ` Bill Sommerfeld 2011-01-27 21:32 ` Tom Herbert 0 siblings, 2 replies; 91+ messages in thread From: Daniel Baluta @ 2011-01-27 10:07 UTC (permalink / raw) To: therbert; +Cc: netdev Hi Tom, How did you solved the issue regarding scaling TCP listeners? I think SO_REUSEPORT proposed by patch [1] can be a good start. Where there any follow ups? Also, solving the problem in users pace can be an option. I want to run multiple instances of a DNS server on a multi core system. Any suggestions would be welcomed. SO_REUSEPORT option seems to be already there [2]. Where there any plans to have a kernel implementation? thanks, Daniel. [1] http://amailbox.org/mailarchive/linux-netdev/2010/4/19/6274993 [2] http://lxr.linux.no/linux+v2.6.37/include/asm-generic/socket.h#L25 ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-01-27 10:07 SO_REUSEPORT - can it be done in kernel? Daniel Baluta @ 2011-01-27 15:55 ` Bill Sommerfeld 2011-01-27 21:32 ` Tom Herbert 1 sibling, 0 replies; 91+ messages in thread From: Bill Sommerfeld @ 2011-01-27 15:55 UTC (permalink / raw) To: Daniel Baluta; +Cc: therbert, netdev On Thu, Jan 27, 2011 at 02:07, Daniel Baluta <daniel.baluta@gmail.com> wrote: > How did you solved the issue regarding scaling TCP listeners? > I think SO_REUSEPORT proposed by patch [1] can be a good > start. Where there any follow ups? Google is using the patch internally. I've recently joined google and have picked up this work from Tom; I'm starting to rework how it interacts with TCP (in particular, changing how it interacts with request sockets and listen sockets so that incoming connections are not prematurely bound to a specific listener sharing the port). I have nothing worth sharing yet. ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-01-27 10:07 SO_REUSEPORT - can it be done in kernel? Daniel Baluta 2011-01-27 15:55 ` Bill Sommerfeld @ 2011-01-27 21:32 ` Tom Herbert 2011-02-25 12:56 ` Thomas Graf 1 sibling, 1 reply; 91+ messages in thread From: Tom Herbert @ 2011-01-27 21:32 UTC (permalink / raw) To: Daniel Baluta; +Cc: netdev On Thu, Jan 27, 2011 at 2:07 AM, Daniel Baluta <daniel.baluta@gmail.com> wrote: > Hi Tom, > > How did you solved the issue regarding scaling TCP listeners? > I think SO_REUSEPORT proposed by patch [1] can be a good > start. Where there any follow ups? > As Bill mentioned we are continue to work on the TCP issues. Looks like modifying the TCP listener structures will probably be required. > Also, solving the problem in users pace can be an option. I want > to run multiple instances of a DNS server on a multi core system. > Any suggestions would be welcomed. > > SO_REUSEPORT option seems to be already there [2]. Where > there any plans to have a kernel implementation? > Yes, we are still planning this. The UDP implementation for my earlier patch should be usable to try for DNS/UDP-- this is in fact where we saw a major performance gain. Eric Dumazet had some nice improvements that should probably be looked at also. Tom ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-01-27 21:32 ` Tom Herbert @ 2011-02-25 12:56 ` Thomas Graf 2011-02-25 19:18 ` Rick Jones 2011-02-25 19:51 ` Tom Herbert 0 siblings, 2 replies; 91+ messages in thread From: Thomas Graf @ 2011-02-25 12:56 UTC (permalink / raw) To: Tom Herbert, Bill Sommerfeld; +Cc: Daniel Baluta, netdev On Thu, Jan 27, 2011 at 01:32:25PM -0800, Tom Herbert wrote: > Yes, we are still planning this. The UDP implementation for my > earlier patch should be usable to try for DNS/UDP-- this is in fact > where we saw a major performance gain. Eric Dumazet had some nice > improvements that should probably be looked at also. I can confirm this. Serious scalability issues have been reported on a 12 core system running bind 9.7-2. The system was only able to deliver ~110K queries per second. Using your SO_REUSEPORT patch and a modified bind using it. The same system is able to deliver ~650K queries per seconds while maxing out all cores completely. Tom, Bill: do you have a timeline for merging this? Especially the UDP bits? -Thomas ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-25 12:56 ` Thomas Graf @ 2011-02-25 19:18 ` Rick Jones 2011-02-25 19:20 ` David Miller ` (2 more replies) 2011-02-25 19:51 ` Tom Herbert 1 sibling, 3 replies; 91+ messages in thread From: Rick Jones @ 2011-02-25 19:18 UTC (permalink / raw) To: Thomas Graf; +Cc: Tom Herbert, Bill Sommerfeld, Daniel Baluta, netdev On Fri, 2011-02-25 at 07:56 -0500, Thomas Graf wrote: > On Thu, Jan 27, 2011 at 01:32:25PM -0800, Tom Herbert wrote: > > Yes, we are still planning this. The UDP implementation for my > > earlier patch should be usable to try for DNS/UDP-- this is in fact > > where we saw a major performance gain. Eric Dumazet had some nice > > improvements that should probably be looked at also. > > I can confirm this. > > Serious scalability issues have been reported on a 12 core system > running bind 9.7-2. The system was only able to deliver ~110K queries > per second. > > Using your SO_REUSEPORT patch and a modified bind using it. The same > system is able to deliver ~650K queries per seconds while maxing out > all cores completely. I think the idea is goodness, but will ask, was the (first) bottleneck actually in the kernel, or was it in bind itself? I've seen single-instance, single-byte burst-mode netperf TCP_RR do in excess of 300K transactions per second (with TCP_NODELAY set) on an X5560 core. ftp://ftp.netperf.org/netperf/misc/dl380g6_X5560_rhel54_ad386_cxgb3_1.4.1.2_b2b_to_same_agg_1500mtu_20100513-2.csv and that was with now ancient RHEL5.4 bits... yes, there is a bit of apples, oranges and kumquats but still, I am wondering if this didn't also "work around" some internal BIND scaling issues as well. rick jones > > Tom, Bill: do you have a timeline for merging this? Especially the > UDP bits? > > -Thomas > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-25 19:18 ` Rick Jones @ 2011-02-25 19:20 ` David Miller 2011-02-26 0:57 ` Herbert Xu 2011-02-25 19:21 ` SO_REUSEPORT - can it be done in kernel? Eric Dumazet 2011-02-25 22:48 ` Thomas Graf 2 siblings, 1 reply; 91+ messages in thread From: David Miller @ 2011-02-25 19:20 UTC (permalink / raw) To: rick.jones2; +Cc: tgraf, therbert, wsommerfeld, daniel.baluta, netdev From: Rick Jones <rick.jones2@hp.com> Date: Fri, 25 Feb 2011 11:18:15 -0800 > and that was with now ancient RHEL5.4 bits... yes, there is a bit of > apples, oranges and kumquats but still, I am wondering if this didn't > also "work around" some internal BIND scaling issues as well. I think this is fundamentally a bind problem as well. ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-25 19:20 ` David Miller @ 2011-02-26 0:57 ` Herbert Xu 2011-02-26 2:12 ` David Miller 2011-02-27 11:02 ` Thomas Graf 0 siblings, 2 replies; 91+ messages in thread From: Herbert Xu @ 2011-02-26 0:57 UTC (permalink / raw) To: David Miller Cc: rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev David Miller <davem@davemloft.net> wrote: > > I think this is fundamentally a bind problem as well. I'm fairly certain the bottleneck is indeed in the kernel, and in the UDP stack in particular. This is born out by a test where I used two named worker threads, both working on the same socket. Stracing shows that they're working flat out only doing sendmsg/recvmsg. The result was that they obtained (in aggregate) half the throughput of a single worker thread. I then retested by having them use two sockets and the performance greatly improved. Now this is actually expected since our UDP stack is essentially single-threaded on the send side when only one socket is being used, mostly due to the corking functionality. I'm unsure how big a role the receive side scalability actually plays in this case, but I suspect it isn't great. Which is why I'm quite skeptical about this REUSEPORT patch as IMHO the only reason it produces a great result is solely because it is allowing parallel sends going out. Rather than modifying all UDP applications out there to fix what is fundamentally a kernel problem, I think what we should do is fix the UDP stack so that it actually scales. It isn't all that hard since the easy way would be to only take the lock if we're already corked or about to cork. For the receive side we also don't need REUSEPORT as we can simply make our UDP stack multiqueue. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-26 0:57 ` Herbert Xu @ 2011-02-26 2:12 ` David Miller 2011-02-26 2:48 ` Herbert Xu 2011-02-27 11:02 ` Thomas Graf 1 sibling, 1 reply; 91+ messages in thread From: David Miller @ 2011-02-26 2:12 UTC (permalink / raw) To: herbert; +Cc: rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Sat, 26 Feb 2011 08:57:18 +0800 > It isn't all that hard since the easy way would be to only take > the lock if we're already corked or about to cork. We take the lock unconditionally because we essentially have to after UDP takes on the socket buffer accounting facilities similar to TCP. ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-26 2:12 ` David Miller @ 2011-02-26 2:48 ` Herbert Xu 2011-02-26 3:07 ` David Miller 0 siblings, 1 reply; 91+ messages in thread From: Herbert Xu @ 2011-02-26 2:48 UTC (permalink / raw) To: David Miller Cc: rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev On Fri, Feb 25, 2011 at 06:12:44PM -0800, David Miller wrote: > > We take the lock unconditionally because we essentially have to after > UDP takes on the socket buffer accounting facilities similar to TCP. Well I just checked out the history tree (2.6.12) and it too had the unconditional lock on the send path. So this predates the system-wide buffer limit change. I'm looking at redoing this and the bulk of the work is going to be restructuring ip_append_data/ip_push_pending_frames so that it doesn't store the states in sk/inet_sk. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-26 2:48 ` Herbert Xu @ 2011-02-26 3:07 ` David Miller 2011-02-26 3:11 ` Herbert Xu 0 siblings, 1 reply; 91+ messages in thread From: David Miller @ 2011-02-26 3:07 UTC (permalink / raw) To: herbert; +Cc: rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Sat, 26 Feb 2011 10:48:48 +0800 > I'm looking at redoing this and the bulk of the work is going to > be restructuring ip_append_data/ip_push_pending_frames so that it > doesn't store the states in sk/inet_sk. I suppose you're going to replace that stuff with an on-stack control structure that gets passed around by reference or similar? ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-26 3:07 ` David Miller @ 2011-02-26 3:11 ` Herbert Xu 2011-02-26 7:31 ` Eric Dumazet 0 siblings, 1 reply; 91+ messages in thread From: Herbert Xu @ 2011-02-26 3:11 UTC (permalink / raw) To: David Miller Cc: rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev On Fri, Feb 25, 2011 at 07:07:23PM -0800, David Miller wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Sat, 26 Feb 2011 10:48:48 +0800 > > > I'm looking at redoing this and the bulk of the work is going to > > be restructuring ip_append_data/ip_push_pending_frames so that it > > doesn't store the states in sk/inet_sk. > > I suppose you're going to replace that stuff with an on-stack > control structure that gets passed around by reference or > similar? Either that or have ip_append_data do ip_push_pending_frames directly. That function's signature is a mess already and I need to think about this a bit more :) Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-26 3:11 ` Herbert Xu @ 2011-02-26 7:31 ` Eric Dumazet 2011-02-26 7:46 ` David Miller 0 siblings, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-02-26 7:31 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev Le samedi 26 février 2011 à 11:11 +0800, Herbert Xu a écrit : > On Fri, Feb 25, 2011 at 07:07:23PM -0800, David Miller wrote: > > From: Herbert Xu <herbert@gondor.apana.org.au> > > Date: Sat, 26 Feb 2011 10:48:48 +0800 > > > > > I'm looking at redoing this and the bulk of the work is going to > > > be restructuring ip_append_data/ip_push_pending_frames so that it > > > doesn't store the states in sk/inet_sk. > > > > I suppose you're going to replace that stuff with an on-stack > > control structure that gets passed around by reference or > > similar? > > Either that or have ip_append_data do ip_push_pending_frames > directly. > > That function's signature is a mess already and I need to think > about this a bit more :) > > Cheers, UDP CORK is a problem indeed. I wonder who really uses it ? ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-26 7:31 ` Eric Dumazet @ 2011-02-26 7:46 ` David Miller 0 siblings, 0 replies; 91+ messages in thread From: David Miller @ 2011-02-26 7:46 UTC (permalink / raw) To: eric.dumazet Cc: herbert, rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Sat, 26 Feb 2011 08:31:24 +0100 > UDP CORK is a problem indeed. I wonder who really uses it ? git grep MSG_MORE -- net/sunrpc ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-26 0:57 ` Herbert Xu 2011-02-26 2:12 ` David Miller @ 2011-02-27 11:02 ` Thomas Graf 2011-02-27 11:06 ` Herbert Xu 1 sibling, 1 reply; 91+ messages in thread From: Thomas Graf @ 2011-02-27 11:02 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Sat, Feb 26, 2011 at 08:57:18AM +0800, Herbert Xu wrote: > I'm fairly certain the bottleneck is indeed in the kernel, and > in the UDP stack in particular. > > This is born out by a test where I used two named worker threads, > both working on the same socket. Stracing shows that they're > working flat out only doing sendmsg/recvmsg. > > The result was that they obtained (in aggregate) half the throughput > of a single worker thread. I agree. This is the bottleneck that I described were the kernel is not able to deliver enough queries for BIND to show the lock contention issues. But there is also the situation where netperf RR performance numbers indicate a mugh higher kernel capability but BIND is not able to deliver more even though the CPU utilization is very low. This is the situation where we see the large number of futex calls indicating the lock contention due to too many queries on a single socket. > Which is why I'm quite skeptical about this REUSEPORT patch as > IMHO the only reason it produces a great result is solely because > it is allowing parallel sends going out. > > Rather than modifying all UDP applications out there to fix what > is fundamentally a kernel problem, I think what we should do is > fix the UDP stack so that it actually scales. I am not suggesting that this is the ultimate and final fix for this problem. It is fixing a symptom rather than fixing the cause but sometimes being able to fix the symptom becomes really handy :-) Adding SO_REUSEPORT does not prevent us from fixing the UDP stack in the long run. > It isn't all that hard since the easy way would be to only take > the lock if we're already corked or about to cork. > > For the receive side we also don't need REUSEPORT as we can simply > make our UDP stack multiqueue. OK, it is not required and there is definitely a better way to fix the kernel bottleneck in the long term. Even better. I still suggest to merge this patch as a immediate workaround fix until we scale properly on a single socket and also as a workaround for applications which can't get rid of their per socket mutex quickly. ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-27 11:02 ` Thomas Graf @ 2011-02-27 11:06 ` Herbert Xu 2011-02-28 3:45 ` Tom Herbert ` (5 more replies) 0 siblings, 6 replies; 91+ messages in thread From: Herbert Xu @ 2011-02-27 11:06 UTC (permalink / raw) To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Sun, Feb 27, 2011 at 06:02:05AM -0500, Thomas Graf wrote: > > I still suggest to merge this patch as a immediate workaround fix > until we scale properly on a single socket and also as a workaround > for applications which can't get rid of their per socket mutex quickly. I disagree completely. This patch adds a user-space API that we will have to carry with us for perpetuity. I would only support this if we had no other way around the problem. If this does turn out to be mostly due to sendmsg contention then fixing it is going to be much simpler than making the UDP stack multiqueue capable. I'm working on this right now. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-27 11:06 ` Herbert Xu @ 2011-02-28 3:45 ` Tom Herbert 2011-02-28 4:26 ` Herbert Xu 2011-02-28 11:36 ` Herbert Xu ` (4 subsequent siblings) 5 siblings, 1 reply; 91+ messages in thread From: Tom Herbert @ 2011-02-28 3:45 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, rick.jones2, wsommerfeld, daniel.baluta, netdev > I disagree completely. > > This patch adds a user-space API that we will have to carry > with us for perpetuity. I would only support this if we had > no other way around the problem. > > If this does turn out to be mostly due to sendmsg contention > then fixing it is going to be much simpler than making the UDP > stack multiqueue capable. > That sounds promising, but receive side will still have problems. There is lock contention on the queue as well as cache line bouncing on the sock structures. Also multiple threads sleeping on same socket typically leads to asymmetric load across the threads (and degenerative cases where receiving thread is woken up and other threads have already processed all the packets). TCP listener threads suffer from these same problems. Tom > I'm working on this right now. > > Cheers, > -- > Email: Herbert Xu <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt > ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-28 3:45 ` Tom Herbert @ 2011-02-28 4:26 ` Herbert Xu 0 siblings, 0 replies; 91+ messages in thread From: Herbert Xu @ 2011-02-28 4:26 UTC (permalink / raw) To: Tom Herbert; +Cc: David Miller, rick.jones2, wsommerfeld, daniel.baluta, netdev On Sun, Feb 27, 2011 at 07:45:55PM -0800, Tom Herbert wrote: > That sounds promising, but receive side will still have problems. > There is lock contention on the queue as well as cache line bouncing > on the sock structures. Also multiple threads sleeping on same socket > typically leads to asymmetric load across the threads (and > degenerative cases where receiving thread is woken up and other > threads have already processed all the packets). TCP listener threads > suffer from these same problems. IOW this is something that we have to solve anyway. I'm just being overly cautious here because user-space API changes are something that we should not enter into lightly. If this patch was completely internal to the kernel, then I would have much less of an objection as we can always revert/change it later on. With a user-space API we don't have that flexibility. Thanks, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-27 11:06 ` Herbert Xu 2011-02-28 3:45 ` Tom Herbert @ 2011-02-28 11:36 ` Herbert Xu 2011-02-28 13:32 ` Eric Dumazet ` (3 more replies) 2011-02-28 11:41 ` [PATCH 2/5] net: Remove explicit write references to sk/inet in ip_append_data Herbert Xu ` (3 subsequent siblings) 5 siblings, 4 replies; 91+ messages in thread From: Herbert Xu @ 2011-02-28 11:36 UTC (permalink / raw) To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Sun, Feb 27, 2011 at 07:06:14PM +0800, Herbert Xu wrote: > I'm working on this right now. OK I think I was definitely on the right track. With the send patch made lockless I now get numbers which are even better than those obtained with running named with multiple sockets. That's right, a single socket is now faster than what multiple sockets were without the patch (of course, multiple sockets may still faster with the patch vs. a single socket for obvious reasons, but I couldn't measure any significant difference). Also worthy of note is that prior to the patch all CPUs showed idleness (lazy bastards!), with the patch they're all maxed out. In retrospect, the idleness was simply the result of the socket lock scheduling away and was an indication of lock contention. Here are the patches I used. Please don't them yet as I intend to clean them up quite a bit. But please do test them heavily, especially if you have an AMD NUMA machine as that's where scalability problems really show up. Intel tends to be a lot more forgiving. My last AMD machine blew up years ago :) Thanks! -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-28 11:36 ` Herbert Xu @ 2011-02-28 13:32 ` Eric Dumazet 2011-02-28 14:13 ` Herbert Xu 2011-02-28 14:53 ` Eric Dumazet 2011-02-28 14:13 ` Thomas Graf ` (2 subsequent siblings) 3 siblings, 2 replies; 91+ messages in thread From: Eric Dumazet @ 2011-02-28 13:32 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le lundi 28 février 2011 à 19:36 +0800, Herbert Xu a écrit : > On Sun, Feb 27, 2011 at 07:06:14PM +0800, Herbert Xu wrote: > > I'm working on this right now. > > OK I think I was definitely on the right track. With the send > patch made lockless I now get numbers which are even better than > those obtained with running named with multiple sockets. That's > right, a single socket is now faster than what multiple sockets > were without the patch (of course, multiple sockets may still > faster with the patch vs. a single socket for obvious reasons, > but I couldn't measure any significant difference). > > Also worthy of note is that prior to the patch all CPUs showed > idleness (lazy bastards!), with the patch they're all maxed out. > > In retrospect, the idleness was simply the result of the socket > lock scheduling away and was an indication of lock contention. > Now, input path can run without finding socket locked by xmit path, so skb are queued into receive queue, not backlog one. > Here are the patches I used. Please don't them yet as I intend > to clean them up quite a bit. > > But please do test them heavily, especially if you have an AMD > NUMA machine as that's where scalability problems really show > up. Intel tends to be a lot more forgiving. My last AMD machine > blew up years ago :) I am going to test them, thanks ! ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-28 13:32 ` Eric Dumazet @ 2011-02-28 14:13 ` Herbert Xu 2011-02-28 14:22 ` Eric Dumazet 2011-02-28 14:53 ` Eric Dumazet 1 sibling, 1 reply; 91+ messages in thread From: Herbert Xu @ 2011-02-28 14:13 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Mon, Feb 28, 2011 at 02:32:51PM +0100, Eric Dumazet wrote: > > Now, input path can run without finding socket locked by xmit path, so > skb are queued into receive queue, not backlog one. Indeed, I think this is what Dave alluded to earlier. This will eventually have to be dealt with but for now the data rate is low enough that it isn't killing us. Thanks, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-28 14:13 ` Herbert Xu @ 2011-02-28 14:22 ` Eric Dumazet 2011-02-28 14:25 ` Herbert Xu 0 siblings, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-02-28 14:22 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le lundi 28 février 2011 à 22:13 +0800, Herbert Xu a écrit : > On Mon, Feb 28, 2011 at 02:32:51PM +0100, Eric Dumazet wrote: > > > > Now, input path can run without finding socket locked by xmit path, so > > skb are queued into receive queue, not backlog one. > > Indeed, I think this is what Dave alluded to earlier. This will > eventually have to be dealt with but for now the data rate is low > enough that it isn't killing us. Not sure how you read this ;) I said that before your patches, a sender was consuming lot of time to transfert frames from backlog to receive queue right before releasing socket lock. Now, the receive path doesnt slow down the senders, and vice versa. :) ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-28 14:22 ` Eric Dumazet @ 2011-02-28 14:25 ` Herbert Xu 0 siblings, 0 replies; 91+ messages in thread From: Herbert Xu @ 2011-02-28 14:25 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Mon, Feb 28, 2011 at 03:22:06PM +0100, Eric Dumazet wrote: > > Not sure how you read this ;) > > I said that before your patches, a sender was consuming lot of time to > transfert frames from backlog to receive queue right before releasing > socket lock. > > Now, the receive path doesnt slow down the senders, and vice versa. > > :) I understood what you wrote :) I was just referring to an earlier message where Dave talked about the UDP accounting patch making us having to take the lock on every packet. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-28 13:32 ` Eric Dumazet 2011-02-28 14:13 ` Herbert Xu @ 2011-02-28 14:53 ` Eric Dumazet 2011-02-28 15:01 ` Thomas Graf 1 sibling, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-02-28 14:53 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le lundi 28 février 2011 à 14:32 +0100, Eric Dumazet a écrit : > Le lundi 28 février 2011 à 19:36 +0800, Herbert Xu a écrit : > > On Sun, Feb 27, 2011 at 07:06:14PM +0800, Herbert Xu wrote: > > > I'm working on this right now. > > > > OK I think I was definitely on the right track. With the send > > patch made lockless I now get numbers which are even better than > > those obtained with running named with multiple sockets. That's > > right, a single socket is now faster than what multiple sockets > > were without the patch (of course, multiple sockets may still > > faster with the patch vs. a single socket for obvious reasons, > > but I couldn't measure any significant difference). > > > > Also worthy of note is that prior to the patch all CPUs showed > > idleness (lazy bastards!), with the patch they're all maxed out. > > > > In retrospect, the idleness was simply the result of the socket > > lock scheduling away and was an indication of lock contention. > > > > Now, input path can run without finding socket locked by xmit path, so > skb are queued into receive queue, not backlog one. > > > Here are the patches I used. Please don't them yet as I intend > > to clean them up quite a bit. > > > > But please do test them heavily, especially if you have an AMD > > NUMA machine as that's where scalability problems really show > > up. Intel tends to be a lot more forgiving. My last AMD machine > > blew up years ago :) > > I am going to test them, thanks ! > First "sending only" tests on my 2x4x2 machine (two E5540@2.53GHz, quad core, hyper threaded, NUMA kernel) 16 threads, each one sending 100.000 UDP frames using a _shared_ socket I use the same destination IP, so suffer a bit of dst refcount contention. (to dummy0 device to avoid contention on qdisc and device) # ip ro get 10.2.2.21 10.2.2.21 dev dummy0 src 10.2.2.2 cache LOCKDEP enabled kernel Before : time ./udpflood -f -t 16 -l 100000 10.2.2.21 real 0m42.749s user 0m1.010s sys 1m38.039s After : time ./udpflood -f -t 16 -l 100000 10.2.2.21 real 0m1.167s user 0m0.488s sys 0m17.373s With one thread only and 16*100000 frames : # time ./udpflood -f -l 1600000 10.2.2.21 real 0m9.318s user 0m0.238s sys 0m9.052s (We have some false sharing on atomic fields in struct file and socket, but nothing to worry about.) With LOCKDEP OFF : 16 threads : # time ./udpflood -f -t 16 -l 100000 10.2.2.21 real 0m0.718s user 0m0.376s sys 0m10.963s 1 thread : # time ./udpflood -f -l 1600000 10.2.2.21 real 0m1.514s user 0m0.153s sys 0m1.357s "perf record/report" results for the 16 threads case (no lockdep) # Events: 389K cpu-clock-msecs # # Overhead Command Shared Object Symbol # ........ ........... ................... ................................... # 9.03% udpflood [kernel.kallsyms] [k] sock_wfree 8.58% udpflood [kernel.kallsyms] [k] __ip_route_output_key 8.52% udpflood [kernel.kallsyms] [k] sock_alloc_send_pskb 7.46% udpflood [kernel.kallsyms] [k] sock_def_write_space 6.76% udpflood [kernel.kallsyms] [k] __xfrm_lookup 6.18% swapper [kernel.kallsyms] [k] acpi_idle_enter_bm 5.66% udpflood [kernel.kallsyms] [k] dst_release 4.96% udpflood [kernel.kallsyms] [k] udp_sendmsg 3.48% udpflood [kernel.kallsyms] [k] fget_light 2.75% udpflood [kernel.kallsyms] [k] sock_tx_timestamp 2.40% udpflood [kernel.kallsyms] [k] __ip_make_skb 2.36% udpflood [kernel.kallsyms] [k] fput 1.87% swapper [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 1.81% udpflood [kernel.kallsyms] [k] inet_sendmsg 1.53% udpflood [kernel.kallsyms] [k] sys_sendto 1.50% udpflood [kernel.kallsyms] [k] ip_finish_output 1.31% udpflood [kernel.kallsyms] [k] csum_partial_copy_generic 1.30% udpflood udpflood [.] do_thread 1.28% udpflood [kernel.kallsyms] [k] __ip_append_data 1.08% udpflood [kernel.kallsyms] [k] __memset 1.05% udpflood [kernel.kallsyms] [k] ip_route_output_flow 0.91% udpflood [kernel.kallsyms] [k] kfree 0.88% udpflood [vdso] [.] 0xffffe430 0.83% udpflood [kernel.kallsyms] [k] copy_user_generic_string 0.78% udpflood libc-2.3.4.so [.] __GI_memcpy 0.77% udpflood [kernel.kallsyms] [k] ia32_sysenter_target What do you suggest to perform a bind based test ? ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-28 14:53 ` Eric Dumazet @ 2011-02-28 15:01 ` Thomas Graf 0 siblings, 0 replies; 91+ messages in thread From: Thomas Graf @ 2011-02-28 15:01 UTC (permalink / raw) To: Eric Dumazet Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Mon, Feb 28, 2011 at 03:53:03PM +0100, Eric Dumazet wrote: > What do you suggest to perform a bind based test ? We use queryperf from BIND sources. I typically run 1 queryperf instance per core on multiple machines. ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-28 11:36 ` Herbert Xu 2011-02-28 13:32 ` Eric Dumazet @ 2011-02-28 14:13 ` Thomas Graf 2011-02-28 16:22 ` Eric Dumazet 2011-03-01 5:33 ` Eric Dumazet 2011-03-01 12:35 ` Herbert Xu 3 siblings, 1 reply; 91+ messages in thread From: Thomas Graf @ 2011-02-28 14:13 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Mon, Feb 28, 2011 at 07:36:59PM +0800, Herbert Xu wrote: > But please do test them heavily, especially if you have an AMD > NUMA machine as that's where scalability problems really show > up. Intel tends to be a lot more forgiving. My last AMD machine > blew up years ago :) This is just a preliminary test result and not 100% reliable because half through the testing the machine reported memory issues and disabled a DIMM before booting the tested kernels. Nevertheless, bind 9.7.3: 2.6.38-rc5+: 62kqps 2.6.38-rc5+ w/ Herbert's patch: 442kqps This is on a 2 NUMA Intel Xeon X5560 @ 2.80GHz with 16 cores Again, this number is not 100% reliably but clearly shows that the concept of the patch is working very well. Will test Herbert's patch on the machine that did 650kqps with SO_REUSEPORT and also on some AMD machines. ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-28 14:13 ` Thomas Graf @ 2011-02-28 16:22 ` Eric Dumazet 2011-02-28 16:37 ` Thomas Graf 0 siblings, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-02-28 16:22 UTC (permalink / raw) To: Thomas Graf Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le lundi 28 février 2011 à 09:13 -0500, Thomas Graf a écrit : > On Mon, Feb 28, 2011 at 07:36:59PM +0800, Herbert Xu wrote: > > But please do test them heavily, especially if you have an AMD > > NUMA machine as that's where scalability problems really show > > up. Intel tends to be a lot more forgiving. My last AMD machine > > blew up years ago :) > > This is just a preliminary test result and not 100% reliable > because half through the testing the machine reported memory > issues and disabled a DIMM before booting the tested kernels. > > Nevertheless, bind 9.7.3: > > 2.6.38-rc5+: 62kqps > 2.6.38-rc5+ w/ Herbert's patch: 442kqps > > This is on a 2 NUMA Intel Xeon X5560 @ 2.80GHz with 16 cores > > Again, this number is not 100% reliably but clearly shows that > the concept of the patch is working very well. > > Will test Herbert's patch on the machine that did 650kqps with > SO_REUSEPORT and also on some AMD machines. > -- I suspect your queryperf input file hits many zones ? With a single zone, my machine is able to give 250kps : most of the time is consumed in bind code, dealing with rwlocks and false sharing things... (bind-9.7.2-P3) Using two remote machines to perform queries, on bnx2x adapter, RSS enabled : two cpus receive UDP frames for the same socket, so we also hit false sharing in kernel receive path. --------------------------------------------------------------------------------------------------------------------------------- PerfTop: 558863 irqs/sec kernel:40.8% exact: 0.0% [1000Hz cpu-clock-msecs], (all, 16 CPUs) --------------------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ _____________________________ ______________________________________ 137175.00 12.4% acpi_idle_enter_bm [kernel.kallsyms] 63784.00 5.8% _raw_spin_unlock_irqrestore [kernel.kallsyms] 54140.00 4.9% isc_rwlock_lock /opt/src/bind-9.7.2-P3/bin/named/named 32682.00 2.9% isc_rwlock_unlock /opt/src/bind-9.7.2-P3/bin/named/named 21823.00 2.0% dns_rbt_findnode /opt/src/bind-9.7.2-P3/bin/named/named 20306.00 1.8% __ticket_spin_lock [kernel.kallsyms] 16881.00 1.5% finish_task_switch [kernel.kallsyms] 15335.00 1.4% zone_find /opt/src/bind-9.7.2-P3/bin/named/named 14082.00 1.3% decrement_reference /opt/src/bind-9.7.2-P3/bin/named/named 14064.00 1.3% __pthread_mutex_lock_internal /lib/tls/libpthread-2.3.4.so 13519.00 1.2% isc_stats_increment /opt/src/bind-9.7.2-P3/bin/named/named 13027.00 1.2% __GI_memcpy /lib/tls/libc-2.3.4.so 12516.00 1.1% dns_name_concatenate /opt/src/bind-9.7.2-P3/bin/named/named 12499.00 1.1% currentversion /opt/src/bind-9.7.2-P3/bin/named/named 11412.00 1.0% dns_name_fullcompare /opt/src/bind-9.7.2-P3/bin/named/named 10814.00 1.0% new_reference.clone.6 /opt/src/bind-9.7.2-P3/bin/named/named 10580.00 1.0% attach /opt/src/bind-9.7.2-P3/bin/named/named 9805.00 0.9% zone_zonecut_callback /opt/src/bind-9.7.2-P3/bin/named/named ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-28 16:22 ` Eric Dumazet @ 2011-02-28 16:37 ` Thomas Graf 2011-02-28 17:07 ` Eric Dumazet 0 siblings, 1 reply; 91+ messages in thread From: Thomas Graf @ 2011-02-28 16:37 UTC (permalink / raw) To: Eric Dumazet Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Mon, Feb 28, 2011 at 05:22:54PM +0100, Eric Dumazet wrote: > Le lundi 28 février 2011 à 09:13 -0500, Thomas Graf a écrit : > > On Mon, Feb 28, 2011 at 07:36:59PM +0800, Herbert Xu wrote: > > > But please do test them heavily, especially if you have an AMD > > > NUMA machine as that's where scalability problems really show > > > up. Intel tends to be a lot more forgiving. My last AMD machine > > > blew up years ago :) > > > > This is just a preliminary test result and not 100% reliable > > because half through the testing the machine reported memory > > issues and disabled a DIMM before booting the tested kernels. > > > > Nevertheless, bind 9.7.3: > > > > 2.6.38-rc5+: 62kqps > > 2.6.38-rc5+ w/ Herbert's patch: 442kqps > > > > This is on a 2 NUMA Intel Xeon X5560 @ 2.80GHz with 16 cores > > > > Again, this number is not 100% reliably but clearly shows that > > the concept of the patch is working very well. > > > > Will test Herbert's patch on the machine that did 650kqps with > > SO_REUSEPORT and also on some AMD machines. > > -- > > I suspect your queryperf input file hits many zones ? No, we use a simple example.com zone with host[1-4] A records resolving to 10.[1-4].0.1 > With a single zone, my machine is able to give 250kps : most of the time > is consumed in bind code, dealing with rwlocks and false sharing > things... > > (bind-9.7.2-P3) > Using two remote machines to perform queries, on bnx2x adapter, RSS > enabled : two cpus receive UDP frames for the same socket, so we also > hit false sharing in kernel receive path. How do you measure the qps? The output of queryperf? That is not always accurate. I run rdnc stats twice and then calculate the qps based on the counter "queries resulted in successful answer" diff and timestamp diff. The numbers differ a lot depending on the architecture we test on. F.e. on a 12 core AMD with 2 NUMA nodes: 2.6.32 named -n 1: 37.0kqps named: 3.8kqps (yes, no joke, the socket receive buffer is always full and the kernel drops pkts) 2.6.38-rc5+ with Herbert's patches: named -n 1: 36.9kqps named: 222.0kqps ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-28 16:37 ` Thomas Graf @ 2011-02-28 17:07 ` Eric Dumazet 2011-03-01 10:19 ` Thomas Graf 0 siblings, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-02-28 17:07 UTC (permalink / raw) To: Thomas Graf Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le lundi 28 février 2011 à 11:37 -0500, Thomas Graf a écrit : > How do you measure the qps? The output of queryperf? That is not always > accurate. I run rdnc stats twice and then calculate the qps based on the > counter "queries resulted in successful answer" diff and timestamp diff. > I have some custom ethernet/system monitoring package installed, so I get packet rates from it. I appears my two source machines were not fast enough. (One had LOCKDEP kernel). I now reach 320 kqps, even if I force NIC interrupts through one cpu only. > The numbers differ a lot depending on the architecture we test on. > > F.e. on a 12 core AMD with 2 NUMA nodes: > > 2.6.32 named -n 1: 37.0kqps > named: 3.8kqps (yes, no joke, the socket receive buffer is > always full and the kernel drops pkts) Yes, this old kernel miss commit c377411f2494a93 added in 2.6.35 (net: sk_add_backlog() take rmem_alloc into account) Quoting the change log : Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp receiver can now process ~200.000 pps (instead of ~100 pps before the patch) on a 8 core machine. > > 2.6.38-rc5+ with Herbert's patches: > named -n 1: 36.9kqps > named: 222.0kqps ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-28 17:07 ` Eric Dumazet @ 2011-03-01 10:19 ` Thomas Graf 2011-03-01 10:33 ` Eric Dumazet 0 siblings, 1 reply; 91+ messages in thread From: Thomas Graf @ 2011-03-01 10:19 UTC (permalink / raw) To: Eric Dumazet Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Mon, Feb 28, 2011 at 06:07:49PM +0100, Eric Dumazet wrote: > > The numbers differ a lot depending on the architecture we test on. > > > > F.e. on a 12 core AMD with 2 NUMA nodes: > > > > 2.6.32 named -n 1: 37.0kqps > > named: 3.8kqps (yes, no joke, the socket receive buffer is > > always full and the kernel drops pkts) > > Yes, this old kernel miss commit c377411f2494a93 added in 2.6.35 > (net: sk_add_backlog() take rmem_alloc into account) I retested with net-2.6 w/o Herbert's patch: named -n 1: 36.9kqps named: 16.2kqps > > 2.6.38-rc5+ with Herbert's patches: > > named -n 1: 36.9kqps > > named: 222.0kqps ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 10:19 ` Thomas Graf @ 2011-03-01 10:33 ` Eric Dumazet 2011-03-01 11:07 ` Thomas Graf 0 siblings, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-03-01 10:33 UTC (permalink / raw) To: Thomas Graf Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le mardi 01 mars 2011 à 05:19 -0500, Thomas Graf a écrit : > On Mon, Feb 28, 2011 at 06:07:49PM +0100, Eric Dumazet wrote: > > > The numbers differ a lot depending on the architecture we test on. > > > > > > F.e. on a 12 core AMD with 2 NUMA nodes: > > > > > > 2.6.32 named -n 1: 37.0kqps > > > named: 3.8kqps (yes, no joke, the socket receive buffer is > > > always full and the kernel drops pkts) > > > > Yes, this old kernel miss commit c377411f2494a93 added in 2.6.35 > > (net: sk_add_backlog() take rmem_alloc into account) > > I retested with net-2.6 w/o Herbert's patch: > > named -n 1: 36.9kqps > named: 16.2kqps Thats better ;) You could do "cat /proc/net/udp" to check if drops occur on port 53 socket (last column) But maybe your queryperf is limited to few queries in flight (default is 20 per queryperf instance) ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 10:33 ` Eric Dumazet @ 2011-03-01 11:07 ` Thomas Graf 2011-03-01 11:13 ` Eric Dumazet 0 siblings, 1 reply; 91+ messages in thread From: Thomas Graf @ 2011-03-01 11:07 UTC (permalink / raw) To: Eric Dumazet Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 11:33:22AM +0100, Eric Dumazet wrote: > > I retested with net-2.6 w/o Herbert's patch: > > > > named -n 1: 36.9kqps > > named: 16.2kqps > > Thats better ;) > > You could do "cat /proc/net/udp" to check if drops occur on port 53 > socket (last column) > > But maybe your queryperf is limited to few queries in flight (default is > 20 per queryperf instance) I tried -q 10, 20, 30, 50, 100. Starting with 20 I see drops, at q=50 queryperf reports 99% drops. I also tested again on the Intel machine that did ~650kqps using SO_REUSEPORT. net-2.6: 106.3kqps, 101.2kqps net-2.6 lockless udp: 251.7kqps, 250.4kqps I see drops in both test cases occur so I believe the rate supplied by the clients is sufficient. The difference is obvious when looking at top and mpstat: UDP lockless (250kqps): Cpu0 : 46.4%us, 28.8%sy, 0.0%ni, 24.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 2.0%us, 1.3%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 93.6%si, 0.0%st Cpu2 : 45.9%us, 28.2%sy, 0.0%ni, 25.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 50.0%us, 21.6%sy, 0.0%ni, 28.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 45.4%us, 27.8%sy, 0.0%ni, 26.5%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu5 : 50.7%us, 23.2%sy, 0.0%ni, 26.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 45.2%us, 28.9%sy, 0.0%ni, 25.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 50.5%us, 22.0%sy, 0.0%ni, 27.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 45.3%us, 29.3%sy, 0.0%ni, 25.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 50.8%us, 20.8%sy, 0.0%ni, 28.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 46.1%us, 27.8%sy, 0.0%ni, 26.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 27.2%us, 11.3%sy, 0.0%ni, 3.3%id, 0.0%wa, 0.0%hi, 58.1%si, 0.0%st 05:50:44 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 05:50:44 AM all 23.86 0.00 13.02 0.22 0.00 6.98 0.00 0.00 55.92 05:50:44 AM 0 26.16 0.00 17.20 0.73 0.00 0.30 0.00 0.00 55.61 05:50:44 AM 1 2.36 0.00 2.11 0.70 0.00 51.97 0.00 0.00 42.87 05:50:44 AM 2 25.90 0.00 16.38 0.32 0.00 0.03 0.00 0.00 57.36 05:50:44 AM 3 28.26 0.00 12.73 0.27 0.00 0.02 0.00 0.00 58.73 05:50:44 AM 4 25.63 0.00 16.04 0.13 0.00 0.03 0.00 0.00 58.17 05:50:44 AM 5 28.19 0.00 12.54 0.17 0.00 0.01 0.00 0.00 59.09 05:50:44 AM 6 25.28 0.00 15.21 0.02 0.00 1.95 0.00 0.00 57.54 05:50:44 AM 7 28.34 0.00 12.40 0.10 0.00 0.01 0.00 0.00 59.14 05:50:44 AM 8 25.70 0.00 15.91 0.01 0.00 0.02 0.00 0.00 58.37 05:50:44 AM 9 28.31 0.00 12.56 0.11 0.00 0.01 0.00 0.00 59.01 05:50:44 AM 10 25.85 0.00 15.65 0.01 0.00 0.02 0.00 0.00 58.47 05:50:44 AM 11 16.11 0.00 7.44 0.10 0.00 29.87 0.00 0.00 46.49 SO_REUSEPORT test (doing 640kqps): Cpu0 : 57.3%us, 26.5%sy, 0.0%ni, 3.3%id, 0.0%wa, 0.0%hi, 12.9%si, 0.0%st Cpu1 : 25.7%us, 10.0%sy, 0.0%ni, 0.3%id, 0.0%wa, 0.0%hi, 64.0%si, 0.0%st Cpu2 : 56.3%us, 28.8%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 11.9%si, 0.0%st Cpu3 : 29.1%us, 10.9%sy, 0.0%ni, 1.3%id, 0.0%wa, 0.0%hi, 58.6%si, 0.0%st Cpu4 : 57.3%us, 28.5%sy, 0.0%ni, 2.3%id, 0.0%wa, 0.0%hi, 11.9%si, 0.0%st Cpu5 : 64.8%us, 22.6%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 9.6%si, 0.0%st Cpu6 : 59.0%us, 26.7%sy, 0.0%ni, 2.7%id, 0.0%wa, 0.0%hi, 11.7%si, 0.0%st Cpu7 : 64.1%us, 22.3%sy, 0.0%ni, 3.7%id, 0.0%wa, 0.0%hi, 10.0%si, 0.0%st Cpu8 : 57.6%us, 27.5%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 11.9%si, 0.0%st Cpu9 : 65.2%us, 22.2%sy, 0.0%ni, 2.3%id, 0.0%wa, 0.0%hi, 10.3%si, 0.0%st Cpu10 : 56.9%us, 28.3%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 11.8%si, 0.0%st Cpu11 : 40.2%us, 14.6%sy, 0.0%ni, 2.3%id, 0.0%wa, 0.0%hi, 42.9%si, 0.0%st ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 11:07 ` Thomas Graf @ 2011-03-01 11:13 ` Eric Dumazet 2011-03-01 11:27 ` Thomas Graf 0 siblings, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-03-01 11:13 UTC (permalink / raw) To: Thomas Graf Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le mardi 01 mars 2011 à 06:07 -0500, Thomas Graf a écrit : > On Tue, Mar 01, 2011 at 11:33:22AM +0100, Eric Dumazet wrote: > > > I retested with net-2.6 w/o Herbert's patch: > > > > > > named -n 1: 36.9kqps > > > named: 16.2kqps > > > > Thats better ;) > > > > You could do "cat /proc/net/udp" to check if drops occur on port 53 > > socket (last column) > > > > But maybe your queryperf is limited to few queries in flight (default is > > 20 per queryperf instance) > > I tried -q 10, 20, 30, 50, 100. Starting with 20 I see drops, at q=50 > queryperf reports 99% drops. > > I also tested again on the Intel machine that did ~650kqps using SO_REUSEPORT. > > net-2.6: 106.3kqps, 101.2kqps > net-2.6 lockless udp: 251.7kqps, 250.4kqps > > I see drops in both test cases occur so I believe the rate supplied by the > clients is sufficient. > > The difference is obvious when looking at top and mpstat: > > UDP lockless (250kqps): > > Cpu0 : 46.4%us, 28.8%sy, 0.0%ni, 24.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu1 : 2.0%us, 1.3%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 93.6%si, 0.0%st > Cpu2 : 45.9%us, 28.2%sy, 0.0%ni, 25.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu3 : 50.0%us, 21.6%sy, 0.0%ni, 28.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu4 : 45.4%us, 27.8%sy, 0.0%ni, 26.5%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st > Cpu5 : 50.7%us, 23.2%sy, 0.0%ni, 26.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu6 : 45.2%us, 28.9%sy, 0.0%ni, 25.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu7 : 50.5%us, 22.0%sy, 0.0%ni, 27.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu8 : 45.3%us, 29.3%sy, 0.0%ni, 25.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu9 : 50.8%us, 20.8%sy, 0.0%ni, 28.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu10 : 46.1%us, 27.8%sy, 0.0%ni, 26.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu11 : 27.2%us, 11.3%sy, 0.0%ni, 3.3%id, 0.0%wa, 0.0%hi, 58.1%si, 0.0%st Its a bit strange two cpus spend time in softirq, unless you have two queryperf sources, and a multiqueue NIC, or maybe you use two NICS ? Mind use "perf top -C 1" and "perf top -C 11" to check what these cpus do ? ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 11:13 ` Eric Dumazet @ 2011-03-01 11:27 ` Thomas Graf 2011-03-01 11:45 ` Eric Dumazet 0 siblings, 1 reply; 91+ messages in thread From: Thomas Graf @ 2011-03-01 11:27 UTC (permalink / raw) To: Eric Dumazet Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 12:13:04PM +0100, Eric Dumazet wrote: > Its a bit strange two cpus spend time in softirq, unless you have two > queryperf sources, and a multiqueue NIC, or maybe you use two NICS ? one NIC, 2 clients (12 instances per client) [root@hp-bl460cg7-01 ~]# cat /sys/class/net/eth0/queues/rx-0/rps_cpus 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 [root@hp-bl460cg7-01 ~]# netstat -s | grep err 1781377 packet receive errors > Mind use "perf top -C 1" and "perf top -C 11" to check what these cpus > do ? -------------------------------------------------------------------------------------------------------------------- PerfTop: 16198 irqs/sec kernel:99.1% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 1) -------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ ___________________________ ___________________________________________________________ 51675.00 33.2% _raw_spin_unlock_irqrestore [kernel.kallsyms] 12426.00 8.0% clflush_cache_range [kernel.kallsyms] 5511.00 3.5% be_poll_rx /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 4567.00 2.9% __udp4_lib_lookup [kernel.kallsyms] 3981.00 2.6% __kmalloc_node_track_caller [kernel.kallsyms] 3975.00 2.6% get_rx_page_info /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 3725.00 2.4% sk_run_filter [kernel.kallsyms] 3606.00 2.3% get_page_from_freelist [kernel.kallsyms] 3178.00 2.0% __domain_mapping [kernel.kallsyms] 3122.00 2.0% kmem_cache_alloc_node [kernel.kallsyms] 2839.00 1.8% sock_queue_rcv_skb [kernel.kallsyms] 2246.00 1.4% __netif_receive_skb [kernel.kallsyms] 2245.00 1.4% nf_iterate [kernel.kallsyms] 2081.00 1.3% __udp4_lib_rcv [kernel.kallsyms] 2042.00 1.3% ipt_do_table [kernel.kallsyms] 1901.00 1.2% _raw_spin_lock [kernel.kallsyms] 1856.00 1.2% __alloc_skb [kernel.kallsyms] 1645.00 1.1% read_tsc [kernel.kallsyms] 1562.00 1.0% nf_ct_tuple_equal [kernel.kallsyms] 1562.00 1.0% ip_rcv [kernel.kallsyms] 1495.00 1.0% __nf_conntrack_find_get [kernel.kallsyms] 1477.00 0.9% sock_def_readable [kernel.kallsyms] 1363.00 0.9% find_first_bit [kernel.kallsyms] 1360.00 0.9% domain_get_iommu [kernel.kallsyms] 1255.00 0.8% udp_queue_rcv_skb [kernel.kallsyms] 1174.00 0.8% xfrm4_policy_check.clone.0 [kernel.kallsyms] 1138.00 0.7% hash_conntrack_raw [kernel.kallsyms] 1000.00 0.6% intel_unmap_page [kernel.kallsyms] 959.00 0.6% load_pointer [kernel.kallsyms] 957.00 0.6% sock_flag [kernel.kallsyms] 938.00 0.6% nf_conntrack_in [kernel.kallsyms] 891.00 0.6% _local_bh_enable_ip [kernel.kallsyms] 884.00 0.6% eth_type_trans [kernel.kallsyms] 832.00 0.5% be_post_rx_frags /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 829.00 0.5% __alloc_pages_nodemask [kernel.kallsyms] 813.00 0.5% kmem_cache_alloc [kernel.kallsyms] 802.00 0.5% netif_receive_skb [kernel.kallsyms] 802.00 0.5% ip_route_input_common [kernel.kallsyms] 723.00 0.5% nf_ct_get_tuple [kernel.kallsyms] 720.00 0.5% __intel_map_single [kernel.kallsyms] 720.00 0.5% udp_error [kernel.kallsyms] -------------------------------------------------------------------------------------------------------------------- PerfTop: 16360 irqs/sec kernel:72.6% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 11) -------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ _____________________________ ___________________________________________________________ 16993.00 32.4% _raw_spin_unlock_irqrestore [kernel.kallsyms] 5833.00 11.1% clflush_cache_range [kernel.kallsyms] 3315.00 6.3% be_tx_compl_process /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 1818.00 3.5% kmem_cache_free [kernel.kallsyms] 1415.00 2.7% isc_rwlock_lock /usr/lib64/libisc.so.62.0.1 1090.00 2.1% be_poll_tx_mcc /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 811.00 1.5% skb_release_head_state [kernel.kallsyms] 772.00 1.5% skb_release_data [kernel.kallsyms] 712.00 1.4% dns_rbt_findnode /usr/lib64/libdns.so.69.0.1 703.00 1.3% isc_rwlock_unlock /usr/lib64/libisc.so.62.0.1 695.00 1.3% dma_pte_clear_range [kernel.kallsyms] 618.00 1.2% kfree_skb [kernel.kallsyms] 597.00 1.1% kfree [kernel.kallsyms] 553.00 1.1% intel_unmap_page [kernel.kallsyms] 531.00 1.0% __do_softirq [kernel.kallsyms] 504.00 1.0% isc_stats_increment /usr/lib64/libisc.so.62.0.1 397.00 0.8% virt_to_head_page [kernel.kallsyms] 306.00 0.6% _raw_spin_lock [kernel.kallsyms] 270.00 0.5% domain_get_iommu [kernel.kallsyms] 256.00 0.5% dns_name_fullcompare /usr/lib64/libdns.so.69.0.1 233.00 0.4% find_first_bit [kernel.kallsyms] 222.00 0.4% dns_name_equal /usr/lib64/libdns.so.69.0.1 218.00 0.4% __pthread_mutex_lock_internal /lib64/libpthread-2.12.so 207.00 0.4% dns_rbtnodechain_init /usr/lib64/libdns.so.69.0.1 196.00 0.4% dns_acl_match /usr/lib64/libdns.so.69.0.1 194.00 0.4% dma_pte_free_pagetable [kernel.kallsyms] 192.00 0.4% dns_name_getlabelsequence /usr/lib64/libdns.so.69.0.1 ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 11:27 ` Thomas Graf @ 2011-03-01 11:45 ` Eric Dumazet 2011-03-01 11:53 ` Herbert Xu ` (2 more replies) 0 siblings, 3 replies; 91+ messages in thread From: Eric Dumazet @ 2011-03-01 11:45 UTC (permalink / raw) To: Thomas Graf Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le mardi 01 mars 2011 à 06:27 -0500, Thomas Graf a écrit : > On Tue, Mar 01, 2011 at 12:13:04PM +0100, Eric Dumazet wrote: > > Its a bit strange two cpus spend time in softirq, unless you have two > > queryperf sources, and a multiqueue NIC, or maybe you use two NICS ? > > one NIC, 2 clients (12 instances per client) > > [root@hp-bl460cg7-01 ~]# cat /sys/class/net/eth0/queues/rx-0/rps_cpus > 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 > > [root@hp-bl460cg7-01 ~]# netstat -s | grep err > 1781377 packet receive errors > > > Mind use "perf top -C 1" and "perf top -C 11" to check what these cpus > > do ? > Thanks that's really interesting > -------------------------------------------------------------------------------------------------------------------- > PerfTop: 16198 irqs/sec kernel:99.1% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 1) > -------------------------------------------------------------------------------------------------------------------- > > samples pcnt function DSO > _______ _____ ___________________________ ___________________________________________________________ > CPU 1 handles receives from your BENET NIC (Its a bit strange, given this NIC should provide 4 rx queues). Load could be split to two cpus in your case (two sources) Try : ethtool -S eth0 | grep rx_pk rxq0: rx_pkts: ?? rxq1: rx_pkts: ?? rxq2: rx_pkts: ?? rxq3: rx_pkts: ?? rxq4: rx_pkts: ?? Its BE_HDR_LEN being 64, small UDP frames are too big to fit in skb head. > 51675.00 33.2% _raw_spin_unlock_irqrestore [kernel.kallsyms] > 12426.00 8.0% clflush_cache_range [kernel.kallsyms] > 5511.00 3.5% be_poll_rx /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko > 4567.00 2.9% __udp4_lib_lookup [kernel.kallsyms] > 3981.00 2.6% __kmalloc_node_track_caller [kernel.kallsyms] > 3975.00 2.6% get_rx_page_info /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko > 3725.00 2.4% sk_run_filter [kernel.kallsyms] > 3606.00 2.3% get_page_from_freelist [kernel.kallsyms] > 3178.00 2.0% __domain_mapping [kernel.kallsyms] > 3122.00 2.0% kmem_cache_alloc_node [kernel.kallsyms] > 2839.00 1.8% sock_queue_rcv_skb [kernel.kallsyms] > 2246.00 1.4% __netif_receive_skb [kernel.kallsyms] > 2245.00 1.4% nf_iterate [kernel.kallsyms] > 2081.00 1.3% __udp4_lib_rcv [kernel.kallsyms] > 2042.00 1.3% ipt_do_table [kernel.kallsyms] > 1901.00 1.2% _raw_spin_lock [kernel.kallsyms] > 1856.00 1.2% __alloc_skb [kernel.kallsyms] > 1645.00 1.1% read_tsc [kernel.kallsyms] > 1562.00 1.0% nf_ct_tuple_equal [kernel.kallsyms] > 1562.00 1.0% ip_rcv [kernel.kallsyms] > 1495.00 1.0% __nf_conntrack_find_get [kernel.kallsyms] > 1477.00 0.9% sock_def_readable [kernel.kallsyms] > 1363.00 0.9% find_first_bit [kernel.kallsyms] > 1360.00 0.9% domain_get_iommu [kernel.kallsyms] > 1255.00 0.8% udp_queue_rcv_skb [kernel.kallsyms] > 1174.00 0.8% xfrm4_policy_check.clone.0 [kernel.kallsyms] > 1138.00 0.7% hash_conntrack_raw [kernel.kallsyms] > 1000.00 0.6% intel_unmap_page [kernel.kallsyms] > 959.00 0.6% load_pointer [kernel.kallsyms] > 957.00 0.6% sock_flag [kernel.kallsyms] > 938.00 0.6% nf_conntrack_in [kernel.kallsyms] > 891.00 0.6% _local_bh_enable_ip [kernel.kallsyms] > 884.00 0.6% eth_type_trans [kernel.kallsyms] > 832.00 0.5% be_post_rx_frags /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko > 829.00 0.5% __alloc_pages_nodemask [kernel.kallsyms] > 813.00 0.5% kmem_cache_alloc [kernel.kallsyms] > 802.00 0.5% netif_receive_skb [kernel.kallsyms] > 802.00 0.5% ip_route_input_common [kernel.kallsyms] > 723.00 0.5% nf_ct_get_tuple [kernel.kallsyms] > 720.00 0.5% __intel_map_single [kernel.kallsyms] > 720.00 0.5% udp_error [kernel.kallsyms] > > -------------------------------------------------------------------------------------------------------------------- > PerfTop: 16360 irqs/sec kernel:72.6% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 11) > -------------------------------------------------------------------------------------------------------------------- > CPU 11 handles all TX completions : Its a potential bottleneck. I might ressurect XPS patch ;) > samples pcnt function DSO > _______ _____ _____________________________ ___________________________________________________________ > > 16993.00 32.4% _raw_spin_unlock_irqrestore [kernel.kallsyms] > 5833.00 11.1% clflush_cache_range [kernel.kallsyms] > 3315.00 6.3% be_tx_compl_process /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko > 1818.00 3.5% kmem_cache_free [kernel.kallsyms] > 1415.00 2.7% isc_rwlock_lock /usr/lib64/libisc.so.62.0.1 > 1090.00 2.1% be_poll_tx_mcc /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko > 811.00 1.5% skb_release_head_state [kernel.kallsyms] > 772.00 1.5% skb_release_data [kernel.kallsyms] > 712.00 1.4% dns_rbt_findnode /usr/lib64/libdns.so.69.0.1 > 703.00 1.3% isc_rwlock_unlock /usr/lib64/libisc.so.62.0.1 > 695.00 1.3% dma_pte_clear_range [kernel.kallsyms] > 618.00 1.2% kfree_skb [kernel.kallsyms] > 597.00 1.1% kfree [kernel.kallsyms] > 553.00 1.1% intel_unmap_page [kernel.kallsyms] > 531.00 1.0% __do_softirq [kernel.kallsyms] > 504.00 1.0% isc_stats_increment /usr/lib64/libisc.so.62.0.1 > 397.00 0.8% virt_to_head_page [kernel.kallsyms] > 306.00 0.6% _raw_spin_lock [kernel.kallsyms] > 270.00 0.5% domain_get_iommu [kernel.kallsyms] > 256.00 0.5% dns_name_fullcompare /usr/lib64/libdns.so.69.0.1 > 233.00 0.4% find_first_bit [kernel.kallsyms] > 222.00 0.4% dns_name_equal /usr/lib64/libdns.so.69.0.1 > 218.00 0.4% __pthread_mutex_lock_internal /lib64/libpthread-2.12.so > 207.00 0.4% dns_rbtnodechain_init /usr/lib64/libdns.so.69.0.1 > 196.00 0.4% dns_acl_match /usr/lib64/libdns.so.69.0.1 > 194.00 0.4% dma_pte_free_pagetable [kernel.kallsyms] > 192.00 0.4% dns_name_getlabelsequence /usr/lib64/libdns.so.69.0.1 > ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 11:45 ` Eric Dumazet @ 2011-03-01 11:53 ` Herbert Xu 2011-03-01 12:32 ` Herbert Xu 2011-03-01 13:03 ` Eric Dumazet 2011-03-01 12:01 ` Thomas Graf 2011-03-01 12:18 ` Thomas Graf 2 siblings, 2 replies; 91+ messages in thread From: Herbert Xu @ 2011-03-01 11:53 UTC (permalink / raw) To: Eric Dumazet Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 12:45:09PM +0100, Eric Dumazet wrote: > > CPU 11 handles all TX completions : Its a potential bottleneck. > > I might ressurect XPS patch ;) Actually this has been my gripe all along with our TX multiqueue support. We should not decide the queue based on the socket, but on the current CPU. We already do the right thing for forwarded packets because there is no socket to latch onto, we just need to fix it for locally generated traffic. The odd packet reordering each time your scheduler decides to migrate the process isn't a big deal IMHO. If your scheduler is constantly moving things you've got bigger problems to worry about. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 11:53 ` Herbert Xu @ 2011-03-01 12:32 ` Herbert Xu 2011-03-01 13:04 ` Eric Dumazet 2011-03-01 13:03 ` Eric Dumazet 1 sibling, 1 reply; 91+ messages in thread From: Herbert Xu @ 2011-03-01 12:32 UTC (permalink / raw) To: Eric Dumazet Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 07:53:05PM +0800, Herbert Xu wrote: > On Tue, Mar 01, 2011 at 12:45:09PM +0100, Eric Dumazet wrote: > > > > CPU 11 handles all TX completions : Its a potential bottleneck. > > > > I might ressurect XPS patch ;) > > Actually this has been my gripe all along with our TX multiqueue > support. We should not decide the queue based on the socket, but > on the current CPU. > > We already do the right thing for forwarded packets because there > is no socket to latch onto, we just need to fix it for locally > generated traffic. > > The odd packet reordering each time your scheduler decides to > migrate the process isn't a big deal IMHO. If your scheduler > is constantly moving things you've got bigger problems to worry > about. If anybody wants to play here is a patch to do exactly that: net: Determine TX queue purely by current CPU Distributing packets generated on one CPU to multiple queues makes no sense. Nor does putting packets from multiple CPUs into a single queue. While this may introduce packet reordering should the scheduler decide to migrate a thread, it isn't a big deal because migration is meant to be a rare event, and nothing will die as long as the ordering doesn't occur all the time. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/net/core/dev.c b/net/core/dev.c index 8ae6631..87bd20a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2164,22 +2164,12 @@ static u32 hashrnd __read_mostly; u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb, unsigned int num_tx_queues) { - u32 hash; + u32 hash = raw_smp_processor_id(); - if (skb_rx_queue_recorded(skb)) { - hash = skb_get_rx_queue(skb); - while (unlikely(hash >= num_tx_queues)) - hash -= num_tx_queues; - return hash; - } + while (unlikely(hash >= num_tx_queues)) + hash -= num_tx_queues; - if (skb->sk && skb->sk->sk_hash) - hash = skb->sk->sk_hash; - else - hash = (__force u16) skb->protocol ^ skb->rxhash; - hash = jhash_1word(hash, hashrnd); - - return (u16) (((u64) hash * num_tx_queues) >> 32); + return hash; } EXPORT_SYMBOL(__skb_tx_hash); Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 12:32 ` Herbert Xu @ 2011-03-01 13:04 ` Eric Dumazet 2011-03-01 13:11 ` Herbert Xu 0 siblings, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-03-01 13:04 UTC (permalink / raw) To: Herbert Xu Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le mardi 01 mars 2011 à 20:32 +0800, Herbert Xu a écrit : > On Tue, Mar 01, 2011 at 07:53:05PM +0800, Herbert Xu wrote: > > On Tue, Mar 01, 2011 at 12:45:09PM +0100, Eric Dumazet wrote: > > > > > > CPU 11 handles all TX completions : Its a potential bottleneck. > > > > > > I might ressurect XPS patch ;) > > > > Actually this has been my gripe all along with our TX multiqueue > > support. We should not decide the queue based on the socket, but > > on the current CPU. > > > > We already do the right thing for forwarded packets because there > > is no socket to latch onto, we just need to fix it for locally > > generated traffic. > > > > The odd packet reordering each time your scheduler decides to > > migrate the process isn't a big deal IMHO. If your scheduler > > is constantly moving things you've got bigger problems to worry > > about. > > If anybody wants to play here is a patch to do exactly that: > > net: Determine TX queue purely by current CPU > > Distributing packets generated on one CPU to multiple queues > makes no sense. Nor does putting packets from multiple CPUs > into a single queue. > > While this may introduce packet reordering should the scheduler > decide to migrate a thread, it isn't a big deal because migration > is meant to be a rare event, and nothing will die as long as the > ordering doesn't occur all the time. > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> > > diff --git a/net/core/dev.c b/net/core/dev.c > index 8ae6631..87bd20a 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -2164,22 +2164,12 @@ static u32 hashrnd __read_mostly; > u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb, > unsigned int num_tx_queues) > { > - u32 hash; > + u32 hash = raw_smp_processor_id(); > > - if (skb_rx_queue_recorded(skb)) { > - hash = skb_get_rx_queue(skb); > - while (unlikely(hash >= num_tx_queues)) > - hash -= num_tx_queues; > - return hash; > - } > + while (unlikely(hash >= num_tx_queues)) > + hash -= num_tx_queues; > > - if (skb->sk && skb->sk->sk_hash) > - hash = skb->sk->sk_hash; > - else > - hash = (__force u16) skb->protocol ^ skb->rxhash; > - hash = jhash_1word(hash, hashrnd); > - > - return (u16) (((u64) hash * num_tx_queues) >> 32); > + return hash; > } > EXPORT_SYMBOL(__skb_tx_hash); > > Cheers, Well, some machines have 4096 cpus ;) ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 13:04 ` Eric Dumazet @ 2011-03-01 13:11 ` Herbert Xu 0 siblings, 0 replies; 91+ messages in thread From: Herbert Xu @ 2011-03-01 13:11 UTC (permalink / raw) To: Eric Dumazet Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 02:04:29PM +0100, Eric Dumazet wrote: > Well, some machines have 4096 cpus ;) Well just change it to use the multiplication then :) -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 11:53 ` Herbert Xu 2011-03-01 12:32 ` Herbert Xu @ 2011-03-01 13:03 ` Eric Dumazet 2011-03-01 13:18 ` Herbert Xu 1 sibling, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-03-01 13:03 UTC (permalink / raw) To: Herbert Xu Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le mardi 01 mars 2011 à 19:53 +0800, Herbert Xu a écrit : > On Tue, Mar 01, 2011 at 12:45:09PM +0100, Eric Dumazet wrote: > > > > CPU 11 handles all TX completions : Its a potential bottleneck. > > > > I might ressurect XPS patch ;) > > Actually this has been my gripe all along with our TX multiqueue > support. We should not decide the queue based on the socket, but > on the current CPU. > > We already do the right thing for forwarded packets because there > is no socket to latch onto, we just need to fix it for locally > generated traffic. > I believe its now done properly (in net-next-2.6) with commit 4f57c087de9b46182 (net: implement mechanism for HW based QOS) > The odd packet reordering each time your scheduler decides to > migrate the process isn't a big deal IMHO. If your scheduler > is constantly moving things you've got bigger problems to worry > about. Well, BENET has one TX queue anyway... ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 13:03 ` Eric Dumazet @ 2011-03-01 13:18 ` Herbert Xu 2011-03-01 13:52 ` Eric Dumazet 2011-03-01 16:31 ` Eric Dumazet 0 siblings, 2 replies; 91+ messages in thread From: Herbert Xu @ 2011-03-01 13:18 UTC (permalink / raw) To: Eric Dumazet Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 02:03:29PM +0100, Eric Dumazet wrote: > > I believe its now done properly (in net-next-2.6) with commit > 4f57c087de9b46182 (net: implement mechanism for HW based QOS) Nope, that has nothing to do with this. > > The odd packet reordering each time your scheduler decides to > > migrate the process isn't a big deal IMHO. If your scheduler > > is constantly moving things you've got bigger problems to worry > > about. > > Well, BENET has one TX queue anyway... Interesting. So I wonder which lock is showing up at the top of the profile with a single socket then. As it's definitely going away with multiple sockets, that means it's not the TX queue lock. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 13:18 ` Herbert Xu @ 2011-03-01 13:52 ` Eric Dumazet 2011-03-01 13:58 ` Herbert Xu 2011-03-01 16:31 ` Eric Dumazet 1 sibling, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-03-01 13:52 UTC (permalink / raw) To: Herbert Xu Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le mardi 01 mars 2011 à 21:18 +0800, Herbert Xu a écrit : > Interesting. So I wonder which lock is showing up at the top > of the profile with a single socket then. As it's definitely > going away with multiple sockets, that means it's not the TX > queue lock. > This CPU also runs named process, so this is socket lock and receive queue lock. Named threads all do : recvmsg()/sendmsg() in a loop, so all are waiting a frame before doing some work. Because of single receive queue, extra context switches occur (all threads but one have to sleep again per query) For about 80 kqps (standard linux-2.6 kernel, no patches), I have following vmstat output procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 4 1 0 2184048 63496 1595056 0 0 0 2060 64592 294528 19 11 67 4 6 1 0 2184040 63496 1595056 0 0 0 1960 64686 293928 19 11 66 4 3 1 0 2184040 63496 1595056 0 0 0 2344 64556 294268 20 11 66 4 4 1 0 2184040 63496 1595056 0 0 0 2400 64626 293859 19 11 67 4 ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 13:52 ` Eric Dumazet @ 2011-03-01 13:58 ` Herbert Xu 0 siblings, 0 replies; 91+ messages in thread From: Herbert Xu @ 2011-03-01 13:58 UTC (permalink / raw) To: Eric Dumazet Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 02:52:46PM +0100, Eric Dumazet wrote: > Le mardi 01 mars 2011 à 21:18 +0800, Herbert Xu a écrit : > > > Interesting. So I wonder which lock is showing up at the top > > of the profile with a single socket then. As it's definitely > > going away with multiple sockets, that means it's not the TX > > queue lock. > > > > This CPU also runs named process, so this is socket lock and receive > queue lock. It can't be the socket lock because it's an IRQ-disabling variant. The receive queue lock, I'll buy that. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 13:18 ` Herbert Xu 2011-03-01 13:52 ` Eric Dumazet @ 2011-03-01 16:31 ` Eric Dumazet 2011-03-02 0:23 ` Herbert Xu 1 sibling, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-03-01 16:31 UTC (permalink / raw) To: Herbert Xu Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le mardi 01 mars 2011 à 21:18 +0800, Herbert Xu a écrit : > On Tue, Mar 01, 2011 at 02:03:29PM +0100, Eric Dumazet wrote: > > > > I believe its now done properly (in net-next-2.6) with commit > > 4f57c087de9b46182 (net: implement mechanism for HW based QOS) > > Nope, that has nothing to do with this. Right, I was thinking of commit 1d24eb4815d1e0e8 (xps: Transmit Packet Steering) Now you say all this stuff should be replaced by "use this cpu number nly", just because you have a multi threaded process sending UDP frames trough one socket... This wont work for tcp streams, you could imagine a multi-threaded application using a shared tcp socket as well. Too many OOO packets. ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 16:31 ` Eric Dumazet @ 2011-03-02 0:23 ` Herbert Xu 2011-03-02 2:00 ` Eric Dumazet 0 siblings, 1 reply; 91+ messages in thread From: Herbert Xu @ 2011-03-02 0:23 UTC (permalink / raw) To: Eric Dumazet Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 05:31:24PM +0100, Eric Dumazet wrote: > > This wont work for tcp streams, you could imagine a multi-threaded > application using a shared tcp socket as well. Too many OOO packets. Think about it, a TCP socket cannot be used by a multi-threaded app in a scalable way. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-02 0:23 ` Herbert Xu @ 2011-03-02 2:00 ` Eric Dumazet 2011-03-02 2:39 ` Herbert Xu 0 siblings, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-03-02 2:00 UTC (permalink / raw) To: Herbert Xu Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le mercredi 02 mars 2011 à 08:23 +0800, Herbert Xu a écrit : > On Tue, Mar 01, 2011 at 05:31:24PM +0100, Eric Dumazet wrote: > > > > This wont work for tcp streams, you could imagine a multi-threaded > > application using a shared tcp socket as well. Too many OOO packets. > > Think about it, a TCP socket cannot be used by a multi-threaded app > in a scalable way. Well... If you think about it, SO_REUSEPORT patch has exactly the same goal : Let each thread use a different socket, to scale without kernel limits. We cant modify TX selection each time we want to "fix" a problem without changing user side (not adding an API), and as side effect make non optimal applications become miserable. We added RPS and XPS that works correctly if each socket is used by one thread. Maybe we need to add an user API or automatically detect a particular DGRAM socket is used by many different threads to : 0) Decide OOM is ok for this workload (many threads issuing send() at the same time) 1) Setup several receive queues (up to num_possible_cpus()) 2) Use an appropriate TX queue selection ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-02 2:00 ` Eric Dumazet @ 2011-03-02 2:39 ` Herbert Xu 2011-03-02 2:56 ` Eric Dumazet 2011-03-02 7:12 ` Tom Herbert 0 siblings, 2 replies; 91+ messages in thread From: Herbert Xu @ 2011-03-02 2:39 UTC (permalink / raw) To: Eric Dumazet Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Wed, Mar 02, 2011 at 03:00:03AM +0100, Eric Dumazet wrote: > > > > Think about it, a TCP socket cannot be used by a multi-threaded app > > in a scalable way. > > Well... > > If you think about it, SO_REUSEPORT patch has exactly the same goal : UDP is a datagram protocol, TCP is not. Anyway, here is an alternate proposal. When a TCP socket transmits for the first time (SYN or SYN-ACK), we pick a queue based on CPU and store it in the socket. From then on we stick to that selection. We would only allow changes if we can ensure that all transmitted packets have left the queue. Or we just never change it like we do now. For datagram protocols we simply use the current CPU. > We added RPS and XPS that works correctly if each socket is used by one > thread. Maybe we need to add an user API or automatically detect a > particular DGRAM socket is used by many different threads to : No we don't need that for datagram protocols at all. By definition there is no ordering guarantee across CPUs for datagram sockets. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-02 2:39 ` Herbert Xu @ 2011-03-02 2:56 ` Eric Dumazet 2011-03-02 3:09 ` Herbert Xu 2011-03-02 7:12 ` Tom Herbert 1 sibling, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-03-02 2:56 UTC (permalink / raw) To: Herbert Xu Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le mercredi 02 mars 2011 à 10:39 +0800, Herbert Xu a écrit : > UDP is a datagram protocol, TCP is not. > > Anyway, here is an alternate proposal. When a TCP socket transmits > for the first time (SYN or SYN-ACK), we pick a queue based on CPU and > store it in the socket. From then on we stick to that selection. > Many TCP apps I know use one thread to perform listen/accept and a pool of threads to handle each new conn. Anyway, the SYN-ACK is generated by softirq, not really user choice. CPU depends if NIC is RX multiqueue or RPS is setup. All this discussion is about letting process scheduler decide TX queue, (because user/admin used cpu affinity) or let network stack drive scheduler : Please migrate this thread on this cpu. Both schems should be allowed/configurable so that best results are available. > We would only allow changes if we can ensure that all transmitted > packets have left the queue. Or we just never change it like we > do now. > We do change in case of dst/route change. Each device can have different number of TX queues. ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-02 2:56 ` Eric Dumazet @ 2011-03-02 3:09 ` Herbert Xu 2011-03-02 3:44 ` Eric Dumazet 0 siblings, 1 reply; 91+ messages in thread From: Herbert Xu @ 2011-03-02 3:09 UTC (permalink / raw) To: Eric Dumazet Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Wed, Mar 02, 2011 at 03:56:38AM +0100, Eric Dumazet wrote: > > Anyway, the SYN-ACK is generated by softirq, not really user choice. > CPU depends if NIC is RX multiqueue or RPS is setup. Which is exactly what we want. The RX queue selection should determine the TX cpu. > All this discussion is about letting process scheduler decide TX queue, > (because user/admin used cpu affinity) or let network stack drive > scheduler : Please migrate this thread on this cpu. > > Both schems should be allowed/configurable so that best results are > available. Whatever scheme we end up with, hashing different sockets running in the same thread to different queues is just broken. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-02 3:09 ` Herbert Xu @ 2011-03-02 3:44 ` Eric Dumazet 0 siblings, 0 replies; 91+ messages in thread From: Eric Dumazet @ 2011-03-02 3:44 UTC (permalink / raw) To: Herbert Xu Cc: Thomas Graf, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le mercredi 02 mars 2011 à 11:09 +0800, Herbert Xu a écrit : > On Wed, Mar 02, 2011 at 03:56:38AM +0100, Eric Dumazet wrote: > > > > Anyway, the SYN-ACK is generated by softirq, not really user choice. > > CPU depends if NIC is RX multiqueue or RPS is setup. > > Which is exactly what we want. The RX queue selection should > determine the TX cpu. > This is working today with RFS/XPS. Or it should, indirectly. OOO problem is handled as well. ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-02 2:39 ` Herbert Xu 2011-03-02 2:56 ` Eric Dumazet @ 2011-03-02 7:12 ` Tom Herbert 2011-03-02 7:31 ` Herbert Xu 1 sibling, 1 reply; 91+ messages in thread From: Tom Herbert @ 2011-03-02 7:12 UTC (permalink / raw) To: Herbert Xu Cc: Eric Dumazet, Thomas Graf, David Miller, rick.jones2, wsommerfeld, daniel.baluta, netdev On Tue, Mar 1, 2011 at 6:39 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote: > On Wed, Mar 02, 2011 at 03:00:03AM +0100, Eric Dumazet wrote: >> > >> > Think about it, a TCP socket cannot be used by a multi-threaded app >> > in a scalable way. >> >> Well... >> >> If you think about it, SO_REUSEPORT patch has exactly the same goal : > In a sense. SO_RESUSEPORT for TCP is intended to provide a scalable listener solution. Sharing an established socket is not very efficient, something like a multiplexing socket layer on top of TCP might be good. > UDP is a datagram protocol, TCP is not. > > Anyway, here is an alternate proposal. When a TCP socket transmits > for the first time (SYN or SYN-ACK), we pick a queue based on CPU and > store it in the socket. From then on we stick to that selection. > > We would only allow changes if we can ensure that all transmitted > packets have left the queue. Or we just never change it like we > do now. > XPS does all this already. > For datagram protocols we simply use the current CPU. > Probably need to set skb->ooo_okay (for UDP etc.) also so that XPS will change queues. Tom ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-02 7:12 ` Tom Herbert @ 2011-03-02 7:31 ` Herbert Xu 2011-03-02 8:04 ` Eric Dumazet 0 siblings, 1 reply; 91+ messages in thread From: Herbert Xu @ 2011-03-02 7:31 UTC (permalink / raw) To: Tom Herbert Cc: Eric Dumazet, Thomas Graf, David Miller, rick.jones2, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 11:12:29PM -0800, Tom Herbert wrote: > > Probably need to set skb->ooo_okay (for UDP etc.) also so that XPS > will change queues. Hmm, not quite. We still want to maintain packet ordering from the same CPU. That is, if I do two sendmsg calls from the same CPU, they should go into the same queue in that order. So this shouldn't just be a knob that says whether we can pick queues at random. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-02 7:31 ` Herbert Xu @ 2011-03-02 8:04 ` Eric Dumazet 2011-03-02 8:07 ` Herbert Xu 0 siblings, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-03-02 8:04 UTC (permalink / raw) To: Herbert Xu Cc: Tom Herbert, Thomas Graf, David Miller, rick.jones2, wsommerfeld, daniel.baluta, netdev Le mercredi 02 mars 2011 à 15:31 +0800, Herbert Xu a écrit : > On Tue, Mar 01, 2011 at 11:12:29PM -0800, Tom Herbert wrote: > > > > Probably need to set skb->ooo_okay (for UDP etc.) also so that XPS > > will change queues. > > Hmm, not quite. We still want to maintain packet ordering from > the same CPU. That is, if I do two sendmsg calls from the same > CPU, they should go into the same queue in that order. > > So this shouldn't just be a knob that says whether we can pick > queues at random. > Not sure why two UDP packets from the same cpu should be sent on same queue. - Some qdisc do reorder packets anyway. - Some bonding setups use two links in round-robin mode (link aggregation) ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-02 8:04 ` Eric Dumazet @ 2011-03-02 8:07 ` Herbert Xu 2011-03-02 8:24 ` Eric Dumazet 0 siblings, 1 reply; 91+ messages in thread From: Herbert Xu @ 2011-03-02 8:07 UTC (permalink / raw) To: Eric Dumazet Cc: Tom Herbert, Thomas Graf, David Miller, rick.jones2, wsommerfeld, daniel.baluta, netdev On Wed, Mar 02, 2011 at 09:04:08AM +0100, Eric Dumazet wrote: > > Not sure why two UDP packets from the same cpu should be sent on same > queue. > > - Some qdisc do reorder packets anyway. Which qdisc reorders packets belonging to the same flow? > - Some bonding setups use two links in round-robin mode (link > aggregation) Just because the Internet may reorder things doesn't mean that we should. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-02 8:07 ` Herbert Xu @ 2011-03-02 8:24 ` Eric Dumazet 0 siblings, 0 replies; 91+ messages in thread From: Eric Dumazet @ 2011-03-02 8:24 UTC (permalink / raw) To: Herbert Xu Cc: Tom Herbert, Thomas Graf, David Miller, rick.jones2, wsommerfeld, daniel.baluta, netdev Le mercredi 02 mars 2011 à 16:07 +0800, Herbert Xu a écrit : > On Wed, Mar 02, 2011 at 09:04:08AM +0100, Eric Dumazet wrote: > > > > Not sure why two UDP packets from the same cpu should be sent on same > > queue. > > > > - Some qdisc do reorder packets anyway. > > Which qdisc reorders packets belonging to the same flow? > Hmm to be fair you did not specified "same flow", and /sbin/named answers are usually one packet long... How are we going to detect flows in sendto() calls ? Just kidding. If you want to push your patch, I suspect a dynamic per_cpu variable is needed per TX-multiqueue device, so that "current cpu -> txq number" is one instruction. ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 11:45 ` Eric Dumazet 2011-03-01 11:53 ` Herbert Xu @ 2011-03-01 12:01 ` Thomas Graf 2011-03-01 12:15 ` Herbert Xu 2011-03-01 13:27 ` Herbert Xu 2011-03-01 12:18 ` Thomas Graf 2 siblings, 2 replies; 91+ messages in thread From: Thomas Graf @ 2011-03-01 12:01 UTC (permalink / raw) To: Eric Dumazet Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 12:45:09PM +0100, Eric Dumazet wrote: This is how perf top looks like with SO_REUSEPORT ---------------------------------------------------------------------------------------------------------------------------------- PerfTop: 27498 irqs/sec kernel:50.5% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 1) ---------------------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ _____________________________ __________________ 16464.00 6.0% isc_rwlock_lock libisc.so.62.0.1 15462.00 5.7% intel_idle [kernel.kallsyms] 13140.00 4.8% _spin_unlock_irqrestore [kernel.kallsyms] 9283.00 3.4% __do_softirq [kernel.kallsyms] 8469.00 3.1% finish_task_switch [kernel.kallsyms] 8189.00 3.0% __udp4_lib_lookup [kernel.kallsyms] 8096.00 3.0% dns_rbt_findnode libdns.so.69.0.1 7619.00 2.8% isc_rwlock_unlock libisc.so.62.0.1 5090.00 1.9% isc_stats_increment libisc.so.62.0.1 4325.00 1.6% tick_nohz_stop_sched_tick [kernel.kallsyms] 3656.00 1.3% _spin_lock [kernel.kallsyms] 3540.00 1.3% __pthread_mutex_lock_internal libpthread-2.12.so 3168.00 1.2% _spin_lock_bh [kernel.kallsyms] 2576.00 0.9% dns_name_fullcompare libdns.so.69.0.1 2492.00 0.9% __pthread_mutex_unlock libpthread-2.12.so 2486.00 0.9% isc___mempool_get libisc.so.62.0.1 2475.00 0.9% dns_rbtnodechain_init libdns.so.69.0.1 2454.00 0.9% be_poll_rx [be2net] 2417.00 0.9% sk_run_filter [kernel.kallsyms] 2411.00 0.9% tick_nohz_restart_sched_tick [kernel.kallsyms] 2331.00 0.9% dns_name_equal libdns.so.69.0.1 2198.00 0.8% net_rx_action [kernel.kallsyms] 2135.00 0.8% fget_light [kernel.kallsyms] 2130.00 0.8% dns_zone_attach libdns.so.69.0.1 2073.00 0.8% dns_name_getlabelsequence libdns.so.69.0.1 2024.00 0.7% copy_user_generic_string [kernel.kallsyms] 2003.00 0.7% dns_acl_match libdns.so.69.0.1 1868.00 0.7% be_xmit [be2net] ---------------------------------------------------------------------------------------------------------------------------------- PerfTop: 16206 irqs/sec kernel:88.6% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 3) ---------------------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ _____________________________ __________________ 15662.00 11.3% __udp4_lib_lookup [kernel.kallsyms] 10404.00 7.5% intel_idle [kernel.kallsyms] 10248.00 7.4% _spin_unlock_irqrestore [kernel.kallsyms] 4386.00 3.2% __do_softirq [kernel.kallsyms] 4324.00 3.1% be_poll_rx [be2net] 4165.00 3.0% get_rx_page_info [be2net] 4050.00 2.9% get_page_from_freelist [kernel.kallsyms] 4045.00 2.9% finish_task_switch [kernel.kallsyms] 3861.00 2.8% sk_run_filter [kernel.kallsyms] 3544.00 2.5% ip_route_input [kernel.kallsyms] 3385.00 2.4% _spin_lock [kernel.kallsyms] 2583.00 1.9% get_rps_cpu [kernel.kallsyms] 2042.00 1.5% tick_nohz_stop_sched_tick [kernel.kallsyms] 1971.00 1.4% kmem_cache_alloc_node_notrace [kernel.kallsyms] 1788.00 1.3% _read_lock [kernel.kallsyms] 1777.00 1.3% __netif_receive_skb [kernel.kallsyms] 1777.00 1.3% isc_rwlock_lock libisc.so.62.0.1 1769.00 1.3% memcpy_c [kernel.kallsyms] 1618.00 1.2% __alloc_skb [kernel.kallsyms] 1591.00 1.1% __pthread_mutex_lock_internal libpthread-2.12.so 1576.00 1.1% kmem_cache_alloc_node [kernel.kallsyms] 1450.00 1.0% sock_queue_rcv_skb [kernel.kallsyms] 1427.00 1.0% tick_nohz_restart_sched_tick [kernel.kallsyms] 1214.00 0.9% __udp4_lib_rcv [kernel.kallsyms] 1124.00 0.8% net_rx_action [kernel.kallsyms] 1113.00 0.8% getnstimeofday [kernel.kallsyms] 1072.00 0.8% selinux_socket_sock_rcv_skb [kernel.kallsyms] 1016.00 0.7% ip_rcv [kernel.kallsyms] 992.00 0.7% sock_def_readable [kernel.kallsyms] 961.00 0.7% dns_rbt_findnode libdns.so.69.0.1 899.00 0.6% fget [kernel.kallsyms] 898.00 0.6% datagram_poll [kernel.kallsyms] 809.00 0.6% isc_rwlock_unlock libisc.so.62.0.1 803.00 0.6% __alloc_pages_nodemask [kernel.kallsyms] 799.00 0.6% udp_queue_rcv_skb [kernel.kallsyms] 694.00 0.5% packet_rcv [kernel.kallsyms] 662.00 0.5% mutex_lock [kernel.kallsyms] ------------------------------------------------------------------------------------------------------------------------------------------ PerfTop: 31619 irqs/sec kernel:37.7% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 10) ------------------------------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ _____________________________ __________________ 6726.00 7.7% isc_rwlock_lock libisc.so.62.0.1 4597.00 5.3% _spin_unlock_irqrestore [kernel.kallsyms] 4230.00 4.9% intel_idle [kernel.kallsyms] 3319.00 3.8% dns_rbt_findnode libdns.so.69.0.1 3178.00 3.7% isc_rwlock_unlock libisc.so.62.0.1 2682.00 3.1% finish_task_switch [kernel.kallsyms] 2164.00 2.5% isc_stats_increment libisc.so.62.0.1 1435.00 1.7% tick_nohz_stop_sched_tick [kernel.kallsyms] 1407.00 1.6% _spin_lock_bh [kernel.kallsyms] 1288.00 1.5% __pthread_mutex_lock_internal libpthread-2.12.so 1264.00 1.5% copy_user_generic_string [kernel.kallsyms] 1082.00 1.2% _spin_lock [kernel.kallsyms] 1061.00 1.2% be_xmit [be2net] 1024.00 1.2% __pthread_mutex_unlock libpthread-2.12.so 1014.00 1.2% dns_rbtnodechain_init libdns.so.69.0.1 989.00 1.1% isc___mempool_get libisc.so.62.0.1 964.00 1.1% dns_name_equal libdns.so.69.0.1 957.00 1.1% dns_name_getlabelsequence libdns.so.69.0.1 944.00 1.1% dns_name_fullcompare libdns.so.69.0.1 858.00 1.0% dns_zone_attach libdns.so.69.0.1 793.00 0.9% udp_sendmsg [kernel.kallsyms] 785.00 0.9% tick_nohz_restart_sched_tick [kernel.kallsyms] 784.00 0.9% dns_acl_match libdns.so.69.0.1 776.00 0.9% fget_light [kernel.kallsyms] 723.00 0.8% dns_name_hash libdns.so.69.0.1 691.00 0.8% dns_message_rendersection libdns.so.69.0.1 675.00 0.8% dns_name_fromwire libdns.so.69.0.1 658.00 0.8% udp_recvmsg [kernel.kallsyms] 646.00 0.7% kmem_cache_free [kernel.kallsyms] 641.00 0.7% kfree [kernel.kallsyms] 535.00 0.6% isc_radix_search libisc.so.62.0.1 531.00 0.6% dev_queue_xmit [kernel.kallsyms] ------------------------------------------------------------------------------------------------------------------------------------------- PerfTop: 31136 irqs/sec kernel:48.3% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 11) ------------------------------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ _____________________________ __________________ 13043.00 6.0% isc_rwlock_lock libisc.so.62.0.1 10852.00 5.0% _spin_unlock_irqrestore [kernel.kallsyms] 10538.00 4.9% be_tx_compl_process [be2net] 8275.00 3.8% kfree [kernel.kallsyms] 6467.00 3.0% kmem_cache_free [kernel.kallsyms] 6453.00 3.0% dns_rbt_findnode libdns.so.69.0.1 6423.00 3.0% intel_idle [kernel.kallsyms] 6199.00 2.9% isc_rwlock_unlock libisc.so.62.0.1 5492.00 2.5% sock_wfree [kernel.kallsyms] 5372.00 2.5% finish_task_switch [kernel.kallsyms] 5321.00 2.4% kfree_skb [kernel.kallsyms] 4030.00 1.9% isc_stats_increment libisc.so.62.0.1 3820.00 1.8% skb_release_data [kernel.kallsyms] 3518.00 1.6% be_poll_tx_mcc [be2net] 3034.00 1.4% sock_def_write_space [kernel.kallsyms] 2599.00 1.2% __do_softirq [kernel.kallsyms] 2572.00 1.2% tick_nohz_stop_sched_tick [kernel.kallsyms] 2519.00 1.2% __pthread_mutex_lock_internal libpthread-2.12.so 2497.00 1.1% _spin_lock_bh [kernel.kallsyms] 2045.00 0.9% dns_name_fullcompare libdns.so.69.0.1 1960.00 0.9% isc___mempool_get libisc.so.62.0.1 1873.00 0.9% dns_rbtnodechain_init libdns.so.69.0.1 1861.00 0.9% _spin_lock [kernel.kallsyms] 1806.00 0.8% __pthread_mutex_unlock libpthread-2.12.so 1791.00 0.8% dns_name_equal libdns.so.69.0.1 1757.00 0.8% dns_zone_attach libdns.so.69.0.1 1621.00 0.7% dns_name_getlabelsequence libdns.so.69.0.1 1576.00 0.7% fget_light [kernel.kallsyms] 1532.00 0.7% dns_acl_match libdns.so.69.0.1 1515.00 0.7% tick_nohz_restart_sched_tick [kernel.kallsyms] 1510.00 0.7% be_xmit [be2net] ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 12:01 ` Thomas Graf @ 2011-03-01 12:15 ` Herbert Xu 2011-03-01 13:27 ` Herbert Xu 1 sibling, 0 replies; 91+ messages in thread From: Herbert Xu @ 2011-03-01 12:15 UTC (permalink / raw) To: Eric Dumazet, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 07:01:12AM -0500, Thomas Graf wrote: > On Tue, Mar 01, 2011 at 12:45:09PM +0100, Eric Dumazet wrote: > > This is how perf top looks like with SO_REUSEPORT Yeah I think Eric is spot on. The remaining bottleneck is because we hash all outbound packets from a single socket to a single TX queue, despite the fact that they were produced on difference CPUs. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 12:01 ` Thomas Graf 2011-03-01 12:15 ` Herbert Xu @ 2011-03-01 13:27 ` Herbert Xu 1 sibling, 0 replies; 91+ messages in thread From: Herbert Xu @ 2011-03-01 13:27 UTC (permalink / raw) To: Eric Dumazet, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 07:01:12AM -0500, Thomas Graf wrote: > On Tue, Mar 01, 2011 at 12:45:09PM +0100, Eric Dumazet wrote: > > This is how perf top looks like with SO_REUSEPORT > > ---------------------------------------------------------------------------------------------------------------------------------- > PerfTop: 27498 irqs/sec kernel:50.5% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 1) > ---------------------------------------------------------------------------------------------------------------------------------- > > samples pcnt function DSO > _______ _____ _____________________________ __________________ > > 16464.00 6.0% isc_rwlock_lock libisc.so.62.0.1 > 15462.00 5.7% intel_idle [kernel.kallsyms] So was this a RHEL6 kernel? I wonder if that is what's making it perform better. I guess we'll find out tomorrow. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 11:45 ` Eric Dumazet 2011-03-01 11:53 ` Herbert Xu 2011-03-01 12:01 ` Thomas Graf @ 2011-03-01 12:18 ` Thomas Graf 2011-03-01 12:19 ` Herbert Xu 2 siblings, 1 reply; 91+ messages in thread From: Thomas Graf @ 2011-03-01 12:18 UTC (permalink / raw) To: Eric Dumazet Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 12:45:09PM +0100, Eric Dumazet wrote: > ethtool -S eth0 | grep rx_pk > rxq0: rx_pkts: ?? > rxq1: rx_pkts: ?? > rxq2: rx_pkts: ?? > rxq3: rx_pkts: ?? > rxq4: rx_pkts: ?? It could do multiqueue but it doesnt: [root@hp-bl460cg7-01 ~]# ethtool -S eth0 | grep rx_pk rxq0: rx_pkts: 1512 rxq1: rx_pkts: 462 rxq2: rx_pkts: 122 rxq3: rx_pkts: 24751393 rxq4: rx_pkts: 35 So, adding a third client making sure it would hit another queue: rxq0: rx_pkts: 3041 rxq1: rx_pkts: 867 rxq2: rx_pkts: 4610476 rxq3: rx_pkts: 57418776 rxq4: rx_pkts: 40 ... makes it use CPU 5 for rxq2 and the qps goes up from 250kqps to 270kqps ----------------------------------------------------------------------------------------------------------------------------------------------------------------- PerfTop: 18417 irqs/sec kernel:50.2% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 5) ----------------------------------------------------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ _____________________________ ___________________________________________________________ 12712.00 18.5% _raw_spin_unlock_irqrestore [kernel.kallsyms] 3697.00 5.4% isc_rwlock_lock /usr/lib64/libisc.so.62.0.1 1948.00 2.8% dns_rbt_findnode /usr/lib64/libdns.so.69.0.1 1809.00 2.6% isc_rwlock_unlock /usr/lib64/libisc.so.62.0.1 1631.00 2.4% __do_softirq [kernel.kallsyms] 1237.00 1.8% isc_stats_increment /usr/lib64/libisc.so.62.0.1 1106.00 1.6% clflush_cache_range [kernel.kallsyms] 964.00 1.4% _raw_spin_lock [kernel.kallsyms] 714.00 1.0% be_poll_rx /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 630.00 0.9% __pthread_mutex_lock_internal /lib64/libpthread-2.12.so 627.00 0.9% dns_name_fullcompare /usr/lib64/libdns.so.69.0.1 582.00 0.8% dns_rbtnodechain_init /usr/lib64/libdns.so.69.0.1 552.00 0.8% sk_run_filter [kernel.kallsyms] 527.00 0.8% dns_name_getlabelsequence /usr/lib64/libdns.so.69.0.1 525.00 0.8% __pthread_mutex_unlock /lib64/libpthread-2.12.so 492.00 0.7% dns_name_equal /usr/lib64/libdns.so.69.0.1 468.00 0.7% isc___mempool_get /usr/lib64/libisc.so.62.0.1 462.00 0.7% __udp4_lib_lookup [kernel.kallsyms] 457.00 0.7% dns_acl_match /usr/lib64/libdns.so.69.0.1 453.00 0.7% dns_zone_attach /usr/lib64/libdns.so.69.0.1 451.00 0.7% fget_light [kernel.kallsyms] 443.00 0.6% dns_message_rendersection /usr/lib64/libdns.so.69.0.1 431.00 0.6% ipt_do_table [kernel.kallsyms] 429.00 0.6% nf_iterate [kernel.kallsyms] 422.00 0.6% __kmalloc_node_track_caller [kernel.kallsyms] 408.00 0.6% __domain_mapping [kernel.kallsyms] 387.00 0.6% dns_name_hash /usr/lib64/libdns.so.69.0.1 353.00 0.5% copy_user_generic_string [kernel.kallsyms] 349.00 0.5% dns_name_fromwire /usr/lib64/libdns.so.69.0.1 ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 12:18 ` Thomas Graf @ 2011-03-01 12:19 ` Herbert Xu 2011-03-01 13:50 ` Thomas Graf 0 siblings, 1 reply; 91+ messages in thread From: Herbert Xu @ 2011-03-01 12:19 UTC (permalink / raw) To: Eric Dumazet, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 07:18:29AM -0500, Thomas Graf wrote: > > > ... makes it use CPU 5 for rxq2 and the qps goes up from 250kqps to 270kqps I think the increase here comes from the larger number of packets in flight more than anything. The bottleneck is still the TX queue (both software and hardware). Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 12:19 ` Herbert Xu @ 2011-03-01 13:50 ` Thomas Graf 2011-03-01 14:06 ` Eric Dumazet 0 siblings, 1 reply; 91+ messages in thread From: Thomas Graf @ 2011-03-01 13:50 UTC (permalink / raw) To: Herbert Xu Cc: Eric Dumazet, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 08:19:51PM +0800, Herbert Xu wrote: > On Tue, Mar 01, 2011 at 07:18:29AM -0500, Thomas Graf wrote: > > > > > > ... makes it use CPU 5 for rxq2 and the qps goes up from 250kqps to 270kqps > > I think the increase here comes from the larger number of packets > in flight more than anything. > > The bottleneck is still the TX queue (both software and hardware). Disabled netfilter and reran test Now does ~316kqps (rx was split over 2 queues) ---------------------------------------------------------------------------------------------------------------------- PerfTop: 30608 irqs/sec kernel:66.1% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 1) ---------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ _____________________________ ___________________________________________________________ 19237.00 5.6% _raw_spin_unlock_irqrestore /lib/modules/2.6.38-rc5+/build/vmlinux 17170.00 5.0% get_rx_page_info /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 11411.00 3.3% be_poll_rx /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 11320.00 3.3% isc_rwlock_lock /usr/lib64/libisc.so.62.0.1 10669.00 3.1% __do_softirq /lib/modules/2.6.38-rc5+/build/vmlinux 10655.00 3.1% get_page_from_freelist /lib/modules/2.6.38-rc5+/build/vmlinux 9523.00 2.8% intel_idle /lib/modules/2.6.38-rc5+/build/vmlinux 8677.00 2.5% __udp4_lib_lookup /lib/modules/2.6.38-rc5+/build/vmlinux 8379.00 2.4% sock_queue_rcv_skb /lib/modules/2.6.38-rc5+/build/vmlinux 8226.00 2.4% sk_run_filter /lib/modules/2.6.38-rc5+/build/vmlinux 6724.00 1.9% __netif_receive_skb /lib/modules/2.6.38-rc5+/build/vmlinux 6553.00 1.9% __alloc_skb /lib/modules/2.6.38-rc5+/build/vmlinux 6205.00 1.8% udp_queue_rcv_skb /lib/modules/2.6.38-rc5+/build/vmlinux 6038.00 1.7% _raw_spin_lock /lib/modules/2.6.38-rc5+/build/vmlinux 5868.00 1.7% isc_rwlock_unlock /usr/lib64/libisc.so.62.0.1 5696.00 1.6% dns_rbt_findnode /usr/lib64/libdns.so.69.0.1 5647.00 1.6% read_tsc /lib/modules/2.6.38-rc5+/build/vmlinux 5633.00 1.6% getnstimeofday /lib/modules/2.6.38-rc5+/build/vmlinux 5448.00 1.6% kmem_cache_alloc_node_trace /lib/modules/2.6.38-rc5+/build/vmlinux 5272.00 1.5% finish_task_switch /lib/modules/2.6.38-rc5+/build/vmlinux 4719.00 1.4% sock_def_readable /lib/modules/2.6.38-rc5+/build/vmlinux 4002.00 1.2% is_swiotlb_buffer /lib/modules/2.6.38-rc5+/build/vmlinux 3914.00 1.1% memcpy /lib/modules/2.6.38-rc5+/build/vmlinux 3717.00 1.1% isc_stats_increment /usr/lib64/libisc.so.62.0.1 3706.00 1.1% __udp4_lib_rcv /lib/modules/2.6.38-rc5+/build/vmlinux 3653.00 1.1% ip_rcv /lib/modules/2.6.38-rc5+/build/vmlinux 3598.00 1.0% kmem_cache_alloc_node /lib/modules/2.6.38-rc5+/build/vmlinux 3407.00 1.0% ip_route_input_common /lib/modules/2.6.38-rc5+/build/vmlinux 2683.00 0.8% be_post_rx_frags /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 2666.00 0.8% __pthread_mutex_lock_internal /lib64/libpthread-2.12.so 2331.00 0.7% __phys_addr /lib/modules/2.6.38-rc5+/build/vmlinux 2230.00 0.6% __alloc_pages_nodemask /lib/modules/2.6.38-rc5+/build/vmlinux 2023.00 0.6% dns_name_fullcompare /usr/lib64/libdns.so.69.0.1 1972.00 0.6% packet_rcv /lib/modules/2.6.38-rc5+/build/vmlinux 1902.00 0.6% eth_type_trans /lib/modules/2.6.38-rc5+/build/vmlinux 1860.00 0.5% __pthread_mutex_unlock /lib64/libpthread-2.12.so 1804.00 0.5% fget_light /lib/modules/2.6.38-rc5+/build/vmlinux 1739.00 0.5% alloc_pages_current /lib/modules/2.6.38-rc5+/build/vmlinux 1736.00 0.5% dns_rbtnodechain_init /usr/lib64/libdns.so.69.0.1 ---------------------------------------------------------------------------------------------------------------------- PerfTop: 29038 irqs/sec kernel:48.0% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 11) ---------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ _____________________________ ___________________________________________________________ 12833.00 7.5% intel_idle /lib/modules/2.6.38-rc5+/build/vmlinux 10771.00 6.3% isc_rwlock_lock /usr/lib64/libisc.so.62.0.1 8713.00 5.1% be_tx_compl_process /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 6452.00 3.8% kfree /lib/modules/2.6.38-rc5+/build/vmlinux 5935.00 3.5% skb_release_data /lib/modules/2.6.38-rc5+/build/vmlinux 5552.00 3.2% kmem_cache_free /lib/modules/2.6.38-rc5+/build/vmlinux 5292.00 3.1% isc_rwlock_unlock /usr/lib64/libisc.so.62.0.1 4893.00 2.9% dns_rbt_findnode /usr/lib64/libdns.so.69.0.1 4413.00 2.6% kfree_skb /lib/modules/2.6.38-rc5+/build/vmlinux 3802.00 2.2% be_poll_tx_mcc /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 3515.00 2.1% isc_stats_increment /usr/lib64/libisc.so.62.0.1 3016.00 1.8% _raw_spin_unlock_irqrestore /lib/modules/2.6.38-rc5+/build/vmlinux 2202.00 1.3% __do_softirq /lib/modules/2.6.38-rc5+/build/vmlinux 2027.00 1.2% _raw_spin_lock /lib/modules/2.6.38-rc5+/build/vmlinux 1935.00 1.1% finish_task_switch /lib/modules/2.6.38-rc5+/build/vmlinux 1906.00 1.1% __pthread_mutex_lock_internal /lib64/libpthread-2.12.so 1837.00 1.1% dns_name_fullcompare /usr/lib64/libdns.so.69.0.1 1702.00 1.0% dns_rbtnodechain_init /usr/lib64/libdns.so.69.0.1 1561.00 0.9% fget_light /lib/modules/2.6.38-rc5+/build/vmlinux 1559.00 0.9% dns_name_getlabelsequence /usr/lib64/libdns.so.69.0.1 1491.00 0.9% dns_name_equal /usr/lib64/libdns.so.69.0.1 1464.00 0.9% __pthread_mutex_unlock /lib64/libpthread-2.12.so 1454.00 0.9% dns_acl_match /usr/lib64/libdns.so.69.0.1 1293.00 0.8% dns_zone_attach /usr/lib64/libdns.so.69.0.1 1245.00 0.7% be_xmit /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 1159.00 0.7% dns_message_rendersection /usr/lib64/libdns.so.69.0.1 1115.00 0.7% isc___mempool_get /usr/lib64/libisc.so.62.0.1 1100.00 0.6% copy_user_generic_string /lib/modules/2.6.38-rc5+/build/vmlinux 1030.00 0.6% dns_name_fromwire /usr/lib64/libdns.so.69.0.1 1015.00 0.6% dns_name_hash /usr/lib64/libdns.so.69.0.1 1013.00 0.6% isc_radix_search /usr/lib64/libisc.so.62.0.1 970.00 0.6% __ip_route_output_key /lib/modules/2.6.38-rc5+/build/vmlinux 917.00 0.5% fput /lib/modules/2.6.38-rc5+/build/vmlinux 817.00 0.5% dev_queue_xmit /lib/modules/2.6.38-rc5+/build/vmlinux 812.00 0.5% sk_run_filter /lib/modules/2.6.38-rc5+/build/vmlinux 806.00 0.5% avc_has_perm_noaudit /lib/modules/2.6.38-rc5+/build/vmlinux 802.00 0.5% sock_wfree /lib/modules/2.6.38-rc5+/build/vmlinux 793.00 0.5% dns_name_towire /usr/lib64/libdns.so.69.0.1 754.00 0.4% sock_alloc_send_pskb /lib/modules/2.6.38-rc5+/build/vmlinux 752.00 0.4% dns_message_parse /usr/lib64/libdns.so.69.0.1 749.00 0.4% dns_rdata_towire /usr/lib64/libdns.so.69.0.1 728.00 0.4% dns_rdataset_init /usr/lib64/libdns.so.69.0.1 709.00 0.4% isc___mempool_put /usr/lib64/libisc.so.62.0.1 699.00 0.4% skb_release_head_state /lib/modules/2.6.38-rc5+/build/vmlinux 685.00 0.4% _raw_spin_lock_bh /lib/modules/2.6.38-rc5+/build/vmlinux 683.00 0.4% dns_name_concatenate /usr/lib64/libdns.so.69.0.1 678.00 0.4% __ip_append_data /lib/modules/2.6.38-rc5+/build/vmlinux 673.00 0.4% tick_nohz_stop_sched_tick /lib/modules/2.6.38-rc5+/build/vmlinux 662.00 0.4% sys_sendmsg /lib/modules/2.6.38-rc5+/build/vmlinux 654.00 0.4% dns_compress_findglobal /usr/lib64/libdns.so.69.0.1 654.00 0.4% memcpy /lib64/libc-2.12.so 637.00 0.4% dns_compress_invalidate /usr/lib64/libdns.so.69.0.1 597.00 0.3% isc__buffer_init /usr/lib64/libisc.so.62.0.1 595.00 0.3% dns_zone_detach /usr/lib64/libdns.so.69.0.1 WARNING: failed to keep up with mmap data. WARNING: failed to keep up with mmap data. ---------------------------------------------------------------------------------------------------------------------- PerfTop: 29539 irqs/sec kernel:47.0% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 11) ---------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ _____________________________ ___________________________________________________________ 14478.00 7.5% intel_idle /lib/modules/2.6.38-rc5+/build/vmlinux 12279.00 6.3% isc_rwlock_lock /usr/lib64/libisc.so.62.0.1 9844.00 5.1% be_tx_compl_process /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 7368.00 3.8% kfree /lib/modules/2.6.38-rc5+/build/vmlinux 6696.00 3.5% skb_release_data /lib/modules/2.6.38-rc5+/build/vmlinux 6240.00 3.2% kmem_cache_free /lib/modules/2.6.38-rc5+/build/vmlinux 6034.00 3.1% isc_rwlock_unlock /usr/lib64/libisc.so.62.0.1 5547.00 2.9% dns_rbt_findnode /usr/lib64/libdns.so.69.0.1 5012.00 2.6% kfree_skb /lib/modules/2.6.38-rc5+/build/vmlinux 4290.00 2.2% be_poll_tx_mcc /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 4024.00 2.1% isc_stats_increment /usr/lib64/libisc.so.62.0.1 3417.00 1.8% _raw_spin_unlock_irqrestore /lib/modules/2.6.38-rc5+/build/vmlinux 2470.00 1.3% __do_softirq /lib/modules/2.6.38-rc5+/build/vmlinux 2312.00 1.2% _raw_spin_lock /lib/modules/2.6.38-rc5+/build/vmlinux 2138.00 1.1% finish_task_switch /lib/modules/2.6.38-rc5+/build/vmlinux 2136.00 1.1% __pthread_mutex_lock_internal /lib64/libpthread-2.12.so 2061.00 1.1% dns_name_fullcompare /usr/lib64/libdns.so.69.0.1 1961.00 1.0% dns_rbtnodechain_init /usr/lib64/libdns.so.69.0.1 1797.00 0.9% dns_name_getlabelsequence /usr/lib64/libdns.so.69.0.1 1743.00 0.9% fget_light /lib/modules/2.6.38-rc5+/build/vmlinux 1723.00 0.9% dns_name_equal /usr/lib64/libdns.so.69.0.1 1673.00 0.9% __pthread_mutex_unlock /lib64/libpthread-2.12.so 1671.00 0.9% dns_acl_match /usr/lib64/libdns.so.69.0.1 1488.00 0.8% dns_zone_attach /usr/lib64/libdns.so.69.0.1 1428.00 0.7% be_xmit /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 1369.00 0.7% dns_message_rendersection /usr/lib64/libdns.so.69.0.1 1278.00 0.7% isc___mempool_get /usr/lib64/libisc.so.62.0.1 1251.00 0.6% copy_user_generic_string /lib/modules/2.6.38-rc5+/build/vmlinux 1193.00 0.6% dns_name_fromwire /usr/lib64/libdns.so.69.0.1 1182.00 0.6% isc_radix_search /usr/lib64/libisc.so.62.0.1 ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 13:50 ` Thomas Graf @ 2011-03-01 14:06 ` Eric Dumazet 2011-03-01 14:22 ` Thomas Graf 0 siblings, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-03-01 14:06 UTC (permalink / raw) To: Thomas Graf Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le mardi 01 mars 2011 à 08:50 -0500, Thomas Graf a écrit : > On Tue, Mar 01, 2011 at 08:19:51PM +0800, Herbert Xu wrote: > > On Tue, Mar 01, 2011 at 07:18:29AM -0500, Thomas Graf wrote: > > > > > > > > > ... makes it use CPU 5 for rxq2 and the qps goes up from 250kqps to 270kqps > > > > I think the increase here comes from the larger number of packets > > in flight more than anything. > > > > The bottleneck is still the TX queue (both software and hardware). > > Disabled netfilter and reran test > > Now does ~316kqps (rx was split over 2 queues) Would be nice to cpu affine named to _not_ run on CPU11, just to specialize it for TX completions and have softirq time percentage and "perf top -C 11 " results Thanks ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 14:06 ` Eric Dumazet @ 2011-03-01 14:22 ` Thomas Graf 2011-03-01 14:30 ` Thomas Graf 0 siblings, 1 reply; 91+ messages in thread From: Thomas Graf @ 2011-03-01 14:22 UTC (permalink / raw) To: Eric Dumazet Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 03:06:59PM +0100, Eric Dumazet wrote: > Would be nice to cpu affine named to _not_ run on CPU11, just to > specialize it for TX completions and have softirq time percentage and > "perf top -C 11 " results ---------------------------------------------------------------------------------------------------------------------- PerfTop: 995 irqs/sec kernel:97.7% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 11) ---------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ ___________________________ ___________________________________________________________ 335.00 23.3% intel_idle /lib/modules/2.6.38-rc5+/build/vmlinux 253.00 17.6% be_tx_compl_process /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 177.00 12.3% skb_release_data /lib/modules/2.6.38-rc5+/build/vmlinux 132.00 9.2% kfree /lib/modules/2.6.38-rc5+/build/vmlinux 127.00 8.8% kfree_skb /lib/modules/2.6.38-rc5+/build/vmlinux 105.00 7.3% be_poll_tx_mcc /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 99.00 6.9% kmem_cache_free /lib/modules/2.6.38-rc5+/build/vmlinux 36.00 2.5% __do_softirq /lib/modules/2.6.38-rc5+/build/vmlinux 20.00 1.4% _raw_spin_unlock_irqrestore /lib/modules/2.6.38-rc5+/build/vmlinux 19.00 1.3% skb_release_head_state /lib/modules/2.6.38-rc5+/build/vmlinux 13.00 0.9% unmap_tx_frag /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 11.00 0.8% rb_next /usr/bin/perf 10.00 0.7% dso__find_symbol /usr/bin/perf 9.00 0.6% is_swiotlb_buffer /lib/modules/2.6.38-rc5+/build/vmlinux 9.00 0.6% __strcmp_sse42 /lib64/libc-2.12.so 8.00 0.6% __kfree_skb /lib/modules/2.6.38-rc5+/build/vmlinux 8.00 0.6% __strstr_sse42 /lib64/libc-2.12.so 6.00 0.4% _int_malloc /lib64/libc-2.12.so ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 14:22 ` Thomas Graf @ 2011-03-01 14:30 ` Thomas Graf 2011-03-01 14:52 ` Eric Dumazet 0 siblings, 1 reply; 91+ messages in thread From: Thomas Graf @ 2011-03-01 14:30 UTC (permalink / raw) To: Eric Dumazet, Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld On Tue, Mar 01, 2011 at 09:22:35AM -0500, Thomas Graf wrote: > On Tue, Mar 01, 2011 at 03:06:59PM +0100, Eric Dumazet wrote: > > Would be nice to cpu affine named to _not_ run on CPU11, just to > > specialize it for TX completions and have softirq time percentage and > > "perf top -C 11 " results CPU 1 isolated as well (named running with mask 0,2-10) ---------------------------------------------------------------------------------------------------------------------- PerfTop: 580 irqs/sec kernel:100.0% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 1) ---------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ ___________________________ ___________________________________________________________ 283.00 9.2% get_rx_page_info /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 256.00 8.4% _raw_spin_unlock_irqrestore /lib/modules/2.6.38-rc5+/build/vmlinux 190.00 6.2% be_poll_rx /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 182.00 5.9% get_page_from_freelist /lib/modules/2.6.38-rc5+/build/vmlinux 157.00 5.1% intel_idle /lib/modules/2.6.38-rc5+/build/vmlinux 143.00 4.7% __do_softirq /lib/modules/2.6.38-rc5+/build/vmlinux 133.00 4.3% sock_queue_rcv_skb /lib/modules/2.6.38-rc5+/build/vmlinux 133.00 4.3% __udp4_lib_lookup /lib/modules/2.6.38-rc5+/build/vmlinux 131.00 4.3% sk_run_filter /lib/modules/2.6.38-rc5+/build/vmlinux 114.00 3.7% getnstimeofday /lib/modules/2.6.38-rc5+/build/vmlinux 112.00 3.7% __alloc_skb /lib/modules/2.6.38-rc5+/build/vmlinux 103.00 3.4% read_tsc /lib/modules/2.6.38-rc5+/build/vmlinux 100.00 3.3% __netif_receive_skb /lib/modules/2.6.38-rc5+/build/vmlinux 95.00 3.1% udp_queue_rcv_skb /lib/modules/2.6.38-rc5+/build/vmlinux 82.00 2.7% sock_def_readable /lib/modules/2.6.38-rc5+/build/vmlinux 79.00 2.6% kmem_cache_alloc_node_trace /lib/modules/2.6.38-rc5+/build/vmlinux 72.00 2.3% _raw_spin_lock /lib/modules/2.6.38-rc5+/build/vmlinux 67.00 2.2% __phys_addr /lib/modules/2.6.38-rc5+/build/vmlinux 63.00 2.1% is_swiotlb_buffer /lib/modules/2.6.38-rc5+/build/vmlinux 51.00 1.7% __udp4_lib_rcv /lib/modules/2.6.38-rc5+/build/vmlinux 48.00 1.6% memcpy /lib/modules/2.6.38-rc5+/build/vmlinux 47.00 1.5% ip_rcv /lib/modules/2.6.38-rc5+/build/vmlinux 46.00 1.5% kmem_cache_alloc_node /lib/modules/2.6.38-rc5+/build/vmlinux 44.00 1.4% dma_issue_pending_all /lib/modules/2.6.38-rc5+/build/vmlinux 40.00 1.3% ip_route_input_common /lib/modules/2.6.38-rc5+/build/vmlinux 36.00 1.2% __alloc_pages_nodemask /lib/modules/2.6.38-rc5+/build/vmlinux 33.00 1.1% be_post_rx_frags /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 24.00 0.8% alloc_pages_current /lib/modules/2.6.38-rc5+/build/vmlinux 21.00 0.7% packet_rcv /lib/modules/2.6.38-rc5+/build/vmlinux 20.00 0.7% local_bh_enable /lib/modules/2.6.38-rc5+/build/vmlinux 17.00 0.6% consume_skb /lib/modules/2.6.38-rc5+/build/vmlinux 16.00 0.5% next_zones_zonelist /lib/modules/2.6.38-rc5+/build/vmlinux 14.00 0.5% selinux_socket_sock_rcv_skb /lib/modules/2.6.38-rc5+/build/vmlinux 13.00 0.4% ip_local_deliver /lib/modules/2.6.38-rc5+/build/vmlinux 11.00 0.4% sk_filter /lib/modules/2.6.38-rc5+/build/vmlinux 10.00 0.3% get_rps_cpu /lib/modules/2.6.38-rc5+/build/vmlinux 9.00 0.3% native_read_tsc /lib/modules/2.6.38-rc5+/build/vmlinux 8.00 0.3% local_bh_disable /lib/modules/2.6.38-rc5+/build/vmlinux 8.00 0.3% eth_type_trans /lib/modules/2.6.38-rc5+/build/vmlinux 8.00 0.3% napi_complete /lib/modules/2.6.38-rc5+/build/vmlinux 7.00 0.2% netif_receive_skb /lib/modules/2.6.38-rc5+/build/vmlinux 7.00 0.2% dso__find_symbol /usr/bin/perf 7.00 0.2% __kmalloc_node_track_caller /lib/modules/2.6.38-rc5+/build/vmlinux ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 14:30 ` Thomas Graf @ 2011-03-01 14:52 ` Eric Dumazet 2011-03-01 15:07 ` Thomas Graf 0 siblings, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-03-01 14:52 UTC (permalink / raw) To: Thomas Graf Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le mardi 01 mars 2011 à 09:30 -0500, Thomas Graf a écrit : > On Tue, Mar 01, 2011 at 09:22:35AM -0500, Thomas Graf wrote: > > On Tue, Mar 01, 2011 at 03:06:59PM +0100, Eric Dumazet wrote: > > > Would be nice to cpu affine named to _not_ run on CPU11, just to > > > specialize it for TX completions and have softirq time percentage and > > > "perf top -C 11 " results > > CPU 1 isolated as well (named running with mask 0,2-10) > > ---------------------------------------------------------------------------------------------------------------------- > PerfTop: 580 irqs/sec kernel:100.0% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 1) > ---------------------------------------------------------------------------------------------------------------------- > > samples pcnt function DSO > _______ _____ ___________________________ ___________________________________________________________ > > 283.00 9.2% get_rx_page_info /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko > 256.00 8.4% _raw_spin_unlock_irqrestore /lib/modules/2.6.38-rc5+/build/vmlinux > 190.00 6.2% be_poll_rx /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko > 182.00 5.9% get_page_from_freelist /lib/modules/2.6.38-rc5+/build/vmlinux > 157.00 5.1% intel_idle /lib/modules/2.6.38-rc5+/build/vmlinux > 143.00 4.7% __do_softirq /lib/modules/2.6.38-rc5+/build/vmlinux > 133.00 4.3% sock_queue_rcv_skb /lib/modules/2.6.38-rc5+/build/vmlinux > 133.00 4.3% __udp4_lib_lookup /lib/modules/2.6.38-rc5+/build/vmlinux > 131.00 4.3% sk_run_filter /lib/modules/2.6.38-rc5+/build/vmlinux sk_run_filter ? Do you have a packet filter running ? ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 14:52 ` Eric Dumazet @ 2011-03-01 15:07 ` Thomas Graf 0 siblings, 0 replies; 91+ messages in thread From: Thomas Graf @ 2011-03-01 15:07 UTC (permalink / raw) To: Eric Dumazet Cc: Herbert Xu, David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Tue, Mar 01, 2011 at 03:52:40PM +0100, Eric Dumazet wrote: > Le mardi 01 mars 2011 à 09:30 -0500, Thomas Graf a écrit : > > On Tue, Mar 01, 2011 at 09:22:35AM -0500, Thomas Graf wrote: > > > On Tue, Mar 01, 2011 at 03:06:59PM +0100, Eric Dumazet wrote: > > > > Would be nice to cpu affine named to _not_ run on CPU11, just to > > > > specialize it for TX completions and have softirq time percentage and > > > > "perf top -C 11 " results > > > > CPU 1 isolated as well (named running with mask 0,2-10) > > > > ---------------------------------------------------------------------------------------------------------------------- > > PerfTop: 580 irqs/sec kernel:100.0% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 1) > > ---------------------------------------------------------------------------------------------------------------------- > > > > samples pcnt function DSO > > _______ _____ ___________________________ ___________________________________________________________ > > > > 283.00 9.2% get_rx_page_info /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko > > 256.00 8.4% _raw_spin_unlock_irqrestore /lib/modules/2.6.38-rc5+/build/vmlinux > > 190.00 6.2% be_poll_rx /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko > > 182.00 5.9% get_page_from_freelist /lib/modules/2.6.38-rc5+/build/vmlinux > > 157.00 5.1% intel_idle /lib/modules/2.6.38-rc5+/build/vmlinux > > 143.00 4.7% __do_softirq /lib/modules/2.6.38-rc5+/build/vmlinux > > 133.00 4.3% sock_queue_rcv_skb /lib/modules/2.6.38-rc5+/build/vmlinux > > 133.00 4.3% __udp4_lib_lookup /lib/modules/2.6.38-rc5+/build/vmlinux > > 131.00 4.3% sk_run_filter /lib/modules/2.6.38-rc5+/build/vmlinux > > sk_run_filter ? Do you have a packet filter running ? dhclient was running. With dhclient killed: ---------------------------------------------------------------------------------------------------------------------- PerfTop: 726 irqs/sec kernel:99.9% exact: 0.0% [1000Hz cpu-clock-msecs], (all, CPU: 1) ---------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ ___________________________ ___________________________________________________________ 472.00 10.6% get_rx_page_info /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 419.00 9.4% _raw_spin_unlock_irqrestore /lib/modules/2.6.38-rc5+/build/vmlinux 280.00 6.3% be_poll_rx /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 259.00 5.8% get_page_from_freelist /lib/modules/2.6.38-rc5+/build/vmlinux 248.00 5.6% __do_softirq /lib/modules/2.6.38-rc5+/build/vmlinux 238.00 5.4% intel_idle /lib/modules/2.6.38-rc5+/build/vmlinux 204.00 4.6% sock_queue_rcv_skb /lib/modules/2.6.38-rc5+/build/vmlinux 189.00 4.3% __udp4_lib_lookup /lib/modules/2.6.38-rc5+/build/vmlinux 178.00 4.0% getnstimeofday /lib/modules/2.6.38-rc5+/build/vmlinux 169.00 3.8% __alloc_skb /lib/modules/2.6.38-rc5+/build/vmlinux 144.00 3.2% read_tsc /lib/modules/2.6.38-rc5+/build/vmlinux 143.00 3.2% sock_def_readable /lib/modules/2.6.38-rc5+/build/vmlinux 138.00 3.1% udp_queue_rcv_skb /lib/modules/2.6.38-rc5+/build/vmlinux 115.00 2.6% kmem_cache_alloc_node_trace /lib/modules/2.6.38-rc5+/build/vmlinux 114.00 2.6% __netif_receive_skb /lib/modules/2.6.38-rc5+/build/vmlinux 109.00 2.5% _raw_spin_lock /lib/modules/2.6.38-rc5+/build/vmlinux 100.00 2.3% is_swiotlb_buffer /lib/modules/2.6.38-rc5+/build/vmlinux 90.00 2.0% __phys_addr /lib/modules/2.6.38-rc5+/build/vmlinux 82.00 1.8% __udp4_lib_rcv /lib/modules/2.6.38-rc5+/build/vmlinux 80.00 1.8% kmem_cache_alloc_node /lib/modules/2.6.38-rc5+/build/vmlinux 73.00 1.6% ip_route_input_common /lib/modules/2.6.38-rc5+/build/vmlinux 60.00 1.4% memcpy /lib/modules/2.6.38-rc5+/build/vmlinux 59.00 1.3% dma_issue_pending_all /lib/modules/2.6.38-rc5+/build/vmlinux 58.00 1.3% ip_rcv /lib/modules/2.6.38-rc5+/build/vmlinux 57.00 1.3% be_post_rx_frags /lib/modules/2.6.38-rc5+/kernel/drivers/net/benet/be2net.ko 49.00 1.1% __alloc_pages_nodemask /lib/modules/2.6.38-rc5+/build/vmlinux 45.00 1.0% alloc_pages_current /lib/modules/2.6.38-rc5+/build/vmlinux 27.00 0.6% get_rps_cpu /lib/modules/2.6.38-rc5+/build/vmlinux 23.00 0.5% napi_complete /lib/modules/2.6.38-rc5+/build/vmlinux 22.00 0.5% ip_local_deliver /lib/modules/2.6.38-rc5+/build/vmlinux 18.00 0.4% selinux_socket_sock_rcv_skb /lib/modules/2.6.38-rc5+/build/vmlinux 17.00 0.4% native_read_tsc /lib/modules/2.6.38-rc5+/build/vmlinux 16.00 0.4% local_bh_enable /lib/modules/2.6.38-rc5+/build/vmlinux 16.00 0.4% next_zones_zonelist /lib/modules/2.6.38-rc5+/build/vmlinux 14.00 0.3% sk_filter /lib/modules/2.6.38-rc5+/build/vmlinux 13.00 0.3% eth_type_trans /lib/modules/2.6.38-rc5+/build/vmlinux 10.00 0.2% __kmalloc_node_track_caller /lib/modules/2.6.38-rc5+/build/vmlinux 10.00 0.2% _raw_spin_lock_irqsave /lib/modules/2.6.38-rc5+/build/vmlinux 9.00 0.2% raw_local_deliver /lib/modules/2.6.38-rc5+/build/vmlinux 8.00 0.2% __udp_queue_rcv_skb /lib/modules/2.6.38-rc5+/build/vmlinux 8.00 0.2% netif_receive_skb /lib/modules/2.6.38-rc5+/build/vmlinux 8.00 0.2% ip_queue_rcv_skb /lib/modules/2.6.38-rc5+/build/vmlinux 7.00 0.2% net_rx_action /lib/modules/2.6.38-rc5+/build/vmlinux 6.00 0.1% swiotlb_map_page /lib/modules/2.6.38-rc5+/build/vmlinux 6.00 0.1% __sk_mem_schedule /lib/modules/2.6.38-rc5+/build/vmlinux 6.00 0.1% dso__find_symbol /usr/bin/perf 6.00 0.1% __netdev_alloc_skb /lib/modules/2.6.38-rc5+/build/vmlinux ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-28 11:36 ` Herbert Xu 2011-02-28 13:32 ` Eric Dumazet 2011-02-28 14:13 ` Thomas Graf @ 2011-03-01 5:33 ` Eric Dumazet 2011-03-01 12:35 ` Herbert Xu 3 siblings, 0 replies; 91+ messages in thread From: Eric Dumazet @ 2011-03-01 5:33 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le lundi 28 février 2011 à 19:36 +0800, Herbert Xu a écrit : > Here are the patches I used. Please don't them yet as I intend > to clean them up quite a bit. > I assume you mean "please dont commit them" ;) > But please do test them heavily, especially if you have an AMD > NUMA machine as that's where scalability problems really show > up. Intel tends to be a lot more forgiving. My last AMD machine > blew up years ago :) Same here, My only AMD machine is a desktop class machine, not a server. ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-28 11:36 ` Herbert Xu ` (2 preceding siblings ...) 2011-03-01 5:33 ` Eric Dumazet @ 2011-03-01 12:35 ` Herbert Xu 2011-03-01 12:36 ` [PATCH 1/5] inet: Remove unused sk_sndmsg_* from UFO Herbert Xu ` (5 more replies) 3 siblings, 6 replies; 91+ messages in thread From: Herbert Xu @ 2011-03-01 12:35 UTC (permalink / raw) To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev On Mon, Feb 28, 2011 at 07:36:59PM +0800, Herbert Xu wrote: > Here are the patches I used. Please don't them yet as I intend > to clean them up quite a bit. OK here is the version ready for merging (please retest them though as I have changed things substantially). The main change is that the legacy UDP code path is now gone so we use the same UDP header generation whether corking is on or off. I will add IPv6 support in a later patch set. Thanks! -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* [PATCH 1/5] inet: Remove unused sk_sndmsg_* from UFO 2011-03-01 12:35 ` Herbert Xu @ 2011-03-01 12:36 ` Herbert Xu 2011-03-01 12:36 ` [PATCH 3/5] inet: Add ip_make_skb and ip_finish_skb Herbert Xu ` (4 subsequent siblings) 5 siblings, 0 replies; 91+ messages in thread From: Herbert Xu @ 2011-03-01 12:36 UTC (permalink / raw) To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, Thomas Graf inet: Remove unused sk_sndmsg_* from UFO UFO doesn't really use the sk_sndmsg_* parameters so touching them is pointless. It can't use them anyway since the whole point of UFO is to use the original pages without copying. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> --- net/core/skbuff.c | 3 --- net/ipv4/ip_output.c | 1 - net/ipv6/ip6_output.c | 1 - 3 files changed, 5 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index d883dcc..97011a7 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2434,8 +2434,6 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb, return -ENOMEM; /* initialize the next frag */ - sk->sk_sndmsg_page = page; - sk->sk_sndmsg_off = 0; skb_fill_page_desc(skb, frg_cnt, page, 0, 0); skb->truesize += PAGE_SIZE; atomic_add(PAGE_SIZE, &sk->sk_wmem_alloc); @@ -2455,7 +2453,6 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb, return -EFAULT; /* copy was successful so update the size parameters */ - sk->sk_sndmsg_off += copy; frag->size += copy; skb->len += copy; skb->data_len += copy; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 04c7b3b..d3a4540 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -767,7 +767,6 @@ static inline int ip_ufo_append_data(struct sock *sk, skb->ip_summed = CHECKSUM_PARTIAL; skb->csum = 0; - sk->sk_sndmsg_off = 0; /* specify the length of each IP datagram fragment */ skb_shinfo(skb)->gso_size = mtu - fragheaderlen; diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 5f8d242..9965182 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1061,7 +1061,6 @@ static inline int ip6_ufo_append_data(struct sock *sk, skb->ip_summed = CHECKSUM_PARTIAL; skb->csum = 0; - sk->sk_sndmsg_off = 0; } err = skb_append_datato_frags(sk,skb, getfrag, from, ^ permalink raw reply related [flat|nested] 91+ messages in thread
* [PATCH 3/5] inet: Add ip_make_skb and ip_finish_skb 2011-03-01 12:35 ` Herbert Xu 2011-03-01 12:36 ` [PATCH 1/5] inet: Remove unused sk_sndmsg_* from UFO Herbert Xu @ 2011-03-01 12:36 ` Herbert Xu 2011-03-01 12:36 ` [PATCH 2/5] inet: Remove explicit write references to sk/inet in ip_append_data Herbert Xu ` (3 subsequent siblings) 5 siblings, 0 replies; 91+ messages in thread From: Herbert Xu @ 2011-03-01 12:36 UTC (permalink / raw) To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, Thomas Graf inet: Add ip_make_skb and ip_finish_skb This patch adds the helper ip_make_skb which is like ip_append_data and ip_push_pending_frames all rolled into one, except that it does not send the skb produced. The sending part is carried out by ip_send_skb, which the transport protocol can call after it has tweaked the skb. It is meant to be called in cases where corking is not used should have a one-to-one correspondence to sendmsg. This patch also adds the helper ip_finish_skb which is meant to be replace ip_push_pending_frames when corking is required. Previously the protocol stack would peek at the socket write queue and add its header to the first packet. With ip_finish_skb, the protocol stack can directly operate on the final skb instead, just like the non-corking case with ip_make_skb. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> --- include/net/ip.h | 16 ++++++++++++ net/ipv4/ip_output.c | 65 ++++++++++++++++++++++++++++++++++++++++----------- 2 files changed, 67 insertions(+), 14 deletions(-) diff --git a/include/net/ip.h b/include/net/ip.h index 67fac78..a4f6311 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -116,8 +116,24 @@ extern int ip_append_data(struct sock *sk, extern int ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb); extern ssize_t ip_append_page(struct sock *sk, struct page *page, int offset, size_t size, int flags); +extern struct sk_buff *__ip_make_skb(struct sock *sk, + struct sk_buff_head *queue, + struct inet_cork *cork); +extern int ip_send_skb(struct sk_buff *skb); extern int ip_push_pending_frames(struct sock *sk); extern void ip_flush_pending_frames(struct sock *sk); +extern struct sk_buff *ip_make_skb(struct sock *sk, + int getfrag(void *from, char *to, int offset, int len, + int odd, struct sk_buff *skb), + void *from, int length, int transhdrlen, + struct ipcm_cookie *ipc, + struct rtable **rtp, + unsigned int flags); + +static inline struct sk_buff *ip_finish_skb(struct sock *sk) +{ + return __ip_make_skb(sk, &sk->sk_write_queue, &inet_sk(sk)->cork); +} /* datagram.c */ extern int ip4_datagram_connect(struct sock *sk, diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 1dd5ecc..460308c 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1267,9 +1267,9 @@ static void ip_cork_release(struct inet_cork *cork) * Combined all pending IP fragments on the socket as one IP datagram * and push them out. */ -static int __ip_push_pending_frames(struct sock *sk, - struct sk_buff_head *queue, - struct inet_cork *cork) +struct sk_buff *__ip_make_skb(struct sock *sk, + struct sk_buff_head *queue, + struct inet_cork *cork) { struct sk_buff *skb, *tmp_skb; struct sk_buff **tail_skb; @@ -1280,7 +1280,6 @@ static int __ip_push_pending_frames(struct sock *sk, struct iphdr *iph; __be16 df = 0; __u8 ttl; - int err = 0; if ((skb = __skb_dequeue(queue)) == NULL) goto out; @@ -1351,28 +1350,37 @@ static int __ip_push_pending_frames(struct sock *sk, icmp_out_count(net, ((struct icmphdr *) skb_transport_header(skb))->type); - /* Netfilter gets whole the not fragmented skb. */ + ip_cork_release(cork); +out: + return skb; +} + +int ip_send_skb(struct sk_buff *skb) +{ + struct net *net = sock_net(skb->sk); + int err; + err = ip_local_out(skb); if (err) { if (err > 0) err = net_xmit_errno(err); if (err) - goto error; + IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS); } -out: - ip_cork_release(cork); return err; - -error: - IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS); - goto out; } int ip_push_pending_frames(struct sock *sk) { - return __ip_push_pending_frames(sk, &sk->sk_write_queue, - &inet_sk(sk)->cork); + struct sk_buff *skb; + + skb = ip_finish_skb(sk); + if (!skb) + return 0; + + /* Netfilter gets whole the not fragmented skb. */ + return ip_send_skb(skb); } /* @@ -1395,6 +1403,35 @@ void ip_flush_pending_frames(struct sock *sk) __ip_flush_pending_frames(sk, &sk->sk_write_queue, &inet_sk(sk)->cork); } +struct sk_buff *ip_make_skb(struct sock *sk, + int getfrag(void *from, char *to, int offset, + int len, int odd, struct sk_buff *skb), + void *from, int length, int transhdrlen, + struct ipcm_cookie *ipc, struct rtable **rtp, + unsigned int flags) +{ + struct inet_cork cork = {}; + struct sk_buff_head queue; + int err; + + if (flags & MSG_PROBE) + return NULL; + + __skb_queue_head_init(&queue); + + err = ip_setup_cork(sk, &cork, ipc, rtp); + if (err) + return ERR_PTR(err); + + err = __ip_append_data(sk, &queue, &cork, getfrag, + from, length, transhdrlen, flags); + if (err) { + __ip_flush_pending_frames(sk, &queue, &cork); + return ERR_PTR(err); + } + + return __ip_make_skb(sk, &queue, &cork); +} /* * Fetch data from kernel space and fill in checksum if needed. ^ permalink raw reply related [flat|nested] 91+ messages in thread
* [PATCH 2/5] inet: Remove explicit write references to sk/inet in ip_append_data 2011-03-01 12:35 ` Herbert Xu 2011-03-01 12:36 ` [PATCH 1/5] inet: Remove unused sk_sndmsg_* from UFO Herbert Xu 2011-03-01 12:36 ` [PATCH 3/5] inet: Add ip_make_skb and ip_finish_skb Herbert Xu @ 2011-03-01 12:36 ` Herbert Xu 2011-03-02 6:15 ` inet: Replace left-over references to inet->cork Herbert Xu 2011-03-01 12:36 ` [PATCH 4/5] udp: Switch to ip_finish_skb Herbert Xu ` (2 subsequent siblings) 5 siblings, 1 reply; 91+ messages in thread From: Herbert Xu @ 2011-03-01 12:36 UTC (permalink / raw) To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, Thomas Graf inet: Remove explicit write references to sk/inet in ip_append_data In order to allow simultaneous calls to ip_append_data on the same socket, it must not modify any shared state in sk or inet (other than those that are designed to allow that such as atomic counters). This patch abstracts out write references to sk and inet_sk in ip_append_data and its friends so that we may use the underlying code in parallel. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> --- include/net/inet_sock.h | 23 ++-- net/ipv4/ip_output.c | 238 ++++++++++++++++++++++++++++-------------------- 2 files changed, 154 insertions(+), 107 deletions(-) diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h index 8181498..b3de102 100644 --- a/include/net/inet_sock.h +++ b/include/net/inet_sock.h @@ -86,6 +86,19 @@ static inline struct inet_request_sock *inet_rsk(const struct request_sock *sk) return (struct inet_request_sock *)sk; } +struct inet_cork { + unsigned int flags; + unsigned int fragsize; + struct ip_options *opt; + struct dst_entry *dst; + int length; /* Total length of all frames */ + __be32 addr; + struct flowi fl; + struct page *page; + u32 off; + u8 tx_flags; +}; + struct ip_mc_socklist; struct ipv6_pinfo; struct rtable; @@ -143,15 +156,7 @@ struct inet_sock { int mc_index; __be32 mc_addr; struct ip_mc_socklist __rcu *mc_list; - struct { - unsigned int flags; - unsigned int fragsize; - struct ip_options *opt; - struct dst_entry *dst; - int length; /* Total length of all frames */ - __be32 addr; - struct flowi fl; - } cork; + struct inet_cork cork; }; #define IPCORK_OPT 1 /* ip-options has been held in ipcork.opt */ diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index d3a4540..1dd5ecc 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -733,6 +733,7 @@ csum_page(struct page *page, int offset, int copy) } static inline int ip_ufo_append_data(struct sock *sk, + struct sk_buff_head *queue, int getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb), void *from, int length, int hh_len, int fragheaderlen, @@ -745,7 +746,7 @@ static inline int ip_ufo_append_data(struct sock *sk, * device, so create one single skb packet containing complete * udp datagram */ - if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL) { + if ((skb = skb_peek_tail(queue)) == NULL) { skb = sock_alloc_send_skb(sk, hh_len + fragheaderlen + transhdrlen + 20, (flags & MSG_DONTWAIT), &err); @@ -771,35 +772,24 @@ static inline int ip_ufo_append_data(struct sock *sk, /* specify the length of each IP datagram fragment */ skb_shinfo(skb)->gso_size = mtu - fragheaderlen; skb_shinfo(skb)->gso_type = SKB_GSO_UDP; - __skb_queue_tail(&sk->sk_write_queue, skb); + __skb_queue_tail(queue, skb); } return skb_append_datato_frags(sk, skb, getfrag, from, (length - transhdrlen)); } -/* - * ip_append_data() and ip_append_page() can make one large IP datagram - * from many pieces of data. Each pieces will be holded on the socket - * until ip_push_pending_frames() is called. Each piece can be a page - * or non-page data. - * - * Not only UDP, other transport protocols - e.g. raw sockets - can use - * this interface potentially. - * - * LATER: length must be adjusted by pad at tail, when it is required. - */ -int ip_append_data(struct sock *sk, - int getfrag(void *from, char *to, int offset, int len, - int odd, struct sk_buff *skb), - void *from, int length, int transhdrlen, - struct ipcm_cookie *ipc, struct rtable **rtp, - unsigned int flags) +static int __ip_append_data(struct sock *sk, struct sk_buff_head *queue, + struct inet_cork *cork, + int getfrag(void *from, char *to, int offset, + int len, int odd, struct sk_buff *skb), + void *from, int length, int transhdrlen, + unsigned int flags) { struct inet_sock *inet = inet_sk(sk); struct sk_buff *skb; - struct ip_options *opt = NULL; + struct ip_options *opt = inet->cork.opt; int hh_len; int exthdrlen; int mtu; @@ -808,58 +798,19 @@ int ip_append_data(struct sock *sk, int offset = 0; unsigned int maxfraglen, fragheaderlen; int csummode = CHECKSUM_NONE; - struct rtable *rt; - - if (flags&MSG_PROBE) - return 0; + struct rtable *rt = (struct rtable *)cork->dst; - if (skb_queue_empty(&sk->sk_write_queue)) { - /* - * setup for corking. - */ - opt = ipc->opt; - if (opt) { - if (inet->cork.opt == NULL) { - inet->cork.opt = kmalloc(sizeof(struct ip_options) + 40, sk->sk_allocation); - if (unlikely(inet->cork.opt == NULL)) - return -ENOBUFS; - } - memcpy(inet->cork.opt, opt, sizeof(struct ip_options)+opt->optlen); - inet->cork.flags |= IPCORK_OPT; - inet->cork.addr = ipc->addr; - } - rt = *rtp; - if (unlikely(!rt)) - return -EFAULT; - /* - * We steal reference to this route, caller should not release it - */ - *rtp = NULL; - inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE ? - rt->dst.dev->mtu : - dst_mtu(rt->dst.path); - inet->cork.dst = &rt->dst; - inet->cork.length = 0; - sk->sk_sndmsg_page = NULL; - sk->sk_sndmsg_off = 0; - exthdrlen = rt->dst.header_len; - length += exthdrlen; - transhdrlen += exthdrlen; - } else { - rt = (struct rtable *)inet->cork.dst; - if (inet->cork.flags & IPCORK_OPT) - opt = inet->cork.opt; + exthdrlen = transhdrlen ? rt->dst.header_len : 0; + length += exthdrlen; + transhdrlen += exthdrlen; + mtu = inet->cork.fragsize; - transhdrlen = 0; - exthdrlen = 0; - mtu = inet->cork.fragsize; - } hh_len = LL_RESERVED_SPACE(rt->dst.dev); fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0); maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen; - if (inet->cork.length + length > 0xFFFF - fragheaderlen) { + if (cork->length + length > 0xFFFF - fragheaderlen) { ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->inet_dport, mtu-exthdrlen); return -EMSGSIZE; @@ -875,15 +826,15 @@ int ip_append_data(struct sock *sk, !exthdrlen) csummode = CHECKSUM_PARTIAL; - skb = skb_peek_tail(&sk->sk_write_queue); + skb = skb_peek_tail(queue); - inet->cork.length += length; + cork->length += length; if (((length > mtu) || (skb && skb_is_gso(skb))) && (sk->sk_protocol == IPPROTO_UDP) && (rt->dst.dev->features & NETIF_F_UFO)) { - err = ip_ufo_append_data(sk, getfrag, from, length, hh_len, - fragheaderlen, transhdrlen, mtu, - flags); + err = ip_ufo_append_data(sk, queue, getfrag, from, length, + hh_len, fragheaderlen, transhdrlen, + mtu, flags); if (err) goto error; return 0; @@ -960,7 +911,7 @@ alloc_new_skb: else /* only the initial fragment is time stamped */ - ipc->tx_flags = 0; + cork->tx_flags = 0; } if (skb == NULL) goto error; @@ -971,7 +922,7 @@ alloc_new_skb: skb->ip_summed = csummode; skb->csum = 0; skb_reserve(skb, hh_len); - skb_shinfo(skb)->tx_flags = ipc->tx_flags; + skb_shinfo(skb)->tx_flags = cork->tx_flags; /* * Find where to start putting bytes. @@ -1008,7 +959,7 @@ alloc_new_skb: /* * Put the packet on the pending queue. */ - __skb_queue_tail(&sk->sk_write_queue, skb); + __skb_queue_tail(queue, skb); continue; } @@ -1028,8 +979,8 @@ alloc_new_skb: } else { int i = skb_shinfo(skb)->nr_frags; skb_frag_t *frag = &skb_shinfo(skb)->frags[i-1]; - struct page *page = sk->sk_sndmsg_page; - int off = sk->sk_sndmsg_off; + struct page *page = cork->page; + int off = cork->off; unsigned int left; if (page && (left = PAGE_SIZE - off) > 0) { @@ -1041,7 +992,7 @@ alloc_new_skb: goto error; } get_page(page); - skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0); + skb_fill_page_desc(skb, i, page, off, 0); frag = &skb_shinfo(skb)->frags[i]; } } else if (i < MAX_SKB_FRAGS) { @@ -1052,8 +1003,8 @@ alloc_new_skb: err = -ENOMEM; goto error; } - sk->sk_sndmsg_page = page; - sk->sk_sndmsg_off = 0; + cork->page = page; + cork->off = 0; skb_fill_page_desc(skb, i, page, 0, 0); frag = &skb_shinfo(skb)->frags[i]; @@ -1065,7 +1016,7 @@ alloc_new_skb: err = -EFAULT; goto error; } - sk->sk_sndmsg_off += copy; + cork->off += copy; frag->size += copy; skb->len += copy; skb->data_len += copy; @@ -1079,11 +1030,87 @@ alloc_new_skb: return 0; error: - inet->cork.length -= length; + cork->length -= length; IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS); return err; } +static int ip_setup_cork(struct sock *sk, struct inet_cork *cork, + struct ipcm_cookie *ipc, struct rtable **rtp) +{ + struct inet_sock *inet = inet_sk(sk); + struct ip_options *opt; + struct rtable *rt; + + /* + * setup for corking. + */ + opt = ipc->opt; + if (opt) { + if (cork->opt == NULL) { + cork->opt = kmalloc(sizeof(struct ip_options) + 40, + sk->sk_allocation); + if (unlikely(cork->opt == NULL)) + return -ENOBUFS; + } + memcpy(cork->opt, opt, sizeof(struct ip_options) + opt->optlen); + cork->flags |= IPCORK_OPT; + cork->addr = ipc->addr; + } + rt = *rtp; + if (unlikely(!rt)) + return -EFAULT; + /* + * We steal reference to this route, caller should not release it + */ + *rtp = NULL; + cork->fragsize = inet->pmtudisc == IP_PMTUDISC_PROBE ? + rt->dst.dev->mtu : dst_mtu(rt->dst.path); + cork->dst = &rt->dst; + cork->length = 0; + cork->tx_flags = ipc->tx_flags; + cork->page = NULL; + cork->off = 0; + + return 0; +} + +/* + * ip_append_data() and ip_append_page() can make one large IP datagram + * from many pieces of data. Each pieces will be holded on the socket + * until ip_push_pending_frames() is called. Each piece can be a page + * or non-page data. + * + * Not only UDP, other transport protocols - e.g. raw sockets - can use + * this interface potentially. + * + * LATER: length must be adjusted by pad at tail, when it is required. + */ +int ip_append_data(struct sock *sk, + int getfrag(void *from, char *to, int offset, int len, + int odd, struct sk_buff *skb), + void *from, int length, int transhdrlen, + struct ipcm_cookie *ipc, struct rtable **rtp, + unsigned int flags) +{ + struct inet_sock *inet = inet_sk(sk); + int err; + + if (flags&MSG_PROBE) + return 0; + + if (skb_queue_empty(&sk->sk_write_queue)) { + err = ip_setup_cork(sk, &inet->cork, ipc, rtp); + if (err) + return err; + } else { + transhdrlen = 0; + } + + return __ip_append_data(sk, &sk->sk_write_queue, &inet->cork, getfrag, + from, length, transhdrlen, flags); +} + ssize_t ip_append_page(struct sock *sk, struct page *page, int offset, size_t size, int flags) { @@ -1227,40 +1254,42 @@ error: return err; } -static void ip_cork_release(struct inet_sock *inet) +static void ip_cork_release(struct inet_cork *cork) { - inet->cork.flags &= ~IPCORK_OPT; - kfree(inet->cork.opt); - inet->cork.opt = NULL; - dst_release(inet->cork.dst); - inet->cork.dst = NULL; + cork->flags &= ~IPCORK_OPT; + kfree(cork->opt); + cork->opt = NULL; + dst_release(cork->dst); + cork->dst = NULL; } /* * Combined all pending IP fragments on the socket as one IP datagram * and push them out. */ -int ip_push_pending_frames(struct sock *sk) +static int __ip_push_pending_frames(struct sock *sk, + struct sk_buff_head *queue, + struct inet_cork *cork) { struct sk_buff *skb, *tmp_skb; struct sk_buff **tail_skb; struct inet_sock *inet = inet_sk(sk); struct net *net = sock_net(sk); struct ip_options *opt = NULL; - struct rtable *rt = (struct rtable *)inet->cork.dst; + struct rtable *rt = (struct rtable *)cork->dst; struct iphdr *iph; __be16 df = 0; __u8 ttl; int err = 0; - if ((skb = __skb_dequeue(&sk->sk_write_queue)) == NULL) + if ((skb = __skb_dequeue(queue)) == NULL) goto out; tail_skb = &(skb_shinfo(skb)->frag_list); /* move skb->data to ip header from ext header */ if (skb->data < skb_network_header(skb)) __skb_pull(skb, skb_network_offset(skb)); - while ((tmp_skb = __skb_dequeue(&sk->sk_write_queue)) != NULL) { + while ((tmp_skb = __skb_dequeue(queue)) != NULL) { __skb_pull(tmp_skb, skb_network_header_len(skb)); *tail_skb = tmp_skb; tail_skb = &(tmp_skb->next); @@ -1286,8 +1315,8 @@ int ip_push_pending_frames(struct sock *sk) ip_dont_fragment(sk, &rt->dst))) df = htons(IP_DF); - if (inet->cork.flags & IPCORK_OPT) - opt = inet->cork.opt; + if (cork->flags & IPCORK_OPT) + opt = cork->opt; if (rt->rt_type == RTN_MULTICAST) ttl = inet->mc_ttl; @@ -1299,7 +1328,7 @@ int ip_push_pending_frames(struct sock *sk) iph->ihl = 5; if (opt) { iph->ihl += opt->optlen>>2; - ip_options_build(skb, opt, inet->cork.addr, rt, 0); + ip_options_build(skb, opt, cork->addr, rt, 0); } iph->tos = inet->tos; iph->frag_off = df; @@ -1315,7 +1344,7 @@ int ip_push_pending_frames(struct sock *sk) * Steal rt from cork.dst to avoid a pair of atomic_inc/atomic_dec * on dst refcount */ - inet->cork.dst = NULL; + cork->dst = NULL; skb_dst_set(skb, &rt->dst); if (iph->protocol == IPPROTO_ICMP) @@ -1332,7 +1361,7 @@ int ip_push_pending_frames(struct sock *sk) } out: - ip_cork_release(inet); + ip_cork_release(cork); return err; error: @@ -1340,17 +1369,30 @@ error: goto out; } +int ip_push_pending_frames(struct sock *sk) +{ + return __ip_push_pending_frames(sk, &sk->sk_write_queue, + &inet_sk(sk)->cork); +} + /* * Throw away all pending data on the socket. */ -void ip_flush_pending_frames(struct sock *sk) +static void __ip_flush_pending_frames(struct sock *sk, + struct sk_buff_head *queue, + struct inet_cork *cork) { struct sk_buff *skb; - while ((skb = __skb_dequeue_tail(&sk->sk_write_queue)) != NULL) + while ((skb = __skb_dequeue_tail(queue)) != NULL) kfree_skb(skb); - ip_cork_release(inet_sk(sk)); + ip_cork_release(cork); +} + +void ip_flush_pending_frames(struct sock *sk) +{ + __ip_flush_pending_frames(sk, &sk->sk_write_queue, &inet_sk(sk)->cork); } ^ permalink raw reply related [flat|nested] 91+ messages in thread
* inet: Replace left-over references to inet->cork 2011-03-01 12:36 ` [PATCH 2/5] inet: Remove explicit write references to sk/inet in ip_append_data Herbert Xu @ 2011-03-02 6:15 ` Herbert Xu 2011-03-02 7:01 ` David Miller 0 siblings, 1 reply; 91+ messages in thread From: Herbert Xu @ 2011-03-02 6:15 UTC (permalink / raw) To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, Thomas Graf On Tue, Mar 01, 2011 at 08:36:47PM +0800, Herbert Xu wrote: > inet: Remove explicit write references to sk/inet in ip_append_data Just found a couple of spots where inet->cork was still used instead of just cork. inet: Replace left-over references to inet->cork The patch to replace inet->cork with cork left out two spots in __ip_append_data that can result in bogus packet construction. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 460308c..3e8637c 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -789,7 +789,7 @@ static int __ip_append_data(struct sock *sk, struct sk_buff_head *queue, struct inet_sock *inet = inet_sk(sk); struct sk_buff *skb; - struct ip_options *opt = inet->cork.opt; + struct ip_options *opt = cork->opt; int hh_len; int exthdrlen; int mtu; @@ -803,7 +803,7 @@ static int __ip_append_data(struct sock *sk, struct sk_buff_head *queue, exthdrlen = transhdrlen ? rt->dst.header_len : 0; length += exthdrlen; transhdrlen += exthdrlen; - mtu = inet->cork.fragsize; + mtu = cork->fragsize; hh_len = LL_RESERVED_SPACE(rt->dst.dev); Thanks, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 91+ messages in thread
* Re: inet: Replace left-over references to inet->cork 2011-03-02 6:15 ` inet: Replace left-over references to inet->cork Herbert Xu @ 2011-03-02 7:01 ` David Miller 0 siblings, 0 replies; 91+ messages in thread From: David Miller @ 2011-03-02 7:01 UTC (permalink / raw) To: herbert; +Cc: rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, tgraf From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 2 Mar 2011 14:15:17 +0800 > On Tue, Mar 01, 2011 at 08:36:47PM +0800, Herbert Xu wrote: >> inet: Remove explicit write references to sk/inet in ip_append_data > > Just found a couple of spots where inet->cork was still used > instead of just cork. > > inet: Replace left-over references to inet->cork Applied, thanks Herbert. ^ permalink raw reply [flat|nested] 91+ messages in thread
* [PATCH 4/5] udp: Switch to ip_finish_skb 2011-03-01 12:35 ` Herbert Xu ` (2 preceding siblings ...) 2011-03-01 12:36 ` [PATCH 2/5] inet: Remove explicit write references to sk/inet in ip_append_data Herbert Xu @ 2011-03-01 12:36 ` Herbert Xu 2011-03-01 12:36 ` [PATCH 5/5] udp: Add lockless transmit path Herbert Xu 2011-03-01 16:43 ` SO_REUSEPORT - can it be done in kernel? Eric Dumazet 5 siblings, 0 replies; 91+ messages in thread From: Herbert Xu @ 2011-03-01 12:36 UTC (permalink / raw) To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, Thomas Graf udp: Switch to ip_finish_skb This patch converts UDP to use the new ip_finish_skb API. This would then allows us to more easily use ip_make_skb which allows UDP to run without a socket lock. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> --- include/net/udp.h | 11 ++++++ include/net/udplite.h | 12 +++++++ net/ipv4/udp.c | 83 ++++++++++++++++++++++++++++++-------------------- 3 files changed, 73 insertions(+), 33 deletions(-) diff --git a/include/net/udp.h b/include/net/udp.h index bb967dd..b8563ba 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -144,6 +144,17 @@ static inline __wsum udp_csum_outgoing(struct sock *sk, struct sk_buff *skb) return csum; } +static inline __wsum udp_csum(struct sk_buff *skb) +{ + __wsum csum = csum_partial(skb_transport_header(skb), + sizeof(struct udphdr), skb->csum); + + for (skb = skb_shinfo(skb)->frag_list; skb; skb = skb->next) { + csum = csum_add(csum, skb->csum); + } + return csum; +} + /* hash routines shared between UDPv4/6 and UDP-Litev4/6 */ static inline void udp_lib_hash(struct sock *sk) { diff --git a/include/net/udplite.h b/include/net/udplite.h index afdffe6..673a024 100644 --- a/include/net/udplite.h +++ b/include/net/udplite.h @@ -115,6 +115,18 @@ static inline __wsum udplite_csum_outgoing(struct sock *sk, struct sk_buff *skb) return csum; } +static inline __wsum udplite_csum(struct sk_buff *skb) +{ + struct sock *sk = skb->sk; + int cscov = udplite_sender_cscov(udp_sk(sk), udp_hdr(skb)); + const int off = skb_transport_offset(skb); + const int len = skb->len - off; + + skb->ip_summed = CHECKSUM_NONE; /* no HW support for checksumming */ + + return skb_checksum(skb, off, min(cscov, len), 0); +} + extern void udplite4_register(void); extern int udplite_get_port(struct sock *sk, unsigned short snum, int (*scmp)(const struct sock *, const struct sock *)); diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 8157b17..9a6d326 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -663,75 +663,72 @@ void udp_flush_pending_frames(struct sock *sk) EXPORT_SYMBOL(udp_flush_pending_frames); /** - * udp4_hwcsum_outgoing - handle outgoing HW checksumming - * @sk: socket we are sending on + * udp4_hwcsum - handle outgoing HW checksumming * @skb: sk_buff containing the filled-in UDP header * (checksum field must be zeroed out) + * @src: source IP address + * @dst: destination IP address */ -static void udp4_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb, - __be32 src, __be32 dst, int len) +static void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst) { - unsigned int offset; struct udphdr *uh = udp_hdr(skb); + struct sk_buff *frags = skb_shinfo(skb)->frag_list; + int offset = skb_transport_offset(skb); + int len = skb->len - offset; + int hlen = len; __wsum csum = 0; - if (skb_queue_len(&sk->sk_write_queue) == 1) { + if (!frags) { /* * Only one fragment on the socket. */ skb->csum_start = skb_transport_header(skb) - skb->head; skb->csum_offset = offsetof(struct udphdr, check); - uh->check = ~csum_tcpudp_magic(src, dst, len, IPPROTO_UDP, 0); + uh->check = ~csum_tcpudp_magic(src, dst, len, + IPPROTO_UDP, 0); } else { /* * HW-checksum won't work as there are two or more * fragments on the socket so that all csums of sk_buffs * should be together */ - offset = skb_transport_offset(skb); - skb->csum = skb_checksum(skb, offset, skb->len - offset, 0); + do { + csum = csum_add(csum, frags->csum); + hlen -= frags->len; + } while ((frags = frags->next)); + csum = skb_checksum(skb, offset, hlen, csum); skb->ip_summed = CHECKSUM_NONE; - skb_queue_walk(&sk->sk_write_queue, skb) { - csum = csum_add(csum, skb->csum); - } - uh->check = csum_tcpudp_magic(src, dst, len, IPPROTO_UDP, csum); if (uh->check == 0) uh->check = CSUM_MANGLED_0; } } -/* - * Push out all pending data as one UDP datagram. Socket is locked. - */ -static int udp_push_pending_frames(struct sock *sk) +static int udp_send_skb(struct sk_buff *skb, __be32 daddr, __be32 dport) { - struct udp_sock *up = udp_sk(sk); + struct sock *sk = skb->sk; struct inet_sock *inet = inet_sk(sk); - struct flowi *fl = &inet->cork.fl; - struct sk_buff *skb; struct udphdr *uh; + struct rtable *rt = (struct rtable *)skb_dst(skb); int err = 0; int is_udplite = IS_UDPLITE(sk); + int offset = skb_transport_offset(skb); + int len = skb->len - offset; __wsum csum = 0; - /* Grab the skbuff where UDP header space exists. */ - if ((skb = skb_peek(&sk->sk_write_queue)) == NULL) - goto out; - /* * Create a UDP header */ uh = udp_hdr(skb); - uh->source = fl->fl_ip_sport; - uh->dest = fl->fl_ip_dport; - uh->len = htons(up->len); + uh->source = inet->inet_sport; + uh->dest = dport; + uh->len = htons(len); uh->check = 0; if (is_udplite) /* UDP-Lite */ - csum = udplite_csum_outgoing(sk, skb); + csum = udplite_csum(skb); else if (sk->sk_no_check == UDP_CSUM_NOXMIT) { /* UDP csum disabled */ @@ -740,20 +737,20 @@ static int udp_push_pending_frames(struct sock *sk) } else if (skb->ip_summed == CHECKSUM_PARTIAL) { /* UDP hardware csum */ - udp4_hwcsum_outgoing(sk, skb, fl->fl4_src, fl->fl4_dst, up->len); + udp4_hwcsum(skb, rt->rt_src, daddr); goto send; - } else /* `normal' UDP */ - csum = udp_csum_outgoing(sk, skb); + } else + csum = udp_csum(skb); /* add protocol-dependent pseudo-header */ - uh->check = csum_tcpudp_magic(fl->fl4_src, fl->fl4_dst, up->len, + uh->check = csum_tcpudp_magic(rt->rt_src, daddr, len, sk->sk_protocol, csum); if (uh->check == 0) uh->check = CSUM_MANGLED_0; send: - err = ip_push_pending_frames(sk); + err = ip_send_skb(skb); if (err) { if (err == -ENOBUFS && !inet->recverr) { UDP_INC_STATS_USER(sock_net(sk), @@ -763,6 +760,26 @@ send: } else UDP_INC_STATS_USER(sock_net(sk), UDP_MIB_OUTDATAGRAMS, is_udplite); + return err; +} + +/* + * Push out all pending data as one UDP datagram. Socket is locked. + */ +static int udp_push_pending_frames(struct sock *sk) +{ + struct udp_sock *up = udp_sk(sk); + struct inet_sock *inet = inet_sk(sk); + struct flowi *fl = &inet->cork.fl; + struct sk_buff *skb; + int err = 0; + + skb = ip_finish_skb(sk); + if (!skb) + goto out; + + err = udp_send_skb(skb, fl->fl4_dst, fl->fl_ip_dport); + out: up->len = 0; up->pending = 0; ^ permalink raw reply related [flat|nested] 91+ messages in thread
* [PATCH 5/5] udp: Add lockless transmit path 2011-03-01 12:35 ` Herbert Xu ` (3 preceding siblings ...) 2011-03-01 12:36 ` [PATCH 4/5] udp: Switch to ip_finish_skb Herbert Xu @ 2011-03-01 12:36 ` Herbert Xu 2011-03-01 16:43 ` SO_REUSEPORT - can it be done in kernel? Eric Dumazet 5 siblings, 0 replies; 91+ messages in thread From: Herbert Xu @ 2011-03-01 12:36 UTC (permalink / raw) To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, Thomas Graf udp: Add lockless transmit path The UDP transmit path has been running under the socket lock for a long time because of the corking feature. This means that transmitting to the same socket in multiple threads does not scale at all. However, as most users don't actually use corking, the locking can be removed in the common case. This patch creates a lockless fast path where corking is not used. Please note that this does create a slight inaccuracy in the enforcement of socket send buffer limits. In particular, we may exceed the socket limit by up to (number of CPUs) * (packet size) because of the way the limit is computed. As the primary purpose of socket buffers is to indicate congestion, this should not be a great problem for now. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> --- net/ipv4/udp.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 9a6d326..bb9f707 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -802,6 +802,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, int err, is_udplite = IS_UDPLITE(sk); int corkreq = up->corkflag || msg->msg_flags&MSG_MORE; int (*getfrag)(void *, char *, int, int, int, struct sk_buff *); + struct sk_buff *skb; if (len > 0xFFFF) return -EMSGSIZE; @@ -816,6 +817,8 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, ipc.opt = NULL; ipc.tx_flags = 0; + getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag; + if (up->pending) { /* * There are pending frames. @@ -940,6 +943,17 @@ back_from_confirm: if (!ipc.addr) daddr = ipc.addr = rt->rt_dst; + /* Lockless fast path for the non-corking case. */ + if (!corkreq) { + skb = ip_make_skb(sk, getfrag, msg->msg_iov, ulen, + sizeof(struct udphdr), &ipc, &rt, + msg->msg_flags); + err = PTR_ERR(skb); + if (skb && !IS_ERR(skb)) + err = udp_send_skb(skb, daddr, dport); + goto out; + } + lock_sock(sk); if (unlikely(up->pending)) { /* The socket is already corked while preparing it. */ @@ -961,7 +975,6 @@ back_from_confirm: do_append_data: up->len += ulen; - getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag; err = ip_append_data(sk, getfrag, msg->msg_iov, ulen, sizeof(struct udphdr), &ipc, &rt, corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags); ^ permalink raw reply related [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 12:35 ` Herbert Xu ` (4 preceding siblings ...) 2011-03-01 12:36 ` [PATCH 5/5] udp: Add lockless transmit path Herbert Xu @ 2011-03-01 16:43 ` Eric Dumazet 2011-03-01 20:36 ` David Miller 5 siblings, 1 reply; 91+ messages in thread From: Eric Dumazet @ 2011-03-01 16:43 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev Le mardi 01 mars 2011 à 20:35 +0800, Herbert Xu a écrit : > On Mon, Feb 28, 2011 at 07:36:59PM +0800, Herbert Xu wrote: > > Here are the patches I used. Please don't them yet as I intend > > to clean them up quite a bit. > > OK here is the version ready for merging (please retest them > though as I have changed things substantially). > > The main change is that the legacy UDP code path is now gone > so we use the same UDP header generation whether corking is on > or off. > > I will add IPv6 support in a later patch set. > > Thanks! For the whole patchset : Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Tests were fine on my dev machine. Thanks ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-03-01 16:43 ` SO_REUSEPORT - can it be done in kernel? Eric Dumazet @ 2011-03-01 20:36 ` David Miller 0 siblings, 0 replies; 91+ messages in thread From: David Miller @ 2011-03-01 20:36 UTC (permalink / raw) To: eric.dumazet Cc: herbert, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 01 Mar 2011 17:43:07 +0100 > Le mardi 01 mars 2011 à 20:35 +0800, Herbert Xu a écrit : >> On Mon, Feb 28, 2011 at 07:36:59PM +0800, Herbert Xu wrote: >> > Here are the patches I used. Please don't them yet as I intend >> > to clean them up quite a bit. >> >> OK here is the version ready for merging (please retest them >> though as I have changed things substantially). >> >> The main change is that the legacy UDP code path is now gone >> so we use the same UDP header generation whether corking is on >> or off. >> >> I will add IPv6 support in a later patch set. >> >> Thanks! > > For the whole patchset : > > Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Applied, great work everyone! ^ permalink raw reply [flat|nested] 91+ messages in thread
* [PATCH 2/5] net: Remove explicit write references to sk/inet in ip_append_data 2011-02-27 11:06 ` Herbert Xu 2011-02-28 3:45 ` Tom Herbert 2011-02-28 11:36 ` Herbert Xu @ 2011-02-28 11:41 ` Herbert Xu 2011-03-01 5:31 ` Eric Dumazet 2011-02-28 11:41 ` [PATCH 1/5] net: Remove unused sk_sndmsg_* from UFO Herbert Xu ` (2 subsequent siblings) 5 siblings, 1 reply; 91+ messages in thread From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw) To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, Thomas Graf net: Remove explicit write references to sk/inet in ip_append_data In order to allow simultaneous calls to ip_append_data on the same socket, it must not modify any shared state in sk or inet (other than those that are designed to allow that such as atomic counters). This patch abstracts out write references to sk and inet_sk in ip_append_data and its friends so that we may use the underlying code in parallel. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> --- include/net/inet_sock.h | 23 ++-- net/ipv4/ip_output.c | 238 ++++++++++++++++++++++++++++-------------------- 2 files changed, 154 insertions(+), 107 deletions(-) diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h index 8181498..b3de102 100644 --- a/include/net/inet_sock.h +++ b/include/net/inet_sock.h @@ -86,6 +86,19 @@ static inline struct inet_request_sock *inet_rsk(const struct request_sock *sk) return (struct inet_request_sock *)sk; } +struct inet_cork { + unsigned int flags; + unsigned int fragsize; + struct ip_options *opt; + struct dst_entry *dst; + int length; /* Total length of all frames */ + __be32 addr; + struct flowi fl; + struct page *page; + u32 off; + u8 tx_flags; +}; + struct ip_mc_socklist; struct ipv6_pinfo; struct rtable; @@ -143,15 +156,7 @@ struct inet_sock { int mc_index; __be32 mc_addr; struct ip_mc_socklist __rcu *mc_list; - struct { - unsigned int flags; - unsigned int fragsize; - struct ip_options *opt; - struct dst_entry *dst; - int length; /* Total length of all frames */ - __be32 addr; - struct flowi fl; - } cork; + struct inet_cork cork; }; #define IPCORK_OPT 1 /* ip-options has been held in ipcork.opt */ diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index d3a4540..1dd5ecc 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -733,6 +733,7 @@ csum_page(struct page *page, int offset, int copy) } static inline int ip_ufo_append_data(struct sock *sk, + struct sk_buff_head *queue, int getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb), void *from, int length, int hh_len, int fragheaderlen, @@ -745,7 +746,7 @@ static inline int ip_ufo_append_data(struct sock *sk, * device, so create one single skb packet containing complete * udp datagram */ - if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL) { + if ((skb = skb_peek_tail(queue)) == NULL) { skb = sock_alloc_send_skb(sk, hh_len + fragheaderlen + transhdrlen + 20, (flags & MSG_DONTWAIT), &err); @@ -771,35 +772,24 @@ static inline int ip_ufo_append_data(struct sock *sk, /* specify the length of each IP datagram fragment */ skb_shinfo(skb)->gso_size = mtu - fragheaderlen; skb_shinfo(skb)->gso_type = SKB_GSO_UDP; - __skb_queue_tail(&sk->sk_write_queue, skb); + __skb_queue_tail(queue, skb); } return skb_append_datato_frags(sk, skb, getfrag, from, (length - transhdrlen)); } -/* - * ip_append_data() and ip_append_page() can make one large IP datagram - * from many pieces of data. Each pieces will be holded on the socket - * until ip_push_pending_frames() is called. Each piece can be a page - * or non-page data. - * - * Not only UDP, other transport protocols - e.g. raw sockets - can use - * this interface potentially. - * - * LATER: length must be adjusted by pad at tail, when it is required. - */ -int ip_append_data(struct sock *sk, - int getfrag(void *from, char *to, int offset, int len, - int odd, struct sk_buff *skb), - void *from, int length, int transhdrlen, - struct ipcm_cookie *ipc, struct rtable **rtp, - unsigned int flags) +static int __ip_append_data(struct sock *sk, struct sk_buff_head *queue, + struct inet_cork *cork, + int getfrag(void *from, char *to, int offset, + int len, int odd, struct sk_buff *skb), + void *from, int length, int transhdrlen, + unsigned int flags) { struct inet_sock *inet = inet_sk(sk); struct sk_buff *skb; - struct ip_options *opt = NULL; + struct ip_options *opt = inet->cork.opt; int hh_len; int exthdrlen; int mtu; @@ -808,58 +798,19 @@ int ip_append_data(struct sock *sk, int offset = 0; unsigned int maxfraglen, fragheaderlen; int csummode = CHECKSUM_NONE; - struct rtable *rt; - - if (flags&MSG_PROBE) - return 0; + struct rtable *rt = (struct rtable *)cork->dst; - if (skb_queue_empty(&sk->sk_write_queue)) { - /* - * setup for corking. - */ - opt = ipc->opt; - if (opt) { - if (inet->cork.opt == NULL) { - inet->cork.opt = kmalloc(sizeof(struct ip_options) + 40, sk->sk_allocation); - if (unlikely(inet->cork.opt == NULL)) - return -ENOBUFS; - } - memcpy(inet->cork.opt, opt, sizeof(struct ip_options)+opt->optlen); - inet->cork.flags |= IPCORK_OPT; - inet->cork.addr = ipc->addr; - } - rt = *rtp; - if (unlikely(!rt)) - return -EFAULT; - /* - * We steal reference to this route, caller should not release it - */ - *rtp = NULL; - inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE ? - rt->dst.dev->mtu : - dst_mtu(rt->dst.path); - inet->cork.dst = &rt->dst; - inet->cork.length = 0; - sk->sk_sndmsg_page = NULL; - sk->sk_sndmsg_off = 0; - exthdrlen = rt->dst.header_len; - length += exthdrlen; - transhdrlen += exthdrlen; - } else { - rt = (struct rtable *)inet->cork.dst; - if (inet->cork.flags & IPCORK_OPT) - opt = inet->cork.opt; + exthdrlen = transhdrlen ? rt->dst.header_len : 0; + length += exthdrlen; + transhdrlen += exthdrlen; + mtu = inet->cork.fragsize; - transhdrlen = 0; - exthdrlen = 0; - mtu = inet->cork.fragsize; - } hh_len = LL_RESERVED_SPACE(rt->dst.dev); fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0); maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen; - if (inet->cork.length + length > 0xFFFF - fragheaderlen) { + if (cork->length + length > 0xFFFF - fragheaderlen) { ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->inet_dport, mtu-exthdrlen); return -EMSGSIZE; @@ -875,15 +826,15 @@ int ip_append_data(struct sock *sk, !exthdrlen) csummode = CHECKSUM_PARTIAL; - skb = skb_peek_tail(&sk->sk_write_queue); + skb = skb_peek_tail(queue); - inet->cork.length += length; + cork->length += length; if (((length > mtu) || (skb && skb_is_gso(skb))) && (sk->sk_protocol == IPPROTO_UDP) && (rt->dst.dev->features & NETIF_F_UFO)) { - err = ip_ufo_append_data(sk, getfrag, from, length, hh_len, - fragheaderlen, transhdrlen, mtu, - flags); + err = ip_ufo_append_data(sk, queue, getfrag, from, length, + hh_len, fragheaderlen, transhdrlen, + mtu, flags); if (err) goto error; return 0; @@ -960,7 +911,7 @@ alloc_new_skb: else /* only the initial fragment is time stamped */ - ipc->tx_flags = 0; + cork->tx_flags = 0; } if (skb == NULL) goto error; @@ -971,7 +922,7 @@ alloc_new_skb: skb->ip_summed = csummode; skb->csum = 0; skb_reserve(skb, hh_len); - skb_shinfo(skb)->tx_flags = ipc->tx_flags; + skb_shinfo(skb)->tx_flags = cork->tx_flags; /* * Find where to start putting bytes. @@ -1008,7 +959,7 @@ alloc_new_skb: /* * Put the packet on the pending queue. */ - __skb_queue_tail(&sk->sk_write_queue, skb); + __skb_queue_tail(queue, skb); continue; } @@ -1028,8 +979,8 @@ alloc_new_skb: } else { int i = skb_shinfo(skb)->nr_frags; skb_frag_t *frag = &skb_shinfo(skb)->frags[i-1]; - struct page *page = sk->sk_sndmsg_page; - int off = sk->sk_sndmsg_off; + struct page *page = cork->page; + int off = cork->off; unsigned int left; if (page && (left = PAGE_SIZE - off) > 0) { @@ -1041,7 +992,7 @@ alloc_new_skb: goto error; } get_page(page); - skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0); + skb_fill_page_desc(skb, i, page, off, 0); frag = &skb_shinfo(skb)->frags[i]; } } else if (i < MAX_SKB_FRAGS) { @@ -1052,8 +1003,8 @@ alloc_new_skb: err = -ENOMEM; goto error; } - sk->sk_sndmsg_page = page; - sk->sk_sndmsg_off = 0; + cork->page = page; + cork->off = 0; skb_fill_page_desc(skb, i, page, 0, 0); frag = &skb_shinfo(skb)->frags[i]; @@ -1065,7 +1016,7 @@ alloc_new_skb: err = -EFAULT; goto error; } - sk->sk_sndmsg_off += copy; + cork->off += copy; frag->size += copy; skb->len += copy; skb->data_len += copy; @@ -1079,11 +1030,87 @@ alloc_new_skb: return 0; error: - inet->cork.length -= length; + cork->length -= length; IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS); return err; } +static int ip_setup_cork(struct sock *sk, struct inet_cork *cork, + struct ipcm_cookie *ipc, struct rtable **rtp) +{ + struct inet_sock *inet = inet_sk(sk); + struct ip_options *opt; + struct rtable *rt; + + /* + * setup for corking. + */ + opt = ipc->opt; + if (opt) { + if (cork->opt == NULL) { + cork->opt = kmalloc(sizeof(struct ip_options) + 40, + sk->sk_allocation); + if (unlikely(cork->opt == NULL)) + return -ENOBUFS; + } + memcpy(cork->opt, opt, sizeof(struct ip_options) + opt->optlen); + cork->flags |= IPCORK_OPT; + cork->addr = ipc->addr; + } + rt = *rtp; + if (unlikely(!rt)) + return -EFAULT; + /* + * We steal reference to this route, caller should not release it + */ + *rtp = NULL; + cork->fragsize = inet->pmtudisc == IP_PMTUDISC_PROBE ? + rt->dst.dev->mtu : dst_mtu(rt->dst.path); + cork->dst = &rt->dst; + cork->length = 0; + cork->tx_flags = ipc->tx_flags; + cork->page = NULL; + cork->off = 0; + + return 0; +} + +/* + * ip_append_data() and ip_append_page() can make one large IP datagram + * from many pieces of data. Each pieces will be holded on the socket + * until ip_push_pending_frames() is called. Each piece can be a page + * or non-page data. + * + * Not only UDP, other transport protocols - e.g. raw sockets - can use + * this interface potentially. + * + * LATER: length must be adjusted by pad at tail, when it is required. + */ +int ip_append_data(struct sock *sk, + int getfrag(void *from, char *to, int offset, int len, + int odd, struct sk_buff *skb), + void *from, int length, int transhdrlen, + struct ipcm_cookie *ipc, struct rtable **rtp, + unsigned int flags) +{ + struct inet_sock *inet = inet_sk(sk); + int err; + + if (flags&MSG_PROBE) + return 0; + + if (skb_queue_empty(&sk->sk_write_queue)) { + err = ip_setup_cork(sk, &inet->cork, ipc, rtp); + if (err) + return err; + } else { + transhdrlen = 0; + } + + return __ip_append_data(sk, &sk->sk_write_queue, &inet->cork, getfrag, + from, length, transhdrlen, flags); +} + ssize_t ip_append_page(struct sock *sk, struct page *page, int offset, size_t size, int flags) { @@ -1227,40 +1254,42 @@ error: return err; } -static void ip_cork_release(struct inet_sock *inet) +static void ip_cork_release(struct inet_cork *cork) { - inet->cork.flags &= ~IPCORK_OPT; - kfree(inet->cork.opt); - inet->cork.opt = NULL; - dst_release(inet->cork.dst); - inet->cork.dst = NULL; + cork->flags &= ~IPCORK_OPT; + kfree(cork->opt); + cork->opt = NULL; + dst_release(cork->dst); + cork->dst = NULL; } /* * Combined all pending IP fragments on the socket as one IP datagram * and push them out. */ -int ip_push_pending_frames(struct sock *sk) +static int __ip_push_pending_frames(struct sock *sk, + struct sk_buff_head *queue, + struct inet_cork *cork) { struct sk_buff *skb, *tmp_skb; struct sk_buff **tail_skb; struct inet_sock *inet = inet_sk(sk); struct net *net = sock_net(sk); struct ip_options *opt = NULL; - struct rtable *rt = (struct rtable *)inet->cork.dst; + struct rtable *rt = (struct rtable *)cork->dst; struct iphdr *iph; __be16 df = 0; __u8 ttl; int err = 0; - if ((skb = __skb_dequeue(&sk->sk_write_queue)) == NULL) + if ((skb = __skb_dequeue(queue)) == NULL) goto out; tail_skb = &(skb_shinfo(skb)->frag_list); /* move skb->data to ip header from ext header */ if (skb->data < skb_network_header(skb)) __skb_pull(skb, skb_network_offset(skb)); - while ((tmp_skb = __skb_dequeue(&sk->sk_write_queue)) != NULL) { + while ((tmp_skb = __skb_dequeue(queue)) != NULL) { __skb_pull(tmp_skb, skb_network_header_len(skb)); *tail_skb = tmp_skb; tail_skb = &(tmp_skb->next); @@ -1286,8 +1315,8 @@ int ip_push_pending_frames(struct sock *sk) ip_dont_fragment(sk, &rt->dst))) df = htons(IP_DF); - if (inet->cork.flags & IPCORK_OPT) - opt = inet->cork.opt; + if (cork->flags & IPCORK_OPT) + opt = cork->opt; if (rt->rt_type == RTN_MULTICAST) ttl = inet->mc_ttl; @@ -1299,7 +1328,7 @@ int ip_push_pending_frames(struct sock *sk) iph->ihl = 5; if (opt) { iph->ihl += opt->optlen>>2; - ip_options_build(skb, opt, inet->cork.addr, rt, 0); + ip_options_build(skb, opt, cork->addr, rt, 0); } iph->tos = inet->tos; iph->frag_off = df; @@ -1315,7 +1344,7 @@ int ip_push_pending_frames(struct sock *sk) * Steal rt from cork.dst to avoid a pair of atomic_inc/atomic_dec * on dst refcount */ - inet->cork.dst = NULL; + cork->dst = NULL; skb_dst_set(skb, &rt->dst); if (iph->protocol == IPPROTO_ICMP) @@ -1332,7 +1361,7 @@ int ip_push_pending_frames(struct sock *sk) } out: - ip_cork_release(inet); + ip_cork_release(cork); return err; error: @@ -1340,17 +1369,30 @@ error: goto out; } +int ip_push_pending_frames(struct sock *sk) +{ + return __ip_push_pending_frames(sk, &sk->sk_write_queue, + &inet_sk(sk)->cork); +} + /* * Throw away all pending data on the socket. */ -void ip_flush_pending_frames(struct sock *sk) +static void __ip_flush_pending_frames(struct sock *sk, + struct sk_buff_head *queue, + struct inet_cork *cork) { struct sk_buff *skb; - while ((skb = __skb_dequeue_tail(&sk->sk_write_queue)) != NULL) + while ((skb = __skb_dequeue_tail(queue)) != NULL) kfree_skb(skb); - ip_cork_release(inet_sk(sk)); + ip_cork_release(cork); +} + +void ip_flush_pending_frames(struct sock *sk) +{ + __ip_flush_pending_frames(sk, &sk->sk_write_queue, &inet_sk(sk)->cork); } ^ permalink raw reply related [flat|nested] 91+ messages in thread
* Re: [PATCH 2/5] net: Remove explicit write references to sk/inet in ip_append_data 2011-02-28 11:41 ` [PATCH 2/5] net: Remove explicit write references to sk/inet in ip_append_data Herbert Xu @ 2011-03-01 5:31 ` Eric Dumazet 0 siblings, 0 replies; 91+ messages in thread From: Eric Dumazet @ 2011-03-01 5:31 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, Thomas Graf Le lundi 28 février 2011 à 19:41 +0800, Herbert Xu a écrit : > net: Remove explicit write references to sk/inet in ip_append_data > > In order to allow simultaneous calls to ip_append_data on the same > socket, it must not modify any shared state in sk or inet (other > than those that are designed to allow that such as atomic counters). > > This patch abstracts out write references to sk and inet_sk in > ip_append_data and its friends so that we may use the underlying > code in parallel. > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> > --- > > include/net/inet_sock.h | 23 ++-- > net/ipv4/ip_output.c | 238 ++++++++++++++++++++++++++++-------------------- > 2 files changed, 154 insertions(+), 107 deletions(-) Acked-by: Eric Dumazet <eric.dumazet@gmail.com> ^ permalink raw reply [flat|nested] 91+ messages in thread
* [PATCH 1/5] net: Remove unused sk_sndmsg_* from UFO 2011-02-27 11:06 ` Herbert Xu ` (2 preceding siblings ...) 2011-02-28 11:41 ` [PATCH 2/5] net: Remove explicit write references to sk/inet in ip_append_data Herbert Xu @ 2011-02-28 11:41 ` Herbert Xu 2011-03-01 5:31 ` Eric Dumazet 2011-02-28 11:41 ` [PATCH 3/5] inet: Add ip_make_skb and ip_send_skb Herbert Xu 2011-02-28 11:41 ` [PATCH 4/5] udp: Add lockless transmit path Herbert Xu 5 siblings, 1 reply; 91+ messages in thread From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw) To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, Thomas Graf net: Remove unused sk_sndmsg_* from UFO UFO doesn't really use the sk_sndmsg_* parameters so touching them is pointless. It can't use them anyway since the whole point of UFO is to use the original pages without copying. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> --- net/core/skbuff.c | 3 --- net/ipv4/ip_output.c | 1 - net/ipv6/ip6_output.c | 1 - 3 files changed, 5 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index d883dcc..97011a7 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2434,8 +2434,6 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb, return -ENOMEM; /* initialize the next frag */ - sk->sk_sndmsg_page = page; - sk->sk_sndmsg_off = 0; skb_fill_page_desc(skb, frg_cnt, page, 0, 0); skb->truesize += PAGE_SIZE; atomic_add(PAGE_SIZE, &sk->sk_wmem_alloc); @@ -2455,7 +2453,6 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb, return -EFAULT; /* copy was successful so update the size parameters */ - sk->sk_sndmsg_off += copy; frag->size += copy; skb->len += copy; skb->data_len += copy; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 04c7b3b..d3a4540 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -767,7 +767,6 @@ static inline int ip_ufo_append_data(struct sock *sk, skb->ip_summed = CHECKSUM_PARTIAL; skb->csum = 0; - sk->sk_sndmsg_off = 0; /* specify the length of each IP datagram fragment */ skb_shinfo(skb)->gso_size = mtu - fragheaderlen; diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 5f8d242..9965182 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1061,7 +1061,6 @@ static inline int ip6_ufo_append_data(struct sock *sk, skb->ip_summed = CHECKSUM_PARTIAL; skb->csum = 0; - sk->sk_sndmsg_off = 0; } err = skb_append_datato_frags(sk,skb, getfrag, from, ^ permalink raw reply related [flat|nested] 91+ messages in thread
* Re: [PATCH 1/5] net: Remove unused sk_sndmsg_* from UFO 2011-02-28 11:41 ` [PATCH 1/5] net: Remove unused sk_sndmsg_* from UFO Herbert Xu @ 2011-03-01 5:31 ` Eric Dumazet 0 siblings, 0 replies; 91+ messages in thread From: Eric Dumazet @ 2011-03-01 5:31 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, Thomas Graf Le lundi 28 février 2011 à 19:41 +0800, Herbert Xu a écrit : > net: Remove unused sk_sndmsg_* from UFO > > UFO doesn't really use the sk_sndmsg_* parameters so touching > them is pointless. It can't use them anyway since the whole > point of UFO is to use the original pages without copying. > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> > --- Acked-by: Eric Dumazet <eric.dumazet@gmail.com> ^ permalink raw reply [flat|nested] 91+ messages in thread
* [PATCH 3/5] inet: Add ip_make_skb and ip_send_skb 2011-02-27 11:06 ` Herbert Xu ` (3 preceding siblings ...) 2011-02-28 11:41 ` [PATCH 1/5] net: Remove unused sk_sndmsg_* from UFO Herbert Xu @ 2011-02-28 11:41 ` Herbert Xu 2011-03-01 5:31 ` Eric Dumazet 2011-02-28 11:41 ` [PATCH 4/5] udp: Add lockless transmit path Herbert Xu 5 siblings, 1 reply; 91+ messages in thread From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw) To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, Thomas Graf inet: Add ip_make_skb and ip_send_skb This patch adds the helper ip_make_skb which is like ip_append_data and ip_push_pending_frames all rolled into one, except that it does not send the skb produced. The sending part is carried out by ip_send_skb, which the transport protocol can call after it has tweaked the skb. It is meant to be called in cases where corking is not used should have a one-to-one correspondence to sendmsg. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> --- include/net/ip.h | 8 ++++++ net/ipv4/ip_output.c | 65 ++++++++++++++++++++++++++++++++++++++++----------- 2 files changed, 59 insertions(+), 14 deletions(-) diff --git a/include/net/ip.h b/include/net/ip.h index 67fac78..a96e525 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -116,8 +116,16 @@ extern int ip_append_data(struct sock *sk, extern int ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb); extern ssize_t ip_append_page(struct sock *sk, struct page *page, int offset, size_t size, int flags); +extern int ip_send_skb(struct sk_buff *skb); extern int ip_push_pending_frames(struct sock *sk); extern void ip_flush_pending_frames(struct sock *sk); +extern struct sk_buff *ip_make_skb(struct sock *sk, + int getfrag(void *from, char *to, int offset, int len, + int odd, struct sk_buff *skb), + void *from, int length, int transhdrlen, + struct ipcm_cookie *ipc, + struct rtable **rtp, + unsigned int flags); /* datagram.c */ extern int ip4_datagram_connect(struct sock *sk, diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 1dd5ecc..dba14c6 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1267,9 +1267,9 @@ static void ip_cork_release(struct inet_cork *cork) * Combined all pending IP fragments on the socket as one IP datagram * and push them out. */ -static int __ip_push_pending_frames(struct sock *sk, - struct sk_buff_head *queue, - struct inet_cork *cork) +static struct sk_buff *__ip_make_skb(struct sock *sk, + struct sk_buff_head *queue, + struct inet_cork *cork) { struct sk_buff *skb, *tmp_skb; struct sk_buff **tail_skb; @@ -1280,7 +1280,6 @@ static int __ip_push_pending_frames(struct sock *sk, struct iphdr *iph; __be16 df = 0; __u8 ttl; - int err = 0; if ((skb = __skb_dequeue(queue)) == NULL) goto out; @@ -1351,28 +1350,37 @@ static int __ip_push_pending_frames(struct sock *sk, icmp_out_count(net, ((struct icmphdr *) skb_transport_header(skb))->type); - /* Netfilter gets whole the not fragmented skb. */ + ip_cork_release(cork); +out: + return skb; +} + +int ip_send_skb(struct sk_buff *skb) +{ + struct net *net = sock_net(skb->sk); + int err; + err = ip_local_out(skb); if (err) { if (err > 0) err = net_xmit_errno(err); if (err) - goto error; + IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS); } -out: - ip_cork_release(cork); return err; - -error: - IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS); - goto out; } int ip_push_pending_frames(struct sock *sk) { - return __ip_push_pending_frames(sk, &sk->sk_write_queue, - &inet_sk(sk)->cork); + struct sk_buff *skb; + + skb = __ip_make_skb(sk, &sk->sk_write_queue, &inet_sk(sk)->cork); + if (!skb) + return 0; + + /* Netfilter gets whole the not fragmented skb. */ + return ip_send_skb(skb); } /* @@ -1395,6 +1403,35 @@ void ip_flush_pending_frames(struct sock *sk) __ip_flush_pending_frames(sk, &sk->sk_write_queue, &inet_sk(sk)->cork); } +struct sk_buff *ip_make_skb(struct sock *sk, + int getfrag(void *from, char *to, int offset, + int len, int odd, struct sk_buff *skb), + void *from, int length, int transhdrlen, + struct ipcm_cookie *ipc, struct rtable **rtp, + unsigned int flags) +{ + struct inet_cork cork = {}; + struct sk_buff_head queue; + int err; + + if (flags & MSG_PROBE) + return NULL; + + __skb_queue_head_init(&queue); + + err = ip_setup_cork(sk, &cork, ipc, rtp); + if (err) + return ERR_PTR(err); + + err = __ip_append_data(sk, &queue, &cork, getfrag, + from, length, transhdrlen, flags); + if (err) { + __ip_flush_pending_frames(sk, &queue, &cork); + return ERR_PTR(err); + } + + return __ip_make_skb(sk, &queue, &cork); +} /* * Fetch data from kernel space and fill in checksum if needed. ^ permalink raw reply related [flat|nested] 91+ messages in thread
* Re: [PATCH 3/5] inet: Add ip_make_skb and ip_send_skb 2011-02-28 11:41 ` [PATCH 3/5] inet: Add ip_make_skb and ip_send_skb Herbert Xu @ 2011-03-01 5:31 ` Eric Dumazet 0 siblings, 0 replies; 91+ messages in thread From: Eric Dumazet @ 2011-03-01 5:31 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, Thomas Graf Le lundi 28 février 2011 à 19:41 +0800, Herbert Xu a écrit : > inet: Add ip_make_skb and ip_send_skb > > This patch adds the helper ip_make_skb which is like ip_append_data > and ip_push_pending_frames all rolled into one, except that it does > not send the skb produced. The sending part is carried out by > ip_send_skb, which the transport protocol can call after it has > tweaked the skb. > > It is meant to be called in cases where corking is not used should > have a one-to-one correspondence to sendmsg. > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> ^ permalink raw reply [flat|nested] 91+ messages in thread
* [PATCH 4/5] udp: Add lockless transmit path 2011-02-27 11:06 ` Herbert Xu ` (4 preceding siblings ...) 2011-02-28 11:41 ` [PATCH 3/5] inet: Add ip_make_skb and ip_send_skb Herbert Xu @ 2011-02-28 11:41 ` Herbert Xu 2011-02-28 11:41 ` Herbert Xu 2011-03-01 5:30 ` Eric Dumazet 5 siblings, 2 replies; 91+ messages in thread From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw) To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, Thomas Graf udp: Add lockless transmit path The UDP transmit path has been running under the socket lock for a long time because of the corking feature. This means that transmitting to the same socket in multiple threads does not scale at all. However, as most users don't actually use corking, the locking can be removed in the common case. This patch creates a lockless fast path where corking is not used. Please note that this does create a slight inaccuracy in the enforcement of socket send buffer limits. In particular, we may exceed the socket limit by up to (number of CPUs) * (packet size) because of the way the limit is computed. As the primary purpose of socket buffers is to indicate congestion, this should not be a great problem for now. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> --- include/net/udp.h | 11 +++++ include/net/udplite.h | 12 +++++ net/ipv4/udp.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 126 insertions(+), 1 deletion(-) diff --git a/include/net/udp.h b/include/net/udp.h index bb967dd..b8563ba 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -144,6 +144,17 @@ static inline __wsum udp_csum_outgoing(struct sock *sk, struct sk_buff *skb) return csum; } +static inline __wsum udp_csum(struct sk_buff *skb) +{ + __wsum csum = csum_partial(skb_transport_header(skb), + sizeof(struct udphdr), skb->csum); + + for (skb = skb_shinfo(skb)->frag_list; skb; skb = skb->next) { + csum = csum_add(csum, skb->csum); + } + return csum; +} + /* hash routines shared between UDPv4/6 and UDP-Litev4/6 */ static inline void udp_lib_hash(struct sock *sk) { diff --git a/include/net/udplite.h b/include/net/udplite.h index afdffe6..673a024 100644 --- a/include/net/udplite.h +++ b/include/net/udplite.h @@ -115,6 +115,18 @@ static inline __wsum udplite_csum_outgoing(struct sock *sk, struct sk_buff *skb) return csum; } +static inline __wsum udplite_csum(struct sk_buff *skb) +{ + struct sock *sk = skb->sk; + int cscov = udplite_sender_cscov(udp_sk(sk), udp_hdr(skb)); + const int off = skb_transport_offset(skb); + const int len = skb->len - off; + + skb->ip_summed = CHECKSUM_NONE; /* no HW support for checksumming */ + + return skb_checksum(skb, off, min(cscov, len), 0); +} + extern void udplite4_register(void); extern int udplite_get_port(struct sock *sk, unsigned short snum, int (*scmp)(const struct sock *, const struct sock *)); diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 8157b17..7fd3664 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -769,6 +769,95 @@ out: return err; } +static void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst) +{ + struct udphdr *uh = udp_hdr(skb); + struct sk_buff *frags = skb_shinfo(skb)->frag_list; + int offset = skb_transport_offset(skb); + int len = skb->len - offset; + int hlen = len; + __wsum csum = 0; + + if (!frags) { + /* + * Only one fragment on the socket. + */ + skb->csum_start = skb_transport_header(skb) - skb->head; + skb->csum_offset = offsetof(struct udphdr, check); + uh->check = ~csum_tcpudp_magic(src, dst, len, + IPPROTO_UDP, 0); + } else { + /* + * HW-checksum won't work as there are two or more + * fragments on the socket so that all csums of sk_buffs + * should be together + */ + do { + csum = csum_add(csum, frags->csum); + hlen -= frags->len; + } while ((frags = frags->next)); + + csum = skb_checksum(skb, offset, hlen, csum); + skb->ip_summed = CHECKSUM_NONE; + + uh->check = csum_tcpudp_magic(src, dst, len, IPPROTO_UDP, csum); + if (uh->check == 0) + uh->check = CSUM_MANGLED_0; + } +} + +static int udp_send_skb(struct sk_buff *skb, __be32 daddr, __be32 dport) +{ + struct sock *sk = skb->sk; + struct inet_sock *inet = inet_sk(sk); + struct udphdr *uh; + struct rtable *rt = (struct rtable *)skb_dst(skb); + int err = 0; + int is_udplite = IS_UDPLITE(sk); + int offset = skb_transport_offset(skb); + int len = skb->len - offset; + __wsum csum = 0; + + /* + * Create a UDP header + */ + uh = udp_hdr(skb); + uh->source = inet->inet_sport; + uh->dest = dport; + uh->len = htons(len); + uh->check = 0; + + if (is_udplite) + csum = udplite_csum(skb); + else if (sk->sk_no_check == UDP_CSUM_NOXMIT) { + skb->ip_summed = CHECKSUM_NONE; + goto send; + } else if (skb->ip_summed == CHECKSUM_PARTIAL) { + udp4_hwcsum(skb, rt->rt_src, daddr); + goto send; + } else + csum = udp_csum(skb); + + /* add protocol-dependent pseudo-header */ + uh->check = csum_tcpudp_magic(rt->rt_src, daddr, len, + sk->sk_protocol, csum); + if (uh->check == 0) + uh->check = CSUM_MANGLED_0; + +send: + err = ip_send_skb(skb); + if (err) { + if (err == -ENOBUFS && !inet->recverr) { + UDP_INC_STATS_USER(sock_net(sk), + UDP_MIB_SNDBUFERRORS, is_udplite); + err = 0; + } + } else + UDP_INC_STATS_USER(sock_net(sk), + UDP_MIB_OUTDATAGRAMS, is_udplite); + return err; +} + int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t len) { @@ -785,6 +874,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, int err, is_udplite = IS_UDPLITE(sk); int corkreq = up->corkflag || msg->msg_flags&MSG_MORE; int (*getfrag)(void *, char *, int, int, int, struct sk_buff *); + struct sk_buff *skb; if (len > 0xFFFF) return -EMSGSIZE; @@ -799,6 +889,8 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, ipc.opt = NULL; ipc.tx_flags = 0; + getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag; + if (up->pending) { /* * There are pending frames. @@ -923,6 +1015,17 @@ back_from_confirm: if (!ipc.addr) daddr = ipc.addr = rt->rt_dst; + /* Lockless fast path for the non-corking case. */ + if (!corkreq) { + skb = ip_make_skb(sk, getfrag, msg->msg_iov, ulen, + sizeof(struct udphdr), &ipc, &rt, + msg->msg_flags); + err = PTR_ERR(skb); + if (skb && !IS_ERR(skb)) + err = udp_send_skb(skb, daddr, dport); + goto out; + } + lock_sock(sk); if (unlikely(up->pending)) { /* The socket is already corked while preparing it. */ @@ -944,7 +1047,6 @@ back_from_confirm: do_append_data: up->len += ulen; - getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag; err = ip_append_data(sk, getfrag, msg->msg_iov, ulen, sizeof(struct udphdr), &ipc, &rt, corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags); ^ permalink raw reply related [flat|nested] 91+ messages in thread
* Re: [PATCH 4/5] udp: Add lockless transmit path 2011-02-28 11:41 ` [PATCH 4/5] udp: Add lockless transmit path Herbert Xu @ 2011-02-28 11:41 ` Herbert Xu 2011-03-01 5:30 ` Eric Dumazet 1 sibling, 0 replies; 91+ messages in thread From: Herbert Xu @ 2011-02-28 11:41 UTC (permalink / raw) To: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, Thomas Graf On Mon, Feb 28, 2011 at 07:41:01PM +0800, Herbert Xu wrote: > udp: Add lockless transmit path Doh! There are only 4 patches in the series. So you didn't miss anything, yet :) -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: [PATCH 4/5] udp: Add lockless transmit path 2011-02-28 11:41 ` [PATCH 4/5] udp: Add lockless transmit path Herbert Xu 2011-02-28 11:41 ` Herbert Xu @ 2011-03-01 5:30 ` Eric Dumazet 1 sibling, 0 replies; 91+ messages in thread From: Eric Dumazet @ 2011-03-01 5:30 UTC (permalink / raw) To: Herbert Xu Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta, netdev, Thomas Graf Le lundi 28 février 2011 à 19:41 +0800, Herbert Xu a écrit : > udp: Add lockless transmit path > > The UDP transmit path has been running under the socket lock > for a long time because of the corking feature. This means that > transmitting to the same socket in multiple threads does not > scale at all. > > However, as most users don't actually use corking, the locking > can be removed in the common case. > > This patch creates a lockless fast path where corking is not used. > > Please note that this does create a slight inaccuracy in the > enforcement of socket send buffer limits. In particular, we > may exceed the socket limit by up to (number of CPUs) * (packet > size) because of the way the limit is computed. > > As the primary purpose of socket buffers is to indicate congestion, > this should not be a great problem for now. > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> > --- So far I found no obvious problem, and got pretty impressive results. Acked-by: Eric Dumazet <eric.dumazet@gmail.com> ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-25 19:18 ` Rick Jones 2011-02-25 19:20 ` David Miller @ 2011-02-25 19:21 ` Eric Dumazet 2011-02-25 22:48 ` Thomas Graf 2 siblings, 0 replies; 91+ messages in thread From: Eric Dumazet @ 2011-02-25 19:21 UTC (permalink / raw) To: rick.jones2 Cc: Thomas Graf, Tom Herbert, Bill Sommerfeld, Daniel Baluta, netdev Le vendredi 25 février 2011 à 11:18 -0800, Rick Jones a écrit : > I think the idea is goodness, but will ask, was the (first) bottleneck > actually in the kernel, or was it in bind itself? I've seen > single-instance, single-byte burst-mode netperf TCP_RR do in excess of > 300K transactions per second (with TCP_NODELAY set) on an X5560 core. > > ftp://ftp.netperf.org/netperf/misc/dl380g6_X5560_rhel54_ad386_cxgb3_1.4.1.2_b2b_to_same_agg_1500mtu_20100513-2.csv > > and that was with now ancient RHEL5.4 bits... yes, there is a bit of > apples, oranges and kumquats but still, I am wondering if this didn't > also "work around" some internal BIND scaling issues as well. > A single core can probably give 300K transactions. But if you use several cores, accessing a single socket (the one bound on port 53), then performance drops because of false sharing, locking.... ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-25 19:18 ` Rick Jones 2011-02-25 19:20 ` David Miller 2011-02-25 19:21 ` SO_REUSEPORT - can it be done in kernel? Eric Dumazet @ 2011-02-25 22:48 ` Thomas Graf 2011-02-25 23:15 ` Rick Jones 2 siblings, 1 reply; 91+ messages in thread From: Thomas Graf @ 2011-02-25 22:48 UTC (permalink / raw) To: Rick Jones; +Cc: Tom Herbert, Bill Sommerfeld, Daniel Baluta, netdev On Fri, Feb 25, 2011 at 11:18:15AM -0800, Rick Jones wrote: > I think the idea is goodness, but will ask, was the (first) bottleneck > actually in the kernel, or was it in bind itself? I've seen > single-instance, single-byte burst-mode netperf TCP_RR do in excess of > 300K transactions per second (with TCP_NODELAY set) on an X5560 core. > > ftp://ftp.netperf.org/netperf/misc/dl380g6_X5560_rhel54_ad386_cxgb3_1.4.1.2_b2b_to_same_agg_1500mtu_20100513-2.csv > > and that was with now ancient RHEL5.4 bits... yes, there is a bit of > apples, oranges and kumquats but still, I am wondering if this didn't > also "work around" some internal BIND scaling issues as well. Yes it is. We have observed two separate bottlenecks. The first we have discovered is within BIND. As soon as more than 1 worker thread is being used strace showed a ton of futex() system calls to the kernel as soon as the number of queries crossed a magic barrier. This suggested heavy lock contention within BIND. This BIND lock contetion was not visible on all systems having scalability issues though. Some machines were not able to deliver enough queries to BIND in order for the lock contention to appear. ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-25 22:48 ` Thomas Graf @ 2011-02-25 23:15 ` Rick Jones 0 siblings, 0 replies; 91+ messages in thread From: Rick Jones @ 2011-02-25 23:15 UTC (permalink / raw) To: Thomas Graf; +Cc: Tom Herbert, Bill Sommerfeld, Daniel Baluta, netdev On Fri, 2011-02-25 at 17:48 -0500, Thomas Graf wrote: > On Fri, Feb 25, 2011 at 11:18:15AM -0800, Rick Jones wrote: > > I think the idea is goodness, but will ask, was the (first) bottleneck > > actually in the kernel, or was it in bind itself? I've seen > > single-instance, single-byte burst-mode netperf TCP_RR do in excess of > > 300K transactions per second (with TCP_NODELAY set) on an X5560 core. > > > > ftp://ftp.netperf.org/netperf/misc/dl380g6_X5560_rhel54_ad386_cxgb3_1.4.1.2_b2b_to_same_agg_1500mtu_20100513-2.csv > > > > and that was with now ancient RHEL5.4 bits... yes, there is a bit of > > apples, oranges and kumquats but still, I am wondering if this didn't > > also "work around" some internal BIND scaling issues as well. > > Yes it is. We have observed two separate bottlenecks. > > The first we have discovered is within BIND. As soon as more than 1 > worker thread is being used strace showed a ton of futex() system > calls to the kernel as soon as the number of queries crossed a magic > barrier. This suggested heavy lock contention within BIND. The more things change, the more they remain the same, or perhaps "Code may come and go, but lock contention is forever: ftp://ftp.cup.hp.com/dist/networking/briefs/bind9_perf.txt rick jones The system ftp.cup.hp.com is probably going away before long, I will probably put its collection of ancient writeups somewhere on netperf.org > > This BIND lock contetion was not visible on all systems having scalability > issues though. Some machines were not able to deliver enough queries to > BIND in order for the lock contention to appear. ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-25 12:56 ` Thomas Graf 2011-02-25 19:18 ` Rick Jones @ 2011-02-25 19:51 ` Tom Herbert 2011-02-25 22:58 ` Thomas Graf 2011-02-25 23:33 ` Bill Sommerfeld 1 sibling, 2 replies; 91+ messages in thread From: Tom Herbert @ 2011-02-25 19:51 UTC (permalink / raw) To: Tom Herbert, Bill Sommerfeld, Daniel Baluta, netdev; +Cc: Thomas Graf > Using your SO_REUSEPORT patch and a modified bind using it. The same > system is able to deliver ~650K queries per seconds while maxing out > all cores completely. > Nice data point. > Tom, Bill: do you have a timeline for merging this? Especially the > UDP bits? > Bill has been working on the TCP implementation which is requiring some fairly major surgery on the listener connections in syn-rcvd state, this is ongoing. On the UDP side, I believe the patch is functional, but as Eric pointed out it probably could be further optimized. I'll split out the UDP bits into a separate patch and post that... Tom > -Thomas > ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-25 19:51 ` Tom Herbert @ 2011-02-25 22:58 ` Thomas Graf 2011-02-25 23:33 ` Bill Sommerfeld 1 sibling, 0 replies; 91+ messages in thread From: Thomas Graf @ 2011-02-25 22:58 UTC (permalink / raw) To: Tom Herbert; +Cc: Bill Sommerfeld, Daniel Baluta, netdev On Fri, Feb 25, 2011 at 11:51:13AM -0800, Tom Herbert wrote: > On the UDP side, I believe the patch is functional, but as Eric > pointed out it probably could be further optimized. I'll split out > the UDP bits into a separate patch and post that... Cool! I will be happy to assist in improving it further. We already see 97-99% CPU utizliation spread evenly over all cores (with about 20-30% spent in softirq) so at least it already scales perfectly well. ^ permalink raw reply [flat|nested] 91+ messages in thread
* Re: SO_REUSEPORT - can it be done in kernel? 2011-02-25 19:51 ` Tom Herbert 2011-02-25 22:58 ` Thomas Graf @ 2011-02-25 23:33 ` Bill Sommerfeld 1 sibling, 0 replies; 91+ messages in thread From: Bill Sommerfeld @ 2011-02-25 23:33 UTC (permalink / raw) To: Tom Herbert; +Cc: Daniel Baluta, netdev, Thomas Graf On Fri, Feb 25, 2011 at 11:51, Tom Herbert <therbert@google.com> wrote: >> Tom, Bill: do you have a timeline for merging this? Especially the >> UDP bits? > Bill has been working on the TCP implementation which is requiring > some fairly major surgery on the listener connections in syn-rcvd > state, this is ongoing. Yup. The broad approach I settled on is to delay binding of new connections to listener sockets by moving receive_sock's from a per-listen_sock hash table to new hash chains in the global hash table. This is very much a work-in-progress. I'm part way through the conversion and have running code with most of the new structures in place in parallel with the old; I'm about to start relying exclusively on the new, and then will tear down the old; once that's done I'll be in a position to hook that up to SO_REUSEPORT and start actually measuring the difference. In short: it will be a while. So splitting SO_REUSEPORT for UDP from SO_REUSEPORT for TCP makes a lot of sense to me. ^ permalink raw reply [flat|nested] 91+ messages in thread
end of thread, other threads:[~2011-03-02 8:24 UTC | newest] Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-01-27 10:07 SO_REUSEPORT - can it be done in kernel? Daniel Baluta 2011-01-27 15:55 ` Bill Sommerfeld 2011-01-27 21:32 ` Tom Herbert 2011-02-25 12:56 ` Thomas Graf 2011-02-25 19:18 ` Rick Jones 2011-02-25 19:20 ` David Miller 2011-02-26 0:57 ` Herbert Xu 2011-02-26 2:12 ` David Miller 2011-02-26 2:48 ` Herbert Xu 2011-02-26 3:07 ` David Miller 2011-02-26 3:11 ` Herbert Xu 2011-02-26 7:31 ` Eric Dumazet 2011-02-26 7:46 ` David Miller 2011-02-27 11:02 ` Thomas Graf 2011-02-27 11:06 ` Herbert Xu 2011-02-28 3:45 ` Tom Herbert 2011-02-28 4:26 ` Herbert Xu 2011-02-28 11:36 ` Herbert Xu 2011-02-28 13:32 ` Eric Dumazet 2011-02-28 14:13 ` Herbert Xu 2011-02-28 14:22 ` Eric Dumazet 2011-02-28 14:25 ` Herbert Xu 2011-02-28 14:53 ` Eric Dumazet 2011-02-28 15:01 ` Thomas Graf 2011-02-28 14:13 ` Thomas Graf 2011-02-28 16:22 ` Eric Dumazet 2011-02-28 16:37 ` Thomas Graf 2011-02-28 17:07 ` Eric Dumazet 2011-03-01 10:19 ` Thomas Graf 2011-03-01 10:33 ` Eric Dumazet 2011-03-01 11:07 ` Thomas Graf 2011-03-01 11:13 ` Eric Dumazet 2011-03-01 11:27 ` Thomas Graf 2011-03-01 11:45 ` Eric Dumazet 2011-03-01 11:53 ` Herbert Xu 2011-03-01 12:32 ` Herbert Xu 2011-03-01 13:04 ` Eric Dumazet 2011-03-01 13:11 ` Herbert Xu 2011-03-01 13:03 ` Eric Dumazet 2011-03-01 13:18 ` Herbert Xu 2011-03-01 13:52 ` Eric Dumazet 2011-03-01 13:58 ` Herbert Xu 2011-03-01 16:31 ` Eric Dumazet 2011-03-02 0:23 ` Herbert Xu 2011-03-02 2:00 ` Eric Dumazet 2011-03-02 2:39 ` Herbert Xu 2011-03-02 2:56 ` Eric Dumazet 2011-03-02 3:09 ` Herbert Xu 2011-03-02 3:44 ` Eric Dumazet 2011-03-02 7:12 ` Tom Herbert 2011-03-02 7:31 ` Herbert Xu 2011-03-02 8:04 ` Eric Dumazet 2011-03-02 8:07 ` Herbert Xu 2011-03-02 8:24 ` Eric Dumazet 2011-03-01 12:01 ` Thomas Graf 2011-03-01 12:15 ` Herbert Xu 2011-03-01 13:27 ` Herbert Xu 2011-03-01 12:18 ` Thomas Graf 2011-03-01 12:19 ` Herbert Xu 2011-03-01 13:50 ` Thomas Graf 2011-03-01 14:06 ` Eric Dumazet 2011-03-01 14:22 ` Thomas Graf 2011-03-01 14:30 ` Thomas Graf 2011-03-01 14:52 ` Eric Dumazet 2011-03-01 15:07 ` Thomas Graf 2011-03-01 5:33 ` Eric Dumazet 2011-03-01 12:35 ` Herbert Xu 2011-03-01 12:36 ` [PATCH 1/5] inet: Remove unused sk_sndmsg_* from UFO Herbert Xu 2011-03-01 12:36 ` [PATCH 3/5] inet: Add ip_make_skb and ip_finish_skb Herbert Xu 2011-03-01 12:36 ` [PATCH 2/5] inet: Remove explicit write references to sk/inet in ip_append_data Herbert Xu 2011-03-02 6:15 ` inet: Replace left-over references to inet->cork Herbert Xu 2011-03-02 7:01 ` David Miller 2011-03-01 12:36 ` [PATCH 4/5] udp: Switch to ip_finish_skb Herbert Xu 2011-03-01 12:36 ` [PATCH 5/5] udp: Add lockless transmit path Herbert Xu 2011-03-01 16:43 ` SO_REUSEPORT - can it be done in kernel? Eric Dumazet 2011-03-01 20:36 ` David Miller 2011-02-28 11:41 ` [PATCH 2/5] net: Remove explicit write references to sk/inet in ip_append_data Herbert Xu 2011-03-01 5:31 ` Eric Dumazet 2011-02-28 11:41 ` [PATCH 1/5] net: Remove unused sk_sndmsg_* from UFO Herbert Xu 2011-03-01 5:31 ` Eric Dumazet 2011-02-28 11:41 ` [PATCH 3/5] inet: Add ip_make_skb and ip_send_skb Herbert Xu 2011-03-01 5:31 ` Eric Dumazet 2011-02-28 11:41 ` [PATCH 4/5] udp: Add lockless transmit path Herbert Xu 2011-02-28 11:41 ` Herbert Xu 2011-03-01 5:30 ` Eric Dumazet 2011-02-25 19:21 ` SO_REUSEPORT - can it be done in kernel? Eric Dumazet 2011-02-25 22:48 ` Thomas Graf 2011-02-25 23:15 ` Rick Jones 2011-02-25 19:51 ` Tom Herbert 2011-02-25 22:58 ` Thomas Graf 2011-02-25 23:33 ` Bill Sommerfeld
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.