From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754516AbZDKEQU (ORCPT ); Sat, 11 Apr 2009 00:16:20 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753225AbZDKEPu (ORCPT ); Sat, 11 Apr 2009 00:15:50 -0400 Received: from e7.ny.us.ibm.com ([32.97.182.137]:38806 "EHLO e7.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753151AbZDKEPr (ORCPT ); Sat, 11 Apr 2009 00:15:47 -0400 Date: Fri, 10 Apr 2009 21:15:33 -0700 From: "Paul E. McKenney" To: Linus Torvalds Cc: David Miller , Ingo Molnar , Lai Jiangshan , shemminger@vyatta.com, jeff.chua.linux@gmail.com, dada1@cosmosbay.com, jengelh@medozas.de, kaber@trash.net, r000n@r000n.net, Linux Kernel Mailing List , netfilter-devel@vger.kernel.org, netdev@vger.kernel.org Subject: Re: iptables very slow after commit 784544739a25c30637397ace5489eeb6e15d7d49 Message-ID: <20090411041533.GB6822@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20090410095246.4fdccb56@s6510> <20090410.182507.140306636.davem@davemloft.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.15+20070412 (2007-04-11) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Apr 10, 2009 at 06:39:18PM -0700, Linus Torvalds wrote: > > > On Fri, 10 Apr 2009, David Miller wrote: > > > > [ CC:'ing netfilter-devel and netdev... ] > > I wonder if we should bring in the RCU people too, for them to tell you > that the networking people are beign silly, and should not synchronize > with the very heavy-handed > > synchronize_net() > > but instead of doing synchronization (which is probably why adding a few > hundred rules then takes several seconds - each synchronizes and that > takes a timer tick or so), add the rules to be free'd on some rcu-freeing > list for later freeing. > > Or whatever. Paul? synchronize_net() just calls synchronize_rcu(), and > with that knowledge and a simple > > git show 784544739a25c30637397ace5489eeb6e15d7d49 > > I bet you can already tell people how to fix their performance issue. Well, I am certainly happy to demonstrate my ignorance of the networking code by throwing out a few suggestions. So, Dave and Steve, you might want to get out your barf bag before reading further. You have been warned! ;-) 1. Assuming that the synchronize_net() is intended to guarantee that the new rules will be in effect before returning to user space: a. Split this functionality, so that there is a new user-space primitive that installs a new rule, but without waiting. They provide an additional user-space primitive that waits for the rules to take effect. Then, when loading a long list of rules, load them using the non-waiting primitive, and wait at the end. b. As above, but provide a flag that says whether or not to wait. Same general effect. But I am not seeing the direct connection between this patch and netfilter, so... 2. For the xt_replace_table() case, it would be necessary to add an rcu_head to the xt_table_info, and replace each caller's direct calls to xt_free_table_info() with call_rcu(). Now this has an issue in that the caller wants to return the final counter values. My assumption is that these values do not in fact need to be exact. If I am wrong about that, then my suggestion would lose the counts from late readers. I must defer to the networking guys as to whether this is acceptable or not. If not, more head-scratching would be required. (But it looks to me that the rule is being trashed, so who cares about the extra counts?) In addition, a malicious user might be able to force this to happen extremely frequently, running the system out of memory. One way to fix this is to invoke synchronize_net() one out of 20 times or some such. 3. For the alloc_counters() case, the comments indicate that we really truly do want an atomic sampling of the counters. The counters are 64-bit entities, which is a bit inconvenient. Though people using this functionality are no doubt quite happy to never have to worry about overflow, I hasten to add! I will nevertheless suggest the following egregious hack to get a consistent sample of one counter for some other CPU: a. Disable interrupts b. Atomically exchange the bottom 32 bits of the counter with the value zero. c. Atomically exchange the top 32 bits of the counter with the value zero. d. Concatenate the values obtained in (b) and (c), which is the snapshot value. e. Re-enable interrupts. Yes, for each counter. Do it for the honor of the -rt patchset. ;-) Disabling interrupts should make it impossible for the low-order 32 bits of the counter to overflow before we get around to zeroing the upper 32 bits. Yes, this is horribly paranoid, but please keep in mind that even my level of paranoia is not always sufficient to keep RCU working correctly. :-/ Architectures with 64-bit atomics can simply do a 64-bit exchange (or cmpxchg(), for that matter). Now we still have the possibility that the other CPU is still hammering away on the counter that we just zeroed from a long-running RCU read-side critical section. So, we also need to add an rcu_head somewhere, perhaps reuse the one in xt_table_info, create a second one, or squirrel one away somewhere else. As long as there is a way to get to the old counter values. And a flag to indicate that the rcu_head is in use. It is socially irresponsible to pass a given rcu_head to call_rcu() before it has been invoked after the previous time it was passed to call_rcu(). But you guys all knew that already. We replace the synchronize_net() with call_rcu(), more or less. The call_rcu() probably needs to be under the lock -- or at the very least, setting the flag saying that it is in use needs to be under the lock. The RCU callback function traverses the old counters one last time, adding their values to the new set of counters. No atomic exchange tricks are required this time, since all the RCU readers that could possibly have held a reference to the old set of counters must now be done. We now clear the flag, allowing the next counter snapshot to proceed. OK, OK, Dave and Steve, I should have suggested that you get two barf bags. Maybe three. ;-) Additional caveat: coward that I am, I looked only at the IPv4 code. There might well be additional complications in the arp and IPv6 code. However, I do believe that something like this might actually work. Thoughts? Thanx, Paul > Linus > > --- > > > On Fri, 10 Apr 2009 17:15:52 +0800 (SGT) > > > Jeff Chua wrote: > > >> > > >> Adding 200 records in iptables took 6.0sec in 2.6.30-rc1 compared to > > >> 0.2sec in 2.6.29. I've bisected down this commit. > > >> > > >> There are a few patches on top of the original patch. When I reverted the > > >> original commit + changing rcu_read() to rcu_read_bh(), it speeds up the > > >> inserts back to .2sec again. > > >> > > >> I'm loading all the firewall rules during boot-up and this 6 secs slowness > > >> is really not very nice to wait for. > > > > > > The performance benefit during operation is more important. The load > > > time is fixable. The problem is probably generic to any set of rules, > > > but could you post some info about your configuration (like the rule > > > set), and the system configuration (# of cpu's, config etc). > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > Please read the FAQ at http://www.tux.org/lkml/ > > > -- > To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html