* [PATCH nf-next] netfilter: conntrack: remove the central spinlock @ 2013-05-09 3:04 Eric Dumazet 2013-05-09 5:43 ` Cong Wang 2013-05-22 17:47 ` [PATCH v2 " Eric Dumazet 0 siblings, 2 replies; 21+ messages in thread From: Eric Dumazet @ 2013-05-09 3:04 UTC (permalink / raw) To: Pablo Neira Ayuso; +Cc: netfilter-devel, netdev, Tom Herbert, Patrick McHardy From: Eric Dumazet <edumazet@google.com> nf_conntrack_lock is a monolithic lock and suffers from huge contention on current generation servers (8 or more core/threads). Split this lock into three distinct components : 1) One array of hashed spinlocks to protect insertions/deletions of conntracks into the hash table. 1024 spinlocks seem to give good results, at minimal cost (4KB memory). 2) One spinlock per cpu to protect dying/unconfirmed special lists. (These lists are now per cpu, a bit like the untracked ct) Add a @cpu field to nf_conn, to make sure we hold the appropriate spinlock at removal time. 3) One spinlock to protect the expectations : nf_conntrack_expect_lock The hash resize is a bit tricky, because we need to take all locks in the array. LOCKDEP being unable to currently handle this, we use arch_spinlock_t and make sure we don't deadlock (A conntrack has two directions and usually needs to lock two spinlocks) A seqcount_t is used to synchronize the hash table users with the resizing process. Reorder struct netns_ct so that @count changes don't slowdown users of read only fields. Results on a 32 threads machine, 200 concurrent instances of "netperf -t TCP_CRR" : ~390000 tps instead of ~300000 tps. perf top before patch : 51.16% [kernel] [k] _raw_spin_lock_bh 1.10% [ip_tables] [k] ipt_do_table 1.00% [nf_conntrack] [k] tcp_packet 0.76% [kernel] [k] nf_iterate 0.71% [kernel] [k] _raw_spin_lock 0.70% [kernel] [k] kmem_cache_free 0.67% [kernel] [k] tcp_ack 0.62% [kernel] [k] tcp_transmit_skb 0.61% [kernel] [k] local_bh_disable 0.60% [nf_conntrack] [k] ____nf_conntrack_find 0.59% [kernel] [k] __schedule 0.57% [kernel] [k] __netif_receive_skb_core 0.55% [kernel] [k] tcp_v4_rcv 0.50% [kernel] [k] local_bh_enable After patch : 2.62% [nf_conntrack] [k] tcp_packet 2.10% [ip_tables] [k] ipt_do_table 1.73% [kernel] [k] _raw_spin_lock 1.71% [kernel] [k] nf_iterate 1.62% [kernel] [k] tcp_ack 1.36% [kernel] [k] tcp_transmit_skb 1.28% [kernel] [k] __netif_receive_skb_core 1.28% [nf_conntrack] [k] __nf_conntrack_find_get 1.27% [kernel] [k] local_bh_enable 1.27% [kernel] [k] __schedule 1.23% [kernel] [k] kmem_cache_free 1.22% [kernel] [k] local_bh_disable 1.15% [kernel] [k] tcp_v4_rcv 1.06% [nf_conntrack] [k] nf_conntrack_in 1.02% [kernel] [k] _raw_spin_lock_bh 0.96% [kernel] [k] __inet_lookup_established 0.90% [kernel] [k] _raw_spin_lock_irqsave Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Tom Herbert <therbert@google.com> --- include/net/netfilter/nf_conntrack.h | 7 include/net/netfilter/nf_conntrack_core.h | 5 include/net/netns/conntrack.h | 30 - net/netfilter/nf_conntrack_core.c | 423 +++++++++++++------- net/netfilter/nf_conntrack_expect.c | 18 net/netfilter/nf_conntrack_h323_main.c | 4 net/netfilter/nf_conntrack_helper.c | 37 + net/netfilter/nf_conntrack_netlink.c | 128 +++--- net/netfilter/nf_conntrack_sip.c | 8 9 files changed, 427 insertions(+), 233 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 644d9c2..2311914 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -76,7 +76,8 @@ struct nf_conn { plus 1 for any connection(s) we are `master' for */ struct nf_conntrack ct_general; - spinlock_t lock; + spinlock_t lock; + u16 cpu; /* XXX should I move this to the tail ? - Y.K */ /* These are my tuples; original and reply */ @@ -176,10 +177,6 @@ extern void *nf_ct_alloc_hashtable(unsigned int *sizep, int nulls); extern void nf_ct_free_hashtable(void *hash, unsigned int size); -extern struct nf_conntrack_tuple_hash * -__nf_conntrack_find(struct net *net, u16 zone, - const struct nf_conntrack_tuple *tuple); - extern int nf_conntrack_hash_check_insert(struct nf_conn *ct); extern void nf_ct_delete_from_lists(struct nf_conn *ct); extern void nf_ct_dying_timeout(struct nf_conn *ct); diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h index fb2b623..51ba13e 100644 --- a/include/net/netfilter/nf_conntrack_core.h +++ b/include/net/netfilter/nf_conntrack_core.h @@ -84,6 +84,9 @@ print_tuple(struct seq_file *s, const struct nf_conntrack_tuple *tuple, const struct nf_conntrack_l3proto *l3proto, const struct nf_conntrack_l4proto *proto); -extern spinlock_t nf_conntrack_lock ; +#define CONNTRACK_LOCKS 1024 +extern arch_spinlock_t nf_conntrack_locks[]; + +extern spinlock_t nf_conntrack_expect_lock ; #endif /* _NF_CONNTRACK_CORE_H */ diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h index c9c0c53..f7db370 100644 --- a/include/net/netns/conntrack.h +++ b/include/net/netns/conntrack.h @@ -5,6 +5,7 @@ #include <linux/list_nulls.h> #include <linux/atomic.h> #include <linux/netfilter/nf_conntrack_tcp.h> +#include <linux/seqlock.h> struct ctl_table_header; struct nf_conntrack_ecache; @@ -62,16 +63,31 @@ struct nf_ip_net { #endif }; +struct ct_pcpu { + spinlock_t lock; + struct hlist_nulls_head unconfirmed; + struct hlist_nulls_head dying; +}; + struct netns_ct { atomic_t count; + struct hlist_nulls_head tmpl; +#ifdef CONFIG_SYSCTL + struct ctl_table_header *sysctl_header; + struct ctl_table_header *acct_sysctl_header; + struct ctl_table_header *tstamp_sysctl_header; + struct ctl_table_header *event_sysctl_header; + struct ctl_table_header *helper_sysctl_header; +#endif + char *slabname; + unsigned int expect_count; unsigned int htable_size; + seqcount_t generation; struct kmem_cache *nf_conntrack_cachep; struct hlist_nulls_head *hash; struct hlist_head *expect_hash; - struct hlist_nulls_head unconfirmed; - struct hlist_nulls_head dying; - struct hlist_nulls_head tmpl; + struct ct_pcpu __percpu *pcpu_lists; struct ip_conntrack_stat __percpu *stat; struct nf_ct_event_notifier __rcu *nf_conntrack_event_cb; struct nf_exp_event_notifier __rcu *nf_expect_event_cb; @@ -92,13 +108,5 @@ struct netns_ct { struct hlist_head *nat_bysource; unsigned int nat_htable_size; #endif -#ifdef CONFIG_SYSCTL - struct ctl_table_header *sysctl_header; - struct ctl_table_header *acct_sysctl_header; - struct ctl_table_header *tstamp_sysctl_header; - struct ctl_table_header *event_sysctl_header; - struct ctl_table_header *helper_sysctl_header; -#endif - char *slabname; }; #endif diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 0283bae..a48ec55 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -64,8 +64,57 @@ int (*nf_nat_seq_adjust_hook)(struct sk_buff *skb, unsigned int protoff); EXPORT_SYMBOL_GPL(nf_nat_seq_adjust_hook); -DEFINE_SPINLOCK(nf_conntrack_lock); -EXPORT_SYMBOL_GPL(nf_conntrack_lock); +__cacheline_aligned_in_smp arch_spinlock_t nf_conntrack_locks[CONNTRACK_LOCKS]; +EXPORT_SYMBOL_GPL(nf_conntrack_locks); + +__cacheline_aligned_in_smp DEFINE_SPINLOCK(nf_conntrack_expect_lock); +EXPORT_SYMBOL_GPL(nf_conntrack_expect_lock); + +static void nf_conntrack_double_unlock(unsigned int h1, unsigned int h2) +{ + h1 %= CONNTRACK_LOCKS; + h2 %= CONNTRACK_LOCKS; + arch_spin_unlock(&nf_conntrack_locks[h1]); + if (h1 != h2) + arch_spin_unlock(&nf_conntrack_locks[h2]); +} + +/* return true if we need to recompute hashes (in case hash table was resized) */ +static bool nf_conntrack_double_lock(struct net *net, unsigned int h1, + unsigned int h2, unsigned int sequence) +{ + h1 %= CONNTRACK_LOCKS; + h2 %= CONNTRACK_LOCKS; + if (h1 <= h2) { + arch_spin_lock(&nf_conntrack_locks[h1]); + if (h1 != h2) + arch_spin_lock(&nf_conntrack_locks[h2]); + } else { + arch_spin_lock(&nf_conntrack_locks[h2]); + arch_spin_lock(&nf_conntrack_locks[h1]); + } + if (read_seqcount_retry(&net->ct.generation, sequence)) { + nf_conntrack_double_unlock(h1, h2); + return true; + } + return false; +} + +static void nf_conntrack_all_lock(void) +{ + int i; + + for (i = 0; i < CONNTRACK_LOCKS; i++) + arch_spin_lock(&nf_conntrack_locks[i]); +} + +static void nf_conntrack_all_unlock(void) +{ + int i; + + for (i = 0; i < CONNTRACK_LOCKS; i++) + arch_spin_unlock(&nf_conntrack_locks[i]); +} unsigned int nf_conntrack_htable_size __read_mostly; EXPORT_SYMBOL_GPL(nf_conntrack_htable_size); @@ -196,6 +245,47 @@ clean_from_lists(struct nf_conn *ct) nf_ct_remove_expectations(ct); } +static void nf_ct_add_to_dying_list(struct nf_conn *ct) +{ + struct ct_pcpu *pcpu; + + /* add this conntrack to the (per cpu) dying list */ + ct->cpu = raw_smp_processor_id(); + pcpu = per_cpu_ptr(nf_ct_net(ct)->ct.pcpu_lists, ct->cpu); + + spin_lock(&pcpu->lock); + hlist_nulls_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, + &pcpu->dying); + spin_unlock(&pcpu->lock); +} + +static void nf_ct_add_to_unconfirmed_list(struct nf_conn *ct) +{ + struct ct_pcpu *pcpu; + + /* add this conntrack to the (per cpu) unconfirmed list */ + ct->cpu = raw_smp_processor_id(); + pcpu = per_cpu_ptr(nf_ct_net(ct)->ct.pcpu_lists, ct->cpu); + + spin_lock(&pcpu->lock); + hlist_nulls_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, + &pcpu->unconfirmed); + spin_unlock(&pcpu->lock); +} + +static void nf_ct_del_from_dying_or_unconfirmed_list(struct nf_conn *ct) +{ + struct ct_pcpu *pcpu; + + /* We overload first tuple to link into unconfirmed or dying list.*/ + pcpu = per_cpu_ptr(nf_ct_net(ct)->ct.pcpu_lists, ct->cpu); + + spin_lock(&pcpu->lock); + BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode)); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); + spin_unlock(&pcpu->lock); +} + static void destroy_conntrack(struct nf_conntrack *nfct) { @@ -207,9 +297,6 @@ destroy_conntrack(struct nf_conntrack *nfct) NF_CT_ASSERT(atomic_read(&nfct->use) == 0); NF_CT_ASSERT(!timer_pending(&ct->timeout)); - /* To make sure we don't get any weird locking issues here: - * destroy_conntrack() MUST NOT be called with a write lock - * to nf_conntrack_lock!!! -HW */ rcu_read_lock(); l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct)); if (l4proto && l4proto->destroy) @@ -217,19 +304,21 @@ destroy_conntrack(struct nf_conntrack *nfct) rcu_read_unlock(); - spin_lock_bh(&nf_conntrack_lock); - /* Expectations will have been removed in clean_from_lists, - * except TFTP can create an expectation on the first packet, - * before connection is in the list, so we need to clean here, - * too. */ - nf_ct_remove_expectations(ct); - - /* We overload first tuple to link into unconfirmed or dying list.*/ - BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode)); - hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); + local_bh_disable(); + if (nfct_help(ct)) { + spin_lock(&nf_conntrack_expect_lock); + /* Expectations will have been removed in clean_from_lists, + * except TFTP can create an expectation on the first packet, + * before connection is in the list, so we need to clean here, + * too. + */ + nf_ct_remove_expectations(ct); + spin_unlock(&nf_conntrack_expect_lock); + } + nf_ct_del_from_dying_or_unconfirmed_list(ct); NF_CT_STAT_INC(net, delete); - spin_unlock_bh(&nf_conntrack_lock); + local_bh_enable(); if (ct->master) nf_ct_put(ct->master); @@ -241,17 +330,28 @@ destroy_conntrack(struct nf_conntrack *nfct) void nf_ct_delete_from_lists(struct nf_conn *ct) { struct net *net = nf_ct_net(ct); + unsigned int hash, repl_hash; + u16 zone = nf_ct_zone(ct); + unsigned int sequence; nf_ct_helper_destroy(ct); - spin_lock_bh(&nf_conntrack_lock); - /* Inside lock so preempt is disabled on module removal path. - * Otherwise we can get spurious warnings. */ - NF_CT_STAT_INC(net, delete_list); + + local_bh_disable(); + do { + sequence = read_seqcount_begin(&net->ct.generation); + hash = hash_conntrack(net, zone, + &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple); + repl_hash = hash_conntrack(net, zone, + &ct->tuplehash[IP_CT_DIR_REPLY].tuple); + } while (nf_conntrack_double_lock(net, hash, repl_hash, sequence)); + clean_from_lists(ct); - /* add this conntrack to the dying list */ - hlist_nulls_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, - &net->ct.dying); - spin_unlock_bh(&nf_conntrack_lock); + nf_conntrack_double_unlock(hash, repl_hash); + + nf_ct_add_to_dying_list(ct); + + NF_CT_STAT_INC(net, delete_list); + local_bh_enable(); } EXPORT_SYMBOL_GPL(nf_ct_delete_from_lists); @@ -315,8 +415,6 @@ static void death_by_timeout(unsigned long ul_conntrack) * Warning : * - Caller must take a reference on returned object * and recheck nf_ct_tuple_equal(tuple, &h->tuple) - * OR - * - Caller must lock nf_conntrack_lock before calling this function */ static struct nf_conntrack_tuple_hash * ____nf_conntrack_find(struct net *net, u16 zone, @@ -354,15 +452,6 @@ begin: return NULL; } -struct nf_conntrack_tuple_hash * -__nf_conntrack_find(struct net *net, u16 zone, - const struct nf_conntrack_tuple *tuple) -{ - return ____nf_conntrack_find(net, zone, tuple, - hash_conntrack_raw(tuple, zone)); -} -EXPORT_SYMBOL_GPL(__nf_conntrack_find); - /* Find a connection corresponding to a tuple. */ static struct nf_conntrack_tuple_hash * __nf_conntrack_find_get(struct net *net, u16 zone, @@ -421,14 +510,18 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct) struct nf_conntrack_tuple_hash *h; struct hlist_nulls_node *n; u16 zone; + unsigned int sequence; zone = nf_ct_zone(ct); - hash = hash_conntrack(net, zone, - &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple); - repl_hash = hash_conntrack(net, zone, - &ct->tuplehash[IP_CT_DIR_REPLY].tuple); - spin_lock_bh(&nf_conntrack_lock); + local_bh_disable(); + do { + sequence = read_seqcount_begin(&net->ct.generation); + hash = hash_conntrack(net, zone, + &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple); + repl_hash = hash_conntrack(net, zone, + &ct->tuplehash[IP_CT_DIR_REPLY].tuple); + } while (nf_conntrack_double_lock(net, hash, repl_hash, sequence)); /* See if there's one in the list already, including reverse */ hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode) @@ -445,14 +538,15 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct) add_timer(&ct->timeout); nf_conntrack_get(&ct->ct_general); __nf_conntrack_hash_insert(ct, hash, repl_hash); + nf_conntrack_double_unlock(hash, repl_hash); NF_CT_STAT_INC(net, insert); - spin_unlock_bh(&nf_conntrack_lock); - + local_bh_enable(); return 0; out: + nf_conntrack_double_unlock(hash, repl_hash); NF_CT_STAT_INC(net, insert_failed); - spin_unlock_bh(&nf_conntrack_lock); + local_bh_enable(); return -EEXIST; } EXPORT_SYMBOL_GPL(nf_conntrack_hash_check_insert); @@ -470,6 +564,7 @@ __nf_conntrack_confirm(struct sk_buff *skb) enum ip_conntrack_info ctinfo; struct net *net; u16 zone; + unsigned int sequence; ct = nf_ct_get(skb, &ctinfo); net = nf_ct_net(ct); @@ -482,31 +577,37 @@ __nf_conntrack_confirm(struct sk_buff *skb) return NF_ACCEPT; zone = nf_ct_zone(ct); - /* reuse the hash saved before */ - hash = *(unsigned long *)&ct->tuplehash[IP_CT_DIR_REPLY].hnnode.pprev; - hash = hash_bucket(hash, net); - repl_hash = hash_conntrack(net, zone, - &ct->tuplehash[IP_CT_DIR_REPLY].tuple); + local_bh_disable(); + + do { + sequence = read_seqcount_begin(&net->ct.generation); + /* reuse the hash saved before */ + hash = *(unsigned long *)&ct->tuplehash[IP_CT_DIR_REPLY].hnnode.pprev; + hash = hash_bucket(hash, net); + repl_hash = hash_conntrack(net, zone, + &ct->tuplehash[IP_CT_DIR_REPLY].tuple); + + } while (nf_conntrack_double_lock(net, hash, repl_hash, sequence)); /* We're not in hash table, and we refuse to set up related - connections for unconfirmed conns. But packet copies and - REJECT will give spurious warnings here. */ + * connections for unconfirmed conns. But packet copies and + * REJECT will give spurious warnings here. + */ /* NF_CT_ASSERT(atomic_read(&ct->ct_general.use) == 1); */ /* No external references means no one else could have - confirmed us. */ + * confirmed us. + */ NF_CT_ASSERT(!nf_ct_is_confirmed(ct)); pr_debug("Confirming conntrack %p\n", ct); - - spin_lock_bh(&nf_conntrack_lock); - /* We have to check the DYING flag inside the lock to prevent a race against nf_ct_get_next_corpse() possibly called from user context, else we insert an already 'dead' hash, blocking further use of that particular connection -JM */ if (unlikely(nf_ct_is_dying(ct))) { - spin_unlock_bh(&nf_conntrack_lock); + nf_conntrack_double_unlock(hash, repl_hash); + local_bh_enable(); return NF_ACCEPT; } @@ -524,8 +625,7 @@ __nf_conntrack_confirm(struct sk_buff *skb) zone == nf_ct_zone(nf_ct_tuplehash_to_ctrack(h))) goto out; - /* Remove from unconfirmed list */ - hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); + nf_ct_del_from_dying_or_unconfirmed_list(ct); /* Timer relative to confirmation time, not original setting time, otherwise we'd get timer wrap in @@ -549,8 +649,9 @@ __nf_conntrack_confirm(struct sk_buff *skb) * stores are visible. */ __nf_conntrack_hash_insert(ct, hash, repl_hash); + nf_conntrack_double_unlock(hash, repl_hash); NF_CT_STAT_INC(net, insert); - spin_unlock_bh(&nf_conntrack_lock); + local_bh_enable(); help = nfct_help(ct); if (help && help->helper) @@ -561,8 +662,9 @@ __nf_conntrack_confirm(struct sk_buff *skb) return NF_ACCEPT; out: + nf_conntrack_double_unlock(hash, repl_hash); NF_CT_STAT_INC(net, insert_failed); - spin_unlock_bh(&nf_conntrack_lock); + local_bh_enable(); return NF_DROP; } EXPORT_SYMBOL_GPL(__nf_conntrack_confirm); @@ -580,9 +682,6 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple, u16 zone = nf_ct_zone(ignored_conntrack); unsigned int hash = hash_conntrack(net, zone, tuple); - /* Disable BHs the entire time since we need to disable them at - * least once for the stats anyway. - */ rcu_read_lock_bh(); hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) { ct = nf_ct_tuplehash_to_ctrack(h); @@ -595,8 +694,8 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple, } NF_CT_STAT_INC(net, searched); } - rcu_read_unlock_bh(); + rcu_read_unlock_bh(); return 0; } EXPORT_SYMBOL_GPL(nf_conntrack_tuple_taken); @@ -605,39 +704,48 @@ EXPORT_SYMBOL_GPL(nf_conntrack_tuple_taken); /* There's a small race here where we may free a just-assured connection. Too bad: we're in trouble anyway. */ -static noinline int early_drop(struct net *net, unsigned int hash) +static noinline int early_drop(struct net *net, unsigned int _hash) { /* Use oldest entry, which is roughly LRU */ struct nf_conntrack_tuple_hash *h; struct nf_conn *ct = NULL, *tmp; struct hlist_nulls_node *n; - unsigned int i, cnt = 0; + unsigned int i = 0, cnt = 0; int dropped = 0; + unsigned int hash, sequence; + arch_spinlock_t *lockp; - rcu_read_lock(); - for (i = 0; i < net->ct.htable_size; i++) { + local_bh_disable(); +restart: + sequence = read_seqcount_begin(&net->ct.generation); + hash = hash_bucket(_hash, net); + for (; i < net->ct.htable_size; i++) { + lockp = &nf_conntrack_locks[hash % CONNTRACK_LOCKS]; + arch_spin_lock(lockp); + if (read_seqcount_retry(&net->ct.generation, sequence)) { + arch_spin_unlock(lockp); + goto restart; + } hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) { tmp = nf_ct_tuplehash_to_ctrack(h); - if (!test_bit(IPS_ASSURED_BIT, &tmp->status)) + if (!test_bit(IPS_ASSURED_BIT, &tmp->status) && + !nf_ct_is_dying(tmp) && + atomic_inc_not_zero(&tmp->ct_general.use)) { ct = tmp; + break; + } cnt++; } - if (ct != NULL) { - if (likely(!nf_ct_is_dying(ct) && - atomic_inc_not_zero(&ct->ct_general.use))) - break; - else - ct = NULL; - } + hash = (hash + 1) % net->ct.htable_size; + arch_spin_unlock(lockp); - if (cnt >= NF_CT_EVICTION_RANGE) + if (ct || cnt >= NF_CT_EVICTION_RANGE) break; - hash = (hash + 1) % net->ct.htable_size; } - rcu_read_unlock(); + local_bh_enable(); if (!ct) return dropped; @@ -689,7 +797,7 @@ __nf_conntrack_alloc(struct net *net, u16 zone, if (nf_conntrack_max && unlikely(atomic_read(&net->ct.count) > nf_conntrack_max)) { - if (!early_drop(net, hash_bucket(hash, net))) { + if (!early_drop(net, hash)) { atomic_dec(&net->ct.count); net_warn_ratelimited("nf_conntrack: table full, dropping packet\n"); return ERR_PTR(-ENOMEM); @@ -781,7 +889,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, struct nf_conn_help *help; struct nf_conntrack_tuple repl_tuple; struct nf_conntrack_ecache *ecache; - struct nf_conntrack_expect *exp; + struct nf_conntrack_expect *exp = NULL; u16 zone = tmpl ? nf_ct_zone(tmpl) : NF_CT_DEFAULT_ZONE; struct nf_conn_timeout *timeout_ext; unsigned int *timeouts; @@ -820,39 +928,41 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, ecache ? ecache->expmask : 0, GFP_ATOMIC); - spin_lock_bh(&nf_conntrack_lock); - exp = nf_ct_find_expectation(net, zone, tuple); - if (exp) { - pr_debug("conntrack: expectation arrives ct=%p exp=%p\n", - ct, exp); - /* Welcome, Mr. Bond. We've been expecting you... */ - __set_bit(IPS_EXPECTED_BIT, &ct->status); - ct->master = exp->master; - if (exp->helper) { - help = nf_ct_helper_ext_add(ct, exp->helper, - GFP_ATOMIC); - if (help) - rcu_assign_pointer(help->helper, exp->helper); - } + local_bh_disable(); + if (net->ct.expect_count) { + spin_lock(&nf_conntrack_expect_lock); + exp = nf_ct_find_expectation(net, zone, tuple); + if (exp) { + pr_debug("conntrack: expectation arrives ct=%p exp=%p\n", + ct, exp); + /* Welcome, Mr. Bond. We've been expecting you... */ + __set_bit(IPS_EXPECTED_BIT, &ct->status); + ct->master = exp->master; + if (exp->helper) { + help = nf_ct_helper_ext_add(ct, exp->helper, + GFP_ATOMIC); + if (help) + rcu_assign_pointer(help->helper, exp->helper); + } #ifdef CONFIG_NF_CONNTRACK_MARK - ct->mark = exp->master->mark; + ct->mark = exp->master->mark; #endif #ifdef CONFIG_NF_CONNTRACK_SECMARK - ct->secmark = exp->master->secmark; + ct->secmark = exp->master->secmark; #endif - nf_conntrack_get(&ct->master->ct_general); - NF_CT_STAT_INC(net, expect_new); - } else { + nf_conntrack_get(&ct->master->ct_general); + NF_CT_STAT_INC(net, expect_new); + } + spin_unlock(&nf_conntrack_expect_lock); + } + if (!exp) { __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC); NF_CT_STAT_INC(net, new); } + nf_ct_add_to_unconfirmed_list(ct); - /* Overload tuple linked list to put us in unconfirmed list. */ - hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, - &net->ct.unconfirmed); - - spin_unlock_bh(&nf_conntrack_lock); + local_bh_enable(); if (exp) { if (exp->expectfn) @@ -1218,27 +1328,42 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data), struct nf_conntrack_tuple_hash *h; struct nf_conn *ct; struct hlist_nulls_node *n; + int cpu; + arch_spinlock_t *lockp; - spin_lock_bh(&nf_conntrack_lock); for (; *bucket < net->ct.htable_size; (*bucket)++) { - hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], hnnode) { - if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL) - continue; + lockp = &nf_conntrack_locks[*bucket % CONNTRACK_LOCKS]; + local_bh_disable(); + arch_spin_lock(lockp); + if (*bucket < net->ct.htable_size) { + hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], hnnode) { + if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL) + continue; + ct = nf_ct_tuplehash_to_ctrack(h); + if (iter(ct, data)) + goto found; + } + } + arch_spin_unlock(lockp); + local_bh_enable(); + } + + for_each_possible_cpu(cpu) { + struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu); + + spin_lock_bh(&pcpu->lock); + hlist_nulls_for_each_entry(h, n, &pcpu->unconfirmed, hnnode) { ct = nf_ct_tuplehash_to_ctrack(h); if (iter(ct, data)) - goto found; + set_bit(IPS_DYING_BIT, &ct->status); } + spin_unlock_bh(&pcpu->lock); } - hlist_nulls_for_each_entry(h, n, &net->ct.unconfirmed, hnnode) { - ct = nf_ct_tuplehash_to_ctrack(h); - if (iter(ct, data)) - set_bit(IPS_DYING_BIT, &ct->status); - } - spin_unlock_bh(&nf_conntrack_lock); return NULL; found: atomic_inc(&ct->ct_general.use); - spin_unlock_bh(&nf_conntrack_lock); + arch_spin_unlock(lockp); + local_bh_enable(); return ct; } @@ -1314,14 +1439,19 @@ static void nf_ct_release_dying_list(struct net *net) struct nf_conntrack_tuple_hash *h; struct nf_conn *ct; struct hlist_nulls_node *n; + int cpu; - spin_lock_bh(&nf_conntrack_lock); - hlist_nulls_for_each_entry(h, n, &net->ct.dying, hnnode) { - ct = nf_ct_tuplehash_to_ctrack(h); - /* never fails to remove them, no listeners at this point */ - nf_ct_kill(ct); + for_each_possible_cpu(cpu) { + struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu); + + spin_lock_bh(&pcpu->lock); + hlist_nulls_for_each_entry(h, n, &pcpu->dying, hnnode) { + ct = nf_ct_tuplehash_to_ctrack(h); + /* never fails to remove them, no listeners at this point */ + nf_ct_kill(ct); + } + spin_unlock_bh(&pcpu->lock); } - spin_unlock_bh(&nf_conntrack_lock); } static int untrack_refs(void) @@ -1407,6 +1537,7 @@ i_see_dead_people: kmem_cache_destroy(net->ct.nf_conntrack_cachep); kfree(net->ct.slabname); free_percpu(net->ct.stat); + free_percpu(net->ct.pcpu_lists); } } @@ -1459,12 +1590,16 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) if (!hash) return -ENOMEM; + local_bh_disable(); + nf_conntrack_all_lock(); + write_seqcount_begin(&init_net.ct.generation); + /* Lookups in the old hash might happen in parallel, which means we * might get false negatives during connection lookup. New connections * created because of a false negative won't make it into the hash - * though since that required taking the lock. + * though since that required taking the locks. */ - spin_lock_bh(&nf_conntrack_lock); + for (i = 0; i < init_net.ct.htable_size; i++) { while (!hlist_nulls_empty(&init_net.ct.hash[i])) { h = hlist_nulls_entry(init_net.ct.hash[i].first, @@ -1481,7 +1616,10 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) init_net.ct.htable_size = nf_conntrack_htable_size = hashsize; init_net.ct.hash = hash; - spin_unlock_bh(&nf_conntrack_lock); + + write_seqcount_end(&init_net.ct.generation); + nf_conntrack_all_unlock(); + local_bh_enable(); nf_ct_free_hashtable(old_hash, old_size); return 0; @@ -1505,6 +1643,11 @@ int nf_conntrack_init_start(void) int max_factor = 8; int ret, cpu; +#if 0 + for (i = 0; i < ARRAY_SIZE(nf_conntrack_locks); i++) + arch_spin_lock_init(&nf_conntrack_locks[i]); +#endif + /* Idea from tcp.c: use 1/16384 of memory. On i386: 32MB * machine has 512 buckets. >= 1GB machines have 16384 buckets. */ if (!nf_conntrack_htable_size) { @@ -1616,37 +1759,43 @@ void nf_conntrack_init_end(void) int nf_conntrack_init_net(struct net *net) { - int ret; + int ret = -ENOMEM; + int cpu; atomic_set(&net->ct.count, 0); - INIT_HLIST_NULLS_HEAD(&net->ct.unconfirmed, UNCONFIRMED_NULLS_VAL); - INIT_HLIST_NULLS_HEAD(&net->ct.dying, DYING_NULLS_VAL); INIT_HLIST_NULLS_HEAD(&net->ct.tmpl, TEMPLATE_NULLS_VAL); - net->ct.stat = alloc_percpu(struct ip_conntrack_stat); - if (!net->ct.stat) { - ret = -ENOMEM; + + net->ct.pcpu_lists = alloc_percpu(struct ct_pcpu); + if (!net->ct.pcpu_lists) goto err_stat; + + for_each_possible_cpu(cpu) { + struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu); + + spin_lock_init(&pcpu->lock); + INIT_HLIST_NULLS_HEAD(&pcpu->unconfirmed, UNCONFIRMED_NULLS_VAL); + INIT_HLIST_NULLS_HEAD(&pcpu->dying, DYING_NULLS_VAL); } + net->ct.stat = alloc_percpu(struct ip_conntrack_stat); + if (!net->ct.stat) + goto err_pcpu_lists; + net->ct.slabname = kasprintf(GFP_KERNEL, "nf_conntrack_%p", net); - if (!net->ct.slabname) { - ret = -ENOMEM; + if (!net->ct.slabname) goto err_slabname; - } net->ct.nf_conntrack_cachep = kmem_cache_create(net->ct.slabname, sizeof(struct nf_conn), 0, SLAB_DESTROY_BY_RCU, NULL); if (!net->ct.nf_conntrack_cachep) { printk(KERN_ERR "Unable to create nf_conn slab cache\n"); - ret = -ENOMEM; goto err_cache; } net->ct.htable_size = nf_conntrack_htable_size; net->ct.hash = nf_ct_alloc_hashtable(&net->ct.htable_size, 1); if (!net->ct.hash) { - ret = -ENOMEM; printk(KERN_ERR "Unable to create nf_conntrack_hash\n"); goto err_hash; } @@ -1688,6 +1837,8 @@ err_cache: kfree(net->ct.slabname); err_slabname: free_percpu(net->ct.stat); +err_pcpu_lists: + free_percpu(net->ct.pcpu_lists); err_stat: return ret; } diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c index c63b618..eacc1f7b 100644 --- a/net/netfilter/nf_conntrack_expect.c +++ b/net/netfilter/nf_conntrack_expect.c @@ -66,9 +66,9 @@ static void nf_ct_expectation_timed_out(unsigned long ul_expect) { struct nf_conntrack_expect *exp = (void *)ul_expect; - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); nf_ct_unlink_expect(exp); - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); nf_ct_expect_put(exp); } @@ -217,12 +217,12 @@ static inline int expect_matches(const struct nf_conntrack_expect *a, /* Generally a bad idea to call this: could have matched already. */ void nf_ct_unexpect_related(struct nf_conntrack_expect *exp) { - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); if (del_timer(&exp->timeout)) { nf_ct_unlink_expect(exp); nf_ct_expect_put(exp); } - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); } EXPORT_SYMBOL_GPL(nf_ct_unexpect_related); @@ -330,7 +330,7 @@ static int nf_ct_expect_insert(struct nf_conntrack_expect *exp) setup_timer(&exp->timeout, nf_ct_expectation_timed_out, (unsigned long)exp); helper = rcu_dereference_protected(master_help->helper, - lockdep_is_held(&nf_conntrack_lock)); + lockdep_is_held(&nf_conntrack_expect_lock)); if (helper) { exp->timeout.expires = jiffies + helper->expect_policy[exp->class].timeout * HZ; @@ -390,7 +390,7 @@ static inline int __nf_ct_expect_check(struct nf_conntrack_expect *expect) } /* Will be over limit? */ helper = rcu_dereference_protected(master_help->helper, - lockdep_is_held(&nf_conntrack_lock)); + lockdep_is_held(&nf_conntrack_expect_lock)); if (helper) { p = &helper->expect_policy[expect->class]; if (p->max_expected && @@ -417,7 +417,7 @@ int nf_ct_expect_related_report(struct nf_conntrack_expect *expect, { int ret; - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); ret = __nf_ct_expect_check(expect); if (ret <= 0) goto out; @@ -425,11 +425,11 @@ int nf_ct_expect_related_report(struct nf_conntrack_expect *expect, ret = nf_ct_expect_insert(expect); if (ret < 0) goto out; - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); nf_ct_expect_event_report(IPEXP_NEW, expect, portid, report); return ret; out: - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); return ret; } EXPORT_SYMBOL_GPL(nf_ct_expect_related_report); diff --git a/net/netfilter/nf_conntrack_h323_main.c b/net/netfilter/nf_conntrack_h323_main.c index bdebd03..5b1bcdf 100644 --- a/net/netfilter/nf_conntrack_h323_main.c +++ b/net/netfilter/nf_conntrack_h323_main.c @@ -1476,7 +1476,7 @@ static int process_rcf(struct sk_buff *skb, struct nf_conn *ct, nf_ct_refresh(ct, skb, info->timeout * HZ); /* Set expect timeout */ - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); exp = find_expect(ct, &ct->tuplehash[dir].tuple.dst.u3, info->sig_port[!dir]); if (exp) { @@ -1486,7 +1486,7 @@ static int process_rcf(struct sk_buff *skb, struct nf_conn *ct, nf_ct_dump_tuple(&exp->tuple); set_expect_timeout(exp, info->timeout); } - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); } return 0; diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c index 974a2a4..a7ad8d4 100644 --- a/net/netfilter/nf_conntrack_helper.c +++ b/net/netfilter/nf_conntrack_helper.c @@ -258,7 +258,7 @@ static inline int unhelp(struct nf_conntrack_tuple_hash *i, if (help && rcu_dereference_protected( help->helper, - lockdep_is_held(&nf_conntrack_lock) + lockdep_is_held(&nf_conntrack_expect_lock) ) == me) { nf_conntrack_event(IPCT_HELPER, ct); RCU_INIT_POINTER(help->helper, NULL); @@ -284,17 +284,17 @@ static LIST_HEAD(nf_ct_helper_expectfn_list); void nf_ct_helper_expectfn_register(struct nf_ct_helper_expectfn *n) { - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); list_add_rcu(&n->head, &nf_ct_helper_expectfn_list); - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); } EXPORT_SYMBOL_GPL(nf_ct_helper_expectfn_register); void nf_ct_helper_expectfn_unregister(struct nf_ct_helper_expectfn *n) { - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); list_del_rcu(&n->head); - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); } EXPORT_SYMBOL_GPL(nf_ct_helper_expectfn_unregister); @@ -396,15 +396,17 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me, const struct hlist_node *next; const struct hlist_nulls_node *nn; unsigned int i; + int cpu; /* Get rid of expectations */ + spin_lock_bh(&nf_conntrack_expect_lock); for (i = 0; i < nf_ct_expect_hsize; i++) { hlist_for_each_entry_safe(exp, next, &net->ct.expect_hash[i], hnode) { struct nf_conn_help *help = nfct_help(exp->master); if ((rcu_dereference_protected( help->helper, - lockdep_is_held(&nf_conntrack_lock) + lockdep_is_held(&nf_conntrack_expect_lock) ) == me || exp->helper == me) && del_timer(&exp->timeout)) { nf_ct_unlink_expect(exp); @@ -412,14 +414,27 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me, } } } + spin_unlock_bh(&nf_conntrack_expect_lock); /* Get rid of expecteds, set helpers to NULL. */ - hlist_nulls_for_each_entry(h, nn, &net->ct.unconfirmed, hnnode) - unhelp(h, me); - for (i = 0; i < net->ct.htable_size; i++) { - hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode) + for_each_possible_cpu(cpu) { + struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu); + + spin_lock_bh(&pcpu->lock); + hlist_nulls_for_each_entry(h, nn, &pcpu->unconfirmed, hnnode) unhelp(h, me); + spin_unlock_bh(&pcpu->lock); + } + local_bh_disable(); + for (i = 0; i < net->ct.htable_size; i++) { + arch_spin_lock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]); + if (i < net->ct.htable_size) { + hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode) + unhelp(h, me); + } + arch_spin_unlock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]); } + local_bh_enable(); } void nf_conntrack_helper_unregister(struct nf_conntrack_helper *me) @@ -437,10 +452,8 @@ void nf_conntrack_helper_unregister(struct nf_conntrack_helper *me) synchronize_rcu(); rtnl_lock(); - spin_lock_bh(&nf_conntrack_lock); for_each_net(net) __nf_conntrack_helper_unregister(me, net); - spin_unlock_bh(&nf_conntrack_lock); rtnl_unlock(); } EXPORT_SYMBOL_GPL(nf_conntrack_helper_unregister); diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 6d0f8a1..23aa8a3 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -766,14 +766,23 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb) struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh); u_int8_t l3proto = nfmsg->nfgen_family; int res; + arch_spinlock_t *lockp; + #ifdef CONFIG_NF_CONNTRACK_MARK const struct ctnetlink_dump_filter *filter = cb->data; #endif - spin_lock_bh(&nf_conntrack_lock); last = (struct nf_conn *)cb->args[1]; + + local_bh_disable(); for (; cb->args[0] < net->ct.htable_size; cb->args[0]++) { restart: + lockp = &nf_conntrack_locks[cb->args[0] % CONNTRACK_LOCKS]; + arch_spin_lock(lockp); + if (cb->args[0] >= net->ct.htable_size) { + arch_spin_unlock(lockp); + goto out; + } hlist_nulls_for_each_entry(h, n, &net->ct.hash[cb->args[0]], hnnode) { if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL) @@ -805,16 +814,18 @@ restart: if (res < 0) { nf_conntrack_get(&ct->ct_general); cb->args[1] = (unsigned long)ct; + arch_spin_unlock(lockp); goto out; } } + arch_spin_unlock(lockp); if (cb->args[1]) { cb->args[1] = 0; goto restart; } } out: - spin_unlock_bh(&nf_conntrack_lock); + local_bh_enable(); if (last) nf_ct_put(last); @@ -1145,50 +1156,65 @@ static int ctnetlink_done_list(struct netlink_callback *cb) } static int -ctnetlink_dump_list(struct sk_buff *skb, struct netlink_callback *cb, - struct hlist_nulls_head *list) +ctnetlink_dump_list(struct sk_buff *skb, struct netlink_callback *cb, bool dying) { - struct nf_conn *ct, *last; + struct nf_conn *ct, *last = NULL; struct nf_conntrack_tuple_hash *h; struct hlist_nulls_node *n; struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh); u_int8_t l3proto = nfmsg->nfgen_family; int res; + int cpu; + struct hlist_nulls_head *list; + struct net *net = sock_net(skb->sk); if (cb->args[2]) return 0; - spin_lock_bh(&nf_conntrack_lock); - last = (struct nf_conn *)cb->args[1]; -restart: - hlist_nulls_for_each_entry(h, n, list, hnnode) { - ct = nf_ct_tuplehash_to_ctrack(h); - if (l3proto && nf_ct_l3num(ct) != l3proto) + if (cb->args[0] == nr_cpu_ids) + return 0; + + for (cpu = cb->args[0]; cpu < nr_cpu_ids; cpu++) { + struct ct_pcpu *pcpu; + + if (!cpu_possible(cpu)) continue; - if (cb->args[1]) { - if (ct != last) + + pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu); + spin_lock_bh(&pcpu->lock); + last = (struct nf_conn *)cb->args[1]; + list = dying ? &pcpu->dying : &pcpu->unconfirmed; +restart: + hlist_nulls_for_each_entry(h, n, list, hnnode) { + ct = nf_ct_tuplehash_to_ctrack(h); + if (l3proto && nf_ct_l3num(ct) != l3proto) continue; - cb->args[1] = 0; - } - rcu_read_lock(); - res = ctnetlink_fill_info(skb, NETLINK_CB(cb->skb).portid, - cb->nlh->nlmsg_seq, - NFNL_MSG_TYPE(cb->nlh->nlmsg_type), - ct); - rcu_read_unlock(); - if (res < 0) { - nf_conntrack_get(&ct->ct_general); - cb->args[1] = (unsigned long)ct; - goto out; + if (cb->args[1]) { + if (ct != last) + continue; + cb->args[1] = 0; + } + rcu_read_lock(); + res = ctnetlink_fill_info(skb, NETLINK_CB(cb->skb).portid, + cb->nlh->nlmsg_seq, + NFNL_MSG_TYPE(cb->nlh->nlmsg_type), + ct); + rcu_read_unlock(); + if (res < 0) { + nf_conntrack_get(&ct->ct_general); + cb->args[1] = (unsigned long)ct; + spin_unlock_bh(&pcpu->lock); + goto out; + } } + if (cb->args[1]) { + cb->args[1] = 0; + goto restart; + } else + cb->args[2] = 1; + spin_unlock_bh(&pcpu->lock); } - if (cb->args[1]) { - cb->args[1] = 0; - goto restart; - } else - cb->args[2] = 1; out: - spin_unlock_bh(&nf_conntrack_lock); if (last) nf_ct_put(last); @@ -1198,9 +1224,7 @@ out: static int ctnetlink_dump_dying(struct sk_buff *skb, struct netlink_callback *cb) { - struct net *net = sock_net(skb->sk); - - return ctnetlink_dump_list(skb, cb, &net->ct.dying); + return ctnetlink_dump_list(skb, cb, true); } static int @@ -1222,9 +1246,7 @@ ctnetlink_get_ct_dying(struct sock *ctnl, struct sk_buff *skb, static int ctnetlink_dump_unconfirmed(struct sk_buff *skb, struct netlink_callback *cb) { - struct net *net = sock_net(skb->sk); - - return ctnetlink_dump_list(skb, cb, &net->ct.unconfirmed); + return ctnetlink_dump_list(skb, cb, false); } static int @@ -1373,14 +1395,14 @@ ctnetlink_change_helper(struct nf_conn *ct, const struct nlattr * const cda[]) nf_ct_protonum(ct)); if (helper == NULL) { #ifdef CONFIG_MODULES - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); if (request_module("nfct-helper-%s", helpname) < 0) { - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); return -EOPNOTSUPP; } - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); helper = __nf_conntrack_helper_find(helpname, nf_ct_l3num(ct), nf_ct_protonum(ct)); if (helper) @@ -1818,9 +1840,9 @@ ctnetlink_new_conntrack(struct sock *ctnl, struct sk_buff *skb, err = -EEXIST; ct = nf_ct_tuplehash_to_ctrack(h); if (!(nlh->nlmsg_flags & NLM_F_EXCL)) { - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); err = ctnetlink_change_conntrack(ct, cda); - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); if (err == 0) { nf_conntrack_eventmask_report((1 << IPCT_REPLY) | (1 << IPCT_ASSURED) | @@ -2117,9 +2139,9 @@ ctnetlink_nfqueue_parse(const struct nlattr *attr, struct nf_conn *ct) nla_parse_nested(cda, CTA_MAX, attr, ct_nla_policy); - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); ret = ctnetlink_nfqueue_parse_ct((const struct nlattr **)cda, ct); - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); return ret; } @@ -2629,13 +2651,13 @@ ctnetlink_del_expect(struct sock *ctnl, struct sk_buff *skb, } /* after list removal, usage count == 1 */ - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); if (del_timer(&exp->timeout)) { nf_ct_unlink_expect_report(exp, NETLINK_CB(skb).portid, nlmsg_report(nlh)); nf_ct_expect_put(exp); } - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); /* have to put what we 'get' above. * after this line usage count == 0 */ nf_ct_expect_put(exp); @@ -2644,7 +2666,7 @@ ctnetlink_del_expect(struct sock *ctnl, struct sk_buff *skb, struct nf_conn_help *m_help; /* delete all expectations for this helper */ - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); for (i = 0; i < nf_ct_expect_hsize; i++) { hlist_for_each_entry_safe(exp, next, &net->ct.expect_hash[i], @@ -2659,10 +2681,10 @@ ctnetlink_del_expect(struct sock *ctnl, struct sk_buff *skb, } } } - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); } else { /* This basically means we have to flush everything*/ - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); for (i = 0; i < nf_ct_expect_hsize; i++) { hlist_for_each_entry_safe(exp, next, &net->ct.expect_hash[i], @@ -2675,7 +2697,7 @@ ctnetlink_del_expect(struct sock *ctnl, struct sk_buff *skb, } } } - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); } return 0; @@ -2881,11 +2903,11 @@ ctnetlink_new_expect(struct sock *ctnl, struct sk_buff *skb, if (err < 0) return err; - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); exp = __nf_ct_expect_find(net, zone, &tuple); if (!exp) { - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); err = -ENOENT; if (nlh->nlmsg_flags & NLM_F_CREATE) { err = ctnetlink_create_expect(net, zone, cda, @@ -2899,7 +2921,7 @@ ctnetlink_new_expect(struct sock *ctnl, struct sk_buff *skb, err = -EEXIST; if (!(nlh->nlmsg_flags & NLM_F_EXCL)) err = ctnetlink_change_expect(exp, cda); - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); return err; } diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c index e0c4373..5564de1 100644 --- a/net/netfilter/nf_conntrack_sip.c +++ b/net/netfilter/nf_conntrack_sip.c @@ -858,7 +858,7 @@ static int refresh_signalling_expectation(struct nf_conn *ct, struct hlist_node *next; int found = 0; - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); hlist_for_each_entry_safe(exp, next, &help->expectations, lnode) { if (exp->class != SIP_EXPECT_SIGNALLING || !nf_inet_addr_cmp(&exp->tuple.dst.u3, addr) || @@ -873,7 +873,7 @@ static int refresh_signalling_expectation(struct nf_conn *ct, found = 1; break; } - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); return found; } @@ -883,7 +883,7 @@ static void flush_expectations(struct nf_conn *ct, bool media) struct nf_conntrack_expect *exp; struct hlist_node *next; - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); hlist_for_each_entry_safe(exp, next, &help->expectations, lnode) { if ((exp->class != SIP_EXPECT_SIGNALLING) ^ media) continue; @@ -894,7 +894,7 @@ static void flush_expectations(struct nf_conn *ct, bool media) if (!media) break; } - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); } static int set_expected_rtp_rtcp(struct sk_buff *skb, unsigned int protoff, ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH nf-next] netfilter: conntrack: remove the central spinlock 2013-05-09 3:04 [PATCH nf-next] netfilter: conntrack: remove the central spinlock Eric Dumazet @ 2013-05-09 5:43 ` Cong Wang 2013-05-09 6:01 ` Eric Dumazet 2013-05-22 17:47 ` [PATCH v2 " Eric Dumazet 1 sibling, 1 reply; 21+ messages in thread From: Cong Wang @ 2013-05-09 5:43 UTC (permalink / raw) To: netdev; +Cc: netfilter-devel On Thu, 09 May 2013 at 03:04 GMT, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > +#if 0 > + for (i = 0; i < ARRAY_SIZE(nf_conntrack_locks); i++) > + arch_spin_lock_init(&nf_conntrack_locks[i]); > +#endif > + Is this intentional? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH nf-next] netfilter: conntrack: remove the central spinlock 2013-05-09 5:43 ` Cong Wang @ 2013-05-09 6:01 ` Eric Dumazet 2013-05-09 7:46 ` Cong Wang 0 siblings, 1 reply; 21+ messages in thread From: Eric Dumazet @ 2013-05-09 6:01 UTC (permalink / raw) To: Cong Wang; +Cc: netdev, netfilter-devel On Thu, 2013-05-09 at 05:43 +0000, Cong Wang wrote: > On Thu, 09 May 2013 at 03:04 GMT, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > +#if 0 > > + for (i = 0; i < ARRAY_SIZE(nf_conntrack_locks); i++) > > + arch_spin_lock_init(&nf_conntrack_locks[i]); > > +#endif > > + > > Is this intentional? What do you think ? Apparently arch_spin_lock_init() does not exist (but for ia64) kernel/lglock.c simply relies on storage being cleared. nf_conntrack_locks[] being cleared, we do not have to init it. But I do not really like this. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH nf-next] netfilter: conntrack: remove the central spinlock 2013-05-09 6:01 ` Eric Dumazet @ 2013-05-09 7:46 ` Cong Wang 2013-05-09 13:46 ` Eric Dumazet 0 siblings, 1 reply; 21+ messages in thread From: Cong Wang @ 2013-05-09 7:46 UTC (permalink / raw) To: netdev; +Cc: netfilter-devel On Thu, 09 May 2013 at 06:01 GMT, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2013-05-09 at 05:43 +0000, Cong Wang wrote: >> On Thu, 09 May 2013 at 03:04 GMT, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> > >> > +#if 0 >> > + for (i = 0; i < ARRAY_SIZE(nf_conntrack_locks); i++) >> > + arch_spin_lock_init(&nf_conntrack_locks[i]); >> > +#endif >> > + >> >> Is this intentional? > > What do you think ? > > Apparently arch_spin_lock_init() does not exist (but for ia64) > Either it is not expected to use arch_spin_lock directly, or it has to provide a nop for non-IA64. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH nf-next] netfilter: conntrack: remove the central spinlock 2013-05-09 7:46 ` Cong Wang @ 2013-05-09 13:46 ` Eric Dumazet 0 siblings, 0 replies; 21+ messages in thread From: Eric Dumazet @ 2013-05-09 13:46 UTC (permalink / raw) To: Cong Wang; +Cc: netdev, netfilter-devel On Thu, 2013-05-09 at 07:46 +0000, Cong Wang wrote: > > Either it is not expected to use arch_spin_lock directly, or it has to > provide a nop for non-IA64. I suggest you check yourself ;) # git grep -n arch_spin_lock_init arch/ia64/include/asm/spinlock.h:19:#define arch_spin_lock_init(x) ((x)->lock = 0) This definition is not used at all. It should for consistency, and it's not a netfilter concern. ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock 2013-05-09 3:04 [PATCH nf-next] netfilter: conntrack: remove the central spinlock Eric Dumazet 2013-05-09 5:43 ` Cong Wang @ 2013-05-22 17:47 ` Eric Dumazet 2013-05-22 18:20 ` Joe Perches ` (2 more replies) 1 sibling, 3 replies; 21+ messages in thread From: Eric Dumazet @ 2013-05-22 17:47 UTC (permalink / raw) To: Pablo Neira Ayuso; +Cc: netfilter-devel, netdev, Tom Herbert, Patrick McHardy From: Eric Dumazet <edumazet@google.com> nf_conntrack_lock is a monolithic lock and suffers from huge contention on current generation servers (8 or more core/threads). Split this lock into three distinct components : 1) One array of hashed spinlocks to protect insertions/deletions of conntracks into the hash table. 1024 spinlocks seem to give good results, at minimal cost (4KB memory). Due to lockdep max depth, 1024 becomes 8 if CONFIG_LOCKDEP=y 2) One spinlock per cpu to protect dying/unconfirmed special lists. (These lists are now per cpu, a bit like the untracked ct) Add a @cpu field to nf_conn, to make sure we hold the appropriate spinlock at removal time. 3) One spinlock to protect the expectations : nf_conntrack_expect_lock The hash resize is a bit tricky, because we need to take all locks in the array. A seqcount_t is used to synchronize the hash table users with the resizing process. Reorder struct netns_ct so that @count changes don't slowdown users of read only fields. Results on a 32 threads machine, 200 concurrent instances of "netperf -t TCP_CRR" : ~390000 tps instead of ~300000 tps. perf top before patch : 51.16% [kernel] [k] _raw_spin_lock_bh 1.10% [ip_tables] [k] ipt_do_table 1.00% [nf_conntrack] [k] tcp_packet 0.76% [kernel] [k] nf_iterate 0.71% [kernel] [k] _raw_spin_lock 0.70% [kernel] [k] kmem_cache_free 0.67% [kernel] [k] tcp_ack 0.62% [kernel] [k] tcp_transmit_skb 0.61% [kernel] [k] local_bh_disable 0.60% [nf_conntrack] [k] ____nf_conntrack_find 0.59% [kernel] [k] __schedule 0.57% [kernel] [k] __netif_receive_skb_core 0.55% [kernel] [k] tcp_v4_rcv 0.50% [kernel] [k] local_bh_enable After patch : 2.62% [nf_conntrack] [k] tcp_packet 2.10% [ip_tables] [k] ipt_do_table 1.73% [kernel] [k] _raw_spin_lock 1.71% [kernel] [k] nf_iterate 1.62% [kernel] [k] tcp_ack 1.36% [kernel] [k] tcp_transmit_skb 1.28% [kernel] [k] __netif_receive_skb_core 1.28% [nf_conntrack] [k] __nf_conntrack_find_get 1.27% [kernel] [k] local_bh_enable 1.27% [kernel] [k] __schedule 1.23% [kernel] [k] kmem_cache_free 1.22% [kernel] [k] local_bh_disable 1.15% [kernel] [k] tcp_v4_rcv 1.06% [nf_conntrack] [k] nf_conntrack_in 1.02% [kernel] [k] _raw_spin_lock_bh 0.96% [kernel] [k] __inet_lookup_established 0.90% [kernel] [k] _raw_spin_lock_irqsave Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Tom Herbert <therbert@google.com> --- v2: use regular spinlock_t instead of arch_spinlock_t to get full lockdep support. include/net/netfilter/nf_conntrack.h | 7 include/net/netfilter/nf_conntrack_core.h | 9 include/net/netns/conntrack.h | 30 - net/netfilter/nf_conntrack_core.c | 425 +++++++++++++------- net/netfilter/nf_conntrack_expect.c | 18 net/netfilter/nf_conntrack_h323_main.c | 4 net/netfilter/nf_conntrack_helper.c | 37 + net/netfilter/nf_conntrack_netlink.c | 128 +++--- net/netfilter/nf_conntrack_sip.c | 8 9 files changed, 432 insertions(+), 234 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 644d9c2..2311914 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -76,7 +76,8 @@ struct nf_conn { plus 1 for any connection(s) we are `master' for */ struct nf_conntrack ct_general; - spinlock_t lock; + spinlock_t lock; + u16 cpu; /* XXX should I move this to the tail ? - Y.K */ /* These are my tuples; original and reply */ @@ -176,10 +177,6 @@ extern void *nf_ct_alloc_hashtable(unsigned int *sizep, int nulls); extern void nf_ct_free_hashtable(void *hash, unsigned int size); -extern struct nf_conntrack_tuple_hash * -__nf_conntrack_find(struct net *net, u16 zone, - const struct nf_conntrack_tuple *tuple); - extern int nf_conntrack_hash_check_insert(struct nf_conn *ct); extern void nf_ct_delete_from_lists(struct nf_conn *ct); extern void nf_ct_dying_timeout(struct nf_conn *ct); diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h index fb2b623..4bcbbfd 100644 --- a/include/net/netfilter/nf_conntrack_core.h +++ b/include/net/netfilter/nf_conntrack_core.h @@ -84,6 +84,13 @@ print_tuple(struct seq_file *s, const struct nf_conntrack_tuple *tuple, const struct nf_conntrack_l3proto *l3proto, const struct nf_conntrack_l4proto *proto); -extern spinlock_t nf_conntrack_lock ; +#ifdef CONFIG_LOCKDEP +# define CONNTRACK_LOCKS 8 +#else +# define CONNTRACK_LOCKS 1024 +#endif +extern spinlock_t nf_conntrack_locks[CONNTRACK_LOCKS]; + +extern spinlock_t nf_conntrack_expect_lock ; #endif /* _NF_CONNTRACK_CORE_H */ diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h index c9c0c53..f7db370 100644 --- a/include/net/netns/conntrack.h +++ b/include/net/netns/conntrack.h @@ -5,6 +5,7 @@ #include <linux/list_nulls.h> #include <linux/atomic.h> #include <linux/netfilter/nf_conntrack_tcp.h> +#include <linux/seqlock.h> struct ctl_table_header; struct nf_conntrack_ecache; @@ -62,16 +63,31 @@ struct nf_ip_net { #endif }; +struct ct_pcpu { + spinlock_t lock; + struct hlist_nulls_head unconfirmed; + struct hlist_nulls_head dying; +}; + struct netns_ct { atomic_t count; + struct hlist_nulls_head tmpl; +#ifdef CONFIG_SYSCTL + struct ctl_table_header *sysctl_header; + struct ctl_table_header *acct_sysctl_header; + struct ctl_table_header *tstamp_sysctl_header; + struct ctl_table_header *event_sysctl_header; + struct ctl_table_header *helper_sysctl_header; +#endif + char *slabname; + unsigned int expect_count; unsigned int htable_size; + seqcount_t generation; struct kmem_cache *nf_conntrack_cachep; struct hlist_nulls_head *hash; struct hlist_head *expect_hash; - struct hlist_nulls_head unconfirmed; - struct hlist_nulls_head dying; - struct hlist_nulls_head tmpl; + struct ct_pcpu __percpu *pcpu_lists; struct ip_conntrack_stat __percpu *stat; struct nf_ct_event_notifier __rcu *nf_conntrack_event_cb; struct nf_exp_event_notifier __rcu *nf_expect_event_cb; @@ -92,13 +108,5 @@ struct netns_ct { struct hlist_head *nat_bysource; unsigned int nat_htable_size; #endif -#ifdef CONFIG_SYSCTL - struct ctl_table_header *sysctl_header; - struct ctl_table_header *acct_sysctl_header; - struct ctl_table_header *tstamp_sysctl_header; - struct ctl_table_header *event_sysctl_header; - struct ctl_table_header *helper_sysctl_header; -#endif - char *slabname; }; #endif diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 0283bae..c0823b9 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -64,8 +64,59 @@ int (*nf_nat_seq_adjust_hook)(struct sk_buff *skb, unsigned int protoff); EXPORT_SYMBOL_GPL(nf_nat_seq_adjust_hook); -DEFINE_SPINLOCK(nf_conntrack_lock); -EXPORT_SYMBOL_GPL(nf_conntrack_lock); +__cacheline_aligned_in_smp spinlock_t nf_conntrack_locks[CONNTRACK_LOCKS]; +EXPORT_SYMBOL_GPL(nf_conntrack_locks); + +__cacheline_aligned_in_smp DEFINE_SPINLOCK(nf_conntrack_expect_lock); +EXPORT_SYMBOL_GPL(nf_conntrack_expect_lock); + +static void nf_conntrack_double_unlock(unsigned int h1, unsigned int h2) +{ + h1 %= CONNTRACK_LOCKS; + h2 %= CONNTRACK_LOCKS; + spin_unlock(&nf_conntrack_locks[h1]); + if (h1 != h2) + spin_unlock(&nf_conntrack_locks[h2]); +} + +/* return true if we need to recompute hashes (in case hash table was resized) */ +static bool nf_conntrack_double_lock(struct net *net, unsigned int h1, + unsigned int h2, unsigned int sequence) +{ + h1 %= CONNTRACK_LOCKS; + h2 %= CONNTRACK_LOCKS; + if (h1 <= h2) { + spin_lock(&nf_conntrack_locks[h1]); + if (h1 != h2) + spin_lock_nested(&nf_conntrack_locks[h2], + SINGLE_DEPTH_NESTING); + } else { + spin_lock(&nf_conntrack_locks[h2]); + spin_lock_nested(&nf_conntrack_locks[h1], + SINGLE_DEPTH_NESTING); + } + if (read_seqcount_retry(&net->ct.generation, sequence)) { + nf_conntrack_double_unlock(h1, h2); + return true; + } + return false; +} + +static void nf_conntrack_all_lock(void) +{ + int i; + + for (i = 0; i < CONNTRACK_LOCKS; i++) + spin_lock_nested(&nf_conntrack_locks[i], i); +} + +static void nf_conntrack_all_unlock(void) +{ + int i; + + for (i = 0; i < CONNTRACK_LOCKS; i++) + spin_unlock(&nf_conntrack_locks[i]); +} unsigned int nf_conntrack_htable_size __read_mostly; EXPORT_SYMBOL_GPL(nf_conntrack_htable_size); @@ -196,6 +247,47 @@ clean_from_lists(struct nf_conn *ct) nf_ct_remove_expectations(ct); } +static void nf_ct_add_to_dying_list(struct nf_conn *ct) +{ + struct ct_pcpu *pcpu; + + /* add this conntrack to the (per cpu) dying list */ + ct->cpu = raw_smp_processor_id(); + pcpu = per_cpu_ptr(nf_ct_net(ct)->ct.pcpu_lists, ct->cpu); + + spin_lock(&pcpu->lock); + hlist_nulls_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, + &pcpu->dying); + spin_unlock(&pcpu->lock); +} + +static void nf_ct_add_to_unconfirmed_list(struct nf_conn *ct) +{ + struct ct_pcpu *pcpu; + + /* add this conntrack to the (per cpu) unconfirmed list */ + ct->cpu = raw_smp_processor_id(); + pcpu = per_cpu_ptr(nf_ct_net(ct)->ct.pcpu_lists, ct->cpu); + + spin_lock(&pcpu->lock); + hlist_nulls_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, + &pcpu->unconfirmed); + spin_unlock(&pcpu->lock); +} + +static void nf_ct_del_from_dying_or_unconfirmed_list(struct nf_conn *ct) +{ + struct ct_pcpu *pcpu; + + /* We overload first tuple to link into unconfirmed or dying list.*/ + pcpu = per_cpu_ptr(nf_ct_net(ct)->ct.pcpu_lists, ct->cpu); + + spin_lock(&pcpu->lock); + BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode)); + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); + spin_unlock(&pcpu->lock); +} + static void destroy_conntrack(struct nf_conntrack *nfct) { @@ -207,9 +299,6 @@ destroy_conntrack(struct nf_conntrack *nfct) NF_CT_ASSERT(atomic_read(&nfct->use) == 0); NF_CT_ASSERT(!timer_pending(&ct->timeout)); - /* To make sure we don't get any weird locking issues here: - * destroy_conntrack() MUST NOT be called with a write lock - * to nf_conntrack_lock!!! -HW */ rcu_read_lock(); l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct)); if (l4proto && l4proto->destroy) @@ -217,19 +306,21 @@ destroy_conntrack(struct nf_conntrack *nfct) rcu_read_unlock(); - spin_lock_bh(&nf_conntrack_lock); - /* Expectations will have been removed in clean_from_lists, - * except TFTP can create an expectation on the first packet, - * before connection is in the list, so we need to clean here, - * too. */ - nf_ct_remove_expectations(ct); - - /* We overload first tuple to link into unconfirmed or dying list.*/ - BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode)); - hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); + local_bh_disable(); + if (nfct_help(ct)) { + spin_lock(&nf_conntrack_expect_lock); + /* Expectations will have been removed in clean_from_lists, + * except TFTP can create an expectation on the first packet, + * before connection is in the list, so we need to clean here, + * too. + */ + nf_ct_remove_expectations(ct); + spin_unlock(&nf_conntrack_expect_lock); + } + nf_ct_del_from_dying_or_unconfirmed_list(ct); NF_CT_STAT_INC(net, delete); - spin_unlock_bh(&nf_conntrack_lock); + local_bh_enable(); if (ct->master) nf_ct_put(ct->master); @@ -241,17 +332,28 @@ destroy_conntrack(struct nf_conntrack *nfct) void nf_ct_delete_from_lists(struct nf_conn *ct) { struct net *net = nf_ct_net(ct); + unsigned int hash, repl_hash; + u16 zone = nf_ct_zone(ct); + unsigned int sequence; nf_ct_helper_destroy(ct); - spin_lock_bh(&nf_conntrack_lock); - /* Inside lock so preempt is disabled on module removal path. - * Otherwise we can get spurious warnings. */ - NF_CT_STAT_INC(net, delete_list); + + local_bh_disable(); + do { + sequence = read_seqcount_begin(&net->ct.generation); + hash = hash_conntrack(net, zone, + &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple); + repl_hash = hash_conntrack(net, zone, + &ct->tuplehash[IP_CT_DIR_REPLY].tuple); + } while (nf_conntrack_double_lock(net, hash, repl_hash, sequence)); + clean_from_lists(ct); - /* add this conntrack to the dying list */ - hlist_nulls_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, - &net->ct.dying); - spin_unlock_bh(&nf_conntrack_lock); + nf_conntrack_double_unlock(hash, repl_hash); + + nf_ct_add_to_dying_list(ct); + + NF_CT_STAT_INC(net, delete_list); + local_bh_enable(); } EXPORT_SYMBOL_GPL(nf_ct_delete_from_lists); @@ -315,8 +417,6 @@ static void death_by_timeout(unsigned long ul_conntrack) * Warning : * - Caller must take a reference on returned object * and recheck nf_ct_tuple_equal(tuple, &h->tuple) - * OR - * - Caller must lock nf_conntrack_lock before calling this function */ static struct nf_conntrack_tuple_hash * ____nf_conntrack_find(struct net *net, u16 zone, @@ -354,15 +454,6 @@ begin: return NULL; } -struct nf_conntrack_tuple_hash * -__nf_conntrack_find(struct net *net, u16 zone, - const struct nf_conntrack_tuple *tuple) -{ - return ____nf_conntrack_find(net, zone, tuple, - hash_conntrack_raw(tuple, zone)); -} -EXPORT_SYMBOL_GPL(__nf_conntrack_find); - /* Find a connection corresponding to a tuple. */ static struct nf_conntrack_tuple_hash * __nf_conntrack_find_get(struct net *net, u16 zone, @@ -421,14 +512,18 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct) struct nf_conntrack_tuple_hash *h; struct hlist_nulls_node *n; u16 zone; + unsigned int sequence; zone = nf_ct_zone(ct); - hash = hash_conntrack(net, zone, - &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple); - repl_hash = hash_conntrack(net, zone, - &ct->tuplehash[IP_CT_DIR_REPLY].tuple); - spin_lock_bh(&nf_conntrack_lock); + local_bh_disable(); + do { + sequence = read_seqcount_begin(&net->ct.generation); + hash = hash_conntrack(net, zone, + &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple); + repl_hash = hash_conntrack(net, zone, + &ct->tuplehash[IP_CT_DIR_REPLY].tuple); + } while (nf_conntrack_double_lock(net, hash, repl_hash, sequence)); /* See if there's one in the list already, including reverse */ hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode) @@ -445,14 +540,15 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct) add_timer(&ct->timeout); nf_conntrack_get(&ct->ct_general); __nf_conntrack_hash_insert(ct, hash, repl_hash); + nf_conntrack_double_unlock(hash, repl_hash); NF_CT_STAT_INC(net, insert); - spin_unlock_bh(&nf_conntrack_lock); - + local_bh_enable(); return 0; out: + nf_conntrack_double_unlock(hash, repl_hash); NF_CT_STAT_INC(net, insert_failed); - spin_unlock_bh(&nf_conntrack_lock); + local_bh_enable(); return -EEXIST; } EXPORT_SYMBOL_GPL(nf_conntrack_hash_check_insert); @@ -470,6 +566,7 @@ __nf_conntrack_confirm(struct sk_buff *skb) enum ip_conntrack_info ctinfo; struct net *net; u16 zone; + unsigned int sequence; ct = nf_ct_get(skb, &ctinfo); net = nf_ct_net(ct); @@ -482,31 +579,37 @@ __nf_conntrack_confirm(struct sk_buff *skb) return NF_ACCEPT; zone = nf_ct_zone(ct); - /* reuse the hash saved before */ - hash = *(unsigned long *)&ct->tuplehash[IP_CT_DIR_REPLY].hnnode.pprev; - hash = hash_bucket(hash, net); - repl_hash = hash_conntrack(net, zone, - &ct->tuplehash[IP_CT_DIR_REPLY].tuple); + local_bh_disable(); + + do { + sequence = read_seqcount_begin(&net->ct.generation); + /* reuse the hash saved before */ + hash = *(unsigned long *)&ct->tuplehash[IP_CT_DIR_REPLY].hnnode.pprev; + hash = hash_bucket(hash, net); + repl_hash = hash_conntrack(net, zone, + &ct->tuplehash[IP_CT_DIR_REPLY].tuple); + + } while (nf_conntrack_double_lock(net, hash, repl_hash, sequence)); /* We're not in hash table, and we refuse to set up related - connections for unconfirmed conns. But packet copies and - REJECT will give spurious warnings here. */ + * connections for unconfirmed conns. But packet copies and + * REJECT will give spurious warnings here. + */ /* NF_CT_ASSERT(atomic_read(&ct->ct_general.use) == 1); */ /* No external references means no one else could have - confirmed us. */ + * confirmed us. + */ NF_CT_ASSERT(!nf_ct_is_confirmed(ct)); pr_debug("Confirming conntrack %p\n", ct); - - spin_lock_bh(&nf_conntrack_lock); - /* We have to check the DYING flag inside the lock to prevent a race against nf_ct_get_next_corpse() possibly called from user context, else we insert an already 'dead' hash, blocking further use of that particular connection -JM */ if (unlikely(nf_ct_is_dying(ct))) { - spin_unlock_bh(&nf_conntrack_lock); + nf_conntrack_double_unlock(hash, repl_hash); + local_bh_enable(); return NF_ACCEPT; } @@ -524,8 +627,7 @@ __nf_conntrack_confirm(struct sk_buff *skb) zone == nf_ct_zone(nf_ct_tuplehash_to_ctrack(h))) goto out; - /* Remove from unconfirmed list */ - hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); + nf_ct_del_from_dying_or_unconfirmed_list(ct); /* Timer relative to confirmation time, not original setting time, otherwise we'd get timer wrap in @@ -549,8 +651,9 @@ __nf_conntrack_confirm(struct sk_buff *skb) * stores are visible. */ __nf_conntrack_hash_insert(ct, hash, repl_hash); + nf_conntrack_double_unlock(hash, repl_hash); NF_CT_STAT_INC(net, insert); - spin_unlock_bh(&nf_conntrack_lock); + local_bh_enable(); help = nfct_help(ct); if (help && help->helper) @@ -561,8 +664,9 @@ __nf_conntrack_confirm(struct sk_buff *skb) return NF_ACCEPT; out: + nf_conntrack_double_unlock(hash, repl_hash); NF_CT_STAT_INC(net, insert_failed); - spin_unlock_bh(&nf_conntrack_lock); + local_bh_enable(); return NF_DROP; } EXPORT_SYMBOL_GPL(__nf_conntrack_confirm); @@ -580,9 +684,6 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple, u16 zone = nf_ct_zone(ignored_conntrack); unsigned int hash = hash_conntrack(net, zone, tuple); - /* Disable BHs the entire time since we need to disable them at - * least once for the stats anyway. - */ rcu_read_lock_bh(); hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) { ct = nf_ct_tuplehash_to_ctrack(h); @@ -595,8 +696,8 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple, } NF_CT_STAT_INC(net, searched); } - rcu_read_unlock_bh(); + rcu_read_unlock_bh(); return 0; } EXPORT_SYMBOL_GPL(nf_conntrack_tuple_taken); @@ -605,39 +706,48 @@ EXPORT_SYMBOL_GPL(nf_conntrack_tuple_taken); /* There's a small race here where we may free a just-assured connection. Too bad: we're in trouble anyway. */ -static noinline int early_drop(struct net *net, unsigned int hash) +static noinline int early_drop(struct net *net, unsigned int _hash) { /* Use oldest entry, which is roughly LRU */ struct nf_conntrack_tuple_hash *h; struct nf_conn *ct = NULL, *tmp; struct hlist_nulls_node *n; - unsigned int i, cnt = 0; + unsigned int i = 0, cnt = 0; int dropped = 0; + unsigned int hash, sequence; + spinlock_t *lockp; - rcu_read_lock(); - for (i = 0; i < net->ct.htable_size; i++) { + local_bh_disable(); +restart: + sequence = read_seqcount_begin(&net->ct.generation); + hash = hash_bucket(_hash, net); + for (; i < net->ct.htable_size; i++) { + lockp = &nf_conntrack_locks[hash % CONNTRACK_LOCKS]; + spin_lock(lockp); + if (read_seqcount_retry(&net->ct.generation, sequence)) { + spin_unlock(lockp); + goto restart; + } hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) { tmp = nf_ct_tuplehash_to_ctrack(h); - if (!test_bit(IPS_ASSURED_BIT, &tmp->status)) + if (!test_bit(IPS_ASSURED_BIT, &tmp->status) && + !nf_ct_is_dying(tmp) && + atomic_inc_not_zero(&tmp->ct_general.use)) { ct = tmp; + break; + } cnt++; } - if (ct != NULL) { - if (likely(!nf_ct_is_dying(ct) && - atomic_inc_not_zero(&ct->ct_general.use))) - break; - else - ct = NULL; - } + hash = (hash + 1) % net->ct.htable_size; + spin_unlock(lockp); - if (cnt >= NF_CT_EVICTION_RANGE) + if (ct || cnt >= NF_CT_EVICTION_RANGE) break; - hash = (hash + 1) % net->ct.htable_size; } - rcu_read_unlock(); + local_bh_enable(); if (!ct) return dropped; @@ -689,7 +799,7 @@ __nf_conntrack_alloc(struct net *net, u16 zone, if (nf_conntrack_max && unlikely(atomic_read(&net->ct.count) > nf_conntrack_max)) { - if (!early_drop(net, hash_bucket(hash, net))) { + if (!early_drop(net, hash)) { atomic_dec(&net->ct.count); net_warn_ratelimited("nf_conntrack: table full, dropping packet\n"); return ERR_PTR(-ENOMEM); @@ -781,7 +891,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, struct nf_conn_help *help; struct nf_conntrack_tuple repl_tuple; struct nf_conntrack_ecache *ecache; - struct nf_conntrack_expect *exp; + struct nf_conntrack_expect *exp = NULL; u16 zone = tmpl ? nf_ct_zone(tmpl) : NF_CT_DEFAULT_ZONE; struct nf_conn_timeout *timeout_ext; unsigned int *timeouts; @@ -820,39 +930,41 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, ecache ? ecache->expmask : 0, GFP_ATOMIC); - spin_lock_bh(&nf_conntrack_lock); - exp = nf_ct_find_expectation(net, zone, tuple); - if (exp) { - pr_debug("conntrack: expectation arrives ct=%p exp=%p\n", - ct, exp); - /* Welcome, Mr. Bond. We've been expecting you... */ - __set_bit(IPS_EXPECTED_BIT, &ct->status); - ct->master = exp->master; - if (exp->helper) { - help = nf_ct_helper_ext_add(ct, exp->helper, - GFP_ATOMIC); - if (help) - rcu_assign_pointer(help->helper, exp->helper); - } + local_bh_disable(); + if (net->ct.expect_count) { + spin_lock(&nf_conntrack_expect_lock); + exp = nf_ct_find_expectation(net, zone, tuple); + if (exp) { + pr_debug("conntrack: expectation arrives ct=%p exp=%p\n", + ct, exp); + /* Welcome, Mr. Bond. We've been expecting you... */ + __set_bit(IPS_EXPECTED_BIT, &ct->status); + ct->master = exp->master; + if (exp->helper) { + help = nf_ct_helper_ext_add(ct, exp->helper, + GFP_ATOMIC); + if (help) + rcu_assign_pointer(help->helper, exp->helper); + } #ifdef CONFIG_NF_CONNTRACK_MARK - ct->mark = exp->master->mark; + ct->mark = exp->master->mark; #endif #ifdef CONFIG_NF_CONNTRACK_SECMARK - ct->secmark = exp->master->secmark; + ct->secmark = exp->master->secmark; #endif - nf_conntrack_get(&ct->master->ct_general); - NF_CT_STAT_INC(net, expect_new); - } else { + nf_conntrack_get(&ct->master->ct_general); + NF_CT_STAT_INC(net, expect_new); + } + spin_unlock(&nf_conntrack_expect_lock); + } + if (!exp) { __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC); NF_CT_STAT_INC(net, new); } + nf_ct_add_to_unconfirmed_list(ct); - /* Overload tuple linked list to put us in unconfirmed list. */ - hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, - &net->ct.unconfirmed); - - spin_unlock_bh(&nf_conntrack_lock); + local_bh_enable(); if (exp) { if (exp->expectfn) @@ -1218,27 +1330,42 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data), struct nf_conntrack_tuple_hash *h; struct nf_conn *ct; struct hlist_nulls_node *n; + int cpu; + spinlock_t *lockp; - spin_lock_bh(&nf_conntrack_lock); for (; *bucket < net->ct.htable_size; (*bucket)++) { - hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], hnnode) { - if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL) - continue; + lockp = &nf_conntrack_locks[*bucket % CONNTRACK_LOCKS]; + local_bh_disable(); + spin_lock(lockp); + if (*bucket < net->ct.htable_size) { + hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], hnnode) { + if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL) + continue; + ct = nf_ct_tuplehash_to_ctrack(h); + if (iter(ct, data)) + goto found; + } + } + spin_unlock(lockp); + local_bh_enable(); + } + + for_each_possible_cpu(cpu) { + struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu); + + spin_lock_bh(&pcpu->lock); + hlist_nulls_for_each_entry(h, n, &pcpu->unconfirmed, hnnode) { ct = nf_ct_tuplehash_to_ctrack(h); if (iter(ct, data)) - goto found; + set_bit(IPS_DYING_BIT, &ct->status); } + spin_unlock_bh(&pcpu->lock); } - hlist_nulls_for_each_entry(h, n, &net->ct.unconfirmed, hnnode) { - ct = nf_ct_tuplehash_to_ctrack(h); - if (iter(ct, data)) - set_bit(IPS_DYING_BIT, &ct->status); - } - spin_unlock_bh(&nf_conntrack_lock); return NULL; found: atomic_inc(&ct->ct_general.use); - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock(lockp); + local_bh_enable(); return ct; } @@ -1314,14 +1441,19 @@ static void nf_ct_release_dying_list(struct net *net) struct nf_conntrack_tuple_hash *h; struct nf_conn *ct; struct hlist_nulls_node *n; + int cpu; - spin_lock_bh(&nf_conntrack_lock); - hlist_nulls_for_each_entry(h, n, &net->ct.dying, hnnode) { - ct = nf_ct_tuplehash_to_ctrack(h); - /* never fails to remove them, no listeners at this point */ - nf_ct_kill(ct); + for_each_possible_cpu(cpu) { + struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu); + + spin_lock_bh(&pcpu->lock); + hlist_nulls_for_each_entry(h, n, &pcpu->dying, hnnode) { + ct = nf_ct_tuplehash_to_ctrack(h); + /* never fails to remove them, no listeners at this point */ + nf_ct_kill(ct); + } + spin_unlock_bh(&pcpu->lock); } - spin_unlock_bh(&nf_conntrack_lock); } static int untrack_refs(void) @@ -1407,6 +1539,7 @@ i_see_dead_people: kmem_cache_destroy(net->ct.nf_conntrack_cachep); kfree(net->ct.slabname); free_percpu(net->ct.stat); + free_percpu(net->ct.pcpu_lists); } } @@ -1459,12 +1592,16 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) if (!hash) return -ENOMEM; + local_bh_disable(); + nf_conntrack_all_lock(); + write_seqcount_begin(&init_net.ct.generation); + /* Lookups in the old hash might happen in parallel, which means we * might get false negatives during connection lookup. New connections * created because of a false negative won't make it into the hash - * though since that required taking the lock. + * though since that required taking the locks. */ - spin_lock_bh(&nf_conntrack_lock); + for (i = 0; i < init_net.ct.htable_size; i++) { while (!hlist_nulls_empty(&init_net.ct.hash[i])) { h = hlist_nulls_entry(init_net.ct.hash[i].first, @@ -1481,7 +1618,10 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp) init_net.ct.htable_size = nf_conntrack_htable_size = hashsize; init_net.ct.hash = hash; - spin_unlock_bh(&nf_conntrack_lock); + + write_seqcount_end(&init_net.ct.generation); + nf_conntrack_all_unlock(); + local_bh_enable(); nf_ct_free_hashtable(old_hash, old_size); return 0; @@ -1503,7 +1643,10 @@ EXPORT_SYMBOL_GPL(nf_ct_untracked_status_or); int nf_conntrack_init_start(void) { int max_factor = 8; - int ret, cpu; + int i, ret, cpu; + + for (i = 0; i < ARRAY_SIZE(nf_conntrack_locks); i++) + spin_lock_init(&nf_conntrack_locks[i]); /* Idea from tcp.c: use 1/16384 of memory. On i386: 32MB * machine has 512 buckets. >= 1GB machines have 16384 buckets. */ @@ -1616,37 +1759,43 @@ void nf_conntrack_init_end(void) int nf_conntrack_init_net(struct net *net) { - int ret; + int ret = -ENOMEM; + int cpu; atomic_set(&net->ct.count, 0); - INIT_HLIST_NULLS_HEAD(&net->ct.unconfirmed, UNCONFIRMED_NULLS_VAL); - INIT_HLIST_NULLS_HEAD(&net->ct.dying, DYING_NULLS_VAL); INIT_HLIST_NULLS_HEAD(&net->ct.tmpl, TEMPLATE_NULLS_VAL); - net->ct.stat = alloc_percpu(struct ip_conntrack_stat); - if (!net->ct.stat) { - ret = -ENOMEM; + + net->ct.pcpu_lists = alloc_percpu(struct ct_pcpu); + if (!net->ct.pcpu_lists) goto err_stat; + + for_each_possible_cpu(cpu) { + struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu); + + spin_lock_init(&pcpu->lock); + INIT_HLIST_NULLS_HEAD(&pcpu->unconfirmed, UNCONFIRMED_NULLS_VAL); + INIT_HLIST_NULLS_HEAD(&pcpu->dying, DYING_NULLS_VAL); } + net->ct.stat = alloc_percpu(struct ip_conntrack_stat); + if (!net->ct.stat) + goto err_pcpu_lists; + net->ct.slabname = kasprintf(GFP_KERNEL, "nf_conntrack_%p", net); - if (!net->ct.slabname) { - ret = -ENOMEM; + if (!net->ct.slabname) goto err_slabname; - } net->ct.nf_conntrack_cachep = kmem_cache_create(net->ct.slabname, sizeof(struct nf_conn), 0, SLAB_DESTROY_BY_RCU, NULL); if (!net->ct.nf_conntrack_cachep) { printk(KERN_ERR "Unable to create nf_conn slab cache\n"); - ret = -ENOMEM; goto err_cache; } net->ct.htable_size = nf_conntrack_htable_size; net->ct.hash = nf_ct_alloc_hashtable(&net->ct.htable_size, 1); if (!net->ct.hash) { - ret = -ENOMEM; printk(KERN_ERR "Unable to create nf_conntrack_hash\n"); goto err_hash; } @@ -1688,6 +1837,8 @@ err_cache: kfree(net->ct.slabname); err_slabname: free_percpu(net->ct.stat); +err_pcpu_lists: + free_percpu(net->ct.pcpu_lists); err_stat: return ret; } diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c index c63b618..eacc1f7b 100644 --- a/net/netfilter/nf_conntrack_expect.c +++ b/net/netfilter/nf_conntrack_expect.c @@ -66,9 +66,9 @@ static void nf_ct_expectation_timed_out(unsigned long ul_expect) { struct nf_conntrack_expect *exp = (void *)ul_expect; - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); nf_ct_unlink_expect(exp); - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); nf_ct_expect_put(exp); } @@ -217,12 +217,12 @@ static inline int expect_matches(const struct nf_conntrack_expect *a, /* Generally a bad idea to call this: could have matched already. */ void nf_ct_unexpect_related(struct nf_conntrack_expect *exp) { - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); if (del_timer(&exp->timeout)) { nf_ct_unlink_expect(exp); nf_ct_expect_put(exp); } - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); } EXPORT_SYMBOL_GPL(nf_ct_unexpect_related); @@ -330,7 +330,7 @@ static int nf_ct_expect_insert(struct nf_conntrack_expect *exp) setup_timer(&exp->timeout, nf_ct_expectation_timed_out, (unsigned long)exp); helper = rcu_dereference_protected(master_help->helper, - lockdep_is_held(&nf_conntrack_lock)); + lockdep_is_held(&nf_conntrack_expect_lock)); if (helper) { exp->timeout.expires = jiffies + helper->expect_policy[exp->class].timeout * HZ; @@ -390,7 +390,7 @@ static inline int __nf_ct_expect_check(struct nf_conntrack_expect *expect) } /* Will be over limit? */ helper = rcu_dereference_protected(master_help->helper, - lockdep_is_held(&nf_conntrack_lock)); + lockdep_is_held(&nf_conntrack_expect_lock)); if (helper) { p = &helper->expect_policy[expect->class]; if (p->max_expected && @@ -417,7 +417,7 @@ int nf_ct_expect_related_report(struct nf_conntrack_expect *expect, { int ret; - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); ret = __nf_ct_expect_check(expect); if (ret <= 0) goto out; @@ -425,11 +425,11 @@ int nf_ct_expect_related_report(struct nf_conntrack_expect *expect, ret = nf_ct_expect_insert(expect); if (ret < 0) goto out; - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); nf_ct_expect_event_report(IPEXP_NEW, expect, portid, report); return ret; out: - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); return ret; } EXPORT_SYMBOL_GPL(nf_ct_expect_related_report); diff --git a/net/netfilter/nf_conntrack_h323_main.c b/net/netfilter/nf_conntrack_h323_main.c index bdebd03..5b1bcdf 100644 --- a/net/netfilter/nf_conntrack_h323_main.c +++ b/net/netfilter/nf_conntrack_h323_main.c @@ -1476,7 +1476,7 @@ static int process_rcf(struct sk_buff *skb, struct nf_conn *ct, nf_ct_refresh(ct, skb, info->timeout * HZ); /* Set expect timeout */ - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); exp = find_expect(ct, &ct->tuplehash[dir].tuple.dst.u3, info->sig_port[!dir]); if (exp) { @@ -1486,7 +1486,7 @@ static int process_rcf(struct sk_buff *skb, struct nf_conn *ct, nf_ct_dump_tuple(&exp->tuple); set_expect_timeout(exp, info->timeout); } - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); } return 0; diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c index 974a2a4..38e491c 100644 --- a/net/netfilter/nf_conntrack_helper.c +++ b/net/netfilter/nf_conntrack_helper.c @@ -258,7 +258,7 @@ static inline int unhelp(struct nf_conntrack_tuple_hash *i, if (help && rcu_dereference_protected( help->helper, - lockdep_is_held(&nf_conntrack_lock) + lockdep_is_held(&nf_conntrack_expect_lock) ) == me) { nf_conntrack_event(IPCT_HELPER, ct); RCU_INIT_POINTER(help->helper, NULL); @@ -284,17 +284,17 @@ static LIST_HEAD(nf_ct_helper_expectfn_list); void nf_ct_helper_expectfn_register(struct nf_ct_helper_expectfn *n) { - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); list_add_rcu(&n->head, &nf_ct_helper_expectfn_list); - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); } EXPORT_SYMBOL_GPL(nf_ct_helper_expectfn_register); void nf_ct_helper_expectfn_unregister(struct nf_ct_helper_expectfn *n) { - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); list_del_rcu(&n->head); - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); } EXPORT_SYMBOL_GPL(nf_ct_helper_expectfn_unregister); @@ -396,15 +396,17 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me, const struct hlist_node *next; const struct hlist_nulls_node *nn; unsigned int i; + int cpu; /* Get rid of expectations */ + spin_lock_bh(&nf_conntrack_expect_lock); for (i = 0; i < nf_ct_expect_hsize; i++) { hlist_for_each_entry_safe(exp, next, &net->ct.expect_hash[i], hnode) { struct nf_conn_help *help = nfct_help(exp->master); if ((rcu_dereference_protected( help->helper, - lockdep_is_held(&nf_conntrack_lock) + lockdep_is_held(&nf_conntrack_expect_lock) ) == me || exp->helper == me) && del_timer(&exp->timeout)) { nf_ct_unlink_expect(exp); @@ -412,14 +414,27 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me, } } } + spin_unlock_bh(&nf_conntrack_expect_lock); /* Get rid of expecteds, set helpers to NULL. */ - hlist_nulls_for_each_entry(h, nn, &net->ct.unconfirmed, hnnode) - unhelp(h, me); - for (i = 0; i < net->ct.htable_size; i++) { - hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode) + for_each_possible_cpu(cpu) { + struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu); + + spin_lock_bh(&pcpu->lock); + hlist_nulls_for_each_entry(h, nn, &pcpu->unconfirmed, hnnode) unhelp(h, me); + spin_unlock_bh(&pcpu->lock); + } + local_bh_disable(); + for (i = 0; i < net->ct.htable_size; i++) { + spin_lock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]); + if (i < net->ct.htable_size) { + hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode) + unhelp(h, me); + } + spin_unlock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]); } + local_bh_enable(); } void nf_conntrack_helper_unregister(struct nf_conntrack_helper *me) @@ -437,10 +452,8 @@ void nf_conntrack_helper_unregister(struct nf_conntrack_helper *me) synchronize_rcu(); rtnl_lock(); - spin_lock_bh(&nf_conntrack_lock); for_each_net(net) __nf_conntrack_helper_unregister(me, net); - spin_unlock_bh(&nf_conntrack_lock); rtnl_unlock(); } EXPORT_SYMBOL_GPL(nf_conntrack_helper_unregister); diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 6d0f8a1..fa11166 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -766,14 +766,23 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb) struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh); u_int8_t l3proto = nfmsg->nfgen_family; int res; + spinlock_t *lockp; + #ifdef CONFIG_NF_CONNTRACK_MARK const struct ctnetlink_dump_filter *filter = cb->data; #endif - spin_lock_bh(&nf_conntrack_lock); last = (struct nf_conn *)cb->args[1]; + + local_bh_disable(); for (; cb->args[0] < net->ct.htable_size; cb->args[0]++) { restart: + lockp = &nf_conntrack_locks[cb->args[0] % CONNTRACK_LOCKS]; + spin_lock(lockp); + if (cb->args[0] >= net->ct.htable_size) { + spin_unlock(lockp); + goto out; + } hlist_nulls_for_each_entry(h, n, &net->ct.hash[cb->args[0]], hnnode) { if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL) @@ -805,16 +814,18 @@ restart: if (res < 0) { nf_conntrack_get(&ct->ct_general); cb->args[1] = (unsigned long)ct; + spin_unlock(lockp); goto out; } } + spin_unlock(lockp); if (cb->args[1]) { cb->args[1] = 0; goto restart; } } out: - spin_unlock_bh(&nf_conntrack_lock); + local_bh_enable(); if (last) nf_ct_put(last); @@ -1145,50 +1156,65 @@ static int ctnetlink_done_list(struct netlink_callback *cb) } static int -ctnetlink_dump_list(struct sk_buff *skb, struct netlink_callback *cb, - struct hlist_nulls_head *list) +ctnetlink_dump_list(struct sk_buff *skb, struct netlink_callback *cb, bool dying) { - struct nf_conn *ct, *last; + struct nf_conn *ct, *last = NULL; struct nf_conntrack_tuple_hash *h; struct hlist_nulls_node *n; struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh); u_int8_t l3proto = nfmsg->nfgen_family; int res; + int cpu; + struct hlist_nulls_head *list; + struct net *net = sock_net(skb->sk); if (cb->args[2]) return 0; - spin_lock_bh(&nf_conntrack_lock); - last = (struct nf_conn *)cb->args[1]; -restart: - hlist_nulls_for_each_entry(h, n, list, hnnode) { - ct = nf_ct_tuplehash_to_ctrack(h); - if (l3proto && nf_ct_l3num(ct) != l3proto) + if (cb->args[0] == nr_cpu_ids) + return 0; + + for (cpu = cb->args[0]; cpu < nr_cpu_ids; cpu++) { + struct ct_pcpu *pcpu; + + if (!cpu_possible(cpu)) continue; - if (cb->args[1]) { - if (ct != last) + + pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu); + spin_lock_bh(&pcpu->lock); + last = (struct nf_conn *)cb->args[1]; + list = dying ? &pcpu->dying : &pcpu->unconfirmed; +restart: + hlist_nulls_for_each_entry(h, n, list, hnnode) { + ct = nf_ct_tuplehash_to_ctrack(h); + if (l3proto && nf_ct_l3num(ct) != l3proto) continue; - cb->args[1] = 0; - } - rcu_read_lock(); - res = ctnetlink_fill_info(skb, NETLINK_CB(cb->skb).portid, - cb->nlh->nlmsg_seq, - NFNL_MSG_TYPE(cb->nlh->nlmsg_type), - ct); - rcu_read_unlock(); - if (res < 0) { - nf_conntrack_get(&ct->ct_general); - cb->args[1] = (unsigned long)ct; - goto out; + if (cb->args[1]) { + if (ct != last) + continue; + cb->args[1] = 0; + } + rcu_read_lock(); + res = ctnetlink_fill_info(skb, NETLINK_CB(cb->skb).portid, + cb->nlh->nlmsg_seq, + NFNL_MSG_TYPE(cb->nlh->nlmsg_type), + ct); + rcu_read_unlock(); + if (res < 0) { + nf_conntrack_get(&ct->ct_general); + cb->args[1] = (unsigned long)ct; + spin_unlock_bh(&pcpu->lock); + goto out; + } } + if (cb->args[1]) { + cb->args[1] = 0; + goto restart; + } else + cb->args[2] = 1; + spin_unlock_bh(&pcpu->lock); } - if (cb->args[1]) { - cb->args[1] = 0; - goto restart; - } else - cb->args[2] = 1; out: - spin_unlock_bh(&nf_conntrack_lock); if (last) nf_ct_put(last); @@ -1198,9 +1224,7 @@ out: static int ctnetlink_dump_dying(struct sk_buff *skb, struct netlink_callback *cb) { - struct net *net = sock_net(skb->sk); - - return ctnetlink_dump_list(skb, cb, &net->ct.dying); + return ctnetlink_dump_list(skb, cb, true); } static int @@ -1222,9 +1246,7 @@ ctnetlink_get_ct_dying(struct sock *ctnl, struct sk_buff *skb, static int ctnetlink_dump_unconfirmed(struct sk_buff *skb, struct netlink_callback *cb) { - struct net *net = sock_net(skb->sk); - - return ctnetlink_dump_list(skb, cb, &net->ct.unconfirmed); + return ctnetlink_dump_list(skb, cb, false); } static int @@ -1373,14 +1395,14 @@ ctnetlink_change_helper(struct nf_conn *ct, const struct nlattr * const cda[]) nf_ct_protonum(ct)); if (helper == NULL) { #ifdef CONFIG_MODULES - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); if (request_module("nfct-helper-%s", helpname) < 0) { - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); return -EOPNOTSUPP; } - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); helper = __nf_conntrack_helper_find(helpname, nf_ct_l3num(ct), nf_ct_protonum(ct)); if (helper) @@ -1818,9 +1840,9 @@ ctnetlink_new_conntrack(struct sock *ctnl, struct sk_buff *skb, err = -EEXIST; ct = nf_ct_tuplehash_to_ctrack(h); if (!(nlh->nlmsg_flags & NLM_F_EXCL)) { - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); err = ctnetlink_change_conntrack(ct, cda); - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); if (err == 0) { nf_conntrack_eventmask_report((1 << IPCT_REPLY) | (1 << IPCT_ASSURED) | @@ -2117,9 +2139,9 @@ ctnetlink_nfqueue_parse(const struct nlattr *attr, struct nf_conn *ct) nla_parse_nested(cda, CTA_MAX, attr, ct_nla_policy); - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); ret = ctnetlink_nfqueue_parse_ct((const struct nlattr **)cda, ct); - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); return ret; } @@ -2629,13 +2651,13 @@ ctnetlink_del_expect(struct sock *ctnl, struct sk_buff *skb, } /* after list removal, usage count == 1 */ - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); if (del_timer(&exp->timeout)) { nf_ct_unlink_expect_report(exp, NETLINK_CB(skb).portid, nlmsg_report(nlh)); nf_ct_expect_put(exp); } - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); /* have to put what we 'get' above. * after this line usage count == 0 */ nf_ct_expect_put(exp); @@ -2644,7 +2666,7 @@ ctnetlink_del_expect(struct sock *ctnl, struct sk_buff *skb, struct nf_conn_help *m_help; /* delete all expectations for this helper */ - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); for (i = 0; i < nf_ct_expect_hsize; i++) { hlist_for_each_entry_safe(exp, next, &net->ct.expect_hash[i], @@ -2659,10 +2681,10 @@ ctnetlink_del_expect(struct sock *ctnl, struct sk_buff *skb, } } } - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); } else { /* This basically means we have to flush everything*/ - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); for (i = 0; i < nf_ct_expect_hsize; i++) { hlist_for_each_entry_safe(exp, next, &net->ct.expect_hash[i], @@ -2675,7 +2697,7 @@ ctnetlink_del_expect(struct sock *ctnl, struct sk_buff *skb, } } } - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); } return 0; @@ -2881,11 +2903,11 @@ ctnetlink_new_expect(struct sock *ctnl, struct sk_buff *skb, if (err < 0) return err; - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); exp = __nf_ct_expect_find(net, zone, &tuple); if (!exp) { - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); err = -ENOENT; if (nlh->nlmsg_flags & NLM_F_CREATE) { err = ctnetlink_create_expect(net, zone, cda, @@ -2899,7 +2921,7 @@ ctnetlink_new_expect(struct sock *ctnl, struct sk_buff *skb, err = -EEXIST; if (!(nlh->nlmsg_flags & NLM_F_EXCL)) err = ctnetlink_change_expect(exp, cda); - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); return err; } diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c index e0c4373..5564de1 100644 --- a/net/netfilter/nf_conntrack_sip.c +++ b/net/netfilter/nf_conntrack_sip.c @@ -858,7 +858,7 @@ static int refresh_signalling_expectation(struct nf_conn *ct, struct hlist_node *next; int found = 0; - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); hlist_for_each_entry_safe(exp, next, &help->expectations, lnode) { if (exp->class != SIP_EXPECT_SIGNALLING || !nf_inet_addr_cmp(&exp->tuple.dst.u3, addr) || @@ -873,7 +873,7 @@ static int refresh_signalling_expectation(struct nf_conn *ct, found = 1; break; } - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); return found; } @@ -883,7 +883,7 @@ static void flush_expectations(struct nf_conn *ct, bool media) struct nf_conntrack_expect *exp; struct hlist_node *next; - spin_lock_bh(&nf_conntrack_lock); + spin_lock_bh(&nf_conntrack_expect_lock); hlist_for_each_entry_safe(exp, next, &help->expectations, lnode) { if ((exp->class != SIP_EXPECT_SIGNALLING) ^ media) continue; @@ -894,7 +894,7 @@ static void flush_expectations(struct nf_conn *ct, bool media) if (!media) break; } - spin_unlock_bh(&nf_conntrack_lock); + spin_unlock_bh(&nf_conntrack_expect_lock); } static int set_expected_rtp_rtcp(struct sk_buff *skb, unsigned int protoff, ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock 2013-05-22 17:47 ` [PATCH v2 " Eric Dumazet @ 2013-05-22 18:20 ` Joe Perches 2013-05-22 19:26 ` Eric Dumazet 2013-05-24 13:16 ` Jesper Dangaard Brouer 2013-08-26 22:28 ` Pablo Neira Ayuso 2 siblings, 1 reply; 21+ messages in thread From: Joe Perches @ 2013-05-22 18:20 UTC (permalink / raw) To: Eric Dumazet Cc: Pablo Neira Ayuso, netfilter-devel, netdev, Tom Herbert, Patrick McHardy On Wed, 2013-05-22 at 10:47 -0700, Eric Dumazet wrote: > From: Eric Dumazet <edumazet@google.com> > > nf_conntrack_lock is a monolithic lock and suffers from huge contention > on current generation servers (8 or more core/threads). [] > diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h [] > @@ -76,7 +76,8 @@ struct nf_conn { > plus 1 for any connection(s) we are `master' for */ > struct nf_conntrack ct_general; > > - spinlock_t lock; > + spinlock_t lock; > + u16 cpu; trivia: What's the real value in not using int here? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock 2013-05-22 18:20 ` Joe Perches @ 2013-05-22 19:26 ` Eric Dumazet 2013-05-22 19:57 ` Joe Perches 0 siblings, 1 reply; 21+ messages in thread From: Eric Dumazet @ 2013-05-22 19:26 UTC (permalink / raw) To: Joe Perches Cc: Pablo Neira Ayuso, netfilter-devel, netdev, Tom Herbert, Patrick McHardy On Wed, 2013-05-22 at 11:20 -0700, Joe Perches wrote: > On Wed, 2013-05-22 at 10:47 -0700, Eric Dumazet wrote: > > From: Eric Dumazet <edumazet@google.com> > > > > nf_conntrack_lock is a monolithic lock and suffers from huge contention > > on current generation servers (8 or more core/threads). > [] > > diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h > [] > > @@ -76,7 +76,8 @@ struct nf_conn { > > plus 1 for any connection(s) we are `master' for */ > > struct nf_conntrack ct_general; > > > > - spinlock_t lock; > > + spinlock_t lock; > > + u16 cpu; > > trivia: > > What's the real value in not using int here? > On some machines, sizeof(spinlock_t) is 2 So this addition doesn't increase size of the structure, as I fill a hole. Thats the case on x86 when NR_CPUS < 256 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock 2013-05-22 19:26 ` Eric Dumazet @ 2013-05-22 19:57 ` Joe Perches 2013-05-22 20:16 ` Eric Dumazet 0 siblings, 1 reply; 21+ messages in thread From: Joe Perches @ 2013-05-22 19:57 UTC (permalink / raw) To: Eric Dumazet Cc: Pablo Neira Ayuso, netfilter-devel, netdev, Tom Herbert, Patrick McHardy On Wed, 2013-05-22 at 12:26 -0700, Eric Dumazet wrote: > On Wed, 2013-05-22 at 11:20 -0700, Joe Perches wrote: > > On Wed, 2013-05-22 at 10:47 -0700, Eric Dumazet wrote: > > > nf_conntrack_lock is a monolithic lock and suffers from huge contention > > > on current generation servers (8 or more core/threads). > > [] > > > diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h > > [] > > > @@ -76,7 +76,8 @@ struct nf_conn { > > > plus 1 for any connection(s) we are `master' for */ > > > struct nf_conntrack ct_general; > > > > > > - spinlock_t lock; > > > + spinlock_t lock; > > > + u16 cpu; > > trivia: > > What's the real value in not using int here? > On some machines, sizeof(spinlock_t) is 2 > So this addition doesn't increase size of the structure, as I fill a > hole. > Thats the case on x86 when NR_CPUS < 256 Ahh, nice. It might also be nice to mark it if ever more than a u16 brace/flock/coven/cluster worth of cpus become feasible as it seems int is used almost everywhere else. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock 2013-05-22 19:57 ` Joe Perches @ 2013-05-22 20:16 ` Eric Dumazet 2013-05-22 20:38 ` Joe Perches 0 siblings, 1 reply; 21+ messages in thread From: Eric Dumazet @ 2013-05-22 20:16 UTC (permalink / raw) To: Joe Perches Cc: Pablo Neira Ayuso, netfilter-devel, netdev, Tom Herbert, Patrick McHardy On Wed, 2013-05-22 at 12:57 -0700, Joe Perches wrote: > Ahh, nice. > > It might also be nice to mark it if ever more than a u16 > brace/flock/coven/cluster worth of cpus become feasible > as it seems int is used almost everywhere else. It seems Linus hates cpu_t or whatever_t Thats why we have u16 everywhere to code cpu numbers, and why we use "unsigned long" for jiffies. Check include/linux/netdevice.h for numerous examples So far, I believe linux supports at most 4096 cpus. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock 2013-05-22 20:16 ` Eric Dumazet @ 2013-05-22 20:38 ` Joe Perches 2013-05-22 20:48 ` Eric Dumazet 0 siblings, 1 reply; 21+ messages in thread From: Joe Perches @ 2013-05-22 20:38 UTC (permalink / raw) To: Eric Dumazet Cc: Pablo Neira Ayuso, netfilter-devel, netdev, Tom Herbert, Patrick McHardy On Wed, 2013-05-22 at 13:16 -0700, Eric Dumazet wrote: > On Wed, 2013-05-22 at 12:57 -0700, Joe Perches wrote: > > It might also be nice to mark it if ever more than a u16 > > brace/flock/coven/cluster worth of cpus become feasible > > as it seems int is used almost everywhere else. > > It seems Linus hates cpu_t or whatever_t Go figure. > Thats why we have u16 everywhere to code cpu numbers, and why we use > "unsigned long" for jiffies. u16 for cpu is hardly used at all. $ git grep -E "^\s*\w+\s+cpu\s*;" | cut -f2- -d":" | \ sed -r -e 's/^\s+//g' -e 's/\s+/ /g' -e 's/;.*//' | \ sort | uniq -c | sort -rn 569 int cpu 29 return cpu 28 __u32 cpu 14 u32 cpu 7 unsigned cpu 5 u16 cpu 4 cpuid_t cpu 3 uint32_t cpu 2 __u8 cpu 2 u64 cpu 2 s32 cpu 2 long cpu 1 u8 cpu 1 __u16 cpu 1 struct cpu 1 short cpu 1 __le32 cpu 1 geo_cpu_t cpu > So far, I believe linux supports at most 4096 cpus. I believe that as well. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock 2013-05-22 20:38 ` Joe Perches @ 2013-05-22 20:48 ` Eric Dumazet 2013-05-22 21:12 ` Joe Perches 0 siblings, 1 reply; 21+ messages in thread From: Eric Dumazet @ 2013-05-22 20:48 UTC (permalink / raw) To: Joe Perches Cc: Pablo Neira Ayuso, netfilter-devel, netdev, Tom Herbert, Patrick McHardy On Wed, 2013-05-22 at 13:38 -0700, Joe Perches wrote: > u16 for cpu is hardly used at all. Its used in network tree. The day it has to change, don't worry, I think we'll be able to handle it. For the meantime, I do not want to add 4 more bytes to conntrack structure, its already bloated enough. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock 2013-05-22 20:48 ` Eric Dumazet @ 2013-05-22 21:12 ` Joe Perches 2013-05-22 21:29 ` David Miller 2013-05-22 21:34 ` Eric Dumazet 0 siblings, 2 replies; 21+ messages in thread From: Joe Perches @ 2013-05-22 21:12 UTC (permalink / raw) To: Eric Dumazet Cc: Pablo Neira Ayuso, netfilter-devel, netdev, Tom Herbert, Patrick McHardy On Wed, 2013-05-22 at 13:48 -0700, Eric Dumazet wrote: > On Wed, 2013-05-22 at 13:38 -0700, Joe Perches wrote: > > > u16 for cpu is hardly used at all. > > Its used in network tree. > > The day it has to change, don't worry, I think we'll be able to handle > it. > > For the meantime, I do not want to add 4 more bytes to conntrack > structure, its already bloated enough. I wrote "nice" already, no worries. btw: netdevice.h could possibly convert xmit_lock_owner to u16 as well for that same 4 byte savings if the -1 test was converted too. --- include/linux/netdevice.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 60584b1..c38cd29 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -528,7 +528,7 @@ struct netdev_queue { * write mostly part */ spinlock_t _xmit_lock ____cacheline_aligned_in_smp; - int xmit_lock_owner; + u16 xmit_lock_owner; /* * please use this field instead of dev->trans_start */ @@ -2453,7 +2453,7 @@ static inline void __netif_tx_unlock_bh(struct netdev_queue *txq) static inline void txq_trans_update(struct netdev_queue *txq) { - if (txq->xmit_lock_owner != -1) + if (txq->xmit_lock_owner != (u16)-1) txq->trans_start = jiffies; } ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock 2013-05-22 21:12 ` Joe Perches @ 2013-05-22 21:29 ` David Miller 2013-05-22 21:34 ` Eric Dumazet 1 sibling, 0 replies; 21+ messages in thread From: David Miller @ 2013-05-22 21:29 UTC (permalink / raw) To: joe; +Cc: eric.dumazet, pablo, netfilter-devel, netdev, therbert, kaber From: Joe Perches <joe@perches.com> Date: Wed, 22 May 2013 14:12:51 -0700 > @@ -528,7 +528,7 @@ struct netdev_queue { > * write mostly part > */ > spinlock_t _xmit_lock ____cacheline_aligned_in_smp; > - int xmit_lock_owner; > + u16 xmit_lock_owner; > /* > * please use this field instead of dev->trans_start > */ > @@ -2453,7 +2453,7 @@ static inline void __netif_tx_unlock_bh(struct netdev_queue *txq) > > static inline void txq_trans_update(struct netdev_queue *txq) > { > - if (txq->xmit_lock_owner != -1) > + if (txq->xmit_lock_owner != (u16)-1) > txq->trans_start = jiffies; > } Then technically this is an s16, don't use casts to pretend that it isn't. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock 2013-05-22 21:12 ` Joe Perches 2013-05-22 21:29 ` David Miller @ 2013-05-22 21:34 ` Eric Dumazet 1 sibling, 0 replies; 21+ messages in thread From: Eric Dumazet @ 2013-05-22 21:34 UTC (permalink / raw) To: Joe Perches Cc: Pablo Neira Ayuso, netfilter-devel, netdev, Tom Herbert, Patrick McHardy On Wed, 2013-05-22 at 14:12 -0700, Joe Perches wrote: > On Wed, 2013-05-22 at 13:48 -0700, Eric Dumazet wrote: > > On Wed, 2013-05-22 at 13:38 -0700, Joe Perches wrote: > > > > > u16 for cpu is hardly used at all. > > > > Its used in network tree. > > > > The day it has to change, don't worry, I think we'll be able to handle > > it. > > > > For the meantime, I do not want to add 4 more bytes to conntrack > > structure, its already bloated enough. > > I wrote "nice" already, no worries. > > btw: netdevice.h could possibly convert xmit_lock_owner > to u16 as well for that same 4 byte savings if the -1 > test was converted too. > --- > include/linux/netdevice.h | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 60584b1..c38cd29 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -528,7 +528,7 @@ struct netdev_queue { > * write mostly part > */ > spinlock_t _xmit_lock ____cacheline_aligned_in_smp; > - int xmit_lock_owner; > + u16 xmit_lock_owner; > /* Typical machines have less than 10 structures like this one, there is no gain trying to save some bytes, and more gain trying to get correct alignments ;) While conntracking can easily consume more than 10.000.000 conntracks on a machine. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock 2013-05-22 17:47 ` [PATCH v2 " Eric Dumazet 2013-05-22 18:20 ` Joe Perches @ 2013-05-24 13:16 ` Jesper Dangaard Brouer 2013-05-24 13:51 ` Eric Dumazet 2013-08-26 22:28 ` Pablo Neira Ayuso 2 siblings, 1 reply; 21+ messages in thread From: Jesper Dangaard Brouer @ 2013-05-24 13:16 UTC (permalink / raw) To: Eric Dumazet Cc: Pablo Neira Ayuso, netfilter-devel, netdev, Tom Herbert, Patrick McHardy On Wed, 22 May 2013 10:47:48 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote: > nf_conntrack_lock is a monolithic lock and suffers from huge > contention on current generation servers (8 or more core/threads). > [...] > Results on a 32 threads machine, 200 concurrent instances of "netperf > -t TCP_CRR" : > > ~390000 tps instead of ~300000 tps. Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> I gave the patch a quick run in my testlab, and the results are amazing, you are amazing Eric! :-) Basic testlab setup: I'm generating a 2700 Kpps SYN-flood against port 80 (with trafgen) Baseline result from a 3.9.0-rc5 kernel: - With nf_conntrack my performance is 749 Kpps. If removing all iptables and nf_contrack modules: - the performance hits 1095 Kpps. But it looks like we are hitting a new spin_lock in ip_send_reply() If start a LISTEN process on the port, then we hit the "old" SYN scalability issues again, performance drops tp 227 Kpps. On a patched net-next (close to 3.10.0-rc1) kernel, with Eric's new locking scheme patch: - I measured an amazing 2431 Kpps. 13.45% [kernel] [k] fib_table_lookup 9.07% [nf_conntrack] [k] __nf_conntrack_alloc 6.50% [nf_conntrack] [k] nf_conntrack_free 5.24% [ip_tables] [k] ipt_do_table 3.66% [nf_conntrack] [k] nf_conntrack_in 3.54% [kernel] [k] inet_getpeer 3.52% [nf_conntrack] [k] tcp_packet 2.44% [ixgbe] [k] ixgbe_poll 2.30% [kernel] [k] __ip_route_output_key 2.04% [nf_conntrack] [k] nf_conntrack_tuple_taken 1.98% [kernel] [k] icmp_send Then, I realized that I didn't have any iptables rules that accepted port 80 on my testlab system, thus this were basically a drop packets test with a nf_conntrack lookup. If I add a rule that accept new connection to that port e.g: iptables -I INPUT -p tcp -m state --state NEW -m tcp --dport 80 -j ACCEPT New ruleset: -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT -A INPUT -p icmp -j ACCEPT -A INPUT -i lo -j ACCEPT -A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT -A INPUT -p tcp -m state --state NEW -m tcp --dport 80 -j ACCEPT -A INPUT -j REJECT --reject-with icmp-host-prohibited Then, performance drops again: - to approx 883 Kpps. Discover that the NAT stuff is to blame: - 17.71% swapper [kernel.kallsyms] [k] _raw_spin_lock_bh - _raw_spin_lock_bh + 47.17% nf_nat_cleanup_conntrack + 45.81% nf_nat_setup_info + 6.43% nf_nat_get_offset Removing the nat modules, improves the performance: - to 1182 Kpps (not listen on port 80) sudo iptables -t nat -F sudo rmmod iptable_nat nf_nat_ipv4 And the perf output looks more like what I would expect: - 14.85% swapper [kernel.kallsyms] [k] _raw_spin_lock - _raw_spin_lock + 82.86% mod_timer + 11.14% nf_conntrack_double_lock + 2.50% nf_ct_del_from_dying_or_unconfirmed_list + 1.48% nf_conntrack_in + 1.30% nf_ct_delete_from_lists - 12.78% swapper [kernel.kallsyms] [k] _raw_spin_lock_irqsave - _raw_spin_lock_irqsave - 99.44% lock_timer_base + 99.07% del_timer + 0.93% mod_timer + 2.69% swapper [ip_tables] [k] ipt_do_table + 2.28% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock_irqsave + 2.18% swapper [nf_conntrack] [k] tcp_packet + 2.16% swapper [kernel.kallsyms] [k] fib_table_lookup Again if I start a LISTEN process on the port, performance drops to 169Kpps, due to the LISTEN and SYN-cookie scalability issues. I'm amazed, this patch will actually make it a viable choice to load the conntrack modules on a DDoS based filtering box, and use the conntracks to protect against ACK and SYN+ACK attacks. Simply by not accepting the ACK or SYN+ACK to create a conntrack entry. Via the command: sysctl -w net/netfilter/nf_conntrack_tcp_loose=0 A quick test show; now I can run a LISTEN process on the port, and handle an SYN+ACK attack of approx 2580Kpps (and the same for ACK attacks), while running a LISTEN process on the port. Thanks for the great work Eric! ps. also tested resizing the hash tables, both: /proc/sys/net/netfilter/nf_conntrack_max and resizing the buckets via: /sys/module/nf_conntrack/parameters/hashsize -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock 2013-05-24 13:16 ` Jesper Dangaard Brouer @ 2013-05-24 13:51 ` Eric Dumazet 2013-05-27 12:33 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 21+ messages in thread From: Eric Dumazet @ 2013-05-24 13:51 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Pablo Neira Ayuso, netfilter-devel, netdev, Tom Herbert, Patrick McHardy On Fri, 2013-05-24 at 15:16 +0200, Jesper Dangaard Brouer wrote: > On Wed, 22 May 2013 10:47:48 -0700 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > nf_conntrack_lock is a monolithic lock and suffers from huge > > contention on current generation servers (8 or more core/threads). > > > [...] > > Results on a 32 threads machine, 200 concurrent instances of "netperf > > -t TCP_CRR" : > > > > ~390000 tps instead of ~300000 tps. > > Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> > > I gave the patch a quick run in my testlab, and the results are > amazing, you are amazing Eric! :-) > > Basic testlab setup: > I'm generating a 2700 Kpps SYN-flood against port 80 (with trafgen) > > Baseline result from a 3.9.0-rc5 kernel: > - With nf_conntrack my performance is 749 Kpps. > > If removing all iptables and nf_contrack modules: > - the performance hits 1095 Kpps. > But it looks like we are hitting a new spin_lock in ip_send_reply() > > If start a LISTEN process on the port, then we hit the "old" SYN > scalability issues again, performance drops tp 227 Kpps. > > On a patched net-next (close to 3.10.0-rc1) kernel, with Eric's new > locking scheme patch: > - I measured an amazing 2431 Kpps. > > 13.45% [kernel] [k] fib_table_lookup > 9.07% [nf_conntrack] [k] __nf_conntrack_alloc > 6.50% [nf_conntrack] [k] nf_conntrack_free > 5.24% [ip_tables] [k] ipt_do_table > 3.66% [nf_conntrack] [k] nf_conntrack_in > 3.54% [kernel] [k] inet_getpeer > 3.52% [nf_conntrack] [k] tcp_packet > 2.44% [ixgbe] [k] ixgbe_poll > 2.30% [kernel] [k] __ip_route_output_key > 2.04% [nf_conntrack] [k] nf_conntrack_tuple_taken > 1.98% [kernel] [k] icmp_send > > Then, I realized that I didn't have any iptables rules that accepted > port 80 on my testlab system, thus this were basically a drop packets > test with a nf_conntrack lookup. > > If I add a rule that accept new connection to that port e.g: > iptables -I INPUT -p tcp -m state --state NEW -m tcp --dport 80 -j > ACCEPT > > New ruleset: > -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT > -A INPUT -p icmp -j ACCEPT > -A INPUT -i lo -j ACCEPT > -A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT > -A INPUT -p tcp -m state --state NEW -m tcp --dport 80 -j ACCEPT > -A INPUT -j REJECT --reject-with icmp-host-prohibited > > Then, performance drops again: > - to approx 883 Kpps. > > Discover that the NAT stuff is to blame: > > - 17.71% swapper [kernel.kallsyms] [k] _raw_spin_lock_bh > - _raw_spin_lock_bh > + 47.17% nf_nat_cleanup_conntrack > + 45.81% nf_nat_setup_info > + 6.43% nf_nat_get_offset > > Removing the nat modules, improves the performance: > - to 1182 Kpps (not listen on port 80) > > sudo iptables -t nat -F > sudo rmmod iptable_nat nf_nat_ipv4 > > And the perf output looks more like what I would expect: > > - 14.85% swapper [kernel.kallsyms] [k] _raw_spin_lock > - _raw_spin_lock > + 82.86% mod_timer > + 11.14% nf_conntrack_double_lock > + 2.50% nf_ct_del_from_dying_or_unconfirmed_list > + 1.48% nf_conntrack_in > + 1.30% nf_ct_delete_from_lists > - 12.78% swapper [kernel.kallsyms] [k] > _raw_spin_lock_irqsave > - _raw_spin_lock_irqsave > - 99.44% lock_timer_base > + 99.07% del_timer > + 0.93% mod_timer > + 2.69% swapper [ip_tables] [k] ipt_do_table > + 2.28% ksoftirqd/0 [kernel.kallsyms] [k] > _raw_spin_lock_irqsave > + 2.18% swapper [nf_conntrack] [k] tcp_packet > + 2.16% swapper [kernel.kallsyms] [k] fib_table_lookup > > > Again if I start a LISTEN process on the port, performance drops to > 169Kpps, due to the LISTEN and SYN-cookie scalability issues. > > I'm amazed, this patch will actually make it a viable choice to load > the conntrack modules on a DDoS based filtering box, and use the > conntracks to protect against ACK and SYN+ACK attacks. > > Simply by not accepting the ACK or SYN+ACK to create a conntrack entry. > Via the command: > sysctl -w net/netfilter/nf_conntrack_tcp_loose=0 > > A quick test show; now I can run a LISTEN process on the port, and > handle an SYN+ACK attack of approx 2580Kpps (and the same for ACK > attacks), while running a LISTEN process on the port. > > Thanks for the great work Eric! > > ps. also tested resizing the hash tables, both: > /proc/sys/net/netfilter/nf_conntrack_max > and resizing the buckets via: > /sys/module/nf_conntrack/parameters/hashsize > Wow, this is very interesting ! Did you test the thing when expectations are possible ? (say ftp module loaded) I think we should add RCU in the fast path, instead of having to lock the expectation lock. Its totally doable. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock 2013-05-24 13:51 ` Eric Dumazet @ 2013-05-27 12:33 ` Jesper Dangaard Brouer 2013-05-27 12:36 ` Pablo Neira Ayuso 0 siblings, 1 reply; 21+ messages in thread From: Jesper Dangaard Brouer @ 2013-05-27 12:33 UTC (permalink / raw) To: Eric Dumazet Cc: Pablo Neira Ayuso, netfilter-devel, netdev, Tom Herbert, Patrick McHardy On Fri, 24 May 2013 06:51:36 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Fri, 2013-05-24 at 15:16 +0200, Jesper Dangaard Brouer wrote: [...cut...] > > I'm amazed, this patch will actually make it a viable choice to load > > the conntrack modules on a DDoS based filtering box, and use the > > conntracks to protect against ACK and SYN+ACK attacks. > > > > Simply by not accepting the ACK or SYN+ACK to create a conntrack > > entry. Via the command: > > sysctl -w net/netfilter/nf_conntrack_tcp_loose=0 > > > > A quick test show; now I can run a LISTEN process on the port, and > > handle an SYN+ACK attack of approx 2580Kpps (and the same for ACK > > attacks), while running a LISTEN process on the port. > > [...] > > > > Wow, this is very interesting ! > > Did you test the thing when expectations are possible ? (say ftp > module loaded) Nope. I'm not sure how to create a test case, that causes an expectation to be created. > I think we should add RCU in the fast path, instead of having to lock > the expectation lock. Its totally doable. Interesting! :-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock 2013-05-27 12:33 ` Jesper Dangaard Brouer @ 2013-05-27 12:36 ` Pablo Neira Ayuso 2013-08-23 14:42 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 21+ messages in thread From: Pablo Neira Ayuso @ 2013-05-27 12:36 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Eric Dumazet, netfilter-devel, netdev, Tom Herbert, Patrick McHardy On Mon, May 27, 2013 at 02:33:46PM +0200, Jesper Dangaard Brouer wrote: > On Fri, 24 May 2013 06:51:36 -0700 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > On Fri, 2013-05-24 at 15:16 +0200, Jesper Dangaard Brouer wrote: > [...cut...] > > > I'm amazed, this patch will actually make it a viable choice to load > > > the conntrack modules on a DDoS based filtering box, and use the > > > conntracks to protect against ACK and SYN+ACK attacks. > > > > > > Simply by not accepting the ACK or SYN+ACK to create a conntrack > > > entry. Via the command: > > > sysctl -w net/netfilter/nf_conntrack_tcp_loose=0 > > > > > > A quick test show; now I can run a LISTEN process on the port, and > > > handle an SYN+ACK attack of approx 2580Kpps (and the same for ACK > > > attacks), while running a LISTEN process on the port. > > > > [...] > > > > > > > Wow, this is very interesting ! > > > > Did you test the thing when expectations are possible ? (say ftp > > module loaded) > > Nope. I'm not sure how to create a test case, that causes an > expectation to be created. This is still in my queue, I didn't forget about this. I need to find some spare time to give this a test with expectations enabled and also with conntrackd/state-sync. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock 2013-05-27 12:36 ` Pablo Neira Ayuso @ 2013-08-23 14:42 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 21+ messages in thread From: Jesper Dangaard Brouer @ 2013-08-23 14:42 UTC (permalink / raw) To: Pablo Neira Ayuso Cc: Eric Dumazet, netfilter-devel, netdev, Tom Herbert, Patrick McHardy Hi Pablo, On Mon, 27 May 2013 14:36:56 +0200 Pablo Neira Ayuso <pablo@netfilter.org> wrote: > On Mon, May 27, 2013 at 02:33:46PM +0200, Jesper Dangaard Brouer wrote: > > On Fri, 24 May 2013 06:51:36 -0700 > > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > > On Fri, 2013-05-24 at 15:16 +0200, Jesper Dangaard Brouer wrote: > > [...cut...] > > > > I'm amazed, this patch will actually make it a viable choice to load > > > > the conntrack modules on a DDoS based filtering box, and use the > > > > conntracks to protect against ACK and SYN+ACK attacks. > > > > > > > > Simply by not accepting the ACK or SYN+ACK to create a conntrack > > > > entry. Via the command: > > > > sysctl -w net/netfilter/nf_conntrack_tcp_loose=0 > > > > > > > > A quick test show; now I can run a LISTEN process on the port, and > > > > handle an SYN+ACK attack of approx 2580Kpps (and the same for ACK > > > > attacks), while running a LISTEN process on the port. > > > > > > [...] > > > > > > > > > > Wow, this is very interesting ! > > > > > > Did you test the thing when expectations are possible ? (say ftp > > > module loaded) > > > > Nope. I'm not sure how to create a test case, that causes an > > expectation to be created. > > This is still in my queue, I didn't forget about this. I need to find > some spare time to give this a test with expectations enabled and also > with conntrackd/state-sync. What about this patch, what is the status? Is it still on you queue, or did it get applied without me noticing? Link for people wanting to read-up on thread: http://thread.gmane.org/gmane.linux.network/268758/ -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock 2013-05-22 17:47 ` [PATCH v2 " Eric Dumazet 2013-05-22 18:20 ` Joe Perches 2013-05-24 13:16 ` Jesper Dangaard Brouer @ 2013-08-26 22:28 ` Pablo Neira Ayuso 2 siblings, 0 replies; 21+ messages in thread From: Pablo Neira Ayuso @ 2013-08-26 22:28 UTC (permalink / raw) To: Eric Dumazet Cc: netfilter-devel, netdev, Tom Herbert, Patrick McHardy, Jesper Dangaard Brouer Hi Eric, Several comments on the expectation side. On Wed, May 22, 2013 at 10:47:48AM -0700, Eric Dumazet wrote: [...] >diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c >index 0283bae..c0823b9 100644 >--- a/net/netfilter/nf_conntrack_core.c >+++ b/net/netfilter/nf_conntrack_core.c [...] > @@ -196,6 +247,47 @@ clean_from_lists(struct nf_conn *ct) > nf_ct_remove_expectations(ct); With this patch, we hold the conntrack lock when calling clean_from_lists but I think that's not enough. nf_conntrack_expect_lock spinlock protection is missing here. > } [...] > @@ -820,39 +930,41 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, > ecache ? ecache->expmask : 0, > GFP_ATOMIC); > > - spin_lock_bh(&nf_conntrack_lock); > - exp = nf_ct_find_expectation(net, zone, tuple); > - if (exp) { > - pr_debug("conntrack: expectation arrives ct=%p exp=%p\n", > - ct, exp); > - /* Welcome, Mr. Bond. We've been expecting you... */ > - __set_bit(IPS_EXPECTED_BIT, &ct->status); > - ct->master = exp->master; > - if (exp->helper) { > - help = nf_ct_helper_ext_add(ct, exp->helper, > - GFP_ATOMIC); > - if (help) > - rcu_assign_pointer(help->helper, exp->helper); > - } > + local_bh_disable(); > + if (net->ct.expect_count) { > + spin_lock(&nf_conntrack_expect_lock); I also think we have a possible race for expected conntracks. The problem is that expectations don't bump the refcount of the master conntrack. > + exp = nf_ct_find_expectation(net, zone, tuple); ... Now that we have independent locks for conntrack and expectation lists, a packet may match an expectation while another CPU is walking through the nf_ct_delete_from_lists bits using exp->master as input. > + if (exp) { > + pr_debug("conntrack: expectation arrives ct=%p exp=%p\n", > + ct, exp); > + /* Welcome, Mr. Bond. We've been expecting you... */ > + __set_bit(IPS_EXPECTED_BIT, &ct->status); > + ct->master = exp->master; Then, ct->master points to the (vanished) conntrack. > + if (exp->helper) { > + help = nf_ct_helper_ext_add(ct, exp->helper, > + GFP_ATOMIC); > + if (help) > + rcu_assign_pointer(help->helper, exp->helper); > + } > > #ifdef CONFIG_NF_CONNTRACK_MARK > - ct->mark = exp->master->mark; > + ct->mark = exp->master->mark; > #endif > #ifdef CONFIG_NF_CONNTRACK_SECMARK > - ct->secmark = exp->master->secmark; > + ct->secmark = exp->master->secmark; > #endif > - nf_conntrack_get(&ct->master->ct_general); > - NF_CT_STAT_INC(net, expect_new); > - } else { > + nf_conntrack_get(&ct->master->ct_general); So this needs to adjust the refcounting logic for expectations, to bump it earlier, during the expectation insertion. Regards. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2013-08-26 22:28 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-05-09 3:04 [PATCH nf-next] netfilter: conntrack: remove the central spinlock Eric Dumazet 2013-05-09 5:43 ` Cong Wang 2013-05-09 6:01 ` Eric Dumazet 2013-05-09 7:46 ` Cong Wang 2013-05-09 13:46 ` Eric Dumazet 2013-05-22 17:47 ` [PATCH v2 " Eric Dumazet 2013-05-22 18:20 ` Joe Perches 2013-05-22 19:26 ` Eric Dumazet 2013-05-22 19:57 ` Joe Perches 2013-05-22 20:16 ` Eric Dumazet 2013-05-22 20:38 ` Joe Perches 2013-05-22 20:48 ` Eric Dumazet 2013-05-22 21:12 ` Joe Perches 2013-05-22 21:29 ` David Miller 2013-05-22 21:34 ` Eric Dumazet 2013-05-24 13:16 ` Jesper Dangaard Brouer 2013-05-24 13:51 ` Eric Dumazet 2013-05-27 12:33 ` Jesper Dangaard Brouer 2013-05-27 12:36 ` Pablo Neira Ayuso 2013-08-23 14:42 ` Jesper Dangaard Brouer 2013-08-26 22:28 ` Pablo Neira Ayuso
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.