All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/38] Netfilter/IPVS updates for net-next
@ 2014-03-17 12:42 Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 01/38] netfilter: remove double colon Pablo Neira Ayuso
                   ` (38 more replies)
  0 siblings, 39 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

Hi David,

The following patchset contains Netfilter/IPVS updates for net-next,
most relevantly they are:

* cleanup to remove double semicolon from stephen hemminger.

* calm down sparse warning in xt_ipcomp, from Fan Du.

* nf_ct_labels support for nf_tables, from Florian Westphal.

* new macros to simplify rcu dereferences in the scope of nfnetlink
  and nf_tables, from Patrick McHardy.

* Accept queue and drop (including reason for drop) to verdict
  parsing in nf_tables, also from Patrick.

* Remove unused random seed initialization in nfnetlink_log, from
  Florian Westphal.

* Allow to attach user-specific information to nf_tables rules, useful
  to attach user comments to rule, from me.

* Return errors in ipset according to the manpage documentation, from
  Jozsef Kadlecsik.

* Fix coccinelle warnings related to incorrect bool type usage for ipset,
  from Fengguang Wu.

* Add hash:ip,mark set type to ipset, from Vytas Dauksa.

* Fix message for each spotted by ipset for each netns that is created,
  from Ilia Mirkin.

* Add forceadd option to ipset, which evicts a random entry from the set
  if it becomes full, from Josh Hunt.

* Minor IPVS cleanups and fixes from Andi Kleen and Tingwei Liu.

* Improve conntrack scalability by removing a central spinlock, original
  work from Eric Dumazet. Jesper Dangaard Brouer took them over to address
  remaining issues. Several patches to prepare this change come in first
  place.

* Rework nft_hash to resolve bugs (leaking chain, missing rcu synchronization
  on element removal, etc. from Patrick McHardy.

* Restore context in the rule deletion path, as we now release rule objects
  synchronously, from Patrick McHardy. This gets back event notification for
  anonymous sets.

* Fix NAT family validation in nft_nat, also from Patrick.

* Improve scalability of xt_connlimit by using an array of spinlocks and
  by introducing a rb-tree of hashtables for faster lookup of accounted
  objects per network. This patch was preceded by several patches and
  refactorizations to accomodate this change including the use of kmem_cache,
  from Florian Westphal.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git master

These changes should merge cleanly without conflicts to your net-next tree.

Thanks a lot!

----------------------------------------------------------------

The following changes since commit 1e8d6421cff2c24fe0b345711e7a21af02e8bcf5:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2014-02-19 01:24:22 -0500)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git master

for you to fetch changes up to 7d08487777c8b30dea34790734d708470faaf1e5:

  netfilter: connlimit: use rbtree for per-host conntrack obj storage (2014-03-17 11:11:57 +0100)

----------------------------------------------------------------
Andi Kleen (1):
      sections, ipvs: Remove useless __read_mostly for ipvs genl_ops

Fengguang Wu (1):
      netfilter: ipset: Add hash: fix coccinelle warnings

Florian Westphal (10):
      netfilter: nft_ct: labels get support
      netfilter: nfnetlink_log: remove unused code
      netfilter: ipset: kernel: uapi: fix MARKMASK attr ABI breakage
      netfilter: connlimit: factor hlist search into new function
      netfilter: connlimit: improve packet-to-closed-connection logic
      netfilter: connlimit: move insertion of new element out of count function
      netfilter: connlimit: use kmem_cache for conn objects
      netfilter: connlimit: use keyed locks
      netfilter: connlimit: make same_source_net signed
      netfilter: connlimit: use rbtree for per-host conntrack obj storage

Ilia Mirkin (1):
      netfilter: ipset: move registration message to init from net_init

Jesper Dangaard Brouer (5):
      netfilter: trivial code cleanup and doc changes
      netfilter: conntrack: spinlock per cpu to protect special lists.
      netfilter: avoid race with exp->master ct
      netfilter: conntrack: seperate expect locking from nf_conntrack_lock
      netfilter: conntrack: remove central spinlock nf_conntrack_lock

Joe Perches (1):
      netfilter: Convert uses of __constant_<foo> to <foo>

Josh Hunt (1):
      netfilter: ipset: add forceadd kernel support for hash set types

Jozsef Kadlecsik (1):
      netfilter: ipset: Prepare the kernel for create option flags when no extension is needed

Pablo Neira Ayuso (3):
      netfilter: xt_ipcomp: Use ntohs to ease sparse warning
      netfilter: nf_tables: add optional user data area to rules
      Merge git://git.kernel.org/.../horms/ipvs-next

Patrick McHardy (10):
      netfilter: ip_set: rename nfnl_dereference()/nfnl_set()
      netfilter: nfnetlink: add rcu_dereference_protected() helpers
      netfilter: nf_tables: add nft_dereference() macro
      netfilter: nf_tables: accept QUEUE/DROP verdict parameters
      netfilter: nft_hash: bug fixes and resizing
      netfilter: nf_tables: clean up nf_tables_trans_add() argument order
      netfilter: nf_tables: restore context for expression destructors
      netfilter: nf_tables: restore notifications for anonymous set destruction
      netfilter: nft_ct: remove family from struct nft_ct
      netfilter: nft_nat: fix family validation

Sergey Popovich (1):
      netfilter: ipset: Follow manual page behavior for SET target on list:set

Tingwei Liu (1):
      ipvs: Reduce checkpatch noise in ip_vs_lblc.c

Vytas Dauksa (2):
      netfilter: ipset: add hash:ip,mark data type to ipset
      netfilter: ipset: add markmask for hash:ip,mark data type

stephen hemminger (1):
      netfilter: remove double colon

 include/linux/netfilter/ipset/ip_set.h       |   15 +-
 include/linux/netfilter/nfnetlink.h          |   21 ++
 include/net/netfilter/nf_conntrack.h         |   11 +-
 include/net/netfilter/nf_conntrack_core.h    |    9 +-
 include/net/netfilter/nf_conntrack_labels.h  |    4 +-
 include/net/netfilter/nf_tables.h            |   28 +-
 include/net/netns/conntrack.h                |   13 +-
 include/uapi/linux/netfilter/ipset/ip_set.h  |   12 +
 include/uapi/linux/netfilter/nf_tables.h     |    6 +-
 net/ipv4/netfilter.c                         |    2 +-
 net/netfilter/ipset/Kconfig                  |    9 +
 net/netfilter/ipset/Makefile                 |    1 +
 net/netfilter/ipset/ip_set_core.c            |   54 ++--
 net/netfilter/ipset/ip_set_hash_gen.h        |   43 +++
 net/netfilter/ipset/ip_set_hash_ip.c         |    3 +-
 net/netfilter/ipset/ip_set_hash_ipmark.c     |  321 +++++++++++++++++++
 net/netfilter/ipset/ip_set_hash_ipport.c     |    3 +-
 net/netfilter/ipset/ip_set_hash_ipportip.c   |    3 +-
 net/netfilter/ipset/ip_set_hash_ipportnet.c  |    3 +-
 net/netfilter/ipset/ip_set_hash_net.c        |    3 +-
 net/netfilter/ipset/ip_set_hash_netiface.c   |    3 +-
 net/netfilter/ipset/ip_set_hash_netnet.c     |   10 +-
 net/netfilter/ipset/ip_set_hash_netport.c    |    3 +-
 net/netfilter/ipset/ip_set_hash_netportnet.c |    3 +-
 net/netfilter/ipset/pfxlen.c                 |    4 +-
 net/netfilter/ipvs/ip_vs_ctl.c               |    2 +-
 net/netfilter/ipvs/ip_vs_lblc.c              |   13 +-
 net/netfilter/nf_conntrack_core.c            |  432 ++++++++++++++++++--------
 net/netfilter/nf_conntrack_expect.c          |   36 ++-
 net/netfilter/nf_conntrack_h323_main.c       |    4 +-
 net/netfilter/nf_conntrack_helper.c          |   41 ++-
 net/netfilter/nf_conntrack_netlink.c         |  133 ++++----
 net/netfilter/nf_conntrack_sip.c             |    8 +-
 net/netfilter/nf_tables_api.c                |   80 +++--
 net/netfilter/nfnetlink.c                    |    8 +
 net/netfilter/nfnetlink_log.c                |    8 -
 net/netfilter/nft_compat.c                   |    4 +-
 net/netfilter/nft_ct.c                       |   36 ++-
 net/netfilter/nft_hash.c                     |  260 +++++++++++++---
 net/netfilter/nft_immediate.c                |    3 +-
 net/netfilter/nft_log.c                      |    3 +-
 net/netfilter/nft_lookup.c                   |    5 +-
 net/netfilter/nft_nat.c                      |   22 +-
 net/netfilter/xt_AUDIT.c                     |    4 +-
 net/netfilter/xt_connlimit.c                 |  311 ++++++++++++++----
 net/netfilter/xt_ipcomp.c                    |    2 +-
 46 files changed, 1527 insertions(+), 475 deletions(-)
 create mode 100644 net/netfilter/ipset/ip_set_hash_ipmark.c

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 01/38] netfilter: remove double colon
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 02/38] netfilter: xt_ipcomp: Use ntohs to ease sparse warning Pablo Neira Ayuso
                   ` (37 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: stephen hemminger <stephen@networkplumber.org>

This is C not shell script

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
index c3e0ade..7ebd6e3 100644
--- a/net/ipv4/netfilter.c
+++ b/net/ipv4/netfilter.c
@@ -61,7 +61,7 @@ int ip_route_me_harder(struct sk_buff *skb, unsigned int addr_type)
 		skb_dst_set(skb, NULL);
 		dst = xfrm_lookup(net, dst, flowi4_to_flowi(&fl4), skb->sk, 0);
 		if (IS_ERR(dst))
-			return PTR_ERR(dst);;
+			return PTR_ERR(dst);
 		skb_dst_set(skb, dst);
 	}
 #endif
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 02/38] netfilter: xt_ipcomp: Use ntohs to ease sparse warning
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 01/38] netfilter: remove double colon Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 03/38] netfilter: nft_ct: labels get support Pablo Neira Ayuso
                   ` (36 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

0-DAY kernel build testing backend reported:

sparse warnings: (new ones prefixed by >>)

 >> >> net/netfilter/xt_ipcomp.c:63:26: sparse: restricted __be16 degrades to integer
 >> >> net/netfilter/xt_ipcomp.c:63:26: sparse: cast to restricted __be32

Fix this by using ntohs without shifting.

Tested with: make C=1 CF=-D__CHECK_ENDIAN__

Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/xt_ipcomp.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/xt_ipcomp.c b/net/netfilter/xt_ipcomp.c
index a4c7561..89d5310 100644
--- a/net/netfilter/xt_ipcomp.c
+++ b/net/netfilter/xt_ipcomp.c
@@ -60,7 +60,7 @@ static bool comp_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	}
 
 	return spi_match(compinfo->spis[0], compinfo->spis[1],
-			 ntohl(chdr->cpi << 16),
+			 ntohs(chdr->cpi),
 			 !!(compinfo->invflags & XT_IPCOMP_INV_SPI));
 }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 03/38] netfilter: nft_ct: labels get support
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 01/38] netfilter: remove double colon Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 02/38] netfilter: xt_ipcomp: Use ntohs to ease sparse warning Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 04/38] netfilter: ip_set: rename nfnl_dereference()/nfnl_set() Pablo Neira Ayuso
                   ` (35 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Florian Westphal <fw@strlen.de>

This also adds NF_CT_LABELS_MAX_SIZE so it can be re-used
as BUILD_BUG_ON in nft_ct.

At this time, nft doesn't yet support writing to the label area;
when this changes the label->words handling needs to be moved
out of xt_connlabel.c into nf_conntrack_labels.c.

Also removes a useless run-time check: words cannot grow beyond
4 (32 bit) or 2 (64bit) since xt_connlabel enforces a maximum of
128 labels.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_conntrack_labels.h |    4 +++-
 include/uapi/linux/netfilter/nf_tables.h    |    1 +
 net/netfilter/nf_conntrack_netlink.c        |    5 ++---
 net/netfilter/nft_ct.c                      |   24 ++++++++++++++++++++++++
 4 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_labels.h b/include/net/netfilter/nf_conntrack_labels.h
index c985695..dec6336 100644
--- a/include/net/netfilter/nf_conntrack_labels.h
+++ b/include/net/netfilter/nf_conntrack_labels.h
@@ -7,6 +7,8 @@
 
 #include <uapi/linux/netfilter/xt_connlabel.h>
 
+#define NF_CT_LABELS_MAX_SIZE ((XT_CONNLABEL_MAXBIT + 1) / BITS_PER_BYTE)
+
 struct nf_conn_labels {
 	u8 words;
 	unsigned long bits[];
@@ -29,7 +31,7 @@ static inline struct nf_conn_labels *nf_ct_labels_ext_add(struct nf_conn *ct)
 	u8 words;
 
 	words = ACCESS_ONCE(net->ct.label_words);
-	if (words == 0 || WARN_ON_ONCE(words > 8))
+	if (words == 0)
 		return NULL;
 
 	cl_ext = nf_ct_ext_add_length(ct, NF_CT_EXT_LABELS,
diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index 83c985a..c84c452 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -601,6 +601,7 @@ enum nft_ct_keys {
 	NFT_CT_PROTOCOL,
 	NFT_CT_PROTO_SRC,
 	NFT_CT_PROTO_DST,
+	NFT_CT_LABELS,
 };
 
 /**
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index bb322d0..47e9369 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -966,7 +966,6 @@ ctnetlink_parse_help(const struct nlattr *attr, char **helper_name,
 	return 0;
 }
 
-#define __CTA_LABELS_MAX_LENGTH ((XT_CONNLABEL_MAXBIT + 1) / BITS_PER_BYTE)
 static const struct nla_policy ct_nla_policy[CTA_MAX+1] = {
 	[CTA_TUPLE_ORIG]	= { .type = NLA_NESTED },
 	[CTA_TUPLE_REPLY]	= { .type = NLA_NESTED },
@@ -984,9 +983,9 @@ static const struct nla_policy ct_nla_policy[CTA_MAX+1] = {
 	[CTA_ZONE]		= { .type = NLA_U16 },
 	[CTA_MARK_MASK]		= { .type = NLA_U32 },
 	[CTA_LABELS]		= { .type = NLA_BINARY,
-				    .len = __CTA_LABELS_MAX_LENGTH },
+				    .len = NF_CT_LABELS_MAX_SIZE },
 	[CTA_LABELS_MASK]	= { .type = NLA_BINARY,
-				    .len = __CTA_LABELS_MAX_LENGTH },
+				    .len = NF_CT_LABELS_MAX_SIZE },
 };
 
 static int
diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index 46e2754..e59b08f 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -19,6 +19,7 @@
 #include <net/netfilter/nf_conntrack_tuple.h>
 #include <net/netfilter/nf_conntrack_helper.h>
 #include <net/netfilter/nf_conntrack_ecache.h>
+#include <net/netfilter/nf_conntrack_labels.h>
 
 struct nft_ct {
 	enum nft_ct_keys	key:8;
@@ -97,6 +98,26 @@ static void nft_ct_get_eval(const struct nft_expr *expr,
 			goto err;
 		strncpy((char *)dest->data, helper->name, sizeof(dest->data));
 		return;
+#ifdef CONFIG_NF_CONNTRACK_LABELS
+	case NFT_CT_LABELS: {
+		struct nf_conn_labels *labels = nf_ct_labels_find(ct);
+		unsigned int size;
+
+		if (!labels) {
+			memset(dest->data, 0, sizeof(dest->data));
+			return;
+		}
+
+		BUILD_BUG_ON(NF_CT_LABELS_MAX_SIZE > sizeof(dest->data));
+		size = labels->words * sizeof(long);
+
+		memcpy(dest->data, labels->bits, size);
+		if (size < sizeof(dest->data))
+			memset(((char *) dest->data) + size, 0,
+			       sizeof(dest->data) - size);
+		return;
+	}
+#endif
 	}
 
 	tuple = &ct->tuplehash[priv->dir].tuple;
@@ -221,6 +242,9 @@ static int nft_ct_init_validate_get(const struct nft_expr *expr,
 #ifdef CONFIG_NF_CONNTRACK_SECMARK
 	case NFT_CT_SECMARK:
 #endif
+#ifdef CONFIG_NF_CONNTRACK_LABELS
+	case NFT_CT_LABELS:
+#endif
 	case NFT_CT_EXPIRATION:
 	case NFT_CT_HELPER:
 		if (tb[NFTA_CT_DIRECTION] != NULL)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 04/38] netfilter: ip_set: rename nfnl_dereference()/nfnl_set()
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (2 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 03/38] netfilter: nft_ct: labels get support Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 05/38] netfilter: nfnetlink: add rcu_dereference_protected() helpers Pablo Neira Ayuso
                   ` (34 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Patrick McHardy <kaber@trash.net>

The next patch will introduce a nfnl_dereference() macro that actually
checks that the appropriate mutex is held and therefore needs a
subsystem argument.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/ipset/ip_set_core.c |   46 ++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index de770ec..728a2cf 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -54,10 +54,10 @@ MODULE_DESCRIPTION("core IP set support");
 MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_IPSET);
 
 /* When the nfnl mutex is held: */
-#define nfnl_dereference(p)		\
+#define ip_set_dereference(p)		\
 	rcu_dereference_protected(p, 1)
-#define nfnl_set(inst, id)			\
-	nfnl_dereference((inst)->ip_set_list)[id]
+#define ip_set(inst, id)		\
+	ip_set_dereference((inst)->ip_set_list)[id]
 
 /*
  * The set types are implemented in modules and registered set types
@@ -640,7 +640,7 @@ ip_set_nfnl_get_byindex(struct net *net, ip_set_id_t index)
 		return IPSET_INVALID_ID;
 
 	nfnl_lock(NFNL_SUBSYS_IPSET);
-	set = nfnl_set(inst, index);
+	set = ip_set(inst, index);
 	if (set)
 		__ip_set_get(set);
 	else
@@ -666,7 +666,7 @@ ip_set_nfnl_put(struct net *net, ip_set_id_t index)
 
 	nfnl_lock(NFNL_SUBSYS_IPSET);
 	if (!inst->is_deleted) { /* already deleted from ip_set_net_exit() */
-		set = nfnl_set(inst, index);
+		set = ip_set(inst, index);
 		if (set != NULL)
 			__ip_set_put(set);
 	}
@@ -734,7 +734,7 @@ find_set_and_id(struct ip_set_net *inst, const char *name, ip_set_id_t *id)
 
 	*id = IPSET_INVALID_ID;
 	for (i = 0; i < inst->ip_set_max; i++) {
-		set = nfnl_set(inst, i);
+		set = ip_set(inst, i);
 		if (set != NULL && STREQ(set->name, name)) {
 			*id = i;
 			break;
@@ -760,7 +760,7 @@ find_free_id(struct ip_set_net *inst, const char *name, ip_set_id_t *index,
 
 	*index = IPSET_INVALID_ID;
 	for (i = 0;  i < inst->ip_set_max; i++) {
-		s = nfnl_set(inst, i);
+		s = ip_set(inst, i);
 		if (s == NULL) {
 			if (*index == IPSET_INVALID_ID)
 				*index = i;
@@ -883,7 +883,7 @@ ip_set_create(struct sock *ctnl, struct sk_buff *skb,
 		if (!list)
 			goto cleanup;
 		/* nfnl mutex is held, both lists are valid */
-		tmp = nfnl_dereference(inst->ip_set_list);
+		tmp = ip_set_dereference(inst->ip_set_list);
 		memcpy(list, tmp, sizeof(struct ip_set *) * inst->ip_set_max);
 		rcu_assign_pointer(inst->ip_set_list, list);
 		/* Make sure all current packets have passed through */
@@ -900,7 +900,7 @@ ip_set_create(struct sock *ctnl, struct sk_buff *skb,
 	 * Finally! Add our shiny new set to the list, and be done.
 	 */
 	pr_debug("create: '%s' created with index %u!\n", set->name, index);
-	nfnl_set(inst, index) = set;
+	ip_set(inst, index) = set;
 
 	return ret;
 
@@ -925,10 +925,10 @@ ip_set_setname_policy[IPSET_ATTR_CMD_MAX + 1] = {
 static void
 ip_set_destroy_set(struct ip_set_net *inst, ip_set_id_t index)
 {
-	struct ip_set *set = nfnl_set(inst, index);
+	struct ip_set *set = ip_set(inst, index);
 
 	pr_debug("set: %s\n",  set->name);
-	nfnl_set(inst, index) = NULL;
+	ip_set(inst, index) = NULL;
 
 	/* Must call it without holding any lock */
 	set->variant->destroy(set);
@@ -962,7 +962,7 @@ ip_set_destroy(struct sock *ctnl, struct sk_buff *skb,
 	read_lock_bh(&ip_set_ref_lock);
 	if (!attr[IPSET_ATTR_SETNAME]) {
 		for (i = 0; i < inst->ip_set_max; i++) {
-			s = nfnl_set(inst, i);
+			s = ip_set(inst, i);
 			if (s != NULL && s->ref) {
 				ret = -IPSET_ERR_BUSY;
 				goto out;
@@ -970,7 +970,7 @@ ip_set_destroy(struct sock *ctnl, struct sk_buff *skb,
 		}
 		read_unlock_bh(&ip_set_ref_lock);
 		for (i = 0; i < inst->ip_set_max; i++) {
-			s = nfnl_set(inst, i);
+			s = ip_set(inst, i);
 			if (s != NULL)
 				ip_set_destroy_set(inst, i);
 		}
@@ -1020,7 +1020,7 @@ ip_set_flush(struct sock *ctnl, struct sk_buff *skb,
 
 	if (!attr[IPSET_ATTR_SETNAME]) {
 		for (i = 0; i < inst->ip_set_max; i++) {
-			s = nfnl_set(inst, i);
+			s = ip_set(inst, i);
 			if (s != NULL)
 				ip_set_flush_set(s);
 		}
@@ -1074,7 +1074,7 @@ ip_set_rename(struct sock *ctnl, struct sk_buff *skb,
 
 	name2 = nla_data(attr[IPSET_ATTR_SETNAME2]);
 	for (i = 0; i < inst->ip_set_max; i++) {
-		s = nfnl_set(inst, i);
+		s = ip_set(inst, i);
 		if (s != NULL && STREQ(s->name, name2)) {
 			ret = -IPSET_ERR_EXIST_SETNAME2;
 			goto out;
@@ -1134,8 +1134,8 @@ ip_set_swap(struct sock *ctnl, struct sk_buff *skb,
 
 	write_lock_bh(&ip_set_ref_lock);
 	swap(from->ref, to->ref);
-	nfnl_set(inst, from_id) = to;
-	nfnl_set(inst, to_id) = from;
+	ip_set(inst, from_id) = to;
+	ip_set(inst, to_id) = from;
 	write_unlock_bh(&ip_set_ref_lock);
 
 	return 0;
@@ -1157,7 +1157,7 @@ ip_set_dump_done(struct netlink_callback *cb)
 	struct ip_set_net *inst = (struct ip_set_net *)cb->args[IPSET_CB_NET];
 	if (cb->args[IPSET_CB_ARG0]) {
 		pr_debug("release set %s\n",
-			 nfnl_set(inst, cb->args[IPSET_CB_INDEX])->name);
+			 ip_set(inst, cb->args[IPSET_CB_INDEX])->name);
 		__ip_set_put_byindex(inst,
 			(ip_set_id_t) cb->args[IPSET_CB_INDEX]);
 	}
@@ -1254,7 +1254,7 @@ dump_last:
 		 dump_type, dump_flags, cb->args[IPSET_CB_INDEX]);
 	for (; cb->args[IPSET_CB_INDEX] < max; cb->args[IPSET_CB_INDEX]++) {
 		index = (ip_set_id_t) cb->args[IPSET_CB_INDEX];
-		set = nfnl_set(inst, index);
+		set = ip_set(inst, index);
 		if (set == NULL) {
 			if (dump_type == DUMP_ONE) {
 				ret = -ENOENT;
@@ -1332,7 +1332,7 @@ next_set:
 release_refcount:
 	/* If there was an error or set is done, release set */
 	if (ret || !cb->args[IPSET_CB_ARG0]) {
-		pr_debug("release set %s\n", nfnl_set(inst, index)->name);
+		pr_debug("release set %s\n", ip_set(inst, index)->name);
 		__ip_set_put_byindex(inst, index);
 		cb->args[IPSET_CB_ARG0] = 0;
 	}
@@ -1887,7 +1887,7 @@ ip_set_sockfn_get(struct sock *sk, int optval, void __user *user, int *len)
 		find_set_and_id(inst, req_get->set.name, &id);
 		req_get->set.index = id;
 		if (id != IPSET_INVALID_ID)
-			req_get->family = nfnl_set(inst, id)->family;
+			req_get->family = ip_set(inst, id)->family;
 		nfnl_unlock(NFNL_SUBSYS_IPSET);
 		goto copy;
 	}
@@ -1901,7 +1901,7 @@ ip_set_sockfn_get(struct sock *sk, int optval, void __user *user, int *len)
 			goto done;
 		}
 		nfnl_lock(NFNL_SUBSYS_IPSET);
-		set = nfnl_set(inst, req_get->set.index);
+		set = ip_set(inst, req_get->set.index);
 		strncpy(req_get->set.name, set ? set->name : "",
 			IPSET_MAXNAMELEN);
 		nfnl_unlock(NFNL_SUBSYS_IPSET);
@@ -1960,7 +1960,7 @@ ip_set_net_exit(struct net *net)
 	inst->is_deleted = 1; /* flag for ip_set_nfnl_put */
 
 	for (i = 0; i < inst->ip_set_max; i++) {
-		set = nfnl_set(inst, i);
+		set = ip_set(inst, i);
 		if (set != NULL)
 			ip_set_destroy_set(inst, i);
 	}
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 05/38] netfilter: nfnetlink: add rcu_dereference_protected() helpers
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (3 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 04/38] netfilter: ip_set: rename nfnl_dereference()/nfnl_set() Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 06/38] netfilter: nf_tables: add nft_dereference() macro Pablo Neira Ayuso
                   ` (33 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Patrick McHardy <kaber@trash.net>

Add a lockdep_nfnl_is_held() function and a nfnl_dereference() macro for
RCU dereferences protected by a NFNL subsystem mutex.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/netfilter/nfnetlink.h |   21 +++++++++++++++++++++
 net/netfilter/nfnetlink.c           |    8 ++++++++
 2 files changed, 29 insertions(+)

diff --git a/include/linux/netfilter/nfnetlink.h b/include/linux/netfilter/nfnetlink.h
index 28c7436..e955d47 100644
--- a/include/linux/netfilter/nfnetlink.h
+++ b/include/linux/netfilter/nfnetlink.h
@@ -44,6 +44,27 @@ int nfnetlink_unicast(struct sk_buff *skb, struct net *net, u32 portid,
 
 void nfnl_lock(__u8 subsys_id);
 void nfnl_unlock(__u8 subsys_id);
+#ifdef CONFIG_PROVE_LOCKING
+int lockdep_nfnl_is_held(__u8 subsys_id);
+#else
+static inline int lockdep_nfnl_is_held(__u8 subsys_id)
+{
+	return 1;
+}
+#endif /* CONFIG_PROVE_LOCKING */
+
+/*
+ * nfnl_dereference - fetch RCU pointer when updates are prevented by subsys mutex
+ *
+ * @p: The pointer to read, prior to dereferencing
+ * @ss: The nfnetlink subsystem ID
+ *
+ * Return the value of the specified RCU-protected pointer, but omit
+ * both the smp_read_barrier_depends() and the ACCESS_ONCE(), because
+ * caller holds the NFNL subsystem mutex.
+ */
+#define nfnl_dereference(p, ss)					\
+	rcu_dereference_protected(p, lockdep_nfnl_is_held(ss))
 
 #define MODULE_ALIAS_NFNL_SUBSYS(subsys) \
 	MODULE_ALIAS("nfnetlink-subsys-" __stringify(subsys))
diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 046aa13..e8138da 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -61,6 +61,14 @@ void nfnl_unlock(__u8 subsys_id)
 }
 EXPORT_SYMBOL_GPL(nfnl_unlock);
 
+#ifdef CONFIG_PROVE_LOCKING
+int lockdep_nfnl_is_held(u8 subsys_id)
+{
+	return lockdep_is_held(&table[subsys_id].mutex);
+}
+EXPORT_SYMBOL_GPL(lockdep_nfnl_is_held);
+#endif
+
 int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n)
 {
 	nfnl_lock(n->subsys_id);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 06/38] netfilter: nf_tables: add nft_dereference() macro
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (4 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 05/38] netfilter: nfnetlink: add rcu_dereference_protected() helpers Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 07/38] netfilter: nf_tables: accept QUEUE/DROP verdict parameters Pablo Neira Ayuso
                   ` (32 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Patrick McHardy <kaber@trash.net>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_tables.h |    4 ++++
 net/netfilter/nf_tables_api.c     |    3 +--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index e7e14ff..81abd61 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -3,6 +3,7 @@
 
 #include <linux/list.h>
 #include <linux/netfilter.h>
+#include <linux/netfilter/nfnetlink.h>
 #include <linux/netfilter/x_tables.h>
 #include <linux/netfilter/nf_tables.h>
 #include <net/netlink.h>
@@ -521,6 +522,9 @@ void nft_unregister_chain_type(const struct nf_chain_type *);
 int nft_register_expr(struct nft_expr_type *);
 void nft_unregister_expr(struct nft_expr_type *);
 
+#define nft_dereference(p)					\
+	nfnl_dereference(p, NFNL_SUBSYS_NFTABLES)
+
 #define MODULE_ALIAS_NFT_FAMILY(family)	\
 	MODULE_ALIAS("nft-afinfo-" __stringify(family))
 
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index adce01e..4b7e14d 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -794,9 +794,8 @@ nf_tables_counters(struct nft_base_chain *chain, const struct nlattr *attr)
 	stats->pkts = be64_to_cpu(nla_get_be64(tb[NFTA_COUNTER_PACKETS]));
 
 	if (chain->stats) {
-		/* nfnl_lock is held, add some nfnl function for this, later */
 		struct nft_stats __percpu *oldstats =
-			rcu_dereference_protected(chain->stats, 1);
+				nft_dereference(chain->stats);
 
 		rcu_assign_pointer(chain->stats, newstats);
 		synchronize_rcu();
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 07/38] netfilter: nf_tables: accept QUEUE/DROP verdict parameters
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (5 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 06/38] netfilter: nf_tables: add nft_dereference() macro Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 08/38] netfilter: nfnetlink_log: remove unused code Pablo Neira Ayuso
                   ` (31 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Patrick McHardy <kaber@trash.net>

Allow userspace to specify the queue number or the errno code for QUEUE
and DROP verdicts.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_tables_api.c |   15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 4b7e14d..0b56340 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -3174,9 +3174,16 @@ static int nft_verdict_init(const struct nft_ctx *ctx, struct nft_data *data,
 	data->verdict = ntohl(nla_get_be32(tb[NFTA_VERDICT_CODE]));
 
 	switch (data->verdict) {
-	case NF_ACCEPT:
-	case NF_DROP:
-	case NF_QUEUE:
+	default:
+		switch (data->verdict & NF_VERDICT_MASK) {
+		case NF_ACCEPT:
+		case NF_DROP:
+		case NF_QUEUE:
+			break;
+		default:
+			return -EINVAL;
+		}
+		/* fall through */
 	case NFT_CONTINUE:
 	case NFT_BREAK:
 	case NFT_RETURN:
@@ -3197,8 +3204,6 @@ static int nft_verdict_init(const struct nft_ctx *ctx, struct nft_data *data,
 		data->chain = chain;
 		desc->len = sizeof(data);
 		break;
-	default:
-		return -EINVAL;
 	}
 
 	desc->type = NFT_DATA_VERDICT;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 08/38] netfilter: nfnetlink_log: remove unused code
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (6 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 07/38] netfilter: nf_tables: accept QUEUE/DROP verdict parameters Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 09/38] netfilter: nf_tables: add optional user data area to rules Pablo Neira Ayuso
                   ` (30 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Florian Westphal <fw@strlen.de>

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nfnetlink_log.c |    8 --------
 1 file changed, 8 deletions(-)

diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c
index a155d19..d292c8d 100644
--- a/net/netfilter/nfnetlink_log.c
+++ b/net/netfilter/nfnetlink_log.c
@@ -28,8 +28,6 @@
 #include <linux/proc_fs.h>
 #include <linux/security.h>
 #include <linux/list.h>
-#include <linux/jhash.h>
-#include <linux/random.h>
 #include <linux/slab.h>
 #include <net/sock.h>
 #include <net/netfilter/nf_log.h>
@@ -75,7 +73,6 @@ struct nfulnl_instance {
 };
 
 #define INSTANCE_BUCKETS	16
-static unsigned int hash_init;
 
 static int nfnl_log_net_id __read_mostly;
 
@@ -1067,11 +1064,6 @@ static int __init nfnetlink_log_init(void)
 {
 	int status = -ENOMEM;
 
-	/* it's not really all that important to have a random value, so
-	 * we can do this from the init function, even if there hasn't
-	 * been that much entropy yet */
-	get_random_bytes(&hash_init, sizeof(hash_init));
-
 	netlink_register_notifier(&nfulnl_rtnl_notifier);
 	status = nfnetlink_subsys_register(&nfulnl_subsys);
 	if (status < 0) {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 09/38] netfilter: nf_tables: add optional user data area to rules
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (7 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 08/38] netfilter: nfnetlink_log: remove unused code Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 10/38] netfilter: ipset: Follow manual page behavior for SET target on list:set Pablo Neira Ayuso
                   ` (29 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

This allows us to store user comment strings, but it could be also
used to store any kind of information that the user application needs
to link to the rule.

Scratch 8 bits for the new ulen field that indicates the length the
user data area. 4 bits from the handle (so it's 42 bits long, according
to Patrick, it would last 139 years with 1000 new rules per second)
and 4 bits from dlen (so the expression data area is 4K, which seems
sufficient by now even considering the compatibility layer).

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
---
 include/net/netfilter/nf_tables.h        |   11 +++++++++--
 include/uapi/linux/netfilter/nf_tables.h |    5 ++++-
 net/netfilter/nf_tables_api.c            |   17 +++++++++++++++--
 3 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 81abd61..5af56da 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -326,13 +326,15 @@ static inline void *nft_expr_priv(const struct nft_expr *expr)
  *	@handle: rule handle
  *	@genmask: generation mask
  *	@dlen: length of expression data
+ *	@ulen: length of user data (used for comments)
  *	@data: expression data
  */
 struct nft_rule {
 	struct list_head		list;
-	u64				handle:46,
+	u64				handle:42,
 					genmask:2,
-					dlen:16;
+					dlen:12,
+					ulen:8;
 	unsigned char			data[]
 		__attribute__((aligned(__alignof__(struct nft_expr))));
 };
@@ -371,6 +373,11 @@ static inline struct nft_expr *nft_expr_last(const struct nft_rule *rule)
 	return (struct nft_expr *)&rule->data[rule->dlen];
 }
 
+static inline void *nft_userdata(const struct nft_rule *rule)
+{
+	return (void *)&rule->data[rule->dlen];
+}
+
 /*
  * The last pointer isn't really necessary, but the compiler isn't able to
  * determine that the result of nft_expr_last() is always the same since it
diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index c84c452..c88ccbf 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -1,7 +1,8 @@
 #ifndef _LINUX_NF_TABLES_H
 #define _LINUX_NF_TABLES_H
 
-#define NFT_CHAIN_MAXNAMELEN 32
+#define NFT_CHAIN_MAXNAMELEN	32
+#define NFT_USERDATA_MAXLEN	256
 
 enum nft_registers {
 	NFT_REG_VERDICT,
@@ -156,6 +157,7 @@ enum nft_chain_attributes {
  * @NFTA_RULE_EXPRESSIONS: list of expressions (NLA_NESTED: nft_expr_attributes)
  * @NFTA_RULE_COMPAT: compatibility specifications of the rule (NLA_NESTED: nft_rule_compat_attributes)
  * @NFTA_RULE_POSITION: numeric handle of the previous rule (NLA_U64)
+ * @NFTA_RULE_USERDATA: user data (NLA_BINARY, NFT_USERDATA_MAXLEN)
  */
 enum nft_rule_attributes {
 	NFTA_RULE_UNSPEC,
@@ -165,6 +167,7 @@ enum nft_rule_attributes {
 	NFTA_RULE_EXPRESSIONS,
 	NFTA_RULE_COMPAT,
 	NFTA_RULE_POSITION,
+	NFTA_RULE_USERDATA,
 	__NFTA_RULE_MAX
 };
 #define NFTA_RULE_MAX		(__NFTA_RULE_MAX - 1)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 0b56340..f25d011 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -1295,6 +1295,8 @@ static const struct nla_policy nft_rule_policy[NFTA_RULE_MAX + 1] = {
 	[NFTA_RULE_EXPRESSIONS]	= { .type = NLA_NESTED },
 	[NFTA_RULE_COMPAT]	= { .type = NLA_NESTED },
 	[NFTA_RULE_POSITION]	= { .type = NLA_U64 },
+	[NFTA_RULE_USERDATA]	= { .type = NLA_BINARY,
+				    .len = NFT_USERDATA_MAXLEN },
 };
 
 static int nf_tables_fill_rule_info(struct sk_buff *skb, u32 portid, u32 seq,
@@ -1347,6 +1349,10 @@ static int nf_tables_fill_rule_info(struct sk_buff *skb, u32 portid, u32 seq,
 	}
 	nla_nest_end(skb, list);
 
+	if (rule->ulen &&
+	    nla_put(skb, NFTA_RULE_USERDATA, rule->ulen, nft_userdata(rule)))
+		goto nla_put_failure;
+
 	return nlmsg_end(skb, nlh);
 
 nla_put_failure:
@@ -1583,7 +1589,7 @@ static int nf_tables_newrule(struct sock *nlsk, struct sk_buff *skb,
 	struct nft_expr *expr;
 	struct nft_ctx ctx;
 	struct nlattr *tmp;
-	unsigned int size, i, n;
+	unsigned int size, i, n, ulen = 0;
 	int err, rem;
 	bool create;
 	u64 handle, pos_handle;
@@ -1649,8 +1655,11 @@ static int nf_tables_newrule(struct sock *nlsk, struct sk_buff *skb,
 		}
 	}
 
+	if (nla[NFTA_RULE_USERDATA])
+		ulen = nla_len(nla[NFTA_RULE_USERDATA]);
+
 	err = -ENOMEM;
-	rule = kzalloc(sizeof(*rule) + size, GFP_KERNEL);
+	rule = kzalloc(sizeof(*rule) + size + ulen, GFP_KERNEL);
 	if (rule == NULL)
 		goto err1;
 
@@ -1658,6 +1667,10 @@ static int nf_tables_newrule(struct sock *nlsk, struct sk_buff *skb,
 
 	rule->handle = handle;
 	rule->dlen   = size;
+	rule->ulen   = ulen;
+
+	if (ulen)
+		nla_memcpy(nft_userdata(rule), nla[NFTA_RULE_USERDATA], ulen);
 
 	expr = nft_expr_first(rule);
 	for (i = 0; i < n; i++) {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 10/38] netfilter: ipset: Follow manual page behavior for SET target on list:set
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (8 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 09/38] netfilter: nf_tables: add optional user data area to rules Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 11/38] netfilter: ipset: Add hash: fix coccinelle warnings Pablo Neira Ayuso
                   ` (28 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Sergey Popovich <popovich_sergei@mail.ru>

ipset(8) for list:set says:
  The match will try to find a matching entry in the sets and the
  target will try to add an entry to the first set to which it can
  be added.

However real behavior is bit differ from described. Consider example:

 # ipset create test-1-v4 hash:ip family inet
 # ipset create test-1-v6 hash:ip family inet6
 # ipset create test-1 list:set
 # ipset add test-1 test-1-v4
 # ipset add test-1 test-1-v6

 # iptables  -A INPUT -p tcp --destination-port 25 -j SET --add-set test-1 src
 # ip6tables -A INPUT -p tcp --destination-port 25 -j SET --add-set test-1 src

And then when iptables/ip6tables rule matches packet IPSET target
tries to add src from packet to the list:set test-1 where first
entry is test-1-v4 and the second one is test-1-v6.

For IPv4, as it first entry in test-1 src added to test-1-v4
correctly, but for IPv6 src not added!

Placing test-1-v6 to the first element of list:set makes behavior
correct for IPv6, but brokes for IPv4.

This is due to result, returned from ip_set_add() and ip_set_del() from
net/netfilter/ipset/ip_set_core.c when set in list:set equires more
parameters than given or address families do not match (which is this
case).

It seems wrong returning 0 from ip_set_add() and ip_set_del() in
this case, as 0 should be returned only when an element successfuly
added/deleted to/from the set, contrary to ip_set_test() which
returns 0 when no entry exists and >0 when entry found in set.

Signed-off-by: Sergey Popovich <popovich_sergei@mail.ru>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_core.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index 728a2cf..c6c97b8 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -510,7 +510,7 @@ ip_set_add(ip_set_id_t index, const struct sk_buff *skb,
 
 	if (opt->dim < set->type->dimension ||
 	    !(opt->family == set->family || set->family == NFPROTO_UNSPEC))
-		return 0;
+		return -IPSET_ERR_TYPE_MISMATCH;
 
 	write_lock_bh(&set->lock);
 	ret = set->variant->kadt(set, skb, par, IPSET_ADD, opt);
@@ -533,7 +533,7 @@ ip_set_del(ip_set_id_t index, const struct sk_buff *skb,
 
 	if (opt->dim < set->type->dimension ||
 	    !(opt->family == set->family || set->family == NFPROTO_UNSPEC))
-		return 0;
+		return -IPSET_ERR_TYPE_MISMATCH;
 
 	write_lock_bh(&set->lock);
 	ret = set->variant->kadt(set, skb, par, IPSET_DEL, opt);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 11/38] netfilter: ipset: Add hash: fix coccinelle warnings
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (9 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 10/38] netfilter: ipset: Follow manual page behavior for SET target on list:set Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 12/38] netfilter: ipset: add hash:ip,mark data type to ipset Pablo Neira Ayuso
                   ` (27 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Fengguang Wu <fengguang.wu@intel.com>

net/netfilter/ipset/ip_set_hash_netnet.c:115:8-9: WARNING: return of 0/1 in function 'hash_netnet4_data_list' with return type bool
/c/kernel-tests/src/cocci/net/netfilter/ipset/ip_set_hash_netnet.c:338:8-9: WARNING: return of 0/1 in function 'hash_netnet6_data_list' with return type bool

Return statements in functions returning bool should use
true/false instead of 1/0.
Generated by: coccinelle/misc/boolreturn.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_hash_netnet.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_hash_netnet.c b/net/netfilter/ipset/ip_set_hash_netnet.c
index 6226803..4e7261d 100644
--- a/net/netfilter/ipset/ip_set_hash_netnet.c
+++ b/net/netfilter/ipset/ip_set_hash_netnet.c
@@ -112,10 +112,10 @@ hash_netnet4_data_list(struct sk_buff *skb,
 	    (flags &&
 	     nla_put_net32(skb, IPSET_ATTR_CADT_FLAGS, htonl(flags))))
 		goto nla_put_failure;
-	return 0;
+	return false;
 
 nla_put_failure:
-	return 1;
+	return true;
 }
 
 static inline void
@@ -334,10 +334,10 @@ hash_netnet6_data_list(struct sk_buff *skb,
 	    (flags &&
 	     nla_put_net32(skb, IPSET_ATTR_CADT_FLAGS, htonl(flags))))
 		goto nla_put_failure;
-	return 0;
+	return false;
 
 nla_put_failure:
-	return 1;
+	return true;
 }
 
 static inline void
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 12/38] netfilter: ipset: add hash:ip,mark data type to ipset
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (10 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 11/38] netfilter: ipset: Add hash: fix coccinelle warnings Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 13/38] netfilter: ipset: add markmask for hash:ip,mark data type Pablo Neira Ayuso
                   ` (26 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Vytas Dauksa <vytas.dauksa@smoothwall.net>

Introduce packet mark support with new ip,mark hash set. This includes
userspace and kernelspace code, hash:ip,mark set tests and man page
updates.

The intended use of ip,mark set is similar to the ip:port type, but for
protocols which don't use a predictable port number. Instead of port
number it matches a firewall mark determined by a layer 7 filtering
program like opendpi.

As well as allowing or blocking traffic it will also be used for
accounting packets and bytes sent for each protocol.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 include/linux/netfilter/ipset/ip_set.h      |   10 +-
 include/uapi/linux/netfilter/ipset/ip_set.h |    1 +
 net/netfilter/ipset/Kconfig                 |    9 +
 net/netfilter/ipset/Makefile                |    1 +
 net/netfilter/ipset/ip_set_hash_ipmark.c    |  312 +++++++++++++++++++++++++++
 5 files changed, 329 insertions(+), 4 deletions(-)
 create mode 100644 net/netfilter/ipset/ip_set_hash_ipmark.c

diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
index 0c7d01e..4ac00d4 100644
--- a/include/linux/netfilter/ipset/ip_set.h
+++ b/include/linux/netfilter/ipset/ip_set.h
@@ -39,11 +39,13 @@ enum ip_set_feature {
 	IPSET_TYPE_NAME = (1 << IPSET_TYPE_NAME_FLAG),
 	IPSET_TYPE_IFACE_FLAG = 5,
 	IPSET_TYPE_IFACE = (1 << IPSET_TYPE_IFACE_FLAG),
-	IPSET_TYPE_NOMATCH_FLAG = 6,
+	IPSET_TYPE_MARK_FLAG = 6,
+	IPSET_TYPE_MARK = (1 << IPSET_TYPE_MARK_FLAG),
+	IPSET_TYPE_NOMATCH_FLAG = 7,
 	IPSET_TYPE_NOMATCH = (1 << IPSET_TYPE_NOMATCH_FLAG),
 	/* Strictly speaking not a feature, but a flag for dumping:
 	 * this settype must be dumped last */
-	IPSET_DUMP_LAST_FLAG = 7,
+	IPSET_DUMP_LAST_FLAG = 8,
 	IPSET_DUMP_LAST = (1 << IPSET_DUMP_LAST_FLAG),
 };
 
@@ -171,8 +173,6 @@ struct ip_set_type {
 	char name[IPSET_MAXNAMELEN];
 	/* Protocol version */
 	u8 protocol;
-	/* Set features to control swapping */
-	u8 features;
 	/* Set type dimension */
 	u8 dimension;
 	/*
@@ -182,6 +182,8 @@ struct ip_set_type {
 	u8 family;
 	/* Type revisions */
 	u8 revision_min, revision_max;
+	/* Set features to control swapping */
+	u16 features;
 
 	/* Create set */
 	int (*create)(struct net *net, struct ip_set *set,
diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
index 25d3b2f..5368f82 100644
--- a/include/uapi/linux/netfilter/ipset/ip_set.h
+++ b/include/uapi/linux/netfilter/ipset/ip_set.h
@@ -82,6 +82,7 @@ enum {
 	IPSET_ATTR_PROTO,	/* 7 */
 	IPSET_ATTR_CADT_FLAGS,	/* 8 */
 	IPSET_ATTR_CADT_LINENO = IPSET_ATTR_LINENO,	/* 9 */
+	IPSET_ATTR_MARK,	/* 10 */
 	/* Reserve empty slots */
 	IPSET_ATTR_CADT_MAX = 16,
 	/* Create-only specific attributes */
diff --git a/net/netfilter/ipset/Kconfig b/net/netfilter/ipset/Kconfig
index 44cd4f5..2f7f5c3 100644
--- a/net/netfilter/ipset/Kconfig
+++ b/net/netfilter/ipset/Kconfig
@@ -61,6 +61,15 @@ config IP_SET_HASH_IP
 
 	  To compile it as a module, choose M here.  If unsure, say N.
 
+config IP_SET_HASH_IPMARK
+	tristate "hash:ip,mark set support"
+	depends on IP_SET
+	help
+	  This option adds the hash:ip,mark set type support, by which one
+	  can store IPv4/IPv6 address and mark pairs.
+
+	  To compile it as a module, choose M here.  If unsure, say N.
+
 config IP_SET_HASH_IPPORT
 	tristate "hash:ip,port set support"
 	depends on IP_SET
diff --git a/net/netfilter/ipset/Makefile b/net/netfilter/ipset/Makefile
index 44b2d38..231f101 100644
--- a/net/netfilter/ipset/Makefile
+++ b/net/netfilter/ipset/Makefile
@@ -14,6 +14,7 @@ obj-$(CONFIG_IP_SET_BITMAP_PORT) += ip_set_bitmap_port.o
 
 # hash types
 obj-$(CONFIG_IP_SET_HASH_IP) += ip_set_hash_ip.o
+obj-$(CONFIG_IP_SET_HASH_IPMARK) += ip_set_hash_ipmark.o
 obj-$(CONFIG_IP_SET_HASH_IPPORT) += ip_set_hash_ipport.o
 obj-$(CONFIG_IP_SET_HASH_IPPORTIP) += ip_set_hash_ipportip.o
 obj-$(CONFIG_IP_SET_HASH_IPPORTNET) += ip_set_hash_ipportnet.o
diff --git a/net/netfilter/ipset/ip_set_hash_ipmark.c b/net/netfilter/ipset/ip_set_hash_ipmark.c
new file mode 100644
index 0000000..e56c0d9
--- /dev/null
+++ b/net/netfilter/ipset/ip_set_hash_ipmark.c
@@ -0,0 +1,312 @@
+/* Copyright (C) 2003-2013 Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
+ * Copyright (C) 2013 Smoothwall Ltd. <vytas.dauksa@smoothwall.net>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+/* Kernel module implementing an IP set type: the hash:ip,mark type */
+
+#include <linux/jhash.h>
+#include <linux/module.h>
+#include <linux/ip.h>
+#include <linux/skbuff.h>
+#include <linux/errno.h>
+#include <linux/random.h>
+#include <net/ip.h>
+#include <net/ipv6.h>
+#include <net/netlink.h>
+#include <net/tcp.h>
+
+#include <linux/netfilter.h>
+#include <linux/netfilter/ipset/pfxlen.h>
+#include <linux/netfilter/ipset/ip_set.h>
+#include <linux/netfilter/ipset/ip_set_hash.h>
+
+#define IPSET_TYPE_REV_MIN	0
+#define IPSET_TYPE_REV_MAX	0
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Vytas Dauksa <vytas.dauksa@smoothwall.net>");
+IP_SET_MODULE_DESC("hash:ip,mark", IPSET_TYPE_REV_MIN, IPSET_TYPE_REV_MAX);
+MODULE_ALIAS("ip_set_hash:ip,mark");
+
+/* Type specific function prefix */
+#define HTYPE		hash_ipmark
+
+/* IPv4 variant */
+
+/* Member elements */
+struct hash_ipmark4_elem {
+	__be32 ip;
+	__u32 mark;
+};
+
+/* Common functions */
+
+static inline bool
+hash_ipmark4_data_equal(const struct hash_ipmark4_elem *ip1,
+			const struct hash_ipmark4_elem *ip2,
+			u32 *multi)
+{
+	return ip1->ip == ip2->ip &&
+	       ip1->mark == ip2->mark;
+}
+
+static bool
+hash_ipmark4_data_list(struct sk_buff *skb,
+		       const struct hash_ipmark4_elem *data)
+{
+	if (nla_put_ipaddr4(skb, IPSET_ATTR_IP, data->ip) ||
+	    nla_put_net32(skb, IPSET_ATTR_MARK, htonl(data->mark)))
+		goto nla_put_failure;
+	return 0;
+
+nla_put_failure:
+	return 1;
+}
+
+static inline void
+hash_ipmark4_data_next(struct hash_ipmark4_elem *next,
+		       const struct hash_ipmark4_elem *d)
+{
+	next->ip = d->ip;
+}
+
+#define MTYPE           hash_ipmark4
+#define PF              4
+#define HOST_MASK       32
+#define HKEY_DATALEN	sizeof(struct hash_ipmark4_elem)
+#include "ip_set_hash_gen.h"
+
+static int
+hash_ipmark4_kadt(struct ip_set *set, const struct sk_buff *skb,
+		  const struct xt_action_param *par,
+		  enum ipset_adt adt, struct ip_set_adt_opt *opt)
+{
+	ipset_adtfn adtfn = set->variant->adt[adt];
+	struct hash_ipmark4_elem e = { };
+	struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
+
+	e.mark = skb->mark;
+
+	ip4addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip);
+	return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags);
+}
+
+static int
+hash_ipmark4_uadt(struct ip_set *set, struct nlattr *tb[],
+		  enum ipset_adt adt, u32 *lineno, u32 flags, bool retried)
+{
+	const struct hash_ipmark *h = set->data;
+	ipset_adtfn adtfn = set->variant->adt[adt];
+	struct hash_ipmark4_elem e = { };
+	struct ip_set_ext ext = IP_SET_INIT_UEXT(set);
+	u32 ip, ip_to = 0;
+	int ret;
+
+	if (unlikely(!tb[IPSET_ATTR_IP] ||
+		     !ip_set_attr_netorder(tb, IPSET_ATTR_MARK) ||
+		     !ip_set_optattr_netorder(tb, IPSET_ATTR_TIMEOUT) ||
+		     !ip_set_optattr_netorder(tb, IPSET_ATTR_PACKETS) ||
+		     !ip_set_optattr_netorder(tb, IPSET_ATTR_BYTES)))
+		return -IPSET_ERR_PROTOCOL;
+
+	if (tb[IPSET_ATTR_LINENO])
+		*lineno = nla_get_u32(tb[IPSET_ATTR_LINENO]);
+
+	ret = ip_set_get_ipaddr4(tb[IPSET_ATTR_IP], &e.ip) ||
+	      ip_set_get_extensions(set, tb, &ext);
+	if (ret)
+		return ret;
+
+	e.mark = ntohl(nla_get_u32(tb[IPSET_ATTR_MARK]));
+
+	if (adt == IPSET_TEST ||
+	    !(tb[IPSET_ATTR_IP_TO] || tb[IPSET_ATTR_CIDR])) {
+		ret = adtfn(set, &e, &ext, &ext, flags);
+		return ip_set_eexist(ret, flags) ? 0 : ret;
+	}
+
+	ip_to = ip = ntohl(e.ip);
+	if (tb[IPSET_ATTR_IP_TO]) {
+		ret = ip_set_get_hostipaddr4(tb[IPSET_ATTR_IP_TO], &ip_to);
+		if (ret)
+			return ret;
+		if (ip > ip_to)
+			swap(ip, ip_to);
+	} else if (tb[IPSET_ATTR_CIDR]) {
+		u8 cidr = nla_get_u8(tb[IPSET_ATTR_CIDR]);
+
+		if (!cidr || cidr > 32)
+			return -IPSET_ERR_INVALID_CIDR;
+		ip_set_mask_from_to(ip, ip_to, cidr);
+	}
+
+	if (retried)
+		ip = ntohl(h->next.ip);
+	for (; !before(ip_to, ip); ip++) {
+		e.ip = htonl(ip);
+		ret = adtfn(set, &e, &ext, &ext, flags);
+
+		if (ret && !ip_set_eexist(ret, flags))
+			return ret;
+		else
+			ret = 0;
+	}
+	return ret;
+}
+
+/* IPv6 variant */
+
+struct hash_ipmark6_elem {
+	union nf_inet_addr ip;
+	__u32 mark;
+};
+
+/* Common functions */
+
+static inline bool
+hash_ipmark6_data_equal(const struct hash_ipmark6_elem *ip1,
+			const struct hash_ipmark6_elem *ip2,
+			u32 *multi)
+{
+	return ipv6_addr_equal(&ip1->ip.in6, &ip2->ip.in6) &&
+	       ip1->mark == ip2->mark;
+}
+
+static bool
+hash_ipmark6_data_list(struct sk_buff *skb,
+		       const struct hash_ipmark6_elem *data)
+{
+	if (nla_put_ipaddr6(skb, IPSET_ATTR_IP, &data->ip.in6) ||
+	    nla_put_net32(skb, IPSET_ATTR_MARK, htonl(data->mark)))
+		goto nla_put_failure;
+	return 0;
+
+nla_put_failure:
+	return 1;
+}
+
+static inline void
+hash_ipmark6_data_next(struct hash_ipmark4_elem *next,
+		       const struct hash_ipmark6_elem *d)
+{
+}
+
+#undef MTYPE
+#undef PF
+#undef HOST_MASK
+#undef HKEY_DATALEN
+
+#define MTYPE		hash_ipmark6
+#define PF		6
+#define HOST_MASK	128
+#define HKEY_DATALEN	sizeof(struct hash_ipmark6_elem)
+#define	IP_SET_EMIT_CREATE
+#include "ip_set_hash_gen.h"
+
+
+static int
+hash_ipmark6_kadt(struct ip_set *set, const struct sk_buff *skb,
+		  const struct xt_action_param *par,
+		  enum ipset_adt adt, struct ip_set_adt_opt *opt)
+{
+	ipset_adtfn adtfn = set->variant->adt[adt];
+	struct hash_ipmark6_elem e = { };
+	struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
+
+	e.mark = skb->mark;
+
+	ip6addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip.in6);
+	return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags);
+}
+
+static int
+hash_ipmark6_uadt(struct ip_set *set, struct nlattr *tb[],
+		  enum ipset_adt adt, u32 *lineno, u32 flags, bool retried)
+{
+	ipset_adtfn adtfn = set->variant->adt[adt];
+	struct hash_ipmark6_elem e = { };
+	struct ip_set_ext ext = IP_SET_INIT_UEXT(set);
+	int ret;
+
+	if (unlikely(!tb[IPSET_ATTR_IP] ||
+		     !ip_set_attr_netorder(tb, IPSET_ATTR_MARK) ||
+		     !ip_set_optattr_netorder(tb, IPSET_ATTR_TIMEOUT) ||
+		     !ip_set_optattr_netorder(tb, IPSET_ATTR_PACKETS) ||
+		     !ip_set_optattr_netorder(tb, IPSET_ATTR_BYTES) ||
+		     tb[IPSET_ATTR_IP_TO] ||
+		     tb[IPSET_ATTR_CIDR]))
+		return -IPSET_ERR_PROTOCOL;
+
+	if (tb[IPSET_ATTR_LINENO])
+		*lineno = nla_get_u32(tb[IPSET_ATTR_LINENO]);
+
+	ret = ip_set_get_ipaddr6(tb[IPSET_ATTR_IP], &e.ip) ||
+	      ip_set_get_extensions(set, tb, &ext);
+	if (ret)
+		return ret;
+
+	e.mark = ntohl(nla_get_u32(tb[IPSET_ATTR_MARK]));
+
+	if (adt == IPSET_TEST) {
+		ret = adtfn(set, &e, &ext, &ext, flags);
+		return ip_set_eexist(ret, flags) ? 0 : ret;
+	}
+
+	ret = adtfn(set, &e, &ext, &ext, flags);
+	if (ret && !ip_set_eexist(ret, flags))
+		return ret;
+	else
+		ret = 0;
+
+	return ret;
+}
+
+static struct ip_set_type hash_ipmark_type __read_mostly = {
+	.name		= "hash:ip,mark",
+	.protocol	= IPSET_PROTOCOL,
+	.features	= IPSET_TYPE_IP | IPSET_TYPE_MARK,
+	.dimension	= IPSET_DIM_TWO,
+	.family		= NFPROTO_UNSPEC,
+	.revision_min	= IPSET_TYPE_REV_MIN,
+	.revision_max	= IPSET_TYPE_REV_MAX,
+	.create		= hash_ipmark_create,
+	.create_policy	= {
+		[IPSET_ATTR_HASHSIZE]	= { .type = NLA_U32 },
+		[IPSET_ATTR_MAXELEM]	= { .type = NLA_U32 },
+		[IPSET_ATTR_PROBES]	= { .type = NLA_U8 },
+		[IPSET_ATTR_RESIZE]	= { .type = NLA_U8  },
+		[IPSET_ATTR_TIMEOUT]	= { .type = NLA_U32 },
+		[IPSET_ATTR_CADT_FLAGS]	= { .type = NLA_U32 },
+	},
+	.adt_policy	= {
+		[IPSET_ATTR_IP]		= { .type = NLA_NESTED },
+		[IPSET_ATTR_IP_TO]	= { .type = NLA_NESTED },
+		[IPSET_ATTR_MARK]	= { .type = NLA_U32 },
+		[IPSET_ATTR_CIDR]	= { .type = NLA_U8 },
+		[IPSET_ATTR_TIMEOUT]	= { .type = NLA_U32 },
+		[IPSET_ATTR_LINENO]	= { .type = NLA_U32 },
+		[IPSET_ATTR_BYTES]	= { .type = NLA_U64 },
+		[IPSET_ATTR_PACKETS]	= { .type = NLA_U64 },
+		[IPSET_ATTR_COMMENT]	= { .type = NLA_NUL_STRING },
+	},
+	.me		= THIS_MODULE,
+};
+
+static int __init
+hash_ipmark_init(void)
+{
+	return ip_set_type_register(&hash_ipmark_type);
+}
+
+static void __exit
+hash_ipmark_fini(void)
+{
+	ip_set_type_unregister(&hash_ipmark_type);
+}
+
+module_init(hash_ipmark_init);
+module_exit(hash_ipmark_fini);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 13/38] netfilter: ipset: add markmask for hash:ip,mark data type
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (11 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 12/38] netfilter: ipset: add hash:ip,mark data type to ipset Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 14/38] netfilter: ipset: Prepare the kernel for create option flags when no extension is needed Pablo Neira Ayuso
                   ` (25 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Vytas Dauksa <vytas.dauksa@smoothwall.net>

Introduce packet mark mask for hash:ip,mark data type. This allows to
set mark bit filter for the ip set.

Change-Id: Id8dd9ca7e64477c4f7b022a1d9c1a5b187f1c96e

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 include/uapi/linux/netfilter/ipset/ip_set.h |    2 ++
 net/netfilter/ipset/ip_set_hash_gen.h       |   31 +++++++++++++++++++++++++++
 net/netfilter/ipset/ip_set_hash_ipmark.c    |    9 ++++++++
 3 files changed, 42 insertions(+)

diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
index 5368f82..f636f28 100644
--- a/include/uapi/linux/netfilter/ipset/ip_set.h
+++ b/include/uapi/linux/netfilter/ipset/ip_set.h
@@ -89,6 +89,7 @@ enum {
 	IPSET_ATTR_GC,
 	IPSET_ATTR_HASHSIZE,
 	IPSET_ATTR_MAXELEM,
+	IPSET_ATTR_MARKMASK,
 	IPSET_ATTR_NETMASK,
 	IPSET_ATTR_PROBES,
 	IPSET_ATTR_RESIZE,
@@ -138,6 +139,7 @@ enum ipset_errno {
 	IPSET_ERR_EXIST,
 	IPSET_ERR_INVALID_CIDR,
 	IPSET_ERR_INVALID_NETMASK,
+	IPSET_ERR_INVALID_MARKMASK,
 	IPSET_ERR_INVALID_FAMILY,
 	IPSET_ERR_TIMEOUT,
 	IPSET_ERR_REFERENCED,
diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index be6932a..b1eed81 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -263,6 +263,9 @@ struct htype {
 	u32 maxelem;		/* max elements in the hash */
 	u32 elements;		/* current element (vs timeout) */
 	u32 initval;		/* random jhash init value */
+#ifdef IP_SET_HASH_WITH_MARKMASK
+	u32 markmask;		/* markmask value for mark mask to store */
+#endif
 	struct timer_list gc;	/* garbage collection when timeout enabled */
 	struct mtype_elem next; /* temporary storage for uadd */
 #ifdef IP_SET_HASH_WITH_MULTI
@@ -454,6 +457,9 @@ mtype_same_set(const struct ip_set *a, const struct ip_set *b)
 #ifdef IP_SET_HASH_WITH_NETMASK
 	       x->netmask == y->netmask &&
 #endif
+#ifdef IP_SET_HASH_WITH_MARKMASK
+	       x->markmask == y->markmask &&
+#endif
 	       a->extensions == b->extensions;
 }
 
@@ -908,6 +914,10 @@ mtype_head(struct ip_set *set, struct sk_buff *skb)
 	    nla_put_u8(skb, IPSET_ATTR_NETMASK, h->netmask))
 		goto nla_put_failure;
 #endif
+#ifdef IP_SET_HASH_WITH_MARKMASK
+	if (nla_put_u32(skb, IPSET_ATTR_MARKMASK, h->markmask))
+		goto nla_put_failure;
+#endif
 	if (nla_put_net32(skb, IPSET_ATTR_REFERENCES, htonl(set->ref - 1)) ||
 	    nla_put_net32(skb, IPSET_ATTR_MEMSIZE, htonl(memsize)))
 		goto nla_put_failure;
@@ -1016,6 +1026,9 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set,
 			    struct nlattr *tb[], u32 flags)
 {
 	u32 hashsize = IPSET_DEFAULT_HASHSIZE, maxelem = IPSET_DEFAULT_MAXELEM;
+#ifdef IP_SET_HASH_WITH_MARKMASK
+	u32 markmask;
+#endif
 	u8 hbits;
 #ifdef IP_SET_HASH_WITH_NETMASK
 	u8 netmask;
@@ -1026,6 +1039,10 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set,
 
 	if (!(set->family == NFPROTO_IPV4 || set->family == NFPROTO_IPV6))
 		return -IPSET_ERR_INVALID_FAMILY;
+
+#ifdef IP_SET_HASH_WITH_MARKMASK
+	markmask = 0xffffffff;
+#endif
 #ifdef IP_SET_HASH_WITH_NETMASK
 	netmask = set->family == NFPROTO_IPV4 ? 32 : 128;
 	pr_debug("Create set %s with family %s\n",
@@ -1034,6 +1051,9 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set,
 
 	if (unlikely(!ip_set_optattr_netorder(tb, IPSET_ATTR_HASHSIZE) ||
 		     !ip_set_optattr_netorder(tb, IPSET_ATTR_MAXELEM) ||
+#ifdef IP_SET_HASH_WITH_MARKMASK
+		     !ip_set_optattr_netorder(tb, IPSET_ATTR_MARKMASK) ||
+#endif
 		     !ip_set_optattr_netorder(tb, IPSET_ATTR_TIMEOUT) ||
 		     !ip_set_optattr_netorder(tb, IPSET_ATTR_CADT_FLAGS)))
 		return -IPSET_ERR_PROTOCOL;
@@ -1057,6 +1077,14 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set,
 			return -IPSET_ERR_INVALID_NETMASK;
 	}
 #endif
+#ifdef IP_SET_HASH_WITH_MARKMASK
+	if (tb[IPSET_ATTR_MARKMASK]) {
+		markmask = ntohl(nla_get_u32(tb[IPSET_ATTR_MARKMASK]));
+
+		if ((markmask > 4294967295u) || markmask == 0)
+			return -IPSET_ERR_INVALID_MARKMASK;
+	}
+#endif
 
 	hsize = sizeof(*h);
 #ifdef IP_SET_HASH_WITH_NETS
@@ -1071,6 +1099,9 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set,
 #ifdef IP_SET_HASH_WITH_NETMASK
 	h->netmask = netmask;
 #endif
+#ifdef IP_SET_HASH_WITH_MARKMASK
+	h->markmask = markmask;
+#endif
 	get_random_bytes(&h->initval, sizeof(h->initval));
 	set->timeout = IPSET_NO_TIMEOUT;
 
diff --git a/net/netfilter/ipset/ip_set_hash_ipmark.c b/net/netfilter/ipset/ip_set_hash_ipmark.c
index e56c0d9..1bf8e85 100644
--- a/net/netfilter/ipset/ip_set_hash_ipmark.c
+++ b/net/netfilter/ipset/ip_set_hash_ipmark.c
@@ -34,6 +34,7 @@ MODULE_ALIAS("ip_set_hash:ip,mark");
 
 /* Type specific function prefix */
 #define HTYPE		hash_ipmark
+#define IP_SET_HASH_WITH_MARKMASK
 
 /* IPv4 variant */
 
@@ -85,11 +86,13 @@ hash_ipmark4_kadt(struct ip_set *set, const struct sk_buff *skb,
 		  const struct xt_action_param *par,
 		  enum ipset_adt adt, struct ip_set_adt_opt *opt)
 {
+	const struct hash_ipmark *h = set->data;
 	ipset_adtfn adtfn = set->variant->adt[adt];
 	struct hash_ipmark4_elem e = { };
 	struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
 
 	e.mark = skb->mark;
+	e.mark &= h->markmask;
 
 	ip4addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip);
 	return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags);
@@ -122,6 +125,7 @@ hash_ipmark4_uadt(struct ip_set *set, struct nlattr *tb[],
 		return ret;
 
 	e.mark = ntohl(nla_get_u32(tb[IPSET_ATTR_MARK]));
+	e.mark &= h->markmask;
 
 	if (adt == IPSET_TEST ||
 	    !(tb[IPSET_ATTR_IP_TO] || tb[IPSET_ATTR_CIDR])) {
@@ -213,11 +217,13 @@ hash_ipmark6_kadt(struct ip_set *set, const struct sk_buff *skb,
 		  const struct xt_action_param *par,
 		  enum ipset_adt adt, struct ip_set_adt_opt *opt)
 {
+	const struct hash_ipmark *h = set->data;
 	ipset_adtfn adtfn = set->variant->adt[adt];
 	struct hash_ipmark6_elem e = { };
 	struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
 
 	e.mark = skb->mark;
+	e.mark &= h->markmask;
 
 	ip6addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip.in6);
 	return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags);
@@ -227,6 +233,7 @@ static int
 hash_ipmark6_uadt(struct ip_set *set, struct nlattr *tb[],
 		  enum ipset_adt adt, u32 *lineno, u32 flags, bool retried)
 {
+	const struct hash_ipmark *h = set->data;
 	ipset_adtfn adtfn = set->variant->adt[adt];
 	struct hash_ipmark6_elem e = { };
 	struct ip_set_ext ext = IP_SET_INIT_UEXT(set);
@@ -250,6 +257,7 @@ hash_ipmark6_uadt(struct ip_set *set, struct nlattr *tb[],
 		return ret;
 
 	e.mark = ntohl(nla_get_u32(tb[IPSET_ATTR_MARK]));
+	e.mark &= h->markmask;
 
 	if (adt == IPSET_TEST) {
 		ret = adtfn(set, &e, &ext, &ext, flags);
@@ -275,6 +283,7 @@ static struct ip_set_type hash_ipmark_type __read_mostly = {
 	.revision_max	= IPSET_TYPE_REV_MAX,
 	.create		= hash_ipmark_create,
 	.create_policy	= {
+		[IPSET_ATTR_MARKMASK]	= { .type = NLA_U32 },
 		[IPSET_ATTR_HASHSIZE]	= { .type = NLA_U32 },
 		[IPSET_ATTR_MAXELEM]	= { .type = NLA_U32 },
 		[IPSET_ATTR_PROBES]	= { .type = NLA_U8 },
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 14/38] netfilter: ipset: Prepare the kernel for create option flags when no extension is needed
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (12 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 13/38] netfilter: ipset: add markmask for hash:ip,mark data type Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 15/38] netfilter: ipset: kernel: uapi: fix MARKMASK attr ABI breakage Pablo Neira Ayuso
                   ` (24 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 include/linux/netfilter/ipset/ip_set.h      |    2 ++
 include/uapi/linux/netfilter/ipset/ip_set.h |    6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
index 4ac00d4..f476bce 100644
--- a/include/linux/netfilter/ipset/ip_set.h
+++ b/include/linux/netfilter/ipset/ip_set.h
@@ -219,6 +219,8 @@ struct ip_set {
 	u8 revision;
 	/* Extensions */
 	u8 extensions;
+	/* Create flags */
+	u8 flags;
 	/* Default timeout value, if enabled */
 	u32 timeout;
 	/* Element data size */
diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
index f636f28..a29a378 100644
--- a/include/uapi/linux/netfilter/ipset/ip_set.h
+++ b/include/uapi/linux/netfilter/ipset/ip_set.h
@@ -188,6 +188,12 @@ enum ipset_cadt_flags {
 	IPSET_FLAG_CADT_MAX	= 15,
 };
 
+/* The flag bits which correspond to the non-extension create flags */
+enum ipset_create_flags {
+	IPSET_CREATE_FLAG_NONE = 0,
+	IPSET_CREATE_FLAG_MAX = 7,
+};
+
 /* Commands with settype-specific attributes */
 enum ipset_adt {
 	IPSET_ADD,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 15/38] netfilter: ipset: kernel: uapi: fix MARKMASK attr ABI breakage
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (13 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 14/38] netfilter: ipset: Prepare the kernel for create option flags when no extension is needed Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 16/38] netfilter: ipset: move registration message to init from net_init Pablo Neira Ayuso
                   ` (23 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Florian Westphal <fw@strlen.de>

commit 2dfb973c0dcc6d2211 (add markmask for hash:ip,mark data type)
inserted IPSET_ATTR_MARKMASK in-between other enum values, i.e.
changing values of all further attributes.  This causes 'ipset list'
segfault on existing kernels since ipset no longer finds
IPSET_ATTR_MEMSIZE (it has a different value on kernel side).

Jozsef points out it should be moved below IPSET_ATTR_MARK which
works since there is some extra reserved space after that value.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 include/uapi/linux/netfilter/ipset/ip_set.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
index a29a378..a1ca244 100644
--- a/include/uapi/linux/netfilter/ipset/ip_set.h
+++ b/include/uapi/linux/netfilter/ipset/ip_set.h
@@ -83,13 +83,13 @@ enum {
 	IPSET_ATTR_CADT_FLAGS,	/* 8 */
 	IPSET_ATTR_CADT_LINENO = IPSET_ATTR_LINENO,	/* 9 */
 	IPSET_ATTR_MARK,	/* 10 */
+	IPSET_ATTR_MARKMASK,	/* 11 */
 	/* Reserve empty slots */
 	IPSET_ATTR_CADT_MAX = 16,
 	/* Create-only specific attributes */
 	IPSET_ATTR_GC,
 	IPSET_ATTR_HASHSIZE,
 	IPSET_ATTR_MAXELEM,
-	IPSET_ATTR_MARKMASK,
 	IPSET_ATTR_NETMASK,
 	IPSET_ATTR_PROBES,
 	IPSET_ATTR_RESIZE,
@@ -139,7 +139,6 @@ enum ipset_errno {
 	IPSET_ERR_EXIST,
 	IPSET_ERR_INVALID_CIDR,
 	IPSET_ERR_INVALID_NETMASK,
-	IPSET_ERR_INVALID_MARKMASK,
 	IPSET_ERR_INVALID_FAMILY,
 	IPSET_ERR_TIMEOUT,
 	IPSET_ERR_REFERENCED,
@@ -147,6 +146,7 @@ enum ipset_errno {
 	IPSET_ERR_IPADDR_IPV6,
 	IPSET_ERR_COUNTER,
 	IPSET_ERR_COMMENT,
+	IPSET_ERR_INVALID_MARKMASK,
 
 	/* Type specific error codes */
 	IPSET_ERR_TYPE_SPECIFIC = 4352,
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 16/38] netfilter: ipset: move registration message to init from net_init
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (14 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 15/38] netfilter: ipset: kernel: uapi: fix MARKMASK attr ABI breakage Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 17/38] netfilter: ipset: add forceadd kernel support for hash set types Pablo Neira Ayuso
                   ` (22 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Ilia Mirkin <imirkin@alum.mit.edu>

Commit 1785e8f473 ("netfiler: ipset: Add net namespace for ipset") moved
the initialization print into net_init, which can get called a lot due
to namespaces. Move it back into init, reduce to pr_info.

Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index c6c97b8..636cb8d 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -1945,7 +1945,6 @@ ip_set_net_init(struct net *net)
 		return -ENOMEM;
 	inst->is_deleted = 0;
 	rcu_assign_pointer(inst->ip_set_list, list);
-	pr_notice("ip_set: protocol %u\n", IPSET_PROTOCOL);
 	return 0;
 }
 
@@ -1996,6 +1995,7 @@ ip_set_init(void)
 		nfnetlink_subsys_unregister(&ip_set_netlink_subsys);
 		return ret;
 	}
+	pr_info("ip_set: protocol %u\n", IPSET_PROTOCOL);
 	return 0;
 }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 17/38] netfilter: ipset: add forceadd kernel support for hash set types
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (15 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 16/38] netfilter: ipset: move registration message to init from net_init Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 18/38] sections, ipvs: Remove useless __read_mostly for ipvs genl_ops Pablo Neira Ayuso
                   ` (21 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Josh Hunt <johunt@akamai.com>

Adds a new property for hash set types, where if a set is created
with the 'forceadd' option and the set becomes full the next addition
to the set may succeed and evict a random entry from the set.

To keep overhead low eviction is done very simply. It checks to see
which bucket the new entry would be added. If the bucket's pos value
is non-zero (meaning there's at least one entry in the bucket) it
replaces the first entry in the bucket. If pos is zero, then it continues
down the normal add process.

This property is useful if you have a set for 'ban' lists where it may
not matter if you release some entries from the set early.

Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 include/linux/netfilter/ipset/ip_set.h       |    3 +++
 include/uapi/linux/netfilter/ipset/ip_set.h  |    7 +++++--
 net/netfilter/ipset/ip_set_core.c            |    2 ++
 net/netfilter/ipset/ip_set_hash_gen.h        |   12 ++++++++++++
 net/netfilter/ipset/ip_set_hash_ip.c         |    3 ++-
 net/netfilter/ipset/ip_set_hash_ipmark.c     |    2 +-
 net/netfilter/ipset/ip_set_hash_ipport.c     |    3 ++-
 net/netfilter/ipset/ip_set_hash_ipportip.c   |    3 ++-
 net/netfilter/ipset/ip_set_hash_ipportnet.c  |    3 ++-
 net/netfilter/ipset/ip_set_hash_net.c        |    3 ++-
 net/netfilter/ipset/ip_set_hash_netiface.c   |    3 ++-
 net/netfilter/ipset/ip_set_hash_netnet.c     |    2 +-
 net/netfilter/ipset/ip_set_hash_netport.c    |    3 ++-
 net/netfilter/ipset/ip_set_hash_netportnet.c |    3 ++-
 14 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
index f476bce..96afc29 100644
--- a/include/linux/netfilter/ipset/ip_set.h
+++ b/include/linux/netfilter/ipset/ip_set.h
@@ -65,6 +65,7 @@ enum ip_set_extension {
 #define SET_WITH_TIMEOUT(s)	((s)->extensions & IPSET_EXT_TIMEOUT)
 #define SET_WITH_COUNTER(s)	((s)->extensions & IPSET_EXT_COUNTER)
 #define SET_WITH_COMMENT(s)	((s)->extensions & IPSET_EXT_COMMENT)
+#define SET_WITH_FORCEADD(s)	((s)->flags & IPSET_CREATE_FLAG_FORCEADD)
 
 /* Extension id, in size order */
 enum ip_set_ext_id {
@@ -255,6 +256,8 @@ ip_set_put_flags(struct sk_buff *skb, struct ip_set *set)
 		cadt_flags |= IPSET_FLAG_WITH_COUNTERS;
 	if (SET_WITH_COMMENT(set))
 		cadt_flags |= IPSET_FLAG_WITH_COMMENT;
+	if (SET_WITH_FORCEADD(set))
+		cadt_flags |= IPSET_FLAG_WITH_FORCEADD;
 
 	if (!cadt_flags)
 		return 0;
diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
index a1ca244..78c2f2e 100644
--- a/include/uapi/linux/netfilter/ipset/ip_set.h
+++ b/include/uapi/linux/netfilter/ipset/ip_set.h
@@ -185,13 +185,16 @@ enum ipset_cadt_flags {
 	IPSET_FLAG_WITH_COUNTERS = (1 << IPSET_FLAG_BIT_WITH_COUNTERS),
 	IPSET_FLAG_BIT_WITH_COMMENT = 4,
 	IPSET_FLAG_WITH_COMMENT = (1 << IPSET_FLAG_BIT_WITH_COMMENT),
+	IPSET_FLAG_BIT_WITH_FORCEADD = 5,
+	IPSET_FLAG_WITH_FORCEADD = (1 << IPSET_FLAG_BIT_WITH_FORCEADD),
 	IPSET_FLAG_CADT_MAX	= 15,
 };
 
 /* The flag bits which correspond to the non-extension create flags */
 enum ipset_create_flags {
-	IPSET_CREATE_FLAG_NONE = 0,
-	IPSET_CREATE_FLAG_MAX = 7,
+	IPSET_CREATE_FLAG_BIT_FORCEADD = 0,
+	IPSET_CREATE_FLAG_FORCEADD = (1 << IPSET_CREATE_FLAG_BIT_FORCEADD),
+	IPSET_CREATE_FLAG_BIT_MAX = 7,
 };
 
 /* Commands with settype-specific attributes */
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index 636cb8d..1172083 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -368,6 +368,8 @@ ip_set_elem_len(struct ip_set *set, struct nlattr *tb[], size_t len)
 
 	if (tb[IPSET_ATTR_CADT_FLAGS])
 		cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+	if (cadt_flags & IPSET_FLAG_WITH_FORCEADD)
+		set->flags |= IPSET_CREATE_FLAG_FORCEADD;
 	for (id = 0; id < IPSET_EXT_ID_MAX; id++) {
 		if (!add_extension(id, cadt_flags, tb))
 			continue;
diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index b1eed81..61c7fb0 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -633,6 +633,18 @@ mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 	bool flag_exist = flags & IPSET_FLAG_EXIST;
 	u32 key, multi = 0;
 
+	if (h->elements >= h->maxelem && SET_WITH_FORCEADD(set)) {
+		rcu_read_lock_bh();
+		t = rcu_dereference_bh(h->table);
+		key = HKEY(value, h->initval, t->htable_bits);
+		n = hbucket(t,key);
+		if (n->pos) {
+			/* Choosing the first entry in the array to replace */
+			j = 0;
+			goto reuse_slot;
+		}
+		rcu_read_unlock_bh();
+	}
 	if (SET_WITH_TIMEOUT(set) && h->elements >= h->maxelem)
 		/* FIXME: when set is full, we slow down here */
 		mtype_expire(set, h, NLEN(set->family), set->dsize);
diff --git a/net/netfilter/ipset/ip_set_hash_ip.c b/net/netfilter/ipset/ip_set_hash_ip.c
index e65fc24..dd40607 100644
--- a/net/netfilter/ipset/ip_set_hash_ip.c
+++ b/net/netfilter/ipset/ip_set_hash_ip.c
@@ -25,7 +25,8 @@
 
 #define IPSET_TYPE_REV_MIN	0
 /*				1	   Counters support */
-#define IPSET_TYPE_REV_MAX	2	/* Comments support */
+/*				2	   Comments support */
+#define IPSET_TYPE_REV_MAX	3	/* Forceadd support */
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>");
diff --git a/net/netfilter/ipset/ip_set_hash_ipmark.c b/net/netfilter/ipset/ip_set_hash_ipmark.c
index 1bf8e85..4eff0a2 100644
--- a/net/netfilter/ipset/ip_set_hash_ipmark.c
+++ b/net/netfilter/ipset/ip_set_hash_ipmark.c
@@ -25,7 +25,7 @@
 #include <linux/netfilter/ipset/ip_set_hash.h>
 
 #define IPSET_TYPE_REV_MIN	0
-#define IPSET_TYPE_REV_MAX	0
+#define IPSET_TYPE_REV_MAX	1	/* Forceadd support */
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Vytas Dauksa <vytas.dauksa@smoothwall.net>");
diff --git a/net/netfilter/ipset/ip_set_hash_ipport.c b/net/netfilter/ipset/ip_set_hash_ipport.c
index 525a595..7597b82 100644
--- a/net/netfilter/ipset/ip_set_hash_ipport.c
+++ b/net/netfilter/ipset/ip_set_hash_ipport.c
@@ -27,7 +27,8 @@
 #define IPSET_TYPE_REV_MIN	0
 /*				1    SCTP and UDPLITE support added */
 /*				2    Counters support added */
-#define IPSET_TYPE_REV_MAX	3 /* Comments support added */
+/*				3    Comments support added */
+#define IPSET_TYPE_REV_MAX	4 /* Forceadd support added */
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>");
diff --git a/net/netfilter/ipset/ip_set_hash_ipportip.c b/net/netfilter/ipset/ip_set_hash_ipportip.c
index f563663..672655f 100644
--- a/net/netfilter/ipset/ip_set_hash_ipportip.c
+++ b/net/netfilter/ipset/ip_set_hash_ipportip.c
@@ -27,7 +27,8 @@
 #define IPSET_TYPE_REV_MIN	0
 /*				1    SCTP and UDPLITE support added */
 /*				2    Counters support added */
-#define IPSET_TYPE_REV_MAX	3 /* Comments support added */
+/*				3    Comments support added */
+#define IPSET_TYPE_REV_MAX	4 /* Forceadd support added */
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>");
diff --git a/net/netfilter/ipset/ip_set_hash_ipportnet.c b/net/netfilter/ipset/ip_set_hash_ipportnet.c
index 5d87fe8..7308d84 100644
--- a/net/netfilter/ipset/ip_set_hash_ipportnet.c
+++ b/net/netfilter/ipset/ip_set_hash_ipportnet.c
@@ -29,7 +29,8 @@
 /*				2    Range as input support for IPv4 added */
 /*				3    nomatch flag support added */
 /*				4    Counters support added */
-#define IPSET_TYPE_REV_MAX	5 /* Comments support added */
+/*				5    Comments support added */
+#define IPSET_TYPE_REV_MAX	6 /* Forceadd support added */
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>");
diff --git a/net/netfilter/ipset/ip_set_hash_net.c b/net/netfilter/ipset/ip_set_hash_net.c
index 8295cf4..4c7d495 100644
--- a/net/netfilter/ipset/ip_set_hash_net.c
+++ b/net/netfilter/ipset/ip_set_hash_net.c
@@ -26,7 +26,8 @@
 /*				1    Range as input support for IPv4 added */
 /*				2    nomatch flag support added */
 /*				3    Counters support added */
-#define IPSET_TYPE_REV_MAX	4 /* Comments support added */
+/*				4    Comments support added */
+#define IPSET_TYPE_REV_MAX	5 /* Forceadd support added */
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>");
diff --git a/net/netfilter/ipset/ip_set_hash_netiface.c b/net/netfilter/ipset/ip_set_hash_netiface.c
index b827a0f..db26068 100644
--- a/net/netfilter/ipset/ip_set_hash_netiface.c
+++ b/net/netfilter/ipset/ip_set_hash_netiface.c
@@ -27,7 +27,8 @@
 /*				1    nomatch flag support added */
 /*				2    /0 support added */
 /*				3    Counters support added */
-#define IPSET_TYPE_REV_MAX	4 /* Comments support added */
+/*				4    Comments support added */
+#define IPSET_TYPE_REV_MAX	5 /* Forceadd support added */
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>");
diff --git a/net/netfilter/ipset/ip_set_hash_netnet.c b/net/netfilter/ipset/ip_set_hash_netnet.c
index 4e7261d..3e99987 100644
--- a/net/netfilter/ipset/ip_set_hash_netnet.c
+++ b/net/netfilter/ipset/ip_set_hash_netnet.c
@@ -24,7 +24,7 @@
 #include <linux/netfilter/ipset/ip_set_hash.h>
 
 #define IPSET_TYPE_REV_MIN	0
-#define IPSET_TYPE_REV_MAX	0
+#define IPSET_TYPE_REV_MAX	1	/* Forceadd support added */
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Oliver Smith <oliver@8.c.9.b.0.7.4.0.1.0.0.2.ip6.arpa>");
diff --git a/net/netfilter/ipset/ip_set_hash_netport.c b/net/netfilter/ipset/ip_set_hash_netport.c
index 7097fb0..1c645fb 100644
--- a/net/netfilter/ipset/ip_set_hash_netport.c
+++ b/net/netfilter/ipset/ip_set_hash_netport.c
@@ -28,7 +28,8 @@
 /*				2    Range as input support for IPv4 added */
 /*				3    nomatch flag support added */
 /*				4    Counters support added */
-#define IPSET_TYPE_REV_MAX	5 /* Comments support added */
+/*				5    Comments support added */
+#define IPSET_TYPE_REV_MAX	6 /* Forceadd support added */
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>");
diff --git a/net/netfilter/ipset/ip_set_hash_netportnet.c b/net/netfilter/ipset/ip_set_hash_netportnet.c
index 703d119..c0d2ba7 100644
--- a/net/netfilter/ipset/ip_set_hash_netportnet.c
+++ b/net/netfilter/ipset/ip_set_hash_netportnet.c
@@ -25,7 +25,8 @@
 #include <linux/netfilter/ipset/ip_set_hash.h>
 
 #define IPSET_TYPE_REV_MIN	0
-#define IPSET_TYPE_REV_MAX	0 /* Comments support added */
+/*				0    Comments support added */
+#define IPSET_TYPE_REV_MAX	1 /* Forceadd support added */
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Oliver Smith <oliver@8.c.9.b.0.7.4.0.1.0.0.2.ip6.arpa>");
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 18/38] sections, ipvs: Remove useless __read_mostly for ipvs genl_ops
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (16 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 17/38] netfilter: ipset: add forceadd kernel support for hash set types Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 19/38] ipvs: Reduce checkpatch noise in ip_vs_lblc.c Pablo Neira Ayuso
                   ` (20 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Andi Kleen <ak@linux.intel.com>

const __read_mostly does not make any sense, because const
data is already read-only. Remove the __read_mostly
for the ipvs genl_ops. This avoids a LTO
section conflict compile problem.

Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: Simon Horman <horms@verge.net.au>
Cc: Patrick McHardy <kaber@trash.net>
Cc: lvs-devel@vger.kernel.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_ctl.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 35be035..2a68a38 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -3580,7 +3580,7 @@ out:
 }
 
 
-static const struct genl_ops ip_vs_genl_ops[] __read_mostly = {
+static const struct genl_ops ip_vs_genl_ops[] = {
 	{
 		.cmd	= IPVS_CMD_NEW_SERVICE,
 		.flags	= GENL_ADMIN_PERM,
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 19/38] ipvs: Reduce checkpatch noise in ip_vs_lblc.c
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (17 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 18/38] sections, ipvs: Remove useless __read_mostly for ipvs genl_ops Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 20/38] netfilter: trivial code cleanup and doc changes Pablo Neira Ayuso
                   ` (19 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Tingwei Liu <tingw.liu@gmail.com>

Add whitespace after operator and put open brace { on the previous line

Cc: Tingwei Liu <liutingwei@hisense.com>
Cc: lvs-devel@vger.kernel.org
Signed-off-by: Tingwei Liu <tingw.liu@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_lblc.c |   13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c
index ca056a3..547ff33 100644
--- a/net/netfilter/ipvs/ip_vs_lblc.c
+++ b/net/netfilter/ipvs/ip_vs_lblc.c
@@ -238,7 +238,7 @@ static void ip_vs_lblc_flush(struct ip_vs_service *svc)
 
 	spin_lock_bh(&svc->sched_lock);
 	tbl->dead = 1;
-	for (i=0; i<IP_VS_LBLC_TAB_SIZE; i++) {
+	for (i = 0; i < IP_VS_LBLC_TAB_SIZE; i++) {
 		hlist_for_each_entry_safe(en, next, &tbl->bucket[i], list) {
 			ip_vs_lblc_del(en);
 			atomic_dec(&tbl->entries);
@@ -265,7 +265,7 @@ static inline void ip_vs_lblc_full_check(struct ip_vs_service *svc)
 	unsigned long now = jiffies;
 	int i, j;
 
-	for (i=0, j=tbl->rover; i<IP_VS_LBLC_TAB_SIZE; i++) {
+	for (i = 0, j = tbl->rover; i < IP_VS_LBLC_TAB_SIZE; i++) {
 		j = (j + 1) & IP_VS_LBLC_TAB_MASK;
 
 		spin_lock(&svc->sched_lock);
@@ -321,7 +321,7 @@ static void ip_vs_lblc_check_expire(unsigned long data)
 	if (goal > tbl->max_size/2)
 		goal = tbl->max_size/2;
 
-	for (i=0, j=tbl->rover; i<IP_VS_LBLC_TAB_SIZE; i++) {
+	for (i = 0, j = tbl->rover; i < IP_VS_LBLC_TAB_SIZE; i++) {
 		j = (j + 1) & IP_VS_LBLC_TAB_MASK;
 
 		spin_lock(&svc->sched_lock);
@@ -340,7 +340,7 @@ static void ip_vs_lblc_check_expire(unsigned long data)
 	tbl->rover = j;
 
   out:
-	mod_timer(&tbl->periodic_timer, jiffies+CHECK_EXPIRE_INTERVAL);
+	mod_timer(&tbl->periodic_timer, jiffies + CHECK_EXPIRE_INTERVAL);
 }
 
 
@@ -363,7 +363,7 @@ static int ip_vs_lblc_init_svc(struct ip_vs_service *svc)
 	/*
 	 *    Initialize the hash buckets
 	 */
-	for (i=0; i<IP_VS_LBLC_TAB_SIZE; i++) {
+	for (i = 0; i < IP_VS_LBLC_TAB_SIZE; i++) {
 		INIT_HLIST_HEAD(&tbl->bucket[i]);
 	}
 	tbl->max_size = IP_VS_LBLC_TAB_SIZE*16;
@@ -536,8 +536,7 @@ out:
 /*
  *      IPVS LBLC Scheduler structure
  */
-static struct ip_vs_scheduler ip_vs_lblc_scheduler =
-{
+static struct ip_vs_scheduler ip_vs_lblc_scheduler = {
 	.name =			"lblc",
 	.refcnt =		ATOMIC_INIT(0),
 	.module =		THIS_MODULE,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 20/38] netfilter: trivial code cleanup and doc changes
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (18 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 19/38] ipvs: Reduce checkpatch noise in ip_vs_lblc.c Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 21/38] netfilter: conntrack: spinlock per cpu to protect special lists Pablo Neira Ayuso
                   ` (18 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Jesper Dangaard Brouer <brouer@redhat.com>

Changes while reading through the netfilter code.

Added hint about how conntrack nf_conn refcnt is accessed.
And renamed repl_hash to reply_hash for readability

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_conntrack.h |    8 +++++++-
 net/netfilter/nf_conntrack_core.c    |   20 ++++++++++----------
 net/netfilter/nf_conntrack_expect.c  |    2 +-
 3 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index b2ac624..e10d1fa 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -73,7 +73,13 @@ struct nf_conn_help {
 
 struct nf_conn {
 	/* Usage count in here is 1 for hash table/destruct timer, 1 per skb,
-           plus 1 for any connection(s) we are `master' for */
+	 * plus 1 for any connection(s) we are `master' for
+	 *
+	 * Hint, SKB address this struct and refcnt via skb->nfct and
+	 * helpers nf_conntrack_get() and nf_conntrack_put().
+	 * Helper nf_ct_put() equals nf_conntrack_put() by dec refcnt,
+	 * beware nf_ct_get() is different and don't inc refcnt.
+	 */
 	struct nf_conntrack ct_general;
 
 	spinlock_t lock;
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 356bef5..965693e 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -408,21 +408,21 @@ EXPORT_SYMBOL_GPL(nf_conntrack_find_get);
 
 static void __nf_conntrack_hash_insert(struct nf_conn *ct,
 				       unsigned int hash,
-				       unsigned int repl_hash)
+				       unsigned int reply_hash)
 {
 	struct net *net = nf_ct_net(ct);
 
 	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
 			   &net->ct.hash[hash]);
 	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode,
-			   &net->ct.hash[repl_hash]);
+			   &net->ct.hash[reply_hash]);
 }
 
 int
 nf_conntrack_hash_check_insert(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
-	unsigned int hash, repl_hash;
+	unsigned int hash, reply_hash;
 	struct nf_conntrack_tuple_hash *h;
 	struct hlist_nulls_node *n;
 	u16 zone;
@@ -430,7 +430,7 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
 	zone = nf_ct_zone(ct);
 	hash = hash_conntrack(net, zone,
 			      &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
-	repl_hash = hash_conntrack(net, zone,
+	reply_hash = hash_conntrack(net, zone,
 				   &ct->tuplehash[IP_CT_DIR_REPLY].tuple);
 
 	spin_lock_bh(&nf_conntrack_lock);
@@ -441,7 +441,7 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
 				      &h->tuple) &&
 		    zone == nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)))
 			goto out;
-	hlist_nulls_for_each_entry(h, n, &net->ct.hash[repl_hash], hnnode)
+	hlist_nulls_for_each_entry(h, n, &net->ct.hash[reply_hash], hnnode)
 		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple,
 				      &h->tuple) &&
 		    zone == nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)))
@@ -451,7 +451,7 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
 	smp_wmb();
 	/* The caller holds a reference to this object */
 	atomic_set(&ct->ct_general.use, 2);
-	__nf_conntrack_hash_insert(ct, hash, repl_hash);
+	__nf_conntrack_hash_insert(ct, hash, reply_hash);
 	NF_CT_STAT_INC(net, insert);
 	spin_unlock_bh(&nf_conntrack_lock);
 
@@ -483,7 +483,7 @@ EXPORT_SYMBOL_GPL(nf_conntrack_tmpl_insert);
 int
 __nf_conntrack_confirm(struct sk_buff *skb)
 {
-	unsigned int hash, repl_hash;
+	unsigned int hash, reply_hash;
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct;
 	struct nf_conn_help *help;
@@ -507,7 +507,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	/* reuse the hash saved before */
 	hash = *(unsigned long *)&ct->tuplehash[IP_CT_DIR_REPLY].hnnode.pprev;
 	hash = hash_bucket(hash, net);
-	repl_hash = hash_conntrack(net, zone,
+	reply_hash = hash_conntrack(net, zone,
 				   &ct->tuplehash[IP_CT_DIR_REPLY].tuple);
 
 	/* We're not in hash table, and we refuse to set up related
@@ -540,7 +540,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 				      &h->tuple) &&
 		    zone == nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)))
 			goto out;
-	hlist_nulls_for_each_entry(h, n, &net->ct.hash[repl_hash], hnnode)
+	hlist_nulls_for_each_entry(h, n, &net->ct.hash[reply_hash], hnnode)
 		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple,
 				      &h->tuple) &&
 		    zone == nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)))
@@ -570,7 +570,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	 * guarantee that no other CPU can find the conntrack before the above
 	 * stores are visible.
 	 */
-	__nf_conntrack_hash_insert(ct, hash, repl_hash);
+	__nf_conntrack_hash_insert(ct, hash, reply_hash);
 	NF_CT_STAT_INC(net, insert);
 	spin_unlock_bh(&nf_conntrack_lock);
 
diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c
index 4fd1ca9..da2f84f 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -417,7 +417,7 @@ out:
 	return ret;
 }
 
-int nf_ct_expect_related_report(struct nf_conntrack_expect *expect, 
+int nf_ct_expect_related_report(struct nf_conntrack_expect *expect,
 				u32 portid, int report)
 {
 	int ret;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 21/38] netfilter: conntrack: spinlock per cpu to protect special lists.
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (19 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 20/38] netfilter: trivial code cleanup and doc changes Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 22/38] netfilter: avoid race with exp->master ct Pablo Neira Ayuso
                   ` (17 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Jesper Dangaard Brouer <brouer@redhat.com>

One spinlock per cpu to protect dying/unconfirmed/template special lists.
(These lists are now per cpu, a bit like the untracked ct)
Add a @cpu field to nf_conn, to make sure we hold the appropriate
spinlock at removal time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_conntrack.h |    3 +-
 include/net/netns/conntrack.h        |   11 ++-
 net/netfilter/nf_conntrack_core.c    |  141 +++++++++++++++++++++++++---------
 net/netfilter/nf_conntrack_helper.c  |   11 ++-
 net/netfilter/nf_conntrack_netlink.c |   81 ++++++++++---------
 5 files changed, 168 insertions(+), 79 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index e10d1fa..37252f7 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -82,7 +82,8 @@ struct nf_conn {
 	 */
 	struct nf_conntrack ct_general;
 
-	spinlock_t lock;
+	spinlock_t	lock;
+	u16		cpu;
 
 	/* XXX should I move this to the tail ? - Y.K */
 	/* These are my tuples; original and reply */
diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
index fbcc7fa..c6a8994 100644
--- a/include/net/netns/conntrack.h
+++ b/include/net/netns/conntrack.h
@@ -62,6 +62,13 @@ struct nf_ip_net {
 #endif
 };
 
+struct ct_pcpu {
+	spinlock_t		lock;
+	struct hlist_nulls_head unconfirmed;
+	struct hlist_nulls_head dying;
+	struct hlist_nulls_head tmpl;
+};
+
 struct netns_ct {
 	atomic_t		count;
 	unsigned int		expect_count;
@@ -86,9 +93,7 @@ struct netns_ct {
 	struct kmem_cache	*nf_conntrack_cachep;
 	struct hlist_nulls_head	*hash;
 	struct hlist_head	*expect_hash;
-	struct hlist_nulls_head	unconfirmed;
-	struct hlist_nulls_head	dying;
-	struct hlist_nulls_head tmpl;
+	struct ct_pcpu __percpu *pcpu_lists;
 	struct ip_conntrack_stat __percpu *stat;
 	struct nf_ct_event_notifier __rcu *nf_conntrack_event_cb;
 	struct nf_exp_event_notifier __rcu *nf_expect_event_cb;
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 965693e..289b279 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -192,6 +192,50 @@ clean_from_lists(struct nf_conn *ct)
 	nf_ct_remove_expectations(ct);
 }
 
+/* must be called with local_bh_disable */
+static void nf_ct_add_to_dying_list(struct nf_conn *ct)
+{
+	struct ct_pcpu *pcpu;
+
+	/* add this conntrack to the (per cpu) dying list */
+	ct->cpu = smp_processor_id();
+	pcpu = per_cpu_ptr(nf_ct_net(ct)->ct.pcpu_lists, ct->cpu);
+
+	spin_lock(&pcpu->lock);
+	hlist_nulls_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
+			     &pcpu->dying);
+	spin_unlock(&pcpu->lock);
+}
+
+/* must be called with local_bh_disable */
+static void nf_ct_add_to_unconfirmed_list(struct nf_conn *ct)
+{
+	struct ct_pcpu *pcpu;
+
+	/* add this conntrack to the (per cpu) unconfirmed list */
+	ct->cpu = smp_processor_id();
+	pcpu = per_cpu_ptr(nf_ct_net(ct)->ct.pcpu_lists, ct->cpu);
+
+	spin_lock(&pcpu->lock);
+	hlist_nulls_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
+			     &pcpu->unconfirmed);
+	spin_unlock(&pcpu->lock);
+}
+
+/* must be called with local_bh_disable */
+static void nf_ct_del_from_dying_or_unconfirmed_list(struct nf_conn *ct)
+{
+	struct ct_pcpu *pcpu;
+
+	/* We overload first tuple to link into unconfirmed or dying list.*/
+	pcpu = per_cpu_ptr(nf_ct_net(ct)->ct.pcpu_lists, ct->cpu);
+
+	spin_lock(&pcpu->lock);
+	BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
+	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
+	spin_unlock(&pcpu->lock);
+}
+
 static void
 destroy_conntrack(struct nf_conntrack *nfct)
 {
@@ -220,9 +264,7 @@ destroy_conntrack(struct nf_conntrack *nfct)
 	 * too. */
 	nf_ct_remove_expectations(ct);
 
-	/* We overload first tuple to link into unconfirmed or dying list.*/
-	BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
-	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
+	nf_ct_del_from_dying_or_unconfirmed_list(ct);
 
 	NF_CT_STAT_INC(net, delete);
 	spin_unlock_bh(&nf_conntrack_lock);
@@ -244,9 +286,7 @@ static void nf_ct_delete_from_lists(struct nf_conn *ct)
 	 * Otherwise we can get spurious warnings. */
 	NF_CT_STAT_INC(net, delete_list);
 	clean_from_lists(ct);
-	/* add this conntrack to the dying list */
-	hlist_nulls_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
-			     &net->ct.dying);
+	nf_ct_add_to_dying_list(ct);
 	spin_unlock_bh(&nf_conntrack_lock);
 }
 
@@ -467,15 +507,22 @@ EXPORT_SYMBOL_GPL(nf_conntrack_hash_check_insert);
 /* deletion from this larval template list happens via nf_ct_put() */
 void nf_conntrack_tmpl_insert(struct net *net, struct nf_conn *tmpl)
 {
+	struct ct_pcpu *pcpu;
+
 	__set_bit(IPS_TEMPLATE_BIT, &tmpl->status);
 	__set_bit(IPS_CONFIRMED_BIT, &tmpl->status);
 	nf_conntrack_get(&tmpl->ct_general);
 
-	spin_lock_bh(&nf_conntrack_lock);
+	/* add this conntrack to the (per cpu) tmpl list */
+	local_bh_disable();
+	tmpl->cpu = smp_processor_id();
+	pcpu = per_cpu_ptr(nf_ct_net(tmpl)->ct.pcpu_lists, tmpl->cpu);
+
+	spin_lock(&pcpu->lock);
 	/* Overload tuple linked list to put us in template list. */
 	hlist_nulls_add_head_rcu(&tmpl->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
-				 &net->ct.tmpl);
-	spin_unlock_bh(&nf_conntrack_lock);
+				 &pcpu->tmpl);
+	spin_unlock_bh(&pcpu->lock);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_tmpl_insert);
 
@@ -546,8 +593,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 		    zone == nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)))
 			goto out;
 
-	/* Remove from unconfirmed list */
-	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
+	nf_ct_del_from_dying_or_unconfirmed_list(ct);
 
 	/* Timer relative to confirmation time, not original
 	   setting time, otherwise we'd get timer wrap in
@@ -879,10 +925,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
 
 	/* Now it is inserted into the unconfirmed list, bump refcount */
 	nf_conntrack_get(&ct->ct_general);
-
-	/* Overload tuple linked list to put us in unconfirmed list. */
-	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
-		       &net->ct.unconfirmed);
+	nf_ct_add_to_unconfirmed_list(ct);
 
 	spin_unlock_bh(&nf_conntrack_lock);
 
@@ -1254,6 +1297,7 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data),
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct;
 	struct hlist_nulls_node *n;
+	int cpu;
 
 	spin_lock_bh(&nf_conntrack_lock);
 	for (; *bucket < net->ct.htable_size; (*bucket)++) {
@@ -1265,12 +1309,19 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data),
 				goto found;
 		}
 	}
-	hlist_nulls_for_each_entry(h, n, &net->ct.unconfirmed, hnnode) {
-		ct = nf_ct_tuplehash_to_ctrack(h);
-		if (iter(ct, data))
-			set_bit(IPS_DYING_BIT, &ct->status);
-	}
 	spin_unlock_bh(&nf_conntrack_lock);
+
+	for_each_possible_cpu(cpu) {
+		struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu);
+
+		spin_lock_bh(&pcpu->lock);
+		hlist_nulls_for_each_entry(h, n, &pcpu->unconfirmed, hnnode) {
+			ct = nf_ct_tuplehash_to_ctrack(h);
+			if (iter(ct, data))
+				set_bit(IPS_DYING_BIT, &ct->status);
+		}
+		spin_unlock_bh(&pcpu->lock);
+	}
 	return NULL;
 found:
 	atomic_inc(&ct->ct_general.use);
@@ -1323,14 +1374,19 @@ static void nf_ct_release_dying_list(struct net *net)
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct;
 	struct hlist_nulls_node *n;
+	int cpu;
 
-	spin_lock_bh(&nf_conntrack_lock);
-	hlist_nulls_for_each_entry(h, n, &net->ct.dying, hnnode) {
-		ct = nf_ct_tuplehash_to_ctrack(h);
-		/* never fails to remove them, no listeners at this point */
-		nf_ct_kill(ct);
+	for_each_possible_cpu(cpu) {
+		struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu);
+
+		spin_lock_bh(&pcpu->lock);
+		hlist_nulls_for_each_entry(h, n, &pcpu->dying, hnnode) {
+			ct = nf_ct_tuplehash_to_ctrack(h);
+			/* never fails to remove them, no listeners at this point */
+			nf_ct_kill(ct);
+		}
+		spin_unlock_bh(&pcpu->lock);
 	}
-	spin_unlock_bh(&nf_conntrack_lock);
 }
 
 static int untrack_refs(void)
@@ -1417,6 +1473,7 @@ i_see_dead_people:
 		kmem_cache_destroy(net->ct.nf_conntrack_cachep);
 		kfree(net->ct.slabname);
 		free_percpu(net->ct.stat);
+		free_percpu(net->ct.pcpu_lists);
 	}
 }
 
@@ -1629,37 +1686,43 @@ void nf_conntrack_init_end(void)
 
 int nf_conntrack_init_net(struct net *net)
 {
-	int ret;
+	int ret = -ENOMEM;
+	int cpu;
 
 	atomic_set(&net->ct.count, 0);
-	INIT_HLIST_NULLS_HEAD(&net->ct.unconfirmed, UNCONFIRMED_NULLS_VAL);
-	INIT_HLIST_NULLS_HEAD(&net->ct.dying, DYING_NULLS_VAL);
-	INIT_HLIST_NULLS_HEAD(&net->ct.tmpl, TEMPLATE_NULLS_VAL);
-	net->ct.stat = alloc_percpu(struct ip_conntrack_stat);
-	if (!net->ct.stat) {
-		ret = -ENOMEM;
+
+	net->ct.pcpu_lists = alloc_percpu(struct ct_pcpu);
+	if (!net->ct.pcpu_lists)
 		goto err_stat;
+
+	for_each_possible_cpu(cpu) {
+		struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu);
+
+		spin_lock_init(&pcpu->lock);
+		INIT_HLIST_NULLS_HEAD(&pcpu->unconfirmed, UNCONFIRMED_NULLS_VAL);
+		INIT_HLIST_NULLS_HEAD(&pcpu->dying, DYING_NULLS_VAL);
+		INIT_HLIST_NULLS_HEAD(&pcpu->tmpl, TEMPLATE_NULLS_VAL);
 	}
 
+	net->ct.stat = alloc_percpu(struct ip_conntrack_stat);
+	if (!net->ct.stat)
+		goto err_pcpu_lists;
+
 	net->ct.slabname = kasprintf(GFP_KERNEL, "nf_conntrack_%p", net);
-	if (!net->ct.slabname) {
-		ret = -ENOMEM;
+	if (!net->ct.slabname)
 		goto err_slabname;
-	}
 
 	net->ct.nf_conntrack_cachep = kmem_cache_create(net->ct.slabname,
 							sizeof(struct nf_conn), 0,
 							SLAB_DESTROY_BY_RCU, NULL);
 	if (!net->ct.nf_conntrack_cachep) {
 		printk(KERN_ERR "Unable to create nf_conn slab cache\n");
-		ret = -ENOMEM;
 		goto err_cache;
 	}
 
 	net->ct.htable_size = nf_conntrack_htable_size;
 	net->ct.hash = nf_ct_alloc_hashtable(&net->ct.htable_size, 1);
 	if (!net->ct.hash) {
-		ret = -ENOMEM;
 		printk(KERN_ERR "Unable to create nf_conntrack_hash\n");
 		goto err_hash;
 	}
@@ -1701,6 +1764,8 @@ err_cache:
 	kfree(net->ct.slabname);
 err_slabname:
 	free_percpu(net->ct.stat);
+err_pcpu_lists:
+	free_percpu(net->ct.pcpu_lists);
 err_stat:
 	return ret;
 }
diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c
index 974a2a4..27d9302 100644
--- a/net/netfilter/nf_conntrack_helper.c
+++ b/net/netfilter/nf_conntrack_helper.c
@@ -396,6 +396,7 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
 	const struct hlist_node *next;
 	const struct hlist_nulls_node *nn;
 	unsigned int i;
+	int cpu;
 
 	/* Get rid of expectations */
 	for (i = 0; i < nf_ct_expect_hsize; i++) {
@@ -414,8 +415,14 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
 	}
 
 	/* Get rid of expecteds, set helpers to NULL. */
-	hlist_nulls_for_each_entry(h, nn, &net->ct.unconfirmed, hnnode)
-		unhelp(h, me);
+	for_each_possible_cpu(cpu) {
+		struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu);
+
+		spin_lock_bh(&pcpu->lock);
+		hlist_nulls_for_each_entry(h, nn, &pcpu->unconfirmed, hnnode)
+			unhelp(h, me);
+		spin_unlock_bh(&pcpu->lock);
+	}
 	for (i = 0; i < net->ct.htable_size; i++) {
 		hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode)
 			unhelp(h, me);
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 47e9369..4ac8ce6 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -1137,50 +1137,65 @@ static int ctnetlink_done_list(struct netlink_callback *cb)
 }
 
 static int
-ctnetlink_dump_list(struct sk_buff *skb, struct netlink_callback *cb,
-		    struct hlist_nulls_head *list)
+ctnetlink_dump_list(struct sk_buff *skb, struct netlink_callback *cb, bool dying)
 {
-	struct nf_conn *ct, *last;
+	struct nf_conn *ct, *last = NULL;
 	struct nf_conntrack_tuple_hash *h;
 	struct hlist_nulls_node *n;
 	struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh);
 	u_int8_t l3proto = nfmsg->nfgen_family;
 	int res;
+	int cpu;
+	struct hlist_nulls_head *list;
+	struct net *net = sock_net(skb->sk);
 
 	if (cb->args[2])
 		return 0;
 
-	spin_lock_bh(&nf_conntrack_lock);
-	last = (struct nf_conn *)cb->args[1];
-restart:
-	hlist_nulls_for_each_entry(h, n, list, hnnode) {
-		ct = nf_ct_tuplehash_to_ctrack(h);
-		if (l3proto && nf_ct_l3num(ct) != l3proto)
+	if (cb->args[0] == nr_cpu_ids)
+		return 0;
+
+	for (cpu = cb->args[0]; cpu < nr_cpu_ids; cpu++) {
+		struct ct_pcpu *pcpu;
+
+		if (!cpu_possible(cpu))
 			continue;
-		if (cb->args[1]) {
-			if (ct != last)
+
+		pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu);
+		spin_lock_bh(&pcpu->lock);
+		last = (struct nf_conn *)cb->args[1];
+		list = dying ? &pcpu->dying : &pcpu->unconfirmed;
+restart:
+		hlist_nulls_for_each_entry(h, n, list, hnnode) {
+			ct = nf_ct_tuplehash_to_ctrack(h);
+			if (l3proto && nf_ct_l3num(ct) != l3proto)
 				continue;
-			cb->args[1] = 0;
-		}
-		rcu_read_lock();
-		res = ctnetlink_fill_info(skb, NETLINK_CB(cb->skb).portid,
-					  cb->nlh->nlmsg_seq,
-					  NFNL_MSG_TYPE(cb->nlh->nlmsg_type),
-					  ct);
-		rcu_read_unlock();
-		if (res < 0) {
-			nf_conntrack_get(&ct->ct_general);
-			cb->args[1] = (unsigned long)ct;
-			goto out;
+			if (cb->args[1]) {
+				if (ct != last)
+					continue;
+				cb->args[1] = 0;
+			}
+			rcu_read_lock();
+			res = ctnetlink_fill_info(skb, NETLINK_CB(cb->skb).portid,
+						  cb->nlh->nlmsg_seq,
+						  NFNL_MSG_TYPE(cb->nlh->nlmsg_type),
+						  ct);
+			rcu_read_unlock();
+			if (res < 0) {
+				nf_conntrack_get(&ct->ct_general);
+				cb->args[1] = (unsigned long)ct;
+				spin_unlock_bh(&pcpu->lock);
+				goto out;
+			}
 		}
+		if (cb->args[1]) {
+			cb->args[1] = 0;
+			goto restart;
+		} else
+			cb->args[2] = 1;
+		spin_unlock_bh(&pcpu->lock);
 	}
-	if (cb->args[1]) {
-		cb->args[1] = 0;
-		goto restart;
-	} else
-		cb->args[2] = 1;
 out:
-	spin_unlock_bh(&nf_conntrack_lock);
 	if (last)
 		nf_ct_put(last);
 
@@ -1190,9 +1205,7 @@ out:
 static int
 ctnetlink_dump_dying(struct sk_buff *skb, struct netlink_callback *cb)
 {
-	struct net *net = sock_net(skb->sk);
-
-	return ctnetlink_dump_list(skb, cb, &net->ct.dying);
+	return ctnetlink_dump_list(skb, cb, true);
 }
 
 static int
@@ -1214,9 +1227,7 @@ ctnetlink_get_ct_dying(struct sock *ctnl, struct sk_buff *skb,
 static int
 ctnetlink_dump_unconfirmed(struct sk_buff *skb, struct netlink_callback *cb)
 {
-	struct net *net = sock_net(skb->sk);
-
-	return ctnetlink_dump_list(skb, cb, &net->ct.unconfirmed);
+	return ctnetlink_dump_list(skb, cb, false);
 }
 
 static int
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 22/38] netfilter: avoid race with exp->master ct
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (20 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 21/38] netfilter: conntrack: spinlock per cpu to protect special lists Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 23/38] netfilter: conntrack: seperate expect locking from nf_conntrack_lock Pablo Neira Ayuso
                   ` (16 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Jesper Dangaard Brouer <brouer@redhat.com>

Preparation for disconnecting the nf_conntrack_lock from the
expectations code.  Once the nf_conntrack_lock is lifted, a race
condition is exposed.

The expectations master conntrack exp->master, can race with
delete operations, as the refcnt increment happens too late in
init_conntrack().  Race is against other CPUs invoking
->destroy() (destroy_conntrack()), or nf_ct_delete() (via timeout
or early_drop()).

Avoid this race in nf_ct_find_expectation() by using atomic_inc_not_zero(),
and checking if nf_ct_is_dying() (path via nf_ct_delete()).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_conntrack_core.c   |    2 +-
 net/netfilter/nf_conntrack_expect.c |   14 ++++++++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 289b279..92d5977 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -902,6 +902,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
 			 ct, exp);
 		/* Welcome, Mr. Bond.  We've been expecting you... */
 		__set_bit(IPS_EXPECTED_BIT, &ct->status);
+		/* exp->master safe, refcnt bumped in nf_ct_find_expectation */
 		ct->master = exp->master;
 		if (exp->helper) {
 			help = nf_ct_helper_ext_add(ct, exp->helper,
@@ -916,7 +917,6 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
 #ifdef CONFIG_NF_CONNTRACK_SECMARK
 		ct->secmark = exp->master->secmark;
 #endif
-		nf_conntrack_get(&ct->master->ct_general);
 		NF_CT_STAT_INC(net, expect_new);
 	} else {
 		__nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c
index da2f84f..f02805e 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -155,6 +155,18 @@ nf_ct_find_expectation(struct net *net, u16 zone,
 	if (!nf_ct_is_confirmed(exp->master))
 		return NULL;
 
+	/* Avoid race with other CPUs, that for exp->master ct, is
+	 * about to invoke ->destroy(), or nf_ct_delete() via timeout
+	 * or early_drop().
+	 *
+	 * The atomic_inc_not_zero() check tells:  If that fails, we
+	 * know that the ct is being destroyed.  If it succeeds, we
+	 * can be sure the ct cannot disappear underneath.
+	 */
+	if (unlikely(nf_ct_is_dying(exp->master) ||
+		     !atomic_inc_not_zero(&exp->master->ct_general.use)))
+		return NULL;
+
 	if (exp->flags & NF_CT_EXPECT_PERMANENT) {
 		atomic_inc(&exp->use);
 		return exp;
@@ -162,6 +174,8 @@ nf_ct_find_expectation(struct net *net, u16 zone,
 		nf_ct_unlink_expect(exp);
 		return exp;
 	}
+	/* Undo exp->master refcnt increase, if del_timer() failed */
+	nf_ct_put(exp->master);
 
 	return NULL;
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 23/38] netfilter: conntrack: seperate expect locking from nf_conntrack_lock
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (21 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 22/38] netfilter: avoid race with exp->master ct Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 24/38] netfilter: conntrack: remove central spinlock nf_conntrack_lock Pablo Neira Ayuso
                   ` (15 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Jesper Dangaard Brouer <brouer@redhat.com>

Netfilter expectations are protected with the same lock as conntrack
entries (nf_conntrack_lock).  This patch split out expectations locking
to use it's own lock (nf_conntrack_expect_lock).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_conntrack_core.h |    2 +
 net/netfilter/nf_conntrack_core.c         |   60 ++++++++++++++++-------------
 net/netfilter/nf_conntrack_expect.c       |   20 +++++-----
 net/netfilter/nf_conntrack_h323_main.c    |    4 +-
 net/netfilter/nf_conntrack_helper.c       |   22 +++++------
 net/netfilter/nf_conntrack_netlink.c      |   32 +++++++--------
 net/netfilter/nf_conntrack_sip.c          |    8 ++--
 7 files changed, 79 insertions(+), 69 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index 15308b8..d12a631 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -79,4 +79,6 @@ print_tuple(struct seq_file *s, const struct nf_conntrack_tuple *tuple,
 
 extern spinlock_t nf_conntrack_lock ;
 
+extern spinlock_t nf_conntrack_expect_lock;
+
 #endif /* _NF_CONNTRACK_CORE_H */
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 92d5977..4cdf1ad 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -63,6 +63,9 @@ EXPORT_SYMBOL_GPL(nfnetlink_parse_nat_setup_hook);
 DEFINE_SPINLOCK(nf_conntrack_lock);
 EXPORT_SYMBOL_GPL(nf_conntrack_lock);
 
+__cacheline_aligned_in_smp DEFINE_SPINLOCK(nf_conntrack_expect_lock);
+EXPORT_SYMBOL_GPL(nf_conntrack_expect_lock);
+
 unsigned int nf_conntrack_htable_size __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 
@@ -247,9 +250,6 @@ destroy_conntrack(struct nf_conntrack *nfct)
 	NF_CT_ASSERT(atomic_read(&nfct->use) == 0);
 	NF_CT_ASSERT(!timer_pending(&ct->timeout));
 
-	/* To make sure we don't get any weird locking issues here:
-	 * destroy_conntrack() MUST NOT be called with a write lock
-	 * to nf_conntrack_lock!!! -HW */
 	rcu_read_lock();
 	l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct));
 	if (l4proto && l4proto->destroy)
@@ -257,17 +257,18 @@ destroy_conntrack(struct nf_conntrack *nfct)
 
 	rcu_read_unlock();
 
-	spin_lock_bh(&nf_conntrack_lock);
+	local_bh_disable();
 	/* Expectations will have been removed in clean_from_lists,
 	 * except TFTP can create an expectation on the first packet,
 	 * before connection is in the list, so we need to clean here,
-	 * too. */
+	 * too.
+	 */
 	nf_ct_remove_expectations(ct);
 
 	nf_ct_del_from_dying_or_unconfirmed_list(ct);
 
 	NF_CT_STAT_INC(net, delete);
-	spin_unlock_bh(&nf_conntrack_lock);
+	local_bh_enable();
 
 	if (ct->master)
 		nf_ct_put(ct->master);
@@ -851,7 +852,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
 	struct nf_conn_help *help;
 	struct nf_conntrack_tuple repl_tuple;
 	struct nf_conntrack_ecache *ecache;
-	struct nf_conntrack_expect *exp;
+	struct nf_conntrack_expect *exp = NULL;
 	u16 zone = tmpl ? nf_ct_zone(tmpl) : NF_CT_DEFAULT_ZONE;
 	struct nf_conn_timeout *timeout_ext;
 	unsigned int *timeouts;
@@ -895,30 +896,35 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
 				 ecache ? ecache->expmask : 0,
 			     GFP_ATOMIC);
 
-	spin_lock_bh(&nf_conntrack_lock);
-	exp = nf_ct_find_expectation(net, zone, tuple);
-	if (exp) {
-		pr_debug("conntrack: expectation arrives ct=%p exp=%p\n",
-			 ct, exp);
-		/* Welcome, Mr. Bond.  We've been expecting you... */
-		__set_bit(IPS_EXPECTED_BIT, &ct->status);
-		/* exp->master safe, refcnt bumped in nf_ct_find_expectation */
-		ct->master = exp->master;
-		if (exp->helper) {
-			help = nf_ct_helper_ext_add(ct, exp->helper,
-						    GFP_ATOMIC);
-			if (help)
-				rcu_assign_pointer(help->helper, exp->helper);
-		}
+	local_bh_disable();
+	if (net->ct.expect_count) {
+		spin_lock(&nf_conntrack_expect_lock);
+		exp = nf_ct_find_expectation(net, zone, tuple);
+		if (exp) {
+			pr_debug("conntrack: expectation arrives ct=%p exp=%p\n",
+				 ct, exp);
+			/* Welcome, Mr. Bond.  We've been expecting you... */
+			__set_bit(IPS_EXPECTED_BIT, &ct->status);
+			/* exp->master safe, refcnt bumped in nf_ct_find_expectation */
+			ct->master = exp->master;
+			if (exp->helper) {
+				help = nf_ct_helper_ext_add(ct, exp->helper,
+							    GFP_ATOMIC);
+				if (help)
+					rcu_assign_pointer(help->helper, exp->helper);
+			}
 
 #ifdef CONFIG_NF_CONNTRACK_MARK
-		ct->mark = exp->master->mark;
+			ct->mark = exp->master->mark;
 #endif
 #ifdef CONFIG_NF_CONNTRACK_SECMARK
-		ct->secmark = exp->master->secmark;
+			ct->secmark = exp->master->secmark;
 #endif
-		NF_CT_STAT_INC(net, expect_new);
-	} else {
+			NF_CT_STAT_INC(net, expect_new);
+		}
+		spin_unlock(&nf_conntrack_expect_lock);
+	}
+	if (!exp) {
 		__nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
 		NF_CT_STAT_INC(net, new);
 	}
@@ -927,7 +933,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
 	nf_conntrack_get(&ct->ct_general);
 	nf_ct_add_to_unconfirmed_list(ct);
 
-	spin_unlock_bh(&nf_conntrack_lock);
+	local_bh_enable();
 
 	if (exp) {
 		if (exp->expectfn)
diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c
index f02805e..f87e8f6 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -66,9 +66,9 @@ static void nf_ct_expectation_timed_out(unsigned long ul_expect)
 {
 	struct nf_conntrack_expect *exp = (void *)ul_expect;
 
-	spin_lock_bh(&nf_conntrack_lock);
+	spin_lock_bh(&nf_conntrack_expect_lock);
 	nf_ct_unlink_expect(exp);
-	spin_unlock_bh(&nf_conntrack_lock);
+	spin_unlock_bh(&nf_conntrack_expect_lock);
 	nf_ct_expect_put(exp);
 }
 
@@ -191,12 +191,14 @@ void nf_ct_remove_expectations(struct nf_conn *ct)
 	if (!help)
 		return;
 
+	spin_lock_bh(&nf_conntrack_expect_lock);
 	hlist_for_each_entry_safe(exp, next, &help->expectations, lnode) {
 		if (del_timer(&exp->timeout)) {
 			nf_ct_unlink_expect(exp);
 			nf_ct_expect_put(exp);
 		}
 	}
+	spin_unlock_bh(&nf_conntrack_expect_lock);
 }
 EXPORT_SYMBOL_GPL(nf_ct_remove_expectations);
 
@@ -231,12 +233,12 @@ static inline int expect_matches(const struct nf_conntrack_expect *a,
 /* Generally a bad idea to call this: could have matched already. */
 void nf_ct_unexpect_related(struct nf_conntrack_expect *exp)
 {
-	spin_lock_bh(&nf_conntrack_lock);
+	spin_lock_bh(&nf_conntrack_expect_lock);
 	if (del_timer(&exp->timeout)) {
 		nf_ct_unlink_expect(exp);
 		nf_ct_expect_put(exp);
 	}
-	spin_unlock_bh(&nf_conntrack_lock);
+	spin_unlock_bh(&nf_conntrack_expect_lock);
 }
 EXPORT_SYMBOL_GPL(nf_ct_unexpect_related);
 
@@ -349,7 +351,7 @@ static int nf_ct_expect_insert(struct nf_conntrack_expect *exp)
 	setup_timer(&exp->timeout, nf_ct_expectation_timed_out,
 		    (unsigned long)exp);
 	helper = rcu_dereference_protected(master_help->helper,
-					   lockdep_is_held(&nf_conntrack_lock));
+					   lockdep_is_held(&nf_conntrack_expect_lock));
 	if (helper) {
 		exp->timeout.expires = jiffies +
 			helper->expect_policy[exp->class].timeout * HZ;
@@ -409,7 +411,7 @@ static inline int __nf_ct_expect_check(struct nf_conntrack_expect *expect)
 	}
 	/* Will be over limit? */
 	helper = rcu_dereference_protected(master_help->helper,
-					   lockdep_is_held(&nf_conntrack_lock));
+					   lockdep_is_held(&nf_conntrack_expect_lock));
 	if (helper) {
 		p = &helper->expect_policy[expect->class];
 		if (p->max_expected &&
@@ -436,7 +438,7 @@ int nf_ct_expect_related_report(struct nf_conntrack_expect *expect,
 {
 	int ret;
 
-	spin_lock_bh(&nf_conntrack_lock);
+	spin_lock_bh(&nf_conntrack_expect_lock);
 	ret = __nf_ct_expect_check(expect);
 	if (ret <= 0)
 		goto out;
@@ -444,11 +446,11 @@ int nf_ct_expect_related_report(struct nf_conntrack_expect *expect,
 	ret = nf_ct_expect_insert(expect);
 	if (ret < 0)
 		goto out;
-	spin_unlock_bh(&nf_conntrack_lock);
+	spin_unlock_bh(&nf_conntrack_expect_lock);
 	nf_ct_expect_event_report(IPEXP_NEW, expect, portid, report);
 	return ret;
 out:
-	spin_unlock_bh(&nf_conntrack_lock);
+	spin_unlock_bh(&nf_conntrack_expect_lock);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(nf_ct_expect_related_report);
diff --git a/net/netfilter/nf_conntrack_h323_main.c b/net/netfilter/nf_conntrack_h323_main.c
index 70866d1..3a3a60b 100644
--- a/net/netfilter/nf_conntrack_h323_main.c
+++ b/net/netfilter/nf_conntrack_h323_main.c
@@ -1476,7 +1476,7 @@ static int process_rcf(struct sk_buff *skb, struct nf_conn *ct,
 		nf_ct_refresh(ct, skb, info->timeout * HZ);
 
 		/* Set expect timeout */
-		spin_lock_bh(&nf_conntrack_lock);
+		spin_lock_bh(&nf_conntrack_expect_lock);
 		exp = find_expect(ct, &ct->tuplehash[dir].tuple.dst.u3,
 				  info->sig_port[!dir]);
 		if (exp) {
@@ -1486,7 +1486,7 @@ static int process_rcf(struct sk_buff *skb, struct nf_conn *ct,
 			nf_ct_dump_tuple(&exp->tuple);
 			set_expect_timeout(exp, info->timeout);
 		}
-		spin_unlock_bh(&nf_conntrack_lock);
+		spin_unlock_bh(&nf_conntrack_expect_lock);
 	}
 
 	return 0;
diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c
index 27d9302..29bd704 100644
--- a/net/netfilter/nf_conntrack_helper.c
+++ b/net/netfilter/nf_conntrack_helper.c
@@ -250,16 +250,14 @@ out:
 }
 EXPORT_SYMBOL_GPL(__nf_ct_try_assign_helper);
 
+/* appropiate ct lock protecting must be taken by caller */
 static inline int unhelp(struct nf_conntrack_tuple_hash *i,
 			 const struct nf_conntrack_helper *me)
 {
 	struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(i);
 	struct nf_conn_help *help = nfct_help(ct);
 
-	if (help && rcu_dereference_protected(
-			help->helper,
-			lockdep_is_held(&nf_conntrack_lock)
-			) == me) {
+	if (help && rcu_dereference_raw(help->helper) == me) {
 		nf_conntrack_event(IPCT_HELPER, ct);
 		RCU_INIT_POINTER(help->helper, NULL);
 	}
@@ -284,17 +282,17 @@ static LIST_HEAD(nf_ct_helper_expectfn_list);
 
 void nf_ct_helper_expectfn_register(struct nf_ct_helper_expectfn *n)
 {
-	spin_lock_bh(&nf_conntrack_lock);
+	spin_lock_bh(&nf_conntrack_expect_lock);
 	list_add_rcu(&n->head, &nf_ct_helper_expectfn_list);
-	spin_unlock_bh(&nf_conntrack_lock);
+	spin_unlock_bh(&nf_conntrack_expect_lock);
 }
 EXPORT_SYMBOL_GPL(nf_ct_helper_expectfn_register);
 
 void nf_ct_helper_expectfn_unregister(struct nf_ct_helper_expectfn *n)
 {
-	spin_lock_bh(&nf_conntrack_lock);
+	spin_lock_bh(&nf_conntrack_expect_lock);
 	list_del_rcu(&n->head);
-	spin_unlock_bh(&nf_conntrack_lock);
+	spin_unlock_bh(&nf_conntrack_expect_lock);
 }
 EXPORT_SYMBOL_GPL(nf_ct_helper_expectfn_unregister);
 
@@ -399,13 +397,14 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
 	int cpu;
 
 	/* Get rid of expectations */
+	spin_lock_bh(&nf_conntrack_expect_lock);
 	for (i = 0; i < nf_ct_expect_hsize; i++) {
 		hlist_for_each_entry_safe(exp, next,
 					  &net->ct.expect_hash[i], hnode) {
 			struct nf_conn_help *help = nfct_help(exp->master);
 			if ((rcu_dereference_protected(
 					help->helper,
-					lockdep_is_held(&nf_conntrack_lock)
+					lockdep_is_held(&nf_conntrack_expect_lock)
 					) == me || exp->helper == me) &&
 			    del_timer(&exp->timeout)) {
 				nf_ct_unlink_expect(exp);
@@ -413,6 +412,7 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
 			}
 		}
 	}
+	spin_unlock_bh(&nf_conntrack_expect_lock);
 
 	/* Get rid of expecteds, set helpers to NULL. */
 	for_each_possible_cpu(cpu) {
@@ -423,10 +423,12 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
 			unhelp(h, me);
 		spin_unlock_bh(&pcpu->lock);
 	}
+	spin_lock_bh(&nf_conntrack_lock);
 	for (i = 0; i < net->ct.htable_size; i++) {
 		hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode)
 			unhelp(h, me);
 	}
+	spin_unlock_bh(&nf_conntrack_lock);
 }
 
 void nf_conntrack_helper_unregister(struct nf_conntrack_helper *me)
@@ -444,10 +446,8 @@ void nf_conntrack_helper_unregister(struct nf_conntrack_helper *me)
 	synchronize_rcu();
 
 	rtnl_lock();
-	spin_lock_bh(&nf_conntrack_lock);
 	for_each_net(net)
 		__nf_conntrack_helper_unregister(me, net);
-	spin_unlock_bh(&nf_conntrack_lock);
 	rtnl_unlock();
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_helper_unregister);
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 4ac8ce6..be4d1b0 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -1376,14 +1376,14 @@ ctnetlink_change_helper(struct nf_conn *ct, const struct nlattr * const cda[])
 					    nf_ct_protonum(ct));
 	if (helper == NULL) {
 #ifdef CONFIG_MODULES
-		spin_unlock_bh(&nf_conntrack_lock);
+		spin_unlock_bh(&nf_conntrack_expect_lock);
 
 		if (request_module("nfct-helper-%s", helpname) < 0) {
-			spin_lock_bh(&nf_conntrack_lock);
+			spin_lock_bh(&nf_conntrack_expect_lock);
 			return -EOPNOTSUPP;
 		}
 
-		spin_lock_bh(&nf_conntrack_lock);
+		spin_lock_bh(&nf_conntrack_expect_lock);
 		helper = __nf_conntrack_helper_find(helpname, nf_ct_l3num(ct),
 						    nf_ct_protonum(ct));
 		if (helper)
@@ -1821,9 +1821,9 @@ ctnetlink_new_conntrack(struct sock *ctnl, struct sk_buff *skb,
 	err = -EEXIST;
 	ct = nf_ct_tuplehash_to_ctrack(h);
 	if (!(nlh->nlmsg_flags & NLM_F_EXCL)) {
-		spin_lock_bh(&nf_conntrack_lock);
+		spin_lock_bh(&nf_conntrack_expect_lock);
 		err = ctnetlink_change_conntrack(ct, cda);
-		spin_unlock_bh(&nf_conntrack_lock);
+		spin_unlock_bh(&nf_conntrack_expect_lock);
 		if (err == 0) {
 			nf_conntrack_eventmask_report((1 << IPCT_REPLY) |
 						      (1 << IPCT_ASSURED) |
@@ -2152,9 +2152,9 @@ ctnetlink_nfqueue_parse(const struct nlattr *attr, struct nf_conn *ct)
 	if (ret < 0)
 		return ret;
 
-	spin_lock_bh(&nf_conntrack_lock);
+	spin_lock_bh(&nf_conntrack_expect_lock);
 	ret = ctnetlink_nfqueue_parse_ct((const struct nlattr **)cda, ct);
-	spin_unlock_bh(&nf_conntrack_lock);
+	spin_unlock_bh(&nf_conntrack_expect_lock);
 
 	return ret;
 }
@@ -2709,13 +2709,13 @@ ctnetlink_del_expect(struct sock *ctnl, struct sk_buff *skb,
 		}
 
 		/* after list removal, usage count == 1 */
-		spin_lock_bh(&nf_conntrack_lock);
+		spin_lock_bh(&nf_conntrack_expect_lock);
 		if (del_timer(&exp->timeout)) {
 			nf_ct_unlink_expect_report(exp, NETLINK_CB(skb).portid,
 						   nlmsg_report(nlh));
 			nf_ct_expect_put(exp);
 		}
-		spin_unlock_bh(&nf_conntrack_lock);
+		spin_unlock_bh(&nf_conntrack_expect_lock);
 		/* have to put what we 'get' above.
 		 * after this line usage count == 0 */
 		nf_ct_expect_put(exp);
@@ -2724,7 +2724,7 @@ ctnetlink_del_expect(struct sock *ctnl, struct sk_buff *skb,
 		struct nf_conn_help *m_help;
 
 		/* delete all expectations for this helper */
-		spin_lock_bh(&nf_conntrack_lock);
+		spin_lock_bh(&nf_conntrack_expect_lock);
 		for (i = 0; i < nf_ct_expect_hsize; i++) {
 			hlist_for_each_entry_safe(exp, next,
 						  &net->ct.expect_hash[i],
@@ -2739,10 +2739,10 @@ ctnetlink_del_expect(struct sock *ctnl, struct sk_buff *skb,
 				}
 			}
 		}
-		spin_unlock_bh(&nf_conntrack_lock);
+		spin_unlock_bh(&nf_conntrack_expect_lock);
 	} else {
 		/* This basically means we have to flush everything*/
-		spin_lock_bh(&nf_conntrack_lock);
+		spin_lock_bh(&nf_conntrack_expect_lock);
 		for (i = 0; i < nf_ct_expect_hsize; i++) {
 			hlist_for_each_entry_safe(exp, next,
 						  &net->ct.expect_hash[i],
@@ -2755,7 +2755,7 @@ ctnetlink_del_expect(struct sock *ctnl, struct sk_buff *skb,
 				}
 			}
 		}
-		spin_unlock_bh(&nf_conntrack_lock);
+		spin_unlock_bh(&nf_conntrack_expect_lock);
 	}
 
 	return 0;
@@ -2981,11 +2981,11 @@ ctnetlink_new_expect(struct sock *ctnl, struct sk_buff *skb,
 	if (err < 0)
 		return err;
 
-	spin_lock_bh(&nf_conntrack_lock);
+	spin_lock_bh(&nf_conntrack_expect_lock);
 	exp = __nf_ct_expect_find(net, zone, &tuple);
 
 	if (!exp) {
-		spin_unlock_bh(&nf_conntrack_lock);
+		spin_unlock_bh(&nf_conntrack_expect_lock);
 		err = -ENOENT;
 		if (nlh->nlmsg_flags & NLM_F_CREATE) {
 			err = ctnetlink_create_expect(net, zone, cda,
@@ -2999,7 +2999,7 @@ ctnetlink_new_expect(struct sock *ctnl, struct sk_buff *skb,
 	err = -EEXIST;
 	if (!(nlh->nlmsg_flags & NLM_F_EXCL))
 		err = ctnetlink_change_expect(exp, cda);
-	spin_unlock_bh(&nf_conntrack_lock);
+	spin_unlock_bh(&nf_conntrack_expect_lock);
 
 	return err;
 }
diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c
index 466410e..4c3ba1c 100644
--- a/net/netfilter/nf_conntrack_sip.c
+++ b/net/netfilter/nf_conntrack_sip.c
@@ -800,7 +800,7 @@ static int refresh_signalling_expectation(struct nf_conn *ct,
 	struct hlist_node *next;
 	int found = 0;
 
-	spin_lock_bh(&nf_conntrack_lock);
+	spin_lock_bh(&nf_conntrack_expect_lock);
 	hlist_for_each_entry_safe(exp, next, &help->expectations, lnode) {
 		if (exp->class != SIP_EXPECT_SIGNALLING ||
 		    !nf_inet_addr_cmp(&exp->tuple.dst.u3, addr) ||
@@ -815,7 +815,7 @@ static int refresh_signalling_expectation(struct nf_conn *ct,
 		found = 1;
 		break;
 	}
-	spin_unlock_bh(&nf_conntrack_lock);
+	spin_unlock_bh(&nf_conntrack_expect_lock);
 	return found;
 }
 
@@ -825,7 +825,7 @@ static void flush_expectations(struct nf_conn *ct, bool media)
 	struct nf_conntrack_expect *exp;
 	struct hlist_node *next;
 
-	spin_lock_bh(&nf_conntrack_lock);
+	spin_lock_bh(&nf_conntrack_expect_lock);
 	hlist_for_each_entry_safe(exp, next, &help->expectations, lnode) {
 		if ((exp->class != SIP_EXPECT_SIGNALLING) ^ media)
 			continue;
@@ -836,7 +836,7 @@ static void flush_expectations(struct nf_conn *ct, bool media)
 		if (!media)
 			break;
 	}
-	spin_unlock_bh(&nf_conntrack_lock);
+	spin_unlock_bh(&nf_conntrack_expect_lock);
 }
 
 static int set_expected_rtp_rtcp(struct sk_buff *skb, unsigned int protoff,
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 24/38] netfilter: conntrack: remove central spinlock nf_conntrack_lock
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (22 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 23/38] netfilter: conntrack: seperate expect locking from nf_conntrack_lock Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 25/38] netfilter: nft_hash: bug fixes and resizing Pablo Neira Ayuso
                   ` (14 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Jesper Dangaard Brouer <brouer@redhat.com>

nf_conntrack_lock is a monolithic lock and suffers from huge contention
on current generation servers (8 or more core/threads).

Perf locking congestion is clear on base kernel:

-  72.56%  ksoftirqd/6  [kernel.kallsyms]    [k] _raw_spin_lock_bh
   - _raw_spin_lock_bh
      + 25.33% init_conntrack
      + 24.86% nf_ct_delete_from_lists
      + 24.62% __nf_conntrack_confirm
      + 24.38% destroy_conntrack
      + 0.70% tcp_packet
+   2.21%  ksoftirqd/6  [kernel.kallsyms]    [k] fib_table_lookup
+   1.15%  ksoftirqd/6  [kernel.kallsyms]    [k] __slab_free
+   0.77%  ksoftirqd/6  [kernel.kallsyms]    [k] inet_getpeer
+   0.70%  ksoftirqd/6  [nf_conntrack]       [k] nf_ct_delete
+   0.55%  ksoftirqd/6  [ip_tables]          [k] ipt_do_table

This patch change conntrack locking and provides a huge performance
improvement.  SYN-flood attack tested on a 24-core E5-2695v2(ES) with
10Gbit/s ixgbe (with tool trafgen):

 Base kernel:   810.405 new conntrack/sec
 After patch: 2.233.876 new conntrack/sec

Notice other floods attack (SYN+ACK or ACK) can easily be deflected using:
 # iptables -A INPUT -m state --state INVALID -j DROP
 # sysctl -w net/netfilter/nf_conntrack_tcp_loose=0

Use an array of hashed spinlocks to protect insertions/deletions of
conntracks into the hash table. 1024 spinlocks seem to give good
results, at minimal cost (4KB memory). Due to lockdep max depth,
1024 becomes 8 if CONFIG_LOCKDEP=y

The hash resize is a bit tricky, because we need to take all locks in
the array. A seqcount_t is used to synchronize the hash table users
with the resizing process.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_conntrack_core.h |    7 +-
 include/net/netns/conntrack.h             |    2 +
 net/netfilter/nf_conntrack_core.c         |  219 +++++++++++++++++++++--------
 net/netfilter/nf_conntrack_helper.c       |   12 +-
 net/netfilter/nf_conntrack_netlink.c      |   15 +-
 5 files changed, 188 insertions(+), 67 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index d12a631..cc0c188 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -77,7 +77,12 @@ print_tuple(struct seq_file *s, const struct nf_conntrack_tuple *tuple,
             const struct nf_conntrack_l3proto *l3proto,
             const struct nf_conntrack_l4proto *proto);
 
-extern spinlock_t nf_conntrack_lock ;
+#ifdef CONFIG_LOCKDEP
+# define CONNTRACK_LOCKS 8
+#else
+# define CONNTRACK_LOCKS 1024
+#endif
+extern spinlock_t nf_conntrack_locks[CONNTRACK_LOCKS];
 
 extern spinlock_t nf_conntrack_expect_lock;
 
diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
index c6a8994..773cce3 100644
--- a/include/net/netns/conntrack.h
+++ b/include/net/netns/conntrack.h
@@ -5,6 +5,7 @@
 #include <linux/list_nulls.h>
 #include <linux/atomic.h>
 #include <linux/netfilter/nf_conntrack_tcp.h>
+#include <linux/seqlock.h>
 
 struct ctl_table_header;
 struct nf_conntrack_ecache;
@@ -90,6 +91,7 @@ struct netns_ct {
 	int			sysctl_checksum;
 
 	unsigned int		htable_size;
+	seqcount_t		generation;
 	struct kmem_cache	*nf_conntrack_cachep;
 	struct hlist_nulls_head	*hash;
 	struct hlist_head	*expect_hash;
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 4cdf1ad..5d1e7d1 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -60,12 +60,60 @@ int (*nfnetlink_parse_nat_setup_hook)(struct nf_conn *ct,
 				      const struct nlattr *attr) __read_mostly;
 EXPORT_SYMBOL_GPL(nfnetlink_parse_nat_setup_hook);
 
-DEFINE_SPINLOCK(nf_conntrack_lock);
-EXPORT_SYMBOL_GPL(nf_conntrack_lock);
+__cacheline_aligned_in_smp spinlock_t nf_conntrack_locks[CONNTRACK_LOCKS];
+EXPORT_SYMBOL_GPL(nf_conntrack_locks);
 
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(nf_conntrack_expect_lock);
 EXPORT_SYMBOL_GPL(nf_conntrack_expect_lock);
 
+static void nf_conntrack_double_unlock(unsigned int h1, unsigned int h2)
+{
+	h1 %= CONNTRACK_LOCKS;
+	h2 %= CONNTRACK_LOCKS;
+	spin_unlock(&nf_conntrack_locks[h1]);
+	if (h1 != h2)
+		spin_unlock(&nf_conntrack_locks[h2]);
+}
+
+/* return true if we need to recompute hashes (in case hash table was resized) */
+static bool nf_conntrack_double_lock(struct net *net, unsigned int h1,
+				     unsigned int h2, unsigned int sequence)
+{
+	h1 %= CONNTRACK_LOCKS;
+	h2 %= CONNTRACK_LOCKS;
+	if (h1 <= h2) {
+		spin_lock(&nf_conntrack_locks[h1]);
+		if (h1 != h2)
+			spin_lock_nested(&nf_conntrack_locks[h2],
+					 SINGLE_DEPTH_NESTING);
+	} else {
+		spin_lock(&nf_conntrack_locks[h2]);
+		spin_lock_nested(&nf_conntrack_locks[h1],
+				 SINGLE_DEPTH_NESTING);
+	}
+	if (read_seqcount_retry(&net->ct.generation, sequence)) {
+		nf_conntrack_double_unlock(h1, h2);
+		return true;
+	}
+	return false;
+}
+
+static void nf_conntrack_all_lock(void)
+{
+	int i;
+
+	for (i = 0; i < CONNTRACK_LOCKS; i++)
+		spin_lock_nested(&nf_conntrack_locks[i], i);
+}
+
+static void nf_conntrack_all_unlock(void)
+{
+	int i;
+
+	for (i = 0; i < CONNTRACK_LOCKS; i++)
+		spin_unlock(&nf_conntrack_locks[i]);
+}
+
 unsigned int nf_conntrack_htable_size __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 
@@ -280,15 +328,28 @@ destroy_conntrack(struct nf_conntrack *nfct)
 static void nf_ct_delete_from_lists(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
+	unsigned int hash, reply_hash;
+	u16 zone = nf_ct_zone(ct);
+	unsigned int sequence;
 
 	nf_ct_helper_destroy(ct);
-	spin_lock_bh(&nf_conntrack_lock);
-	/* Inside lock so preempt is disabled on module removal path.
-	 * Otherwise we can get spurious warnings. */
-	NF_CT_STAT_INC(net, delete_list);
+
+	local_bh_disable();
+	do {
+		sequence = read_seqcount_begin(&net->ct.generation);
+		hash = hash_conntrack(net, zone,
+				      &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
+		reply_hash = hash_conntrack(net, zone,
+					   &ct->tuplehash[IP_CT_DIR_REPLY].tuple);
+	} while (nf_conntrack_double_lock(net, hash, reply_hash, sequence));
+
 	clean_from_lists(ct);
+	nf_conntrack_double_unlock(hash, reply_hash);
+
 	nf_ct_add_to_dying_list(ct);
-	spin_unlock_bh(&nf_conntrack_lock);
+
+	NF_CT_STAT_INC(net, delete_list);
+	local_bh_enable();
 }
 
 static void death_by_event(unsigned long ul_conntrack)
@@ -372,8 +433,6 @@ nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
  * Warning :
  * - Caller must take a reference on returned object
  *   and recheck nf_ct_tuple_equal(tuple, &h->tuple)
- * OR
- * - Caller must lock nf_conntrack_lock before calling this function
  */
 static struct nf_conntrack_tuple_hash *
 ____nf_conntrack_find(struct net *net, u16 zone,
@@ -467,14 +526,18 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
 	struct nf_conntrack_tuple_hash *h;
 	struct hlist_nulls_node *n;
 	u16 zone;
+	unsigned int sequence;
 
 	zone = nf_ct_zone(ct);
-	hash = hash_conntrack(net, zone,
-			      &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
-	reply_hash = hash_conntrack(net, zone,
-				   &ct->tuplehash[IP_CT_DIR_REPLY].tuple);
 
-	spin_lock_bh(&nf_conntrack_lock);
+	local_bh_disable();
+	do {
+		sequence = read_seqcount_begin(&net->ct.generation);
+		hash = hash_conntrack(net, zone,
+				      &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
+		reply_hash = hash_conntrack(net, zone,
+					   &ct->tuplehash[IP_CT_DIR_REPLY].tuple);
+	} while (nf_conntrack_double_lock(net, hash, reply_hash, sequence));
 
 	/* See if there's one in the list already, including reverse */
 	hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode)
@@ -493,14 +556,15 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
 	/* The caller holds a reference to this object */
 	atomic_set(&ct->ct_general.use, 2);
 	__nf_conntrack_hash_insert(ct, hash, reply_hash);
+	nf_conntrack_double_unlock(hash, reply_hash);
 	NF_CT_STAT_INC(net, insert);
-	spin_unlock_bh(&nf_conntrack_lock);
-
+	local_bh_enable();
 	return 0;
 
 out:
+	nf_conntrack_double_unlock(hash, reply_hash);
 	NF_CT_STAT_INC(net, insert_failed);
-	spin_unlock_bh(&nf_conntrack_lock);
+	local_bh_enable();
 	return -EEXIST;
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_hash_check_insert);
@@ -540,6 +604,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	enum ip_conntrack_info ctinfo;
 	struct net *net;
 	u16 zone;
+	unsigned int sequence;
 
 	ct = nf_ct_get(skb, &ctinfo);
 	net = nf_ct_net(ct);
@@ -552,31 +617,37 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 		return NF_ACCEPT;
 
 	zone = nf_ct_zone(ct);
-	/* reuse the hash saved before */
-	hash = *(unsigned long *)&ct->tuplehash[IP_CT_DIR_REPLY].hnnode.pprev;
-	hash = hash_bucket(hash, net);
-	reply_hash = hash_conntrack(net, zone,
-				   &ct->tuplehash[IP_CT_DIR_REPLY].tuple);
+	local_bh_disable();
+
+	do {
+		sequence = read_seqcount_begin(&net->ct.generation);
+		/* reuse the hash saved before */
+		hash = *(unsigned long *)&ct->tuplehash[IP_CT_DIR_REPLY].hnnode.pprev;
+		hash = hash_bucket(hash, net);
+		reply_hash = hash_conntrack(net, zone,
+					   &ct->tuplehash[IP_CT_DIR_REPLY].tuple);
+
+	} while (nf_conntrack_double_lock(net, hash, reply_hash, sequence));
 
 	/* We're not in hash table, and we refuse to set up related
-	   connections for unconfirmed conns.  But packet copies and
-	   REJECT will give spurious warnings here. */
+	 * connections for unconfirmed conns.  But packet copies and
+	 * REJECT will give spurious warnings here.
+	 */
 	/* NF_CT_ASSERT(atomic_read(&ct->ct_general.use) == 1); */
 
 	/* No external references means no one else could have
-	   confirmed us. */
+	 * confirmed us.
+	 */
 	NF_CT_ASSERT(!nf_ct_is_confirmed(ct));
 	pr_debug("Confirming conntrack %p\n", ct);
-
-	spin_lock_bh(&nf_conntrack_lock);
-
 	/* We have to check the DYING flag inside the lock to prevent
 	   a race against nf_ct_get_next_corpse() possibly called from
 	   user context, else we insert an already 'dead' hash, blocking
 	   further use of that particular connection -JM */
 
 	if (unlikely(nf_ct_is_dying(ct))) {
-		spin_unlock_bh(&nf_conntrack_lock);
+		nf_conntrack_double_unlock(hash, reply_hash);
+		local_bh_enable();
 		return NF_ACCEPT;
 	}
 
@@ -618,8 +689,9 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	 * stores are visible.
 	 */
 	__nf_conntrack_hash_insert(ct, hash, reply_hash);
+	nf_conntrack_double_unlock(hash, reply_hash);
 	NF_CT_STAT_INC(net, insert);
-	spin_unlock_bh(&nf_conntrack_lock);
+	local_bh_enable();
 
 	help = nfct_help(ct);
 	if (help && help->helper)
@@ -630,8 +702,9 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	return NF_ACCEPT;
 
 out:
+	nf_conntrack_double_unlock(hash, reply_hash);
 	NF_CT_STAT_INC(net, insert_failed);
-	spin_unlock_bh(&nf_conntrack_lock);
+	local_bh_enable();
 	return NF_DROP;
 }
 EXPORT_SYMBOL_GPL(__nf_conntrack_confirm);
@@ -674,39 +747,48 @@ EXPORT_SYMBOL_GPL(nf_conntrack_tuple_taken);
 
 /* There's a small race here where we may free a just-assured
    connection.  Too bad: we're in trouble anyway. */
-static noinline int early_drop(struct net *net, unsigned int hash)
+static noinline int early_drop(struct net *net, unsigned int _hash)
 {
 	/* Use oldest entry, which is roughly LRU */
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct = NULL, *tmp;
 	struct hlist_nulls_node *n;
-	unsigned int i, cnt = 0;
+	unsigned int i = 0, cnt = 0;
 	int dropped = 0;
+	unsigned int hash, sequence;
+	spinlock_t *lockp;
 
-	rcu_read_lock();
-	for (i = 0; i < net->ct.htable_size; i++) {
+	local_bh_disable();
+restart:
+	sequence = read_seqcount_begin(&net->ct.generation);
+	hash = hash_bucket(_hash, net);
+	for (; i < net->ct.htable_size; i++) {
+		lockp = &nf_conntrack_locks[hash % CONNTRACK_LOCKS];
+		spin_lock(lockp);
+		if (read_seqcount_retry(&net->ct.generation, sequence)) {
+			spin_unlock(lockp);
+			goto restart;
+		}
 		hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash],
 					 hnnode) {
 			tmp = nf_ct_tuplehash_to_ctrack(h);
-			if (!test_bit(IPS_ASSURED_BIT, &tmp->status))
+			if (!test_bit(IPS_ASSURED_BIT, &tmp->status) &&
+			    !nf_ct_is_dying(tmp) &&
+			    atomic_inc_not_zero(&tmp->ct_general.use)) {
 				ct = tmp;
+				break;
+			}
 			cnt++;
 		}
 
-		if (ct != NULL) {
-			if (likely(!nf_ct_is_dying(ct) &&
-				   atomic_inc_not_zero(&ct->ct_general.use)))
-				break;
-			else
-				ct = NULL;
-		}
+		hash = (hash + 1) % net->ct.htable_size;
+		spin_unlock(lockp);
 
-		if (cnt >= NF_CT_EVICTION_RANGE)
+		if (ct || cnt >= NF_CT_EVICTION_RANGE)
 			break;
 
-		hash = (hash + 1) % net->ct.htable_size;
 	}
-	rcu_read_unlock();
+	local_bh_enable();
 
 	if (!ct)
 		return dropped;
@@ -755,7 +837,7 @@ __nf_conntrack_alloc(struct net *net, u16 zone,
 
 	if (nf_conntrack_max &&
 	    unlikely(atomic_read(&net->ct.count) > nf_conntrack_max)) {
-		if (!early_drop(net, hash_bucket(hash, net))) {
+		if (!early_drop(net, hash)) {
 			atomic_dec(&net->ct.count);
 			net_warn_ratelimited("nf_conntrack: table full, dropping packet\n");
 			return ERR_PTR(-ENOMEM);
@@ -1304,18 +1386,24 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data),
 	struct nf_conn *ct;
 	struct hlist_nulls_node *n;
 	int cpu;
+	spinlock_t *lockp;
 
-	spin_lock_bh(&nf_conntrack_lock);
 	for (; *bucket < net->ct.htable_size; (*bucket)++) {
-		hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], hnnode) {
-			if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
-				continue;
-			ct = nf_ct_tuplehash_to_ctrack(h);
-			if (iter(ct, data))
-				goto found;
+		lockp = &nf_conntrack_locks[*bucket % CONNTRACK_LOCKS];
+		local_bh_disable();
+		spin_lock(lockp);
+		if (*bucket < net->ct.htable_size) {
+			hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], hnnode) {
+				if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
+					continue;
+				ct = nf_ct_tuplehash_to_ctrack(h);
+				if (iter(ct, data))
+					goto found;
+			}
 		}
+		spin_unlock(lockp);
+		local_bh_enable();
 	}
-	spin_unlock_bh(&nf_conntrack_lock);
 
 	for_each_possible_cpu(cpu) {
 		struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu);
@@ -1331,7 +1419,8 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data),
 	return NULL;
 found:
 	atomic_inc(&ct->ct_general.use);
-	spin_unlock_bh(&nf_conntrack_lock);
+	spin_unlock(lockp);
+	local_bh_enable();
 	return ct;
 }
 
@@ -1532,12 +1621,16 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	if (!hash)
 		return -ENOMEM;
 
+	local_bh_disable();
+	nf_conntrack_all_lock();
+	write_seqcount_begin(&init_net.ct.generation);
+
 	/* Lookups in the old hash might happen in parallel, which means we
 	 * might get false negatives during connection lookup. New connections
 	 * created because of a false negative won't make it into the hash
-	 * though since that required taking the lock.
+	 * though since that required taking the locks.
 	 */
-	spin_lock_bh(&nf_conntrack_lock);
+
 	for (i = 0; i < init_net.ct.htable_size; i++) {
 		while (!hlist_nulls_empty(&init_net.ct.hash[i])) {
 			h = hlist_nulls_entry(init_net.ct.hash[i].first,
@@ -1554,7 +1647,10 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 
 	init_net.ct.htable_size = nf_conntrack_htable_size = hashsize;
 	init_net.ct.hash = hash;
-	spin_unlock_bh(&nf_conntrack_lock);
+
+	write_seqcount_end(&init_net.ct.generation);
+	nf_conntrack_all_unlock();
+	local_bh_enable();
 
 	nf_ct_free_hashtable(old_hash, old_size);
 	return 0;
@@ -1576,7 +1672,10 @@ EXPORT_SYMBOL_GPL(nf_ct_untracked_status_or);
 int nf_conntrack_init_start(void)
 {
 	int max_factor = 8;
-	int ret, cpu;
+	int i, ret, cpu;
+
+	for (i = 0; i < ARRAY_SIZE(nf_conntrack_locks); i++)
+		spin_lock_init(&nf_conntrack_locks[i]);
 
 	/* Idea from tcp.c: use 1/16384 of memory.  On i386: 32MB
 	 * machine has 512 buckets. >= 1GB machines have 16384 buckets. */
diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c
index 29bd704..5b3eae7 100644
--- a/net/netfilter/nf_conntrack_helper.c
+++ b/net/netfilter/nf_conntrack_helper.c
@@ -423,12 +423,16 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
 			unhelp(h, me);
 		spin_unlock_bh(&pcpu->lock);
 	}
-	spin_lock_bh(&nf_conntrack_lock);
+	local_bh_disable();
 	for (i = 0; i < net->ct.htable_size; i++) {
-		hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode)
-			unhelp(h, me);
+		spin_lock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]);
+		if (i < net->ct.htable_size) {
+			hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode)
+				unhelp(h, me);
+		}
+		spin_unlock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]);
 	}
-	spin_unlock_bh(&nf_conntrack_lock);
+	local_bh_enable();
 }
 
 void nf_conntrack_helper_unregister(struct nf_conntrack_helper *me)
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index be4d1b0..8d778a9 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -764,14 +764,23 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 	struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh);
 	u_int8_t l3proto = nfmsg->nfgen_family;
 	int res;
+	spinlock_t *lockp;
+
 #ifdef CONFIG_NF_CONNTRACK_MARK
 	const struct ctnetlink_dump_filter *filter = cb->data;
 #endif
 
-	spin_lock_bh(&nf_conntrack_lock);
 	last = (struct nf_conn *)cb->args[1];
+
+	local_bh_disable();
 	for (; cb->args[0] < net->ct.htable_size; cb->args[0]++) {
 restart:
+		lockp = &nf_conntrack_locks[cb->args[0] % CONNTRACK_LOCKS];
+		spin_lock(lockp);
+		if (cb->args[0] >= net->ct.htable_size) {
+			spin_unlock(lockp);
+			goto out;
+		}
 		hlist_nulls_for_each_entry(h, n, &net->ct.hash[cb->args[0]],
 					 hnnode) {
 			if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
@@ -803,16 +812,18 @@ restart:
 			if (res < 0) {
 				nf_conntrack_get(&ct->ct_general);
 				cb->args[1] = (unsigned long)ct;
+				spin_unlock(lockp);
 				goto out;
 			}
 		}
+		spin_unlock(lockp);
 		if (cb->args[1]) {
 			cb->args[1] = 0;
 			goto restart;
 		}
 	}
 out:
-	spin_unlock_bh(&nf_conntrack_lock);
+	local_bh_enable();
 	if (last)
 		nf_ct_put(last);
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 25/38] netfilter: nft_hash: bug fixes and resizing
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (23 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 24/38] netfilter: conntrack: remove central spinlock nf_conntrack_lock Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 26/38] netfilter: nf_tables: clean up nf_tables_trans_add() argument order Pablo Neira Ayuso
                   ` (13 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Patrick McHardy <kaber@trash.net>

The hash set type is very broken and was never meant to be merged in this
state. Missing RCU synchronization on element removal, leaking chain
refcounts when used as a verdict map, races during lookups, a fixed table
size are probably just some of the problems. Luckily it is currently
never chosen by the kernel when the rbtree type is also available.

Rewrite it to be usable.

The new implementation supports automatic hash table resizing using RCU,
based on Paul McKenney's and Josh Triplett's algorithm "Optimized Resizing
For RCU-Protected Hash Tables" described in [1].

Resizing doesn't require a second list head in the elements, it works by
chosing a hash function that remaps elements to a predictable set of buckets,
only resizing by integral factors and

- during expansion: linking new buckets to the old bucket that contains
  elements for any of the new buckets, thereby creating imprecise chains,
  then incrementally seperating the elements until the new buckets only
  contain elements that hash directly to them.

- during shrinking: linking the hash chains of all old buckets that hash
  to the same new bucket to form a single chain.

Expansion requires at most the number of elements in the longest hash chain
grace periods, shrinking requires a single grace period.

Due to the requirement of having hash chains/elements linked to multiple
buckets during resizing, homemade single linked lists are used instead of
the existing list helpers, that don't support this in a clean fashion.
As a side effect, the amount of memory required per element is reduced by
one pointer.

Expansion is triggered when the load factors exceeds 75%, shrinking when
the load factor goes below 30%. Both operations are allowed to fail and
will be retried on the next insertion or removal if their respective
conditions still hold.

[1] http://dl.acm.org/citation.cfm?id=2002181.2002192

Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_hash.c |  260 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 214 insertions(+), 46 deletions(-)

diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index 3d3f8fc..6a1acde 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2008-2009 Patrick McHardy <kaber@trash.net>
+ * Copyright (c) 2008-2014 Patrick McHardy <kaber@trash.net>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
@@ -18,17 +18,29 @@
 #include <linux/netfilter/nf_tables.h>
 #include <net/netfilter/nf_tables.h>
 
+#define NFT_HASH_MIN_SIZE	4
+
 struct nft_hash {
-	struct hlist_head	*hash;
-	unsigned int		hsize;
+	struct nft_hash_table __rcu	*tbl;
+};
+
+struct nft_hash_table {
+	unsigned int			size;
+	unsigned int			elements;
+	struct nft_hash_elem __rcu	*buckets[];
 };
 
 struct nft_hash_elem {
-	struct hlist_node	hnode;
-	struct nft_data		key;
-	struct nft_data		data[];
+	struct nft_hash_elem __rcu	*next;
+	struct nft_data			key;
+	struct nft_data			data[];
 };
 
+#define nft_hash_for_each_entry(i, head) \
+	for (i = nft_dereference(head); i != NULL; i = nft_dereference(i->next))
+#define nft_hash_for_each_entry_rcu(i, head) \
+	for (i = rcu_dereference(head); i != NULL; i = rcu_dereference(i->next))
+
 static u32 nft_hash_rnd __read_mostly;
 static bool nft_hash_rnd_initted __read_mostly;
 
@@ -38,7 +50,7 @@ static unsigned int nft_hash_data(const struct nft_data *data,
 	unsigned int h;
 
 	h = jhash(data->data, len, nft_hash_rnd);
-	return ((u64)h * hsize) >> 32;
+	return h & (hsize - 1);
 }
 
 static bool nft_hash_lookup(const struct nft_set *set,
@@ -46,11 +58,12 @@ static bool nft_hash_lookup(const struct nft_set *set,
 			    struct nft_data *data)
 {
 	const struct nft_hash *priv = nft_set_priv(set);
+	const struct nft_hash_table *tbl = rcu_dereference(priv->tbl);
 	const struct nft_hash_elem *he;
 	unsigned int h;
 
-	h = nft_hash_data(key, priv->hsize, set->klen);
-	hlist_for_each_entry(he, &priv->hash[h], hnode) {
+	h = nft_hash_data(key, tbl->size, set->klen);
+	nft_hash_for_each_entry_rcu(he, tbl->buckets[h]) {
 		if (nft_data_cmp(&he->key, key, set->klen))
 			continue;
 		if (set->flags & NFT_SET_MAP)
@@ -60,19 +73,148 @@ static bool nft_hash_lookup(const struct nft_set *set,
 	return false;
 }
 
-static void nft_hash_elem_destroy(const struct nft_set *set,
-				  struct nft_hash_elem *he)
+static void nft_hash_tbl_free(const struct nft_hash_table *tbl)
 {
-	nft_data_uninit(&he->key, NFT_DATA_VALUE);
-	if (set->flags & NFT_SET_MAP)
-		nft_data_uninit(he->data, set->dtype);
-	kfree(he);
+	if (is_vmalloc_addr(tbl))
+		vfree(tbl);
+	else
+		kfree(tbl);
+}
+
+static struct nft_hash_table *nft_hash_tbl_alloc(unsigned int nbuckets)
+{
+	struct nft_hash_table *tbl;
+	size_t size;
+
+	size = sizeof(*tbl) + nbuckets * sizeof(tbl->buckets[0]);
+	tbl = kzalloc(size, GFP_KERNEL | __GFP_REPEAT | __GFP_NOWARN);
+	if (tbl == NULL)
+		tbl = vzalloc(size);
+	if (tbl == NULL)
+		return NULL;
+	tbl->size = nbuckets;
+
+	return tbl;
+}
+
+static void nft_hash_chain_unzip(const struct nft_set *set,
+				 const struct nft_hash_table *ntbl,
+				 struct nft_hash_table *tbl, unsigned int n)
+{
+	struct nft_hash_elem *he, *last, *next;
+	unsigned int h;
+
+	he = nft_dereference(tbl->buckets[n]);
+	if (he == NULL)
+		return;
+	h = nft_hash_data(&he->key, ntbl->size, set->klen);
+
+	/* Find last element of first chain hashing to bucket h */
+	last = he;
+	nft_hash_for_each_entry(he, he->next) {
+		if (nft_hash_data(&he->key, ntbl->size, set->klen) != h)
+			break;
+		last = he;
+	}
+
+	/* Unlink first chain from the old table */
+	RCU_INIT_POINTER(tbl->buckets[n], last->next);
+
+	/* If end of chain reached, done */
+	if (he == NULL)
+		return;
+
+	/* Find first element of second chain hashing to bucket h */
+	next = NULL;
+	nft_hash_for_each_entry(he, he->next) {
+		if (nft_hash_data(&he->key, ntbl->size, set->klen) != h)
+			continue;
+		next = he;
+		break;
+	}
+
+	/* Link the two chains */
+	RCU_INIT_POINTER(last->next, next);
+}
+
+static int nft_hash_tbl_expand(const struct nft_set *set, struct nft_hash *priv)
+{
+	struct nft_hash_table *tbl = nft_dereference(priv->tbl), *ntbl;
+	struct nft_hash_elem *he;
+	unsigned int i, h;
+	bool complete;
+
+	ntbl = nft_hash_tbl_alloc(tbl->size * 2);
+	if (ntbl == NULL)
+		return -ENOMEM;
+
+	/* Link new table's buckets to first element in the old table
+	 * hashing to the new bucket.
+	 */
+	for (i = 0; i < ntbl->size; i++) {
+		h = i < tbl->size ? i : i - tbl->size;
+		nft_hash_for_each_entry(he, tbl->buckets[h]) {
+			if (nft_hash_data(&he->key, ntbl->size, set->klen) != i)
+				continue;
+			RCU_INIT_POINTER(ntbl->buckets[i], he);
+			break;
+		}
+	}
+	ntbl->elements = tbl->elements;
+
+	/* Publish new table */
+	rcu_assign_pointer(priv->tbl, ntbl);
+
+	/* Unzip interleaved hash chains */
+	do {
+		/* Wait for readers to use new table/unzipped chains */
+		synchronize_rcu();
+
+		complete = true;
+		for (i = 0; i < tbl->size; i++) {
+			nft_hash_chain_unzip(set, ntbl, tbl, i);
+			if (tbl->buckets[i] != NULL)
+				complete = false;
+		}
+	} while (!complete);
+
+	nft_hash_tbl_free(tbl);
+	return 0;
+}
+
+static int nft_hash_tbl_shrink(const struct nft_set *set, struct nft_hash *priv)
+{
+	struct nft_hash_table *tbl = nft_dereference(priv->tbl), *ntbl;
+	struct nft_hash_elem __rcu **pprev;
+	unsigned int i;
+
+	ntbl = nft_hash_tbl_alloc(tbl->size / 2);
+	if (ntbl == NULL)
+		return -ENOMEM;
+
+	for (i = 0; i < ntbl->size; i++) {
+		ntbl->buckets[i] = tbl->buckets[i];
+
+		for (pprev = &ntbl->buckets[i]; *pprev != NULL;
+		     pprev = &nft_dereference(*pprev)->next)
+			;
+		RCU_INIT_POINTER(*pprev, tbl->buckets[i + ntbl->size]);
+	}
+	ntbl->elements = tbl->elements;
+
+	/* Publish new table */
+	rcu_assign_pointer(priv->tbl, ntbl);
+	synchronize_rcu();
+
+	nft_hash_tbl_free(tbl);
+	return 0;
 }
 
 static int nft_hash_insert(const struct nft_set *set,
 			   const struct nft_set_elem *elem)
 {
 	struct nft_hash *priv = nft_set_priv(set);
+	struct nft_hash_table *tbl = nft_dereference(priv->tbl);
 	struct nft_hash_elem *he;
 	unsigned int size, h;
 
@@ -91,33 +233,66 @@ static int nft_hash_insert(const struct nft_set *set,
 	if (set->flags & NFT_SET_MAP)
 		nft_data_copy(he->data, &elem->data);
 
-	h = nft_hash_data(&he->key, priv->hsize, set->klen);
-	hlist_add_head_rcu(&he->hnode, &priv->hash[h]);
+	h = nft_hash_data(&he->key, tbl->size, set->klen);
+	RCU_INIT_POINTER(he->next, tbl->buckets[h]);
+	rcu_assign_pointer(tbl->buckets[h], he);
+	tbl->elements++;
+
+	/* Expand table when exceeding 75% load */
+	if (tbl->elements > tbl->size / 4 * 3)
+		nft_hash_tbl_expand(set, priv);
+
 	return 0;
 }
 
+static void nft_hash_elem_destroy(const struct nft_set *set,
+				  struct nft_hash_elem *he)
+{
+	nft_data_uninit(&he->key, NFT_DATA_VALUE);
+	if (set->flags & NFT_SET_MAP)
+		nft_data_uninit(he->data, set->dtype);
+	kfree(he);
+}
+
 static void nft_hash_remove(const struct nft_set *set,
 			    const struct nft_set_elem *elem)
 {
-	struct nft_hash_elem *he = elem->cookie;
+	struct nft_hash *priv = nft_set_priv(set);
+	struct nft_hash_table *tbl = nft_dereference(priv->tbl);
+	struct nft_hash_elem *he, __rcu **pprev;
 
-	hlist_del_rcu(&he->hnode);
+	pprev = elem->cookie;
+	he = nft_dereference((*pprev));
+
+	RCU_INIT_POINTER(*pprev, he->next);
+	synchronize_rcu();
 	kfree(he);
+	tbl->elements--;
+
+	/* Shrink table beneath 30% load */
+	if (tbl->elements < tbl->size * 3 / 10 &&
+	    tbl->size > NFT_HASH_MIN_SIZE)
+		nft_hash_tbl_shrink(set, priv);
 }
 
 static int nft_hash_get(const struct nft_set *set, struct nft_set_elem *elem)
 {
 	const struct nft_hash *priv = nft_set_priv(set);
+	const struct nft_hash_table *tbl = nft_dereference(priv->tbl);
+	struct nft_hash_elem __rcu * const *pprev;
 	struct nft_hash_elem *he;
 	unsigned int h;
 
-	h = nft_hash_data(&elem->key, priv->hsize, set->klen);
-	hlist_for_each_entry(he, &priv->hash[h], hnode) {
-		if (nft_data_cmp(&he->key, &elem->key, set->klen))
+	h = nft_hash_data(&elem->key, tbl->size, set->klen);
+	pprev = &tbl->buckets[h];
+	nft_hash_for_each_entry(he, tbl->buckets[h]) {
+		if (nft_data_cmp(&he->key, &elem->key, set->klen)) {
+			pprev = &he->next;
 			continue;
+		}
 
-		elem->cookie = he;
-		elem->flags  = 0;
+		elem->cookie = (void *)pprev;
+		elem->flags = 0;
 		if (set->flags & NFT_SET_MAP)
 			nft_data_copy(&elem->data, he->data);
 		return 0;
@@ -129,12 +304,13 @@ static void nft_hash_walk(const struct nft_ctx *ctx, const struct nft_set *set,
 			  struct nft_set_iter *iter)
 {
 	const struct nft_hash *priv = nft_set_priv(set);
+	const struct nft_hash_table *tbl = nft_dereference(priv->tbl);
 	const struct nft_hash_elem *he;
 	struct nft_set_elem elem;
 	unsigned int i;
 
-	for (i = 0; i < priv->hsize; i++) {
-		hlist_for_each_entry(he, &priv->hash[i], hnode) {
+	for (i = 0; i < tbl->size; i++) {
+		nft_hash_for_each_entry(he, tbl->buckets[i]) {
 			if (iter->count < iter->skip)
 				goto cont;
 
@@ -161,43 +337,35 @@ static int nft_hash_init(const struct nft_set *set,
 			 const struct nlattr * const tb[])
 {
 	struct nft_hash *priv = nft_set_priv(set);
-	unsigned int cnt, i;
+	struct nft_hash_table *tbl;
 
 	if (unlikely(!nft_hash_rnd_initted)) {
 		get_random_bytes(&nft_hash_rnd, 4);
 		nft_hash_rnd_initted = true;
 	}
 
-	/* Aim for a load factor of 0.75 */
-	// FIXME: temporarily broken until we have set descriptions
-	cnt = 100;
-	cnt = cnt * 4 / 3;
-
-	priv->hash = kcalloc(cnt, sizeof(struct hlist_head), GFP_KERNEL);
-	if (priv->hash == NULL)
+	tbl = nft_hash_tbl_alloc(NFT_HASH_MIN_SIZE);
+	if (tbl == NULL)
 		return -ENOMEM;
-	priv->hsize = cnt;
-
-	for (i = 0; i < cnt; i++)
-		INIT_HLIST_HEAD(&priv->hash[i]);
-
+	RCU_INIT_POINTER(priv->tbl, tbl);
 	return 0;
 }
 
 static void nft_hash_destroy(const struct nft_set *set)
 {
 	const struct nft_hash *priv = nft_set_priv(set);
-	const struct hlist_node *next;
-	struct nft_hash_elem *elem;
+	const struct nft_hash_table *tbl = nft_dereference(priv->tbl);
+	struct nft_hash_elem *he, *next;
 	unsigned int i;
 
-	for (i = 0; i < priv->hsize; i++) {
-		hlist_for_each_entry_safe(elem, next, &priv->hash[i], hnode) {
-			hlist_del(&elem->hnode);
-			nft_hash_elem_destroy(set, elem);
+	for (i = 0; i < tbl->size; i++) {
+		for (he = nft_dereference(tbl->buckets[i]); he != NULL;
+		     he = next) {
+			next = nft_dereference(he->next);
+			nft_hash_elem_destroy(set, he);
 		}
 	}
-	kfree(priv->hash);
+	kfree(tbl);
 }
 
 static struct nft_set_ops nft_hash_ops __read_mostly = {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 26/38] netfilter: nf_tables: clean up nf_tables_trans_add() argument order
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (24 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 25/38] netfilter: nft_hash: bug fixes and resizing Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 27/38] netfilter: nf_tables: restore context for expression destructors Pablo Neira Ayuso
                   ` (12 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Patrick McHardy <kaber@trash.net>

The context argument logically comes first, and this is what every other
function dealing with contexts does.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_tables_api.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index f25d011..611afc0 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -1557,7 +1557,7 @@ static void nf_tables_rule_destroy(struct nft_rule *rule)
 static struct nft_expr_info *info;
 
 static struct nft_rule_trans *
-nf_tables_trans_add(struct nft_rule *rule, const struct nft_ctx *ctx)
+nf_tables_trans_add(struct nft_ctx *ctx, struct nft_rule *rule)
 {
 	struct nft_rule_trans *rupd;
 
@@ -1683,7 +1683,7 @@ static int nf_tables_newrule(struct sock *nlsk, struct sk_buff *skb,
 
 	if (nlh->nlmsg_flags & NLM_F_REPLACE) {
 		if (nft_rule_is_active_next(net, old_rule)) {
-			repl = nf_tables_trans_add(old_rule, &ctx);
+			repl = nf_tables_trans_add(&ctx, old_rule);
 			if (repl == NULL) {
 				err = -ENOMEM;
 				goto err2;
@@ -1706,7 +1706,7 @@ static int nf_tables_newrule(struct sock *nlsk, struct sk_buff *skb,
 			list_add_rcu(&rule->list, &chain->rules);
 	}
 
-	if (nf_tables_trans_add(rule, &ctx) == NULL) {
+	if (nf_tables_trans_add(&ctx, rule) == NULL) {
 		err = -ENOMEM;
 		goto err3;
 	}
@@ -1735,7 +1735,7 @@ nf_tables_delrule_one(struct nft_ctx *ctx, struct nft_rule *rule)
 {
 	/* You cannot delete the same rule twice */
 	if (nft_rule_is_active_next(ctx->net, rule)) {
-		if (nf_tables_trans_add(rule, ctx) == NULL)
+		if (nf_tables_trans_add(ctx, rule) == NULL)
 			return -ENOMEM;
 		nft_rule_disactivate_next(ctx->net, rule);
 		return 0;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 27/38] netfilter: nf_tables: restore context for expression destructors
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (25 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 26/38] netfilter: nf_tables: clean up nf_tables_trans_add() argument order Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 28/38] netfilter: nf_tables: restore notifications for anonymous set destruction Pablo Neira Ayuso
                   ` (11 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Patrick McHardy <kaber@trash.net>

In order to fix set destruction notifications and get rid of unnecessary
members in private data structures, pass the context to expressions'
destructor functions again.

In order to do so, replace various members in the nft_rule_trans structure
by the full context.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_tables.h |   13 ++++---------
 net/netfilter/nf_tables_api.c     |   34 +++++++++++++++++-----------------
 net/netfilter/nft_compat.c        |    4 ++--
 net/netfilter/nft_ct.c            |    3 ++-
 net/netfilter/nft_immediate.c     |    3 ++-
 net/netfilter/nft_log.c           |    3 ++-
 net/netfilter/nft_lookup.c        |    3 ++-
 7 files changed, 31 insertions(+), 32 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 5af56da..e6bc14d 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -289,7 +289,8 @@ struct nft_expr_ops {
 	int				(*init)(const struct nft_ctx *ctx,
 						const struct nft_expr *expr,
 						const struct nlattr * const tb[]);
-	void				(*destroy)(const struct nft_expr *expr);
+	void				(*destroy)(const struct nft_ctx *ctx,
+						   const struct nft_expr *expr);
 	int				(*dump)(struct sk_buff *skb,
 						const struct nft_expr *expr);
 	int				(*validate)(const struct nft_ctx *ctx,
@@ -343,19 +344,13 @@ struct nft_rule {
  *	struct nft_rule_trans - nf_tables rule update in transaction
  *
  *	@list: used internally
+ *	@ctx: rule context
  *	@rule: rule that needs to be updated
- *	@chain: chain that this rule belongs to
- *	@table: table for which this chain applies
- *	@nlh: netlink header of the message that contain this update
- *	@family: family expressesed as AF_*
  */
 struct nft_rule_trans {
 	struct list_head		list;
+	struct nft_ctx			ctx;
 	struct nft_rule			*rule;
-	const struct nft_chain		*chain;
-	const struct nft_table		*table;
-	const struct nlmsghdr		*nlh;
-	u8				family;
 };
 
 static inline struct nft_expr *nft_expr_first(const struct nft_rule *rule)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 611afc0..2c10c3f 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -1253,10 +1253,11 @@ err1:
 	return err;
 }
 
-static void nf_tables_expr_destroy(struct nft_expr *expr)
+static void nf_tables_expr_destroy(const struct nft_ctx *ctx,
+				   struct nft_expr *expr)
 {
 	if (expr->ops->destroy)
-		expr->ops->destroy(expr);
+		expr->ops->destroy(ctx, expr);
 	module_put(expr->ops->type->owner);
 }
 
@@ -1536,7 +1537,8 @@ err:
 	return err;
 }
 
-static void nf_tables_rule_destroy(struct nft_rule *rule)
+static void nf_tables_rule_destroy(const struct nft_ctx *ctx,
+				   struct nft_rule *rule)
 {
 	struct nft_expr *expr;
 
@@ -1546,7 +1548,7 @@ static void nf_tables_rule_destroy(struct nft_rule *rule)
 	 */
 	expr = nft_expr_first(rule);
 	while (expr->ops && expr != nft_expr_last(rule)) {
-		nf_tables_expr_destroy(expr);
+		nf_tables_expr_destroy(ctx, expr);
 		expr = nft_expr_next(expr);
 	}
 	kfree(rule);
@@ -1565,11 +1567,8 @@ nf_tables_trans_add(struct nft_ctx *ctx, struct nft_rule *rule)
 	if (rupd == NULL)
 	       return NULL;
 
-	rupd->chain = ctx->chain;
-	rupd->table = ctx->table;
+	rupd->ctx = *ctx;
 	rupd->rule = rule;
-	rupd->family = ctx->afi->family;
-	rupd->nlh = ctx->nlh;
 	list_add_tail(&rupd->list, &ctx->net->nft.commit_list);
 
 	return rupd;
@@ -1721,7 +1720,7 @@ err3:
 		kfree(repl);
 	}
 err2:
-	nf_tables_rule_destroy(rule);
+	nf_tables_rule_destroy(&ctx, rule);
 err1:
 	for (i = 0; i < n; i++) {
 		if (info[i].ops != NULL)
@@ -1831,10 +1830,10 @@ static int nf_tables_commit(struct sk_buff *skb)
 		 */
 		if (nft_rule_is_active(net, rupd->rule)) {
 			nft_rule_clear(net, rupd->rule);
-			nf_tables_rule_notify(skb, rupd->nlh, rupd->table,
-					      rupd->chain, rupd->rule,
-					      NFT_MSG_NEWRULE, 0,
-					      rupd->family);
+			nf_tables_rule_notify(skb, rupd->ctx.nlh,
+					      rupd->ctx.table, rupd->ctx.chain,
+					      rupd->rule, NFT_MSG_NEWRULE, 0,
+					      rupd->ctx.afi->family);
 			list_del(&rupd->list);
 			kfree(rupd);
 			continue;
@@ -1842,9 +1841,10 @@ static int nf_tables_commit(struct sk_buff *skb)
 
 		/* This rule is in the past, get rid of it */
 		list_del_rcu(&rupd->rule->list);
-		nf_tables_rule_notify(skb, rupd->nlh, rupd->table, rupd->chain,
+		nf_tables_rule_notify(skb, rupd->ctx.nlh,
+				      rupd->ctx.table, rupd->ctx.chain,
 				      rupd->rule, NFT_MSG_DELRULE, 0,
-				      rupd->family);
+				      rupd->ctx.afi->family);
 	}
 
 	/* Make sure we don't see any packet traversing old rules */
@@ -1852,7 +1852,7 @@ static int nf_tables_commit(struct sk_buff *skb)
 
 	/* Now we can safely release unused old rules */
 	list_for_each_entry_safe(rupd, tmp, &net->nft.commit_list, list) {
-		nf_tables_rule_destroy(rupd->rule);
+		nf_tables_rule_destroy(&rupd->ctx, rupd->rule);
 		list_del(&rupd->list);
 		kfree(rupd);
 	}
@@ -1881,7 +1881,7 @@ static int nf_tables_abort(struct sk_buff *skb)
 	synchronize_rcu();
 
 	list_for_each_entry_safe(rupd, tmp, &net->nft.commit_list, list) {
-		nf_tables_rule_destroy(rupd->rule);
+		nf_tables_rule_destroy(&rupd->ctx, rupd->rule);
 		list_del(&rupd->list);
 		kfree(rupd);
 	}
diff --git a/net/netfilter/nft_compat.c b/net/netfilter/nft_compat.c
index 82cb823..8a779be 100644
--- a/net/netfilter/nft_compat.c
+++ b/net/netfilter/nft_compat.c
@@ -192,7 +192,7 @@ err:
 }
 
 static void
-nft_target_destroy(const struct nft_expr *expr)
+nft_target_destroy(const struct nft_ctx *ctx, const struct nft_expr *expr)
 {
 	struct xt_target *target = expr->ops->data;
 
@@ -379,7 +379,7 @@ err:
 }
 
 static void
-nft_match_destroy(const struct nft_expr *expr)
+nft_match_destroy(const struct nft_ctx *ctx, const struct nft_expr *expr)
 {
 	struct xt_match *match = expr->ops->data;
 
diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index e59b08f..65a2c7b 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -321,7 +321,8 @@ static int nft_ct_init(const struct nft_ctx *ctx,
 	return 0;
 }
 
-static void nft_ct_destroy(const struct nft_expr *expr)
+static void nft_ct_destroy(const struct nft_ctx *ctx,
+			   const struct nft_expr *expr)
 {
 	struct nft_ct *priv = nft_expr_priv(expr);
 
diff --git a/net/netfilter/nft_immediate.c b/net/netfilter/nft_immediate.c
index f169501..810385e 100644
--- a/net/netfilter/nft_immediate.c
+++ b/net/netfilter/nft_immediate.c
@@ -70,7 +70,8 @@ err1:
 	return err;
 }
 
-static void nft_immediate_destroy(const struct nft_expr *expr)
+static void nft_immediate_destroy(const struct nft_ctx *ctx,
+				  const struct nft_expr *expr)
 {
 	const struct nft_immediate_expr *priv = nft_expr_priv(expr);
 	return nft_data_uninit(&priv->data, nft_dreg_to_type(priv->dreg));
diff --git a/net/netfilter/nft_log.c b/net/netfilter/nft_log.c
index 26c5154..10cfb15 100644
--- a/net/netfilter/nft_log.c
+++ b/net/netfilter/nft_log.c
@@ -74,7 +74,8 @@ static int nft_log_init(const struct nft_ctx *ctx,
 	return 0;
 }
 
-static void nft_log_destroy(const struct nft_expr *expr)
+static void nft_log_destroy(const struct nft_ctx *ctx,
+			    const struct nft_expr *expr)
 {
 	struct nft_log *priv = nft_expr_priv(expr);
 
diff --git a/net/netfilter/nft_lookup.c b/net/netfilter/nft_lookup.c
index bb4ef4c..953978e 100644
--- a/net/netfilter/nft_lookup.c
+++ b/net/netfilter/nft_lookup.c
@@ -89,7 +89,8 @@ static int nft_lookup_init(const struct nft_ctx *ctx,
 	return 0;
 }
 
-static void nft_lookup_destroy(const struct nft_expr *expr)
+static void nft_lookup_destroy(const struct nft_ctx *ctx,
+			       const struct nft_expr *expr)
 {
 	struct nft_lookup *priv = nft_expr_priv(expr);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 28/38] netfilter: nf_tables: restore notifications for anonymous set destruction
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (26 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 27/38] netfilter: nf_tables: restore context for expression destructors Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 29/38] netfilter: nft_ct: remove family from struct nft_ct Pablo Neira Ayuso
                   ` (10 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Patrick McHardy <kaber@trash.net>

Since we have the context available again, we can restore notifications
for destruction of anonymous sets.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_tables_api.c |    3 +--
 net/netfilter/nft_lookup.c    |    2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 2c10c3f..33045a5 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -2442,8 +2442,7 @@ err1:
 static void nf_tables_set_destroy(const struct nft_ctx *ctx, struct nft_set *set)
 {
 	list_del(&set->list);
-	if (!(set->flags & NFT_SET_ANONYMOUS))
-		nf_tables_set_notify(ctx, set, NFT_MSG_DELSET);
+	nf_tables_set_notify(ctx, set, NFT_MSG_DELSET);
 
 	set->ops->destroy(set);
 	module_put(set->ops->owner);
diff --git a/net/netfilter/nft_lookup.c b/net/netfilter/nft_lookup.c
index 953978e..7fd2bea 100644
--- a/net/netfilter/nft_lookup.c
+++ b/net/netfilter/nft_lookup.c
@@ -94,7 +94,7 @@ static void nft_lookup_destroy(const struct nft_ctx *ctx,
 {
 	struct nft_lookup *priv = nft_expr_priv(expr);
 
-	nf_tables_unbind_set(NULL, priv->set, &priv->binding);
+	nf_tables_unbind_set(ctx, priv->set, &priv->binding);
 }
 
 static int nft_lookup_dump(struct sk_buff *skb, const struct nft_expr *expr)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 29/38] netfilter: nft_ct: remove family from struct nft_ct
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (27 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 28/38] netfilter: nf_tables: restore notifications for anonymous set destruction Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 30/38] netfilter: nft_nat: fix family validation Pablo Neira Ayuso
                   ` (9 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Patrick McHardy <kaber@trash.net>

Since we have the context available during destruction again, we can
remove the family from the private structure.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_ct.c |    9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index 65a2c7b..bd0d41e 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -24,11 +24,10 @@
 struct nft_ct {
 	enum nft_ct_keys	key:8;
 	enum ip_conntrack_dir	dir:8;
-	union{
+	union {
 		enum nft_registers	dreg:8;
 		enum nft_registers	sreg:8;
 	};
-	uint8_t			family;
 };
 
 static void nft_ct_get_eval(const struct nft_expr *expr,
@@ -316,17 +315,13 @@ static int nft_ct_init(const struct nft_ctx *ctx,
 	if (err < 0)
 		return err;
 
-	priv->family = ctx->afi->family;
-
 	return 0;
 }
 
 static void nft_ct_destroy(const struct nft_ctx *ctx,
 			   const struct nft_expr *expr)
 {
-	struct nft_ct *priv = nft_expr_priv(expr);
-
-	nft_ct_l3proto_module_put(priv->family);
+	nft_ct_l3proto_module_put(ctx->afi->family);
 }
 
 static int nft_ct_get_dump(struct sk_buff *skb, const struct nft_expr *expr)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 30/38] netfilter: nft_nat: fix family validation
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (28 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 29/38] netfilter: nft_ct: remove family from struct nft_ct Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 31/38] netfilter: connlimit: factor hlist search into new function Pablo Neira Ayuso
                   ` (8 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Patrick McHardy <kaber@trash.net>

The family in the NAT expression is basically completely useless since
we have it available during runtime anyway. Nevertheless it is used to
decide the NAT family, so at least validate it properly. As we don't
support cross-family NAT, it needs to match the family of the table the
expression exists in.

Unfortunately we can't remove it completely since we need to dump it for
userspace (*sigh*), so at least reduce the memory waste.

Additionally clean up the module init function by removing useless
temporary variables.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_nat.c |   22 ++++++++++------------
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/net/netfilter/nft_nat.c b/net/netfilter/nft_nat.c
index d3b1ffe..a0195d2 100644
--- a/net/netfilter/nft_nat.c
+++ b/net/netfilter/nft_nat.c
@@ -31,8 +31,8 @@ struct nft_nat {
 	enum nft_registers      sreg_addr_max:8;
 	enum nft_registers      sreg_proto_min:8;
 	enum nft_registers      sreg_proto_max:8;
-	int                     family;
-	enum nf_nat_manip_type  type;
+	enum nf_nat_manip_type  type:8;
+	u8			family;
 };
 
 static void nft_nat_eval(const struct nft_expr *expr,
@@ -88,6 +88,7 @@ static int nft_nat_init(const struct nft_ctx *ctx, const struct nft_expr *expr,
 			const struct nlattr * const tb[])
 {
 	struct nft_nat *priv = nft_expr_priv(expr);
+	u32 family;
 	int err;
 
 	if (tb[NFTA_NAT_TYPE] == NULL)
@@ -107,9 +108,12 @@ static int nft_nat_init(const struct nft_ctx *ctx, const struct nft_expr *expr,
 	if (tb[NFTA_NAT_FAMILY] == NULL)
 		return -EINVAL;
 
-	priv->family = ntohl(nla_get_be32(tb[NFTA_NAT_FAMILY]));
-	if (priv->family != AF_INET && priv->family != AF_INET6)
-		return -EINVAL;
+	family = ntohl(nla_get_be32(tb[NFTA_NAT_FAMILY]));
+	if (family != AF_INET && family != AF_INET6)
+		return -EAFNOSUPPORT;
+	if (family != ctx->afi->family)
+		return -EOPNOTSUPP;
+	priv->family = family;
 
 	if (tb[NFTA_NAT_REG_ADDR_MIN]) {
 		priv->sreg_addr_min = ntohl(nla_get_be32(
@@ -202,13 +206,7 @@ static struct nft_expr_type nft_nat_type __read_mostly = {
 
 static int __init nft_nat_module_init(void)
 {
-	int err;
-
-	err = nft_register_expr(&nft_nat_type);
-	if (err < 0)
-		return err;
-
-	return 0;
+	return nft_register_expr(&nft_nat_type);
 }
 
 static void __exit nft_nat_module_exit(void)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 31/38] netfilter: connlimit: factor hlist search into new function
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (29 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 30/38] netfilter: nft_nat: fix family validation Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 32/38] netfilter: connlimit: improve packet-to-closed-connection logic Pablo Neira Ayuso
                   ` (7 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Florian Westphal <fw@strlen.de>

Simplifies followup patch that introduces separate locks for each of
the hash slots.

Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/xt_connlimit.c |   49 ++++++++++++++++++++++++++++--------------
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index c40b269..6988818 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -92,30 +92,24 @@ same_source_net(const union nf_inet_addr *addr,
 	}
 }
 
-static int count_them(struct net *net,
-		      struct xt_connlimit_data *data,
-		      const struct nf_conntrack_tuple *tuple,
-		      const union nf_inet_addr *addr,
-		      const union nf_inet_addr *mask,
-		      u_int8_t family)
+static int count_hlist(struct net *net,
+		       struct hlist_head *head,
+		       const struct nf_conntrack_tuple *tuple,
+		       const union nf_inet_addr *addr,
+		       const union nf_inet_addr *mask,
+		       u_int8_t family)
 {
 	const struct nf_conntrack_tuple_hash *found;
 	struct xt_connlimit_conn *conn;
 	struct hlist_node *n;
 	struct nf_conn *found_ct;
-	struct hlist_head *hash;
 	bool addit = true;
 	int matches = 0;
 
-	if (family == NFPROTO_IPV6)
-		hash = &data->iphash[connlimit_iphash6(addr, mask)];
-	else
-		hash = &data->iphash[connlimit_iphash(addr->ip & mask->ip)];
-
 	rcu_read_lock();
 
 	/* check the saved connections */
-	hlist_for_each_entry_safe(conn, n, hash, node) {
+	hlist_for_each_entry_safe(conn, n, head, node) {
 		found    = nf_conntrack_find_get(net, NF_CT_DEFAULT_ZONE,
 						 &conn->tuple);
 		found_ct = NULL;
@@ -166,13 +160,38 @@ static int count_them(struct net *net,
 			return -ENOMEM;
 		conn->tuple = *tuple;
 		conn->addr = *addr;
-		hlist_add_head(&conn->node, hash);
+		hlist_add_head(&conn->node, head);
 		++matches;
 	}
 
 	return matches;
 }
 
+static int count_them(struct net *net,
+		      struct xt_connlimit_data *data,
+		      const struct nf_conntrack_tuple *tuple,
+		      const union nf_inet_addr *addr,
+		      const union nf_inet_addr *mask,
+		      u_int8_t family)
+{
+	struct hlist_head *hhead;
+	int count;
+	u32 hash;
+
+	if (family == NFPROTO_IPV6)
+		hash = connlimit_iphash6(addr, mask);
+	else
+		hash = connlimit_iphash(addr->ip & mask->ip);
+
+	hhead = &data->iphash[hash];
+
+	spin_lock_bh(&data->lock);
+	count = count_hlist(net, hhead, tuple, addr, mask, family);
+	spin_unlock_bh(&data->lock);
+
+	return count;
+}
+
 static bool
 connlimit_mt(const struct sk_buff *skb, struct xt_action_param *par)
 {
@@ -202,10 +221,8 @@ connlimit_mt(const struct sk_buff *skb, struct xt_action_param *par)
 			  iph->daddr : iph->saddr;
 	}
 
-	spin_lock_bh(&info->data->lock);
 	connections = count_them(net, info->data, tuple_ptr, &addr,
 	                         &info->mask, par->family);
-	spin_unlock_bh(&info->data->lock);
 
 	if (connections < 0)
 		/* kmalloc failed, drop it entirely */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 32/38] netfilter: connlimit: improve packet-to-closed-connection logic
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (30 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 31/38] netfilter: connlimit: factor hlist search into new function Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 33/38] netfilter: connlimit: move insertion of new element out of count function Pablo Neira Ayuso
                   ` (6 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Florian Westphal <fw@strlen.de>

Instead of freeing the entry from our list and then adding
it back again in the 'packet to closing connection' case just keep the
matching entry around.  Also drop the found_ct != NULL test as
nf_ct_tuplehash_to_ctrack is just container_of().

Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/xt_connlimit.c |   23 ++++++++---------------
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index 6988818..d4c6db1 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -112,29 +112,22 @@ static int count_hlist(struct net *net,
 	hlist_for_each_entry_safe(conn, n, head, node) {
 		found    = nf_conntrack_find_get(net, NF_CT_DEFAULT_ZONE,
 						 &conn->tuple);
-		found_ct = NULL;
+		if (found == NULL) {
+			hlist_del(&conn->node);
+			kfree(conn);
+			continue;
+		}
 
-		if (found != NULL)
-			found_ct = nf_ct_tuplehash_to_ctrack(found);
+		found_ct = nf_ct_tuplehash_to_ctrack(found);
 
-		if (found_ct != NULL &&
-		    nf_ct_tuple_equal(&conn->tuple, tuple) &&
-		    !already_closed(found_ct))
+		if (nf_ct_tuple_equal(&conn->tuple, tuple)) {
 			/*
 			 * Just to be sure we have it only once in the list.
 			 * We should not see tuples twice unless someone hooks
 			 * this into a table without "-p tcp --syn".
 			 */
 			addit = false;
-
-		if (found == NULL) {
-			/* this one is gone */
-			hlist_del(&conn->node);
-			kfree(conn);
-			continue;
-		}
-
-		if (already_closed(found_ct)) {
+		} else if (already_closed(found_ct)) {
 			/*
 			 * we do not care about connections which are
 			 * closed already -> ditch it
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 33/38] netfilter: connlimit: move insertion of new element out of count function
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (31 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 32/38] netfilter: connlimit: improve packet-to-closed-connection logic Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 34/38] netfilter: connlimit: use kmem_cache for conn objects Pablo Neira Ayuso
                   ` (5 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Florian Westphal <fw@strlen.de>

Allows easier code-reuse in followup patches.

Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/xt_connlimit.c |   38 +++++++++++++++++++++++---------------
 1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index d4c6db1..0220d40 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -97,13 +97,12 @@ static int count_hlist(struct net *net,
 		       const struct nf_conntrack_tuple *tuple,
 		       const union nf_inet_addr *addr,
 		       const union nf_inet_addr *mask,
-		       u_int8_t family)
+		       u_int8_t family, bool *addit)
 {
 	const struct nf_conntrack_tuple_hash *found;
 	struct xt_connlimit_conn *conn;
 	struct hlist_node *n;
 	struct nf_conn *found_ct;
-	bool addit = true;
 	int matches = 0;
 
 	rcu_read_lock();
@@ -126,7 +125,7 @@ static int count_hlist(struct net *net,
 			 * We should not see tuples twice unless someone hooks
 			 * this into a table without "-p tcp --syn".
 			 */
-			addit = false;
+			*addit = false;
 		} else if (already_closed(found_ct)) {
 			/*
 			 * we do not care about connections which are
@@ -146,20 +145,22 @@ static int count_hlist(struct net *net,
 
 	rcu_read_unlock();
 
-	if (addit) {
-		/* save the new connection in our list */
-		conn = kmalloc(sizeof(*conn), GFP_ATOMIC);
-		if (conn == NULL)
-			return -ENOMEM;
-		conn->tuple = *tuple;
-		conn->addr = *addr;
-		hlist_add_head(&conn->node, head);
-		++matches;
-	}
-
 	return matches;
 }
 
+static bool add_hlist(struct hlist_head *head,
+		      const struct nf_conntrack_tuple *tuple,
+		      const union nf_inet_addr *addr)
+{
+	struct xt_connlimit_conn *conn = kmalloc(sizeof(*conn), GFP_ATOMIC);
+	if (conn == NULL)
+		return false;
+	conn->tuple = *tuple;
+	conn->addr = *addr;
+	hlist_add_head(&conn->node, head);
+	return true;
+}
+
 static int count_them(struct net *net,
 		      struct xt_connlimit_data *data,
 		      const struct nf_conntrack_tuple *tuple,
@@ -170,6 +171,7 @@ static int count_them(struct net *net,
 	struct hlist_head *hhead;
 	int count;
 	u32 hash;
+	bool addit = true;
 
 	if (family == NFPROTO_IPV6)
 		hash = connlimit_iphash6(addr, mask);
@@ -179,7 +181,13 @@ static int count_them(struct net *net,
 	hhead = &data->iphash[hash];
 
 	spin_lock_bh(&data->lock);
-	count = count_hlist(net, hhead, tuple, addr, mask, family);
+	count = count_hlist(net, hhead, tuple, addr, mask, family, &addit);
+	if (addit) {
+		if (add_hlist(hhead, tuple, addr))
+			count++;
+		else
+			count = -ENOMEM;
+	}
 	spin_unlock_bh(&data->lock);
 
 	return count;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 34/38] netfilter: connlimit: use kmem_cache for conn objects
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (32 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 33/38] netfilter: connlimit: move insertion of new element out of count function Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 35/38] netfilter: Convert uses of __constant_<foo> to <foo> Pablo Neira Ayuso
                   ` (4 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Florian Westphal <fw@strlen.de>

We might allocate thousands of these (one object per connection).
Use distinct kmem cache to permit simplte tracking on how many
objects are currently used by the connlimit match via the sysfs.

Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/xt_connlimit.c |   24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index 0220d40..a8eaabb 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -44,6 +44,7 @@ struct xt_connlimit_data {
 };
 
 static u_int32_t connlimit_rnd __read_mostly;
+static struct kmem_cache *connlimit_conn_cachep __read_mostly;
 
 static inline unsigned int connlimit_iphash(__be32 addr)
 {
@@ -113,7 +114,7 @@ static int count_hlist(struct net *net,
 						 &conn->tuple);
 		if (found == NULL) {
 			hlist_del(&conn->node);
-			kfree(conn);
+			kmem_cache_free(connlimit_conn_cachep, conn);
 			continue;
 		}
 
@@ -133,7 +134,7 @@ static int count_hlist(struct net *net,
 			 */
 			nf_ct_put(found_ct);
 			hlist_del(&conn->node);
-			kfree(conn);
+			kmem_cache_free(connlimit_conn_cachep, conn);
 			continue;
 		}
 
@@ -152,7 +153,9 @@ static bool add_hlist(struct hlist_head *head,
 		      const struct nf_conntrack_tuple *tuple,
 		      const union nf_inet_addr *addr)
 {
-	struct xt_connlimit_conn *conn = kmalloc(sizeof(*conn), GFP_ATOMIC);
+	struct xt_connlimit_conn *conn;
+
+	conn = kmem_cache_alloc(connlimit_conn_cachep, GFP_ATOMIC);
 	if (conn == NULL)
 		return false;
 	conn->tuple = *tuple;
@@ -285,7 +288,7 @@ static void connlimit_mt_destroy(const struct xt_mtdtor_param *par)
 	for (i = 0; i < ARRAY_SIZE(info->data->iphash); ++i) {
 		hlist_for_each_entry_safe(conn, n, &hash[i], node) {
 			hlist_del(&conn->node);
-			kfree(conn);
+			kmem_cache_free(connlimit_conn_cachep, conn);
 		}
 	}
 
@@ -305,12 +308,23 @@ static struct xt_match connlimit_mt_reg __read_mostly = {
 
 static int __init connlimit_mt_init(void)
 {
-	return xt_register_match(&connlimit_mt_reg);
+	int ret;
+	connlimit_conn_cachep = kmem_cache_create("xt_connlimit_conn",
+					   sizeof(struct xt_connlimit_conn),
+					   0, 0, NULL);
+	if (!connlimit_conn_cachep)
+		return -ENOMEM;
+
+	ret = xt_register_match(&connlimit_mt_reg);
+	if (ret != 0)
+		kmem_cache_destroy(connlimit_conn_cachep);
+	return ret;
 }
 
 static void __exit connlimit_mt_exit(void)
 {
 	xt_unregister_match(&connlimit_mt_reg);
+	kmem_cache_destroy(connlimit_conn_cachep);
 }
 
 module_init(connlimit_mt_init);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 35/38] netfilter: Convert uses of __constant_<foo> to <foo>
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (33 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 34/38] netfilter: connlimit: use kmem_cache for conn objects Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 36/38] netfilter: connlimit: use keyed locks Pablo Neira Ayuso
                   ` (3 subsequent siblings)
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Joe Perches <joe@perches.com>

The use of __constant_<foo> has been unnecessary for quite awhile now.

Make these uses consistent with the rest of the kernel.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/ipset/pfxlen.c |    4 ++--
 net/netfilter/xt_AUDIT.c     |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/ipset/pfxlen.c b/net/netfilter/ipset/pfxlen.c
index 4f29fa9..04d15fd 100644
--- a/net/netfilter/ipset/pfxlen.c
+++ b/net/netfilter/ipset/pfxlen.c
@@ -7,8 +7,8 @@
 
 #define E(a, b, c, d) \
 	{.ip6 = { \
-		__constant_htonl(a), __constant_htonl(b), \
-		__constant_htonl(c), __constant_htonl(d), \
+		htonl(a), htonl(b), \
+		htonl(c), htonl(d), \
 	} }
 
 /*
diff --git a/net/netfilter/xt_AUDIT.c b/net/netfilter/xt_AUDIT.c
index 3228d7f..4973cbd 100644
--- a/net/netfilter/xt_AUDIT.c
+++ b/net/netfilter/xt_AUDIT.c
@@ -146,11 +146,11 @@ audit_tg(struct sk_buff *skb, const struct xt_action_param *par)
 
 		if (par->family == NFPROTO_BRIDGE) {
 			switch (eth_hdr(skb)->h_proto) {
-			case __constant_htons(ETH_P_IP):
+			case htons(ETH_P_IP):
 				audit_ip4(ab, skb);
 				break;
 
-			case __constant_htons(ETH_P_IPV6):
+			case htons(ETH_P_IPV6):
 				audit_ip6(ab, skb);
 				break;
 			}
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 36/38] netfilter: connlimit: use keyed locks
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (34 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 35/38] netfilter: Convert uses of __constant_<foo> to <foo> Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:54   ` David Laight
  2014-03-17 14:00   ` Eric Dumazet
  2014-03-17 12:42 ` [PATCH 37/38] netfilter: connlimit: make same_source_net signed Pablo Neira Ayuso
                   ` (2 subsequent siblings)
  38 siblings, 2 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Florian Westphal <fw@strlen.de>

connlimit currently suffers from spinlock contention, example for
4-core system with rps enabled:

+  20.84%   ksoftirqd/2  [kernel.kallsyms] [k] _raw_spin_lock_bh
+  20.76%   ksoftirqd/1  [kernel.kallsyms] [k] _raw_spin_lock_bh
+  20.42%   ksoftirqd/0  [kernel.kallsyms] [k] _raw_spin_lock_bh
+   6.07%   ksoftirqd/2  [nf_conntrack]    [k] ____nf_conntrack_find
+   6.07%   ksoftirqd/1  [nf_conntrack]    [k] ____nf_conntrack_find
+   5.97%   ksoftirqd/0  [nf_conntrack]    [k] ____nf_conntrack_find
+   2.47%   ksoftirqd/2  [nf_conntrack]    [k] hash_conntrack_raw
+   2.45%   ksoftirqd/0  [nf_conntrack]    [k] hash_conntrack_raw
+   2.44%   ksoftirqd/1  [nf_conntrack]    [k] hash_conntrack_raw

May allow parallel lookup/insert/delete if the entry is hashed to
another slot.  With patch:

+  20.95%  ksoftirqd/0  [nf_conntrack] [k] ____nf_conntrack_find
+  20.50%  ksoftirqd/1  [nf_conntrack] [k] ____nf_conntrack_find
+  20.27%  ksoftirqd/2  [nf_conntrack] [k] ____nf_conntrack_find
+   5.76%  ksoftirqd/1  [nf_conntrack] [k] hash_conntrack_raw
+   5.39%  ksoftirqd/2  [nf_conntrack] [k] hash_conntrack_raw
+   5.35%  ksoftirqd/0  [nf_conntrack] [k] hash_conntrack_raw
+   2.00%  ksoftirqd/1  [kernel.kallsyms] [k] __rcu_read_unlock

Improved rx processing rate from ~35kpps to ~50 kpps.

Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/xt_connlimit.c |   26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index a8eaabb..ad290cc 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -31,6 +31,9 @@
 #include <net/netfilter/nf_conntrack_tuple.h>
 #include <net/netfilter/nf_conntrack_zones.h>
 
+#define CONNLIMIT_SLOTS	256
+#define CONNLIMIT_LOCK_SLOTS	32
+
 /* we will save the tuples of all connections we care about */
 struct xt_connlimit_conn {
 	struct hlist_node		node;
@@ -39,8 +42,8 @@ struct xt_connlimit_conn {
 };
 
 struct xt_connlimit_data {
-	struct hlist_head	iphash[256];
-	spinlock_t		lock;
+	struct hlist_head	iphash[CONNLIMIT_SLOTS];
+	spinlock_t		locks[CONNLIMIT_LOCK_SLOTS];
 };
 
 static u_int32_t connlimit_rnd __read_mostly;
@@ -48,7 +51,8 @@ static struct kmem_cache *connlimit_conn_cachep __read_mostly;
 
 static inline unsigned int connlimit_iphash(__be32 addr)
 {
-	return jhash_1word((__force __u32)addr, connlimit_rnd) & 0xFF;
+	return jhash_1word((__force __u32)addr,
+			    connlimit_rnd) % CONNLIMIT_SLOTS;
 }
 
 static inline unsigned int
@@ -61,7 +65,8 @@ connlimit_iphash6(const union nf_inet_addr *addr,
 	for (i = 0; i < ARRAY_SIZE(addr->ip6); ++i)
 		res.ip6[i] = addr->ip6[i] & mask->ip6[i];
 
-	return jhash2((u32 *)res.ip6, ARRAY_SIZE(res.ip6), connlimit_rnd) & 0xFF;
+	return jhash2((u32 *)res.ip6, ARRAY_SIZE(res.ip6),
+		       connlimit_rnd) % CONNLIMIT_SLOTS;
 }
 
 static inline bool already_closed(const struct nf_conn *conn)
@@ -183,7 +188,7 @@ static int count_them(struct net *net,
 
 	hhead = &data->iphash[hash];
 
-	spin_lock_bh(&data->lock);
+	spin_lock_bh(&data->locks[hash % CONNLIMIT_LOCK_SLOTS]);
 	count = count_hlist(net, hhead, tuple, addr, mask, family, &addit);
 	if (addit) {
 		if (add_hlist(hhead, tuple, addr))
@@ -191,7 +196,7 @@ static int count_them(struct net *net,
 		else
 			count = -ENOMEM;
 	}
-	spin_unlock_bh(&data->lock);
+	spin_unlock_bh(&data->locks[hash % CONNLIMIT_LOCK_SLOTS]);
 
 	return count;
 }
@@ -227,7 +232,6 @@ connlimit_mt(const struct sk_buff *skb, struct xt_action_param *par)
 
 	connections = count_them(net, info->data, tuple_ptr, &addr,
 	                         &info->mask, par->family);
-
 	if (connections < 0)
 		/* kmalloc failed, drop it entirely */
 		goto hotdrop;
@@ -268,7 +272,9 @@ static int connlimit_mt_check(const struct xt_mtchk_param *par)
 		return -ENOMEM;
 	}
 
-	spin_lock_init(&info->data->lock);
+	for (i = 0; i < ARRAY_SIZE(info->data->locks); ++i)
+		spin_lock_init(&info->data->locks[i]);
+
 	for (i = 0; i < ARRAY_SIZE(info->data->iphash); ++i)
 		INIT_HLIST_HEAD(&info->data->iphash[i]);
 
@@ -309,6 +315,10 @@ static struct xt_match connlimit_mt_reg __read_mostly = {
 static int __init connlimit_mt_init(void)
 {
 	int ret;
+
+	BUILD_BUG_ON(CONNLIMIT_LOCK_SLOTS > CONNLIMIT_SLOTS);
+	BUILD_BUG_ON((CONNLIMIT_SLOTS % CONNLIMIT_LOCK_SLOTS) != 0);
+
 	connlimit_conn_cachep = kmem_cache_create("xt_connlimit_conn",
 					   sizeof(struct xt_connlimit_conn),
 					   0, 0, NULL);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 37/38] netfilter: connlimit: make same_source_net signed
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (35 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 36/38] netfilter: connlimit: use keyed locks Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 12:42 ` [PATCH 38/38] netfilter: connlimit: use rbtree for per-host conntrack obj storage Pablo Neira Ayuso
  2014-03-17 19:19 ` [PATCH 00/38] Netfilter/IPVS updates for net-next David Miller
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Florian Westphal <fw@strlen.de>

currently returns 1 if they're the same.  Make it work like mem/strcmp
so it can be used as rbtree search function.

Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/xt_connlimit.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index ad290cc..dc5207f 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -78,13 +78,14 @@ static inline bool already_closed(const struct nf_conn *conn)
 		return 0;
 }
 
-static inline unsigned int
+static int
 same_source_net(const union nf_inet_addr *addr,
 		const union nf_inet_addr *mask,
 		const union nf_inet_addr *u3, u_int8_t family)
 {
 	if (family == NFPROTO_IPV4) {
-		return (addr->ip & mask->ip) == (u3->ip & mask->ip);
+		return ntohl(addr->ip & mask->ip) -
+		       ntohl(u3->ip & mask->ip);
 	} else {
 		union nf_inet_addr lh, rh;
 		unsigned int i;
@@ -94,7 +95,7 @@ same_source_net(const union nf_inet_addr *addr,
 			rh.ip6[i] = u3->ip6[i] & mask->ip6[i];
 		}
 
-		return memcmp(&lh.ip6, &rh.ip6, sizeof(lh.ip6)) == 0;
+		return memcmp(&lh.ip6, &rh.ip6, sizeof(lh.ip6));
 	}
 }
 
@@ -143,7 +144,7 @@ static int count_hlist(struct net *net,
 			continue;
 		}
 
-		if (same_source_net(addr, mask, &conn->addr, family))
+		if (same_source_net(addr, mask, &conn->addr, family) == 0)
 			/* same source network -> be counted! */
 			++matches;
 		nf_ct_put(found_ct);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 38/38] netfilter: connlimit: use rbtree for per-host conntrack obj storage
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (36 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 37/38] netfilter: connlimit: make same_source_net signed Pablo Neira Ayuso
@ 2014-03-17 12:42 ` Pablo Neira Ayuso
  2014-03-17 19:19 ` [PATCH 00/38] Netfilter/IPVS updates for net-next David Miller
  38 siblings, 0 replies; 47+ messages in thread
From: Pablo Neira Ayuso @ 2014-03-17 12:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Florian Westphal <fw@strlen.de>

With current match design every invocation of the connlimit_match
function means we have to perform (number_of_conntracks % 256) lookups
in the conntrack table [ to perform GC/delete stale entries ].
This is also the reason why ____nf_conntrack_find() in perf top has
> 20% cpu time per core.

This patch changes the storage to rbtree which cuts down the number of
ct objects that need testing.

When looking up a new tuple, we only test the connections of the host
objects we visit while searching for the wanted host/network (or
the leaf we need to insert at).

The slot count is reduced to 32.  Increasing slot count doesn't
speed up things much because of rbtree nature.

before patch (50kpps rx, 10kpps tx):
+  20.95%  ksoftirqd/0  [nf_conntrack] [k] ____nf_conntrack_find
+  20.50%  ksoftirqd/1  [nf_conntrack] [k] ____nf_conntrack_find
+  20.27%  ksoftirqd/2  [nf_conntrack] [k] ____nf_conntrack_find
+   5.76%  ksoftirqd/1  [nf_conntrack] [k] hash_conntrack_raw
+   5.39%  ksoftirqd/2  [nf_conntrack] [k] hash_conntrack_raw
+   5.35%  ksoftirqd/0  [nf_conntrack] [k] hash_conntrack_raw

after (90kpps, 51kpps tx):
+  17.24%       swapper  [nf_conntrack]    [k] ____nf_conntrack_find
+   6.60%   ksoftirqd/2  [nf_conntrack]    [k] ____nf_conntrack_find
+   2.73%       swapper  [nf_conntrack]    [k] hash_conntrack_raw
+   2.36%       swapper  [xt_connlimit]    [k] count_tree

Obvious disadvantages to previous version are the increase in code
complexity and the increased memory cost.

Partially based on Eric Dumazets fq scheduler.

Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/xt_connlimit.c |  224 +++++++++++++++++++++++++++++++++---------
 1 file changed, 177 insertions(+), 47 deletions(-)

diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index dc5207f..458464e 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -19,6 +19,7 @@
 #include <linux/jhash.h>
 #include <linux/slab.h>
 #include <linux/list.h>
+#include <linux/rbtree.h>
 #include <linux/module.h>
 #include <linux/random.h>
 #include <linux/skbuff.h>
@@ -31,8 +32,9 @@
 #include <net/netfilter/nf_conntrack_tuple.h>
 #include <net/netfilter/nf_conntrack_zones.h>
 
-#define CONNLIMIT_SLOTS	256
+#define CONNLIMIT_SLOTS		32
 #define CONNLIMIT_LOCK_SLOTS	32
+#define CONNLIMIT_GC_MAX_NODES	8
 
 /* we will save the tuples of all connections we care about */
 struct xt_connlimit_conn {
@@ -41,12 +43,20 @@ struct xt_connlimit_conn {
 	union nf_inet_addr		addr;
 };
 
+struct xt_connlimit_rb {
+	struct rb_node node;
+	struct hlist_head hhead; /* connections/hosts in same subnet */
+	union nf_inet_addr addr; /* search key */
+};
+
 struct xt_connlimit_data {
-	struct hlist_head	iphash[CONNLIMIT_SLOTS];
+	struct rb_root climit_root4[CONNLIMIT_SLOTS];
+	struct rb_root climit_root6[CONNLIMIT_SLOTS];
 	spinlock_t		locks[CONNLIMIT_LOCK_SLOTS];
 };
 
 static u_int32_t connlimit_rnd __read_mostly;
+static struct kmem_cache *connlimit_rb_cachep __read_mostly;
 static struct kmem_cache *connlimit_conn_cachep __read_mostly;
 
 static inline unsigned int connlimit_iphash(__be32 addr)
@@ -99,19 +109,33 @@ same_source_net(const union nf_inet_addr *addr,
 	}
 }
 
-static int count_hlist(struct net *net,
-		       struct hlist_head *head,
-		       const struct nf_conntrack_tuple *tuple,
-		       const union nf_inet_addr *addr,
-		       const union nf_inet_addr *mask,
-		       u_int8_t family, bool *addit)
+static bool add_hlist(struct hlist_head *head,
+		      const struct nf_conntrack_tuple *tuple,
+		      const union nf_inet_addr *addr)
+{
+	struct xt_connlimit_conn *conn;
+
+	conn = kmem_cache_alloc(connlimit_conn_cachep, GFP_ATOMIC);
+	if (conn == NULL)
+		return false;
+	conn->tuple = *tuple;
+	conn->addr = *addr;
+	hlist_add_head(&conn->node, head);
+	return true;
+}
+
+static unsigned int check_hlist(struct net *net,
+				struct hlist_head *head,
+				const struct nf_conntrack_tuple *tuple,
+				bool *addit)
 {
 	const struct nf_conntrack_tuple_hash *found;
 	struct xt_connlimit_conn *conn;
 	struct hlist_node *n;
 	struct nf_conn *found_ct;
-	int matches = 0;
+	unsigned int length = 0;
 
+	*addit = true;
 	rcu_read_lock();
 
 	/* check the saved connections */
@@ -144,30 +168,114 @@ static int count_hlist(struct net *net,
 			continue;
 		}
 
-		if (same_source_net(addr, mask, &conn->addr, family) == 0)
-			/* same source network -> be counted! */
-			++matches;
 		nf_ct_put(found_ct);
+		length++;
 	}
 
 	rcu_read_unlock();
 
-	return matches;
+	return length;
 }
 
-static bool add_hlist(struct hlist_head *head,
-		      const struct nf_conntrack_tuple *tuple,
-		      const union nf_inet_addr *addr)
+static void tree_nodes_free(struct rb_root *root,
+			    struct xt_connlimit_rb *gc_nodes[],
+			    unsigned int gc_count)
+{
+	struct xt_connlimit_rb *rbconn;
+
+	while (gc_count) {
+		rbconn = gc_nodes[--gc_count];
+		rb_erase(&rbconn->node, root);
+		kmem_cache_free(connlimit_rb_cachep, rbconn);
+	}
+}
+
+static unsigned int
+count_tree(struct net *net, struct rb_root *root,
+	   const struct nf_conntrack_tuple *tuple,
+	   const union nf_inet_addr *addr, const union nf_inet_addr *mask,
+	   u8 family)
 {
+	struct xt_connlimit_rb *gc_nodes[CONNLIMIT_GC_MAX_NODES];
+	struct rb_node **rbnode, *parent;
+	struct xt_connlimit_rb *rbconn;
 	struct xt_connlimit_conn *conn;
+	unsigned int gc_count;
+	bool no_gc = false;
+
+ restart:
+	gc_count = 0;
+	parent = NULL;
+	rbnode = &(root->rb_node);
+	while (*rbnode) {
+		int diff;
+		bool addit;
+
+		rbconn = container_of(*rbnode, struct xt_connlimit_rb, node);
+
+		parent = *rbnode;
+		diff = same_source_net(addr, mask, &rbconn->addr, family);
+		if (diff < 0) {
+			rbnode = &((*rbnode)->rb_left);
+		} else if (diff > 0) {
+			rbnode = &((*rbnode)->rb_right);
+		} else {
+			/* same source network -> be counted! */
+			unsigned int count;
+			count = check_hlist(net, &rbconn->hhead, tuple, &addit);
+
+			tree_nodes_free(root, gc_nodes, gc_count);
+			if (!addit)
+				return count;
+
+			if (!add_hlist(&rbconn->hhead, tuple, addr))
+				return 0; /* hotdrop */
+
+			return count + 1;
+		}
+
+		if (no_gc || gc_count >= ARRAY_SIZE(gc_nodes))
+			continue;
+
+		/* only used for GC on hhead, retval and 'addit' ignored */
+		check_hlist(net, &rbconn->hhead, tuple, &addit);
+		if (hlist_empty(&rbconn->hhead))
+			gc_nodes[gc_count++] = rbconn;
+	}
+
+	if (gc_count) {
+		no_gc = true;
+		tree_nodes_free(root, gc_nodes, gc_count);
+		/* tree_node_free before new allocation permits
+		 * allocator to re-use newly free'd object.
+		 *
+		 * This is a rare event; in most cases we will find
+		 * existing node to re-use. (or gc_count is 0).
+		 */
+		goto restart;
+	}
+
+	/* no match, need to insert new node */
+	rbconn = kmem_cache_alloc(connlimit_rb_cachep, GFP_ATOMIC);
+	if (rbconn == NULL)
+		return 0;
 
 	conn = kmem_cache_alloc(connlimit_conn_cachep, GFP_ATOMIC);
-	if (conn == NULL)
-		return false;
+	if (conn == NULL) {
+		kmem_cache_free(connlimit_rb_cachep, rbconn);
+		return 0;
+	}
+
 	conn->tuple = *tuple;
 	conn->addr = *addr;
-	hlist_add_head(&conn->node, head);
-	return true;
+	rbconn->addr = *addr;
+
+	INIT_HLIST_HEAD(&rbconn->hhead);
+	hlist_add_head(&conn->node, &rbconn->hhead);
+
+	rb_link_node(&rbconn->node, parent, rbnode);
+	rb_insert_color(&rbconn->node, root);
+	return 1;
 }
 
 static int count_them(struct net *net,
@@ -177,26 +285,22 @@ static int count_them(struct net *net,
 		      const union nf_inet_addr *mask,
 		      u_int8_t family)
 {
-	struct hlist_head *hhead;
+	struct rb_root *root;
 	int count;
 	u32 hash;
-	bool addit = true;
 
-	if (family == NFPROTO_IPV6)
+	if (family == NFPROTO_IPV6) {
 		hash = connlimit_iphash6(addr, mask);
-	else
+		root = &data->climit_root6[hash];
+	} else {
 		hash = connlimit_iphash(addr->ip & mask->ip);
-
-	hhead = &data->iphash[hash];
+		root = &data->climit_root4[hash];
+	}
 
 	spin_lock_bh(&data->locks[hash % CONNLIMIT_LOCK_SLOTS]);
-	count = count_hlist(net, hhead, tuple, addr, mask, family, &addit);
-	if (addit) {
-		if (add_hlist(hhead, tuple, addr))
-			count++;
-		else
-			count = -ENOMEM;
-	}
+
+	count = count_tree(net, root, tuple, addr, mask, family);
+
 	spin_unlock_bh(&data->locks[hash % CONNLIMIT_LOCK_SLOTS]);
 
 	return count;
@@ -212,7 +316,7 @@ connlimit_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	const struct nf_conntrack_tuple *tuple_ptr = &tuple;
 	enum ip_conntrack_info ctinfo;
 	const struct nf_conn *ct;
-	int connections;
+	unsigned int connections;
 
 	ct = nf_ct_get(skb, &ctinfo);
 	if (ct != NULL)
@@ -233,7 +337,7 @@ connlimit_mt(const struct sk_buff *skb, struct xt_action_param *par)
 
 	connections = count_them(net, info->data, tuple_ptr, &addr,
 	                         &info->mask, par->family);
-	if (connections < 0)
+	if (connections == 0)
 		/* kmalloc failed, drop it entirely */
 		goto hotdrop;
 
@@ -276,28 +380,44 @@ static int connlimit_mt_check(const struct xt_mtchk_param *par)
 	for (i = 0; i < ARRAY_SIZE(info->data->locks); ++i)
 		spin_lock_init(&info->data->locks[i]);
 
-	for (i = 0; i < ARRAY_SIZE(info->data->iphash); ++i)
-		INIT_HLIST_HEAD(&info->data->iphash[i]);
+	for (i = 0; i < ARRAY_SIZE(info->data->climit_root4); ++i)
+		info->data->climit_root4[i] = RB_ROOT;
+	for (i = 0; i < ARRAY_SIZE(info->data->climit_root6); ++i)
+		info->data->climit_root6[i] = RB_ROOT;
 
 	return 0;
 }
 
-static void connlimit_mt_destroy(const struct xt_mtdtor_param *par)
+static void destroy_tree(struct rb_root *r)
 {
-	const struct xt_connlimit_info *info = par->matchinfo;
 	struct xt_connlimit_conn *conn;
+	struct xt_connlimit_rb *rbconn;
 	struct hlist_node *n;
-	struct hlist_head *hash = info->data->iphash;
-	unsigned int i;
+	struct rb_node *node;
 
-	nf_ct_l3proto_module_put(par->family);
+	while ((node = rb_first(r)) != NULL) {
+		rbconn = container_of(node, struct xt_connlimit_rb, node);
 
-	for (i = 0; i < ARRAY_SIZE(info->data->iphash); ++i) {
-		hlist_for_each_entry_safe(conn, n, &hash[i], node) {
-			hlist_del(&conn->node);
+		rb_erase(node, r);
+
+		hlist_for_each_entry_safe(conn, n, &rbconn->hhead, node)
 			kmem_cache_free(connlimit_conn_cachep, conn);
-		}
+
+		kmem_cache_free(connlimit_rb_cachep, rbconn);
 	}
+}
+
+static void connlimit_mt_destroy(const struct xt_mtdtor_param *par)
+{
+	const struct xt_connlimit_info *info = par->matchinfo;
+	unsigned int i;
+
+	nf_ct_l3proto_module_put(par->family);
+
+	for (i = 0; i < ARRAY_SIZE(info->data->climit_root4); ++i)
+		destroy_tree(&info->data->climit_root4[i]);
+	for (i = 0; i < ARRAY_SIZE(info->data->climit_root6); ++i)
+		destroy_tree(&info->data->climit_root6[i]);
 
 	kfree(info->data);
 }
@@ -326,9 +446,18 @@ static int __init connlimit_mt_init(void)
 	if (!connlimit_conn_cachep)
 		return -ENOMEM;
 
+	connlimit_rb_cachep = kmem_cache_create("xt_connlimit_rb",
+					   sizeof(struct xt_connlimit_rb),
+					   0, 0, NULL);
+	if (!connlimit_rb_cachep) {
+		kmem_cache_destroy(connlimit_conn_cachep);
+		return -ENOMEM;
+	}
 	ret = xt_register_match(&connlimit_mt_reg);
-	if (ret != 0)
+	if (ret != 0) {
 		kmem_cache_destroy(connlimit_conn_cachep);
+		kmem_cache_destroy(connlimit_rb_cachep);
+	}
 	return ret;
 }
 
@@ -336,6 +465,7 @@ static void __exit connlimit_mt_exit(void)
 {
 	xt_unregister_match(&connlimit_mt_reg);
 	kmem_cache_destroy(connlimit_conn_cachep);
+	kmem_cache_destroy(connlimit_rb_cachep);
 }
 
 module_init(connlimit_mt_init);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* RE: [PATCH 36/38] netfilter: connlimit: use keyed locks
  2014-03-17 12:42 ` [PATCH 36/38] netfilter: connlimit: use keyed locks Pablo Neira Ayuso
@ 2014-03-17 12:54   ` David Laight
  2014-03-17 14:26     ` Florian Westphal
  2014-03-17 14:00   ` Eric Dumazet
  1 sibling, 1 reply; 47+ messages in thread
From: David Laight @ 2014-03-17 12:54 UTC (permalink / raw)
  To: 'Pablo Neira Ayuso', netfilter-devel; +Cc: davem, netdev

From: Pablo Neira Ayuso
> +#define CONNLIMIT_SLOTS	256
> +#define CONNLIMIT_LOCK_SLOTS	32

You might want to make these explicitly unsigned to ensure
the divisions are unsigned (so can be done with a simple mask).

	David




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 36/38] netfilter: connlimit: use keyed locks
  2014-03-17 12:42 ` [PATCH 36/38] netfilter: connlimit: use keyed locks Pablo Neira Ayuso
  2014-03-17 12:54   ` David Laight
@ 2014-03-17 14:00   ` Eric Dumazet
  2014-03-17 14:23     ` Florian Westphal
  2014-03-18 13:46     ` Jesper Dangaard Brouer
  1 sibling, 2 replies; 47+ messages in thread
From: Eric Dumazet @ 2014-03-17 14:00 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, davem, netdev

On Mon, 2014-03-17 at 13:42 +0100, Pablo Neira Ayuso wrote:
> From: Florian Westphal <fw@strlen.de>
> 
> connlimit currently suffers from spinlock contention, example for
> 4-core system with rps enabled:

> +#define CONNLIMIT_SLOTS	256
> +#define CONNLIMIT_LOCK_SLOTS	32



32 spinlocks use 2 cache lines (assuming 4 bytes per spinlock, and 64
bytes cache lines)

So I guess this probably should be increased to have less false sharing.

Note: This can be done later, I do not want to block this patch serie at
all !

I believe this hash table of spinlocks could be global, not in each
struct xt_connlimit_data.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 36/38] netfilter: connlimit: use keyed locks
  2014-03-17 14:00   ` Eric Dumazet
@ 2014-03-17 14:23     ` Florian Westphal
  2014-03-18 13:46     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 47+ messages in thread
From: Florian Westphal @ 2014-03-17 14:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Pablo Neira Ayuso, netfilter-devel, davem, netdev

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2014-03-17 at 13:42 +0100, Pablo Neira Ayuso wrote:
> > From: Florian Westphal <fw@strlen.de>
> > 
> > connlimit currently suffers from spinlock contention, example for
> > 4-core system with rps enabled:
> 
> > +#define CONNLIMIT_SLOTS	256
> > +#define CONNLIMIT_LOCK_SLOTS	32
> 
> 32 spinlocks use 2 cache lines (assuming 4 bytes per spinlock, and 64
> bytes cache lines)
> 
> So I guess this probably should be increased to have less false sharing.

True, Jesper pointed out the same thing to me.

> I believe this hash table of spinlocks could be global, not in each
> struct xt_connlimit_data.

Good point.  Indeed, this can be global.  I did not increase it
since more locks than tree slots is illegal (need exclusive access to
each rtree at this time). I guess we could align it or increase number
of lock and rbtree slots (after moving lock slots out of connlimit
data).

Thanks for the heads-up Eric, I'll see about addressing this later this
week if noone beats me to it.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 36/38] netfilter: connlimit: use keyed locks
  2014-03-17 12:54   ` David Laight
@ 2014-03-17 14:26     ` Florian Westphal
  2014-03-17 14:40       ` David Laight
  0 siblings, 1 reply; 47+ messages in thread
From: Florian Westphal @ 2014-03-17 14:26 UTC (permalink / raw)
  To: David Laight; +Cc: 'Pablo Neira Ayuso', netfilter-devel, davem, netdev

David Laight <David.Laight@ACULAB.COM> wrote:
> From: Pablo Neira Ayuso
> > +#define CONNLIMIT_SLOTS	256
> > +#define CONNLIMIT_LOCK_SLOTS	32
> 
> You might want to make these explicitly unsigned to ensure
> the divisions are unsigned (so can be done with a simple mask).

I can do  this but I checked that at least on my machine gcc already
emits a 'add' instruction to reduce the hash value.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH 36/38] netfilter: connlimit: use keyed locks
  2014-03-17 14:26     ` Florian Westphal
@ 2014-03-17 14:40       ` David Laight
  0 siblings, 0 replies; 47+ messages in thread
From: David Laight @ 2014-03-17 14:40 UTC (permalink / raw)
  To: 'Florian Westphal'
  Cc: 'Pablo Neira Ayuso', netfilter-devel, davem, netdev

From: Florian Westphal
> David Laight <David.Laight@ACULAB.COM> wrote:
> > From: Pablo Neira Ayuso
> > > +#define CONNLIMIT_SLOTS	256
> > > +#define CONNLIMIT_LOCK_SLOTS	32
> >
> > You might want to make these explicitly unsigned to ensure
> > the divisions are unsigned (so can be done with a simple mask).
> 
> I can do  this but I checked that at least on my machine gcc already
> emits a 'add' instruction to reduce the hash value.

I think you meant 'and' :-)
Yes the hash itself is probably unsigned so forces an unsigned modulus.
It just seemed better to force it from the other argument as well.

	David




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 00/38] Netfilter/IPVS updates for net-next
  2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
                   ` (37 preceding siblings ...)
  2014-03-17 12:42 ` [PATCH 38/38] netfilter: connlimit: use rbtree for per-host conntrack obj storage Pablo Neira Ayuso
@ 2014-03-17 19:19 ` David Miller
  38 siblings, 0 replies; 47+ messages in thread
From: David Miller @ 2014-03-17 19:19 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Mon, 17 Mar 2014 13:42:20 +0100

> The following patchset contains Netfilter/IPVS updates for net-next,
> most relevantly they are:
 ...
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git master
> 
> These changes should merge cleanly without conflicts to your net-next tree.

Pulled, thanks a lot Pablo.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 36/38] netfilter: connlimit: use keyed locks
  2014-03-17 14:00   ` Eric Dumazet
  2014-03-17 14:23     ` Florian Westphal
@ 2014-03-18 13:46     ` Jesper Dangaard Brouer
  2014-03-18 14:01       ` Eric Dumazet
  1 sibling, 1 reply; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2014-03-18 13:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: brouer, Pablo Neira Ayuso, netfilter-devel, davem, netdev

On Mon, 17 Mar 2014 07:00:08 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Mon, 2014-03-17 at 13:42 +0100, Pablo Neira Ayuso wrote:
> > From: Florian Westphal <fw@strlen.de>
> > 
> > connlimit currently suffers from spinlock contention, example for
> > 4-core system with rps enabled:
> 
> > +#define CONNLIMIT_SLOTS	256
> > +#define CONNLIMIT_LOCK_SLOTS	32
> 
> 
> 
> 32 spinlocks use 2 cache lines (assuming 4 bytes per spinlock, and 64
> bytes cache lines)

Hehe, I actually also pointed this out during my internal review, but
we never gotten around to fixing this.

> So I guess this probably should be increased to have less false sharing.
> 
> Note: This can be done later, I do not want to block this patch serie at
> all !

Yes, lets fix it up later.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 36/38] netfilter: connlimit: use keyed locks
  2014-03-18 13:46     ` Jesper Dangaard Brouer
@ 2014-03-18 14:01       ` Eric Dumazet
  0 siblings, 0 replies; 47+ messages in thread
From: Eric Dumazet @ 2014-03-18 14:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Pablo Neira Ayuso, netfilter-devel, davem, netdev

On Tue, 2014-03-18 at 14:46 +0100, Jesper Dangaard Brouer wrote:

> Hehe, I actually also pointed this out during my internal review, but
> we never gotten around to fixing this.

BTW, sizeof(spinlock_t) can be 2 bytes, if NR_CPUS is below 128 on x86

So whole hash table fits in a single cache line.



^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2014-03-18 14:01 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-17 12:42 [PATCH 00/38] Netfilter/IPVS updates for net-next Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 01/38] netfilter: remove double colon Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 02/38] netfilter: xt_ipcomp: Use ntohs to ease sparse warning Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 03/38] netfilter: nft_ct: labels get support Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 04/38] netfilter: ip_set: rename nfnl_dereference()/nfnl_set() Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 05/38] netfilter: nfnetlink: add rcu_dereference_protected() helpers Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 06/38] netfilter: nf_tables: add nft_dereference() macro Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 07/38] netfilter: nf_tables: accept QUEUE/DROP verdict parameters Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 08/38] netfilter: nfnetlink_log: remove unused code Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 09/38] netfilter: nf_tables: add optional user data area to rules Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 10/38] netfilter: ipset: Follow manual page behavior for SET target on list:set Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 11/38] netfilter: ipset: Add hash: fix coccinelle warnings Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 12/38] netfilter: ipset: add hash:ip,mark data type to ipset Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 13/38] netfilter: ipset: add markmask for hash:ip,mark data type Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 14/38] netfilter: ipset: Prepare the kernel for create option flags when no extension is needed Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 15/38] netfilter: ipset: kernel: uapi: fix MARKMASK attr ABI breakage Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 16/38] netfilter: ipset: move registration message to init from net_init Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 17/38] netfilter: ipset: add forceadd kernel support for hash set types Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 18/38] sections, ipvs: Remove useless __read_mostly for ipvs genl_ops Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 19/38] ipvs: Reduce checkpatch noise in ip_vs_lblc.c Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 20/38] netfilter: trivial code cleanup and doc changes Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 21/38] netfilter: conntrack: spinlock per cpu to protect special lists Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 22/38] netfilter: avoid race with exp->master ct Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 23/38] netfilter: conntrack: seperate expect locking from nf_conntrack_lock Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 24/38] netfilter: conntrack: remove central spinlock nf_conntrack_lock Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 25/38] netfilter: nft_hash: bug fixes and resizing Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 26/38] netfilter: nf_tables: clean up nf_tables_trans_add() argument order Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 27/38] netfilter: nf_tables: restore context for expression destructors Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 28/38] netfilter: nf_tables: restore notifications for anonymous set destruction Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 29/38] netfilter: nft_ct: remove family from struct nft_ct Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 30/38] netfilter: nft_nat: fix family validation Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 31/38] netfilter: connlimit: factor hlist search into new function Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 32/38] netfilter: connlimit: improve packet-to-closed-connection logic Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 33/38] netfilter: connlimit: move insertion of new element out of count function Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 34/38] netfilter: connlimit: use kmem_cache for conn objects Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 35/38] netfilter: Convert uses of __constant_<foo> to <foo> Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 36/38] netfilter: connlimit: use keyed locks Pablo Neira Ayuso
2014-03-17 12:54   ` David Laight
2014-03-17 14:26     ` Florian Westphal
2014-03-17 14:40       ` David Laight
2014-03-17 14:00   ` Eric Dumazet
2014-03-17 14:23     ` Florian Westphal
2014-03-18 13:46     ` Jesper Dangaard Brouer
2014-03-18 14:01       ` Eric Dumazet
2014-03-17 12:42 ` [PATCH 37/38] netfilter: connlimit: make same_source_net signed Pablo Neira Ayuso
2014-03-17 12:42 ` [PATCH 38/38] netfilter: connlimit: use rbtree for per-host conntrack obj storage Pablo Neira Ayuso
2014-03-17 19:19 ` [PATCH 00/38] Netfilter/IPVS updates for net-next David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.