All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10] ipset patches for nf-next
@ 2014-11-24 20:46 Jozsef Kadlecsik
  2014-11-24 20:46 ` [PATCH 01/10] netfilter: ipset: Support updating extensions when the set is full Jozsef Kadlecsik
                   ` (9 more replies)
  0 siblings, 10 replies; 17+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-24 20:46 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Hi Pablo,

Please consider to apply the next series of patches for ipset:

- Fix parallel resizing and listing of the same set: when adding elements
  and listing of the same set were executed parallel, listing could start
  to list the original set (before resizing) and continue with the new one.
- Styles warned by checkpatch.pl fixed
- Introduce RCU in all set types instead of rwlock per set. The patch was
  performance tested by Jesper Dangaard Brouer:

  Generator: sending 12.2Mpps (tx:12264083 pps)
  Drop performance in "raw" with ipset: 8Mpps
  Drop performance in "raw" with ipset with RCU-locking: 11.3Mpps
- Remove rbtree from hash:net,iface for RCU locking
- Explicitly add padding elements to hash:net,net and hash:net,port,
  because the elements must be u32 sized for the used hash function.
- Allocate the proper size of memory when /0 networks are supported
- Simplify cidr handling for hash:*net* types (cleaning it up for RCU)
- Indicate explicitly when /0 networks are supported
- Alignment problem between 64bit kernel 32bit userspace fixed by
  introducing a new set match revision, reported by Sven-Haegar Koch
- Support updating element extensions when the set is full (fixes
  netfilter bugzilla id 880)

You can pull the changes from

        git://blackhole.kfki.hu/nf-next master

The iptables part of the new set match functionality
can be found in the iptables git tree, in the ipset
branch.

Thanks,
Jozsef
============================================================================
The following changes since commit beacd3e8ef237e077c8707395440813feef16d3f:

  netfilter: nfnetlink_log: Make use of pr_fmt where applicable (2014-11-20 14:09:01 +0100)

are available in the git repository at:

  git://blackhole.kfki.hu/nf-next master

for you to fetch changes up to 4743864b182ce10346e48c03bee5189e8951f460:

  netfilter: ipset: Fix parallel resizing and listing of the same set (2014-11-24 21:22:26 +0100)

----------------------------------------------------------------
Jozsef Kadlecsik (10):
      netfilter: ipset: Support updating extensions when the set is full
      netfilter: ipset: Alignment problem between 64bit kernel 32bit userspace
      netfilter: ipset: Indicate when /0 networks are supported
      netfilter: ipset: Simplify cidr handling for hash:*net* types
      netfilter: ipset: Allocate the proper size of memory when /0 networks are supported
      netfilter: ipset: Explicitly add padding elements to hash:net,net and hash:net,port,net
      netfilter: ipset: Remove rbtree from hash:net,iface in order to run under RCU
      netfilter: ipset: Introduce RCU in all set types instead of rwlock per set
      netfilter: ipset: styles warned by checkpatch.pl fixed
      netfilter: ipset: Fix parallel resizing and listing of the same set

 include/linux/netfilter/ipset/ip_set.h         | 102 ++--
 include/linux/netfilter/ipset/ip_set_timeout.h |  39 +-
 include/uapi/linux/netfilter/ipset/ip_set.h    |   8 +-
 include/uapi/linux/netfilter/xt_set.h          |  13 +-
 net/netfilter/ipset/ip_set_bitmap_gen.h        |   8 +-
 net/netfilter/ipset/ip_set_bitmap_ipmac.c      |   2 +-
 net/netfilter/ipset/ip_set_core.c              |  72 +--
 net/netfilter/ipset/ip_set_hash_gen.h          | 636 +++++++++++++++----------
 net/netfilter/ipset/ip_set_hash_ipportnet.c    |   2 +
 net/netfilter/ipset/ip_set_hash_net.c          |   4 +-
 net/netfilter/ipset/ip_set_hash_netiface.c     | 159 +------
 net/netfilter/ipset/ip_set_hash_netnet.c       |   4 +
 net/netfilter/ipset/ip_set_hash_netport.c      |   2 +
 net/netfilter/ipset/ip_set_hash_netportnet.c   |   4 +
 net/netfilter/ipset/ip_set_list_set.c          | 386 +++++++--------
 net/netfilter/xt_set.c                         |  80 +++-
 net/sched/em_ipset.c                           |   5 +-
 17 files changed, 842 insertions(+), 684 deletions(-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 01/10] netfilter: ipset: Support updating extensions when the set is full
  2014-11-24 20:46 [PATCH 00/10] ipset patches for nf-next Jozsef Kadlecsik
@ 2014-11-24 20:46 ` Jozsef Kadlecsik
  2014-11-24 20:46 ` [PATCH 02/10] netfilter: ipset: Alignment problem between 64bit kernel 32bit userspace Jozsef Kadlecsik
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-24 20:46 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

When the set was full (hash type and maxelem reached), it was not
possible to update the extension part of already existing elements.
The patch removes this limitation. (Fixes netfilter bugzilla id 880.)

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_hash_gen.h | 40 +++++++++++++++--------------------
 1 file changed, 17 insertions(+), 23 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index fee7c64e..a12ee04 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -633,29 +633,6 @@ mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 	bool flag_exist = flags & IPSET_FLAG_EXIST;
 	u32 key, multi = 0;
 
-	if (h->elements >= h->maxelem && SET_WITH_FORCEADD(set)) {
-		rcu_read_lock_bh();
-		t = rcu_dereference_bh(h->table);
-		key = HKEY(value, h->initval, t->htable_bits);
-		n = hbucket(t,key);
-		if (n->pos) {
-			/* Choosing the first entry in the array to replace */
-			j = 0;
-			goto reuse_slot;
-		}
-		rcu_read_unlock_bh();
-	}
-	if (SET_WITH_TIMEOUT(set) && h->elements >= h->maxelem)
-		/* FIXME: when set is full, we slow down here */
-		mtype_expire(set, h, NLEN(set->family), set->dsize);
-
-	if (h->elements >= h->maxelem) {
-		if (net_ratelimit())
-			pr_warn("Set %s is full, maxelem %u reached\n",
-				set->name, h->maxelem);
-		return -IPSET_ERR_HASH_FULL;
-	}
-
 	rcu_read_lock_bh();
 	t = rcu_dereference_bh(h->table);
 	key = HKEY(value, h->initval, t->htable_bits);
@@ -680,6 +657,23 @@ mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 		    j != AHASH_MAX(h) + 1)
 			j = i;
 	}
+	if (h->elements >= h->maxelem && SET_WITH_FORCEADD(set) && n->pos) {
+		/* Choosing the first entry in the array to replace */
+		j = 0;
+		goto reuse_slot;
+	}
+	if (SET_WITH_TIMEOUT(set) && h->elements >= h->maxelem)
+		/* FIXME: when set is full, we slow down here */
+		mtype_expire(set, h, NLEN(set->family), set->dsize);
+
+	if (h->elements >= h->maxelem) {
+		if (net_ratelimit())
+			pr_warn("Set %s is full, maxelem %u reached\n",
+				set->name, h->maxelem);
+		ret = -IPSET_ERR_HASH_FULL;
+		goto out;
+	}
+
 reuse_slot:
 	if (j != AHASH_MAX(h) + 1) {
 		/* Fill out reused slot */
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 02/10] netfilter: ipset: Alignment problem between 64bit kernel 32bit userspace
  2014-11-24 20:46 [PATCH 00/10] ipset patches for nf-next Jozsef Kadlecsik
  2014-11-24 20:46 ` [PATCH 01/10] netfilter: ipset: Support updating extensions when the set is full Jozsef Kadlecsik
@ 2014-11-24 20:46 ` Jozsef Kadlecsik
  2014-11-24 20:46 ` [PATCH 03/10] netfilter: ipset: Indicate when /0 networks are supported Jozsef Kadlecsik
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-24 20:46 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Sven-Haegar Koch reported the issue:

sims:~# iptables -A OUTPUT -m set --match-set testset src -j ACCEPT
iptables: Invalid argument. Run `dmesg' for more information.

In syslog:
x_tables: ip_tables: set.3 match: invalid size 48 (kernel) != (user) 32

which was introduced by the counter extension in ipset.

The patch fixes the alignment issue with introducing a new set match
revision with the fixed underlying 'struct ip_set_counter_match'
structure.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 include/uapi/linux/netfilter/ipset/ip_set.h |  8 +++-
 include/uapi/linux/netfilter/xt_set.h       | 13 ++++-
 net/netfilter/xt_set.c                      | 73 +++++++++++++++++++++++++++--
 3 files changed, 88 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
index ca03119..5ab4e60 100644
--- a/include/uapi/linux/netfilter/ipset/ip_set.h
+++ b/include/uapi/linux/netfilter/ipset/ip_set.h
@@ -256,11 +256,17 @@ enum {
 	IPSET_COUNTER_GT,
 };
 
-struct ip_set_counter_match {
+/* Backward compatibility for set match v3 */
+struct ip_set_counter_match0 {
 	__u8 op;
 	__u64 value;
 };
 
+struct ip_set_counter_match {
+	__aligned_u64 value;
+	__u8 op;
+};
+
 /* Interface to iptables/ip6tables */
 
 #define SO_IP_SET		83
diff --git a/include/uapi/linux/netfilter/xt_set.h b/include/uapi/linux/netfilter/xt_set.h
index d6a1df1..d4e0234 100644
--- a/include/uapi/linux/netfilter/xt_set.h
+++ b/include/uapi/linux/netfilter/xt_set.h
@@ -66,8 +66,8 @@ struct xt_set_info_target_v2 {
 
 struct xt_set_info_match_v3 {
 	struct xt_set_info match_set;
-	struct ip_set_counter_match packets;
-	struct ip_set_counter_match bytes;
+	struct ip_set_counter_match0 packets;
+	struct ip_set_counter_match0 bytes;
 	__u32 flags;
 };
 
@@ -81,4 +81,13 @@ struct xt_set_info_target_v3 {
 	__u32 timeout;
 };
 
+/* Revision 4 match */
+
+struct xt_set_info_match_v4 {
+	struct xt_set_info match_set;
+	struct ip_set_counter_match packets;
+	struct ip_set_counter_match bytes;
+	__u32 flags;
+};
+
 #endif /*_XT_SET_H*/
diff --git a/net/netfilter/xt_set.c b/net/netfilter/xt_set.c
index 5732cd6..c3f2638 100644
--- a/net/netfilter/xt_set.c
+++ b/net/netfilter/xt_set.c
@@ -157,7 +157,7 @@ set_match_v1_destroy(const struct xt_mtdtor_param *par)
 /* Revision 3 match */
 
 static bool
-match_counter(u64 counter, const struct ip_set_counter_match *info)
+match_counter0(u64 counter, const struct ip_set_counter_match0 *info)
 {
 	switch (info->op) {
 	case IPSET_COUNTER_NONE:
@@ -192,14 +192,60 @@ set_match_v3(const struct sk_buff *skb, struct xt_action_param *par)
 	if (!(ret && opt.cmdflags & IPSET_FLAG_MATCH_COUNTERS))
 		return ret;
 
-	if (!match_counter(opt.ext.packets, &info->packets))
+	if (!match_counter0(opt.ext.packets, &info->packets))
 		return 0;
-	return match_counter(opt.ext.bytes, &info->bytes);
+	return match_counter0(opt.ext.bytes, &info->bytes);
 }
 
 #define set_match_v3_checkentry	set_match_v1_checkentry
 #define set_match_v3_destroy	set_match_v1_destroy
 
+/* Revision 4 match */
+
+static bool
+match_counter(u64 counter, const struct ip_set_counter_match *info)
+{
+	switch (info->op) {
+	case IPSET_COUNTER_NONE:
+		return true;
+	case IPSET_COUNTER_EQ:
+		return counter == info->value;
+	case IPSET_COUNTER_NE:
+		return counter != info->value;
+	case IPSET_COUNTER_LT:
+		return counter < info->value;
+	case IPSET_COUNTER_GT:
+		return counter > info->value;
+	}
+	return false;
+}
+
+static bool
+set_match_v4(const struct sk_buff *skb, CONST struct xt_action_param *par)
+{
+	const struct xt_set_info_match_v4 *info = par->matchinfo;
+	ADT_OPT(opt, par->family, info->match_set.dim,
+		info->match_set.flags, info->flags, UINT_MAX);
+	int ret;
+
+	if (info->packets.op != IPSET_COUNTER_NONE ||
+	    info->bytes.op != IPSET_COUNTER_NONE)
+		opt.cmdflags |= IPSET_FLAG_MATCH_COUNTERS;
+
+	ret = match_set(info->match_set.index, skb, par, &opt,
+			info->match_set.flags & IPSET_INV_MATCH);
+
+	if (!(ret && opt.cmdflags & IPSET_FLAG_MATCH_COUNTERS))
+		return ret;
+
+	if (!match_counter(opt.ext.packets, &info->packets))
+		return 0;
+	return match_counter(opt.ext.bytes, &info->bytes);
+}
+
+#define set_match_v4_checkentry	set_match_v1_checkentry
+#define set_match_v4_destroy	set_match_v1_destroy
+
 /* Revision 0 interface: backward compatible with netfilter/iptables */
 
 static unsigned int
@@ -573,6 +619,27 @@ static struct xt_match set_matches[] __read_mostly = {
 		.destroy	= set_match_v3_destroy,
 		.me		= THIS_MODULE
 	},
+	/* new revision for counters support: update, match */
+	{
+		.name		= "set",
+		.family		= NFPROTO_IPV4,
+		.revision	= 4,
+		.match		= set_match_v4,
+		.matchsize	= sizeof(struct xt_set_info_match_v4),
+		.checkentry	= set_match_v4_checkentry,
+		.destroy	= set_match_v4_destroy,
+		.me		= THIS_MODULE
+	},
+	{
+		.name		= "set",
+		.family		= NFPROTO_IPV6,
+		.revision	= 4,
+		.match		= set_match_v4,
+		.matchsize	= sizeof(struct xt_set_info_match_v4),
+		.checkentry	= set_match_v4_checkentry,
+		.destroy	= set_match_v4_destroy,
+		.me		= THIS_MODULE
+	},
 };
 
 static struct xt_target set_targets[] __read_mostly = {
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 03/10] netfilter: ipset: Indicate when /0 networks are supported
  2014-11-24 20:46 [PATCH 00/10] ipset patches for nf-next Jozsef Kadlecsik
  2014-11-24 20:46 ` [PATCH 01/10] netfilter: ipset: Support updating extensions when the set is full Jozsef Kadlecsik
  2014-11-24 20:46 ` [PATCH 02/10] netfilter: ipset: Alignment problem between 64bit kernel 32bit userspace Jozsef Kadlecsik
@ 2014-11-24 20:46 ` Jozsef Kadlecsik
  2014-11-24 20:46 ` [PATCH 04/10] netfilter: ipset: Simplify cidr handling for hash:*net* types Jozsef Kadlecsik
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-24 20:46 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_hash_gen.h      | 2 +-
 net/netfilter/ipset/ip_set_hash_netiface.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index a12ee04..9428fa5 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -156,7 +156,7 @@ hbucket_elem_add(struct hbucket *n, u8 ahash_max, size_t dsize)
 
 #define SET_HOST_MASK(family)	(family == AF_INET ? 32 : 128)
 
-#ifdef IP_SET_HASH_WITH_MULTI
+#ifdef IP_SET_HASH_WITH_NET0
 #define NLEN(family)		(SET_HOST_MASK(family) + 1)
 #else
 #define NLEN(family)		SET_HOST_MASK(family)
diff --git a/net/netfilter/ipset/ip_set_hash_netiface.c b/net/netfilter/ipset/ip_set_hash_netiface.c
index 35dd358..758b002 100644
--- a/net/netfilter/ipset/ip_set_hash_netiface.c
+++ b/net/netfilter/ipset/ip_set_hash_netiface.c
@@ -115,6 +115,7 @@ iface_add(struct rb_root *root, const char **iface)
 #define IP_SET_HASH_WITH_NETS
 #define IP_SET_HASH_WITH_RBTREE
 #define IP_SET_HASH_WITH_MULTI
+#define IP_SET_HASH_WITH_NET0
 
 #define STREQ(a, b)	(strcmp(a, b) == 0)
 
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 04/10] netfilter: ipset: Simplify cidr handling for hash:*net* types
  2014-11-24 20:46 [PATCH 00/10] ipset patches for nf-next Jozsef Kadlecsik
                   ` (2 preceding siblings ...)
  2014-11-24 20:46 ` [PATCH 03/10] netfilter: ipset: Indicate when /0 networks are supported Jozsef Kadlecsik
@ 2014-11-24 20:46 ` Jozsef Kadlecsik
  2014-11-24 20:46 ` [PATCH 05/10] netfilter: ipset: Allocate the proper size of memory when /0 networks are supported Jozsef Kadlecsik
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-24 20:46 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_hash_gen.h | 56 +++++++++++++++++------------------
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index 9428fa5..8ef9135 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -147,11 +147,17 @@ hbucket_elem_add(struct hbucket *n, u8 ahash_max, size_t dsize)
 #else
 #define __CIDR(cidr, i)		(cidr)
 #endif
+
+/* cidr + 1 is stored in net_prefixes to support /0 */
+#define SCIDR(cidr, i)		(__CIDR(cidr, i) + 1)
+
 #ifdef IP_SET_HASH_WITH_NETS_PACKED
-/* When cidr is packed with nomatch, cidr - 1 is stored in the entry */
-#define CIDR(cidr, i)		(__CIDR(cidr, i) + 1)
+/* When cidr is packed with nomatch, cidr - 1 is stored in the data entry */
+#define GCIDR(cidr, i)		(__CIDR(cidr, i) + 1)
+#define NCIDR(cidr)		(cidr)
 #else
-#define CIDR(cidr, i)		(__CIDR(cidr, i))
+#define GCIDR(cidr, i)		(__CIDR(cidr, i))
+#define NCIDR(cidr)		(cidr - 1)
 #endif
 
 #define SET_HOST_MASK(family)	(family == AF_INET ? 32 : 128)
@@ -292,24 +298,22 @@ mtype_add_cidr(struct htype *h, u8 cidr, u8 nets_length, u8 n)
 	int i, j;
 
 	/* Add in increasing prefix order, so larger cidr first */
-	for (i = 0, j = -1; i < nets_length && h->nets[i].nets[n]; i++) {
+	for (i = 0, j = -1; i < nets_length && h->nets[i].cidr[n]; i++) {
 		if (j != -1)
 			continue;
 		else if (h->nets[i].cidr[n] < cidr)
 			j = i;
 		else if (h->nets[i].cidr[n] == cidr) {
-			h->nets[i].nets[n]++;
+			h->nets[cidr - 1].nets[n]++;
 			return;
 		}
 	}
 	if (j != -1) {
-		for (; i > j; i--) {
+		for (; i > j; i--)
 			h->nets[i].cidr[n] = h->nets[i - 1].cidr[n];
-			h->nets[i].nets[n] = h->nets[i - 1].nets[n];
-		}
 	}
 	h->nets[i].cidr[n] = cidr;
-	h->nets[i].nets[n] = 1;
+	h->nets[cidr - 1].nets[n] = 1;
 }
 
 static void
@@ -320,16 +324,12 @@ mtype_del_cidr(struct htype *h, u8 cidr, u8 nets_length, u8 n)
 	for (i = 0; i < nets_length; i++) {
 	        if (h->nets[i].cidr[n] != cidr)
 	                continue;
-                if (h->nets[i].nets[n] > 1 || i == net_end ||
-                    h->nets[i + 1].nets[n] == 0) {
-                        h->nets[i].nets[n]--;
+		h->nets[cidr -1].nets[n]--;
+		if (h->nets[cidr -1].nets[n] > 0)
                         return;
-                }
-                for (j = i; j < net_end && h->nets[j].nets[n]; j++) {
+		for (j = i; j < net_end && h->nets[j].cidr[n]; j++)
 		        h->nets[j].cidr[n] = h->nets[j + 1].cidr[n];
-		        h->nets[j].nets[n] = h->nets[j + 1].nets[n];
-                }
-                h->nets[j].nets[n] = 0;
+		h->nets[j].cidr[n] = 0;
                 return;
 	}
 }
@@ -486,7 +486,7 @@ mtype_expire(struct ip_set *set, struct htype *h, u8 nets_length, size_t dsize)
 				pr_debug("expired %u/%u\n", i, j);
 #ifdef IP_SET_HASH_WITH_NETS
 				for (k = 0; k < IPSET_NET_COUNT; k++)
-					mtype_del_cidr(h, CIDR(data->cidr, k),
+					mtype_del_cidr(h, SCIDR(data->cidr, k),
 						       nets_length, k);
 #endif
 				ip_set_ext_destroy(set, data);
@@ -680,9 +680,9 @@ reuse_slot:
 		data = ahash_data(n, j, set->dsize);
 #ifdef IP_SET_HASH_WITH_NETS
 		for (i = 0; i < IPSET_NET_COUNT; i++) {
-			mtype_del_cidr(h, CIDR(data->cidr, i),
+			mtype_del_cidr(h, SCIDR(data->cidr, i),
 				       NLEN(set->family), i);
-			mtype_add_cidr(h, CIDR(d->cidr, i),
+			mtype_add_cidr(h, SCIDR(d->cidr, i),
 				       NLEN(set->family), i);
 		}
 #endif
@@ -699,7 +699,7 @@ reuse_slot:
 		data = ahash_data(n, n->pos++, set->dsize);
 #ifdef IP_SET_HASH_WITH_NETS
 		for (i = 0; i < IPSET_NET_COUNT; i++)
-			mtype_add_cidr(h, CIDR(d->cidr, i), NLEN(set->family),
+			mtype_add_cidr(h, SCIDR(d->cidr, i), NLEN(set->family),
 				       i);
 #endif
 		h->elements++;
@@ -760,7 +760,7 @@ mtype_del(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 		h->elements--;
 #ifdef IP_SET_HASH_WITH_NETS
 		for (j = 0; j < IPSET_NET_COUNT; j++)
-			mtype_del_cidr(h, CIDR(d->cidr, j), NLEN(set->family),
+			mtype_del_cidr(h, SCIDR(d->cidr, j), NLEN(set->family),
 				       j);
 #endif
 		ip_set_ext_destroy(set, data);
@@ -821,15 +821,15 @@ mtype_test_cidrs(struct ip_set *set, struct mtype_elem *d,
 	u8 nets_length = NLEN(set->family);
 
 	pr_debug("test by nets\n");
-	for (; j < nets_length && h->nets[j].nets[0] && !multi; j++) {
+	for (; j < nets_length && h->nets[j].cidr[0] && !multi; j++) {
 #if IPSET_NET_COUNT == 2
 		mtype_data_reset_elem(d, &orig);
-		mtype_data_netmask(d, h->nets[j].cidr[0], false);
-		for (k = 0; k < nets_length && h->nets[k].nets[1] && !multi;
+		mtype_data_netmask(d, NCIDR(h->nets[j].cidr[0]), false);
+		for (k = 0; k < nets_length && h->nets[k].cidr[1] && !multi;
 		     k++) {
-			mtype_data_netmask(d, h->nets[k].cidr[1], true);
+			mtype_data_netmask(d, NCIDR(h->nets[k].cidr[1]), true);
 #else
-		mtype_data_netmask(d, h->nets[j].cidr[0]);
+		mtype_data_netmask(d, NCIDR(h->nets[j].cidr[0]));
 #endif
 		key = HKEY(d, h->initval, t->htable_bits);
 		n = hbucket(t, key);
@@ -877,7 +877,7 @@ mtype_test(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 	/* If we test an IP address and not a network address,
 	 * try all possible network sizes */
 	for (i = 0; i < IPSET_NET_COUNT; i++)
-		if (CIDR(d->cidr, i) != SET_HOST_MASK(set->family))
+		if (GCIDR(d->cidr, i) != SET_HOST_MASK(set->family))
 			break;
 	if (i == IPSET_NET_COUNT) {
 		ret = mtype_test_cidrs(set, d, ext, mext, flags);
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 05/10] netfilter: ipset: Allocate the proper size of memory when /0 networks are supported
  2014-11-24 20:46 [PATCH 00/10] ipset patches for nf-next Jozsef Kadlecsik
                   ` (3 preceding siblings ...)
  2014-11-24 20:46 ` [PATCH 04/10] netfilter: ipset: Simplify cidr handling for hash:*net* types Jozsef Kadlecsik
@ 2014-11-24 20:46 ` Jozsef Kadlecsik
  2014-11-24 20:46 ` [PATCH 06/10] netfilter: ipset: Explicitly add padding elements to hash:net,net and hash:net,port,net Jozsef Kadlecsik
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-24 20:46 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_hash_gen.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index 8ef9135..974ff38 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -1101,8 +1101,7 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set,
 
 	hsize = sizeof(*h);
 #ifdef IP_SET_HASH_WITH_NETS
-	hsize += sizeof(struct net_prefixes) *
-		(set->family == NFPROTO_IPV4 ? 32 : 128);
+	hsize += sizeof(struct net_prefixes) * NLEN(set->family);
 #endif
 	h = kzalloc(hsize, GFP_KERNEL);
 	if (!h)
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 06/10] netfilter: ipset: Explicitly add padding elements to hash:net,net and hash:net,port,net
  2014-11-24 20:46 [PATCH 00/10] ipset patches for nf-next Jozsef Kadlecsik
                   ` (4 preceding siblings ...)
  2014-11-24 20:46 ` [PATCH 05/10] netfilter: ipset: Allocate the proper size of memory when /0 networks are supported Jozsef Kadlecsik
@ 2014-11-24 20:46 ` Jozsef Kadlecsik
  2014-11-24 20:46 ` [PATCH 07/10] netfilter: ipset: Remove rbtree from hash:net,iface in order to run under RCU Jozsef Kadlecsik
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-24 20:46 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

The elements must be u32 sized for the used hash function.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_hash_netnet.c     | 2 ++
 net/netfilter/ipset/ip_set_hash_netportnet.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/net/netfilter/ipset/ip_set_hash_netnet.c b/net/netfilter/ipset/ip_set_hash_netnet.c
index da00284..ea8772a 100644
--- a/net/netfilter/ipset/ip_set_hash_netnet.c
+++ b/net/netfilter/ipset/ip_set_hash_netnet.c
@@ -46,6 +46,7 @@ struct hash_netnet4_elem {
 		__be64 ipcmp;
 	};
 	u8 nomatch;
+	u8 padding;
 	union {
 		u8 cidr[2];
 		u16 ccmp;
@@ -271,6 +272,7 @@ hash_netnet4_uadt(struct ip_set *set, struct nlattr *tb[],
 struct hash_netnet6_elem {
 	union nf_inet_addr ip[2];
 	u8 nomatch;
+	u8 padding;
 	union {
 		u8 cidr[2];
 		u16 ccmp;
diff --git a/net/netfilter/ipset/ip_set_hash_netportnet.c b/net/netfilter/ipset/ip_set_hash_netportnet.c
index b8053d6..bfaa94c 100644
--- a/net/netfilter/ipset/ip_set_hash_netportnet.c
+++ b/net/netfilter/ipset/ip_set_hash_netportnet.c
@@ -53,6 +53,7 @@ struct hash_netportnet4_elem {
 		u8 cidr[2];
 		u16 ccmp;
 	};
+	u16 padding;
 	u8 nomatch:1;
 	u8 proto;
 };
@@ -324,6 +325,7 @@ struct hash_netportnet6_elem {
 		u8 cidr[2];
 		u16 ccmp;
 	};
+	u16 padding;
 	u8 nomatch:1;
 	u8 proto;
 };
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 07/10] netfilter: ipset: Remove rbtree from hash:net,iface in order to run under RCU
  2014-11-24 20:46 [PATCH 00/10] ipset patches for nf-next Jozsef Kadlecsik
                   ` (5 preceding siblings ...)
  2014-11-24 20:46 ` [PATCH 06/10] netfilter: ipset: Explicitly add padding elements to hash:net,net and hash:net,port,net Jozsef Kadlecsik
@ 2014-11-24 20:46 ` Jozsef Kadlecsik
  2014-11-24 20:46 ` [PATCH 08/10] netfilter: ipset: Introduce RCU in all set types instead of rwlock per set Jozsef Kadlecsik
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-24 20:46 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_hash_netiface.c | 156 ++++-------------------------
 1 file changed, 17 insertions(+), 139 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_hash_netiface.c b/net/netfilter/ipset/ip_set_hash_netiface.c
index 758b002..d8a53ec 100644
--- a/net/netfilter/ipset/ip_set_hash_netiface.c
+++ b/net/netfilter/ipset/ip_set_hash_netiface.c
@@ -13,7 +13,6 @@
 #include <linux/skbuff.h>
 #include <linux/errno.h>
 #include <linux/random.h>
-#include <linux/rbtree.h>
 #include <net/ip.h>
 #include <net/ipv6.h>
 #include <net/netlink.h>
@@ -36,88 +35,14 @@ MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>");
 IP_SET_MODULE_DESC("hash:net,iface", IPSET_TYPE_REV_MIN, IPSET_TYPE_REV_MAX);
 MODULE_ALIAS("ip_set_hash:net,iface");
 
-/* Interface name rbtree */
-
-struct iface_node {
-	struct rb_node node;
-	char iface[IFNAMSIZ];
-};
-
-#define iface_data(n)	(rb_entry(n, struct iface_node, node)->iface)
-
-static void
-rbtree_destroy(struct rb_root *root)
-{
-	struct iface_node *node, *next;
-
-	rbtree_postorder_for_each_entry_safe(node, next, root, node)
-		kfree(node);
-
-	*root = RB_ROOT;
-}
-
-static int
-iface_test(struct rb_root *root, const char **iface)
-{
-	struct rb_node *n = root->rb_node;
-
-	while (n) {
-		const char *d = iface_data(n);
-		int res = strcmp(*iface, d);
-
-		if (res < 0)
-			n = n->rb_left;
-		else if (res > 0)
-			n = n->rb_right;
-		else {
-			*iface = d;
-			return 1;
-		}
-	}
-	return 0;
-}
-
-static int
-iface_add(struct rb_root *root, const char **iface)
-{
-	struct rb_node **n = &(root->rb_node), *p = NULL;
-	struct iface_node *d;
-
-	while (*n) {
-		char *ifname = iface_data(*n);
-		int res = strcmp(*iface, ifname);
-
-		p = *n;
-		if (res < 0)
-			n = &((*n)->rb_left);
-		else if (res > 0)
-			n = &((*n)->rb_right);
-		else {
-			*iface = ifname;
-			return 0;
-		}
-	}
-
-	d = kzalloc(sizeof(*d), GFP_ATOMIC);
-	if (!d)
-		return -ENOMEM;
-	strcpy(d->iface, *iface);
-
-	rb_link_node(&d->node, p, n);
-	rb_insert_color(&d->node, root);
-
-	*iface = d->iface;
-	return 0;
-}
-
 /* Type specific function prefix */
 #define HTYPE		hash_netiface
 #define IP_SET_HASH_WITH_NETS
-#define IP_SET_HASH_WITH_RBTREE
 #define IP_SET_HASH_WITH_MULTI
 #define IP_SET_HASH_WITH_NET0
 
 #define STREQ(a, b)	(strcmp(a, b) == 0)
+#define IFNAMCPY(a, b)	strncpy(a, b, IFNAMSIZ)
 
 /* IPv4 variant */
 
@@ -136,7 +61,7 @@ struct hash_netiface4_elem {
 	u8 cidr;
 	u8 nomatch;
 	u8 elem;
-	const char *iface;
+	char iface[IFNAMSIZ];
 };
 
 /* Common functions */
@@ -150,7 +75,7 @@ hash_netiface4_data_equal(const struct hash_netiface4_elem *ip1,
 	       ip1->cidr == ip2->cidr &&
 	       (++*multi) &&
 	       ip1->physdev == ip2->physdev &&
-	       ip1->iface == ip2->iface;
+	       STREQ(ip1->iface, ip2->iface);
 }
 
 static inline int
@@ -223,7 +148,6 @@ hash_netiface4_kadt(struct ip_set *set, const struct sk_buff *skb,
 		.elem = 1,
 	};
 	struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
-	int ret;
 
 	if (e.cidr == 0)
 		return -EINVAL;
@@ -233,8 +157,8 @@ hash_netiface4_kadt(struct ip_set *set, const struct sk_buff *skb,
 	ip4addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip);
 	e.ip &= ip_set_netmask(e.cidr);
 
-#define IFACE(dir)	(par->dir ? par->dir->name : NULL)
-#define PHYSDEV(dir)	(nf_bridge->dir ? nf_bridge->dir->name : NULL)
+#define IFACE(dir)	(par->dir ? par->dir->name : "")
+#define PHYSDEV(dir)	(nf_bridge->dir ? nf_bridge->dir->name : "")
 #define SRCDIR		(opt->flags & IPSET_DIM_TWO_SRC)
 
 	if (opt->cmdflags & IPSET_FLAG_PHYSDEV) {
@@ -243,26 +167,15 @@ hash_netiface4_kadt(struct ip_set *set, const struct sk_buff *skb,
 
 		if (!nf_bridge)
 			return -EINVAL;
-		e.iface = SRCDIR ? PHYSDEV(physindev) : PHYSDEV(physoutdev);
+		IFNAMCPY(e.iface,
+			 SRCDIR ? PHYSDEV(physindev) : PHYSDEV(physoutdev));
 		e.physdev = 1;
-#else
-		e.iface = NULL;
 #endif
 	} else
-		e.iface = SRCDIR ? IFACE(in) : IFACE(out);
+		IFNAMCPY(e.iface, SRCDIR ? IFACE(in) : IFACE(out));
 
-	if (!e.iface)
+	if (strlen(e.iface) == 0)
 		return -EINVAL;
-	ret = iface_test(&h->rbtree, &e.iface);
-	if (adt == IPSET_ADD) {
-		if (!ret) {
-			ret = iface_add(&h->rbtree, &e.iface);
-			if (ret)
-				return ret;
-		}
-	} else if (!ret)
-		return ret;
-
 	return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags);
 }
 
@@ -275,7 +188,6 @@ hash_netiface4_uadt(struct ip_set *set, struct nlattr *tb[],
 	struct hash_netiface4_elem e = { .cidr = HOST_MASK, .elem = 1 };
 	struct ip_set_ext ext = IP_SET_INIT_UEXT(set);
 	u32 ip = 0, ip_to = 0, last;
-	char iface[IFNAMSIZ];
 	int ret;
 
 	if (unlikely(!tb[IPSET_ATTR_IP] ||
@@ -302,18 +214,7 @@ hash_netiface4_uadt(struct ip_set *set, struct nlattr *tb[],
 		if (e.cidr > HOST_MASK)
 			return -IPSET_ERR_INVALID_CIDR;
 	}
-
-	strcpy(iface, nla_data(tb[IPSET_ATTR_IFACE]));
-	e.iface = iface;
-	ret = iface_test(&h->rbtree, &e.iface);
-	if (adt == IPSET_ADD) {
-		if (!ret) {
-			ret = iface_add(&h->rbtree, &e.iface);
-			if (ret)
-				return ret;
-		}
-	} else if (!ret)
-		return ret;
+	IFNAMCPY(e.iface, nla_data(tb[IPSET_ATTR_IFACE]));
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
@@ -372,7 +273,7 @@ struct hash_netiface6_elem {
 	u8 cidr;
 	u8 nomatch;
 	u8 elem;
-	const char *iface;
+	char iface[IFNAMSIZ];
 };
 
 /* Common functions */
@@ -386,7 +287,7 @@ hash_netiface6_data_equal(const struct hash_netiface6_elem *ip1,
 	       ip1->cidr == ip2->cidr &&
 	       (++*multi) &&
 	       ip1->physdev == ip2->physdev &&
-	       ip1->iface == ip2->iface;
+	       STREQ(ip1->iface, ip2->iface);
 }
 
 static inline int
@@ -464,7 +365,6 @@ hash_netiface6_kadt(struct ip_set *set, const struct sk_buff *skb,
 		.elem = 1,
 	};
 	struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
-	int ret;
 
 	if (e.cidr == 0)
 		return -EINVAL;
@@ -480,25 +380,15 @@ hash_netiface6_kadt(struct ip_set *set, const struct sk_buff *skb,
 
 		if (!nf_bridge)
 			return -EINVAL;
-		e.iface = SRCDIR ? PHYSDEV(physindev) : PHYSDEV(physoutdev);
+		IFNAMCPY(e.iface,
+			 SRCDIR ? PHYSDEV(physindev) : PHYSDEV(physoutdev));
 		e.physdev = 1;
-#else
-		e.iface = NULL;
 #endif
 	} else
-		e.iface = SRCDIR ? IFACE(in) : IFACE(out);
+		IFNAMCPY(e.iface, SRCDIR ? IFACE(in) : IFACE(out));
 
-	if (!e.iface)
+	if (strlen(e.iface) == 0)
 		return -EINVAL;
-	ret = iface_test(&h->rbtree, &e.iface);
-	if (adt == IPSET_ADD) {
-		if (!ret) {
-			ret = iface_add(&h->rbtree, &e.iface);
-			if (ret)
-				return ret;
-		}
-	} else if (!ret)
-		return ret;
 
 	return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags);
 }
@@ -507,11 +397,9 @@ static int
 hash_netiface6_uadt(struct ip_set *set, struct nlattr *tb[],
 		   enum ipset_adt adt, u32 *lineno, u32 flags, bool retried)
 {
-	struct hash_netiface *h = set->data;
 	ipset_adtfn adtfn = set->variant->adt[adt];
 	struct hash_netiface6_elem e = { .cidr = HOST_MASK, .elem = 1 };
 	struct ip_set_ext ext = IP_SET_INIT_UEXT(set);
-	char iface[IFNAMSIZ];
 	int ret;
 
 	if (unlikely(!tb[IPSET_ATTR_IP] ||
@@ -541,17 +429,7 @@ hash_netiface6_uadt(struct ip_set *set, struct nlattr *tb[],
 		return -IPSET_ERR_INVALID_CIDR;
 	ip6_netmask(&e.ip, e.cidr);
 
-	strcpy(iface, nla_data(tb[IPSET_ATTR_IFACE]));
-	e.iface = iface;
-	ret = iface_test(&h->rbtree, &e.iface);
-	if (adt == IPSET_ADD) {
-		if (!ret) {
-			ret = iface_add(&h->rbtree, &e.iface);
-			if (ret)
-				return ret;
-		}
-	} else if (!ret)
-		return ret;
+	IFNAMCPY(e.iface, nla_data(tb[IPSET_ATTR_IFACE]));
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 08/10] netfilter: ipset: Introduce RCU in all set types instead of rwlock per set
  2014-11-24 20:46 [PATCH 00/10] ipset patches for nf-next Jozsef Kadlecsik
                   ` (6 preceding siblings ...)
  2014-11-24 20:46 ` [PATCH 07/10] netfilter: ipset: Remove rbtree from hash:net,iface in order to run under RCU Jozsef Kadlecsik
@ 2014-11-24 20:46 ` Jozsef Kadlecsik
  2014-11-26 13:14   ` Pablo Neira Ayuso
  2014-11-24 20:46 ` [PATCH 09/10] netfilter: ipset: styles warned by checkpatch.pl fixed Jozsef Kadlecsik
  2014-11-24 20:46 ` [PATCH 10/10] netfilter: ipset: Fix parallel resizing and listing of the same set Jozsef Kadlecsik
  9 siblings, 1 reply; 17+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-24 20:46 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Performance is tested by Jesper Dangaard Brouer:

Simple drop in FORWARD
~~~~~~~~~~~~~~~~~~~~

Dropping via simple iptables net-mask match::

 iptables -t raw -N simple || iptables -t raw -F simple
 iptables -t raw -I simple  -s 198.18.0.0/15 -j DROP
 iptables -t raw -D PREROUTING -j simple
 iptables -t raw -I PREROUTING -j simple

Drop performance in "raw": 11.3Mpps

Generator: sending 12.2Mpps (tx:12264083 pps)

Drop via original ipset in RAW table
~~~~~~~~~~~~~~~~~~~~~~~~~

Create a set with lots of elements::
 sudo ./ipset destroy test
 echo "create test hash:ip hashsize 65536" > test.set
 for x in `seq 0 255`; do
    for y in `seq 0 255`; do
        echo "add test 198.18.$x.$y" >> test.set
    done
 done
 sudo ./ipset restore < test.set

Dropping via ipset::

 iptables -t raw -F
 iptables -t raw -N net198 || iptables -t raw -F net198
 iptables -t raw -I net198 -m set --match-set test src -j DROP
 iptables -t raw -I PREROUTING -j net198

Drop performance in "raw" with ipset: 8Mpps

Perf report numbers ipset drop in "raw"::

 +   24.65%  ksoftirqd/1  [ip_set]           [k] ip_set_test
 -   21.42%  ksoftirqd/1  [kernel.kallsyms]  [k] _raw_read_lock_bh
    - _raw_read_lock_bh
       + 99.88% ip_set_test
 -   19.42%  ksoftirqd/1  [kernel.kallsyms]  [k] _raw_read_unlock_bh
    - _raw_read_unlock_bh
       + 99.72% ip_set_test
 +    4.31%  ksoftirqd/1  [ip_set_hash_ip]   [k] hash_ip4_kadt
 +    2.27%  ksoftirqd/1  [ixgbe]            [k] ixgbe_fetch_rx_buffer
 +    2.18%  ksoftirqd/1  [ip_tables]        [k] ipt_do_table
 +    1.81%  ksoftirqd/1  [ip_set_hash_ip]   [k] hash_ip4_test
 +    1.61%  ksoftirqd/1  [kernel.kallsyms]  [k] __netif_receive_skb_core
 +    1.44%  ksoftirqd/1  [kernel.kallsyms]  [k] build_skb
 +    1.42%  ksoftirqd/1  [kernel.kallsyms]  [k] ip_rcv
 +    1.36%  ksoftirqd/1  [kernel.kallsyms]  [k] __local_bh_enable_ip
 +    1.16%  ksoftirqd/1  [kernel.kallsyms]  [k] dev_gro_receive
 +    1.09%  ksoftirqd/1  [kernel.kallsyms]  [k] __rcu_read_unlock
 +    0.96%  ksoftirqd/1  [ixgbe]            [k] ixgbe_clean_rx_irq
 +    0.95%  ksoftirqd/1  [kernel.kallsyms]  [k] __netdev_alloc_frag
 +    0.88%  ksoftirqd/1  [kernel.kallsyms]  [k] kmem_cache_alloc
 +    0.87%  ksoftirqd/1  [xt_set]           [k] set_match_v3
 +    0.85%  ksoftirqd/1  [kernel.kallsyms]  [k] inet_gro_receive
 +    0.83%  ksoftirqd/1  [kernel.kallsyms]  [k] nf_iterate
 +    0.76%  ksoftirqd/1  [kernel.kallsyms]  [k] put_compound_page
 +    0.75%  ksoftirqd/1  [kernel.kallsyms]  [k] __rcu_read_lock

Drop via ipset in RAW table with RCU-locking
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With RCU locking, the RW-lock is gone.

Drop performance in "raw" with ipset with RCU-locking: 11.3Mpps

Performance-tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 include/linux/netfilter/ipset/ip_set.h         |  82 +++-
 include/linux/netfilter/ipset/ip_set_timeout.h |  39 +-
 net/netfilter/ipset/ip_set_bitmap_gen.h        |   8 +-
 net/netfilter/ipset/ip_set_bitmap_ipmac.c      |   2 +-
 net/netfilter/ipset/ip_set_core.c              |  35 +-
 net/netfilter/ipset/ip_set_hash_gen.h          | 547 +++++++++++++++----------
 net/netfilter/ipset/ip_set_list_set.c          | 386 ++++++++---------
 7 files changed, 614 insertions(+), 485 deletions(-)

diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
index f1606fa..418d360 100644
--- a/include/linux/netfilter/ipset/ip_set.h
+++ b/include/linux/netfilter/ipset/ip_set.h
@@ -113,10 +113,10 @@ struct ip_set_comment {
 };
 
 struct ip_set_skbinfo {
-	u32 skbmark;
-	u32 skbmarkmask;
-	u32 skbprio;
-	u16 skbqueue;
+	u32 __rcu skbmark;
+	u32 __rcu skbmarkmask;
+	u32 __rcu skbprio;
+	u16 __rcu skbqueue;
 };
 
 struct ip_set;
@@ -223,7 +223,7 @@ struct ip_set {
 	/* The name of the set */
 	char name[IPSET_MAXNAMELEN];
 	/* Lock protecting the set data */
-	rwlock_t lock;
+	spinlock_t lock;
 	/* References to the set */
 	u32 ref;
 	/* The core set type */
@@ -322,30 +322,72 @@ ip_set_update_counter(struct ip_set_counter *counter,
 	}
 }
 
+/* RCU-safe assign value */
+#define IP_SET_RCU_ASSIGN(ptr, value)	\
+do {					\
+	smp_wmb();			\
+	*(ptr) = value;			\
+} while (0)
+
+static inline void
+ip_set_rcu_assign_ulong(unsigned long *v, unsigned long value)
+{
+	IP_SET_RCU_ASSIGN(v, value);
+}
+
+static inline void
+ip_set_rcu_assign_u32(u32 *v, u32 value)
+{
+	IP_SET_RCU_ASSIGN(v, value);
+}
+
+static inline void
+ip_set_rcu_assign_u16(u16 *v, u16 value)
+{
+	IP_SET_RCU_ASSIGN(v, value);
+}
+
+static inline void
+ip_set_rcu_assign_u8(u8 *v, u8 value)
+{
+	IP_SET_RCU_ASSIGN(v, value);
+}
+
+#define ip_set_rcu_deref(t)		\
+	rcu_dereference_index_check(t,	\
+		rcu_read_lock_held() || rcu_read_lock_bh_held())
+
 static inline void
 ip_set_get_skbinfo(struct ip_set_skbinfo *skbinfo,
 		      const struct ip_set_ext *ext,
 		      struct ip_set_ext *mext, u32 flags)
 {
-		mext->skbmark = skbinfo->skbmark;
-		mext->skbmarkmask = skbinfo->skbmarkmask;
-		mext->skbprio = skbinfo->skbprio;
-		mext->skbqueue = skbinfo->skbqueue;
+		mext->skbmark = ip_set_rcu_deref(skbinfo->skbmark);
+		mext->skbmarkmask = ip_set_rcu_deref(skbinfo->skbmarkmask);
+		mext->skbprio = ip_set_rcu_deref(skbinfo->skbprio);
+		mext->skbqueue = ip_set_rcu_deref(skbinfo->skbqueue);
 }
 static inline bool
 ip_set_put_skbinfo(struct sk_buff *skb, struct ip_set_skbinfo *skbinfo)
 {
+	u32 skbmark, skbmarkmask, skbprio;
+	u16 skbqueue;
+
+	skbmark = ip_set_rcu_deref(skbinfo->skbmark);
+	skbmarkmask = ip_set_rcu_deref(skbinfo->skbmarkmask);
+	skbprio = ip_set_rcu_deref(skbinfo->skbprio);
+	skbqueue = ip_set_rcu_deref(skbinfo->skbqueue);
 	/* Send nonzero parameters only */
-	return ((skbinfo->skbmark || skbinfo->skbmarkmask) &&
+	return ((skbmark || skbmarkmask) &&
 		nla_put_net64(skb, IPSET_ATTR_SKBMARK,
-			      cpu_to_be64((u64)skbinfo->skbmark << 32 |
-					  skbinfo->skbmarkmask))) ||
-	       (skbinfo->skbprio &&
+			      cpu_to_be64((u64)skbmark << 32 |
+					  skbmarkmask))) ||
+	       (skbprio &&
 	        nla_put_net32(skb, IPSET_ATTR_SKBPRIO,
-			      cpu_to_be32(skbinfo->skbprio))) ||
-	       (skbinfo->skbqueue &&
+			      cpu_to_be32(skbprio))) ||
+	       (skbqueue &&
 	        nla_put_net16(skb, IPSET_ATTR_SKBQUEUE,
-			     cpu_to_be16(skbinfo->skbqueue)));
+			     cpu_to_be16(skbqueue)));
 
 }
 
@@ -353,10 +395,10 @@ static inline void
 ip_set_init_skbinfo(struct ip_set_skbinfo *skbinfo,
 		    const struct ip_set_ext *ext)
 {
-	skbinfo->skbmark = ext->skbmark;
-	skbinfo->skbmarkmask = ext->skbmarkmask;
-	skbinfo->skbprio = ext->skbprio;
-	skbinfo->skbqueue = ext->skbqueue;
+	ip_set_rcu_assign_u32(&skbinfo->skbmark, ext->skbmark);
+	ip_set_rcu_assign_u32(&skbinfo->skbmarkmask, ext->skbmarkmask);
+	ip_set_rcu_assign_u32(&skbinfo->skbprio, ext->skbprio);
+	ip_set_rcu_assign_u16(&skbinfo->skbqueue, ext->skbqueue);
 }
 
 static inline bool
diff --git a/include/linux/netfilter/ipset/ip_set_timeout.h b/include/linux/netfilter/ipset/ip_set_timeout.h
index 83c2f9e..9e30031 100644
--- a/include/linux/netfilter/ipset/ip_set_timeout.h
+++ b/include/linux/netfilter/ipset/ip_set_timeout.h
@@ -40,38 +40,47 @@ ip_set_timeout_uget(struct nlattr *tb)
 }
 
 static inline bool
-ip_set_timeout_test(unsigned long timeout)
+__ip_set_timeout_expired(unsigned long t)
 {
-	return timeout == IPSET_ELEM_PERMANENT ||
-	       time_is_after_jiffies(timeout);
+	return t != IPSET_ELEM_PERMANENT && time_is_before_jiffies(t);
+}
+
+static inline bool
+ip_set_timeout_expired_rcu(unsigned long *timeout)
+{
+	unsigned long t = ip_set_rcu_deref(*timeout);
+
+	return __ip_set_timeout_expired(t);
 }
 
 static inline bool
 ip_set_timeout_expired(unsigned long *timeout)
 {
-	return *timeout != IPSET_ELEM_PERMANENT &&
-	       time_is_before_jiffies(*timeout);
+	return __ip_set_timeout_expired(*timeout);
 }
 
 static inline void
-ip_set_timeout_set(unsigned long *timeout, u32 t)
+ip_set_timeout_set(unsigned long *timeout, u32 value)
 {
-	if (!t) {
-		*timeout = IPSET_ELEM_PERMANENT;
-		return;
-	}
+	unsigned long t;
+
+	if (!value)
+		return ip_set_rcu_assign_ulong(timeout, IPSET_ELEM_PERMANENT);
 
-	*timeout = msecs_to_jiffies(t * 1000) + jiffies;
-	if (*timeout == IPSET_ELEM_PERMANENT)
+	t = msecs_to_jiffies(value * 1000) + jiffies;
+	if (t == IPSET_ELEM_PERMANENT)
 		/* Bingo! :-) */
-		(*timeout)--;
+		t--;
+	ip_set_rcu_assign_ulong(timeout, t);
 }
 
 static inline u32
 ip_set_timeout_get(unsigned long *timeout)
 {
-	return *timeout == IPSET_ELEM_PERMANENT ? 0 :
-		jiffies_to_msecs(*timeout - jiffies)/1000;
+	unsigned long t = ip_set_rcu_deref(*timeout);
+
+	return t == IPSET_ELEM_PERMANENT ? 0 :
+		jiffies_to_msecs(t - jiffies)/1000;
 }
 
 #endif	/* __KERNEL__ */
diff --git a/net/netfilter/ipset/ip_set_bitmap_gen.h b/net/netfilter/ipset/ip_set_bitmap_gen.h
index 6f024a8..8d919cc 100644
--- a/net/netfilter/ipset/ip_set_bitmap_gen.h
+++ b/net/netfilter/ipset/ip_set_bitmap_gen.h
@@ -124,7 +124,7 @@ mtype_test(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 	if (ret <= 0)
 		return ret;
 	if (SET_WITH_TIMEOUT(set) &&
-	    ip_set_timeout_expired(ext_timeout(x, set)))
+	    ip_set_timeout_expired_rcu(ext_timeout(x, set)))
 		return 0;
 	if (SET_WITH_COUNTER(set))
 		ip_set_update_counter(ext_counter(x, set), ext, mext, flags);
@@ -216,7 +216,7 @@ mtype_list(const struct ip_set *set,
 #ifdef IP_SET_BITMAP_STORED_TIMEOUT
 		     mtype_is_filled((const struct mtype_elem *) x) &&
 #endif
-		     ip_set_timeout_expired(ext_timeout(x, set))))
+		     ip_set_timeout_expired_rcu(ext_timeout(x, set))))
 			continue;
 		nested = ipset_nest_start(skb, IPSET_ATTR_DATA);
 		if (!nested) {
@@ -260,7 +260,7 @@ mtype_gc(unsigned long ul_set)
 
 	/* We run parallel with other readers (test element)
 	 * but adding/deleting new entries is locked out */
-	read_lock_bh(&set->lock);
+	spin_lock_bh(&set->lock);
 	for (id = 0; id < map->elements; id++)
 		if (mtype_gc_test(id, map, set->dsize)) {
 			x = get_ext(set, map, id);
@@ -269,7 +269,7 @@ mtype_gc(unsigned long ul_set)
 				ip_set_ext_destroy(set, x);
 			}
 		}
-	read_unlock_bh(&set->lock);
+	spin_unlock_bh(&set->lock);
 
 	map->gc.expires = jiffies + IPSET_GC_PERIOD(set->timeout) * HZ;
 	add_timer(&map->gc);
diff --git a/net/netfilter/ipset/ip_set_bitmap_ipmac.c b/net/netfilter/ipset/ip_set_bitmap_ipmac.c
index 8610474..5e20c26 100644
--- a/net/netfilter/ipset/ip_set_bitmap_ipmac.c
+++ b/net/netfilter/ipset/ip_set_bitmap_ipmac.c
@@ -134,7 +134,7 @@ bitmap_ipmac_add_timeout(unsigned long *timeout,
 		if (e->ether)
 			ip_set_timeout_set(timeout, t);
 		else
-			*timeout = t;
+			ip_set_rcu_assign_ulong(timeout, t);
 	}
 	return 0;
 }
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index 912e5a0..9fb2610 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -217,6 +217,7 @@ ip_set_type_register(struct ip_set_type *type)
 		 type->revision_min, type->revision_max);
 unlock:
 	ip_set_type_unlock();
+	synchronize_rcu();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(ip_set_type_register);
@@ -496,16 +497,16 @@ ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
 	    !(opt->family == set->family || set->family == NFPROTO_UNSPEC))
 		return 0;
 
-	read_lock_bh(&set->lock);
+	rcu_read_lock_bh();
 	ret = set->variant->kadt(set, skb, par, IPSET_TEST, opt);
-	read_unlock_bh(&set->lock);
+	rcu_read_unlock_bh();
 
 	if (ret == -EAGAIN) {
 		/* Type requests element to be completed */
 		pr_debug("element must be completed, ADD is triggered\n");
-		write_lock_bh(&set->lock);
+		spin_lock_bh(&set->lock);
 		set->variant->kadt(set, skb, par, IPSET_ADD, opt);
-		write_unlock_bh(&set->lock);
+		spin_unlock_bh(&set->lock);
 		ret = 1;
 	} else {
 		/* --return-nomatch: invert matched element */
@@ -535,9 +536,9 @@ ip_set_add(ip_set_id_t index, const struct sk_buff *skb,
 	    !(opt->family == set->family || set->family == NFPROTO_UNSPEC))
 		return -IPSET_ERR_TYPE_MISMATCH;
 
-	write_lock_bh(&set->lock);
+	spin_lock_bh(&set->lock);
 	ret = set->variant->kadt(set, skb, par, IPSET_ADD, opt);
-	write_unlock_bh(&set->lock);
+	spin_unlock_bh(&set->lock);
 
 	return ret;
 }
@@ -558,9 +559,9 @@ ip_set_del(ip_set_id_t index, const struct sk_buff *skb,
 	    !(opt->family == set->family || set->family == NFPROTO_UNSPEC))
 		return -IPSET_ERR_TYPE_MISMATCH;
 
-	write_lock_bh(&set->lock);
+	spin_lock_bh(&set->lock);
 	ret = set->variant->kadt(set, skb, par, IPSET_DEL, opt);
-	write_unlock_bh(&set->lock);
+	spin_unlock_bh(&set->lock);
 
 	return ret;
 }
@@ -845,7 +846,7 @@ ip_set_create(struct sock *ctnl, struct sk_buff *skb,
 	set = kzalloc(sizeof(struct ip_set), GFP_KERNEL);
 	if (!set)
 		return -ENOMEM;
-	rwlock_init(&set->lock);
+	spin_lock_init(&set->lock);
 	strlcpy(set->name, name, IPSET_MAXNAMELEN);
 	set->family = family;
 	set->revision = revision;
@@ -1024,9 +1025,9 @@ ip_set_flush_set(struct ip_set *set)
 {
 	pr_debug("set: %s\n",  set->name);
 
-	write_lock_bh(&set->lock);
+	spin_lock_bh(&set->lock);
 	set->variant->flush(set);
-	write_unlock_bh(&set->lock);
+	spin_unlock_bh(&set->lock);
 }
 
 static int
@@ -1327,9 +1328,9 @@ dump_last:
 				goto next_set;
 			/* Fall through and add elements */
 		default:
-			read_lock_bh(&set->lock);
+			rcu_read_lock_bh();
 			ret = set->variant->list(set, skb, cb);
-			read_unlock_bh(&set->lock);
+			rcu_read_unlock_bh();
 			if (!cb->args[IPSET_CB_ARG0])
 				/* Set is done, proceed with next one */
 				goto next_set;
@@ -1407,9 +1408,9 @@ call_ad(struct sock *ctnl, struct sk_buff *skb, struct ip_set *set,
 	bool eexist = flags & IPSET_FLAG_EXIST, retried = false;
 
 	do {
-		write_lock_bh(&set->lock);
+		spin_lock_bh(&set->lock);
 		ret = set->variant->uadt(set, tb, adt, &lineno, flags, retried);
-		write_unlock_bh(&set->lock);
+		spin_unlock_bh(&set->lock);
 		retried = true;
 	} while (ret == -EAGAIN &&
 		 set->variant->resize &&
@@ -1589,9 +1590,9 @@ ip_set_utest(struct sock *ctnl, struct sk_buff *skb,
 			     set->type->adt_policy))
 		return -IPSET_ERR_PROTOCOL;
 
-	read_lock_bh(&set->lock);
+	rcu_read_lock_bh();
 	ret = set->variant->uadt(set, tb, IPSET_TEST, NULL, 0, 0);
-	read_unlock_bh(&set->lock);
+	rcu_read_unlock_bh();
 	/* Userspace can't trigger element to be re-added */
 	if (ret == -EAGAIN)
 		ret = 1;
diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index 974ff38..f889d98 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -10,9 +10,10 @@
 
 #include <linux/rcupdate.h>
 #include <linux/jhash.h>
+#include <linux/types.h>
 #include <linux/netfilter/ipset/ip_set_timeout.h>
-#ifndef rcu_dereference_bh
-#define rcu_dereference_bh(p)	rcu_dereference(p)
+#ifndef rcu_dereference_bh_check
+#define rcu_dereference_bh_check(p, c)	rcu_dereference_bh(p)
 #endif
 
 #define rcu_dereference_bh_nfnl(p)	rcu_dereference_bh_check(p, 1)
@@ -20,9 +21,7 @@
 /* Hashing which uses arrays to resolve clashing. The hash table is resized
  * (doubled) when searching becomes too long.
  * Internally jhash is used with the assumption that the size of the
- * stored data is a multiple of sizeof(u32). If storage supports timeout,
- * the timeout field must be the last one in the data structure - that field
- * is ignored when computing the hash key.
+ * stored data is a multiple of sizeof(u32).
  *
  * Readers and resizing
  *
@@ -36,6 +35,8 @@
 #define AHASH_INIT_SIZE			4
 /* Max number of elements to store in an array block */
 #define AHASH_MAX_SIZE			(3*AHASH_INIT_SIZE)
+/* Max muber of elements in the array block when tuned */
+#define AHASH_MAX_TUNED			64
 
 /* Max number of elements can be tuned */
 #ifdef IP_SET_HASH_WITH_MULTI
@@ -53,7 +54,7 @@ tune_ahash_max(u8 curr, u32 multi)
 	/* Currently, at listing one hash bucket must fit into a message.
 	 * Therefore we have a hard limit here.
 	 */
-	return n > curr && n <= 64 ? n : curr;
+	return n > curr && n <= AHASH_MAX_TUNED ? n : curr;
 }
 #define TUNE_AHASH_MAX(h, multi)	\
 	((h)->ahash_max = tune_ahash_max((h)->ahash_max, multi))
@@ -64,18 +65,21 @@ tune_ahash_max(u8 curr, u32 multi)
 
 /* A hash bucket */
 struct hbucket {
-	void *value;		/* the array of the values */
+	struct rcu_head rcu;	/* for call_rcu_bh */
+	/* Which positions are used in the array */
+	DECLARE_BITMAP(used, AHASH_MAX_TUNED);
 	u8 size;		/* size of the array */
 	u8 pos;			/* position of the first free entry */
-};
+	unsigned char value[0];	/* the array of the values */
+} __attribute__ ((aligned));
 
 /* The hash table: the table size stored here in order to make resizing easy */
 struct htable {
 	u8 htable_bits;		/* size of hash table == 2^htable_bits */
-	struct hbucket bucket[0]; /* hashtable buckets */
+	struct hbucket * __rcu bucket[0]; /* hashtable buckets */
 };
 
-#define hbucket(h, i)		(&((h)->bucket[i]))
+#define hbucket(h, i)		((h)->bucket[i])
 
 #ifndef IPSET_NET_COUNT
 #define IPSET_NET_COUNT		1
@@ -83,8 +87,8 @@ struct htable {
 
 /* Book-keeping of the prefixes added to the set */
 struct net_prefixes {
-	u32 nets[IPSET_NET_COUNT]; /* number of elements per cidr */
-	u8 cidr[IPSET_NET_COUNT];  /* the different cidr values in the set */
+	u32 nets[IPSET_NET_COUNT]; /* number of elements for this cidr */
+	u8 cidr[IPSET_NET_COUNT];  /* the cidr value */
 };
 
 /* Compute the hash table size */
@@ -97,11 +101,11 @@ htable_size(u8 hbits)
 	if (hbits > 31)
 		return 0;
 	hsize = jhash_size(hbits);
-	if ((((size_t)-1) - sizeof(struct htable))/sizeof(struct hbucket)
+	if ((((size_t)-1) - sizeof(struct htable))/sizeof(struct hbucket *)
 	    < hsize)
 		return 0;
 
-	return hsize * sizeof(struct hbucket) + sizeof(struct htable);
+	return hsize * sizeof(struct hbucket *) + sizeof(struct htable);
 }
 
 /* Compute htable_bits from the user input parameter hashsize */
@@ -110,6 +114,7 @@ htable_bits(u32 hashsize)
 {
 	/* Assume that hashsize == 2^htable_bits */
 	u8 bits = fls(hashsize - 1);
+
 	if (jhash_size(bits) != hashsize)
 		/* Round up to the first 2^n value */
 		bits = fls(hashsize);
@@ -117,30 +122,6 @@ htable_bits(u32 hashsize)
 	return bits;
 }
 
-static int
-hbucket_elem_add(struct hbucket *n, u8 ahash_max, size_t dsize)
-{
-	if (n->pos >= n->size) {
-		void *tmp;
-
-		if (n->size >= ahash_max)
-			/* Trigger rehashing */
-			return -EAGAIN;
-
-		tmp = kzalloc((n->size + AHASH_INIT_SIZE) * dsize,
-			      GFP_ATOMIC);
-		if (!tmp)
-			return -ENOMEM;
-		if (n->size) {
-			memcpy(tmp, n->value, n->size * dsize);
-			kfree(n->value);
-		}
-		n->value = tmp;
-		n->size += AHASH_INIT_SIZE;
-	}
-	return 0;
-}
-
 #ifdef IP_SET_HASH_WITH_NETS
 #if IPSET_NET_COUNT > 1
 #define __CIDR(cidr, i)		(cidr[i])
@@ -280,9 +261,6 @@ struct htype {
 #ifdef IP_SET_HASH_WITH_NETMASK
 	u8 netmask;		/* netmask value for subnets to store */
 #endif
-#ifdef IP_SET_HASH_WITH_RBTREE
-	struct rb_root rbtree;
-#endif
 #ifdef IP_SET_HASH_WITH_NETS
 	struct net_prefixes nets[0]; /* book-keeping of prefixes */
 #endif
@@ -324,8 +302,8 @@ mtype_del_cidr(struct htype *h, u8 cidr, u8 nets_length, u8 n)
 	for (i = 0; i < nets_length; i++) {
 	        if (h->nets[i].cidr[n] != cidr)
 	                continue;
-		h->nets[cidr -1].nets[n]--;
-		if (h->nets[cidr -1].nets[n] > 0)
+		h->nets[cidr - 1].nets[n]--;
+		if (h->nets[cidr - 1].nets[n] > 0)
                         return;
 		for (j = i; j < net_end && h->nets[j].cidr[n]; j++)
 		        h->nets[j].cidr[n] = h->nets[j + 1].cidr[n];
@@ -341,15 +319,18 @@ mtype_ahash_memsize(const struct htype *h, const struct htable *t,
 		    u8 nets_length, size_t dsize)
 {
 	u32 i;
-	size_t memsize = sizeof(*h)
-			 + sizeof(*t)
+	struct hbucket *n;
+	size_t memsize = sizeof(*h) + sizeof(*t);
+
 #ifdef IP_SET_HASH_WITH_NETS
-			 + sizeof(struct net_prefixes) * nets_length
+	memsize += sizeof(struct net_prefixes) * nets_length;
 #endif
-			 + jhash_size(t->htable_bits) * sizeof(struct hbucket);
-
-	for (i = 0; i < jhash_size(t->htable_bits); i++)
-		memsize += t->bucket[i].size * dsize;
+	for (i = 0; i < jhash_size(t->htable_bits); i++) {
+		n = rcu_dereference_bh(hbucket(t, i));
+		if (n == NULL)
+			continue;
+		memsize += sizeof(struct hbucket) + n->size * dsize;
+	}
 
 	return memsize;
 }
@@ -364,7 +345,8 @@ mtype_ext_cleanup(struct ip_set *set, struct hbucket *n)
 	int i;
 
 	for (i = 0; i < n->pos; i++)
-		ip_set_ext_destroy(set, ahash_data(n, i, set->dsize));
+		if (test_bit(i, n->used))
+			ip_set_ext_destroy(set, ahash_data(n, i, set->dsize));
 }
 
 /* Flush a hash type of set: destroy all elements */
@@ -376,16 +358,16 @@ mtype_flush(struct ip_set *set)
 	struct hbucket *n;
 	u32 i;
 
-	t = rcu_dereference_bh_nfnl(h->table);
+	t = h->table;
 	for (i = 0; i < jhash_size(t->htable_bits); i++) {
 		n = hbucket(t, i);
-		if (n->size) {
-			if (set->extensions & IPSET_EXT_DESTROY)
-				mtype_ext_cleanup(set, n);
-			n->size = n->pos = 0;
-			/* FIXME: use slab cache */
-			kfree(n->value);
-		}
+		if (n == NULL)
+			continue;
+		if (set->extensions & IPSET_EXT_DESTROY)
+			mtype_ext_cleanup(set, n);
+		/* FIXME: use slab cache */
+		rcu_assign_pointer(hbucket(t, i), NULL);
+		kfree_rcu(n, rcu);
 	}
 #ifdef IP_SET_HASH_WITH_NETS
 	memset(h->nets, 0, sizeof(struct net_prefixes) * NLEN(set->family));
@@ -402,12 +384,12 @@ mtype_ahash_destroy(struct ip_set *set, struct htable *t, bool ext_destroy)
 
 	for (i = 0; i < jhash_size(t->htable_bits); i++) {
 		n = hbucket(t, i);
-		if (n->size) {
-			if (set->extensions & IPSET_EXT_DESTROY && ext_destroy)
-				mtype_ext_cleanup(set, n);
-			/* FIXME: use slab cache */
-			kfree(n->value);
-		}
+		if (n == NULL)
+			continue;
+		if (set->extensions & IPSET_EXT_DESTROY && ext_destroy)
+			mtype_ext_cleanup(set, n);
+		/* FIXME: use slab cache */
+		kfree(n);
 	}
 
 	ip_set_free(t);
@@ -422,10 +404,7 @@ mtype_destroy(struct ip_set *set)
 	if (set->extensions & IPSET_EXT_TIMEOUT)
 		del_timer_sync(&h->gc);
 
-	mtype_ahash_destroy(set, rcu_dereference_bh_nfnl(h->table), true);
-#ifdef IP_SET_HASH_WITH_RBTREE
-	rbtree_destroy(&h->rbtree);
-#endif
+	mtype_ahash_destroy(set, h->table, true);
 	kfree(h);
 
 	set->data = NULL;
@@ -470,17 +449,22 @@ mtype_expire(struct ip_set *set, struct htype *h, u8 nets_length, size_t dsize)
 	struct htable *t;
 	struct hbucket *n;
 	struct mtype_elem *data;
-	u32 i;
-	int j;
+	u32 i, j, d;
 #ifdef IP_SET_HASH_WITH_NETS
 	u8 k;
 #endif
 
-	rcu_read_lock_bh();
-	t = rcu_dereference_bh(h->table);
+	BUG_ON(!spin_is_locked(&set->lock));
+	t = h->table;
 	for (i = 0; i < jhash_size(t->htable_bits); i++) {
 		n = hbucket(t, i);
-		for (j = 0; j < n->pos; j++) {
+		if (n == NULL)
+			continue;
+		for (j = 0, d = 0; j < n->pos; j++) {
+			if (!test_bit(j, n->used)) {
+				d++;
+				continue;
+			}
 			data = ahash_data(n, j, dsize);
 			if (ip_set_timeout_expired(ext_timeout(data, set))) {
 				pr_debug("expired %u/%u\n", i, j);
@@ -490,29 +474,32 @@ mtype_expire(struct ip_set *set, struct htype *h, u8 nets_length, size_t dsize)
 						       nets_length, k);
 #endif
 				ip_set_ext_destroy(set, data);
-				if (j != n->pos - 1)
-					/* Not last one */
-					memcpy(data,
-					       ahash_data(n, n->pos - 1, dsize),
-					       dsize);
-				n->pos--;
+				clear_bit(j, n->used);
 				h->elements--;
+				d++;
 			}
 		}
-		if (n->pos + AHASH_INIT_SIZE < n->size) {
-			void *tmp = kzalloc((n->size - AHASH_INIT_SIZE)
-					    * dsize,
-					    GFP_ATOMIC);
+		if (d >= AHASH_INIT_SIZE) {
+			struct hbucket *tmp = kzalloc(sizeof(struct hbucket) +
+					(n->size - AHASH_INIT_SIZE) * dsize,
+					GFP_ATOMIC);
 			if (!tmp)
 				/* Still try to delete expired elements */
 				continue;
-			n->size -= AHASH_INIT_SIZE;
-			memcpy(tmp, n->value, n->size * dsize);
-			kfree(n->value);
-			n->value = tmp;
+			tmp->size = n->size - AHASH_INIT_SIZE;
+			for (j = 0, d = 0; j < n->pos; j++) {
+				if (!test_bit(j, n->used))
+					continue;
+				data = ahash_data(n, j, dsize);
+				memcpy(tmp->value + d * dsize, data, dsize);
+				set_bit(j, tmp->used);
+				d++;
+			}
+			tmp->pos = d;
+			rcu_assign_pointer(hbucket(t, i), tmp);
+			kfree_rcu(n, rcu);
 		}
 	}
-	rcu_read_unlock_bh();
 }
 
 static void
@@ -522,9 +509,9 @@ mtype_gc(unsigned long ul_set)
 	struct htype *h = set->data;
 
 	pr_debug("called\n");
-	write_lock_bh(&set->lock);
+	spin_lock_bh(&set->lock);
 	mtype_expire(set, h, NLEN(set->family), set->dsize);
-	write_unlock_bh(&set->lock);
+	spin_unlock_bh(&set->lock);
 
 	h->gc.expires = jiffies + IPSET_GC_PERIOD(set->timeout) * HZ;
 	add_timer(&h->gc);
@@ -537,75 +524,110 @@ static int
 mtype_resize(struct ip_set *set, bool retried)
 {
 	struct htype *h = set->data;
-	struct htable *t, *orig = rcu_dereference_bh_nfnl(h->table);
-	u8 htable_bits = orig->htable_bits;
+	struct htable *t, *orig;
+	u8 htable_bits;
+	size_t dsize = set->dsize;
 #ifdef IP_SET_HASH_WITH_NETS
 	u8 flags;
+	unsigned char _tmp[dsize];
+	struct mtype_elem *tmp = (struct mtype_elem *) &_tmp;
 #endif
 	struct mtype_elem *data;
 	struct mtype_elem *d;
 	struct hbucket *n, *m;
-	u32 i, j;
+	u32 i, j, key;
 	int ret;
 
-	/* Try to cleanup once */
-	if (SET_WITH_TIMEOUT(set) && !retried) {
-		i = h->elements;
-		write_lock_bh(&set->lock);
-		mtype_expire(set, set->data, NLEN(set->family), set->dsize);
-		write_unlock_bh(&set->lock);
-		if (h->elements < i)
-			return 0;
-	}
+	rcu_read_lock_bh();
+	orig = rcu_dereference_bh_nfnl(h->table);
+	htable_bits = orig->htable_bits;
+	rcu_read_unlock_bh();
 
 retry:
 	ret = 0;
 	htable_bits++;
-	pr_debug("attempt to resize set %s from %u to %u, t %p\n",
-		 set->name, orig->htable_bits, htable_bits, orig);
 	if (!htable_bits) {
 		/* In case we have plenty of memory :-) */
 		pr_warn("Cannot increase the hashsize of set %s further\n",
 			set->name);
-		return -IPSET_ERR_HASH_FULL;
+		ret = -IPSET_ERR_HASH_FULL;
+		goto out;
+	}
+	t = ip_set_alloc(htable_size(htable_bits));
+	if (!t) {
+		ret = -ENOMEM;
+		goto out;
 	}
-	t = ip_set_alloc(sizeof(*t)
-			 + jhash_size(htable_bits) * sizeof(struct hbucket));
-	if (!t)
-		return -ENOMEM;
 	t->htable_bits = htable_bits;
 
-	read_lock_bh(&set->lock);
+	spin_lock_bh(&set->lock);
+	orig = h->table;
+	pr_debug("attempt to resize set %s from %u to %u, t %p\n",
+		 set->name, orig->htable_bits, htable_bits, orig);
 	for (i = 0; i < jhash_size(orig->htable_bits); i++) {
 		n = hbucket(orig, i);
+		if (n == NULL)
+			continue;
 		for (j = 0; j < n->pos; j++) {
-			data = ahash_data(n, j, set->dsize);
+			if (!test_bit(j, n->used))
+				continue;
+			data = ahash_data(n, j, dsize);
 #ifdef IP_SET_HASH_WITH_NETS
+			/* We have readers running parallel with us,
+			 * so the live data cannot be modified.
+			 */
 			flags = 0;
+			memcpy(tmp, data, dsize);
+			data = tmp;
 			mtype_data_reset_flags(data, &flags);
 #endif
-			m = hbucket(t, HKEY(data, h->initval, htable_bits));
-			ret = hbucket_elem_add(m, AHASH_MAX(h), set->dsize);
-			if (ret < 0) {
-#ifdef IP_SET_HASH_WITH_NETS
-				mtype_data_reset_flags(data, &flags);
-#endif
-				read_unlock_bh(&set->lock);
-				mtype_ahash_destroy(set, t, false);
-				if (ret == -EAGAIN)
-					goto retry;
-				return ret;
+			key = HKEY(data, h->initval, htable_bits);
+			m = hbucket(t, key);
+			if (!m) {
+				m = kzalloc(sizeof(struct hbucket) +
+					    AHASH_INIT_SIZE * dsize,
+					    GFP_ATOMIC);
+				if (!m)
+					ret = -ENOMEM;
+				m->size = AHASH_INIT_SIZE;
+				hbucket(t, key) = m;
+			} else if (m->pos >= m->size) {
+				struct hbucket *tmp;
+
+				if (m->size >= AHASH_MAX(h))
+					ret = -EAGAIN;
+				else {
+					tmp = kzalloc(sizeof(struct hbucket) +
+						(m->size + AHASH_INIT_SIZE)
+						* dsize,
+						GFP_ATOMIC);
+					if (!tmp)
+						ret = -ENOMEM;
+				}
+				if (ret < 0) {
+					spin_unlock_bh(&set->lock);
+					mtype_ahash_destroy(set, t, false);
+					if (ret == -EAGAIN)
+						goto retry;
+					return ret;
+				}
+				memcpy(tmp, m, sizeof(struct hbucket) +
+					       m->size * dsize);
+				tmp->size = m->size + AHASH_INIT_SIZE;
+				kfree(m);
+				m = hbucket(t, key) = tmp;
 			}
-			d = ahash_data(m, m->pos++, set->dsize);
-			memcpy(d, data, set->dsize);
+			d = ahash_data(m, m->pos, dsize);
+			memcpy(d, data, dsize);
+			set_bit(m->pos++, m->used);
 #ifdef IP_SET_HASH_WITH_NETS
 			mtype_data_reset_flags(d, &flags);
 #endif
 		}
 	}
-
 	rcu_assign_pointer(h->table, t);
-	read_unlock_bh(&set->lock);
+
+	spin_unlock_bh(&set->lock);
 
 	/* Give time to other readers of the set */
 	synchronize_rcu_bh();
@@ -615,6 +637,10 @@ retry:
 	mtype_ahash_destroy(set, orig, false);
 
 	return 0;
+
+out:
+	spin_unlock_bh(&set->lock);
+	return ret;
 }
 
 /* Add an element to a hash and update the internal counters when succeeded,
@@ -627,17 +653,49 @@ mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 	struct htable *t;
 	const struct mtype_elem *d = value;
 	struct mtype_elem *data;
-	struct hbucket *n;
-	int i, ret = 0;
-	int j = AHASH_MAX(h) + 1;
+	struct hbucket *n, *old = ERR_PTR(-ENOENT);
+	int i, j = -1;
 	bool flag_exist = flags & IPSET_FLAG_EXIST;
+	bool deleted = false, forceadd = false, reuse = false;
 	u32 key, multi = 0;
 
-	rcu_read_lock_bh();
-	t = rcu_dereference_bh(h->table);
+	if (h->elements >= h->maxelem) {
+		if (SET_WITH_TIMEOUT(set))
+			/* FIXME: when set is full, we slow down here */
+			mtype_expire(set, h, NLEN(set->family), set->dsize);
+		if (h->elements >= h->maxelem && SET_WITH_FORCEADD(set))
+			forceadd = true;
+	}
+
+	t = h->table;
 	key = HKEY(value, h->initval, t->htable_bits);
 	n = hbucket(t, key);
+	if (n == NULL) {
+		if (forceadd) {
+			if (net_ratelimit())
+				pr_warn("Set %s is full, maxelem %u reached\n",
+					set->name, h->maxelem);
+			return -IPSET_ERR_HASH_FULL;
+		} else if (h->elements >= h->maxelem)
+			goto set_full;
+		old = NULL;
+		n = kzalloc(sizeof(struct hbucket) +
+			    AHASH_INIT_SIZE * set->dsize,
+			    GFP_ATOMIC);
+		if (n == NULL)
+			return -ENOMEM;
+		n->size = AHASH_INIT_SIZE;
+		goto copy_elem;
+	}
 	for (i = 0; i < n->pos; i++) {
+		if (!test_bit(i, n->used)) {
+			/* Reuse first deleted entry */
+			if (j == -1) {
+				deleted = reuse = true;
+				j = i;
+			}
+			continue;
+		}
 		data = ahash_data(n, i, set->dsize);
 		if (mtype_data_equal(data, d, &multi)) {
 			if (flag_exist ||
@@ -645,85 +703,92 @@ mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 			     ip_set_timeout_expired(ext_timeout(data, set)))) {
 				/* Just the extensions could be overwritten */
 				j = i;
-				goto reuse_slot;
-			} else {
-				ret = -IPSET_ERR_EXIST;
-				goto out;
-			}
+				goto overwrite_extensions;
+			} else
+				return -IPSET_ERR_EXIST;
 		}
 		/* Reuse first timed out entry */
 		if (SET_WITH_TIMEOUT(set) &&
 		    ip_set_timeout_expired(ext_timeout(data, set)) &&
-		    j != AHASH_MAX(h) + 1)
+		    j == -1) {
 			j = i;
+			reuse = true;
+		}
 	}
-	if (h->elements >= h->maxelem && SET_WITH_FORCEADD(set) && n->pos) {
-		/* Choosing the first entry in the array to replace */
-		j = 0;
-		goto reuse_slot;
-	}
-	if (SET_WITH_TIMEOUT(set) && h->elements >= h->maxelem)
-		/* FIXME: when set is full, we slow down here */
-		mtype_expire(set, h, NLEN(set->family), set->dsize);
-
-	if (h->elements >= h->maxelem) {
-		if (net_ratelimit())
-			pr_warn("Set %s is full, maxelem %u reached\n",
-				set->name, h->maxelem);
-		ret = -IPSET_ERR_HASH_FULL;
-		goto out;
-	}
-
-reuse_slot:
-	if (j != AHASH_MAX(h) + 1) {
-		/* Fill out reused slot */
+	if (reuse || forceadd) {
 		data = ahash_data(n, j, set->dsize);
+		if (!deleted) {
 #ifdef IP_SET_HASH_WITH_NETS
-		for (i = 0; i < IPSET_NET_COUNT; i++) {
-			mtype_del_cidr(h, SCIDR(data->cidr, i),
-				       NLEN(set->family), i);
-			mtype_add_cidr(h, SCIDR(d->cidr, i),
-				       NLEN(set->family), i);
-		}
+			for (i = 0; i < IPSET_NET_COUNT; i++)
+				mtype_del_cidr(h, SCIDR(data->cidr, i),
+					       NLEN(set->family), i);
 #endif
-		ip_set_ext_destroy(set, data);
-	} else {
-		/* Use/create a new slot */
+			ip_set_ext_destroy(set, data);
+			h->elements--;
+		}
+		goto copy_data;
+	}
+	if (h->elements >= h->maxelem)
+		goto set_full;
+	/* Create a new slot */
+	if (n->pos >= n->size) {
 		TUNE_AHASH_MAX(h, multi);
-		ret = hbucket_elem_add(n, AHASH_MAX(h), set->dsize);
-		if (ret != 0) {
-			if (ret == -EAGAIN)
-				mtype_data_next(&h->next, d);
-			goto out;
+		if (n->size >= AHASH_MAX(h)) {
+			/* Trigger rehashing */
+			mtype_data_next(&h->next, d);
+			return -EAGAIN;
 		}
-		data = ahash_data(n, n->pos++, set->dsize);
+		old = n;
+		n = kzalloc(sizeof(struct hbucket) +
+			    (old->size + AHASH_INIT_SIZE) * set->dsize,
+			    GFP_ATOMIC);
+		if (!n)
+			return -ENOMEM;
+		memcpy(n, old, sizeof(struct hbucket) +
+		       old->size * set->dsize);
+		n->size = old->size + AHASH_INIT_SIZE;
+	}
+
+copy_elem:
+	j = n->pos++;
+	data = ahash_data(n, j, set->dsize);
+copy_data:
+	h->elements++;
 #ifdef IP_SET_HASH_WITH_NETS
-		for (i = 0; i < IPSET_NET_COUNT; i++)
-			mtype_add_cidr(h, SCIDR(d->cidr, i), NLEN(set->family),
-				       i);
+	for (i = 0; i < IPSET_NET_COUNT; i++)
+		mtype_add_cidr(h, SCIDR(d->cidr, i),
+			       NLEN(set->family), i);
 #endif
-		h->elements++;
-	}
 	memcpy(data, d, sizeof(struct mtype_elem));
+overwrite_extensions:
 #ifdef IP_SET_HASH_WITH_NETS
 	mtype_data_set_flags(data, flags);
 #endif
-	if (SET_WITH_TIMEOUT(set))
-		ip_set_timeout_set(ext_timeout(data, set), ext->timeout);
 	if (SET_WITH_COUNTER(set))
 		ip_set_init_counter(ext_counter(data, set), ext);
 	if (SET_WITH_COMMENT(set))
 		ip_set_init_comment(ext_comment(data, set), ext);
 	if (SET_WITH_SKBINFO(set))
 		ip_set_init_skbinfo(ext_skbinfo(data, set), ext);
+	/* Must come last */
+	if (SET_WITH_TIMEOUT(set))
+		ip_set_timeout_set(ext_timeout(data, set), ext->timeout);
+	set_bit(j, n->used);
+	if (old != ERR_PTR(-ENOENT)) {
+		rcu_assign_pointer(hbucket(t, key), n);
+		if (old)
+			kfree_rcu(old, rcu);
+	}
 
-out:
-	rcu_read_unlock_bh();
-	return ret;
+	return 0;
+set_full:
+	if (net_ratelimit())
+		pr_warn("Set %s is full, maxelem %u reached\n",
+			set->name, h->maxelem);
+	return -IPSET_ERR_HASH_FULL;
 }
 
-/* Delete an element from the hash: swap it with the last element
- * and free up space if possible.
+/* Delete an element from the hash and free up space if possible.
  */
 static int
 mtype_del(struct ip_set *set, void *value, const struct ip_set_ext *ext,
@@ -734,29 +799,31 @@ mtype_del(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 	const struct mtype_elem *d = value;
 	struct mtype_elem *data;
 	struct hbucket *n;
-	int i, ret = -IPSET_ERR_EXIST;
-#ifdef IP_SET_HASH_WITH_NETS
-	u8 j;
-#endif
+	int i, j, k, ret = -IPSET_ERR_EXIST;
 	u32 key, multi = 0;
+	size_t dsize = set->dsize;
 
-	rcu_read_lock_bh();
-	t = rcu_dereference_bh(h->table);
+	t = h->table;
 	key = HKEY(value, h->initval, t->htable_bits);
 	n = hbucket(t, key);
-	for (i = 0; i < n->pos; i++) {
-		data = ahash_data(n, i, set->dsize);
+	if (!n)
+		goto out;
+	for (i = 0, k = 0; i < n->pos; i++) {
+		if (!test_bit(i, n->used)) {
+			k++;
+			continue;
+		}
+		data = ahash_data(n, i, dsize);
 		if (!mtype_data_equal(data, d, &multi))
 			continue;
 		if (SET_WITH_TIMEOUT(set) &&
 		    ip_set_timeout_expired(ext_timeout(data, set)))
 			goto out;
-		if (i != n->pos - 1)
-			/* Not last one */
-			memcpy(data, ahash_data(n, n->pos - 1, set->dsize),
-			       set->dsize);
 
-		n->pos--;
+		ret = 0;
+		clear_bit(i, n->used);
+		if (i + 1 == n->pos)
+			n->pos--;
 		h->elements--;
 #ifdef IP_SET_HASH_WITH_NETS
 		for (j = 0; j < IPSET_NET_COUNT; j++)
@@ -764,25 +831,37 @@ mtype_del(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 				       j);
 #endif
 		ip_set_ext_destroy(set, data);
-		if (n->pos + AHASH_INIT_SIZE < n->size) {
-			void *tmp = kzalloc((n->size - AHASH_INIT_SIZE)
-					    * set->dsize,
-					    GFP_ATOMIC);
-			if (!tmp) {
-				ret = 0;
+
+		for (; i < n->pos; i++) {
+			if (!test_bit(i, n->used))
+				k++;
+		}
+		if (n->pos == 0 && k == 0) {
+			rcu_assign_pointer(hbucket(t, key), NULL);
+			kfree_rcu(n, rcu);
+		} else if (k >= AHASH_INIT_SIZE) {
+			struct hbucket *tmp = kzalloc(sizeof(struct hbucket) +
+					(n->size - AHASH_INIT_SIZE) * dsize,
+					GFP_ATOMIC);
+			if (!tmp)
 				goto out;
+			tmp->size = n->size - AHASH_INIT_SIZE;
+			for (j = 0, k = 0; j < n->pos; j++) {
+				if (!test_bit(j, n->used))
+					continue;
+				data = ahash_data(n, j, dsize);
+				memcpy(tmp->value + k * dsize, data, dsize);
+				set_bit(j, tmp->used);
+				k++;
 			}
-			n->size -= AHASH_INIT_SIZE;
-			memcpy(tmp, n->value, n->size * set->dsize);
-			kfree(n->value);
-			n->value = tmp;
+			tmp->pos = k;
+			rcu_assign_pointer(hbucket(t, key), tmp);
+			kfree_rcu(n, rcu);
 		}
-		ret = 0;
 		goto out;
 	}
 
 out:
-	rcu_read_unlock_bh();
 	return ret;
 }
 
@@ -832,8 +911,12 @@ mtype_test_cidrs(struct ip_set *set, struct mtype_elem *d,
 		mtype_data_netmask(d, NCIDR(h->nets[j].cidr[0]));
 #endif
 		key = HKEY(d, h->initval, t->htable_bits);
-		n = hbucket(t, key);
+		n =  rcu_dereference_bh(hbucket(t, key));
+		if (n == NULL)
+			continue;
 		for (i = 0; i < n->pos; i++) {
+			if (!test_bit(i, n->used))
+				continue;
 			data = ahash_data(n, i, set->dsize);
 			if (!mtype_data_equal(data, d, &multi))
 				continue;
@@ -871,7 +954,6 @@ mtype_test(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 	int i, ret = 0;
 	u32 key, multi = 0;
 
-	rcu_read_lock_bh();
 	t = rcu_dereference_bh(h->table);
 #ifdef IP_SET_HASH_WITH_NETS
 	/* If we test an IP address and not a network address,
@@ -886,8 +968,14 @@ mtype_test(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 #endif
 
 	key = HKEY(d, h->initval, t->htable_bits);
-	n = hbucket(t, key);
+	n = rcu_dereference_bh(hbucket(t, key));
+	if (n == NULL) {
+		ret = 0;
+		goto out;
+	}
 	for (i = 0; i < n->pos; i++) {
+		if (!test_bit(i, n->used))
+			continue;
 		data = ahash_data(n, i, set->dsize);
 		if (mtype_data_equal(data, d, &multi) &&
 		    !(SET_WITH_TIMEOUT(set) &&
@@ -897,7 +985,6 @@ mtype_test(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 		}
 	}
 out:
-	rcu_read_unlock_bh();
 	return ret;
 }
 
@@ -909,15 +996,19 @@ mtype_head(struct ip_set *set, struct sk_buff *skb)
 	const struct htable *t;
 	struct nlattr *nested;
 	size_t memsize;
+	u8 htable_bits;
 
+	rcu_read_lock_bh();
 	t = rcu_dereference_bh_nfnl(h->table);
 	memsize = mtype_ahash_memsize(h, t, NLEN(set->family), set->dsize);
+	htable_bits = t->htable_bits;
+	rcu_read_unlock_bh();
 
 	nested = ipset_nest_start(skb, IPSET_ATTR_DATA);
 	if (!nested)
 		goto nla_put_failure;
 	if (nla_put_net32(skb, IPSET_ATTR_HASHSIZE,
-			  htonl(jhash_size(t->htable_bits))) ||
+			  htonl(jhash_size(htable_bits))) ||
 	    nla_put_net32(skb, IPSET_ATTR_MAXELEM, htonl(h->maxelem)))
 		goto nla_put_failure;
 #ifdef IP_SET_HASH_WITH_NETMASK
@@ -947,26 +1038,32 @@ mtype_list(const struct ip_set *set,
 	   struct sk_buff *skb, struct netlink_callback *cb)
 {
 	const struct htype *h = set->data;
-	const struct htable *t = rcu_dereference_bh_nfnl(h->table);
+	const struct htable *t;
 	struct nlattr *atd, *nested;
 	const struct hbucket *n;
 	const struct mtype_elem *e;
 	u32 first = cb->args[IPSET_CB_ARG0];
 	/* We assume that one hash bucket fills into one page */
 	void *incomplete;
-	int i;
+	int i, ret = 0;
 
 	atd = ipset_nest_start(skb, IPSET_ATTR_ADT);
 	if (!atd)
 		return -EMSGSIZE;
+
 	pr_debug("list hash set %s\n", set->name);
+	t = rcu_dereference_bh_nfnl(h->table);
 	for (; cb->args[IPSET_CB_ARG0] < jhash_size(t->htable_bits);
 	     cb->args[IPSET_CB_ARG0]++) {
 		incomplete = skb_tail_pointer(skb);
-		n = hbucket(t, cb->args[IPSET_CB_ARG0]);
+		n = rcu_dereference_bh(hbucket(t, cb->args[IPSET_CB_ARG0]));
 		pr_debug("cb->arg bucket: %lu, t %p n %p\n",
 			 cb->args[IPSET_CB_ARG0], t, n);
+		if (n == NULL)
+			continue;
 		for (i = 0; i < n->pos; i++) {
+			if (!test_bit(i, n->used))
+				continue;
 			e = ahash_data(n, i, set->dsize);
 			if (SET_WITH_TIMEOUT(set) &&
 			    ip_set_timeout_expired(ext_timeout(e, set)))
@@ -977,7 +1074,8 @@ mtype_list(const struct ip_set *set,
 			if (!nested) {
 				if (cb->args[IPSET_CB_ARG0] == first) {
 					nla_nest_cancel(skb, atd);
-					return -EMSGSIZE;
+					ret = -EMSGSIZE;
+					goto out;
 				} else
 					goto nla_put_failure;
 			}
@@ -992,7 +1090,7 @@ mtype_list(const struct ip_set *set,
 	/* Set listing finished */
 	cb->args[IPSET_CB_ARG0] = 0;
 
-	return 0;
+	goto out;
 
 nla_put_failure:
 	nlmsg_trim(skb, incomplete);
@@ -1000,10 +1098,11 @@ nla_put_failure:
 		pr_warn("Can't list set %s: one bucket does not fit into a message. Please report it!\n",
 			set->name);
 		cb->args[IPSET_CB_ARG0] = 0;
-		return -EMSGSIZE;
-	}
-	ipset_nest_end(skb, atd);
-	return 0;
+		ret = -EMSGSIZE;
+	} else
+		ipset_nest_end(skb, atd);
+out:
+	return ret;
 }
 
 static int
diff --git a/net/netfilter/ipset/ip_set_list_set.c b/net/netfilter/ipset/ip_set_list_set.c
index f8f6828..323115a 100644
--- a/net/netfilter/ipset/ip_set_list_set.c
+++ b/net/netfilter/ipset/ip_set_list_set.c
@@ -9,6 +9,7 @@
 
 #include <linux/module.h>
 #include <linux/ip.h>
+#include <linux/rculist.h>
 #include <linux/skbuff.h>
 #include <linux/errno.h>
 
@@ -27,6 +28,8 @@ MODULE_ALIAS("ip_set_list:set");
 
 /* Member elements  */
 struct set_elem {
+	struct rcu_head rcu;
+	struct list_head list;
 	ip_set_id_t id;
 };
 
@@ -41,12 +44,9 @@ struct list_set {
 	u32 size;		/* size of set list array */
 	struct timer_list gc;	/* garbage collection */
 	struct net *net;	/* namespace */
-	struct set_elem members[0]; /* the set members */
+	struct list_head members; /* the set members */
 };
 
-#define list_set_elem(set, map, id)	\
-	(struct set_elem *)((void *)(map)->members + (id) * (set)->dsize)
-
 static int
 list_set_ktest(struct ip_set *set, const struct sk_buff *skb,
 	       const struct xt_action_param *par,
@@ -54,17 +54,14 @@ list_set_ktest(struct ip_set *set, const struct sk_buff *skb,
 {
 	struct list_set *map = set->data;
 	struct set_elem *e;
-	u32 i, cmdflags = opt->cmdflags;
+	u32 cmdflags = opt->cmdflags;
 	int ret;
 
 	/* Don't lookup sub-counters at all */
 	opt->cmdflags &= ~IPSET_FLAG_MATCH_COUNTERS;
 	if (opt->cmdflags & IPSET_FLAG_SKIP_SUBCOUNTER_UPDATE)
 		opt->cmdflags &= ~IPSET_FLAG_SKIP_COUNTER_UPDATE;
-	for (i = 0; i < map->size; i++) {
-		e = list_set_elem(set, map, i);
-		if (e->id == IPSET_INVALID_ID)
-			return 0;
+	list_for_each_entry_rcu(e, &map->members, list) {
 		if (SET_WITH_TIMEOUT(set) &&
 		    ip_set_timeout_expired(ext_timeout(e, set)))
 			continue;
@@ -91,13 +88,9 @@ list_set_kadd(struct ip_set *set, const struct sk_buff *skb,
 {
 	struct list_set *map = set->data;
 	struct set_elem *e;
-	u32 i;
 	int ret;
 
-	for (i = 0; i < map->size; i++) {
-		e = list_set_elem(set, map, i);
-		if (e->id == IPSET_INVALID_ID)
-			return 0;
+	list_for_each_entry_rcu(e, &map->members, list) {
 		if (SET_WITH_TIMEOUT(set) &&
 		    ip_set_timeout_expired(ext_timeout(e, set)))
 			continue;
@@ -115,13 +108,9 @@ list_set_kdel(struct ip_set *set, const struct sk_buff *skb,
 {
 	struct list_set *map = set->data;
 	struct set_elem *e;
-	u32 i;
 	int ret;
 
-	for (i = 0; i < map->size; i++) {
-		e = list_set_elem(set, map, i);
-		if (e->id == IPSET_INVALID_ID)
-			return 0;
+	list_for_each_entry_rcu(e, &map->members, list) {
 		if (SET_WITH_TIMEOUT(set) &&
 		    ip_set_timeout_expired(ext_timeout(e, set)))
 			continue;
@@ -138,110 +127,65 @@ list_set_kadt(struct ip_set *set, const struct sk_buff *skb,
 	      enum ipset_adt adt, struct ip_set_adt_opt *opt)
 {
 	struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
+	int ret = -EINVAL;
 
+	rcu_read_lock();
 	switch (adt) {
 	case IPSET_TEST:
-		return list_set_ktest(set, skb, par, opt, &ext);
+		ret = list_set_ktest(set, skb, par, opt, &ext);
+		break;
 	case IPSET_ADD:
-		return list_set_kadd(set, skb, par, opt, &ext);
+		ret = list_set_kadd(set, skb, par, opt, &ext);
+		break;
 	case IPSET_DEL:
-		return list_set_kdel(set, skb, par, opt, &ext);
+		ret = list_set_kdel(set, skb, par, opt, &ext);
+		break;
 	default:
 		break;
 	}
-	return -EINVAL;
-}
+	rcu_read_unlock();
 
-static bool
-id_eq(const struct ip_set *set, u32 i, ip_set_id_t id)
-{
-	const struct list_set *map = set->data;
-	const struct set_elem *e;
-
-	if (i >= map->size)
-		return 0;
-
-	e = list_set_elem(set, map, i);
-	return !!(e->id == id &&
-		 !(SET_WITH_TIMEOUT(set) &&
-		   ip_set_timeout_expired(ext_timeout(e, set))));
+	return ret;
 }
 
-static int
-list_set_add(struct ip_set *set, u32 i, struct set_adt_elem *d,
-	     const struct ip_set_ext *ext)
-{
-	struct list_set *map = set->data;
-	struct set_elem *e = list_set_elem(set, map, i);
-
-	if (e->id != IPSET_INVALID_ID) {
-		if (i == map->size - 1) {
-			/* Last element replaced: e.g. add new,before,last */
-			ip_set_put_byindex(map->net, e->id);
-			ip_set_ext_destroy(set, e);
-		} else {
-			struct set_elem *x = list_set_elem(set, map,
-							   map->size - 1);
-
-			/* Last element pushed off */
-			if (x->id != IPSET_INVALID_ID) {
-				ip_set_put_byindex(map->net, x->id);
-				ip_set_ext_destroy(set, x);
-			}
-			memmove(list_set_elem(set, map, i + 1), e,
-				set->dsize * (map->size - (i + 1)));
-			/* Extensions must be initialized to zero */
-			memset(e, 0, set->dsize);
-		}
-	}
-
-	e->id = d->id;
-	if (SET_WITH_TIMEOUT(set))
-		ip_set_timeout_set(ext_timeout(e, set), ext->timeout);
-	if (SET_WITH_COUNTER(set))
-		ip_set_init_counter(ext_counter(e, set), ext);
-	if (SET_WITH_COMMENT(set))
-		ip_set_init_comment(ext_comment(e, set), ext);
-	if (SET_WITH_SKBINFO(set))
-		ip_set_init_skbinfo(ext_skbinfo(e, set), ext);
-	return 0;
-}
+/* Userspace interfaces: we are protected by the nfnl mutex */
 
-static int
-list_set_del(struct ip_set *set, u32 i)
+static void
+__list_set_del(struct ip_set *set, struct set_elem *e)
 {
 	struct list_set *map = set->data;
-	struct set_elem *e = list_set_elem(set, map, i);
 
 	ip_set_put_byindex(map->net, e->id);
+	/* We may call it, because we don't have a to be destroyed
+	 * extension which is used by the kernel.
+	 */
 	ip_set_ext_destroy(set, e);
+	kfree_rcu(e, rcu);
+}
 
-	if (i < map->size - 1)
-		memmove(e, list_set_elem(set, map, i + 1),
-			set->dsize * (map->size - (i + 1)));
+static inline void
+list_set_del(struct ip_set *set, struct set_elem *e)
+{
+	list_del_rcu(&e->list);
+	__list_set_del(set, e);
+}
 
-	/* Last element */
-	e = list_set_elem(set, map, map->size - 1);
-	e->id = IPSET_INVALID_ID;
-	return 0;
+static inline void
+list_set_replace(struct ip_set *set, struct set_elem *e, struct set_elem *old)
+{
+	list_replace_rcu(&old->list, &e->list);
+	__list_set_del(set, old);
 }
 
 static void
 set_cleanup_entries(struct ip_set *set)
 {
 	struct list_set *map = set->data;
-	struct set_elem *e;
-	u32 i = 0;
+	struct set_elem *e, *n;
 
-	while (i < map->size) {
-		e = list_set_elem(set, map, i);
-		if (e->id != IPSET_INVALID_ID &&
-		    ip_set_timeout_expired(ext_timeout(e, set)))
-			list_set_del(set, i);
-			/* Check element moved to position i in next loop */
-		else
-			i++;
-	}
+	list_for_each_entry_safe(e, n, &map->members, list)
+		if (ip_set_timeout_expired(ext_timeout(e, set)))
+			list_set_del(set, e);
 }
 
 static int
@@ -250,31 +194,45 @@ list_set_utest(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 {
 	struct list_set *map = set->data;
 	struct set_adt_elem *d = value;
-	struct set_elem *e;
-	u32 i;
+	struct set_elem *e, *next, *prev = NULL;
 	int ret;
 
-	for (i = 0; i < map->size; i++) {
-		e = list_set_elem(set, map, i);
-		if (e->id == IPSET_INVALID_ID)
-			return 0;
-		else if (SET_WITH_TIMEOUT(set) &&
-			 ip_set_timeout_expired(ext_timeout(e, set)))
+	list_for_each_entry(e, &map->members, list) {
+		if (SET_WITH_TIMEOUT(set) &&
+		    ip_set_timeout_expired(ext_timeout(e, set)))
 			continue;
-		else if (e->id != d->id)
+		else if (e->id != d->id) {
+			prev = e;
 			continue;
+		}
 
 		if (d->before == 0)
-			return 1;
-		else if (d->before > 0)
-			ret = id_eq(set, i + 1, d->refid);
-		else
-			ret = i > 0 && id_eq(set, i - 1, d->refid);
+			ret = 1;
+		else if (d->before > 0) {
+			next = list_next_entry(e, list);
+			ret = !list_is_last(&e->list, &map->members) &&
+			      next->id == d->refid;
+		} else
+			ret = prev != NULL && prev->id == d->refid;
 		return ret;
 	}
 	return 0;
 }
 
+static void
+list_set_init_extensions(struct ip_set *set, const struct ip_set_ext *ext,
+			 struct set_elem *e)
+{
+	if (SET_WITH_COUNTER(set))
+		ip_set_init_counter(ext_counter(e, set), ext);
+	if (SET_WITH_COMMENT(set))
+		ip_set_init_comment(ext_comment(e, set), ext);
+	if (SET_WITH_SKBINFO(set))
+		ip_set_init_skbinfo(ext_skbinfo(e, set), ext);
+	/* Update timeout last */
+	if (SET_WITH_TIMEOUT(set))
+		ip_set_timeout_set(ext_timeout(e, set), ext->timeout);
+}
 
 static int
 list_set_uadd(struct ip_set *set, void *value, const struct ip_set_ext *ext,
@@ -282,60 +240,82 @@ list_set_uadd(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 {
 	struct list_set *map = set->data;
 	struct set_adt_elem *d = value;
-	struct set_elem *e;
+	struct set_elem *e, *n, *prev, *next;
 	bool flag_exist = flags & IPSET_FLAG_EXIST;
-	u32 i, ret = 0;
 
 	if (SET_WITH_TIMEOUT(set))
 		set_cleanup_entries(set);
 
-	/* Check already added element */
-	for (i = 0; i < map->size; i++) {
-		e = list_set_elem(set, map, i);
-		if (e->id == IPSET_INVALID_ID)
-			goto insert;
-		else if (e->id != d->id)
+	/* Find where to add the new entry */
+	n = prev = next = NULL;
+	list_for_each_entry(e, &map->members, list) {
+		if (SET_WITH_TIMEOUT(set) &&
+		    ip_set_timeout_expired(ext_timeout(e, set)))
 			continue;
-
-		if ((d->before > 1 && !id_eq(set, i + 1, d->refid)) ||
-		    (d->before < 0 &&
-		     (i == 0 || !id_eq(set, i - 1, d->refid))))
-			/* Before/after doesn't match */
+		else if (d->id == e->id)
+			n = e;
+		else if (d->before == 0 || e->id != d->refid)
+			continue;
+		else if (d->before > 0)
+			next = e;
+		else
+			prev = e;
+	}
+	/* Re-add already existing element */
+	if (n) {
+		if ((d->before > 0 && !next) ||
+		    (d->before < 0 && !prev))
 			return -IPSET_ERR_REF_EXIST;
 		if (!flag_exist)
-			/* Can't re-add */
 			return -IPSET_ERR_EXIST;
 		/* Update extensions */
-		ip_set_ext_destroy(set, e);
+		ip_set_ext_destroy(set, n);
+		list_set_init_extensions(set, ext, n);
 
-		if (SET_WITH_TIMEOUT(set))
-			ip_set_timeout_set(ext_timeout(e, set), ext->timeout);
-		if (SET_WITH_COUNTER(set))
-			ip_set_init_counter(ext_counter(e, set), ext);
-		if (SET_WITH_COMMENT(set))
-			ip_set_init_comment(ext_comment(e, set), ext);
-		if (SET_WITH_SKBINFO(set))
-			ip_set_init_skbinfo(ext_skbinfo(e, set), ext);
 		/* Set is already added to the list */
 		ip_set_put_byindex(map->net, d->id);
 		return 0;
 	}
-insert:
-	ret = -IPSET_ERR_LIST_FULL;
-	for (i = 0; i < map->size && ret == -IPSET_ERR_LIST_FULL; i++) {
-		e = list_set_elem(set, map, i);
-		if (e->id == IPSET_INVALID_ID)
-			ret = d->before != 0 ? -IPSET_ERR_REF_EXIST
-				: list_set_add(set, i, d, ext);
-		else if (e->id != d->refid)
-			continue;
-		else if (d->before > 0)
-			ret = list_set_add(set, i, d, ext);
-		else if (i + 1 < map->size)
-			ret = list_set_add(set, i + 1, d, ext);
+	/* Add new entry */
+	if (d->before == 0) {
+		/* Append  */
+		n = list_empty(&map->members) ? NULL :
+		    list_last_entry(&map->members, struct set_elem, list);
+	} else if (d->before > 0) {
+		/* Insert after next element */
+		if (!list_is_last(&next->list, &map->members))
+			n = list_next_entry(next, list);
+	} else {
+		/* Insert before prev element */
+		if (prev->list.prev != &map->members)
+			n = list_prev_entry(prev, list);
 	}
-
-	return ret;
+	/* Can we replace a timed out entry? */
+	if (n != NULL &&
+	    !(SET_WITH_TIMEOUT(set) &&
+	      ip_set_timeout_expired(ext_timeout(n, set))))
+		n =  NULL;
+
+	e = kzalloc(set->dsize, GFP_KERNEL);
+	if (!e)
+		return -ENOMEM;
+	e->id = d->id;
+	INIT_LIST_HEAD(&e->list);
+	list_set_init_extensions(set, ext, e);
+	if (n)
+		list_set_replace(set, e, n);
+	else if (next)
+		list_add_tail_rcu(&e->list, &next->list);
+	else if (prev)
+		list_add_rcu(&e->list, &prev->list);
+	else
+		list_add_tail_rcu(&e->list, &map->members);
+	spin_unlock_bh(&set->lock);
+
+	synchronize_rcu_bh();
+
+	spin_lock_bh(&set->lock);
+	return 0;
 }
 
 static int
@@ -344,32 +324,30 @@ list_set_udel(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 {
 	struct list_set *map = set->data;
 	struct set_adt_elem *d = value;
-	struct set_elem *e;
-	u32 i;
-
-	for (i = 0; i < map->size; i++) {
-		e = list_set_elem(set, map, i);
-		if (e->id == IPSET_INVALID_ID)
-			return d->before != 0 ? -IPSET_ERR_REF_EXIST
-					      : -IPSET_ERR_EXIST;
-		else if (SET_WITH_TIMEOUT(set) &&
-			 ip_set_timeout_expired(ext_timeout(e, set)))
+	struct set_elem *e, *next, *prev = NULL;
+
+	list_for_each_entry(e, &map->members, list) {
+		if (SET_WITH_TIMEOUT(set) &&
+		    ip_set_timeout_expired(ext_timeout(e, set)))
 			continue;
-		else if (e->id != d->id)
+		else if (e->id != d->id) {
+			prev = e;
 			continue;
+		}
 
-		if (d->before == 0)
-			return list_set_del(set, i);
-		else if (d->before > 0) {
-			if (!id_eq(set, i + 1, d->refid))
+		if (d->before > 0) {
+			next = list_next_entry(e, list);
+			if (list_is_last(&e->list, &map->members) ||
+			    next->id != d->refid)
 				return -IPSET_ERR_REF_EXIST;
-			return list_set_del(set, i);
-		} else if (i == 0 || !id_eq(set, i - 1, d->refid))
-			return -IPSET_ERR_REF_EXIST;
-		else
-			return list_set_del(set, i);
+		} else if (d->before < 0) {
+			if (prev == NULL || prev->id != d->refid)
+				return -IPSET_ERR_REF_EXIST;
+		}
+		list_set_del(set, e);
+		return 0;
 	}
-	return -IPSET_ERR_EXIST;
+	return d->before != 0 ? -IPSET_ERR_REF_EXIST : -IPSET_ERR_EXIST;
 }
 
 static int
@@ -410,6 +388,7 @@ list_set_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 f = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		e.before = f & IPSET_FLAG_BEFORE;
 	}
 
@@ -447,27 +426,26 @@ static void
 list_set_flush(struct ip_set *set)
 {
 	struct list_set *map = set->data;
-	struct set_elem *e;
-	u32 i;
-
-	for (i = 0; i < map->size; i++) {
-		e = list_set_elem(set, map, i);
-		if (e->id != IPSET_INVALID_ID) {
-			ip_set_put_byindex(map->net, e->id);
-			ip_set_ext_destroy(set, e);
-			e->id = IPSET_INVALID_ID;
-		}
-	}
+	struct set_elem *e, *n;
+
+	list_for_each_entry_safe(e, n, &map->members, list)
+		list_set_del(set, e);
 }
 
 static void
 list_set_destroy(struct ip_set *set)
 {
 	struct list_set *map = set->data;
+	struct set_elem *e, *n;
 
 	if (SET_WITH_TIMEOUT(set))
 		del_timer_sync(&map->gc);
-	list_set_flush(set);
+	list_for_each_entry_safe(e, n, &map->members, list) {
+		list_del(&e->list);
+		ip_set_put_byindex(map->net, e->id);
+		ip_set_ext_destroy(set, e);
+		kfree(e);
+	}
 	kfree(map);
 
 	set->data = NULL;
@@ -478,6 +456,11 @@ list_set_head(struct ip_set *set, struct sk_buff *skb)
 {
 	const struct list_set *map = set->data;
 	struct nlattr *nested;
+	struct set_elem *e;
+	u32 n = 0;
+
+	list_for_each_entry(e, &map->members, list)
+		n++;
 
 	nested = ipset_nest_start(skb, IPSET_ATTR_DATA);
 	if (!nested)
@@ -485,7 +468,7 @@ list_set_head(struct ip_set *set, struct sk_buff *skb)
 	if (nla_put_net32(skb, IPSET_ATTR_SIZE, htonl(map->size)) ||
 	    nla_put_net32(skb, IPSET_ATTR_REFERENCES, htonl(set->ref - 1)) ||
 	    nla_put_net32(skb, IPSET_ATTR_MEMSIZE,
-			  htonl(sizeof(*map) + map->size * set->dsize)))
+			  htonl(sizeof(*map) + n * set->dsize)))
 		goto nla_put_failure;
 	if (unlikely(ip_set_put_flags(skb, set)))
 		goto nla_put_failure;
@@ -502,18 +485,20 @@ list_set_list(const struct ip_set *set,
 {
 	const struct list_set *map = set->data;
 	struct nlattr *atd, *nested;
-	u32 i, first = cb->args[IPSET_CB_ARG0];
-	const struct set_elem *e;
+	u32 i = 0, first = cb->args[IPSET_CB_ARG0];
+	struct set_elem *e;
 
 	atd = ipset_nest_start(skb, IPSET_ATTR_ADT);
 	if (!atd)
 		return -EMSGSIZE;
-	for (; cb->args[IPSET_CB_ARG0] < map->size;
-	     cb->args[IPSET_CB_ARG0]++) {
-		i = cb->args[IPSET_CB_ARG0];
-		e = list_set_elem(set, map, i);
-		if (e->id == IPSET_INVALID_ID)
-			goto finish;
+	list_for_each_entry(e, &map->members, list) {
+		if (i == first)
+			break;
+		i++;
+	}
+
+	list_for_each_entry_from(e, &map->members, list) {
+		i++;
 		if (SET_WITH_TIMEOUT(set) &&
 		    ip_set_timeout_expired(ext_timeout(e, set)))
 			continue;
@@ -532,7 +517,7 @@ list_set_list(const struct ip_set *set,
 			goto nla_put_failure;
 		ipset_nest_end(skb, nested);
 	}
-finish:
+
 	ipset_nest_end(skb, atd);
 	/* Set listing finished */
 	cb->args[IPSET_CB_ARG0] = 0;
@@ -544,6 +529,7 @@ nla_put_failure:
 		cb->args[IPSET_CB_ARG0] = 0;
 		return -EMSGSIZE;
 	}
+	cb->args[IPSET_CB_ARG0] = i - 1;
 	ipset_nest_end(skb, atd);
 	return 0;
 }
@@ -580,9 +566,9 @@ list_set_gc(unsigned long ul_set)
 	struct ip_set *set = (struct ip_set *) ul_set;
 	struct list_set *map = set->data;
 
-	write_lock_bh(&set->lock);
+	spin_lock_bh(&set->lock);
 	set_cleanup_entries(set);
-	write_unlock_bh(&set->lock);
+	spin_unlock_bh(&set->lock);
 
 	map->gc.expires = jiffies + IPSET_GC_PERIOD(set->timeout) * HZ;
 	add_timer(&map->gc);
@@ -606,24 +592,16 @@ static bool
 init_list_set(struct net *net, struct ip_set *set, u32 size)
 {
 	struct list_set *map;
-	struct set_elem *e;
-	u32 i;
 
-	map = kzalloc(sizeof(*map) +
-		      min_t(u32, size, IP_SET_LIST_MAX_SIZE) * set->dsize,
-		      GFP_KERNEL);
+	map = kzalloc(sizeof(*map), GFP_KERNEL);
 	if (!map)
 		return false;
 
 	map->size = size;
 	map->net = net;
+	INIT_LIST_HEAD(&map->members);
 	set->data = map;
 
-	for (i = 0; i < size; i++) {
-		e = list_set_elem(set, map, i);
-		e->id = IPSET_INVALID_ID;
-	}
-
 	return true;
 }
 
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 09/10] netfilter: ipset: styles warned by checkpatch.pl fixed
  2014-11-24 20:46 [PATCH 00/10] ipset patches for nf-next Jozsef Kadlecsik
                   ` (7 preceding siblings ...)
  2014-11-24 20:46 ` [PATCH 08/10] netfilter: ipset: Introduce RCU in all set types instead of rwlock per set Jozsef Kadlecsik
@ 2014-11-24 20:46 ` Jozsef Kadlecsik
  2014-11-24 20:46 ` [PATCH 10/10] netfilter: ipset: Fix parallel resizing and listing of the same set Jozsef Kadlecsik
  9 siblings, 0 replies; 17+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-24 20:46 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 include/linux/netfilter/ipset/ip_set.h       |  7 ++++---
 net/netfilter/ipset/ip_set_core.c            | 11 ++++++++---
 net/netfilter/ipset/ip_set_hash_ipportnet.c  |  2 ++
 net/netfilter/ipset/ip_set_hash_net.c        |  4 +++-
 net/netfilter/ipset/ip_set_hash_netiface.c   |  2 ++
 net/netfilter/ipset/ip_set_hash_netnet.c     |  2 ++
 net/netfilter/ipset/ip_set_hash_netport.c    |  2 ++
 net/netfilter/ipset/ip_set_hash_netportnet.c |  2 ++
 net/netfilter/xt_set.c                       |  7 +++++++
 net/sched/em_ipset.c                         |  5 ++++-
 10 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
index 418d360..b0cfe42 100644
--- a/include/linux/netfilter/ipset/ip_set.h
+++ b/include/linux/netfilter/ipset/ip_set.h
@@ -325,6 +325,7 @@ ip_set_update_counter(struct ip_set_counter *counter,
 /* RCU-safe assign value */
 #define IP_SET_RCU_ASSIGN(ptr, value)	\
 do {					\
+	/* Safe assign numeric types */ \
 	smp_wmb();			\
 	*(ptr) = value;			\
 } while (0)
@@ -383,10 +384,10 @@ ip_set_put_skbinfo(struct sk_buff *skb, struct ip_set_skbinfo *skbinfo)
 			      cpu_to_be64((u64)skbmark << 32 |
 					  skbmarkmask))) ||
 	       (skbprio &&
-	        nla_put_net32(skb, IPSET_ATTR_SKBPRIO,
+		nla_put_net32(skb, IPSET_ATTR_SKBPRIO,
 			      cpu_to_be32(skbprio))) ||
 	       (skbqueue &&
-	        nla_put_net16(skb, IPSET_ATTR_SKBQUEUE,
+		nla_put_net16(skb, IPSET_ATTR_SKBQUEUE,
 			     cpu_to_be16(skbqueue)));
 
 }
@@ -585,7 +586,7 @@ ip_set_put_extensions(struct sk_buff *skb, const struct ip_set *set,
 
 		if (nla_put_net32(skb, IPSET_ATTR_TIMEOUT,
 			htonl(active ? ip_set_timeout_get(timeout)
-				: *timeout)))
+			      : *timeout)))
 			return -EMSGSIZE;
 	}
 	if (SET_WITH_COUNTER(set) &&
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index 9fb2610..957ec91 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -390,6 +390,7 @@ ip_set_get_extensions(struct ip_set *set, struct nlattr *tb[],
 		      struct ip_set_ext *ext)
 {
 	u64 fullmark;
+
 	if (tb[IPSET_ATTR_TIMEOUT]) {
 		if (!(set->extensions & IPSET_EXT_TIMEOUT))
 			return -IPSET_ERR_TIMEOUT;
@@ -903,7 +904,7 @@ ip_set_create(struct sock *ctnl, struct sk_buff *skb,
 			/* Wraparound */
 			goto cleanup;
 
-		list = kzalloc(sizeof(struct ip_set *) * i, GFP_KERNEL);
+		list = kcalloc(i, sizeof(struct ip_set *), GFP_KERNEL);
 		if (!list)
 			goto cleanup;
 		/* nfnl mutex is held, both lists are valid */
@@ -1179,6 +1180,7 @@ static int
 ip_set_dump_done(struct netlink_callback *cb)
 {
 	struct ip_set_net *inst = (struct ip_set_net *)cb->args[IPSET_CB_NET];
+
 	if (cb->args[IPSET_CB_ARG0]) {
 		pr_debug("release set %s\n",
 			 ip_set(inst, cb->args[IPSET_CB_INDEX])->name);
@@ -1216,7 +1218,7 @@ dump_init(struct netlink_callback *cb, struct ip_set_net *inst)
 
 	/* cb->args[IPSET_CB_NET]:	net namespace
 	 *         [IPSET_CB_DUMP]:	dump single set/all sets
-	 *         [IPSET_CB_INDEX]: 	set index
+	 *         [IPSET_CB_INDEX]:	set index
 	 *         [IPSET_CB_ARG0]:	type specific
 	 */
 
@@ -1235,6 +1237,7 @@ dump_init(struct netlink_callback *cb, struct ip_set_net *inst)
 
 	if (cda[IPSET_ATTR_FLAGS]) {
 		u32 f = ip_set_get_h32(cda[IPSET_ATTR_FLAGS]);
+
 		dump_type |= (f << 16);
 	}
 	cb->args[IPSET_CB_NET] = (unsigned long)inst;
@@ -1864,6 +1867,7 @@ ip_set_sockfn_get(struct sock *sk, int optval, void __user *user, int *len)
 	if (*op < IP_SET_OP_VERSION) {
 		/* Check the version at the beginning of operations */
 		struct ip_set_req_version *req_version = data;
+
 		if (req_version->version != IPSET_PROTOCOL) {
 			ret = -EPROTO;
 			goto done;
@@ -1965,7 +1969,7 @@ ip_set_net_init(struct net *net)
 	if (inst->ip_set_max >= IPSET_INVALID_ID)
 		inst->ip_set_max = IPSET_INVALID_ID - 1;
 
-	list = kzalloc(sizeof(struct ip_set *) * inst->ip_set_max, GFP_KERNEL);
+	list = kcalloc(inst->ip_set_max, sizeof(struct ip_set *), GFP_KERNEL);
 	if (!list)
 		return -ENOMEM;
 	inst->is_deleted = 0;
@@ -2003,6 +2007,7 @@ static int __init
 ip_set_init(void)
 {
 	int ret = nfnetlink_subsys_register(&ip_set_netlink_subsys);
+
 	if (ret != 0) {
 		pr_err("ip_set: cannot register with nfnetlink.\n");
 		return ret;
diff --git a/net/netfilter/ipset/ip_set_hash_ipportnet.c b/net/netfilter/ipset/ip_set_hash_ipportnet.c
index b6012ad..3d07f93 100644
--- a/net/netfilter/ipset/ip_set_hash_ipportnet.c
+++ b/net/netfilter/ipset/ip_set_hash_ipportnet.c
@@ -224,6 +224,7 @@ hash_ipportnet4_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
@@ -485,6 +486,7 @@ hash_ipportnet6_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
diff --git a/net/netfilter/ipset/ip_set_hash_net.c b/net/netfilter/ipset/ip_set_hash_net.c
index 6b3ac10..cb95064 100644
--- a/net/netfilter/ipset/ip_set_hash_net.c
+++ b/net/netfilter/ipset/ip_set_hash_net.c
@@ -173,6 +173,7 @@ hash_net4_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
@@ -180,7 +181,7 @@ hash_net4_uadt(struct ip_set *set, struct nlattr *tb[],
 	if (adt == IPSET_TEST || !tb[IPSET_ATTR_IP_TO]) {
 		e.ip = htonl(ip & ip_set_hostmask(e.cidr));
 		ret = adtfn(set, &e, &ext, &ext, flags);
-		return ip_set_enomatch(ret, flags, adt, set) ? -ret:
+		return ip_set_enomatch(ret, flags, adt, set) ? -ret :
 		       ip_set_eexist(ret, flags) ? 0 : ret;
 	}
 
@@ -348,6 +349,7 @@ hash_net6_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
diff --git a/net/netfilter/ipset/ip_set_hash_netiface.c b/net/netfilter/ipset/ip_set_hash_netiface.c
index d8a53ec..d7bdb06 100644
--- a/net/netfilter/ipset/ip_set_hash_netiface.c
+++ b/net/netfilter/ipset/ip_set_hash_netiface.c
@@ -218,6 +218,7 @@ hash_netiface4_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_PHYSDEV)
 			e.physdev = 1;
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
@@ -433,6 +434,7 @@ hash_netiface6_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_PHYSDEV)
 			e.physdev = 1;
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
diff --git a/net/netfilter/ipset/ip_set_hash_netnet.c b/net/netfilter/ipset/ip_set_hash_netnet.c
index ea8772a..ea4d06b 100644
--- a/net/netfilter/ipset/ip_set_hash_netnet.c
+++ b/net/netfilter/ipset/ip_set_hash_netnet.c
@@ -204,6 +204,7 @@ hash_netnet4_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
@@ -432,6 +433,7 @@ hash_netnet6_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
diff --git a/net/netfilter/ipset/ip_set_hash_netport.c b/net/netfilter/ipset/ip_set_hash_netport.c
index c0ddb58..bc7a1c2 100644
--- a/net/netfilter/ipset/ip_set_hash_netport.c
+++ b/net/netfilter/ipset/ip_set_hash_netport.c
@@ -215,6 +215,7 @@ hash_netport4_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
@@ -436,6 +437,7 @@ hash_netport6_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
diff --git a/net/netfilter/ipset/ip_set_hash_netportnet.c b/net/netfilter/ipset/ip_set_hash_netportnet.c
index bfaa94c..f98d3f6 100644
--- a/net/netfilter/ipset/ip_set_hash_netportnet.c
+++ b/net/netfilter/ipset/ip_set_hash_netportnet.c
@@ -239,6 +239,7 @@ hash_netportnet4_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
@@ -515,6 +516,7 @@ hash_netportnet6_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
diff --git a/net/netfilter/xt_set.c b/net/netfilter/xt_set.c
index c3f2638..e9dcd81 100644
--- a/net/netfilter/xt_set.c
+++ b/net/netfilter/xt_set.c
@@ -52,6 +52,7 @@ static bool
 set_match_v0(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct xt_set_info_match_v0 *info = par->matchinfo;
+
 	ADT_OPT(opt, par->family, info->match_set.u.compat.dim,
 		info->match_set.u.compat.flags, 0, UINT_MAX);
 
@@ -114,6 +115,7 @@ static bool
 set_match_v1(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct xt_set_info_match_v1 *info = par->matchinfo;
+
 	ADT_OPT(opt, par->family, info->match_set.dim,
 		info->match_set.flags, 0, UINT_MAX);
 
@@ -178,6 +180,7 @@ static bool
 set_match_v3(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct xt_set_info_match_v3 *info = par->matchinfo;
+
 	ADT_OPT(opt, par->family, info->match_set.dim,
 		info->match_set.flags, info->flags, UINT_MAX);
 	int ret;
@@ -252,6 +255,7 @@ static unsigned int
 set_target_v0(struct sk_buff *skb, const struct xt_action_param *par)
 {
 	const struct xt_set_info_target_v0 *info = par->targinfo;
+
 	ADT_OPT(add_opt, par->family, info->add_set.u.compat.dim,
 		info->add_set.u.compat.flags, 0, UINT_MAX);
 	ADT_OPT(del_opt, par->family, info->del_set.u.compat.dim,
@@ -324,6 +328,7 @@ static unsigned int
 set_target_v1(struct sk_buff *skb, const struct xt_action_param *par)
 {
 	const struct xt_set_info_target_v1 *info = par->targinfo;
+
 	ADT_OPT(add_opt, par->family, info->add_set.dim,
 		info->add_set.flags, 0, UINT_MAX);
 	ADT_OPT(del_opt, par->family, info->del_set.dim,
@@ -392,6 +397,7 @@ static unsigned int
 set_target_v2(struct sk_buff *skb, const struct xt_action_param *par)
 {
 	const struct xt_set_info_target_v2 *info = par->targinfo;
+
 	ADT_OPT(add_opt, par->family, info->add_set.dim,
 		info->add_set.flags, info->flags, info->timeout);
 	ADT_OPT(del_opt, par->family, info->del_set.dim,
@@ -418,6 +424,7 @@ static unsigned int
 set_target_v3(struct sk_buff *skb, const struct xt_action_param *par)
 {
 	const struct xt_set_info_target_v3 *info = par->targinfo;
+
 	ADT_OPT(add_opt, par->family, info->add_set.dim,
 		info->add_set.flags, info->flags, info->timeout);
 	ADT_OPT(del_opt, par->family, info->del_set.dim,
diff --git a/net/sched/em_ipset.c b/net/sched/em_ipset.c
index 5b4a4ef..c3dba5e 100644
--- a/net/sched/em_ipset.c
+++ b/net/sched/em_ipset.c
@@ -44,6 +44,7 @@ static int em_ipset_change(struct net *net, void *data, int data_len,
 static void em_ipset_destroy(struct tcf_ematch *em)
 {
 	const struct xt_set_info *set = (const void *) em->data;
+
 	if (set) {
 		ip_set_nfnl_put(em->net, set->index);
 		kfree((void *) em->data);
@@ -70,7 +71,9 @@ static int em_ipset_match(struct sk_buff *skb, struct tcf_ematch *em,
 		acpar.family = NFPROTO_IPV6;
 		if (!pskb_network_may_pull(skb, sizeof(struct ipv6hdr)))
 			return 0;
-		/* doesn't call ipv6_find_hdr() because ipset doesn't use thoff, yet */
+		/* doesn't call ipv6_find_hdr() because ipset doesn't use
+		 * thoff, yet
+		 */
 		acpar.thoff = sizeof(struct ipv6hdr);
 		break;
 	default:
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 10/10] netfilter: ipset: Fix parallel resizing and listing of the same set
  2014-11-24 20:46 [PATCH 00/10] ipset patches for nf-next Jozsef Kadlecsik
                   ` (8 preceding siblings ...)
  2014-11-24 20:46 ` [PATCH 09/10] netfilter: ipset: styles warned by checkpatch.pl fixed Jozsef Kadlecsik
@ 2014-11-24 20:46 ` Jozsef Kadlecsik
  9 siblings, 0 replies; 17+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-24 20:46 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

When elements added to a hash:* type of set and resizing triggered,
parallel listing could start to list the original set (before resizing)
and "continue" with listing the new set. Fix it by references and
using the original hash table for listing. Therefore the destroying
the original hash table may happen from the resizing or listing functions.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 include/linux/netfilter/ipset/ip_set.h | 13 ++++++----
 net/netfilter/ipset/ip_set_core.c      | 30 +++++++++++++----------
 net/netfilter/ipset/ip_set_hash_gen.h  | 44 ++++++++++++++++++++++++++++++----
 3 files changed, 65 insertions(+), 22 deletions(-)

diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
index b0cfe42..42c6ce0 100644
--- a/include/linux/netfilter/ipset/ip_set.h
+++ b/include/linux/netfilter/ipset/ip_set.h
@@ -176,6 +176,9 @@ struct ip_set_type_variant {
 	/* List elements */
 	int (*list)(const struct ip_set *set, struct sk_buff *skb,
 		    struct netlink_callback *cb);
+	/* Keep listing private when resizing runs parallel */
+	void (*uref)(struct ip_set *set, struct netlink_callback *cb,
+		     bool start);
 
 	/* Return true if "b" set is the same as "a"
 	 * according to the create set parameters */
@@ -423,12 +426,12 @@ ip_set_init_counter(struct ip_set_counter *counter,
 
 /* Netlink CB args */
 enum {
-	IPSET_CB_NET = 0,
-	IPSET_CB_DUMP,
-	IPSET_CB_INDEX,
-	IPSET_CB_ARG0,
+	IPSET_CB_NET = 0,	/* net namespace */
+	IPSET_CB_DUMP,		/* dump single set/all sets */
+	IPSET_CB_INDEX,		/* set index */
+	IPSET_CB_PRIVATE,	/* set private data */
+	IPSET_CB_ARG0,		/* type specific */
 	IPSET_CB_ARG1,
-	IPSET_CB_ARG2,
 };
 
 /* register and unregister set references */
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index 957ec91..5dafbb7 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -1179,13 +1179,16 @@ ip_set_swap(struct sock *ctnl, struct sk_buff *skb,
 static int
 ip_set_dump_done(struct netlink_callback *cb)
 {
-	struct ip_set_net *inst = (struct ip_set_net *)cb->args[IPSET_CB_NET];
-
 	if (cb->args[IPSET_CB_ARG0]) {
-		pr_debug("release set %s\n",
-			 ip_set(inst, cb->args[IPSET_CB_INDEX])->name);
-		__ip_set_put_byindex(inst,
-			(ip_set_id_t) cb->args[IPSET_CB_INDEX]);
+		struct ip_set_net *inst =
+			(struct ip_set_net *)cb->args[IPSET_CB_NET];
+		ip_set_id_t index = (ip_set_id_t) cb->args[IPSET_CB_INDEX];
+		struct ip_set *set = ip_set(inst, index);
+
+		if (set->variant->uref)
+			set->variant->uref(set, cb, false);
+		pr_debug("release set %s\n", set->name);
+		__ip_set_put_byindex(inst, index);
 	}
 	return 0;
 }
@@ -1216,12 +1219,6 @@ dump_init(struct netlink_callback *cb, struct ip_set_net *inst)
 	nla_parse(cda, IPSET_ATTR_CMD_MAX,
 		  attr, nlh->nlmsg_len - min_len, ip_set_setname_policy);
 
-	/* cb->args[IPSET_CB_NET]:	net namespace
-	 *         [IPSET_CB_DUMP]:	dump single set/all sets
-	 *         [IPSET_CB_INDEX]:	set index
-	 *         [IPSET_CB_ARG0]:	type specific
-	 */
-
 	if (cda[IPSET_ATTR_SETNAME]) {
 		struct ip_set *set;
 
@@ -1329,6 +1326,8 @@ dump_last:
 				goto release_refcount;
 			if (dump_flags & IPSET_FLAG_LIST_HEADER)
 				goto next_set;
+			if (set->variant->uref)
+				set->variant->uref(set, cb, true);
 			/* Fall through and add elements */
 		default:
 			rcu_read_lock_bh();
@@ -1345,6 +1344,8 @@ dump_last:
 		dump_type = DUMP_LAST;
 		cb->args[IPSET_CB_DUMP] = dump_type | (dump_flags << 16);
 		cb->args[IPSET_CB_INDEX] = 0;
+		if (set && set->variant->uref)
+			set->variant->uref(set, cb, false);
 		goto dump_last;
 	}
 	goto out;
@@ -1359,7 +1360,10 @@ next_set:
 release_refcount:
 	/* If there was an error or set is done, release set */
 	if (ret || !cb->args[IPSET_CB_ARG0]) {
-		pr_debug("release set %s\n", ip_set(inst, index)->name);
+		set = ip_set(inst, index);
+		if (set->variant->uref)
+			set->variant->uref(set, cb, false);
+		pr_debug("release set %s\n", set->name);
 		__ip_set_put_byindex(inst, index);
 		cb->args[IPSET_CB_ARG0] = 0;
 	}
diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index f889d98..6ac789b 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -75,6 +75,8 @@ struct hbucket {
 
 /* The hash table: the table size stored here in order to make resizing easy */
 struct htable {
+	atomic_t ref;		/* References for resizing */
+	atomic_t uref;		/* References for dumping */
 	u8 htable_bits;		/* size of hash table == 2^htable_bits */
 	struct hbucket * __rcu bucket[0]; /* hashtable buckets */
 };
@@ -184,6 +186,7 @@ htable_bits(u32 hashsize)
 #undef mtype_del
 #undef mtype_test_cidrs
 #undef mtype_test
+#undef mtype_uref
 #undef mtype_expire
 #undef mtype_resize
 #undef mtype_head
@@ -225,6 +228,7 @@ htable_bits(u32 hashsize)
 #define mtype_del		IPSET_TOKEN(MTYPE, _del)
 #define mtype_test_cidrs	IPSET_TOKEN(MTYPE, _test_cidrs)
 #define mtype_test		IPSET_TOKEN(MTYPE, _test)
+#define mtype_uref		IPSET_TOKEN(MTYPE, _uref)
 #define mtype_expire		IPSET_TOKEN(MTYPE, _expire)
 #define mtype_resize		IPSET_TOKEN(MTYPE, _resize)
 #define mtype_head		IPSET_TOKEN(MTYPE, _head)
@@ -562,6 +566,8 @@ retry:
 
 	spin_lock_bh(&set->lock);
 	orig = h->table;
+	atomic_set(&orig->ref, 1);
+	atomic_inc(&orig->uref);
 	pr_debug("attempt to resize set %s from %u to %u, t %p\n",
 		 set->name, orig->htable_bits, htable_bits, orig);
 	for (i = 0; i < jhash_size(orig->htable_bits); i++) {
@@ -605,6 +611,8 @@ retry:
 						ret = -ENOMEM;
 				}
 				if (ret < 0) {
+					atomic_set(&orig->ref, 0);
+					atomic_dec(&orig->uref);
 					spin_unlock_bh(&set->lock);
 					mtype_ahash_destroy(set, t, false);
 					if (ret == -EAGAIN)
@@ -634,7 +642,11 @@ retry:
 
 	pr_debug("set %s resized from %u (%p) to %u (%p)\n", set->name,
 		 orig->htable_bits, orig, t->htable_bits, t);
-	mtype_ahash_destroy(set, orig, false);
+	/* If there's nobody else dumping the table, destroy it */
+	if (atomic_dec_and_test(&orig->uref)) {
+		pr_debug("Table destroy by resize %p\n", orig);
+		mtype_ahash_destroy(set, orig, false);
+	}
 
 	return 0;
 
@@ -1032,12 +1044,35 @@ nla_put_failure:
 	return -EMSGSIZE;
 }
 
+/* Make possible to run dumping parallel with resizing */
+static void
+mtype_uref(struct ip_set *set, struct netlink_callback *cb, bool start)
+{
+	struct htype *h = set->data;
+	struct htable *t;
+
+	if (start) {
+		rcu_read_lock_bh();
+		t = rcu_dereference_bh_nfnl(h->table);
+		atomic_inc(&t->uref);
+		cb->args[IPSET_CB_PRIVATE] = (unsigned long) t;
+		rcu_read_unlock_bh();
+	} else if (cb->args[IPSET_CB_PRIVATE]) {
+		t = (struct htable *) cb->args[IPSET_CB_PRIVATE];
+		if (atomic_dec_and_test(&t->uref) && atomic_read(&t->ref)) {
+			/* Resizing didn't destroy the hash table */
+			pr_debug("Table destroy by dump: %p\n", t);
+			mtype_ahash_destroy(set, t, false);
+		}
+		cb->args[IPSET_CB_PRIVATE] = 0;
+	}
+}
+
 /* Reply a LIST/SAVE request: dump the elements of the specified set */
 static int
 mtype_list(const struct ip_set *set,
 	   struct sk_buff *skb, struct netlink_callback *cb)
 {
-	const struct htype *h = set->data;
 	const struct htable *t;
 	struct nlattr *atd, *nested;
 	const struct hbucket *n;
@@ -1052,11 +1087,11 @@ mtype_list(const struct ip_set *set,
 		return -EMSGSIZE;
 
 	pr_debug("list hash set %s\n", set->name);
-	t = rcu_dereference_bh_nfnl(h->table);
+	t = (const struct htable *) cb->args[IPSET_CB_PRIVATE];
 	for (; cb->args[IPSET_CB_ARG0] < jhash_size(t->htable_bits);
 	     cb->args[IPSET_CB_ARG0]++) {
 		incomplete = skb_tail_pointer(skb);
-		n = rcu_dereference_bh(hbucket(t, cb->args[IPSET_CB_ARG0]));
+		n = hbucket(t, cb->args[IPSET_CB_ARG0]);
 		pr_debug("cb->arg bucket: %lu, t %p n %p\n",
 			 cb->args[IPSET_CB_ARG0], t, n);
 		if (n == NULL)
@@ -1126,6 +1161,7 @@ static const struct ip_set_type_variant mtype_variant = {
 	.flush	= mtype_flush,
 	.head	= mtype_head,
 	.list	= mtype_list,
+	.uref	= mtype_uref,
 	.resize	= mtype_resize,
 	.same_set = mtype_same_set,
 };
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 08/10] netfilter: ipset: Introduce RCU in all set types instead of rwlock per set
  2014-11-24 20:46 ` [PATCH 08/10] netfilter: ipset: Introduce RCU in all set types instead of rwlock per set Jozsef Kadlecsik
@ 2014-11-26 13:14   ` Pablo Neira Ayuso
  2014-11-26 14:58     ` Jozsef Kadlecsik
  0 siblings, 1 reply; 17+ messages in thread
From: Pablo Neira Ayuso @ 2014-11-26 13:14 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: netfilter-devel

On Mon, Nov 24, 2014 at 09:46:50PM +0100, Jozsef Kadlecsik wrote:
> Performance is tested by Jesper Dangaard Brouer:
> 
> Simple drop in FORWARD
> ~~~~~~~~~~~~~~~~~~~~
> 
> Dropping via simple iptables net-mask match::
> 
>  iptables -t raw -N simple || iptables -t raw -F simple
>  iptables -t raw -I simple  -s 198.18.0.0/15 -j DROP
>  iptables -t raw -D PREROUTING -j simple
>  iptables -t raw -I PREROUTING -j simple
> 
> Drop performance in "raw": 11.3Mpps
> 
> Generator: sending 12.2Mpps (tx:12264083 pps)
> 
> Drop via original ipset in RAW table
> ~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Create a set with lots of elements::
>  sudo ./ipset destroy test
>  echo "create test hash:ip hashsize 65536" > test.set
>  for x in `seq 0 255`; do
>     for y in `seq 0 255`; do
>         echo "add test 198.18.$x.$y" >> test.set
>     done
>  done
>  sudo ./ipset restore < test.set
> 
> Dropping via ipset::
> 
>  iptables -t raw -F
>  iptables -t raw -N net198 || iptables -t raw -F net198
>  iptables -t raw -I net198 -m set --match-set test src -j DROP
>  iptables -t raw -I PREROUTING -j net198
> 
> Drop performance in "raw" with ipset: 8Mpps
> 
> Perf report numbers ipset drop in "raw"::
> 
>  +   24.65%  ksoftirqd/1  [ip_set]           [k] ip_set_test
>  -   21.42%  ksoftirqd/1  [kernel.kallsyms]  [k] _raw_read_lock_bh
>     - _raw_read_lock_bh
>        + 99.88% ip_set_test
>  -   19.42%  ksoftirqd/1  [kernel.kallsyms]  [k] _raw_read_unlock_bh
>     - _raw_read_unlock_bh
>        + 99.72% ip_set_test
>  +    4.31%  ksoftirqd/1  [ip_set_hash_ip]   [k] hash_ip4_kadt
>  +    2.27%  ksoftirqd/1  [ixgbe]            [k] ixgbe_fetch_rx_buffer
>  +    2.18%  ksoftirqd/1  [ip_tables]        [k] ipt_do_table
>  +    1.81%  ksoftirqd/1  [ip_set_hash_ip]   [k] hash_ip4_test
>  +    1.61%  ksoftirqd/1  [kernel.kallsyms]  [k] __netif_receive_skb_core
>  +    1.44%  ksoftirqd/1  [kernel.kallsyms]  [k] build_skb
>  +    1.42%  ksoftirqd/1  [kernel.kallsyms]  [k] ip_rcv
>  +    1.36%  ksoftirqd/1  [kernel.kallsyms]  [k] __local_bh_enable_ip
>  +    1.16%  ksoftirqd/1  [kernel.kallsyms]  [k] dev_gro_receive
>  +    1.09%  ksoftirqd/1  [kernel.kallsyms]  [k] __rcu_read_unlock
>  +    0.96%  ksoftirqd/1  [ixgbe]            [k] ixgbe_clean_rx_irq
>  +    0.95%  ksoftirqd/1  [kernel.kallsyms]  [k] __netdev_alloc_frag
>  +    0.88%  ksoftirqd/1  [kernel.kallsyms]  [k] kmem_cache_alloc
>  +    0.87%  ksoftirqd/1  [xt_set]           [k] set_match_v3
>  +    0.85%  ksoftirqd/1  [kernel.kallsyms]  [k] inet_gro_receive
>  +    0.83%  ksoftirqd/1  [kernel.kallsyms]  [k] nf_iterate
>  +    0.76%  ksoftirqd/1  [kernel.kallsyms]  [k] put_compound_page
>  +    0.75%  ksoftirqd/1  [kernel.kallsyms]  [k] __rcu_read_lock
> 
> Drop via ipset in RAW table with RCU-locking
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> With RCU locking, the RW-lock is gone.
> 
> Drop performance in "raw" with ipset with RCU-locking: 11.3Mpps
> 
> Performance-tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> ---
>  include/linux/netfilter/ipset/ip_set.h         |  82 +++-
>  include/linux/netfilter/ipset/ip_set_timeout.h |  39 +-
>  net/netfilter/ipset/ip_set_bitmap_gen.h        |   8 +-
>  net/netfilter/ipset/ip_set_bitmap_ipmac.c      |   2 +-
>  net/netfilter/ipset/ip_set_core.c              |  35 +-
>  net/netfilter/ipset/ip_set_hash_gen.h          | 547 +++++++++++++++----------
>  net/netfilter/ipset/ip_set_list_set.c          | 386 ++++++++---------
>  7 files changed, 614 insertions(+), 485 deletions(-)
> 
> diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
> index f1606fa..418d360 100644
> --- a/include/linux/netfilter/ipset/ip_set.h
> +++ b/include/linux/netfilter/ipset/ip_set.h
> @@ -113,10 +113,10 @@ struct ip_set_comment {
>  };
>  
>  struct ip_set_skbinfo {
> -	u32 skbmark;
> -	u32 skbmarkmask;
> -	u32 skbprio;
> -	u16 skbqueue;
> +	u32 __rcu skbmark;
> +	u32 __rcu skbmarkmask;
> +	u32 __rcu skbprio;
> +	u16 __rcu skbqueue;
>  };
>  
>  struct ip_set;
> @@ -223,7 +223,7 @@ struct ip_set {
>  	/* The name of the set */
>  	char name[IPSET_MAXNAMELEN];
>  	/* Lock protecting the set data */
> -	rwlock_t lock;
> +	spinlock_t lock;
>  	/* References to the set */
>  	u32 ref;
>  	/* The core set type */
> @@ -322,30 +322,72 @@ ip_set_update_counter(struct ip_set_counter *counter,
>  	}
>  }
>  
> +/* RCU-safe assign value */
> +#define IP_SET_RCU_ASSIGN(ptr, value)	\
> +do {					\
> +	smp_wmb();			\
> +	*(ptr) = value;			\
> +} while (0)
> +
> +static inline void
> +ip_set_rcu_assign_ulong(unsigned long *v, unsigned long value)
> +{
> +	IP_SET_RCU_ASSIGN(v, value);
> +}
> +
> +static inline void
> +ip_set_rcu_assign_u32(u32 *v, u32 value)
> +{
> +	IP_SET_RCU_ASSIGN(v, value);
> +}
> +
> +static inline void
> +ip_set_rcu_assign_u16(u16 *v, u16 value)
> +{
> +	IP_SET_RCU_ASSIGN(v, value);
> +}
> +
> +static inline void
> +ip_set_rcu_assign_u8(u8 *v, u8 value)
> +{
> +	IP_SET_RCU_ASSIGN(v, value);
> +}

No questions regarding numbers, but I would like to see some
explanation on the RCU approach that you're implementing in this
patch. Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 08/10] netfilter: ipset: Introduce RCU in all set types instead of rwlock per set
  2014-11-26 13:14   ` Pablo Neira Ayuso
@ 2014-11-26 14:58     ` Jozsef Kadlecsik
  2014-11-26 15:41       ` Florian Westphal
  0 siblings, 1 reply; 17+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-26 14:58 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Wed, 26 Nov 2014, Pablo Neira Ayuso wrote:

> > diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
> > index f1606fa..418d360 100644
> > --- a/include/linux/netfilter/ipset/ip_set.h
> > +++ b/include/linux/netfilter/ipset/ip_set.h
> > @@ -113,10 +113,10 @@ struct ip_set_comment {
> >  };
> >  
> >  struct ip_set_skbinfo {
> > -	u32 skbmark;
> > -	u32 skbmarkmask;
> > -	u32 skbprio;
> > -	u16 skbqueue;
> > +	u32 __rcu skbmark;
> > +	u32 __rcu skbmarkmask;
> > +	u32 __rcu skbprio;
> > +	u16 __rcu skbqueue;
> >  };
> >  
> >  struct ip_set;
> > @@ -223,7 +223,7 @@ struct ip_set {
> >  	/* The name of the set */
> >  	char name[IPSET_MAXNAMELEN];
> >  	/* Lock protecting the set data */
> > -	rwlock_t lock;
> > +	spinlock_t lock;
> >  	/* References to the set */
> >  	u32 ref;
> >  	/* The core set type */
> > @@ -322,30 +322,72 @@ ip_set_update_counter(struct ip_set_counter *counter,
> >  	}
> >  }
> >  
> > +/* RCU-safe assign value */
> > +#define IP_SET_RCU_ASSIGN(ptr, value)	\
> > +do {					\
> > +	smp_wmb();			\
> > +	*(ptr) = value;			\
> > +} while (0)
> > +
> > +static inline void
> > +ip_set_rcu_assign_ulong(unsigned long *v, unsigned long value)
> > +{
> > +	IP_SET_RCU_ASSIGN(v, value);
> > +}
> > +
> > +static inline void
> > +ip_set_rcu_assign_u32(u32 *v, u32 value)
> > +{
> > +	IP_SET_RCU_ASSIGN(v, value);
> > +}
> > +
> > +static inline void
> > +ip_set_rcu_assign_u16(u16 *v, u16 value)
> > +{
> > +	IP_SET_RCU_ASSIGN(v, value);
> > +}
> > +
> > +static inline void
> > +ip_set_rcu_assign_u8(u8 *v, u8 value)
> > +{
> > +	IP_SET_RCU_ASSIGN(v, value);
> > +}
> 
> No questions regarding numbers, but I would like to see some
> explanation on the RCU approach that you're implementing in this
> patch. Thanks.

There are four types in the patch :-)

- bitmap set types: bit setting is atomic, so nothing required 
- list set type: normal list rcu
- hash set types: the pointer manipulations of the hash buckets
  are RCU protected; the cidr maintenance is changed and ordered
  so that adding/deleting does not break the list of cidrs.
- elem extensions: the manipulations of the extensions are 
  either atomic operations or integer settings. The inline functions
  above provide those RCU safe integer settings: they are the same
  as rcu_assign_pointer(), but that one can be used for pointers only.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 08/10] netfilter: ipset: Introduce RCU in all set types instead of rwlock per set
  2014-11-26 14:58     ` Jozsef Kadlecsik
@ 2014-11-26 15:41       ` Florian Westphal
  2014-11-26 16:01         ` Jozsef Kadlecsik
  0 siblings, 1 reply; 17+ messages in thread
From: Florian Westphal @ 2014-11-26 15:41 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Pablo Neira Ayuso, netfilter-devel

Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> wrote:
> On Wed, 26 Nov 2014, Pablo Neira Ayuso wrote:
> 
> > > diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
> > > index f1606fa..418d360 100644
> > > --- a/include/linux/netfilter/ipset/ip_set.h
> > > +++ b/include/linux/netfilter/ipset/ip_set.h
> > > @@ -113,10 +113,10 @@ struct ip_set_comment {
> > >  };
> > >  
> > >  struct ip_set_skbinfo {
> > > -	u32 skbmark;
> > > -	u32 skbmarkmask;
> > > -	u32 skbprio;
> > > -	u16 skbqueue;
> > > +	u32 __rcu skbmark;
> > > +	u32 __rcu skbmarkmask;
> > > +	u32 __rcu skbprio;
> > > +	u16 __rcu skbqueue;
> > >  };

This looks weird...
What is rcu protecting here?

> > > +#define IP_SET_RCU_ASSIGN(ptr, value)	\
> > > +do {					\
> > > +	smp_wmb();			\

Whats this barrier for?

> > > +	*(ptr) = value;			\
> > > +} while (0)

> > > +static inline void
> > > +ip_set_rcu_assign_ulong(unsigned long *v, unsigned long value)
> > > +{
> > > +	IP_SET_RCU_ASSIGN(v, value);
> > > +}
> > > +
> > > +static inline void
> > > +ip_set_rcu_assign_u32(u32 *v, u32 value)
> > > +{
> > > +	IP_SET_RCU_ASSIGN(v, value);
> > > +}
> > > +
> > > +static inline void
> > > +ip_set_rcu_assign_u16(u16 *v, u16 value)
> > > +{
> > > +	IP_SET_RCU_ASSIGN(v, value);
> > > +}
> > > +
> > > +static inline void
> > > +ip_set_rcu_assign_u8(u8 *v, u8 value)
> > > +{
> > > +	IP_SET_RCU_ASSIGN(v, value);
> > > +}

These macros look dodgy...  Why are they needed?

> - elem extensions: the manipulations of the extensions are 
>   either atomic operations or integer settings. The inline functions
>   above provide those RCU safe integer settings: they are the same
>   as rcu_assign_pointer(), but that one can be used for pointers only.

Yes, because it doesn't make sense to me to have something NOT for
pointers.

rcu_assign_pointer() is there to make sure that when you "publish" the
memory whose address is given, all the content is visible to all the
cpus, this is also why it has the barrier.

I don't see why you'd need a special function otherwise, unless you
want some explicit documentation as to where you're changing/updating
content of structures protected by rcu.  But, in those sections you
already need to hold some type of lock, so I don't think its needed.

Cheers,
Florian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 08/10] netfilter: ipset: Introduce RCU in all set types instead of rwlock per set
  2014-11-26 15:41       ` Florian Westphal
@ 2014-11-26 16:01         ` Jozsef Kadlecsik
  2014-11-26 17:18           ` Pablo Neira Ayuso
  0 siblings, 1 reply; 17+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-26 16:01 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Pablo Neira Ayuso, netfilter-devel

On Wed, 26 Nov 2014, Florian Westphal wrote:

> Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> wrote:
> > On Wed, 26 Nov 2014, Pablo Neira Ayuso wrote:
> > 
> > > > diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
> > > > index f1606fa..418d360 100644
> > > > --- a/include/linux/netfilter/ipset/ip_set.h
> > > > +++ b/include/linux/netfilter/ipset/ip_set.h
> > > > @@ -113,10 +113,10 @@ struct ip_set_comment {
> > > >  };
> > > >  
> > > >  struct ip_set_skbinfo {
> > > > -	u32 skbmark;
> > > > -	u32 skbmarkmask;
> > > > -	u32 skbprio;
> > > > -	u16 skbqueue;
> > > > +	u32 __rcu skbmark;
> > > > +	u32 __rcu skbmarkmask;
> > > > +	u32 __rcu skbprio;
> > > > +	u16 __rcu skbqueue;
> > > >  };
> 
> This looks weird...
> What is rcu protecting here?
> 
> > > > +#define IP_SET_RCU_ASSIGN(ptr, value)	\
> > > > +do {					\
> > > > +	smp_wmb();			\
> 
> Whats this barrier for?
> 
> > > > +	*(ptr) = value;			\
> > > > +} while (0)
> 
> > > > +static inline void
> > > > +ip_set_rcu_assign_ulong(unsigned long *v, unsigned long value)
> > > > +{
> > > > +	IP_SET_RCU_ASSIGN(v, value);
> > > > +}
> > > > +
> > > > +static inline void
> > > > +ip_set_rcu_assign_u32(u32 *v, u32 value)
> > > > +{
> > > > +	IP_SET_RCU_ASSIGN(v, value);
> > > > +}
> > > > +
> > > > +static inline void
> > > > +ip_set_rcu_assign_u16(u16 *v, u16 value)
> > > > +{
> > > > +	IP_SET_RCU_ASSIGN(v, value);
> > > > +}
> > > > +
> > > > +static inline void
> > > > +ip_set_rcu_assign_u8(u8 *v, u8 value)
> > > > +{
> > > > +	IP_SET_RCU_ASSIGN(v, value);
> > > > +}
> 
> These macros look dodgy...  Why are they needed?
> 
> > - elem extensions: the manipulations of the extensions are 
> >   either atomic operations or integer settings. The inline functions
> >   above provide those RCU safe integer settings: they are the same
> >   as rcu_assign_pointer(), but that one can be used for pointers only.
> 
> Yes, because it doesn't make sense to me to have something NOT for
> pointers.
> 
> rcu_assign_pointer() is there to make sure that when you "publish" the
> memory whose address is given, all the content is visible to all the
> cpus, this is also why it has the barrier.
> 
> I don't see why you'd need a special function otherwise, unless you
> want some explicit documentation as to where you're changing/updating
> content of structures protected by rcu.  But, in those sections you
> already need to hold some type of lock, so I don't think its needed.

Hm, you are both right - I clean up this part as it's totally unnecessary, 
these are not pointers, even indirectly.

Thanks!

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 08/10] netfilter: ipset: Introduce RCU in all set types instead of rwlock per set
  2014-11-26 16:01         ` Jozsef Kadlecsik
@ 2014-11-26 17:18           ` Pablo Neira Ayuso
  2014-11-27  8:23             ` Jozsef Kadlecsik
  0 siblings, 1 reply; 17+ messages in thread
From: Pablo Neira Ayuso @ 2014-11-26 17:18 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Florian Westphal, netfilter-devel

On Wed, Nov 26, 2014 at 05:01:39PM +0100, Jozsef Kadlecsik wrote:
> On Wed, 26 Nov 2014, Florian Westphal wrote:
> > 
> > I don't see why you'd need a special function otherwise, unless you
> > want some explicit documentation as to where you're changing/updating
> > content of structures protected by rcu.  But, in those sections you
> > already need to hold some type of lock, so I don't think its needed.
> 
> Hm, you are both right - I clean up this part as it's totally unnecessary, 
> these are not pointers, even indirectly.

Thanks.

Please, I'd appreciate if you send smaller batches for nf-next for
easier review. So instead of large batches, more frequent pull
requests if possible.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 08/10] netfilter: ipset: Introduce RCU in all set types instead of rwlock per set
  2014-11-26 17:18           ` Pablo Neira Ayuso
@ 2014-11-27  8:23             ` Jozsef Kadlecsik
  0 siblings, 0 replies; 17+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-27  8:23 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel

On Wed, 26 Nov 2014, Pablo Neira Ayuso wrote:

> > Hm, you are both right - I clean up this part as it's totally unnecessary, 
> > these are not pointers, even indirectly.
> 
> Thanks.
> 
> Please, I'd appreciate if you send smaller batches for nf-next for
> easier review. So instead of large batches, more frequent pull
> requests if possible.

Sure! I'll split up the RCU patch into multiple parts too when resending 
the patchset.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-11-27  8:22 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-24 20:46 [PATCH 00/10] ipset patches for nf-next Jozsef Kadlecsik
2014-11-24 20:46 ` [PATCH 01/10] netfilter: ipset: Support updating extensions when the set is full Jozsef Kadlecsik
2014-11-24 20:46 ` [PATCH 02/10] netfilter: ipset: Alignment problem between 64bit kernel 32bit userspace Jozsef Kadlecsik
2014-11-24 20:46 ` [PATCH 03/10] netfilter: ipset: Indicate when /0 networks are supported Jozsef Kadlecsik
2014-11-24 20:46 ` [PATCH 04/10] netfilter: ipset: Simplify cidr handling for hash:*net* types Jozsef Kadlecsik
2014-11-24 20:46 ` [PATCH 05/10] netfilter: ipset: Allocate the proper size of memory when /0 networks are supported Jozsef Kadlecsik
2014-11-24 20:46 ` [PATCH 06/10] netfilter: ipset: Explicitly add padding elements to hash:net,net and hash:net,port,net Jozsef Kadlecsik
2014-11-24 20:46 ` [PATCH 07/10] netfilter: ipset: Remove rbtree from hash:net,iface in order to run under RCU Jozsef Kadlecsik
2014-11-24 20:46 ` [PATCH 08/10] netfilter: ipset: Introduce RCU in all set types instead of rwlock per set Jozsef Kadlecsik
2014-11-26 13:14   ` Pablo Neira Ayuso
2014-11-26 14:58     ` Jozsef Kadlecsik
2014-11-26 15:41       ` Florian Westphal
2014-11-26 16:01         ` Jozsef Kadlecsik
2014-11-26 17:18           ` Pablo Neira Ayuso
2014-11-27  8:23             ` Jozsef Kadlecsik
2014-11-24 20:46 ` [PATCH 09/10] netfilter: ipset: styles warned by checkpatch.pl fixed Jozsef Kadlecsik
2014-11-24 20:46 ` [PATCH 10/10] netfilter: ipset: Fix parallel resizing and listing of the same set Jozsef Kadlecsik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.