netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 4.14 0/4] netfilter: xt_connlimit: backport upstream fixes for race in connection counting
@ 2019-01-02 17:00 Mauricio Faria de Oliveira
  2019-01-02 17:00 ` [PATCH 4.14 1/4] netfilter: xt_connlimit: don't store address in the conn nodes Mauricio Faria de Oliveira
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2019-01-02 17:00 UTC (permalink / raw)
  To: stable, netdev, Florian Westphal
  Cc: Alakesh Haloi, nivedita.singhvi, Pablo Neira Ayuso,
	Jozsef Kadlecsik, David S. Miller, Yi-Hung Wei

Recently, Alakesh Haloi reported the following issue [1] with stable/4.14:

  """
  An iptable rule like the following on a multicore systems will result in
  accepting more connections than set in the rule.

  iptables  -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit \
        --connlimit-above 2000 --connlimit-mask 0 -j DROP
  """

And proposed a fix that is not in Linus's tree. The discussion went on to
confirm whether the issue was still reproducible with mainline/nf.git tip,
and to either identify the upstream fix or re-submit the non-upstream fix.

Alakesh eventually was able to test with upstream, and reported that issue
was still reproducible [2].
On that, our findinds diverge, at least in my test environment:

First, I verified that the suggested mainline fix for the issue [3] indeed
fixes it, by testing with it applied and reverted on v4.18, a clean revert.
(The issue is reproducible with the commit reverted).

Then, with a consistent reproducer, I moved to nf.git, with HEAD on commit
a007232 ("netfilter: nf_conncount: fix argument order to find_next_bit"),
and the issues was not reproducible (even with 20+ threads on client side,
the number Alakesh reported to achieve 2150+ connections [4], and I tried
spreading the network interface IRQ affinity over more and more CPUs too.)

Either way, the suggested mainline fix does actually fix the issue in 4.14
for at least one environment. So, it might well be the case that Alakesh's
test environment has differences/subtleties that leads to more connections
accepted, and more commits are needed for that particular environment type.

But for now, with one bare-metal environment (24-core server, 4-core client)
verified, I thought of submitting the patches for review/comments/testing,
then looking for additional fixes for that environment separately.

The fix is PATCH 4/4, and PATCHes 1-3/4 are helpers for a cleaner backport.
All backports are simple, and essentially consist of refresh context lines
and use older struct/file names.

Reviews from netfilter maintainers are very appreciated, as I've no previous
experience in this area, and although the backports look simple and build/run
correctly, there's usually stuff that only more experienced people may notice.

Thanks,
Mauricio

Links:
=====

  [1] https://www.spinics.net/lists/stable/msg270040.html
  [2] https://www.spinics.net/lists/stable/msg273669.html
  [3] https://www.spinics.net/lists/stable/msg271300.html
  [4] https://www.spinics.net/lists/stable/msg273669.html

Test-case:
=========

 - v4.14.91 (original): client achieves 2000+ connections (6000 target)
                        with 3 threads.

    server # iptables -F
    server # iptables -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit --connlimit-above 2000 --connlimit-mask 0 -j DROP 

    server # iptables -L
    Chain INPUT (policy ACCEPT)
    target     prot opt source               destination         
    DROP       tcp  --  anywhere             anywhere             tcp dpt:7777 flags:FIN,SYN,RST,ACK/SYN #conn src/0 > 2000

    Chain FORWARD (policy ACCEPT)
    target     prot opt source               destination         

    Chain OUTPUT (policy ACCEPT)
    target     prot opt source               destination         

    server # ulimit -SHn 65000
    server # ruby server.rb
    <... listening ...>


    client # ulimit -SHn 65000
    client # ruby client.rb 10.230.56.100 7777 6000 3
    Connecting to ["10.230.56.100"]:7777 6000 times with 3
    1
    2
    3
    <...>
    2000
    <...>
    6000
    Target reached. Thread finishing
    6001
    Target reached. Thread finishing
    6002
    Target reached. Thread finishing
    Threads done. 6002 connections
    press enter to exit

 - v4.14.91 + patches: client only achieved 2000 connections.

    server #  (same procedure)

    client #  (same procedure)

    Connecting to ["10.230.56.100"]:7777 6000 times with 3
    1
    2
    3
    <...>
    2000
    <... blocked for a while...>
    failed to create connection: Connection timed out - connect(2) for "10.230.56.100" port 7777
    failed to create connection: Connection timed out - connect(2) for "10.230.56.100" port 7777
    failed to create connection: Connection timed out - connect(2) for "10.230.56.100" port 7777
    Threads done. 2000 connections
    press enter to exit

Florian Westphal (2):
  netfilter: xt_connlimit: don't store address in the conn nodes
  netfilter: nf_conncount: fix garbage collection confirm race

Pablo Neira Ayuso (1):
  netfilter: nf_conncount: expose connection list interface

Yi-Hung Wei (1):
  netfilter: nf_conncount: Fix garbage collection with zones

 include/net/netfilter/nf_conntrack_count.h | 15 +++++
 net/netfilter/xt_connlimit.c               | 99 +++++++++++++++++++++++-------
 2 files changed, 91 insertions(+), 23 deletions(-)
 create mode 100644 include/net/netfilter/nf_conntrack_count.h

-- 
2.7.4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 4.14 1/4] netfilter: xt_connlimit: don't store address in the conn nodes
  2019-01-02 17:00 [PATCH 4.14 0/4] netfilter: xt_connlimit: backport upstream fixes for race in connection counting Mauricio Faria de Oliveira
@ 2019-01-02 17:00 ` Mauricio Faria de Oliveira
  2019-01-02 17:00 ` [PATCH 4.14 2/4] netfilter: nf_conncount: expose connection list interface Mauricio Faria de Oliveira
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2019-01-02 17:00 UTC (permalink / raw)
  To: stable, netdev, Florian Westphal
  Cc: Alakesh Haloi, nivedita.singhvi, Pablo Neira Ayuso,
	Jozsef Kadlecsik, David S. Miller, Yi-Hung Wei

From: Florian Westphal <fw@strlen.de>

commit ce49480dba8666cba0106e8e31a942c9ce4c438a upstream.

Only stored, never read.  This is a leftover from commit 7d08487777c8
("netfilter: connlimit: use rbtree for per-host conntrack obj storage"),
which added the rbtree node struct that stores the address instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

[mfo: backport: refresh context lines and use older symbol/file names:
 - nf_conncount.c -> xt_connlimit.c.
   - nf_conncount_rb -> xt_connlimit_rb
   - nf_conncount_tuple -> xt_connlimit_conn
  - additionally, remove the add_hlist() 'addr' parameter that isn't used and removed
    later upstream with commit 625c556118f3 ("netfilter: connlimit: split xt_connlimit
    into front and backend") in the rename from 'xt_connlimit.c' to 'nf_conncount.c',
    a big refactor, so do it here, while still here in this related patch.]
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
---
 net/netfilter/xt_connlimit.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index ffa8eec..79d4151 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -46,7 +46,6 @@
 struct xt_connlimit_conn {
 	struct hlist_node		node;
 	struct nf_conntrack_tuple	tuple;
-	union nf_inet_addr		addr;
 };
 
 struct xt_connlimit_rb {
@@ -116,8 +115,7 @@ same_source_net(const union nf_inet_addr *addr,
 }
 
 static bool add_hlist(struct hlist_head *head,
-		      const struct nf_conntrack_tuple *tuple,
-		      const union nf_inet_addr *addr)
+		      const struct nf_conntrack_tuple *tuple)
 {
 	struct xt_connlimit_conn *conn;
 
@@ -125,7 +123,6 @@ static bool add_hlist(struct hlist_head *head,
 	if (conn == NULL)
 		return false;
 	conn->tuple = *tuple;
-	conn->addr = *addr;
 	hlist_add_head(&conn->node, head);
 	return true;
 }
@@ -231,7 +228,7 @@ count_tree(struct net *net, struct rb_root *root,
 			if (!addit)
 				return count;
 
-			if (!add_hlist(&rbconn->hhead, tuple, addr))
+			if (!add_hlist(&rbconn->hhead, tuple))
 				return 0; /* hotdrop */
 
 			return count + 1;
@@ -270,7 +267,6 @@ count_tree(struct net *net, struct rb_root *root,
 	}
 
 	conn->tuple = *tuple;
-	conn->addr = *addr;
 	rbconn->addr = *addr;
 
 	INIT_HLIST_HEAD(&rbconn->hhead);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4.14 2/4] netfilter: nf_conncount: expose connection list interface
  2019-01-02 17:00 [PATCH 4.14 0/4] netfilter: xt_connlimit: backport upstream fixes for race in connection counting Mauricio Faria de Oliveira
  2019-01-02 17:00 ` [PATCH 4.14 1/4] netfilter: xt_connlimit: don't store address in the conn nodes Mauricio Faria de Oliveira
@ 2019-01-02 17:00 ` Mauricio Faria de Oliveira
  2019-01-02 17:00 ` [PATCH 4.14 3/4] netfilter: nf_conncount: Fix garbage collection with zones Mauricio Faria de Oliveira
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2019-01-02 17:00 UTC (permalink / raw)
  To: stable, netdev, Florian Westphal
  Cc: Alakesh Haloi, nivedita.singhvi, Pablo Neira Ayuso,
	Jozsef Kadlecsik, David S. Miller, Yi-Hung Wei

From: Pablo Neira Ayuso <pablo@netfilter.org>

commit 5e5cbc7b23eaf13e18652c03efbad5be6995de6a upstream.

This patch provides an interface to maintain the list of connections and
the lookup function to obtain the number of connections in the list.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

[mfo: backport: refresh context lines and use older symbol/file names:
 - nf_conntrack_count.h: new file, add include guards.
 - nf_conncount.c -> xt_connlimit.c.
   - nf_conncount_rb -> xt_connlimit_rb
   - nf_conncount_tuple -> xt_connlimit_conn
   - conncount_rb_cachep -> connlimit_rb_cachep
   - conncount_conn_cachep -> connlimit_conn_cachep]
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
---
 include/net/netfilter/nf_conntrack_count.h | 14 ++++++++++++
 net/netfilter/xt_connlimit.c               | 36 +++++++++++++++++++-----------
 2 files changed, 37 insertions(+), 13 deletions(-)
 create mode 100644 include/net/netfilter/nf_conntrack_count.h

diff --git a/include/net/netfilter/nf_conntrack_count.h b/include/net/netfilter/nf_conntrack_count.h
new file mode 100644
index 0000000..54e43b8
--- /dev/null
+++ b/include/net/netfilter/nf_conntrack_count.h
@@ -0,0 +1,14 @@
+#ifndef _NF_CONNTRACK_COUNT_H
+#define _NF_CONNTRACK_COUNT_H
+
+unsigned int nf_conncount_lookup(struct net *net, struct hlist_head *head,
+				 const struct nf_conntrack_tuple *tuple,
+				 const struct nf_conntrack_zone *zone,
+				 bool *addit);
+
+bool nf_conncount_add(struct hlist_head *head,
+		      const struct nf_conntrack_tuple *tuple);
+
+void nf_conncount_cache_free(struct hlist_head *hhead);
+
+#endif
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index 79d4151..7af5875 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -114,7 +114,7 @@ same_source_net(const union nf_inet_addr *addr,
 	}
 }
 
-static bool add_hlist(struct hlist_head *head,
+bool nf_conncount_add(struct hlist_head *head,
 		      const struct nf_conntrack_tuple *tuple)
 {
 	struct xt_connlimit_conn *conn;
@@ -126,12 +126,12 @@ static bool add_hlist(struct hlist_head *head,
 	hlist_add_head(&conn->node, head);
 	return true;
 }
+EXPORT_SYMBOL_GPL(nf_conncount_add);
 
-static unsigned int check_hlist(struct net *net,
-				struct hlist_head *head,
-				const struct nf_conntrack_tuple *tuple,
-				const struct nf_conntrack_zone *zone,
-				bool *addit)
+unsigned int nf_conncount_lookup(struct net *net, struct hlist_head *head,
+				 const struct nf_conntrack_tuple *tuple,
+				 const struct nf_conntrack_zone *zone,
+				 bool *addit)
 {
 	const struct nf_conntrack_tuple_hash *found;
 	struct xt_connlimit_conn *conn;
@@ -176,6 +176,7 @@ static unsigned int check_hlist(struct net *net,
 
 	return length;
 }
+EXPORT_SYMBOL_GPL(nf_conncount_lookup);
 
 static void tree_nodes_free(struct rb_root *root,
 			    struct xt_connlimit_rb *gc_nodes[],
@@ -222,13 +223,15 @@ count_tree(struct net *net, struct rb_root *root,
 		} else {
 			/* same source network -> be counted! */
 			unsigned int count;
-			count = check_hlist(net, &rbconn->hhead, tuple, zone, &addit);
+
+			count = nf_conncount_lookup(net, &rbconn->hhead, tuple,
+						    zone, &addit);
 
 			tree_nodes_free(root, gc_nodes, gc_count);
 			if (!addit)
 				return count;
 
-			if (!add_hlist(&rbconn->hhead, tuple))
+			if (!nf_conncount_add(&rbconn->hhead, tuple))
 				return 0; /* hotdrop */
 
 			return count + 1;
@@ -238,7 +241,7 @@ count_tree(struct net *net, struct rb_root *root,
 			continue;
 
 		/* only used for GC on hhead, retval and 'addit' ignored */
-		check_hlist(net, &rbconn->hhead, tuple, zone, &addit);
+		nf_conncount_lookup(net, &rbconn->hhead, tuple, zone, &addit);
 		if (hlist_empty(&rbconn->hhead))
 			gc_nodes[gc_count++] = rbconn;
 	}
@@ -378,11 +381,19 @@ static int connlimit_mt_check(const struct xt_mtchk_param *par)
 	return 0;
 }
 
-static void destroy_tree(struct rb_root *r)
+void nf_conncount_cache_free(struct hlist_head *hhead)
 {
 	struct xt_connlimit_conn *conn;
-	struct xt_connlimit_rb *rbconn;
 	struct hlist_node *n;
+
+	hlist_for_each_entry_safe(conn, n, hhead, node)
+		kmem_cache_free(connlimit_conn_cachep, conn);
+}
+EXPORT_SYMBOL_GPL(nf_conncount_cache_free);
+
+static void destroy_tree(struct rb_root *r)
+{
+	struct xt_connlimit_rb *rbconn;
 	struct rb_node *node;
 
 	while ((node = rb_first(r)) != NULL) {
@@ -390,8 +401,7 @@ static void destroy_tree(struct rb_root *r)
 
 		rb_erase(node, r);
 
-		hlist_for_each_entry_safe(conn, n, &rbconn->hhead, node)
-			kmem_cache_free(connlimit_conn_cachep, conn);
+		nf_conncount_cache_free(&rbconn->hhead);
 
 		kmem_cache_free(connlimit_rb_cachep, rbconn);
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4.14 3/4] netfilter: nf_conncount: Fix garbage collection with zones
  2019-01-02 17:00 [PATCH 4.14 0/4] netfilter: xt_connlimit: backport upstream fixes for race in connection counting Mauricio Faria de Oliveira
  2019-01-02 17:00 ` [PATCH 4.14 1/4] netfilter: xt_connlimit: don't store address in the conn nodes Mauricio Faria de Oliveira
  2019-01-02 17:00 ` [PATCH 4.14 2/4] netfilter: nf_conncount: expose connection list interface Mauricio Faria de Oliveira
@ 2019-01-02 17:00 ` Mauricio Faria de Oliveira
  2019-01-02 17:00 ` [PATCH 4.14 4/4] netfilter: nf_conncount: fix garbage collection confirm race Mauricio Faria de Oliveira
  2019-01-02 17:17 ` [PATCH 4.14 0/4] netfilter: xt_connlimit: backport upstream fixes for race in connection counting Florian Westphal
  4 siblings, 0 replies; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2019-01-02 17:00 UTC (permalink / raw)
  To: stable, netdev, Florian Westphal
  Cc: Alakesh Haloi, nivedita.singhvi, Pablo Neira Ayuso,
	Jozsef Kadlecsik, David S. Miller, Yi-Hung Wei

From: Yi-Hung Wei <yihung.wei@gmail.com>

commit 21ba8847f857028dc83a0f341e16ecc616e34740 upstream.

Currently, we use check_hlist() for garbage colleciton. However, we
use the ‘zone’ from the counted entry to query the existence of
existing entries in the hlist. This could be wrong when they are in
different zones, and this patch fixes this issue.

Fixes: e59ea3df3fc2 ("netfilter: xt_connlimit: honor conntrack zone if available")
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

[mfo: backport: refresh context lines and use older symbol/file names, note hunk 5:
 - nf_conncount.c -> xt_connlimit.c
   - nf_conncount_rb -> xt_connlimit_rb
   - nf_conncount_tuple -> xt_connlimit_conn
   - hunk 5: remove check for non-NULL 'tuple', that isn't required as it's introduced
     by upstream commit 35d8deb80 ("netfilter: conncount: Support count only use case")
     which addresses nf_conncount_count() that does not exist yet -- it's introduced by
     upstream commit 625c556118f3 ("netfilter: connlimit: split xt_connlimit into front
     and backend"), a refactor change.
 - nft_connlimit.c -> removed, not used/doesn't exist yet.]
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
---
 include/net/netfilter/nf_conntrack_count.h |  3 ++-
 net/netfilter/xt_connlimit.c               | 13 +++++++++----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_count.h b/include/net/netfilter/nf_conntrack_count.h
index 54e43b8..4b71a2f 100644
--- a/include/net/netfilter/nf_conntrack_count.h
+++ b/include/net/netfilter/nf_conntrack_count.h
@@ -7,7 +7,8 @@ unsigned int nf_conncount_lookup(struct net *net, struct hlist_head *head,
 				 bool *addit);
 
 bool nf_conncount_add(struct hlist_head *head,
-		      const struct nf_conntrack_tuple *tuple);
+		      const struct nf_conntrack_tuple *tuple,
+		      const struct nf_conntrack_zone *zone);
 
 void nf_conncount_cache_free(struct hlist_head *hhead);
 
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index 7af5875..ab1f849 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -46,6 +46,7 @@
 struct xt_connlimit_conn {
 	struct hlist_node		node;
 	struct nf_conntrack_tuple	tuple;
+	struct nf_conntrack_zone	zone;
 };
 
 struct xt_connlimit_rb {
@@ -115,7 +116,8 @@ same_source_net(const union nf_inet_addr *addr,
 }
 
 bool nf_conncount_add(struct hlist_head *head,
-		      const struct nf_conntrack_tuple *tuple)
+		      const struct nf_conntrack_tuple *tuple,
+		      const struct nf_conntrack_zone *zone)
 {
 	struct xt_connlimit_conn *conn;
 
@@ -123,6 +125,7 @@ bool nf_conncount_add(struct hlist_head *head,
 	if (conn == NULL)
 		return false;
 	conn->tuple = *tuple;
+	conn->zone = *zone;
 	hlist_add_head(&conn->node, head);
 	return true;
 }
@@ -143,7 +146,7 @@ unsigned int nf_conncount_lookup(struct net *net, struct hlist_head *head,
 
 	/* check the saved connections */
 	hlist_for_each_entry_safe(conn, n, head, node) {
-		found = nf_conntrack_find_get(net, zone, &conn->tuple);
+		found = nf_conntrack_find_get(net, &conn->zone, &conn->tuple);
 		if (found == NULL) {
 			hlist_del(&conn->node);
 			kmem_cache_free(connlimit_conn_cachep, conn);
@@ -152,7 +155,8 @@ unsigned int nf_conncount_lookup(struct net *net, struct hlist_head *head,
 
 		found_ct = nf_ct_tuplehash_to_ctrack(found);
 
-		if (nf_ct_tuple_equal(&conn->tuple, tuple)) {
+		if (nf_ct_tuple_equal(&conn->tuple, tuple) &&
+		    nf_ct_zone_equal(found_ct, zone, zone->dir)) {
 			/*
 			 * Just to be sure we have it only once in the list.
 			 * We should not see tuples twice unless someone hooks
@@ -231,7 +235,7 @@ count_tree(struct net *net, struct rb_root *root,
 			if (!addit)
 				return count;
 
-			if (!nf_conncount_add(&rbconn->hhead, tuple))
+			if (!nf_conncount_add(&rbconn->hhead, tuple, zone))
 				return 0; /* hotdrop */
 
 			return count + 1;
@@ -270,6 +274,7 @@ count_tree(struct net *net, struct rb_root *root,
 	}
 
 	conn->tuple = *tuple;
+	conn->zone = *zone;
 	rbconn->addr = *addr;
 
 	INIT_HLIST_HEAD(&rbconn->hhead);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4.14 4/4] netfilter: nf_conncount: fix garbage collection confirm race
  2019-01-02 17:00 [PATCH 4.14 0/4] netfilter: xt_connlimit: backport upstream fixes for race in connection counting Mauricio Faria de Oliveira
                   ` (2 preceding siblings ...)
  2019-01-02 17:00 ` [PATCH 4.14 3/4] netfilter: nf_conncount: Fix garbage collection with zones Mauricio Faria de Oliveira
@ 2019-01-02 17:00 ` Mauricio Faria de Oliveira
  2019-01-02 17:06   ` Florian Westphal
  2019-01-02 17:17 ` [PATCH 4.14 0/4] netfilter: xt_connlimit: backport upstream fixes for race in connection counting Florian Westphal
  4 siblings, 1 reply; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2019-01-02 17:00 UTC (permalink / raw)
  To: stable, netdev, Florian Westphal
  Cc: Alakesh Haloi, nivedita.singhvi, Pablo Neira Ayuso,
	Jozsef Kadlecsik, David S. Miller, Yi-Hung Wei

From: Florian Westphal <fw@strlen.de>

commit b36e4523d4d56e2595e28f16f6ccf1cd6a9fc452 upstream.

Yi-Hung Wei and Justin Pettit found a race in the garbage collection scheme
used by nf_conncount.

When doing list walk, we lookup the tuple in the conntrack table.
If the lookup fails we remove this tuple from our list because
the conntrack entry is gone.

This is the common cause, but turns out its not the only one.
The list entry could have been created just before by another cpu, i.e. the
conntrack entry might not yet have been inserted into the global hash.

The avoid this, we introduce a timestamp and the owning cpu.
If the entry appears to be stale, evict only if:
 1. The current cpu is the one that added the entry, or,
 2. The timestamp is older than two jiffies

The second constraint allows GC to be taken over by other
cpu too (e.g. because a cpu was offlined or napi got moved to another
cpu).

We can't pretend the 'doubtful' entry wasn't in our list.
Instead, when we don't find an entry indicate via IS_ERR
that entry was removed ('did not exist' or withheld
('might-be-unconfirmed').

This most likely also fixes a xt_connlimit imbalance earlier reported by
Dmitry Andrianov.

Cc: Dmitry Andrianov <dmitry.andrianov@alertme.com>
Reported-by: Justin Pettit <jpettit@vmware.com>
Reported-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

[mfo: backport: refresh context lines and use older symbol/file names:
 - nf_conncount.c -> xt_connlimit.c.
   - nf_conncount_rb -> xt_connlimit_rb
   - nf_conncount_tuple -> xt_connlimit_conn
   - conncount_conn_cachep -> connlimit_conn_cachep]
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
---
 net/netfilter/xt_connlimit.c | 52 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 47 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index ab1f849..913b86ef 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -47,6 +47,8 @@ struct xt_connlimit_conn {
 	struct hlist_node		node;
 	struct nf_conntrack_tuple	tuple;
 	struct nf_conntrack_zone	zone;
+	int				cpu;
+	u32				jiffies32;
 };
 
 struct xt_connlimit_rb {
@@ -126,11 +128,42 @@ bool nf_conncount_add(struct hlist_head *head,
 		return false;
 	conn->tuple = *tuple;
 	conn->zone = *zone;
+	conn->cpu = raw_smp_processor_id();
+	conn->jiffies32 = (u32)jiffies;
 	hlist_add_head(&conn->node, head);
 	return true;
 }
 EXPORT_SYMBOL_GPL(nf_conncount_add);
 
+static const struct nf_conntrack_tuple_hash *
+find_or_evict(struct net *net, struct xt_connlimit_conn *conn)
+{
+	const struct nf_conntrack_tuple_hash *found;
+	unsigned long a, b;
+	int cpu = raw_smp_processor_id();
+	__s32 age;
+
+	found = nf_conntrack_find_get(net, &conn->zone, &conn->tuple);
+	if (found)
+		return found;
+	b = conn->jiffies32;
+	a = (u32)jiffies;
+
+	/* conn might have been added just before by another cpu and
+	 * might still be unconfirmed.  In this case, nf_conntrack_find()
+	 * returns no result.  Thus only evict if this cpu added the
+	 * stale entry or if the entry is older than two jiffies.
+	 */
+	age = a - b;
+	if (conn->cpu == cpu || age >= 2) {
+		hlist_del(&conn->node);
+		kmem_cache_free(connlimit_conn_cachep, conn);
+		return ERR_PTR(-ENOENT);
+	}
+
+	return ERR_PTR(-EAGAIN);
+}
+
 unsigned int nf_conncount_lookup(struct net *net, struct hlist_head *head,
 				 const struct nf_conntrack_tuple *tuple,
 				 const struct nf_conntrack_zone *zone,
@@ -138,18 +171,27 @@ unsigned int nf_conncount_lookup(struct net *net, struct hlist_head *head,
 {
 	const struct nf_conntrack_tuple_hash *found;
 	struct xt_connlimit_conn *conn;
-	struct hlist_node *n;
 	struct nf_conn *found_ct;
+	struct hlist_node *n;
 	unsigned int length = 0;
 
 	*addit = true;
 
 	/* check the saved connections */
 	hlist_for_each_entry_safe(conn, n, head, node) {
-		found = nf_conntrack_find_get(net, &conn->zone, &conn->tuple);
-		if (found == NULL) {
-			hlist_del(&conn->node);
-			kmem_cache_free(connlimit_conn_cachep, conn);
+		found = find_or_evict(net, conn);
+		if (IS_ERR(found)) {
+			/* Not found, but might be about to be confirmed */
+			if (PTR_ERR(found) == -EAGAIN) {
+				length++;
+				if (!tuple)
+					continue;
+
+				if (nf_ct_tuple_equal(&conn->tuple, tuple) &&
+				    nf_ct_zone_id(&conn->zone, conn->zone.dir) ==
+				    nf_ct_zone_id(zone, zone->dir))
+					*addit = false;
+			}
 			continue;
 		}
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 4.14 4/4] netfilter: nf_conncount: fix garbage collection confirm race
  2019-01-02 17:00 ` [PATCH 4.14 4/4] netfilter: nf_conncount: fix garbage collection confirm race Mauricio Faria de Oliveira
@ 2019-01-02 17:06   ` Florian Westphal
  2019-01-02 17:07     ` Mauricio Faria de Oliveira
  0 siblings, 1 reply; 9+ messages in thread
From: Florian Westphal @ 2019-01-02 17:06 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: stable, netdev, Florian Westphal, Alakesh Haloi,
	nivedita.singhvi, Pablo Neira Ayuso, Jozsef Kadlecsik,
	David S. Miller, Yi-Hung Wei

Mauricio Faria de Oliveira <mfo@canonical.com> wrote:
> +static const struct nf_conntrack_tuple_hash *
> +find_or_evict(struct net *net, struct xt_connlimit_conn *conn)
> +{
> +	const struct nf_conntrack_tuple_hash *found;
> +	unsigned long a, b;
> +	int cpu = raw_smp_processor_id();
> +	__s32 age;

This needs to be 'u32'.  Alternatively, also backport

4cd273bb91b3001f623 ("netfilter: nf_conncount: don't skip eviction when
age is negative").

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 4.14 4/4] netfilter: nf_conncount: fix garbage collection confirm race
  2019-01-02 17:06   ` Florian Westphal
@ 2019-01-02 17:07     ` Mauricio Faria de Oliveira
  0 siblings, 0 replies; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2019-01-02 17:07 UTC (permalink / raw)
  To: Florian Westphal
  Cc: stable, netdev, Alakesh Haloi, Nivedita Singhvi,
	Pablo Neira Ayuso, Jozsef Kadlecsik, David S. Miller,
	Yi-Hung Wei

On Wed, Jan 2, 2019 at 3:06 PM Florian Westphal <fw@strlen.de> wrote:
>
> Mauricio Faria de Oliveira <mfo@canonical.com> wrote:
> > +static const struct nf_conntrack_tuple_hash *
> > +find_or_evict(struct net *net, struct xt_connlimit_conn *conn)
> > +{
> > +     const struct nf_conntrack_tuple_hash *found;
> > +     unsigned long a, b;
> > +     int cpu = raw_smp_processor_id();
> > +     __s32 age;
>
> This needs to be 'u32'.  Alternatively, also backport
>
> 4cd273bb91b3001f623 ("netfilter: nf_conncount: don't skip eviction when
> age is negative").

Sure, will pick that up.  Waiting for more comments before sending a v2.


-- 
Mauricio Faria de Oliveira

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 4.14 0/4] netfilter: xt_connlimit: backport upstream fixes for race in connection counting
  2019-01-02 17:00 [PATCH 4.14 0/4] netfilter: xt_connlimit: backport upstream fixes for race in connection counting Mauricio Faria de Oliveira
                   ` (3 preceding siblings ...)
  2019-01-02 17:00 ` [PATCH 4.14 4/4] netfilter: nf_conncount: fix garbage collection confirm race Mauricio Faria de Oliveira
@ 2019-01-02 17:17 ` Florian Westphal
  2019-01-02 19:52   ` Mauricio Faria de Oliveira
  4 siblings, 1 reply; 9+ messages in thread
From: Florian Westphal @ 2019-01-02 17:17 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: stable, netdev, Florian Westphal, Alakesh Haloi,
	nivedita.singhvi, Pablo Neira Ayuso, Jozsef Kadlecsik,
	David S. Miller, Yi-Hung Wei

Mauricio Faria de Oliveira <mfo@canonical.com> wrote:
> Recently, Alakesh Haloi reported the following issue [1] with stable/4.14:
> 
>   """
>   An iptable rule like the following on a multicore systems will result in
>   accepting more connections than set in the rule.
> 
>   iptables  -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit \
>         --connlimit-above 2000 --connlimit-mask 0 -j DROP
>   """
> 
> And proposed a fix that is not in Linus's tree. The discussion went on to
> confirm whether the issue was still reproducible with mainline/nf.git tip,
> and to either identify the upstream fix or re-submit the non-upstream fix.
> 
> Alakesh eventually was able to test with upstream, and reported that issue
> was still reproducible [2].
> On that, our findinds diverge, at least in my test environment:
> 
> First, I verified that the suggested mainline fix for the issue [3] indeed
> fixes it, by testing with it applied and reverted on v4.18, a clean revert.
> (The issue is reproducible with the commit reverted).
> 
> Then, with a consistent reproducer, I moved to nf.git, with HEAD on commit
> a007232 ("netfilter: nf_conncount: fix argument order to find_next_bit"),
> and the issues was not reproducible (even with 20+ threads on client side,
> the number Alakesh reported to achieve 2150+ connections [4], and I tried
> spreading the network interface IRQ affinity over more and more CPUs too.)
> 
> Either way, the suggested mainline fix does actually fix the issue in 4.14
> for at least one environment. So, it might well be the case that Alakesh's
> test environment has differences/subtleties that leads to more connections
> accepted, and more commits are needed for that particular environment type.

nf_conncount has a design flaw that is only closed in nf.git/net.git
at the time of this writing, so results with earlier kernels (including
4.20) might just fail with different bugs.

4.14 doesn't have those problems, so I think this series (aside from the
nit in patch 4/4) indeed should fix the issue reported.

> But for now, with one bare-metal environment (24-core server, 4-core client)
> verified, I thought of submitting the patches for review/comments/testing,
> then looking for additional fixes for that environment separately.

4.14 should be good after this afaics.

Thanks a lot for doing this backport and the details testing
information.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 4.14 0/4] netfilter: xt_connlimit: backport upstream fixes for race in connection counting
  2019-01-02 17:17 ` [PATCH 4.14 0/4] netfilter: xt_connlimit: backport upstream fixes for race in connection counting Florian Westphal
@ 2019-01-02 19:52   ` Mauricio Faria de Oliveira
  0 siblings, 0 replies; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2019-01-02 19:52 UTC (permalink / raw)
  To: Florian Westphal
  Cc: stable, netdev, Alakesh Haloi, Nivedita Singhvi,
	Pablo Neira Ayuso, Jozsef Kadlecsik, David S. Miller,
	Yi-Hung Wei

Florian,

On Wed, Jan 2, 2019 at 3:17 PM Florian Westphal <fw@strlen.de> wrote:
>
> Mauricio Faria de Oliveira <mfo@canonical.com> wrote:
<snip>
> > Either way, the suggested mainline fix does actually fix the issue in 4.14
> > for at least one environment. So, it might well be the case that Alakesh's
> > test environment has differences/subtleties that leads to more connections
> > accepted, and more commits are needed for that particular environment type.
>
> nf_conncount has a design flaw that is only closed in nf.git/net.git
> at the time of this writing, so results with earlier kernels (including
> 4.20) might just fail with different bugs.
>
> 4.14 doesn't have those problems, so I think this series (aside from the
> nit in patch 4/4) indeed should fix the issue reported.

Thanks for mentioning that. It offers some relief about the different
results observed.

> > But for now, with one bare-metal environment (24-core server, 4-core client)
> > verified, I thought of submitting the patches for review/comments/testing,
> > then looking for additional fixes for that environment separately.
>
> 4.14 should be good after this afaics.
>
> Thanks a lot for doing this backport and the details testing
> information.

Thank you a lot for your quick and careful review.
I'll build/test/submit a PATCH v2 series (with that fix to patch 4/4) shortly.

cheers,

-- 
Mauricio Faria de Oliveira

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-01-02 19:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-02 17:00 [PATCH 4.14 0/4] netfilter: xt_connlimit: backport upstream fixes for race in connection counting Mauricio Faria de Oliveira
2019-01-02 17:00 ` [PATCH 4.14 1/4] netfilter: xt_connlimit: don't store address in the conn nodes Mauricio Faria de Oliveira
2019-01-02 17:00 ` [PATCH 4.14 2/4] netfilter: nf_conncount: expose connection list interface Mauricio Faria de Oliveira
2019-01-02 17:00 ` [PATCH 4.14 3/4] netfilter: nf_conncount: Fix garbage collection with zones Mauricio Faria de Oliveira
2019-01-02 17:00 ` [PATCH 4.14 4/4] netfilter: nf_conncount: fix garbage collection confirm race Mauricio Faria de Oliveira
2019-01-02 17:06   ` Florian Westphal
2019-01-02 17:07     ` Mauricio Faria de Oliveira
2019-01-02 17:17 ` [PATCH 4.14 0/4] netfilter: xt_connlimit: backport upstream fixes for race in connection counting Florian Westphal
2019-01-02 19:52   ` Mauricio Faria de Oliveira

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).