[RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries
@ 2020-01-08 13:44 Florian Westphal
  2020-01-08 13:44 ` [RFC nf-next 1/4] netfilter: conntrack: remove two args from resolve_clash Florian Westphal
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Florian Westphal @ 2020-01-08 13:44 UTC (permalink / raw)
  To: netfilter-devel

This series allows conntrack to insert a duplicate conntrack entry
if the reply direction doesn't result in a clash with a different
original connection.

Background:

kubernetes creates load-balancing rules for DNS using
-m statistics, e.g.:
-p udp --dport 53 -m statistics --mode random ... -j DNAT --to-destination x
-p udp --dport 53 -m statistics --mode random ... -j DNAT --to-destination y

When the resolver sends an A and AAAA request back-to-back from
different threads on the same socket, this has a high chance of a connection
tracking clash at insertion time.

This in turn results in a drop of the clashing udp packet which then
results in a 5 second DNS timeout.

The clash cannot be resolved with the current logic because the
two conntracks entries have different NAT transformations, the first one
from s:highport to x.53, the second from s:highport to y.53.

One solution is to change rules to use a consistent mapping, e.g.
using -m cluster or nftables 'jhash' expression.  This would cause
the A and AAAA requests coming from same socket to match the same
rule and thus share the same NAT information.

This change adds a second clash resolution/drop avoidance step:
A clashing entry will be added anyway provided the reply direction
is unique.

Because this results in duplicate conntrack entries for the original
direction, this comes with strings attached:
1. The clashed conntrack entry will only be around for 3 seconds
2. The clashed entry will still fail to be inserted if hash
   chain grew too large.

This entire series isn't nice but so far I did not find a better
solution.

Comments welcome.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC nf-next 1/4] netfilter: conntrack: remove two args from resolve_clash
  2020-01-08 13:44 [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries Florian Westphal
@ 2020-01-08 13:44 ` Florian Westphal
  2020-01-08 13:44 ` [RFC nf-next 2/4] netfilter: conntrack: place confirm-bit setting in a helper Florian Westphal
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2020-01-08 13:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

ctinfo is whats taken from the skb, i.e.
ct = nf_ct_get(skb, &ctinfo).

We do not pass 'ct' and instead re-fetch it from the skb.
Just do the same for both netns and ctinfo.

Also add a comment on what clash resolution is supposed to do.
While at it, one indent level can be removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_core.c | 69 +++++++++++++++++++++++--------
 1 file changed, 51 insertions(+), 18 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index f4c4b467c87e..c2be039c40f0 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -894,31 +894,64 @@ static void nf_ct_acct_merge(struct nf_conn *ct, enum ip_conntrack_info ctinfo,
 	}
 }
 
-/* Resolve race on insertion if this protocol allows this. */
+/**
+ * nf_ct_resolve_clash - attempt to handle clash without packet drop
+ *
+ * @skb: skb that causes the clash
+ * @h: tuplehash of the clashing entry already in table
+ *
+ * A conntrack entry can be inserted to the connection tracking table
+ * if there is no existing entry with an identical tuple.
+ *
+ * If there is one, @skb (and the assocated, unconfirmed conntrack) has
+ * to be dropped.  In case @skb is retransmitted, next conntrack lookup
+ * will find the already-existing entry.
+ *
+ * The major problem with such packet drop is the extra delay added by
+ * the packet loss -- it will take some time for a retransmit to occur
+ * (or the sender to time out when waiting for a reply).
+ *
+ * This function attempts to handle the situation without packet drop.
+ *
+ * If @skb has no NAT transformation or if the colliding entries are
+ * exactly the same, only the to-be-confirmed conntrack entry is discarded
+ * and @skb is associated with the conntrack entry already in the table.
+ *
+ * Returns NF_DROP if the clash could not be resolved.
+ */
 static __cold noinline int
-nf_ct_resolve_clash(struct net *net, struct sk_buff *skb,
-		    enum ip_conntrack_info ctinfo,
-		    struct nf_conntrack_tuple_hash *h)
+nf_ct_resolve_clash(struct sk_buff *skb, struct nf_conntrack_tuple_hash *h)
 {
 	/* This is the conntrack entry already in hashes that won race. */
 	struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
 	const struct nf_conntrack_l4proto *l4proto;
-	enum ip_conntrack_info oldinfo;
-	struct nf_conn *loser_ct = nf_ct_get(skb, &oldinfo);
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn *loser_ct;
+	struct net *net;
+
+	loser_ct = nf_ct_get(skb, &ctinfo);
 
 	l4proto = nf_ct_l4proto_find(nf_ct_protonum(ct));
-	if (l4proto->allow_clash &&
-	    !nf_ct_is_dying(ct) &&
-	    atomic_inc_not_zero(&ct->ct_general.use)) {
-		if (((ct->status & IPS_NAT_DONE_MASK) == 0) ||
-		    nf_ct_match(ct, loser_ct)) {
-			nf_ct_acct_merge(ct, ctinfo, loser_ct);
-			nf_conntrack_put(&loser_ct->ct_general);
-			nf_ct_set(skb, ct, oldinfo);
-			return NF_ACCEPT;
-		}
-		nf_ct_put(ct);
+	if (!l4proto->allow_clash)
+		goto drop;
+
+	if (nf_ct_is_dying(ct))
+		goto drop;
+
+	if (!atomic_inc_not_zero(&ct->ct_general.use))
+		goto drop;
+
+	if (((ct->status & IPS_NAT_DONE_MASK) == 0) ||
+	    nf_ct_match(ct, loser_ct)) {
+		nf_ct_acct_merge(ct, ctinfo, loser_ct);
+		nf_conntrack_put(&loser_ct->ct_general);
+		nf_ct_set(skb, ct, ctinfo);
+		return NF_ACCEPT;
 	}
+
+	nf_ct_put(ct);
+drop:
+	net = nf_ct_net(loser_ct);
 	NF_CT_STAT_INC(net, drop);
 	return NF_DROP;
 }
@@ -1036,7 +1069,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 
 out:
 	nf_ct_add_to_dying_list(ct);
-	ret = nf_ct_resolve_clash(net, skb, ctinfo, h);
+	ret = nf_ct_resolve_clash(skb, h);
 dying:
 	nf_conntrack_double_unlock(hash, reply_hash);
 	NF_CT_STAT_INC(net, insert_failed);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC nf-next 2/4] netfilter: conntrack: place confirm-bit setting in a helper
  2020-01-08 13:44 [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries Florian Westphal
  2020-01-08 13:44 ` [RFC nf-next 1/4] netfilter: conntrack: remove two args from resolve_clash Florian Westphal
@ 2020-01-08 13:44 ` Florian Westphal
  2020-01-08 13:44 ` [RFC nf-next 3/4] netfilter: conntrack: split resolve_clash function Florian Westphal
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2020-01-08 13:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

... so it can be re-used from clash resolution in followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_core.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index c2be039c40f0..63ded7873141 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -894,6 +894,19 @@ static void nf_ct_acct_merge(struct nf_conn *ct, enum ip_conntrack_info ctinfo,
 	}
 }
 
+static void __nf_conntrack_insert_prepare(struct nf_conn *ct)
+{
+	struct nf_conn_tstamp *tstamp;
+
+	atomic_inc(&ct->ct_general.use);
+	ct->status |= IPS_CONFIRMED;
+
+	/* set conntrack timestamp, if enabled. */
+	tstamp = nf_conn_tstamp_find(ct);
+	if (tstamp)
+		tstamp->start = ktime_get_real_ns();
+}
+
 /**
  * nf_ct_resolve_clash - attempt to handle clash without packet drop
  *
@@ -965,7 +978,6 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	struct nf_conntrack_tuple_hash *h;
 	struct nf_conn *ct;
 	struct nf_conn_help *help;
-	struct nf_conn_tstamp *tstamp;
 	struct hlist_nulls_node *n;
 	enum ip_conntrack_info ctinfo;
 	struct net *net;
@@ -1042,13 +1054,8 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	   setting time, otherwise we'd get timer wrap in
 	   weird delay cases. */
 	ct->timeout += nfct_time_stamp;
-	atomic_inc(&ct->ct_general.use);
-	ct->status |= IPS_CONFIRMED;
 
-	/* set conntrack timestamp, if enabled. */
-	tstamp = nf_conn_tstamp_find(ct);
-	if (tstamp)
-		tstamp->start = ktime_get_real_ns();
+	__nf_conntrack_insert_prepare(ct);
 
 	/* Since the lookup is lockless, hash insertion must be done after
 	 * starting the timer and setting the CONFIRMED bit. The RCU barriers
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC nf-next 3/4] netfilter: conntrack: split resolve_clash function
  2020-01-08 13:44 [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries Florian Westphal
  2020-01-08 13:44 ` [RFC nf-next 1/4] netfilter: conntrack: remove two args from resolve_clash Florian Westphal
  2020-01-08 13:44 ` [RFC nf-next 2/4] netfilter: conntrack: place confirm-bit setting in a helper Florian Westphal
@ 2020-01-08 13:44 ` Florian Westphal
  2020-01-08 13:45 ` [RFC nf-next 4/4] netfilter: conntrack: allow insertion of duplicate/clashing entries Florian Westphal
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2020-01-08 13:44 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

Followup patch will need a helper function with the 'clashing entries
refer to the identical tuple in both directions' resolution logic.

This patch will add another resolve_clash helper where loser_ct must
not be added to the dying list because it will be inserted into the
table.

Therefore this also moves the stat counters and dying-list insertion
of the losing ct.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_core.c | 58 ++++++++++++++++++++++---------
 1 file changed, 41 insertions(+), 17 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 63ded7873141..2987b2d72b6b 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -907,6 +907,39 @@ static void __nf_conntrack_insert_prepare(struct nf_conn *ct)
 		tstamp->start = ktime_get_real_ns();
 }
 
+static int __nf_ct_resolve_clash(struct sk_buff *skb,
+				 struct nf_conntrack_tuple_hash *h)
+{
+	/* This is the conntrack entry already in hashes that won race. */
+	struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn *loser_ct;
+
+	loser_ct = nf_ct_get(skb, &ctinfo);
+
+	if (nf_ct_is_dying(ct))
+		return NF_DROP;
+
+	if (!atomic_inc_not_zero(&ct->ct_general.use))
+		return NF_DROP;
+
+	if (((ct->status & IPS_NAT_DONE_MASK) == 0) ||
+	    nf_ct_match(ct, loser_ct)) {
+		struct net *net = nf_ct_net(ct);
+
+		nf_ct_acct_merge(ct, ctinfo, loser_ct);
+		nf_ct_add_to_dying_list(loser_ct);
+		nf_conntrack_put(&loser_ct->ct_general);
+		nf_ct_set(skb, ct, ctinfo);
+
+		NF_CT_STAT_INC(net, insert_failed);
+		return NF_ACCEPT;
+	}
+
+	nf_ct_put(ct);
+	return NF_DROP;
+}
+
 /**
  * nf_ct_resolve_clash - attempt to handle clash without packet drop
  *
@@ -941,31 +974,23 @@ nf_ct_resolve_clash(struct sk_buff *skb, struct nf_conntrack_tuple_hash *h)
 	enum ip_conntrack_info ctinfo;
 	struct nf_conn *loser_ct;
 	struct net *net;
+	int ret;
 
 	loser_ct = nf_ct_get(skb, &ctinfo);
+	net = nf_ct_net(loser_ct);
 
 	l4proto = nf_ct_l4proto_find(nf_ct_protonum(ct));
 	if (!l4proto->allow_clash)
 		goto drop;
 
-	if (nf_ct_is_dying(ct))
-		goto drop;
-
-	if (!atomic_inc_not_zero(&ct->ct_general.use))
-		goto drop;
-
-	if (((ct->status & IPS_NAT_DONE_MASK) == 0) ||
-	    nf_ct_match(ct, loser_ct)) {
-		nf_ct_acct_merge(ct, ctinfo, loser_ct);
-		nf_conntrack_put(&loser_ct->ct_general);
-		nf_ct_set(skb, ct, ctinfo);
-		return NF_ACCEPT;
-	}
+	ret = __nf_ct_resolve_clash(skb, h);
+	if (ret == NF_ACCEPT)
+		return ret;
 
-	nf_ct_put(ct);
 drop:
-	net = nf_ct_net(loser_ct);
+	nf_ct_add_to_dying_list(loser_ct);
 	NF_CT_STAT_INC(net, drop);
+	NF_CT_STAT_INC(net, insert_failed);
 	return NF_DROP;
 }
 
@@ -1034,6 +1059,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 
 	if (unlikely(nf_ct_is_dying(ct))) {
 		nf_ct_add_to_dying_list(ct);
+		NF_CT_STAT_INC(net, insert_failed);
 		goto dying;
 	}
 
@@ -1075,11 +1101,9 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	return NF_ACCEPT;
 
 out:
-	nf_ct_add_to_dying_list(ct);
 	ret = nf_ct_resolve_clash(skb, h);
 dying:
 	nf_conntrack_double_unlock(hash, reply_hash);
-	NF_CT_STAT_INC(net, insert_failed);
 	local_bh_enable();
 	return ret;
 }
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC nf-next 4/4] netfilter: conntrack: allow insertion of duplicate/clashing entries
  2020-01-08 13:44 [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries Florian Westphal
                   ` (2 preceding siblings ...)
  2020-01-08 13:44 ` [RFC nf-next 3/4] netfilter: conntrack: split resolve_clash function Florian Westphal
@ 2020-01-08 13:45 ` Florian Westphal
  2020-01-13 14:04 ` [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries Florian Westphal
  2020-01-13 23:53 ` Florian Westphal
  5 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2020-01-08 13:45 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

This patch further relaxes the need to drop an skb due to a clash with
an existing conntrack entry.

Current clash resolution handles the case where the clash occurs
between two identical entries (distinct nf_conn objects with same
tuples).

Also allow this collision type:
         Original                   Reply
10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.6:5353
10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.7:5353

This frequently happens with DNS resolvers that send A and AAAA queries
back-to-back when NAT rules are present that cause packets to get
different DNAT transformations applied, for example:

-m statistics --mode random ... -j DNAT --dnat-to 10.0.0.6:5353
-m statistics --mode random ... -j DNAT --dnat-to 10.0.0.7:5353

In this case the A or AAAA query is dropped which incurs a costly
delay during name resolution.

This change makes it so that when the 2nd colliding packet is received,
we insert the colliding conntrack anyway, provided the reply direction
is still unique.

The original direction gets inserted after the 'existing' entry, i.e.
it will only be found in the reply case.

Example:

CPU A:						CPU B:
1.  10.2.3.4:42 -> 10.8.8.8:53 (A)
2.                                              10.2.3.4:42 -> 10.8.8.8:53 (AAAA)
3.  Apply DNAT, reply changed to 10.0.0.6
4.                                              10.2.3.4:42 -> 10.8.8.8:53 (AAAA)
5.                                              Apply DNAT, reply changed to 10.0.0.7
6. confirm/commit to conntrack table, no collisions
7.                                              commit clashing entry

Reply comes in:

10.2.3.4:42 <- 10.0.0.6:5353 (A)
 -> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42
10.2.3.4:42 <- 10.0.0.7:5353 (AAAA)
 -> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42

In case of a retransmit from ORIGINAL dir, all further packets will get
the DNAT transformation to 10.0.0.6.

Also, the clashing entry will get a fixed timeout, we want to expire it
early of possible.

Example entries:
udp 29 src=192.168.7.10 dst=10.8.8.8 sport=7843 dport=53 src=127.0.0.1 dst=192.168.7.10 sport=1053 dport=7843 [ASSURED] use=1
udp  2 src=192.168.7.10 dst=10.8.8.8 sport=7843 dport=53 src=127.0.0.1 dst=192.168.7.10 sport=1054 dport=7843 [ASSURED] use=1

The 2nd entry has a fixed timeout and will always expire, even if it
continues to receive data.

I tried to come up with other solutions but this is the only one that
doesn't involve changes to the ruleset (the other solution is to make
sure that all packets from same origin get the same NAT transformation
e.g. via "-m cluster" instead of "-m statistics".

nft has the 'jhash' expression which can be used instead of 'numgen'.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_core.c | 87 ++++++++++++++++++++++++++++++-
 1 file changed, 85 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 2987b2d72b6b..97c00fdc7ecf 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -940,11 +940,82 @@ static int __nf_ct_resolve_clash(struct sk_buff *skb,
 	return NF_DROP;
 }
 
+/**
+ * nf_ct_resolve_clash_harder - attempt to insert clashing conntrack entry
+ *
+ * @skb: skb that causes the collision
+ * @hash: hash slot for original direction
+ * @hash_reply: hash slot for reply direction
+ *
+ * Called when origin or reply direction had a clash.
+ * The skb can be handled without packet drop provided the
+ * reply direction has no clash or the clashing and existing entries
+ * have the same tuples in both directions.
+ *
+ * Caller must hold conntrack table locks for both slots to prevent
+ * concurrent updates.
+ *
+ * Returns NF_DROP if the clash could not be handled.
+ */
+static int nf_ct_resolve_clash_harder(struct sk_buff *skb,
+				      u32 hash, u32 reply_hash)
+{
+	struct nf_conn *loser_ct = (struct nf_conn *)skb_nfct(skb);
+	const struct nf_conntrack_zone *zone;
+	struct nf_conntrack_tuple_hash *h;
+	struct hlist_nulls_node *n;
+	unsigned int len;
+	struct net *net;
+
+	zone = nf_ct_zone(loser_ct);
+	net = nf_ct_net(loser_ct);
+
+	/* Reply direction must never result in a clash, unless both origin
+	 * and reply tuples are identical.
+	 */
+	hlist_nulls_for_each_entry(h, n, &nf_conntrack_hash[reply_hash], hnnode) {
+		if (nf_ct_key_equal(h,
+				    &loser_ct->tuplehash[IP_CT_DIR_REPLY].tuple,
+				    zone, net)) {
+			/* Clash in reply direction: OK if origin is same too. */
+			return __nf_ct_resolve_clash(skb, h);
+		}
+	}
+
+	/* No clash in reply, so its clashing in original direction
+	 * (else, this function would not have been called).
+	 *
+	 * Also check hash chain length before we allow insertion.
+	 */
+	len = 0;
+	hlist_nulls_for_each_entry(h, n, &nf_conntrack_hash[hash], hnnode) {
+		if (len++ > 8u)
+			return NF_DROP;
+	}
+
+	/* We want the clashing entry to go away soon */
+	loser_ct->timeout = nfct_time_stamp + HZ * 3u;
+	loser_ct->status |= IPS_FIXED_TIMEOUT;
+
+	__nf_conntrack_insert_prepare(loser_ct);
+	/* ... use add_tail_rcu for ORIGINAL dir: we want lookups
+	 * to find the entry already in table, not loser_ct.
+	 */
+	hlist_nulls_add_tail_rcu(&loser_ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
+				 &nf_conntrack_hash[hash]);
+
+	hlist_nulls_add_head_rcu(&loser_ct->tuplehash[IP_CT_DIR_REPLY].hnnode,
+				 &nf_conntrack_hash[reply_hash]);
+	return NF_ACCEPT;
+}
+
 /**
  * nf_ct_resolve_clash - attempt to handle clash without packet drop
  *
  * @skb: skb that causes the clash
  * @h: tuplehash of the clashing entry already in table
+ * @hash: hash slot for original direction
+ * @hash_reply: hash slot for reply direction
  *
  * A conntrack entry can be inserted to the connection tracking table
  * if there is no existing entry with an identical tuple.
@@ -963,10 +1034,18 @@ static int __nf_ct_resolve_clash(struct sk_buff *skb,
  * exactly the same, only the to-be-confirmed conntrack entry is discarded
  * and @skb is associated with the conntrack entry already in the table.
  *
+ * Failing that, the new, unconfirmed conntrack is still added to the table
+ * provided that the collision only occurs in the ORIGINAL direction.
+ * The new entry will be added after the existing one in the hash list,
+ * so packets in the ORIGINAL direction will continue to match the existing
+ * entry.  The new entry will also have a fixed timeout so it expires --
+ * due to the collision, it will not see bidirectional traffic.
+ *
  * Returns NF_DROP if the clash could not be resolved.
  */
 static __cold noinline int
-nf_ct_resolve_clash(struct sk_buff *skb, struct nf_conntrack_tuple_hash *h)
+nf_ct_resolve_clash(struct sk_buff *skb, struct nf_conntrack_tuple_hash *h,
+		    u32 hash, u32 reply_hash)
 {
 	/* This is the conntrack entry already in hashes that won race. */
 	struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
@@ -987,6 +1066,10 @@ nf_ct_resolve_clash(struct sk_buff *skb, struct nf_conntrack_tuple_hash *h)
 	if (ret == NF_ACCEPT)
 		return ret;
 
+	ret = nf_ct_resolve_clash_harder(skb, hash, reply_hash);
+	if (ret == NF_ACCEPT)
+		return ret;
+
 drop:
 	nf_ct_add_to_dying_list(loser_ct);
 	NF_CT_STAT_INC(net, drop);
@@ -1101,7 +1184,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	return NF_ACCEPT;
 
 out:
-	ret = nf_ct_resolve_clash(skb, h);
+	ret = nf_ct_resolve_clash(skb, h, hash, reply_hash);
 dying:
 	nf_conntrack_double_unlock(hash, reply_hash);
 	local_bh_enable();
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries
  2020-01-08 13:44 [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries Florian Westphal
                   ` (3 preceding siblings ...)
  2020-01-08 13:45 ` [RFC nf-next 4/4] netfilter: conntrack: allow insertion of duplicate/clashing entries Florian Westphal
@ 2020-01-13 14:04 ` Florian Westphal
  2020-01-13 23:53 ` Florian Westphal
  5 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2020-01-13 14:04 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

Florian Westphal <fw@strlen.de> wrote:
> This series allows conntrack to insert a duplicate conntrack entry
> if the reply direction doesn't result in a clash with a different
> original connection.
> 
> Background:
> 
> kubernetes creates load-balancing rules for DNS using
> -m statistics, e.g.:
> -p udp --dport 53 -m statistics --mode random ... -j DNAT --to-destination x
> -p udp --dport 53 -m statistics --mode random ... -j DNAT --to-destination y
> 
> When the resolver sends an A and AAAA request back-to-back from
> different threads on the same socket, this has a high chance of a connection
> tracking clash at insertion time.
> 
> This in turn results in a drop of the clashing udp packet which then
> results in a 5 second DNS timeout.

I'd really like to get feedback for this patch set.

If its deemed unacceptable thats OK, at least I can then tell users they
must change their rulesets to make this work.

If someone has alternative ideas on how to resolve this I'd be
interested as well.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries
  2020-01-08 13:44 [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries Florian Westphal
                   ` (4 preceding siblings ...)
  2020-01-13 14:04 ` [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries Florian Westphal
@ 2020-01-13 23:53 ` Florian Westphal
  2020-01-14 21:14   ` Kadlecsik József
  2020-01-16 11:19   ` Pablo Neira Ayuso
  5 siblings, 2 replies; 12+ messages in thread
From: Florian Westphal @ 2020-01-13 23:53 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

Florian Westphal <fw@strlen.de> wrote:
> This entire series isn't nice but so far I did not find a better
> solution.

I did consider getting rid of the unconfirmed list, but this is also
problematic.

At allocation time we do not know what kind of NAT transformations
will be applied by the ruleset, i.e. we'd need another locking step to
move the entries to the right location in the hash table.

Same if the skb is dropped: we need to lock the conntrack table again to
delete the newly added entry -- this isn't needed right now because the
conntrack is only on the percpu unconfirmed list in this case.

This is also a problem because of conntrack events, we would have to
seperate insertion and notification, else we'd flood userspace for every
conntrack we create in case of a packet drop flood.

Other solutions are:
1. use a ruleset that assigns the same nat mapping for both A and AAAA
   requests, or,
2. steer all packets that might have this problem (i.e. udp dport 53) to
    the same cpu core.

Yet another solution would be a variation of this patch set:

1. Only add the reply direction to the table (i.e. conntrack -L won't show
   the duplicated entry).
2. Add a new conntrack flag for the duplicate that guarantees the
   conntrack is removed immediately when first reply packet comes in.
   This would also have the effect that the conntrack can never be
   assured, i.e. the "hidden duplicates" are always early-dropable if
   conntrack table gets full.
3. change event code to never report such duplicates to userspace.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries
  2020-01-13 23:53 ` Florian Westphal
@ 2020-01-14 21:14   ` Kadlecsik József
  2020-01-14 22:21     ` Florian Westphal
  2020-01-16 11:19   ` Pablo Neira Ayuso
  1 sibling, 1 reply; 12+ messages in thread
From: Kadlecsik József @ 2020-01-14 21:14 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

Hi Florian,

On Tue, 14 Jan 2020, Florian Westphal wrote:

> Florian Westphal <fw@strlen.de> wrote:
> > This entire series isn't nice but so far I did not find a better
> > solution.
> 
> I did consider getting rid of the unconfirmed list, but this is also
> problematic.
> 
> At allocation time we do not know what kind of NAT transformations
> will be applied by the ruleset, i.e. we'd need another locking step to
> move the entries to the right location in the hash table.
> 
> Same if the skb is dropped: we need to lock the conntrack table again to
> delete the newly added entry -- this isn't needed right now because the
> conntrack is only on the percpu unconfirmed list in this case.
> 
> This is also a problem because of conntrack events, we would have to
> seperate insertion and notification, else we'd flood userspace for every
> conntrack we create in case of a packet drop flood.
> 
> Other solutions are:
> 1. use a ruleset that assigns the same nat mapping for both A and AAAA
>    requests, or,
> 2. steer all packets that might have this problem (i.e. udp dport 53) to
>     the same cpu core.
> 
> Yet another solution would be a variation of this patch set:
> 
> 1. Only add the reply direction to the table (i.e. conntrack -L won't show
>    the duplicated entry).
> 2. Add a new conntrack flag for the duplicate that guarantees the
>    conntrack is removed immediately when first reply packet comes in.
>    This would also have the effect that the conntrack can never be
>    assured, i.e. the "hidden duplicates" are always early-dropable if
>    conntrack table gets full.
> 3. change event code to never report such duplicates to userspace.

Somehow my general feeling is that all proposed fixes in conntrack could 
in some cases break other non single-request - single-response UDP 
applications.

Reading about the kubernetes issue as far as I see

a. When the pods run glibc based systems, the issue could easily be
   fixed by configuring the real DNS server IP addresses in the pods
   resolv.conf files with "options single-request single-request-reopen" 
   enabled. 
b. When the pods run musl based systems, there's no such a solution
   because the main musl developer refused to implement the required
   RES_SNGLKUP and RES_SNGLKUPREOP options in musl.

However, I think there's a general already available solution in iptables: 
force the same DNAT mapping for the packets of the same socket by the 
HMARK target. Something like this:

-t raw -p udp --dport 53 -j HMARK --hmark-tuple src,sport \
	--hmark-mod 1 --hmark-offset 10 --hmark-rnd 0xdeafbeef
-t nat -p udp --dport 53 -m state --state NEW -m mark --mark 10 -j DNAT ..
-t nat -p udp --dport 53 -m state --state NEW -m mark --mark 11 -j DNAT ..

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries
  2020-01-14 21:14   ` Kadlecsik József
@ 2020-01-14 22:21     ` Florian Westphal
  2020-01-15  7:58       ` Kadlecsik József
  0 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2020-01-14 22:21 UTC (permalink / raw)
  To: Kadlecsik József; +Cc: Florian Westphal, netfilter-devel

Kadlecsik József <kadlec@blackhole.kfki.hu> wrote:
> However, I think there's a general already available solution in iptables: 
> force the same DNAT mapping for the packets of the same socket by the 
> HMARK target. Something like this:
> 
> -t raw -p udp --dport 53 -j HMARK --hmark-tuple src,sport \
> 	--hmark-mod 1 --hmark-offset 10 --hmark-rnd 0xdeafbeef
> -t nat -p udp --dport 53 -m state --state NEW -m mark --mark 10 -j DNAT ..
> -t nat -p udp --dport 53 -m state --state NEW -m mark --mark 11 -j DNAT ..

Yes, HMARK and -m cluster both work, nft has jhash expression.
So we already have alternatives to provide consistent nat mappings.

I doubt that kubernetes will change their rulesets, however.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries
  2020-01-14 22:21     ` Florian Westphal
@ 2020-01-15  7:58       ` Kadlecsik József
  0 siblings, 0 replies; 12+ messages in thread
From: Kadlecsik József @ 2020-01-15  7:58 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1440 bytes --]

On Tue, 14 Jan 2020, Florian Westphal wrote:

> Kadlecsik József <kadlec@blackhole.kfki.hu> wrote:
> > However, I think there's a general already available solution in iptables: 
> > force the same DNAT mapping for the packets of the same socket by the 
> > HMARK target. Something like this:
> > 
> > -t raw -p udp --dport 53 -j HMARK --hmark-tuple src,sport \
> > 	--hmark-mod 1 --hmark-offset 10 --hmark-rnd 0xdeafbeef
> > -t nat -p udp --dport 53 -m state --state NEW -m mark --mark 10 -j DNAT ..
> > -t nat -p udp --dport 53 -m state --state NEW -m mark --mark 11 -j DNAT ..
> 
> Yes, HMARK and -m cluster both work, nft has jhash expression.
> So we already have alternatives to provide consistent nat mappings.
> 
> I doubt that kubernetes will change their rulesets, however.

That'd be sad - those rules are surely not carved in stone...

[By the way, I'd go even further and leave out DNAT completely: put the 
real nameservers into resolv.conf and be done with it. musl connects them 
parallel anyway and glibc supports the rotate options for ages. What else 
would remain for DNAT? "Hide" the real IP addresses of the name servers? 
That's just pointless.]

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries
  2020-01-13 23:53 ` Florian Westphal
  2020-01-14 21:14   ` Kadlecsik József
@ 2020-01-16 11:19   ` Pablo Neira Ayuso
  2020-01-16 11:37     ` Florian Westphal
  1 sibling, 1 reply; 12+ messages in thread
From: Pablo Neira Ayuso @ 2020-01-16 11:19 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Tue, Jan 14, 2020 at 12:53:09AM +0100, Florian Westphal wrote:
> Florian Westphal <fw@strlen.de> wrote:
> > This entire series isn't nice but so far I did not find a better
> > solution.
> 
> I did consider getting rid of the unconfirmed list, but this is also
> problematic.

Another proposal:

I think the percpu unconfirmed list should become a hashtable.

From resolve_normal_ct(), if __nf_conntrack_find_get() returns NULL,
this can fall back to make a rcu lookless lookup on the unconfirmed
hashtable.

From nf_nat_inet_fn(), grab a nat spinlock only if the conntrack is
unconfirmed (slow path) to make sure that the packet winning race to
enter nf_nat_inet_fn() takes the time to set up NAT properly.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries
  2020-01-16 11:19   ` Pablo Neira Ayuso
@ 2020-01-16 11:37     ` Florian Westphal
  0 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2020-01-16 11:37 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Tue, Jan 14, 2020 at 12:53:09AM +0100, Florian Westphal wrote:
> > Florian Westphal <fw@strlen.de> wrote:
> > > This entire series isn't nice but so far I did not find a better
> > > solution.
> > 
> > I did consider getting rid of the unconfirmed list, but this is also
> > problematic.
> 
> Another proposal:
> 
> I think the percpu unconfirmed list should become a hashtable.
> 
> From resolve_normal_ct(), if __nf_conntrack_find_get() returns NULL,
> this can fall back to make a rcu lookless lookup on the unconfirmed
> hashtable.

Unconfirmed entries can't be processed in parallel.

I.e., we can detect the race (there is an unconfirmed entry in the
unconfirmed table matching original tuple), but we can't do anything about
it until that result has been confirmed.

We could allow processing unconfirmed entries in parallel but it will
require taking ct->lock in all places that add/change extensions.

And we would need to do so unconditionally.

At this point I'm considering to send patches to iptables and kernel
that document that -m statistics can't be used for udp NAT and
another patch to add a l4 hash mode to -m cluster, I don't like this
series either and all alternatives are even worse.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-01-16 11:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-08 13:44 [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries Florian Westphal
2020-01-08 13:44 ` [RFC nf-next 1/4] netfilter: conntrack: remove two args from resolve_clash Florian Westphal
2020-01-08 13:44 ` [RFC nf-next 2/4] netfilter: conntrack: place confirm-bit setting in a helper Florian Westphal
2020-01-08 13:44 ` [RFC nf-next 3/4] netfilter: conntrack: split resolve_clash function Florian Westphal
2020-01-08 13:45 ` [RFC nf-next 4/4] netfilter: conntrack: allow insertion of duplicate/clashing entries Florian Westphal
2020-01-13 14:04 ` [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries Florian Westphal
2020-01-13 23:53 ` Florian Westphal
2020-01-14 21:14   ` Kadlecsik József
2020-01-14 22:21     ` Florian Westphal
2020-01-15  7:58       ` Kadlecsik József
2020-01-16 11:19   ` Pablo Neira Ayuso
2020-01-16 11:37     ` Florian Westphal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).