linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] udp reuseport: fix packet of same flow hashed to different socket
@ 2016-06-08  3:40 Su Xuemin
  2016-06-08 15:43 ` Eric Dumazet
  0 siblings, 1 reply; 4+ messages in thread
From: Su Xuemin @ 2016-06-08  3:40 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev
  Cc: linux-kernel, suxm

From: "Su, Xuemin" <suxm@chinanetcenter.com>

There is a corner case in which udp packets belonging to a same
flow are hashed to different socket when hslot->count changes from 10
to 11:

1) When hslot->count <= 10, __udp_lib_lookup() searches udp_table->hash,
and always passes 'daddr' to udp_ehashfn().

2) When hslot->count > 10, __udp_lib_lookup() searches udp_table->hash2,
but may pass 'INADDR_ANY' to udp_ehashfn() if the sockets are bound to
INADDR_ANY instead of some specific addr.

That means when hslot->count changes from 10 to 11, the hash calculated by
udp_ehashfn() is also changed, and the udp packets belonging to a same
flow will be hashed to different socket.

This is easily reproduced:
1) Create 10 udp sockets and bind all of them to 0.0.0.0:40000.
2) From the same host send udp packets to 127.0.0.1:40000, record the
socket index which receives the packets.
3) Create 1 more udp socket and bind it to 0.0.0.0:44096. The number 44096
is 40000 + UDP_HASH_SIZE(4096), this makes the new socket put into the
same hslot as the aformentioned 10 sockets, and makes the hslot->count
change from 10 to 11.
4) From the same host send udp packets to 127.0.0.1:40000, and the socket
index which receives the packets will be different from the one received
in step 2.
This should not happen as the socket bound to 0.0.0.0:44096 should not
change the behavior of the sockets bound to 0.0.0.0:40000.

The fix here is that when searching udp_table->hash, if the socket
supports reuseport, pass inet_sk(sk)->inet_rcv_saddr to udp_ehashfn()
instead of daddr. When the sockets are bound to some specific addr,
inet_sk(sk)->inet_rcv_saddr should equal to daddr, and when the sockets
are bould to INADDR_ANY, this will pass INADDR_ANY to udp_ehashfn() as
what is done when searching udp_table->hash2.

It's the same case for IPv6, and this patch also fixes that.

Signed-off-by: Su, Xuemin <suxm@chinanetcenter.com>
---
The patch v1 does not fix the code in IPv6. Thank Eric Dumazet for
pointing that.
And I use this tree to generate this patch, hope it's correct:
  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

 net/ipv4/udp.c | 4 +++-
 net/ipv6/udp.c | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d56c055..57c38f6 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -577,7 +577,9 @@ begin:
 		if (score > badness) {
 			reuseport = sk->sk_reuseport;
 			if (reuseport) {
-				hash = udp_ehashfn(net, daddr, hnum,
+				hash = udp_ehashfn(net,
+						   inet_sk(sk)->inet_rcv_saddr,
+						   hnum,
 						   saddr, sport);
 				result = reuseport_select_sock(sk, hash, skb,
 							sizeof(struct udphdr));
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 2da1896..41ca493 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -290,7 +290,9 @@ begin:
 		if (score > badness) {
 			reuseport = sk->sk_reuseport;
 			if (reuseport) {
-				hash = udp6_ehashfn(net, daddr, hnum,
+				hash = udp6_ehashfn(net,
+						    &sk->sk_v6_rcv_saddr,
+						    hnum,
 						    saddr, sport);
 				result = reuseport_select_sock(sk, hash, skb,
 							sizeof(struct udphdr));
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] udp reuseport: fix packet of same flow hashed to different socket
  2016-06-08  3:40 [PATCH v2] udp reuseport: fix packet of same flow hashed to different socket Su Xuemin
@ 2016-06-08 15:43 ` Eric Dumazet
  2016-06-09  1:43   ` Su Xuemin
  0 siblings, 1 reply; 4+ messages in thread
From: Eric Dumazet @ 2016-06-08 15:43 UTC (permalink / raw)
  To: Su Xuemin
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev, linux-kernel

On Wed, 2016-06-08 at 11:40 +0800, Su Xuemin wrote:
> From: "Su, Xuemin" <suxm@chinanetcenter.com>
> 
> There is a corner case in which udp packets belonging to a same
> flow are hashed to different socket when hslot->count changes from 10
> to 11:

> It's the same case for IPv6, and this patch also fixes that.
> 
> Signed-off-by: Su, Xuemin <suxm@chinanetcenter.com>
> ---
> The patch v1 does not fix the code in IPv6. Thank Eric Dumazet for
> pointing that.
> And I use this tree to generate this patch, hope it's correct:
>   git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> 

I am not convinced this is the right way to fix the issue.

Fact that you did not change udp4_lib_lookup2() is telling me something.


1) If host has 100 different IP, and 10 sockets bound to 0.0.0.0:53 
2) If we receive datagrams from SRCIP:srcport send to IP{1..100}:53
will all be hashed on same slot.

See the hash used in soreuseport logic as an equivalent of a 4-tuple
hash really, not a 3-tuple one.

I would try instead :

(Looks like we missed this when commit 1223c67c093 ("udp: fix for
unicast RX path optimization") was reviewed/merged.

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d56c0559b477cb96ce78c9b9b7dacc3109594f3a..5847fe48f17d64bee32551e9dd42024a8e65fb10 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -563,7 +563,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 				goto begin;
 
 			result = udp4_lib_lookup2(net, saddr, sport,
-						  htonl(INADDR_ANY), hnum, dif,
+						  daddr, hnum, dif,
 						  hslot2, slot2, skb);
 		}
 		return result;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 2da1896af93496cc58762c67b4cd4b4f42924901..501b5563234d0921eff9f1e50155db76f27b3837 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -277,7 +277,7 @@ struct sock *__udp6_lib_lookup(struct net *net,
 				goto begin;
 
 			result = udp6_lib_lookup2(net, saddr, sport,
-						  &in6addr_any, hnum, dif,
+						  daddr, hnum, dif,
 						  hslot2, slot2, skb);
 		}
 		return result;

 

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] udp reuseport: fix packet of same flow hashed to different socket
  2016-06-08 15:43 ` Eric Dumazet
@ 2016-06-09  1:43   ` Su Xuemin
  2016-06-09  3:11     ` Eric Dumazet
  0 siblings, 1 reply; 4+ messages in thread
From: Su Xuemin @ 2016-06-09  1:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev, linux-kernel

On Wed 2016-06-08 at 08:43 -0700, Eric Dumazet wrote:
> I am not convinced this is the right way to fix the issue.
> 
> Fact that you did not change udp4_lib_lookup2() is telling me something.
> 
> 
> 1) If host has 100 different IP, and 10 sockets bound to 0.0.0.0:53 
> 2) If we receive datagrams from SRCIP:srcport send to IP{1..100}:53
> will all be hashed on same slot.
> 
> See the hash used in soreuseport logic as an equivalent of a 4-tuple
> hash really, not a 3-tuple one.
> 

This is my understanding of __udp4_lib_lookup(), see the comments below,
please kindly review it.
The problem of the current code is:
  In Path 1, we pass daddr to udp_ehashfn() inside udp4_lib_lookup2().
  In Path 2, we pass 0.0.0.0 to udp_ehashfn() inside udp4_lib_lookup2().
  In Path 3, we pass daddr to udp_ehashfn(), even if inet->inet_rcv_saddr
is 0.0.0.0.
  The hash value returned by udp_ehashfn() is used as a random seed to
select socket from the sockets of a hslot. In Path 2 and Path 3, this
hash value is different, so we will select different sockets.
  Pass inet->inet_rcv_saddr to udp_ehashfn() in Path 3 will make the
logic consistent in all these three Pathes.
--------
        ...
        if (hslot->count > 10) {
                /* Xuemin:
                 * for udptable->hash2[], sockets bound to specific ip and
                 * sockets bound to 0.0.0.0 may be placed in different hslot,
                 * so we may have to search two hslots. */
                ...
                /* Xuemin Path 1:
                 * for udptable->hash2[], firstly try to find the socket
                 * bound to the specific daddr. */
                result = udp4_lib_lookup2(net, saddr, sport,
                                          daddr, hnum, dif,
                                          hslot2, slot2, skb);
                if (!result) {
                        /* Xuemin Path 2:
                         * for udptable->hash2[], if we can not find the
                         * socket bound to the specific daddr, we then
                         * try to find the socket bound to 0.0.0.0. */
                        result = udp4_lib_lookup2(net, saddr, sport,
                                                  htonl(INADDR_ANY), hnum, dif,
                                                  hslot2, slot2, skb);
                }
                return result;
        }
        /* Xuemin:
         * for udptable->hash[], all sockets bound to the same
         * port(no matter they are bound to specific ip or to 0.0.0.0)
         * are placed in the same hslot, so we only need to search one
         * hslot. */
begin:
        result = NULL;
        badness = 0;
        sk_for_each_rcu(sk, &hslot->head) {
                score = compute_score(sk, net, saddr, hnum, sport,
                                      daddr, dport, dif);
                if (score > badness) {
                        reuseport = sk->sk_reuseport;
                        if (reuseport) {
                                /* Xuemin Path 3:
                                 * when inet_sk(sk)->inet_rcv_saddr is 0.0.0.0,
                                 * we should not pass daddr here. */
                                hash = udp_ehashfn(net, daddr, hnum,
                                                   saddr, sport);
                                result = reuseport_select_sock(sk, hash, skb,
                                                        sizeof(struct udphdr));

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] udp reuseport: fix packet of same flow hashed to different socket
  2016-06-09  1:43   ` Su Xuemin
@ 2016-06-09  3:11     ` Eric Dumazet
  0 siblings, 0 replies; 4+ messages in thread
From: Eric Dumazet @ 2016-06-09  3:11 UTC (permalink / raw)
  To: Su Xuemin
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev, linux-kernel

On Thu, 2016-06-09 at 09:43 +0800, Su Xuemin wrote:
> On Wed 2016-06-08 at 08:43 -0700, Eric Dumazet wrote:
> > I am not convinced this is the right way to fix the issue.
> > 
> > Fact that you did not change udp4_lib_lookup2() is telling me something.
> > 
> > 
> > 1) If host has 100 different IP, and 10 sockets bound to 0.0.0.0:53 
> > 2) If we receive datagrams from SRCIP:srcport send to IP{1..100}:53
> > will all be hashed on same slot.
> > 
> > See the hash used in soreuseport logic as an equivalent of a 4-tuple
> > hash really, not a 3-tuple one.
> > 
> 
> This is my understanding of __udp4_lib_lookup(), see the comments below,
> please kindly review it.
> The problem of the current code is:
>   In Path 1, we pass daddr to udp_ehashfn() inside udp4_lib_lookup2().
>   In Path 2, we pass 0.0.0.0 to udp_ehashfn() inside udp4_lib_lookup2().
>   In Path 3, we pass daddr to udp_ehashfn(), even if inet->inet_rcv_saddr
> is 0.0.0.0.

My patch always pass daddr, and never 0.0.0.0 which makes little sense.

If a socket is bound to 0.0.0.0, then daddr will match it anyway.

I wrote this code, and clearly I was wrong.

The only moment we should use 0.0.0.0 is to get the hash bucket, but
really the lookup should use all the available keys.

Your patch can not be backported to stable versions because socket is
not stable (SLAB_DESTROY_BY_RCU rules) meaning that
inet_sk(sk)->inet_rcv_saddr could contain garbage.

daddr on the other hand is stable, since it is on the caller stack or
incoming packet (IPv6) and can not change under us.

We do not want to compute a hash based on garbage.

Have you tried my patch ?

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-06-09  3:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-08  3:40 [PATCH v2] udp reuseport: fix packet of same flow hashed to different socket Su Xuemin
2016-06-08 15:43 ` Eric Dumazet
2016-06-09  1:43   ` Su Xuemin
2016-06-09  3:11     ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).