All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 net] tcp: Refine SYN handling for PAWS.
@ 2023-03-27 23:06 Kuniyuki Iwashima
  2023-03-28  7:14 ` Jason Xing
  2023-03-28  8:13 ` Eric Dumazet
  0 siblings, 2 replies; 6+ messages in thread
From: Kuniyuki Iwashima @ 2023-03-27 23:06 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

Our Network Load Balancer (NLB) [0] has multiple nodes with different
IP addresses, and each node forwards TCP flows from clients to backend
targets.  NLB has an option to preserve the client's source IP address
and port when routing packets to backend targets.

When a client connects to two different NLB nodes, they may select the
same backend target.  Then, if the client has used the same source IP
and port, the two flows at the backend side will have the same 4-tuple.

While testing around such cases, I saw these sequences on the backend
target.

IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 2819965599, win 62727, options [mss 8365,sackOK,TS val 1029816180 ecr 0,nop,wscale 7], length 0
IP 10.0.3.249.10000 > 10.0.0.215.60000: Flags [S.], seq 3040695044, ack 2819965600, win 62643, options [mss 8961,sackOK,TS val 1224784076 ecr 1029816180,nop,wscale 7], length 0
IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [.], ack 1, win 491, options [nop,nop,TS val 1029816181 ecr 1224784076], length 0
IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 2681819307, win 62727, options [mss 8365,sackOK,TS val 572088282 ecr 0,nop,wscale 7], length 0
IP 10.0.3.249.10000 > 10.0.0.215.60000: Flags [.], ack 1, win 490, options [nop,nop,TS val 1224794914 ecr 1029816181,nop,nop,sack 1 {4156821004:4156821005}], length 0

It seems to be working correctly, but the last ACK was generated by
tcp_send_dupack() and PAWSEstab was increased.  This is because the
second connection has a smaller timestamp than the first one.

In this case, we should send a challenge ACK instead of a dup ACK and
increase the correct counter to rate-limit it properly.

Let's check the SYN bit after the PAWS tests to avoid adding unnecessary
overhead for most packets.

Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html [0]
Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html#client-ip-preservation [1]
Fixes: 0c24604b68fc ("tcp: implement RFC 5961 4.2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/tcp_input.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cc072d2cfcd8..89fca4c18530 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5714,6 +5714,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 	    tp->rx_opt.saw_tstamp &&
 	    tcp_paws_discard(sk, skb)) {
 		if (!th->rst) {
+			if (unlikely(th->syn))
+				goto syn_challenge;
 			NET_INC_STATS(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
 			if (!tcp_oow_rate_limited(sock_net(sk), skb,
 						  LINUX_MIB_TCPACKSKIPPEDPAWS,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v1 net] tcp: Refine SYN handling for PAWS.
  2023-03-27 23:06 [PATCH v1 net] tcp: Refine SYN handling for PAWS Kuniyuki Iwashima
@ 2023-03-28  7:14 ` Jason Xing
  2023-03-28  8:13 ` Eric Dumazet
  1 sibling, 0 replies; 6+ messages in thread
From: Jason Xing @ 2023-03-28  7:14 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Kuniyuki Iwashima, netdev

On Tue, Mar 28, 2023 at 7:18 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> Our Network Load Balancer (NLB) [0] has multiple nodes with different
> IP addresses, and each node forwards TCP flows from clients to backend
> targets.  NLB has an option to preserve the client's source IP address
> and port when routing packets to backend targets.
>
> When a client connects to two different NLB nodes, they may select the
> same backend target.  Then, if the client has used the same source IP
> and port, the two flows at the backend side will have the same 4-tuple.

It rarely happens based on my knowledge (at least no one reports such
an issue in my company), but indeed it can happen.

>
> While testing around such cases, I saw these sequences on the backend
> target.
>
> IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 2819965599, win 62727, options [mss 8365,sackOK,TS val 1029816180 ecr 0,nop,wscale 7], length 0
> IP 10.0.3.249.10000 > 10.0.0.215.60000: Flags [S.], seq 3040695044, ack 2819965600, win 62643, options [mss 8961,sackOK,TS val 1224784076 ecr 1029816180,nop,wscale 7], length 0
> IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [.], ack 1, win 491, options [nop,nop,TS val 1029816181 ecr 1224784076], length 0
> IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 2681819307, win 62727, options [mss 8365,sackOK,TS val 572088282 ecr 0,nop,wscale 7], length 0
> IP 10.0.3.249.10000 > 10.0.0.215.60000: Flags [.], ack 1, win 490, options [nop,nop,TS val 1224794914 ecr 1029816181,nop,nop,sack 1 {4156821004:4156821005}], length 0
>
> It seems to be working correctly, but the last ACK was generated by
> tcp_send_dupack() and PAWSEstab was increased.  This is because the
> second connection has a smaller timestamp than the first one.
>
> In this case, we should send a challenge ACK instead of a dup ACK and
> increase the correct counter to rate-limit it properly.
>
> Let's check the SYN bit after the PAWS tests to avoid adding unnecessary
> overhead for most packets.
>
> Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html [0]
> Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html#client-ip-preservation [1]
> Fixes: 0c24604b68fc ("tcp: implement RFC 5961 4.2")
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>

A long time ago, a similar case reported in 2012 was handled in commit
e37158991701(tcp: refine SYN handling in tcp_validate_incoming) by
Eric.
I believe we also can do this here during the paws check phrase.

It looks good to me, please feel free to add:
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Thanks!

> ---
>  net/ipv4/tcp_input.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index cc072d2cfcd8..89fca4c18530 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5714,6 +5714,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
>             tp->rx_opt.saw_tstamp &&
>             tcp_paws_discard(sk, skb)) {
>                 if (!th->rst) {
> +                       if (unlikely(th->syn))
> +                               goto syn_challenge;
>                         NET_INC_STATS(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
>                         if (!tcp_oow_rate_limited(sock_net(sk), skb,
>                                                   LINUX_MIB_TCPACKSKIPPEDPAWS,
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1 net] tcp: Refine SYN handling for PAWS.
  2023-03-27 23:06 [PATCH v1 net] tcp: Refine SYN handling for PAWS Kuniyuki Iwashima
  2023-03-28  7:14 ` Jason Xing
@ 2023-03-28  8:13 ` Eric Dumazet
  2023-03-28 16:41   ` Kuniyuki Iwashima
  1 sibling, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2023-03-28  8:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Kuniyuki Iwashima, netdev

On Mon, Mar 27, 2023 at 4:06 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> Our Network Load Balancer (NLB) [0] has multiple nodes with different
> IP addresses, and each node forwards TCP flows from clients to backend
> targets.  NLB has an option to preserve the client's source IP address
> and port when routing packets to backend targets.
>
> When a client connects to two different NLB nodes, they may select the
> same backend target.  Then, if the client has used the same source IP
> and port, the two flows at the backend side will have the same 4-tuple.
>
> While testing around such cases, I saw these sequences on the backend
> target.
>
> IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 2819965599, win 62727, options [mss 8365,sackOK,TS val 1029816180 ecr 0,nop,wscale 7], length 0
> IP 10.0.3.249.10000 > 10.0.0.215.60000: Flags [S.], seq 3040695044, ack 2819965600, win 62643, options [mss 8961,sackOK,TS val 1224784076 ecr 1029816180,nop,wscale 7], length 0
> IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [.], ack 1, win 491, options [nop,nop,TS val 1029816181 ecr 1224784076], length 0
> IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 2681819307, win 62727, options [mss 8365,sackOK,TS val 572088282 ecr 0,nop,wscale 7], length 0
> IP 10.0.3.249.10000 > 10.0.0.215.60000: Flags [.], ack 1, win 490, options [nop,nop,TS val 1224794914 ecr 1029816181,nop,nop,sack 1 {4156821004:4156821005}], length 0
>
> It seems to be working correctly, but the last ACK was generated by
> tcp_send_dupack() and PAWSEstab was increased.  This is because the
> second connection has a smaller timestamp than the first one.
>
> In this case, we should send a challenge ACK instead of a dup ACK and
> increase the correct counter to rate-limit it properly.

OK, but this seems about the same thing to me. A challenge ACK is a dup ACK ?

It is not clear why it matters, because most probably both ACK make no
sense for the sender ?

>
> Let's check the SYN bit after the PAWS tests to avoid adding unnecessary
> overhead for most packets.
>
> Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html [0]
> Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html#client-ip-preservation [1]
> Fixes: 0c24604b68fc ("tcp: implement RFC 5961 4.2")

The core of the change was to not send an RST anymore.
I did not change part of the code which was not sending an RST :)



> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  net/ipv4/tcp_input.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index cc072d2cfcd8..89fca4c18530 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5714,6 +5714,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
>             tp->rx_opt.saw_tstamp &&
>             tcp_paws_discard(sk, skb)) {
>                 if (!th->rst) {
> +                       if (unlikely(th->syn))
> +                               goto syn_challenge;
>                         NET_INC_STATS(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
>                         if (!tcp_oow_rate_limited(sock_net(sk), skb,
>                                                   LINUX_MIB_TCPACKSKIPPEDPAWS,
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1 net] tcp: Refine SYN handling for PAWS.
  2023-03-28  8:13 ` Eric Dumazet
@ 2023-03-28 16:41   ` Kuniyuki Iwashima
  2023-03-28 17:48     ` Eric Dumazet
  0 siblings, 1 reply; 6+ messages in thread
From: Kuniyuki Iwashima @ 2023-03-28 16:41 UTC (permalink / raw)
  To: edumazet; +Cc: davem, kuba, kuni1840, kuniyu, netdev, pabeni

From:   Eric Dumazet <edumazet@google.com>
Date:   Tue, 28 Mar 2023 01:13:09 -0700
> On Mon, Mar 27, 2023 at 4:06=E2=80=AFPM Kuniyuki Iwashima <kuniyu@amazon.co=
> m> wrote:
> >
> > Our Network Load Balancer (NLB) [0] has multiple nodes with different
> > IP addresses, and each node forwards TCP flows from clients to backend
> > targets.  NLB has an option to preserve the client's source IP address
> > and port when routing packets to backend targets.
> >
> > When a client connects to two different NLB nodes, they may select the
> > same backend target.  Then, if the client has used the same source IP
> > and port, the two flows at the backend side will have the same 4-tuple.
> >
> > While testing around such cases, I saw these sequences on the backend
> > target.
> >
> > IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 2819965599, win 62=
> 727, options [mss 8365,sackOK,TS val 1029816180 ecr 0,nop,wscale 7], length=
>  0
> > IP 10.0.3.249.10000 > 10.0.0.215.60000: Flags [S.], seq 3040695044, ack 2=
> 819965600, win 62643, options [mss 8961,sackOK,TS val 1224784076 ecr 102981=
> 6180,nop,wscale 7], length 0
> > IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [.], ack 1, win 491, option=
> s [nop,nop,TS val 1029816181 ecr 1224784076], length 0
> > IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 2681819307, win 62=
> 727, options [mss 8365,sackOK,TS val 572088282 ecr 0,nop,wscale 7], length =
> 0
> > IP 10.0.3.249.10000 > 10.0.0.215.60000: Flags [.], ack 1, win 490, option=
> s [nop,nop,TS val 1224794914 ecr 1029816181,nop,nop,sack 1 {4156821004:4156=
> 821005}], length 0
> >
> > It seems to be working correctly, but the last ACK was generated by
> > tcp_send_dupack() and PAWSEstab was increased.  This is because the
> > second connection has a smaller timestamp than the first one.
> >
> > In this case, we should send a challenge ACK instead of a dup ACK and
> > increase the correct counter to rate-limit it properly.
> 
> OK, but this seems about the same thing to me. A challenge ACK is a dup ACK=
>  ?

Yes, I thought you distinguished them with the counter, and I just
followed it :)

e37158991701 ("tcp: refine SYN handling in tcp_validate_incoming")

> 
> It is not clear why it matters, because most probably both ACK make no
> sense for the sender ?

Yes, but it's tricky that which counter is increased depends on the
two timestamp.


> 
> >
> > Let's check the SYN bit after the PAWS tests to avoid adding unnecessary
> > overhead for most packets.
> >
> > Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/int=
> roduction.html [0]
> > Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/loa=
> d-balancer-target-groups.html#client-ip-preservation [1]
> > Fixes: 0c24604b68fc ("tcp: implement RFC 5961 4.2")
> 
> The core of the change was to not send an RST anymore.
> I did not change part of the code which was not sending an RST :)

I see.
Should I replace the tag with add 'CC: stable # backport ver', or
respin for net-next without the tag ?

Thanks,
Kuniyuki


> 
> 
> 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  net/ipv4/tcp_input.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index cc072d2cfcd8..89fca4c18530 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -5714,6 +5714,8 @@ static bool tcp_validate_incoming(struct sock *sk, =
> struct sk_buff *skb,
> >             tp->rx_opt.saw_tstamp &&
> >             tcp_paws_discard(sk, skb)) {
> >                 if (!th->rst) {
> > +                       if (unlikely(th->syn))
> > +                               goto syn_challenge;
> >                         NET_INC_STATS(sock_net(sk), LINUX_MIB_PAWSESTABRE=
> JECTED);
> >                         if (!tcp_oow_rate_limited(sock_net(sk), skb,
> >                                                   LINUX_MIB_TCPACKSKIPPED=
> PAWS,
> > --
> > 2.30.2
> >

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1 net] tcp: Refine SYN handling for PAWS.
  2023-03-28 16:41   ` Kuniyuki Iwashima
@ 2023-03-28 17:48     ` Eric Dumazet
  2023-03-28 18:30       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2023-03-28 17:48 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: davem, kuba, kuni1840, netdev, pabeni

On Tue, Mar 28, 2023 at 6:41 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:

> I see.
> Should I replace the tag with add 'CC: stable # backport ver', or
> respin for net-next without the tag ?

Yes, net-next should be a better target I think, but no hard feelings
if this makes your life easier.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1 net] tcp: Refine SYN handling for PAWS.
  2023-03-28 17:48     ` Eric Dumazet
@ 2023-03-28 18:30       ` Kuniyuki Iwashima
  0 siblings, 0 replies; 6+ messages in thread
From: Kuniyuki Iwashima @ 2023-03-28 18:30 UTC (permalink / raw)
  To: edumazet; +Cc: davem, kuba, kuni1840, kuniyu, netdev, pabeni

From:   Eric Dumazet <edumazet@google.com>
Date:   Tue, 28 Mar 2023 19:48:03 +0200
> On Tue, Mar 28, 2023 at 6:41=E2=80=AFPM Kuniyuki Iwashima <kuniyu@amazon.co=
> m> wrote:
> 
> > I see.
> > Should I replace the tag with add 'CC: stable # backport ver', or
> > respin for net-next without the tag ?
> 
> Yes, net-next should be a better target I think, but no hard feelings
> if this makes your life easier.

Sure, I'll post it for net-next.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-03-28 18:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-27 23:06 [PATCH v1 net] tcp: Refine SYN handling for PAWS Kuniyuki Iwashima
2023-03-28  7:14 ` Jason Xing
2023-03-28  8:13 ` Eric Dumazet
2023-03-28 16:41   ` Kuniyuki Iwashima
2023-03-28 17:48     ` Eric Dumazet
2023-03-28 18:30       ` Kuniyuki Iwashima

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.