All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/2] tcp: tcp_v4_err() changes
@ 2024-04-17 16:57 Eric Dumazet
  2024-04-17 16:57 ` [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err() Eric Dumazet
  2024-04-17 16:57 ` [PATCH net-next 2/2] tcp: no longer abort SYN_SENT when receiving some ICMP (II) Eric Dumazet
  0 siblings, 2 replies; 23+ messages in thread
From: Eric Dumazet @ 2024-04-17 16:57 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Neal Cardwell, Dragos Tatulea, eric.dumazet, Eric Dumazet

First patch fixes a bug in tcp_v4_err().

Fixing this old bug allows us to bring back a patch
that we had to revert. Hopefully linux TCP can better
meet RFC 1122 requirements.

Eric Dumazet (2):
  tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  tcp: no longer abort SYN_SENT when receiving some ICMP (II)

 net/ipv4/tcp_ipv4.c | 9 ++++++++-
 net/ipv6/tcp_ipv6.c | 9 ++++++---
 2 files changed, 14 insertions(+), 4 deletions(-)

-- 
2.44.0.683.g7961c838ac-goog


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-17 16:57 [PATCH net-next 0/2] tcp: tcp_v4_err() changes Eric Dumazet
@ 2024-04-17 16:57 ` Eric Dumazet
  2024-04-17 17:08   ` Maciej Żenczykowski
                     ` (2 more replies)
  2024-04-17 16:57 ` [PATCH net-next 2/2] tcp: no longer abort SYN_SENT when receiving some ICMP (II) Eric Dumazet
  1 sibling, 3 replies; 23+ messages in thread
From: Eric Dumazet @ 2024-04-17 16:57 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Neal Cardwell, Dragos Tatulea, eric.dumazet,
	Eric Dumazet, Maciej Żenczykowski, Willem de Bruijn,
	Shachar Kagan

Blamed commit claimed in its changelog that the new functionality
was guarded by IP_RECVERR/IPV6_RECVERR :

    Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
    enable this feature, and that the error message is only queued
    while in SYN_SNT state.

This was true only for IPv6, because ipv6_icmp_error() has
the following check:

if (!inet6_test_bit(RECVERR6, sk))
    return;

Other callers check IP_RECVERR by themselves, it is unclear
if we could factorize these checks in ip_icmp_error()

For stable backports, I chose to add the missing check in tcp_v4_err()

We think this missing check was the root cause for commit
0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
some ICMP") breakage, leading to a revert.

Many thanks to Dragos Tatulea for conducting the investigations.

As Jakub said :

    The suspicion is that SSH sees the ICMP report on the socket error queue
    and tries to connect() again, but due to the patch the socket isn't
    disconnected, so it gets EALREADY, and throws its hands up...

    The error bubbles up to Vagrant which also becomes unhappy.

    Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?

Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Cc: Dragos Tatulea <dtatulea@nvidia.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Shachar Kagan <skagan@nvidia.com>
---
 net/ipv4/tcp_ipv4.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
 		if (fastopen && !fastopen->sk)
 			break;
 
-		ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
+		if (inet_test_bit(RECVERR, sk))
+			ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
 
 		if (!sock_owned_by_user(sk)) {
 			WRITE_ONCE(sk->sk_err, err);
-- 
2.44.0.683.g7961c838ac-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next 2/2] tcp: no longer abort SYN_SENT when receiving some ICMP (II)
  2024-04-17 16:57 [PATCH net-next 0/2] tcp: tcp_v4_err() changes Eric Dumazet
  2024-04-17 16:57 ` [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err() Eric Dumazet
@ 2024-04-17 16:57 ` Eric Dumazet
  2024-04-18  3:24   ` Jason Xing
  1 sibling, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2024-04-17 16:57 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Neal Cardwell, Dragos Tatulea, eric.dumazet, Eric Dumazet

Notes:

 - A prior version of this patch in commit
   0a8de364ff7a ("tcp: no longer abort SYN_SENT when
   receiving some ICMP") had to be reverted.

 - We found the root cause, and fixed it in prior patch
   in the series.

 - Many thanks to Dragos Tatulea !

Currently, non fatal ICMP messages received on behalf
of SYN_SENT sockets do call tcp_ld_RTO_revert()
to implement RFC 6069, but immediately call tcp_done(),
thus aborting the connect() attempt.

This violates RFC 1122 following requirement:

4.2.3.9  ICMP Messages
...
          o    Destination Unreachable -- codes 0, 1, 5

                 Since these Unreachable messages indicate soft error
                 conditions, TCP MUST NOT abort the connection, and it
                 SHOULD make the information available to the
                 application.

This patch makes sure non 'fatal' ICMP[v6] messages do not
abort the connection attempt.

It enables RFC 6069 for SYN_SENT sockets as a result.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
---
 net/ipv4/tcp_ipv4.c | 6 ++++++
 net/ipv6/tcp_ipv6.c | 9 ++++++---
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a717db99972d977a64178d7ed1109325d64a6d51..4b50d09f84b9ae7fec98a71bedf39594ab85e5ea 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -482,6 +482,7 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
 	const int code = icmp_hdr(skb)->code;
 	struct sock *sk;
 	struct request_sock *fastopen;
+	bool harderr = false;
 	u32 seq, snd_una;
 	int err;
 	struct net *net = dev_net(skb->dev);
@@ -555,6 +556,7 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
 		goto out;
 	case ICMP_PARAMETERPROB:
 		err = EPROTO;
+		harderr = true;
 		break;
 	case ICMP_DEST_UNREACH:
 		if (code > NR_ICMP_UNREACH)
@@ -579,6 +581,7 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
 		}
 
 		err = icmp_err_convert[code].errno;
+		harderr = icmp_err_convert[code].fatal;
 		/* check if this ICMP message allows revert of backoff.
 		 * (see RFC 6069)
 		 */
@@ -605,6 +608,9 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
 		if (inet_test_bit(RECVERR, sk))
 			ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
 
+		if (!harderr)
+			break;
+
 		if (!sock_owned_by_user(sk)) {
 			WRITE_ONCE(sk->sk_err, err);
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index bb7c3caf4f8536dabdcb3dbe7c90aff9c8985c90..f31527f0a13dee9488ce72834d89524e83f61e5f 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -382,7 +382,7 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	struct tcp_sock *tp;
 	__u32 seq, snd_una;
 	struct sock *sk;
-	bool fatal;
+	bool harderr;
 	int err;
 
 	sk = __inet6_lookup_established(net, net->ipv4.tcp_death_row.hashinfo,
@@ -403,9 +403,9 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 		return 0;
 	}
 	seq = ntohl(th->seq);
-	fatal = icmpv6_err_convert(type, code, &err);
+	harderr = icmpv6_err_convert(type, code, &err);
 	if (sk->sk_state == TCP_NEW_SYN_RECV) {
-		tcp_req_err(sk, seq, fatal);
+		tcp_req_err(sk, seq, harderr);
 		return 0;
 	}
 
@@ -490,6 +490,9 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 
 		ipv6_icmp_error(sk, skb, err, th->dest, ntohl(info), (u8 *)th);
 
+		if (!harderr)
+			break;
+
 		if (!sock_owned_by_user(sk)) {
 			WRITE_ONCE(sk->sk_err, err);
 			sk_error_report(sk);		/* Wake people up to see the error (see connect in sock.c) */
-- 
2.44.0.683.g7961c838ac-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-17 16:57 ` [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err() Eric Dumazet
@ 2024-04-17 17:08   ` Maciej Żenczykowski
  2024-04-18  3:22   ` Jason Xing
  2024-04-18  8:02   ` Paolo Abeni
  2 siblings, 0 replies; 23+ messages in thread
From: Maciej Żenczykowski @ 2024-04-17 17:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Neal Cardwell, Dragos Tatulea, eric.dumazet, Willem de Bruijn,
	Shachar Kagan

On Wed, Apr 17, 2024 at 9:58 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Blamed commit claimed in its changelog that the new functionality
> was guarded by IP_RECVERR/IPV6_RECVERR :
>
>     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
>     enable this feature, and that the error message is only queued
>     while in SYN_SNT state.
>
> This was true only for IPv6, because ipv6_icmp_error() has
> the following check:
>
> if (!inet6_test_bit(RECVERR6, sk))
>     return;
>
> Other callers check IP_RECVERR by themselves, it is unclear
> if we could factorize these checks in ip_icmp_error()
>
> For stable backports, I chose to add the missing check in tcp_v4_err()
>
> We think this missing check was the root cause for commit
> 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> some ICMP") breakage, leading to a revert.
>
> Many thanks to Dragos Tatulea for conducting the investigations.
>
> As Jakub said :
>
>     The suspicion is that SSH sees the ICMP report on the socket error queue
>     and tries to connect() again, but due to the patch the socket isn't
>     disconnected, so it gets EALREADY, and throws its hands up...
>
>     The error bubbles up to Vagrant which also becomes unhappy.
>
>     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
>
> Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> Cc: Dragos Tatulea <dtatulea@nvidia.com>
> Cc: Maciej Żenczykowski <maze@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Shachar Kagan <skagan@nvidia.com>
> ---
>  net/ipv4/tcp_ipv4.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
>                 if (fastopen && !fastopen->sk)
>                         break;
>
> -               ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> +               if (inet_test_bit(RECVERR, sk))
> +                       ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
>
>                 if (!sock_owned_by_user(sk)) {
>                         WRITE_ONCE(sk->sk_err, err);
> --
> 2.44.0.683.g7961c838ac-goog
>

Reviewed-by: Maciej Żenczykowski <maze@google.com>

Makes sense to me.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-17 16:57 ` [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err() Eric Dumazet
  2024-04-17 17:08   ` Maciej Żenczykowski
@ 2024-04-18  3:22   ` Jason Xing
  2024-04-18  6:45     ` Eric Dumazet
  2024-04-18  8:02   ` Paolo Abeni
  2 siblings, 1 reply; 23+ messages in thread
From: Jason Xing @ 2024-04-18  3:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Neal Cardwell, Dragos Tatulea, eric.dumazet,
	Maciej Żenczykowski, Willem de Bruijn, Shachar Kagan

On Thu, Apr 18, 2024 at 12:59 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Blamed commit claimed in its changelog that the new functionality
> was guarded by IP_RECVERR/IPV6_RECVERR :
>
>     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
>     enable this feature, and that the error message is only queued
>     while in SYN_SNT state.
>
> This was true only for IPv6, because ipv6_icmp_error() has
> the following check:
>
> if (!inet6_test_bit(RECVERR6, sk))
>     return;
>
> Other callers check IP_RECVERR by themselves, it is unclear
> if we could factorize these checks in ip_icmp_error()
>
> For stable backports, I chose to add the missing check in tcp_v4_err()
>
> We think this missing check was the root cause for commit
> 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> some ICMP") breakage, leading to a revert.
>
> Many thanks to Dragos Tatulea for conducting the investigations.
>
> As Jakub said :
>
>     The suspicion is that SSH sees the ICMP report on the socket error queue
>     and tries to connect() again, but due to the patch the socket isn't
>     disconnected, so it gets EALREADY, and throws its hands up...
>
>     The error bubbles up to Vagrant which also becomes unhappy.
>
>     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
>
> Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> Cc: Dragos Tatulea <dtatulea@nvidia.com>
> Cc: Maciej Żenczykowski <maze@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Shachar Kagan <skagan@nvidia.com>

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

I wonder if we're supposed to move this check into ip_icmp_error()
like ipv6_icmp_error() does, because I notice one caller
rxrpc_encap_err_rcv() without checking RECVERR  bit reuses the ICMP
error logic which is introduced in commit b6c66c4324e7 ("rxrpc: Use
the core ICMP/ICMP6 parsers'')?

Or should it be a follow-up patch (moving it inside of
ip_icmp_error()) to handle the rxrpc case and also prevent future
misuse for other people?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 2/2] tcp: no longer abort SYN_SENT when receiving some ICMP (II)
  2024-04-17 16:57 ` [PATCH net-next 2/2] tcp: no longer abort SYN_SENT when receiving some ICMP (II) Eric Dumazet
@ 2024-04-18  3:24   ` Jason Xing
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Xing @ 2024-04-18  3:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Neal Cardwell, Dragos Tatulea, eric.dumazet

On Thu, Apr 18, 2024 at 12:59 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Notes:
>
>  - A prior version of this patch in commit
>    0a8de364ff7a ("tcp: no longer abort SYN_SENT when
>    receiving some ICMP") had to be reverted.
>
>  - We found the root cause, and fixed it in prior patch
>    in the series.
>
>  - Many thanks to Dragos Tatulea !
>
> Currently, non fatal ICMP messages received on behalf
> of SYN_SENT sockets do call tcp_ld_RTO_revert()
> to implement RFC 6069, but immediately call tcp_done(),
> thus aborting the connect() attempt.
>
> This violates RFC 1122 following requirement:
>
> 4.2.3.9  ICMP Messages
> ...
>           o    Destination Unreachable -- codes 0, 1, 5
>
>                  Since these Unreachable messages indicate soft error
>                  conditions, TCP MUST NOT abort the connection, and it
>                  SHOULD make the information available to the
>                  application.
>
> This patch makes sure non 'fatal' ICMP[v6] messages do not
> abort the connection attempt.
>
> It enables RFC 6069 for SYN_SENT sockets as a result.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Tested-by: Dragos Tatulea <dtatulea@nvidia.com>

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Finally!!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-18  3:22   ` Jason Xing
@ 2024-04-18  6:45     ` Eric Dumazet
  2024-04-18  6:53       ` Jason Xing
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2024-04-18  6:45 UTC (permalink / raw)
  To: Jason Xing
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Neal Cardwell, Dragos Tatulea, eric.dumazet,
	Maciej Żenczykowski, Willem de Bruijn, Shachar Kagan

On Thu, Apr 18, 2024 at 5:23 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Apr 18, 2024 at 12:59 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > Blamed commit claimed in its changelog that the new functionality
> > was guarded by IP_RECVERR/IPV6_RECVERR :
> >
> >     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
> >     enable this feature, and that the error message is only queued
> >     while in SYN_SNT state.
> >
> > This was true only for IPv6, because ipv6_icmp_error() has
> > the following check:
> >
> > if (!inet6_test_bit(RECVERR6, sk))
> >     return;
> >
> > Other callers check IP_RECVERR by themselves, it is unclear
> > if we could factorize these checks in ip_icmp_error()
> >
> > For stable backports, I chose to add the missing check in tcp_v4_err()
> >
> > We think this missing check was the root cause for commit
> > 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> > some ICMP") breakage, leading to a revert.
> >
> > Many thanks to Dragos Tatulea for conducting the investigations.
> >
> > As Jakub said :
> >
> >     The suspicion is that SSH sees the ICMP report on the socket error queue
> >     and tries to connect() again, but due to the patch the socket isn't
> >     disconnected, so it gets EALREADY, and throws its hands up...
> >
> >     The error bubbles up to Vagrant which also becomes unhappy.
> >
> >     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> >
> > Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> > Cc: Dragos Tatulea <dtatulea@nvidia.com>
> > Cc: Maciej Żenczykowski <maze@google.com>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Neal Cardwell <ncardwell@google.com>
> > Cc: Shachar Kagan <skagan@nvidia.com>
>
> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
>
> I wonder if we're supposed to move this check into ip_icmp_error()
> like ipv6_icmp_error() does, because I notice one caller
> rxrpc_encap_err_rcv() without checking RECVERR  bit reuses the ICMP
> error logic which is introduced in commit b6c66c4324e7 ("rxrpc: Use
> the core ICMP/ICMP6 parsers'')?

I tried to focus on the TCP issues, and to have a stable candidate for patch #1.

The refactoring can wait.

>
> Or should it be a follow-up patch (moving it inside of
> ip_icmp_error()) to handle the rxrpc case and also prevent future
> misuse for other people?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-18  6:45     ` Eric Dumazet
@ 2024-04-18  6:53       ` Jason Xing
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Xing @ 2024-04-18  6:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Neal Cardwell, Dragos Tatulea, eric.dumazet,
	Maciej Żenczykowski, Willem de Bruijn, Shachar Kagan

On Thu, Apr 18, 2024 at 2:45 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Apr 18, 2024 at 5:23 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Thu, Apr 18, 2024 at 12:59 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > Blamed commit claimed in its changelog that the new functionality
> > > was guarded by IP_RECVERR/IPV6_RECVERR :
> > >
> > >     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
> > >     enable this feature, and that the error message is only queued
> > >     while in SYN_SNT state.
> > >
> > > This was true only for IPv6, because ipv6_icmp_error() has
> > > the following check:
> > >
> > > if (!inet6_test_bit(RECVERR6, sk))
> > >     return;
> > >
> > > Other callers check IP_RECVERR by themselves, it is unclear
> > > if we could factorize these checks in ip_icmp_error()
> > >
> > > For stable backports, I chose to add the missing check in tcp_v4_err()
> > >
> > > We think this missing check was the root cause for commit
> > > 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> > > some ICMP") breakage, leading to a revert.
> > >
> > > Many thanks to Dragos Tatulea for conducting the investigations.
> > >
> > > As Jakub said :
> > >
> > >     The suspicion is that SSH sees the ICMP report on the socket error queue
> > >     and tries to connect() again, but due to the patch the socket isn't
> > >     disconnected, so it gets EALREADY, and throws its hands up...
> > >
> > >     The error bubbles up to Vagrant which also becomes unhappy.
> > >
> > >     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> > >
> > > Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > Cc: Dragos Tatulea <dtatulea@nvidia.com>
> > > Cc: Maciej Żenczykowski <maze@google.com>
> > > Cc: Willem de Bruijn <willemb@google.com>
> > > Cc: Neal Cardwell <ncardwell@google.com>
> > > Cc: Shachar Kagan <skagan@nvidia.com>
> >
> > Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
> >
> > I wonder if we're supposed to move this check into ip_icmp_error()
> > like ipv6_icmp_error() does, because I notice one caller
> > rxrpc_encap_err_rcv() without checking RECVERR  bit reuses the ICMP
> > error logic which is introduced in commit b6c66c4324e7 ("rxrpc: Use
> > the core ICMP/ICMP6 parsers'')?
>
> I tried to focus on the TCP issues, and to have a stable candidate for patch #1.
>
> The refactoring can wait.

Got it. It's clear.

After this patch is applied, I can adjust a little bit (only by moving
it into ip_icmp_error()).

Thanks,
Jason

>
> >
> > Or should it be a follow-up patch (moving it inside of
> > ip_icmp_error()) to handle the rxrpc case and also prevent future
> > misuse for other people?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-17 16:57 ` [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err() Eric Dumazet
  2024-04-17 17:08   ` Maciej Żenczykowski
  2024-04-18  3:22   ` Jason Xing
@ 2024-04-18  8:02   ` Paolo Abeni
  2024-04-18  8:03     ` Eric Dumazet
  2 siblings, 1 reply; 23+ messages in thread
From: Paolo Abeni @ 2024-04-18  8:02 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: netdev, Neal Cardwell, Dragos Tatulea, eric.dumazet,
	Maciej Żenczykowski, Willem de Bruijn, Shachar Kagan

Hi,

On Wed, 2024-04-17 at 16:57 +0000, Eric Dumazet wrote:
> Blamed commit claimed in its changelog that the new functionality
> was guarded by IP_RECVERR/IPV6_RECVERR :
> 
>     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
>     enable this feature, and that the error message is only queued
>     while in SYN_SNT state.
> 
> This was true only for IPv6, because ipv6_icmp_error() has
> the following check:
> 
> if (!inet6_test_bit(RECVERR6, sk))
>     return;
> 
> Other callers check IP_RECVERR by themselves, it is unclear
> if we could factorize these checks in ip_icmp_error()
> 
> For stable backports, I chose to add the missing check in tcp_v4_err()
> 
> We think this missing check was the root cause for commit
> 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> some ICMP") breakage, leading to a revert.
> 
> Many thanks to Dragos Tatulea for conducting the investigations.
> 
> As Jakub said :
> 
>     The suspicion is that SSH sees the ICMP report on the socket error queue
>     and tries to connect() again, but due to the patch the socket isn't
>     disconnected, so it gets EALREADY, and throws its hands up...
> 
>     The error bubbles up to Vagrant which also becomes unhappy.
> 
>     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> 
> Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> Cc: Dragos Tatulea <dtatulea@nvidia.com>
> Cc: Maciej Żenczykowski <maze@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Shachar Kagan <skagan@nvidia.com>
> ---
>  net/ipv4/tcp_ipv4.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
>  		if (fastopen && !fastopen->sk)
>  			break;
>  
> -		ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> +		if (inet_test_bit(RECVERR, sk))
> +			ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
>  
>  		if (!sock_owned_by_user(sk)) {
>  			WRITE_ONCE(sk->sk_err, err);

We have a fcnal-test.sh self-test failure:

https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-04-18--06-00&test=fcnal-test-sh

that I suspect are related to this patch (or the following one): the
test case creates a TCP connection on loopback and this is the only
patchseries touching the related code, included in the relevant patch
burst.

Could you please have a look?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-18  8:02   ` Paolo Abeni
@ 2024-04-18  8:03     ` Eric Dumazet
  2024-04-18  9:26       ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2024-04-18  8:03 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: David S . Miller, Jakub Kicinski, netdev, Neal Cardwell,
	Dragos Tatulea, eric.dumazet, Maciej Żenczykowski,
	Willem de Bruijn, Shachar Kagan

On Thu, Apr 18, 2024 at 10:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hi,
>
> On Wed, 2024-04-17 at 16:57 +0000, Eric Dumazet wrote:
> > Blamed commit claimed in its changelog that the new functionality
> > was guarded by IP_RECVERR/IPV6_RECVERR :
> >
> >     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
> >     enable this feature, and that the error message is only queued
> >     while in SYN_SNT state.
> >
> > This was true only for IPv6, because ipv6_icmp_error() has
> > the following check:
> >
> > if (!inet6_test_bit(RECVERR6, sk))
> >     return;
> >
> > Other callers check IP_RECVERR by themselves, it is unclear
> > if we could factorize these checks in ip_icmp_error()
> >
> > For stable backports, I chose to add the missing check in tcp_v4_err()
> >
> > We think this missing check was the root cause for commit
> > 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> > some ICMP") breakage, leading to a revert.
> >
> > Many thanks to Dragos Tatulea for conducting the investigations.
> >
> > As Jakub said :
> >
> >     The suspicion is that SSH sees the ICMP report on the socket error queue
> >     and tries to connect() again, but due to the patch the socket isn't
> >     disconnected, so it gets EALREADY, and throws its hands up...
> >
> >     The error bubbles up to Vagrant which also becomes unhappy.
> >
> >     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> >
> > Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> > Cc: Dragos Tatulea <dtatulea@nvidia.com>
> > Cc: Maciej Żenczykowski <maze@google.com>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Neal Cardwell <ncardwell@google.com>
> > Cc: Shachar Kagan <skagan@nvidia.com>
> > ---
> >  net/ipv4/tcp_ipv4.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
> >               if (fastopen && !fastopen->sk)
> >                       break;
> >
> > -             ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > +             if (inet_test_bit(RECVERR, sk))
> > +                     ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> >
> >               if (!sock_owned_by_user(sk)) {
> >                       WRITE_ONCE(sk->sk_err, err);
>
> We have a fcnal-test.sh self-test failure:
>
> https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-04-18--06-00&test=fcnal-test-sh
>
> that I suspect are related to this patch (or the following one): the
> test case creates a TCP connection on loopback and this is the only
> patchseries touching the related code, included in the relevant patch
> burst.
>
> Could you please have a look?

Sure, thanks Paolo !

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-18  8:03     ` Eric Dumazet
@ 2024-04-18  9:26       ` Eric Dumazet
  2024-04-18  9:58         ` Paolo Abeni
  2024-04-18 17:56         ` David Ahern
  0 siblings, 2 replies; 23+ messages in thread
From: Eric Dumazet @ 2024-04-18  9:26 UTC (permalink / raw)
  To: Paolo Abeni, David Ahern
  Cc: David S . Miller, Jakub Kicinski, netdev, Neal Cardwell,
	Dragos Tatulea, eric.dumazet, Maciej Żenczykowski,
	Willem de Bruijn, Shachar Kagan

On Thu, Apr 18, 2024 at 10:03 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Apr 18, 2024 at 10:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > Hi,
> >
> > On Wed, 2024-04-17 at 16:57 +0000, Eric Dumazet wrote:
> > > Blamed commit claimed in its changelog that the new functionality
> > > was guarded by IP_RECVERR/IPV6_RECVERR :
> > >
> > >     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
> > >     enable this feature, and that the error message is only queued
> > >     while in SYN_SNT state.
> > >
> > > This was true only for IPv6, because ipv6_icmp_error() has
> > > the following check:
> > >
> > > if (!inet6_test_bit(RECVERR6, sk))
> > >     return;
> > >
> > > Other callers check IP_RECVERR by themselves, it is unclear
> > > if we could factorize these checks in ip_icmp_error()
> > >
> > > For stable backports, I chose to add the missing check in tcp_v4_err()
> > >
> > > We think this missing check was the root cause for commit
> > > 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> > > some ICMP") breakage, leading to a revert.
> > >
> > > Many thanks to Dragos Tatulea for conducting the investigations.
> > >
> > > As Jakub said :
> > >
> > >     The suspicion is that SSH sees the ICMP report on the socket error queue
> > >     and tries to connect() again, but due to the patch the socket isn't
> > >     disconnected, so it gets EALREADY, and throws its hands up...
> > >
> > >     The error bubbles up to Vagrant which also becomes unhappy.
> > >
> > >     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> > >
> > > Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > Cc: Dragos Tatulea <dtatulea@nvidia.com>
> > > Cc: Maciej Żenczykowski <maze@google.com>
> > > Cc: Willem de Bruijn <willemb@google.com>
> > > Cc: Neal Cardwell <ncardwell@google.com>
> > > Cc: Shachar Kagan <skagan@nvidia.com>
> > > ---
> > >  net/ipv4/tcp_ipv4.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
> > >               if (fastopen && !fastopen->sk)
> > >                       break;
> > >
> > > -             ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > +             if (inet_test_bit(RECVERR, sk))
> > > +                     ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > >
> > >               if (!sock_owned_by_user(sk)) {
> > >                       WRITE_ONCE(sk->sk_err, err);
> >
> > We have a fcnal-test.sh self-test failure:
> >
> > https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-04-18--06-00&test=fcnal-test-sh
> >
> > that I suspect are related to this patch (or the following one): the
> > test case creates a TCP connection on loopback and this is the only
> > patchseries touching the related code, included in the relevant patch
> > burst.
> >
> > Could you please have a look?
>
> Sure, thanks Paolo !

First patch is fine, I see no failure from fcnal-test.sh (as I would expect)

For the second one, I am not familiar enough with this very slow test
suite (all these "sleep 1" ... oh well)

I guess "failing tests" depended on TCP connect() to immediately abort
on one ICMP message,
depending on old kernel behavior.

I do not know how to launch a subset of the tests, and trace these.

"./fcnal-test.sh -t ipv4_tcp" alone takes more than 9 minutes [1] in a
VM running a non debug kernel :/

David, do you have an idea how to proceed ?

Thanks.

[1]
Tests passed: 134
Tests failed:   0

real 9m33.085s
user 0m40.159s
sys 0m30.098s

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-18  9:26       ` Eric Dumazet
@ 2024-04-18  9:58         ` Paolo Abeni
  2024-04-18 10:15           ` Eric Dumazet
  2024-04-18 17:56         ` David Ahern
  1 sibling, 1 reply; 23+ messages in thread
From: Paolo Abeni @ 2024-04-18  9:58 UTC (permalink / raw)
  To: Eric Dumazet, David Ahern
  Cc: David S . Miller, Jakub Kicinski, netdev, Neal Cardwell,
	Dragos Tatulea, eric.dumazet, Maciej Żenczykowski,
	Willem de Bruijn, Shachar Kagan

On Thu, 2024-04-18 at 11:26 +0200, Eric Dumazet wrote:
> On Thu, Apr 18, 2024 at 10:03 AM Eric Dumazet <edumazet@google.com> wrote:
> > 
> > On Thu, Apr 18, 2024 at 10:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > 
> > > Hi,
> > > 
> > > On Wed, 2024-04-17 at 16:57 +0000, Eric Dumazet wrote:
> > > > Blamed commit claimed in its changelog that the new functionality
> > > > was guarded by IP_RECVERR/IPV6_RECVERR :
> > > > 
> > > >     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
> > > >     enable this feature, and that the error message is only queued
> > > >     while in SYN_SNT state.
> > > > 
> > > > This was true only for IPv6, because ipv6_icmp_error() has
> > > > the following check:
> > > > 
> > > > if (!inet6_test_bit(RECVERR6, sk))
> > > >     return;
> > > > 
> > > > Other callers check IP_RECVERR by themselves, it is unclear
> > > > if we could factorize these checks in ip_icmp_error()
> > > > 
> > > > For stable backports, I chose to add the missing check in tcp_v4_err()
> > > > 
> > > > We think this missing check was the root cause for commit
> > > > 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> > > > some ICMP") breakage, leading to a revert.
> > > > 
> > > > Many thanks to Dragos Tatulea for conducting the investigations.
> > > > 
> > > > As Jakub said :
> > > > 
> > > >     The suspicion is that SSH sees the ICMP report on the socket error queue
> > > >     and tries to connect() again, but due to the patch the socket isn't
> > > >     disconnected, so it gets EALREADY, and throws its hands up...
> > > > 
> > > >     The error bubbles up to Vagrant which also becomes unhappy.
> > > > 
> > > >     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> > > > 
> > > > Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > > Cc: Dragos Tatulea <dtatulea@nvidia.com>
> > > > Cc: Maciej Żenczykowski <maze@google.com>
> > > > Cc: Willem de Bruijn <willemb@google.com>
> > > > Cc: Neal Cardwell <ncardwell@google.com>
> > > > Cc: Shachar Kagan <skagan@nvidia.com>
> > > > ---
> > > >  net/ipv4/tcp_ipv4.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
> > > > --- a/net/ipv4/tcp_ipv4.c
> > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > @@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
> > > >               if (fastopen && !fastopen->sk)
> > > >                       break;
> > > > 
> > > > -             ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > > +             if (inet_test_bit(RECVERR, sk))
> > > > +                     ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > > 
> > > >               if (!sock_owned_by_user(sk)) {
> > > >                       WRITE_ONCE(sk->sk_err, err);
> > > 
> > > We have a fcnal-test.sh self-test failure:
> > > 
> > > https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-04-18--06-00&test=fcnal-test-sh
> > > 
> > > that I suspect are related to this patch (or the following one): the
> > > test case creates a TCP connection on loopback and this is the only
> > > patchseries touching the related code, included in the relevant patch
> > > burst.
> > > 
> > > Could you please have a look?
> > 
> > Sure, thanks Paolo !
> 
> First patch is fine, I see no failure from fcnal-test.sh (as I would expect)
> 
> For the second one, I am not familiar enough with this very slow test
> suite (all these "sleep 1" ... oh well)

@David, some of them could be replaced with loopy_wait calls

> I guess "failing tests" depended on TCP connect() to immediately abort
> on one ICMP message,
> depending on old kernel behavior.
> 
> I do not know how to launch a subset of the tests, and trace these.
> 
> "./fcnal-test.sh -t ipv4_tcp" alone takes more than 9 minutes [1] in a
> VM running a non debug kernel :/
> 
> David, do you have an idea how to proceed ?

One very dumb thing I do in that cases is commenting out the other
tests, something alike (completely untested!):
---
diff --git a/tools/testing/selftests/net/fcnal-test.sh b/tools/testing/selftests/net/fcnal-test.sh
index 386ebd829df5..494932aa99b2 100755
--- a/tools/testing/selftests/net/fcnal-test.sh
+++ b/tools/testing/selftests/net/fcnal-test.sh
@@ -1186,6 +1186,7 @@ ipv4_tcp_novrf()
 {
 	local a
 
+if false; then
 	#
 	# server tests
 	#
@@ -1271,6 +1272,7 @@ ipv4_tcp_novrf()
 		log_test_addr ${a} $? 1 "Device server, unbound client, local connection"
 	done
 
+fi
 	a=${NSA_IP}
 	log_start
 	run_cmd nettest -s &
@@ -1487,12 +1489,14 @@ ipv4_tcp()
 	set_sysctl net.ipv4.tcp_l3mdev_accept=0
 	ipv4_tcp_novrf
 	log_subsection "tcp_l3mdev_accept enabled"
+if false; then
 	set_sysctl net.ipv4.tcp_l3mdev_accept=1
 	ipv4_tcp_novrf
 
 	log_subsection "With VRF"
 	setup "yes"
 	ipv4_tcp_vrf
+fi
 }
 
 ################################################################################


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-18  9:58         ` Paolo Abeni
@ 2024-04-18 10:15           ` Eric Dumazet
  2024-04-18 10:22             ` Eric Dumazet
  2024-04-18 17:46             ` David Ahern
  0 siblings, 2 replies; 23+ messages in thread
From: Eric Dumazet @ 2024-04-18 10:15 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: David Ahern, David S . Miller, Jakub Kicinski, netdev,
	Neal Cardwell, Dragos Tatulea, eric.dumazet,
	Maciej Żenczykowski, Willem de Bruijn, Shachar Kagan

On Thu, Apr 18, 2024 at 11:58 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Thu, 2024-04-18 at 11:26 +0200, Eric Dumazet wrote:
> > On Thu, Apr 18, 2024 at 10:03 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Thu, Apr 18, 2024 at 10:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Wed, 2024-04-17 at 16:57 +0000, Eric Dumazet wrote:
> > > > > Blamed commit claimed in its changelog that the new functionality
> > > > > was guarded by IP_RECVERR/IPV6_RECVERR :
> > > > >
> > > > >     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
> > > > >     enable this feature, and that the error message is only queued
> > > > >     while in SYN_SNT state.
> > > > >
> > > > > This was true only for IPv6, because ipv6_icmp_error() has
> > > > > the following check:
> > > > >
> > > > > if (!inet6_test_bit(RECVERR6, sk))
> > > > >     return;
> > > > >
> > > > > Other callers check IP_RECVERR by themselves, it is unclear
> > > > > if we could factorize these checks in ip_icmp_error()
> > > > >
> > > > > For stable backports, I chose to add the missing check in tcp_v4_err()
> > > > >
> > > > > We think this missing check was the root cause for commit
> > > > > 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> > > > > some ICMP") breakage, leading to a revert.
> > > > >
> > > > > Many thanks to Dragos Tatulea for conducting the investigations.
> > > > >
> > > > > As Jakub said :
> > > > >
> > > > >     The suspicion is that SSH sees the ICMP report on the socket error queue
> > > > >     and tries to connect() again, but due to the patch the socket isn't
> > > > >     disconnected, so it gets EALREADY, and throws its hands up...
> > > > >
> > > > >     The error bubbles up to Vagrant which also becomes unhappy.
> > > > >
> > > > >     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> > > > >
> > > > > Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > > > Cc: Dragos Tatulea <dtatulea@nvidia.com>
> > > > > Cc: Maciej Żenczykowski <maze@google.com>
> > > > > Cc: Willem de Bruijn <willemb@google.com>
> > > > > Cc: Neal Cardwell <ncardwell@google.com>
> > > > > Cc: Shachar Kagan <skagan@nvidia.com>
> > > > > ---
> > > > >  net/ipv4/tcp_ipv4.c | 3 ++-
> > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > > index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
> > > > > --- a/net/ipv4/tcp_ipv4.c
> > > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > > @@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
> > > > >               if (fastopen && !fastopen->sk)
> > > > >                       break;
> > > > >
> > > > > -             ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > > > +             if (inet_test_bit(RECVERR, sk))
> > > > > +                     ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > > >
> > > > >               if (!sock_owned_by_user(sk)) {
> > > > >                       WRITE_ONCE(sk->sk_err, err);
> > > >
> > > > We have a fcnal-test.sh self-test failure:
> > > >
> > > > https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-04-18--06-00&test=fcnal-test-sh
> > > >
> > > > that I suspect are related to this patch (or the following one): the
> > > > test case creates a TCP connection on loopback and this is the only
> > > > patchseries touching the related code, included in the relevant patch
> > > > burst.
> > > >
> > > > Could you please have a look?
> > >
> > > Sure, thanks Paolo !
> >
> > First patch is fine, I see no failure from fcnal-test.sh (as I would expect)
> >
> > For the second one, I am not familiar enough with this very slow test
> > suite (all these "sleep 1" ... oh well)
>
> @David, some of them could be replaced with loopy_wait calls
>
> > I guess "failing tests" depended on TCP connect() to immediately abort
> > on one ICMP message,
> > depending on old kernel behavior.
> >
> > I do not know how to launch a subset of the tests, and trace these.
> >
> > "./fcnal-test.sh -t ipv4_tcp" alone takes more than 9 minutes [1] in a
> > VM running a non debug kernel :/
> >
> > David, do you have an idea how to proceed ?
>
> One very dumb thing I do in that cases is commenting out the other
> tests, something alike (completely untested!):
> ---
> diff --git a/tools/testing/selftests/net/fcnal-test.sh b/tools/testing/selftests/net/fcnal-test.sh
> index 386ebd829df5..494932aa99b2 100755
> --- a/tools/testing/selftests/net/fcnal-test.sh
> +++ b/tools/testing/selftests/net/fcnal-test.sh
> @@ -1186,6 +1186,7 @@ ipv4_tcp_novrf()
>  {
>         local a
>
> +if false; then
>         #
>         # server tests
>         #
> @@ -1271,6 +1272,7 @@ ipv4_tcp_novrf()
>                 log_test_addr ${a} $? 1 "Device server, unbound client, local connection"
>         done
>
> +fi
>         a=${NSA_IP}
>         log_start
>         run_cmd nettest -s &
> @@ -1487,12 +1489,14 @@ ipv4_tcp()
>         set_sysctl net.ipv4.tcp_l3mdev_accept=0
>         ipv4_tcp_novrf
>         log_subsection "tcp_l3mdev_accept enabled"
> +if false; then
>         set_sysctl net.ipv4.tcp_l3mdev_accept=1
>         ipv4_tcp_novrf
>
>         log_subsection "With VRF"
>         setup "yes"
>         ipv4_tcp_vrf
> +fi
>  }

Thanks Paolo

I found that the following patch is fixing the issue for me.

diff --git a/tools/testing/selftests/net/nettest.c
b/tools/testing/selftests/net/nettest.c
index cd8a580974480212b45d86f35293b77f3d033473..ff25e53024ef6d4101f251c8a8a5e936e44e280f
100644
--- a/tools/testing/selftests/net/nettest.c
+++ b/tools/testing/selftests/net/nettest.c
@@ -1744,6 +1744,7 @@ static int connectsock(void *addr, socklen_t
alen, struct sock_args *args)
        if (args->bind_test_only)
                goto out;

+       set_recv_attr(sd, args->version);
        if (connect(sd, addr, alen) < 0) {
                if (errno != EINPROGRESS) {
                        log_err_errno("Failed to connect to remote host");

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-18 10:15           ` Eric Dumazet
@ 2024-04-18 10:22             ` Eric Dumazet
  2024-04-18 10:36               ` Eric Dumazet
  2024-04-18 17:46             ` David Ahern
  1 sibling, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2024-04-18 10:22 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: David Ahern, David S . Miller, Jakub Kicinski, netdev,
	Neal Cardwell, Dragos Tatulea, eric.dumazet,
	Maciej Żenczykowski, Willem de Bruijn, Shachar Kagan

On Thu, Apr 18, 2024 at 12:15 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Apr 18, 2024 at 11:58 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On Thu, 2024-04-18 at 11:26 +0200, Eric Dumazet wrote:
> > > On Thu, Apr 18, 2024 at 10:03 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Thu, Apr 18, 2024 at 10:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Wed, 2024-04-17 at 16:57 +0000, Eric Dumazet wrote:
> > > > > > Blamed commit claimed in its changelog that the new functionality
> > > > > > was guarded by IP_RECVERR/IPV6_RECVERR :
> > > > > >
> > > > > >     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
> > > > > >     enable this feature, and that the error message is only queued
> > > > > >     while in SYN_SNT state.
> > > > > >
> > > > > > This was true only for IPv6, because ipv6_icmp_error() has
> > > > > > the following check:
> > > > > >
> > > > > > if (!inet6_test_bit(RECVERR6, sk))
> > > > > >     return;
> > > > > >
> > > > > > Other callers check IP_RECVERR by themselves, it is unclear
> > > > > > if we could factorize these checks in ip_icmp_error()
> > > > > >
> > > > > > For stable backports, I chose to add the missing check in tcp_v4_err()
> > > > > >
> > > > > > We think this missing check was the root cause for commit
> > > > > > 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> > > > > > some ICMP") breakage, leading to a revert.
> > > > > >
> > > > > > Many thanks to Dragos Tatulea for conducting the investigations.
> > > > > >
> > > > > > As Jakub said :
> > > > > >
> > > > > >     The suspicion is that SSH sees the ICMP report on the socket error queue
> > > > > >     and tries to connect() again, but due to the patch the socket isn't
> > > > > >     disconnected, so it gets EALREADY, and throws its hands up...
> > > > > >
> > > > > >     The error bubbles up to Vagrant which also becomes unhappy.
> > > > > >
> > > > > >     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> > > > > >
> > > > > > Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > > Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > > > > Cc: Dragos Tatulea <dtatulea@nvidia.com>
> > > > > > Cc: Maciej Żenczykowski <maze@google.com>
> > > > > > Cc: Willem de Bruijn <willemb@google.com>
> > > > > > Cc: Neal Cardwell <ncardwell@google.com>
> > > > > > Cc: Shachar Kagan <skagan@nvidia.com>
> > > > > > ---
> > > > > >  net/ipv4/tcp_ipv4.c | 3 ++-
> > > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > > > index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
> > > > > > --- a/net/ipv4/tcp_ipv4.c
> > > > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > > > @@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
> > > > > >               if (fastopen && !fastopen->sk)
> > > > > >                       break;
> > > > > >
> > > > > > -             ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > > > > +             if (inet_test_bit(RECVERR, sk))
> > > > > > +                     ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > > > >
> > > > > >               if (!sock_owned_by_user(sk)) {
> > > > > >                       WRITE_ONCE(sk->sk_err, err);
> > > > >
> > > > > We have a fcnal-test.sh self-test failure:
> > > > >
> > > > > https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-04-18--06-00&test=fcnal-test-sh
> > > > >
> > > > > that I suspect are related to this patch (or the following one): the
> > > > > test case creates a TCP connection on loopback and this is the only
> > > > > patchseries touching the related code, included in the relevant patch
> > > > > burst.
> > > > >
> > > > > Could you please have a look?
> > > >
> > > > Sure, thanks Paolo !
> > >
> > > First patch is fine, I see no failure from fcnal-test.sh (as I would expect)
> > >
> > > For the second one, I am not familiar enough with this very slow test
> > > suite (all these "sleep 1" ... oh well)
> >
> > @David, some of them could be replaced with loopy_wait calls
> >
> > > I guess "failing tests" depended on TCP connect() to immediately abort
> > > on one ICMP message,
> > > depending on old kernel behavior.
> > >
> > > I do not know how to launch a subset of the tests, and trace these.
> > >
> > > "./fcnal-test.sh -t ipv4_tcp" alone takes more than 9 minutes [1] in a
> > > VM running a non debug kernel :/
> > >
> > > David, do you have an idea how to proceed ?
> >
> > One very dumb thing I do in that cases is commenting out the other
> > tests, something alike (completely untested!):
> > ---
> > diff --git a/tools/testing/selftests/net/fcnal-test.sh b/tools/testing/selftests/net/fcnal-test.sh
> > index 386ebd829df5..494932aa99b2 100755
> > --- a/tools/testing/selftests/net/fcnal-test.sh
> > +++ b/tools/testing/selftests/net/fcnal-test.sh
> > @@ -1186,6 +1186,7 @@ ipv4_tcp_novrf()
> >  {
> >         local a
> >
> > +if false; then
> >         #
> >         # server tests
> >         #
> > @@ -1271,6 +1272,7 @@ ipv4_tcp_novrf()
> >                 log_test_addr ${a} $? 1 "Device server, unbound client, local connection"
> >         done
> >
> > +fi
> >         a=${NSA_IP}
> >         log_start
> >         run_cmd nettest -s &
> > @@ -1487,12 +1489,14 @@ ipv4_tcp()
> >         set_sysctl net.ipv4.tcp_l3mdev_accept=0
> >         ipv4_tcp_novrf
> >         log_subsection "tcp_l3mdev_accept enabled"
> > +if false; then
> >         set_sysctl net.ipv4.tcp_l3mdev_accept=1
> >         ipv4_tcp_novrf
> >
> >         log_subsection "With VRF"
> >         setup "yes"
> >         ipv4_tcp_vrf
> > +fi
> >  }
>
> Thanks Paolo
>
> I found that the following patch is fixing the issue for me.
>
> diff --git a/tools/testing/selftests/net/nettest.c
> b/tools/testing/selftests/net/nettest.c
> index cd8a580974480212b45d86f35293b77f3d033473..ff25e53024ef6d4101f251c8a8a5e936e44e280f
> 100644
> --- a/tools/testing/selftests/net/nettest.c
> +++ b/tools/testing/selftests/net/nettest.c
> @@ -1744,6 +1744,7 @@ static int connectsock(void *addr, socklen_t
> alen, struct sock_args *args)
>         if (args->bind_test_only)
>                 goto out;
>
> +       set_recv_attr(sd, args->version);
>         if (connect(sd, addr, alen) < 0) {
>                 if (errno != EINPROGRESS) {
>                         log_err_errno("Failed to connect to remote host");

When tracing nettest we now have EHOSTUNREACH

3343  setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0 <0.000210>
3343  setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "eth1\0", 5) = 0 <0.000170>
3343  setsockopt(3, SOL_IP, IP_PKTINFO, [1], 4) = 0 <0.000161>
3343  setsockopt(3, SOL_IP, IP_RECVERR, [1], 4) = 0 <0.000181>
3343  connect(3, {sa_family=AF_INET, sin_port=htons(12345),
sin_addr=inet_addr("172.16.2.1")}, 16) = -1 EINPROGRESS (Operation now
in progress) <0.000874>
3343  pselect6(1024, NULL, [3], NULL, {tv_sec=5, tv_nsec=0}, NULL) = 1
(out [3], left {tv_sec=1, tv_nsec=930762080}) <3.069673>
3343  getsockopt(3, SOL_SOCKET, SO_ERROR, [EHOSTUNREACH], [4]) = 0 <0.000270>

As mentioned in net/ipv4/icmp.c :
 RFC 1122: 3.2.2.1 States that NET_UNREACH, HOST_UNREACH and SR_FAILED
MUST be considered 'transient errs'.

Maybe another way to fix nettest would be to change wait_for_connect()
to pass a non NULL fdset in 4th argument of select()

select(FD_SETSIZE, NULL, &wfd, NULL /* here */, tv);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-18 10:22             ` Eric Dumazet
@ 2024-04-18 10:36               ` Eric Dumazet
  0 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2024-04-18 10:36 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: David Ahern, David S . Miller, Jakub Kicinski, netdev,
	Neal Cardwell, Dragos Tatulea, eric.dumazet,
	Maciej Żenczykowski, Willem de Bruijn, Shachar Kagan

On Thu, Apr 18, 2024 at 12:22 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Apr 18, 2024 at 12:15 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Apr 18, 2024 at 11:58 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > On Thu, 2024-04-18 at 11:26 +0200, Eric Dumazet wrote:
> > > > On Thu, Apr 18, 2024 at 10:03 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Thu, Apr 18, 2024 at 10:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > On Wed, 2024-04-17 at 16:57 +0000, Eric Dumazet wrote:
> > > > > > > Blamed commit claimed in its changelog that the new functionality
> > > > > > > was guarded by IP_RECVERR/IPV6_RECVERR :
> > > > > > >
> > > > > > >     Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
> > > > > > >     enable this feature, and that the error message is only queued
> > > > > > >     while in SYN_SNT state.
> > > > > > >
> > > > > > > This was true only for IPv6, because ipv6_icmp_error() has
> > > > > > > the following check:
> > > > > > >
> > > > > > > if (!inet6_test_bit(RECVERR6, sk))
> > > > > > >     return;
> > > > > > >
> > > > > > > Other callers check IP_RECVERR by themselves, it is unclear
> > > > > > > if we could factorize these checks in ip_icmp_error()
> > > > > > >
> > > > > > > For stable backports, I chose to add the missing check in tcp_v4_err()
> > > > > > >
> > > > > > > We think this missing check was the root cause for commit
> > > > > > > 0a8de364ff7a ("tcp: no longer abort SYN_SENT when receiving
> > > > > > > some ICMP") breakage, leading to a revert.
> > > > > > >
> > > > > > > Many thanks to Dragos Tatulea for conducting the investigations.
> > > > > > >
> > > > > > > As Jakub said :
> > > > > > >
> > > > > > >     The suspicion is that SSH sees the ICMP report on the socket error queue
> > > > > > >     and tries to connect() again, but due to the patch the socket isn't
> > > > > > >     disconnected, so it gets EALREADY, and throws its hands up...
> > > > > > >
> > > > > > >     The error bubbles up to Vagrant which also becomes unhappy.
> > > > > > >
> > > > > > >     Can we skip the call to ip_icmp_error() for non-fatal ICMP errors?
> > > > > > >
> > > > > > > Fixes: 45af29ca761c ("tcp: allow traceroute -Mtcp for unpriv users")
> > > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > > > Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > > > > > Cc: Dragos Tatulea <dtatulea@nvidia.com>
> > > > > > > Cc: Maciej Żenczykowski <maze@google.com>
> > > > > > > Cc: Willem de Bruijn <willemb@google.com>
> > > > > > > Cc: Neal Cardwell <ncardwell@google.com>
> > > > > > > Cc: Shachar Kagan <skagan@nvidia.com>
> > > > > > > ---
> > > > > > >  net/ipv4/tcp_ipv4.c | 3 ++-
> > > > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > > > > index 88c83ac4212957f19efad0f967952d2502bdbc7f..a717db99972d977a64178d7ed1109325d64a6d51 100644
> > > > > > > --- a/net/ipv4/tcp_ipv4.c
> > > > > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > > > > @@ -602,7 +602,8 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
> > > > > > >               if (fastopen && !fastopen->sk)
> > > > > > >                       break;
> > > > > > >
> > > > > > > -             ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > > > > > +             if (inet_test_bit(RECVERR, sk))
> > > > > > > +                     ip_icmp_error(sk, skb, err, th->dest, info, (u8 *)th);
> > > > > > >
> > > > > > >               if (!sock_owned_by_user(sk)) {
> > > > > > >                       WRITE_ONCE(sk->sk_err, err);
> > > > > >
> > > > > > We have a fcnal-test.sh self-test failure:
> > > > > >
> > > > > > https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-04-18--06-00&test=fcnal-test-sh
> > > > > >
> > > > > > that I suspect are related to this patch (or the following one): the
> > > > > > test case creates a TCP connection on loopback and this is the only
> > > > > > patchseries touching the related code, included in the relevant patch
> > > > > > burst.
> > > > > >
> > > > > > Could you please have a look?
> > > > >
> > > > > Sure, thanks Paolo !
> > > >
> > > > First patch is fine, I see no failure from fcnal-test.sh (as I would expect)
> > > >
> > > > For the second one, I am not familiar enough with this very slow test
> > > > suite (all these "sleep 1" ... oh well)
> > >
> > > @David, some of them could be replaced with loopy_wait calls
> > >
> > > > I guess "failing tests" depended on TCP connect() to immediately abort
> > > > on one ICMP message,
> > > > depending on old kernel behavior.
> > > >
> > > > I do not know how to launch a subset of the tests, and trace these.
> > > >
> > > > "./fcnal-test.sh -t ipv4_tcp" alone takes more than 9 minutes [1] in a
> > > > VM running a non debug kernel :/
> > > >
> > > > David, do you have an idea how to proceed ?
> > >
> > > One very dumb thing I do in that cases is commenting out the other
> > > tests, something alike (completely untested!):
> > > ---
> > > diff --git a/tools/testing/selftests/net/fcnal-test.sh b/tools/testing/selftests/net/fcnal-test.sh
> > > index 386ebd829df5..494932aa99b2 100755
> > > --- a/tools/testing/selftests/net/fcnal-test.sh
> > > +++ b/tools/testing/selftests/net/fcnal-test.sh
> > > @@ -1186,6 +1186,7 @@ ipv4_tcp_novrf()
> > >  {
> > >         local a
> > >
> > > +if false; then
> > >         #
> > >         # server tests
> > >         #
> > > @@ -1271,6 +1272,7 @@ ipv4_tcp_novrf()
> > >                 log_test_addr ${a} $? 1 "Device server, unbound client, local connection"
> > >         done
> > >
> > > +fi
> > >         a=${NSA_IP}
> > >         log_start
> > >         run_cmd nettest -s &
> > > @@ -1487,12 +1489,14 @@ ipv4_tcp()
> > >         set_sysctl net.ipv4.tcp_l3mdev_accept=0
> > >         ipv4_tcp_novrf
> > >         log_subsection "tcp_l3mdev_accept enabled"
> > > +if false; then
> > >         set_sysctl net.ipv4.tcp_l3mdev_accept=1
> > >         ipv4_tcp_novrf
> > >
> > >         log_subsection "With VRF"
> > >         setup "yes"
> > >         ipv4_tcp_vrf
> > > +fi
> > >  }
> >
> > Thanks Paolo
> >
> > I found that the following patch is fixing the issue for me.
> >
> > diff --git a/tools/testing/selftests/net/nettest.c
> > b/tools/testing/selftests/net/nettest.c
> > index cd8a580974480212b45d86f35293b77f3d033473..ff25e53024ef6d4101f251c8a8a5e936e44e280f
> > 100644
> > --- a/tools/testing/selftests/net/nettest.c
> > +++ b/tools/testing/selftests/net/nettest.c
> > @@ -1744,6 +1744,7 @@ static int connectsock(void *addr, socklen_t
> > alen, struct sock_args *args)
> >         if (args->bind_test_only)
> >                 goto out;
> >
> > +       set_recv_attr(sd, args->version);
> >         if (connect(sd, addr, alen) < 0) {
> >                 if (errno != EINPROGRESS) {
> >                         log_err_errno("Failed to connect to remote host");
>
> When tracing nettest we now have EHOSTUNREACH
>
> 3343  setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0 <0.000210>
> 3343  setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "eth1\0", 5) = 0 <0.000170>
> 3343  setsockopt(3, SOL_IP, IP_PKTINFO, [1], 4) = 0 <0.000161>
> 3343  setsockopt(3, SOL_IP, IP_RECVERR, [1], 4) = 0 <0.000181>
> 3343  connect(3, {sa_family=AF_INET, sin_port=htons(12345),
> sin_addr=inet_addr("172.16.2.1")}, 16) = -1 EINPROGRESS (Operation now
> in progress) <0.000874>
> 3343  pselect6(1024, NULL, [3], NULL, {tv_sec=5, tv_nsec=0}, NULL) = 1
> (out [3], left {tv_sec=1, tv_nsec=930762080}) <3.069673>
> 3343  getsockopt(3, SOL_SOCKET, SO_ERROR, [EHOSTUNREACH], [4]) = 0 <0.000270>
>
> As mentioned in net/ipv4/icmp.c :
>  RFC 1122: 3.2.2.1 States that NET_UNREACH, HOST_UNREACH and SR_FAILED
> MUST be considered 'transient errs'.
>
> Maybe another way to fix nettest would be to change wait_for_connect()
> to pass a non NULL fdset in 4th argument of select()
>
> select(FD_SETSIZE, NULL, &wfd, NULL /* here */, tv);

This change in wait_for_connect() does not help.

I am guessing set_recv_attr(sd, args->version); is what we need.

I am running all ./fcnal-test.sh tests to make sure everything is green.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-18 10:15           ` Eric Dumazet
  2024-04-18 10:22             ` Eric Dumazet
@ 2024-04-18 17:46             ` David Ahern
  2024-04-18 17:47               ` Eric Dumazet
  1 sibling, 1 reply; 23+ messages in thread
From: David Ahern @ 2024-04-18 17:46 UTC (permalink / raw)
  To: Eric Dumazet, Paolo Abeni
  Cc: David S . Miller, Jakub Kicinski, netdev, Neal Cardwell,
	Dragos Tatulea, eric.dumazet, Maciej Żenczykowski,
	Willem de Bruijn, Shachar Kagan

On 4/18/24 4:15 AM, Eric Dumazet wrote:
> 
> Thanks Paolo
> 
> I found that the following patch is fixing the issue for me.
> 
> diff --git a/tools/testing/selftests/net/nettest.c
> b/tools/testing/selftests/net/nettest.c
> index cd8a580974480212b45d86f35293b77f3d033473..ff25e53024ef6d4101f251c8a8a5e936e44e280f
> 100644
> --- a/tools/testing/selftests/net/nettest.c
> +++ b/tools/testing/selftests/net/nettest.c
> @@ -1744,6 +1744,7 @@ static int connectsock(void *addr, socklen_t
> alen, struct sock_args *args)
>         if (args->bind_test_only)
>                 goto out;
> 
> +       set_recv_attr(sd, args->version);
>         if (connect(sd, addr, alen) < 0) {
>                 if (errno != EINPROGRESS) {
>                         log_err_errno("Failed to connect to remote host");

You have a kernel patch that makes a test fail, and your solution is
changing userspace? The tests are examples of userspace applications and
how they can use APIs, so if the patch breaks a test it is by definition
breaking userspace which is not allowed.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-18 17:46             ` David Ahern
@ 2024-04-18 17:47               ` Eric Dumazet
  2024-04-18 18:02                 ` David Ahern
  2024-04-18 18:09                 ` Jakub Kicinski
  0 siblings, 2 replies; 23+ messages in thread
From: Eric Dumazet @ 2024-04-18 17:47 UTC (permalink / raw)
  To: David Ahern
  Cc: Paolo Abeni, David S . Miller, Jakub Kicinski, netdev,
	Neal Cardwell, Dragos Tatulea, eric.dumazet,
	Maciej Żenczykowski, Willem de Bruijn, Shachar Kagan

On Thu, Apr 18, 2024 at 7:46 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 4/18/24 4:15 AM, Eric Dumazet wrote:
> >
> > Thanks Paolo
> >
> > I found that the following patch is fixing the issue for me.
> >
> > diff --git a/tools/testing/selftests/net/nettest.c
> > b/tools/testing/selftests/net/nettest.c
> > index cd8a580974480212b45d86f35293b77f3d033473..ff25e53024ef6d4101f251c8a8a5e936e44e280f
> > 100644
> > --- a/tools/testing/selftests/net/nettest.c
> > +++ b/tools/testing/selftests/net/nettest.c
> > @@ -1744,6 +1744,7 @@ static int connectsock(void *addr, socklen_t
> > alen, struct sock_args *args)
> >         if (args->bind_test_only)
> >                 goto out;
> >
> > +       set_recv_attr(sd, args->version);
> >         if (connect(sd, addr, alen) < 0) {
> >                 if (errno != EINPROGRESS) {
> >                         log_err_errno("Failed to connect to remote host");
>
> You have a kernel patch that makes a test fail, and your solution is
> changing userspace? The tests are examples of userspace applications and
> how they can use APIs, so if the patch breaks a test it is by definition
> breaking userspace which is not allowed.

I think the userspace program relied on a bug added in linux in 2020

Jakub, I will stop trying to push the patches, this is a lost battle.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-18  9:26       ` Eric Dumazet
  2024-04-18  9:58         ` Paolo Abeni
@ 2024-04-18 17:56         ` David Ahern
  1 sibling, 0 replies; 23+ messages in thread
From: David Ahern @ 2024-04-18 17:56 UTC (permalink / raw)
  To: Eric Dumazet, Paolo Abeni
  Cc: David S . Miller, Jakub Kicinski, netdev, Neal Cardwell,
	Dragos Tatulea, eric.dumazet, Maciej Żenczykowski,
	Willem de Bruijn, Shachar Kagan

On 4/18/24 3:26 AM, Eric Dumazet wrote:
> For the second one, I am not familiar enough with this very slow test
> suite (all these "sleep 1" ... oh well)
> 
> I guess "failing tests" depended on TCP connect() to immediately abort
> on one ICMP message,
> depending on old kernel behavior.
> 
> I do not know how to launch a subset of the tests, and trace these.
> 
> "./fcnal-test.sh -t ipv4_tcp" alone takes more than 9 minutes [1] in a
> VM running a non debug kernel :/
> 
> David, do you have an idea how to proceed ?
> 
> Thanks.
> 
> [1]
> Tests passed: 134
> Tests failed:   0
> 
> real 9m33.085s
> user 0m40.159s
> sys 0m30.098s

The test suite was started in 2013 and has evolved to cover the many
permutations of APIs -- setsockopts, cmsg, VRF, routing, ip rules,
netfilter, etc. There are a lot of combinations. They verify not just
control path or socket bind succeeds, but data path works as expected
which means do a connect and packet transfer.

Some years back nettest gained support to change namespaces and run both
client and server in a single instance. I started converting the tests
to use that capability and remove the sleeps, but it did not speed up
the tests in any significant way.

The tests are in blocks and the blocks can be split out to separate
scripts or kept in one to retain the common setup code and launched in
parallel. Splitting out to any lower level does not make sense.

If someone wants to pick up the task of splitting the tests or running
them in parallel, please do. I have zero time right now.  That the suite
detects changes in kernel behavior shows it is doing what it is designed
to do and proves its value.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-18 17:47               ` Eric Dumazet
@ 2024-04-18 18:02                 ` David Ahern
  2024-04-18 18:09                   ` Eric Dumazet
  2024-04-18 18:09                 ` Jakub Kicinski
  1 sibling, 1 reply; 23+ messages in thread
From: David Ahern @ 2024-04-18 18:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Paolo Abeni, David S . Miller, Jakub Kicinski, netdev,
	Neal Cardwell, Dragos Tatulea, eric.dumazet,
	Maciej Żenczykowski, Willem de Bruijn, Shachar Kagan

On 4/18/24 11:47 AM, Eric Dumazet wrote:
> I think the userspace program relied on a bug added in linux in 2020

which test - when was it added? nettest.c and the fcnal script were
merged in 2019.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-18 17:47               ` Eric Dumazet
  2024-04-18 18:02                 ` David Ahern
@ 2024-04-18 18:09                 ` Jakub Kicinski
  2024-04-18 18:14                   ` Eric Dumazet
  2024-04-18 20:20                   ` David Ahern
  1 sibling, 2 replies; 23+ messages in thread
From: Jakub Kicinski @ 2024-04-18 18:09 UTC (permalink / raw)
  To: Eric Dumazet, David Ahern
  Cc: Paolo Abeni, David S . Miller, netdev, Neal Cardwell,
	Dragos Tatulea, eric.dumazet, Maciej Żenczykowski,
	Willem de Bruijn, Shachar Kagan

On Thu, 18 Apr 2024 19:47:51 +0200 Eric Dumazet wrote:
> > You have a kernel patch that makes a test fail, and your solution is
> > changing userspace? The tests are examples of userspace applications and
> > how they can use APIs, so if the patch breaks a test it is by definition
> > breaking userspace which is not allowed.  

Tests are often overly sensitive to kernel behavior, while this is
obviously a red flag it's not an automatic nack. The more tests we
have the more often we'll catch tiny changes. A lot of tests started
flaking with 6.9 because of the optimizations in the timer subsystem.
You know where I'm going with this..

> I think the userspace program relied on a bug added in linux in 2020
> 
> Jakub, I will stop trying to push the patches, this is a lost battle.

If you have the patches ready - please post them.
I'm happy to take the blame if they actually regress something in 
the wild :(

We're pursuing this because real application suffer real problems
when routing changes cause transient ICMP errors. Users read the RFC
and then come shouting at us that Linux is buggy.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-18 18:02                 ` David Ahern
@ 2024-04-18 18:09                   ` Eric Dumazet
  0 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2024-04-18 18:09 UTC (permalink / raw)
  To: David Ahern
  Cc: Paolo Abeni, David S . Miller, Jakub Kicinski, netdev,
	Neal Cardwell, Dragos Tatulea, eric.dumazet,
	Maciej Żenczykowski, Willem de Bruijn, Shachar Kagan

On Thu, Apr 18, 2024 at 8:02 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 4/18/24 11:47 AM, Eric Dumazet wrote:
> > I think the userspace program relied on a bug added in linux in 2020
>
> which test - when was it added? nettest.c and the fcnal script were
> merged in 2019.

The test relies on connect() being aborted by EHOSTUNREACH

This is in complete violation with RFC

Even our own net.ipv4/icmp.c states this clearly.

 RFC 1122: 3.2.2.1 States that NET_UNREACH, HOST_UNREACH and SR_FAILED
MUST be considered 'transient errs'.

Your program wants to use RECVERR instead of depending on a bug that
we are trying very hard to solve.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-18 18:09                 ` Jakub Kicinski
@ 2024-04-18 18:14                   ` Eric Dumazet
  2024-04-18 20:20                   ` David Ahern
  1 sibling, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2024-04-18 18:14 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Ahern, Paolo Abeni, David S . Miller, netdev,
	Neal Cardwell, Dragos Tatulea, eric.dumazet,
	Maciej Żenczykowski, Willem de Bruijn, Shachar Kagan

On Thu, Apr 18, 2024 at 8:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 18 Apr 2024 19:47:51 +0200 Eric Dumazet wrote:
> > > You have a kernel patch that makes a test fail, and your solution is
> > > changing userspace? The tests are examples of userspace applications and
> > > how they can use APIs, so if the patch breaks a test it is by definition
> > > breaking userspace which is not allowed.
>
> Tests are often overly sensitive to kernel behavior, while this is
> obviously a red flag it's not an automatic nack. The more tests we
> have the more often we'll catch tiny changes. A lot of tests started
> flaking with 6.9 because of the optimizations in the timer subsystem.
> You know where I'm going with this..
>
> > I think the userspace program relied on a bug added in linux in 2020
> >
> > Jakub, I will stop trying to push the patches, this is a lost battle.
>
> If you have the patches ready - please post them.
> I'm happy to take the blame if they actually regress something in
> the wild :(

The series with the 2 patches has been posted already.

The remaining part is a nettest.c fix, that David is not happy with.

diff --git a/tools/testing/selftests/net/nettest.c
b/tools/testing/selftests/net/nettest.c
index cd8a580974480212b45d86f35293b77f3d033473..23d56797900f19890923028af0b7ba162d9d5794
100644
--- a/tools/testing/selftests/net/nettest.c
+++ b/tools/testing/selftests/net/nettest.c
@@ -1744,6 +1744,8 @@ static int connectsock(void *addr, socklen_t
alen, struct sock_args *args)
        if (args->bind_test_only)
                goto out;

+       set_recv_attr(sd, args->version);
+
        if (connect(sd, addr, alen) < 0) {
                if (errno != EINPROGRESS) {
                        log_err_errno("Failed to connect to remote host");

>
> We're pursuing this because real application suffer real problems
> when routing changes cause transient ICMP errors. Users read the RFC
> and then come shouting at us that Linux is buggy.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err()
  2024-04-18 18:09                 ` Jakub Kicinski
  2024-04-18 18:14                   ` Eric Dumazet
@ 2024-04-18 20:20                   ` David Ahern
  1 sibling, 0 replies; 23+ messages in thread
From: David Ahern @ 2024-04-18 20:20 UTC (permalink / raw)
  To: Jakub Kicinski, Eric Dumazet
  Cc: Paolo Abeni, David S . Miller, netdev, Neal Cardwell,
	Dragos Tatulea, eric.dumazet, Maciej Żenczykowski,
	Willem de Bruijn, Shachar Kagan

On 4/18/24 12:09 PM, Jakub Kicinski wrote:
> On Thu, 18 Apr 2024 19:47:51 +0200 Eric Dumazet wrote:
>>> You have a kernel patch that makes a test fail, and your solution is
>>> changing userspace? The tests are examples of userspace applications and
>>> how they can use APIs, so if the patch breaks a test it is by definition
>>> breaking userspace which is not allowed.  
> 
> Tests are often overly sensitive to kernel behavior, while this is

That test script is a fairly comprehensive sweep of uAPIs and kernel
behavior. Its sole job is to detect user visible changes and breakage
from patches. It has done its job here. Do not shoot or criticize the
messenger because you do not like the message.


> obviously a red flag it's not an automatic nack. The more tests we
> have the more often we'll catch tiny changes. A lot of tests started
> flaking with 6.9 because of the optimizations in the timer subsystem.
> You know where I'm going with this..
> 
>> I think the userspace program relied on a bug added in linux in 2020
>>
>> Jakub, I will stop trying to push the patches, this is a lost battle.
> 
> If you have the patches ready - please post them.
> I'm happy to take the blame if they actually regress something in 
> the wild :(

And because of this test suite you are making a conscious decision to
merge a patch that is making a user visible change. That is part of the
tough decisions a maintainer has to make; I have no problem with that.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-04-18 20:20 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-17 16:57 [PATCH net-next 0/2] tcp: tcp_v4_err() changes Eric Dumazet
2024-04-17 16:57 ` [PATCH net-next 1/2] tcp: conditionally call ip_icmp_error() from tcp_v4_err() Eric Dumazet
2024-04-17 17:08   ` Maciej Żenczykowski
2024-04-18  3:22   ` Jason Xing
2024-04-18  6:45     ` Eric Dumazet
2024-04-18  6:53       ` Jason Xing
2024-04-18  8:02   ` Paolo Abeni
2024-04-18  8:03     ` Eric Dumazet
2024-04-18  9:26       ` Eric Dumazet
2024-04-18  9:58         ` Paolo Abeni
2024-04-18 10:15           ` Eric Dumazet
2024-04-18 10:22             ` Eric Dumazet
2024-04-18 10:36               ` Eric Dumazet
2024-04-18 17:46             ` David Ahern
2024-04-18 17:47               ` Eric Dumazet
2024-04-18 18:02                 ` David Ahern
2024-04-18 18:09                   ` Eric Dumazet
2024-04-18 18:09                 ` Jakub Kicinski
2024-04-18 18:14                   ` Eric Dumazet
2024-04-18 20:20                   ` David Ahern
2024-04-18 17:56         ` David Ahern
2024-04-17 16:57 ` [PATCH net-next 2/2] tcp: no longer abort SYN_SENT when receiving some ICMP (II) Eric Dumazet
2024-04-18  3:24   ` Jason Xing

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.