Netdev Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH net-next] icmp: support rfc 4884
@ 2020-06-29 16:57 Willem de Bruijn
  2020-06-29 20:34 ` Tom Herbert
  2020-06-29 21:14 ` Eric Dumazet
  0 siblings, 2 replies; 16+ messages in thread
From: Willem de Bruijn @ 2020-06-29 16:57 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

ICMP messages may include an extension structure after the original
datagram. RFC 4884 standardized this behavior.

It introduces an explicit original datagram length field in the ICMP
header to delineate the original datagram from the extension struct.

Return this field when reading an ICMP error from the error queue.

ICMPv6 by default already returns the entire 32-bit part of the header
that includes this field by default. For consistency, do the exact
same for ICMP. So far it only returns mtu on ICMP_FRAG_NEEDED and gw
on ICMP_PARAMETERPROB.

For backwards compatibility, make this optional, set by setsockopt
SOL_IP/IP_RECVERR_RFC4884. For API documentation and feature test, see
https://github.com/wdebruij/kerneltools/blob/master/tests/recv_icmp.c

Alternative implementation to reading from the skb in ip_icmp_error
is to pass the field from icmp_unreach, again analogous to ICMPv6. But
this would require changes to every $proto_err() callback, which for
ICMP_FRAG_NEEDED pass the u32 info arg to a pmtu update function.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/net/inet_sock.h |  1 +
 include/uapi/linux/in.h |  1 +
 net/ipv4/ip_sockglue.c  | 12 ++++++++++++
 3 files changed, 14 insertions(+)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index a7ce00af6c44..a3702d1d4875 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -225,6 +225,7 @@ struct inet_sock {
 				mc_all:1,
 				nodefrag:1;
 	__u8			bind_address_no_port:1,
+				recverr_rfc4884:1,
 				defer_connect:1; /* Indicates that fastopen_connect is set
 						  * and cookie exists so we defer connect
 						  * until first data frame is written
diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
index 8533bf07450f..3d0d8231dc19 100644
--- a/include/uapi/linux/in.h
+++ b/include/uapi/linux/in.h
@@ -123,6 +123,7 @@ struct in_addr {
 #define IP_CHECKSUM	23
 #define IP_BIND_ADDRESS_NO_PORT	24
 #define IP_RECVFRAGSIZE	25
+#define IP_RECVERR_RFC4884	26
 
 /* IP_MTU_DISCOVER values */
 #define IP_PMTUDISC_DONT		0	/* Never send DF frames */
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 84ec3703c909..525140e3947c 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -398,6 +398,9 @@ void ip_icmp_error(struct sock *sk, struct sk_buff *skb, int err,
 	if (!skb)
 		return;
 
+	if (inet_sk(sk)->recverr_rfc4884)
+		info = ntohl(icmp_hdr(skb)->un.gateway);
+
 	serr = SKB_EXT_ERR(skb);
 	serr->ee.ee_errno = err;
 	serr->ee.ee_origin = SO_EE_ORIGIN_ICMP;
@@ -755,6 +758,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 	case IP_RECVORIGDSTADDR:
 	case IP_CHECKSUM:
 	case IP_RECVFRAGSIZE:
+	case IP_RECVERR_RFC4884:
 		if (optlen >= sizeof(int)) {
 			if (get_user(val, (int __user *) optval))
 				return -EFAULT;
@@ -914,6 +918,11 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 		if (!val)
 			skb_queue_purge(&sk->sk_error_queue);
 		break;
+	case IP_RECVERR_RFC4884:
+		if (val != 0 && val != 1)
+			goto e_inval;
+		inet->recverr_rfc4884 = val;
+		break;
 	case IP_MULTICAST_TTL:
 		if (sk->sk_type == SOCK_STREAM)
 			goto e_inval;
@@ -1588,6 +1597,9 @@ static int do_ip_getsockopt(struct sock *sk, int level, int optname,
 	case IP_RECVERR:
 		val = inet->recverr;
 		break;
+	case IP_RECVERR_RFC4884:
+		val = inet->recverr_rfc4884;
+		break;
 	case IP_MULTICAST_TTL:
 		val = inet->mc_ttl;
 		break;
-- 
2.27.0.212.ge8ba1cc988-goog


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] icmp: support rfc 4884
  2020-06-29 16:57 [PATCH net-next] icmp: support rfc 4884 Willem de Bruijn
@ 2020-06-29 20:34 ` Tom Herbert
  2020-06-29 20:55   ` Willem de Bruijn
  2020-06-29 21:14 ` Eric Dumazet
  1 sibling, 1 reply; 16+ messages in thread
From: Tom Herbert @ 2020-06-29 20:34 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Linux Kernel Network Developers, David S. Miller, Willem de Bruijn

On Mon, Jun 29, 2020 at 12:23 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> From: Willem de Bruijn <willemb@google.com>
>
> ICMP messages may include an extension structure after the original
> datagram. RFC 4884 standardized this behavior.
>
> It introduces an explicit original datagram length field in the ICMP
> header to delineate the original datagram from the extension struct.
>
> Return this field when reading an ICMP error from the error queue.
>
> ICMPv6 by default already returns the entire 32-bit part of the header
> that includes this field by default. For consistency, do the exact
> same for ICMP. So far it only returns mtu on ICMP_FRAG_NEEDED and gw
> on ICMP_PARAMETERPROB.
>
> For backwards compatibility, make this optional, set by setsockopt
> SOL_IP/IP_RECVERR_RFC4884. For API documentation and feature test, see
> https://github.com/wdebruij/kerneltools/blob/master/tests/recv_icmp.c
>
> Alternative implementation to reading from the skb in ip_icmp_error
> is to pass the field from icmp_unreach, again analogous to ICMPv6. But
> this would require changes to every $proto_err() callback, which for
> ICMP_FRAG_NEEDED pass the u32 info arg to a pmtu update function.
>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
>  include/net/inet_sock.h |  1 +
>  include/uapi/linux/in.h |  1 +
>  net/ipv4/ip_sockglue.c  | 12 ++++++++++++
>  3 files changed, 14 insertions(+)
>
> diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> index a7ce00af6c44..a3702d1d4875 100644
> --- a/include/net/inet_sock.h
> +++ b/include/net/inet_sock.h
> @@ -225,6 +225,7 @@ struct inet_sock {
>                                 mc_all:1,
>                                 nodefrag:1;
>         __u8                    bind_address_no_port:1,
> +                               recverr_rfc4884:1,
>                                 defer_connect:1; /* Indicates that fastopen_connect is set
>                                                   * and cookie exists so we defer connect
>                                                   * until first data frame is written
> diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
> index 8533bf07450f..3d0d8231dc19 100644
> --- a/include/uapi/linux/in.h
> +++ b/include/uapi/linux/in.h
> @@ -123,6 +123,7 @@ struct in_addr {
>  #define IP_CHECKSUM    23
>  #define IP_BIND_ADDRESS_NO_PORT        24
>  #define IP_RECVFRAGSIZE        25
> +#define IP_RECVERR_RFC4884     26
>
>  /* IP_MTU_DISCOVER values */
>  #define IP_PMTUDISC_DONT               0       /* Never send DF frames */
> diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
> index 84ec3703c909..525140e3947c 100644
> --- a/net/ipv4/ip_sockglue.c
> +++ b/net/ipv4/ip_sockglue.c
> @@ -398,6 +398,9 @@ void ip_icmp_error(struct sock *sk, struct sk_buff *skb, int err,
>         if (!skb)
>                 return;
>
> +       if (inet_sk(sk)->recverr_rfc4884)
> +               info = ntohl(icmp_hdr(skb)->un.gateway);
> +
Willem,

Doesn't this assume that all received ICMP errors received on the
socket use the extended RFC4884 format once the option is set on the
socket?

I think what we might need to do this properly is to switch on the
ICMP Type/Code to determine if the format is RFC4884. If it is then,
we can figure out where the appropriate info for the ICMP error is.
Good example of this is draft-ietf-6man-icmp-limits-08 currently on
the RFC editor queue. A code for destination unreachable is added for
"aggregate header limit exceeded". An RFC4884 format is used to
contain an ICMP extension that includes a pointer to the offending
byte (unlike parameter problem, destination unreachable doesn't have
Pointer in ICMP header).  So in this case it makes sense that the
kernel returns the Pointer in extended ICMP as the info.

Tom


>         serr = SKB_EXT_ERR(skb);
>         serr->ee.ee_errno = err;
>         serr->ee.ee_origin = SO_EE_ORIGIN_ICMP;
> @@ -755,6 +758,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
>         case IP_RECVORIGDSTADDR:
>         case IP_CHECKSUM:
>         case IP_RECVFRAGSIZE:
> +       case IP_RECVERR_RFC4884:
>                 if (optlen >= sizeof(int)) {
>                         if (get_user(val, (int __user *) optval))
>                                 return -EFAULT;
> @@ -914,6 +918,11 @@ static int do_ip_setsockopt(struct sock *sk, int level,
>                 if (!val)
>                         skb_queue_purge(&sk->sk_error_queue);
>                 break;
> +       case IP_RECVERR_RFC4884:
> +               if (val != 0 && val != 1)
> +                       goto e_inval;
> +               inet->recverr_rfc4884 = val;
> +               break;
>         case IP_MULTICAST_TTL:
>                 if (sk->sk_type == SOCK_STREAM)
>                         goto e_inval;
> @@ -1588,6 +1597,9 @@ static int do_ip_getsockopt(struct sock *sk, int level, int optname,
>         case IP_RECVERR:
>                 val = inet->recverr;
>                 break;
> +       case IP_RECVERR_RFC4884:
> +               val = inet->recverr_rfc4884;
> +               break;
>         case IP_MULTICAST_TTL:
>                 val = inet->mc_ttl;
>                 break;
> --
> 2.27.0.212.ge8ba1cc988-goog
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] icmp: support rfc 4884
  2020-06-29 20:34 ` Tom Herbert
@ 2020-06-29 20:55   ` Willem de Bruijn
  0 siblings, 0 replies; 16+ messages in thread
From: Willem de Bruijn @ 2020-06-29 20:55 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Willem de Bruijn, Linux Kernel Network Developers, David S. Miller

On Mon, Jun 29, 2020 at 4:34 PM Tom Herbert <tom@herbertland.com> wrote:
>
> On Mon, Jun 29, 2020 at 12:23 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > From: Willem de Bruijn <willemb@google.com>
> >
> > ICMP messages may include an extension structure after the original
> > datagram. RFC 4884 standardized this behavior.
> >
> > It introduces an explicit original datagram length field in the ICMP
> > header to delineate the original datagram from the extension struct.
> >
> > Return this field when reading an ICMP error from the error queue.
> >
> > ICMPv6 by default already returns the entire 32-bit part of the header
> > that includes this field by default. For consistency, do the exact
> > same for ICMP. So far it only returns mtu on ICMP_FRAG_NEEDED and gw
> > on ICMP_PARAMETERPROB.
> >
> > For backwards compatibility, make this optional, set by setsockopt
> > SOL_IP/IP_RECVERR_RFC4884. For API documentation and feature test, see
> > https://github.com/wdebruij/kerneltools/blob/master/tests/recv_icmp.c
> >
> > Alternative implementation to reading from the skb in ip_icmp_error
> > is to pass the field from icmp_unreach, again analogous to ICMPv6. But
> > this would require changes to every $proto_err() callback, which for
> > ICMP_FRAG_NEEDED pass the u32 info arg to a pmtu update function.
> >
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ---
> >  include/net/inet_sock.h |  1 +
> >  include/uapi/linux/in.h |  1 +
> >  net/ipv4/ip_sockglue.c  | 12 ++++++++++++
> >  3 files changed, 14 insertions(+)
> >
> > diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> > index a7ce00af6c44..a3702d1d4875 100644
> > --- a/include/net/inet_sock.h
> > +++ b/include/net/inet_sock.h
> > @@ -225,6 +225,7 @@ struct inet_sock {
> >                                 mc_all:1,
> >                                 nodefrag:1;
> >         __u8                    bind_address_no_port:1,
> > +                               recverr_rfc4884:1,
> >                                 defer_connect:1; /* Indicates that fastopen_connect is set
> >                                                   * and cookie exists so we defer connect
> >                                                   * until first data frame is written
> > diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
> > index 8533bf07450f..3d0d8231dc19 100644
> > --- a/include/uapi/linux/in.h
> > +++ b/include/uapi/linux/in.h
> > @@ -123,6 +123,7 @@ struct in_addr {
> >  #define IP_CHECKSUM    23
> >  #define IP_BIND_ADDRESS_NO_PORT        24
> >  #define IP_RECVFRAGSIZE        25
> > +#define IP_RECVERR_RFC4884     26
> >
> >  /* IP_MTU_DISCOVER values */
> >  #define IP_PMTUDISC_DONT               0       /* Never send DF frames */
> > diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
> > index 84ec3703c909..525140e3947c 100644
> > --- a/net/ipv4/ip_sockglue.c
> > +++ b/net/ipv4/ip_sockglue.c
> > @@ -398,6 +398,9 @@ void ip_icmp_error(struct sock *sk, struct sk_buff *skb, int err,
> >         if (!skb)
> >                 return;
> >
> > +       if (inet_sk(sk)->recverr_rfc4884)
> > +               info = ntohl(icmp_hdr(skb)->un.gateway);
> > +
> Willem,
>
> Doesn't this assume that all received ICMP errors received on the
> socket use the extended RFC4884 format once the option is set on the
> socket?
>
> I think what we might need to do this properly is to switch on the
> ICMP Type/Code to determine if the format is RFC4884. If it is then,
> we can figure out where the appropriate info for the ICMP error is.
> Good example of this is draft-ietf-6man-icmp-limits-08 currently on
> the RFC editor queue. A code for destination unreachable is added for
> "aggregate header limit exceeded". An RFC4884 format is used to
> contain an ICMP extension that includes a pointer to the offending
> byte (unlike parameter problem, destination unreachable doesn't have
> Pointer in ICMP header).  So in this case it makes sense that the
> kernel returns the Pointer in extended ICMP as the info.

This implementation matches the existing behavior of ICMPv6.

You're right that the RFC 4884 length field is relevant only for a
selection of ICMP types. For other types most of the bits are
reserved. I don't see any issue with exposing those. Anything
more fine grained will have to be extended for each
additional feature such as the one you mention. I prefer to just make
this information available once and for all. It really helped that it
was already available on v6, for instance.




> Tom
>
>
> >         serr = SKB_EXT_ERR(skb);
> >         serr->ee.ee_errno = err;
> >         serr->ee.ee_origin = SO_EE_ORIGIN_ICMP;
> > @@ -755,6 +758,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
> >         case IP_RECVORIGDSTADDR:
> >         case IP_CHECKSUM:
> >         case IP_RECVFRAGSIZE:
> > +       case IP_RECVERR_RFC4884:
> >                 if (optlen >= sizeof(int)) {
> >                         if (get_user(val, (int __user *) optval))
> >                                 return -EFAULT;
> > @@ -914,6 +918,11 @@ static int do_ip_setsockopt(struct sock *sk, int level,
> >                 if (!val)
> >                         skb_queue_purge(&sk->sk_error_queue);
> >                 break;
> > +       case IP_RECVERR_RFC4884:
> > +               if (val != 0 && val != 1)
> > +                       goto e_inval;
> > +               inet->recverr_rfc4884 = val;
> > +               break;
> >         case IP_MULTICAST_TTL:
> >                 if (sk->sk_type == SOCK_STREAM)
> >                         goto e_inval;
> > @@ -1588,6 +1597,9 @@ static int do_ip_getsockopt(struct sock *sk, int level, int optname,
> >         case IP_RECVERR:
> >                 val = inet->recverr;
> >                 break;
> > +       case IP_RECVERR_RFC4884:
> > +               val = inet->recverr_rfc4884;
> > +               break;
> >         case IP_MULTICAST_TTL:
> >                 val = inet->mc_ttl;
> >                 break;
> > --
> > 2.27.0.212.ge8ba1cc988-goog
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] icmp: support rfc 4884
  2020-06-29 16:57 [PATCH net-next] icmp: support rfc 4884 Willem de Bruijn
  2020-06-29 20:34 ` Tom Herbert
@ 2020-06-29 21:14 ` Eric Dumazet
  2020-06-29 21:30   ` Willem de Bruijn
  2020-06-30  0:30   ` Tom Herbert
  1 sibling, 2 replies; 16+ messages in thread
From: Eric Dumazet @ 2020-06-29 21:14 UTC (permalink / raw)
  To: Willem de Bruijn, netdev; +Cc: davem, Willem de Bruijn



On 6/29/20 9:57 AM, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> ICMP messages may include an extension structure after the original
> datagram. RFC 4884 standardized this behavior.
> 
> It introduces an explicit original datagram length field in the ICMP
> header to delineate the original datagram from the extension struct.
> 
> Return this field when reading an ICMP error from the error queue.

RFC mentions a 'length' field of 8 bits, your patch chose to export the whole
second word of icmp header.

Why is this field mapped to a prior one (icmp_hdr(skb)->un.gateway) ?

Should we add an element in the union to make this a little bit more explicit/readable ?

diff --git a/include/uapi/linux/icmp.h b/include/uapi/linux/icmp.h
index 5589eeb791ca580bb182e1dc38c05eab1c75adb9..427ed5a6765316a4c1e2fa06f3b6618447c01564 100644
--- a/include/uapi/linux/icmp.h
+++ b/include/uapi/linux/icmp.h
@@ -76,6 +76,7 @@ struct icmphdr {
                __be16  sequence;
        } echo;
        __be32  gateway;
+       __be32  second_word; /* RFC 4884 4.[123] : <unused:8>,<length:8>,<mtu:16> */
        struct {
                __be16  __unused;
                __be16  mtu;



> 
> ICMPv6 by default already returns the entire 32-bit part of the header
> that includes this field by default. For consistency, do the exact
> same for ICMP. So far it only returns mtu on ICMP_FRAG_NEEDED and gw
> on ICMP_PARAMETERPROB.
> 
> For backwards compatibility, make this optional, set by setsockopt
> SOL_IP/IP_RECVERR_RFC4884. For API documentation and feature test, see
> https://github.com/wdebruij/kerneltools/blob/master/tests/recv_icmp.c
> 
> Alternative implementation to reading from the skb in ip_icmp_error
> is to pass the field from icmp_unreach, again analogous to ICMPv6. But
> this would require changes to every $proto_err() callback, which for
> ICMP_FRAG_NEEDED pass the u32 info arg to a pmtu update function.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
>  include/net/inet_sock.h |  1 +
>  include/uapi/linux/in.h |  1 +
>  net/ipv4/ip_sockglue.c  | 12 ++++++++++++
>  3 files changed, 14 insertions(+)
> 
> diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> index a7ce00af6c44..a3702d1d4875 100644
> --- a/include/net/inet_sock.h
> +++ b/include/net/inet_sock.h
> @@ -225,6 +225,7 @@ struct inet_sock {
>  				mc_all:1,
>  				nodefrag:1;
>  	__u8			bind_address_no_port:1,
> +				recverr_rfc4884:1,
>  				defer_connect:1; /* Indicates that fastopen_connect is set
>  						  * and cookie exists so we defer connect
>  						  * until first data frame is written
> diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
> index 8533bf07450f..3d0d8231dc19 100644
> --- a/include/uapi/linux/in.h
> +++ b/include/uapi/linux/in.h
> @@ -123,6 +123,7 @@ struct in_addr {
>  #define IP_CHECKSUM	23
>  #define IP_BIND_ADDRESS_NO_PORT	24
>  #define IP_RECVFRAGSIZE	25
> +#define IP_RECVERR_RFC4884	26
>  
>  /* IP_MTU_DISCOVER values */
>  #define IP_PMTUDISC_DONT		0	/* Never send DF frames */
> diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
> index 84ec3703c909..525140e3947c 100644
> --- a/net/ipv4/ip_sockglue.c
> +++ b/net/ipv4/ip_sockglue.c
> @@ -398,6 +398,9 @@ void ip_icmp_error(struct sock *sk, struct sk_buff *skb, int err,
>  	if (!skb)
>  		return;
>  
> +	if (inet_sk(sk)->recverr_rfc4884)
> +		info = ntohl(icmp_hdr(skb)->un.gateway);

ntohl(icmp_hdr(skb)->un.second_word);

> +
>  	serr = SKB_EXT_ERR(skb);
>  	serr->ee.ee_errno = err;
>  	serr->ee.ee_origin = SO_EE_ORIGIN_ICMP;
> @@ -755,6 +758,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
>  	case IP_RECVORIGDSTADDR:
>  	case IP_CHECKSUM:
>  	case IP_RECVFRAGSIZE:
> +	case IP_RECVERR_RFC4884:
>  		if (optlen >= sizeof(int)) {
>  			if (get_user(val, (int __user *) optval))
>  				return -EFAULT;
> @@ -914,6 +918,11 @@ static int do_ip_setsockopt(struct sock *sk, int level,
>  		if (!val)
>  			skb_queue_purge(&sk->sk_error_queue);
>  		break;
> +	case IP_RECVERR_RFC4884:
> +		if (val != 0 && val != 1)
> +			goto e_inval;
> +		inet->recverr_rfc4884 = val;
> +		break;
>  	case IP_MULTICAST_TTL:
>  		if (sk->sk_type == SOCK_STREAM)
>  			goto e_inval;
> @@ -1588,6 +1597,9 @@ static int do_ip_getsockopt(struct sock *sk, int level, int optname,
>  	case IP_RECVERR:
>  		val = inet->recverr;
>  		break;
> +	case IP_RECVERR_RFC4884:
> +		val = inet->recverr_rfc4884;
> +		break;
>  	case IP_MULTICAST_TTL:
>  		val = inet->mc_ttl;
>  		break;
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] icmp: support rfc 4884
  2020-06-29 21:14 ` Eric Dumazet
@ 2020-06-29 21:30   ` Willem de Bruijn
  2020-06-29 23:06     ` Eric Dumazet
  2020-06-30  0:30   ` Tom Herbert
  1 sibling, 1 reply; 16+ messages in thread
From: Willem de Bruijn @ 2020-06-29 21:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willem de Bruijn, Network Development, David Miller

On Mon, Jun 29, 2020 at 5:15 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 6/29/20 9:57 AM, Willem de Bruijn wrote:
> > From: Willem de Bruijn <willemb@google.com>
> >
> > ICMP messages may include an extension structure after the original
> > datagram. RFC 4884 standardized this behavior.
> >
> > It introduces an explicit original datagram length field in the ICMP
> > header to delineate the original datagram from the extension struct.
> >
> > Return this field when reading an ICMP error from the error queue.
>
> RFC mentions a 'length' field of 8 bits, your patch chose to export the whole
> second word of icmp header.
>
> Why is this field mapped to a prior one (icmp_hdr(skb)->un.gateway) ?
>
> Should we add an element in the union to make this a little bit more explicit/readable ?
>
> diff --git a/include/uapi/linux/icmp.h b/include/uapi/linux/icmp.h
> index 5589eeb791ca580bb182e1dc38c05eab1c75adb9..427ed5a6765316a4c1e2fa06f3b6618447c01564 100644
> --- a/include/uapi/linux/icmp.h
> +++ b/include/uapi/linux/icmp.h
> @@ -76,6 +76,7 @@ struct icmphdr {
>                 __be16  sequence;
>         } echo;
>         __be32  gateway;
> +       __be32  second_word; /* RFC 4884 4.[123] : <unused:8>,<length:8>,<mtu:16> */
>         struct {
>                 __be16  __unused;
>                 __be16  mtu;

Okay. How about a variant of the existing struct frag?

@@ -80,6 +80,11 @@ struct icmphdr {
                __be16  __unused;
                __be16  mtu;
        } frag;
+       struct {
+               __u8    __unused;
+               __u8    length;
+               __be16  mtu;
+       } rfc_4884;
        __u8    reserved[4];
   } un;

 };

>
>
> >
> > ICMPv6 by default already returns the entire 32-bit part of the header
> > that includes this field by default. For consistency, do the exact
> > same for ICMP. So far it only returns mtu on ICMP_FRAG_NEEDED and gw
> > on ICMP_PARAMETERPROB.
> >
> > For backwards compatibility, make this optional, set by setsockopt
> > SOL_IP/IP_RECVERR_RFC4884. For API documentation and feature test, see
> > https://github.com/wdebruij/kerneltools/blob/master/tests/recv_icmp.c
> >
> > Alternative implementation to reading from the skb in ip_icmp_error
> > is to pass the field from icmp_unreach, again analogous to ICMPv6. But
> > this would require changes to every $proto_err() callback, which for
> > ICMP_FRAG_NEEDED pass the u32 info arg to a pmtu update function.
> >
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ---
> >  include/net/inet_sock.h |  1 +
> >  include/uapi/linux/in.h |  1 +
> >  net/ipv4/ip_sockglue.c  | 12 ++++++++++++
> >  3 files changed, 14 insertions(+)
> >
> > diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> > index a7ce00af6c44..a3702d1d4875 100644
> > --- a/include/net/inet_sock.h
> > +++ b/include/net/inet_sock.h
> > @@ -225,6 +225,7 @@ struct inet_sock {
> >                               mc_all:1,
> >                               nodefrag:1;
> >       __u8                    bind_address_no_port:1,
> > +                             recverr_rfc4884:1,
> >                               defer_connect:1; /* Indicates that fastopen_connect is set
> >                                                 * and cookie exists so we defer connect
> >                                                 * until first data frame is written
> > diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
> > index 8533bf07450f..3d0d8231dc19 100644
> > --- a/include/uapi/linux/in.h
> > +++ b/include/uapi/linux/in.h
> > @@ -123,6 +123,7 @@ struct in_addr {
> >  #define IP_CHECKSUM  23
> >  #define IP_BIND_ADDRESS_NO_PORT      24
> >  #define IP_RECVFRAGSIZE      25
> > +#define IP_RECVERR_RFC4884   26
> >
> >  /* IP_MTU_DISCOVER values */
> >  #define IP_PMTUDISC_DONT             0       /* Never send DF frames */
> > diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
> > index 84ec3703c909..525140e3947c 100644
> > --- a/net/ipv4/ip_sockglue.c
> > +++ b/net/ipv4/ip_sockglue.c
> > @@ -398,6 +398,9 @@ void ip_icmp_error(struct sock *sk, struct sk_buff *skb, int err,
> >       if (!skb)
> >               return;
> >
> > +     if (inet_sk(sk)->recverr_rfc4884)
> > +             info = ntohl(icmp_hdr(skb)->un.gateway);
>
> ntohl(icmp_hdr(skb)->un.second_word);
>
> > +
> >       serr = SKB_EXT_ERR(skb);
> >       serr->ee.ee_errno = err;
> >       serr->ee.ee_origin = SO_EE_ORIGIN_ICMP;
> > @@ -755,6 +758,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
> >       case IP_RECVORIGDSTADDR:
> >       case IP_CHECKSUM:
> >       case IP_RECVFRAGSIZE:
> > +     case IP_RECVERR_RFC4884:
> >               if (optlen >= sizeof(int)) {
> >                       if (get_user(val, (int __user *) optval))
> >                               return -EFAULT;
> > @@ -914,6 +918,11 @@ static int do_ip_setsockopt(struct sock *sk, int level,
> >               if (!val)
> >                       skb_queue_purge(&sk->sk_error_queue);
> >               break;
> > +     case IP_RECVERR_RFC4884:
> > +             if (val != 0 && val != 1)
> > +                     goto e_inval;
> > +             inet->recverr_rfc4884 = val;
> > +             break;
> >       case IP_MULTICAST_TTL:
> >               if (sk->sk_type == SOCK_STREAM)
> >                       goto e_inval;
> > @@ -1588,6 +1597,9 @@ static int do_ip_getsockopt(struct sock *sk, int level, int optname,
> >       case IP_RECVERR:
> >               val = inet->recverr;
> >               break;
> > +     case IP_RECVERR_RFC4884:
> > +             val = inet->recverr_rfc4884;
> > +             break;
> >       case IP_MULTICAST_TTL:
> >               val = inet->mc_ttl;
> >               break;
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] icmp: support rfc 4884
  2020-06-29 21:30   ` Willem de Bruijn
@ 2020-06-29 23:06     ` Eric Dumazet
  2020-06-30  0:36       ` Tom Herbert
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2020-06-29 23:06 UTC (permalink / raw)
  To: Willem de Bruijn, Eric Dumazet; +Cc: Network Development, David Miller



On 6/29/20 2:30 PM, Willem de Bruijn wrote:
> On Mon, Jun 29, 2020 at 5:15 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>>
>> On 6/29/20 9:57 AM, Willem de Bruijn wrote:
>>> From: Willem de Bruijn <willemb@google.com>
>>>
>>> ICMP messages may include an extension structure after the original
>>> datagram. RFC 4884 standardized this behavior.
>>>
>>> It introduces an explicit original datagram length field in the ICMP
>>> header to delineate the original datagram from the extension struct.
>>>
>>> Return this field when reading an ICMP error from the error queue.
>>
>> RFC mentions a 'length' field of 8 bits, your patch chose to export the whole
>> second word of icmp header.
>>
>> Why is this field mapped to a prior one (icmp_hdr(skb)->un.gateway) ?
>>
>> Should we add an element in the union to make this a little bit more explicit/readable ?
>>
>> diff --git a/include/uapi/linux/icmp.h b/include/uapi/linux/icmp.h
>> index 5589eeb791ca580bb182e1dc38c05eab1c75adb9..427ed5a6765316a4c1e2fa06f3b6618447c01564 100644
>> --- a/include/uapi/linux/icmp.h
>> +++ b/include/uapi/linux/icmp.h
>> @@ -76,6 +76,7 @@ struct icmphdr {
>>                 __be16  sequence;
>>         } echo;
>>         __be32  gateway;
>> +       __be32  second_word; /* RFC 4884 4.[123] : <unused:8>,<length:8>,<mtu:16> */
>>         struct {
>>                 __be16  __unused;
>>                 __be16  mtu;
> 
> Okay. How about a variant of the existing struct frag?
> 
> @@ -80,6 +80,11 @@ struct icmphdr {
>                 __be16  __unused;
>                 __be16  mtu;
>         } frag;
> +       struct {
> +               __u8    __unused;
> +               __u8    length;
> +               __be16  mtu;
> +       } rfc_4884;
>         __u8    reserved[4];
>    } un;
> 

Sure, but my point was later in the code :

>>> +     if (inet_sk(sk)->recverr_rfc4884)
>>> +             info = ntohl(icmp_hdr(skb)->un.gateway);
>>
>> ntohl(icmp_hdr(skb)->un.second_word);

If you leave there "info = ntohl(icmp_hdr(skb)->un.gateway)" it is a bit hard for someone
reading linux kernel code to understand why we do this.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] icmp: support rfc 4884
  2020-06-29 21:14 ` Eric Dumazet
  2020-06-29 21:30   ` Willem de Bruijn
@ 2020-06-30  0:30   ` Tom Herbert
  1 sibling, 0 replies; 16+ messages in thread
From: Tom Herbert @ 2020-06-30  0:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willem de Bruijn, Linux Kernel Network Developers,
	David S. Miller, Willem de Bruijn

On Mon, Jun 29, 2020 at 2:15 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 6/29/20 9:57 AM, Willem de Bruijn wrote:
> > From: Willem de Bruijn <willemb@google.com>
> >
> > ICMP messages may include an extension structure after the original
> > datagram. RFC 4884 standardized this behavior.
> >
> > It introduces an explicit original datagram length field in the ICMP
> > header to delineate the original datagram from the extension struct.
> >
> > Return this field when reading an ICMP error from the error queue.
>
> RFC mentions a 'length' field of 8 bits, your patch chose to export the whole
> second word of icmp header.
>
> Why is this field mapped to a prior one (icmp_hdr(skb)->un.gateway) ?
>
> Should we add an element in the union to make this a little bit more explicit/readable ?
>
Yes, that makes sense. We should also define the structure for the
ICMP extension header, the structure for the ICMP Extension Objects,
and then each particular object can define its own structure. Also, a
function to get a pointer to the extension header (at offset 8*length
from the start of the packet data). Sorry, Willem, this is probably
more than you bargained for in posting the patch ;-)

Tom


> diff --git a/include/uapi/linux/icmp.h b/include/uapi/linux/icmp.h
> index 5589eeb791ca580bb182e1dc38c05eab1c75adb9..427ed5a6765316a4c1e2fa06f3b6618447c01564 100644
> --- a/include/uapi/linux/icmp.h
> +++ b/include/uapi/linux/icmp.h
> @@ -76,6 +76,7 @@ struct icmphdr {
>                 __be16  sequence;
>         } echo;
>         __be32  gateway;
> +       __be32  second_word; /* RFC 4884 4.[123] : <unused:8>,<length:8>,<mtu:16> */
>         struct {
>                 __be16  __unused;
>                 __be16  mtu;
>
>
>
> >
> > ICMPv6 by default already returns the entire 32-bit part of the header
> > that includes this field by default. For consistency, do the exact
> > same for ICMP. So far it only returns mtu on ICMP_FRAG_NEEDED and gw
> > on ICMP_PARAMETERPROB.
> >
> > For backwards compatibility, make this optional, set by setsockopt
> > SOL_IP/IP_RECVERR_RFC4884. For API documentation and feature test, see
> > https://github.com/wdebruij/kerneltools/blob/master/tests/recv_icmp.c
> >
> > Alternative implementation to reading from the skb in ip_icmp_error
> > is to pass the field from icmp_unreach, again analogous to ICMPv6. But
> > this would require changes to every $proto_err() callback, which for
> > ICMP_FRAG_NEEDED pass the u32 info arg to a pmtu update function.
> >
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ---
> >  include/net/inet_sock.h |  1 +
> >  include/uapi/linux/in.h |  1 +
> >  net/ipv4/ip_sockglue.c  | 12 ++++++++++++
> >  3 files changed, 14 insertions(+)
> >
> > diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> > index a7ce00af6c44..a3702d1d4875 100644
> > --- a/include/net/inet_sock.h
> > +++ b/include/net/inet_sock.h
> > @@ -225,6 +225,7 @@ struct inet_sock {
> >                               mc_all:1,
> >                               nodefrag:1;
> >       __u8                    bind_address_no_port:1,
> > +                             recverr_rfc4884:1,
> >                               defer_connect:1; /* Indicates that fastopen_connect is set
> >                                                 * and cookie exists so we defer connect
> >                                                 * until first data frame is written
> > diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
> > index 8533bf07450f..3d0d8231dc19 100644
> > --- a/include/uapi/linux/in.h
> > +++ b/include/uapi/linux/in.h
> > @@ -123,6 +123,7 @@ struct in_addr {
> >  #define IP_CHECKSUM  23
> >  #define IP_BIND_ADDRESS_NO_PORT      24
> >  #define IP_RECVFRAGSIZE      25
> > +#define IP_RECVERR_RFC4884   26
> >
> >  /* IP_MTU_DISCOVER values */
> >  #define IP_PMTUDISC_DONT             0       /* Never send DF frames */
> > diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
> > index 84ec3703c909..525140e3947c 100644
> > --- a/net/ipv4/ip_sockglue.c
> > +++ b/net/ipv4/ip_sockglue.c
> > @@ -398,6 +398,9 @@ void ip_icmp_error(struct sock *sk, struct sk_buff *skb, int err,
> >       if (!skb)
> >               return;
> >
> > +     if (inet_sk(sk)->recverr_rfc4884)
> > +             info = ntohl(icmp_hdr(skb)->un.gateway);
>
> ntohl(icmp_hdr(skb)->un.second_word);
>
> > +
> >       serr = SKB_EXT_ERR(skb);
> >       serr->ee.ee_errno = err;
> >       serr->ee.ee_origin = SO_EE_ORIGIN_ICMP;
> > @@ -755,6 +758,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
> >       case IP_RECVORIGDSTADDR:
> >       case IP_CHECKSUM:
> >       case IP_RECVFRAGSIZE:
> > +     case IP_RECVERR_RFC4884:
> >               if (optlen >= sizeof(int)) {
> >                       if (get_user(val, (int __user *) optval))
> >                               return -EFAULT;
> > @@ -914,6 +918,11 @@ static int do_ip_setsockopt(struct sock *sk, int level,
> >               if (!val)
> >                       skb_queue_purge(&sk->sk_error_queue);
> >               break;
> > +     case IP_RECVERR_RFC4884:
> > +             if (val != 0 && val != 1)
> > +                     goto e_inval;
> > +             inet->recverr_rfc4884 = val;
> > +             break;
> >       case IP_MULTICAST_TTL:
> >               if (sk->sk_type == SOCK_STREAM)
> >                       goto e_inval;
> > @@ -1588,6 +1597,9 @@ static int do_ip_getsockopt(struct sock *sk, int level, int optname,
> >       case IP_RECVERR:
> >               val = inet->recverr;
> >               break;
> > +     case IP_RECVERR_RFC4884:
> > +             val = inet->recverr_rfc4884;
> > +             break;
> >       case IP_MULTICAST_TTL:
> >               val = inet->mc_ttl;
> >               break;
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] icmp: support rfc 4884
  2020-06-29 23:06     ` Eric Dumazet
@ 2020-06-30  0:36       ` Tom Herbert
  2020-06-30  2:18         ` Willem de Bruijn
  0 siblings, 1 reply; 16+ messages in thread
From: Tom Herbert @ 2020-06-30  0:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willem de Bruijn, Network Development, David Miller

On Mon, Jun 29, 2020 at 4:07 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 6/29/20 2:30 PM, Willem de Bruijn wrote:
> > On Mon, Jun 29, 2020 at 5:15 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>
> >>
> >>
> >> On 6/29/20 9:57 AM, Willem de Bruijn wrote:
> >>> From: Willem de Bruijn <willemb@google.com>
> >>>
> >>> ICMP messages may include an extension structure after the original
> >>> datagram. RFC 4884 standardized this behavior.
> >>>
> >>> It introduces an explicit original datagram length field in the ICMP
> >>> header to delineate the original datagram from the extension struct.
> >>>
> >>> Return this field when reading an ICMP error from the error queue.
> >>
> >> RFC mentions a 'length' field of 8 bits, your patch chose to export the whole
> >> second word of icmp header.
> >>
> >> Why is this field mapped to a prior one (icmp_hdr(skb)->un.gateway) ?
> >>
> >> Should we add an element in the union to make this a little bit more explicit/readable ?
> >>
> >> diff --git a/include/uapi/linux/icmp.h b/include/uapi/linux/icmp.h
> >> index 5589eeb791ca580bb182e1dc38c05eab1c75adb9..427ed5a6765316a4c1e2fa06f3b6618447c01564 100644
> >> --- a/include/uapi/linux/icmp.h
> >> +++ b/include/uapi/linux/icmp.h
> >> @@ -76,6 +76,7 @@ struct icmphdr {
> >>                 __be16  sequence;
> >>         } echo;
> >>         __be32  gateway;
> >> +       __be32  second_word; /* RFC 4884 4.[123] : <unused:8>,<length:8>,<mtu:16> */
> >>         struct {
> >>                 __be16  __unused;
> >>                 __be16  mtu;
> >
> > Okay. How about a variant of the existing struct frag?
> >
> > @@ -80,6 +80,11 @@ struct icmphdr {
> >                 __be16  __unused;
> >                 __be16  mtu;
> >         } frag;
> > +       struct {
> > +               __u8    __unused;
> > +               __u8    length;
> > +               __be16  mtu;
> > +       } rfc_4884;
> >         __u8    reserved[4];
> >    } un;
> >
>
> Sure, but my point was later in the code :
>
> >>> +     if (inet_sk(sk)->recverr_rfc4884)
> >>> +             info = ntohl(icmp_hdr(skb)->un.gateway);
> >>
> >> ntohl(icmp_hdr(skb)->un.second_word);
>
> If you leave there "info = ntohl(icmp_hdr(skb)->un.gateway)" it is a bit hard for someone
> reading linux kernel code to understand why we do this.
>
It's also potentially problematic. The other bits are Unused, which
isn't the same thing as necessarily being zero. Userspace might assume
that info is the length without checking its bounded.

>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] icmp: support rfc 4884
  2020-06-30  0:36       ` Tom Herbert
@ 2020-06-30  2:18         ` Willem de Bruijn
  2020-06-30 13:57           ` Willem de Bruijn
  0 siblings, 1 reply; 16+ messages in thread
From: Willem de Bruijn @ 2020-06-30  2:18 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Willem de Bruijn, Network Development, David Miller

On Mon, Jun 29, 2020 at 8:37 PM Tom Herbert <tom@herbertland.com> wrote:
>
> On Mon, Jun 29, 2020 at 4:07 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >
> >
> > On 6/29/20 2:30 PM, Willem de Bruijn wrote:
> > > On Mon, Jun 29, 2020 at 5:15 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > >>
> > >>
> > >>
> > >> On 6/29/20 9:57 AM, Willem de Bruijn wrote:
> > >>> From: Willem de Bruijn <willemb@google.com>
> > >>>
> > >>> ICMP messages may include an extension structure after the original
> > >>> datagram. RFC 4884 standardized this behavior.
> > >>>
> > >>> It introduces an explicit original datagram length field in the ICMP
> > >>> header to delineate the original datagram from the extension struct.
> > >>>
> > >>> Return this field when reading an ICMP error from the error queue.
> > >>
> > >> RFC mentions a 'length' field of 8 bits, your patch chose to export the whole
> > >> second word of icmp header.
> > >>
> > >> Why is this field mapped to a prior one (icmp_hdr(skb)->un.gateway) ?
> > >>
> > >> Should we add an element in the union to make this a little bit more explicit/readable ?
> > >>
> > >> diff --git a/include/uapi/linux/icmp.h b/include/uapi/linux/icmp.h
> > >> index 5589eeb791ca580bb182e1dc38c05eab1c75adb9..427ed5a6765316a4c1e2fa06f3b6618447c01564 100644
> > >> --- a/include/uapi/linux/icmp.h
> > >> +++ b/include/uapi/linux/icmp.h
> > >> @@ -76,6 +76,7 @@ struct icmphdr {
> > >>                 __be16  sequence;
> > >>         } echo;
> > >>         __be32  gateway;
> > >> +       __be32  second_word; /* RFC 4884 4.[123] : <unused:8>,<length:8>,<mtu:16> */
> > >>         struct {
> > >>                 __be16  __unused;
> > >>                 __be16  mtu;
> > >
> > > Okay. How about a variant of the existing struct frag?
> > >
> > > @@ -80,6 +80,11 @@ struct icmphdr {
> > >                 __be16  __unused;
> > >                 __be16  mtu;
> > >         } frag;
> > > +       struct {
> > > +               __u8    __unused;
> > > +               __u8    length;
> > > +               __be16  mtu;
> > > +       } rfc_4884;
> > >         __u8    reserved[4];
> > >    } un;
> > >
> >
> > Sure, but my point was later in the code :
> >
> > >>> +     if (inet_sk(sk)->recverr_rfc4884)
> > >>> +             info = ntohl(icmp_hdr(skb)->un.gateway);
> > >>
> > >> ntohl(icmp_hdr(skb)->un.second_word);
> >
> > If you leave there "info = ntohl(icmp_hdr(skb)->un.gateway)" it is a bit hard for someone
> > reading linux kernel code to understand why we do this.
> >
> It's also potentially problematic. The other bits are Unused, which
> isn't the same thing as necessarily being zero. Userspace might assume
> that info is the length without checking its bounded.

It shouldn't. The icmp type and code are passed in sock_extended_err
as ee_type and ee_code. So it can demultiplex the meaning of the rest
of the icmp header.

It just needs access to the other 32-bits, which indeed are context
sensitive. It makes more sense to me to let userspace demultiplex this
in one place, rather than demultiplex in the kernel and define a new,
likely no simpler, data structure to share with userspace.

Specific to RFC 4884, the 8-bit length field coexists with the
16-bit mtu field in case of ICMP_FRAG_NEEDED, so we cannot just pass
the first as ee_info in RFC 4884 mode. sock_extended_err additionally
has ee_data, but after that we're out of fields, too, so this approach
is not very future proof to additional ICMP extensions.

On your previous point, it might be useful to define struct rfc_4884
equivalent outside struct icmphdr, so that an application can easily
cast to that. RFC 4884 itself does not define any extension objects.
That is out of scope there, and in my opinion, here. Again, better
left to userspace. Especially because as it describes, it standardized
the behavior after observing non-compliant, but existing in the wild,
proprietary extension variants. Users may have to change how they
interpret the fields based on what they have deployed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] icmp: support rfc 4884
  2020-06-30  2:18         ` Willem de Bruijn
@ 2020-06-30 13:57           ` Willem de Bruijn
  2020-06-30 16:16             ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Willem de Bruijn @ 2020-06-30 13:57 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Eric Dumazet, Network Development, David Miller

On Mon, Jun 29, 2020 at 10:19 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Mon, Jun 29, 2020 at 8:37 PM Tom Herbert <tom@herbertland.com> wrote:
> >
> > On Mon, Jun 29, 2020 at 4:07 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > >
> > >
> > >
> > > On 6/29/20 2:30 PM, Willem de Bruijn wrote:
> > > > On Mon, Jun 29, 2020 at 5:15 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > >>
> > > >>
> > > >>
> > > >> On 6/29/20 9:57 AM, Willem de Bruijn wrote:
> > > >>> From: Willem de Bruijn <willemb@google.com>
> > > >>>
> > > >>> ICMP messages may include an extension structure after the original
> > > >>> datagram. RFC 4884 standardized this behavior.
> > > >>>
> > > >>> It introduces an explicit original datagram length field in the ICMP
> > > >>> header to delineate the original datagram from the extension struct.
> > > >>>
> > > >>> Return this field when reading an ICMP error from the error queue.
> > > >>
> > > >> RFC mentions a 'length' field of 8 bits, your patch chose to export the whole
> > > >> second word of icmp header.
> > > >>
> > > >> Why is this field mapped to a prior one (icmp_hdr(skb)->un.gateway) ?
> > > >>
> > > >> Should we add an element in the union to make this a little bit more explicit/readable ?
> > > >>
> > > >> diff --git a/include/uapi/linux/icmp.h b/include/uapi/linux/icmp.h
> > > >> index 5589eeb791ca580bb182e1dc38c05eab1c75adb9..427ed5a6765316a4c1e2fa06f3b6618447c01564 100644
> > > >> --- a/include/uapi/linux/icmp.h
> > > >> +++ b/include/uapi/linux/icmp.h
> > > >> @@ -76,6 +76,7 @@ struct icmphdr {
> > > >>                 __be16  sequence;
> > > >>         } echo;
> > > >>         __be32  gateway;
> > > >> +       __be32  second_word; /* RFC 4884 4.[123] : <unused:8>,<length:8>,<mtu:16> */
> > > >>         struct {
> > > >>                 __be16  __unused;
> > > >>                 __be16  mtu;
> > > >
> > > > Okay. How about a variant of the existing struct frag?
> > > >
> > > > @@ -80,6 +80,11 @@ struct icmphdr {
> > > >                 __be16  __unused;
> > > >                 __be16  mtu;
> > > >         } frag;
> > > > +       struct {
> > > > +               __u8    __unused;
> > > > +               __u8    length;
> > > > +               __be16  mtu;
> > > > +       } rfc_4884;
> > > >         __u8    reserved[4];
> > > >    } un;
> > > >
> > >
> > > Sure, but my point was later in the code :
> > >
> > > >>> +     if (inet_sk(sk)->recverr_rfc4884)
> > > >>> +             info = ntohl(icmp_hdr(skb)->un.gateway);
> > > >>
> > > >> ntohl(icmp_hdr(skb)->un.second_word);
> > >
> > > If you leave there "info = ntohl(icmp_hdr(skb)->un.gateway)" it is a bit hard for someone
> > > reading linux kernel code to understand why we do this.
> > >
> > It's also potentially problematic. The other bits are Unused, which
> > isn't the same thing as necessarily being zero. Userspace might assume
> > that info is the length without checking its bounded.
>
> It shouldn't. The icmp type and code are passed in sock_extended_err
> as ee_type and ee_code. So it can demultiplex the meaning of the rest
> of the icmp header.
>
> It just needs access to the other 32-bits, which indeed are context
> sensitive. It makes more sense to me to let userspace demultiplex this
> in one place, rather than demultiplex in the kernel and define a new,
> likely no simpler, data structure to share with userspace.
>
> Specific to RFC 4884, the 8-bit length field coexists with the
> 16-bit mtu field in case of ICMP_FRAG_NEEDED, so we cannot just pass
> the first as ee_info in RFC 4884 mode. sock_extended_err additionally
> has ee_data, but after that we're out of fields, too, so this approach
> is not very future proof to additional ICMP extensions.
>
> On your previous point, it might be useful to define struct rfc_4884
> equivalent outside struct icmphdr, so that an application can easily
> cast to that. RFC 4884 itself does not define any extension objects.
> That is out of scope there, and in my opinion, here. Again, better
> left to userspace. Especially because as it describes, it standardized
> the behavior after observing non-compliant, but existing in the wild,
> proprietary extension variants. Users may have to change how they
> interpret the fields based on what they have deployed.

As this just shares the raw icmp header data, I should probably
change the name to something less specific to RFC 4884.

Since it would also help with decoding other extensions, such as
the one you mention in  draft-ietf-6man-icmp-limits-08.

Unfortunately I cannot simply reserve IP_RECVERR with integer 2.
Perhaps IP_RECVERR_EXINFO.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] icmp: support rfc 4884
  2020-06-30 13:57           ` Willem de Bruijn
@ 2020-06-30 16:16             ` Eric Dumazet
  2020-06-30 16:41               ` Tom Herbert
  2020-06-30 16:41               ` Willem de Bruijn
  0 siblings, 2 replies; 16+ messages in thread
From: Eric Dumazet @ 2020-06-30 16:16 UTC (permalink / raw)
  To: Willem de Bruijn, Tom Herbert
  Cc: Eric Dumazet, Network Development, David Miller



On 6/30/20 6:57 AM, Willem de Bruijn wrote:
> On Mon, Jun 29, 2020 at 10:19 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>>
>> On Mon, Jun 29, 2020 at 8:37 PM Tom Herbert <tom@herbertland.com> wrote:
>>>
>>> On Mon, Jun 29, 2020 at 4:07 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 6/29/20 2:30 PM, Willem de Bruijn wrote:
>>>>> On Mon, Jun 29, 2020 at 5:15 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 6/29/20 9:57 AM, Willem de Bruijn wrote:
>>>>>>> From: Willem de Bruijn <willemb@google.com>
>>>>>>>
>>>>>>> ICMP messages may include an extension structure after the original
>>>>>>> datagram. RFC 4884 standardized this behavior.
>>>>>>>
>>>>>>> It introduces an explicit original datagram length field in the ICMP
>>>>>>> header to delineate the original datagram from the extension struct.
>>>>>>>
>>>>>>> Return this field when reading an ICMP error from the error queue.
>>>>>>
>>>>>> RFC mentions a 'length' field of 8 bits, your patch chose to export the whole
>>>>>> second word of icmp header.
>>>>>>
>>>>>> Why is this field mapped to a prior one (icmp_hdr(skb)->un.gateway) ?
>>>>>>
>>>>>> Should we add an element in the union to make this a little bit more explicit/readable ?
>>>>>>
>>>>>> diff --git a/include/uapi/linux/icmp.h b/include/uapi/linux/icmp.h
>>>>>> index 5589eeb791ca580bb182e1dc38c05eab1c75adb9..427ed5a6765316a4c1e2fa06f3b6618447c01564 100644
>>>>>> --- a/include/uapi/linux/icmp.h
>>>>>> +++ b/include/uapi/linux/icmp.h
>>>>>> @@ -76,6 +76,7 @@ struct icmphdr {
>>>>>>                 __be16  sequence;
>>>>>>         } echo;
>>>>>>         __be32  gateway;
>>>>>> +       __be32  second_word; /* RFC 4884 4.[123] : <unused:8>,<length:8>,<mtu:16> */
>>>>>>         struct {
>>>>>>                 __be16  __unused;
>>>>>>                 __be16  mtu;
>>>>>
>>>>> Okay. How about a variant of the existing struct frag?
>>>>>
>>>>> @@ -80,6 +80,11 @@ struct icmphdr {
>>>>>                 __be16  __unused;
>>>>>                 __be16  mtu;
>>>>>         } frag;
>>>>> +       struct {
>>>>> +               __u8    __unused;
>>>>> +               __u8    length;
>>>>> +               __be16  mtu;
>>>>> +       } rfc_4884;
>>>>>         __u8    reserved[4];
>>>>>    } un;
>>>>>
>>>>
>>>> Sure, but my point was later in the code :
>>>>
>>>>>>> +     if (inet_sk(sk)->recverr_rfc4884)
>>>>>>> +             info = ntohl(icmp_hdr(skb)->un.gateway);
>>>>>>
>>>>>> ntohl(icmp_hdr(skb)->un.second_word);
>>>>
>>>> If you leave there "info = ntohl(icmp_hdr(skb)->un.gateway)" it is a bit hard for someone
>>>> reading linux kernel code to understand why we do this.
>>>>
>>> It's also potentially problematic. The other bits are Unused, which
>>> isn't the same thing as necessarily being zero. Userspace might assume
>>> that info is the length without checking its bounded.
>>
>> It shouldn't. The icmp type and code are passed in sock_extended_err
>> as ee_type and ee_code. So it can demultiplex the meaning of the rest
>> of the icmp header.
>>
>> It just needs access to the other 32-bits, which indeed are context
>> sensitive. It makes more sense to me to let userspace demultiplex this
>> in one place, rather than demultiplex in the kernel and define a new,
>> likely no simpler, data structure to share with userspace.
>>
>> Specific to RFC 4884, the 8-bit length field coexists with the
>> 16-bit mtu field in case of ICMP_FRAG_NEEDED, so we cannot just pass
>> the first as ee_info in RFC 4884 mode. sock_extended_err additionally
>> has ee_data, but after that we're out of fields, too, so this approach
>> is not very future proof to additional ICMP extensions.
>>
>> On your previous point, it might be useful to define struct rfc_4884
>> equivalent outside struct icmphdr, so that an application can easily
>> cast to that. RFC 4884 itself does not define any extension objects.
>> That is out of scope there, and in my opinion, here. Again, better
>> left to userspace. Especially because as it describes, it standardized
>> the behavior after observing non-compliant, but existing in the wild,
>> proprietary extension variants. Users may have to change how they
>> interpret the fields based on what they have deployed.
> 
> As this just shares the raw icmp header data, I should probably
> change the name to something less specific to RFC 4884.
> 
> Since it would also help with decoding other extensions, such as
> the one you mention in  draft-ietf-6man-icmp-limits-08.
> 
> Unfortunately I cannot simply reserve IP_RECVERR with integer 2.
> Perhaps IP_RECVERR_EXINFO.
> 

Perhaps let the icmp header as is, but provides the extra information
as an explicit ancillary message in ip_recv_error() ?

Really this is all about documentation and providing stable API.

Possible alternative would be to add an union over ee_pad

Legacy applications would always get 0 for ee_pad/ee_length, while
applications enabling IP_RECVERR_RFC4884 would access the wire value.


diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index ca5cb3e3c6df745aa4c886ba7b4f88179fa22d86..509a5a6ccc555705ef867d7580553d289d559786 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -10,7 +10,10 @@ struct sock_extended_err {
        __u8    ee_origin;
        __u8    ee_type;
        __u8    ee_code;
-       __u8    ee_pad;
+       union {
+               __u8    ee_pad;
+               __u8    ee_length; /* RFC 4884 (see IP_RECVERR_RFC4884) */
+       };
        __u32   ee_info;
        __u32   ee_data;
 };

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] icmp: support rfc 4884
  2020-06-30 16:16             ` Eric Dumazet
@ 2020-06-30 16:41               ` Tom Herbert
  2020-06-30 16:52                 ` Willem de Bruijn
  2020-06-30 16:41               ` Willem de Bruijn
  1 sibling, 1 reply; 16+ messages in thread
From: Tom Herbert @ 2020-06-30 16:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willem de Bruijn, Network Development, David Miller

On Tue, Jun 30, 2020 at 9:16 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 6/30/20 6:57 AM, Willem de Bruijn wrote:
> > On Mon, Jun 29, 2020 at 10:19 PM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> >>
> >> On Mon, Jun 29, 2020 at 8:37 PM Tom Herbert <tom@herbertland.com> wrote:
> >>>
> >>> On Mon, Jun 29, 2020 at 4:07 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 6/29/20 2:30 PM, Willem de Bruijn wrote:
> >>>>> On Mon, Jun 29, 2020 at 5:15 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 6/29/20 9:57 AM, Willem de Bruijn wrote:
> >>>>>>> From: Willem de Bruijn <willemb@google.com>
> >>>>>>>
> >>>>>>> ICMP messages may include an extension structure after the original
> >>>>>>> datagram. RFC 4884 standardized this behavior.
> >>>>>>>
> >>>>>>> It introduces an explicit original datagram length field in the ICMP
> >>>>>>> header to delineate the original datagram from the extension struct.
> >>>>>>>
> >>>>>>> Return this field when reading an ICMP error from the error queue.
> >>>>>>
> >>>>>> RFC mentions a 'length' field of 8 bits, your patch chose to export the whole
> >>>>>> second word of icmp header.
> >>>>>>
> >>>>>> Why is this field mapped to a prior one (icmp_hdr(skb)->un.gateway) ?
> >>>>>>
> >>>>>> Should we add an element in the union to make this a little bit more explicit/readable ?
> >>>>>>
> >>>>>> diff --git a/include/uapi/linux/icmp.h b/include/uapi/linux/icmp.h
> >>>>>> index 5589eeb791ca580bb182e1dc38c05eab1c75adb9..427ed5a6765316a4c1e2fa06f3b6618447c01564 100644
> >>>>>> --- a/include/uapi/linux/icmp.h
> >>>>>> +++ b/include/uapi/linux/icmp.h
> >>>>>> @@ -76,6 +76,7 @@ struct icmphdr {
> >>>>>>                 __be16  sequence;
> >>>>>>         } echo;
> >>>>>>         __be32  gateway;
> >>>>>> +       __be32  second_word; /* RFC 4884 4.[123] : <unused:8>,<length:8>,<mtu:16> */
> >>>>>>         struct {
> >>>>>>                 __be16  __unused;
> >>>>>>                 __be16  mtu;
> >>>>>
> >>>>> Okay. How about a variant of the existing struct frag?
> >>>>>
> >>>>> @@ -80,6 +80,11 @@ struct icmphdr {
> >>>>>                 __be16  __unused;
> >>>>>                 __be16  mtu;
> >>>>>         } frag;
> >>>>> +       struct {
> >>>>> +               __u8    __unused;
> >>>>> +               __u8    length;
> >>>>> +               __be16  mtu;
> >>>>> +       } rfc_4884;
> >>>>>         __u8    reserved[4];
> >>>>>    } un;
> >>>>>
> >>>>
> >>>> Sure, but my point was later in the code :
> >>>>
> >>>>>>> +     if (inet_sk(sk)->recverr_rfc4884)
> >>>>>>> +             info = ntohl(icmp_hdr(skb)->un.gateway);
> >>>>>>
> >>>>>> ntohl(icmp_hdr(skb)->un.second_word);
> >>>>
> >>>> If you leave there "info = ntohl(icmp_hdr(skb)->un.gateway)" it is a bit hard for someone
> >>>> reading linux kernel code to understand why we do this.
> >>>>
> >>> It's also potentially problematic. The other bits are Unused, which
> >>> isn't the same thing as necessarily being zero. Userspace might assume
> >>> that info is the length without checking its bounded.
> >>
> >> It shouldn't. The icmp type and code are passed in sock_extended_err
> >> as ee_type and ee_code. So it can demultiplex the meaning of the rest
> >> of the icmp header.
> >>
> >> It just needs access to the other 32-bits, which indeed are context
> >> sensitive. It makes more sense to me to let userspace demultiplex this
> >> in one place, rather than demultiplex in the kernel and define a new,
> >> likely no simpler, data structure to share with userspace.
> >>
> >> Specific to RFC 4884, the 8-bit length field coexists with the
> >> 16-bit mtu field in case of ICMP_FRAG_NEEDED, so we cannot just pass
> >> the first as ee_info in RFC 4884 mode. sock_extended_err additionally
> >> has ee_data, but after that we're out of fields, too, so this approach
> >> is not very future proof to additional ICMP extensions.
> >>
> >> On your previous point, it might be useful to define struct rfc_4884
> >> equivalent outside struct icmphdr, so that an application can easily
> >> cast to that. RFC 4884 itself does not define any extension objects.
> >> That is out of scope there, and in my opinion, here. Again, better
> >> left to userspace. Especially because as it describes, it standardized
> >> the behavior after observing non-compliant, but existing in the wild,
> >> proprietary extension variants. Users may have to change how they
> >> interpret the fields based on what they have deployed.
> >
> > As this just shares the raw icmp header data, I should probably
> > change the name to something less specific to RFC 4884.
> >
> > Since it would also help with decoding other extensions, such as
> > the one you mention in  draft-ietf-6man-icmp-limits-08.
> >
> > Unfortunately I cannot simply reserve IP_RECVERR with integer 2.
> > Perhaps IP_RECVERR_EXINFO.
> >
>
> Perhaps let the icmp header as is, but provides the extra information
> as an explicit ancillary message in ip_recv_error() ?
>
> Really this is all about documentation and providing stable API.
>
Actually, I think we may have a subtle bug here.

RFC4884 appends the ICMP extension to the original datagram. The RFC
describes backwards compatibility in section 5. To be backwards
compatible with legacy implementations that don't know how to parse
the extension, and in particular to find the length of the original
datagram in the data, the requirement is that at the original datagram
is at least 128 bytes long and it seems to assume no ICMP application
need to parse beyond that. But parsing beyond 128 bytes is possible,
consider that the original datagram was UDP/IPv6 with an extension
header such that the UDP header is offset beyond 128 bytes in the
packet. If we don't take this into account, the UDP ports for socket
lookup would be incorrect.

To fix this, we could check the Length field per the types that
support extensions as described in RFC4884. If it's non-zero then
assume the extension is present, so before processing the original
datagram, e.g. performing socket lookup, trim the skb to the length of
the orignal datagram.

Tom

> Possible alternative would be to add an union over ee_pad
>
> Legacy applications would always get 0 for ee_pad/ee_length, while
> applications enabling IP_RECVERR_RFC4884 would access the wire value.
>
>
> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
> index ca5cb3e3c6df745aa4c886ba7b4f88179fa22d86..509a5a6ccc555705ef867d7580553d289d559786 100644
> --- a/include/uapi/linux/errqueue.h
> +++ b/include/uapi/linux/errqueue.h
> @@ -10,7 +10,10 @@ struct sock_extended_err {
>         __u8    ee_origin;
>         __u8    ee_type;
>         __u8    ee_code;
> -       __u8    ee_pad;
> +       union {
> +               __u8    ee_pad;
> +               __u8    ee_length; /* RFC 4884 (see IP_RECVERR_RFC4884) */
> +       };
>         __u32   ee_info;
>         __u32   ee_data;
>  };

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] icmp: support rfc 4884
  2020-06-30 16:16             ` Eric Dumazet
  2020-06-30 16:41               ` Tom Herbert
@ 2020-06-30 16:41               ` Willem de Bruijn
  1 sibling, 0 replies; 16+ messages in thread
From: Willem de Bruijn @ 2020-06-30 16:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willem de Bruijn, Tom Herbert, Network Development, David Miller

On Tue, Jun 30, 2020 at 12:16 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 6/30/20 6:57 AM, Willem de Bruijn wrote:
> > On Mon, Jun 29, 2020 at 10:19 PM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> >>
> >> On Mon, Jun 29, 2020 at 8:37 PM Tom Herbert <tom@herbertland.com> wrote:
> >>>
> >>> On Mon, Jun 29, 2020 at 4:07 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 6/29/20 2:30 PM, Willem de Bruijn wrote:
> >>>>> On Mon, Jun 29, 2020 at 5:15 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 6/29/20 9:57 AM, Willem de Bruijn wrote:
> >>>>>>> From: Willem de Bruijn <willemb@google.com>
> >>>>>>>
> >>>>>>> ICMP messages may include an extension structure after the original
> >>>>>>> datagram. RFC 4884 standardized this behavior.
> >>>>>>>
> >>>>>>> It introduces an explicit original datagram length field in the ICMP
> >>>>>>> header to delineate the original datagram from the extension struct.
> >>>>>>>
> >>>>>>> Return this field when reading an ICMP error from the error queue.
> >>>>>>
> >>>>>> RFC mentions a 'length' field of 8 bits, your patch chose to export the whole
> >>>>>> second word of icmp header.
> >>>>>>
> >>>>>> Why is this field mapped to a prior one (icmp_hdr(skb)->un.gateway) ?
> >>>>>>
> >>>>>> Should we add an element in the union to make this a little bit more explicit/readable ?
> >>>>>>
> >>>>>> diff --git a/include/uapi/linux/icmp.h b/include/uapi/linux/icmp.h
> >>>>>> index 5589eeb791ca580bb182e1dc38c05eab1c75adb9..427ed5a6765316a4c1e2fa06f3b6618447c01564 100644
> >>>>>> --- a/include/uapi/linux/icmp.h
> >>>>>> +++ b/include/uapi/linux/icmp.h
> >>>>>> @@ -76,6 +76,7 @@ struct icmphdr {
> >>>>>>                 __be16  sequence;
> >>>>>>         } echo;
> >>>>>>         __be32  gateway;
> >>>>>> +       __be32  second_word; /* RFC 4884 4.[123] : <unused:8>,<length:8>,<mtu:16> */
> >>>>>>         struct {
> >>>>>>                 __be16  __unused;
> >>>>>>                 __be16  mtu;
> >>>>>
> >>>>> Okay. How about a variant of the existing struct frag?
> >>>>>
> >>>>> @@ -80,6 +80,11 @@ struct icmphdr {
> >>>>>                 __be16  __unused;
> >>>>>                 __be16  mtu;
> >>>>>         } frag;
> >>>>> +       struct {
> >>>>> +               __u8    __unused;
> >>>>> +               __u8    length;
> >>>>> +               __be16  mtu;
> >>>>> +       } rfc_4884;
> >>>>>         __u8    reserved[4];
> >>>>>    } un;
> >>>>>
> >>>>
> >>>> Sure, but my point was later in the code :
> >>>>
> >>>>>>> +     if (inet_sk(sk)->recverr_rfc4884)
> >>>>>>> +             info = ntohl(icmp_hdr(skb)->un.gateway);
> >>>>>>
> >>>>>> ntohl(icmp_hdr(skb)->un.second_word);
> >>>>
> >>>> If you leave there "info = ntohl(icmp_hdr(skb)->un.gateway)" it is a bit hard for someone
> >>>> reading linux kernel code to understand why we do this.
> >>>>
> >>> It's also potentially problematic. The other bits are Unused, which
> >>> isn't the same thing as necessarily being zero. Userspace might assume
> >>> that info is the length without checking its bounded.
> >>
> >> It shouldn't. The icmp type and code are passed in sock_extended_err
> >> as ee_type and ee_code. So it can demultiplex the meaning of the rest
> >> of the icmp header.
> >>
> >> It just needs access to the other 32-bits, which indeed are context
> >> sensitive. It makes more sense to me to let userspace demultiplex this
> >> in one place, rather than demultiplex in the kernel and define a new,
> >> likely no simpler, data structure to share with userspace.
> >>
> >> Specific to RFC 4884, the 8-bit length field coexists with the
> >> 16-bit mtu field in case of ICMP_FRAG_NEEDED, so we cannot just pass
> >> the first as ee_info in RFC 4884 mode. sock_extended_err additionally
> >> has ee_data, but after that we're out of fields, too, so this approach
> >> is not very future proof to additional ICMP extensions.
> >>
> >> On your previous point, it might be useful to define struct rfc_4884
> >> equivalent outside struct icmphdr, so that an application can easily
> >> cast to that. RFC 4884 itself does not define any extension objects.
> >> That is out of scope there, and in my opinion, here. Again, better
> >> left to userspace. Especially because as it describes, it standardized
> >> the behavior after observing non-compliant, but existing in the wild,
> >> proprietary extension variants. Users may have to change how they
> >> interpret the fields based on what they have deployed.
> >
> > As this just shares the raw icmp header data, I should probably
> > change the name to something less specific to RFC 4884.
> >
> > Since it would also help with decoding other extensions, such as
> > the one you mention in  draft-ietf-6man-icmp-limits-08.
> >
> > Unfortunately I cannot simply reserve IP_RECVERR with integer 2.
> > Perhaps IP_RECVERR_EXINFO.
> >
>
> Perhaps let the icmp header as is, but provides the extra information
> as an explicit ancillary message in ip_recv_error() ?
>
> Really this is all about documentation and providing stable API.

Understood. Of course happy to discuss alternatives, as it does set
things in stone.

>
> Possible alternative would be to add an union over ee_pad
>
> Legacy applications would always get 0 for ee_pad/ee_length, while
> applications enabling IP_RECVERR_RFC4884 would access the wire value.

And leave __u32 ee_data free for other uses.

I find it much more intuitive to just unconditionally pass the 32 bit
data that an application may need to be able to decode any ICMP
message (along with ee_type + ee_code), rather than start defining
ee_pad + ee_data fields in context dependent ways.

As for ICMP_FRAG_NEEDED now 24 of the 32 bits are defined, something
will inevitably find a use for the remaining 8 bits, and then we need
another kernel feature.

Also, if going down this path I will have to add the same for IPv6,
while it already exposes all this information userspace needs in
ee_info.

That said, if consensus is that the kernel should make more of an
effort to return this data in a structured form, and it is limited to
32 bits overall, the ee_pad/ee_len union for this case has my
preference. CMSG parsing adds a lot of boilerplate to each
application.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] icmp: support rfc 4884
  2020-06-30 16:41               ` Tom Herbert
@ 2020-06-30 16:52                 ` Willem de Bruijn
  2020-06-30 18:57                   ` Tom Herbert
  0 siblings, 1 reply; 16+ messages in thread
From: Willem de Bruijn @ 2020-06-30 16:52 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Willem de Bruijn, Network Development, David Miller

On Tue, Jun 30, 2020 at 12:41 PM Tom Herbert <tom@herbertland.com> wrote:
>
> On Tue, Jun 30, 2020 at 9:16 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >
> >
> > On 6/30/20 6:57 AM, Willem de Bruijn wrote:
> > > On Mon, Jun 29, 2020 at 10:19 PM Willem de Bruijn
> > > <willemdebruijn.kernel@gmail.com> wrote:
> > >>
> > >> On Mon, Jun 29, 2020 at 8:37 PM Tom Herbert <tom@herbertland.com> wrote:
> > >>>
> > >>> On Mon, Jun 29, 2020 at 4:07 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 6/29/20 2:30 PM, Willem de Bruijn wrote:
> > >>>>> On Mon, Jun 29, 2020 at 5:15 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On 6/29/20 9:57 AM, Willem de Bruijn wrote:
> > >>>>>>> From: Willem de Bruijn <willemb@google.com>
> > >>>>>>>
> > >>>>>>> ICMP messages may include an extension structure after the original
> > >>>>>>> datagram. RFC 4884 standardized this behavior.
> > >>>>>>>
> > >>>>>>> It introduces an explicit original datagram length field in the ICMP
> > >>>>>>> header to delineate the original datagram from the extension struct.
> > >>>>>>>
> > >>>>>>> Return this field when reading an ICMP error from the error queue.
> > >>>>>>
> > >>>>>> RFC mentions a 'length' field of 8 bits, your patch chose to export the whole
> > >>>>>> second word of icmp header.
> > >>>>>>
> > >>>>>> Why is this field mapped to a prior one (icmp_hdr(skb)->un.gateway) ?
> > >>>>>>
> > >>>>>> Should we add an element in the union to make this a little bit more explicit/readable ?
> > >>>>>>
> > >>>>>> diff --git a/include/uapi/linux/icmp.h b/include/uapi/linux/icmp.h
> > >>>>>> index 5589eeb791ca580bb182e1dc38c05eab1c75adb9..427ed5a6765316a4c1e2fa06f3b6618447c01564 100644
> > >>>>>> --- a/include/uapi/linux/icmp.h
> > >>>>>> +++ b/include/uapi/linux/icmp.h
> > >>>>>> @@ -76,6 +76,7 @@ struct icmphdr {
> > >>>>>>                 __be16  sequence;
> > >>>>>>         } echo;
> > >>>>>>         __be32  gateway;
> > >>>>>> +       __be32  second_word; /* RFC 4884 4.[123] : <unused:8>,<length:8>,<mtu:16> */
> > >>>>>>         struct {
> > >>>>>>                 __be16  __unused;
> > >>>>>>                 __be16  mtu;
> > >>>>>
> > >>>>> Okay. How about a variant of the existing struct frag?
> > >>>>>
> > >>>>> @@ -80,6 +80,11 @@ struct icmphdr {
> > >>>>>                 __be16  __unused;
> > >>>>>                 __be16  mtu;
> > >>>>>         } frag;
> > >>>>> +       struct {
> > >>>>> +               __u8    __unused;
> > >>>>> +               __u8    length;
> > >>>>> +               __be16  mtu;
> > >>>>> +       } rfc_4884;
> > >>>>>         __u8    reserved[4];
> > >>>>>    } un;
> > >>>>>
> > >>>>
> > >>>> Sure, but my point was later in the code :
> > >>>>
> > >>>>>>> +     if (inet_sk(sk)->recverr_rfc4884)
> > >>>>>>> +             info = ntohl(icmp_hdr(skb)->un.gateway);
> > >>>>>>
> > >>>>>> ntohl(icmp_hdr(skb)->un.second_word);
> > >>>>
> > >>>> If you leave there "info = ntohl(icmp_hdr(skb)->un.gateway)" it is a bit hard for someone
> > >>>> reading linux kernel code to understand why we do this.
> > >>>>
> > >>> It's also potentially problematic. The other bits are Unused, which
> > >>> isn't the same thing as necessarily being zero. Userspace might assume
> > >>> that info is the length without checking its bounded.
> > >>
> > >> It shouldn't. The icmp type and code are passed in sock_extended_err
> > >> as ee_type and ee_code. So it can demultiplex the meaning of the rest
> > >> of the icmp header.
> > >>
> > >> It just needs access to the other 32-bits, which indeed are context
> > >> sensitive. It makes more sense to me to let userspace demultiplex this
> > >> in one place, rather than demultiplex in the kernel and define a new,
> > >> likely no simpler, data structure to share with userspace.
> > >>
> > >> Specific to RFC 4884, the 8-bit length field coexists with the
> > >> 16-bit mtu field in case of ICMP_FRAG_NEEDED, so we cannot just pass
> > >> the first as ee_info in RFC 4884 mode. sock_extended_err additionally
> > >> has ee_data, but after that we're out of fields, too, so this approach
> > >> is not very future proof to additional ICMP extensions.
> > >>
> > >> On your previous point, it might be useful to define struct rfc_4884
> > >> equivalent outside struct icmphdr, so that an application can easily
> > >> cast to that. RFC 4884 itself does not define any extension objects.
> > >> That is out of scope there, and in my opinion, here. Again, better
> > >> left to userspace. Especially because as it describes, it standardized
> > >> the behavior after observing non-compliant, but existing in the wild,
> > >> proprietary extension variants. Users may have to change how they
> > >> interpret the fields based on what they have deployed.
> > >
> > > As this just shares the raw icmp header data, I should probably
> > > change the name to something less specific to RFC 4884.
> > >
> > > Since it would also help with decoding other extensions, such as
> > > the one you mention in  draft-ietf-6man-icmp-limits-08.
> > >
> > > Unfortunately I cannot simply reserve IP_RECVERR with integer 2.
> > > Perhaps IP_RECVERR_EXINFO.
> > >
> >
> > Perhaps let the icmp header as is, but provides the extra information
> > as an explicit ancillary message in ip_recv_error() ?
> >
> > Really this is all about documentation and providing stable API.
> >
> Actually, I think we may have a subtle bug here.
>
> RFC4884 appends the ICMP extension to the original datagram. The RFC
> describes backwards compatibility in section 5. To be backwards
> compatible with legacy implementations that don't know how to parse
> the extension, and in particular to find the length of the original
> datagram in the data, the requirement is that at the original datagram
> is at least 128 bytes long and it seems to assume no ICMP application
> need to parse beyond that. But parsing beyond 128 bytes is possible,
> consider that the original datagram was UDP/IPv6 with an extension
> header such that the UDP header is offset beyond 128 bytes in the
> packet. If we don't take this into account, the UDP ports for socket
> lookup would be incorrect.
>
> To fix this, we could check the Length field per the types that
> support extensions as described in RFC4884. If it's non-zero then
> assume the extension is present, so before processing the original
> datagram, e.g. performing socket lookup, trim the skb to the length of
> the orignal datagram.

This is somewhat orthogonal to this patch.

You are suggesting proactively protecting applications that do not
know how to parse RFC 4884 compliant packets by truncating
those unless the application sets the new setsockopt?

I'm afraid that might break legacy applications that currently do
expect extension objects. They already can in case of ICMPv6.
And unfortunately they can try in an unsafe manner by parsing
the original datagram for ICMPv4, too.

ICMPv6 differs from ICMP in that it suggests forwarding as much
of the "invoking" packet as possible without exceeding the min
MTU (1280B).

RFC 4884 mentions the minimum 128 B for ICMP but not for
ICMPv6. It specifically keeps the wording about forwarding as
much as possible.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] icmp: support rfc 4884
  2020-06-30 16:52                 ` Willem de Bruijn
@ 2020-06-30 18:57                   ` Tom Herbert
  2020-06-30 19:27                     ` Willem de Bruijn
  0 siblings, 1 reply; 16+ messages in thread
From: Tom Herbert @ 2020-06-30 18:57 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Eric Dumazet, Network Development, David Miller

On Tue, Jun 30, 2020 at 9:52 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Tue, Jun 30, 2020 at 12:41 PM Tom Herbert <tom@herbertland.com> wrote:
> >
> > On Tue, Jun 30, 2020 at 9:16 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > >
> > >
> > >
> > > On 6/30/20 6:57 AM, Willem de Bruijn wrote:
> > > > On Mon, Jun 29, 2020 at 10:19 PM Willem de Bruijn
> > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > >>
> > > >> On Mon, Jun 29, 2020 at 8:37 PM Tom Herbert <tom@herbertland.com> wrote:
> > > >>>
> > > >>> On Mon, Jun 29, 2020 at 4:07 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On 6/29/20 2:30 PM, Willem de Bruijn wrote:
> > > >>>>> On Mon, Jun 29, 2020 at 5:15 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On 6/29/20 9:57 AM, Willem de Bruijn wrote:
> > > >>>>>>> From: Willem de Bruijn <willemb@google.com>
> > > >>>>>>>
> > > >>>>>>> ICMP messages may include an extension structure after the original
> > > >>>>>>> datagram. RFC 4884 standardized this behavior.
> > > >>>>>>>
> > > >>>>>>> It introduces an explicit original datagram length field in the ICMP
> > > >>>>>>> header to delineate the original datagram from the extension struct.
> > > >>>>>>>
> > > >>>>>>> Return this field when reading an ICMP error from the error queue.
> > > >>>>>>
> > > >>>>>> RFC mentions a 'length' field of 8 bits, your patch chose to export the whole
> > > >>>>>> second word of icmp header.
> > > >>>>>>
> > > >>>>>> Why is this field mapped to a prior one (icmp_hdr(skb)->un.gateway) ?
> > > >>>>>>
> > > >>>>>> Should we add an element in the union to make this a little bit more explicit/readable ?
> > > >>>>>>
> > > >>>>>> diff --git a/include/uapi/linux/icmp.h b/include/uapi/linux/icmp.h
> > > >>>>>> index 5589eeb791ca580bb182e1dc38c05eab1c75adb9..427ed5a6765316a4c1e2fa06f3b6618447c01564 100644
> > > >>>>>> --- a/include/uapi/linux/icmp.h
> > > >>>>>> +++ b/include/uapi/linux/icmp.h
> > > >>>>>> @@ -76,6 +76,7 @@ struct icmphdr {
> > > >>>>>>                 __be16  sequence;
> > > >>>>>>         } echo;
> > > >>>>>>         __be32  gateway;
> > > >>>>>> +       __be32  second_word; /* RFC 4884 4.[123] : <unused:8>,<length:8>,<mtu:16> */
> > > >>>>>>         struct {
> > > >>>>>>                 __be16  __unused;
> > > >>>>>>                 __be16  mtu;
> > > >>>>>
> > > >>>>> Okay. How about a variant of the existing struct frag?
> > > >>>>>
> > > >>>>> @@ -80,6 +80,11 @@ struct icmphdr {
> > > >>>>>                 __be16  __unused;
> > > >>>>>                 __be16  mtu;
> > > >>>>>         } frag;
> > > >>>>> +       struct {
> > > >>>>> +               __u8    __unused;
> > > >>>>> +               __u8    length;
> > > >>>>> +               __be16  mtu;
> > > >>>>> +       } rfc_4884;
> > > >>>>>         __u8    reserved[4];
> > > >>>>>    } un;
> > > >>>>>
> > > >>>>
> > > >>>> Sure, but my point was later in the code :
> > > >>>>
> > > >>>>>>> +     if (inet_sk(sk)->recverr_rfc4884)
> > > >>>>>>> +             info = ntohl(icmp_hdr(skb)->un.gateway);
> > > >>>>>>
> > > >>>>>> ntohl(icmp_hdr(skb)->un.second_word);
> > > >>>>
> > > >>>> If you leave there "info = ntohl(icmp_hdr(skb)->un.gateway)" it is a bit hard for someone
> > > >>>> reading linux kernel code to understand why we do this.
> > > >>>>
> > > >>> It's also potentially problematic. The other bits are Unused, which
> > > >>> isn't the same thing as necessarily being zero. Userspace might assume
> > > >>> that info is the length without checking its bounded.
> > > >>
> > > >> It shouldn't. The icmp type and code are passed in sock_extended_err
> > > >> as ee_type and ee_code. So it can demultiplex the meaning of the rest
> > > >> of the icmp header.
> > > >>
> > > >> It just needs access to the other 32-bits, which indeed are context
> > > >> sensitive. It makes more sense to me to let userspace demultiplex this
> > > >> in one place, rather than demultiplex in the kernel and define a new,
> > > >> likely no simpler, data structure to share with userspace.
> > > >>
> > > >> Specific to RFC 4884, the 8-bit length field coexists with the
> > > >> 16-bit mtu field in case of ICMP_FRAG_NEEDED, so we cannot just pass
> > > >> the first as ee_info in RFC 4884 mode. sock_extended_err additionally
> > > >> has ee_data, but after that we're out of fields, too, so this approach
> > > >> is not very future proof to additional ICMP extensions.
> > > >>
> > > >> On your previous point, it might be useful to define struct rfc_4884
> > > >> equivalent outside struct icmphdr, so that an application can easily
> > > >> cast to that. RFC 4884 itself does not define any extension objects.
> > > >> That is out of scope there, and in my opinion, here. Again, better
> > > >> left to userspace. Especially because as it describes, it standardized
> > > >> the behavior after observing non-compliant, but existing in the wild,
> > > >> proprietary extension variants. Users may have to change how they
> > > >> interpret the fields based on what they have deployed.
> > > >
> > > > As this just shares the raw icmp header data, I should probably
> > > > change the name to something less specific to RFC 4884.
> > > >
> > > > Since it would also help with decoding other extensions, such as
> > > > the one you mention in  draft-ietf-6man-icmp-limits-08.
> > > >
> > > > Unfortunately I cannot simply reserve IP_RECVERR with integer 2.
> > > > Perhaps IP_RECVERR_EXINFO.
> > > >
> > >
> > > Perhaps let the icmp header as is, but provides the extra information
> > > as an explicit ancillary message in ip_recv_error() ?
> > >
> > > Really this is all about documentation and providing stable API.
> > >
> > Actually, I think we may have a subtle bug here.
> >
> > RFC4884 appends the ICMP extension to the original datagram. The RFC
> > describes backwards compatibility in section 5. To be backwards
> > compatible with legacy implementations that don't know how to parse
> > the extension, and in particular to find the length of the original
> > datagram in the data, the requirement is that at the original datagram
> > is at least 128 bytes long and it seems to assume no ICMP application
> > need to parse beyond that. But parsing beyond 128 bytes is possible,
> > consider that the original datagram was UDP/IPv6 with an extension
> > header such that the UDP header is offset beyond 128 bytes in the
> > packet. If we don't take this into account, the UDP ports for socket
> > lookup would be incorrect.
> >
> > To fix this, we could check the Length field per the types that
> > support extensions as described in RFC4884. If it's non-zero then
> > assume the extension is present, so before processing the original
> > datagram, e.g. performing socket lookup, trim the skb to the length of
> > the orignal datagram.
>
> This is somewhat orthogonal to this patch.
>
> You are suggesting proactively protecting applications that do not
> know how to parse RFC 4884 compliant packets by truncating
> those unless the application sets the new setsockopt?
>
No, the bug is a bit more insidious. The problem is that it's possible
to misdirect an ICMP error to the wrong socket. If that can happen and
leads to adverse side effects then it's a bug. Grant it, in normal
conditions this is probably exceeding rare, but we can't discount the
possibility and this sounds like the sort of thing that could be used
in some twisted DOS attack.

> I'm afraid that might break legacy applications that currently do
> expect extension objects. They already can in case of ICMPv6.
> And unfortunately they can try in an unsafe manner by parsing
> the original datagram for ICMPv4, too.
>
I suspect there's more legacy applications that don't understand the
extensions so the risk of misinterpreting the packet data with
extensions exists for them.

> ICMPv6 differs from ICMP in that it suggests forwarding as much
> of the "invoking" packet as possible without exceeding the min
> MTU (1280B).
>
> RFC 4884 mentions the minimum 128 B for ICMP but not for
> ICMPv6. It specifically keeps the wording about forwarding as
> much as possible.

Out of curiosity, are you implementing this patch for a specific use
case in mind or just for completeness?

Thanks,
Tom

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next] icmp: support rfc 4884
  2020-06-30 18:57                   ` Tom Herbert
@ 2020-06-30 19:27                     ` Willem de Bruijn
  0 siblings, 0 replies; 16+ messages in thread
From: Willem de Bruijn @ 2020-06-30 19:27 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Willem de Bruijn, Eric Dumazet, Network Development, David Miller

On Tue, Jun 30, 2020 at 2:57 PM Tom Herbert <tom@herbertland.com> wrote:
>
> On Tue, Jun 30, 2020 at 9:52 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > On Tue, Jun 30, 2020 at 12:41 PM Tom Herbert <tom@herbertland.com> wrote:
> > >
> > > On Tue, Jun 30, 2020 at 9:16 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > >
> > > >
> > > >
> > > > On 6/30/20 6:57 AM, Willem de Bruijn wrote:
> > > > > On Mon, Jun 29, 2020 at 10:19 PM Willem de Bruijn
> > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > >>
> > > > >> On Mon, Jun 29, 2020 at 8:37 PM Tom Herbert <tom@herbertland.com> wrote:
> > > > >>>
> > > > >>> On Mon, Jun 29, 2020 at 4:07 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> On 6/29/20 2:30 PM, Willem de Bruijn wrote:
> > > > >>>>> On Mon, Jun 29, 2020 at 5:15 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> On 6/29/20 9:57 AM, Willem de Bruijn wrote:
> > > > >>>>>>> From: Willem de Bruijn <willemb@google.com>
> > > > >>>>>>>
> > > > >>>>>>> ICMP messages may include an extension structure after the original
> > > > >>>>>>> datagram. RFC 4884 standardized this behavior.
> > > > >>>>>>>
> > > > >>>>>>> It introduces an explicit original datagram length field in the ICMP
> > > > >>>>>>> header to delineate the original datagram from the extension struct.
> > > > >>>>>>>
> > > > >>>>>>> Return this field when reading an ICMP error from the error queue.
> > > > >>>>>>
> > > > >>>>>> RFC mentions a 'length' field of 8 bits, your patch chose to export the whole
> > > > >>>>>> second word of icmp header.
> > > > >>>>>>
> > > > >>>>>> Why is this field mapped to a prior one (icmp_hdr(skb)->un.gateway) ?
> > > > >>>>>>
> > > > >>>>>> Should we add an element in the union to make this a little bit more explicit/readable ?
> > > > >>>>>>
> > > > >>>>>> diff --git a/include/uapi/linux/icmp.h b/include/uapi/linux/icmp.h
> > > > >>>>>> index 5589eeb791ca580bb182e1dc38c05eab1c75adb9..427ed5a6765316a4c1e2fa06f3b6618447c01564 100644
> > > > >>>>>> --- a/include/uapi/linux/icmp.h
> > > > >>>>>> +++ b/include/uapi/linux/icmp.h
> > > > >>>>>> @@ -76,6 +76,7 @@ struct icmphdr {
> > > > >>>>>>                 __be16  sequence;
> > > > >>>>>>         } echo;
> > > > >>>>>>         __be32  gateway;
> > > > >>>>>> +       __be32  second_word; /* RFC 4884 4.[123] : <unused:8>,<length:8>,<mtu:16> */
> > > > >>>>>>         struct {
> > > > >>>>>>                 __be16  __unused;
> > > > >>>>>>                 __be16  mtu;
> > > > >>>>>
> > > > >>>>> Okay. How about a variant of the existing struct frag?
> > > > >>>>>
> > > > >>>>> @@ -80,6 +80,11 @@ struct icmphdr {
> > > > >>>>>                 __be16  __unused;
> > > > >>>>>                 __be16  mtu;
> > > > >>>>>         } frag;
> > > > >>>>> +       struct {
> > > > >>>>> +               __u8    __unused;
> > > > >>>>> +               __u8    length;
> > > > >>>>> +               __be16  mtu;
> > > > >>>>> +       } rfc_4884;
> > > > >>>>>         __u8    reserved[4];
> > > > >>>>>    } un;
> > > > >>>>>
> > > > >>>>
> > > > >>>> Sure, but my point was later in the code :
> > > > >>>>
> > > > >>>>>>> +     if (inet_sk(sk)->recverr_rfc4884)
> > > > >>>>>>> +             info = ntohl(icmp_hdr(skb)->un.gateway);
> > > > >>>>>>
> > > > >>>>>> ntohl(icmp_hdr(skb)->un.second_word);
> > > > >>>>
> > > > >>>> If you leave there "info = ntohl(icmp_hdr(skb)->un.gateway)" it is a bit hard for someone
> > > > >>>> reading linux kernel code to understand why we do this.
> > > > >>>>
> > > > >>> It's also potentially problematic. The other bits are Unused, which
> > > > >>> isn't the same thing as necessarily being zero. Userspace might assume
> > > > >>> that info is the length without checking its bounded.
> > > > >>
> > > > >> It shouldn't. The icmp type and code are passed in sock_extended_err
> > > > >> as ee_type and ee_code. So it can demultiplex the meaning of the rest
> > > > >> of the icmp header.
> > > > >>
> > > > >> It just needs access to the other 32-bits, which indeed are context
> > > > >> sensitive. It makes more sense to me to let userspace demultiplex this
> > > > >> in one place, rather than demultiplex in the kernel and define a new,
> > > > >> likely no simpler, data structure to share with userspace.
> > > > >>
> > > > >> Specific to RFC 4884, the 8-bit length field coexists with the
> > > > >> 16-bit mtu field in case of ICMP_FRAG_NEEDED, so we cannot just pass
> > > > >> the first as ee_info in RFC 4884 mode. sock_extended_err additionally
> > > > >> has ee_data, but after that we're out of fields, too, so this approach
> > > > >> is not very future proof to additional ICMP extensions.
> > > > >>
> > > > >> On your previous point, it might be useful to define struct rfc_4884
> > > > >> equivalent outside struct icmphdr, so that an application can easily
> > > > >> cast to that. RFC 4884 itself does not define any extension objects.
> > > > >> That is out of scope there, and in my opinion, here. Again, better
> > > > >> left to userspace. Especially because as it describes, it standardized
> > > > >> the behavior after observing non-compliant, but existing in the wild,
> > > > >> proprietary extension variants. Users may have to change how they
> > > > >> interpret the fields based on what they have deployed.
> > > > >
> > > > > As this just shares the raw icmp header data, I should probably
> > > > > change the name to something less specific to RFC 4884.
> > > > >
> > > > > Since it would also help with decoding other extensions, such as
> > > > > the one you mention in  draft-ietf-6man-icmp-limits-08.
> > > > >
> > > > > Unfortunately I cannot simply reserve IP_RECVERR with integer 2.
> > > > > Perhaps IP_RECVERR_EXINFO.
> > > > >
> > > >
> > > > Perhaps let the icmp header as is, but provides the extra information
> > > > as an explicit ancillary message in ip_recv_error() ?
> > > >
> > > > Really this is all about documentation and providing stable API.
> > > >
> > > Actually, I think we may have a subtle bug here.
> > >
> > > RFC4884 appends the ICMP extension to the original datagram. The RFC
> > > describes backwards compatibility in section 5. To be backwards
> > > compatible with legacy implementations that don't know how to parse
> > > the extension, and in particular to find the length of the original
> > > datagram in the data, the requirement is that at the original datagram
> > > is at least 128 bytes long and it seems to assume no ICMP application
> > > need to parse beyond that. But parsing beyond 128 bytes is possible,
> > > consider that the original datagram was UDP/IPv6 with an extension
> > > header such that the UDP header is offset beyond 128 bytes in the
> > > packet. If we don't take this into account, the UDP ports for socket
> > > lookup would be incorrect.
> > >
> > > To fix this, we could check the Length field per the types that
> > > support extensions as described in RFC4884. If it's non-zero then
> > > assume the extension is present, so before processing the original
> > > datagram, e.g. performing socket lookup, trim the skb to the length of
> > > the orignal datagram.
> >
> > This is somewhat orthogonal to this patch.
> >
> > You are suggesting proactively protecting applications that do not
> > know how to parse RFC 4884 compliant packets by truncating
> > those unless the application sets the new setsockopt?
> >
> No, the bug is a bit more insidious. The problem is that it's possible
> to misdirect an ICMP error to the wrong socket. If that can happen and
> leads to adverse side effects then it's a bug. Grant it, in normal
> conditions this is probably exceeding rare, but we can't discount the
> possibility and this sounds like the sort of thing that could be used
> in some twisted DOS attack.

You mean the kernel itself might accidentally read an extension
object as if it is part of original datagram headers? Or applications?
The kernel only uses the outer packet for socket lookup e.g., in
__udp4_lib_err.

Does my point about the differences between ICMP and ICMPv6
on response size address that? If this is an ICMPv6 specific issue
due to extension headers.

> > I'm afraid that might break legacy applications that currently do
> > expect extension objects. They already can in case of ICMPv6.
> > And unfortunately they can try in an unsafe manner by parsing
> > the original datagram for ICMPv4, too.
> >
> I suspect there's more legacy applications that don't understand the
> extensions so the risk of misinterpreting the packet data with
> extensions exists for them.

Fair point. But we still cannot change established behavior.

> > ICMPv6 differs from ICMP in that it suggests forwarding as much
> > of the "invoking" packet as possible without exceeding the min
> > MTU (1280B).
> >
> > RFC 4884 mentions the minimum 128 B for ICMP but not for
> > ICMPv6. It specifically keeps the wording about forwarding as
> > much as possible.
>
> Out of curiosity, are you implementing this patch for a specific use
> case in mind or just for completeness?

I do have a specific user for this.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, back to index

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-29 16:57 [PATCH net-next] icmp: support rfc 4884 Willem de Bruijn
2020-06-29 20:34 ` Tom Herbert
2020-06-29 20:55   ` Willem de Bruijn
2020-06-29 21:14 ` Eric Dumazet
2020-06-29 21:30   ` Willem de Bruijn
2020-06-29 23:06     ` Eric Dumazet
2020-06-30  0:36       ` Tom Herbert
2020-06-30  2:18         ` Willem de Bruijn
2020-06-30 13:57           ` Willem de Bruijn
2020-06-30 16:16             ` Eric Dumazet
2020-06-30 16:41               ` Tom Herbert
2020-06-30 16:52                 ` Willem de Bruijn
2020-06-30 18:57                   ` Tom Herbert
2020-06-30 19:27                     ` Willem de Bruijn
2020-06-30 16:41               ` Willem de Bruijn
2020-06-30  0:30   ` Tom Herbert

Netdev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netdev/0 netdev/git/0.git
	git clone --mirror https://lore.kernel.org/netdev/1 netdev/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netdev netdev/ https://lore.kernel.org/netdev \
		netdev@vger.kernel.org
	public-inbox-index netdev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netdev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git