netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH ipsec/vti 0/2] Fragmentation of IPv4 in VTI
@ 2019-03-17 23:37 Bram Yvahk
  2019-03-17 23:37 ` [PATCH ipsec/vti 1/2] vti: fragment IPv4 packets when DF bit is not set Bram Yvahk
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Bram Yvahk @ 2019-03-17 23:37 UTC (permalink / raw)
  To: steffen.klassert, herbert, davem; +Cc: netdev

We've experienced an issue with VTI when the path-mtu is smaller than the size
of the "client" packet.

What happens: IPv4 packet from the client (i.e. another system in the LAN)
attempts to transmit some data; IPv4 header shows that 'DF' bit is not set but
still the client receives ICMPv4 "need-to-frag" message [which the client does
not expect and ignores].

Example: $ ping -s 1300 -M dont -c5 192.168.235.2
    PING 192.168.235.3 (192.168.235.3) 1300(1328) bytes of data.
    From 192.168.236.254 icmp_seq=1 Frag needed and DF set (mtu = 1214)
    From 192.168.236.254 icmp_seq=2 Frag needed and DF set (mtu = 1214)
    From 192.168.236.254 icmp_seq=3 Frag needed and DF set (mtu = 1214)
    From 192.168.236.254 icmp_seq=4 Frag needed and DF set (mtu = 1214)
    From 192.168.236.254 icmp_seq=5 Frag needed and DF set (mtu = 1214)

    --- 192.168.235.3 ping statistics ---
    5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 3999ms

This is addressed in the first patch of this series. (I don't fully like some
parts of this patch but I'll submit my remarks as a reply to the patch)

When testing the changes with an IPv6 tunnel (and with vti6) another issue/edge
case popped up. Assume: client sends a IP packet with a size of 1500 bytes.
After encrypting this packet [and wrapping it in ESP] it obviously is too large
to transmit as a single IPv6 packet and therefor two fragmented IPv6 packets
are transmitted. However: when the packet size is larger then the (not yet
known) path-mtu then a 'ICMPV6_PKT_TOOBIG' is received.
The problem: the ip6_vti error handler ('vti6_err') does not expect that the
original packet was a fragmented IPv6 packet --> it does not process the
'PKT_TOOBIG' and continues in sending packets that are too big.

This is addressed in the second patch.

Bram Yvahk (2):
  vti: fragment IPv4 packets when DF bit is not set
  vti6: process icmp msg when IPv6 is fragmented

 net/ipv4/ip_vti.c  | 43 +++++++++++++++++++---------
 net/ipv6/ip6_vti.c | 83 +++++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 92 insertions(+), 34 deletions(-)

-- 
2.7.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH ipsec/vti 1/2] vti: fragment IPv4 packets when DF bit is not set
  2019-03-17 23:37 [PATCH ipsec/vti 0/2] Fragmentation of IPv4 in VTI Bram Yvahk
@ 2019-03-17 23:37 ` Bram Yvahk
  2019-03-17 23:52   ` Bram Yvahk
  2019-03-17 23:37 ` [PATCH ipsec/vti 2/2] vti6: process icmp msg when IPv6 is fragmented Bram Yvahk
  2019-03-21 15:16 ` [PATCH ipsec/vti 0/2] Fragmentation of IPv4 in VTI Steffen Klassert
  2 siblings, 1 reply; 7+ messages in thread
From: Bram Yvahk @ 2019-03-17 23:37 UTC (permalink / raw)
  To: steffen.klassert, herbert, davem; +Cc: netdev

Only send a 'need to frag' ICMP message when the "Don't Fragment" bit
is set. If it's not set then the packet can/will be fragmented.

This fixes sending an 'need to frag' message on a client that did not
set the DF bit, i.e.:

	$ ping -s 1300 -M dont -c5 192.168.235.2
	PING 192.168.235.3 (192.168.235.3) 1300(1328) bytes of data.
	From 192.168.236.254 icmp_seq=1 Frag needed and DF set (mtu = 1214)

Signed-off-by: Bram Yvahk <bram-yvahk@mail.wizbit.be>
---
 net/ipv4/ip_vti.c  | 43 +++++++++++++++++++++++++++--------------
 net/ipv6/ip6_vti.c | 56 +++++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 70 insertions(+), 29 deletions(-)

diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index 68a21bf..5738e44 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -196,6 +196,34 @@ static bool vti_state_check(const struct xfrm_state *x, __be32 dst, __be32 src)
 	return true;
 }
 
+static bool vti_tunnel_check_size(struct sk_buff *skb)
+{
+	int mtu;
+
+	if (skb->protocol == htons(ETH_P_IP)) {
+		if (!(ip_hdr(skb)->frag_off & htons(IP_DF)) || skb->ignore_df)
+			return true;
+	}
+
+	mtu = dst_mtu(skb_dst(skb));
+	if (skb->len > mtu) {
+		skb_dst_update_pmtu(skb, mtu);
+		if (skb->protocol == htons(ETH_P_IP)) {
+			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
+				  htonl(mtu));
+		} else {
+			if (mtu < IPV6_MIN_MTU)
+				mtu = IPV6_MIN_MTU;
+
+			icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
+		}
+
+		return false;
+	}
+
+	return true;
+}
+
 static netdev_tx_t vti_xmit(struct sk_buff *skb, struct net_device *dev,
 			    struct flowi *fl)
 {
@@ -205,7 +233,6 @@ static netdev_tx_t vti_xmit(struct sk_buff *skb, struct net_device *dev,
 	struct net_device *tdev;	/* Device to other host */
 	int pkt_len = skb->len;
 	int err;
-	int mtu;
 
 	if (!dst) {
 		dev->stats.tx_carrier_errors++;
@@ -233,19 +260,7 @@ static netdev_tx_t vti_xmit(struct sk_buff *skb, struct net_device *dev,
 		goto tx_error;
 	}
 
-	mtu = dst_mtu(dst);
-	if (skb->len > mtu) {
-		skb_dst_update_pmtu(skb, mtu);
-		if (skb->protocol == htons(ETH_P_IP)) {
-			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
-				  htonl(mtu));
-		} else {
-			if (mtu < IPV6_MIN_MTU)
-				mtu = IPV6_MIN_MTU;
-
-			icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
-		}
-
+	if (!vti_tunnel_check_size(skb)) {
 		dst_release(dst);
 		goto tx_error;
 	}
diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index 8b6eeff..47f178c 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -436,6 +436,46 @@ static bool vti6_state_check(const struct xfrm_state *x,
 }
 
 /**
+ * vti6_tunnel_check_size - check size of packet
+ *   @skb: the outgoing socket buffer
+ *
+ * Description:
+ *   Check if packet is too large (> pmtu)
+ *
+ * Return:
+ *   true if size of packet is ok
+ *   false if packet is too large
+ **/
+static bool vti6_tunnel_check_size(struct sk_buff *skb)
+{
+	int mtu;
+
+	if (skb->protocol == htons(ETH_P_IP)) {
+		if (!(ip_hdr(skb)->frag_off & htons(IP_DF)) || skb->ignore_df)
+			return true;
+	}
+
+	mtu = dst_mtu(skb_dst(skb));
+	if (skb->len > mtu) {
+		skb_dst_update_pmtu(skb, mtu);
+
+		if (skb->protocol == htons(ETH_P_IPV6)) {
+			if (mtu < IPV6_MIN_MTU)
+				mtu = IPV6_MIN_MTU;
+
+			icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
+		} else {
+			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
+				  htonl(mtu));
+		}
+
+		return false;
+	}
+
+	return true;
+}
+
+/**
  * vti6_xmit - send a packet
  *   @skb: the outgoing socket buffer
  *   @dev: the outgoing tunnel device
@@ -451,7 +491,6 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, struct flowi *fl)
 	struct xfrm_state *x;
 	int pkt_len = skb->len;
 	int err = -1;
-	int mtu;
 
 	if (!dst)
 		goto tx_err_link_failure;
@@ -481,20 +520,7 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, struct flowi *fl)
 		goto tx_err_dst_release;
 	}
 
-	mtu = dst_mtu(dst);
-	if (skb->len > mtu) {
-		skb_dst_update_pmtu(skb, mtu);
-
-		if (skb->protocol == htons(ETH_P_IPV6)) {
-			if (mtu < IPV6_MIN_MTU)
-				mtu = IPV6_MIN_MTU;
-
-			icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
-		} else {
-			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
-				  htonl(mtu));
-		}
-
+	if (!vti6_tunnel_check_size(skb)) {
 		err = -EMSGSIZE;
 		goto tx_err_dst_release;
 	}
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH ipsec/vti 2/2] vti6: process icmp msg when IPv6 is fragmented
  2019-03-17 23:37 [PATCH ipsec/vti 0/2] Fragmentation of IPv4 in VTI Bram Yvahk
  2019-03-17 23:37 ` [PATCH ipsec/vti 1/2] vti: fragment IPv4 packets when DF bit is not set Bram Yvahk
@ 2019-03-17 23:37 ` Bram Yvahk
  2019-03-21 15:16 ` [PATCH ipsec/vti 0/2] Fragmentation of IPv4 in VTI Steffen Klassert
  2 siblings, 0 replies; 7+ messages in thread
From: Bram Yvahk @ 2019-03-17 23:37 UTC (permalink / raw)
  To: steffen.klassert, herbert, davem; +Cc: netdev

In the error function the 'nexthdr' of the (original) IPv6 header
was used to check for which protocol it was.

When the (original) IPv6 packet is fragmented however then nexthdr
is set to 'NEXTHDR_FRAGMENT' and this causes the code to return
early and not process the ICMP error.

Signed-off-by: Bram Yvahk <bram-yvahk@mail.wizbit.be>
---
 net/ipv6/ip6_vti.c | 27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index 47f178c..9582ffd 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -590,7 +590,7 @@ vti6_tnl_xmit(struct sk_buff *skb, struct net_device *dev)
 }
 
 static int vti6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
-		    u8 type, u8 code, int offset, __be32 info)
+		    u8 type, u8 code, int offset, __be32 info, int protocol)
 {
 	__be32 spi;
 	__u32 mark;
@@ -601,7 +601,6 @@ static int vti6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	struct ip_comp_hdr *ipch;
 	struct net *net = dev_net(skb->dev);
 	const struct ipv6hdr *iph = (const struct ipv6hdr *)skb->data;
-	int protocol = iph->nexthdr;
 
 	t = vti6_tnl_lookup(dev_net(skb->dev), &iph->daddr, &iph->saddr);
 	if (!t)
@@ -645,6 +644,24 @@ static int vti6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	return 0;
 }
 
+static int vti6_esp_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+			u8 type, u8 code, int offset, __be32 info)
+{
+	return vti6_err(skb, opt, type, code, offset, info, IPPROTO_ESP);
+}
+
+static int vti6_ah_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+		       u8 type, u8 code, int offset, __be32 info)
+{
+	return vti6_err(skb, opt, type, code, offset, info, IPPROTO_AH);
+}
+
+static int vti6_ipcomp_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+			   u8 type, u8 code, int offset, __be32 info)
+{
+	return vti6_err(skb, opt, type, code, offset, info, IPPROTO_COMP);
+}
+
 static void vti6_link_config(struct ip6_tnl *t, bool keep_mtu)
 {
 	struct net_device *dev = t->dev;
@@ -1189,21 +1206,21 @@ static struct pernet_operations vti6_net_ops = {
 static struct xfrm6_protocol vti_esp6_protocol __read_mostly = {
 	.handler	=	vti6_rcv,
 	.cb_handler	=	vti6_rcv_cb,
-	.err_handler	=	vti6_err,
+	.err_handler	=	vti6_esp_err,
 	.priority	=	100,
 };
 
 static struct xfrm6_protocol vti_ah6_protocol __read_mostly = {
 	.handler	=	vti6_rcv,
 	.cb_handler	=	vti6_rcv_cb,
-	.err_handler	=	vti6_err,
+	.err_handler	=	vti6_ah_err,
 	.priority	=	100,
 };
 
 static struct xfrm6_protocol vti_ipcomp6_protocol __read_mostly = {
 	.handler	=	vti6_rcv,
 	.cb_handler	=	vti6_rcv_cb,
-	.err_handler	=	vti6_err,
+	.err_handler	=	vti6_ipcomp_err,
 	.priority	=	100,
 };
 
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH ipsec/vti 1/2] vti: fragment IPv4 packets when DF bit is not set
  2019-03-17 23:37 ` [PATCH ipsec/vti 1/2] vti: fragment IPv4 packets when DF bit is not set Bram Yvahk
@ 2019-03-17 23:52   ` Bram Yvahk
  0 siblings, 0 replies; 7+ messages in thread
From: Bram Yvahk @ 2019-03-17 23:52 UTC (permalink / raw)
  To: steffen.klassert, herbert, davem; +Cc: netdev

Bram Yvahk wrote:
> Only send a 'need to frag' ICMP message when the "Don't Fragment" bit
> is set. If it's not set then the packet can/will be fragmented.
>
> This fixes sending an 'need to frag' message on a client that did not
> set the DF bit, i.e.:
>
>         $ ping -s 1300 -M dont -c5 192.168.235.2
>         PING 192.168.235.3 (192.168.235.3) 1300(1328) bytes of data.
>         From 192.168.236.254 icmp_seq=1 Frag needed and DF set (mtu = 1214)
>
> Signed-off-by: Bram Yvahk <bram-yvahk@mail.wizbit.be>
> ---
>  net/ipv4/ip_vti.c  | 43 +++++++++++++++++++++++++++--------------
>  net/ipv6/ip6_vti.c | 56 +++++++++++++++++++++++++++++++++++++++---------------
>  2 files changed, 70 insertions(+), 29 deletions(-)
>
>   
Parts in the patch that I don't really like:

1) adds almost identical code in ipv4/ip_vti.c and ipv6/ip6_vti.c
    (is there a place to share this code?)

2) 'vti_tunnel_check_size'/'vti6_tunnel_check_size' functions:
   Name and implementation of this function is based on the xfrm code
   but to me "check" in the function name implies that it's (only)
   checking something but in this case it's doing a bit more: it also
updates
   the path-mtu and transmitting an ICMP message.

3) return value of 'vti6_tunnel_check_size':

   In XFRM ('xfrm4_tunnel_check_size'/'xfrm6_tunnel_check_size') return
   value of '0' and '-EMSGSIZE' is used.
  
   I opted not to follow this because of how the code is called in
   'vti6_xmit'. It has an 'err' variable but it is initialized to '-1'.
  
   What I considered:
   - let 'vti6_tunnel_check_size' return 0 and -EMSGSIZE: this matches
     with XFRM but it makes the 'err' variable in 'vti_xmit' a bit
confusing.
     Before calling 'vti6_tunnel_check_size' err=-1 means no error, after
     calling 'vti6_tunnel_check_size' err=0 means no error..
    
   - let 'vti6_tunnel_check_size' return 0 and -EMSGSIZE and use a second
     err variable in vti6_xmit (i.e.
       if ((err2 = vti6_tunnel_check_size(..)) < 0) { err=err2; goto ... }
  
   - let 'vti6_tunnel_check_size' return -1 and -EMSGSIZE: this matches
     with XFRM and works but to me it seems confusing for the function to
     return -1 when there is no error...
    
   - let 'vti6_tunnel_check_size' return a bool and set error in 'vti6_xmit'
  
   - change vti6_xmit to initialize err to 0 (but I've no idea of the impact
     of that)
    
   The least ugly solution to me was the bool so I went with that.
  
4) no error is set in vti_xmit (ip_vti): when the packet is too large it
   doesn't set an error and always returns 'NETDEV_TX_OK'. [This was already
   the case so that behaviour is kept but it doesn't look right...]


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH ipsec/vti 0/2] Fragmentation of IPv4 in VTI
  2019-03-17 23:37 [PATCH ipsec/vti 0/2] Fragmentation of IPv4 in VTI Bram Yvahk
  2019-03-17 23:37 ` [PATCH ipsec/vti 1/2] vti: fragment IPv4 packets when DF bit is not set Bram Yvahk
  2019-03-17 23:37 ` [PATCH ipsec/vti 2/2] vti6: process icmp msg when IPv6 is fragmented Bram Yvahk
@ 2019-03-21 15:16 ` Steffen Klassert
  2019-03-21 18:33   ` Bram Yvahk
  2 siblings, 1 reply; 7+ messages in thread
From: Steffen Klassert @ 2019-03-21 15:16 UTC (permalink / raw)
  To: Bram Yvahk; +Cc: herbert, davem, netdev

On Sun, Mar 17, 2019 at 11:37:55PM +0000, Bram Yvahk wrote:
> We've experienced an issue with VTI when the path-mtu is smaller than the size
> of the "client" packet.
> 
> What happens: IPv4 packet from the client (i.e. another system in the LAN)
> attempts to transmit some data; IPv4 header shows that 'DF' bit is not set but
> still the client receives ICMPv4 "need-to-frag" message [which the client does
> not expect and ignores].
> 
> Example: $ ping -s 1300 -M dont -c5 192.168.235.2
>     PING 192.168.235.3 (192.168.235.3) 1300(1328) bytes of data.
>     From 192.168.236.254 icmp_seq=1 Frag needed and DF set (mtu = 1214)
>     From 192.168.236.254 icmp_seq=2 Frag needed and DF set (mtu = 1214)
>     From 192.168.236.254 icmp_seq=3 Frag needed and DF set (mtu = 1214)
>     From 192.168.236.254 icmp_seq=4 Frag needed and DF set (mtu = 1214)
>     From 192.168.236.254 icmp_seq=5 Frag needed and DF set (mtu = 1214)
> 
>     --- 192.168.235.3 ping statistics ---
>     5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 3999ms

Hm, this works here. Can you show how you setup the vti device?
Some tunnel configuration options (set ttl etc.) force to have
the DF bit set.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH ipsec/vti 0/2] Fragmentation of IPv4 in VTI
  2019-03-21 15:16 ` [PATCH ipsec/vti 0/2] Fragmentation of IPv4 in VTI Steffen Klassert
@ 2019-03-21 18:33   ` Bram Yvahk
  2019-03-22 20:46     ` Bram Yvahk
  0 siblings, 1 reply; 7+ messages in thread
From: Bram Yvahk @ 2019-03-21 18:33 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: herbert, davem, netdev

Steffen Klassert wrote:
> On Sun, Mar 17, 2019 at 11:37:55PM +0000, Bram Yvahk wrote:
>> We've experienced an issue with VTI when the path-mtu is smaller than
the size
>> of the "client" packet.
>>
>> What happens: IPv4 packet from the client (i.e. another system in the
LAN)
>> attempts to transmit some data; IPv4 header shows that 'DF' bit is
not set but
>> still the client receives ICMPv4 "need-to-frag" message [which the
client does
>> not expect and ignores].
>>
>> Example: $ ping -s 1300 -M dont -c5 192.168.235.2
>>     PING 192.168.235.3 (192.168.235.3) 1300(1328) bytes of data.
>>     From 192.168.236.254 icmp_seq=1 Frag needed and DF set (mtu = 1214)
>>     From 192.168.236.254 icmp_seq=2 Frag needed and DF set (mtu = 1214)
>>     From 192.168.236.254 icmp_seq=3 Frag needed and DF set (mtu = 1214)
>>     From 192.168.236.254 icmp_seq=4 Frag needed and DF set (mtu = 1214)
>>     From 192.168.236.254 icmp_seq=5 Frag needed and DF set (mtu = 1214)
>>
>>     --- 192.168.235.3 ping statistics ---
>>     5 packets transmitted, 0 received, +5 errors, 100% packet loss,
time 3999ms
>
> Hm, this works here. Can you show how you setup the vti device?
> Some tunnel configuration options (set ttl etc.) force to have
> the DF bit set.

I will provide these details Tommorow.
What I can say is that ttl was set to inherit.

When testing this there is one important bit - which in hindsight I
should've included in the previous message - the (IPsec) Gateway A
needs to know the path-mtu to (IPsec) Gateway B.

Some ways to accomplish this:
- transmit a ICMP with DF bit set and a larger packet size from
  Gateway A to Gateway B
- ensure the "nopmtudisc" option is *not* set in the xfrm state
  and then let client A transmit a ICMP *with* DF bit set to
  client B. [when "nopmtudisc" is set then all outgoing IPv4 ESP
  packet have the DF bit cleared, when "nopmtudisc" is not set then
  DF bit is copied from the client packet]
 
For testing purposes I recommend to do the ping from Gateway A to
Gateway B. (Otherwise tcpdumps/traffic get a bit more confusing.)

A more in-depth description of what happens:

Setup:
======

|----------|   |-----------|   |-------|   |-----------|   |----------|
| client A |---| Gateway A |---| Hop H |---| Gateway B |---| client B |
------------   |-----------|   |-------|   |-----------|   |----------|

- testing with linux 4.14.95 (setup with more recent kernel is WIP)
- link mtu between client A and Gateway A: 1500
- link mtu between Gateway A and Hop H: 1500
- link mtu between Hop H and Gateway B: 1280
- link mtu between Gateway B and client B: 1500
- path-mtu between Gateway A and Gateway B: 1280
- IPsec tunnel over *IPv4* between Gateway A and Gateway B
- tunneling IPv4 over the IPsec tunnel
- testing with VTI

Scenario:
==========

Before starting it's important to ensure that:
- Gateway A does *not* know the path-mtu to Gateway B
- Client A does *not* know the path-mtu to Gateway B

* Step 1: client A: $ ping -M dont -s 1300 ip_of_client_B
  - IPv4 ICMP packet of client A does not have DF bit set
  - IPv4 ESP packet of Gateway A does not have DF bit set
  - Hop H receives a IPv4 ESP packet that is too large for link-mtu
    between Hop H and Gateway B: it fragments the IPv4 ESP packet.
  - Gateway B receives 2 IPv4 fragmented packets
  - (Client B receives one IPv4 ICMP packet from client A)

* Step 2: Gateway A: $ ping -M do -s 1300 ip_of_gateway_B
  - IPv4 ICMP packet of Gateway A does have DF bit set
  - Gateway A receives a 'need to frag' ICMP from Hop H
 
* Step 3: client A: $ ping -M dont -s 1300 ip_of_client_B
  - IPv4 ICMP packet of client A does not have DF bit set
  - Gateway A: it process this packet in VTI module and detects that
    packet size > path-mtu and then sends a 'need to frag' ICMP to
    client A. [this is the code I patched]
     
=> the critical bit in the above is that Gateway A learns
   the path-mtu to Gateway B. If it doesn't then it keeps
   assuming path-mtu is 1500 and the check in VTI will not
   trigger (since path-mtu of 1500 > packet size)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH ipsec/vti 0/2] Fragmentation of IPv4 in VTI
  2019-03-21 18:33   ` Bram Yvahk
@ 2019-03-22 20:46     ` Bram Yvahk
  0 siblings, 0 replies; 7+ messages in thread
From: Bram Yvahk @ 2019-03-22 20:46 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: herbert, davem, netdev

Bram Yvahk wrote:
> Steffen Klassert wrote:
>> On Sun, Mar 17, 2019 at 11:37:55PM +0000, Bram Yvahk wrote:
>>> We've experienced an issue with VTI when the path-mtu is smaller than
> the size
>>> of the "client" packet.
>>>
>>> What happens: IPv4 packet from the client (i.e. another system in the
> LAN)
>>> attempts to transmit some data; IPv4 header shows that 'DF' bit is
> not set but
>>> still the client receives ICMPv4 "need-to-frag" message [which the
> client does
>>> not expect and ignores].
>>>
>>> Example: $ ping -s 1300 -M dont -c5 192.168.235.2
>>>     PING 192.168.235.3 (192.168.235.3) 1300(1328) bytes of data.
>>>     From 192.168.236.254 icmp_seq=1 Frag needed and DF set (mtu = 1214)
>>>     From 192.168.236.254 icmp_seq=2 Frag needed and DF set (mtu = 1214)
>>>     From 192.168.236.254 icmp_seq=3 Frag needed and DF set (mtu = 1214)
>>>     From 192.168.236.254 icmp_seq=4 Frag needed and DF set (mtu = 1214)
>>>     From 192.168.236.254 icmp_seq=5 Frag needed and DF set (mtu = 1214)
>>>
>>>     --- 192.168.235.3 ping statistics ---
>>>     5 packets transmitted, 0 received, +5 errors, 100% packet loss,
> time 3999ms
>> Hm, this works here. Can you show how you setup the vti device?
>> Some tunnel configuration options (set ttl etc.) force to have
>> the DF bit set.
>
> I will provide these details Tommorow.
> What I can say is that ttl was set to inherit.
>

vti device is created (on Gateway A) using:
$ ip tun add name vti0 mode vti ikey 1 okey 1 local <ip gateway A>
$ ip link show dev vti0
46: vti0@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN
mode DEFAULT group default qlen 1000
    link/ipip <ip gateway A> brd 0.0.0.0
$ ip tun show name vti0            
vti0: ip/ip remote any local <ip gateway A> ttl inherit key 1
   
[I've also done setup with mtu 1400 - all remains the same]

xfrm state:
src <ip gateway B> dst <ip gateway A>
        proto esp spi 0xcd76a4a9 reqid 16389 mode tunnel
        replay-window 32 flag nopmtudisc af-unspec
        auth-trunc hmac(sha1) 0x08e1ce16b1f7f9039f9cc7421cf61010c029efc3 96
        enc cbc(aes)
0x22c7aacd9680a10a52b0c5670b7d850c35ba17f7c7dc6c963252cdc311b1f4d5
        anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000
src <ip gateway A> dst <ip gateway B>
        proto esp spi 0x8f2988c7 reqid 16389 mode tunnel
        replay-window 32 flag nopmtudisc af-unspec
        auth-trunc hmac(sha1) 0x229bbe490606ddcc6a68332babd498001591c6bf 96
        enc cbc(aes)
0xd598dba419bfc45232580e54d517aae6a77c3328a51ebb3321802b89cc51ae43
        anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000

(same behaviour with/without nopmtudisc; nopmtudisc only makes a
 difference for packets from 'client A' that *do* have the DF bit set)

>
> When testing this there is one important bit - which in hindsight I
> should've included in the previous message - the (IPsec) Gateway A
> needs to know the path-mtu to (IPsec) Gateway B.
>
> Some ways to accomplish this:
> - transmit a ICMP with DF bit set and a larger packet size from
>   Gateway A to Gateway B
> - ensure the "nopmtudisc" option is *not* set in the xfrm state
>   and then let client A transmit a ICMP *with* DF bit set to
>   client B. [when "nopmtudisc" is set then all outgoing IPv4 ESP
>   packet have the DF bit cleared, when "nopmtudisc" is not set then
>   DF bit is copied from the client packet]
> 
> For testing purposes I recommend to do the ping from Gateway A to
> Gateway B. (Otherwise tcpdumps/traffic get a bit more confusing.)
>
> A more in-depth description of what happens:
>
> Setup:
> ======
>
> |----------|   |-----------|   |-------|   |-----------|   |----------|
> | client A |---| Gateway A |---| Hop H |---| Gateway B |---| client B |
> ------------   |-----------|   |-------|   |-----------|   |----------|
>
> - testing with linux 4.14.95 (setup with more recent kernel is WIP)
> - link mtu between client A and Gateway A: 1500
> - link mtu between Gateway A and Hop H: 1500
> - link mtu between Hop H and Gateway B: 1280
> - link mtu between Gateway B and client B: 1500
> - path-mtu between Gateway A and Gateway B: 1280
> - IPsec tunnel over *IPv4* between Gateway A and Gateway B
> - tunneling IPv4 over the IPsec tunnel
> - testing with VTI
>
> Scenario:
> ==========
>
> Before starting it's important to ensure that:
> - Gateway A does *not* know the path-mtu to Gateway B
> - Client A does *not* know the path-mtu to Gateway B

On Gateway A:

$ ip route get <ip of gateway B>
<ip gateway B> via <hop H> dev eth1 src <ip gateway A> uid 0
    cache
=> no mtu shown --> path-mtu not yet known

>
> * Step 1: client A: $ ping -M dont -s 1300 ip_of_client_B
>   - IPv4 ICMP packet of client A does not have DF bit set
>   - IPv4 ESP packet of Gateway A does not have DF bit set
>   - Hop H receives a IPv4 ESP packet that is too large for link-mtu
>     between Hop H and Gateway B: it fragments the IPv4 ESP packet.
>   - Gateway B receives 2 IPv4 fragmented packets
>   - (Client B receives one IPv4 ICMP packet from client A)

tcpdump on Gateway A:
- from client A it receives:
    IP (tos 0x0, ttl 64, id 46797, offset 0, flags [none], proto ICMP
(1), length 1328)
        client_A > client_B: ICMP echo request, id 6855, seq 1, length 1308

- it transmits (to Gateway B):
    IP (tos 0x0, ttl 64, id 10932, offset 0, flags [none], proto ESP
(50), length 1400)
        gateway_A > gateway_B: ESP(spi=0x8f2988c7,seq=0x3), length 1380

tcpdump on Gateway B:
- it receives (from Gateway A):
    IP (tos 0x0, ttl 63, id 10932, offset 0, flags [+], proto ESP (50),
length 1276)
        gateway_A > gateway_B: ESP(spi=0x8f2988c7,seq=0x3), length 1256
    IP (tos 0x0, ttl 63, id 10932, offset 1256, flags [none], proto ESP
(50), length 144)
        gateway_A > gateway_B: ip-proto-50
- it transmits (to client B):
    IP (tos 0x0, ttl 62, id 46797, offset 0, flags [none], proto ICMP
(1), length 1328)
        client_A > client_B: ICMP echo request, id 6855, seq 1, length 1308

=> Hop H fragmented the IPv4 packets. This is expected: DF bit is not
   set on ESP packets and Gateway A does not know path-mtu to Gateway B

>
> * Step 2: Gateway A: $ ping -M do -s 1300 ip_of_gateway_B
>   - IPv4 ICMP packet of Gateway A does have DF bit set
>   - Gateway A receives a 'need to frag' ICMP from Hop H

tcpdump on Gateway A:

- it transmits (local packet - to Gateway B):
    IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1),
length 1328)
        gateway_A > gateway_B: ICMP echo request, id 28176, seq 1,
length 1308
- it receives (from Hop H):
    IP (tos 0xc0, ttl 64, id 52788, offset 0, flags [none], proto ICMP
(1), length 576)
        hop_H > gateway_A: ICMP 1.1.235.254 unreachable - need to frag
(mtu 1280), length 556
            IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP
(1), length 1328)
                gateway_A > gateway_B: ICMP echo request, id 28176, seq
1, length 1308

=> Hop H send need-to-frag mtu. This expected: DF bit is set on ICMP
   packet so Hop H should not fragment.
  
on Gateway A:
$ ip route get <ip of gateway B>
<ip gateway B> via <hop H> dev eth1 src <ip gateway A> uid 0
    cache expires 17sec mtu 1280
=> path-mtu known to be 1280


> * Step 3: client A: $ ping -M dont -s 1300 ip_of_client_B
>   - IPv4 ICMP packet of client A does not have DF bit set
>   - Gateway A: it process this packet in VTI module and detects that
>     packet size > path-mtu and then sends a 'need to frag' ICMP to
>     client A. [this is the code I patched]

tcpdump on Gateway A:
- from client A it receives:
    IP (tos 0x0, ttl 64, id 46798, offset 0, flags [none], proto ICMP
(1), length 1328)
        client_A > client_B: ICMP echo request, id 7063, seq 1, length 1308
       
- it transmits to client A:
    IP (tos 0xc0, ttl 64, id 59290, offset 0, flags [none], proto ICMP
(1), length 576)
        gateway_A > client_A: ICMP client_B unreachable - need to frag
(mtu 1214), length 556
           IP (tos 0x0, ttl 63, id 46798, offset 0, flags [none], proto
ICMP (1), length 1328)
               client_A > client_B: ICMP echo request, id 7063, seq 1,
length 1308

>     
> => the critical bit in the above is that Gateway A learns
>    the path-mtu to Gateway B. If it doesn't then it keeps
>    assuming path-mtu is 1500 and the check in VTI will not
>    trigger (since path-mtu of 1500 > packet size)



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-03-22 20:46 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-17 23:37 [PATCH ipsec/vti 0/2] Fragmentation of IPv4 in VTI Bram Yvahk
2019-03-17 23:37 ` [PATCH ipsec/vti 1/2] vti: fragment IPv4 packets when DF bit is not set Bram Yvahk
2019-03-17 23:52   ` Bram Yvahk
2019-03-17 23:37 ` [PATCH ipsec/vti 2/2] vti6: process icmp msg when IPv6 is fragmented Bram Yvahk
2019-03-21 15:16 ` [PATCH ipsec/vti 0/2] Fragmentation of IPv4 in VTI Steffen Klassert
2019-03-21 18:33   ` Bram Yvahk
2019-03-22 20:46     ` Bram Yvahk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).