[PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels
@ 2020-08-03 20:52 Stefano Brivio
  2020-08-03 20:52 ` [PATCH net-next 1/6] ipv4: route: Ignore output interface in FIB lookup for PMTU route Stefano Brivio
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Stefano Brivio @ 2020-08-03 20:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: Florian Westphal, David Ahern, Aaron Conole, Numan Siddique,
	Jakub Kicinski, Pravin B Shelar, Roopa Prabhu,
	Nikolay Aleksandrov, Lourdes Pedrajas, netdev

Currently, PMTU discovery for UDP tunnels only works if packets are
routed to the encapsulating interfaces, not bridged.

This results from the fact that we generally don't have valid routes
to the senders we can use to relay ICMP and ICMPv6 errors, and makes
PMTU discovery completely non-functional for VXLAN and GENEVE ports of
both regular bridges and Open vSwitch instances.

If the sender is local, and packets are forwarded to the port by a
regular bridge, all it takes is to generate a corresponding route
exception on the encapsulating device. The bridge then finds the route
exception carrying the PMTU value estimate as it forwards frames, and
relays ICMP messages back to the socket of the local sender. Patch 1/6
fixes this case.

If the sender resides on another node, we actually need to reply to
IP and IPv6 packets ourselves and send these ICMP or ICMPv6 errors
back, using the same encapsulating device. Patch 2/6, based on an
original idea by Florian Westphal, adds the needed functionality,
while patches 3/6 and 4/6 add matching support for VXLAN and GENEVE.

Finally, 5/6 and 6/6 introduce selftests for all combinations of
inner and outer IP versions, covering both VXLAN and GENEVE, with
both regular bridges and Open vSwitch instances.

Stefano Brivio (6):
  ipv4: route: Ignore output interface in FIB lookup for PMTU route
  tunnels: PMTU discovery support for directly bridged IP packets
  vxlan: Support for PMTU discovery on directly bridged links
  geneve: Support for PMTU discovery on directly bridged links
  selftests: pmtu.sh: Add tests for bridged UDP tunnels
  selftests: pmtu.sh: Add tests for UDP tunnels handled by Open vSwitch

 drivers/net/bareudp.c               |   5 +-
 drivers/net/geneve.c                |  57 ++++-
 drivers/net/vxlan.c                 |  49 +++-
 include/net/dst.h                   |  10 -
 include/net/ip_tunnels.h            |  88 +++++++
 net/ipv4/ip_tunnel_core.c           | 122 ++++++++++
 net/ipv4/route.c                    |   1 +
 tools/testing/selftests/net/pmtu.sh | 347 +++++++++++++++++++++++++++-
 8 files changed, 650 insertions(+), 29 deletions(-)

-- 
2.27.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net-next 1/6] ipv4: route: Ignore output interface in FIB lookup for PMTU route
  2020-08-03 20:52 [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels Stefano Brivio
@ 2020-08-03 20:52 ` Stefano Brivio
  2020-08-03 23:30   ` David Ahern
  2020-08-03 20:52 ` [PATCH net-next 2/6] tunnels: PMTU discovery support for directly bridged IP packets Stefano Brivio
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Stefano Brivio @ 2020-08-03 20:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: Florian Westphal, David Ahern, Aaron Conole, Numan Siddique,
	Jakub Kicinski, Pravin B Shelar, Roopa Prabhu,
	Nikolay Aleksandrov, Lourdes Pedrajas, netdev

Currently, processes sending traffic to a local bridge with an
encapsulation device as a port don't get ICMP errors if they exceed
the PMTU of the encapsulated link.

David Ahern suggested this as a hack, but it actually looks like
the correct solution: when we update the PMTU for a given destination
by means of updating or creating a route exception, the encapsulation
might trigger this because of PMTU discovery happening either on the
encapsulation device itself, or its lower layer.

The output interface shouldn't matter, because we already have a
valid destination. Drop the output interface restriction from the
associated route lookup.

For UDP tunnels, we will now have a route exception created for the
encapsulation itself, with a MTU value reflecting its headroom, which
allows a bridge forwarding IP packets originated locally to deliver
errors back to the sending socket.

The behaviour is now consistent with IPv6 and verified with selftests
pmtu_ipv{4,6}_br_{geneve,vxlan}{4,6}_exception introduced later in
this series.

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 net/ipv4/route.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a01efa062f6b..c14fd8124f57 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1050,6 +1050,7 @@ static void ip_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
 	struct flowi4 fl4;

 	ip_rt_build_flow_key(&fl4, sk, skb);
+	fl4.flowi4_oif = 0;	/* Don't make lookup fail for encapsulations */
 	__ip_rt_update_pmtu(rt, &fl4, mtu);
 }

-- 
2.27.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next 2/6] tunnels: PMTU discovery support for directly bridged IP packets
  2020-08-03 20:52 [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels Stefano Brivio
  2020-08-03 20:52 ` [PATCH net-next 1/6] ipv4: route: Ignore output interface in FIB lookup for PMTU route Stefano Brivio
@ 2020-08-03 20:52 ` Stefano Brivio
  2020-08-03 23:44   ` David Ahern
  2020-08-03 20:52 ` [PATCH net-next 3/6] vxlan: Support for PMTU discovery on directly bridged links Stefano Brivio
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Stefano Brivio @ 2020-08-03 20:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: Florian Westphal, David Ahern, Aaron Conole, Numan Siddique,
	Jakub Kicinski, Pravin B Shelar, Roopa Prabhu,
	Nikolay Aleksandrov, Lourdes Pedrajas, netdev

It's currently possible to bridge Ethernet tunnels carrying IP
packets directly to external interfaces without assigning them
addresses and routes on the bridged network itself: this is the case
for UDP tunnels bridged with a standard bridge or by Open vSwitch.

PMTU discovery is currently broken with those configurations, because
the encapsulation effectively decreases the MTU of the link, and
while we are able to account for this using PMTU discovery on the
lower layer, we don't have a way to relay ICMP or ICMPv6 messages
needed by the sender, because we don't have valid routes to it.

On the other hand, as a tunnel endpoint, we can't fragment packets
as a general approach: this is for instance clearly forbidden for
VXLAN by RFC 7348, section 4.3:

   VTEPs MUST NOT fragment VXLAN packets.  Intermediate routers may
   fragment encapsulated VXLAN packets due to the larger frame size.
   The destination VTEP MAY silently discard such VXLAN fragments.

The same paragraph recommends that the MTU over the physical network
accomodates for encapsulations, but this isn't a practical option for
complex topologies, especially for typical Open vSwitch use cases.

Further, it states that:

   Other techniques like Path MTU discovery (see [RFC1191] and
   [RFC1981]) MAY be used to address this requirement as well.

Now, PMTU discovery already works for routed interfaces, we get
route exceptions created by the encapsulation device as they receive
ICMP Fragmentation Needed and ICMPv6 Packet Too Big messages, and
we already rebuild those messages with the appropriate MTU and route
them back to the sender.

Add the missing bits for bridged cases:

- checks in skb_tunnel_check_pmtu() to understand if it's appropriate
  to trigger a reply according to RFC 1122 section 3.2.2 for ICMP and
  RFC 4443 section 2.4 for ICMPv6. This function is already called by
  UDP tunnels

- a new function generating those ICMP or ICMPv6 replies. We can't
  reuse icmp_send() and icmp6_send() as we don't see the sender as a
  valid destination. This doesn't need to be generic, as we don't
  cover any other type of ICMP errors given that we only provide an
  encapsulation function to the sender

While at it, make the MTU check in skb_tunnel_check_pmtu() accurate:
we might receive GSO buffers here, and the passed headroom already
includes the inner MAC length, so we don't have to account for it
a second time (that would imply three MAC headers on the wire, but
there are just two).

This issue became visible while bridging IPv6 packets with 4500 bytes
of payload over GENEVE using IPv4 with a PMTU of 4000. Given the 50
bytes of encapsulation headroom, we would advertise MTU as 3950, and
we would reject fragmented IPv6 datagrams of 3958 bytes size on the
wire. We're exclusively dealing with network MTU here, though, so we
could get Ethernet frames up to 3964 octets in that case.

Based on earlier patch from Florian Westphal.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 drivers/net/bareudp.c     |   5 +-
 drivers/net/geneve.c      |   5 +-
 drivers/net/vxlan.c       |   4 +-
 include/net/dst.h         |  10 ----
 include/net/ip_tunnels.h  |  88 +++++++++++++++++++++++++++
 net/ipv4/ip_tunnel_core.c | 122 ++++++++++++++++++++++++++++++++++++++
 6 files changed, 218 insertions(+), 16 deletions(-)

diff --git a/drivers/net/bareudp.c b/drivers/net/bareudp.c
index 44eb2b1d0416..d2560591c4d6 100644
--- a/drivers/net/bareudp.c
+++ b/drivers/net/bareudp.c
@@ -308,7 +308,7 @@ static int bareudp_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 		return PTR_ERR(rt);
 
 	skb_tunnel_check_pmtu(skb, &rt->dst,
-			      BAREUDP_IPV4_HLEN + info->options_len);
+			      BAREUDP_IPV4_HLEN + info->options_len, false);
 
 	sport = udp_flow_src_port(bareudp->net, skb,
 				  bareudp->sport_min, USHRT_MAX,
@@ -369,7 +369,8 @@ static int bareudp6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 	if (IS_ERR(dst))
 		return PTR_ERR(dst);
 
-	skb_tunnel_check_pmtu(skb, dst, BAREUDP_IPV6_HLEN + info->options_len);
+	skb_tunnel_check_pmtu(skb, dst, BAREUDP_IPV6_HLEN + info->options_len,
+			      false);
 
 	sport = udp_flow_src_port(bareudp->net, skb,
 				  bareudp->sport_min, USHRT_MAX,
diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 017c13acc911..de86b6d82132 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -894,7 +894,7 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 		return PTR_ERR(rt);
 
 	skb_tunnel_check_pmtu(skb, &rt->dst,
-			      GENEVE_IPV4_HLEN + info->options_len);
+			      GENEVE_IPV4_HLEN + info->options_len, false);
 
 	sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
 	if (geneve->cfg.collect_md) {
@@ -955,7 +955,8 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 	if (IS_ERR(dst))
 		return PTR_ERR(dst);
 
-	skb_tunnel_check_pmtu(skb, dst, GENEVE_IPV6_HLEN + info->options_len);
+	skb_tunnel_check_pmtu(skb, dst, GENEVE_IPV6_HLEN + info->options_len,
+			      false);
 
 	sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
 	if (geneve->cfg.collect_md) {
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index a43c97b13924..21ea79f65410 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2714,7 +2714,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 		}
 
 		ndst = &rt->dst;
-		skb_tunnel_check_pmtu(skb, ndst, VXLAN_HEADROOM);
+		skb_tunnel_check_pmtu(skb, ndst, VXLAN_HEADROOM, false);
 
 		tos = ip_tunnel_ecn_encap(RT_TOS(tos), old_iph, skb);
 		ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
@@ -2754,7 +2754,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 				goto out_unlock;
 		}
 
-		skb_tunnel_check_pmtu(skb, ndst, VXLAN6_HEADROOM);
+		skb_tunnel_check_pmtu(skb, ndst, VXLAN6_HEADROOM, false);
 
 		tos = ip_tunnel_ecn_encap(RT_TOS(tos), old_iph, skb);
 		ttl = ttl ? : ip6_dst_hoplimit(ndst);
diff --git a/include/net/dst.h b/include/net/dst.h
index 852d8fb36ab7..6ae2e625050d 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -535,14 +535,4 @@ static inline void skb_dst_update_pmtu_no_confirm(struct sk_buff *skb, u32 mtu)
 		dst->ops->update_pmtu(dst, NULL, skb, mtu, false);
 }
 
-static inline void skb_tunnel_check_pmtu(struct sk_buff *skb,
-					 struct dst_entry *encap_dst,
-					 int headroom)
-{
-	u32 encap_mtu = dst_mtu(encap_dst);
-
-	if (skb->len > encap_mtu - headroom)
-		skb_dst_update_pmtu_no_confirm(skb, encap_mtu - headroom);
-}
-
 #endif /* _NET_DST_H */
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 36025dea7612..b6f1f161a90f 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -9,6 +9,7 @@
 #include <linux/types.h>
 #include <linux/u64_stats_sync.h>
 #include <linux/bitops.h>
+#include <linux/icmp.h>
 
 #include <net/dsfield.h>
 #include <net/gro_cells.h>
@@ -17,6 +18,7 @@
 #include <net/rtnetlink.h>
 #include <net/lwtunnel.h>
 #include <net/dst_cache.h>
+#include <net/ip.h>
 
 #if IS_ENABLED(CONFIG_IPV6)
 #include <net/ipv6.h>
@@ -420,6 +422,7 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct sk_buff *skb,
 		   u8 tos, u8 ttl, __be16 df, bool xnet);
 struct metadata_dst *iptunnel_metadata_reply(struct metadata_dst *md,
 					     gfp_t flags);
+int iptunnel_pmtud_build_icmp(struct sk_buff *skb, int mtu);
 
 int iptunnel_handle_offloads(struct sk_buff *skb, int gso_type_mask);
 
@@ -461,6 +464,91 @@ static inline void iptunnel_xmit_stats(struct net_device *dev, int pkt_len)
 	}
 }
 
+/**
+ * skb_tunnel_check_pmtu() - Check, update PMTU and trigger ICMP reply as needed
+ * @skb:	Buffer being sent by encapsulation, L2 headers expected
+ * @encap_dst:	Destination for tunnel encapsulation (outer IP)
+ * @headroom:	Encapsulation header size, bytes
+ * @reply:	Build matching ICMP or ICMPv6 message as a result
+ *
+ * L2 tunnel implementations that can carry IP and can be directly bridged
+ * (currently UDP tunnels) can't always rely on IP forwarding paths to handle
+ * PMTU discovery. In the bridged case, ICMP or ICMPv6 messages need to be built
+ * based on payload and sent back by the encapsulation itself.
+ *
+ * For routable interfaces, we just need to update the PMTU for the destination.
+ *
+ * Return: 0 if ICMP error not needed, length if built, negative value on error
+ */
+static inline int skb_tunnel_check_pmtu(struct sk_buff *skb,
+					struct dst_entry *encap_dst,
+					int headroom, bool reply)
+{
+	u32 mtu = dst_mtu(encap_dst) - headroom;
+
+	if ((skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu)) ||
+	    (!skb_is_gso(skb) && (skb->len - skb_mac_header_len(skb)) <= mtu))
+		return 0;
+
+	skb_dst_update_pmtu_no_confirm(skb, mtu);
+
+	if (!reply || skb->pkt_type == PACKET_HOST)
+		return 0;
+
+	if (skb->protocol == htons(ETH_P_IP) && mtu > 576) {
+		const struct icmphdr *icmph = icmp_hdr(skb);
+		const struct iphdr *iph = ip_hdr(skb);
+
+		if (iph->frag_off != htons(IP_DF) ||
+		    ipv4_is_lbcast(iph->daddr) ||
+		    ipv4_is_multicast(iph->daddr) ||
+		    ipv4_is_zeronet(iph->saddr) ||
+		    ipv4_is_loopback(iph->saddr) ||
+		    ipv4_is_lbcast(iph->saddr) ||
+		    ipv4_is_multicast(iph->saddr) ||
+		    (iph->protocol == IPPROTO_ICMP && icmp_is_err(icmph->type)))
+			return 0;
+
+		return iptunnel_pmtud_build_icmp(skb, mtu);
+	}
+
+#if IS_ENABLED(CONFIG_IPV6)
+	if (skb->protocol == htons(ETH_P_IPV6) && mtu > IPV6_MIN_MTU) {
+		const struct ipv6hdr *ip6h = ipv6_hdr(skb);
+		int stype = ipv6_addr_type(&ip6h->saddr);
+		u8 proto = ip6h->nexthdr;
+		__be16 frag_off;
+		int offset;
+
+		if (stype == IPV6_ADDR_ANY || stype == IPV6_ADDR_MULTICAST ||
+		    stype == IPV6_ADDR_LOOPBACK)
+			return 0;
+
+		offset = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &proto,
+					  &frag_off);
+		if (offset < 0 || (frag_off & htons(~0x7)))
+			return 0;
+
+		if (proto == IPPROTO_ICMPV6) {
+			struct icmp6hdr *icmp6h;
+
+			if (!pskb_may_pull(skb, (skb_network_header(skb) +
+						 offset + 1 - skb->data)))
+				return 0;
+
+			icmp6h = (struct icmp6hdr *)(skb_network_header(skb) +
+						     offset);
+			if (icmpv6_is_err(icmp6h->icmp6_type) ||
+			    icmp6h->icmp6_type == NDISC_REDIRECT)
+				return 0;
+		}
+
+		return iptunnel_pmtud_build_icmp(skb, mtu);
+	}
+#endif
+	return 0;
+}
+
 static inline void *ip_tunnel_info_opts(struct ip_tunnel_info *info)
 {
 	return info + 1;
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index f8b419e2475c..54a3dbf7f512 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -184,6 +184,128 @@ int iptunnel_handle_offloads(struct sk_buff *skb,
 }
 EXPORT_SYMBOL_GPL(iptunnel_handle_offloads);
 
+/**
+ * iptunnel_pmtud_build_icmp() - Build ICMP or ICMPv6 error message for PMTUD
+ * @skb:	Original packet with L2 header
+ * @mtu:	MTU value for ICMP or ICMPv6 error
+ *
+ * Return: length on success, negative error code if message couldn't be built.
+ */
+int iptunnel_pmtud_build_icmp(struct sk_buff *skb, int mtu)
+{
+	struct ethhdr eh;
+	int len, err;
+
+	if (skb->protocol == htons(ETH_P_IP))
+		len = ETH_HLEN + sizeof(struct iphdr);
+	else if (IS_ENABLED(CONFIG_IPV6) && skb->protocol == htons(ETH_P_IPV6))
+		len = ETH_HLEN + sizeof(struct ipv6hdr);
+	else
+		return 0;
+
+	if (!pskb_may_pull(skb, len))
+		return -EINVAL;
+
+	skb_copy_bits(skb, skb_mac_offset(skb), &eh, ETH_HLEN);
+	pskb_pull(skb, ETH_HLEN);
+	skb_reset_network_header(skb);
+
+	if (skb->protocol == htons(ETH_P_IP)) {
+		const struct iphdr *iph = ip_hdr(skb);
+		struct icmphdr *icmph;
+		struct iphdr *niph;
+
+		err = pskb_trim(skb, 576 - sizeof(*niph) - sizeof(*icmph));
+		if (err)
+			return err;
+
+		len = skb->len + sizeof(*icmph);
+		err = skb_cow(skb, sizeof(*niph) + sizeof(*icmph) + ETH_HLEN);
+		if (err)
+			return err;
+
+		icmph = skb_push(skb, sizeof(*icmph));
+		*icmph = (struct icmphdr) {
+			.type			= ICMP_DEST_UNREACH,
+			.code			= ICMP_FRAG_NEEDED,
+			.checksum		= 0,
+			.un.frag.__unused	= 0,
+			.un.frag.mtu		= ntohs(mtu),
+		};
+		icmph->checksum = ip_compute_csum(icmph, len);
+		skb_reset_transport_header(skb);
+
+		niph = skb_push(skb, sizeof(*niph));
+		*niph = (struct iphdr) {
+			.ihl			= sizeof(*niph) / 4u,
+			.version 		= 4,
+			.tos 			= 0,
+			.tot_len		= htons(len + sizeof(*niph)),
+			.id			= 0,
+			.frag_off		= htons(IP_DF),
+			.ttl			= iph->ttl,
+			.protocol		= IPPROTO_ICMP,
+			.saddr			= iph->daddr,
+			.daddr			= iph->saddr,
+		};
+		ip_send_check(niph);
+	}
+
+#if IS_ENABLED(CONFIG_IPV6)
+	else if (skb->protocol == htons(ETH_P_IPV6)) {
+		const struct ipv6hdr *ip6h = ipv6_hdr(skb);
+		struct icmp6hdr *icmp6h;
+		struct ipv6hdr *nip6h;
+		__wsum csum;
+
+		err = pskb_trim(skb, IPV6_MIN_MTU -
+				     sizeof(*nip6h) - sizeof(*icmp6h));
+		if (err)
+			return err;
+
+		len = skb->len + sizeof(*icmp6h);
+		err = skb_cow(skb, sizeof(*nip6h) + sizeof(*icmp6h) + ETH_HLEN);
+		if (err)
+			return err;
+
+		icmp6h = skb_push(skb, sizeof(*icmp6h));
+		*icmp6h = (struct icmp6hdr) {
+			.icmp6_type		= ICMPV6_PKT_TOOBIG,
+			.icmp6_code		= 0,
+			.icmp6_cksum		= 0,
+			.icmp6_mtu		= htonl(mtu),
+		};
+		skb_reset_transport_header(skb);
+
+		nip6h = skb_push(skb, sizeof(*nip6h));
+		*nip6h = (struct ipv6hdr) {
+			.priority		= 0,
+			.version		= 6,
+			.flow_lbl		= { 0 },
+			.payload_len		= htons(len),
+			.nexthdr		= IPPROTO_ICMPV6,
+			.hop_limit		= ip6h->hop_limit,
+			.saddr			= ip6h->daddr,
+			.daddr			= ip6h->saddr,
+		};
+
+		csum = csum_partial(icmp6h, len, 0);
+		icmp6h->icmp6_cksum = csum_ipv6_magic(&nip6h->saddr,
+						      &nip6h->daddr, len,
+						      IPPROTO_ICMPV6, csum);
+	}
+#endif
+
+	skb_reset_network_header(skb);
+	skb->ip_summed = CHECKSUM_NONE;
+
+	eth_header(skb, skb->dev, htons(eh.h_proto), eh.h_source, eh.h_dest, 0);
+	skb_reset_mac_header(skb);
+
+	return skb->len;
+}
+EXPORT_SYMBOL(iptunnel_pmtud_build_icmp);
+
 /* Often modified stats are per cpu, other are shared (netdev->stats) */
 void ip_tunnel_get_stats64(struct net_device *dev,
 			   struct rtnl_link_stats64 *tot)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next 3/6] vxlan: Support for PMTU discovery on directly bridged links
  2020-08-03 20:52 [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels Stefano Brivio
  2020-08-03 20:52 ` [PATCH net-next 1/6] ipv4: route: Ignore output interface in FIB lookup for PMTU route Stefano Brivio
  2020-08-03 20:52 ` [PATCH net-next 2/6] tunnels: PMTU discovery support for directly bridged IP packets Stefano Brivio
@ 2020-08-03 20:52 ` Stefano Brivio
  2020-08-03 23:48   ` David Ahern
  2020-08-03 20:52 ` [PATCH net-next 4/6] geneve: " Stefano Brivio
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Stefano Brivio @ 2020-08-03 20:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: Florian Westphal, David Ahern, Aaron Conole, Numan Siddique,
	Jakub Kicinski, Pravin B Shelar, Roopa Prabhu,
	Nikolay Aleksandrov, Lourdes Pedrajas, netdev

If the interface is a bridge or Open vSwitch port, and we can't
forward a packet because it exceeds the local PMTU estimate,
trigger an ICMP or ICMPv6 reply to the sender, using the same
interface to forward it back.

If metadata collection is enabled, reverse destination and source
addresses, so that Open vSwitch is able to match this packet against
the existing, reverse flow.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 drivers/net/vxlan.c | 49 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 43 insertions(+), 6 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 21ea79f65410..88941f26f851 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2494,7 +2494,8 @@ static struct dst_entry *vxlan6_get_route(struct vxlan_dev *vxlan,
 
 /* Bypass encapsulation if the destination is local */
 static void vxlan_encap_bypass(struct sk_buff *skb, struct vxlan_dev *src_vxlan,
-			       struct vxlan_dev *dst_vxlan, __be32 vni)
+			       struct vxlan_dev *dst_vxlan, __be32 vni,
+			       bool snoop)
 {
 	struct pcpu_sw_netstats *tx_stats, *rx_stats;
 	union vxlan_addr loopback;
@@ -2526,7 +2527,7 @@ static void vxlan_encap_bypass(struct sk_buff *skb, struct vxlan_dev *src_vxlan,
 		goto drop;
 	}
 
-	if (dst_vxlan->cfg.flags & VXLAN_F_LEARN)
+	if ((dst_vxlan->cfg.flags & VXLAN_F_LEARN) && snoop)
 		vxlan_snoop(dev, &loopback, eth_hdr(skb)->h_source, 0, vni);
 
 	u64_stats_update_begin(&tx_stats->syncp);
@@ -2575,7 +2576,7 @@ static int encap_bypass_if_local(struct sk_buff *skb, struct net_device *dev,
 
 			return -ENOENT;
 		}
-		vxlan_encap_bypass(skb, vxlan, dst_vxlan, vni);
+		vxlan_encap_bypass(skb, vxlan, dst_vxlan, vni, true);
 		return 1;
 	}
 
@@ -2611,7 +2612,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 		if (vxlan_addr_any(dst)) {
 			if (did_rsc) {
 				/* short-circuited back to local bridge */
-				vxlan_encap_bypass(skb, vxlan, vxlan, default_vni);
+				vxlan_encap_bypass(skb, vxlan, vxlan,
+						   default_vni, true);
 				return;
 			}
 			goto drop;
@@ -2714,7 +2716,24 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 		}
 
 		ndst = &rt->dst;
-		skb_tunnel_check_pmtu(skb, ndst, VXLAN_HEADROOM, false);
+		err = skb_tunnel_check_pmtu(skb, ndst, VXLAN_HEADROOM,
+					    netif_is_bridge_port(dev) ||
+					    netif_is_ovs_port(dev));
+		if (err < 0) {
+			goto tx_error;
+		} else if (err) {
+			if (info) {
+				struct in_addr src, dst;
+
+				src = remote_ip.sin.sin_addr;
+				dst = local_ip.sin.sin_addr;
+				info->key.u.ipv4.src = src.s_addr;
+				info->key.u.ipv4.dst = dst.s_addr;
+			}
+			vxlan_encap_bypass(skb, vxlan, vxlan, vni, false);
+			dst_release(ndst);
+			goto out_unlock;
+		}
 
 		tos = ip_tunnel_ecn_encap(RT_TOS(tos), old_iph, skb);
 		ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
@@ -2754,7 +2773,25 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 				goto out_unlock;
 		}
 
-		skb_tunnel_check_pmtu(skb, ndst, VXLAN6_HEADROOM, false);
+		err = skb_tunnel_check_pmtu(skb, ndst, VXLAN6_HEADROOM,
+					    netif_is_bridge_port(dev) ||
+					    netif_is_ovs_port(dev));
+		if (err < 0) {
+			goto tx_error;
+		} else if (err) {
+			if (info) {
+				struct in6_addr src, dst;
+
+				src = remote_ip.sin6.sin6_addr;
+				dst = local_ip.sin6.sin6_addr;
+				info->key.u.ipv6.src = src;
+				info->key.u.ipv6.dst = dst;
+			}
+
+			vxlan_encap_bypass(skb, vxlan, vxlan, vni, false);
+			dst_release(ndst);
+			goto out_unlock;
+		}
 
 		tos = ip_tunnel_ecn_encap(RT_TOS(tos), old_iph, skb);
 		ttl = ttl ? : ip6_dst_hoplimit(ndst);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next 4/6] geneve: Support for PMTU discovery on directly bridged links
  2020-08-03 20:52 [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels Stefano Brivio
                   ` (2 preceding siblings ...)
  2020-08-03 20:52 ` [PATCH net-next 3/6] vxlan: Support for PMTU discovery on directly bridged links Stefano Brivio
@ 2020-08-03 20:52 ` Stefano Brivio
  2020-08-03 20:52 ` [PATCH net-next 5/6] selftests: pmtu.sh: Add tests for bridged UDP tunnels Stefano Brivio
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Stefano Brivio @ 2020-08-03 20:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: Florian Westphal, David Ahern, Aaron Conole, Numan Siddique,
	Jakub Kicinski, Pravin B Shelar, Roopa Prabhu,
	Nikolay Aleksandrov, Lourdes Pedrajas, netdev

If the interface is a bridge or Open vSwitch port, and we can't
forward a packet because it exceeds the local PMTU estimate,
trigger an ICMP or ICMPv6 reply to the sender, using the same
interface to forward it back.

If metadata collection is enabled, set destination and source
addresses for the flow as if we were receiving the packet, so that
Open vSwitch can match the ICMP error against the existing
association.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 drivers/net/geneve.c | 58 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 53 insertions(+), 5 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index de86b6d82132..c169367236c1 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -893,8 +893,32 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 	if (IS_ERR(rt))
 		return PTR_ERR(rt);
 
-	skb_tunnel_check_pmtu(skb, &rt->dst,
-			      GENEVE_IPV4_HLEN + info->options_len, false);
+	err = skb_tunnel_check_pmtu(skb, &rt->dst,
+				    GENEVE_IPV4_HLEN + info->options_len,
+				    netif_is_bridge_port(dev) ||
+				    netif_is_ovs_port(dev));
+	if (err < 0) {
+		dst_release(&rt->dst);
+		return err;
+	} else if (err) {
+		struct ip_tunnel_info *info;
+
+		info = skb_tunnel_info(skb);
+		if (info) {
+			info->key.u.ipv4.dst = fl4.saddr;
+			info->key.u.ipv4.src = fl4.daddr;
+		}
+
+		if (!pskb_may_pull(skb, ETH_HLEN)) {
+			dst_release(&rt->dst);
+			return -EINVAL;
+		}
+
+		skb->protocol = eth_type_trans(skb, geneve->dev);
+		netif_rx(skb);
+		dst_release(&rt->dst);
+		return -EMSGSIZE;
+	}
 
 	sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
 	if (geneve->cfg.collect_md) {
@@ -955,8 +979,31 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 	if (IS_ERR(dst))
 		return PTR_ERR(dst);
 
-	skb_tunnel_check_pmtu(skb, dst, GENEVE_IPV6_HLEN + info->options_len,
-			      false);
+	err = skb_tunnel_check_pmtu(skb, dst,
+				    GENEVE_IPV6_HLEN + info->options_len,
+				    netif_is_bridge_port(dev) ||
+				    netif_is_ovs_port(dev));
+	if (err < 0) {
+		dst_release(dst);
+		return err;
+	} else if (err) {
+		struct ip_tunnel_info *info = skb_tunnel_info(skb);
+
+		if (info) {
+			info->key.u.ipv6.dst = fl6.saddr;
+			info->key.u.ipv6.src = fl6.daddr;
+		}
+
+		if (!pskb_may_pull(skb, ETH_HLEN)) {
+			dst_release(dst);
+			return -EINVAL;
+		}
+
+		skb->protocol = eth_type_trans(skb, geneve->dev);
+		netif_rx(skb);
+		dst_release(dst);
+		return -EMSGSIZE;
+	}
 
 	sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
 	if (geneve->cfg.collect_md) {
@@ -1013,7 +1060,8 @@ static netdev_tx_t geneve_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (likely(!err))
 		return NETDEV_TX_OK;
 
-	dev_kfree_skb(skb);
+	if (err != -EMSGSIZE)
+		dev_kfree_skb(skb);
 
 	if (err == -ELOOP)
 		dev->stats.collisions++;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next 5/6] selftests: pmtu.sh: Add tests for bridged UDP tunnels
  2020-08-03 20:52 [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels Stefano Brivio
                   ` (3 preceding siblings ...)
  2020-08-03 20:52 ` [PATCH net-next 4/6] geneve: " Stefano Brivio
@ 2020-08-03 20:52 ` Stefano Brivio
  2020-08-03 20:52 ` [PATCH net-next 6/6] selftests: pmtu.sh: Add tests for UDP tunnels handled by Open vSwitch Stefano Brivio
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Stefano Brivio @ 2020-08-03 20:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: Florian Westphal, David Ahern, Aaron Conole, Numan Siddique,
	Jakub Kicinski, Pravin B Shelar, Roopa Prabhu,
	Nikolay Aleksandrov, Lourdes Pedrajas, netdev

The new tests check that IP and IPv6 packets exceeding the local PMTU
estimate, both locally generated and forwarded by a bridge from
another node, result in the correct route exceptions being created,
and that communication with end-to-end fragmentation over VXLAN and
GENEVE tunnels is now possible as a result of PMTU discovery.

Part of the existing setup functions aren't generic enough to simply
add a namespace and a bridge to the existing routing setup. This
rework is in progress and we can easily shrink this once more generic
topology functions are available.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tools/testing/selftests/net/pmtu.sh | 167 ++++++++++++++++++++++++++--
 1 file changed, 160 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index 77c09cd339c3..ee79c26a37dd 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -59,6 +59,25 @@
 #	Same as pmtu_ipv6_vxlan6_exception, but using a GENEVE tunnel instead of
 #	VXLAN
 #
+# - pmtu_ipv{4,6}_br_vxlan{4,6}_exception
+#	Set up three namespaces, A, B, and C, with routing between A and B over
+#	R1. R2 is unused in these tests. A has a veth connection to C, and is
+#	connected to B via a VXLAN endpoint, which is directly bridged to C.
+#	MTU on the B-R1 link is lower than other MTUs.
+#
+#	Check that both C and A are able to communicate with B over the VXLAN
+#	tunnel, and that PMTU exceptions with the correct values are created.
+#
+#	                  segment a_r1    segment b_r1            b_r1: 4000
+#	                .--------------R1--------------.    everything
+#	   C---veth     A                               B         else: 5000
+#	        ' bridge                                |
+#	            '---- - - - - - VXLAN - - - - - - - '
+#
+# - pmtu_ipv{4,6}_br_geneve{4,6}_exception
+#	Same as pmtu_ipv{4,6}_br_vxlan{4,6}_exception, with a GENEVE tunnel
+#	instead.
+#
 # - pmtu_ipv{4,6}_fou{4,6}_exception
 #	Same as pmtu_ipv4_vxlan4, but using a direct IPv4/IPv6 encapsulation
 #	(FoU) over IPv4/IPv6, instead of VXLAN
@@ -147,6 +166,14 @@ tests="
 	pmtu_ipv6_geneve4_exception	IPv6 over geneve4: PMTU exceptions	1
 	pmtu_ipv4_geneve6_exception	IPv4 over geneve6: PMTU exceptions	1
 	pmtu_ipv6_geneve6_exception	IPv6 over geneve6: PMTU exceptions	1
+	pmtu_ipv4_br_vxlan4_exception	IPv4, bridged vxlan4: PMTU exceptions	1
+	pmtu_ipv6_br_vxlan4_exception	IPv6, bridged vxlan4: PMTU exceptions	1
+	pmtu_ipv4_br_vxlan6_exception	IPv4, bridged vxlan6: PMTU exceptions	1
+	pmtu_ipv6_br_vxlan6_exception	IPv6, bridged vxlan6: PMTU exceptions	1
+	pmtu_ipv4_br_geneve4_exception	IPv4, bridged geneve4: PMTU exceptions	1
+	pmtu_ipv6_br_geneve4_exception	IPv6, bridged geneve4: PMTU exceptions	1
+	pmtu_ipv4_br_geneve6_exception	IPv4, bridged geneve6: PMTU exceptions	1
+	pmtu_ipv6_br_geneve6_exception	IPv6, bridged geneve6: PMTU exceptions	1
 	pmtu_ipv4_fou4_exception	IPv4 over fou4: PMTU exceptions		1
 	pmtu_ipv6_fou4_exception	IPv6 over fou4: PMTU exceptions		1
 	pmtu_ipv4_fou6_exception	IPv4 over fou6: PMTU exceptions		1
@@ -173,10 +200,12 @@ tests="
 
 NS_A="ns-A"
 NS_B="ns-B"
+NS_C="ns-C"
 NS_R1="ns-R1"
 NS_R2="ns-R2"
 ns_a="ip netns exec ${NS_A}"
 ns_b="ip netns exec ${NS_B}"
+ns_c="ip netns exec ${NS_C}"
 ns_r1="ip netns exec ${NS_R1}"
 ns_r2="ip netns exec ${NS_R2}"
 
@@ -239,9 +268,11 @@ routes_nh="
 
 veth4_a_addr="192.168.1.1"
 veth4_b_addr="192.168.1.2"
+veth4_c_addr="192.168.2.10"
 veth4_mask="24"
 veth6_a_addr="fd00:1::a"
 veth6_b_addr="fd00:1::b"
+veth6_c_addr="fd00:2::c"
 veth6_mask="64"
 
 tunnel4_a_addr="192.168.2.1"
@@ -428,7 +459,7 @@ setup_ip6ip6() {
 }
 
 setup_namespaces() {
-	for n in ${NS_A} ${NS_B} ${NS_R1} ${NS_R2}; do
+	for n in ${NS_A} ${NS_B} ${NS_C} ${NS_R1} ${NS_R2}; do
 		ip netns add ${n} || return 1
 
 		# Disable DAD, so that we don't have to wait to use the
@@ -484,6 +515,7 @@ setup_vxlan_or_geneve() {
 	a_addr="${2}"
 	b_addr="${3}"
 	opts="${4}"
+	br_if_a="${5}"
 
 	if [ "${type}" = "vxlan" ]; then
 		opts="${opts} ttl 64 dstport 4789"
@@ -497,12 +529,19 @@ setup_vxlan_or_geneve() {
 	run_cmd ${ns_a} ip link add ${type}_a type ${type} id 1 ${opts_a} remote ${b_addr} ${opts} || return 1
 	run_cmd ${ns_b} ip link add ${type}_b type ${type} id 1 ${opts_b} remote ${a_addr} ${opts}
 
-	run_cmd ${ns_a} ip addr add ${tunnel4_a_addr}/${tunnel4_mask} dev ${type}_a
-	run_cmd ${ns_b} ip addr add ${tunnel4_b_addr}/${tunnel4_mask} dev ${type}_b
+	if [ -n "${br_if_a}" ]; then
+		run_cmd ${ns_a} ip addr add ${tunnel4_a_addr}/${tunnel4_mask} dev ${br_if_a}
+		run_cmd ${ns_a} ip addr add ${tunnel6_a_addr}/${tunnel6_mask} dev ${br_if_a}
+		run_cmd ${ns_a} ip link set ${type}_a master ${br_if_a}
+	else
+		run_cmd ${ns_a} ip addr add ${tunnel4_a_addr}/${tunnel4_mask} dev ${type}_a
+		run_cmd ${ns_a} ip addr add ${tunnel6_a_addr}/${tunnel6_mask} dev ${type}_a
+	fi
 
-	run_cmd ${ns_a} ip addr add ${tunnel6_a_addr}/${tunnel6_mask} dev ${type}_a
+	run_cmd ${ns_b} ip addr add ${tunnel4_b_addr}/${tunnel4_mask} dev ${type}_b
 	run_cmd ${ns_b} ip addr add ${tunnel6_b_addr}/${tunnel6_mask} dev ${type}_b
 
+
 	run_cmd ${ns_a} ip link set ${type}_a up
 	run_cmd ${ns_b} ip link set ${type}_b up
 }
@@ -516,11 +555,27 @@ setup_vxlan4() {
 }
 
 setup_geneve6() {
-	setup_vxlan_or_geneve geneve ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1
+	setup_vxlan_or_geneve geneve ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1 ""
 }
 
 setup_vxlan6() {
-	setup_vxlan_or_geneve vxlan  ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1
+	setup_vxlan_or_geneve vxlan  ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1 ""
+}
+
+setup_bridged_geneve4() {
+	setup_vxlan_or_geneve geneve ${prefix4}.${a_r1}.1  ${prefix4}.${b_r1}.1  "df set" "br0"
+}
+
+setup_bridged_vxlan4() {
+	setup_vxlan_or_geneve vxlan  ${prefix4}.${a_r1}.1  ${prefix4}.${b_r1}.1  "df set" "br0"
+}
+
+setup_bridged_geneve6() {
+	setup_vxlan_or_geneve geneve ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1 "" "br0"
+}
+
+setup_bridged_vxlan6() {
+	setup_vxlan_or_geneve vxlan  ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1 "" "br0"
 }
 
 setup_xfrm() {
@@ -630,6 +685,20 @@ setup_routing() {
 	return 0
 }
 
+setup_bridge() {
+	run_cmd ${ns_a} ip link add br0 type bridge || return 2
+	run_cmd ${ns_a} ip link set br0 up
+
+	run_cmd ${ns_c} ip link add veth_C-A type veth peer name veth_A-C
+	run_cmd ${ns_c} ip link set veth_A-C netns ns-A
+
+	run_cmd ${ns_a} ip link set veth_A-C up
+	run_cmd ${ns_c} ip link set veth_C-A up
+	run_cmd ${ns_c} ip addr add ${veth4_c_addr}/${veth4_mask} dev veth_C-A
+	run_cmd ${ns_c} ip addr add ${veth6_c_addr}/${veth6_mask} dev veth_C-A
+	run_cmd ${ns_a} ip link set veth_A-C master br0
+}
+
 setup() {
 	[ "$(id -u)" -ne 0 ] && echo "  need to run as root" && return $ksft_skip
 
@@ -657,7 +726,7 @@ cleanup() {
 	done
 	tcpdump_pids=
 
-	for n in ${NS_A} ${NS_B} ${NS_R1} ${NS_R2}; do
+	for n in ${NS_A} ${NS_B} ${NS_C} ${NS_R1} ${NS_R2}; do
 		ip netns del ${n} 2> /dev/null
 	done
 }
@@ -892,6 +961,90 @@ test_pmtu_ipv6_geneve6_exception() {
 	test_pmtu_ipvX_over_vxlanY_or_geneveY_exception geneve 6 6
 }
 
+test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception() {
+	type=${1}
+	family=${2}
+	outer_family=${3}
+	ll_mtu=4000
+
+	if [ ${outer_family} -eq 4 ]; then
+		setup namespaces routing bridge bridged_${type}4 || return 2
+		#                      IPv4 header   UDP header   VXLAN/GENEVE header   Ethernet header
+		exp_mtu=$((${ll_mtu} - 20          - 8          - 8                   - 14))
+	else
+		setup namespaces routing bridge bridged_${type}6 || return 2
+		#                      IPv6 header   UDP header   VXLAN/GENEVE header   Ethernet header
+		exp_mtu=$((${ll_mtu} - 40          - 8          - 8                   - 14))
+	fi
+
+	trace "${ns_a}" ${type}_a    "${ns_b}"  ${type}_b \
+	      "${ns_a}" veth_A-R1    "${ns_r1}" veth_R1-A \
+	      "${ns_b}" veth_B-R1    "${ns_r1}" veth_R1-B \
+	      "${ns_a}" br0          "${ns_a}"  veth-A-C  \
+	      "${ns_c}" veth_C-A
+
+	if [ ${family} -eq 4 ]; then
+		ping=ping
+		dst=${tunnel4_b_addr}
+	else
+		ping=${ping6}
+		dst=${tunnel6_b_addr}
+	fi
+
+	# Create route exception by exceeding link layer MTU
+	mtu "${ns_a}"  veth_A-R1 $((${ll_mtu} + 1000))
+	mtu "${ns_a}"  br0       $((${ll_mtu} + 1000))
+	mtu "${ns_a}"  veth_A-C  $((${ll_mtu} + 1000))
+	mtu "${ns_c}"  veth_C-A  $((${ll_mtu} + 1000))
+	mtu "${ns_r1}" veth_R1-A $((${ll_mtu} + 1000))
+	mtu "${ns_b}"  veth_B-R1 ${ll_mtu}
+	mtu "${ns_r1}" veth_R1-B ${ll_mtu}
+
+	mtu "${ns_a}" ${type}_a $((${ll_mtu} + 1000))
+	mtu "${ns_b}" ${type}_b $((${ll_mtu} + 1000))
+
+	run_cmd ${ns_c} ${ping} -q -M want -i 0.1 -c 10 -s $((${ll_mtu} + 500)) ${dst} || return 1
+	run_cmd ${ns_a} ${ping} -q -M want -i 0.1 -w 1  -s $((${ll_mtu} + 500)) ${dst} || return 1
+
+	# Check that exceptions were created
+	pmtu="$(route_get_dst_pmtu_from_exception "${ns_c}" ${dst})"
+	check_pmtu_value ${exp_mtu} "${pmtu}" "exceeding link layer MTU on bridged ${type} interface"
+	pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${dst})"
+	check_pmtu_value ${exp_mtu} "${pmtu}" "exceeding link layer MTU on locally bridged ${type} interface"
+}
+
+test_pmtu_ipv4_br_vxlan4_exception() {
+	test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception vxlan  4 4
+}
+
+test_pmtu_ipv6_br_vxlan4_exception() {
+	test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception vxlan  6 4
+}
+
+test_pmtu_ipv4_br_geneve4_exception() {
+	test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception geneve 4 4
+}
+
+test_pmtu_ipv6_br_geneve4_exception() {
+	test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception geneve 6 4
+}
+
+test_pmtu_ipv4_br_vxlan6_exception() {
+	test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception vxlan  4 6
+}
+
+test_pmtu_ipv6_br_vxlan6_exception() {
+	test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception vxlan  6 6
+}
+
+test_pmtu_ipv4_br_geneve6_exception() {
+	test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception geneve 4 6
+}
+
+test_pmtu_ipv6_br_geneve6_exception() {
+	test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception geneve 6 6
+}
+
 test_pmtu_ipvX_over_fouY_or_gueY() {
 	inner_family=${1}
 	outer_family=${2}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next 6/6] selftests: pmtu.sh: Add tests for UDP tunnels handled by Open vSwitch
  2020-08-03 20:52 [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels Stefano Brivio
                   ` (4 preceding siblings ...)
  2020-08-03 20:52 ` [PATCH net-next 5/6] selftests: pmtu.sh: Add tests for bridged UDP tunnels Stefano Brivio
@ 2020-08-03 20:52 ` Stefano Brivio
  2020-08-03 23:28 ` [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels Florian Westphal
  2020-08-04  1:25 ` David Miller
  7 siblings, 0 replies; 16+ messages in thread
From: Stefano Brivio @ 2020-08-03 20:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: Florian Westphal, David Ahern, Aaron Conole, Numan Siddique,
	Jakub Kicinski, Pravin B Shelar, Roopa Prabhu,
	Nikolay Aleksandrov, Lourdes Pedrajas, netdev

The new tests check that IP and IPv6 packets exceeding the local PMTU
estimate, forwarded by an Open vSwitch instance from another node,
result in the correct route exceptions being created, and that
communication with end-to-end fragmentation, over GENEVE and VXLAN
Open vSwitch ports, is now possible as a result of PMTU discovery.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tools/testing/selftests/net/pmtu.sh | 180 ++++++++++++++++++++++++++++
 1 file changed, 180 insertions(+)

diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index ee79c26a37dd..5401f6fd615d 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -78,6 +78,26 @@
 #	Same as pmtu_ipv{4,6}_br_vxlan{4,6}_exception, with a GENEVE tunnel
 #	instead.
 #
+# - pmtu_ipv{4,6}_ovs_vxlan{4,6}_exception
+#	Set up two namespaces, B, and C, with routing between the init namespace
+#	and B over R1. A and R2 are unused in these tests. The init namespace
+#	has a veth connection to C, and is connected to B via a VXLAN endpoint,
+#	which is handled by Open vSwitch and bridged to C. MTU on the B-R1 link
+#	is lower than other MTUs.
+#
+#	Check that C is able to communicate with B over the VXLAN tunnel, and
+#	that PMTU exceptions with the correct values are created.
+#
+#	                  segment a_r1    segment b_r1            b_r1: 4000
+#	                .--------------R1--------------.    everything
+#	   C---veth    init                             B         else: 5000
+#	        '- ovs                                  |
+#	            '---- - - - - - VXLAN - - - - - - - '
+#
+# - pmtu_ipv{4,6}_ovs_geneve{4,6}_exception
+#	Same as pmtu_ipv{4,6}_ovs_vxlan{4,6}_exception, with a GENEVE tunnel
+#	instead.
+#
 # - pmtu_ipv{4,6}_fou{4,6}_exception
 #	Same as pmtu_ipv4_vxlan4, but using a direct IPv4/IPv6 encapsulation
 #	(FoU) over IPv4/IPv6, instead of VXLAN
@@ -174,6 +194,14 @@ tests="
 	pmtu_ipv6_br_geneve4_exception	IPv6, bridged geneve4: PMTU exceptions	1
 	pmtu_ipv4_br_geneve6_exception	IPv4, bridged geneve6: PMTU exceptions	1
 	pmtu_ipv6_br_geneve6_exception	IPv6, bridged geneve6: PMTU exceptions	1
+	pmtu_ipv4_ovs_vxlan4_exception	IPv4, OVS vxlan4: PMTU exceptions	1
+	pmtu_ipv6_ovs_vxlan4_exception	IPv6, OVS vxlan4: PMTU exceptions	1
+	pmtu_ipv4_ovs_vxlan6_exception	IPv4, OVS vxlan6: PMTU exceptions	1
+	pmtu_ipv6_ovs_vxlan6_exception	IPv6, OVS vxlan6: PMTU exceptions	1
+	pmtu_ipv4_ovs_geneve4_exception	IPv4, OVS geneve4: PMTU exceptions	1
+	pmtu_ipv6_ovs_geneve4_exception	IPv6, OVS geneve4: PMTU exceptions	1
+	pmtu_ipv4_ovs_geneve6_exception	IPv4, OVS geneve6: PMTU exceptions	1
+	pmtu_ipv6_ovs_geneve6_exception	IPv6, OVS geneve6: PMTU exceptions	1
 	pmtu_ipv4_fou4_exception	IPv4 over fou4: PMTU exceptions		1
 	pmtu_ipv6_fou4_exception	IPv6 over fou4: PMTU exceptions		1
 	pmtu_ipv4_fou6_exception	IPv4 over fou6: PMTU exceptions		1
@@ -699,6 +727,66 @@ setup_bridge() {
 	run_cmd ${ns_a} ip link set veth_A-C master br0
 }
 
+setup_ovs_vxlan_or_geneve() {
+	type="${1}"
+	a_addr="${2}"
+	b_addr="${3}"
+
+	if [ "${type}" = "vxlan" ]; then
+		opts="${opts} ttl 64 dstport 4789"
+		opts_b="local ${b_addr}"
+	fi
+
+	run_cmd ovs-vsctl add-port ovs_br0 ${type}_a -- \
+		set interface ${type}_a type=${type} \
+		options:remote_ip=${b_addr} options:key=1 options:csum=true || return 1
+
+	run_cmd ${ns_b} ip link add ${type}_b type ${type} id 1 ${opts_b} remote ${a_addr} ${opts} || return 1
+
+	run_cmd ${ns_b} ip addr add ${tunnel4_b_addr}/${tunnel4_mask} dev ${type}_b
+	run_cmd ${ns_b} ip addr add ${tunnel6_b_addr}/${tunnel6_mask} dev ${type}_b
+
+	run_cmd ${ns_b} ip link set ${type}_b up
+}
+
+setup_ovs_geneve4() {
+	setup_ovs_vxlan_or_geneve geneve ${prefix4}.${a_r1}.1  ${prefix4}.${b_r1}.1
+}
+
+setup_ovs_vxlan4() {
+	setup_ovs_vxlan_or_geneve vxlan  ${prefix4}.${a_r1}.1  ${prefix4}.${b_r1}.1
+}
+
+setup_ovs_geneve6() {
+	setup_ovs_vxlan_or_geneve geneve ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1
+}
+
+setup_ovs_vxlan6() {
+	setup_ovs_vxlan_or_geneve vxlan  ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1
+}
+
+setup_ovs_bridge() {
+	run_cmd ovs-vsctl add-br ovs_br0 || return 2
+	run_cmd ip link set ovs_br0 up
+
+	run_cmd ${ns_c} ip link add veth_C-A type veth peer name veth_A-C
+	run_cmd ${ns_c} ip link set veth_A-C netns 1
+
+	run_cmd         ip link set veth_A-C up
+	run_cmd ${ns_c} ip link set veth_C-A up
+	run_cmd ${ns_c} ip addr add ${veth4_c_addr}/${veth4_mask} dev veth_C-A
+	run_cmd ${ns_c} ip addr add ${veth6_c_addr}/${veth6_mask} dev veth_C-A
+	run_cmd ovs-vsctl add-port ovs_br0 veth_A-C
+
+	# Move veth_A-R1 to init
+	run_cmd ${ns_a} ip link set veth_A-R1 netns 1
+	run_cmd ip addr add ${prefix4}.${a_r1}.1/${veth4_mask} dev veth_A-R1
+	run_cmd ip addr add ${prefix6}:${a_r1}::1/${veth6_mask} dev veth_A-R1
+	run_cmd ip link set veth_A-R1 up
+	run_cmd ip route add ${prefix4}.${b_r1}.1 via ${prefix4}.${a_r1}.2
+	run_cmd ip route add ${prefix6}:${b_r1}::1 via ${prefix6}:${a_r1}::2
+}
+
 setup() {
 	[ "$(id -u)" -ne 0 ] && echo "  need to run as root" && return $ksft_skip
 
@@ -729,6 +817,11 @@ cleanup() {
 	for n in ${NS_A} ${NS_B} ${NS_C} ${NS_R1} ${NS_R2}; do
 		ip netns del ${n} 2> /dev/null
 	done
+
+	ip link del veth_A-C			2>/dev/null
+	ip link del veth_A-R1			2>/dev/null
+	ovs-vsctl --if-exists del-port vxlan_a	2>/dev/null
+	ovs-vsctl --if-exists del-br ovs_br0	2>/dev/null
 }
 
 mtu() {
@@ -1045,6 +1138,93 @@ test_pmtu_ipv6_br_geneve6_exception() {
 	test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception geneve 6 6
 }
 
+test_pmtu_ipvX_over_ovs_vxlanY_or_geneveY_exception() {
+	type=${1}
+	family=${2}
+	outer_family=${3}
+	ll_mtu=4000
+
+	if [ ${outer_family} -eq 4 ]; then
+		setup namespaces routing ovs_bridge ovs_${type}4 || return 2
+		#                      IPv4 header   UDP header   VXLAN/GENEVE header   Ethernet header
+		exp_mtu=$((${ll_mtu} - 20          - 8          - 8                   - 14))
+	else
+		setup namespaces routing ovs_bridge ovs_${type}6 || return 2
+		#                      IPv6 header   UDP header   VXLAN/GENEVE header   Ethernet header
+		exp_mtu=$((${ll_mtu} - 40          - 8          - 8                   - 14))
+	fi
+
+	if [ "${type}" = "vxlan" ]; then
+		tun_a="vxlan_sys_4789"
+	elif [ "${type}" = "geneve" ]; then
+		tun_a="genev_sys_6081"
+	fi
+
+	trace ""        "${tun_a}"  "${ns_b}"  ${type}_b \
+	      ""        veth_A-R1   "${ns_r1}" veth_R1-A \
+	      "${ns_b}" veth_B-R1   "${ns_r1}" veth_R1-B \
+	      ""        ovs_br0     ""         veth-A-C  \
+	      "${ns_c}" veth_C-A
+
+	if [ ${family} -eq 4 ]; then
+		ping=ping
+		dst=${tunnel4_b_addr}
+	else
+		ping=${ping6}
+		dst=${tunnel6_b_addr}
+	fi
+
+	# Create route exception by exceeding link layer MTU
+	mtu ""         veth_A-R1 $((${ll_mtu} + 1000))
+	mtu ""         ovs_br0   $((${ll_mtu} + 1000))
+	mtu ""         veth_A-C  $((${ll_mtu} + 1000))
+	mtu "${ns_c}"  veth_C-A  $((${ll_mtu} + 1000))
+	mtu "${ns_r1}" veth_R1-A $((${ll_mtu} + 1000))
+	mtu "${ns_b}"  veth_B-R1 ${ll_mtu}
+	mtu "${ns_r1}" veth_R1-B ${ll_mtu}
+
+	mtu ""        ${tun_a}  $((${ll_mtu} + 1000))
+	mtu "${ns_b}" ${type}_b $((${ll_mtu} + 1000))
+
+	run_cmd ${ns_c} ${ping} -q -M want -i 0.1 -c 20 -s $((${ll_mtu} + 500)) ${dst} || return 1
+
+	# Check that exceptions were created
+	pmtu="$(route_get_dst_pmtu_from_exception "${ns_c}" ${dst})"
+	check_pmtu_value ${exp_mtu} "${pmtu}" "exceeding link layer MTU on Open vSwitch ${type} interface"
+}
+
+test_pmtu_ipv4_ovs_vxlan4_exception() {
+	test_pmtu_ipvX_over_ovs_vxlanY_or_geneveY_exception vxlan  4 4
+}
+
+test_pmtu_ipv6_ovs_vxlan4_exception() {
+	test_pmtu_ipvX_over_ovs_vxlanY_or_geneveY_exception vxlan  6 4
+}
+
+test_pmtu_ipv4_ovs_geneve4_exception() {
+	test_pmtu_ipvX_over_ovs_vxlanY_or_geneveY_exception geneve 4 4
+}
+
+test_pmtu_ipv6_ovs_geneve4_exception() {
+	test_pmtu_ipvX_over_ovs_vxlanY_or_geneveY_exception geneve 6 4
+}
+
+test_pmtu_ipv4_ovs_vxlan6_exception() {
+	test_pmtu_ipvX_over_ovs_vxlanY_or_geneveY_exception vxlan  4 6
+}
+
+test_pmtu_ipv6_ovs_vxlan6_exception() {
+	test_pmtu_ipvX_over_ovs_vxlanY_or_geneveY_exception vxlan  6 6
+}
+
+test_pmtu_ipv4_ovs_geneve6_exception() {
+	test_pmtu_ipvX_over_ovs_vxlanY_or_geneveY_exception geneve 4 6
+}
+
+test_pmtu_ipv6_ovs_geneve6_exception() {
+	test_pmtu_ipvX_over_ovs_vxlanY_or_geneveY_exception geneve 6 6
+}
+
 test_pmtu_ipvX_over_fouY_or_gueY() {
 	inner_family=${1}
 	outer_family=${2}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels
  2020-08-03 20:52 [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels Stefano Brivio
                   ` (5 preceding siblings ...)
  2020-08-03 20:52 ` [PATCH net-next 6/6] selftests: pmtu.sh: Add tests for UDP tunnels handled by Open vSwitch Stefano Brivio
@ 2020-08-03 23:28 ` Florian Westphal
  2020-08-03 23:46   ` David Ahern
  2020-08-04  1:25 ` David Miller
  7 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2020-08-03 23:28 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: David S. Miller, Florian Westphal, David Ahern, Aaron Conole,
	Numan Siddique, Jakub Kicinski, Pravin B Shelar, Roopa Prabhu,
	Nikolay Aleksandrov, Lourdes Pedrajas, netdev

Stefano Brivio <sbrivio@redhat.com> wrote:
> Currently, PMTU discovery for UDP tunnels only works if packets are
> routed to the encapsulating interfaces, not bridged.
> 
> This results from the fact that we generally don't have valid routes
> to the senders we can use to relay ICMP and ICMPv6 errors, and makes
> PMTU discovery completely non-functional for VXLAN and GENEVE ports of
> both regular bridges and Open vSwitch instances.
> 
> If the sender is local, and packets are forwarded to the port by a
> regular bridge, all it takes is to generate a corresponding route
> exception on the encapsulating device. The bridge then finds the route
> exception carrying the PMTU value estimate as it forwards frames, and
> relays ICMP messages back to the socket of the local sender. Patch 1/6
> fixes this case.
> 
> If the sender resides on another node, we actually need to reply to
> IP and IPv6 packets ourselves and send these ICMP or ICMPv6 errors
> back, using the same encapsulating device. Patch 2/6, based on an
> original idea by Florian Westphal, adds the needed functionality,
> while patches 3/6 and 4/6 add matching support for VXLAN and GENEVE.
> 
> Finally, 5/6 and 6/6 introduce selftests for all combinations of
> inner and outer IP versions, covering both VXLAN and GENEVE, with
> both regular bridges and Open vSwitch instances.

Thanks for taking over and brining this into shape, this looks good to
me.

Given such setups will become easily get stuck on first pmtu update
it would be good to get this applied now, even tough merge window is
already open.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 1/6] ipv4: route: Ignore output interface in FIB lookup for PMTU route
  2020-08-03 20:52 ` [PATCH net-next 1/6] ipv4: route: Ignore output interface in FIB lookup for PMTU route Stefano Brivio
@ 2020-08-03 23:30   ` David Ahern
  2020-08-04  5:52     ` Stefano Brivio
  0 siblings, 1 reply; 16+ messages in thread
From: David Ahern @ 2020-08-03 23:30 UTC (permalink / raw)
  To: Stefano Brivio, David S. Miller
  Cc: Florian Westphal, Aaron Conole, Numan Siddique, Jakub Kicinski,
	Pravin B Shelar, Roopa Prabhu, Nikolay Aleksandrov,
	Lourdes Pedrajas, netdev

On 8/3/20 2:52 PM, Stefano Brivio wrote:
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index a01efa062f6b..c14fd8124f57 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1050,6 +1050,7 @@ static void ip_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
>  	struct flowi4 fl4;
>  
>  	ip_rt_build_flow_key(&fl4, sk, skb);
> +	fl4.flowi4_oif = 0;	/* Don't make lookup fail for encapsulations */
>  	__ip_rt_update_pmtu(rt, &fl4, mtu);
>  }
>  
> 

Can this be limited to:
	if (skb &&
	    netif_is_bridge_port(skb->dev) || netif_is_ovs_port(skb->dev))
		fl4.flowi4_oif = 0;

I'm not sure we want to reset oif for all MTU updates.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 2/6] tunnels: PMTU discovery support for directly bridged IP packets
  2020-08-03 20:52 ` [PATCH net-next 2/6] tunnels: PMTU discovery support for directly bridged IP packets Stefano Brivio
@ 2020-08-03 23:44   ` David Ahern
  2020-08-04  5:53     ` Stefano Brivio
  0 siblings, 1 reply; 16+ messages in thread
From: David Ahern @ 2020-08-03 23:44 UTC (permalink / raw)
  To: Stefano Brivio, David S. Miller
  Cc: Florian Westphal, Aaron Conole, Numan Siddique, Jakub Kicinski,
	Pravin B Shelar, Roopa Prabhu, Nikolay Aleksandrov,
	Lourdes Pedrajas, netdev

On 8/3/20 2:52 PM, Stefano Brivio wrote:
> @@ -461,6 +464,91 @@ static inline void iptunnel_xmit_stats(struct net_device *dev, int pkt_len)
>  	}
>  }
>  
> +/**
> + * skb_tunnel_check_pmtu() - Check, update PMTU and trigger ICMP reply as needed
> + * @skb:	Buffer being sent by encapsulation, L2 headers expected
> + * @encap_dst:	Destination for tunnel encapsulation (outer IP)
> + * @headroom:	Encapsulation header size, bytes
> + * @reply:	Build matching ICMP or ICMPv6 message as a result
> + *
> + * L2 tunnel implementations that can carry IP and can be directly bridged
> + * (currently UDP tunnels) can't always rely on IP forwarding paths to handle
> + * PMTU discovery. In the bridged case, ICMP or ICMPv6 messages need to be built
> + * based on payload and sent back by the encapsulation itself.
> + *
> + * For routable interfaces, we just need to update the PMTU for the destination.
> + *
> + * Return: 0 if ICMP error not needed, length if built, negative value on error
> + */
> +static inline int skb_tunnel_check_pmtu(struct sk_buff *skb,
> +					struct dst_entry *encap_dst,
> +					int headroom, bool reply)

Given its size, this is probably better as a function. I believe it can
go into net/ipv4/ip_tunnel_core.c like you have iptunnel_pmtud_build_icmp.


> +{
> +	u32 mtu = dst_mtu(encap_dst) - headroom;
> +
> +	if ((skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu)) ||
> +	    (!skb_is_gso(skb) && (skb->len - skb_mac_header_len(skb)) <= mtu))
> +		return 0;
> +
> +	skb_dst_update_pmtu_no_confirm(skb, mtu);
> +
> +	if (!reply || skb->pkt_type == PACKET_HOST)
> +		return 0;
> +
> +	if (skb->protocol == htons(ETH_P_IP) && mtu > 576) {

I am surprised the 576 does not have an existing macro.

> +		const struct icmphdr *icmph = icmp_hdr(skb);
> +		const struct iphdr *iph = ip_hdr(skb);
> +
> +		if (iph->frag_off != htons(IP_DF) ||
> +		    ipv4_is_lbcast(iph->daddr) ||
> +		    ipv4_is_multicast(iph->daddr) ||
> +		    ipv4_is_zeronet(iph->saddr) ||
> +		    ipv4_is_loopback(iph->saddr) ||
> +		    ipv4_is_lbcast(iph->saddr) ||
> +		    ipv4_is_multicast(iph->saddr) ||
> +		    (iph->protocol == IPPROTO_ICMP && icmp_is_err(icmph->type)))
> +			return 0;
> +
> +		return iptunnel_pmtud_build_icmp(skb, mtu);
> +	}
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +	if (skb->protocol == htons(ETH_P_IPV6) && mtu > IPV6_MIN_MTU) {
> +		const struct ipv6hdr *ip6h = ipv6_hdr(skb);
> +		int stype = ipv6_addr_type(&ip6h->saddr);
> +		u8 proto = ip6h->nexthdr;
> +		__be16 frag_off;
> +		int offset;
> +
> +		if (stype == IPV6_ADDR_ANY || stype == IPV6_ADDR_MULTICAST ||
> +		    stype == IPV6_ADDR_LOOPBACK)
> +			return 0;
> +
> +		offset = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &proto,
> +					  &frag_off);
> +		if (offset < 0 || (frag_off & htons(~0x7)))
> +			return 0;
> +
> +		if (proto == IPPROTO_ICMPV6) {
> +			struct icmp6hdr *icmp6h;
> +
> +			if (!pskb_may_pull(skb, (skb_network_header(skb) +
> +						 offset + 1 - skb->data)))
> +				return 0;
> +
> +			icmp6h = (struct icmp6hdr *)(skb_network_header(skb) +
> +						     offset);
> +			if (icmpv6_is_err(icmp6h->icmp6_type) ||
> +			    icmp6h->icmp6_type == NDISC_REDIRECT)
> +				return 0;
> +		}
> +
> +		return iptunnel_pmtud_build_icmp(skb, mtu);
> +	}
> +#endif


separate v4 and v6 code into helpers based on skb->protocol; the mtu
check then becomes part of the version specific helpers.

> +	return 0;
> +}
> +
>  static inline void *ip_tunnel_info_opts(struct ip_tunnel_info *info)
>  {
>  	return info + 1;
> diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
> index f8b419e2475c..54a3dbf7f512 100644
> --- a/net/ipv4/ip_tunnel_core.c
> +++ b/net/ipv4/ip_tunnel_core.c
> @@ -184,6 +184,128 @@ int iptunnel_handle_offloads(struct sk_buff *skb,
>  }
>  EXPORT_SYMBOL_GPL(iptunnel_handle_offloads);
>  
> +/**
> + * iptunnel_pmtud_build_icmp() - Build ICMP or ICMPv6 error message for PMTUD
> + * @skb:	Original packet with L2 header
> + * @mtu:	MTU value for ICMP or ICMPv6 error
> + *
> + * Return: length on success, negative error code if message couldn't be built.
> + */
> +int iptunnel_pmtud_build_icmp(struct sk_buff *skb, int mtu)
> +{
> +	struct ethhdr eh;
> +	int len, err;
> +
> +	if (skb->protocol == htons(ETH_P_IP))
> +		len = ETH_HLEN + sizeof(struct iphdr);
> +	else if (IS_ENABLED(CONFIG_IPV6) && skb->protocol == htons(ETH_P_IPV6))
> +		len = ETH_HLEN + sizeof(struct ipv6hdr);
> +	else
> +		return 0;
> +
> +	if (!pskb_may_pull(skb, len))
> +		return -EINVAL;
> +
> +	skb_copy_bits(skb, skb_mac_offset(skb), &eh, ETH_HLEN);
> +	pskb_pull(skb, ETH_HLEN);
> +	skb_reset_network_header(skb);
> +
> +	if (skb->protocol == htons(ETH_P_IP)) {
> +		const struct iphdr *iph = ip_hdr(skb);
> +		struct icmphdr *icmph;
> +		struct iphdr *niph;
> +
> +		err = pskb_trim(skb, 576 - sizeof(*niph) - sizeof(*icmph));
> +		if (err)
> +			return err;
> +
> +		len = skb->len + sizeof(*icmph);
> +		err = skb_cow(skb, sizeof(*niph) + sizeof(*icmph) + ETH_HLEN);
> +		if (err)
> +			return err;
> +
> +		icmph = skb_push(skb, sizeof(*icmph));
> +		*icmph = (struct icmphdr) {
> +			.type			= ICMP_DEST_UNREACH,
> +			.code			= ICMP_FRAG_NEEDED,
> +			.checksum		= 0,
> +			.un.frag.__unused	= 0,
> +			.un.frag.mtu		= ntohs(mtu),
> +		};
> +		icmph->checksum = ip_compute_csum(icmph, len);
> +		skb_reset_transport_header(skb);
> +
> +		niph = skb_push(skb, sizeof(*niph));
> +		*niph = (struct iphdr) {
> +			.ihl			= sizeof(*niph) / 4u,
> +			.version 		= 4,
> +			.tos 			= 0,
> +			.tot_len		= htons(len + sizeof(*niph)),
> +			.id			= 0,
> +			.frag_off		= htons(IP_DF),
> +			.ttl			= iph->ttl,
> +			.protocol		= IPPROTO_ICMP,
> +			.saddr			= iph->daddr,
> +			.daddr			= iph->saddr,
> +		};
> +		ip_send_check(niph);
> +	}
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +	else if (skb->protocol == htons(ETH_P_IPV6)) {
> +		const struct ipv6hdr *ip6h = ipv6_hdr(skb);
> +		struct icmp6hdr *icmp6h;
> +		struct ipv6hdr *nip6h;
> +		__wsum csum;
> +
> +		err = pskb_trim(skb, IPV6_MIN_MTU -
> +				     sizeof(*nip6h) - sizeof(*icmp6h));
> +		if (err)
> +			return err;
> +
> +		len = skb->len + sizeof(*icmp6h);
> +		err = skb_cow(skb, sizeof(*nip6h) + sizeof(*icmp6h) + ETH_HLEN);
> +		if (err)
> +			return err;
> +
> +		icmp6h = skb_push(skb, sizeof(*icmp6h));
> +		*icmp6h = (struct icmp6hdr) {
> +			.icmp6_type		= ICMPV6_PKT_TOOBIG,
> +			.icmp6_code		= 0,
> +			.icmp6_cksum		= 0,
> +			.icmp6_mtu		= htonl(mtu),
> +		};
> +		skb_reset_transport_header(skb);
> +
> +		nip6h = skb_push(skb, sizeof(*nip6h));
> +		*nip6h = (struct ipv6hdr) {
> +			.priority		= 0,
> +			.version		= 6,
> +			.flow_lbl		= { 0 },
> +			.payload_len		= htons(len),
> +			.nexthdr		= IPPROTO_ICMPV6,
> +			.hop_limit		= ip6h->hop_limit,
> +			.saddr			= ip6h->daddr,
> +			.daddr			= ip6h->saddr,
> +		};
> +
> +		csum = csum_partial(icmp6h, len, 0);
> +		icmp6h->icmp6_cksum = csum_ipv6_magic(&nip6h->saddr,
> +						      &nip6h->daddr, len,
> +						      IPPROTO_ICMPV6, csum);
> +	}
> +#endif
> +
> +	skb_reset_network_header(skb);
> +	skb->ip_summed = CHECKSUM_NONE;
> +
> +	eth_header(skb, skb->dev, htons(eh.h_proto), eh.h_source, eh.h_dest, 0);
> +	skb_reset_mac_header(skb);
> +
> +	return skb->len;
> +}
> +EXPORT_SYMBOL(iptunnel_pmtud_build_icmp);

I think separate v4 and v6 versions would be more readable; the
duplication is mostly skb manipulation.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels
  2020-08-03 23:28 ` [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels Florian Westphal
@ 2020-08-03 23:46   ` David Ahern
  0 siblings, 0 replies; 16+ messages in thread
From: David Ahern @ 2020-08-03 23:46 UTC (permalink / raw)
  To: Florian Westphal, Stefano Brivio
  Cc: David S. Miller, Aaron Conole, Numan Siddique, Jakub Kicinski,
	Pravin B Shelar, Roopa Prabhu, Nikolay Aleksandrov,
	Lourdes Pedrajas, netdev

On 8/3/20 5:28 PM, Florian Westphal wrote:
> Stefano Brivio <sbrivio@redhat.com> wrote:
>> Currently, PMTU discovery for UDP tunnels only works if packets are
>> routed to the encapsulating interfaces, not bridged.
>>
>> This results from the fact that we generally don't have valid routes
>> to the senders we can use to relay ICMP and ICMPv6 errors, and makes
>> PMTU discovery completely non-functional for VXLAN and GENEVE ports of
>> both regular bridges and Open vSwitch instances.
>>
>> If the sender is local, and packets are forwarded to the port by a
>> regular bridge, all it takes is to generate a corresponding route
>> exception on the encapsulating device. The bridge then finds the route
>> exception carrying the PMTU value estimate as it forwards frames, and
>> relays ICMP messages back to the socket of the local sender. Patch 1/6
>> fixes this case.
>>
>> If the sender resides on another node, we actually need to reply to
>> IP and IPv6 packets ourselves and send these ICMP or ICMPv6 errors
>> back, using the same encapsulating device. Patch 2/6, based on an
>> original idea by Florian Westphal, adds the needed functionality,
>> while patches 3/6 and 4/6 add matching support for VXLAN and GENEVE.
>>
>> Finally, 5/6 and 6/6 introduce selftests for all combinations of
>> inner and outer IP versions, covering both VXLAN and GENEVE, with
>> both regular bridges and Open vSwitch instances.
> 
> Thanks for taking over and brining this into shape, this looks good to
> me.

+1. I'm sure this took quite a bit of your time. Thanks for doing that.
I like this version much better.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 3/6] vxlan: Support for PMTU discovery on directly bridged links
  2020-08-03 20:52 ` [PATCH net-next 3/6] vxlan: Support for PMTU discovery on directly bridged links Stefano Brivio
@ 2020-08-03 23:48   ` David Ahern
  2020-08-04  5:53     ` Stefano Brivio
  0 siblings, 1 reply; 16+ messages in thread
From: David Ahern @ 2020-08-03 23:48 UTC (permalink / raw)
  To: Stefano Brivio, David S. Miller
  Cc: Florian Westphal, Aaron Conole, Numan Siddique, Jakub Kicinski,
	Pravin B Shelar, Roopa Prabhu, Nikolay Aleksandrov,
	Lourdes Pedrajas, netdev

On 8/3/20 2:52 PM, Stefano Brivio wrote:
> +		err = skb_tunnel_check_pmtu(skb, ndst, VXLAN_HEADROOM,
> +					    netif_is_bridge_port(dev) ||
> +					    netif_is_ovs_port(dev));

you have this check in a few places. Maybe a new helper like
netif_is_any_bridge_port that can check both IFF flags in 1 go.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels
  2020-08-03 20:52 [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels Stefano Brivio
                   ` (6 preceding siblings ...)
  2020-08-03 23:28 ` [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels Florian Westphal
@ 2020-08-04  1:25 ` David Miller
  7 siblings, 0 replies; 16+ messages in thread
From: David Miller @ 2020-08-04  1:25 UTC (permalink / raw)
  To: sbrivio
  Cc: fw, dsahern, aconole, nusiddiq, kuba, pshelar, roopa, nikolay,
	lu, netdev

From: Stefano Brivio <sbrivio@redhat.com>
Date: Mon,  3 Aug 2020 22:52:08 +0200

> Currently, PMTU discovery for UDP tunnels only works if packets are
> routed to the encapsulating interfaces, not bridged.
> 
> This results from the fact that we generally don't have valid routes
> to the senders we can use to relay ICMP and ICMPv6 errors, and makes
> PMTU discovery completely non-functional for VXLAN and GENEVE ports of
> both regular bridges and Open vSwitch instances.
> 
> If the sender is local, and packets are forwarded to the port by a
> regular bridge, all it takes is to generate a corresponding route
> exception on the encapsulating device. The bridge then finds the route
> exception carrying the PMTU value estimate as it forwards frames, and
> relays ICMP messages back to the socket of the local sender. Patch 1/6
> fixes this case.
> 
> If the sender resides on another node, we actually need to reply to
> IP and IPv6 packets ourselves and send these ICMP or ICMPv6 errors
> back, using the same encapsulating device. Patch 2/6, based on an
> original idea by Florian Westphal, adds the needed functionality,
> while patches 3/6 and 4/6 add matching support for VXLAN and GENEVE.
> 
> Finally, 5/6 and 6/6 introduce selftests for all combinations of
> inner and outer IP versions, covering both VXLAN and GENEVE, with
> both regular bridges and Open vSwitch instances.

Please address the feedback you've received and I will apply this
series, thank you.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 1/6] ipv4: route: Ignore output interface in FIB lookup for PMTU route
  2020-08-03 23:30   ` David Ahern
@ 2020-08-04  5:52     ` Stefano Brivio
  0 siblings, 0 replies; 16+ messages in thread
From: Stefano Brivio @ 2020-08-04  5:52 UTC (permalink / raw)
  To: David Ahern
  Cc: David S. Miller, Florian Westphal, Aaron Conole, Numan Siddique,
	Jakub Kicinski, Pravin B Shelar, Roopa Prabhu,
	Nikolay Aleksandrov, Lourdes Pedrajas, netdev

On Mon, 3 Aug 2020 17:30:46 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 8/3/20 2:52 PM, Stefano Brivio wrote:
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index a01efa062f6b..c14fd8124f57 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1050,6 +1050,7 @@ static void ip_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
> >  	struct flowi4 fl4;
> >  
> >  	ip_rt_build_flow_key(&fl4, sk, skb);
> > +	fl4.flowi4_oif = 0;	/* Don't make lookup fail for encapsulations */
> >  	__ip_rt_update_pmtu(rt, &fl4, mtu);
> >  }
> >  
> 
> Can this be limited to:
> 	if (skb &&
> 	    netif_is_bridge_port(skb->dev) || netif_is_ovs_port(skb->dev))
> 		fl4.flowi4_oif = 0;
> 
> I'm not sure we want to reset oif for all MTU updates.

I think that generally speaking we might, because this is about the
*path* MTU after all, so the output interface doesn't look very
relevant.

On the other hand, I couldn't find any other case where this makes a
difference, and I guess it's better to eventually find out about those
other cases if any, rather than fixing things by accident possibly in
the wrong way.

Changed in v2, thanks.

-- 
Stefano


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 2/6] tunnels: PMTU discovery support for directly bridged IP packets
  2020-08-03 23:44   ` David Ahern
@ 2020-08-04  5:53     ` Stefano Brivio
  0 siblings, 0 replies; 16+ messages in thread
From: Stefano Brivio @ 2020-08-04  5:53 UTC (permalink / raw)
  To: David Ahern
  Cc: David S. Miller, Florian Westphal, Aaron Conole, Numan Siddique,
	Jakub Kicinski, Pravin B Shelar, Roopa Prabhu,
	Nikolay Aleksandrov, Lourdes Pedrajas, netdev

On Mon, 3 Aug 2020 17:44:16 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 8/3/20 2:52 PM, Stefano Brivio wrote:
> > @@ -461,6 +464,91 @@ static inline void iptunnel_xmit_stats(struct net_device *dev, int pkt_len)
> > [...]
> >
> > +static inline int skb_tunnel_check_pmtu(struct sk_buff *skb,
> > +					struct dst_entry *encap_dst,
> > +					int headroom, bool reply)  
> 
> Given its size, this is probably better as a function. I believe it can
> go into net/ipv4/ip_tunnel_core.c like you have iptunnel_pmtud_build_icmp.

Right, moved in v2.

> > [...]
> > +	if (skb->protocol == htons(ETH_P_IP) && mtu > 576) {  
> 
> I am surprised the 576 does not have an existing macro.

I guess that comes from how RFC 791 picks this "512 plus something
reasonable" value. I'll think of a name and propose as a later patch,
it's used in a number of places.

> > [...]
> > +		return iptunnel_pmtud_build_icmp(skb, mtu);
> > +	}
> > +#endif  
> 
> separate v4 and v6 code into helpers based on skb->protocol; the mtu
> check then becomes part of the version specific helpers.

Done.

> > +EXPORT_SYMBOL(iptunnel_pmtud_build_icmp);  
> 
> I think separate v4 and v6 versions would be more readable; the
> duplication is mostly skb manipulation.

Yes, way more readable, changed in v2.

-- 
Stefano


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 3/6] vxlan: Support for PMTU discovery on directly bridged links
  2020-08-03 23:48   ` David Ahern
@ 2020-08-04  5:53     ` Stefano Brivio
  0 siblings, 0 replies; 16+ messages in thread
From: Stefano Brivio @ 2020-08-04  5:53 UTC (permalink / raw)
  To: David Ahern
  Cc: David S. Miller, Florian Westphal, Aaron Conole, Numan Siddique,
	Jakub Kicinski, Pravin B Shelar, Roopa Prabhu,
	Nikolay Aleksandrov, Lourdes Pedrajas, netdev

On Mon, 3 Aug 2020 17:48:48 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 8/3/20 2:52 PM, Stefano Brivio wrote:
> > +		err = skb_tunnel_check_pmtu(skb, ndst, VXLAN_HEADROOM,
> > +					    netif_is_bridge_port(dev) ||
> > +					    netif_is_ovs_port(dev));  
> 
> you have this check in a few places. Maybe a new helper like
> netif_is_any_bridge_port that can check both IFF flags in 1 go.

Ah, yes, good idea, added to 1/6 v2 as it's where it's needed now.

-- 
Stefano


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-08-04  5:54 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-03 20:52 [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels Stefano Brivio
2020-08-03 20:52 ` [PATCH net-next 1/6] ipv4: route: Ignore output interface in FIB lookup for PMTU route Stefano Brivio
2020-08-03 23:30   ` David Ahern
2020-08-04  5:52     ` Stefano Brivio
2020-08-03 20:52 ` [PATCH net-next 2/6] tunnels: PMTU discovery support for directly bridged IP packets Stefano Brivio
2020-08-03 23:44   ` David Ahern
2020-08-04  5:53     ` Stefano Brivio
2020-08-03 20:52 ` [PATCH net-next 3/6] vxlan: Support for PMTU discovery on directly bridged links Stefano Brivio
2020-08-03 23:48   ` David Ahern
2020-08-04  5:53     ` Stefano Brivio
2020-08-03 20:52 ` [PATCH net-next 4/6] geneve: " Stefano Brivio
2020-08-03 20:52 ` [PATCH net-next 5/6] selftests: pmtu.sh: Add tests for bridged UDP tunnels Stefano Brivio
2020-08-03 20:52 ` [PATCH net-next 6/6] selftests: pmtu.sh: Add tests for UDP tunnels handled by Open vSwitch Stefano Brivio
2020-08-03 23:28 ` [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels Florian Westphal
2020-08-03 23:46   ` David Ahern
2020-08-04  1:25 ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.