All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv5 net-next 3/3] sunvnet: generate ICMP PTMUD messages for smaller port MTUs
@ 2014-09-17 20:49 David L Stevens
  2014-09-17 21:13 ` Sergei Shtylyov
  2014-09-17 22:43 ` Sowmini Varadhan
  0 siblings, 2 replies; 7+ messages in thread
From: David L Stevens @ 2014-09-17 20:49 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

This patch sends ICMP and ICMPv6 messages for Path MTU Discovery when a remote
port MTU is smaller than the device MTU. This allows mixing newer VIO protocol
devices that support MTU negotiation with older devices that do not on the
same vswitch. It also allows Linux-Linux LDOMs to use 64K-1 data packets even
though Solaris vswitch is limited to <16K MTU.

Signed-off-by: David L Stevens <david.stevens@oracle.com>
---
 drivers/net/ethernet/sun/sunvnet.c |   37 +++++++++++++++++++++++++++++++++++-
 1 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index 5e64e60..3c4ee18 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -17,6 +17,13 @@
 #include <linux/mutex.h>
 #include <linux/if_vlan.h>
 
+#if IS_ENABLED(CONFIG_IPV6)
+#include <linux/icmpv6.h>
+#endif
+
+#include <net/icmp.h>
+#include <net/route.h>
+
 #include <asm/vio.h>
 #include <asm/ldc.h>
 
@@ -791,8 +798,36 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (unlikely(!port))
 		goto out_dropped;
 
-	if (skb->len > port->rmtu)
+	if (skb->len > port->rmtu) {
+		unsigned long localmtu = port->rmtu - ETH_HLEN;
+
+		if (vio_version_after_eq(&port->vio, 1, 3))
+			localmtu -= VLAN_HLEN;
+
+		if (skb->protocol == htons(ETH_P_IP)) {
+			struct flowi4 fl4;
+			struct rtable *rt = NULL;
+
+			memset(&fl4, 0, sizeof(fl4));
+			fl4.flowi4_oif = dev->ifindex;
+			fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
+			fl4.daddr = ip_hdr(skb)->daddr;
+			fl4.saddr = ip_hdr(skb)->saddr;
+
+			rt = ip_route_output_key(dev_net(dev), &fl4);
+			if (!IS_ERR(rt)) {
+				skb_dst_set(skb, &rt->dst);
+				icmp_send(skb, ICMP_DEST_UNREACH,
+					  ICMP_FRAG_NEEDED,
+					  htonl(localmtu));
+			}
+		}
+#if IS_ENABLED(CONFIG_IPV6)
+		else if (skb->protocol == htons(ETH_P_IPV6))
+			icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, localmtu);
+#endif
 		goto out_dropped;
+	}
 
 	spin_lock_irqsave(&port->vio.lock, flags);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCHv5 net-next 3/3] sunvnet: generate ICMP PTMUD messages for smaller port MTUs
  2014-09-17 20:49 [PATCHv5 net-next 3/3] sunvnet: generate ICMP PTMUD messages for smaller port MTUs David L Stevens
@ 2014-09-17 21:13 ` Sergei Shtylyov
  2014-09-17 22:03   ` David L Stevens
  2014-09-17 22:43 ` Sowmini Varadhan
  1 sibling, 1 reply; 7+ messages in thread
From: Sergei Shtylyov @ 2014-09-17 21:13 UTC (permalink / raw)
  To: David L Stevens, David Miller; +Cc: netdev

Hello.

On 9/17/2014 11:49 PM, David L Stevens wrote:

> This patch sends ICMP and ICMPv6 messages for Path MTU Discovery when a remote
> port MTU is smaller than the device MTU. This allows mixing newer VIO protocol
> devices that support MTU negotiation with older devices that do not on the
> same vswitch. It also allows Linux-Linux LDOMs to use 64K-1 data packets even
> though Solaris vswitch is limited to <16K MTU.

> Signed-off-by: David L Stevens <david.stevens@oracle.com>
> ---
>   drivers/net/ethernet/sun/sunvnet.c |   37 +++++++++++++++++++++++++++++++++++-
>   1 files changed, 36 insertions(+), 1 deletions(-)

> diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
> index 5e64e60..3c4ee18 100644
> --- a/drivers/net/ethernet/sun/sunvnet.c
> +++ b/drivers/net/ethernet/sun/sunvnet.c
[...]
> @@ -791,8 +798,36 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
>   	if (unlikely(!port))
>   		goto out_dropped;
>
> -	if (skb->len > port->rmtu)
> +	if (skb->len > port->rmtu) {
> +		unsigned long localmtu = port->rmtu - ETH_HLEN;
> +
> +		if (vio_version_after_eq(&port->vio, 1, 3))
> +			localmtu -= VLAN_HLEN;
> +
> +		if (skb->protocol == htons(ETH_P_IP)) {
> +			struct flowi4 fl4;
> +			struct rtable *rt = NULL;
> +
> +			memset(&fl4, 0, sizeof(fl4));
> +			fl4.flowi4_oif = dev->ifindex;
> +			fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
> +			fl4.daddr = ip_hdr(skb)->daddr;
> +			fl4.saddr = ip_hdr(skb)->saddr;
> +
> +			rt = ip_route_output_key(dev_net(dev), &fl4);
> +			if (!IS_ERR(rt)) {
> +				skb_dst_set(skb, &rt->dst);
> +				icmp_send(skb, ICMP_DEST_UNREACH,
> +					  ICMP_FRAG_NEEDED,
> +					  htonl(localmtu));
> +			}
> +		}
> +#if IS_ENABLED(CONFIG_IPV6)

    This #if could be avoided by extending the *if* statement below, no?

> +		else if (skb->protocol == htons(ETH_P_IPV6))
> +			icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, localmtu);
> +#endif
>   		goto out_dropped;
> +	}

WBR, Sergei

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCHv5 net-next 3/3] sunvnet: generate ICMP PTMUD messages for smaller port MTUs
  2014-09-17 21:13 ` Sergei Shtylyov
@ 2014-09-17 22:03   ` David L Stevens
  0 siblings, 0 replies; 7+ messages in thread
From: David L Stevens @ 2014-09-17 22:03 UTC (permalink / raw)
  To: Sergei Shtylyov, David Miller; +Cc: netdev



On 09/17/2014 05:13 PM, Sergei Shtylyov wrote:

>> +        }
>> +#if IS_ENABLED(CONFIG_IPV6)
> 
>    This #if could be avoided by extending the *if* statement below, no?
> 
>> +        else if (skb->protocol == htons(ETH_P_IPV6))
>> +            icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, localmtu);
>> +#endif
>>           goto out_dropped;
>> +    }

I'm not sure I understand your point, but the #if must be there to
avoid a reference to icmpv6_send() which will not be defined if
CONFIG_IPV6=n.

						+-DLS

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCHv5 net-next 3/3] sunvnet: generate ICMP PTMUD messages for smaller port MTUs
  2014-09-17 20:49 [PATCHv5 net-next 3/3] sunvnet: generate ICMP PTMUD messages for smaller port MTUs David L Stevens
  2014-09-17 21:13 ` Sergei Shtylyov
@ 2014-09-17 22:43 ` Sowmini Varadhan
  2014-09-17 23:41   ` David L Stevens
  1 sibling, 1 reply; 7+ messages in thread
From: Sowmini Varadhan @ 2014-09-17 22:43 UTC (permalink / raw)
  To: David L Stevens; +Cc: David Miller, netdev

On (09/17/14 16:49), David L Stevens wrote:
> +
> +			rt = ip_route_output_key(dev_net(dev), &fl4);
> +			if (!IS_ERR(rt)) {

As I've mentioned before, this layering violation makes me uneasy,
so its benefits should be evaluated carefully.  You will typically not be
able to find an rt for packets coming here from any application
that does not itself use/update the FIB, e.g., uspace based packet-injectors (PF_PACKET-based applications, intel dpdk-based uspace stacks etc.)

--Sowmini

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCHv5 net-next 3/3] sunvnet: generate ICMP PTMUD messages for smaller port MTUs
  2014-09-17 22:43 ` Sowmini Varadhan
@ 2014-09-17 23:41   ` David L Stevens
  2014-09-18 19:23     ` Sowmini Varadhan
  0 siblings, 1 reply; 7+ messages in thread
From: David L Stevens @ 2014-09-17 23:41 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: David Miller, netdev



On 09/17/2014 06:43 PM, Sowmini Varadhan wrote:
> On (09/17/14 16:49), David L Stevens wrote:
>> +
>> +			rt = ip_route_output_key(dev_net(dev), &fl4);
>> +			if (!IS_ERR(rt)) {
> 
> As I've mentioned before, this layering violation makes me uneasy,
> so its benefits should be evaluated carefully.  You will typically not be
> able to find an rt for packets coming here from any application
> that does not itself use/update the FIB, e.g., uspace based packet-injectors
> (PF_PACKET-based applications, intel dpdk-based uspace stacks etc.)

I think the configurations where it doesn't work are reasonably rare, and
the default on boot is a 1500-byte MTU for everyone and none of these ICMP
errors will be triggered if that is where all hosts on the vswitch leave it.
You don't have to ever mix MTUs. The alternative in all cases where we send
an ICMP error to make it work is to instead silently drop those packets, all
packets of the same size or larger that we get after. It does nothing different
whatsoever for any configuration that works today. It only allows other configurations
to work also.

A pair of Linux LDOMs can get 8X throughput improvement by raising the MTU to 64K, but
many packets will be *silently* dropped if they go to any other destination that does
not support 64K MTU. Those destinations that don't support 64K MTU include any legacy
Linux running the pre-jumbo code and all Solaris hosts, including the current releases.

With the ICMP errors, the new linux code can interoperate properly with all of them, and
do so at much higher throughput (4-8X) with those that can support higher MTUs.

Also, I wouldn't call it a layering violation. icmp_send() is the external API for
triggering ICMP errors, and we are sending them at the point where we know the next-hop MTU.
It is exactly equivalent to an Ethernet device connected to a switch where the switch
sends useful layer-3 packets (like IGMP queries). In this case, that useful layer 3 info
is remote link MTU data; something not available in ordinary Ethernet.

Also, any PF_PACKET or other applications that bypass the routing tables can still (and
should) receive and process PMTUD packets. If you're sending raw Ethernet frames that are
IP, you should. If you mean non-routable protocols, none of those can be delivered over
any link where these ICMP errors would be sent, anyway. If you try to send an 18000-byte
non-IP packet with a 1500-byte MTU, it will be dropped today, and still dropped with this
patchset.

Large broadcast frames (your comment from another mail) will not work, just as they won't
work today. If you need to do that, you should leave the MTU of the sender below the lowest
MTU attached to the same switch, the way you would for any other Ethernet, and they'll be
fragmented and reassembled.

On the other hand, if you want your TCP and UDP traffic over IPv4 and IPv6 to be 4-8 times
faster than it is today without any other change, you can accept the deficiencies in the
otherwise not-allowed configuration and set the device MTU to 64K even when it's mixed with
other devices that can't do it. If all of them support the higher MTU, none of this code
comes into play. If some of them don't, this code allows them to interoperate; without this
code, all those packets are simply silently dropped and your network can only function at
the level of the least capable attached LDOM.

								+-DLS

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCHv5 net-next 3/3] sunvnet: generate ICMP PTMUD messages for smaller port MTUs
  2014-09-17 23:41   ` David L Stevens
@ 2014-09-18 19:23     ` Sowmini Varadhan
  2014-09-18 20:09       ` David L Stevens
  0 siblings, 1 reply; 7+ messages in thread
From: Sowmini Varadhan @ 2014-09-18 19:23 UTC (permalink / raw)
  To: David L Stevens; +Cc: David Miller, netdev

On 09/17/2014 07:41 PM, David L Stevens wrote:
>
>
> On 09/17/2014 06:43 PM, Sowmini Varadhan wrote:
>> On (09/17/14 16:49), David L Stevens wrote:
>>> +
>>> +			rt = ip_route_output_key(dev_net(dev), &fl4);
>>> +			if (!IS_ERR(rt)) {
>>
>> As I've mentioned before, this layering violation makes me uneasy,
>> so its benefits should be evaluated carefully.  You will typically not be
>> able to find an rt for packets coming here from any application
>> that does not itself use/update the FIB, e.g., uspace based packet-injectors
>> (PF_PACKET-based applications, intel dpdk-based uspace stacks etc.)

>
> A pair of Linux LDOMs can get 8X throughput improvement by raising the MTU to 64K, but
> many packets will be *silently* dropped if they go to any other destination that does
> not support 64K MTU. Those destinations that don't support 64K MTU include any legacy
> Linux running the pre-jumbo code and all Solaris hosts, including the current releases.

by now I am actually quite confused by what the Administrator will see.
If I do "ifconfig -a" or "ip addr", what is the reported mtu of the 
interface?


> Also, I wouldn't call it a layering violation. icmp_send() is the external API for
> triggering ICMP errors, and we are sending them at the point where we know the next-hop MTU.
> It is exactly equivalent to an Ethernet device connected to a switch where the switch
> sends useful layer-3 packets (like IGMP queries). In this case, that useful layer 3 info
> is remote link MTU data; something not available in ordinary Ethernet.

Interesting. So if the Administrator sets up ICMP filters for 
outbound/inbound (at the IP layer), what will be the observed behavior?

--Sowmini

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCHv5 net-next 3/3] sunvnet: generate ICMP PTMUD messages for smaller port MTUs
  2014-09-18 19:23     ` Sowmini Varadhan
@ 2014-09-18 20:09       ` David L Stevens
  0 siblings, 0 replies; 7+ messages in thread
From: David L Stevens @ 2014-09-18 20:09 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: David Miller, netdev



On 09/18/2014 03:23 PM, Sowmini Varadhan wrote:

> by now I am actually quite confused by what the Administrator will see.
> If I do "ifconfig -a" or "ip addr", what is the reported mtu of the interface?

	The interface MTU is whatever the admin sets it to, between 68 bytes (the IPv4 min)
and 64K-1 (the IPv4 max).
	In cases where packets of interface MTU size cannot be delivered because the LDC
MTU is smaller, instead of silently dropping them, we send the ICMP errors which allow
PMTUD updates per-destination. Subsequent packets will be segmented or fragmented at that
(lower) value for that destination, and use other MTUs up to the interface MTU for other
destinations.

> Interesting. So if the Administrator sets up ICMP filters for outbound/inbound (at the IP layer), what will be the observed behavior?

If an administrator drops PMTUD packets, then TCP won't work, even without this patch set, for any
destinations that cause PMTUD. It's explicitly not optional in IPv6; in IPv4, fragmenting TCP
packets could hide it as long as IP_DF is not set, but the only thing this code could do for
packets too big is to drop them -- exactly what we'd do whether or not we send the ICMP error to
tell the sender what MTU we can send.

								+-DLS

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-09-18 20:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-17 20:49 [PATCHv5 net-next 3/3] sunvnet: generate ICMP PTMUD messages for smaller port MTUs David L Stevens
2014-09-17 21:13 ` Sergei Shtylyov
2014-09-17 22:03   ` David L Stevens
2014-09-17 22:43 ` Sowmini Varadhan
2014-09-17 23:41   ` David L Stevens
2014-09-18 19:23     ` Sowmini Varadhan
2014-09-18 20:09       ` David L Stevens

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.