All of lore.kernel.org
 help / color / mirror / Atom feed
* UDP path MTU discovery
@ 2010-03-26  0:02 Glen Turner
  2010-03-26  0:53 ` Rick Jones
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Glen Turner @ 2010-03-26  0:02 UTC (permalink / raw)
  To: netdev

[This is a second attempt to report this bug.]

Path MTU Discovery for UDP underperforms for IPv4 and fails
for IPv6 in Linux for transactional services like DHCP and
RADIUS running on jumbo frame interfaces.

These servers send packets with exponential back-off. UDP
Path MTU Discovery probes for the path MTU each time the
application sends a packet. So if you start with a high
enough interface MTU then the server application backoff
times get huge and the client gives up before the path
MTU is discovered.

This differs from TCP, where it is the kernel -- and not
the application -- which organises retransmission. On
receiving a ICMP Fragmentation Needed the kernel can
immediately re-probe the path MTU wiht no waiting for
an exponential timer to expire.

In IPv4 there is a work-around for the server, turn off
Path MTU Discovery and allow routers to fragment the packet
as needed. Looking at the code for the various transactional
servers (ISC DHCP, FreeRADIUS, RADIATOR, radsecproxy) they
all disable Path MTU Discovery on Linux. This workaround has
the side effect of hiding the problem, misleading people into
thinking that UDP Path MTU Discovery actually works for these
transactional servers.

In IPv6 routers do not fragment packets, so there is no work
around. Transactional servers which use UDP over IPv6 encounter
exponential backoffs within the application and the client
abandons the transaction. There is no way for the server to
know that the packet was lost due to Path MTU Discovery and
to immediately re-transmit it (without an exponential penalty)
so that the MTU can be probed again.

This can be viewed as a flaw in the RFC and in the sockets API
for which IPv6 has removed the common work-around.

Thank you, Glen

-- 
 Glen Turner
 www.gdt.id.au/~gdt


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-26  0:02 UDP path MTU discovery Glen Turner
@ 2010-03-26  0:53 ` Rick Jones
  2010-03-26  3:26   ` David Miller
  2010-03-26  3:24 ` David Miller
  2010-03-28  8:50 ` Andi Kleen
  2 siblings, 1 reply; 34+ messages in thread
From: Rick Jones @ 2010-03-26  0:53 UTC (permalink / raw)
  To: Glen Turner; +Cc: netdev

Glen Turner wrote:
> [This is a second attempt to report this bug.]

Bugzilla somewhere?

> Path MTU Discovery for UDP underperforms for IPv4 and fails for IPv6 in Linux
> for transactional services like DHCP and RADIUS running on jumbo frame
> interfaces.
> 
> These servers send packets with exponential back-off. UDP Path MTU Discovery
> probes for the path MTU each time the application sends a packet. So if you
> start with a high enough interface MTU then the server application backoff 
> times get huge and the client gives up before the path MTU is discovered.
> 
> This differs from TCP, where it is the kernel -- and not the application --
> which organises retransmission. On receiving a ICMP Fragmentation Needed the
> kernel can immediately re-probe the path MTU wiht no waiting for an
> exponential timer to expire.
> 
> In IPv4 there is a work-around for the server, turn off Path MTU Discovery
> and allow routers to fragment the packet as needed. Looking at the code for
> the various transactional servers (ISC DHCP, FreeRADIUS, RADIATOR,
> radsecproxy) they all disable Path MTU Discovery on Linux. This workaround
> has the side effect of hiding the problem, misleading people into thinking
> that UDP Path MTU Discovery actually works for these transactional servers.
> 
> In IPv6 routers do not fragment packets, so there is no work around.
> Transactional servers which use UDP over IPv6 encounter exponential backoffs
> within the application and the client abandons the transaction. There is no
> way for the server to know that the packet was lost due to Path MTU Discovery
> and to immediately re-transmit it (without an exponential penalty) so that
> the MTU can be probed again.
> 
> This can be viewed as a flaw in the RFC and in the sockets API
> for which IPv6 has removed the common work-around.

So, presuming it is indeed a bug what form might a fix take? Are you suggesting 
there should be a way for an application to say "Please let me see/know about 
the ICMP messages?"  Is that option available on other platforms as a 
platform-specific extension?  I don't have the details, but the HP-UX 11i v3 
(11.31) netinet/udp.h file contains these:

#define UDP_RX_ICMP     0x02    /* boolean; get/set ICMP packets reception */
                                 /* Set to 1 if ICMP packets are to be received*/

#define UDP_RX_ICMP6    0x03    /* boolean; get/set ICMPv6 packets reception */
                                 /* Set to 1 if ICMPv6 packets are to be
                                    received */

and it does appear that they are in more places than just HP-UX - there are some 
hits for that for the old Apple Open Transport - which makes sense - it too had 
Mentat origins.

rick jones

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-26  0:02 UDP path MTU discovery Glen Turner
  2010-03-26  0:53 ` Rick Jones
@ 2010-03-26  3:24 ` David Miller
  2010-03-28  8:41   ` Andi Kleen
  2010-03-28  8:50 ` Andi Kleen
  2 siblings, 1 reply; 34+ messages in thread
From: David Miller @ 2010-03-26  3:24 UTC (permalink / raw)
  To: gdt; +Cc: netdev

From: Glen Turner <gdt@gdt.id.au>
Date: Fri, 26 Mar 2010 10:32:31 +1030

> This differs from TCP, where it is the kernel -- and not
> the application -- which organises retransmission. On
> receiving a ICMP Fragmentation Needed the kernel can
> immediately re-probe the path MTU wiht no waiting for
> an exponential timer to expire.

So the argument is, the kernel TCP does retransmission smart,
userspace UDP apps do it stupidly, so let's turn off the feature
instead of fixing userspace.

Right?

Sorry, fix this correctly in the user apps.  Putting the
blame on UDP path MTU discovery is placing it in the
wrong spot.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-26  0:53 ` Rick Jones
@ 2010-03-26  3:26   ` David Miller
  2010-03-26 17:48     ` Rick Jones
  2010-03-31 23:42     ` Glen Turner
  0 siblings, 2 replies; 34+ messages in thread
From: David Miller @ 2010-03-26  3:26 UTC (permalink / raw)
  To: rick.jones2; +Cc: gdt, netdev

From: Rick Jones <rick.jones2@hp.com>
Date: Thu, 25 Mar 2010 17:53:11 -0700

> So, presuming it is indeed a bug what form might a fix take? Are you
> suggesting there should be a way for an application to say "Please let
> me see/know about the ICMP messages?"  Is that option available on
> other platforms as a platform-specific extension?

We already provide this information.

The socket ends up with EMSGSIZE in it's error queue, so the next time
the application does I/O it sees that error immediately from the
read/write call and thus knows that path MTU arrived.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-26  3:26   ` David Miller
@ 2010-03-26 17:48     ` Rick Jones
  2010-03-31 23:42     ` Glen Turner
  1 sibling, 0 replies; 34+ messages in thread
From: Rick Jones @ 2010-03-26 17:48 UTC (permalink / raw)
  To: David Miller; +Cc: gdt, netdev

David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Thu, 25 Mar 2010 17:53:11 -0700
> 
> 
>>So, presuming it is indeed a bug what form might a fix take? Are you
>>suggesting there should be a way for an application to say "Please let
>>me see/know about the ICMP messages?"  Is that option available on
>>other platforms as a platform-specific extension?
> 
> 
> We already provide this information.
> 
> The socket ends up with EMSGSIZE in it's error queue, so the next time
> the application does I/O it sees that error immediately from the
> read/write call and thus knows that path MTU arrived.

A possibly pedantic question, but only when it does I/O, or also when/if it is 
in poll/select?

What distinguishes this EMSGSIZE from a run-of-the-mill EMSGSIZE error such as 
one gets from trying to send a datagram larger than SO_SNDBUF?

That is something that happens all the time in netperf when people forget a -m 
option on UDP_STREAM tests :)  Netperf gets the error and exits.  But supposing 
I wanted to make netperf more sophisticated in that regard - what sort of things 
must it do?  Call getsockopt(SO_SNDBUF) to check the size of the failed send 
against SO_SNDBUF and only then decide if it is an error on this send or an ICMP 
Datagram Too Big arrived indication from a previous send?  I know that netperf 
already has this information, so using it as the example is a bit stretched, but 
lets presume for the moment that netperf just has a socket handed to it from 
"somewhere."

rick jones

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-26  3:24 ` David Miller
@ 2010-03-28  8:41   ` Andi Kleen
  2010-03-31 23:57     ` Glen Turner
  0 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2010-03-28  8:41 UTC (permalink / raw)
  To: David Miller; +Cc: gdt, netdev

David Miller <davem@davemloft.net> writes:

> From: Glen Turner <gdt@gdt.id.au>
> Date: Fri, 26 Mar 2010 10:32:31 +1030
>
>> This differs from TCP, where it is the kernel -- and not
>> the application -- which organises retransmission. On
>> receiving a ICMP Fragmentation Needed the kernel can
>> immediately re-probe the path MTU wiht no waiting for
>> an exponential timer to expire.
>
> So the argument is, the kernel TCP does retransmission smart,
> userspace UDP apps do it stupidly, so let's turn off the feature
> instead of fixing userspace.
>
> Right?
>
> Sorry, fix this correctly in the user apps.  Putting the
> blame on UDP path MTU discovery is placing it in the
> wrong spot.

It means though that all IPv6 UDP applications essentially have
to implement path mtu discovery support (which is non trivial) 

Will be likely a long time until they're all fixed.

Seems like a big hole not considered by the IPv6 designers?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-26  0:02 UDP path MTU discovery Glen Turner
  2010-03-26  0:53 ` Rick Jones
  2010-03-26  3:24 ` David Miller
@ 2010-03-28  8:50 ` Andi Kleen
  2010-03-29 17:01   ` Rick Jones
  2 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2010-03-28  8:50 UTC (permalink / raw)
  To: Glen Turner; +Cc: netdev

Glen Turner <gdt@gdt.id.au> writes:

> In IPv6 routers do not fragment packets, so there is no work
> around. Transactional servers which use UDP over IPv6 encounter
> exponential backoffs within the application and the client
> abandons the transaction. There is no way for the server to
> know that the packet was lost due to Path MTU Discovery and
> to immediately re-transmit it (without an exponential penalty)
> so that the MTU can be probed again.

You can still turn path mtu discovery off and Linux will
fragment based on the known path MTU (I believe when
the too big fragment gets a icmp back the pmtu gets updated)

However you might lose a few packets in the process until the path MTU
is known, but at least it will stay cached (unless you thrash the
routing cache)

In theory one could probably add some hack in the the kernel UDP code
to hold one packet and retransmit it immediately with fragments when
the ICMP comes in. However that would be quite far in behaviour from
traditional UDP and be considered very ugly. It could also mess up
congestion avoidance schemes done by the application. 

Still might be preferable over rewriting zillions of applications?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-28  8:50 ` Andi Kleen
@ 2010-03-29 17:01   ` Rick Jones
  2010-03-29 20:14     ` Andi Kleen
  2010-03-31 23:43     ` Glen Turner
  0 siblings, 2 replies; 34+ messages in thread
From: Rick Jones @ 2010-03-29 17:01 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Glen Turner, netdev

> In theory one could probably add some hack in the the kernel UDP code
> to hold one packet and retransmit it immediately with fragments when
> the ICMP comes in. However that would be quite far in behaviour from
> traditional UDP and be considered very ugly. It could also mess up
> congestion avoidance schemes done by the application. 
> 
> Still might be preferable over rewriting zillions of applications?

But which of the last N datagrams sent by the application should be retained for 
retransmission?  It could be scores if not hundreds of datagrams depending on 
the behaviour of the application and the latency to the narrow part of the network.

That the IPv6 specification was heavily "influenced" by "the router guys" seems 
increasingly clear...

rick jones

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-29 17:01   ` Rick Jones
@ 2010-03-29 20:14     ` Andi Kleen
  2010-03-29 20:25       ` Rick Jones
  2010-03-29 20:50       ` Edgar E. Iglesias
  2010-03-31 23:43     ` Glen Turner
  1 sibling, 2 replies; 34+ messages in thread
From: Andi Kleen @ 2010-03-29 20:14 UTC (permalink / raw)
  To: Rick Jones; +Cc: Andi Kleen, Glen Turner, netdev

On Mon, Mar 29, 2010 at 10:01:42AM -0700, Rick Jones wrote:
> >In theory one could probably add some hack in the the kernel UDP code
> >to hold one packet and retransmit it immediately with fragments when
> >the ICMP comes in. However that would be quite far in behaviour from
> >traditional UDP and be considered very ugly. It could also mess up
> >congestion avoidance schemes done by the application. 
> >
> >Still might be preferable over rewriting zillions of applications?
> 
> But which of the last N datagrams sent by the application should be 
> retained for retransmission?  It could be scores if not hundreds of 
> datagrams depending on the behaviour of the application and the latency to 
> the narrow part of the network.

Yes, if there's a large window you lose. I guess it would make protocols
like DHCP work at least ("transactional UDP" as the original poster called it)

I don't know if it would fix enough applications to be worth 
implementing. The only way to find out would be to try I guess.
I don't have any better ideas.

> That the IPv6 specification was heavily "influenced" by "the router guys" 
> seems increasingly clear...

Yes it sounds like the IETF didn't completely think that through.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-29 20:14     ` Andi Kleen
@ 2010-03-29 20:25       ` Rick Jones
  2010-03-29 20:50       ` Edgar E. Iglesias
  1 sibling, 0 replies; 34+ messages in thread
From: Rick Jones @ 2010-03-29 20:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Glen Turner, netdev

Andi Kleen wrote:
> On Mon, Mar 29, 2010 at 10:01:42AM -0700, Rick Jones wrote:

>>But which of the last N datagrams sent by the application should be 
>>retained for retransmission?  It could be scores if not hundreds of 
>>datagrams depending on the behaviour of the application and the latency to 
>>the narrow part of the network.
> 
> 
> Yes, if there's a large window you lose. I guess it would make protocols
> like DHCP work at least ("transactional UDP" as the original poster called it)
> 
> I don't know if it would fix enough applications to be worth 
> implementing. The only way to find out would be to try I guess.
> I don't have any better ideas.

I don't think there are any good solutions that do not require either 
application involvement, or a modification to IPv6.

How about allowing an application to request that (copies of) ICMP(v6) messages 
be made available through the socket?  In that way, the application, which 
ostensibly already has to be keeping track of its sends for its own nefarious 
retransmission porpoises can receive the "signal" just like TCP does and perhaps 
there will be enough in the ICMPv6 message for the application to know which 
message(s) need to be retransmitted.

rick jones

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-29 20:14     ` Andi Kleen
  2010-03-29 20:25       ` Rick Jones
@ 2010-03-29 20:50       ` Edgar E. Iglesias
  2010-03-29 21:01         ` Rick Jones
  1 sibling, 1 reply; 34+ messages in thread
From: Edgar E. Iglesias @ 2010-03-29 20:50 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Rick Jones, Glen Turner, netdev

On Mon, Mar 29, 2010 at 10:14:31PM +0200, Andi Kleen wrote:
> On Mon, Mar 29, 2010 at 10:01:42AM -0700, Rick Jones wrote:
> > >In theory one could probably add some hack in the the kernel UDP code
> > >to hold one packet and retransmit it immediately with fragments when
> > >the ICMP comes in. However that would be quite far in behaviour from
> > >traditional UDP and be considered very ugly. It could also mess up
> > >congestion avoidance schemes done by the application. 
> > >
> > >Still might be preferable over rewriting zillions of applications?
> > 
> > But which of the last N datagrams sent by the application should be 
> > retained for retransmission?  It could be scores if not hundreds of 
> > datagrams depending on the behaviour of the application and the latency to 
> > the narrow part of the network.
> 
> Yes, if there's a large window you lose. I guess it would make protocols
> like DHCP work at least ("transactional UDP" as the original poster called it)
> 
> I don't know if it would fix enough applications to be worth 
> implementing. The only way to find out would be to try I guess.
> I don't have any better ideas.
> 
> > That the IPv6 specification was heavily "influenced" by "the router guys" 
> > seems increasingly clear...
> 
> Yes it sounds like the IETF didn't completely think that through.


Are things really that bad?

These "transactional" IPv6 apps all have the option to stick to 1280
sized datagrams to avoid the problem. If throughput is an issue these
apps will surely benefit from proper PMTUD anyway or?

Cheers

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-29 20:50       ` Edgar E. Iglesias
@ 2010-03-29 21:01         ` Rick Jones
  2010-03-29 21:29           ` Eric Dumazet
  0 siblings, 1 reply; 34+ messages in thread
From: Rick Jones @ 2010-03-29 21:01 UTC (permalink / raw)
  To: Edgar E. Iglesias; +Cc: Andi Kleen, Glen Turner, netdev

> Are things really that bad?
> 
> These "transactional" IPv6 apps all have the option to stick to 1280
> sized datagrams to avoid the problem. If throughput is an issue these
> apps will surely benefit from proper PMTUD anyway or?

I would get the alphabet soup completely garbled, but the DNS folks are talking 
about EDNS (?) message sizes upwards of 4096 bytes - encryption/authentication 
and other angels being asked to dance on the head of the DNS pin are asking for 
more and more space in the messages.

So, someone will have to blink somewhere - either DNS will have to go TCP and 
*possibly* take RTT hits there depending on various patch streams, or the IEEE 
will have to sanction jumbo frames and people deploy them widely, or it will 
have to become feasible to actually do the occasional IPv6 datagram 
fragmentation and get a timely retransmission out of a UDP application on a PMTU 
hit.

rick jones

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-29 21:01         ` Rick Jones
@ 2010-03-29 21:29           ` Eric Dumazet
  2010-03-29 23:38             ` Templin, Fred L
  0 siblings, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2010-03-29 21:29 UTC (permalink / raw)
  To: Rick Jones; +Cc: Edgar E. Iglesias, Andi Kleen, Glen Turner, netdev

Le lundi 29 mars 2010 à 14:01 -0700, Rick Jones a écrit :

> I would get the alphabet soup completely garbled, but the DNS folks are talking 
> about EDNS (?) message sizes upwards of 4096 bytes - encryption/authentication 
> and other angels being asked to dance on the head of the DNS pin are asking for 
> more and more space in the messages.
> 
> So, someone will have to blink somewhere - either DNS will have to go TCP and 
> *possibly* take RTT hits there depending on various patch streams, or the IEEE 
> will have to sanction jumbo frames and people deploy them widely, or it will 
> have to become feasible to actually do the occasional IPv6 datagram 
> fragmentation and get a timely retransmission out of a UDP application on a PMTU 
> hit.
> 

1) 4096 bytes UDP messages... well...
2) Using regular TCP for DNS servers... well...

I believe some guys were pushing TCPCT (Cookie Transactions) for this
case ( http://tools.ietf.org/html/draft-simpson-tcpct-00.html )

(That is, using an enhanced TCP for long DNS queries... but not only for
DNS...)




^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: UDP path MTU discovery
  2010-03-29 21:29           ` Eric Dumazet
@ 2010-03-29 23:38             ` Templin, Fred L
  2010-03-30  5:20               ` Andi Kleen
  0 siblings, 1 reply; 34+ messages in thread
From: Templin, Fred L @ 2010-03-29 23:38 UTC (permalink / raw)
  To: Eric Dumazet, Rick Jones
  Cc: Edgar E. Iglesias, Andi Kleen, Glen Turner, netdev



> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On Behalf Of Eric Dumazet
> Sent: Monday, March 29, 2010 2:29 PM
> To: Rick Jones
> Cc: Edgar E. Iglesias; Andi Kleen; Glen Turner; netdev@vger.kernel.org
> Subject: Re: UDP path MTU discovery
> 
> Le lundi 29 mars 2010 à 14:01 -0700, Rick Jones a écrit :
> 
> > I would get the alphabet soup completely garbled, but the DNS folks are talking
> > about EDNS (?) message sizes upwards of 4096 bytes - encryption/authentication
> > and other angels being asked to dance on the head of the DNS pin are asking for
> > more and more space in the messages.
> >
> > So, someone will have to blink somewhere - either DNS will have to go TCP and
> > *possibly* take RTT hits there depending on various patch streams, or the IEEE
> > will have to sanction jumbo frames and people deploy them widely, or it will
> > have to become feasible to actually do the occasional IPv6 datagram
> > fragmentation and get a timely retransmission out of a UDP application on a PMTU
> > hit.
> >
> 
> 1) 4096 bytes UDP messages... well...
> 2) Using regular TCP for DNS servers... well...
> 
> I believe some guys were pushing TCPCT (Cookie Transactions) for this
> case ( http://tools.ietf.org/html/draft-simpson-tcpct-00.html )
> 
> (That is, using an enhanced TCP for long DNS queries... but not only for
> DNS...)

IPv4 gets by this by setting DF=0 in the IP header, and
lets the network fragment the packet if necessary. IPv6 can
similarly get by this by having the sending host fragment
the large UDP packet into IPv6 fragments no longer than
1280 bytes each.

But wait! IPv4 hosts are only required to reassemble 576 bytes
at a minimum, and IPv6 hosts are only required to reassemble
1500 bytes at a minimum. Indeed, RFC2460 says:

   "An upper-layer protocol or application that depends on IPv6
   fragmentation to send packets larger than the MTU of a path should
   not send packets larger than 1500 octets unless it has assurance that
   the destination is capable of reassembling packets of that larger
   size."

but it is not clear how the sender can get such "assurance".
In the end, perhaps IPv6 should just do what IPv4 does;
turn off PMTUD and hope for the best?

Fred
fred.l.templin@boeing.com 
 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-29 23:38             ` Templin, Fred L
@ 2010-03-30  5:20               ` Andi Kleen
  2010-03-30  6:06                 ` Eric Dumazet
  2010-03-30  6:16                 ` UDP path MTU discovery Edgar E. Iglesias
  0 siblings, 2 replies; 34+ messages in thread
From: Andi Kleen @ 2010-03-30  5:20 UTC (permalink / raw)
  To: Templin, Fred L
  Cc: Eric Dumazet, Rick Jones, Edgar E. Iglesias, Andi Kleen,
	Glen Turner, netdev

On Mon, Mar 29, 2010 at 04:38:49PM -0700, Templin, Fred L wrote:
> > 1) 4096 bytes UDP messages... well...
> > 2) Using regular TCP for DNS servers... well...
> > 
> > I believe some guys were pushing TCPCT (Cookie Transactions) for this
> > case ( http://tools.ietf.org/html/draft-simpson-tcpct-00.html )
> > 
> > (That is, using an enhanced TCP for long DNS queries... but not only for
> > DNS...)
> 
> IPv4 gets by this by setting DF=0 in the IP header, and
> lets the network fragment the packet if necessary. IPv6 can
> similarly get by this by having the sending host fragment
> the large UDP packet into IPv6 fragments no longer than
> 1280 bytes each.

That's true -- in theory the UDP app unwilling/unable to do proper ptmudisc 
could set the path mtu to 1280 + header and still keep path mtu discovery off 
and then just fragment. 

Drawback would be of course suboptimal network use with too small MTUs
in the common case.

Right now there is no right socket option to set the path mtu. We
have a IP_MTU option, but it only works for getting the MTU.
That's because the PMTU is in the routing cache entry and shared
by multiple sockets. Presumably one could add a special case
with an MTU in the socket overriding the one in the destination entry.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-30  5:20               ` Andi Kleen
@ 2010-03-30  6:06                 ` Eric Dumazet
  2010-03-30  6:16                   ` Andi Kleen
  2010-03-30  6:17                   ` UDP path MTU discovery II Andi Kleen
  2010-03-30  6:16                 ` UDP path MTU discovery Edgar E. Iglesias
  1 sibling, 2 replies; 34+ messages in thread
From: Eric Dumazet @ 2010-03-30  6:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Templin, Fred L, Rick Jones, Edgar E. Iglesias, Glen Turner, netdev

Le mardi 30 mars 2010 à 07:20 +0200, Andi Kleen a écrit :
> On Mon, Mar 29, 2010 at 04:38:49PM -0700, Templin, Fred L wrote:
> > > 1) 4096 bytes UDP messages... well...
> > > 2) Using regular TCP for DNS servers... well...
> > > 
> > > I believe some guys were pushing TCPCT (Cookie Transactions) for this
> > > case ( http://tools.ietf.org/html/draft-simpson-tcpct-00.html )
> > > 
> > > (That is, using an enhanced TCP for long DNS queries... but not only for
> > > DNS...)
> > 
> > IPv4 gets by this by setting DF=0 in the IP header, and
> > lets the network fragment the packet if necessary. IPv6 can
> > similarly get by this by having the sending host fragment
> > the large UDP packet into IPv6 fragments no longer than
> > 1280 bytes each.
> 
> That's true -- in theory the UDP app unwilling/unable to do proper ptmudisc 
> could set the path mtu to 1280 + header and still keep path mtu discovery off 
> and then just fragment. 
> 
> Drawback would be of course suboptimal network use with too small MTUs
> in the common case.
> 
> Right now there is no right socket option to set the path mtu. We
> have a IP_MTU option, but it only works for getting the MTU.
> That's because the PMTU is in the routing cache entry and shared
> by multiple sockets. Presumably one could add a special case
> with an MTU in the socket overriding the one in the destination entry.

We have IP_MTU_DISCOVER option with four existing values



/* IP_MTU_DISCOVER values */
#define IP_PMTUDISC_DONT                0       /* Never send DF frames */
#define IP_PMTUDISC_WANT                1       /* Use per route hints  */
#define IP_PMTUDISC_DO                  2       /* Always DF            */
#define IP_PMTUDISC_PROBE               3       /* Ignore dst pmtu      */

We might add a fifth value (or open full range) and change 

static inline int ip_skb_dst_mtu(struct sk_buff *skb)
{
        struct inet_sock *inet = skb->sk ? inet_sk(skb->sk) : NULL;

        return (inet && inet->pmtudisc == IP_PMTUDISC_PROBE) ?
               skb_dst(skb)->dev->mtu : dst_mtu(skb_dst(skb));
}

->

static inline int ip_skb_dst_mtu(struct sk_buff *skb)
{
	if (skb->sk) {
		struct inet_sock *inet = inet_sk(skb->sk);

		if (inet->pmtudisc > IP_PMTUDISC_PROBE)
			return inet->pmtudisc;
		if (inet->pmtudisc == IP_PMTUDISC_PROBE)
			return skb_dst(skb)->dev->mtu;
	}
	return dst_mtu(skb_dst(skb));
}




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-30  6:06                 ` Eric Dumazet
@ 2010-03-30  6:16                   ` Andi Kleen
  2010-03-30  6:17                   ` UDP path MTU discovery II Andi Kleen
  1 sibling, 0 replies; 34+ messages in thread
From: Andi Kleen @ 2010-03-30  6:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, Templin, Fred L, Rick Jones, Edgar E. Iglesias,
	Glen Turner, netdev

> > Right now there is no right socket option to set the path mtu. We
> > have a IP_MTU option, but it only works for getting the MTU.
> > That's because the PMTU is in the routing cache entry and shared
> > by multiple sockets. Presumably one could add a special case
> > with an MTU in the socket overriding the one in the destination entry.
> 
> We have IP_MTU_DISCOVER option with four existing values

I think you would just need a sk->pmtu and a option to set/unset it
Or perhaps just a flag that clamps to 1280? 

Then the existing mtu discover options would be sufficient.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-30  5:20               ` Andi Kleen
  2010-03-30  6:06                 ` Eric Dumazet
@ 2010-03-30  6:16                 ` Edgar E. Iglesias
  2010-03-30  6:19                   ` Andi Kleen
  1 sibling, 1 reply; 34+ messages in thread
From: Edgar E. Iglesias @ 2010-03-30  6:16 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Templin, Fred L, Eric Dumazet, Rick Jones, Glen Turner, netdev

On Tue, Mar 30, 2010 at 07:20:44AM +0200, Andi Kleen wrote:
> On Mon, Mar 29, 2010 at 04:38:49PM -0700, Templin, Fred L wrote:
> > > 1) 4096 bytes UDP messages... well...
> > > 2) Using regular TCP for DNS servers... well...
> > > 
> > > I believe some guys were pushing TCPCT (Cookie Transactions) for this
> > > case ( http://tools.ietf.org/html/draft-simpson-tcpct-00.html )
> > > 
> > > (That is, using an enhanced TCP for long DNS queries... but not only for
> > > DNS...)
> > 
> > IPv4 gets by this by setting DF=0 in the IP header, and
> > lets the network fragment the packet if necessary. IPv6 can
> > similarly get by this by having the sending host fragment
> > the large UDP packet into IPv6 fragments no longer than
> > 1280 bytes each.
> 
> That's true -- in theory the UDP app unwilling/unable to do proper ptmudisc 
> could set the path mtu to 1280 + header and still keep path mtu discovery off 
> and then just fragment. 
> 
> Drawback would be of course suboptimal network use with too small MTUs
> in the common case.
>
> Right now there is no right socket option to set the path mtu. We
> have a IP_MTU option, but it only works for getting the MTU.
> That's because the PMTU is in the routing cache entry and shared
> by multiple sockets. Presumably one could add a special case
> with an MTU in the socket overriding the one in the destination entry.

Sorry I'm not following you here.. Why do you need to set the MTU?

IIUC:
UDP is supposed to preserve datagram boundaries, so the sender should
when seeing an EMSGSIZE, read the PMTU and avoid sending further UDP
packets larger than that. Userspace has control over the UDP datagram
size. If it can, the app will also at this point retransmit any recent
packets that went out larger than the fresh PMTU.

If you don't want to hassle with all of that, the app can stick to
1280 (or I guess for the extreme/lazy cases turn on fragmentation)..

Cheers

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery II
  2010-03-30  6:06                 ` Eric Dumazet
  2010-03-30  6:16                   ` Andi Kleen
@ 2010-03-30  6:17                   ` Andi Kleen
  1 sibling, 0 replies; 34+ messages in thread
From: Andi Kleen @ 2010-03-30  6:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, Templin, Fred L, Rick Jones, Edgar E. Iglesias,
	Glen Turner, netdev


Hmm, never mind your scheme of mapping it to the pmtudisc field would
probably work. Except it's u8 currently, would need to be u16 or u32

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-30  6:16                 ` UDP path MTU discovery Edgar E. Iglesias
@ 2010-03-30  6:19                   ` Andi Kleen
  2010-03-30  8:20                     ` Edgar E. Iglesias
  2010-03-30 15:58                     ` Templin, Fred L
  0 siblings, 2 replies; 34+ messages in thread
From: Andi Kleen @ 2010-03-30  6:19 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: Andi Kleen, Templin, Fred L, Eric Dumazet, Rick Jones,
	Glen Turner, netdev

> If you don't want to hassle with all of that, the app can stick to
> 1280 (or I guess for the extreme/lazy cases turn on fragmentation)..

See the early mails in this thread. This is about apps who can't
limit themselves to 1280, but still don't want full blown PMTU.
[They probably should, but it can be a lot of work]

The MTU would allow to force fragmentation on the sending host
as a workaround similar to IPv4.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-30  6:19                   ` Andi Kleen
@ 2010-03-30  8:20                     ` Edgar E. Iglesias
  2010-03-30 14:12                       ` Andi Kleen
  2010-03-30 15:58                     ` Templin, Fred L
  1 sibling, 1 reply; 34+ messages in thread
From: Edgar E. Iglesias @ 2010-03-30  8:20 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Templin, Fred L, Eric Dumazet, Rick Jones, Glen Turner, netdev

On Tue, Mar 30, 2010 at 08:19:52AM +0200, Andi Kleen wrote:
> > If you don't want to hassle with all of that, the app can stick to
> > 1280 (or I guess for the extreme/lazy cases turn on fragmentation)..
> 
> See the early mails in this thread. This is about apps who can't
> limit themselves to 1280, but still don't want full blown PMTU.
> [They probably should, but it can be a lot of work]
> 
> The MTU would allow to force fragmentation on the sending host
> as a workaround similar to IPv4.

Yes, but I dont see why you need an option with semantics of setting an MTU.

If an UDP app wants to use fragmentation (for whatever reason) setting
a boolean flag like XXX_PMTUDISC_DONT should be enough. The kernel will for
IPv6 have to work with the real PMTU or stick to 1280 when generating the
fragments. Keep in mind that unlike IPv4, IPv6 has no DF flag. It's up to
the sender to create the the fragments.

Where does the application controllable per socket MTU come into the
picture?

Cheers

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-30  8:20                     ` Edgar E. Iglesias
@ 2010-03-30 14:12                       ` Andi Kleen
  2010-03-30 22:04                         ` Edgar E. Iglesias
  0 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2010-03-30 14:12 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: Andi Kleen, Templin, Fred L, Eric Dumazet, Rick Jones,
	Glen Turner, netdev

> Where does the application controllable per socket MTU come into the
> picture?

To set the minimum path MTU so that there is a guarantee that IPv6 routers
(which are unable to fragment themselves) will never drop it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: UDP path MTU discovery
  2010-03-30  6:19                   ` Andi Kleen
  2010-03-30  8:20                     ` Edgar E. Iglesias
@ 2010-03-30 15:58                     ` Templin, Fred L
  2010-03-30 16:06                       ` Andi Kleen
  1 sibling, 1 reply; 34+ messages in thread
From: Templin, Fred L @ 2010-03-30 15:58 UTC (permalink / raw)
  To: Andi Kleen, Edgar E. Iglesias
  Cc: Eric Dumazet, Rick Jones, Glen Turner, netdev



> -----Original Message-----
> From: Andi Kleen [mailto:andi@firstfloor.org]
> Sent: Monday, March 29, 2010 11:20 PM
> To: Edgar E. Iglesias
> Cc: Andi Kleen; Templin, Fred L; Eric Dumazet; Rick Jones; Glen Turner; netdev@vger.kernel.org
> Subject: Re: UDP path MTU discovery
> 
> > If you don't want to hassle with all of that, the app can stick to
> > 1280 (or I guess for the extreme/lazy cases turn on fragmentation)..
> 
> See the early mails in this thread. This is about apps who can't
> limit themselves to 1280, but still don't want full blown PMTU.
> [They probably should, but it can be a lot of work]

Right. Some apps may need to send isolated packets that
are larger than the path MTU without invoking path MTU
discovery.
 
> The MTU would allow to force fragmentation on the sending host
> as a workaround similar to IPv4.

Right again. Unlike IPv4, however, IPv6 does not allow
in-the-network fragmentation. So when in doubt, apps
that need to send isolated packets that may violate the
path MTU should really perform host-based fragmentation
with a maximum fragment size of 1280. Isn't there a
socket option "IPV6_USE_MIN_MTU" that apps can use to
force fragmentation on large packets (RFC3542)?

Caveat - the app may have no way of knowing whether
the destination is capable of reassembling fragmented
packets larger than 1500...

Fred
fred.l.templin@boeing.com

> -Andi
> --
> ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-30 15:58                     ` Templin, Fred L
@ 2010-03-30 16:06                       ` Andi Kleen
  0 siblings, 0 replies; 34+ messages in thread
From: Andi Kleen @ 2010-03-30 16:06 UTC (permalink / raw)
  To: Templin, Fred L
  Cc: Andi Kleen, Edgar E. Iglesias, Eric Dumazet, Rick Jones,
	Glen Turner, netdev

> Right again. Unlike IPv4, however, IPv6 does not allow
> in-the-network fragmentation. So when in doubt, apps
> that need to send isolated packets that may violate the
> path MTU should really perform host-based fragmentation
> with a maximum fragment size of 1280. Isn't there a
> socket option "IPV6_USE_MIN_MTU" that apps can use to
> force fragmentation on large packets (RFC3542)?

Thanks for the pointer. The option is right now only defined,
but not implemented.  But yes it would help. 

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-30 14:12                       ` Andi Kleen
@ 2010-03-30 22:04                         ` Edgar E. Iglesias
  0 siblings, 0 replies; 34+ messages in thread
From: Edgar E. Iglesias @ 2010-03-30 22:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Templin, Fred L, Eric Dumazet, Rick Jones, Glen Turner, netdev

On Tue, Mar 30, 2010 at 04:12:37PM +0200, Andi Kleen wrote:
> > Where does the application controllable per socket MTU come into the
> > picture?
> 
> To set the minimum path MTU so that there is a guarantee that IPv6 routers
> (which are unable to fragment themselves) will never drop it.

Not sure I agree with that kind of solution but thats probably
because of missunderstandings on my side :)

Thanks for explaning.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-26  3:26   ` David Miller
  2010-03-26 17:48     ` Rick Jones
@ 2010-03-31 23:42     ` Glen Turner
  2010-03-31 23:51       ` Hagen Paul Pfeifer
  1 sibling, 1 reply; 34+ messages in thread
From: Glen Turner @ 2010-03-31 23:42 UTC (permalink / raw)
  To: David Miller; +Cc: rick.jones2, netdev

On Thu, 2010-03-25 at 20:26 -0700, David Miller wrote:
> We already provide this information.
> 
> The socket ends up with EMSGSIZE in it's error queue, so the next time
> the application does I/O it sees that error immediately from the
> read/write call and thus knows that path MTU arrived.

Thanks David.

Does select() return from its blocking so the application can make
use of this indication immediately, rather than after the
application's exponentially-increasing wait?

Is an incoming ICMP the only cause of EMSGSIZE?  That is, can an
application safely retransmit immediately?

-- 
 Glen Turner
 www.gdt.id.au/~gdt


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-29 17:01   ` Rick Jones
  2010-03-29 20:14     ` Andi Kleen
@ 2010-03-31 23:43     ` Glen Turner
  2010-04-01  0:55       ` Andi Kleen
  1 sibling, 1 reply; 34+ messages in thread
From: Glen Turner @ 2010-03-31 23:43 UTC (permalink / raw)
  To: Rick Jones; +Cc: Andi Kleen, netdev

On Mon, 2010-03-29 at 10:01 -0700, Rick Jones wrote:

> But which of the last N datagrams sent by the application should be retained for 
> retransmission?  It could be scores if not hundreds of datagrams depending on 
> the behaviour of the application and the latency to the narrow part of the network.

We don't need that sort of exotica from the kernel.  The applications
have to be prepared to retransmit lost packets in any case.

What we need is an API for an instant notification that a ICMP Packet
Too Big message has arrived concerning the socket.

Then the application simply retransmits immediately, without adding
to the exponential backoff penalty which the application maintains.
The application maintain a overall packet-transmitted limit to prevent
a DoS.

>From this application behaviour the kernel sees a stream of packets
it can use for UDP Path MTU Discovery (paced at the RTT, so not
contributing to congestion collapse). That stream halts when the
first packet makes it to the end system.

As for David Miller's rant, the applications currently have no choice
but to "do it stupidly" as the kernel doesn't pass enough information
for user space to do it intelligently.  If the kernel passed user space
the same indication as TCP gets, then we could -- and would -- do it
right.

Re-writing the applications to take advantage of the API is no great
shakes -- there aren't many of them, they are written by people with
a good knowledge of networking, but unfortunately they tend to do
important stuff (allocate addresses, serve names, authenticate link
layer access).

It would be nice if the API had some commonality between platforms.
But there's no shortage of #ifdefs already, and one more to make
these applications work well for IPv6 on jumbo frames on the platform
of choice for networking infrastructure would be seen by application
authors as well worthwhile.

Thanks for your consideration,
Glen

-- 
 Glen Turner
 www.gdt.id.au/~gdt


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-31 23:42     ` Glen Turner
@ 2010-03-31 23:51       ` Hagen Paul Pfeifer
  2010-04-01  0:06         ` Rick Jones
  0 siblings, 1 reply; 34+ messages in thread
From: Hagen Paul Pfeifer @ 2010-03-31 23:51 UTC (permalink / raw)
  To: Glen Turner; +Cc: David Miller, rick.jones2, netdev

* Glen Turner | 2010-04-01 10:12:03 [+1030]:

>Does select() return from its blocking so the application can make
>use of this indication immediately, rather than after the
>application's exponentially-increasing wait?

Yes, poll() will return immediately with POLLERR.

>Is an incoming ICMP the only cause of EMSGSIZE?  That is, can an
>application safely retransmit immediately?

IIRC, yes.


Cheers, Hagen

-- 
Hagen Paul Pfeifer <hagen@jauu.net>  ||  http://jauu.net/
Telephone: +49 174 5455209           ||  Key Id: 0x98350C22
Key Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-28  8:41   ` Andi Kleen
@ 2010-03-31 23:57     ` Glen Turner
  2010-04-01  0:57       ` Andi Kleen
  0 siblings, 1 reply; 34+ messages in thread
From: Glen Turner @ 2010-03-31 23:57 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, netdev

On Sun, 2010-03-28 at 10:41 +0200, Andi Kleen wrote:

> It means though that all IPv6 UDP applications essentially have
> to implement path mtu discovery support (which is non trivial) 

It is trivial from the applications point of view to let the
kernel find the UDP Path MTU. We just need more information
from the kernel as to when it would like to see those packets
(ie, for performance we'd like to feed in the packet to re-send
as soon as the ICMP Packet Too Big arrives for the previous
packet).

> Will be likely a long time until they're all fixed.

There's no need to make that assumption.  We'd very much like
transactional UDP protocols to work well in advanced networks.
The other choices -- holding down millions of TCP sockets,
or using new protocols (and there are competing proposals) --
don't exactly fill our operations teams with confidence.

We'd very much like to use UDP were we can and something else
where we must.

> Seems like a big hole not considered by the IPv6 designers?

Yeah. The sockets API for IPv6 required an additional feature that
the IETF did not foresee.

-- 
 Glen Turner
 www.gdt.id.au/~gdt


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-31 23:51       ` Hagen Paul Pfeifer
@ 2010-04-01  0:06         ` Rick Jones
  0 siblings, 0 replies; 34+ messages in thread
From: Rick Jones @ 2010-04-01  0:06 UTC (permalink / raw)
  To: Hagen Paul Pfeifer; +Cc: Glen Turner, David Miller, netdev

Hagen Paul Pfeifer wrote:
> * Glen Turner | 2010-04-01 10:12:03 [+1030]:
> 
> 
>>Does select() return from its blocking so the application can make
>>use of this indication immediately, rather than after the
>>application's exponentially-increasing wait?
> 
> 
> Yes, poll() will return immediately with POLLERR.
> 
> 
>>Is an incoming ICMP the only cause of EMSGSIZE?  That is, can an
>>application safely retransmit immediately?
> 
> 
> IIRC, yes.

Under Linux perhaps, and assuming it can guess which prior send triggered the 
EMSGSIZE, but under HP-UX EMSGSIZE means you tried to send a datagram larger 
than the socket buffer:

tusc src/netperf -t UDP_RR -- -s 1024 -r 60K
...
send(4, 0x4000ee68, 61440, 0) ............................ ERR#218 EMSGSIZE

I've not checked BSD, Solaris or AIX.

On a 2.6.22 kernel where I do the same thing, it returns ENOBUFS instead.

strace src/netperf -H localhost -t UDP_RR -- -s 1024 -r 60K
...
send(4, "netperf\0netperf\0netperf\0netperf\0n"..., 61440, 0) = -1 ENOBUFS (No 
buffer space available)

Of course the send() manpage on various Linux systems I've tried says:

        EMSGSIZE
               The  socket  type  requires that message be sent atomically, and
               the size of the message to be sent made this impossible.

        ENOBUFS
               The output queue for a network interface was full.  This  gener-
               ally  indicates  that the interface has stopped sending, but may
               be caused by transient congestion.   (Normally,  this  does  not
               occur in Linux.  Packets are just silently dropped when a device
               queue overflows.)

I suppose they are old on that system.  Netperf interprets an ENOBUFS per the 
manpage, and will not exit immediately in a UDP_STREAM test, but will simply 
count the send as failed and try again.  Not sure if it is worth trying to teach 
netperf differently here or not.

rick jones

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-31 23:43     ` Glen Turner
@ 2010-04-01  0:55       ` Andi Kleen
  2010-04-02  5:41         ` Glen Turner
  0 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2010-04-01  0:55 UTC (permalink / raw)
  To: Glen Turner; +Cc: Rick Jones, Andi Kleen, netdev

On Thu, Apr 01, 2010 at 10:13:04AM +1030, Glen Turner wrote:
> On Mon, 2010-03-29 at 10:01 -0700, Rick Jones wrote:
> 
> > But which of the last N datagrams sent by the application should be retained for 
> > retransmission?  It could be scores if not hundreds of datagrams depending on 
> > the behaviour of the application and the latency to the narrow part of the network.
> 
> We don't need that sort of exotica from the kernel.  The applications
> have to be prepared to retransmit lost packets in any case.
> 
> What we need is an API for an instant notification that a ICMP Packet
> Too Big message has arrived concerning the socket.

That already exists of course: IP_RECVERR

> As for David Miller's rant, the applications currently have no choice
> but to "do it stupidly" as the kernel doesn't pass enough information
> for user space to do it intelligently.  If the kernel passed user space
> the same indication as TCP gets, then we could -- and would -- do it
> right.

That's wrong. Linux has supported UDP/RAW pmtu discovery since many many
years.

I have a really old presentation on it (from 2000 or so):

http://halobates.de/net-topics/text33.htm
http://halobates.de/net-topics/text34.htm
http://halobates.de/net-topics/text35.htm
http://halobates.de/net-topics/text36.htm

It's also in the manpages.

However I suspect it's too much work to change a lot of applications
to that, so I suspect the IPV6_MIN_MTU workaround is still needed.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-03-31 23:57     ` Glen Turner
@ 2010-04-01  0:57       ` Andi Kleen
  0 siblings, 0 replies; 34+ messages in thread
From: Andi Kleen @ 2010-04-01  0:57 UTC (permalink / raw)
  To: Glen Turner; +Cc: Andi Kleen, David Miller, netdev

> > Seems like a big hole not considered by the IPv6 designers?
> 
> Yeah. The sockets API for IPv6 required an additional feature that
> the IETF did not foresee.

Linux (or in this concrete case ANK) did foresee it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-04-01  0:55       ` Andi Kleen
@ 2010-04-02  5:41         ` Glen Turner
  2010-04-04 10:25           ` Andi Kleen
  0 siblings, 1 reply; 34+ messages in thread
From: Glen Turner @ 2010-04-02  5:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Rick Jones, netdev

On Thu, 2010-04-01 at 02:55 +0200, Andi Kleen wrote:
> > What we need is an API for an instant notification that a ICMP Packet
> > Too Big message has arrived concerning the socket.
> 
> That already exists of course: IP_RECVERR

Hi Andi,

So what should I code?  The suggested EMSGSIZE or your suggestion
of grabbing all returning ICMP and parsing it?  Noting that the
second choice is pretty ugly.  That both seem specific to Linux is
frustrating, but that is life -- adding support for an operating
system seems to inevitably add #ifdefs for this sort of code.

Let me know and I'll code it into FreeRADIUS and radsecproxy and
I'll see how they go with 802.1x requests over IPv6.

Thanks so much for your time, Glen

-- 
 Glen Turner
 www.gdt.id.au/~gdt


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: UDP path MTU discovery
  2010-04-02  5:41         ` Glen Turner
@ 2010-04-04 10:25           ` Andi Kleen
  0 siblings, 0 replies; 34+ messages in thread
From: Andi Kleen @ 2010-04-04 10:25 UTC (permalink / raw)
  To: Glen Turner; +Cc: Andi Kleen, Rick Jones, netdev

On Fri, Apr 02, 2010 at 04:11:58PM +1030, Glen Turner wrote:
> On Thu, 2010-04-01 at 02:55 +0200, Andi Kleen wrote:
> > > What we need is an API for an instant notification that a ICMP Packet
> > > Too Big message has arrived concerning the socket.
> > 
> > That already exists of course: IP_RECVERR
> 
> Hi Andi,
> 
> So what should I code?  The suggested EMSGSIZE or your suggestion
> of grabbing all returning ICMP and parsing it?  Noting that the

You don't need to parse any ICMPs, the kernel does that for you. 
See the documentation of IP_RECVERR in ip(7). The MTU is in ee_info

First you need to enable path mtu discovery for the socket
using IP_MTU_DISCOVER.

So you can either keep track of the MTU yourself based on extended
errors coming out of IP_RECVERR, or ask the kernel using IP_MTU when
the socket is connected or simply lower when you see a EMSGSIZE. It's also 
possible to do this with a dummy socket that gets connected/unconnected too.

> second choice is pretty ugly.  That both seem specific to Linux is
> frustrating, but that is life -- adding support for an operating
> system seems to inevitably add #ifdefs for this sort of code.

Well when the other OS see the need they will hopefully add similar
interfaces, with some luck even compatible to the ones in Linux.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2010-04-04 10:25 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-26  0:02 UDP path MTU discovery Glen Turner
2010-03-26  0:53 ` Rick Jones
2010-03-26  3:26   ` David Miller
2010-03-26 17:48     ` Rick Jones
2010-03-31 23:42     ` Glen Turner
2010-03-31 23:51       ` Hagen Paul Pfeifer
2010-04-01  0:06         ` Rick Jones
2010-03-26  3:24 ` David Miller
2010-03-28  8:41   ` Andi Kleen
2010-03-31 23:57     ` Glen Turner
2010-04-01  0:57       ` Andi Kleen
2010-03-28  8:50 ` Andi Kleen
2010-03-29 17:01   ` Rick Jones
2010-03-29 20:14     ` Andi Kleen
2010-03-29 20:25       ` Rick Jones
2010-03-29 20:50       ` Edgar E. Iglesias
2010-03-29 21:01         ` Rick Jones
2010-03-29 21:29           ` Eric Dumazet
2010-03-29 23:38             ` Templin, Fred L
2010-03-30  5:20               ` Andi Kleen
2010-03-30  6:06                 ` Eric Dumazet
2010-03-30  6:16                   ` Andi Kleen
2010-03-30  6:17                   ` UDP path MTU discovery II Andi Kleen
2010-03-30  6:16                 ` UDP path MTU discovery Edgar E. Iglesias
2010-03-30  6:19                   ` Andi Kleen
2010-03-30  8:20                     ` Edgar E. Iglesias
2010-03-30 14:12                       ` Andi Kleen
2010-03-30 22:04                         ` Edgar E. Iglesias
2010-03-30 15:58                     ` Templin, Fred L
2010-03-30 16:06                       ` Andi Kleen
2010-03-31 23:43     ` Glen Turner
2010-04-01  0:55       ` Andi Kleen
2010-04-02  5:41         ` Glen Turner
2010-04-04 10:25           ` Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.