linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] icmp: Restore resistence to abnormal messages
@ 2016-11-11 20:20 Vicente Jimenez Aguilar
  2016-11-14 18:36 ` David Miller
  0 siblings, 1 reply; 8+ messages in thread
From: Vicente Jimenez Aguilar @ 2016-11-11 20:20 UTC (permalink / raw)
  To: David S . Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy
  Cc: netdev, linux-kernel, Vicente Jimenez Aguilar

Restore network resistance to abnormal ICMP fragmentation needed messages
with next hop MTU equal to (or exceeding) dropped packet size

Fixes: 46517008e116 ("ipv4: Kill ip_rt_frag_needed().")
Signed-off-by: Vicente Jimenez Aguilar <googuy@gmail.com>
---
 net/ipv4/icmp.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 38abe70..4c90d76 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -773,6 +773,7 @@ static bool icmp_tag_validation(int proto)
 static bool icmp_unreach(struct sk_buff *skb)
 {
 	const struct iphdr *iph;
+	unsigned short old_mtu;
 	struct icmphdr *icmph;
 	struct net *net;
 	u32 info = 0;
@@ -819,6 +820,12 @@ static bool icmp_unreach(struct sk_buff *skb)
 				/* fall through */
 			case 0:
 				info = ntohs(icmph->un.frag.mtu);
+				/* Handle weird case where next hop MTU is
+				 * equal to or exceeding dropped packet size
+				 */
+				old_mtu = ntohs(iph->tot_len);
+				if (info >= old_mtu)
+					info = old_mtu - 2;
 			}
 			break;
 		case ICMP_SR_FAILED:
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] icmp: Restore resistence to abnormal messages
  2016-11-11 20:20 [PATCH] icmp: Restore resistence to abnormal messages Vicente Jimenez Aguilar
@ 2016-11-14 18:36 ` David Miller
  2016-11-15 16:49   ` Vicente Jiménez
  0 siblings, 1 reply; 8+ messages in thread
From: David Miller @ 2016-11-14 18:36 UTC (permalink / raw)
  To: googuy; +Cc: kuznet, jmorris, yoshfuji, kaber, netdev, linux-kernel

From: Vicente Jimenez Aguilar <googuy@gmail.com>
Date: Fri, 11 Nov 2016 21:20:18 +0100

> @@ -819,6 +820,12 @@ static bool icmp_unreach(struct sk_buff *skb)
>  				/* fall through */
>  			case 0:
>  				info = ntohs(icmph->un.frag.mtu);
> +				/* Handle weird case where next hop MTU is
> +				 * equal to or exceeding dropped packet size
> +				 */
> +				old_mtu = ntohs(iph->tot_len);
> +				if (info >= old_mtu)
> +					info = old_mtu - 2;

This isn't something the old code did.

The old code behaved much differently.

In the case where the new mtu was smaller than 68 or larger than
the iph->tot_len value, it would do several things:

1) First it would check for a BSD 4.2 anomaly and subtract old_mtu
   by the IP header length.

2) Second, it would try to guess the intended MTU using the
   mtu_plateau table.

I don't see any code where a subtraction by a fixed constant of 2
occurred.

Nor can I figure out what that might accomplish.  If you really
want to do this, you have to docuement what this 2 means, what
it is accomplishing, and why you have choosen to accomplish it
this way.

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] icmp: Restore resistence to abnormal messages
  2016-11-14 18:36 ` David Miller
@ 2016-11-15 16:49   ` Vicente Jiménez
  2016-11-15 16:56     ` David Miller
  0 siblings, 1 reply; 8+ messages in thread
From: Vicente Jiménez @ 2016-11-15 16:49 UTC (permalink / raw)
  To: David Miller
  Cc: Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6602 bytes --]

On Mon, Nov 14, 2016 at 7:36 PM, David Miller <davem@davemloft.net> wrote:
> From: Vicente Jimenez Aguilar <googuy@gmail.com>
> Date: Fri, 11 Nov 2016 21:20:18 +0100
>
>> @@ -819,6 +820,12 @@ static bool icmp_unreach(struct sk_buff *skb)
>>                               /* fall through */
>>                       case 0:
>>                               info = ntohs(icmph->un.frag.mtu);
>> +                             /* Handle weird case where next hop MTU is
>> +                              * equal to or exceeding dropped packet size
>> +                              */
>> +                             old_mtu = ntohs(iph->tot_len);
>> +                             if (info >= old_mtu)
>> +                                     info = old_mtu - 2;
>
> This isn't something the old code did.
>
> The old code behaved much differently.
>
I don't wanted to restore old behavior just fix a strange case that
was handle by this code where the next hop MTU reported by the router
is equal or greater than the actual path MTU. Because router
information is wrong, we need a way to guess a good packet size
ignoring router data. The simplest strategy that avoid odd numbers is
reducing dropped packet size by 2.

> In the case where the new mtu was smaller than 68 or larger than
> the iph->tot_len value, it would do several things:
Larger OR EQUAL (our case) than the iph->tot_len.
>
> 1) First it would check for a BSD 4.2 anomaly and subtract old_mtu
>    by the IP header length.
This only happens when next hop MTU is zero. I don't care about this
case. It is handled at protocol level as the commit that removed that
function says.

>
> 2) Second, it would try to guess the intended MTU using the
>    mtu_plateau table.
>
Old kernel work because they reduced the current MTU by a small amount
using the mtu_plateau.
Not because we have some trouble with headers and we need the common 1492 value.

> I don't see any code where a subtraction by a fixed constant of 2
> occurred.
>
This is the simplest solution I came about. Because packets of size
old_mtu get dropped and suggested next hop MTU is equal or larger, we
just ignore wrong router information and try to get the correct MTU by
reducing current MTU by just a bit. I discarded subtracting by 1 to
avoid odd numbers.
For our case:
 - Old kernel sets MTU to 1492 from previous 1500 bytes and worked.
 - Newer kernel enters infinite loop sending dropped size packets.
 - Patched kernel sets MTU to 1498 from 1500 and go trough this abnormal router.
 - Microsoft Windows also recovers from those strange ICMP messages.

By the way, setting MTU to 1499 also failed to communicate for us.

> Nor can I figure out what that might accomplish.  If you really
> want to do this, you have to docuement what this 2 means, what
> it is accomplishing, and why you have choosen to accomplish it
> this way.
>
> Thanks.

I just wanted to restore the "new_mtu >= old_mtu" check and instead of
reducing the current packet size using a plateau function, decrement
current size by 2 (to avoid odd numbers). This solution solved our
problem setting MTU to 1498 and I think this must converge to any good
MTU value because it reduce packet size until we don't receive any
more ICMP fragmentation needed messages.

We can criticize that this solution could be slow to reach small MTU values.
But I suspect that those abnormal ICMP messages was caused by a router
bug and must be very close to a size that get through them. If
suggested MTU is smaller than current packet size, we set the MTU to
next hop MTU. If those packets that are equal in size to the suggested
one still get dropped, then this code try to overcome that. It had the
strength of being really simple and it must be ideally never be
executed (we are still trying to fix the abnormal router, but we don't
manage it). I suspect that for those weird case, a working size must
be very close to the suggested and it is far better than current
kernel that just take the value in next hop MTU without any check and
enter in a infinite loop sending dropped size packets. TCP connections
stall and finally time out.

I suspect that a router bug get size comparison wrong and the real
path MTU is 1500 (as it is in the other direction). We need to get
around this and currently we just lowered interface MTU at startup for
those devices.

Where do I need to put more explanation?
 - Code comment
 - Commit message

Attached capture file that show the problem: Abnormal ICMP messages
that drops and suggest 1500 bytes.

This is the long explanation that I emailed earlier:

"
In a large corporate network, we spotted this weird ICMP message after
a long troubleshooting. See attached capture file. Those ICMP "network
unreachable - fragmentation needed and don't fragment bit set"
messages are sent by a router that drop 1500 bytes IP packets and fill
the next hop MTU ICMP field with 1500.

Those messages cause the TCP connection to stall but only on newer
kernels. Older kernels set path MTU to 1492 and communicates
successfully.

After checking code and commit history, I spotted how commit
46517008e116 ("ipv4: Kill ip_rt_frag_needed().") from June 2012
changed ICMP messages handling by removing ip_rt_frag_needed function.

The relevant part of the ip_rt_frag_needed function that was removed is:

if (new_mtu < 68 || new_mtu >= old_mtu) {
        /* BSD 4.2 derived systems incorrectly adjust
         * tot_len by the IP header length, and report
         * a zero MTU in the ICMP message.
         */
        if (mtu == 0 &&
            old_mtu >= 68 + (iph->ihl << 2))
                old_mtu -= iph->ihl << 2;
        mtu = guess_mtu(old_mtu);
}

This condition handled the cases when next hop MTU where zero (less
than 68). Now this is handled by the protocol and fixed by commit
68b7107b6298 "ipv4: icmp: Fix pMTU handling for rare case".

But the rarest case when (next hop MTU) new_mtu >= old_mtu (dropped
packet length) was also removed. This commit restores this check.
Instead of using a table lookup like function guess_mtu uses, it just
try to set the path MTU decrementing by 2 bytes the dropped packet
size.

In our case, setting the path MTU to just 1498 (one iteration) worked.
This solution should converge in any case to a good value by small
steps. I don't think there's a need to a more complex solution.

The patched kernel worked perfectly setting the path MTU to 1498 from
the initial default interface value of 1500. This time I don't have a
capture file from inside the affected center, but all received packed
had a maximum size of 1498.
"

-- 
thanks
vicente

[-- Attachment #2: ICMP dropping and suggesting 1500.pcapng --]
[-- Type: application/x-pcapng, Size: 89664 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] icmp: Restore resistence to abnormal messages
  2016-11-15 16:49   ` Vicente Jiménez
@ 2016-11-15 16:56     ` David Miller
  2016-11-15 17:30       ` Florian Westphal
  0 siblings, 1 reply; 8+ messages in thread
From: David Miller @ 2016-11-15 16:56 UTC (permalink / raw)
  To: googuy; +Cc: kuznet, jmorris, yoshfuji, kaber, netdev, linux-kernel

From: Vicente Jiménez <googuy@gmail.com>
Date: Tue, 15 Nov 2016 17:49:43 +0100

> On Mon, Nov 14, 2016 at 7:36 PM, David Miller <davem@davemloft.net> wrote:
>> From: Vicente Jimenez Aguilar <googuy@gmail.com>
>> Date: Fri, 11 Nov 2016 21:20:18 +0100
>>
>>> @@ -819,6 +820,12 @@ static bool icmp_unreach(struct sk_buff *skb)
>>>                               /* fall through */
>>>                       case 0:
>>>                               info = ntohs(icmph->un.frag.mtu);
>>> +                             /* Handle weird case where next hop MTU is
>>> +                              * equal to or exceeding dropped packet size
>>> +                              */
>>> +                             old_mtu = ntohs(iph->tot_len);
>>> +                             if (info >= old_mtu)
>>> +                                     info = old_mtu - 2;
>>
>> This isn't something the old code did.
>>
>> The old code behaved much differently.
>>
> I don't wanted to restore old behavior just fix a strange case that
> was handle by this code where the next hop MTU reported by the router
> is equal or greater than the actual path MTU. Because router
> information is wrong, we need a way to guess a good packet size
> ignoring router data. The simplest strategy that avoid odd numbers is
> reducing dropped packet size by 2.

This whole approach seems arbitrary.

You haven't discussed in any way, what causes this in the first place.
And what about that cause makes simply subtracting by 2 work well or
not.

You have a very locallized, specific, situation on your end you want
to fix.  But we must accept changes that handle things generically and
in a way that would help more than just your specific case.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] icmp: Restore resistence to abnormal messages
  2016-11-15 16:56     ` David Miller
@ 2016-11-15 17:30       ` Florian Westphal
  2016-11-15 19:32         ` Vicente Jiménez
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Westphal @ 2016-11-15 17:30 UTC (permalink / raw)
  To: David Miller
  Cc: googuy, kuznet, jmorris, yoshfuji, kaber, netdev, linux-kernel

David Miller <davem@redhat.com> wrote:
> From: Vicente Jiménez <googuy@gmail.com>
> Date: Tue, 15 Nov 2016 17:49:43 +0100
> 
> > On Mon, Nov 14, 2016 at 7:36 PM, David Miller <davem@davemloft.net> wrote:
> >> From: Vicente Jimenez Aguilar <googuy@gmail.com>
> >> Date: Fri, 11 Nov 2016 21:20:18 +0100
> >>
> >>> @@ -819,6 +820,12 @@ static bool icmp_unreach(struct sk_buff *skb)
> >>>                               /* fall through */
> >>>                       case 0:
> >>>                               info = ntohs(icmph->un.frag.mtu);
> >>> +                             /* Handle weird case where next hop MTU is
> >>> +                              * equal to or exceeding dropped packet size
> >>> +                              */
> >>> +                             old_mtu = ntohs(iph->tot_len);
> >>> +                             if (info >= old_mtu)
> >>> +                                     info = old_mtu - 2;
> >>
> >> This isn't something the old code did.
> >>
> >> The old code behaved much differently.
> >>
> > I don't wanted to restore old behavior just fix a strange case that
> > was handle by this code where the next hop MTU reported by the router
> > is equal or greater than the actual path MTU. Because router
> > information is wrong, we need a way to guess a good packet size
> > ignoring router data. The simplest strategy that avoid odd numbers is
> > reducing dropped packet size by 2.
> 
> This whole approach seems arbitrary.
> 
> You haven't discussed in any way, what causes this in the first place.
> And what about that cause makes simply subtracting by 2 work well or
> not.
> 
> You have a very locallized, specific, situation on your end you want
> to fix.  But we must accept changes that handle things generically and
> in a way that would help more than just your specific case.

FWIW this is similar to the patch I sent a while ago:

https://patchwork.ozlabs.org/patch/493997/

I think in interest of robustness principle ("eat shit and don't die")
one of these changes should go in :-|

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] icmp: Restore resistence to abnormal messages
  2016-11-15 17:30       ` Florian Westphal
@ 2016-11-15 19:32         ` Vicente Jiménez
  2016-11-16  1:14           ` Florian Westphal
  0 siblings, 1 reply; 8+ messages in thread
From: Vicente Jiménez @ 2016-11-15 19:32 UTC (permalink / raw)
  To: Florian Westphal
  Cc: David Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev, linux-kernel

I agree that both patches try to solve the same problem in a very similar way.
Florian Westphal's patch do two more things:
1- add warning with pr_warn_ratelimited. I like this idea. I also
though about adding some message but I have no kernel experience and I
preferred to have just a working solution.
2- Check if the packet size is lower than (536 + 8). I think this is
not necessary because low values (even the zero case) is already
handled by the protocol. Also I don't understand why you choose this
value, it seems to be related to TCP MSS and the compared value is IP
packet size.

Finally, both patches decrement current packet by a value: Mine by 2
and Florian's by 8 bytes. Both arbitrary values. Personally I prefer
to go by small steps. If the small step fails, it just iterate again
and with 4 iterations, my patch also decrement the original value by 8
bytes (4x2).
Basically they are the same but my patch take smaller steps and miss
the warning message.

If David Miller thinks this could be a good addition, I'll add the
warning message to my patch.

We can also discuss the amount to subtract.

On Tue, Nov 15, 2016 at 6:30 PM, Florian Westphal <fw@strlen.de> wrote:
> David Miller <davem@redhat.com> wrote:
>> From: Vicente Jiménez <googuy@gmail.com>
>> Date: Tue, 15 Nov 2016 17:49:43 +0100
>>
>> > On Mon, Nov 14, 2016 at 7:36 PM, David Miller <davem@davemloft.net> wrote:
>> >> From: Vicente Jimenez Aguilar <googuy@gmail.com>
>> >> Date: Fri, 11 Nov 2016 21:20:18 +0100
>> >>
>> >>> @@ -819,6 +820,12 @@ static bool icmp_unreach(struct sk_buff *skb)
>> >>>                               /* fall through */
>> >>>                       case 0:
>> >>>                               info = ntohs(icmph->un.frag.mtu);
>> >>> +                             /* Handle weird case where next hop MTU is
>> >>> +                              * equal to or exceeding dropped packet size
>> >>> +                              */
>> >>> +                             old_mtu = ntohs(iph->tot_len);
>> >>> +                             if (info >= old_mtu)
>> >>> +                                     info = old_mtu - 2;
>> >>
>> >> This isn't something the old code did.
>> >>
>> >> The old code behaved much differently.
>> >>
>> > I don't wanted to restore old behavior just fix a strange case that
>> > was handle by this code where the next hop MTU reported by the router
>> > is equal or greater than the actual path MTU. Because router
>> > information is wrong, we need a way to guess a good packet size
>> > ignoring router data. The simplest strategy that avoid odd numbers is
>> > reducing dropped packet size by 2.
>>
>> This whole approach seems arbitrary.
>>
>> You haven't discussed in any way, what causes this in the first place.
>> And what about that cause makes simply subtracting by 2 work well or
>> not.
>>
>> You have a very locallized, specific, situation on your end you want
>> to fix.  But we must accept changes that handle things generically and
>> in a way that would help more than just your specific case.
>
> FWIW this is similar to the patch I sent a while ago:
>
> https://patchwork.ozlabs.org/patch/493997/
>
> I think in interest of robustness principle ("eat shit and don't die")
> one of these changes should go in :-|



-- 
saludos
vicente

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] icmp: Restore resistence to abnormal messages
  2016-11-15 19:32         ` Vicente Jiménez
@ 2016-11-16  1:14           ` Florian Westphal
  2016-11-17  1:17             ` Vicente Jiménez
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Westphal @ 2016-11-16  1:14 UTC (permalink / raw)
  To: Vicente Jiménez
  Cc: Florian Westphal, David Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev, linux-kernel

Vicente Jiménez <googuy@gmail.com> wrote:
> I agree that both patches try to solve the same problem in a very similar way.
> Florian Westphal's patch do two more things:
> 1- add warning with pr_warn_ratelimited. I like this idea. I also
> though about adding some message but I have no kernel experience and I
> preferred to have just a working solution.

I added this only to show whats happening.

I don't like such printks because end users can't do anything about it.

> 2- Check if the packet size is lower than (536 + 8). I think this is
> not necessary because low values (even the zero case) is already
> handled by the protocol. Also I don't understand why you choose this
> value, it seems to be related to TCP MSS and the compared value is IP
> packet size.

Right, no need for this check.

> Finally, both patches decrement current packet by a value: Mine by 2
> and Florian's by 8 bytes. Both arbitrary values. Personally I prefer
> to go by small steps. If the small step fails, it just iterate again
> and with 4 iterations, my patch also decrement the original value by 8
> bytes (4x2).
> Basically they are the same but my patch take smaller steps and miss
> the warning message.

IIRC I chose 8 because connection recovered faster in my case.

I have not experienced this issue again (I dropped the patch from
my kernel at some point and the connection stalls did not reappear so
this got fixed elsewhere).

I'd just apply your patch, possibly with an additional comment that
says that we're grasping at straws because some middlebox is evidently
feeding bogus pmtu information.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] icmp: Restore resistence to abnormal messages
  2016-11-16  1:14           ` Florian Westphal
@ 2016-11-17  1:17             ` Vicente Jiménez
  0 siblings, 0 replies; 8+ messages in thread
From: Vicente Jiménez @ 2016-11-17  1:17 UTC (permalink / raw)
  To: Florian Westphal
  Cc: David Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev, linux-kernel

On Wed, Nov 16, 2016 at 2:14 AM, Florian Westphal <fw@strlen.de> wrote:
> Vicente Jiménez <googuy@gmail.com> wrote:
>> 1- add warning with pr_warn_ratelimited. I like this idea. I also
>> though about adding some message but I have no kernel experience and I
>> preferred to have just a working solution.
>
> I added this only to show whats happening.
>
> I don't like such printks because end users can't do anything about it.
>
What about using net_dbg_ratelimited macro? it only adds messages if
debug is enabled.

>
>> Finally, both patches decrement current packet by a value: Mine by 2
>> and Florian's by 8 bytes. Both arbitrary values. Personally I prefer
>> to go by small steps. If the small step fails, it just iterate again
>> and with 4 iterations, my patch also decrement the original value by 8
>> bytes (4x2).
>> Basically they are the same but my patch take smaller steps and miss
>> the warning message.
>
> IIRC I chose 8 because connection recovered faster in my case.
>
> I have not experienced this issue again (I dropped the patch from
> my kernel at some point and the connection stalls did not reappear so
> this got fixed elsewhere).
My issue is permanent for now in various locations. We have to
decrease MTU manually on all devices with newer kernels. We don't have
direct access to those abnormal routers because they are managed by a
communication provider that think the network had no problem because
all their Windows machines apparently work perfectly.

-- 
cheers
vicente

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-11-17  1:17 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-11 20:20 [PATCH] icmp: Restore resistence to abnormal messages Vicente Jimenez Aguilar
2016-11-14 18:36 ` David Miller
2016-11-15 16:49   ` Vicente Jiménez
2016-11-15 16:56     ` David Miller
2016-11-15 17:30       ` Florian Westphal
2016-11-15 19:32         ` Vicente Jiménez
2016-11-16  1:14           ` Florian Westphal
2016-11-17  1:17             ` Vicente Jiménez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).