All of lore.kernel.org
 help / color / mirror / Atom feed
From: shengyong <shengyong1@huawei.com>
To: Calvin Owens <calvinowens@fb.com>, Alex Gartrell <agartrell@fb.com>
Cc: <davem@davemloft.net>, <netdev@vger.kernel.org>,
	<yangyingling@huawei.com>, <steffen.klassert@secunet.com>,
	<hannes@redhat.com>, <lvs-devel@vger.kernel.org>,
	<kernel-team@fb.com>
Subject: Re: Question: should local address be expired when updating PMTU?
Date: Tue, 3 Feb 2015 11:21:17 +0800	[thread overview]
Message-ID: <54D03EAD.5060307@huawei.com> (raw)
In-Reply-To: <20150203021007.GA1866582@mail.thefacebook.com>



在 2015/2/3 10:10, Calvin Owens 写道:
> On Monday 02/02 at 16:52 -0800, Alex Gartrell wrote:
>> Hello Shengyong,
>>
>>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>>> index b2614b2..b80317a 100644
>>> --- a/net/ipv6/route.c
>>> +++ b/net/ipv6/route.c
>>> @@ -1136,6 +1136,9 @@ static void ip6_rt_update_pmtu(struct
>> dst_entry *dst, struct sock *sk,
>>>   {
>>>          struct rt6_info *rt6 = (struct rt6_info*)dst;
>>>
>>> +       if (rt6->rt6i_flags & RTF_LOCAL)
>>> +               return;
>>> +
>>>          dst_confirm(dst);
>>>          if (mtu < dst_mtu(dst) && rt6->rt6i_dst.plen == 128) {
>>>                  struct net *net = dev_net(dst->dev);
>>>
>>> So is this modification correct? Or how can we avoid such expiring? 
>>
>> FWIW, we encountered this problem with IPVS tunneling.  Here's a
>> patch done by Calvin (cc'ed) that fixes my attempted fix for this.
>> We're not particularly proud of this...
>>
>> At a high level, I don't think the RTF_LOCAL check was sufficient,
>> but I didn't investigate deeply enough and hopefully Calvin can say
>> why.
> 
> I honestly didn't spend much time at all finding the underlying cause
> because it appeared to be fixed upstream: on 3.19-rc5 you get all 3
> expected routes after the last step of my repro below.
Hi,
I do my test on 3.19.0-rc7 just now, it seems it still doesn't solve the
local-addr-expired problem.
 I just really
> needed to get this working at the time, and the gross disgusting
> horrible ugly awful [more negative adjectives] patch included below made
> it work.
> 
> FWIW, the explanation I wrote down in my notes was:
> 
> "The absence of RTF_NONEXTHOP is causing COWs to happen, which are
> always marked as RTF_CACHE. Somehow that's screwing things up in
> rt6_do_redirect()"
> 
> That could be BS though, I don't at all remember how I came to that
> conclusion. 
> 
> (/me resolves to write better notes in the future...)
> 
> Here's how to get the weird behavior on 3.10 (+stable):
> 
> $ sudo ip addr add local 4444::1 dev lo
> ### Now I have 2 routes in /proc/net/ipv6_route, a local and a non-local
> ### Both have the RTF_NONEXTHOP flag set (0x00200000)
> $ sudo ip route add local 4444::1 dev lo
> ### Now I have 3 routes in /proc/net/ipv6_route to 4444::1
> ### Notice the new route does NOT have the RTF_NONEXTHOP flag set
> $ sudo ip addr del local 4444::1 dev lo
> ### Now I just have the one route I created before
> $ sudo ip addr add local 4444::1 dev lo
> ### And now I have 3 routes again
> $ sudo ping6 4444::1
> [blah blah blah successful ping]
> $ sudo ip addr del local 4444::1 dev lo
> $ sudo ip addr add local 4444::1 dev lo
> ### Still have 3 routes
> $ sudo ip addr del local 4444::1 dev lo
> ### Now I just have my one route yet again
> ### Now, *without the address on lo*, talk to it (it works), then re-add it
> $ ping6 4444::1
> [blah blah blah successful ping]
> $ sudo ip addr add local 4444::1 dev lo
> ### Now I only have 2 routes... WAT!?
> ### Notice the LOCAL (0x80000000) route doesn't have the RTF_NONEXTHOP flag set
Looks like we meet different problems. Here is how I do my test (as well as on 3.10
+stable):
      Host only
PC <------------> Virtual Machine
create and send a packet using scapy:
-----------------------------------
| IPv6 (src=PC-addr, dst=VM-addr) |
|---------------------------------|
|     ICMPv6 (Packet Too Big)     |
|---------------------------------|
| IPv6 (src=VM-addr, dst=VM-addr) |
|---------------------------------|
| ICMPv6 (Neighbor Advertisement) |
-----------------------------------
Then the local-addr is set to expire. After expired, the VM is unreachable from
PC side.

thanks,
Sheng
> 
> Thanks,
> Calvin
> 
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index f14d49b..c607a42 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -1159,18 +1159,18 @@ static void ip6_rt_update_pmtu(struct
>> dst_entry *dst, struct sock *sk,
>>                 }
>>                 dst_metric_set(dst, RTAX_MTU, mtu);
>>
>> -               /* FACEBOOK HACK: We need to not expire local non-expiring
>> -                * routes so that we don't accidentally start blackholing
>> -                * ipvs traffic when we happen to use it locally for
>> -                * healthchecking (see ip_vs_xmit.c --
>> -                * __ip_vs_get_out_rt_v6 invokes update_pmtu if the rt is
>> -                * associated with a socket)
>> -                * Alex Gartrell <agartrell@fb.com>
>> +               /*
>> +                * FACEBOOK HACK: Only expire routes that aren't destined for
>> +                * the loopback interface.
>> +                *
>> +                * This prevents the strange route coalescing that happens when
>> +                * you add an address to the loopback that had a route that had
>> +                * been used when the address didn't exist from getting expired
>> +                * and causing packet loss in shiv.
>>                  */
>> -               if (!(rt6->rt6i_flags & RTF_LOCAL) ||
>> -                   (rt6->rt6i_flags & (RTF_EXPIRES | RTF_CACHE)))
>> -                       rt6_update_expires(
>> -                               rt6, net->ipv6.sysctl.ip6_rt_mtu_expires);
>> +               if (!(dst->dev->flags & IFF_LOOPBACK))
>> +                       rt6_update_expires(rt6,
>> + net->ipv6.sysctl.ip6_rt_mtu_expires);
>>         }
>>  }
>>
>>
>> Cheers,
>> -- 
>> Alex Gartrell <agartrell@fb.com>
> 
> .
> 


WARNING: multiple messages have this Message-ID (diff)
From: shengyong <shengyong1@huawei.com>
To: Calvin Owens <calvinowens@fb.com>, Alex Gartrell <agartrell@fb.com>
Cc: davem@davemloft.net, netdev@vger.kernel.org,
	yangyingling@huawei.com, steffen.klassert@secunet.com,
	hannes@redhat.com, lvs-devel@vger.kernel.org, kernel-team@fb.com
Subject: Re: Question: should local address be expired when updating PMTU?
Date: Tue, 3 Feb 2015 11:21:17 +0800	[thread overview]
Message-ID: <54D03EAD.5060307@huawei.com> (raw)
In-Reply-To: <20150203021007.GA1866582@mail.thefacebook.com>



在 2015/2/3 10:10, Calvin Owens 写道:
> On Monday 02/02 at 16:52 -0800, Alex Gartrell wrote:
>> Hello Shengyong,
>>
>>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>>> index b2614b2..b80317a 100644
>>> --- a/net/ipv6/route.c
>>> +++ b/net/ipv6/route.c
>>> @@ -1136,6 +1136,9 @@ static void ip6_rt_update_pmtu(struct
>> dst_entry *dst, struct sock *sk,
>>>   {
>>>          struct rt6_info *rt6 = (struct rt6_info*)dst;
>>>
>>> +       if (rt6->rt6i_flags & RTF_LOCAL)
>>> +               return;
>>> +
>>>          dst_confirm(dst);
>>>          if (mtu < dst_mtu(dst) && rt6->rt6i_dst.plen == 128) {
>>>                  struct net *net = dev_net(dst->dev);
>>>
>>> So is this modification correct? Or how can we avoid such expiring? 
>>
>> FWIW, we encountered this problem with IPVS tunneling.  Here's a
>> patch done by Calvin (cc'ed) that fixes my attempted fix for this.
>> We're not particularly proud of this...
>>
>> At a high level, I don't think the RTF_LOCAL check was sufficient,
>> but I didn't investigate deeply enough and hopefully Calvin can say
>> why.
> 
> I honestly didn't spend much time at all finding the underlying cause
> because it appeared to be fixed upstream: on 3.19-rc5 you get all 3
> expected routes after the last step of my repro below.
Hi,
I do my test on 3.19.0-rc7 just now, it seems it still doesn't solve the
local-addr-expired problem.
 I just really
> needed to get this working at the time, and the gross disgusting
> horrible ugly awful [more negative adjectives] patch included below made
> it work.
> 
> FWIW, the explanation I wrote down in my notes was:
> 
> "The absence of RTF_NONEXTHOP is causing COWs to happen, which are
> always marked as RTF_CACHE. Somehow that's screwing things up in
> rt6_do_redirect()"
> 
> That could be BS though, I don't at all remember how I came to that
> conclusion. 
> 
> (/me resolves to write better notes in the future...)
> 
> Here's how to get the weird behavior on 3.10 (+stable):
> 
> $ sudo ip addr add local 4444::1 dev lo
> ### Now I have 2 routes in /proc/net/ipv6_route, a local and a non-local
> ### Both have the RTF_NONEXTHOP flag set (0x00200000)
> $ sudo ip route add local 4444::1 dev lo
> ### Now I have 3 routes in /proc/net/ipv6_route to 4444::1
> ### Notice the new route does NOT have the RTF_NONEXTHOP flag set
> $ sudo ip addr del local 4444::1 dev lo
> ### Now I just have the one route I created before
> $ sudo ip addr add local 4444::1 dev lo
> ### And now I have 3 routes again
> $ sudo ping6 4444::1
> [blah blah blah successful ping]
> $ sudo ip addr del local 4444::1 dev lo
> $ sudo ip addr add local 4444::1 dev lo
> ### Still have 3 routes
> $ sudo ip addr del local 4444::1 dev lo
> ### Now I just have my one route yet again
> ### Now, *without the address on lo*, talk to it (it works), then re-add it
> $ ping6 4444::1
> [blah blah blah successful ping]
> $ sudo ip addr add local 4444::1 dev lo
> ### Now I only have 2 routes... WAT!?
> ### Notice the LOCAL (0x80000000) route doesn't have the RTF_NONEXTHOP flag set
Looks like we meet different problems. Here is how I do my test (as well as on 3.10
+stable):
      Host only
PC <------------> Virtual Machine
create and send a packet using scapy:
-----------------------------------
| IPv6 (src=PC-addr, dst=VM-addr) |
|---------------------------------|
|     ICMPv6 (Packet Too Big)     |
|---------------------------------|
| IPv6 (src=VM-addr, dst=VM-addr) |
|---------------------------------|
| ICMPv6 (Neighbor Advertisement) |
-----------------------------------
Then the local-addr is set to expire. After expired, the VM is unreachable from
PC side.

thanks,
Sheng
> 
> Thanks,
> Calvin
> 
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index f14d49b..c607a42 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -1159,18 +1159,18 @@ static void ip6_rt_update_pmtu(struct
>> dst_entry *dst, struct sock *sk,
>>                 }
>>                 dst_metric_set(dst, RTAX_MTU, mtu);
>>
>> -               /* FACEBOOK HACK: We need to not expire local non-expiring
>> -                * routes so that we don't accidentally start blackholing
>> -                * ipvs traffic when we happen to use it locally for
>> -                * healthchecking (see ip_vs_xmit.c --
>> -                * __ip_vs_get_out_rt_v6 invokes update_pmtu if the rt is
>> -                * associated with a socket)
>> -                * Alex Gartrell <agartrell@fb.com>
>> +               /*
>> +                * FACEBOOK HACK: Only expire routes that aren't destined for
>> +                * the loopback interface.
>> +                *
>> +                * This prevents the strange route coalescing that happens when
>> +                * you add an address to the loopback that had a route that had
>> +                * been used when the address didn't exist from getting expired
>> +                * and causing packet loss in shiv.
>>                  */
>> -               if (!(rt6->rt6i_flags & RTF_LOCAL) ||
>> -                   (rt6->rt6i_flags & (RTF_EXPIRES | RTF_CACHE)))
>> -                       rt6_update_expires(
>> -                               rt6, net->ipv6.sysctl.ip6_rt_mtu_expires);
>> +               if (!(dst->dev->flags & IFF_LOOPBACK))
>> +                       rt6_update_expires(rt6,
>> + net->ipv6.sysctl.ip6_rt_mtu_expires);
>>         }
>>  }
>>
>>
>> Cheers,
>> -- 
>> Alex Gartrell <agartrell@fb.com>
> 
> .
> 


  reply	other threads:[~2015-02-03  3:21 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-02  8:20 Question: should local address be expired when updating PMTU? shengyong
2015-02-02 21:31 ` David Miller
2015-02-03  0:52 ` Alex Gartrell
2015-02-03  0:52   ` Alex Gartrell
2015-02-03  1:28   ` shengyong
2015-02-03  1:28     ` shengyong
2015-02-03  2:10   ` Calvin Owens
2015-02-03  2:10     ` Calvin Owens
2015-02-03  3:21     ` shengyong [this message]
2015-02-03  3:21       ` shengyong
2015-02-03  9:28 ` Steffen Klassert
2015-02-03 10:54   ` shengyong
2015-02-03 12:01     ` Steffen Klassert
2015-02-04  1:59       ` shengyong
2015-02-05  7:21         ` Steffen Klassert
2015-02-27  2:37           ` shengyong
2015-02-27 10:32             ` Steffen Klassert
2015-03-30 10:32             ` Steffen Klassert
2015-03-30 10:33               ` [PATCH RFC 1/3] ipv6: Fix after pmtu events dissapearing host routes Steffen Klassert
2015-03-30 11:15                 ` Sheng Yong
2015-03-30 18:24                 ` Martin Lau
2015-04-01  8:09                   ` Steffen Klassert
2015-03-30 10:33               ` [PATCH RFC 2/3] ipv6: Extend the route lookups to low priority metrics Steffen Klassert
2015-03-30 10:34               ` [PATCH RFC 3/3] ipv6: Don't update pmtu on uncached routes Steffen Klassert
2015-03-30 11:13               ` Question: should local address be expired when updating PMTU? Sheng Yong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54D03EAD.5060307@huawei.com \
    --to=shengyong1@huawei.com \
    --cc=agartrell@fb.com \
    --cc=calvinowens@fb.com \
    --cc=davem@davemloft.net \
    --cc=hannes@redhat.com \
    --cc=kernel-team@fb.com \
    --cc=lvs-devel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=steffen.klassert@secunet.com \
    --cc=yangyingling@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.