All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wei Wang <weiwan@google.com>
To: Martin Lau <kafai@fb.com>
Cc: Ido Schimmel <idosch@idosch.org>,
	Jesse Hathaway <jesse@mbuki-mvuki.org>,
	Linux Kernel Network Developers <netdev@vger.kernel.org>
Subject: Re: Race condition in route lookup
Date: Sun, 13 Oct 2019 17:23:01 -0700	[thread overview]
Message-ID: <CAEA6p_A_kNA8kVLmVn0e=fp=vx3xpHTTaVrx62NVCDLowVxaog@mail.gmail.com> (raw)
In-Reply-To: <20191012065608.igcba7tcjr4wkfsf@kafai-mbp.dhcp.thefacebook.com>

On Fri, Oct 11, 2019 at 11:56 PM Martin Lau <kafai@fb.com> wrote:
>
> On Fri, Oct 11, 2019 at 10:54:13AM -0700, Wei Wang wrote:
> > On Fri, Oct 11, 2019 at 8:42 AM Ido Schimmel <idosch@idosch.org> wrote:
> > >
> > > On Fri, Oct 11, 2019 at 09:36:51AM -0500, Jesse Hathaway wrote:
> > > > On Thu, Oct 10, 2019 at 3:31 AM Ido Schimmel <idosch@idosch.org> wrote:
> > > > > I think it's working as expected. Here is my theory:
> > > > >
> > > > > If CPU0 is executing both the route get request and forwarding packets
> > > > > through the directly connected interface, then the following can happen:
> > > > >
> > > > > <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
> > > > > is found. Not yet dumped to user space
> > > > >
> > > > > <Any CPU, t1> - Routes are added / removed, therefore invalidating the
> > > > > cache by bumping 'net->ipv4.rt_genid'
> > > > >
> > > > > <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
> > > > > cached dst entry is found to be invalid. Therefore, it is replaced by a
> > > > > newer dst entry. dst_dev_put() is called on old entry which assigns the
> > > > > blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
> > > > > it is not registered.
> > > > >
> > > > > <CPU0, t3> - After softirq finished executing, your route get request
> > > > > from t0 is resumed and the old dst entry is dumped to user space with
> > > > > ifindex of 0.
> > > > >
> > > > > I tested this on my system using your script to generate the route get
> > > > > requests. I pinned it to the same CPU forwarding packets through the
> > > > > nexthop. To constantly invalidate the cache I created another script
> > > > > that simply adds and removes IP addresses from an interface.
> > > > >
> > > > > If I stop the packet forwarding or the script that invalidates the
> > > > > cache, then I don't see any '*' answers to my route get requests.
> > > >
> > > > Thanks for the reply and analysis Ido, I tested with an additional script which
> > > > adds and deletes a route in a loop, as you also saw this increased the
> > > > frequency of blackhole route replies from the first script.
> > > >
> > > > Questions:
> > > >
> > > > 1. We saw this behavior occurring with TCP connections traversing our routers,
> > > > though I was able to reproduce it with only local route requests on our router.
> > > > Would you expect this same behavior for TCP traffic only in the kernel which
> > > > does not go to userspace?
> > >
> > > Yes, the problem is in the input path where received packets need to be
> > > forwarded.
> > >
> > > >
> > > > 2. These blackhole routes occur even though our main routing table is not
> > > > changing, however a separate route table managed by bird on the Linux router is
> > > > changing. Is this still expected behavior given that the ip-rules and main
> > > > route table used by these route requests are not changing?
> > >
> > > Yes, there is a per-netns counter that is incremented whenever cached
> > > dst entries need to be invalidated. Since it is per-netns it is
> > > incremented regardless of the routing table to which your insert the
> > > route.
> > >
> > > >
> > > > 3. We were previously rejecting these packets with an iptables rule which sent
> > > > an ICMP prohibited message to the sender, this caused TCP connections to break
> > > > with a EHOSTUNREACH, should we be silently dropping these packets instead?
> > > >
> > > > 4. If we should just be dropping these packets, why does the kernel not drop
> > > > them instead of letting them traverse the iptables rules?
> > >
> > > I actually believe the current behavior is a bug that needs to be fixed.
> > > See below.
> > >
> > > >
> > > > > BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
> > > > > with older kernel versions you'll see 'lo' instead of '*'.
> > > >
> > > > Yes indeed! Thanks for solving that mystery as well, our routers are running
> > > > 5.1, but we upgraded to 5.4-rc2 to determine whether the issue was still
> > > > present in the latest kernel.
> > >
> > > Do you remember when you started seeing this behavior? I think it
> > > started in 4.13 with commit ffe95ecf3a2e ("Merge branch
> > > 'net-remove-dst-garbage-collector-logic'").
> > >
> > > Let me add Wei to see if/how this can be fixed.
> > >
> > > Wei, in case you don't have the original mail with the description of
> > > the problem, it can be found here [1].
> > >
> > > I believe that the issue Jesse is experiencing is the following:
> > >
> > > <CPU A, t0> - Received packet A is forwarded and cached dst entry is
> > > taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set()
> > >
> > > <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables
> > > from multiple ISPs"), route is added / deleted and rt_cache_flush() is
> > > called
> > >
> > > <CPU B, t2> - Received packet B tries to use the same cached dst entry
> > > from t0, but rt_cache_valid() is no longer true and it is replaced in
> > > rt_cache_route() by the newer one. This calls dst_dev_put() on the
> > > original dst entry which assigns the blackhole netdev to 'dst->dev'
> > >
> > > <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due
> > > to 'dst->dev' being the blackhole netdev
> > >
> > > The following patch "fixes" the problem for me:
> > >
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index 42221a12bdda..1c67bdb80fd5 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -1482,7 +1482,6 @@ static bool rt_cache_route(struct fib_nh_common *nhc, struct rtable *rt)
> > >         prev = cmpxchg(p, orig, rt);
> > >         if (prev == orig) {
> > >                 if (orig) {
> > > -                       dst_dev_put(&orig->dst);
> > >                         dst_release(&orig->dst);
> > >                 }
> > >         } else {
> > >
> > > But if this dst entry is cached in some inactive socket and the netdev
> > > on which it took a reference needs to be unregistered, then we can
> > > potentially wait forever. No?
> > >
> > Yes. That's exactly the reason we need to free the dev here.
> > Otherwise as you described, we will see "unregister_netdevice: waiting
> > for xxx to become free. Usage count = x" flushing the screen... Not
> > fun...
> >
> >
> > > I'm thinking that it can be fixed by making 'nhc_rth_input' per-CPU, in
> > > a similar fashion to what Eric did in commit d26b3a7c4b3b ("ipv4: percpu
> > > nh_rth_output cache").
> > >
> > Hmm... Yes... I would think a per-CPU input cache should work for the
> > case above.
> > Another idea is: instead of calling dst_dev_put() in rt_cache_route()
> > to switch out the dev, we call, rt_add_uncached_list() to add this
> > obsolete dst cache to the uncached list. And if the device gets
> > unregistered, rt_flush_dev() takes care of all dst entries in the
> > uncached list. I think that would work too.
> >
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index dc1f510a7c81..ee618d4234ce 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1482,7 +1482,7 @@ static bool rt_cache_route(struct fib_nh_common
> > *nhc, struct rtable *rt)
> >         prev = cmpxchg(p, orig, rt);
> >         if (prev == orig) {
> >                 if (orig) {
> > -                       dst_dev_put(&orig->dst);
> > +                       rt_add_uncached_list(orig);
> >                         dst_release(&orig->dst);
> >                 }
> >         } else {
> >
> > + Martin for his idea and input.
> The above fix should work and a simple one liner for net.
> percpu may be a too big hammer for bug fix.
> It is only needed for input route?  A comment would be nice.
>
> While reading around, I am puzzling why a rt has to be recreated
> for the same route.  I could be missing something.
>
> I don't recall that is happening to ipv6 route even that tree-branch's
> fn_sernum has changed.
>
> It seems v4 sk has not stored the last lookup rt_genid.
> e.g. __sk_dst_check(sk, 0).  Everyone is sharing the rt->rt_genid
> to check for changes, so the rt must be re-created?
>
I think the reason rt has to be created is v4 code uses per net
rt_genid. So changes to any route in the namespace will invalidate all
other routes. (As David pointed out in his email.) However, v6 code
uses per fib_node fn_sernum, and has a way to only invalidate route
that are affected. (fib6_update_sernum_upto_root())
And v4 code not caching rt_genid seems to be separate issue, I think...


> >
> > > Two questions:
> > >
> > > 1. Do you agree with the above analysis?
> > > 2. Do you have a simpler/better solution in mind?
> > >
> > > Thanks
> > >
> > > [1] https://lore.kernel.org/netdev/CANSNSoVM1Uo106xfJtGpTyXNed8kOL4JiXqf3A1eZHBa7z3=yg@mail.gmail.com/T/#medece9445d617372b4842d44525ef0d3ba1ea083

  reply	other threads:[~2019-10-14  0:23 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-09 16:00 Race condition in route lookup Jesse Hathaway
2019-10-10  8:31 ` Ido Schimmel
2019-10-10  8:46   ` Ido Schimmel
2019-10-11 14:36   ` Jesse Hathaway
2019-10-11 15:42     ` Ido Schimmel
2019-10-11 16:09       ` Jesse Hathaway
2019-10-11 17:54       ` Wei Wang
2019-10-11 18:17         ` Ido Schimmel
2019-10-11 18:25           ` Ido Schimmel
2019-10-11 18:47             ` Wei Wang
2019-10-11 18:52               ` Ido Schimmel
2019-10-11 21:01                 ` Jesse Hathaway
2019-10-11 21:27                 ` David Ahern
2019-10-12  6:56         ` Martin Lau
2019-10-14  0:23           ` Wei Wang [this message]
2019-10-14 17:26             ` Martin Lau
2019-10-15 14:45               ` David Ahern
2019-10-15 16:42                 ` Wei Wang
2019-10-16  6:35                   ` Martin Lau
2019-10-15 14:29         ` Jesse Hathaway
2019-10-15 16:44           ` Wei Wang
2019-10-16  6:39             ` Martin Lau
2019-10-16 16:35               ` Wei Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAEA6p_A_kNA8kVLmVn0e=fp=vx3xpHTTaVrx62NVCDLowVxaog@mail.gmail.com' \
    --to=weiwan@google.com \
    --cc=idosch@idosch.org \
    --cc=jesse@mbuki-mvuki.org \
    --cc=kafai@fb.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.