Re: Race condition in route lookup

From: Martin Lau <kafai@fb.com>
To: Wei Wang <weiwan@google.com>
Cc: Ido Schimmel <idosch@idosch.org>,
	Jesse Hathaway <jesse@mbuki-mvuki.org>,
	Linux Kernel Network Developers <netdev@vger.kernel.org>
Subject: Re: Race condition in route lookup
Date: Mon, 14 Oct 2019 17:26:43 +0000	[thread overview]
Message-ID: <20191014172640.hezqrjpu43oggqjt@kafai-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <CAEA6p_A_kNA8kVLmVn0e=fp=vx3xpHTTaVrx62NVCDLowVxaog@mail.gmail.com>

On Sun, Oct 13, 2019 at 05:23:01PM -0700, Wei Wang wrote:
> On Fri, Oct 11, 2019 at 11:56 PM Martin Lau <kafai@fb.com> wrote:
> >
> > On Fri, Oct 11, 2019 at 10:54:13AM -0700, Wei Wang wrote:
> > > On Fri, Oct 11, 2019 at 8:42 AM Ido Schimmel <idosch@idosch.org> wrote:
> > > >
> > > > On Fri, Oct 11, 2019 at 09:36:51AM -0500, Jesse Hathaway wrote:
> > > > > On Thu, Oct 10, 2019 at 3:31 AM Ido Schimmel <idosch@idosch.org> wrote:
> > > > > > I think it's working as expected. Here is my theory:
> > > > > >
> > > > > > If CPU0 is executing both the route get request and forwarding packets
> > > > > > through the directly connected interface, then the following can happen:
> > > > > >
> > > > > > <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
> > > > > > is found. Not yet dumped to user space
> > > > > >
> > > > > > <Any CPU, t1> - Routes are added / removed, therefore invalidating the
> > > > > > cache by bumping 'net->ipv4.rt_genid'
> > > > > >
> > > > > > <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
> > > > > > cached dst entry is found to be invalid. Therefore, it is replaced by a
> > > > > > newer dst entry. dst_dev_put() is called on old entry which assigns the
> > > > > > blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
> > > > > > it is not registered.
> > > > > >
> > > > > > <CPU0, t3> - After softirq finished executing, your route get request
> > > > > > from t0 is resumed and the old dst entry is dumped to user space with
> > > > > > ifindex of 0.
> > > > > >
> > > > > > I tested this on my system using your script to generate the route get
> > > > > > requests. I pinned it to the same CPU forwarding packets through the
> > > > > > nexthop. To constantly invalidate the cache I created another script
> > > > > > that simply adds and removes IP addresses from an interface.
> > > > > >
> > > > > > If I stop the packet forwarding or the script that invalidates the
> > > > > > cache, then I don't see any '*' answers to my route get requests.
> > > > >
> > > > > Thanks for the reply and analysis Ido, I tested with an additional script which
> > > > > adds and deletes a route in a loop, as you also saw this increased the
> > > > > frequency of blackhole route replies from the first script.
> > > > >
> > > > > Questions:
> > > > >
> > > > > 1. We saw this behavior occurring with TCP connections traversing our routers,
> > > > > though I was able to reproduce it with only local route requests on our router.
> > > > > Would you expect this same behavior for TCP traffic only in the kernel which
> > > > > does not go to userspace?
> > > >
> > > > Yes, the problem is in the input path where received packets need to be
> > > > forwarded.
> > > >
> > > > >
> > > > > 2. These blackhole routes occur even though our main routing table is not
> > > > > changing, however a separate route table managed by bird on the Linux router is
> > > > > changing. Is this still expected behavior given that the ip-rules and main
> > > > > route table used by these route requests are not changing?
> > > >
> > > > Yes, there is a per-netns counter that is incremented whenever cached
> > > > dst entries need to be invalidated. Since it is per-netns it is
> > > > incremented regardless of the routing table to which your insert the
> > > > route.
> > > >
> > > > >
> > > > > 3. We were previously rejecting these packets with an iptables rule which sent
> > > > > an ICMP prohibited message to the sender, this caused TCP connections to break
> > > > > with a EHOSTUNREACH, should we be silently dropping these packets instead?
> > > > >
> > > > > 4. If we should just be dropping these packets, why does the kernel not drop
> > > > > them instead of letting them traverse the iptables rules?
> > > >
> > > > I actually believe the current behavior is a bug that needs to be fixed.
> > > > See below.
> > > >
> > > > >
> > > > > > BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
> > > > > > with older kernel versions you'll see 'lo' instead of '*'.
> > > > >
> > > > > Yes indeed! Thanks for solving that mystery as well, our routers are running
> > > > > 5.1, but we upgraded to 5.4-rc2 to determine whether the issue was still
> > > > > present in the latest kernel.
> > > >
> > > > Do you remember when you started seeing this behavior? I think it
> > > > started in 4.13 with commit ffe95ecf3a2e ("Merge branch
> > > > 'net-remove-dst-garbage-collector-logic'").
> > > >
> > > > Let me add Wei to see if/how this can be fixed.
> > > >
> > > > Wei, in case you don't have the original mail with the description of
> > > > the problem, it can be found here [1].
> > > >
> > > > I believe that the issue Jesse is experiencing is the following:
> > > >
> > > > <CPU A, t0> - Received packet A is forwarded and cached dst entry is
> > > > taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set()
> > > >
> > > > <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables
> > > > from multiple ISPs"), route is added / deleted and rt_cache_flush() is
> > > > called
> > > >
> > > > <CPU B, t2> - Received packet B tries to use the same cached dst entry
> > > > from t0, but rt_cache_valid() is no longer true and it is replaced in
> > > > rt_cache_route() by the newer one. This calls dst_dev_put() on the
> > > > original dst entry which assigns the blackhole netdev to 'dst->dev'
> > > >
> > > > <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due
> > > > to 'dst->dev' being the blackhole netdev
> > > >
> > > > The following patch "fixes" the problem for me:
> > > >
> > > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > > index 42221a12bdda..1c67bdb80fd5 100644
> > > > --- a/net/ipv4/route.c
> > > > +++ b/net/ipv4/route.c
> > > > @@ -1482,7 +1482,6 @@ static bool rt_cache_route(struct fib_nh_common *nhc, struct rtable *rt)
> > > >         prev = cmpxchg(p, orig, rt);
> > > >         if (prev == orig) {
> > > >                 if (orig) {
> > > > -                       dst_dev_put(&orig->dst);
> > > >                         dst_release(&orig->dst);
> > > >                 }
> > > >         } else {
> > > >
> > > > But if this dst entry is cached in some inactive socket and the netdev
> > > > on which it took a reference needs to be unregistered, then we can
> > > > potentially wait forever. No?
> > > >
> > > Yes. That's exactly the reason we need to free the dev here.
> > > Otherwise as you described, we will see "unregister_netdevice: waiting
> > > for xxx to become free. Usage count = x" flushing the screen... Not
> > > fun...
> > >
> > >
> > > > I'm thinking that it can be fixed by making 'nhc_rth_input' per-CPU, in
> > > > a similar fashion to what Eric did in commit d26b3a7c4b3b ("ipv4: percpu
> > > > nh_rth_output cache").
> > > >
> > > Hmm... Yes... I would think a per-CPU input cache should work for the
> > > case above.
> > > Another idea is: instead of calling dst_dev_put() in rt_cache_route()
> > > to switch out the dev, we call, rt_add_uncached_list() to add this
> > > obsolete dst cache to the uncached list. And if the device gets
> > > unregistered, rt_flush_dev() takes care of all dst entries in the
> > > uncached list. I think that would work too.
> > >
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index dc1f510a7c81..ee618d4234ce 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -1482,7 +1482,7 @@ static bool rt_cache_route(struct fib_nh_common
> > > *nhc, struct rtable *rt)
> > >         prev = cmpxchg(p, orig, rt);
> > >         if (prev == orig) {
> > >                 if (orig) {
> > > -                       dst_dev_put(&orig->dst);
> > > +                       rt_add_uncached_list(orig);
> > >                         dst_release(&orig->dst);
> > >                 }
> > >         } else {
> > >
> > > + Martin for his idea and input.
> > The above fix should work and a simple one liner for net.
> > percpu may be a too big hammer for bug fix.
> > It is only needed for input route?  A comment would be nice.
> >
> > While reading around, I am puzzling why a rt has to be recreated
> > for the same route.  I could be missing something.
> >
> > I don't recall that is happening to ipv6 route even that tree-branch's
> > fn_sernum has changed.
> >
> > It seems v4 sk has not stored the last lookup rt_genid.
> > e.g. __sk_dst_check(sk, 0).  Everyone is sharing the rt->rt_genid
> > to check for changes, so the rt must be re-created?
> >
> I think the reason rt has to be created is v4 code uses per net
> rt_genid. So changes to any route in the namespace will invalidate all
> other routes. (As David pointed out in his email.) However, v6 code
> uses per fib_node fn_sernum, and has a way to only invalidate route
> that are affected. (fib6_update_sernum_upto_root())
> And v4 code not caching rt_genid seems to be separate issue, I think...
Understood that v6 impact is smaller on route changes because there is per
fib6_node fn_sernum.

AFAICT, even for the route that are affected by fib6_update_sernum_upto_root(),
I don't see the RTF_PCPU route is re-created.  v6 sk does
dst_check() => re-lookup the fib6 =>
found the same RTF_PCPU (but does not re-create it) =>
update the sk with new cookie in ip6_dst_store()

> 
> 
> > >
> > > > Two questions:
> > > >
> > > > 1. Do you agree with the above analysis?
> > > > 2. Do you have a simpler/better solution in mind?
> > > >
> > > > Thanks
> > > >
> > > > [1] https://lore.kernel.org/netdev/CANSNSoVM1Uo106xfJtGpTyXNed8kOL4JiXqf3A1eZHBa7z3=yg@mail.gmail.com/T/#medece9445d617372b4842d44525ef0d3ba1ea083