All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ido Schimmel <idosch@idosch.org>
To: Jesse Hathaway <jesse@mbuki-mvuki.org>
Cc: netdev@vger.kernel.org
Subject: Re: Race condition in route lookup
Date: Thu, 10 Oct 2019 11:46:08 +0300	[thread overview]
Message-ID: <20191010084608.GA4730@splinter> (raw)
In-Reply-To: <20191010083102.GA1336@splinter>

On Thu, Oct 10, 2019 at 11:31:04AM +0300, Ido Schimmel wrote:
> On Wed, Oct 09, 2019 at 11:00:07AM -0500, Jesse Hathaway wrote:
> > We have been experiencing a route lookup race condition on our internet facing
> > Linux routers. I have been able to reproduce the issue, but would love more
> > help in isolating the cause.
> > 
> > Looking up a route found in the main table returns `*` rather than the directly
> > connected interface about once for every 10-20 million requests. From my
> > reading of the iproute2 source code an asterisk is indicative of the kernel
> > returning and interface index of 0 rather than the correct directly connected
> > interface.
> > 
> > This is reproducible with the following bash snippet on 5.4-rc2:
> > 
> >   $ cat route-race
> >   #!/bin/bash
> > 
> >   # Generate 50 million individual route gets to feed as batch input to `ip`
> >   function ip-cmds() {
> >           route_get='route get 192.168.11.142 from 192.168.180.10 iif vlan180'
> >           for ((i = 0; i < 50000000; i++)); do
> >                   printf '%s\n' "${route_get}"
> >           done
> > 
> >   }
> > 
> >   ip-cmds | ip -d -o -batch - | grep -E 'dev \*' | uniq -c
> > 
> > Example output:
> > 
> >   $ ./route-race
> >         6 unicast 192.168.11.142 from 192.168.180.10 dev * table main
> > \    cache iif vlan180
> > 
> > These routers have multiple routing tables and are ingesting full BGP routing
> > tables from multiple ISPs:
> > 
> >   $ ip route show table all | wc -l
> >   3105543
> > 
> >   $ ip route show table main | wc -l
> >   54
> > 
> > Please let me know what other information I can provide, thanks in advance,
> 
> I think it's working as expected. Here is my theory:
> 
> If CPU0 is executing both the route get request and forwarding packets
> through the directly connected interface, then the following can happen:
> 
> <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop

Sorry, only output path is per-CPU. See commit d26b3a7c4b3b ("ipv4:
percpu nh_rth_output cache"). I indeed see the issue regardless of the
CPU on which I run the route get request.

> is found. Not yet dumped to user space
> 
> <Any CPU, t1> - Routes are added / removed, therefore invalidating the
> cache by bumping 'net->ipv4.rt_genid'
> 
> <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
> cached dst entry is found to be invalid. Therefore, it is replaced by a
> newer dst entry. dst_dev_put() is called on old entry which assigns the
> blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
> it is not registered.
> 
> <CPU0, t3> - After softirq finished executing, your route get request
> from t0 is resumed and the old dst entry is dumped to user space with
> ifindex of 0.
> 
> I tested this on my system using your script to generate the route get
> requests. I pinned it to the same CPU forwarding packets through the
> nexthop. To constantly invalidate the cache I created another script
> that simply adds and removes IP addresses from an interface.
> 
> If I stop the packet forwarding or the script that invalidates the
> cache, then I don't see any '*' answers to my route get requests.
> 
> BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
> with older kernel versions you'll see 'lo' instead of '*'.

  reply	other threads:[~2019-10-10  8:58 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-09 16:00 Race condition in route lookup Jesse Hathaway
2019-10-10  8:31 ` Ido Schimmel
2019-10-10  8:46   ` Ido Schimmel [this message]
2019-10-11 14:36   ` Jesse Hathaway
2019-10-11 15:42     ` Ido Schimmel
2019-10-11 16:09       ` Jesse Hathaway
2019-10-11 17:54       ` Wei Wang
2019-10-11 18:17         ` Ido Schimmel
2019-10-11 18:25           ` Ido Schimmel
2019-10-11 18:47             ` Wei Wang
2019-10-11 18:52               ` Ido Schimmel
2019-10-11 21:01                 ` Jesse Hathaway
2019-10-11 21:27                 ` David Ahern
2019-10-12  6:56         ` Martin Lau
2019-10-14  0:23           ` Wei Wang
2019-10-14 17:26             ` Martin Lau
2019-10-15 14:45               ` David Ahern
2019-10-15 16:42                 ` Wei Wang
2019-10-16  6:35                   ` Martin Lau
2019-10-15 14:29         ` Jesse Hathaway
2019-10-15 16:44           ` Wei Wang
2019-10-16  6:39             ` Martin Lau
2019-10-16 16:35               ` Wei Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191010084608.GA4730@splinter \
    --to=idosch@idosch.org \
    --cc=jesse@mbuki-mvuki.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.