netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ido Schimmel <idosch@idosch.org>
To: Jesse Hathaway <jesse@mbuki-mvuki.org>
Cc: netdev@vger.kernel.org
Subject: Re: Race condition in route lookup
Date: Thu, 10 Oct 2019 11:31:02 +0300	[thread overview]
Message-ID: <20191010083102.GA1336@splinter> (raw)
In-Reply-To: <CANSNSoV1M9stB7CnUcEhsz3FHi4NV_yrBtpYsZ205+rqnvMbvA@mail.gmail.com>

On Wed, Oct 09, 2019 at 11:00:07AM -0500, Jesse Hathaway wrote:
> We have been experiencing a route lookup race condition on our internet facing
> Linux routers. I have been able to reproduce the issue, but would love more
> help in isolating the cause.
> 
> Looking up a route found in the main table returns `*` rather than the directly
> connected interface about once for every 10-20 million requests. From my
> reading of the iproute2 source code an asterisk is indicative of the kernel
> returning and interface index of 0 rather than the correct directly connected
> interface.
> 
> This is reproducible with the following bash snippet on 5.4-rc2:
> 
>   $ cat route-race
>   #!/bin/bash
> 
>   # Generate 50 million individual route gets to feed as batch input to `ip`
>   function ip-cmds() {
>           route_get='route get 192.168.11.142 from 192.168.180.10 iif vlan180'
>           for ((i = 0; i < 50000000; i++)); do
>                   printf '%s\n' "${route_get}"
>           done
> 
>   }
> 
>   ip-cmds | ip -d -o -batch - | grep -E 'dev \*' | uniq -c
> 
> Example output:
> 
>   $ ./route-race
>         6 unicast 192.168.11.142 from 192.168.180.10 dev * table main
> \    cache iif vlan180
> 
> These routers have multiple routing tables and are ingesting full BGP routing
> tables from multiple ISPs:
> 
>   $ ip route show table all | wc -l
>   3105543
> 
>   $ ip route show table main | wc -l
>   54
> 
> Please let me know what other information I can provide, thanks in advance,

I think it's working as expected. Here is my theory:

If CPU0 is executing both the route get request and forwarding packets
through the directly connected interface, then the following can happen:

<CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
is found. Not yet dumped to user space

<Any CPU, t1> - Routes are added / removed, therefore invalidating the
cache by bumping 'net->ipv4.rt_genid'

<CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
cached dst entry is found to be invalid. Therefore, it is replaced by a
newer dst entry. dst_dev_put() is called on old entry which assigns the
blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
it is not registered.

<CPU0, t3> - After softirq finished executing, your route get request
from t0 is resumed and the old dst entry is dumped to user space with
ifindex of 0.

I tested this on my system using your script to generate the route get
requests. I pinned it to the same CPU forwarding packets through the
nexthop. To constantly invalidate the cache I created another script
that simply adds and removes IP addresses from an interface.

If I stop the packet forwarding or the script that invalidates the
cache, then I don't see any '*' answers to my route get requests.

BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
with older kernel versions you'll see 'lo' instead of '*'.

  reply	other threads:[~2019-10-10  8:31 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-09 16:00 Race condition in route lookup Jesse Hathaway
2019-10-10  8:31 ` Ido Schimmel [this message]
2019-10-10  8:46   ` Ido Schimmel
2019-10-11 14:36   ` Jesse Hathaway
2019-10-11 15:42     ` Ido Schimmel
2019-10-11 16:09       ` Jesse Hathaway
2019-10-11 17:54       ` Wei Wang
2019-10-11 18:17         ` Ido Schimmel
2019-10-11 18:25           ` Ido Schimmel
2019-10-11 18:47             ` Wei Wang
2019-10-11 18:52               ` Ido Schimmel
2019-10-11 21:01                 ` Jesse Hathaway
2019-10-11 21:27                 ` David Ahern
2019-10-12  6:56         ` Martin Lau
2019-10-14  0:23           ` Wei Wang
2019-10-14 17:26             ` Martin Lau
2019-10-15 14:45               ` David Ahern
2019-10-15 16:42                 ` Wei Wang
2019-10-16  6:35                   ` Martin Lau
2019-10-15 14:29         ` Jesse Hathaway
2019-10-15 16:44           ` Wei Wang
2019-10-16  6:39             ` Martin Lau
2019-10-16 16:35               ` Wei Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191010083102.GA1336@splinter \
    --to=idosch@idosch.org \
    --cc=jesse@mbuki-mvuki.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).