netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Race condition in route lookup
@ 2019-10-09 16:00 Jesse Hathaway
  2019-10-10  8:31 ` Ido Schimmel
  0 siblings, 1 reply; 23+ messages in thread
From: Jesse Hathaway @ 2019-10-09 16:00 UTC (permalink / raw)
  To: netdev

We have been experiencing a route lookup race condition on our internet facing
Linux routers. I have been able to reproduce the issue, but would love more
help in isolating the cause.

Looking up a route found in the main table returns `*` rather than the directly
connected interface about once for every 10-20 million requests. From my
reading of the iproute2 source code an asterisk is indicative of the kernel
returning and interface index of 0 rather than the correct directly connected
interface.

This is reproducible with the following bash snippet on 5.4-rc2:

  $ cat route-race
  #!/bin/bash

  # Generate 50 million individual route gets to feed as batch input to `ip`
  function ip-cmds() {
          route_get='route get 192.168.11.142 from 192.168.180.10 iif vlan180'
          for ((i = 0; i < 50000000; i++)); do
                  printf '%s\n' "${route_get}"
          done

  }

  ip-cmds | ip -d -o -batch - | grep -E 'dev \*' | uniq -c

Example output:

  $ ./route-race
        6 unicast 192.168.11.142 from 192.168.180.10 dev * table main
\    cache iif vlan180

These routers have multiple routing tables and are ingesting full BGP routing
tables from multiple ISPs:

  $ ip route show table all | wc -l
  3105543

  $ ip route show table main | wc -l
  54

Please let me know what other information I can provide, thanks in advance,
Jesse Hathaway

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-09 16:00 Race condition in route lookup Jesse Hathaway
@ 2019-10-10  8:31 ` Ido Schimmel
  2019-10-10  8:46   ` Ido Schimmel
  2019-10-11 14:36   ` Jesse Hathaway
  0 siblings, 2 replies; 23+ messages in thread
From: Ido Schimmel @ 2019-10-10  8:31 UTC (permalink / raw)
  To: Jesse Hathaway; +Cc: netdev

On Wed, Oct 09, 2019 at 11:00:07AM -0500, Jesse Hathaway wrote:
> We have been experiencing a route lookup race condition on our internet facing
> Linux routers. I have been able to reproduce the issue, but would love more
> help in isolating the cause.
> 
> Looking up a route found in the main table returns `*` rather than the directly
> connected interface about once for every 10-20 million requests. From my
> reading of the iproute2 source code an asterisk is indicative of the kernel
> returning and interface index of 0 rather than the correct directly connected
> interface.
> 
> This is reproducible with the following bash snippet on 5.4-rc2:
> 
>   $ cat route-race
>   #!/bin/bash
> 
>   # Generate 50 million individual route gets to feed as batch input to `ip`
>   function ip-cmds() {
>           route_get='route get 192.168.11.142 from 192.168.180.10 iif vlan180'
>           for ((i = 0; i < 50000000; i++)); do
>                   printf '%s\n' "${route_get}"
>           done
> 
>   }
> 
>   ip-cmds | ip -d -o -batch - | grep -E 'dev \*' | uniq -c
> 
> Example output:
> 
>   $ ./route-race
>         6 unicast 192.168.11.142 from 192.168.180.10 dev * table main
> \    cache iif vlan180
> 
> These routers have multiple routing tables and are ingesting full BGP routing
> tables from multiple ISPs:
> 
>   $ ip route show table all | wc -l
>   3105543
> 
>   $ ip route show table main | wc -l
>   54
> 
> Please let me know what other information I can provide, thanks in advance,

I think it's working as expected. Here is my theory:

If CPU0 is executing both the route get request and forwarding packets
through the directly connected interface, then the following can happen:

<CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
is found. Not yet dumped to user space

<Any CPU, t1> - Routes are added / removed, therefore invalidating the
cache by bumping 'net->ipv4.rt_genid'

<CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
cached dst entry is found to be invalid. Therefore, it is replaced by a
newer dst entry. dst_dev_put() is called on old entry which assigns the
blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
it is not registered.

<CPU0, t3> - After softirq finished executing, your route get request
from t0 is resumed and the old dst entry is dumped to user space with
ifindex of 0.

I tested this on my system using your script to generate the route get
requests. I pinned it to the same CPU forwarding packets through the
nexthop. To constantly invalidate the cache I created another script
that simply adds and removes IP addresses from an interface.

If I stop the packet forwarding or the script that invalidates the
cache, then I don't see any '*' answers to my route get requests.

BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
with older kernel versions you'll see 'lo' instead of '*'.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-10  8:31 ` Ido Schimmel
@ 2019-10-10  8:46   ` Ido Schimmel
  2019-10-11 14:36   ` Jesse Hathaway
  1 sibling, 0 replies; 23+ messages in thread
From: Ido Schimmel @ 2019-10-10  8:46 UTC (permalink / raw)
  To: Jesse Hathaway; +Cc: netdev

On Thu, Oct 10, 2019 at 11:31:04AM +0300, Ido Schimmel wrote:
> On Wed, Oct 09, 2019 at 11:00:07AM -0500, Jesse Hathaway wrote:
> > We have been experiencing a route lookup race condition on our internet facing
> > Linux routers. I have been able to reproduce the issue, but would love more
> > help in isolating the cause.
> > 
> > Looking up a route found in the main table returns `*` rather than the directly
> > connected interface about once for every 10-20 million requests. From my
> > reading of the iproute2 source code an asterisk is indicative of the kernel
> > returning and interface index of 0 rather than the correct directly connected
> > interface.
> > 
> > This is reproducible with the following bash snippet on 5.4-rc2:
> > 
> >   $ cat route-race
> >   #!/bin/bash
> > 
> >   # Generate 50 million individual route gets to feed as batch input to `ip`
> >   function ip-cmds() {
> >           route_get='route get 192.168.11.142 from 192.168.180.10 iif vlan180'
> >           for ((i = 0; i < 50000000; i++)); do
> >                   printf '%s\n' "${route_get}"
> >           done
> > 
> >   }
> > 
> >   ip-cmds | ip -d -o -batch - | grep -E 'dev \*' | uniq -c
> > 
> > Example output:
> > 
> >   $ ./route-race
> >         6 unicast 192.168.11.142 from 192.168.180.10 dev * table main
> > \    cache iif vlan180
> > 
> > These routers have multiple routing tables and are ingesting full BGP routing
> > tables from multiple ISPs:
> > 
> >   $ ip route show table all | wc -l
> >   3105543
> > 
> >   $ ip route show table main | wc -l
> >   54
> > 
> > Please let me know what other information I can provide, thanks in advance,
> 
> I think it's working as expected. Here is my theory:
> 
> If CPU0 is executing both the route get request and forwarding packets
> through the directly connected interface, then the following can happen:
> 
> <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop

Sorry, only output path is per-CPU. See commit d26b3a7c4b3b ("ipv4:
percpu nh_rth_output cache"). I indeed see the issue regardless of the
CPU on which I run the route get request.

> is found. Not yet dumped to user space
> 
> <Any CPU, t1> - Routes are added / removed, therefore invalidating the
> cache by bumping 'net->ipv4.rt_genid'
> 
> <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
> cached dst entry is found to be invalid. Therefore, it is replaced by a
> newer dst entry. dst_dev_put() is called on old entry which assigns the
> blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
> it is not registered.
> 
> <CPU0, t3> - After softirq finished executing, your route get request
> from t0 is resumed and the old dst entry is dumped to user space with
> ifindex of 0.
> 
> I tested this on my system using your script to generate the route get
> requests. I pinned it to the same CPU forwarding packets through the
> nexthop. To constantly invalidate the cache I created another script
> that simply adds and removes IP addresses from an interface.
> 
> If I stop the packet forwarding or the script that invalidates the
> cache, then I don't see any '*' answers to my route get requests.
> 
> BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
> with older kernel versions you'll see 'lo' instead of '*'.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-10  8:31 ` Ido Schimmel
  2019-10-10  8:46   ` Ido Schimmel
@ 2019-10-11 14:36   ` Jesse Hathaway
  2019-10-11 15:42     ` Ido Schimmel
  1 sibling, 1 reply; 23+ messages in thread
From: Jesse Hathaway @ 2019-10-11 14:36 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev

On Thu, Oct 10, 2019 at 3:31 AM Ido Schimmel <idosch@idosch.org> wrote:
> I think it's working as expected. Here is my theory:
>
> If CPU0 is executing both the route get request and forwarding packets
> through the directly connected interface, then the following can happen:
>
> <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
> is found. Not yet dumped to user space
>
> <Any CPU, t1> - Routes are added / removed, therefore invalidating the
> cache by bumping 'net->ipv4.rt_genid'
>
> <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
> cached dst entry is found to be invalid. Therefore, it is replaced by a
> newer dst entry. dst_dev_put() is called on old entry which assigns the
> blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
> it is not registered.
>
> <CPU0, t3> - After softirq finished executing, your route get request
> from t0 is resumed and the old dst entry is dumped to user space with
> ifindex of 0.
>
> I tested this on my system using your script to generate the route get
> requests. I pinned it to the same CPU forwarding packets through the
> nexthop. To constantly invalidate the cache I created another script
> that simply adds and removes IP addresses from an interface.
>
> If I stop the packet forwarding or the script that invalidates the
> cache, then I don't see any '*' answers to my route get requests.

Thanks for the reply and analysis Ido, I tested with an additional script which
adds and deletes a route in a loop, as you also saw this increased the
frequency of blackhole route replies from the first script.

Questions:

1. We saw this behavior occurring with TCP connections traversing our routers,
though I was able to reproduce it with only local route requests on our router.
Would you expect this same behavior for TCP traffic only in the kernel which
does not go to userspace?

2. These blackhole routes occur even though our main routing table is not
changing, however a separate route table managed by bird on the Linux router is
changing. Is this still expected behavior given that the ip-rules and main
route table used by these route requests are not changing?

3. We were previously rejecting these packets with an iptables rule which sent
an ICMP prohibited message to the sender, this caused TCP connections to break
with a EHOSTUNREACH, should we be silently dropping these packets instead?

4. If we should just be dropping these packets, why does the kernel not drop
them instead of letting them traverse the iptables rules?

> BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
> with older kernel versions you'll see 'lo' instead of '*'.

Yes indeed! Thanks for solving that mystery as well, our routers are running
5.1, but we upgraded to 5.4-rc2 to determine whether the issue was still
present in the latest kernel.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-11 14:36   ` Jesse Hathaway
@ 2019-10-11 15:42     ` Ido Schimmel
  2019-10-11 16:09       ` Jesse Hathaway
  2019-10-11 17:54       ` Wei Wang
  0 siblings, 2 replies; 23+ messages in thread
From: Ido Schimmel @ 2019-10-11 15:42 UTC (permalink / raw)
  To: Jesse Hathaway, weiwan; +Cc: netdev

On Fri, Oct 11, 2019 at 09:36:51AM -0500, Jesse Hathaway wrote:
> On Thu, Oct 10, 2019 at 3:31 AM Ido Schimmel <idosch@idosch.org> wrote:
> > I think it's working as expected. Here is my theory:
> >
> > If CPU0 is executing both the route get request and forwarding packets
> > through the directly connected interface, then the following can happen:
> >
> > <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
> > is found. Not yet dumped to user space
> >
> > <Any CPU, t1> - Routes are added / removed, therefore invalidating the
> > cache by bumping 'net->ipv4.rt_genid'
> >
> > <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
> > cached dst entry is found to be invalid. Therefore, it is replaced by a
> > newer dst entry. dst_dev_put() is called on old entry which assigns the
> > blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
> > it is not registered.
> >
> > <CPU0, t3> - After softirq finished executing, your route get request
> > from t0 is resumed and the old dst entry is dumped to user space with
> > ifindex of 0.
> >
> > I tested this on my system using your script to generate the route get
> > requests. I pinned it to the same CPU forwarding packets through the
> > nexthop. To constantly invalidate the cache I created another script
> > that simply adds and removes IP addresses from an interface.
> >
> > If I stop the packet forwarding or the script that invalidates the
> > cache, then I don't see any '*' answers to my route get requests.
> 
> Thanks for the reply and analysis Ido, I tested with an additional script which
> adds and deletes a route in a loop, as you also saw this increased the
> frequency of blackhole route replies from the first script.
> 
> Questions:
> 
> 1. We saw this behavior occurring with TCP connections traversing our routers,
> though I was able to reproduce it with only local route requests on our router.
> Would you expect this same behavior for TCP traffic only in the kernel which
> does not go to userspace?

Yes, the problem is in the input path where received packets need to be
forwarded.

> 
> 2. These blackhole routes occur even though our main routing table is not
> changing, however a separate route table managed by bird on the Linux router is
> changing. Is this still expected behavior given that the ip-rules and main
> route table used by these route requests are not changing?

Yes, there is a per-netns counter that is incremented whenever cached
dst entries need to be invalidated. Since it is per-netns it is
incremented regardless of the routing table to which your insert the
route.

> 
> 3. We were previously rejecting these packets with an iptables rule which sent
> an ICMP prohibited message to the sender, this caused TCP connections to break
> with a EHOSTUNREACH, should we be silently dropping these packets instead?
> 
> 4. If we should just be dropping these packets, why does the kernel not drop
> them instead of letting them traverse the iptables rules?

I actually believe the current behavior is a bug that needs to be fixed.
See below.

> 
> > BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
> > with older kernel versions you'll see 'lo' instead of '*'.
> 
> Yes indeed! Thanks for solving that mystery as well, our routers are running
> 5.1, but we upgraded to 5.4-rc2 to determine whether the issue was still
> present in the latest kernel.

Do you remember when you started seeing this behavior? I think it
started in 4.13 with commit ffe95ecf3a2e ("Merge branch
'net-remove-dst-garbage-collector-logic'").

Let me add Wei to see if/how this can be fixed.

Wei, in case you don't have the original mail with the description of
the problem, it can be found here [1].

I believe that the issue Jesse is experiencing is the following:

<CPU A, t0> - Received packet A is forwarded and cached dst entry is
taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set()

<t1> - Given Jesse has busy routers ("ingesting full BGP routing tables
from multiple ISPs"), route is added / deleted and rt_cache_flush() is
called

<CPU B, t2> - Received packet B tries to use the same cached dst entry
from t0, but rt_cache_valid() is no longer true and it is replaced in
rt_cache_route() by the newer one. This calls dst_dev_put() on the
original dst entry which assigns the blackhole netdev to 'dst->dev'

<CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due
to 'dst->dev' being the blackhole netdev

The following patch "fixes" the problem for me:

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 42221a12bdda..1c67bdb80fd5 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1482,7 +1482,6 @@ static bool rt_cache_route(struct fib_nh_common *nhc, struct rtable *rt)
        prev = cmpxchg(p, orig, rt);
        if (prev == orig) {
                if (orig) {
-                       dst_dev_put(&orig->dst);
                        dst_release(&orig->dst);
                }
        } else {

But if this dst entry is cached in some inactive socket and the netdev
on which it took a reference needs to be unregistered, then we can
potentially wait forever. No?

I'm thinking that it can be fixed by making 'nhc_rth_input' per-CPU, in
a similar fashion to what Eric did in commit d26b3a7c4b3b ("ipv4: percpu
nh_rth_output cache").

Two questions:

1. Do you agree with the above analysis?
2. Do you have a simpler/better solution in mind?

Thanks

[1] https://lore.kernel.org/netdev/CANSNSoVM1Uo106xfJtGpTyXNed8kOL4JiXqf3A1eZHBa7z3=yg@mail.gmail.com/T/#medece9445d617372b4842d44525ef0d3ba1ea083

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-11 15:42     ` Ido Schimmel
@ 2019-10-11 16:09       ` Jesse Hathaway
  2019-10-11 17:54       ` Wei Wang
  1 sibling, 0 replies; 23+ messages in thread
From: Jesse Hathaway @ 2019-10-11 16:09 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: weiwan, netdev

On Fri, Oct 11, 2019 at 10:42 AM Ido Schimmel <idosch@idosch.org> wrote:
> Do you remember when you started seeing this behavior? I think it
> started in 4.13 with commit ffe95ecf3a2e ("Merge branch
> 'net-remove-dst-garbage-collector-logic'").

Unfortunately, our data on when the problem started is a bit fuzzy, but we
definitely started experiencing this issue after upgrading our routers to
4.19.18. We were previously running 4.9 on our routers and we don't believe we
saw the issue on 4.9, but we also do not have definitive data to confirm.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-11 15:42     ` Ido Schimmel
  2019-10-11 16:09       ` Jesse Hathaway
@ 2019-10-11 17:54       ` Wei Wang
  2019-10-11 18:17         ` Ido Schimmel
                           ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Wei Wang @ 2019-10-11 17:54 UTC (permalink / raw)
  To: Ido Schimmel, Martin KaFai Lau
  Cc: Jesse Hathaway, Linux Kernel Network Developers

On Fri, Oct 11, 2019 at 8:42 AM Ido Schimmel <idosch@idosch.org> wrote:
>
> On Fri, Oct 11, 2019 at 09:36:51AM -0500, Jesse Hathaway wrote:
> > On Thu, Oct 10, 2019 at 3:31 AM Ido Schimmel <idosch@idosch.org> wrote:
> > > I think it's working as expected. Here is my theory:
> > >
> > > If CPU0 is executing both the route get request and forwarding packets
> > > through the directly connected interface, then the following can happen:
> > >
> > > <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
> > > is found. Not yet dumped to user space
> > >
> > > <Any CPU, t1> - Routes are added / removed, therefore invalidating the
> > > cache by bumping 'net->ipv4.rt_genid'
> > >
> > > <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
> > > cached dst entry is found to be invalid. Therefore, it is replaced by a
> > > newer dst entry. dst_dev_put() is called on old entry which assigns the
> > > blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
> > > it is not registered.
> > >
> > > <CPU0, t3> - After softirq finished executing, your route get request
> > > from t0 is resumed and the old dst entry is dumped to user space with
> > > ifindex of 0.
> > >
> > > I tested this on my system using your script to generate the route get
> > > requests. I pinned it to the same CPU forwarding packets through the
> > > nexthop. To constantly invalidate the cache I created another script
> > > that simply adds and removes IP addresses from an interface.
> > >
> > > If I stop the packet forwarding or the script that invalidates the
> > > cache, then I don't see any '*' answers to my route get requests.
> >
> > Thanks for the reply and analysis Ido, I tested with an additional script which
> > adds and deletes a route in a loop, as you also saw this increased the
> > frequency of blackhole route replies from the first script.
> >
> > Questions:
> >
> > 1. We saw this behavior occurring with TCP connections traversing our routers,
> > though I was able to reproduce it with only local route requests on our router.
> > Would you expect this same behavior for TCP traffic only in the kernel which
> > does not go to userspace?
>
> Yes, the problem is in the input path where received packets need to be
> forwarded.
>
> >
> > 2. These blackhole routes occur even though our main routing table is not
> > changing, however a separate route table managed by bird on the Linux router is
> > changing. Is this still expected behavior given that the ip-rules and main
> > route table used by these route requests are not changing?
>
> Yes, there is a per-netns counter that is incremented whenever cached
> dst entries need to be invalidated. Since it is per-netns it is
> incremented regardless of the routing table to which your insert the
> route.
>
> >
> > 3. We were previously rejecting these packets with an iptables rule which sent
> > an ICMP prohibited message to the sender, this caused TCP connections to break
> > with a EHOSTUNREACH, should we be silently dropping these packets instead?
> >
> > 4. If we should just be dropping these packets, why does the kernel not drop
> > them instead of letting them traverse the iptables rules?
>
> I actually believe the current behavior is a bug that needs to be fixed.
> See below.
>
> >
> > > BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
> > > with older kernel versions you'll see 'lo' instead of '*'.
> >
> > Yes indeed! Thanks for solving that mystery as well, our routers are running
> > 5.1, but we upgraded to 5.4-rc2 to determine whether the issue was still
> > present in the latest kernel.
>
> Do you remember when you started seeing this behavior? I think it
> started in 4.13 with commit ffe95ecf3a2e ("Merge branch
> 'net-remove-dst-garbage-collector-logic'").
>
> Let me add Wei to see if/how this can be fixed.
>
> Wei, in case you don't have the original mail with the description of
> the problem, it can be found here [1].
>
> I believe that the issue Jesse is experiencing is the following:
>
> <CPU A, t0> - Received packet A is forwarded and cached dst entry is
> taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set()
>
> <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables
> from multiple ISPs"), route is added / deleted and rt_cache_flush() is
> called
>
> <CPU B, t2> - Received packet B tries to use the same cached dst entry
> from t0, but rt_cache_valid() is no longer true and it is replaced in
> rt_cache_route() by the newer one. This calls dst_dev_put() on the
> original dst entry which assigns the blackhole netdev to 'dst->dev'
>
> <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due
> to 'dst->dev' being the blackhole netdev
>
> The following patch "fixes" the problem for me:
>
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 42221a12bdda..1c67bdb80fd5 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1482,7 +1482,6 @@ static bool rt_cache_route(struct fib_nh_common *nhc, struct rtable *rt)
>         prev = cmpxchg(p, orig, rt);
>         if (prev == orig) {
>                 if (orig) {
> -                       dst_dev_put(&orig->dst);
>                         dst_release(&orig->dst);
>                 }
>         } else {
>
> But if this dst entry is cached in some inactive socket and the netdev
> on which it took a reference needs to be unregistered, then we can
> potentially wait forever. No?
>
Yes. That's exactly the reason we need to free the dev here.
Otherwise as you described, we will see "unregister_netdevice: waiting
for xxx to become free. Usage count = x" flushing the screen... Not
fun...


> I'm thinking that it can be fixed by making 'nhc_rth_input' per-CPU, in
> a similar fashion to what Eric did in commit d26b3a7c4b3b ("ipv4: percpu
> nh_rth_output cache").
>
Hmm... Yes... I would think a per-CPU input cache should work for the
case above.
Another idea is: instead of calling dst_dev_put() in rt_cache_route()
to switch out the dev, we call, rt_add_uncached_list() to add this
obsolete dst cache to the uncached list. And if the device gets
unregistered, rt_flush_dev() takes care of all dst entries in the
uncached list. I think that would work too.

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index dc1f510a7c81..ee618d4234ce 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1482,7 +1482,7 @@ static bool rt_cache_route(struct fib_nh_common
*nhc, struct rtable *rt)
        prev = cmpxchg(p, orig, rt);
        if (prev == orig) {
                if (orig) {
-                       dst_dev_put(&orig->dst);
+                       rt_add_uncached_list(orig);
                        dst_release(&orig->dst);
                }
        } else {

+ Martin for his idea and input.

> Two questions:
>
> 1. Do you agree with the above analysis?
> 2. Do you have a simpler/better solution in mind?
>
> Thanks
>
> [1] https://lore.kernel.org/netdev/CANSNSoVM1Uo106xfJtGpTyXNed8kOL4JiXqf3A1eZHBa7z3=yg@mail.gmail.com/T/#medece9445d617372b4842d44525ef0d3ba1ea083

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-11 17:54       ` Wei Wang
@ 2019-10-11 18:17         ` Ido Schimmel
  2019-10-11 18:25           ` Ido Schimmel
  2019-10-12  6:56         ` Martin Lau
  2019-10-15 14:29         ` Jesse Hathaway
  2 siblings, 1 reply; 23+ messages in thread
From: Ido Schimmel @ 2019-10-11 18:17 UTC (permalink / raw)
  To: Wei Wang
  Cc: Martin KaFai Lau, Jesse Hathaway, Linux Kernel Network Developers

On Fri, Oct 11, 2019 at 10:54:13AM -0700, Wei Wang wrote:
> On Fri, Oct 11, 2019 at 8:42 AM Ido Schimmel <idosch@idosch.org> wrote:
> >
> > On Fri, Oct 11, 2019 at 09:36:51AM -0500, Jesse Hathaway wrote:
> > > On Thu, Oct 10, 2019 at 3:31 AM Ido Schimmel <idosch@idosch.org> wrote:
> > > > I think it's working as expected. Here is my theory:
> > > >
> > > > If CPU0 is executing both the route get request and forwarding packets
> > > > through the directly connected interface, then the following can happen:
> > > >
> > > > <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
> > > > is found. Not yet dumped to user space
> > > >
> > > > <Any CPU, t1> - Routes are added / removed, therefore invalidating the
> > > > cache by bumping 'net->ipv4.rt_genid'
> > > >
> > > > <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
> > > > cached dst entry is found to be invalid. Therefore, it is replaced by a
> > > > newer dst entry. dst_dev_put() is called on old entry which assigns the
> > > > blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
> > > > it is not registered.
> > > >
> > > > <CPU0, t3> - After softirq finished executing, your route get request
> > > > from t0 is resumed and the old dst entry is dumped to user space with
> > > > ifindex of 0.
> > > >
> > > > I tested this on my system using your script to generate the route get
> > > > requests. I pinned it to the same CPU forwarding packets through the
> > > > nexthop. To constantly invalidate the cache I created another script
> > > > that simply adds and removes IP addresses from an interface.
> > > >
> > > > If I stop the packet forwarding or the script that invalidates the
> > > > cache, then I don't see any '*' answers to my route get requests.
> > >
> > > Thanks for the reply and analysis Ido, I tested with an additional script which
> > > adds and deletes a route in a loop, as you also saw this increased the
> > > frequency of blackhole route replies from the first script.
> > >
> > > Questions:
> > >
> > > 1. We saw this behavior occurring with TCP connections traversing our routers,
> > > though I was able to reproduce it with only local route requests on our router.
> > > Would you expect this same behavior for TCP traffic only in the kernel which
> > > does not go to userspace?
> >
> > Yes, the problem is in the input path where received packets need to be
> > forwarded.
> >
> > >
> > > 2. These blackhole routes occur even though our main routing table is not
> > > changing, however a separate route table managed by bird on the Linux router is
> > > changing. Is this still expected behavior given that the ip-rules and main
> > > route table used by these route requests are not changing?
> >
> > Yes, there is a per-netns counter that is incremented whenever cached
> > dst entries need to be invalidated. Since it is per-netns it is
> > incremented regardless of the routing table to which your insert the
> > route.
> >
> > >
> > > 3. We were previously rejecting these packets with an iptables rule which sent
> > > an ICMP prohibited message to the sender, this caused TCP connections to break
> > > with a EHOSTUNREACH, should we be silently dropping these packets instead?
> > >
> > > 4. If we should just be dropping these packets, why does the kernel not drop
> > > them instead of letting them traverse the iptables rules?
> >
> > I actually believe the current behavior is a bug that needs to be fixed.
> > See below.
> >
> > >
> > > > BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
> > > > with older kernel versions you'll see 'lo' instead of '*'.
> > >
> > > Yes indeed! Thanks for solving that mystery as well, our routers are running
> > > 5.1, but we upgraded to 5.4-rc2 to determine whether the issue was still
> > > present in the latest kernel.
> >
> > Do you remember when you started seeing this behavior? I think it
> > started in 4.13 with commit ffe95ecf3a2e ("Merge branch
> > 'net-remove-dst-garbage-collector-logic'").
> >
> > Let me add Wei to see if/how this can be fixed.
> >
> > Wei, in case you don't have the original mail with the description of
> > the problem, it can be found here [1].
> >
> > I believe that the issue Jesse is experiencing is the following:
> >
> > <CPU A, t0> - Received packet A is forwarded and cached dst entry is
> > taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set()
> >
> > <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables
> > from multiple ISPs"), route is added / deleted and rt_cache_flush() is
> > called
> >
> > <CPU B, t2> - Received packet B tries to use the same cached dst entry
> > from t0, but rt_cache_valid() is no longer true and it is replaced in
> > rt_cache_route() by the newer one. This calls dst_dev_put() on the
> > original dst entry which assigns the blackhole netdev to 'dst->dev'
> >
> > <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due
> > to 'dst->dev' being the blackhole netdev
> >
> > The following patch "fixes" the problem for me:
> >
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index 42221a12bdda..1c67bdb80fd5 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1482,7 +1482,6 @@ static bool rt_cache_route(struct fib_nh_common *nhc, struct rtable *rt)
> >         prev = cmpxchg(p, orig, rt);
> >         if (prev == orig) {
> >                 if (orig) {
> > -                       dst_dev_put(&orig->dst);
> >                         dst_release(&orig->dst);
> >                 }
> >         } else {
> >
> > But if this dst entry is cached in some inactive socket and the netdev
> > on which it took a reference needs to be unregistered, then we can
> > potentially wait forever. No?
> >
> Yes. That's exactly the reason we need to free the dev here.
> Otherwise as you described, we will see "unregister_netdevice: waiting
> for xxx to become free. Usage count = x" flushing the screen... Not
> fun...
> 
> 
> > I'm thinking that it can be fixed by making 'nhc_rth_input' per-CPU, in
> > a similar fashion to what Eric did in commit d26b3a7c4b3b ("ipv4: percpu
> > nh_rth_output cache").
> >
> Hmm... Yes... I would think a per-CPU input cache should work for the
> case above.
> Another idea is: instead of calling dst_dev_put() in rt_cache_route()
> to switch out the dev, we call, rt_add_uncached_list() to add this
> obsolete dst cache to the uncached list. And if the device gets
> unregistered, rt_flush_dev() takes care of all dst entries in the
> uncached list. I think that would work too.

It crossed my mind as well, but if the device is not unregistered, then
I believe we can eventually consume all the memory and kill the machine?

> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index dc1f510a7c81..ee618d4234ce 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1482,7 +1482,7 @@ static bool rt_cache_route(struct fib_nh_common
> *nhc, struct rtable *rt)
>         prev = cmpxchg(p, orig, rt);
>         if (prev == orig) {
>                 if (orig) {
> -                       dst_dev_put(&orig->dst);
> +                       rt_add_uncached_list(orig);
>                         dst_release(&orig->dst);
>                 }
>         } else {
> 
> + Martin for his idea and input.
> 
> > Two questions:
> >
> > 1. Do you agree with the above analysis?
> > 2. Do you have a simpler/better solution in mind?
> >
> > Thanks
> >
> > [1] https://lore.kernel.org/netdev/CANSNSoVM1Uo106xfJtGpTyXNed8kOL4JiXqf3A1eZHBa7z3=yg@mail.gmail.com/T/#medece9445d617372b4842d44525ef0d3ba1ea083

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-11 18:17         ` Ido Schimmel
@ 2019-10-11 18:25           ` Ido Schimmel
  2019-10-11 18:47             ` Wei Wang
  0 siblings, 1 reply; 23+ messages in thread
From: Ido Schimmel @ 2019-10-11 18:25 UTC (permalink / raw)
  To: Wei Wang
  Cc: Martin KaFai Lau, Jesse Hathaway, Linux Kernel Network Developers

On Fri, Oct 11, 2019 at 09:17:42PM +0300, Ido Schimmel wrote:
> On Fri, Oct 11, 2019 at 10:54:13AM -0700, Wei Wang wrote:
> > On Fri, Oct 11, 2019 at 8:42 AM Ido Schimmel <idosch@idosch.org> wrote:
> > >
> > > On Fri, Oct 11, 2019 at 09:36:51AM -0500, Jesse Hathaway wrote:
> > > > On Thu, Oct 10, 2019 at 3:31 AM Ido Schimmel <idosch@idosch.org> wrote:
> > > > > I think it's working as expected. Here is my theory:
> > > > >
> > > > > If CPU0 is executing both the route get request and forwarding packets
> > > > > through the directly connected interface, then the following can happen:
> > > > >
> > > > > <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
> > > > > is found. Not yet dumped to user space
> > > > >
> > > > > <Any CPU, t1> - Routes are added / removed, therefore invalidating the
> > > > > cache by bumping 'net->ipv4.rt_genid'
> > > > >
> > > > > <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
> > > > > cached dst entry is found to be invalid. Therefore, it is replaced by a
> > > > > newer dst entry. dst_dev_put() is called on old entry which assigns the
> > > > > blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
> > > > > it is not registered.
> > > > >
> > > > > <CPU0, t3> - After softirq finished executing, your route get request
> > > > > from t0 is resumed and the old dst entry is dumped to user space with
> > > > > ifindex of 0.
> > > > >
> > > > > I tested this on my system using your script to generate the route get
> > > > > requests. I pinned it to the same CPU forwarding packets through the
> > > > > nexthop. To constantly invalidate the cache I created another script
> > > > > that simply adds and removes IP addresses from an interface.
> > > > >
> > > > > If I stop the packet forwarding or the script that invalidates the
> > > > > cache, then I don't see any '*' answers to my route get requests.
> > > >
> > > > Thanks for the reply and analysis Ido, I tested with an additional script which
> > > > adds and deletes a route in a loop, as you also saw this increased the
> > > > frequency of blackhole route replies from the first script.
> > > >
> > > > Questions:
> > > >
> > > > 1. We saw this behavior occurring with TCP connections traversing our routers,
> > > > though I was able to reproduce it with only local route requests on our router.
> > > > Would you expect this same behavior for TCP traffic only in the kernel which
> > > > does not go to userspace?
> > >
> > > Yes, the problem is in the input path where received packets need to be
> > > forwarded.
> > >
> > > >
> > > > 2. These blackhole routes occur even though our main routing table is not
> > > > changing, however a separate route table managed by bird on the Linux router is
> > > > changing. Is this still expected behavior given that the ip-rules and main
> > > > route table used by these route requests are not changing?
> > >
> > > Yes, there is a per-netns counter that is incremented whenever cached
> > > dst entries need to be invalidated. Since it is per-netns it is
> > > incremented regardless of the routing table to which your insert the
> > > route.
> > >
> > > >
> > > > 3. We were previously rejecting these packets with an iptables rule which sent
> > > > an ICMP prohibited message to the sender, this caused TCP connections to break
> > > > with a EHOSTUNREACH, should we be silently dropping these packets instead?
> > > >
> > > > 4. If we should just be dropping these packets, why does the kernel not drop
> > > > them instead of letting them traverse the iptables rules?
> > >
> > > I actually believe the current behavior is a bug that needs to be fixed.
> > > See below.
> > >
> > > >
> > > > > BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
> > > > > with older kernel versions you'll see 'lo' instead of '*'.
> > > >
> > > > Yes indeed! Thanks for solving that mystery as well, our routers are running
> > > > 5.1, but we upgraded to 5.4-rc2 to determine whether the issue was still
> > > > present in the latest kernel.
> > >
> > > Do you remember when you started seeing this behavior? I think it
> > > started in 4.13 with commit ffe95ecf3a2e ("Merge branch
> > > 'net-remove-dst-garbage-collector-logic'").
> > >
> > > Let me add Wei to see if/how this can be fixed.
> > >
> > > Wei, in case you don't have the original mail with the description of
> > > the problem, it can be found here [1].
> > >
> > > I believe that the issue Jesse is experiencing is the following:
> > >
> > > <CPU A, t0> - Received packet A is forwarded and cached dst entry is
> > > taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set()
> > >
> > > <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables
> > > from multiple ISPs"), route is added / deleted and rt_cache_flush() is
> > > called
> > >
> > > <CPU B, t2> - Received packet B tries to use the same cached dst entry
> > > from t0, but rt_cache_valid() is no longer true and it is replaced in
> > > rt_cache_route() by the newer one. This calls dst_dev_put() on the
> > > original dst entry which assigns the blackhole netdev to 'dst->dev'
> > >
> > > <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due
> > > to 'dst->dev' being the blackhole netdev
> > >
> > > The following patch "fixes" the problem for me:
> > >
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index 42221a12bdda..1c67bdb80fd5 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -1482,7 +1482,6 @@ static bool rt_cache_route(struct fib_nh_common *nhc, struct rtable *rt)
> > >         prev = cmpxchg(p, orig, rt);
> > >         if (prev == orig) {
> > >                 if (orig) {
> > > -                       dst_dev_put(&orig->dst);
> > >                         dst_release(&orig->dst);
> > >                 }
> > >         } else {
> > >
> > > But if this dst entry is cached in some inactive socket and the netdev
> > > on which it took a reference needs to be unregistered, then we can
> > > potentially wait forever. No?
> > >
> > Yes. That's exactly the reason we need to free the dev here.
> > Otherwise as you described, we will see "unregister_netdevice: waiting
> > for xxx to become free. Usage count = x" flushing the screen... Not
> > fun...
> > 
> > 
> > > I'm thinking that it can be fixed by making 'nhc_rth_input' per-CPU, in
> > > a similar fashion to what Eric did in commit d26b3a7c4b3b ("ipv4: percpu
> > > nh_rth_output cache").
> > >
> > Hmm... Yes... I would think a per-CPU input cache should work for the
> > case above.
> > Another idea is: instead of calling dst_dev_put() in rt_cache_route()
> > to switch out the dev, we call, rt_add_uncached_list() to add this
> > obsolete dst cache to the uncached list. And if the device gets
> > unregistered, rt_flush_dev() takes care of all dst entries in the
> > uncached list. I think that would work too.
> 
> It crossed my mind as well, but if the device is not unregistered, then
> I believe we can eventually consume all the memory and kill the machine?

Ha, sorry no. I think this will actually work.

> 
> > 
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index dc1f510a7c81..ee618d4234ce 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1482,7 +1482,7 @@ static bool rt_cache_route(struct fib_nh_common
> > *nhc, struct rtable *rt)
> >         prev = cmpxchg(p, orig, rt);
> >         if (prev == orig) {
> >                 if (orig) {
> > -                       dst_dev_put(&orig->dst);
> > +                       rt_add_uncached_list(orig);
> >                         dst_release(&orig->dst);
> >                 }
> >         } else {
> > 
> > + Martin for his idea and input.
> > 
> > > Two questions:
> > >
> > > 1. Do you agree with the above analysis?
> > > 2. Do you have a simpler/better solution in mind?
> > >
> > > Thanks
> > >
> > > [1] https://lore.kernel.org/netdev/CANSNSoVM1Uo106xfJtGpTyXNed8kOL4JiXqf3A1eZHBa7z3=yg@mail.gmail.com/T/#medece9445d617372b4842d44525ef0d3ba1ea083

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-11 18:25           ` Ido Schimmel
@ 2019-10-11 18:47             ` Wei Wang
  2019-10-11 18:52               ` Ido Schimmel
  0 siblings, 1 reply; 23+ messages in thread
From: Wei Wang @ 2019-10-11 18:47 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Martin KaFai Lau, Jesse Hathaway, Linux Kernel Network Developers

On Fri, Oct 11, 2019 at 11:25 AM Ido Schimmel <idosch@idosch.org> wrote:
>
> On Fri, Oct 11, 2019 at 09:17:42PM +0300, Ido Schimmel wrote:
> > On Fri, Oct 11, 2019 at 10:54:13AM -0700, Wei Wang wrote:
> > > On Fri, Oct 11, 2019 at 8:42 AM Ido Schimmel <idosch@idosch.org> wrote:
> > > >
> > > > On Fri, Oct 11, 2019 at 09:36:51AM -0500, Jesse Hathaway wrote:
> > > > > On Thu, Oct 10, 2019 at 3:31 AM Ido Schimmel <idosch@idosch.org> wrote:
> > > > > > I think it's working as expected. Here is my theory:
> > > > > >
> > > > > > If CPU0 is executing both the route get request and forwarding packets
> > > > > > through the directly connected interface, then the following can happen:
> > > > > >
> > > > > > <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
> > > > > > is found. Not yet dumped to user space
> > > > > >
> > > > > > <Any CPU, t1> - Routes are added / removed, therefore invalidating the
> > > > > > cache by bumping 'net->ipv4.rt_genid'
> > > > > >
> > > > > > <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
> > > > > > cached dst entry is found to be invalid. Therefore, it is replaced by a
> > > > > > newer dst entry. dst_dev_put() is called on old entry which assigns the
> > > > > > blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
> > > > > > it is not registered.
> > > > > >
> > > > > > <CPU0, t3> - After softirq finished executing, your route get request
> > > > > > from t0 is resumed and the old dst entry is dumped to user space with
> > > > > > ifindex of 0.
> > > > > >
> > > > > > I tested this on my system using your script to generate the route get
> > > > > > requests. I pinned it to the same CPU forwarding packets through the
> > > > > > nexthop. To constantly invalidate the cache I created another script
> > > > > > that simply adds and removes IP addresses from an interface.
> > > > > >
> > > > > > If I stop the packet forwarding or the script that invalidates the
> > > > > > cache, then I don't see any '*' answers to my route get requests.
> > > > >
> > > > > Thanks for the reply and analysis Ido, I tested with an additional script which
> > > > > adds and deletes a route in a loop, as you also saw this increased the
> > > > > frequency of blackhole route replies from the first script.
> > > > >
> > > > > Questions:
> > > > >
> > > > > 1. We saw this behavior occurring with TCP connections traversing our routers,
> > > > > though I was able to reproduce it with only local route requests on our router.
> > > > > Would you expect this same behavior for TCP traffic only in the kernel which
> > > > > does not go to userspace?
> > > >
> > > > Yes, the problem is in the input path where received packets need to be
> > > > forwarded.
> > > >
> > > > >
> > > > > 2. These blackhole routes occur even though our main routing table is not
> > > > > changing, however a separate route table managed by bird on the Linux router is
> > > > > changing. Is this still expected behavior given that the ip-rules and main
> > > > > route table used by these route requests are not changing?
> > > >
> > > > Yes, there is a per-netns counter that is incremented whenever cached
> > > > dst entries need to be invalidated. Since it is per-netns it is
> > > > incremented regardless of the routing table to which your insert the
> > > > route.
> > > >
> > > > >
> > > > > 3. We were previously rejecting these packets with an iptables rule which sent
> > > > > an ICMP prohibited message to the sender, this caused TCP connections to break
> > > > > with a EHOSTUNREACH, should we be silently dropping these packets instead?
> > > > >
> > > > > 4. If we should just be dropping these packets, why does the kernel not drop
> > > > > them instead of letting them traverse the iptables rules?
> > > >
> > > > I actually believe the current behavior is a bug that needs to be fixed.
> > > > See below.
> > > >
> > > > >
> > > > > > BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
> > > > > > with older kernel versions you'll see 'lo' instead of '*'.
> > > > >
> > > > > Yes indeed! Thanks for solving that mystery as well, our routers are running
> > > > > 5.1, but we upgraded to 5.4-rc2 to determine whether the issue was still
> > > > > present in the latest kernel.
> > > >
> > > > Do you remember when you started seeing this behavior? I think it
> > > > started in 4.13 with commit ffe95ecf3a2e ("Merge branch
> > > > 'net-remove-dst-garbage-collector-logic'").
> > > >
> > > > Let me add Wei to see if/how this can be fixed.
> > > >
> > > > Wei, in case you don't have the original mail with the description of
> > > > the problem, it can be found here [1].
> > > >
> > > > I believe that the issue Jesse is experiencing is the following:
> > > >
> > > > <CPU A, t0> - Received packet A is forwarded and cached dst entry is
> > > > taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set()
> > > >
> > > > <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables
> > > > from multiple ISPs"), route is added / deleted and rt_cache_flush() is
> > > > called
> > > >
> > > > <CPU B, t2> - Received packet B tries to use the same cached dst entry
> > > > from t0, but rt_cache_valid() is no longer true and it is replaced in
> > > > rt_cache_route() by the newer one. This calls dst_dev_put() on the
> > > > original dst entry which assigns the blackhole netdev to 'dst->dev'
> > > >
> > > > <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due
> > > > to 'dst->dev' being the blackhole netdev
> > > >
> > > > The following patch "fixes" the problem for me:
> > > >
> > > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > > index 42221a12bdda..1c67bdb80fd5 100644
> > > > --- a/net/ipv4/route.c
> > > > +++ b/net/ipv4/route.c
> > > > @@ -1482,7 +1482,6 @@ static bool rt_cache_route(struct fib_nh_common *nhc, struct rtable *rt)
> > > >         prev = cmpxchg(p, orig, rt);
> > > >         if (prev == orig) {
> > > >                 if (orig) {
> > > > -                       dst_dev_put(&orig->dst);
> > > >                         dst_release(&orig->dst);
> > > >                 }
> > > >         } else {
> > > >
> > > > But if this dst entry is cached in some inactive socket and the netdev
> > > > on which it took a reference needs to be unregistered, then we can
> > > > potentially wait forever. No?
> > > >
> > > Yes. That's exactly the reason we need to free the dev here.
> > > Otherwise as you described, we will see "unregister_netdevice: waiting
> > > for xxx to become free. Usage count = x" flushing the screen... Not
> > > fun...
> > >
> > >
> > > > I'm thinking that it can be fixed by making 'nhc_rth_input' per-CPU, in
> > > > a similar fashion to what Eric did in commit d26b3a7c4b3b ("ipv4: percpu
> > > > nh_rth_output cache").
> > > >
> > > Hmm... Yes... I would think a per-CPU input cache should work for the
> > > case above.
> > > Another idea is: instead of calling dst_dev_put() in rt_cache_route()
> > > to switch out the dev, we call, rt_add_uncached_list() to add this
> > > obsolete dst cache to the uncached list. And if the device gets
> > > unregistered, rt_flush_dev() takes care of all dst entries in the
> > > uncached list. I think that would work too.
> >
> > It crossed my mind as well, but if the device is not unregistered, then
> > I believe we can eventually consume all the memory and kill the machine?
>
> Ha, sorry no. I think this will actually work.
>
When every user releases the dst, it will be removed from the uncached
list. But potentially, if any user of the dst is inactive, the dst
will be in the cached list for a while...

> >
> > >
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index dc1f510a7c81..ee618d4234ce 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -1482,7 +1482,7 @@ static bool rt_cache_route(struct fib_nh_common
> > > *nhc, struct rtable *rt)
> > >         prev = cmpxchg(p, orig, rt);
> > >         if (prev == orig) {
> > >                 if (orig) {
> > > -                       dst_dev_put(&orig->dst);
> > > +                       rt_add_uncached_list(orig);
> > >                         dst_release(&orig->dst);
> > >                 }
> > >         } else {
> > >
> > > + Martin for his idea and input.
> > >
> > > > Two questions:
> > > >
> > > > 1. Do you agree with the above analysis?
> > > > 2. Do you have a simpler/better solution in mind?
> > > >
> > > > Thanks
> > > >
> > > > [1] https://lore.kernel.org/netdev/CANSNSoVM1Uo106xfJtGpTyXNed8kOL4JiXqf3A1eZHBa7z3=yg@mail.gmail.com/T/#medece9445d617372b4842d44525ef0d3ba1ea083

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-11 18:47             ` Wei Wang
@ 2019-10-11 18:52               ` Ido Schimmel
  2019-10-11 21:01                 ` Jesse Hathaway
  2019-10-11 21:27                 ` David Ahern
  0 siblings, 2 replies; 23+ messages in thread
From: Ido Schimmel @ 2019-10-11 18:52 UTC (permalink / raw)
  To: Wei Wang, jesse; +Cc: Martin KaFai Lau, Linux Kernel Network Developers

On Fri, Oct 11, 2019 at 11:47:12AM -0700, Wei Wang wrote:
> On Fri, Oct 11, 2019 at 11:25 AM Ido Schimmel <idosch@idosch.org> wrote:
> >
> > On Fri, Oct 11, 2019 at 09:17:42PM +0300, Ido Schimmel wrote:
> > > On Fri, Oct 11, 2019 at 10:54:13AM -0700, Wei Wang wrote:
> > > > On Fri, Oct 11, 2019 at 8:42 AM Ido Schimmel <idosch@idosch.org> wrote:
> > > > >
> > > > > On Fri, Oct 11, 2019 at 09:36:51AM -0500, Jesse Hathaway wrote:
> > > > > > On Thu, Oct 10, 2019 at 3:31 AM Ido Schimmel <idosch@idosch.org> wrote:
> > > > > > > I think it's working as expected. Here is my theory:
> > > > > > >
> > > > > > > If CPU0 is executing both the route get request and forwarding packets
> > > > > > > through the directly connected interface, then the following can happen:
> > > > > > >
> > > > > > > <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
> > > > > > > is found. Not yet dumped to user space
> > > > > > >
> > > > > > > <Any CPU, t1> - Routes are added / removed, therefore invalidating the
> > > > > > > cache by bumping 'net->ipv4.rt_genid'
> > > > > > >
> > > > > > > <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
> > > > > > > cached dst entry is found to be invalid. Therefore, it is replaced by a
> > > > > > > newer dst entry. dst_dev_put() is called on old entry which assigns the
> > > > > > > blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
> > > > > > > it is not registered.
> > > > > > >
> > > > > > > <CPU0, t3> - After softirq finished executing, your route get request
> > > > > > > from t0 is resumed and the old dst entry is dumped to user space with
> > > > > > > ifindex of 0.
> > > > > > >
> > > > > > > I tested this on my system using your script to generate the route get
> > > > > > > requests. I pinned it to the same CPU forwarding packets through the
> > > > > > > nexthop. To constantly invalidate the cache I created another script
> > > > > > > that simply adds and removes IP addresses from an interface.
> > > > > > >
> > > > > > > If I stop the packet forwarding or the script that invalidates the
> > > > > > > cache, then I don't see any '*' answers to my route get requests.
> > > > > >
> > > > > > Thanks for the reply and analysis Ido, I tested with an additional script which
> > > > > > adds and deletes a route in a loop, as you also saw this increased the
> > > > > > frequency of blackhole route replies from the first script.
> > > > > >
> > > > > > Questions:
> > > > > >
> > > > > > 1. We saw this behavior occurring with TCP connections traversing our routers,
> > > > > > though I was able to reproduce it with only local route requests on our router.
> > > > > > Would you expect this same behavior for TCP traffic only in the kernel which
> > > > > > does not go to userspace?
> > > > >
> > > > > Yes, the problem is in the input path where received packets need to be
> > > > > forwarded.
> > > > >
> > > > > >
> > > > > > 2. These blackhole routes occur even though our main routing table is not
> > > > > > changing, however a separate route table managed by bird on the Linux router is
> > > > > > changing. Is this still expected behavior given that the ip-rules and main
> > > > > > route table used by these route requests are not changing?
> > > > >
> > > > > Yes, there is a per-netns counter that is incremented whenever cached
> > > > > dst entries need to be invalidated. Since it is per-netns it is
> > > > > incremented regardless of the routing table to which your insert the
> > > > > route.
> > > > >
> > > > > >
> > > > > > 3. We were previously rejecting these packets with an iptables rule which sent
> > > > > > an ICMP prohibited message to the sender, this caused TCP connections to break
> > > > > > with a EHOSTUNREACH, should we be silently dropping these packets instead?
> > > > > >
> > > > > > 4. If we should just be dropping these packets, why does the kernel not drop
> > > > > > them instead of letting them traverse the iptables rules?
> > > > >
> > > > > I actually believe the current behavior is a bug that needs to be fixed.
> > > > > See below.
> > > > >
> > > > > >
> > > > > > > BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
> > > > > > > with older kernel versions you'll see 'lo' instead of '*'.
> > > > > >
> > > > > > Yes indeed! Thanks for solving that mystery as well, our routers are running
> > > > > > 5.1, but we upgraded to 5.4-rc2 to determine whether the issue was still
> > > > > > present in the latest kernel.
> > > > >
> > > > > Do you remember when you started seeing this behavior? I think it
> > > > > started in 4.13 with commit ffe95ecf3a2e ("Merge branch
> > > > > 'net-remove-dst-garbage-collector-logic'").
> > > > >
> > > > > Let me add Wei to see if/how this can be fixed.
> > > > >
> > > > > Wei, in case you don't have the original mail with the description of
> > > > > the problem, it can be found here [1].
> > > > >
> > > > > I believe that the issue Jesse is experiencing is the following:
> > > > >
> > > > > <CPU A, t0> - Received packet A is forwarded and cached dst entry is
> > > > > taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set()
> > > > >
> > > > > <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables
> > > > > from multiple ISPs"), route is added / deleted and rt_cache_flush() is
> > > > > called
> > > > >
> > > > > <CPU B, t2> - Received packet B tries to use the same cached dst entry
> > > > > from t0, but rt_cache_valid() is no longer true and it is replaced in
> > > > > rt_cache_route() by the newer one. This calls dst_dev_put() on the
> > > > > original dst entry which assigns the blackhole netdev to 'dst->dev'
> > > > >
> > > > > <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due
> > > > > to 'dst->dev' being the blackhole netdev
> > > > >
> > > > > The following patch "fixes" the problem for me:
> > > > >
> > > > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > > > index 42221a12bdda..1c67bdb80fd5 100644
> > > > > --- a/net/ipv4/route.c
> > > > > +++ b/net/ipv4/route.c
> > > > > @@ -1482,7 +1482,6 @@ static bool rt_cache_route(struct fib_nh_common *nhc, struct rtable *rt)
> > > > >         prev = cmpxchg(p, orig, rt);
> > > > >         if (prev == orig) {
> > > > >                 if (orig) {
> > > > > -                       dst_dev_put(&orig->dst);
> > > > >                         dst_release(&orig->dst);
> > > > >                 }
> > > > >         } else {
> > > > >
> > > > > But if this dst entry is cached in some inactive socket and the netdev
> > > > > on which it took a reference needs to be unregistered, then we can
> > > > > potentially wait forever. No?
> > > > >
> > > > Yes. That's exactly the reason we need to free the dev here.
> > > > Otherwise as you described, we will see "unregister_netdevice: waiting
> > > > for xxx to become free. Usage count = x" flushing the screen... Not
> > > > fun...
> > > >
> > > >
> > > > > I'm thinking that it can be fixed by making 'nhc_rth_input' per-CPU, in
> > > > > a similar fashion to what Eric did in commit d26b3a7c4b3b ("ipv4: percpu
> > > > > nh_rth_output cache").
> > > > >
> > > > Hmm... Yes... I would think a per-CPU input cache should work for the
> > > > case above.
> > > > Another idea is: instead of calling dst_dev_put() in rt_cache_route()
> > > > to switch out the dev, we call, rt_add_uncached_list() to add this
> > > > obsolete dst cache to the uncached list. And if the device gets
> > > > unregistered, rt_flush_dev() takes care of all dst entries in the
> > > > uncached list. I think that would work too.
> > >
> > > It crossed my mind as well, but if the device is not unregistered, then
> > > I believe we can eventually consume all the memory and kill the machine?
> >
> > Ha, sorry no. I think this will actually work.
> >
> When every user releases the dst, it will be removed from the uncached
> list. 

Yea, that's the part I missed in my thinking. :(

> But potentially, if any user of the dst is inactive, the dst will be
> in the cached list for a while...

I think this is fine.

Jesse, can you please test Wei's patch?

Thanks

> 
> > >
> > > >
> > > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > > index dc1f510a7c81..ee618d4234ce 100644
> > > > --- a/net/ipv4/route.c
> > > > +++ b/net/ipv4/route.c
> > > > @@ -1482,7 +1482,7 @@ static bool rt_cache_route(struct fib_nh_common
> > > > *nhc, struct rtable *rt)
> > > >         prev = cmpxchg(p, orig, rt);
> > > >         if (prev == orig) {
> > > >                 if (orig) {
> > > > -                       dst_dev_put(&orig->dst);
> > > > +                       rt_add_uncached_list(orig);
> > > >                         dst_release(&orig->dst);
> > > >                 }
> > > >         } else {
> > > >
> > > > + Martin for his idea and input.
> > > >
> > > > > Two questions:
> > > > >
> > > > > 1. Do you agree with the above analysis?
> > > > > 2. Do you have a simpler/better solution in mind?
> > > > >
> > > > > Thanks
> > > > >
> > > > > [1] https://lore.kernel.org/netdev/CANSNSoVM1Uo106xfJtGpTyXNed8kOL4JiXqf3A1eZHBa7z3=yg@mail.gmail.com/T/#medece9445d617372b4842d44525ef0d3ba1ea083

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-11 18:52               ` Ido Schimmel
@ 2019-10-11 21:01                 ` Jesse Hathaway
  2019-10-11 21:27                 ` David Ahern
  1 sibling, 0 replies; 23+ messages in thread
From: Jesse Hathaway @ 2019-10-11 21:01 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: Wei Wang, Martin KaFai Lau, Linux Kernel Network Developers

On Fri, Oct 11, 2019 at 1:52 PM Ido Schimmel <idosch@idosch.org> wrote:
> I think this is fine.
>
> Jesse, can you please test Wei's patch?

I tested with a patched kernel using the same scripts and it does seem to
resolve the issue, thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-11 18:52               ` Ido Schimmel
  2019-10-11 21:01                 ` Jesse Hathaway
@ 2019-10-11 21:27                 ` David Ahern
  1 sibling, 0 replies; 23+ messages in thread
From: David Ahern @ 2019-10-11 21:27 UTC (permalink / raw)
  To: Ido Schimmel, Wei Wang, jesse
  Cc: Martin KaFai Lau, Linux Kernel Network Developers

On 10/11/19 12:52 PM, Ido Schimmel wrote:
> On Fri, Oct 11, 2019 at 11:47:12AM -0700, Wei Wang wrote:
>> On Fri, Oct 11, 2019 at 11:25 AM Ido Schimmel <idosch@idosch.org> wrote:
>>>
>>> On Fri, Oct 11, 2019 at 09:17:42PM +0300, Ido Schimmel wrote:
>>>> On Fri, Oct 11, 2019 at 10:54:13AM -0700, Wei Wang wrote:
>>>>> On Fri, Oct 11, 2019 at 8:42 AM Ido Schimmel <idosch@idosch.org> wrote:
>>>>>>
>>>>>> On Fri, Oct 11, 2019 at 09:36:51AM -0500, Jesse Hathaway wrote:
>>>>>>> On Thu, Oct 10, 2019 at 3:31 AM Ido Schimmel <idosch@idosch.org> wrote:
>>>>>>>> I think it's working as expected. Here is my theory:
>>>>>>>>
>>>>>>>> If CPU0 is executing both the route get request and forwarding packets
>>>>>>>> through the directly connected interface, then the following can happen:
>>>>>>>>
>>>>>>>> <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
>>>>>>>> is found. Not yet dumped to user space
>>>>>>>>
>>>>>>>> <Any CPU, t1> - Routes are added / removed, therefore invalidating the
>>>>>>>> cache by bumping 'net->ipv4.rt_genid'

IPv4 needs a change similar to what was done for IPv6 - only invalidate
dst's for the branches / table affected and not all dst's across the
namespace.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-11 17:54       ` Wei Wang
  2019-10-11 18:17         ` Ido Schimmel
@ 2019-10-12  6:56         ` Martin Lau
  2019-10-14  0:23           ` Wei Wang
  2019-10-15 14:29         ` Jesse Hathaway
  2 siblings, 1 reply; 23+ messages in thread
From: Martin Lau @ 2019-10-12  6:56 UTC (permalink / raw)
  To: Wei Wang; +Cc: Ido Schimmel, Jesse Hathaway, Linux Kernel Network Developers

On Fri, Oct 11, 2019 at 10:54:13AM -0700, Wei Wang wrote:
> On Fri, Oct 11, 2019 at 8:42 AM Ido Schimmel <idosch@idosch.org> wrote:
> >
> > On Fri, Oct 11, 2019 at 09:36:51AM -0500, Jesse Hathaway wrote:
> > > On Thu, Oct 10, 2019 at 3:31 AM Ido Schimmel <idosch@idosch.org> wrote:
> > > > I think it's working as expected. Here is my theory:
> > > >
> > > > If CPU0 is executing both the route get request and forwarding packets
> > > > through the directly connected interface, then the following can happen:
> > > >
> > > > <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
> > > > is found. Not yet dumped to user space
> > > >
> > > > <Any CPU, t1> - Routes are added / removed, therefore invalidating the
> > > > cache by bumping 'net->ipv4.rt_genid'
> > > >
> > > > <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
> > > > cached dst entry is found to be invalid. Therefore, it is replaced by a
> > > > newer dst entry. dst_dev_put() is called on old entry which assigns the
> > > > blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
> > > > it is not registered.
> > > >
> > > > <CPU0, t3> - After softirq finished executing, your route get request
> > > > from t0 is resumed and the old dst entry is dumped to user space with
> > > > ifindex of 0.
> > > >
> > > > I tested this on my system using your script to generate the route get
> > > > requests. I pinned it to the same CPU forwarding packets through the
> > > > nexthop. To constantly invalidate the cache I created another script
> > > > that simply adds and removes IP addresses from an interface.
> > > >
> > > > If I stop the packet forwarding or the script that invalidates the
> > > > cache, then I don't see any '*' answers to my route get requests.
> > >
> > > Thanks for the reply and analysis Ido, I tested with an additional script which
> > > adds and deletes a route in a loop, as you also saw this increased the
> > > frequency of blackhole route replies from the first script.
> > >
> > > Questions:
> > >
> > > 1. We saw this behavior occurring with TCP connections traversing our routers,
> > > though I was able to reproduce it with only local route requests on our router.
> > > Would you expect this same behavior for TCP traffic only in the kernel which
> > > does not go to userspace?
> >
> > Yes, the problem is in the input path where received packets need to be
> > forwarded.
> >
> > >
> > > 2. These blackhole routes occur even though our main routing table is not
> > > changing, however a separate route table managed by bird on the Linux router is
> > > changing. Is this still expected behavior given that the ip-rules and main
> > > route table used by these route requests are not changing?
> >
> > Yes, there is a per-netns counter that is incremented whenever cached
> > dst entries need to be invalidated. Since it is per-netns it is
> > incremented regardless of the routing table to which your insert the
> > route.
> >
> > >
> > > 3. We were previously rejecting these packets with an iptables rule which sent
> > > an ICMP prohibited message to the sender, this caused TCP connections to break
> > > with a EHOSTUNREACH, should we be silently dropping these packets instead?
> > >
> > > 4. If we should just be dropping these packets, why does the kernel not drop
> > > them instead of letting them traverse the iptables rules?
> >
> > I actually believe the current behavior is a bug that needs to be fixed.
> > See below.
> >
> > >
> > > > BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
> > > > with older kernel versions you'll see 'lo' instead of '*'.
> > >
> > > Yes indeed! Thanks for solving that mystery as well, our routers are running
> > > 5.1, but we upgraded to 5.4-rc2 to determine whether the issue was still
> > > present in the latest kernel.
> >
> > Do you remember when you started seeing this behavior? I think it
> > started in 4.13 with commit ffe95ecf3a2e ("Merge branch
> > 'net-remove-dst-garbage-collector-logic'").
> >
> > Let me add Wei to see if/how this can be fixed.
> >
> > Wei, in case you don't have the original mail with the description of
> > the problem, it can be found here [1].
> >
> > I believe that the issue Jesse is experiencing is the following:
> >
> > <CPU A, t0> - Received packet A is forwarded and cached dst entry is
> > taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set()
> >
> > <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables
> > from multiple ISPs"), route is added / deleted and rt_cache_flush() is
> > called
> >
> > <CPU B, t2> - Received packet B tries to use the same cached dst entry
> > from t0, but rt_cache_valid() is no longer true and it is replaced in
> > rt_cache_route() by the newer one. This calls dst_dev_put() on the
> > original dst entry which assigns the blackhole netdev to 'dst->dev'
> >
> > <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due
> > to 'dst->dev' being the blackhole netdev
> >
> > The following patch "fixes" the problem for me:
> >
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index 42221a12bdda..1c67bdb80fd5 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1482,7 +1482,6 @@ static bool rt_cache_route(struct fib_nh_common *nhc, struct rtable *rt)
> >         prev = cmpxchg(p, orig, rt);
> >         if (prev == orig) {
> >                 if (orig) {
> > -                       dst_dev_put(&orig->dst);
> >                         dst_release(&orig->dst);
> >                 }
> >         } else {
> >
> > But if this dst entry is cached in some inactive socket and the netdev
> > on which it took a reference needs to be unregistered, then we can
> > potentially wait forever. No?
> >
> Yes. That's exactly the reason we need to free the dev here.
> Otherwise as you described, we will see "unregister_netdevice: waiting
> for xxx to become free. Usage count = x" flushing the screen... Not
> fun...
> 
> 
> > I'm thinking that it can be fixed by making 'nhc_rth_input' per-CPU, in
> > a similar fashion to what Eric did in commit d26b3a7c4b3b ("ipv4: percpu
> > nh_rth_output cache").
> >
> Hmm... Yes... I would think a per-CPU input cache should work for the
> case above.
> Another idea is: instead of calling dst_dev_put() in rt_cache_route()
> to switch out the dev, we call, rt_add_uncached_list() to add this
> obsolete dst cache to the uncached list. And if the device gets
> unregistered, rt_flush_dev() takes care of all dst entries in the
> uncached list. I think that would work too.
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index dc1f510a7c81..ee618d4234ce 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1482,7 +1482,7 @@ static bool rt_cache_route(struct fib_nh_common
> *nhc, struct rtable *rt)
>         prev = cmpxchg(p, orig, rt);
>         if (prev == orig) {
>                 if (orig) {
> -                       dst_dev_put(&orig->dst);
> +                       rt_add_uncached_list(orig);
>                         dst_release(&orig->dst);
>                 }
>         } else {
> 
> + Martin for his idea and input.
The above fix should work and a simple one liner for net.
percpu may be a too big hammer for bug fix.
It is only needed for input route?  A comment would be nice.

While reading around, I am puzzling why a rt has to be recreated
for the same route.  I could be missing something.

I don't recall that is happening to ipv6 route even that tree-branch's
fn_sernum has changed.  

It seems v4 sk has not stored the last lookup rt_genid.
e.g. __sk_dst_check(sk, 0).  Everyone is sharing the rt->rt_genid
to check for changes, so the rt must be re-created?

> 
> > Two questions:
> >
> > 1. Do you agree with the above analysis?
> > 2. Do you have a simpler/better solution in mind?
> >
> > Thanks
> >
> > [1] https://lore.kernel.org/netdev/CANSNSoVM1Uo106xfJtGpTyXNed8kOL4JiXqf3A1eZHBa7z3=yg@mail.gmail.com/T/#medece9445d617372b4842d44525ef0d3ba1ea083

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-12  6:56         ` Martin Lau
@ 2019-10-14  0:23           ` Wei Wang
  2019-10-14 17:26             ` Martin Lau
  0 siblings, 1 reply; 23+ messages in thread
From: Wei Wang @ 2019-10-14  0:23 UTC (permalink / raw)
  To: Martin Lau; +Cc: Ido Schimmel, Jesse Hathaway, Linux Kernel Network Developers

On Fri, Oct 11, 2019 at 11:56 PM Martin Lau <kafai@fb.com> wrote:
>
> On Fri, Oct 11, 2019 at 10:54:13AM -0700, Wei Wang wrote:
> > On Fri, Oct 11, 2019 at 8:42 AM Ido Schimmel <idosch@idosch.org> wrote:
> > >
> > > On Fri, Oct 11, 2019 at 09:36:51AM -0500, Jesse Hathaway wrote:
> > > > On Thu, Oct 10, 2019 at 3:31 AM Ido Schimmel <idosch@idosch.org> wrote:
> > > > > I think it's working as expected. Here is my theory:
> > > > >
> > > > > If CPU0 is executing both the route get request and forwarding packets
> > > > > through the directly connected interface, then the following can happen:
> > > > >
> > > > > <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
> > > > > is found. Not yet dumped to user space
> > > > >
> > > > > <Any CPU, t1> - Routes are added / removed, therefore invalidating the
> > > > > cache by bumping 'net->ipv4.rt_genid'
> > > > >
> > > > > <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
> > > > > cached dst entry is found to be invalid. Therefore, it is replaced by a
> > > > > newer dst entry. dst_dev_put() is called on old entry which assigns the
> > > > > blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
> > > > > it is not registered.
> > > > >
> > > > > <CPU0, t3> - After softirq finished executing, your route get request
> > > > > from t0 is resumed and the old dst entry is dumped to user space with
> > > > > ifindex of 0.
> > > > >
> > > > > I tested this on my system using your script to generate the route get
> > > > > requests. I pinned it to the same CPU forwarding packets through the
> > > > > nexthop. To constantly invalidate the cache I created another script
> > > > > that simply adds and removes IP addresses from an interface.
> > > > >
> > > > > If I stop the packet forwarding or the script that invalidates the
> > > > > cache, then I don't see any '*' answers to my route get requests.
> > > >
> > > > Thanks for the reply and analysis Ido, I tested with an additional script which
> > > > adds and deletes a route in a loop, as you also saw this increased the
> > > > frequency of blackhole route replies from the first script.
> > > >
> > > > Questions:
> > > >
> > > > 1. We saw this behavior occurring with TCP connections traversing our routers,
> > > > though I was able to reproduce it with only local route requests on our router.
> > > > Would you expect this same behavior for TCP traffic only in the kernel which
> > > > does not go to userspace?
> > >
> > > Yes, the problem is in the input path where received packets need to be
> > > forwarded.
> > >
> > > >
> > > > 2. These blackhole routes occur even though our main routing table is not
> > > > changing, however a separate route table managed by bird on the Linux router is
> > > > changing. Is this still expected behavior given that the ip-rules and main
> > > > route table used by these route requests are not changing?
> > >
> > > Yes, there is a per-netns counter that is incremented whenever cached
> > > dst entries need to be invalidated. Since it is per-netns it is
> > > incremented regardless of the routing table to which your insert the
> > > route.
> > >
> > > >
> > > > 3. We were previously rejecting these packets with an iptables rule which sent
> > > > an ICMP prohibited message to the sender, this caused TCP connections to break
> > > > with a EHOSTUNREACH, should we be silently dropping these packets instead?
> > > >
> > > > 4. If we should just be dropping these packets, why does the kernel not drop
> > > > them instead of letting them traverse the iptables rules?
> > >
> > > I actually believe the current behavior is a bug that needs to be fixed.
> > > See below.
> > >
> > > >
> > > > > BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
> > > > > with older kernel versions you'll see 'lo' instead of '*'.
> > > >
> > > > Yes indeed! Thanks for solving that mystery as well, our routers are running
> > > > 5.1, but we upgraded to 5.4-rc2 to determine whether the issue was still
> > > > present in the latest kernel.
> > >
> > > Do you remember when you started seeing this behavior? I think it
> > > started in 4.13 with commit ffe95ecf3a2e ("Merge branch
> > > 'net-remove-dst-garbage-collector-logic'").
> > >
> > > Let me add Wei to see if/how this can be fixed.
> > >
> > > Wei, in case you don't have the original mail with the description of
> > > the problem, it can be found here [1].
> > >
> > > I believe that the issue Jesse is experiencing is the following:
> > >
> > > <CPU A, t0> - Received packet A is forwarded and cached dst entry is
> > > taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set()
> > >
> > > <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables
> > > from multiple ISPs"), route is added / deleted and rt_cache_flush() is
> > > called
> > >
> > > <CPU B, t2> - Received packet B tries to use the same cached dst entry
> > > from t0, but rt_cache_valid() is no longer true and it is replaced in
> > > rt_cache_route() by the newer one. This calls dst_dev_put() on the
> > > original dst entry which assigns the blackhole netdev to 'dst->dev'
> > >
> > > <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due
> > > to 'dst->dev' being the blackhole netdev
> > >
> > > The following patch "fixes" the problem for me:
> > >
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index 42221a12bdda..1c67bdb80fd5 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -1482,7 +1482,6 @@ static bool rt_cache_route(struct fib_nh_common *nhc, struct rtable *rt)
> > >         prev = cmpxchg(p, orig, rt);
> > >         if (prev == orig) {
> > >                 if (orig) {
> > > -                       dst_dev_put(&orig->dst);
> > >                         dst_release(&orig->dst);
> > >                 }
> > >         } else {
> > >
> > > But if this dst entry is cached in some inactive socket and the netdev
> > > on which it took a reference needs to be unregistered, then we can
> > > potentially wait forever. No?
> > >
> > Yes. That's exactly the reason we need to free the dev here.
> > Otherwise as you described, we will see "unregister_netdevice: waiting
> > for xxx to become free. Usage count = x" flushing the screen... Not
> > fun...
> >
> >
> > > I'm thinking that it can be fixed by making 'nhc_rth_input' per-CPU, in
> > > a similar fashion to what Eric did in commit d26b3a7c4b3b ("ipv4: percpu
> > > nh_rth_output cache").
> > >
> > Hmm... Yes... I would think a per-CPU input cache should work for the
> > case above.
> > Another idea is: instead of calling dst_dev_put() in rt_cache_route()
> > to switch out the dev, we call, rt_add_uncached_list() to add this
> > obsolete dst cache to the uncached list. And if the device gets
> > unregistered, rt_flush_dev() takes care of all dst entries in the
> > uncached list. I think that would work too.
> >
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index dc1f510a7c81..ee618d4234ce 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1482,7 +1482,7 @@ static bool rt_cache_route(struct fib_nh_common
> > *nhc, struct rtable *rt)
> >         prev = cmpxchg(p, orig, rt);
> >         if (prev == orig) {
> >                 if (orig) {
> > -                       dst_dev_put(&orig->dst);
> > +                       rt_add_uncached_list(orig);
> >                         dst_release(&orig->dst);
> >                 }
> >         } else {
> >
> > + Martin for his idea and input.
> The above fix should work and a simple one liner for net.
> percpu may be a too big hammer for bug fix.
> It is only needed for input route?  A comment would be nice.
>
> While reading around, I am puzzling why a rt has to be recreated
> for the same route.  I could be missing something.
>
> I don't recall that is happening to ipv6 route even that tree-branch's
> fn_sernum has changed.
>
> It seems v4 sk has not stored the last lookup rt_genid.
> e.g. __sk_dst_check(sk, 0).  Everyone is sharing the rt->rt_genid
> to check for changes, so the rt must be re-created?
>
I think the reason rt has to be created is v4 code uses per net
rt_genid. So changes to any route in the namespace will invalidate all
other routes. (As David pointed out in his email.) However, v6 code
uses per fib_node fn_sernum, and has a way to only invalidate route
that are affected. (fib6_update_sernum_upto_root())
And v4 code not caching rt_genid seems to be separate issue, I think...


> >
> > > Two questions:
> > >
> > > 1. Do you agree with the above analysis?
> > > 2. Do you have a simpler/better solution in mind?
> > >
> > > Thanks
> > >
> > > [1] https://lore.kernel.org/netdev/CANSNSoVM1Uo106xfJtGpTyXNed8kOL4JiXqf3A1eZHBa7z3=yg@mail.gmail.com/T/#medece9445d617372b4842d44525ef0d3ba1ea083

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-14  0:23           ` Wei Wang
@ 2019-10-14 17:26             ` Martin Lau
  2019-10-15 14:45               ` David Ahern
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Lau @ 2019-10-14 17:26 UTC (permalink / raw)
  To: Wei Wang; +Cc: Ido Schimmel, Jesse Hathaway, Linux Kernel Network Developers

On Sun, Oct 13, 2019 at 05:23:01PM -0700, Wei Wang wrote:
> On Fri, Oct 11, 2019 at 11:56 PM Martin Lau <kafai@fb.com> wrote:
> >
> > On Fri, Oct 11, 2019 at 10:54:13AM -0700, Wei Wang wrote:
> > > On Fri, Oct 11, 2019 at 8:42 AM Ido Schimmel <idosch@idosch.org> wrote:
> > > >
> > > > On Fri, Oct 11, 2019 at 09:36:51AM -0500, Jesse Hathaway wrote:
> > > > > On Thu, Oct 10, 2019 at 3:31 AM Ido Schimmel <idosch@idosch.org> wrote:
> > > > > > I think it's working as expected. Here is my theory:
> > > > > >
> > > > > > If CPU0 is executing both the route get request and forwarding packets
> > > > > > through the directly connected interface, then the following can happen:
> > > > > >
> > > > > > <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
> > > > > > is found. Not yet dumped to user space
> > > > > >
> > > > > > <Any CPU, t1> - Routes are added / removed, therefore invalidating the
> > > > > > cache by bumping 'net->ipv4.rt_genid'
> > > > > >
> > > > > > <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
> > > > > > cached dst entry is found to be invalid. Therefore, it is replaced by a
> > > > > > newer dst entry. dst_dev_put() is called on old entry which assigns the
> > > > > > blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
> > > > > > it is not registered.
> > > > > >
> > > > > > <CPU0, t3> - After softirq finished executing, your route get request
> > > > > > from t0 is resumed and the old dst entry is dumped to user space with
> > > > > > ifindex of 0.
> > > > > >
> > > > > > I tested this on my system using your script to generate the route get
> > > > > > requests. I pinned it to the same CPU forwarding packets through the
> > > > > > nexthop. To constantly invalidate the cache I created another script
> > > > > > that simply adds and removes IP addresses from an interface.
> > > > > >
> > > > > > If I stop the packet forwarding or the script that invalidates the
> > > > > > cache, then I don't see any '*' answers to my route get requests.
> > > > >
> > > > > Thanks for the reply and analysis Ido, I tested with an additional script which
> > > > > adds and deletes a route in a loop, as you also saw this increased the
> > > > > frequency of blackhole route replies from the first script.
> > > > >
> > > > > Questions:
> > > > >
> > > > > 1. We saw this behavior occurring with TCP connections traversing our routers,
> > > > > though I was able to reproduce it with only local route requests on our router.
> > > > > Would you expect this same behavior for TCP traffic only in the kernel which
> > > > > does not go to userspace?
> > > >
> > > > Yes, the problem is in the input path where received packets need to be
> > > > forwarded.
> > > >
> > > > >
> > > > > 2. These blackhole routes occur even though our main routing table is not
> > > > > changing, however a separate route table managed by bird on the Linux router is
> > > > > changing. Is this still expected behavior given that the ip-rules and main
> > > > > route table used by these route requests are not changing?
> > > >
> > > > Yes, there is a per-netns counter that is incremented whenever cached
> > > > dst entries need to be invalidated. Since it is per-netns it is
> > > > incremented regardless of the routing table to which your insert the
> > > > route.
> > > >
> > > > >
> > > > > 3. We were previously rejecting these packets with an iptables rule which sent
> > > > > an ICMP prohibited message to the sender, this caused TCP connections to break
> > > > > with a EHOSTUNREACH, should we be silently dropping these packets instead?
> > > > >
> > > > > 4. If we should just be dropping these packets, why does the kernel not drop
> > > > > them instead of letting them traverse the iptables rules?
> > > >
> > > > I actually believe the current behavior is a bug that needs to be fixed.
> > > > See below.
> > > >
> > > > >
> > > > > > BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
> > > > > > with older kernel versions you'll see 'lo' instead of '*'.
> > > > >
> > > > > Yes indeed! Thanks for solving that mystery as well, our routers are running
> > > > > 5.1, but we upgraded to 5.4-rc2 to determine whether the issue was still
> > > > > present in the latest kernel.
> > > >
> > > > Do you remember when you started seeing this behavior? I think it
> > > > started in 4.13 with commit ffe95ecf3a2e ("Merge branch
> > > > 'net-remove-dst-garbage-collector-logic'").
> > > >
> > > > Let me add Wei to see if/how this can be fixed.
> > > >
> > > > Wei, in case you don't have the original mail with the description of
> > > > the problem, it can be found here [1].
> > > >
> > > > I believe that the issue Jesse is experiencing is the following:
> > > >
> > > > <CPU A, t0> - Received packet A is forwarded and cached dst entry is
> > > > taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set()
> > > >
> > > > <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables
> > > > from multiple ISPs"), route is added / deleted and rt_cache_flush() is
> > > > called
> > > >
> > > > <CPU B, t2> - Received packet B tries to use the same cached dst entry
> > > > from t0, but rt_cache_valid() is no longer true and it is replaced in
> > > > rt_cache_route() by the newer one. This calls dst_dev_put() on the
> > > > original dst entry which assigns the blackhole netdev to 'dst->dev'
> > > >
> > > > <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due
> > > > to 'dst->dev' being the blackhole netdev
> > > >
> > > > The following patch "fixes" the problem for me:
> > > >
> > > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > > index 42221a12bdda..1c67bdb80fd5 100644
> > > > --- a/net/ipv4/route.c
> > > > +++ b/net/ipv4/route.c
> > > > @@ -1482,7 +1482,6 @@ static bool rt_cache_route(struct fib_nh_common *nhc, struct rtable *rt)
> > > >         prev = cmpxchg(p, orig, rt);
> > > >         if (prev == orig) {
> > > >                 if (orig) {
> > > > -                       dst_dev_put(&orig->dst);
> > > >                         dst_release(&orig->dst);
> > > >                 }
> > > >         } else {
> > > >
> > > > But if this dst entry is cached in some inactive socket and the netdev
> > > > on which it took a reference needs to be unregistered, then we can
> > > > potentially wait forever. No?
> > > >
> > > Yes. That's exactly the reason we need to free the dev here.
> > > Otherwise as you described, we will see "unregister_netdevice: waiting
> > > for xxx to become free. Usage count = x" flushing the screen... Not
> > > fun...
> > >
> > >
> > > > I'm thinking that it can be fixed by making 'nhc_rth_input' per-CPU, in
> > > > a similar fashion to what Eric did in commit d26b3a7c4b3b ("ipv4: percpu
> > > > nh_rth_output cache").
> > > >
> > > Hmm... Yes... I would think a per-CPU input cache should work for the
> > > case above.
> > > Another idea is: instead of calling dst_dev_put() in rt_cache_route()
> > > to switch out the dev, we call, rt_add_uncached_list() to add this
> > > obsolete dst cache to the uncached list. And if the device gets
> > > unregistered, rt_flush_dev() takes care of all dst entries in the
> > > uncached list. I think that would work too.
> > >
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index dc1f510a7c81..ee618d4234ce 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -1482,7 +1482,7 @@ static bool rt_cache_route(struct fib_nh_common
> > > *nhc, struct rtable *rt)
> > >         prev = cmpxchg(p, orig, rt);
> > >         if (prev == orig) {
> > >                 if (orig) {
> > > -                       dst_dev_put(&orig->dst);
> > > +                       rt_add_uncached_list(orig);
> > >                         dst_release(&orig->dst);
> > >                 }
> > >         } else {
> > >
> > > + Martin for his idea and input.
> > The above fix should work and a simple one liner for net.
> > percpu may be a too big hammer for bug fix.
> > It is only needed for input route?  A comment would be nice.
> >
> > While reading around, I am puzzling why a rt has to be recreated
> > for the same route.  I could be missing something.
> >
> > I don't recall that is happening to ipv6 route even that tree-branch's
> > fn_sernum has changed.
> >
> > It seems v4 sk has not stored the last lookup rt_genid.
> > e.g. __sk_dst_check(sk, 0).  Everyone is sharing the rt->rt_genid
> > to check for changes, so the rt must be re-created?
> >
> I think the reason rt has to be created is v4 code uses per net
> rt_genid. So changes to any route in the namespace will invalidate all
> other routes. (As David pointed out in his email.) However, v6 code
> uses per fib_node fn_sernum, and has a way to only invalidate route
> that are affected. (fib6_update_sernum_upto_root())
> And v4 code not caching rt_genid seems to be separate issue, I think...
Understood that v6 impact is smaller on route changes because there is per
fib6_node fn_sernum.

AFAICT, even for the route that are affected by fib6_update_sernum_upto_root(),
I don't see the RTF_PCPU route is re-created.  v6 sk does
dst_check() => re-lookup the fib6 =>
found the same RTF_PCPU (but does not re-create it) =>
update the sk with new cookie in ip6_dst_store()

> 
> 
> > >
> > > > Two questions:
> > > >
> > > > 1. Do you agree with the above analysis?
> > > > 2. Do you have a simpler/better solution in mind?
> > > >
> > > > Thanks
> > > >
> > > > [1] https://lore.kernel.org/netdev/CANSNSoVM1Uo106xfJtGpTyXNed8kOL4JiXqf3A1eZHBa7z3=yg@mail.gmail.com/T/#medece9445d617372b4842d44525ef0d3ba1ea083

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-11 17:54       ` Wei Wang
  2019-10-11 18:17         ` Ido Schimmel
  2019-10-12  6:56         ` Martin Lau
@ 2019-10-15 14:29         ` Jesse Hathaway
  2019-10-15 16:44           ` Wei Wang
  2 siblings, 1 reply; 23+ messages in thread
From: Jesse Hathaway @ 2019-10-15 14:29 UTC (permalink / raw)
  To: Wei Wang; +Cc: Ido Schimmel, Martin KaFai Lau, Linux Kernel Network Developers

On Fri, Oct 11, 2019 at 12:54 PM Wei Wang <weiwan@google.com> wrote:
> Hmm... Yes... I would think a per-CPU input cache should work for the
> case above.
> Another idea is: instead of calling dst_dev_put() in rt_cache_route()
> to switch out the dev, we call, rt_add_uncached_list() to add this
> obsolete dst cache to the uncached list. And if the device gets
> unregistered, rt_flush_dev() takes care of all dst entries in the
> uncached list. I think that would work too.
>
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index dc1f510a7c81..ee618d4234ce 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1482,7 +1482,7 @@ static bool rt_cache_route(struct fib_nh_common
> *nhc, struct rtable *rt)
>         prev = cmpxchg(p, orig, rt);
>         if (prev == orig) {
>                 if (orig) {
> -                       dst_dev_put(&orig->dst);
> +                       rt_add_uncached_list(orig);
>                         dst_release(&orig->dst);
>                 }
>         } else {
>

Thanks Wei for your work on this issue,

Any chance this patch will make it into 5.4?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-14 17:26             ` Martin Lau
@ 2019-10-15 14:45               ` David Ahern
  2019-10-15 16:42                 ` Wei Wang
  0 siblings, 1 reply; 23+ messages in thread
From: David Ahern @ 2019-10-15 14:45 UTC (permalink / raw)
  To: Martin Lau, Wei Wang
  Cc: Ido Schimmel, Jesse Hathaway, Linux Kernel Network Developers

On 10/14/19 1:26 PM, Martin Lau wrote:
> 
> AFAICT, even for the route that are affected by fib6_update_sernum_upto_root(),
> I don't see the RTF_PCPU route is re-created.  v6 sk does
> dst_check() => re-lookup the fib6 =>
> found the same RTF_PCPU (but does not re-create it) =>
> update the sk with new cookie in ip6_dst_store()
> 

That's fine. The pcpu cache is per nexthop (fib6_nh) for a specific
gateway/device.

The invalidate forces another lookup for the intended destination after
the change to the fib. If the lookup resolves to the same fib entry and
nexthop, then re-using the same cached dst/rt6_info is ok.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-15 14:45               ` David Ahern
@ 2019-10-15 16:42                 ` Wei Wang
  2019-10-16  6:35                   ` Martin Lau
  0 siblings, 1 reply; 23+ messages in thread
From: Wei Wang @ 2019-10-15 16:42 UTC (permalink / raw)
  To: David Ahern
  Cc: Martin Lau, Ido Schimmel, Jesse Hathaway,
	Linux Kernel Network Developers

On Tue, Oct 15, 2019 at 7:45 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 10/14/19 1:26 PM, Martin Lau wrote:
> >
> > AFAICT, even for the route that are affected by fib6_update_sernum_upto_root(),
> > I don't see the RTF_PCPU route is re-created.  v6 sk does
> > dst_check() => re-lookup the fib6 =>
> > found the same RTF_PCPU (but does not re-create it) =>
> > update the sk with new cookie in ip6_dst_store()
> >
Hmm... That is a good point. Why does v4 need to recreate the dst
cache even though the route itself is not changed?
Now that I think about it, I agree with Martin's previous comment: it
probably is because v4 code does not cache rt->rt_genid into the
socket and every user of the rt is sharing the same rt_genid stored in
the route itself.

>
> That's fine. The pcpu cache is per nexthop (fib6_nh) for a specific
> gateway/device.
>
> The invalidate forces another lookup for the intended destination after
> the change to the fib. If the lookup resolves to the same fib entry and
> nexthop, then re-using the same cached dst/rt6_info is ok.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-15 14:29         ` Jesse Hathaway
@ 2019-10-15 16:44           ` Wei Wang
  2019-10-16  6:39             ` Martin Lau
  0 siblings, 1 reply; 23+ messages in thread
From: Wei Wang @ 2019-10-15 16:44 UTC (permalink / raw)
  To: Jesse Hathaway
  Cc: Ido Schimmel, Martin KaFai Lau, Linux Kernel Network Developers

On Tue, Oct 15, 2019 at 7:29 AM Jesse Hathaway <jesse@mbuki-mvuki.org> wrote:
>
> On Fri, Oct 11, 2019 at 12:54 PM Wei Wang <weiwan@google.com> wrote:
> > Hmm... Yes... I would think a per-CPU input cache should work for the
> > case above.
> > Another idea is: instead of calling dst_dev_put() in rt_cache_route()
> > to switch out the dev, we call, rt_add_uncached_list() to add this
> > obsolete dst cache to the uncached list. And if the device gets
> > unregistered, rt_flush_dev() takes care of all dst entries in the
> > uncached list. I think that would work too.
> >
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index dc1f510a7c81..ee618d4234ce 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1482,7 +1482,7 @@ static bool rt_cache_route(struct fib_nh_common
> > *nhc, struct rtable *rt)
> >         prev = cmpxchg(p, orig, rt);
> >         if (prev == orig) {
> >                 if (orig) {
> > -                       dst_dev_put(&orig->dst);
> > +                       rt_add_uncached_list(orig);
> >                         dst_release(&orig->dst);
> >                 }
> >         } else {
> >
>
> Thanks Wei for your work on this issue,
>
> Any chance this patch will make it into 5.4?

I can submit the patch to NET branch if everyone agrees with this one liner fix.
Then I believe it will be patched into the next 5.4 release automatically?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-15 16:42                 ` Wei Wang
@ 2019-10-16  6:35                   ` Martin Lau
  0 siblings, 0 replies; 23+ messages in thread
From: Martin Lau @ 2019-10-16  6:35 UTC (permalink / raw)
  To: Wei Wang
  Cc: David Ahern, Ido Schimmel, Jesse Hathaway,
	Linux Kernel Network Developers

On Tue, Oct 15, 2019 at 09:42:49AM -0700, Wei Wang wrote:
> On Tue, Oct 15, 2019 at 7:45 AM David Ahern <dsahern@gmail.com> wrote:
> >
> > On 10/14/19 1:26 PM, Martin Lau wrote:
> > >
> > > AFAICT, even for the route that are affected by fib6_update_sernum_upto_root(),
> > > I don't see the RTF_PCPU route is re-created.  v6 sk does
> > > dst_check() => re-lookup the fib6 =>
> > > found the same RTF_PCPU (but does not re-create it) =>
> > > update the sk with new cookie in ip6_dst_store()
> > >
> Hmm... That is a good point. Why does v4 need to recreate the dst
> cache even though the route itself is not changed?
> Now that I think about it, I agree with Martin's previous comment: it
> probably is because v4 code does not cache rt->rt_genid into the
> socket and every user of the rt is sharing the same rt_genid stored in
> the route itself.
Exactly	:) If no re-create, dst_dev_put() can be avoided.
The root cause is not really related to the global NS rt_genid.
A granular rt_genid may help to reduce the race on dst_dev_put()
but it will still happen.  (that aside, improving the NS rt_genid
would still be great).

Thinking more about it,	this issue should not be limited to input.
I think you fix is right.

> 
> >
> > That's fine. The pcpu cache is per nexthop (fib6_nh) for a specific
> > gateway/device.
> >
> > The invalidate forces another lookup for the intended destination after
> > the change to the fib. If the lookup resolves to the same fib entry and
> > nexthop, then re-using the same cached dst/rt6_info is ok.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-15 16:44           ` Wei Wang
@ 2019-10-16  6:39             ` Martin Lau
  2019-10-16 16:35               ` Wei Wang
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Lau @ 2019-10-16  6:39 UTC (permalink / raw)
  To: Wei Wang; +Cc: Jesse Hathaway, Ido Schimmel, Linux Kernel Network Developers

On Tue, Oct 15, 2019 at 09:44:11AM -0700, Wei Wang wrote:
> On Tue, Oct 15, 2019 at 7:29 AM Jesse Hathaway <jesse@mbuki-mvuki.org> wrote:
> >
> > On Fri, Oct 11, 2019 at 12:54 PM Wei Wang <weiwan@google.com> wrote:
> > > Hmm... Yes... I would think a per-CPU input cache should work for the
> > > case above.
> > > Another idea is: instead of calling dst_dev_put() in rt_cache_route()
> > > to switch out the dev, we call, rt_add_uncached_list() to add this
> > > obsolete dst cache to the uncached list. And if the device gets
> > > unregistered, rt_flush_dev() takes care of all dst entries in the
> > > uncached list. I think that would work too.
> > >
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index dc1f510a7c81..ee618d4234ce 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -1482,7 +1482,7 @@ static bool rt_cache_route(struct fib_nh_common
> > > *nhc, struct rtable *rt)
> > >         prev = cmpxchg(p, orig, rt);
> > >         if (prev == orig) {
> > >                 if (orig) {
> > > -                       dst_dev_put(&orig->dst);
> > > +                       rt_add_uncached_list(orig);
> > >                         dst_release(&orig->dst);
> > >                 }
> > >         } else {
> > >
> >
> > Thanks Wei for your work on this issue,
> >
> > Any chance this patch will make it into 5.4?
> 
> I can submit the patch to NET branch if everyone agrees with this one liner fix.
Acked-by: Martin KaFai Lau <kafai@fb.com>

I don't think it is a very critical bug though.  Not sure
how far it should be ported.

> Then I believe it will be patched into the next 5.4 release automatically?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Race condition in route lookup
  2019-10-16  6:39             ` Martin Lau
@ 2019-10-16 16:35               ` Wei Wang
  0 siblings, 0 replies; 23+ messages in thread
From: Wei Wang @ 2019-10-16 16:35 UTC (permalink / raw)
  To: Martin Lau; +Cc: Jesse Hathaway, Ido Schimmel, Linux Kernel Network Developers

On Tue, Oct 15, 2019 at 11:39 PM Martin Lau <kafai@fb.com> wrote:
>
> On Tue, Oct 15, 2019 at 09:44:11AM -0700, Wei Wang wrote:
> > On Tue, Oct 15, 2019 at 7:29 AM Jesse Hathaway <jesse@mbuki-mvuki.org> wrote:
> > >
> > > On Fri, Oct 11, 2019 at 12:54 PM Wei Wang <weiwan@google.com> wrote:
> > > > Hmm... Yes... I would think a per-CPU input cache should work for the
> > > > case above.
> > > > Another idea is: instead of calling dst_dev_put() in rt_cache_route()
> > > > to switch out the dev, we call, rt_add_uncached_list() to add this
> > > > obsolete dst cache to the uncached list. And if the device gets
> > > > unregistered, rt_flush_dev() takes care of all dst entries in the
> > > > uncached list. I think that would work too.
> > > >
> > > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > > index dc1f510a7c81..ee618d4234ce 100644
> > > > --- a/net/ipv4/route.c
> > > > +++ b/net/ipv4/route.c
> > > > @@ -1482,7 +1482,7 @@ static bool rt_cache_route(struct fib_nh_common
> > > > *nhc, struct rtable *rt)
> > > >         prev = cmpxchg(p, orig, rt);
> > > >         if (prev == orig) {
> > > >                 if (orig) {
> > > > -                       dst_dev_put(&orig->dst);
> > > > +                       rt_add_uncached_list(orig);
> > > >                         dst_release(&orig->dst);
> > > >                 }
> > > >         } else {
> > > >
> > >
> > > Thanks Wei for your work on this issue,
> > >
> > > Any chance this patch will make it into 5.4?
> >
> > I can submit the patch to NET branch if everyone agrees with this one liner fix.
> Acked-by: Martin KaFai Lau <kafai@fb.com>
>
> I don't think it is a very critical bug though.  Not sure
> how far it should be ported.
>
Thanks Martin. I am preparing the patch and will send it out soon.

> > Then I believe it will be patched into the next 5.4 release automatically?

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-10-16 16:36 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-09 16:00 Race condition in route lookup Jesse Hathaway
2019-10-10  8:31 ` Ido Schimmel
2019-10-10  8:46   ` Ido Schimmel
2019-10-11 14:36   ` Jesse Hathaway
2019-10-11 15:42     ` Ido Schimmel
2019-10-11 16:09       ` Jesse Hathaway
2019-10-11 17:54       ` Wei Wang
2019-10-11 18:17         ` Ido Schimmel
2019-10-11 18:25           ` Ido Schimmel
2019-10-11 18:47             ` Wei Wang
2019-10-11 18:52               ` Ido Schimmel
2019-10-11 21:01                 ` Jesse Hathaway
2019-10-11 21:27                 ` David Ahern
2019-10-12  6:56         ` Martin Lau
2019-10-14  0:23           ` Wei Wang
2019-10-14 17:26             ` Martin Lau
2019-10-15 14:45               ` David Ahern
2019-10-15 16:42                 ` Wei Wang
2019-10-16  6:35                   ` Martin Lau
2019-10-15 14:29         ` Jesse Hathaway
2019-10-15 16:44           ` Wei Wang
2019-10-16  6:39             ` Martin Lau
2019-10-16 16:35               ` Wei Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).