netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch net] ipv4: restore rt->fi for reference counting
@ 2017-05-04 21:54 Cong Wang
  2017-05-07 18:53 ` Eric Dumazet
  2017-05-08 18:35 ` David Miller
  0 siblings, 2 replies; 43+ messages in thread
From: Cong Wang @ 2017-05-04 21:54 UTC (permalink / raw)
  To: netdev; +Cc: andreyknvl, edumazet, Cong Wang

IPv4 dst could use fi->fib_metrics to store metrics but fib_info
itself is refcnt'ed, so without taking a refcnt fi and
fi->fib_metrics could be freed while dst metrics still points to
it. This triggers use-after-free as reported by Andrey twice.

This patch reverts commit 2860583fe840 ("ipv4: Kill rt->fi") to
restore this reference counting. It is a quick fix for -net and
-stable, for -net-next, as Eric suggested, we can consider doing
reference counting for metrics itself instead of relying on fib_info.

IPv6 is very different, it copies or steals the metrics from mx6_config
in fib6_commit_metrics() so probably doesn't need a refcnt.

Decnet has already done the refcnt'ing, see dn_fib_semantic_match().

Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/net/route.h |  1 +
 net/ipv4/route.c    | 18 +++++++++++++++++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/net/route.h b/include/net/route.h
index 2cc0e14..4335eb7 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -69,6 +69,7 @@ struct rtable {
 
 	struct list_head	rt_uncached;
 	struct uncached_list	*rt_uncached_list;
+	struct fib_info		*fi; /* for refcnt to shared metrics */
 };
 
 static inline bool rt_is_input_route(const struct rtable *rt)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 655d9ee..f647310 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1387,6 +1387,11 @@ static void ipv4_dst_destroy(struct dst_entry *dst)
 {
 	struct rtable *rt = (struct rtable *) dst;
 
+	if (rt->fi) {
+		fib_info_put(rt->fi);
+		rt->fi = NULL;
+	}
+
 	if (!list_empty(&rt->rt_uncached)) {
 		struct uncached_list *ul = rt->rt_uncached_list;
 
@@ -1424,6 +1429,16 @@ static bool rt_cache_valid(const struct rtable *rt)
 		!rt_is_expired(rt);
 }
 
+static void rt_init_metrics(struct rtable *rt, struct fib_info *fi)
+{
+	if (fi->fib_metrics != (u32 *)dst_default_metrics) {
+		fib_info_hold(fi);
+		rt->fi = fi;
+	}
+
+	dst_init_metrics(&rt->dst, fi->fib_metrics, true);
+}
+
 static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 			   const struct fib_result *res,
 			   struct fib_nh_exception *fnhe,
@@ -1438,7 +1453,7 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 			rt->rt_gateway = nh->nh_gw;
 			rt->rt_uses_gateway = 1;
 		}
-		dst_init_metrics(&rt->dst, fi->fib_metrics, true);
+		rt_init_metrics(rt, fi);
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		rt->dst.tclassid = nh->nh_tclassid;
 #endif
@@ -1490,6 +1505,7 @@ struct rtable *rt_dst_alloc(struct net_device *dev,
 		rt->rt_gateway = 0;
 		rt->rt_uses_gateway = 0;
 		rt->rt_table_id = 0;
+		rt->fi = NULL;
 		INIT_LIST_HEAD(&rt->rt_uncached);
 
 		rt->dst.output = ip_output;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-04 21:54 [Patch net] ipv4: restore rt->fi for reference counting Cong Wang
@ 2017-05-07 18:53 ` Eric Dumazet
  2017-05-08 18:35 ` David Miller
  1 sibling, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2017-05-07 18:53 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, andreyknvl, edumazet

On Thu, 2017-05-04 at 14:54 -0700, Cong Wang wrote:
> IPv4 dst could use fi->fib_metrics to store metrics but fib_info
> itself is refcnt'ed, so without taking a refcnt fi and
> fi->fib_metrics could be freed while dst metrics still points to
> it. This triggers use-after-free as reported by Andrey twice.
> 
> This patch reverts commit 2860583fe840 ("ipv4: Kill rt->fi") to
> restore this reference counting. It is a quick fix for -net and
> -stable, for -net-next, as Eric suggested, we can consider doing
> reference counting for metrics itself instead of relying on fib_info.
> 
> IPv6 is very different, it copies or steals the metrics from mx6_config
> in fib6_commit_metrics() so probably doesn't need a refcnt.
> 
> Decnet has already done the refcnt'ing, see dn_fib_semantic_match().
> 
> Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Tested-by: Andrey Konovalov <andreyknvl@google.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Acked-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-04 21:54 [Patch net] ipv4: restore rt->fi for reference counting Cong Wang
  2017-05-07 18:53 ` Eric Dumazet
@ 2017-05-08 18:35 ` David Miller
  2017-05-09  0:01   ` Eric Dumazet
  1 sibling, 1 reply; 43+ messages in thread
From: David Miller @ 2017-05-08 18:35 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, andreyknvl, edumazet

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Thu,  4 May 2017 14:54:17 -0700

> IPv4 dst could use fi->fib_metrics to store metrics but fib_info
> itself is refcnt'ed, so without taking a refcnt fi and
> fi->fib_metrics could be freed while dst metrics still points to
> it. This triggers use-after-free as reported by Andrey twice.
> 
> This patch reverts commit 2860583fe840 ("ipv4: Kill rt->fi") to
> restore this reference counting. It is a quick fix for -net and
> -stable, for -net-next, as Eric suggested, we can consider doing
> reference counting for metrics itself instead of relying on fib_info.
> 
> IPv6 is very different, it copies or steals the metrics from mx6_config
> in fib6_commit_metrics() so probably doesn't need a refcnt.
> 
> Decnet has already done the refcnt'ing, see dn_fib_semantic_match().
> 
> Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Tested-by: Andrey Konovalov <andreyknvl@google.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-08 18:35 ` David Miller
@ 2017-05-09  0:01   ` Eric Dumazet
  2017-05-09  1:22     ` David Miller
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2017-05-09  0:01 UTC (permalink / raw)
  To: David Miller; +Cc: xiyou.wangcong, netdev, andreyknvl, edumazet

On Mon, 2017-05-08 at 14:35 -0400, David Miller wrote:
> From: Cong Wang <xiyou.wangcong@gmail.com>
> Date: Thu,  4 May 2017 14:54:17 -0700
> 
> > IPv4 dst could use fi->fib_metrics to store metrics but fib_info
> > itself is refcnt'ed, so without taking a refcnt fi and
> > fi->fib_metrics could be freed while dst metrics still points to
> > it. This triggers use-after-free as reported by Andrey twice.
> > 
> > This patch reverts commit 2860583fe840 ("ipv4: Kill rt->fi") to
> > restore this reference counting. It is a quick fix for -net and
> > -stable, for -net-next, as Eric suggested, we can consider doing
> > reference counting for metrics itself instead of relying on fib_info.
> > 
> > IPv6 is very different, it copies or steals the metrics from mx6_config
> > in fib6_commit_metrics() so probably doesn't need a refcnt.
> > 
> > Decnet has already done the refcnt'ing, see dn_fib_semantic_match().
> > 
> > Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
> > Reported-by: Andrey Konovalov <andreyknvl@google.com>
> > Tested-by: Andrey Konovalov <andreyknvl@google.com>
> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> 
> Applied and queued up for -stable, thanks.

Although I now have on latest net tree these messages when I reboot my
test machine.

[  224.085873] unregister_netdevice: waiting for eth0 to become free. Usage count = 43
[  234.106018] unregister_netdevice: waiting for eth0 to become free. Usage count = 43
[  244.366187] unregister_netdevice: waiting for eth0 to become free. Usage count = 43
[  254.626325] unregister_netdevice: waiting for eth0 to become free. Usage count = 43
[  264.706472] unregister_netdevice: waiting for eth0 to become free. Usage count = 43
[  274.736619] unregister_netdevice: waiting for eth0 to become free. Usage count = 43
[  284.766766] unregister_netdevice: waiting for eth0 to become free. Usage count = 43

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-09  0:01   ` Eric Dumazet
@ 2017-05-09  1:22     ` David Miller
  2017-05-09  2:18       ` Eric Dumazet
  0 siblings, 1 reply; 43+ messages in thread
From: David Miller @ 2017-05-09  1:22 UTC (permalink / raw)
  To: eric.dumazet; +Cc: xiyou.wangcong, netdev, andreyknvl, edumazet

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 08 May 2017 17:01:20 -0700

> On Mon, 2017-05-08 at 14:35 -0400, David Miller wrote:
>> From: Cong Wang <xiyou.wangcong@gmail.com>
>> Date: Thu,  4 May 2017 14:54:17 -0700
>> 
>> > IPv4 dst could use fi->fib_metrics to store metrics but fib_info
>> > itself is refcnt'ed, so without taking a refcnt fi and
>> > fi->fib_metrics could be freed while dst metrics still points to
>> > it. This triggers use-after-free as reported by Andrey twice.
>> > 
>> > This patch reverts commit 2860583fe840 ("ipv4: Kill rt->fi") to
>> > restore this reference counting. It is a quick fix for -net and
>> > -stable, for -net-next, as Eric suggested, we can consider doing
>> > reference counting for metrics itself instead of relying on fib_info.
>> > 
>> > IPv6 is very different, it copies or steals the metrics from mx6_config
>> > in fib6_commit_metrics() so probably doesn't need a refcnt.
>> > 
>> > Decnet has already done the refcnt'ing, see dn_fib_semantic_match().
>> > 
>> > Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
>> > Reported-by: Andrey Konovalov <andreyknvl@google.com>
>> > Tested-by: Andrey Konovalov <andreyknvl@google.com>
>> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>> 
>> Applied and queued up for -stable, thanks.
> 
> Although I now have on latest net tree these messages when I reboot my
> test machine.
> 
> [  224.085873] unregister_netdevice: waiting for eth0 to become free. Usage count = 43

Strange, the refcounting looks quite OK in the patch you're quoting.
I looked over it a few times and cannot figure out a possible cause
there.

I am assuming you are quite confident it is this change?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-09  1:22     ` David Miller
@ 2017-05-09  2:18       ` Eric Dumazet
  2017-05-09  2:35         ` David Miller
  2017-05-09 16:44         ` Cong Wang
  0 siblings, 2 replies; 43+ messages in thread
From: Eric Dumazet @ 2017-05-09  2:18 UTC (permalink / raw)
  To: David Miller; +Cc: xiyou.wangcong, netdev, andreyknvl, edumazet

On Mon, 2017-05-08 at 21:22 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 08 May 2017 17:01:20 -0700
> 
> > On Mon, 2017-05-08 at 14:35 -0400, David Miller wrote:
> >> From: Cong Wang <xiyou.wangcong@gmail.com>
> >> Date: Thu,  4 May 2017 14:54:17 -0700
> >> 
> >> > IPv4 dst could use fi->fib_metrics to store metrics but fib_info
> >> > itself is refcnt'ed, so without taking a refcnt fi and
> >> > fi->fib_metrics could be freed while dst metrics still points to
> >> > it. This triggers use-after-free as reported by Andrey twice.
> >> > 
> >> > This patch reverts commit 2860583fe840 ("ipv4: Kill rt->fi") to
> >> > restore this reference counting. It is a quick fix for -net and
> >> > -stable, for -net-next, as Eric suggested, we can consider doing
> >> > reference counting for metrics itself instead of relying on fib_info.
> >> > 
> >> > IPv6 is very different, it copies or steals the metrics from mx6_config
> >> > in fib6_commit_metrics() so probably doesn't need a refcnt.
> >> > 
> >> > Decnet has already done the refcnt'ing, see dn_fib_semantic_match().
> >> > 
> >> > Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
> >> > Reported-by: Andrey Konovalov <andreyknvl@google.com>
> >> > Tested-by: Andrey Konovalov <andreyknvl@google.com>
> >> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> >> 
> >> Applied and queued up for -stable, thanks.
> > 
> > Although I now have on latest net tree these messages when I reboot my
> > test machine.
> > 
> > [  224.085873] unregister_netdevice: waiting for eth0 to become free. Usage count = 43
> 
> Strange, the refcounting looks quite OK in the patch you're quoting.
> I looked over it a few times and cannot figure out a possible cause
> there.
> 
> I am assuming you are quite confident it is this change?

At least, reverting the patch resolves the issue for me.

Keeping fib (and their reference to netdev) is apparently too much,
we probably need to implement a refcount on the metrics themselves,
being stand alone objects.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-09  2:18       ` Eric Dumazet
@ 2017-05-09  2:35         ` David Miller
  2017-05-09 16:44         ` Cong Wang
  1 sibling, 0 replies; 43+ messages in thread
From: David Miller @ 2017-05-09  2:35 UTC (permalink / raw)
  To: eric.dumazet; +Cc: xiyou.wangcong, netdev, andreyknvl, edumazet

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 08 May 2017 19:18:22 -0700

> Keeping fib (and their reference to netdev) is apparently too much,
> we probably need to implement a refcount on the metrics themselves,
> being stand alone objects.

I think I see the problem, when we flush the dst cache on device
unregister, we point the dst at loopback but do not flush metrics.

I'm going to revert.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-09  2:18       ` Eric Dumazet
  2017-05-09  2:35         ` David Miller
@ 2017-05-09 16:44         ` Cong Wang
  2017-05-09 16:56           ` Eric Dumazet
  1 sibling, 1 reply; 43+ messages in thread
From: Cong Wang @ 2017-05-09 16:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet

On Mon, May 8, 2017 at 7:18 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2017-05-08 at 21:22 -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Mon, 08 May 2017 17:01:20 -0700
>>
>> > On Mon, 2017-05-08 at 14:35 -0400, David Miller wrote:
>> >> From: Cong Wang <xiyou.wangcong@gmail.com>
>> >> Date: Thu,  4 May 2017 14:54:17 -0700
>> >>
>> >> > IPv4 dst could use fi->fib_metrics to store metrics but fib_info
>> >> > itself is refcnt'ed, so without taking a refcnt fi and
>> >> > fi->fib_metrics could be freed while dst metrics still points to
>> >> > it. This triggers use-after-free as reported by Andrey twice.
>> >> >
>> >> > This patch reverts commit 2860583fe840 ("ipv4: Kill rt->fi") to
>> >> > restore this reference counting. It is a quick fix for -net and
>> >> > -stable, for -net-next, as Eric suggested, we can consider doing
>> >> > reference counting for metrics itself instead of relying on fib_info.
>> >> >
>> >> > IPv6 is very different, it copies or steals the metrics from mx6_config
>> >> > in fib6_commit_metrics() so probably doesn't need a refcnt.
>> >> >
>> >> > Decnet has already done the refcnt'ing, see dn_fib_semantic_match().
>> >> >
>> >> > Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
>> >> > Reported-by: Andrey Konovalov <andreyknvl@google.com>
>> >> > Tested-by: Andrey Konovalov <andreyknvl@google.com>
>> >> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>> >>
>> >> Applied and queued up for -stable, thanks.
>> >
>> > Although I now have on latest net tree these messages when I reboot my
>> > test machine.
>> >
>> > [  224.085873] unregister_netdevice: waiting for eth0 to become free. Usage count = 43
>>
>> Strange, the refcounting looks quite OK in the patch you're quoting.
>> I looked over it a few times and cannot figure out a possible cause
>> there.


Eric, how did you produce it?
I guess it's because of nh_dev which is the only netdevice pointer inside
fib_info. Let me take a deeper look.

>>
>> I am assuming you are quite confident it is this change?
>
> At least, reverting the patch resolves the issue for me.
>
> Keeping fib (and their reference to netdev) is apparently too much,
> we probably need to implement a refcount on the metrics themselves,
> being stand alone objects.

I don't disagree, just that it may need to change too much code which
goes beyond a stable candidate.

Thanks for the bug report!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-09 16:44         ` Cong Wang
@ 2017-05-09 16:56           ` Eric Dumazet
  2017-05-09 20:56             ` Cong Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2017-05-09 16:56 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet

On Tue, 2017-05-09 at 09:44 -0700, Cong Wang wrote:

> 
> Eric, how did you produce it?
> I guess it's because of nh_dev which is the only netdevice pointer inside
> fib_info. Let me take a deeper look.
> 

Nothing particular, I am using kexec to boot new kernels, and all my
attempts with your patch included demonstrated the issue.

eth0 is a bonding device, it might matter, I do not know.

We also have some tunnels, but unfortunately I can not provide a setup
that you could use on say a VM.

I can send you the .config if this can help

> >>
> >> I am assuming you are quite confident it is this change?
> >
> > At least, reverting the patch resolves the issue for me.
> >
> > Keeping fib (and their reference to netdev) is apparently too much,
> > we probably need to implement a refcount on the metrics themselves,
> > being stand alone objects.
> 
> I don't disagree, just that it may need to change too much code which
> goes beyond a stable candidate.

Well, your choice, but dealing with a full blown fib and its
dependencies look fragile to me.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-09 16:56           ` Eric Dumazet
@ 2017-05-09 20:56             ` Cong Wang
  2017-05-09 22:07               ` Cong Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Cong Wang @ 2017-05-09 20:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet

On Tue, May 9, 2017 at 9:56 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2017-05-09 at 09:44 -0700, Cong Wang wrote:
>
>>
>> Eric, how did you produce it?
>> I guess it's because of nh_dev which is the only netdevice pointer inside
>> fib_info. Let me take a deeper look.
>>
>
> Nothing particular, I am using kexec to boot new kernels, and all my
> attempts with your patch included demonstrated the issue.


Interesting. So this happens on your eth0 teardown path, but
I don't see any problem in the related NETDEV_UNREGISTER
notifiers yet, we don't call dst_destroy() on this path and will defer it
to GC, but that should not be delayed for so long?

Wait... if we transfer dst->dev to loopback_dev because we don't
want to block unregister path, then we might have a similar problem
for rt->fi too, fib_info is still referenced by dst, so these nh_dev's still
hold the dev references...

Something like this (just for proof of concept):

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index da449dd..85cd614 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -205,7 +205,7 @@ static void free_fib_info_rcu(struct rcu_head *head)
        struct fib_info *fi = container_of(head, struct fib_info, rcu);

        change_nexthops(fi) {
-               if (nexthop_nh->nh_dev)
+               if (nexthop_nh->nh_dev && !(nexthop_nh->nh_flags & RTNH_F_DEAD))
                        dev_put(nexthop_nh->nh_dev);
                lwtstate_put(nexthop_nh->nh_lwtstate);
                free_nh_exceptions(nexthop_nh);
@@ -1414,8 +1414,10 @@ int fib_sync_down_dev(struct net_device *dev,
unsigned long event, bool force)
                        else if (nexthop_nh->nh_dev == dev &&
                                 nexthop_nh->nh_scope != scope) {
                                switch (event) {
-                               case NETDEV_DOWN:
                                case NETDEV_UNREGISTER:
+                                       dev_put(dev);
+                                       /* fall through */
+                               case NETDEV_DOWN:
                                        nexthop_nh->nh_flags |= RTNH_F_DEAD;
                                        /* fall through */
                                case NETDEV_CHANGE:

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-09 20:56             ` Cong Wang
@ 2017-05-09 22:07               ` Cong Wang
  2017-05-09 22:52                 ` Eric Dumazet
  0 siblings, 1 reply; 43+ messages in thread
From: Cong Wang @ 2017-05-09 22:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet

[-- Attachment #1: Type: text/plain, Size: 397 bytes --]

On Tue, May 9, 2017 at 1:56 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> Wait... if we transfer dst->dev to loopback_dev because we don't
> want to block unregister path, then we might have a similar problem
> for rt->fi too, fib_info is still referenced by dst, so these nh_dev's still
> hold the dev references...
>

I finally come up with the attach patch... Do you mind to give it a try?

[-- Attachment #2: ipv4-rt-fi-ref-count.diff --]
[-- Type: text/plain, Size: 2821 bytes --]

commit a431b4d969f617c5ef8711b6bf493199eecec759
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Tue May 9 14:35:10 2017 -0700

    ipv4: restore rt->fi for reference counting, try #2
    
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

diff --git a/include/net/route.h b/include/net/route.h
index 2cc0e14..4335eb7 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -69,6 +69,7 @@ struct rtable {
 
 	struct list_head	rt_uncached;
 	struct uncached_list	*rt_uncached_list;
+	struct fib_info		*fi; /* for refcnt to shared metrics */
 };
 
 static inline bool rt_is_input_route(const struct rtable *rt)
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index da449dd..d77c453 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1429,8 +1429,15 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event, bool force)
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 			if (event == NETDEV_UNREGISTER &&
 			    nexthop_nh->nh_dev == dev) {
+				/* This should be fine, we are on unregister
+				 * path so synchronize_net() already waits for
+				 * existing readers. We have to release the
+				 * dev here because dst could still hold this
+				 * fib_info via rt->fi, we can't wait for GC.
+				 */
+				RCU_INIT_POINTER(nexthop_nh->nh_dev, NULL);
+				dev_put(dev);
 				dead = fi->fib_nhs;
-				break;
 			}
 #endif
 		} endfor_nexthops(fi)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 655d9ee..f647310 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1387,6 +1387,11 @@ static void ipv4_dst_destroy(struct dst_entry *dst)
 {
 	struct rtable *rt = (struct rtable *) dst;
 
+	if (rt->fi) {
+		fib_info_put(rt->fi);
+		rt->fi = NULL;
+	}
+
 	if (!list_empty(&rt->rt_uncached)) {
 		struct uncached_list *ul = rt->rt_uncached_list;
 
@@ -1424,6 +1429,16 @@ static bool rt_cache_valid(const struct rtable *rt)
 		!rt_is_expired(rt);
 }
 
+static void rt_init_metrics(struct rtable *rt, struct fib_info *fi)
+{
+	if (fi->fib_metrics != (u32 *)dst_default_metrics) {
+		fib_info_hold(fi);
+		rt->fi = fi;
+	}
+
+	dst_init_metrics(&rt->dst, fi->fib_metrics, true);
+}
+
 static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 			   const struct fib_result *res,
 			   struct fib_nh_exception *fnhe,
@@ -1438,7 +1453,7 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 			rt->rt_gateway = nh->nh_gw;
 			rt->rt_uses_gateway = 1;
 		}
-		dst_init_metrics(&rt->dst, fi->fib_metrics, true);
+		rt_init_metrics(rt, fi);
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		rt->dst.tclassid = nh->nh_tclassid;
 #endif
@@ -1490,6 +1505,7 @@ struct rtable *rt_dst_alloc(struct net_device *dev,
 		rt->rt_gateway = 0;
 		rt->rt_uses_gateway = 0;
 		rt->rt_table_id = 0;
+		rt->fi = NULL;
 		INIT_LIST_HEAD(&rt->rt_uncached);
 
 		rt->dst.output = ip_output;

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-09 22:07               ` Cong Wang
@ 2017-05-09 22:52                 ` Eric Dumazet
  2017-05-09 22:54                   ` Eric Dumazet
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2017-05-09 22:52 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet

On Tue, 2017-05-09 at 15:07 -0700, Cong Wang wrote:
> On Tue, May 9, 2017 at 1:56 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > Wait... if we transfer dst->dev to loopback_dev because we don't
> > want to block unregister path, then we might have a similar problem
> > for rt->fi too, fib_info is still referenced by dst, so these nh_dev's still
> > hold the dev references...
> >
> 
> I finally come up with the attach patch... Do you mind to give it a try?

I will, but this might be delayed by a few hours.

In the mean time, it looks like you could try adding the following to
your .config ;)

CONFIG_IP_ROUTE_MULTIPATH=y

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-09 22:52                 ` Eric Dumazet
@ 2017-05-09 22:54                   ` Eric Dumazet
  2017-05-09 23:09                     ` Eric Dumazet
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2017-05-09 22:54 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet

On Tue, 2017-05-09 at 15:52 -0700, Eric Dumazet wrote:
> On Tue, 2017-05-09 at 15:07 -0700, Cong Wang wrote:
> > On Tue, May 9, 2017 at 1:56 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > Wait... if we transfer dst->dev to loopback_dev because we don't
> > > want to block unregister path, then we might have a similar problem
> > > for rt->fi too, fib_info is still referenced by dst, so these nh_dev's still
> > > hold the dev references...
> > >
> > 
> > I finally come up with the attach patch... Do you mind to give it a try?
> 
> I will, but this might be delayed by a few hours.
> 
> In the mean time, it looks like you could try adding the following to
> your .config ;)
> 
> CONFIG_IP_ROUTE_MULTIPATH=y
> 
> 

+                               /* This should be fine, we are on unregister
+                                * path so synchronize_net() already waits for
+                                * existing readers. We have to release the
+                                * dev here because dst could still hold this
+                                * fib_info via rt->fi, we can't wait for GC.
+                                */
+                               RCU_INIT_POINTER(nexthop_nh->nh_dev, NULL);
+                               dev_put(dev);
                                dead = fi->fib_nhs;

dead = fi->fib_mhs looks wrong if you remove the break; statement ?

-                               break;
                        }

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-09 22:54                   ` Eric Dumazet
@ 2017-05-09 23:09                     ` Eric Dumazet
  2017-05-09 23:35                       ` Cong Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2017-05-09 23:09 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet

On Tue, 2017-05-09 at 15:54 -0700, Eric Dumazet wrote:
> On Tue, 2017-05-09 at 15:52 -0700, Eric Dumazet wrote:
> > On Tue, 2017-05-09 at 15:07 -0700, Cong Wang wrote:
> > > On Tue, May 9, 2017 at 1:56 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > Wait... if we transfer dst->dev to loopback_dev because we don't
> > > > want to block unregister path, then we might have a similar problem
> > > > for rt->fi too, fib_info is still referenced by dst, so these nh_dev's still
> > > > hold the dev references...
> > > >
> > > 
> > > I finally come up with the attach patch... Do you mind to give it a try?
> > 
> > I will, but this might be delayed by a few hours.
> > 
> > In the mean time, it looks like you could try adding the following to
> > your .config ;)
> > 
> > CONFIG_IP_ROUTE_MULTIPATH=y
> > 
> > 
> 
> +                               /* This should be fine, we are on unregister
> +                                * path so synchronize_net() already waits for
> +                                * existing readers. We have to release the
> +                                * dev here because dst could still hold this
> +                                * fib_info via rt->fi, we can't wait for GC.
> +                                */
> +                               RCU_INIT_POINTER(nexthop_nh->nh_dev, NULL);
> +                               dev_put(dev);
>                                 dead = fi->fib_nhs;
> 
> dead = fi->fib_mhs looks wrong if you remove the break; statement ?
> 
> -                               break;

Also setting nexthop_nh->nh_dev to NULL looks quite dangerous

We have plenty of sites doing :

if (fi->fib_dev)
    x = fi->fib_dev->field

fib_route_seq_show() is one example.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-09 23:09                     ` Eric Dumazet
@ 2017-05-09 23:35                       ` Cong Wang
  2017-05-09 23:50                         ` Eric Dumazet
                                           ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Cong Wang @ 2017-05-09 23:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet

On Tue, May 9, 2017 at 4:09 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2017-05-09 at 15:54 -0700, Eric Dumazet wrote:
>> On Tue, 2017-05-09 at 15:52 -0700, Eric Dumazet wrote:
>> > On Tue, 2017-05-09 at 15:07 -0700, Cong Wang wrote:
>> > > On Tue, May 9, 2017 at 1:56 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > > > Wait... if we transfer dst->dev to loopback_dev because we don't
>> > > > want to block unregister path, then we might have a similar problem
>> > > > for rt->fi too, fib_info is still referenced by dst, so these nh_dev's still
>> > > > hold the dev references...
>> > > >
>> > >
>> > > I finally come up with the attach patch... Do you mind to give it a try?
>> >
>> > I will, but this might be delayed by a few hours.
>> >
>> > In the mean time, it looks like you could try adding the following to
>> > your .config ;)
>> >
>> > CONFIG_IP_ROUTE_MULTIPATH=y
>> >
>> >
>>
>> +                               /* This should be fine, we are on unregister
>> +                                * path so synchronize_net() already waits for
>> +                                * existing readers. We have to release the
>> +                                * dev here because dst could still hold this
>> +                                * fib_info via rt->fi, we can't wait for GC.
>> +                                */
>> +                               RCU_INIT_POINTER(nexthop_nh->nh_dev, NULL);
>> +                               dev_put(dev);
>>                                 dead = fi->fib_nhs;
>>
>> dead = fi->fib_mhs looks wrong if you remove the break; statement ?
>>
>> -                               break;

This statement is only used to ensure we pass the "dead == fi->fib_nhs"
check right below the inner loop, it is fine to keep it without break since
fi is not changed in the inner loop.


>
> Also setting nexthop_nh->nh_dev to NULL looks quite dangerous
>
> We have plenty of sites doing :
>
> if (fi->fib_dev)
>     x = fi->fib_dev->field
>
> fib_route_seq_show() is one example.
>

All of them take RCU read lock, so, as I explained in the code comment,
they all should be fine because of synchronize_net() on unregister path.
Do you see anything otherwise?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-09 23:35                       ` Cong Wang
@ 2017-05-09 23:50                         ` Eric Dumazet
  2017-05-10 16:32                           ` Cong Wang
  2017-05-09 23:51                         ` Eric Dumazet
  2017-05-10  7:38                         ` Julian Anastasov
  2 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2017-05-09 23:50 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet

On Tue, 2017-05-09 at 16:35 -0700, Cong Wang wrote:

> All of them take RCU read lock, so, as I explained in the code comment,
> they all should be fine because of synchronize_net() on unregister path.
> Do you see anything otherwise?

They might take rcu lock, but compiler is still allowed to read
fi->fib_dev multiple times, and crashes might happen.

You will need to audit all code and fix it, using proper
rcu_dereference() or similar code ensuring compiler wont do stupid
things.

Like :

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 1201409ba1dcb18ee028003b065410b87bf4a602..ab69517befbb5f300af785fbb20071a3d1086593 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2666,11 +2666,13 @@ static int fib_route_seq_show(struct seq_file *seq, void *v)
 
 		seq_setwidth(seq, 127);
 
-		if (fi)
+		if (fi) {
+			struct net_device *dev = rcu_dereference(fi->fib_dev);
+
 			seq_printf(seq,
 				   "%s\t%08X\t%08X\t%04X\t%d\t%u\t"
 				   "%d\t%08X\t%d\t%u\t%u",
-				   fi->fib_dev ? fi->fib_dev->name : "*",
+				   dev ? dev->name : "*",
 				   prefix,
 				   fi->fib_nh->nh_gw, flags, 0, 0,
 				   fi->fib_priority,
@@ -2679,13 +2681,13 @@ static int fib_route_seq_show(struct seq_file *seq, void *v)
 				    fi->fib_advmss + 40 : 0),
 				   fi->fib_window,
 				   fi->fib_rtt >> 3);
-		else
+		} else {
 			seq_printf(seq,
 				   "*\t%08X\t%08X\t%04X\t%d\t%u\t"
 				   "%d\t%08X\t%d\t%u\t%u",
 				   prefix, 0, flags, 0, 0, 0,
 				   mask, 0, 0, 0);
-
+		}
 		seq_pad(seq, '\n');
 	}
 

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-09 23:35                       ` Cong Wang
  2017-05-09 23:50                         ` Eric Dumazet
@ 2017-05-09 23:51                         ` Eric Dumazet
  2017-05-10 16:40                           ` Cong Wang
  2017-05-10  7:38                         ` Julian Anastasov
  2 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2017-05-09 23:51 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet

On Tue, 2017-05-09 at 16:35 -0700, Cong Wang wrote:

> This statement is only used to ensure we pass the "dead == fi->fib_nhs"
> check right below the inner loop, it is fine to keep it without break since
> fi is not changed in the inner loop.
> 

So the dead++ above wont end up with (dead > fi->fib_nhs) ?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-09 23:35                       ` Cong Wang
  2017-05-09 23:50                         ` Eric Dumazet
  2017-05-09 23:51                         ` Eric Dumazet
@ 2017-05-10  7:38                         ` Julian Anastasov
  2017-05-10 17:00                           ` Cong Wang
  2 siblings, 1 reply; 43+ messages in thread
From: Julian Anastasov @ 2017-05-10  7:38 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet


	Hello,

On Tue, 9 May 2017, Cong Wang wrote:

> > Also setting nexthop_nh->nh_dev to NULL looks quite dangerous
> >
> > We have plenty of sites doing :
> >
> > if (fi->fib_dev)
> >     x = fi->fib_dev->field
> >
> > fib_route_seq_show() is one example.
> >
> 
> All of them take RCU read lock, so, as I explained in the code comment,
> they all should be fine because of synchronize_net() on unregister path.
> Do you see anything otherwise?

	During NETDEV_UNREGISTER packets for dev should not
be flying but packets for other devs can walk the nexthops
for multipath routes. It is the rcu_barrier before
NETDEV_UNREGISTER_FINAL that allows nh_dev to be set to NULL
during this grace period but there are many places to fix that
assume nh_dev!=NULL.

	But why we leak routes? Is there some place that holds
routes without listening for NETDEV_UNREGISTER? On fib_flush
the infos are unlinked from trees, so after a grace period packets
should not see/hold such infos. If we hold routes somewhere for
long time, problem can happen also for routes with single nexthop.

Regards

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-09 23:50                         ` Eric Dumazet
@ 2017-05-10 16:32                           ` Cong Wang
  0 siblings, 0 replies; 43+ messages in thread
From: Cong Wang @ 2017-05-10 16:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet

On Tue, May 9, 2017 at 4:50 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2017-05-09 at 16:35 -0700, Cong Wang wrote:
>
>> All of them take RCU read lock, so, as I explained in the code comment,
>> they all should be fine because of synchronize_net() on unregister path.
>> Do you see anything otherwise?
>
> They might take rcu lock, but compiler is still allowed to read
> fi->fib_dev multiple times, and crashes might happen.
>
> You will need to audit all code and fix it, using proper
> rcu_dereference() or similar code ensuring compiler wont do stupid
> things.
>

Point taken. But without my patch, nh_dev is supposed to be protected
by RCU too, it is freed in a rcu callback and dereferenced like:

struct in_device *in_dev = __in_dev_get_rcu(nh->nh_dev);

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-09 23:51                         ` Eric Dumazet
@ 2017-05-10 16:40                           ` Cong Wang
  0 siblings, 0 replies; 43+ messages in thread
From: Cong Wang @ 2017-05-10 16:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Linux Kernel Network Developers, Andrey Konovalov,
	Eric Dumazet

On Tue, May 9, 2017 at 4:51 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2017-05-09 at 16:35 -0700, Cong Wang wrote:
>
>> This statement is only used to ensure we pass the "dead == fi->fib_nhs"
>> check right below the inner loop, it is fine to keep it without break since
>> fi is not changed in the inner loop.
>>
>
> So the dead++ above wont end up with (dead > fi->fib_nhs) ?

Good point, it could happen, we probably need another boolean to
address this.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-10  7:38                         ` Julian Anastasov
@ 2017-05-10 17:00                           ` Cong Wang
  2017-05-10 19:51                             ` Julian Anastasov
  0 siblings, 1 reply; 43+ messages in thread
From: Cong Wang @ 2017-05-10 17:00 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet

On Wed, May 10, 2017 at 12:38 AM, Julian Anastasov <ja@ssi.bg> wrote:
>
>         Hello,
>
> On Tue, 9 May 2017, Cong Wang wrote:
>
>> > Also setting nexthop_nh->nh_dev to NULL looks quite dangerous
>> >
>> > We have plenty of sites doing :
>> >
>> > if (fi->fib_dev)
>> >     x = fi->fib_dev->field
>> >
>> > fib_route_seq_show() is one example.
>> >
>>
>> All of them take RCU read lock, so, as I explained in the code comment,
>> they all should be fine because of synchronize_net() on unregister path.
>> Do you see anything otherwise?
>
>         During NETDEV_UNREGISTER packets for dev should not
> be flying but packets for other devs can walk the nexthops
> for multipath routes. It is the rcu_barrier before
> NETDEV_UNREGISTER_FINAL that allows nh_dev to be set to NULL
> during this grace period but there are many places to fix that
> assume nh_dev!=NULL.

Excellent point. Unfortunately NETDEV_UNREGISTER_FINAL is not
yet captured by the fib event notifier.

I think readers are still safe to assume nh_dev!=NULL since we wait
for existing readers, readers coming later can't see it any more. Or
am I missing anything?

>
>         But why we leak routes? Is there some place that holds
> routes without listening for NETDEV_UNREGISTER? On fib_flush
> the infos are unlinked from trees, so after a grace period packets
> should not see/hold such infos. If we hold routes somewhere for
> long time, problem can happen also for routes with single nexthop.

My theory is that dst which holds a refcnt to fib_info (of course, after
this patch) is moved in GC after NETDEV_UNREGISTER but still
referenced somewhere, it therefore holds these nh_dev's, which
in turn hold back to the netdevice which is being unregistered, thus
Eric saw these refcnt leak warnings.

I am not sure if sitting in GC for a long time is problematic or not,
but from the code where we transfer dst->dev to loopback_dev,
it seems this is expected otherwise no need to transfer? But I don't
dig the history though.

Thanks.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-10 17:00                           ` Cong Wang
@ 2017-05-10 19:51                             ` Julian Anastasov
  2017-05-12  0:07                               ` Cong Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Julian Anastasov @ 2017-05-10 19:51 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet


	Hello,

On Wed, 10 May 2017, Cong Wang wrote:

> On Wed, May 10, 2017 at 12:38 AM, Julian Anastasov <ja@ssi.bg> wrote:
> >
> >         During NETDEV_UNREGISTER packets for dev should not
> > be flying but packets for other devs can walk the nexthops
> > for multipath routes. It is the rcu_barrier before
> > NETDEV_UNREGISTER_FINAL that allows nh_dev to be set to NULL
> > during this grace period but there are many places to fix that
> > assume nh_dev!=NULL.
> 
> Excellent point. Unfortunately NETDEV_UNREGISTER_FINAL is not
> yet captured by the fib event notifier.
> 
> I think readers are still safe to assume nh_dev!=NULL since we wait
> for existing readers, readers coming later can't see it any more. Or
> am I missing anything?

	It is not safe because other devs can deliver packets
that still can see the multipath fib info in the FIB nodes.

CPU 1 (unreg dev)               CPU 2 (packet to another dev2)
NETDEV_DOWN => set RTNH_F_DEAD

Now traffic sent via downed
dev should stop

flush_all_backlogs
synchronize_net

Now received traffic for dev
should stop on all CPUs

fib_sync_down_dev
nh_dev = NULL
                                fib_table_lookup: boom in
                                __in_dev_get_rcu(nh->nh_dev),
                                all nh_dev dereferences must be
                                checked. If checked, it should be
                                safe to use this nh_dev due to the
                                rcu_barrier in netdev_run_todo.
                                But only the structure can be accessed,
                                traffic should not go via unreged dev.

fib_flush: unlink fi
from fa_info in
fib_table_flush

                                Now other CPUs will not see this
                                fi on lookups (after fib_table_flush)

netdev_run_todo
 rcu_barrier
                                Now other CPUs should not send
                                packets via this fib info
 NETDEV_UNREGISTER_FINAL

Notes:

- when NETDEV_DOWN is delivered fib_sync_down_dev sets RTNH_F_DEAD
but only for link routes, not for host routes

- fib_table_lookup runs without any memory ordering barriers,
when CPU 2 delivers packet from different dev it can hit
a multipath route with nh_dev set to NULL. Not so often
if RTNH_F_DEAD was set early and properly checked before
dereferencing nh_dev.

> >         But why we leak routes? Is there some place that holds
> > routes without listening for NETDEV_UNREGISTER? On fib_flush
> > the infos are unlinked from trees, so after a grace period packets
> > should not see/hold such infos. If we hold routes somewhere for
> > long time, problem can happen also for routes with single nexthop.
> 
> My theory is that dst which holds a refcnt to fib_info (of course, after
> this patch) is moved in GC after NETDEV_UNREGISTER but still
> referenced somewhere, it therefore holds these nh_dev's, which
> in turn hold back to the netdevice which is being unregistered, thus
> Eric saw these refcnt leak warnings.

	Oh, well, the sockets can hold cached dst.
But if the promise is that rt->fi is used only as
reference to metrics we have to find a way to drop
the dev references at NETDEV_UNREGISTER time. If you
set nh_dev to NULL then all lookups should check it
for != NULL. The sockets will not walk NHs via rt->fi,
i.e. the route lookups will get valid res.fi from trees,
so it may work in this way.

Regards

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-10 19:51                             ` Julian Anastasov
@ 2017-05-12  0:07                               ` Cong Wang
  2017-05-12  1:22                                 ` Cong Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Cong Wang @ 2017-05-12  0:07 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet

On Wed, May 10, 2017 at 12:51 PM, Julian Anastasov <ja@ssi.bg> wrote:
>         Oh, well, the sockets can hold cached dst.
> But if the promise is that rt->fi is used only as
> reference to metrics we have to find a way to drop
> the dev references at NETDEV_UNREGISTER time. If you
> set nh_dev to NULL then all lookups should check it
> for != NULL. The sockets will not walk NHs via rt->fi,
> i.e. the route lookups will get valid res.fi from trees,
> so it may work in this way.
>

So, if I understand you correctly it is safe to NULL'ing
nh_dev in NETDEV_UNREGISTER_FINAL, right?

If still not, how about transfer nh_dev's to loopback_dev
too in NETDEV_UNREGISTER? Like we transfer dst->dev.

I don't want to touch the fast path to check for NULL, as
it will change more code and slow down performance.

Thanks.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-12  0:07                               ` Cong Wang
@ 2017-05-12  1:22                                 ` Cong Wang
  2017-05-12  4:55                                   ` Eric Dumazet
  2017-05-12  6:39                                   ` Julian Anastasov
  0 siblings, 2 replies; 43+ messages in thread
From: Cong Wang @ 2017-05-12  1:22 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet

[-- Attachment #1: Type: text/plain, Size: 502 bytes --]

On Thu, May 11, 2017 at 5:07 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> So, if I understand you correctly it is safe to NULL'ing
> nh_dev in NETDEV_UNREGISTER_FINAL, right?
>
> If still not, how about transfer nh_dev's to loopback_dev
> too in NETDEV_UNREGISTER? Like we transfer dst->dev.
>
> I don't want to touch the fast path to check for NULL, as
> it will change more code and slow down performance.

Finally I come up with the attached patch. Please let me know if
I still miss anything.

[-- Attachment #2: ipv4-rt-fi-ref-count.diff --]
[-- Type: text/plain, Size: 4364 bytes --]

commit edc38ecab7101487b35fc9152e166a2670467e49
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Tue May 9 14:35:10 2017 -0700

    ipv4: restore rt->fi for reference counting, try #2
    
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 6692c57..0f04f4d 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -392,6 +392,7 @@ int fib_unmerge(struct net *net);
 /* Exported by fib_semantics.c */
 int ip_fib_check_default(__be32 gw, struct net_device *dev);
 int fib_sync_down_dev(struct net_device *dev, unsigned long event, bool force);
+void fib_put_nh_devs(struct net_device *dev);
 int fib_sync_down_addr(struct net_device *dev, __be32 local);
 int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
 
diff --git a/include/net/route.h b/include/net/route.h
index 2cc0e14..4335eb7 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -69,6 +69,7 @@ struct rtable {
 
 	struct list_head	rt_uncached;
 	struct uncached_list	*rt_uncached_list;
+	struct fib_info		*fi; /* for refcnt to shared metrics */
 };
 
 static inline bool rt_is_input_route(const struct rtable *rt)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 39bd1ed..59b0a1d 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1178,6 +1178,9 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
 		fib_disable_ip(dev, event, true);
 		rt_flush_dev(dev);
 		return NOTIFY_DONE;
+	} else if (event == NETDEV_UNREGISTER_FINAL) {
+		fib_put_nh_devs(dev);
+		return NOTIFY_DONE;
 	}
 
 	in_dev = __in_dev_get_rtnl(dev);
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index da449dd..b85c5bb 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1453,6 +1453,36 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event, bool force)
 	return ret;
 }
 
+/* We have to release these nh_dev here because a dst could still hold a
+ * fib_info via rt->fi, we can't wait for GC, a socket could hold the dst
+ * for a long time.
+ */
+void fib_put_nh_devs(struct net_device *dev)
+{
+	struct fib_info *prev_fi = NULL;
+	unsigned int hash = fib_devindex_hashfn(dev->ifindex);
+	struct hlist_head *head = &fib_info_devhash[hash];
+	struct fib_nh *nh;
+
+	hlist_for_each_entry(nh, head, nh_hash) {
+		struct fib_info *fi = nh->nh_parent;
+
+		if (nh->nh_dev != dev || fi == prev_fi)
+			continue;
+		prev_fi = fi;
+		change_nexthops(fi) {
+			if (nexthop_nh->nh_dev == dev) {
+				/* This should be safe, we are on unregister
+				 * path, after synchronize_net() and
+				 * rcu_barrier(), no one could see this.
+				 */
+				RCU_INIT_POINTER(nexthop_nh->nh_dev, NULL);
+				dev_put(dev);
+			}
+		} endfor_nexthops(fi)
+	}
+}
+
 /* Must be invoked inside of an RCU protected region.  */
 static void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
 {
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 655d9ee..f647310 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1387,6 +1387,11 @@ static void ipv4_dst_destroy(struct dst_entry *dst)
 {
 	struct rtable *rt = (struct rtable *) dst;
 
+	if (rt->fi) {
+		fib_info_put(rt->fi);
+		rt->fi = NULL;
+	}
+
 	if (!list_empty(&rt->rt_uncached)) {
 		struct uncached_list *ul = rt->rt_uncached_list;
 
@@ -1424,6 +1429,16 @@ static bool rt_cache_valid(const struct rtable *rt)
 		!rt_is_expired(rt);
 }
 
+static void rt_init_metrics(struct rtable *rt, struct fib_info *fi)
+{
+	if (fi->fib_metrics != (u32 *)dst_default_metrics) {
+		fib_info_hold(fi);
+		rt->fi = fi;
+	}
+
+	dst_init_metrics(&rt->dst, fi->fib_metrics, true);
+}
+
 static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 			   const struct fib_result *res,
 			   struct fib_nh_exception *fnhe,
@@ -1438,7 +1453,7 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 			rt->rt_gateway = nh->nh_gw;
 			rt->rt_uses_gateway = 1;
 		}
-		dst_init_metrics(&rt->dst, fi->fib_metrics, true);
+		rt_init_metrics(rt, fi);
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		rt->dst.tclassid = nh->nh_tclassid;
 #endif
@@ -1490,6 +1505,7 @@ struct rtable *rt_dst_alloc(struct net_device *dev,
 		rt->rt_gateway = 0;
 		rt->rt_uses_gateway = 0;
 		rt->rt_table_id = 0;
+		rt->fi = NULL;
 		INIT_LIST_HEAD(&rt->rt_uncached);
 
 		rt->dst.output = ip_output;

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-12  1:22                                 ` Cong Wang
@ 2017-05-12  4:55                                   ` Eric Dumazet
  2017-05-12 17:49                                     ` Cong Wang
  2017-05-12  6:39                                   ` Julian Anastasov
  1 sibling, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2017-05-12  4:55 UTC (permalink / raw)
  To: Cong Wang
  Cc: Julian Anastasov, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet

On Thu, 2017-05-11 at 18:22 -0700, Cong Wang wrote:
> On Thu, May 11, 2017 at 5:07 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > So, if I understand you correctly it is safe to NULL'ing
> > nh_dev in NETDEV_UNREGISTER_FINAL, right?
> >
> > If still not, how about transfer nh_dev's to loopback_dev
> > too in NETDEV_UNREGISTER? Like we transfer dst->dev.
> >
> > I don't want to touch the fast path to check for NULL, as
> > it will change more code and slow down performance.
> 
> Finally I come up with the attached patch. Please let me know if
> I still miss anything.

You have not addressed my prior feedback, unless I am mistaken ?

fib_route_seq_show() and others do not expect fi->fib_dev suddenly
becoming NULL. 

RCU contract is more complicated than simply adding rcu grace period
before delete.

Thanks.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-12  1:22                                 ` Cong Wang
  2017-05-12  4:55                                   ` Eric Dumazet
@ 2017-05-12  6:39                                   ` Julian Anastasov
  2017-05-12 17:27                                     ` Cong Wang
  1 sibling, 1 reply; 43+ messages in thread
From: Julian Anastasov @ 2017-05-12  6:39 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet


	Hello,

On Thu, 11 May 2017, Cong Wang wrote:

> On Thu, May 11, 2017 at 5:07 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > So, if I understand you correctly it is safe to NULL'ing
> > nh_dev in NETDEV_UNREGISTER_FINAL, right?
> >
> > If still not, how about transfer nh_dev's to loopback_dev
> > too in NETDEV_UNREGISTER? Like we transfer dst->dev.
> >
> > I don't want to touch the fast path to check for NULL, as
> > it will change more code and slow down performance.
> 
> Finally I come up with the attached patch. Please let me know if
> I still miss anything.

	fib_flush will unlink the FIB infos at NETDEV_UNREGISTER
time, so we can not see them in any hash tables later on
NETDEV_UNREGISTER_FINAL. fib_put_nh_devs() can not work
except if moving NHs in another hash table but that should
not be needed.

	I'm thinking for the following solution which
is a bit hackish:

- on NETDEV_UNREGISTER we want to put the nh_dev references,
so fib_release_info is a good place. But as fib_release_info
is not always called, we will have two places that put
references. We can use such hack:

	- for example, use nh_oif to know if dev_put is
	already called

	- fib_release_info() should set nh_oif to 0 because
	it will now call dev_put without clearing nh_dev

	- free_fib_info_rcu() will not call dev_put if nh_oif is 0:
	if (nexthop_nh->nh_dev && nexthop_nh->nh_oif)
		dev_put(nexthop_nh->nh_dev);

	- as fi is unlinked, there is no chance fib_info_hashfn()
	to use the cleared nh_oif for hashing

	- we expect noone to touch nh_dev fields after fi is
	unlinked, this includes free_fib_info and free_fib_info_rcu

	What we achieve:

- between NETDEV_UNREGISTER and NETDEV_UNREGISTER_FINAL nh_dev
still points to our dev (and not to loopback, I think, this
answers your question in previous email), so route lookups
will not find loopback in the FIB tree. We do not want
some packets to be wrongly rerouted. Even if we don't
hold dev reference, the RCU grace period helps here.

- fib_dev/nh_dev != NULL checks are not needed, so this addresses
Eric's concerns. BTW, fib_route_seq_show actually checks
for fi->fib_dev, may be not in a safe way (lockless_dereference
needed?) but as we don't set nh_dev to NULL this is not
needed.

	What more? What about nh_pcpu_rth_output and
nh_rth_input holding routes? We should think about
them too. I should think more if nh_oif trick can work
for them, may be not because nh_oif is optional...
May be we should purge them somehow?

Regards

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-12  6:39                                   ` Julian Anastasov
@ 2017-05-12 17:27                                     ` Cong Wang
  2017-05-12 20:58                                       ` Cong Wang
  2017-05-12 21:27                                       ` Julian Anastasov
  0 siblings, 2 replies; 43+ messages in thread
From: Cong Wang @ 2017-05-12 17:27 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet

On Thu, May 11, 2017 at 11:39 PM, Julian Anastasov <ja@ssi.bg> wrote:
>
>         Hello,
>
> On Thu, 11 May 2017, Cong Wang wrote:
>
>> On Thu, May 11, 2017 at 5:07 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > So, if I understand you correctly it is safe to NULL'ing
>> > nh_dev in NETDEV_UNREGISTER_FINAL, right?
>> >
>> > If still not, how about transfer nh_dev's to loopback_dev
>> > too in NETDEV_UNREGISTER? Like we transfer dst->dev.
>> >
>> > I don't want to touch the fast path to check for NULL, as
>> > it will change more code and slow down performance.
>>
>> Finally I come up with the attached patch. Please let me know if
>> I still miss anything.
>
>         fib_flush will unlink the FIB infos at NETDEV_UNREGISTER
> time, so we can not see them in any hash tables later on
> NETDEV_UNREGISTER_FINAL. fib_put_nh_devs() can not work
> except if moving NHs in another hash table but that should
> not be needed.

Ah, I did miss the hlist_del() in fib_table_flush(), so we just need some
way to link those fib_nh together for NETDEV_UNREGISTER_FINAL,
a linked list should be sufficient, but requires adding list_head to fib_nh.

>
>         I'm thinking for the following solution which
> is a bit hackish:
>
> - on NETDEV_UNREGISTER we want to put the nh_dev references,
> so fib_release_info is a good place. But as fib_release_info
> is not always called, we will have two places that put
> references. We can use such hack:
>
>         - for example, use nh_oif to know if dev_put is
>         already called
>
>         - fib_release_info() should set nh_oif to 0 because
>         it will now call dev_put without clearing nh_dev

Are you sure we are safe to call dev_put() in fib_release_info()
for _all_ paths, especially non-unregister paths? See:

commit e49cc0da7283088c5e03d475ffe2fdcb24a6d5b1
Author: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Date:   Wed May 23 15:39:45 2012 +0000

    ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

But, hmm, very interesting, I always thought dev_put() triggers a
kfree() potentially, but here your suggestion actually leverages the fact
that it is merely a pcpu counter decrease, so for unregister path,
this is just giving refcnt back, which means it is safe as long as
we don't have any parallel unregister?  We should because of RTNL.

I see why you say this is hackish, really it is. ;) We have to ensure
the evil dev_put() is only called on unregister path.

>
>         - free_fib_info_rcu() will not call dev_put if nh_oif is 0:
>         if (nexthop_nh->nh_dev && nexthop_nh->nh_oif)
>                 dev_put(nexthop_nh->nh_dev);
>
>         - as fi is unlinked, there is no chance fib_info_hashfn()
>         to use the cleared nh_oif for hashing
>
>         - we expect noone to touch nh_dev fields after fi is
>         unlinked, this includes free_fib_info and free_fib_info_rcu
>
>         What we achieve:
>
> - between NETDEV_UNREGISTER and NETDEV_UNREGISTER_FINAL nh_dev
> still points to our dev (and not to loopback, I think, this
> answers your question in previous email), so route lookups
> will not find loopback in the FIB tree. We do not want
> some packets to be wrongly rerouted. Even if we don't
> hold dev reference, the RCU grace period helps here.

Thanks to dev_put() for not calling a netdev_freemem(). ;-)

>
> - fib_dev/nh_dev != NULL checks are not needed, so this addresses
> Eric's concerns. BTW, fib_route_seq_show actually checks
> for fi->fib_dev, may be not in a safe way (lockless_dereference
> needed?) but as we don't set nh_dev to NULL this is not
> needed.
>

I think Eric was complaining about the lack of rcu_dereference()
there.

>         What more? What about nh_pcpu_rth_output and
> nh_rth_input holding routes? We should think about
> them too. I should think more if nh_oif trick can work
> for them, may be not because nh_oif is optional...
> May be we should purge them somehow?
>

Or maybe don't touch nh_oif but using a new flag in
nh_flags?? We only need to know if we should call
dev_put() in free_fib_info_rcu().

Again, I am still not sure if it is any better than just
putting these fib_nh into a linked list.

Thanks.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-12  4:55                                   ` Eric Dumazet
@ 2017-05-12 17:49                                     ` Cong Wang
  0 siblings, 0 replies; 43+ messages in thread
From: Cong Wang @ 2017-05-12 17:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Julian Anastasov, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet

On Thu, May 11, 2017 at 9:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2017-05-11 at 18:22 -0700, Cong Wang wrote:
>> On Thu, May 11, 2017 at 5:07 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > So, if I understand you correctly it is safe to NULL'ing
>> > nh_dev in NETDEV_UNREGISTER_FINAL, right?
>> >
>> > If still not, how about transfer nh_dev's to loopback_dev
>> > too in NETDEV_UNREGISTER? Like we transfer dst->dev.
>> >
>> > I don't want to touch the fast path to check for NULL, as
>> > it will change more code and slow down performance.
>>
>> Finally I come up with the attached patch. Please let me know if
>> I still miss anything.
>
> You have not addressed my prior feedback, unless I am mistaken ?
>
> fib_route_seq_show() and others do not expect fi->fib_dev suddenly
> becoming NULL.
>
> RCU contract is more complicated than simply adding rcu grace period
> before delete.
>

If you mean the lack of rcu_dereference() in fib_route_seq_show() ,
yeah, I don't fix it because I think it should be in a separate patch,
it is not this patch which introduces RCU to nh_dev as I explained
previously.

Or you mean anything else?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-12 17:27                                     ` Cong Wang
@ 2017-05-12 20:58                                       ` Cong Wang
  2017-05-12 21:13                                         ` Cong Wang
  2017-05-12 21:27                                       ` Julian Anastasov
  1 sibling, 1 reply; 43+ messages in thread
From: Cong Wang @ 2017-05-12 20:58 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet

On Fri, May 12, 2017 at 10:27 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> Or maybe don't touch nh_oif but using a new flag in
> nh_flags?? We only need to know if we should call
> dev_put() in free_fib_info_rcu().
>
> Again, I am still not sure if it is any better than just
> putting these fib_nh into a linked list.
>

I am afraid we have to use a linked list, because either a flag or
nh_oif is per nh, but one nh could have multiple different nh_devs...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-12 20:58                                       ` Cong Wang
@ 2017-05-12 21:13                                         ` Cong Wang
  0 siblings, 0 replies; 43+ messages in thread
From: Cong Wang @ 2017-05-12 21:13 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet

On Fri, May 12, 2017 at 1:58 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Fri, May 12, 2017 at 10:27 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> Or maybe don't touch nh_oif but using a new flag in
>> nh_flags?? We only need to know if we should call
>> dev_put() in free_fib_info_rcu().
>>
>> Again, I am still not sure if it is any better than just
>> putting these fib_nh into a linked list.
>>
>
> I am afraid we have to use a linked list, because either a flag or
> nh_oif is per nh, but one nh could have multiple different nh_devs...

Scratch that. Each fib_nh has one nh_dev...
I am trying to use a flag.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-12 17:27                                     ` Cong Wang
  2017-05-12 20:58                                       ` Cong Wang
@ 2017-05-12 21:27                                       ` Julian Anastasov
  2017-05-15 18:34                                         ` Cong Wang
  1 sibling, 1 reply; 43+ messages in thread
From: Julian Anastasov @ 2017-05-12 21:27 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet


	Hello,

On Fri, 12 May 2017, Cong Wang wrote:

> On Thu, May 11, 2017 at 11:39 PM, Julian Anastasov <ja@ssi.bg> wrote:
> >
> >         fib_flush will unlink the FIB infos at NETDEV_UNREGISTER
> > time, so we can not see them in any hash tables later on
> > NETDEV_UNREGISTER_FINAL. fib_put_nh_devs() can not work
> > except if moving NHs in another hash table but that should
> > not be needed.
> 
> Ah, I did miss the hlist_del() in fib_table_flush(), so we just need some
> way to link those fib_nh together for NETDEV_UNREGISTER_FINAL,
> a linked list should be sufficient, but requires adding list_head to fib_nh.

	It could be a fib_info_devhash_dead hash table, similar
to fib_info_devhash where we can hash NHs for dead fib infos
but then we will need a fib_clntref reference or otherwise last
user can drop the fib info before NETDEV_UNREGISTER_FINAL is called.
It will add more code in fib_sync_down_dev to know when
to call fib_info_hold. But OTOH, it will reduce the work
needed for careful release of the references in fib_release_info.

	So, we have 2 possible cases to consider:

1. route deleted, fib_release_info called, fi held by socket,
NETDEV_UNREGISTER can not see the NHs because they are
unlinked in fib_release_info. dev unreg delayed for long time...

2. NETDEV_UNREGISTER called first, before the NHs are
unlinked.

	Now the main question: is FIB_LOOKUP_NOREF used
everywhere in IPv4? I guess so. If not, it means
someone can walk its res->fi NHs which is bad. I think,
this will delay the unregistration for long time and we
can not solve the problem.

	If yes, free_fib_info() should not use call_rcu.
Instead, fib_release_info() will start RCU callback to
drop everything via a common function for fib_release_info
and free_fib_info. As result, the last fib_info_put will
just need to free fi->fib_metrics and fi.

	Something like:

/* Long living data */
fib_info_destroy:
{
	if (fi->fib_dead == 0) {
		pr_warn("Freeing alive fib_info %p\n", fi);
		return;
	}

	if (fi->fib_metrics != (u32 *) dst_default_metrics)
		kfree(fi->fib_metrics);
	kfree(fi);
}

/* Do not imply RCU grace period anymore, last user is last */
fib_info_put():
{
	if (atomic_dec_and_test(&fi->fib_clntref))
		fib_info_destroy(fi);
}

/* Release everything except long living fields (fib_metrics) */
fib_info_release():
{
        change_nexthops(fi) {
                if (nexthop_nh->nh_dev)
                        dev_put(nexthop_nh->nh_dev);
                lwtstate_put(nexthop_nh->nh_lwtstate);
                free_nh_exceptions(nexthop_nh);
                rt_fibinfo_free_cpus(nexthop_nh->nh_pcpu_rth_output);
                rt_fibinfo_free(&nexthop_nh->nh_rth_input);
        } endfor_nexthops(fi);
}

/* RCU grace period after unlink */
fib_release_info_rcu():
{
	struct fib_info *fi = container_of(head, struct fib_info, rcu);

	fib_info_release(fi);
	fib_info_put(fi);
}

/* Called just on error */
free_fib_info():
{
	fib_info_release(fi);
	fib_info_put(fi);
}

fib_release_info():
{
	...
	fi->fib_dead = 1;
-	fib_info_put(fi);
+	call_rcu(&fi->rcu, fib_release_info_rcu);
}

> >         I'm thinking for the following solution which
> > is a bit hackish:
> >
> > - on NETDEV_UNREGISTER we want to put the nh_dev references,
> > so fib_release_info is a good place. But as fib_release_info
> > is not always called, we will have two places that put
> > references. We can use such hack:
> >
> >         - for example, use nh_oif to know if dev_put is
> >         already called
> >
> >         - fib_release_info() should set nh_oif to 0 because
> >         it will now call dev_put without clearing nh_dev
> 
> Are you sure we are safe to call dev_put() in fib_release_info()
> for _all_ paths, especially non-unregister paths? See:

	Yep, dev_put is safe there...

> commit e49cc0da7283088c5e03d475ffe2fdcb24a6d5b1
> Author: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> Date:   Wed May 23 15:39:45 2012 +0000
> 
>     ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

	...as long as we do not set nh_dev to NULL

> But, hmm, very interesting, I always thought dev_put() triggers a
> kfree() potentially, but here your suggestion actually leverages the fact
> that it is merely a pcpu counter decrease, so for unregister path,
> this is just giving refcnt back, which means it is safe as long as
> we don't have any parallel unregister?  We should because of RTNL.

	Yes, we are in the context of unregistration and
there is a rcu_barrier() that prevents device to be
destroyed while there are RCU users. Refcnt reaching 0 is
not enough to free dev, RCU users must be considered,
they do not get reference.

> I see why you say this is hackish, really it is. ;) We have to ensure
> the evil dev_put() is only called on unregister path.

	Or after a RCU grace period but not later...

> > - fib_dev/nh_dev != NULL checks are not needed, so this addresses
> > Eric's concerns. BTW, fib_route_seq_show actually checks
> > for fi->fib_dev, may be not in a safe way (lockless_dereference
> > needed?) but as we don't set nh_dev to NULL this is not
> > needed.
> >
> 
> I think Eric was complaining about the lack of rcu_dereference()
> there.

	Not needed if we don't set nh_dev to NULL.

> >         What more? What about nh_pcpu_rth_output and
> > nh_rth_input holding routes? We should think about
> > them too. I should think more if nh_oif trick can work
> > for them, may be not because nh_oif is optional...
> > May be we should purge them somehow?
> >
> 
> Or maybe don't touch nh_oif but using a new flag in
> nh_flags?? We only need to know if we should call
> dev_put() in free_fib_info_rcu().

	fib_dead is a better option if we decide to use
such solution: 1=not called by fib_release_info,
2=called by fib_release_info.

Regards

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-12 21:27                                       ` Julian Anastasov
@ 2017-05-15 18:34                                         ` Cong Wang
  2017-05-15 20:37                                           ` Julian Anastasov
  0 siblings, 1 reply; 43+ messages in thread
From: Cong Wang @ 2017-05-15 18:34 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet

On Fri, May 12, 2017 at 2:27 PM, Julian Anastasov <ja@ssi.bg> wrote:
>         Now the main question: is FIB_LOOKUP_NOREF used
> everywhere in IPv4? I guess so. If not, it means
> someone can walk its res->fi NHs which is bad. I think,
> this will delay the unregistration for long time and we
> can not solve the problem.
>
>         If yes, free_fib_info() should not use call_rcu.
> Instead, fib_release_info() will start RCU callback to
> drop everything via a common function for fib_release_info
> and free_fib_info. As result, the last fib_info_put will
> just need to free fi->fib_metrics and fi.


Yes it is used. But this is a different problem from the
dev refcnt issue, right? I can send a separate patch to
address it.


>> Are you sure we are safe to call dev_put() in fib_release_info()
>> for _all_ paths, especially non-unregister paths? See:
>
>         Yep, dev_put is safe there...
>
>> commit e49cc0da7283088c5e03d475ffe2fdcb24a6d5b1
>> Author: Yanmin Zhang <yanmin_zhang@linux.intel.com>
>> Date:   Wed May 23 15:39:45 2012 +0000
>>
>>     ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
>
>         ...as long as we do not set nh_dev to NULL
>

OK, fair enough, then I think the best solution here is to move
the dev_put() from free_fib_info_rcu() to fib_release_info(),
fib_nh is already removed from hash there anyway.


diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index da449dd..cb712d1 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -205,8 +205,6 @@ static void free_fib_info_rcu(struct rcu_head *head)
        struct fib_info *fi = container_of(head, struct fib_info, rcu);

        change_nexthops(fi) {
-               if (nexthop_nh->nh_dev)
-                       dev_put(nexthop_nh->nh_dev);
                lwtstate_put(nexthop_nh->nh_lwtstate);
                free_nh_exceptions(nexthop_nh);
                rt_fibinfo_free_cpus(nexthop_nh->nh_pcpu_rth_output);
@@ -246,6 +244,14 @@ void fib_release_info(struct fib_info *fi)
                        if (!nexthop_nh->nh_dev)
                                continue;
                        hlist_del(&nexthop_nh->nh_hash);
+                       /* We have to release these nh_dev here because a dst
+                        * could still hold a fib_info via rt->fi, we can't wait
+                        * for GC, a socket could hold the dst for a long time.
+                        *
+                        * This is safe, dev_put() alone does not really free
+                        * the netdevice, we just have to put the refcnt back.
+                        */
+                       dev_put(nexthop_nh->nh_dev);
                } endfor_nexthops(fi)
                fi->fib_dead = 1;
                fib_info_put(fi);


Thanks!

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-15 18:34                                         ` Cong Wang
@ 2017-05-15 20:37                                           ` Julian Anastasov
  2017-05-15 22:13                                             ` Cong Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Julian Anastasov @ 2017-05-15 20:37 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet


	Hello,

On Mon, 15 May 2017, Cong Wang wrote:

> On Fri, May 12, 2017 at 2:27 PM, Julian Anastasov <ja@ssi.bg> wrote:
> >         Now the main question: is FIB_LOOKUP_NOREF used
> > everywhere in IPv4? I guess so. If not, it means
> > someone can walk its res->fi NHs which is bad. I think,
> > this will delay the unregistration for long time and we
> > can not solve the problem.
> >
> >         If yes, free_fib_info() should not use call_rcu.
> > Instead, fib_release_info() will start RCU callback to
> > drop everything via a common function for fib_release_info
> > and free_fib_info. As result, the last fib_info_put will
> > just need to free fi->fib_metrics and fi.
> 
> 
> Yes it is used. But this is a different problem from the
> dev refcnt issue, right? I can send a separate patch to
> address it.

	Any user that does not set FIB_LOOKUP_NOREF
will need nh_dev refcounts. The assumption is that the
NHs are accessed, who knows, may be even after RCU grace
period. As result, we can not use dev_put on NETDEV_UNREGISTER.
So, we should check if there are users that do not
set FIB_LOOKUP_NOREF, at first look, I don't see such ones
for IPv4.

> >> Are you sure we are safe to call dev_put() in fib_release_info()
> >> for _all_ paths, especially non-unregister paths? See:
> >
> >         Yep, dev_put is safe there...
> >
> >> commit e49cc0da7283088c5e03d475ffe2fdcb24a6d5b1
> >> Author: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> >> Date:   Wed May 23 15:39:45 2012 +0000
> >>
> >>     ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
> >
> >         ...as long as we do not set nh_dev to NULL
> >
> 
> OK, fair enough, then I think the best solution here is to move
> the dev_put() from free_fib_info_rcu() to fib_release_info(),
> fib_nh is already removed from hash there anyway.

	free_fib_info still needs to put the references,
that is the reason for the common fib_info_release() in
my example. It happens in fib_create_info() where free_fib_info()
is called. The func names in my example can be corrected,
if needed.

> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index da449dd..cb712d1 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -205,8 +205,6 @@ static void free_fib_info_rcu(struct rcu_head *head)
>         struct fib_info *fi = container_of(head, struct fib_info, rcu);
> 
>         change_nexthops(fi) {
> -               if (nexthop_nh->nh_dev)
> -                       dev_put(nexthop_nh->nh_dev);
>                 lwtstate_put(nexthop_nh->nh_lwtstate);
>                 free_nh_exceptions(nexthop_nh);
>                 rt_fibinfo_free_cpus(nexthop_nh->nh_pcpu_rth_output);
> @@ -246,6 +244,14 @@ void fib_release_info(struct fib_info *fi)
>                         if (!nexthop_nh->nh_dev)
>                                 continue;
>                         hlist_del(&nexthop_nh->nh_hash);
> +                       /* We have to release these nh_dev here because a dst
> +                        * could still hold a fib_info via rt->fi, we can't wait
> +                        * for GC, a socket could hold the dst for a long time.
> +                        *
> +                        * This is safe, dev_put() alone does not really free
> +                        * the netdevice, we just have to put the refcnt back.
> +                        */
> +                       dev_put(nexthop_nh->nh_dev);
>                 } endfor_nexthops(fi)
>                 fi->fib_dead = 1;

	Such solution needs the fib_dead = 1|2 game to
know who dropped the nh_dev reference, fib_release_info (2) or
fib_create_info (1). You can not remove the dev_put calls
from free_fib_info_rcu.

>                 fib_info_put(fi);

Regards

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-15 20:37                                           ` Julian Anastasov
@ 2017-05-15 22:13                                             ` Cong Wang
  2017-05-16  7:46                                               ` Julian Anastasov
  0 siblings, 1 reply; 43+ messages in thread
From: Cong Wang @ 2017-05-15 22:13 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet

On Mon, May 15, 2017 at 1:37 PM, Julian Anastasov <ja@ssi.bg> wrote:
>         Any user that does not set FIB_LOOKUP_NOREF
> will need nh_dev refcounts. The assumption is that the
> NHs are accessed, who knows, may be even after RCU grace
> period. As result, we can not use dev_put on NETDEV_UNREGISTER.
> So, we should check if there are users that do not
> set FIB_LOOKUP_NOREF, at first look, I don't see such ones
> for IPv4.

I see, although we do have FIB_LOOKUP_NOREF set all the times,
there are other places we hold fib_clntref too, for example
mlxsw_sp_router_fib_event_work(), it actually uses nh_dev too...

So I am afraid moving dev_put() to fib_release_info() is not a solution
here. I have to rethink about it.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-15 22:13                                             ` Cong Wang
@ 2017-05-16  7:46                                               ` Julian Anastasov
  2017-05-16 17:53                                                 ` Cong Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Julian Anastasov @ 2017-05-16  7:46 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet


	Hello,

On Mon, 15 May 2017, Cong Wang wrote:

> On Mon, May 15, 2017 at 1:37 PM, Julian Anastasov <ja@ssi.bg> wrote:
> >         Any user that does not set FIB_LOOKUP_NOREF
> > will need nh_dev refcounts. The assumption is that the
> > NHs are accessed, who knows, may be even after RCU grace
> > period. As result, we can not use dev_put on NETDEV_UNREGISTER.
> > So, we should check if there are users that do not
> > set FIB_LOOKUP_NOREF, at first look, I don't see such ones
> > for IPv4.
> 
> I see, although we do have FIB_LOOKUP_NOREF set all the times,
> there are other places we hold fib_clntref too, for example
> mlxsw_sp_router_fib_event_work(), it actually uses nh_dev too...
> 
> So I am afraid moving dev_put() to fib_release_info() is not a solution
> here. I have to rethink about it.

	At first look, they use fib_info_hold() to get fib_clntref 
reference from places where fib_treeref is not fatally decreased
to 0 but later a work is used which finishes the job. I guess, we
can convert such places to use just a fib_treeref reference.
They can use such new method instead of fib_info_hold:

void fib_treeref_get(struct fib_info *fi)
{
	spin_lock_bh(&fib_info_lock);
	fi->fib_treeref++;
	spin_unlock_bh(&fib_info_lock);
}

They will use fib_release_info() to put the reference. But
on FIB_EVENT_ENTRY_DEL there is a small window where the
scheduled work delays the unlink of fib info from the
hash tables, i.e. there is a risk fib_find_info to reuse
a dead fib info.

	May be we can add a fi->fib_flags & RTNH_F_DEAD
check there but the problem is that it is set also on
NETDEV_DOWN. While attempts to add route to device with
!(dev->flags & IFF_UP) is rejected by fib_check_nh(),
fib_create_info still can create routes when
cfg->fc_scope == RT_SCOPE_HOST. So, RTNH_F_DEAD check
in fib_find_info can avoid the reuse of fib info for
host routes while device is down but not unregistered.
As result, many fib infos can be created instead of one
that is reused. Adding new RTNH_F_* flag to properly handle
this does not look good...

Regards

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-16  7:46                                               ` Julian Anastasov
@ 2017-05-16 17:53                                                 ` Cong Wang
  2017-05-16 18:16                                                   ` Cong Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Cong Wang @ 2017-05-16 17:53 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet

On Tue, May 16, 2017 at 12:46 AM, Julian Anastasov <ja@ssi.bg> wrote:
>
>         Hello,
>
> On Mon, 15 May 2017, Cong Wang wrote:
>
>> On Mon, May 15, 2017 at 1:37 PM, Julian Anastasov <ja@ssi.bg> wrote:
>> >         Any user that does not set FIB_LOOKUP_NOREF
>> > will need nh_dev refcounts. The assumption is that the
>> > NHs are accessed, who knows, may be even after RCU grace
>> > period. As result, we can not use dev_put on NETDEV_UNREGISTER.
>> > So, we should check if there are users that do not
>> > set FIB_LOOKUP_NOREF, at first look, I don't see such ones
>> > for IPv4.
>>
>> I see, although we do have FIB_LOOKUP_NOREF set all the times,
>> there are other places we hold fib_clntref too, for example
>> mlxsw_sp_router_fib_event_work(), it actually uses nh_dev too...
>>
>> So I am afraid moving dev_put() to fib_release_info() is not a solution
>> here. I have to rethink about it.
>
>         At first look, they use fib_info_hold() to get fib_clntref
> reference from places where fib_treeref is not fatally decreased
> to 0 but later a work is used which finishes the job. I guess, we
> can convert such places to use just a fib_treeref reference.
> They can use such new method instead of fib_info_hold:
>
> void fib_treeref_get(struct fib_info *fi)
> {
>         spin_lock_bh(&fib_info_lock);
>         fi->fib_treeref++;
>         spin_unlock_bh(&fib_info_lock);
> }
>
> They will use fib_release_info() to put the reference. But
> on FIB_EVENT_ENTRY_DEL there is a small window where the
> scheduled work delays the unlink of fib info from the
> hash tables, i.e. there is a risk fib_find_info to reuse
> a dead fib info.
>

Right, that seems risky.

How about adding a check for ->fib_dead in these work's?
If the last treeref is gone, probably it is no longer needed
to continue to do the offloading work.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Patch net] ipv4: restore rt->fi for reference counting
  2017-05-16 17:53                                                 ` Cong Wang
@ 2017-05-16 18:16                                                   ` Cong Wang
       [not found]                                                     ` <1495572267.6465.79.camel@edumazet-glaptop3.roam.corp.google.com>
  0 siblings, 1 reply; 43+ messages in thread
From: Cong Wang @ 2017-05-16 18:16 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet

On Tue, May 16, 2017 at 10:53 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> How about adding a check for ->fib_dead in these work's?
> If the last treeref is gone, probably it is no longer needed
> to continue to do the offloading work.

Hmm, this doesn't look correct either, we may miss an offloading work
if fib_release_info() comes first.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH net] ipv4: add reference counting to metrics
       [not found]                                                             ` <1495665921.6465.95.camel@edumazet-glaptop3.roam.corp.google.com>
@ 2017-05-25 21:27                                                               ` Eric Dumazet
  2017-05-25 22:25                                                                 ` Julian Anastasov
                                                                                   ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Eric Dumazet @ 2017-05-25 21:27 UTC (permalink / raw)
  To: David Miller
  Cc: Eric Dumazet, Andrey Konovalov, Julian Anastasov, Cong Wang, netdev

From: Eric Dumazet <edumazet@google.com>

Andrey Konovalov reported crashes in ipv4_mtu()

I could reproduce the issue with KASAN kernels, between
10.246.7.151 and 10.246.7.152 :

1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &

2) At the same time run following loop :
while :
do
 ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
 ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
done


Cong Wang attempted to add back rt->fi in commit
82486aa6f1b9 ("ipv4: restore rt->fi for reference counting")
but this proved to add some issues that were complex to solve.

Instead, I suggested to add a refcount to the metrics themselves,
being a standalone object (in particular, no reference to other objects)

I tried to make this patch as small as possible to ease its backport,
instead of being super clean. Note that we believe that only ipv4 dst
need to take care of the metric refcount. But if this is wrong,
this patch adds the basic infrastructure to extend this to other
families.

Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
for his efforts on this problem.

Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
---
 include/net/dst.h        |    8 +++++++-
 include/net/ip_fib.h     |   10 +++++-----
 net/core/dst.c           |   23 ++++++++++++++---------
 net/ipv4/fib_semantics.c |   17 ++++++++++-------
 net/ipv4/route.c         |   10 +++++++++-
 5 files changed, 45 insertions(+), 23 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 049af33da3b6c95897d544670cea65c542317673..cfc0437841665d7ed46a714915c50d723c24901c 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -107,10 +107,16 @@ struct dst_entry {
 	};
 };
 
+struct dst_metrics {
+	u32		metrics[RTAX_MAX];
+	atomic_t	refcnt;
+};
+extern const struct dst_metrics dst_default_metrics;
+
 u32 *dst_cow_metrics_generic(struct dst_entry *dst, unsigned long old);
-extern const u32 dst_default_metrics[];
 
 #define DST_METRICS_READ_ONLY		0x1UL
+#define DST_METRICS_REFCOUNTED		0x2UL
 #define DST_METRICS_FLAGS		0x3UL
 #define __DST_METRICS_PTR(Y)	\
 	((u32 *)((Y) & ~DST_METRICS_FLAGS))
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 6692c5758b332d468f1e0611ecc4f3e03ae03b2b..f7f6aa789c6174c41ca9739206d586c559c1f3a1 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -114,11 +114,11 @@ struct fib_info {
 	__be32			fib_prefsrc;
 	u32			fib_tb_id;
 	u32			fib_priority;
-	u32			*fib_metrics;
-#define fib_mtu fib_metrics[RTAX_MTU-1]
-#define fib_window fib_metrics[RTAX_WINDOW-1]
-#define fib_rtt fib_metrics[RTAX_RTT-1]
-#define fib_advmss fib_metrics[RTAX_ADVMSS-1]
+	struct dst_metrics	*fib_metrics;
+#define fib_mtu fib_metrics->metrics[RTAX_MTU-1]
+#define fib_window fib_metrics->metrics[RTAX_WINDOW-1]
+#define fib_rtt fib_metrics->metrics[RTAX_RTT-1]
+#define fib_advmss fib_metrics->metrics[RTAX_ADVMSS-1]
 	int			fib_nhs;
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	int			fib_weight;
diff --git a/net/core/dst.c b/net/core/dst.c
index 960e503b5a529a2c4f1866f49c150493ee98d7da..6192f11beec9077de964e2aeff4f78547f08b8da 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -151,13 +151,13 @@ int dst_discard_out(struct net *net, struct sock *sk, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(dst_discard_out);
 
-const u32 dst_default_metrics[RTAX_MAX + 1] = {
+const struct dst_metrics dst_default_metrics = {
 	/* This initializer is needed to force linker to place this variable
 	 * into const section. Otherwise it might end into bss section.
 	 * We really want to avoid false sharing on this variable, and catch
 	 * any writes on it.
 	 */
-	[RTAX_MAX] = 0xdeadbeef,
+	.refcnt = ATOMIC_INIT(1),
 };
 
 void dst_init(struct dst_entry *dst, struct dst_ops *ops,
@@ -169,7 +169,7 @@ void dst_init(struct dst_entry *dst, struct dst_ops *ops,
 	if (dev)
 		dev_hold(dev);
 	dst->ops = ops;
-	dst_init_metrics(dst, dst_default_metrics, true);
+	dst_init_metrics(dst, dst_default_metrics.metrics, true);
 	dst->expires = 0UL;
 	dst->path = dst;
 	dst->from = NULL;
@@ -314,25 +314,30 @@ EXPORT_SYMBOL(dst_release);
 
 u32 *dst_cow_metrics_generic(struct dst_entry *dst, unsigned long old)
 {
-	u32 *p = kmalloc(sizeof(u32) * RTAX_MAX, GFP_ATOMIC);
+	struct dst_metrics *p = kmalloc(sizeof(*p), GFP_ATOMIC);
 
 	if (p) {
-		u32 *old_p = __DST_METRICS_PTR(old);
+		struct dst_metrics *old_p = (struct dst_metrics *)__DST_METRICS_PTR(old);
 		unsigned long prev, new;
 
-		memcpy(p, old_p, sizeof(u32) * RTAX_MAX);
+		atomic_set(&p->refcnt, 1);
+		memcpy(p->metrics, old_p->metrics, sizeof(p->metrics));
 
 		new = (unsigned long) p;
 		prev = cmpxchg(&dst->_metrics, old, new);
 
 		if (prev != old) {
 			kfree(p);
-			p = __DST_METRICS_PTR(prev);
+			p = (struct dst_metrics *)__DST_METRICS_PTR(prev);
 			if (prev & DST_METRICS_READ_ONLY)
 				p = NULL;
+		} else if (prev & DST_METRICS_REFCOUNTED) {
+			if (atomic_dec_and_test(&old_p->refcnt))
+				kfree(old_p);
 		}
 	}
-	return p;
+	BUILD_BUG_ON(offsetof(struct dst_metrics, metrics) != 0);
+	return (u32 *)p;
 }
 EXPORT_SYMBOL(dst_cow_metrics_generic);
 
@@ -341,7 +346,7 @@ void __dst_destroy_metrics_generic(struct dst_entry *dst, unsigned long old)
 {
 	unsigned long prev, new;
 
-	new = ((unsigned long) dst_default_metrics) | DST_METRICS_READ_ONLY;
+	new = ((unsigned long) &dst_default_metrics) | DST_METRICS_READ_ONLY;
 	prev = cmpxchg(&dst->_metrics, old, new);
 	if (prev == old)
 		kfree(__DST_METRICS_PTR(old));
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index da449ddb8cc172bd9091c00057a69a095f98b56d..ad9ad4aab5da7c7d11c3b80edbdfcbdd3d7153fe 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -203,6 +203,7 @@ static void rt_fibinfo_free_cpus(struct rtable __rcu * __percpu *rtp)
 static void free_fib_info_rcu(struct rcu_head *head)
 {
 	struct fib_info *fi = container_of(head, struct fib_info, rcu);
+	struct dst_metrics *m;
 
 	change_nexthops(fi) {
 		if (nexthop_nh->nh_dev)
@@ -213,8 +214,9 @@ static void free_fib_info_rcu(struct rcu_head *head)
 		rt_fibinfo_free(&nexthop_nh->nh_rth_input);
 	} endfor_nexthops(fi);
 
-	if (fi->fib_metrics != (u32 *) dst_default_metrics)
-		kfree(fi->fib_metrics);
+	m = fi->fib_metrics;
+	if (m != &dst_default_metrics && atomic_dec_and_test(&m->refcnt))
+		kfree(m);
 	kfree(fi);
 }
 
@@ -971,11 +973,11 @@ fib_convert_metrics(struct fib_info *fi, const struct fib_config *cfg)
 			val = 255;
 		if (type == RTAX_FEATURES && (val & ~RTAX_FEATURE_MASK))
 			return -EINVAL;
-		fi->fib_metrics[type - 1] = val;
+		fi->fib_metrics->metrics[type - 1] = val;
 	}
 
 	if (ecn_ca)
-		fi->fib_metrics[RTAX_FEATURES - 1] |= DST_FEATURE_ECN_CA;
+		fi->fib_metrics->metrics[RTAX_FEATURES - 1] |= DST_FEATURE_ECN_CA;
 
 	return 0;
 }
@@ -1033,11 +1035,12 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 		goto failure;
 	fib_info_cnt++;
 	if (cfg->fc_mx) {
-		fi->fib_metrics = kzalloc(sizeof(u32) * RTAX_MAX, GFP_KERNEL);
+		fi->fib_metrics = kzalloc(sizeof(*fi->fib_metrics), GFP_KERNEL);
 		if (!fi->fib_metrics)
 			goto failure;
+		atomic_set(&fi->fib_metrics->refcnt, 1);
 	} else
-		fi->fib_metrics = (u32 *) dst_default_metrics;
+		fi->fib_metrics = (struct dst_metrics *)&dst_default_metrics;
 
 	fi->fib_net = net;
 	fi->fib_protocol = cfg->fc_protocol;
@@ -1238,7 +1241,7 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
 	if (fi->fib_priority &&
 	    nla_put_u32(skb, RTA_PRIORITY, fi->fib_priority))
 		goto nla_put_failure;
-	if (rtnetlink_put_metrics(skb, fi->fib_metrics) < 0)
+	if (rtnetlink_put_metrics(skb, fi->fib_metrics->metrics) < 0)
 		goto nla_put_failure;
 
 	if (fi->fib_prefsrc &&
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 655d9eebe43e16a59102edcd3ea4bc177c6b341d..6883b3d4ba8f69de2cb924612d60f5671a219a84 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1385,8 +1385,12 @@ static void rt_add_uncached_list(struct rtable *rt)
 
 static void ipv4_dst_destroy(struct dst_entry *dst)
 {
+	struct dst_metrics *p = (struct dst_metrics *)DST_METRICS_PTR(dst);
 	struct rtable *rt = (struct rtable *) dst;
 
+	if (p != &dst_default_metrics && atomic_dec_and_test(&p->refcnt))
+		kfree(p);
+
 	if (!list_empty(&rt->rt_uncached)) {
 		struct uncached_list *ul = rt->rt_uncached_list;
 
@@ -1438,7 +1442,11 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 			rt->rt_gateway = nh->nh_gw;
 			rt->rt_uses_gateway = 1;
 		}
-		dst_init_metrics(&rt->dst, fi->fib_metrics, true);
+		dst_init_metrics(&rt->dst, fi->fib_metrics->metrics, true);
+		if (fi->fib_metrics != &dst_default_metrics) {
+			rt->dst._metrics |= DST_METRICS_REFCOUNTED;
+			atomic_inc(&fi->fib_metrics->refcnt);
+		}
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		rt->dst.tclassid = nh->nh_tclassid;
 #endif

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH net] ipv4: add reference counting to metrics
  2017-05-25 21:27                                                               ` [PATCH net] ipv4: add reference counting to metrics Eric Dumazet
@ 2017-05-25 22:25                                                                 ` Julian Anastasov
  2017-05-26 17:08                                                                 ` Cong Wang
  2017-05-26 18:58                                                                 ` David Miller
  2 siblings, 0 replies; 43+ messages in thread
From: Julian Anastasov @ 2017-05-25 22:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Eric Dumazet, Andrey Konovalov, Cong Wang, netdev


	Hello,

On Thu, 25 May 2017, Eric Dumazet wrote:

> From: Eric Dumazet <edumazet@google.com>
> 
> Andrey Konovalov reported crashes in ipv4_mtu()
> 
> I could reproduce the issue with KASAN kernels, between
> 10.246.7.151 and 10.246.7.152 :
> 
> 1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &
> 
> 2) At the same time run following loop :
> while :
> do
>  ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
>  ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
> done
> 
> 
> Cong Wang attempted to add back rt->fi in commit
> 82486aa6f1b9 ("ipv4: restore rt->fi for reference counting")
> but this proved to add some issues that were complex to solve.
> 
> Instead, I suggested to add a refcount to the metrics themselves,
> being a standalone object (in particular, no reference to other objects)
> 
> I tried to make this patch as small as possible to ease its backport,
> instead of being super clean. Note that we believe that only ipv4 dst
> need to take care of the metric refcount. But if this is wrong,
> this patch adds the basic infrastructure to extend this to other
> families.
> 
> Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
> for his efforts on this problem.
> 
> Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Andrey Konovalov <andreyknvl@google.com>

	Nice work, thanks!

Reviewed-by: Julian Anastasov <ja@ssi.bg>

> ---
>  include/net/dst.h        |    8 +++++++-
>  include/net/ip_fib.h     |   10 +++++-----
>  net/core/dst.c           |   23 ++++++++++++++---------
>  net/ipv4/fib_semantics.c |   17 ++++++++++-------
>  net/ipv4/route.c         |   10 +++++++++-
>  5 files changed, 45 insertions(+), 23 deletions(-)
> 
> diff --git a/include/net/dst.h b/include/net/dst.h
> index 049af33da3b6c95897d544670cea65c542317673..cfc0437841665d7ed46a714915c50d723c24901c 100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -107,10 +107,16 @@ struct dst_entry {
>  	};
>  };
>  
> +struct dst_metrics {
> +	u32		metrics[RTAX_MAX];
> +	atomic_t	refcnt;
> +};
> +extern const struct dst_metrics dst_default_metrics;
> +
>  u32 *dst_cow_metrics_generic(struct dst_entry *dst, unsigned long old);
> -extern const u32 dst_default_metrics[];
>  
>  #define DST_METRICS_READ_ONLY		0x1UL
> +#define DST_METRICS_REFCOUNTED		0x2UL
>  #define DST_METRICS_FLAGS		0x3UL
>  #define __DST_METRICS_PTR(Y)	\
>  	((u32 *)((Y) & ~DST_METRICS_FLAGS))
> diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> index 6692c5758b332d468f1e0611ecc4f3e03ae03b2b..f7f6aa789c6174c41ca9739206d586c559c1f3a1 100644
> --- a/include/net/ip_fib.h
> +++ b/include/net/ip_fib.h
> @@ -114,11 +114,11 @@ struct fib_info {
>  	__be32			fib_prefsrc;
>  	u32			fib_tb_id;
>  	u32			fib_priority;
> -	u32			*fib_metrics;
> -#define fib_mtu fib_metrics[RTAX_MTU-1]
> -#define fib_window fib_metrics[RTAX_WINDOW-1]
> -#define fib_rtt fib_metrics[RTAX_RTT-1]
> -#define fib_advmss fib_metrics[RTAX_ADVMSS-1]
> +	struct dst_metrics	*fib_metrics;
> +#define fib_mtu fib_metrics->metrics[RTAX_MTU-1]
> +#define fib_window fib_metrics->metrics[RTAX_WINDOW-1]
> +#define fib_rtt fib_metrics->metrics[RTAX_RTT-1]
> +#define fib_advmss fib_metrics->metrics[RTAX_ADVMSS-1]
>  	int			fib_nhs;
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>  	int			fib_weight;
> diff --git a/net/core/dst.c b/net/core/dst.c
> index 960e503b5a529a2c4f1866f49c150493ee98d7da..6192f11beec9077de964e2aeff4f78547f08b8da 100644
> --- a/net/core/dst.c
> +++ b/net/core/dst.c
> @@ -151,13 +151,13 @@ int dst_discard_out(struct net *net, struct sock *sk, struct sk_buff *skb)
>  }
>  EXPORT_SYMBOL(dst_discard_out);
>  
> -const u32 dst_default_metrics[RTAX_MAX + 1] = {
> +const struct dst_metrics dst_default_metrics = {
>  	/* This initializer is needed to force linker to place this variable
>  	 * into const section. Otherwise it might end into bss section.
>  	 * We really want to avoid false sharing on this variable, and catch
>  	 * any writes on it.
>  	 */
> -	[RTAX_MAX] = 0xdeadbeef,
> +	.refcnt = ATOMIC_INIT(1),
>  };
>  
>  void dst_init(struct dst_entry *dst, struct dst_ops *ops,
> @@ -169,7 +169,7 @@ void dst_init(struct dst_entry *dst, struct dst_ops *ops,
>  	if (dev)
>  		dev_hold(dev);
>  	dst->ops = ops;
> -	dst_init_metrics(dst, dst_default_metrics, true);
> +	dst_init_metrics(dst, dst_default_metrics.metrics, true);
>  	dst->expires = 0UL;
>  	dst->path = dst;
>  	dst->from = NULL;
> @@ -314,25 +314,30 @@ EXPORT_SYMBOL(dst_release);
>  
>  u32 *dst_cow_metrics_generic(struct dst_entry *dst, unsigned long old)
>  {
> -	u32 *p = kmalloc(sizeof(u32) * RTAX_MAX, GFP_ATOMIC);
> +	struct dst_metrics *p = kmalloc(sizeof(*p), GFP_ATOMIC);
>  
>  	if (p) {
> -		u32 *old_p = __DST_METRICS_PTR(old);
> +		struct dst_metrics *old_p = (struct dst_metrics *)__DST_METRICS_PTR(old);
>  		unsigned long prev, new;
>  
> -		memcpy(p, old_p, sizeof(u32) * RTAX_MAX);
> +		atomic_set(&p->refcnt, 1);
> +		memcpy(p->metrics, old_p->metrics, sizeof(p->metrics));
>  
>  		new = (unsigned long) p;
>  		prev = cmpxchg(&dst->_metrics, old, new);
>  
>  		if (prev != old) {
>  			kfree(p);
> -			p = __DST_METRICS_PTR(prev);
> +			p = (struct dst_metrics *)__DST_METRICS_PTR(prev);
>  			if (prev & DST_METRICS_READ_ONLY)
>  				p = NULL;
> +		} else if (prev & DST_METRICS_REFCOUNTED) {
> +			if (atomic_dec_and_test(&old_p->refcnt))
> +				kfree(old_p);
>  		}
>  	}
> -	return p;
> +	BUILD_BUG_ON(offsetof(struct dst_metrics, metrics) != 0);
> +	return (u32 *)p;
>  }
>  EXPORT_SYMBOL(dst_cow_metrics_generic);
>  
> @@ -341,7 +346,7 @@ void __dst_destroy_metrics_generic(struct dst_entry *dst, unsigned long old)
>  {
>  	unsigned long prev, new;
>  
> -	new = ((unsigned long) dst_default_metrics) | DST_METRICS_READ_ONLY;
> +	new = ((unsigned long) &dst_default_metrics) | DST_METRICS_READ_ONLY;
>  	prev = cmpxchg(&dst->_metrics, old, new);
>  	if (prev == old)
>  		kfree(__DST_METRICS_PTR(old));
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index da449ddb8cc172bd9091c00057a69a095f98b56d..ad9ad4aab5da7c7d11c3b80edbdfcbdd3d7153fe 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -203,6 +203,7 @@ static void rt_fibinfo_free_cpus(struct rtable __rcu * __percpu *rtp)
>  static void free_fib_info_rcu(struct rcu_head *head)
>  {
>  	struct fib_info *fi = container_of(head, struct fib_info, rcu);
> +	struct dst_metrics *m;
>  
>  	change_nexthops(fi) {
>  		if (nexthop_nh->nh_dev)
> @@ -213,8 +214,9 @@ static void free_fib_info_rcu(struct rcu_head *head)
>  		rt_fibinfo_free(&nexthop_nh->nh_rth_input);
>  	} endfor_nexthops(fi);
>  
> -	if (fi->fib_metrics != (u32 *) dst_default_metrics)
> -		kfree(fi->fib_metrics);
> +	m = fi->fib_metrics;
> +	if (m != &dst_default_metrics && atomic_dec_and_test(&m->refcnt))
> +		kfree(m);
>  	kfree(fi);
>  }
>  
> @@ -971,11 +973,11 @@ fib_convert_metrics(struct fib_info *fi, const struct fib_config *cfg)
>  			val = 255;
>  		if (type == RTAX_FEATURES && (val & ~RTAX_FEATURE_MASK))
>  			return -EINVAL;
> -		fi->fib_metrics[type - 1] = val;
> +		fi->fib_metrics->metrics[type - 1] = val;
>  	}
>  
>  	if (ecn_ca)
> -		fi->fib_metrics[RTAX_FEATURES - 1] |= DST_FEATURE_ECN_CA;
> +		fi->fib_metrics->metrics[RTAX_FEATURES - 1] |= DST_FEATURE_ECN_CA;
>  
>  	return 0;
>  }
> @@ -1033,11 +1035,12 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
>  		goto failure;
>  	fib_info_cnt++;
>  	if (cfg->fc_mx) {
> -		fi->fib_metrics = kzalloc(sizeof(u32) * RTAX_MAX, GFP_KERNEL);
> +		fi->fib_metrics = kzalloc(sizeof(*fi->fib_metrics), GFP_KERNEL);
>  		if (!fi->fib_metrics)
>  			goto failure;
> +		atomic_set(&fi->fib_metrics->refcnt, 1);
>  	} else
> -		fi->fib_metrics = (u32 *) dst_default_metrics;
> +		fi->fib_metrics = (struct dst_metrics *)&dst_default_metrics;
>  
>  	fi->fib_net = net;
>  	fi->fib_protocol = cfg->fc_protocol;
> @@ -1238,7 +1241,7 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event,
>  	if (fi->fib_priority &&
>  	    nla_put_u32(skb, RTA_PRIORITY, fi->fib_priority))
>  		goto nla_put_failure;
> -	if (rtnetlink_put_metrics(skb, fi->fib_metrics) < 0)
> +	if (rtnetlink_put_metrics(skb, fi->fib_metrics->metrics) < 0)
>  		goto nla_put_failure;
>  
>  	if (fi->fib_prefsrc &&
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 655d9eebe43e16a59102edcd3ea4bc177c6b341d..6883b3d4ba8f69de2cb924612d60f5671a219a84 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1385,8 +1385,12 @@ static void rt_add_uncached_list(struct rtable *rt)
>  
>  static void ipv4_dst_destroy(struct dst_entry *dst)
>  {
> +	struct dst_metrics *p = (struct dst_metrics *)DST_METRICS_PTR(dst);
>  	struct rtable *rt = (struct rtable *) dst;
>  
> +	if (p != &dst_default_metrics && atomic_dec_and_test(&p->refcnt))
> +		kfree(p);
> +
>  	if (!list_empty(&rt->rt_uncached)) {
>  		struct uncached_list *ul = rt->rt_uncached_list;
>  
> @@ -1438,7 +1442,11 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
>  			rt->rt_gateway = nh->nh_gw;
>  			rt->rt_uses_gateway = 1;
>  		}
> -		dst_init_metrics(&rt->dst, fi->fib_metrics, true);
> +		dst_init_metrics(&rt->dst, fi->fib_metrics->metrics, true);
> +		if (fi->fib_metrics != &dst_default_metrics) {
> +			rt->dst._metrics |= DST_METRICS_REFCOUNTED;
> +			atomic_inc(&fi->fib_metrics->refcnt);
> +		}
>  #ifdef CONFIG_IP_ROUTE_CLASSID
>  		rt->dst.tclassid = nh->nh_tclassid;
>  #endif

Regards

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net] ipv4: add reference counting to metrics
  2017-05-25 21:27                                                               ` [PATCH net] ipv4: add reference counting to metrics Eric Dumazet
  2017-05-25 22:25                                                                 ` Julian Anastasov
@ 2017-05-26 17:08                                                                 ` Cong Wang
  2017-05-26 17:13                                                                   ` Eric Dumazet
  2017-05-26 18:58                                                                 ` David Miller
  2 siblings, 1 reply; 43+ messages in thread
From: Cong Wang @ 2017-05-26 17:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Eric Dumazet, Andrey Konovalov, Julian Anastasov, netdev

On Thu, May 25, 2017 at 2:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Andrey Konovalov reported crashes in ipv4_mtu()
>
> I could reproduce the issue with KASAN kernels, between
> 10.246.7.151 and 10.246.7.152 :
>
> 1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &
>
> 2) At the same time run following loop :
> while :
> do
>  ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
>  ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
> done
>
>
> Cong Wang attempted to add back rt->fi in commit
> 82486aa6f1b9 ("ipv4: restore rt->fi for reference counting")
> but this proved to add some issues that were complex to solve.
>
> Instead, I suggested to add a refcount to the metrics themselves,
> being a standalone object (in particular, no reference to other objects)
>
> I tried to make this patch as small as possible to ease its backport,
> instead of being super clean. Note that we believe that only ipv4 dst
> need to take care of the metric refcount. But if this is wrong,
> this patch adds the basic infrastructure to extend this to other
> families.
>
> Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
> for his efforts on this problem.
>
> Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Andrey Konovalov <andreyknvl@google.com>

Nice work!! Thanks for your effort of making it small!

Just one nit below.

> -const u32 dst_default_metrics[RTAX_MAX + 1] = {
> +const struct dst_metrics dst_default_metrics = {
>         /* This initializer is needed to force linker to place this variable
>          * into const section. Otherwise it might end into bss section.
>          * We really want to avoid false sharing on this variable, and catch
>          * any writes on it.
>          */
> -       [RTAX_MAX] = 0xdeadbeef,
> +       .refcnt = ATOMIC_INIT(1),
>  };

The code comment above is no longer needed since
we have to initialize refcnt to 1, instead of merely for const
section.

Anyway,

Acked-by: Cong Wang <xiyou.wangcong@gmail.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net] ipv4: add reference counting to metrics
  2017-05-26 17:08                                                                 ` Cong Wang
@ 2017-05-26 17:13                                                                   ` Eric Dumazet
  2017-05-26 17:26                                                                     ` Cong Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2017-05-26 17:13 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, David Miller, Andrey Konovalov, Julian Anastasov, netdev

On Fri, May 26, 2017 at 10:08 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Thu, May 25, 2017 at 2:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Just one nit below.
>
>> -const u32 dst_default_metrics[RTAX_MAX + 1] = {
>> +const struct dst_metrics dst_default_metrics = {
>>         /* This initializer is needed to force linker to place this variable
>>          * into const section. Otherwise it might end into bss section.
>>          * We really want to avoid false sharing on this variable, and catch
>>          * any writes on it.
>>          */
>> -       [RTAX_MAX] = 0xdeadbeef,
>> +       .refcnt = ATOMIC_INIT(1),
>>  };
>
> The code comment above is no longer needed since
> we have to initialize refcnt to 1, instead of merely for const
> section.

I believe the comment is still needed, because normally we make sure
dst_default_metrics.refcnt is never touched (incremened nor
decremented)

So its value should not really matter ?

I found that ATOMIC_INIT(1) was less ugly than the 0xdeadbeef

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net] ipv4: add reference counting to metrics
  2017-05-26 17:13                                                                   ` Eric Dumazet
@ 2017-05-26 17:26                                                                     ` Cong Wang
  0 siblings, 0 replies; 43+ messages in thread
From: Cong Wang @ 2017-05-26 17:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David Miller, Andrey Konovalov, Julian Anastasov, netdev

On Fri, May 26, 2017 at 10:13 AM, Eric Dumazet <edumazet@google.com> wrote:
> On Fri, May 26, 2017 at 10:08 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> On Thu, May 25, 2017 at 2:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> Just one nit below.
>>
>>> -const u32 dst_default_metrics[RTAX_MAX + 1] = {
>>> +const struct dst_metrics dst_default_metrics = {
>>>         /* This initializer is needed to force linker to place this variable
>>>          * into const section. Otherwise it might end into bss section.
>>>          * We really want to avoid false sharing on this variable, and catch
>>>          * any writes on it.
>>>          */
>>> -       [RTAX_MAX] = 0xdeadbeef,
>>> +       .refcnt = ATOMIC_INIT(1),
>>>  };
>>
>> The code comment above is no longer needed since
>> we have to initialize refcnt to 1, instead of merely for const
>> section.
>
> I believe the comment is still needed, because normally we make sure
> dst_default_metrics.refcnt is never touched (incremened nor
> decremented)
>
> So its value should not really matter ?

Oh, you're right, we always have the check on != &dst_default_metrics.


>
> I found that ATOMIC_INIT(1) was less ugly than the 0xdeadbeef

Of course.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net] ipv4: add reference counting to metrics
  2017-05-25 21:27                                                               ` [PATCH net] ipv4: add reference counting to metrics Eric Dumazet
  2017-05-25 22:25                                                                 ` Julian Anastasov
  2017-05-26 17:08                                                                 ` Cong Wang
@ 2017-05-26 18:58                                                                 ` David Miller
  2 siblings, 0 replies; 43+ messages in thread
From: David Miller @ 2017-05-26 18:58 UTC (permalink / raw)
  To: eric.dumazet; +Cc: edumazet, andreyknvl, ja, xiyou.wangcong, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 25 May 2017 14:27:35 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> Andrey Konovalov reported crashes in ipv4_mtu()
> 
> I could reproduce the issue with KASAN kernels, between
> 10.246.7.151 and 10.246.7.152 :
> 
> 1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &
> 
> 2) At the same time run following loop :
> while :
> do
>  ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
>  ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
> done
> 
> 
> Cong Wang attempted to add back rt->fi in commit
> 82486aa6f1b9 ("ipv4: restore rt->fi for reference counting")
> but this proved to add some issues that were complex to solve.
> 
> Instead, I suggested to add a refcount to the metrics themselves,
> being a standalone object (in particular, no reference to other objects)
> 
> I tried to make this patch as small as possible to ease its backport,
> instead of being super clean. Note that we believe that only ipv4 dst
> need to take care of the metric refcount. But if this is wrong,
> this patch adds the basic infrastructure to extend this to other
> families.
> 
> Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
> for his efforts on this problem.
> 
> Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Andrey Konovalov <andreyknvl@google.com>

Applied, thanks everyone for following through on this bug fix.

And sorry for introducing the problem in the first place :)

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2017-05-26 18:58 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-04 21:54 [Patch net] ipv4: restore rt->fi for reference counting Cong Wang
2017-05-07 18:53 ` Eric Dumazet
2017-05-08 18:35 ` David Miller
2017-05-09  0:01   ` Eric Dumazet
2017-05-09  1:22     ` David Miller
2017-05-09  2:18       ` Eric Dumazet
2017-05-09  2:35         ` David Miller
2017-05-09 16:44         ` Cong Wang
2017-05-09 16:56           ` Eric Dumazet
2017-05-09 20:56             ` Cong Wang
2017-05-09 22:07               ` Cong Wang
2017-05-09 22:52                 ` Eric Dumazet
2017-05-09 22:54                   ` Eric Dumazet
2017-05-09 23:09                     ` Eric Dumazet
2017-05-09 23:35                       ` Cong Wang
2017-05-09 23:50                         ` Eric Dumazet
2017-05-10 16:32                           ` Cong Wang
2017-05-09 23:51                         ` Eric Dumazet
2017-05-10 16:40                           ` Cong Wang
2017-05-10  7:38                         ` Julian Anastasov
2017-05-10 17:00                           ` Cong Wang
2017-05-10 19:51                             ` Julian Anastasov
2017-05-12  0:07                               ` Cong Wang
2017-05-12  1:22                                 ` Cong Wang
2017-05-12  4:55                                   ` Eric Dumazet
2017-05-12 17:49                                     ` Cong Wang
2017-05-12  6:39                                   ` Julian Anastasov
2017-05-12 17:27                                     ` Cong Wang
2017-05-12 20:58                                       ` Cong Wang
2017-05-12 21:13                                         ` Cong Wang
2017-05-12 21:27                                       ` Julian Anastasov
2017-05-15 18:34                                         ` Cong Wang
2017-05-15 20:37                                           ` Julian Anastasov
2017-05-15 22:13                                             ` Cong Wang
2017-05-16  7:46                                               ` Julian Anastasov
2017-05-16 17:53                                                 ` Cong Wang
2017-05-16 18:16                                                   ` Cong Wang
     [not found]                                                     ` <1495572267.6465.79.camel@edumazet-glaptop3.roam.corp.google.com>
     [not found]                                                       ` <CAM_iQpX0X3h4Sf+bHUXdJgBqUTxNat0FBT0PeRpLYWju9ci59Q@mail.gmail.com>
     [not found]                                                         ` <CANn89i+mPR+7-AVO2Dsd=KfO=COOVY42AKjwEs=0=GUCML6HUQ@mail.gmail.com>
     [not found]                                                           ` <CAM_iQpUfLmN3yWsCfpx4ZTptBnuYFNuY5CjBKdwoDpvH5K8P=w@mail.gmail.com>
     [not found]                                                             ` <1495665921.6465.95.camel@edumazet-glaptop3.roam.corp.google.com>
2017-05-25 21:27                                                               ` [PATCH net] ipv4: add reference counting to metrics Eric Dumazet
2017-05-25 22:25                                                                 ` Julian Anastasov
2017-05-26 17:08                                                                 ` Cong Wang
2017-05-26 17:13                                                                   ` Eric Dumazet
2017-05-26 17:26                                                                     ` Cong Wang
2017-05-26 18:58                                                                 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).