All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net 0/3] net: nexthop: fix refcount issues when replacing groups
@ 2021-11-21 15:24 Nikolay Aleksandrov
  2021-11-21 15:24 ` [PATCH net 1/3] net: ipv6: add fib6_nh_release_dsts stub Nikolay Aleksandrov
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Nikolay Aleksandrov @ 2021-11-21 15:24 UTC (permalink / raw)
  To: netdev; +Cc: idosch, davem, kuba, dsahern, Nikolay Aleksandrov

From: Nikolay Aleksandrov <nikolay@nvidia.com>

Hi,
This set fixes a refcount bug when replacing nexthop groups and
modifying routes. It is complex because the objects look valid when
debugging memory dumps, but we end up having refcount dependency between
unlinked objects which can never be released, so in turn they cannot
free their resources and refcounts. The problem happens because we can
have stale IPv6 per-cpu dsts in nexthops which were removed from a
group. Even though the IPv6 gen is bumped, the dsts won't be released
until traffic passes through them or the nexthop is freed, that can take
arbitrarily long time, and even worse we can create a scenario[1] where it
can never be released. The fix is to release the IPv6 per-cpu dsts of
replaced nexthops after an RCU grace period so no new ones can be
created. To do that we add a new IPv6 stub - fib6_nh_release_dsts, which
is used by the nexthop code only when necessary. We can further optimize
group replacement, but that is more suited for net-next as these patches
would have to be backported to stable releases.

Thanks,
 Nik

[1]
This info is also present in patch 02's commit message.
Initial state:
 $ ip nexthop list
  id 200 via 2002:db8::2 dev bridge.10 scope link onlink
  id 201 via 2002:db8::3 dev bridge scope link onlink
  id 203 group 201/200
 $ ip -6 route
  2001:db8::10 nhid 203 metric 1024 pref medium
     nexthop via 2002:db8::3 dev bridge weight 1 onlink
     nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink

Create rt6_info through one of the multipath legs, e.g.:
 $ taskset -a -c 1  ./pkt_inj 24 bridge.10 2001:db8::10
 (pkt_inj is just a custom packet generator, nothing special)

Then remove that leg from the group by replace (let's assume it is id
200 in this case):
 $ ip nexthop replace id 203 group 201

Now remove the IPv6 route:
 $ ip -6 route del 2001:db8::10/128

The route won't be really deleted due to the stale rt6_info holding 1
refcnt in nexthop id 200.
At this point we have the following reference count dependency:
 (deleted) IPv6 route holds 1 reference over nhid 203
 nh 203 holds 1 ref over id 201
 nh 200 holds 1 ref over the net device and the route due to the stale
 rt6_info

Now to create circular dependency between nh 200 and the IPv6 route, and
also to get a reference over nh 200, restore nhid 200 in the group:
 $ ip nexthop replace id 203 group 201/200

And now we have a permanent circular dependncy because nhid 203 holds a
reference over nh 200 and 201, but the route holds a ref over nh 203 and
is deleted.

To trigger the bug just delete the group (nhid 203):
 $ ip nexthop del id 203

It won't really be deleted due to the IPv6 route dependency, and now we
have 2 unlinked and deleted objects that reference each other: the group
and the IPv6 route. Since the group drops the reference it holds over its
entries at free time (i.e. its own refcount needs to drop to 0) that will
never happen and we get a permanent ref on them, since one of the entries
holds a reference over the IPv6 route it will also never be released.

At this point the dependencies are:
 (deleted, only unlinked) IPv6 route holds reference over group nh 203
 (deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
 nh 200 holds 1 ref over the net device and the route due to the stale
 rt6_info

This is the last point where it can be fixed by running traffic through
nh 200, and specifically through the same CPU so the rt6_info (dst) will
get released due to the IPv6 genid, that in turn will free the IPv6
route, which in turn will free the ref count over the group nh 203.

If nh 200 is deleted at this point, it will never be released due to the
ref from the unlinked group 203, it will only be unlinked:
 $ ip nexthop del id 200
 $ ip nexthop
 $

Now we can never release that stale rt6_info, we have IPv6 route with ref
over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
rt6_info (dst) with ref over the net device and the IPv6 route. All of
these objects are only unlinked, and cannot be released, thus they can't
release their ref counts.

 Message from syslogd@dev at Nov 19 14:04:10 ...
  kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
 Message from syslogd@dev at Nov 19 14:04:20 ...
  kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3


Nikolay Aleksandrov (3):
  net: ipv6: add fib6_nh_release_dsts stub
  net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group
  selftests: net: fib_nexthops: add test for group refcount imbalance
    bug

 include/net/ip6_fib.h                       |  1 +
 include/net/ipv6_stubs.h                    |  1 +
 net/ipv4/nexthop.c                          | 25 ++++++++-
 net/ipv6/af_inet6.c                         |  1 +
 net/ipv6/route.c                            | 19 +++++++
 tools/testing/selftests/net/fib_nexthops.sh | 56 +++++++++++++++++++++
 6 files changed, 101 insertions(+), 2 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH net 1/3] net: ipv6: add fib6_nh_release_dsts stub
  2021-11-21 15:24 [PATCH net 0/3] net: nexthop: fix refcount issues when replacing groups Nikolay Aleksandrov
@ 2021-11-21 15:24 ` Nikolay Aleksandrov
  2021-11-21 15:24 ` [PATCH net 2/3] net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group Nikolay Aleksandrov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Nikolay Aleksandrov @ 2021-11-21 15:24 UTC (permalink / raw)
  To: netdev; +Cc: idosch, davem, kuba, dsahern, Nikolay Aleksandrov

From: Nikolay Aleksandrov <nikolay@nvidia.com>

We need a way to release a fib6_nh's per-cpu dsts when replacing
nexthops otherwise we can end up with stale per-cpu dsts which hold net
device references, so add a new IPv6 stub called fib6_nh_release_dsts.
It must be used after an RCU grace period, so no new dsts can be created
through a group's nexthop entry.
Similar to fib6_nh_release it shouldn't be used if fib6_nh_init has failed
so it doesn't need a dummy stub when IPv6 is not enabled.

Fixes: 7bf4796dd099 ("nexthops: add support for replace")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
---
 include/net/ip6_fib.h    |  1 +
 include/net/ipv6_stubs.h |  1 +
 net/ipv6/af_inet6.c      |  1 +
 net/ipv6/route.c         | 19 +++++++++++++++++++
 4 files changed, 22 insertions(+)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index c412dde4d67d..83b8070d1cc9 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -485,6 +485,7 @@ int fib6_nh_init(struct net *net, struct fib6_nh *fib6_nh,
 		 struct fib6_config *cfg, gfp_t gfp_flags,
 		 struct netlink_ext_ack *extack);
 void fib6_nh_release(struct fib6_nh *fib6_nh);
+void fib6_nh_release_dsts(struct fib6_nh *fib6_nh);
 
 int call_fib6_entry_notifiers(struct net *net,
 			      enum fib_event_type event_type,
diff --git a/include/net/ipv6_stubs.h b/include/net/ipv6_stubs.h
index afbce90c4480..45e0339be6fa 100644
--- a/include/net/ipv6_stubs.h
+++ b/include/net/ipv6_stubs.h
@@ -47,6 +47,7 @@ struct ipv6_stub {
 			    struct fib6_config *cfg, gfp_t gfp_flags,
 			    struct netlink_ext_ack *extack);
 	void (*fib6_nh_release)(struct fib6_nh *fib6_nh);
+	void (*fib6_nh_release_dsts)(struct fib6_nh *fib6_nh);
 	void (*fib6_update_sernum)(struct net *net, struct fib6_info *rt);
 	int (*ip6_del_rt)(struct net *net, struct fib6_info *rt, bool skip_notify);
 	void (*fib6_rt_update)(struct net *net, struct fib6_info *rt,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 0c4da163535a..dab4a047590b 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -1026,6 +1026,7 @@ static const struct ipv6_stub ipv6_stub_impl = {
 	.ip6_mtu_from_fib6 = ip6_mtu_from_fib6,
 	.fib6_nh_init	   = fib6_nh_init,
 	.fib6_nh_release   = fib6_nh_release,
+	.fib6_nh_release_dsts = fib6_nh_release_dsts,
 	.fib6_update_sernum = fib6_update_sernum_stub,
 	.fib6_rt_update	   = fib6_rt_update,
 	.ip6_del_rt	   = ip6_del_rt,
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 3ae25b8ffbd6..42d60c76d30a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3680,6 +3680,25 @@ void fib6_nh_release(struct fib6_nh *fib6_nh)
 	fib_nh_common_release(&fib6_nh->nh_common);
 }
 
+void fib6_nh_release_dsts(struct fib6_nh *fib6_nh)
+{
+	int cpu;
+
+	if (!fib6_nh->rt6i_pcpu)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		struct rt6_info *pcpu_rt, **ppcpu_rt;
+
+		ppcpu_rt = per_cpu_ptr(fib6_nh->rt6i_pcpu, cpu);
+		pcpu_rt = xchg(ppcpu_rt, NULL);
+		if (pcpu_rt) {
+			dst_dev_put(&pcpu_rt->dst);
+			dst_release(&pcpu_rt->dst);
+		}
+	}
+}
+
 static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
 					      gfp_t gfp_flags,
 					      struct netlink_ext_ack *extack)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net 2/3] net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group
  2021-11-21 15:24 [PATCH net 0/3] net: nexthop: fix refcount issues when replacing groups Nikolay Aleksandrov
  2021-11-21 15:24 ` [PATCH net 1/3] net: ipv6: add fib6_nh_release_dsts stub Nikolay Aleksandrov
@ 2021-11-21 15:24 ` Nikolay Aleksandrov
  2021-11-21 17:17   ` Ido Schimmel
  2021-11-21 15:24 ` [PATCH net 3/3] selftests: net: fib_nexthops: add test for group refcount imbalance bug Nikolay Aleksandrov
  2021-11-21 17:55 ` [PATCH net 0/3] net: nexthop: fix refcount issues when replacing groups Ido Schimmel
  3 siblings, 1 reply; 13+ messages in thread
From: Nikolay Aleksandrov @ 2021-11-21 15:24 UTC (permalink / raw)
  To: netdev; +Cc: idosch, davem, kuba, dsahern, Nikolay Aleksandrov

From: Nikolay Aleksandrov <nikolay@nvidia.com>

When replacing a nexthop group, we must release the IPv6 per-cpu dsts of
the removed nexthop entries after an RCU grace period because they
contain references to the nexthop's net device and to the fib6 info.
With specific series of events[1] we can reach net device refcount
imbalance which is unrecoverable.

[1]
 $ ip nexthop list
  id 200 via 2002:db8::2 dev bridge.10 scope link onlink
  id 201 via 2002:db8::3 dev bridge scope link onlink
  id 203 group 201/200
 $ ip -6 route
  2001:db8::10 nhid 203 metric 1024 pref medium
     nexthop via 2002:db8::3 dev bridge weight 1 onlink
     nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink

Create rt6_info through one of the multipath legs, e.g.:
 $ taskset -a -c 1  ./pkt_inj 24 bridge.10 2001:db8::10
 (pkt_inj is just a custom packet generator, nothing special)

Then remove that leg from the group by replace (let's assume it is id
200 in this case):
 $ ip nexthop replace id 203 group 201

Now remove the IPv6 route:
 $ ip -6 route del 2001:db8::10/128

The route won't be really deleted due to the stale rt6_info holding 1
refcnt in nexthop id 200.
At this point we have the following reference count dependency:
 (deleted) IPv6 route holds 1 reference over nhid 203
 nh 203 holds 1 ref over id 201
 nh 200 holds 1 ref over the net device and the route due to the stale
 rt6_info

Now to create circular dependency between nh 200 and the IPv6 route, and
also to get a reference over nh 200, restore nhid 200 in the group:
 $ ip nexthop replace id 203 group 201/200

And now we have a permanent circular dependncy because nhid 203 holds a
reference over nh 200 and 201, but the route holds a ref over nh 203 and
is deleted.

To trigger the bug just delete the group (nhid 203):
 $ ip nexthop del id 203

It won't really be deleted due to the IPv6 route dependency, and now we
have 2 unlinked and deleted objects that reference each other: the group
and the IPv6 route. Since the group drops the reference it holds over its
entries at free time (i.e. its own refcount needs to drop to 0) that will
never happen and we get a permanent ref on them, since one of the entries
holds a reference over the IPv6 route it will also never be released.

At this point the dependencies are:
 (deleted, only unlinked) IPv6 route holds reference over group nh 203
 (deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
 nh 200 holds 1 ref over the net device and the route due to the stale
 rt6_info

This is the last point where it can be fixed by running traffic through
nh 200, and specifically through the same CPU so the rt6_info (dst) will
get released due to the IPv6 genid, that in turn will free the IPv6
route, which in turn will free the ref count over the group nh 203.

If nh 200 is deleted at this point, it will never be released due to the
ref from the unlinked group 203, it will only be unlinked:
 $ ip nexthop del id 200
 $ ip nexthop
 $

Now we can never release that stale rt6_info, we have IPv6 route with ref
over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
rt6_info (dst) with ref over the net device and the IPv6 route. All of
these objects are only unlinked, and cannot be released, thus they can't
release their ref counts.

 Message from syslogd@dev at Nov 19 14:04:10 ...
  kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
 Message from syslogd@dev at Nov 19 14:04:20 ...
  kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3

Fixes: 7bf4796dd099 ("nexthops: add support for replace")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
---
 net/ipv4/nexthop.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 9e8100728d46..a69a9e76f99f 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -1899,15 +1899,36 @@ static void remove_nexthop(struct net *net, struct nexthop *nh,
 /* if any FIB entries reference this nexthop, any dst entries
  * need to be regenerated
  */
-static void nh_rt_cache_flush(struct net *net, struct nexthop *nh)
+static void nh_rt_cache_flush(struct net *net, struct nexthop *nh,
+			      struct nexthop *replaced_nh)
 {
 	struct fib6_info *f6i;
+	struct nh_group *nhg;
+	int i;
 
 	if (!list_empty(&nh->fi_list))
 		rt_cache_flush(net);
 
 	list_for_each_entry(f6i, &nh->f6i_list, nh_list)
 		ipv6_stub->fib6_update_sernum(net, f6i);
+
+	/* if an IPv6 group was replaced, we have to release all old
+	 * dsts to make sure all refcounts are released
+	 */
+	if (!replaced_nh->is_group)
+		return;
+
+	/* new dsts must use only the new nexthop group */
+	synchronize_net();
+
+	nhg = rtnl_dereference(replaced_nh->nh_grp);
+	for (i = 0; i < nhg->num_nh; i++) {
+		struct nh_grp_entry *nhge = &nhg->nh_entries[i];
+		struct nh_info *nhi = rtnl_dereference(nhge->nh->nh_info);
+
+		if (nhi->family == AF_INET6)
+			ipv6_stub->fib6_nh_release_dsts(&nhi->fib6_nh);
+	}
 }
 
 static int replace_nexthop_grp(struct net *net, struct nexthop *old,
@@ -2247,7 +2268,7 @@ static int replace_nexthop(struct net *net, struct nexthop *old,
 		err = replace_nexthop_single(net, old, new, extack);
 
 	if (!err) {
-		nh_rt_cache_flush(net, old);
+		nh_rt_cache_flush(net, old, new);
 
 		__remove_nexthop(net, new, NULL);
 		nexthop_put(new);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net 3/3] selftests: net: fib_nexthops: add test for group refcount imbalance bug
  2021-11-21 15:24 [PATCH net 0/3] net: nexthop: fix refcount issues when replacing groups Nikolay Aleksandrov
  2021-11-21 15:24 ` [PATCH net 1/3] net: ipv6: add fib6_nh_release_dsts stub Nikolay Aleksandrov
  2021-11-21 15:24 ` [PATCH net 2/3] net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group Nikolay Aleksandrov
@ 2021-11-21 15:24 ` Nikolay Aleksandrov
  2021-11-21 17:53   ` Ido Schimmel
  2021-11-21 17:55 ` [PATCH net 0/3] net: nexthop: fix refcount issues when replacing groups Ido Schimmel
  3 siblings, 1 reply; 13+ messages in thread
From: Nikolay Aleksandrov @ 2021-11-21 15:24 UTC (permalink / raw)
  To: netdev; +Cc: idosch, davem, kuba, dsahern, Nikolay Aleksandrov

From: Nikolay Aleksandrov <nikolay@nvidia.com>

The new selftest runs a sequence which causes circular refcount
dependency between deleted objects which cannot be released and results
in a netdevice refcount imbalance.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
---
 tools/testing/selftests/net/fib_nexthops.sh | 56 +++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/tools/testing/selftests/net/fib_nexthops.sh b/tools/testing/selftests/net/fib_nexthops.sh
index b5a69ad191b0..48d88a36ae27 100755
--- a/tools/testing/selftests/net/fib_nexthops.sh
+++ b/tools/testing/selftests/net/fib_nexthops.sh
@@ -629,6 +629,59 @@ ipv6_fcnal()
 	log_test $? 0 "Nexthops removed on admin down"
 }
 
+ipv6_grp_refs()
+{
+	run_cmd "$IP link set dev veth1 up"
+	run_cmd "$IP link add veth1.10 link veth1 up type vlan id 10"
+	run_cmd "$IP link add veth1.20 link veth1 up type vlan id 20"
+	run_cmd "$IP -6 addr add 2001:db8:91::1/64 dev veth1.10"
+	run_cmd "$IP -6 addr add 2001:db8:92::1/64 dev veth1.20"
+	run_cmd "$IP -6 neigh add 2001:db8:91::2 lladdr 00:11:22:33:44:55 dev veth1.10"
+	run_cmd "$IP -6 neigh add 2001:db8:92::2 lladdr 00:11:22:33:44:55 dev veth1.20"
+	run_cmd "$IP nexthop add id 100 via 2001:db8:91::2 dev veth1.10"
+	run_cmd "$IP nexthop add id 101 via 2001:db8:92::2 dev veth1.20"
+	run_cmd "$IP nexthop add id 102 group 100"
+	run_cmd "$IP route add 2001:db8:101::1/128 nhid 102"
+
+	# create per-cpu dsts through nh 100
+	run_cmd "ip netns exec me mausezahn -6 veth1.10 -B 2001:db8:101::1 -A 2001:db8:91::1 -c 5 -t tcp "dp=1-1023, flags=syn" >/dev/null 2>&1"
+
+	# remove nh 100 from the group to delete the route potentially leaving
+	# a stale per-cpu dst
+	run_cmd "$IP nexthop replace id 102 group 101"
+	run_cmd "$IP route del 2001:db8:101::1/128"
+
+	# add both nexthops to the group so a reference is taken on them
+	run_cmd "$IP nexthop replace id 102 group 100/101"
+
+	# if the bug exists at this point we have an unlinked IPv6 route
+	# (but not freed due to stale dst) with a reference over the group
+	# so we delete the group which will again only unlink it due to the
+	# route reference
+	run_cmd "$IP nexthop del id 102"
+
+	# delete the nexthop with stale dst, since we have an unlinked
+	# group with a ref to it and an unlinked IPv6 route with ref to the
+	# group, the nh will only be unlinked and not freed so the stale dst
+	# remains forever and we get a net device refcount imbalance
+	run_cmd "$IP nexthop del id 100"
+
+	# if the bug exists this command will hang because the net device
+	# cannot be removed
+	timeout -s KILL 5 ip netns exec me ip link del veth1.10 >/dev/null 2>&1
+
+	# we can't cleanup if the command is hung trying to delete the netdev
+	if [ $? -eq 137 ]; then
+		return 1
+	fi
+
+	# cleanup
+	run_cmd "$IP link del veth1.20"
+	run_cmd "$IP nexthop flush"
+
+	return 0
+}
+
 ipv6_grp_fcnal()
 {
 	local rc
@@ -734,6 +787,9 @@ ipv6_grp_fcnal()
 
 	run_cmd "$IP nexthop add id 108 group 31/24"
 	log_test $? 2 "Nexthop group can not have a blackhole and another nexthop"
+
+	ipv6_grp_refs
+	log_test $? 0 "Nexthop group replace refcounts"
 }
 
 ipv6_res_grp_fcnal()
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net 2/3] net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group
  2021-11-21 15:24 ` [PATCH net 2/3] net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group Nikolay Aleksandrov
@ 2021-11-21 17:17   ` Ido Schimmel
  2021-11-21 17:35     ` Ido Schimmel
  0 siblings, 1 reply; 13+ messages in thread
From: Ido Schimmel @ 2021-11-21 17:17 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: netdev, davem, kuba, dsahern, Nikolay Aleksandrov

On Sun, Nov 21, 2021 at 05:24:52PM +0200, Nikolay Aleksandrov wrote:
> From: Nikolay Aleksandrov <nikolay@nvidia.com>
> 
> When replacing a nexthop group, we must release the IPv6 per-cpu dsts of
> the removed nexthop entries after an RCU grace period because they
> contain references to the nexthop's net device and to the fib6 info.
> With specific series of events[1] we can reach net device refcount
> imbalance which is unrecoverable.
> 
> [1]
>  $ ip nexthop list
>   id 200 via 2002:db8::2 dev bridge.10 scope link onlink
>   id 201 via 2002:db8::3 dev bridge scope link onlink
>   id 203 group 201/200
>  $ ip -6 route
>   2001:db8::10 nhid 203 metric 1024 pref medium
>      nexthop via 2002:db8::3 dev bridge weight 1 onlink
>      nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink
> 
> Create rt6_info through one of the multipath legs, e.g.:
>  $ taskset -a -c 1  ./pkt_inj 24 bridge.10 2001:db8::10
>  (pkt_inj is just a custom packet generator, nothing special)
> 
> Then remove that leg from the group by replace (let's assume it is id
> 200 in this case):
>  $ ip nexthop replace id 203 group 201
> 
> Now remove the IPv6 route:
>  $ ip -6 route del 2001:db8::10/128
> 
> The route won't be really deleted due to the stale rt6_info holding 1
> refcnt in nexthop id 200.
> At this point we have the following reference count dependency:
>  (deleted) IPv6 route holds 1 reference over nhid 203
>  nh 203 holds 1 ref over id 201
>  nh 200 holds 1 ref over the net device and the route due to the stale
>  rt6_info
> 
> Now to create circular dependency between nh 200 and the IPv6 route, and
> also to get a reference over nh 200, restore nhid 200 in the group:
>  $ ip nexthop replace id 203 group 201/200
> 
> And now we have a permanent circular dependncy because nhid 203 holds a
> reference over nh 200 and 201, but the route holds a ref over nh 203 and
> is deleted.
> 
> To trigger the bug just delete the group (nhid 203):
>  $ ip nexthop del id 203
> 
> It won't really be deleted due to the IPv6 route dependency, and now we
> have 2 unlinked and deleted objects that reference each other: the group
> and the IPv6 route. Since the group drops the reference it holds over its
> entries at free time (i.e. its own refcount needs to drop to 0) that will
> never happen and we get a permanent ref on them, since one of the entries
> holds a reference over the IPv6 route it will also never be released.
> 
> At this point the dependencies are:
>  (deleted, only unlinked) IPv6 route holds reference over group nh 203
>  (deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
>  nh 200 holds 1 ref over the net device and the route due to the stale
>  rt6_info
> 
> This is the last point where it can be fixed by running traffic through
> nh 200, and specifically through the same CPU so the rt6_info (dst) will
> get released due to the IPv6 genid, that in turn will free the IPv6
> route, which in turn will free the ref count over the group nh 203.
> 
> If nh 200 is deleted at this point, it will never be released due to the
> ref from the unlinked group 203, it will only be unlinked:
>  $ ip nexthop del id 200
>  $ ip nexthop
>  $
> 
> Now we can never release that stale rt6_info, we have IPv6 route with ref
> over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
> rt6_info (dst) with ref over the net device and the IPv6 route. All of
> these objects are only unlinked, and cannot be released, thus they can't
> release their ref counts.
> 
>  Message from syslogd@dev at Nov 19 14:04:10 ...
>   kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
>  Message from syslogd@dev at Nov 19 14:04:20 ...
>   kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
> 
> Fixes: 7bf4796dd099 ("nexthops: add support for replace")
> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
> ---
>  net/ipv4/nexthop.c | 25 +++++++++++++++++++++++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
> index 9e8100728d46..a69a9e76f99f 100644
> --- a/net/ipv4/nexthop.c
> +++ b/net/ipv4/nexthop.c
> @@ -1899,15 +1899,36 @@ static void remove_nexthop(struct net *net, struct nexthop *nh,
>  /* if any FIB entries reference this nexthop, any dst entries
>   * need to be regenerated
>   */
> -static void nh_rt_cache_flush(struct net *net, struct nexthop *nh)
> +static void nh_rt_cache_flush(struct net *net, struct nexthop *nh,
> +			      struct nexthop *replaced_nh)
>  {
>  	struct fib6_info *f6i;
> +	struct nh_group *nhg;
> +	int i;
>  
>  	if (!list_empty(&nh->fi_list))
>  		rt_cache_flush(net);
>  
>  	list_for_each_entry(f6i, &nh->f6i_list, nh_list)
>  		ipv6_stub->fib6_update_sernum(net, f6i);
> +
> +	/* if an IPv6 group was replaced, we have to release all old
> +	 * dsts to make sure all refcounts are released
> +	 */

This problem is specific to IPv6 because IPv4 dst entries do not hold
references on routes / FIB info thereby avoiding the circular dependency
described in the commit message?

> +	if (!replaced_nh->is_group)
> +		return;

Does it also make sense to skip the part below if we don't have any IPv6
routes using the nexthop?

> +
> +	/* new dsts must use only the new nexthop group */
> +	synchronize_net();
> +
> +	nhg = rtnl_dereference(replaced_nh->nh_grp);

In replace_nexthop_grp() that precedes this function we assign the new
nh_group to the nexthop shell used by the routes:

rcu_assign_pointer(old->nh_grp, newg);

And the old one that you want to purge is stored in
'replaced_nh->nh_grp':

rcu_assign_pointer(new->nh_grp, oldg);

The need for synchronize_net() above stems from the fact that some CPUs
might still be using 'oldg' via 'old->nh_grp'?

If so, we already have one synchronize_net() in replace_nexthop_grp()
for resilient groups. See:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=563f23b002534176f49524b5ca0e1d94d8906c40

Can we avoid two synchronize_net() per resilient group by removing the
one added here and instead do:

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index a69a9e76f99f..a47ce43ab1ff 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -2002,9 +2002,10 @@ static int replace_nexthop_grp(struct net *net, struct nexthop *old,
 
        rcu_assign_pointer(old->nh_grp, newg);
 
+       /* Make sure concurrent readers are not using 'oldg' anymore. */
+       synchronize_net();
+
        if (newg->resilient) {
-               /* Make sure concurrent readers are not using 'oldg' anymore. */
-               synchronize_net();
                rcu_assign_pointer(oldg->res_table, tmp_table);
                rcu_assign_pointer(oldg->spare->res_table, tmp_table);
        }

> +	for (i = 0; i < nhg->num_nh; i++) {
> +		struct nh_grp_entry *nhge = &nhg->nh_entries[i];
> +		struct nh_info *nhi = rtnl_dereference(nhge->nh->nh_info);
> +
> +		if (nhi->family == AF_INET6)
> +			ipv6_stub->fib6_nh_release_dsts(&nhi->fib6_nh);
> +	}
>  }
>  
>  static int replace_nexthop_grp(struct net *net, struct nexthop *old,
> @@ -2247,7 +2268,7 @@ static int replace_nexthop(struct net *net, struct nexthop *old,
>  		err = replace_nexthop_single(net, old, new, extack);
>  
>  	if (!err) {
> -		nh_rt_cache_flush(net, old);
> +		nh_rt_cache_flush(net, old, new);
>  
>  		__remove_nexthop(net, new, NULL);
>  		nexthop_put(new);
> -- 
> 2.31.1
> 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net 2/3] net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group
  2021-11-21 17:17   ` Ido Schimmel
@ 2021-11-21 17:35     ` Ido Schimmel
  2021-11-21 18:02       ` Nikolay Aleksandrov
  0 siblings, 1 reply; 13+ messages in thread
From: Ido Schimmel @ 2021-11-21 17:35 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: netdev, davem, kuba, dsahern, Nikolay Aleksandrov

On Sun, Nov 21, 2021 at 07:17:41PM +0200, Ido Schimmel wrote:
> On Sun, Nov 21, 2021 at 05:24:52PM +0200, Nikolay Aleksandrov wrote:
> > From: Nikolay Aleksandrov <nikolay@nvidia.com>
> Can we avoid two synchronize_net() per resilient group by removing the
> one added here and instead do:
> 
> diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
> index a69a9e76f99f..a47ce43ab1ff 100644
> --- a/net/ipv4/nexthop.c
> +++ b/net/ipv4/nexthop.c
> @@ -2002,9 +2002,10 @@ static int replace_nexthop_grp(struct net *net, struct nexthop *old,
>  
>         rcu_assign_pointer(old->nh_grp, newg);
>  
> +       /* Make sure concurrent readers are not using 'oldg' anymore. */
> +       synchronize_net();
> +
>         if (newg->resilient) {
> -               /* Make sure concurrent readers are not using 'oldg' anymore. */
> -               synchronize_net();
>                 rcu_assign_pointer(oldg->res_table, tmp_table);
>                 rcu_assign_pointer(oldg->spare->res_table, tmp_table);
>         }

Discussed this with Nik. It is possible and would be a good cleanup for
net-next. For net it is best to leave synchronize_net() where it is so
that the patch will be easier to backport. Resilient nexthop groups were
only added in 5.13 whereas nexthop objects were added in 5.3

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 3/3] selftests: net: fib_nexthops: add test for group refcount imbalance bug
  2021-11-21 15:24 ` [PATCH net 3/3] selftests: net: fib_nexthops: add test for group refcount imbalance bug Nikolay Aleksandrov
@ 2021-11-21 17:53   ` Ido Schimmel
  2021-11-21 17:59     ` Nikolay Aleksandrov
  0 siblings, 1 reply; 13+ messages in thread
From: Ido Schimmel @ 2021-11-21 17:53 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: netdev, davem, kuba, dsahern, Nikolay Aleksandrov

On Sun, Nov 21, 2021 at 05:24:53PM +0200, Nikolay Aleksandrov wrote:
> From: Nikolay Aleksandrov <nikolay@nvidia.com>
> 
> The new selftest runs a sequence which causes circular refcount
> dependency between deleted objects which cannot be released and results
> in a netdevice refcount imbalance.
> 
> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
> ---
>  tools/testing/selftests/net/fib_nexthops.sh | 56 +++++++++++++++++++++
>  1 file changed, 56 insertions(+)
> 
> diff --git a/tools/testing/selftests/net/fib_nexthops.sh b/tools/testing/selftests/net/fib_nexthops.sh
> index b5a69ad191b0..48d88a36ae27 100755
> --- a/tools/testing/selftests/net/fib_nexthops.sh
> +++ b/tools/testing/selftests/net/fib_nexthops.sh
> @@ -629,6 +629,59 @@ ipv6_fcnal()
>  	log_test $? 0 "Nexthops removed on admin down"
>  }
>  
> +ipv6_grp_refs()
> +{
> +	run_cmd "$IP link set dev veth1 up"
> +	run_cmd "$IP link add veth1.10 link veth1 up type vlan id 10"
> +	run_cmd "$IP link add veth1.20 link veth1 up type vlan id 20"
> +	run_cmd "$IP -6 addr add 2001:db8:91::1/64 dev veth1.10"
> +	run_cmd "$IP -6 addr add 2001:db8:92::1/64 dev veth1.20"
> +	run_cmd "$IP -6 neigh add 2001:db8:91::2 lladdr 00:11:22:33:44:55 dev veth1.10"
> +	run_cmd "$IP -6 neigh add 2001:db8:92::2 lladdr 00:11:22:33:44:55 dev veth1.20"
> +	run_cmd "$IP nexthop add id 100 via 2001:db8:91::2 dev veth1.10"
> +	run_cmd "$IP nexthop add id 101 via 2001:db8:92::2 dev veth1.20"
> +	run_cmd "$IP nexthop add id 102 group 100"
> +	run_cmd "$IP route add 2001:db8:101::1/128 nhid 102"
> +
> +	# create per-cpu dsts through nh 100
> +	run_cmd "ip netns exec me mausezahn -6 veth1.10 -B 2001:db8:101::1 -A 2001:db8:91::1 -c 5 -t tcp "dp=1-1023, flags=syn" >/dev/null 2>&1"

I see that other test cases in this file that are using mausezahn check
that it exists. See ipv4_torture() for example

> +
> +	# remove nh 100 from the group to delete the route potentially leaving
> +	# a stale per-cpu dst

Not sure I understand the comment. Maybe:

"Remove nh 100 from the group. If the bug described in the previous
commit is not fixed, the nexthop continues to cache a per-CPU dst entry
that holds a reference on the IPv6 route."

?

> +	run_cmd "$IP nexthop replace id 102 group 101"
> +	run_cmd "$IP route del 2001:db8:101::1/128"
> +
> +	# add both nexthops to the group so a reference is taken on them
> +	run_cmd "$IP nexthop replace id 102 group 100/101"
> +
> +	# if the bug exists at this point we have an unlinked IPv6 route

I would mention that by "the bug" you are referring to the bug described
in previous commit

> +	# (but not freed due to stale dst) with a reference over the group
> +	# so we delete the group which will again only unlink it due to the
> +	# route reference
> +	run_cmd "$IP nexthop del id 102"
> +
> +	# delete the nexthop with stale dst, since we have an unlinked
> +	# group with a ref to it and an unlinked IPv6 route with ref to the
> +	# group, the nh will only be unlinked and not freed so the stale dst
> +	# remains forever and we get a net device refcount imbalance
> +	run_cmd "$IP nexthop del id 100"
> +
> +	# if the bug exists this command will hang because the net device
> +	# cannot be removed
> +	timeout -s KILL 5 ip netns exec me ip link del veth1.10 >/dev/null 2>&1
> +
> +	# we can't cleanup if the command is hung trying to delete the netdev
> +	if [ $? -eq 137 ]; then
> +		return 1
> +	fi
> +
> +	# cleanup
> +	run_cmd "$IP link del veth1.20"
> +	run_cmd "$IP nexthop flush"
> +
> +	return 0
> +}
> +
>  ipv6_grp_fcnal()
>  {
>  	local rc
> @@ -734,6 +787,9 @@ ipv6_grp_fcnal()
>  
>  	run_cmd "$IP nexthop add id 108 group 31/24"
>  	log_test $? 2 "Nexthop group can not have a blackhole and another nexthop"
> +
> +	ipv6_grp_refs
> +	log_test $? 0 "Nexthop group replace refcounts"
>  }
>  
>  ipv6_res_grp_fcnal()
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 0/3] net: nexthop: fix refcount issues when replacing groups
  2021-11-21 15:24 [PATCH net 0/3] net: nexthop: fix refcount issues when replacing groups Nikolay Aleksandrov
                   ` (2 preceding siblings ...)
  2021-11-21 15:24 ` [PATCH net 3/3] selftests: net: fib_nexthops: add test for group refcount imbalance bug Nikolay Aleksandrov
@ 2021-11-21 17:55 ` Ido Schimmel
  2021-11-21 18:17   ` Nikolay Aleksandrov
  3 siblings, 1 reply; 13+ messages in thread
From: Ido Schimmel @ 2021-11-21 17:55 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: netdev, davem, kuba, dsahern, Nikolay Aleksandrov

On Sun, Nov 21, 2021 at 05:24:50PM +0200, Nikolay Aleksandrov wrote:
> From: Nikolay Aleksandrov <nikolay@nvidia.com>
> 
> Hi,
> This set fixes a refcount bug when replacing nexthop groups and
> modifying routes. It is complex because the objects look valid when
> debugging memory dumps, but we end up having refcount dependency between
> unlinked objects which can never be released, so in turn they cannot
> free their resources and refcounts. The problem happens because we can
> have stale IPv6 per-cpu dsts in nexthops which were removed from a
> group. Even though the IPv6 gen is bumped, the dsts won't be released
> until traffic passes through them or the nexthop is freed, that can take
> arbitrarily long time, and even worse we can create a scenario[1] where it
> can never be released. The fix is to release the IPv6 per-cpu dsts of
> replaced nexthops after an RCU grace period so no new ones can be
> created. To do that we add a new IPv6 stub - fib6_nh_release_dsts, which
> is used by the nexthop code only when necessary. We can further optimize
> group replacement, but that is more suited for net-next as these patches
> would have to be backported to stable releases.

Will run regression with these patches tonight and report tomorrow

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 3/3] selftests: net: fib_nexthops: add test for group refcount imbalance bug
  2021-11-21 17:53   ` Ido Schimmel
@ 2021-11-21 17:59     ` Nikolay Aleksandrov
  0 siblings, 0 replies; 13+ messages in thread
From: Nikolay Aleksandrov @ 2021-11-21 17:59 UTC (permalink / raw)
  To: Ido Schimmel, Nikolay Aleksandrov; +Cc: netdev, davem, kuba, dsahern

On 21/11/2021 19:53, Ido Schimmel wrote:
> On Sun, Nov 21, 2021 at 05:24:53PM +0200, Nikolay Aleksandrov wrote:
>> From: Nikolay Aleksandrov <nikolay@nvidia.com>
>>
>> The new selftest runs a sequence which causes circular refcount
>> dependency between deleted objects which cannot be released and results
>> in a netdevice refcount imbalance.
>>
>> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
>> ---
>>  tools/testing/selftests/net/fib_nexthops.sh | 56 +++++++++++++++++++++
>>  1 file changed, 56 insertions(+)
>>
>> diff --git a/tools/testing/selftests/net/fib_nexthops.sh b/tools/testing/selftests/net/fib_nexthops.sh
>> index b5a69ad191b0..48d88a36ae27 100755
>> --- a/tools/testing/selftests/net/fib_nexthops.sh
>> +++ b/tools/testing/selftests/net/fib_nexthops.sh
>> @@ -629,6 +629,59 @@ ipv6_fcnal()
>>  	log_test $? 0 "Nexthops removed on admin down"
>>  }
>>  
>> +ipv6_grp_refs()
>> +{
>> +	run_cmd "$IP link set dev veth1 up"
>> +	run_cmd "$IP link add veth1.10 link veth1 up type vlan id 10"
>> +	run_cmd "$IP link add veth1.20 link veth1 up type vlan id 20"
>> +	run_cmd "$IP -6 addr add 2001:db8:91::1/64 dev veth1.10"
>> +	run_cmd "$IP -6 addr add 2001:db8:92::1/64 dev veth1.20"
>> +	run_cmd "$IP -6 neigh add 2001:db8:91::2 lladdr 00:11:22:33:44:55 dev veth1.10"
>> +	run_cmd "$IP -6 neigh add 2001:db8:92::2 lladdr 00:11:22:33:44:55 dev veth1.20"
>> +	run_cmd "$IP nexthop add id 100 via 2001:db8:91::2 dev veth1.10"
>> +	run_cmd "$IP nexthop add id 101 via 2001:db8:92::2 dev veth1.20"
>> +	run_cmd "$IP nexthop add id 102 group 100"
>> +	run_cmd "$IP route add 2001:db8:101::1/128 nhid 102"
>> +
>> +	# create per-cpu dsts through nh 100
>> +	run_cmd "ip netns exec me mausezahn -6 veth1.10 -B 2001:db8:101::1 -A 2001:db8:91::1 -c 5 -t tcp "dp=1-1023, flags=syn" >/dev/null 2>&1"
> 
> I see that other test cases in this file that are using mausezahn check
> that it exists. See ipv4_torture() for example
> 

Indeed, I'll adjust the test

>> +
>> +	# remove nh 100 from the group to delete the route potentially leaving
>> +	# a stale per-cpu dst
> 
> Not sure I understand the comment. Maybe:
> 
> "Remove nh 100 from the group. If the bug described in the previous
> commit is not fixed, the nexthop continues to cache a per-CPU dst entry
> that holds a reference on the IPv6 route."
> 
> ?
> 

Yes, that is the stale per-cpu dst.

>> +	run_cmd "$IP nexthop replace id 102 group 101"
>> +	run_cmd "$IP route del 2001:db8:101::1/128"
>> +
>> +	# add both nexthops to the group so a reference is taken on them
>> +	run_cmd "$IP nexthop replace id 102 group 100/101"
>> +
>> +	# if the bug exists at this point we have an unlinked IPv6 route
> 
> I would mention that by "the bug" you are referring to the bug described
> in previous commit
> 

since there is no commit id yet, I can give a brief description only
I'll may refer to it by subject though

>> +	# (but not freed due to stale dst) with a reference over the group
>> +	# so we delete the group which will again only unlink it due to the
>> +	# route reference
>> +	run_cmd "$IP nexthop del id 102"
>> +
>> +	# delete the nexthop with stale dst, since we have an unlinked
>> +	# group with a ref to it and an unlinked IPv6 route with ref to the
>> +	# group, the nh will only be unlinked and not freed so the stale dst
>> +	# remains forever and we get a net device refcount imbalance
>> +	run_cmd "$IP nexthop del id 100"
>> +
>> +	# if the bug exists this command will hang because the net device
>> +	# cannot be removed
>> +	timeout -s KILL 5 ip netns exec me ip link del veth1.10 >/dev/null 2>&1
>> +
>> +	# we can't cleanup if the command is hung trying to delete the netdev
>> +	if [ $? -eq 137 ]; then
>> +		return 1
>> +	fi
>> +
>> +	# cleanup
>> +	run_cmd "$IP link del veth1.20"
>> +	run_cmd "$IP nexthop flush"
>> +
>> +	return 0
>> +}
>> +
>>  ipv6_grp_fcnal()
>>  {
>>  	local rc
>> @@ -734,6 +787,9 @@ ipv6_grp_fcnal()
>>  
>>  	run_cmd "$IP nexthop add id 108 group 31/24"
>>  	log_test $? 2 "Nexthop group can not have a blackhole and another nexthop"
>> +
>> +	ipv6_grp_refs
>> +	log_test $? 0 "Nexthop group replace refcounts"
>>  }
>>  
>>  ipv6_res_grp_fcnal()
>> -- 
>> 2.31.1
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 2/3] net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group
  2021-11-21 17:35     ` Ido Schimmel
@ 2021-11-21 18:02       ` Nikolay Aleksandrov
  0 siblings, 0 replies; 13+ messages in thread
From: Nikolay Aleksandrov @ 2021-11-21 18:02 UTC (permalink / raw)
  To: Ido Schimmel, Nikolay Aleksandrov; +Cc: netdev, davem, kuba, dsahern

On 21/11/2021 19:35, Ido Schimmel wrote:
> On Sun, Nov 21, 2021 at 07:17:41PM +0200, Ido Schimmel wrote:
>> On Sun, Nov 21, 2021 at 05:24:52PM +0200, Nikolay Aleksandrov wrote:
>>> From: Nikolay Aleksandrov <nikolay@nvidia.com>
>> Can we avoid two synchronize_net() per resilient group by removing the
>> one added here and instead do:
>>
>> diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
>> index a69a9e76f99f..a47ce43ab1ff 100644
>> --- a/net/ipv4/nexthop.c
>> +++ b/net/ipv4/nexthop.c
>> @@ -2002,9 +2002,10 @@ static int replace_nexthop_grp(struct net *net, struct nexthop *old,
>>  
>>         rcu_assign_pointer(old->nh_grp, newg);
>>  
>> +       /* Make sure concurrent readers are not using 'oldg' anymore. */
>> +       synchronize_net();
>> +
>>         if (newg->resilient) {
>> -               /* Make sure concurrent readers are not using 'oldg' anymore. */
>> -               synchronize_net();
>>                 rcu_assign_pointer(oldg->res_table, tmp_table);
>>                 rcu_assign_pointer(oldg->spare->res_table, tmp_table);
>>         }
> 
> Discussed this with Nik. It is possible and would be a good cleanup for
> net-next. For net it is best to leave synchronize_net() where it is so
> that the patch will be easier to backport. Resilient nexthop groups were
> only added in 5.13 whereas nexthop objects were added in 5.3
> 

Indeed, thank you for the review. I'll send patches for this suggestion and
the IPv6 optimization (that is one of the optimizations I was referring to
in the cover letter) for net-next after net is merged.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 0/3] net: nexthop: fix refcount issues when replacing groups
  2021-11-21 17:55 ` [PATCH net 0/3] net: nexthop: fix refcount issues when replacing groups Ido Schimmel
@ 2021-11-21 18:17   ` Nikolay Aleksandrov
  2021-11-22  9:48     ` Ido Schimmel
  0 siblings, 1 reply; 13+ messages in thread
From: Nikolay Aleksandrov @ 2021-11-21 18:17 UTC (permalink / raw)
  To: Ido Schimmel, Nikolay Aleksandrov; +Cc: netdev, davem, kuba, dsahern

On 21/11/2021 19:55, Ido Schimmel wrote:
> On Sun, Nov 21, 2021 at 05:24:50PM +0200, Nikolay Aleksandrov wrote:
>> From: Nikolay Aleksandrov <nikolay@nvidia.com>
>>
>> Hi,
>> This set fixes a refcount bug when replacing nexthop groups and
>> modifying routes. It is complex because the objects look valid when
>> debugging memory dumps, but we end up having refcount dependency between
>> unlinked objects which can never be released, so in turn they cannot
>> free their resources and refcounts. The problem happens because we can
>> have stale IPv6 per-cpu dsts in nexthops which were removed from a
>> group. Even though the IPv6 gen is bumped, the dsts won't be released
>> until traffic passes through them or the nexthop is freed, that can take
>> arbitrarily long time, and even worse we can create a scenario[1] where it
>> can never be released. The fix is to release the IPv6 per-cpu dsts of
>> replaced nexthops after an RCU grace period so no new ones can be
>> created. To do that we add a new IPv6 stub - fib6_nh_release_dsts, which
>> is used by the nexthop code only when necessary. We can further optimize
>> group replacement, but that is more suited for net-next as these patches
>> would have to be backported to stable releases.
> 
> Will run regression with these patches tonight and report tomorrow
> 

Thank you, I've prepared v2 with the selftest mausezahn check and will hold
it off to see how the tests would go. Also if any comments show up in the
meantime. :)

By the way I've been running a torture test all day for multiple IPv6 route
forwarding + local traffic through different CPUs while also replacing multiple
nh groups referencing multiple nexthops, so far it looks good.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 0/3] net: nexthop: fix refcount issues when replacing groups
  2021-11-21 18:17   ` Nikolay Aleksandrov
@ 2021-11-22  9:48     ` Ido Schimmel
  2021-11-22  9:53       ` Nikolay Aleksandrov
  0 siblings, 1 reply; 13+ messages in thread
From: Ido Schimmel @ 2021-11-22  9:48 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: Nikolay Aleksandrov, netdev, davem, kuba, dsahern

On Sun, Nov 21, 2021 at 08:17:49PM +0200, Nikolay Aleksandrov wrote:
> On 21/11/2021 19:55, Ido Schimmel wrote:
> > On Sun, Nov 21, 2021 at 05:24:50PM +0200, Nikolay Aleksandrov wrote:
> >> From: Nikolay Aleksandrov <nikolay@nvidia.com>
> >>
> >> Hi,
> >> This set fixes a refcount bug when replacing nexthop groups and
> >> modifying routes. It is complex because the objects look valid when
> >> debugging memory dumps, but we end up having refcount dependency between
> >> unlinked objects which can never be released, so in turn they cannot
> >> free their resources and refcounts. The problem happens because we can
> >> have stale IPv6 per-cpu dsts in nexthops which were removed from a
> >> group. Even though the IPv6 gen is bumped, the dsts won't be released
> >> until traffic passes through them or the nexthop is freed, that can take
> >> arbitrarily long time, and even worse we can create a scenario[1] where it
> >> can never be released. The fix is to release the IPv6 per-cpu dsts of
> >> replaced nexthops after an RCU grace period so no new ones can be
> >> created. To do that we add a new IPv6 stub - fib6_nh_release_dsts, which
> >> is used by the nexthop code only when necessary. We can further optimize
> >> group replacement, but that is more suited for net-next as these patches
> >> would have to be backported to stable releases.
> > 
> > Will run regression with these patches tonight and report tomorrow
> > 
> 
> Thank you, I've prepared v2 with the selftest mausezahn check and will hold
> it off to see how the tests would go. Also if any comments show up in the
> meantime. :)
> 
> By the way I've been running a torture test all day for multiple IPv6 route
> forwarding + local traffic through different CPUs while also replacing multiple
> nh groups referencing multiple nexthops, so far it looks good.

Regression looks good. Later today I will also have results from a debug
kernel, but I think it should be fine.

Regarding patch #2, can you add a comment (or edit the commit message)
to explain why the fix is only relevant for IPv4? I made this comment,
but I think it was missed:

"This problem is specific to IPv6 because IPv4 dst entries do not hold
references on routes / FIB info thereby avoiding the circular dependency
described in the commit message?"

Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 0/3] net: nexthop: fix refcount issues when replacing groups
  2021-11-22  9:48     ` Ido Schimmel
@ 2021-11-22  9:53       ` Nikolay Aleksandrov
  0 siblings, 0 replies; 13+ messages in thread
From: Nikolay Aleksandrov @ 2021-11-22  9:53 UTC (permalink / raw)
  To: Ido Schimmel, Nikolay Aleksandrov; +Cc: netdev, davem, kuba, dsahern

On 22/11/2021 11:48, Ido Schimmel wrote:
> On Sun, Nov 21, 2021 at 08:17:49PM +0200, Nikolay Aleksandrov wrote:
>> On 21/11/2021 19:55, Ido Schimmel wrote:
>>> On Sun, Nov 21, 2021 at 05:24:50PM +0200, Nikolay Aleksandrov wrote:
>>>> From: Nikolay Aleksandrov <nikolay@nvidia.com>
>>>>
>>>> Hi,
>>>> This set fixes a refcount bug when replacing nexthop groups and
>>>> modifying routes. It is complex because the objects look valid when
>>>> debugging memory dumps, but we end up having refcount dependency between
>>>> unlinked objects which can never be released, so in turn they cannot
>>>> free their resources and refcounts. The problem happens because we can
>>>> have stale IPv6 per-cpu dsts in nexthops which were removed from a
>>>> group. Even though the IPv6 gen is bumped, the dsts won't be released
>>>> until traffic passes through them or the nexthop is freed, that can take
>>>> arbitrarily long time, and even worse we can create a scenario[1] where it
>>>> can never be released. The fix is to release the IPv6 per-cpu dsts of
>>>> replaced nexthops after an RCU grace period so no new ones can be
>>>> created. To do that we add a new IPv6 stub - fib6_nh_release_dsts, which
>>>> is used by the nexthop code only when necessary. We can further optimize
>>>> group replacement, but that is more suited for net-next as these patches
>>>> would have to be backported to stable releases.
>>>
>>> Will run regression with these patches tonight and report tomorrow
>>>
>>
>> Thank you, I've prepared v2 with the selftest mausezahn check and will hold
>> it off to see how the tests would go. Also if any comments show up in the
>> meantime. :)
>>
>> By the way I've been running a torture test all day for multiple IPv6 route
>> forwarding + local traffic through different CPUs while also replacing multiple
>> nh groups referencing multiple nexthops, so far it looks good.
> 
> Regression looks good. Later today I will also have results from a debug
> kernel, but I think it should be fine.
> 
> Regarding patch #2, can you add a comment (or edit the commit message)
> to explain why the fix is only relevant for IPv4? I made this comment,
> but I think it was missed:
> 

I saw it, I've updated the commit msg to reflect why IPv4 isn't affected.

> "This problem is specific to IPv6 because IPv4 dst entries do not hold
> references on routes / FIB info thereby avoiding the circular dependency
> described in the commit message?"
> > Thanks!
> 

Cheers,
 Nik



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-11-22  9:54 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-21 15:24 [PATCH net 0/3] net: nexthop: fix refcount issues when replacing groups Nikolay Aleksandrov
2021-11-21 15:24 ` [PATCH net 1/3] net: ipv6: add fib6_nh_release_dsts stub Nikolay Aleksandrov
2021-11-21 15:24 ` [PATCH net 2/3] net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group Nikolay Aleksandrov
2021-11-21 17:17   ` Ido Schimmel
2021-11-21 17:35     ` Ido Schimmel
2021-11-21 18:02       ` Nikolay Aleksandrov
2021-11-21 15:24 ` [PATCH net 3/3] selftests: net: fib_nexthops: add test for group refcount imbalance bug Nikolay Aleksandrov
2021-11-21 17:53   ` Ido Schimmel
2021-11-21 17:59     ` Nikolay Aleksandrov
2021-11-21 17:55 ` [PATCH net 0/3] net: nexthop: fix refcount issues when replacing groups Ido Schimmel
2021-11-21 18:17   ` Nikolay Aleksandrov
2021-11-22  9:48     ` Ido Schimmel
2021-11-22  9:53       ` Nikolay Aleksandrov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.