netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access
@ 2021-09-28 12:54 Antoine Tenart
  2021-09-28 12:54 ` [RFC PATCH net-next 1/9] net-sysfs: try not to restart the syscall if it will fail eventually Antoine Tenart
                   ` (10 more replies)
  0 siblings, 11 replies; 25+ messages in thread
From: Antoine Tenart @ 2021-09-28 12:54 UTC (permalink / raw)
  To: davem, kuba
  Cc: Antoine Tenart, pabeni, gregkh, ebiederm, stephen, herbert,
	juri.lelli, netdev

Hello,

[Feel free to Cc anyone interested in this]

Please read this before looking at the patches: they are possible
improvements or fixes, that might or might not be acceptable, and there
might be other ways to improve the situation. But we at least wanted to
provide some pointers. Patch 1 is a possible workaround and the rest of
the series is an attempt at fixing this; the two are not necessarily
linked. More below.

We had a report creating pods had performance issues. This was seen when
using -rt kernel but we know it also happens on non-rt kernel (although
the performance impact is smaller). The issue is worse as multiple pods
are created, e.g. at boot time. The issue came down to userspace
spinning on net sysfs reads when virtual Ethernet pair devices were
created or moved: systemd-udevd, NetworkManager & others would wait for
events and trigger internal functions reading attributes such as
'phys_port_id' depending on 1) their implementation 2) the distro or
user configuration (udev rules for example). Tests showed the spin also
happens for a single veth pair creation (on both -rt and non-rt).

What made those syscalls to spin is the following construction (which is
found a lot in net sysfs and sysctl code):

  if (!rtnl_trylock())
          return restart_syscall();

Even with low lock contention if a syscall fails to take the rtnl lock
it goes back to VFS and spins in userspace, which has a huge impact
(compared to using rtnl_lock). The above construction is however there
for a good reason: it was introduced[1] years ago as a workaround to
deadlocks in net[2][3], as the initial issues were (and still are) not
the nice type.

Fixing the issue described here is simple, replacing rtnl_trylock &
restart_syscall with an rtnl_lock is enough. However the initial issues
have then to be fixed for the kernel to work properly.

First, a partial workaround is described in patch 1 ("net-sysfs: try not
to restart the syscall if it will fail eventually"). It is not a fix,
far from being perfect, nor it can be applied for all attributes. But it
does help a lot in most cases as the 'phys_*' attributes are read by
default by systemd and NM (and probably others) when adding or moving
interfaces; although those attributes are not always implemented (or not
at all for many virtual interfaces including veth) and eventually fail.

Then, to understand what could be done to fix this properly we need to
understand what are the initial deadlock issues the trylock/restart
construction fixed. An explanation is done in the initial thread[2];
here is a tl;dr: there are two ABBA deadlocks, between the net device
unregistration and the sysfs or sysctl unregistration (waiting for files
to be unused). Both can be seen as:

              A                            B

   unregister_netdevice_many         sysfs/sysctl access
   rtnl_lock                         sysfs/sysctl lock/refcount
				     rtnl_lock
   sysfs/sysctl drains files
   => waits for B                    => waits for A

We'll focus on net sysfs here[4]. One way to fix ABBA deadlocks is to
invert two locks:

- Looking at thread A, it doesn't seem OK to release and take back the
  rtnl lock in the sysfs draining logic (plus this would make the net
  device unregistration split, which would introduce other issues).

- Looking at thread B, we could take the rtnl lock before the sysfs
  refcount. But that needs to be done one layer up, as we can't access
  sysfs (kernfs) nodes without increasing their refcount first. In the
  end this would mean layers violation, lots of added helpers (there are
  multiple levels of indirections here) to access the rtnl lock. Or
  having it hardcoded in a non-net part. All this didn't looked good.

Another possibility would be to split the rtnl lock. That would be
great, some work already was done, but this is reasonably not for the
near future (if ever doable).

In the end we thought about doing the sysfs drain outside the rtnl lock.
The net device unregistration is already done in two parts:
unregister_netdevice_many and netdev_run_todo (where part of it drops
the rtnl lock). Moving device_del there does the trick, but another
change needs to be done: the naming collision logic has to be extended
until then. (Otherwise the net device name is free to be used between
unregister_netdevice_many and device_del, allowing a concurrent net
device registration to call device_add with the same name; which does
not end well).[6]

The drawback is this has implications about assumptions we currently
have regarding the net device lifecycle: there would now be a window
between starting the unregistration and running todo where names would
still be reserved. This would not be an "rtnl atomic" operation. I see
two new behaviours at least:

- It might be possible to not see a device with name A while still not
  be able to register a new one with the same name, for a short period
  of time.

- Maybe more problematic: __rtnl_newlink would now be able to fail
  because of naming collision. (The logic currently looks for an
  existing device, and if so uses it. With the extension of the naming
  collision, there would be a window were an interface is not usable
  *and* its name is still reserved).

An attempt at doing the above is provided: all patches expect 1.

Side note: netlink doesn't have the above issues (trylock isn't used
there). I know it is seen as the preferable interface for userspace (and
it allows to group attributes); but sysfs is there, used, and won't be
removed anytime soon (if ever).

Thanks for reading until here! Thoughts? Suggestions are more than
welcomed (either about the patches provided or about other ways to
improve the situation).

Antoine

[1] https://lore.kernel.org/netdev/20090226084924.16cb3e08@nehalam/
[2] https://lore.kernel.org/netdev/49A4D5D5.5090602@trash.net/
[3] https://lore.kernel.org/netdev/m14oyhis31.fsf@fess.ebiederm.org/
[4] The sysctl deadlock also happens when renaming a net device, as
    sysctl needs to go through unregistration/registration in an "rtnl
    atomic" operation (sysctl doesn't support renaming and this might
    not change[5]). It makes sense here to tackle first the net sysfs
    issue: while sysctl can be configured from userspace per-interface
    at device creation time, it is however not always used; sysfs is. (A
    fix for net sysfs could probably applied to sysctl with additional
    changes, making net sysfs a good first step candidate as well).
[5] https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L4139
    This isn't linked to sysctl; but it might give an idea how
    improvements in device renaming support would be welcomed. (Not
    judging in any way).
[6] Alternatively sysfs_create_dir_ns could be modified not to call
    dump_stack on naming collisions. But 1) removing this would not only
    impact net 2) doing it conditionally looks quite invasive. (I think
    naming collision detection is also always done by subsystems and not
    expected to happen in device_*).

Antoine Tenart (9):
  net-sysfs: try not to restart the syscall if it will fail eventually
  net: split unlisting the net device from unlisting its node name
  net: export netdev_name_node_lookup
  bonding: use the correct function to check for netdev name collision
  ppp: use the correct function to check for netdev name collision
  net: use the correct function to check for netdev name collision
  net: delay the removal of the name nodes until run_todo
  net: delay device_del until run_todo
  net-sysfs: remove the use of rtnl_trylock/restart_syscall

 drivers/net/bonding/bond_sysfs.c |  4 +--
 drivers/net/ppp/ppp_generic.c    |  2 +-
 include/linux/netdevice.h        |  2 ++
 net/core/dev.c                   | 40 +++++++++++++++++-----
 net/core/net-sysfs.c             | 59 ++++++++------------------------
 5 files changed, 50 insertions(+), 57 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 1/9] net-sysfs: try not to restart the syscall if it will fail eventually
  2021-09-28 12:54 [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access Antoine Tenart
@ 2021-09-28 12:54 ` Antoine Tenart
  2021-09-28 12:54 ` [RFC PATCH net-next 2/9] net: split unlisting the net device from unlisting its node name Antoine Tenart
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Antoine Tenart @ 2021-09-28 12:54 UTC (permalink / raw)
  To: davem, kuba
  Cc: Antoine Tenart, pabeni, gregkh, ebiederm, stephen, herbert,
	juri.lelli, netdev

Due to deadlocks in the networking subsystem spotted 12 years ago[1],
a workaround was put in place[2] to avoid taking the rtnl lock when it
was not available and restarting the syscall (back to VFS, letting
userspace spin). The following construction is found a lot in the net
sysfs and sysctl code:

  if (!rtnl_trylock())
          return restart_syscall();

This can be problematic when multiple userspace threads use such
interfaces in a short period, making them to spin a lot. This happens
for example when adding and moving virtual interfaces: userspace
programs listening on events, such as systemd-udevd and NetworkManager,
do trigger actions reading files in sysfs. It gets worse when a lot of
virtual interfaces are created concurrently, say when creating
containers at boot time.

Returning early without hitting the above pattern when the syscall will
fail eventually does make things better. While it is not a fix for the
issue, it does ease things.

[1] https://lore.kernel.org/netdev/49A4D5D5.5090602@trash.net/
    https://lore.kernel.org/netdev/m14oyhis31.fsf@fess.ebiederm.org/
    and https://lore.kernel.org/netdev/20090226084924.16cb3e08@nehalam/
[2] Rightfully, those deadlocks are *hard* to solve.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
 net/core/net-sysfs.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index f6197774048b..21c3fdeccf20 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -196,6 +196,12 @@ static ssize_t speed_show(struct device *dev,
 	struct net_device *netdev = to_net_dev(dev);
 	int ret = -EINVAL;
 
+	/* The check is also done in __ethtool_get_link_ksettings; this helps
+	 * returning early without hitting the trylock/restart below.
+	 */
+	if (!netdev->ethtool_ops->get_link_ksettings)
+		return -EOPNOTSUPP;
+
 	if (!rtnl_trylock())
 		return restart_syscall();
 
@@ -216,6 +222,12 @@ static ssize_t duplex_show(struct device *dev,
 	struct net_device *netdev = to_net_dev(dev);
 	int ret = -EINVAL;
 
+	/* The check is also done in __ethtool_get_link_ksettings; this helps
+	 * returning early without hitting the trylock/restart below.
+	 */
+	if (!netdev->ethtool_ops->get_link_ksettings)
+		return -EOPNOTSUPP;
+
 	if (!rtnl_trylock())
 		return restart_syscall();
 
@@ -478,6 +490,12 @@ static ssize_t phys_port_id_show(struct device *dev,
 	struct net_device *netdev = to_net_dev(dev);
 	ssize_t ret = -EINVAL;
 
+	/* The check is also done in dev_get_phys_port_id; this helps returning
+	 * early without hitting the trylock/restart below.
+	 */
+	if (!netdev->netdev_ops->ndo_get_phys_port_id)
+		return -EOPNOTSUPP;
+
 	if (!rtnl_trylock())
 		return restart_syscall();
 
@@ -500,6 +518,13 @@ static ssize_t phys_port_name_show(struct device *dev,
 	struct net_device *netdev = to_net_dev(dev);
 	ssize_t ret = -EINVAL;
 
+	/* The checks are also done in dev_get_phys_port_name; this helps
+	 * returning early without hitting the trylock/restart below.
+	 */
+	if (!netdev->netdev_ops->ndo_get_phys_port_name &&
+	    !netdev->netdev_ops->ndo_get_devlink_port)
+		return -EOPNOTSUPP;
+
 	if (!rtnl_trylock())
 		return restart_syscall();
 
@@ -522,6 +547,13 @@ static ssize_t phys_switch_id_show(struct device *dev,
 	struct net_device *netdev = to_net_dev(dev);
 	ssize_t ret = -EINVAL;
 
+	/* The checks are also done in dev_get_phys_port_name; this helps
+	 * returning early without hitting the trylock/restart below.
+	 */
+	if (!netdev->netdev_ops->ndo_get_port_parent_id &&
+	    !netdev->netdev_ops->ndo_get_devlink_port)
+		return -EOPNOTSUPP;
+
 	if (!rtnl_trylock())
 		return restart_syscall();
 
@@ -1226,6 +1258,12 @@ static ssize_t tx_maxrate_store(struct netdev_queue *queue,
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
 
+	/* The check is also done later; this helps returning early without
+	 * hitting the trylock/restart below.
+	 */
+	if (!dev->netdev_ops->ndo_set_tx_maxrate)
+		return -EOPNOTSUPP;
+
 	err = kstrtou32(buf, 10, &rate);
 	if (err < 0)
 		return err;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 2/9] net: split unlisting the net device from unlisting its node name
  2021-09-28 12:54 [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access Antoine Tenart
  2021-09-28 12:54 ` [RFC PATCH net-next 1/9] net-sysfs: try not to restart the syscall if it will fail eventually Antoine Tenart
@ 2021-09-28 12:54 ` Antoine Tenart
  2021-09-28 12:54 ` [RFC PATCH net-next 3/9] net: export netdev_name_node_lookup Antoine Tenart
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Antoine Tenart @ 2021-09-28 12:54 UTC (permalink / raw)
  To: davem, kuba
  Cc: Antoine Tenart, pabeni, gregkh, ebiederm, stephen, herbert,
	juri.lelli, netdev

This (mostly [*]) cosmetic patch spits the unlisting of the net device
from the unlisting of its node name. The two unlisting are still done
for now at the same places in the code, keeping the logic.

[*] The two removals are now not done in a single dev_base_lock locking
    section. That is not an issue as insertion/deletion from both lists
    doesn't have to be atomic from the dev_base_lock point of view.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
 net/core/dev.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index fa989ab63f29..2f28b70e5244 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -385,13 +385,21 @@ static void unlist_netdevice(struct net_device *dev)
 	/* Unlink dev from the device chain */
 	write_lock_bh(&dev_base_lock);
 	list_del_rcu(&dev->dev_list);
-	netdev_name_node_del(dev->name_node);
 	hlist_del_rcu(&dev->index_hlist);
 	write_unlock_bh(&dev_base_lock);
 
 	dev_base_seq_inc(dev_net(dev));
 }
 
+static void unlist_netdevice_name(struct net_device *dev)
+{
+	ASSERT_RTNL();
+
+	write_lock_bh(&dev_base_lock);
+	netdev_name_node_del(dev->name_node);
+	write_unlock_bh(&dev_base_lock);
+}
+
 /*
  *	Our notifier list
  */
@@ -11030,6 +11038,7 @@ void unregister_netdevice_many(struct list_head *head)
 	list_for_each_entry(dev, head, unreg_list) {
 		/* And unlink it from device chain. */
 		unlist_netdevice(dev);
+		unlist_netdevice_name(dev);
 
 		dev->reg_state = NETREG_UNREGISTERING;
 	}
@@ -11177,6 +11186,7 @@ int __dev_change_net_namespace(struct net_device *dev, struct net *net,
 
 	/* And unlink it from device chain */
 	unlist_netdevice(dev);
+	unlist_netdevice_name(dev);
 
 	synchronize_net();
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 3/9] net: export netdev_name_node_lookup
  2021-09-28 12:54 [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access Antoine Tenart
  2021-09-28 12:54 ` [RFC PATCH net-next 1/9] net-sysfs: try not to restart the syscall if it will fail eventually Antoine Tenart
  2021-09-28 12:54 ` [RFC PATCH net-next 2/9] net: split unlisting the net device from unlisting its node name Antoine Tenart
@ 2021-09-28 12:54 ` Antoine Tenart
  2021-09-28 12:54 ` [RFC PATCH net-next 4/9] bonding: use the correct function to check for netdev name collision Antoine Tenart
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Antoine Tenart @ 2021-09-28 12:54 UTC (permalink / raw)
  To: davem, kuba
  Cc: Antoine Tenart, pabeni, gregkh, ebiederm, stephen, herbert,
	juri.lelli, netdev

Export netdev_name_node_lookup for use outside of net/core/dev.c. Prior
to this __dev_get_by_name was used for both name collision detection and
to retrieve a net device reference. We now want to allow a difference in
behaviour between the two, hence exporting netdev_name_node_lookup. (It
will be used in the next commits).

Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
 include/linux/netdevice.h | 2 ++
 net/core/dev.c            | 5 +++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d79163208dfd..f8dae47b9aa8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2955,6 +2955,8 @@ struct net_device *__dev_get_by_flags(struct net *net, unsigned short flags,
 struct net_device *dev_get_by_name(struct net *net, const char *name);
 struct net_device *dev_get_by_name_rcu(struct net *net, const char *name);
 struct net_device *__dev_get_by_name(struct net *net, const char *name);
+struct netdev_name_node *netdev_name_node_lookup(struct net *net,
+						 const char *name);
 int dev_alloc_name(struct net_device *dev, const char *name);
 int dev_open(struct net_device *dev, struct netlink_ext_ack *extack);
 void dev_close(struct net_device *dev);
diff --git a/net/core/dev.c b/net/core/dev.c
index 2f28b70e5244..bfe17a264d6c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -279,8 +279,8 @@ static void netdev_name_node_del(struct netdev_name_node *name_node)
 	hlist_del_rcu(&name_node->hlist);
 }
 
-static struct netdev_name_node *netdev_name_node_lookup(struct net *net,
-							const char *name)
+struct netdev_name_node *netdev_name_node_lookup(struct net *net,
+						 const char *name)
 {
 	struct hlist_head *head = dev_name_hash(net, name);
 	struct netdev_name_node *name_node;
@@ -290,6 +290,7 @@ static struct netdev_name_node *netdev_name_node_lookup(struct net *net,
 			return name_node;
 	return NULL;
 }
+EXPORT_SYMBOL(netdev_name_node_lookup);
 
 static struct netdev_name_node *netdev_name_node_lookup_rcu(struct net *net,
 							    const char *name)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 4/9] bonding: use the correct function to check for netdev name collision
  2021-09-28 12:54 [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access Antoine Tenart
                   ` (2 preceding siblings ...)
  2021-09-28 12:54 ` [RFC PATCH net-next 3/9] net: export netdev_name_node_lookup Antoine Tenart
@ 2021-09-28 12:54 ` Antoine Tenart
  2021-09-28 12:54 ` [RFC PATCH net-next 5/9] ppp: " Antoine Tenart
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Antoine Tenart @ 2021-09-28 12:54 UTC (permalink / raw)
  To: davem, kuba
  Cc: Antoine Tenart, pabeni, gregkh, ebiederm, stephen, herbert,
	juri.lelli, netdev

netdev_name_node_lookup and __dev_get_by_name have two distinct aims,
one helps in name collision detection, while the other is used to
retrieve a reference to a net device from its name. Here in
bond_create_sysfs we want to check for a name collision, hence use the
correct function.

(The behaviour of the two functions was similar but will change in the
next commits).

Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
 drivers/net/bonding/bond_sysfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index b9e9842fed94..8260bf941ca3 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -811,8 +811,8 @@ int bond_create_sysfs(struct bond_net *bn)
 	 */
 	if (ret == -EEXIST) {
 		/* Is someone being kinky and naming a device bonding_master? */
-		if (__dev_get_by_name(bn->net,
-				      class_attr_bonding_masters.attr.name))
+		if (netdev_name_node_lookup(bn->net,
+					    class_attr_bonding_masters.attr.name))
 			pr_err("network device named %s already exists in sysfs\n",
 			       class_attr_bonding_masters.attr.name);
 		ret = 0;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 5/9] ppp: use the correct function to check for netdev name collision
  2021-09-28 12:54 [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access Antoine Tenart
                   ` (3 preceding siblings ...)
  2021-09-28 12:54 ` [RFC PATCH net-next 4/9] bonding: use the correct function to check for netdev name collision Antoine Tenart
@ 2021-09-28 12:54 ` Antoine Tenart
  2021-09-28 12:54 ` [RFC PATCH net-next 6/9] net: " Antoine Tenart
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Antoine Tenart @ 2021-09-28 12:54 UTC (permalink / raw)
  To: davem, kuba
  Cc: Antoine Tenart, pabeni, gregkh, ebiederm, stephen, herbert,
	juri.lelli, netdev

netdev_name_node_lookup and __dev_get_by_name have two distinct aims,
one helps in name collision detection, while the other is used to
retrieve a reference to a net device from its name. Here in
ppp_unit_register we want to check for a name collision, hence use the
correct function.

(The behaviour of the two functions was similar but will change in the
next commits).

Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
 drivers/net/ppp/ppp_generic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index fb52cd175b45..c3e2f4da9949 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -1161,7 +1161,7 @@ static int ppp_unit_register(struct ppp *ppp, int unit, bool ifname_is_set)
 		if (!ifname_is_set) {
 			while (1) {
 				snprintf(ppp->dev->name, IFNAMSIZ, "ppp%i", ret);
-				if (!__dev_get_by_name(ppp->ppp_net, ppp->dev->name))
+				if (!netdev_name_node_lookup(ppp->ppp_net, ppp->dev->name))
 					break;
 				unit_put(&pn->units_idr, ret);
 				ret = unit_get(&pn->units_idr, ppp, ret + 1);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 6/9] net: use the correct function to check for netdev name collision
  2021-09-28 12:54 [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access Antoine Tenart
                   ` (4 preceding siblings ...)
  2021-09-28 12:54 ` [RFC PATCH net-next 5/9] ppp: " Antoine Tenart
@ 2021-09-28 12:54 ` Antoine Tenart
  2021-09-28 12:54 ` [RFC PATCH net-next 7/9] net: delay the removal of the name nodes until run_todo Antoine Tenart
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Antoine Tenart @ 2021-09-28 12:54 UTC (permalink / raw)
  To: davem, kuba
  Cc: Antoine Tenart, pabeni, gregkh, ebiederm, stephen, herbert,
	juri.lelli, netdev

netdev_name_node_lookup and __dev_get_by_name have two distinct aims,
one helps in name collision detection, while the other is used to
retrieve a reference to a net device from its name. Fix the callers
checking for a name collision.

(The behaviour of the two functions was similar but will change in the
next commits).

Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
 net/core/dev.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index bfe17a264d6c..02f9d505dbe2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1142,7 +1142,7 @@ static int __dev_alloc_name(struct net *net, const char *name, char *buf)
 	}
 
 	snprintf(buf, IFNAMSIZ, name, i);
-	if (!__dev_get_by_name(net, buf))
+	if (!netdev_name_node_lookup(net, buf))
 		return i;
 
 	/* It is possible to run out of possible slots
@@ -1196,7 +1196,7 @@ static int dev_get_valid_name(struct net *net, struct net_device *dev,
 
 	if (strchr(name, '%'))
 		return dev_alloc_name_ns(net, dev, name);
-	else if (__dev_get_by_name(net, name))
+	else if (netdev_name_node_lookup(net, name))
 		return -EEXIST;
 	else if (dev->name != name)
 		strlcpy(dev->name, name, IFNAMSIZ);
@@ -11164,7 +11164,7 @@ int __dev_change_net_namespace(struct net_device *dev, struct net *net,
 	 * we can use it in the destination network namespace.
 	 */
 	err = -EEXIST;
-	if (__dev_get_by_name(net, dev->name)) {
+	if (netdev_name_node_lookup(net, dev->name)) {
 		/* We get here if we can't use the current device name */
 		if (!pat)
 			goto out;
@@ -11518,7 +11518,7 @@ static void __net_exit default_device_exit(struct net *net)
 
 		/* Push remaining network devices to init_net */
 		snprintf(fb_name, IFNAMSIZ, "dev%d", dev->ifindex);
-		if (__dev_get_by_name(&init_net, fb_name))
+		if (netdev_name_node_lookup(&init_net, fb_name))
 			snprintf(fb_name, IFNAMSIZ, "dev%%d");
 		err = dev_change_net_namespace(dev, &init_net, fb_name);
 		if (err) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 7/9] net: delay the removal of the name nodes until run_todo
  2021-09-28 12:54 [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access Antoine Tenart
                   ` (5 preceding siblings ...)
  2021-09-28 12:54 ` [RFC PATCH net-next 6/9] net: " Antoine Tenart
@ 2021-09-28 12:54 ` Antoine Tenart
  2021-09-28 12:54 ` [RFC PATCH net-next 8/9] net: delay device_del " Antoine Tenart
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Antoine Tenart @ 2021-09-28 12:54 UTC (permalink / raw)
  To: davem, kuba
  Cc: Antoine Tenart, pabeni, gregkh, ebiederm, stephen, herbert,
	juri.lelli, netdev

Keep the node name collision detection working until the last
registration stage, by delaying the removal of the name nodes in
run_todo. This allows to perform unregistration operations being
sensitive to name collisions, in run_todo. As run_todo has sections of
code running without the rtnl lock taken, this will allow to perform
some of those operations not under this lock (when possible).

While we move the removal of the name node until a late unregistration
stage, we still want to avoid returning a net device reference when it's
being unregistered (calling __dev_get_by_name for example). We keep this
logic by setting the node name dev reference to NULL. This follows the
logic of __dev_get_by_name. Altnames are in the same list, they are not
special here.

From now on we have to be strict on the use of __dev_get_by_name vs
netdev_name_node_lookup. One is designed to get the device, the other
one to lookup in the list of currently reserved names. Current users
should have been fixed by previous patches.

One side effect is there is now a window between unregistering the
netdevice and running the todo where names are still reserved and can't
be used for new device creation.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
 net/core/dev.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 02f9d505dbe2..a1eab120bb50 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10611,10 +10611,15 @@ void netdev_run_todo(void)
 		if (dev->needs_free_netdev)
 			free_netdev(dev);
 
-		/* Report a network device has been unregistered */
 		rtnl_lock();
+		unlist_netdevice_name(dev);
+		synchronize_net();
+		netdev_name_node_free(dev->name_node);
+
 		dev_net(dev)->dev_unreg_count--;
 		__rtnl_unlock();
+
+		/* Report a network device has been unregistered */
 		wake_up(&netdev_unregistering_wq);
 
 		/* Free network device */
@@ -11039,7 +11044,12 @@ void unregister_netdevice_many(struct list_head *head)
 	list_for_each_entry(dev, head, unreg_list) {
 		/* And unlink it from device chain. */
 		unlist_netdevice(dev);
-		unlist_netdevice_name(dev);
+
+		/* Unreference the net device from the node name. From this
+		 * point on the node name is only used for naming collision
+		 * detection.
+		 */
+		dev->name_node->dev = NULL;
 
 		dev->reg_state = NETREG_UNREGISTERING;
 	}
@@ -11072,7 +11082,6 @@ void unregister_netdevice_many(struct list_head *head)
 		dev_mc_flush(dev);
 
 		netdev_name_node_alt_flush(dev);
-		netdev_name_node_free(dev->name_node);
 
 		if (dev->netdev_ops->ndo_uninit)
 			dev->netdev_ops->ndo_uninit(dev);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 8/9] net: delay device_del until run_todo
  2021-09-28 12:54 [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access Antoine Tenart
                   ` (6 preceding siblings ...)
  2021-09-28 12:54 ` [RFC PATCH net-next 7/9] net: delay the removal of the name nodes until run_todo Antoine Tenart
@ 2021-09-28 12:54 ` Antoine Tenart
  2021-09-29  0:02   ` Jakub Kicinski
  2021-09-28 12:55 ` [RFC PATCH net-next 9/9] net-sysfs: remove the use of rtnl_trylock/restart_syscall Antoine Tenart
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Antoine Tenart @ 2021-09-28 12:54 UTC (permalink / raw)
  To: davem, kuba
  Cc: Antoine Tenart, pabeni, gregkh, ebiederm, stephen, herbert,
	juri.lelli, netdev

Move the deletion of the device from unregister_netdevice_many to
netdev_run_todo and move it outside the rtnl lock.

12 years ago was reported an ABBA deadlock between net-sysfs and the
netdevice unregistration[1]. The issue was the following:

              A                            B

   unregister_netdevice_many         sysfs access
   rtnl_lock                         sysfs refcount
				     rtnl_lock
   drain sysfs files
   => waits for B                    => waits for A

This was avoided thanks to two patches[2][3], which used rtnl_trylock in
net-sysfs and restarted the syscall when the rtnl lock was already
taken. This way kernfs nodes were not blocking the netdevice
unregistration anymore.

This was fine at the time but is now causing some issues: creating and
moving interfaces makes userspace (systemd, NetworkManager or others) to
spin a lot as syscalls are restarted, which has an impact on
performance. This happens for example when creating pods. While
userspace applications could be improved, fixing this in-kernel has the
benefit of fixing the root cause of the issue.

The sysfs removal is done in device_del, and moving it outside of the
rtnl lock does fix the initial deadlock. With that the trylock/restart
logic can be removed in a following-up patch.

[1] https://lore.kernel.org/netdev/49A4D5D5.5090602@trash.net/
(I'm referencing the full thread but the sysfs issue was discussed later
in the thread).
[2] 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in sysfs methods.")
[3] 5a5990d3090b ("net: Avoid race between network down and sysfs")

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
 net/core/dev.c       | 2 ++
 net/core/net-sysfs.c | 2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index a1eab120bb50..d774fbec5d63 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10593,6 +10593,8 @@ void netdev_run_todo(void)
 			continue;
 		}
 
+		device_del(&dev->dev);
+
 		dev->reg_state = NETREG_UNREGISTERED;
 
 		netdev_wait_allrefs(dev);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 21c3fdeccf20..e754f00c117b 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1955,8 +1955,6 @@ void netdev_unregister_kobject(struct net_device *ndev)
 	remove_queue_kobjects(ndev);
 
 	pm_runtime_set_memalloc_noio(dev, false);
-
-	device_del(dev);
 }
 
 /* Create sysfs entries for network device. */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 9/9] net-sysfs: remove the use of rtnl_trylock/restart_syscall
  2021-09-28 12:54 [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access Antoine Tenart
                   ` (7 preceding siblings ...)
  2021-09-28 12:54 ` [RFC PATCH net-next 8/9] net: delay device_del " Antoine Tenart
@ 2021-09-28 12:55 ` Antoine Tenart
  2021-10-06  6:45   ` Michal Hocko
  2021-10-06  6:37 ` [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access Michal Hocko
  2021-10-29 14:33 ` Antoine Tenart
  10 siblings, 1 reply; 25+ messages in thread
From: Antoine Tenart @ 2021-09-28 12:55 UTC (permalink / raw)
  To: davem, kuba
  Cc: Antoine Tenart, pabeni, gregkh, ebiederm, stephen, herbert,
	juri.lelli, netdev

The ABBA deadlock avoided by using rtnl_trylock and restart_syscall was
fixed in previous commits, we can now remove the use of this
trylock/restart logic and have net-sysfs operations not spinning when
rtnl is already taken.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
 net/core/net-sysfs.c | 95 +++++++-------------------------------------
 1 file changed, 14 insertions(+), 81 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index e754f00c117b..987b32fd8604 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -90,9 +90,7 @@ static ssize_t netdev_store(struct device *dev, struct device_attribute *attr,
 	if (ret)
 		goto err;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
-
+	rtnl_lock();
 	if (dev_isalive(netdev)) {
 		ret = (*set)(netdev, new);
 		if (ret == 0)
@@ -196,15 +194,7 @@ static ssize_t speed_show(struct device *dev,
 	struct net_device *netdev = to_net_dev(dev);
 	int ret = -EINVAL;
 
-	/* The check is also done in __ethtool_get_link_ksettings; this helps
-	 * returning early without hitting the trylock/restart below.
-	 */
-	if (!netdev->ethtool_ops->get_link_ksettings)
-		return -EOPNOTSUPP;
-
-	if (!rtnl_trylock())
-		return restart_syscall();
-
+	rtnl_lock();
 	if (netif_running(netdev)) {
 		struct ethtool_link_ksettings cmd;
 
@@ -222,15 +212,7 @@ static ssize_t duplex_show(struct device *dev,
 	struct net_device *netdev = to_net_dev(dev);
 	int ret = -EINVAL;
 
-	/* The check is also done in __ethtool_get_link_ksettings; this helps
-	 * returning early without hitting the trylock/restart below.
-	 */
-	if (!netdev->ethtool_ops->get_link_ksettings)
-		return -EOPNOTSUPP;
-
-	if (!rtnl_trylock())
-		return restart_syscall();
-
+	rtnl_lock();
 	if (netif_running(netdev)) {
 		struct ethtool_link_ksettings cmd;
 
@@ -427,9 +409,7 @@ static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
 	if (len >  0 && buf[len - 1] == '\n')
 		--count;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
-
+	rtnl_lock();
 	if (dev_isalive(netdev)) {
 		ret = dev_set_alias(netdev, buf, count);
 		if (ret < 0)
@@ -490,15 +470,7 @@ static ssize_t phys_port_id_show(struct device *dev,
 	struct net_device *netdev = to_net_dev(dev);
 	ssize_t ret = -EINVAL;
 
-	/* The check is also done in dev_get_phys_port_id; this helps returning
-	 * early without hitting the trylock/restart below.
-	 */
-	if (!netdev->netdev_ops->ndo_get_phys_port_id)
-		return -EOPNOTSUPP;
-
-	if (!rtnl_trylock())
-		return restart_syscall();
-
+	rtnl_lock();
 	if (dev_isalive(netdev)) {
 		struct netdev_phys_item_id ppid;
 
@@ -518,16 +490,7 @@ static ssize_t phys_port_name_show(struct device *dev,
 	struct net_device *netdev = to_net_dev(dev);
 	ssize_t ret = -EINVAL;
 
-	/* The checks are also done in dev_get_phys_port_name; this helps
-	 * returning early without hitting the trylock/restart below.
-	 */
-	if (!netdev->netdev_ops->ndo_get_phys_port_name &&
-	    !netdev->netdev_ops->ndo_get_devlink_port)
-		return -EOPNOTSUPP;
-
-	if (!rtnl_trylock())
-		return restart_syscall();
-
+	rtnl_lock();
 	if (dev_isalive(netdev)) {
 		char name[IFNAMSIZ];
 
@@ -547,16 +510,7 @@ static ssize_t phys_switch_id_show(struct device *dev,
 	struct net_device *netdev = to_net_dev(dev);
 	ssize_t ret = -EINVAL;
 
-	/* The checks are also done in dev_get_phys_port_name; this helps
-	 * returning early without hitting the trylock/restart below.
-	 */
-	if (!netdev->netdev_ops->ndo_get_port_parent_id &&
-	    !netdev->netdev_ops->ndo_get_devlink_port)
-		return -EOPNOTSUPP;
-
-	if (!rtnl_trylock())
-		return restart_syscall();
-
+	rtnl_lock();
 	if (dev_isalive(netdev)) {
 		struct netdev_phys_item_id ppid = { };
 
@@ -576,9 +530,7 @@ static ssize_t threaded_show(struct device *dev,
 	struct net_device *netdev = to_net_dev(dev);
 	ssize_t ret = -EINVAL;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
-
+	rtnl_lock();
 	if (dev_isalive(netdev))
 		ret = sprintf(buf, fmt_dec, netdev->threaded);
 
@@ -1214,9 +1166,7 @@ static ssize_t traffic_class_show(struct netdev_queue *queue,
 	if (!netif_is_multiqueue(dev))
 		return -ENOENT;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
-
+	rtnl_lock();
 	index = get_netdev_queue_index(queue);
 
 	/* If queue belongs to subordinate dev use its TC mapping */
@@ -1258,18 +1208,11 @@ static ssize_t tx_maxrate_store(struct netdev_queue *queue,
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
 
-	/* The check is also done later; this helps returning early without
-	 * hitting the trylock/restart below.
-	 */
-	if (!dev->netdev_ops->ndo_set_tx_maxrate)
-		return -EOPNOTSUPP;
-
 	err = kstrtou32(buf, 10, &rate);
 	if (err < 0)
 		return err;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	rtnl_lock();
 
 	err = -EOPNOTSUPP;
 	if (dev->netdev_ops->ndo_set_tx_maxrate)
@@ -1460,8 +1403,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue, char *buf)
 
 	index = get_netdev_queue_index(queue);
 
-	if (!rtnl_trylock())
-		return restart_syscall();
+	rtnl_lock();
 
 	/* If queue belongs to subordinate dev use its map */
 	dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
@@ -1507,11 +1449,7 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
 		return err;
 	}
 
-	if (!rtnl_trylock()) {
-		free_cpumask_var(mask);
-		return restart_syscall();
-	}
-
+	rtnl_lock();
 	err = netif_set_xps_queue(dev, mask, index);
 	rtnl_unlock();
 
@@ -1531,9 +1469,7 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
 
 	index = get_netdev_queue_index(queue);
 
-	if (!rtnl_trylock())
-		return restart_syscall();
-
+	rtnl_lock();
 	tc = netdev_txq_to_tc(dev, index);
 	rtnl_unlock();
 	if (tc < 0)
@@ -1566,10 +1502,7 @@ static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
 		return err;
 	}
 
-	if (!rtnl_trylock()) {
-		bitmap_free(mask);
-		return restart_syscall();
-	}
+	rtnl_lock();
 
 	cpus_read_lock();
 	err = __netif_set_xps_queue(dev, mask, index, XPS_RXQS);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 8/9] net: delay device_del until run_todo
  2021-09-28 12:54 ` [RFC PATCH net-next 8/9] net: delay device_del " Antoine Tenart
@ 2021-09-29  0:02   ` Jakub Kicinski
  2021-09-29  8:26     ` Antoine Tenart
  0 siblings, 1 reply; 25+ messages in thread
From: Jakub Kicinski @ 2021-09-29  0:02 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, pabeni, gregkh, ebiederm, stephen, herbert, juri.lelli, netdev

On Tue, 28 Sep 2021 14:54:59 +0200 Antoine Tenart wrote:
> The sysfs removal is done in device_del, and moving it outside of the
> rtnl lock does fix the initial deadlock. With that the trylock/restart
> logic can be removed in a following-up patch.

> diff --git a/net/core/dev.c b/net/core/dev.c
> index a1eab120bb50..d774fbec5d63 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -10593,6 +10593,8 @@ void netdev_run_todo(void)
>  			continue;
>  		}
>  
> +		device_del(&dev->dev);
> +
>  		dev->reg_state = NETREG_UNREGISTERED;
>  
>  		netdev_wait_allrefs(dev);
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 21c3fdeccf20..e754f00c117b 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -1955,8 +1955,6 @@ void netdev_unregister_kobject(struct net_device *ndev)
>  	remove_queue_kobjects(ndev);
>  
>  	pm_runtime_set_memalloc_noio(dev, false);
> -
> -	device_del(dev);
>  }
>  
>  /* Create sysfs entries for network device. */

Doesn't this mean there may be sysfs files which are accessible 
for an unregistered netdevice? Isn't the point of having device_del()
under rtnl_lock() to make sure we sysfs handlers can't run on dead
devices?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 8/9] net: delay device_del until run_todo
  2021-09-29  0:02   ` Jakub Kicinski
@ 2021-09-29  8:26     ` Antoine Tenart
  2021-09-29 13:31       ` Jakub Kicinski
  0 siblings, 1 reply; 25+ messages in thread
From: Antoine Tenart @ 2021-09-29  8:26 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, pabeni, gregkh, ebiederm, stephen, herbert, juri.lelli, netdev

Quoting Jakub Kicinski (2021-09-29 02:02:29)
> On Tue, 28 Sep 2021 14:54:59 +0200 Antoine Tenart wrote:
> > The sysfs removal is done in device_del, and moving it outside of the
> > rtnl lock does fix the initial deadlock. With that the trylock/restart
> > logic can be removed in a following-up patch.
> 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index a1eab120bb50..d774fbec5d63 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -10593,6 +10593,8 @@ void netdev_run_todo(void)
> >                       continue;
> >               }
> >  
> > +             device_del(&dev->dev);
> > +
> >               dev->reg_state = NETREG_UNREGISTERED;
> >  
> >               netdev_wait_allrefs(dev);
> > diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> > index 21c3fdeccf20..e754f00c117b 100644
> > --- a/net/core/net-sysfs.c
> > +++ b/net/core/net-sysfs.c
> > @@ -1955,8 +1955,6 @@ void netdev_unregister_kobject(struct net_device *ndev)
> >       remove_queue_kobjects(ndev);
> >  
> >       pm_runtime_set_memalloc_noio(dev, false);
> > -
> > -     device_del(dev);
> >  }
> >  
> >  /* Create sysfs entries for network device. */
> 
> Doesn't this mean there may be sysfs files which are accessible 
> for an unregistered netdevice?

It would mean having accessible sysfs files for a device in the
NETREG_UNREGISTERING state; NETREG_UNREGISTERED still comes after
device_del. It's a small difference but still important, I think.

You raise a good point. Yes, that would mean accessing attributes of net
devices being unregistered, meaning accessing or modifying unused or
obsolete parameters and data (it shouldn't be garbage data though).
Unlisting those sysfs files without removing them would be better here,
to not expose files when the device is being unregistered while still
allowing pending operations to complete. I don't know if that is doable
in sysfs.

(While I did ran stress tests reading/writing attributes while
unregistering devices, I think I missed an issue with the
netdev_queue_default attributes; which hopefully can be fixed — if the
whole idea is deemed acceptable).

> Isn't the point of having device_del() under rtnl_lock() to make sure
> we sysfs handlers can't run on dead devices?

Hard to say what was the initial point, there is a lot of history here
:) I'm not sure it was done because of a particular reason; IMHO it just
made sense to make this simple without having a good reason not to do
so. And it helped with the naming collision detection.

Thanks!
Antoine

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 8/9] net: delay device_del until run_todo
  2021-09-29  8:26     ` Antoine Tenart
@ 2021-09-29 13:31       ` Jakub Kicinski
  2021-09-29 17:31         ` Antoine Tenart
  2021-10-05 15:21         ` Antoine Tenart
  0 siblings, 2 replies; 25+ messages in thread
From: Jakub Kicinski @ 2021-09-29 13:31 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, pabeni, gregkh, ebiederm, stephen, herbert, juri.lelli, netdev

On Wed, 29 Sep 2021 10:26:35 +0200 Antoine Tenart wrote:
> Quoting Jakub Kicinski (2021-09-29 02:02:29)
> > On Tue, 28 Sep 2021 14:54:59 +0200 Antoine Tenart wrote:  
> > > The sysfs removal is done in device_del, and moving it outside of the
> > > rtnl lock does fix the initial deadlock. With that the trylock/restart
> > > logic can be removed in a following-up patch.  
> >   
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index a1eab120bb50..d774fbec5d63 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -10593,6 +10593,8 @@ void netdev_run_todo(void)
> > >                       continue;
> > >               }
> > >  
> > > +             device_del(&dev->dev);
> > > +
> > >               dev->reg_state = NETREG_UNREGISTERED;
> > >  
> > >               netdev_wait_allrefs(dev);
> > > diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> > > index 21c3fdeccf20..e754f00c117b 100644
> > > --- a/net/core/net-sysfs.c
> > > +++ b/net/core/net-sysfs.c
> > > @@ -1955,8 +1955,6 @@ void netdev_unregister_kobject(struct net_device *ndev)
> > >       remove_queue_kobjects(ndev);
> > >  
> > >       pm_runtime_set_memalloc_noio(dev, false);
> > > -
> > > -     device_del(dev);
> > >  }
> > >  
> > >  /* Create sysfs entries for network device. */  
> > 
> > Doesn't this mean there may be sysfs files which are accessible 
> > for an unregistered netdevice?  
> 
> It would mean having accessible sysfs files for a device in the
> NETREG_UNREGISTERING state; NETREG_UNREGISTERED still comes after
> device_del. It's a small difference but still important, I think.
> 
> You raise a good point. Yes, that would mean accessing attributes of net
> devices being unregistered, meaning accessing or modifying unused or
> obsolete parameters and data (it shouldn't be garbage data though).
> Unlisting those sysfs files without removing them would be better here,
> to not expose files when the device is being unregistered while still
> allowing pending operations to complete. I don't know if that is doable
> in sysfs.

I wonder. Do we somehow remove the queue objects without waiting or are
those also waited on when we remove the device? 'Cause XPS is the part
that jumps out to me - we reset XPS after netdev_unregister_kobject().
Does it mean user can re-instate XPS settings after we thought we
already reset them?

> (While I did ran stress tests reading/writing attributes while
> unregistering devices, I think I missed an issue with the
> netdev_queue_default attributes; which hopefully can be fixed — if the
> whole idea is deemed acceptable).

Well, it's a little wobbly but I think the direction is sane.
It wouldn't feel super clean to add

	if (dev->state != NETREG_REGISTERED)
		goto out;

to the sysfs handlers but maybe it's better than leaving potential
traps for people who are not aware of the intricacies later? Not sure.

> > Isn't the point of having device_del() under rtnl_lock() to make sure
> > we sysfs handlers can't run on dead devices?  
> 
> Hard to say what was the initial point, there is a lot of history here
> :) I'm not sure it was done because of a particular reason; IMHO it just
> made sense to make this simple without having a good reason not to do
> so. And it helped with the naming collision detection.

FWIW the other two pieces of feedback I have is try to avoid the
synchronize_net() in patch 7 and add a new helper for the name
checking, which would return bool. The callers don't have any 
business getting the struct.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 8/9] net: delay device_del until run_todo
  2021-09-29 13:31       ` Jakub Kicinski
@ 2021-09-29 17:31         ` Antoine Tenart
  2021-10-29  9:04           ` Antoine Tenart
  2021-10-05 15:21         ` Antoine Tenart
  1 sibling, 1 reply; 25+ messages in thread
From: Antoine Tenart @ 2021-09-29 17:31 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, pabeni, gregkh, ebiederm, stephen, herbert, juri.lelli, netdev

Quoting Jakub Kicinski (2021-09-29 15:31:26)
> On Wed, 29 Sep 2021 10:26:35 +0200 Antoine Tenart wrote:
> > Quoting Jakub Kicinski (2021-09-29 02:02:29)
> > > On Tue, 28 Sep 2021 14:54:59 +0200 Antoine Tenart wrote:  
> > > > The sysfs removal is done in device_del, and moving it outside of the
> > > > rtnl lock does fix the initial deadlock. With that the trylock/restart
> > > > logic can be removed in a following-up patch.  
> > >   
> > > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > > index a1eab120bb50..d774fbec5d63 100644
> > > > --- a/net/core/dev.c
> > > > +++ b/net/core/dev.c
> > > > @@ -10593,6 +10593,8 @@ void netdev_run_todo(void)
> > > >                       continue;
> > > >               }
> > > >  
> > > > +             device_del(&dev->dev);
> > > > +
> > > >               dev->reg_state = NETREG_UNREGISTERED;
> > > >  
> > > >               netdev_wait_allrefs(dev);
> > > > diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> > > > index 21c3fdeccf20..e754f00c117b 100644
> > > > --- a/net/core/net-sysfs.c
> > > > +++ b/net/core/net-sysfs.c
> > > > @@ -1955,8 +1955,6 @@ void netdev_unregister_kobject(struct net_device *ndev)
> > > >       remove_queue_kobjects(ndev);
> > > >  
> > > >       pm_runtime_set_memalloc_noio(dev, false);
> > > > -
> > > > -     device_del(dev);
> > > >  }
> > > >  
> > > >  /* Create sysfs entries for network device. */  
> > > 
> > > Doesn't this mean there may be sysfs files which are accessible 
> > > for an unregistered netdevice?  
> > 
> > It would mean having accessible sysfs files for a device in the
> > NETREG_UNREGISTERING state; NETREG_UNREGISTERED still comes after
> > device_del. It's a small difference but still important, I think.
> > 
> > You raise a good point. Yes, that would mean accessing attributes of net
> > devices being unregistered, meaning accessing or modifying unused or
> > obsolete parameters and data (it shouldn't be garbage data though).
> > Unlisting those sysfs files without removing them would be better here,
> > to not expose files when the device is being unregistered while still
> > allowing pending operations to complete. I don't know if that is doable
> > in sysfs.
> 
> I wonder. Do we somehow remove the queue objects without waiting or are
> those also waited on when we remove the device? 'Cause XPS is the part
> that jumps out to me - we reset XPS after netdev_unregister_kobject().
> Does it mean user can re-instate XPS settings after we thought we
> already reset them?

This should be possible yes (and not really wanted).

> Well, it's a little wobbly but I think the direction is sane.
> It wouldn't feel super clean to add
> 
>         if (dev->state != NETREG_REGISTERED)
>                 goto out;
> 
> to the sysfs handlers but maybe it's better than leaving potential
> traps for people who are not aware of the intricacies later? Not sure.

Agreed, that doesn't feel super clean, but would be quite nice to have
for users (and e.g. would also help in the XPS case). Having a wrapper
should be possible, to minimize the impact and make it a bit better.

> > (While I did ran stress tests reading/writing attributes while
> > unregistering devices, I think I missed an issue with the
> > netdev_queue_default attributes; which hopefully can be fixed — if the
> > whole idea is deemed acceptable).

I had a quick look about queue attributes, their removal should also be
done in run_todo (that's easy). However the queues can be updated in
flight (while holding the rtnl lock) and the error paths[1][2] do drain
sysfs files (in kobject_put).

We can't release the rtnl lock here. It should be possible to delay this
outside the rtnl lock (in the global workqueue) but as the kobject are
embedded in the queues, we might need to have them live outside to allow
async releases while a net device (and ->_rx/->_tx) is being freed[3].
That adds to the complexity...

[1] https://elixir.bootlin.com/linux/latest/source/net/core/net-sysfs.c#L1662
[2] https://elixir.bootlin.com/linux/latest/source/net/core/net-sysfs.c#L1067
[3] Or having a dedicated workqueue and draining it.

> > > Isn't the point of having device_del() under rtnl_lock() to make sure
> > > we sysfs handlers can't run on dead devices?  
> > 
> > Hard to say what was the initial point, there is a lot of history here
> > :) I'm not sure it was done because of a particular reason; IMHO it just
> > made sense to make this simple without having a good reason not to do
> > so. And it helped with the naming collision detection.
> 
> FWIW the other two pieces of feedback I have is try to avoid the
> synchronize_net() in patch 7

I wasn't too happy in adding a call to this. However the node name list
is rcu protected and to make sure all CPUs see the removal before
freeing the node name a call to synchronize_net (synchronize_rcu) is
needed. That being said I think we can just call kfree_rcu instead of
netdev_name_node_free (kfree) here.

> add a new helper for the name checking, which would return bool. The
> callers don't have any business getting the struct.

Good idea. Also I think the replacement of __dev_get_by_name by a new
wrapper might be good even outside of this series (in case the series is
delayed / reworked heavily / etc).

Thanks!
Antoine

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 8/9] net: delay device_del until run_todo
  2021-09-29 13:31       ` Jakub Kicinski
  2021-09-29 17:31         ` Antoine Tenart
@ 2021-10-05 15:21         ` Antoine Tenart
  2021-10-05 18:34           ` Jakub Kicinski
  1 sibling, 1 reply; 25+ messages in thread
From: Antoine Tenart @ 2021-10-05 15:21 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, pabeni, gregkh, ebiederm, stephen, herbert, juri.lelli, netdev

Quoting Jakub Kicinski (2021-09-29 15:31:26)
> 
> Well, it's a little wobbly but I think the direction is sane.

> FWIW the other two pieces of feedback I have is try to avoid the
> synchronize_net() in patch 7 and add a new helper for the name
> checking, which would return bool. The callers don't have any 
> business getting the struct.

I'll work on an RFC v2 including modifications discussed in this thread
(especially the ones raised about queues attributes; investigating if it
can be fixed). I might send the patches about the name checking helper
separately to reduce the size of the series, as I think they bring value
outside of it.

(In the meantime suggestions or reviews from others are still welcomed).

BTW, what are your thoughts on patch 1? It is not strictly linked to the
others (or to other solutions that might arise).

Thanks!
Antoine

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 8/9] net: delay device_del until run_todo
  2021-10-05 15:21         ` Antoine Tenart
@ 2021-10-05 18:34           ` Jakub Kicinski
  0 siblings, 0 replies; 25+ messages in thread
From: Jakub Kicinski @ 2021-10-05 18:34 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, pabeni, gregkh, ebiederm, stephen, herbert, juri.lelli, netdev

On Tue, 05 Oct 2021 17:21:59 +0200 Antoine Tenart wrote:
> BTW, what are your thoughts on patch 1? It is not strictly linked to the
> others (or to other solutions that might arise).

IMO perfectly reasonable, just needs a standalone repost.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access
  2021-09-28 12:54 [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access Antoine Tenart
                   ` (8 preceding siblings ...)
  2021-09-28 12:55 ` [RFC PATCH net-next 9/9] net-sysfs: remove the use of rtnl_trylock/restart_syscall Antoine Tenart
@ 2021-10-06  6:37 ` Michal Hocko
  2021-10-06  7:59   ` Antoine Tenart
  2021-10-29 14:33 ` Antoine Tenart
  10 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2021-10-06  6:37 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, kuba, pabeni, gregkh, ebiederm, stephen, herbert,
	juri.lelli, netdev, Jiri Bohac

On Tue 28-09-21 14:54:51, Antoine Tenart wrote:
> Hello,

Hi,
thanks for posting this. Coincidentally we have come across a similar
problem as well just recently.

> What made those syscalls to spin is the following construction (which is
> found a lot in net sysfs and sysctl code):
> 
>   if (!rtnl_trylock())
>           return restart_syscall();

One of our customer is using Prometeus (https://github.com/prometheus/prometheus)
for monitoring and they have noticed that running several instances of
node-exporter can lead to a high CPU utilization. After some
investigation it has turned out that most instances are busy looping on
on of the sysfs files while one instance is processing sysfs speed file
for mlx driver which performs quite a convoluted set of operations (send
commands to the lower layers via workqueues) to talk to the device to
get the information.

The problem becomes more visible with more instance of node-exporter
running at parallel. This results in some monitoring alarms at the said
machine because the high CPU utilization is not expected.

I would appreciate if you CC me on next versions of this patchset.

Thankis for working on this!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 9/9] net-sysfs: remove the use of rtnl_trylock/restart_syscall
  2021-09-28 12:55 ` [RFC PATCH net-next 9/9] net-sysfs: remove the use of rtnl_trylock/restart_syscall Antoine Tenart
@ 2021-10-06  6:45   ` Michal Hocko
  2021-10-06  8:03     ` Antoine Tenart
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2021-10-06  6:45 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, kuba, pabeni, gregkh, ebiederm, stephen, herbert,
	juri.lelli, netdev

On Tue 28-09-21 14:55:00, Antoine Tenart wrote:
> The ABBA deadlock avoided by using rtnl_trylock and restart_syscall was
> fixed in previous commits, we can now remove the use of this
> trylock/restart logic and have net-sysfs operations not spinning when
> rtnl is already taken.

Shouldn't those lock be interruptible or killable at least? As mentioned
in other reply we are seeing multiple a contention on some sysfs file
because mlx driver tends to do some heavy lifting in its speed callback
so it would be great to be able to interact with waiters during that
time.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access
  2021-10-06  6:37 ` [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access Michal Hocko
@ 2021-10-06  7:59   ` Antoine Tenart
  2021-10-06  8:35     ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Antoine Tenart @ 2021-10-06  7:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: davem, kuba, pabeni, gregkh, ebiederm, stephen, herbert,
	juri.lelli, netdev, Jiri Bohac

Quoting Michal Hocko (2021-10-06 08:37:47)
> On Tue 28-09-21 14:54:51, Antoine Tenart wrote:
> thanks for posting this. Coincidentally we have come across a similar
> problem as well just recently.
> 
> > What made those syscalls to spin is the following construction (which is
> > found a lot in net sysfs and sysctl code):
> > 
> >   if (!rtnl_trylock())
> >           return restart_syscall();
> 
> One of our customer is using Prometeus (https://github.com/prometheus/prometheus)
> for monitoring and they have noticed that running several instances of
> node-exporter can lead to a high CPU utilization. After some
> investigation it has turned out that most instances are busy looping on
> on of the sysfs files while one instance is processing sysfs speed file
> for mlx driver which performs quite a convoluted set of operations (send
> commands to the lower layers via workqueues) to talk to the device to
> get the information.
> 
> The problem becomes more visible with more instance of node-exporter
> running at parallel. This results in some monitoring alarms at the said
> machine because the high CPU utilization is not expected.
> 
> I would appreciate if you CC me on next versions of this patchset.

Sure, will do!

Nice to see this can help others. Any help on (extensively) testing is
welcomed :-)

Thanks,
Antoine

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 9/9] net-sysfs: remove the use of rtnl_trylock/restart_syscall
  2021-10-06  6:45   ` Michal Hocko
@ 2021-10-06  8:03     ` Antoine Tenart
  2021-10-06  8:55       ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Antoine Tenart @ 2021-10-06  8:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: davem, kuba, pabeni, gregkh, ebiederm, stephen, herbert,
	juri.lelli, netdev

Quoting Michal Hocko (2021-10-06 08:45:58)
> On Tue 28-09-21 14:55:00, Antoine Tenart wrote:
> > The ABBA deadlock avoided by using rtnl_trylock and restart_syscall was
> > fixed in previous commits, we can now remove the use of this
> > trylock/restart logic and have net-sysfs operations not spinning when
> > rtnl is already taken.
> 
> Shouldn't those lock be interruptible or killable at least? As mentioned
> in other reply we are seeing multiple a contention on some sysfs file
> because mlx driver tends to do some heavy lifting in its speed callback
> so it would be great to be able to interact with waiters during that
> time.

Haven't thought much about this, but it should be possible to use
rtnl_lock_killable. I think this should be a patch on its own with its
own justification though (to help bisecting). But that is easy to do
once the trylock logic is gone.

Thanks,
Antoine

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access
  2021-10-06  7:59   ` Antoine Tenart
@ 2021-10-06  8:35     ` Michal Hocko
  0 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2021-10-06  8:35 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, kuba, pabeni, gregkh, ebiederm, stephen, herbert,
	juri.lelli, netdev, Jiri Bohac

On Wed 06-10-21 09:59:54, Antoine Tenart wrote:
[...]
> Nice to see this can help others.

I find the timing amusing because this behavior was there for years just
hitting us really hard just recently.

> Any help on (extensively) testing is welcomed :-)

We can help with that for sure.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 9/9] net-sysfs: remove the use of rtnl_trylock/restart_syscall
  2021-10-06  8:03     ` Antoine Tenart
@ 2021-10-06  8:55       ` Michal Hocko
  0 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2021-10-06  8:55 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, kuba, pabeni, gregkh, ebiederm, stephen, herbert,
	juri.lelli, netdev

On Wed 06-10-21 10:03:41, Antoine Tenart wrote:
> Quoting Michal Hocko (2021-10-06 08:45:58)
> > On Tue 28-09-21 14:55:00, Antoine Tenart wrote:
> > > The ABBA deadlock avoided by using rtnl_trylock and restart_syscall was
> > > fixed in previous commits, we can now remove the use of this
> > > trylock/restart logic and have net-sysfs operations not spinning when
> > > rtnl is already taken.
> > 
> > Shouldn't those lock be interruptible or killable at least? As mentioned
> > in other reply we are seeing multiple a contention on some sysfs file
> > because mlx driver tends to do some heavy lifting in its speed callback
> > so it would be great to be able to interact with waiters during that
> > time.
> 
> Haven't thought much about this, but it should be possible to use
> rtnl_lock_killable. I think this should be a patch on its own with its
> own justification though (to help bisecting). But that is easy to do
> once the trylock logic is gone.

Agreed

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 8/9] net: delay device_del until run_todo
  2021-09-29 17:31         ` Antoine Tenart
@ 2021-10-29  9:04           ` Antoine Tenart
  0 siblings, 0 replies; 25+ messages in thread
From: Antoine Tenart @ 2021-10-29  9:04 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, pabeni, gregkh, ebiederm, stephen, herbert, juri.lelli,
	netdev, mhocko

Quoting Antoine Tenart (2021-09-29 19:31:56)
> Quoting Jakub Kicinski (2021-09-29 15:31:26)
> > On Wed, 29 Sep 2021 10:26:35 +0200 Antoine Tenart wrote:
> 
> > > (While I did ran stress tests reading/writing attributes while
> > > unregistering devices, I think I missed an issue with the
> > > netdev_queue_default attributes; which hopefully can be fixed — if the
> > > whole idea is deemed acceptable).
> 
> I had a quick look about queue attributes, their removal should also be
> done in run_todo (that's easy). However the queues can be updated in
> flight (while holding the rtnl lock) and the error paths[1][2] do drain
> sysfs files (in kobject_put).
> 
> We can't release the rtnl lock here. It should be possible to delay this
> outside the rtnl lock (in the global workqueue) but as the kobject are
> embedded in the queues, we might need to have them live outside to allow
> async releases while a net device (and ->_rx/->_tx) is being freed[3].
> That adds to the complexity...
> 
> [1] https://elixir.bootlin.com/linux/latest/source/net/core/net-sysfs.c#L1662
> [2] https://elixir.bootlin.com/linux/latest/source/net/core/net-sysfs.c#L1067
> [3] Or having a dedicated workqueue and draining it.

I got back to this and while all other suggestions where easy enough to
get right, handling the queue sysfs files was not... The explanation is
the same for Tx and Rx queues.

Sysfs queue files are special: in addition to being created at device
registration and removed at unregistration, they also can be
reconfigured (added & removed) in-flight. This is done under the rtnl
lock. So for those sysfs files the trylock/restart construction also
helps in not hitting a deadlock while removing queues in-flight.

To make this work here, I had to delay the queue removal outside the
rtnl lock. As the queues are statically allocated in net_device->_tx, I
made them dynamically allocated to allow delaying their removal outside
the rtnl lock (in a workqueue for example).

This worked for allowing the removal of queues not to hit the ABBA
deadlock. (Extra logic to drain the queues before device removal might
be needed too). But this introduced an issue as naming collision between
queues was now possible (if we tried registering new queues while old
ones weren't unregistered yet). This is because the unregistration was
delayed, and for this to work the reconfiguration of queues should be
atomic under the rtnl lock; which is exactly what the changes to not hit
the ABBA deadlock requires... There are contradicting goals here.

This is not really fixable IMHO. First we would need to add a naming
collision logic for queues only, but then we don't have the same
two-step unregistration logic as we have for net devices. And failing on
queues reconfigurations for this doesn't seem acceptable to me (this has
a lot of implications, many "users" can request queues registration &
unregistration). Plus the complexity starts to be quite noticeable,
which doesn't help maintenance.

The above looks like a dead end to me. I tried several approaches to
better handle the queues in sysfs, but couldn't find a solution not
hitting an issue one way or another.

I have however a few other ideas, that may or may not be acceptable.
I'll start a dedicated answer to this thread to discuss those.

Thanks,
Antoine

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access
  2021-09-28 12:54 [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access Antoine Tenart
                   ` (9 preceding siblings ...)
  2021-10-06  6:37 ` [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access Michal Hocko
@ 2021-10-29 14:33 ` Antoine Tenart
  2021-10-29 15:45   ` Stephen Hemminger
  10 siblings, 1 reply; 25+ messages in thread
From: Antoine Tenart @ 2021-10-29 14:33 UTC (permalink / raw)
  To: davem, kuba
  Cc: pabeni, gregkh, ebiederm, stephen, herbert, juri.lelli, netdev, mhocko

With the approach taken in this thread not going too well[1], what next?
I think we should discuss other possibilities and gather some ideas.
Below are some early thoughts, that might not be acceptable.

1) Making an rtnl_lock helper that returns when needed

The idea would be to replace rtnl_trylock/restart_syscall by this
helper, which would try to grab the rtnl lock and return when needed.
Something like the following:

  static rtnl_lock_ifalive(const struct net_device *dev)
  {
          while (!rtnl_trylock()) {
                  if (!dev_isalive(dev))
                          return -EINVAL;

                  /* Special case for queue files */
                  if (dev->drain_sysfs_queues)
                          return restart_syscall();

                  /* something not to have the CPU spinning */
          }
  }

One way not to have the CPU spinning is to sleep, let's say with
`usleep_range(500, 1000);` (range to be defined properly). The
disadvantage is on net device unregistration as we might need to wait
for all those loops to return first. (It's a trade-off though, we're not
restarting syscalls over and over in other situations). Or would there
be something better?

Possible improvements:
- Add an overall timeout and restart the syscall if we hit it, to have
  an upper bound.
- Make it interruptible, check for need_resched, etc.

Note that this approach could work for sysctl files as well; looping
over all devices in a netns to make the checks.

2) Interrupt rtnl_lock when in the unregistration paths

I'm wondering if using mutex_lock_interruptible in problematic areas
(sysfs, sysctl), keeping track of their tasks and interrupting them in
the unregistration paths would work and be acceptable. On paper this
looks like a solution with not much overhead and not too invasive to
implement. But is it acceptable? (Are there some side effects we really
don't want?).

Note that this would need some thinking to make it safe against sysfs
accesses between the tasks interruption and the sysfs files draining
(refcount? another lock?).

3) Other ideas?

Also, I'm not sure about the -rt implications of all the above.

Thanks,
Antoine

[1] https://lore.kernel.org/netdev/163549826664.3523.4140191764737040064@kwain/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access
  2021-10-29 14:33 ` Antoine Tenart
@ 2021-10-29 15:45   ` Stephen Hemminger
  0 siblings, 0 replies; 25+ messages in thread
From: Stephen Hemminger @ 2021-10-29 15:45 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, kuba, pabeni, gregkh, ebiederm, herbert, juri.lelli,
	netdev, mhocko

On Fri, 29 Oct 2021 16:33:05 +0200
Antoine Tenart <atenart@kernel.org> wrote:

> With the approach taken in this thread not going too well[1], what next?
> I think we should discuss other possibilities and gather some ideas.
> Below are some early thoughts, that might not be acceptable.
> 
> 1) Making an rtnl_lock helper that returns when needed
> 
> The idea would be to replace rtnl_trylock/restart_syscall by this
> helper, which would try to grab the rtnl lock and return when needed.
> Something like the following:
> 
>   static rtnl_lock_ifalive(const struct net_device *dev)
>   {
>           while (!rtnl_trylock()) {
>                   if (!dev_isalive(dev))
>                           return -EINVAL;
> 
>                   /* Special case for queue files */
>                   if (dev->drain_sysfs_queues)
>                           return restart_syscall();
> 
>                   /* something not to have the CPU spinning */
>           }
>   }
> 
> One way not to have the CPU spinning is to sleep, let's say with
> `usleep_range(500, 1000);` (range to be defined properly). The
> disadvantage is on net device unregistration as we might need to wait
> for all those loops to return first. (It's a trade-off though, we're not
> restarting syscalls over and over in other situations). Or would there
> be something better?
> 
> Possible improvements:
> - Add an overall timeout and restart the syscall if we hit it, to have
>   an upper bound.
> - Make it interruptible, check for need_resched, etc.
> 
> Note that this approach could work for sysctl files as well; looping
> over all devices in a netns to make the checks.
> 
> 2) Interrupt rtnl_lock when in the unregistration paths
> 
> I'm wondering if using mutex_lock_interruptible in problematic areas
> (sysfs, sysctl), keeping track of their tasks and interrupting them in
> the unregistration paths would work and be acceptable. On paper this
> looks like a solution with not much overhead and not too invasive to
> implement. But is it acceptable? (Are there some side effects we really
> don't want?).
> 
> Note that this would need some thinking to make it safe against sysfs
> accesses between the tasks interruption and the sysfs files draining
> (refcount? another lock?).
> 
> 3) Other ideas?
> 
> Also, I'm not sure about the -rt implications of all the above.
> 
> Thanks,
> Antoine
> 
> [1] https://lore.kernel.org/netdev/163549826664.3523.4140191764737040064@kwain/

As the originator of net-sysfs and the trylock, I appreciate your attempts
to do something better. The best solution may actually not be just in the
networking code but go all the way back up to restart_syscall() itself.


Spinning a CPU itself is not a bad thing, see spin locks.
The problem is if the spin causes the active CPU to not make progress.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-10-29 15:45 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-28 12:54 [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access Antoine Tenart
2021-09-28 12:54 ` [RFC PATCH net-next 1/9] net-sysfs: try not to restart the syscall if it will fail eventually Antoine Tenart
2021-09-28 12:54 ` [RFC PATCH net-next 2/9] net: split unlisting the net device from unlisting its node name Antoine Tenart
2021-09-28 12:54 ` [RFC PATCH net-next 3/9] net: export netdev_name_node_lookup Antoine Tenart
2021-09-28 12:54 ` [RFC PATCH net-next 4/9] bonding: use the correct function to check for netdev name collision Antoine Tenart
2021-09-28 12:54 ` [RFC PATCH net-next 5/9] ppp: " Antoine Tenart
2021-09-28 12:54 ` [RFC PATCH net-next 6/9] net: " Antoine Tenart
2021-09-28 12:54 ` [RFC PATCH net-next 7/9] net: delay the removal of the name nodes until run_todo Antoine Tenart
2021-09-28 12:54 ` [RFC PATCH net-next 8/9] net: delay device_del " Antoine Tenart
2021-09-29  0:02   ` Jakub Kicinski
2021-09-29  8:26     ` Antoine Tenart
2021-09-29 13:31       ` Jakub Kicinski
2021-09-29 17:31         ` Antoine Tenart
2021-10-29  9:04           ` Antoine Tenart
2021-10-05 15:21         ` Antoine Tenart
2021-10-05 18:34           ` Jakub Kicinski
2021-09-28 12:55 ` [RFC PATCH net-next 9/9] net-sysfs: remove the use of rtnl_trylock/restart_syscall Antoine Tenart
2021-10-06  6:45   ` Michal Hocko
2021-10-06  8:03     ` Antoine Tenart
2021-10-06  8:55       ` Michal Hocko
2021-10-06  6:37 ` [RFC PATCH net-next 0/9] Userspace spinning on net-sysfs access Michal Hocko
2021-10-06  7:59   ` Antoine Tenart
2021-10-06  8:35     ` Michal Hocko
2021-10-29 14:33 ` Antoine Tenart
2021-10-29 15:45   ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).