All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion
@ 2021-12-02 17:45 Lahav Schlesinger
  2021-12-04  1:06 ` Jakub Kicinski
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Lahav Schlesinger @ 2021-12-02 17:45 UTC (permalink / raw)
  To: netdev; +Cc: kuba, dsahern

Under large scale, some routers are required to support tens of thousands
of devices at once, both physical and virtual (e.g. loopbacks, tunnels,
vrfs, etc).
At times such routers are required to delete massive amounts of devices
at once, such as when a factory reset is performed on the router (causing
a deletion of all devices), or when a configuration is restored after an
upgrade, or as a request from an operator.

Currently there are 2 means of deleting devices using Netlink:
1. Deleting a single device (either by ifindex using ifinfomsg::ifi_index,
or by name using IFLA_IFNAME)
2. Delete all device that belong to a group (using IFLA_GROUP)

Deletion of devices one-by-one has poor performance on large scale of
devices compared to "group deletion":
After all device are handled, netdev_run_todo() is called which
calls rcu_barrier() to finish any outstanding RCU callbacks that were
registered during the deletion of the device, then wait until the
refcount of all the devices is 0, then perform final cleanups.

However, calling rcu_barrier() is a very costly operation, each call
taking in the order of 10s of milliseconds.

When deleting a large number of device one-by-one, rcu_barrier()
will be called for each device being deleted.
As an example, following benchmark deletes 10K loopback devices,
all of which are UP and with only IPv6 LLA being configured:

1. Deleting one-by-one using 1 thread : 243 seconds
2. Deleting one-by-one using 10 thread: 70 seconds
3. Deleting one-by-one using 50 thread: 54 seconds
4. Deleting all using "group deletion": 30 seconds

Note that even though the deletion logic takes place under the rtnl
lock, since the call to rcu_barrier() is outside the lock we gain
some improvements.

But, while "group deletion" is the fastest, it is not suited for
deleting large number of arbitrary devices which are unknown a head of
time. Furthermore, moving large number of devices to a group is also a
costly operation.

This patch adds support for passing an arbitrary list of ifindex of
devices to delete with a new IFLA_IFINDEX attribute. A single message
may contain multiple instances of this attribute).
This gives a more fine-grained control over which devices to delete,
while still resulting in rcu_barrier() being called only once.
Indeed, the timings of using this new API to delete 10K devices is
the same as using the existing "group" deletion.

Signed-off-by: Lahav Schlesinger <lschlesinger@drivenets.com>
---
v3 -> v4
 - Change single IFLA_INDEX_LIST into multiple IFLA_IFINDEX
 - Fail if passing both IFLA_GROUP and at least one IFLA_IFNEX

v2 -> v3
 - Rename 'ifindex_list' to 'ifindices', and pass it as int*
 - Clamp 'ops' variable in second loop.

v1 -> v2
 - Unset 'len' of IFLA_IFINDEX_LIST in policy.
 - Use __dev_get_by_index() instead of n^2 loop.
 - Return -ENODEV if any ifindex is not present.
 - Saved devices in an array.
 - Fix formatting.

 include/uapi/linux/if_link.h |  1 +
 net/core/rtnetlink.c         | 68 ++++++++++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index eebd3894fe89..68fcde9c0c5e 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -348,6 +348,7 @@ enum {
 	IFLA_PARENT_DEV_NAME,
 	IFLA_PARENT_DEV_BUS_NAME,
 
+	IFLA_IFINDEX,
 	__IFLA_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index fd030e02f16d..9d804866fe72 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1880,6 +1880,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_PROTO_DOWN_REASON] = { .type = NLA_NESTED },
 	[IFLA_NEW_IFINDEX]	= NLA_POLICY_MIN(NLA_S32, 1),
 	[IFLA_PARENT_DEV_NAME]	= { .type = NLA_NUL_STRING },
+	[IFLA_IFINDEX]		= { .type = NLA_S32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -3050,6 +3051,65 @@ static int rtnl_group_dellink(const struct net *net, int group)
 	return 0;
 }
 
+static int rtnl_ifindex_dellink(struct net *net, struct nlattr *head, int len,
+				struct netlink_ext_ack *extack)
+{
+	int i = 0, num_devices = 0, rem;
+	struct net_device **dev_list;
+	const struct nlattr *nla;
+	LIST_HEAD(list_kill);
+	int ret;
+
+	nla_for_each_attr(nla, head, len, rem) {
+		if (nla_type(nla) == IFLA_IFINDEX)
+			num_devices++;
+	}
+
+	dev_list = kmalloc_array(num_devices, sizeof(*dev_list), GFP_KERNEL);
+	if (!dev_list)
+		return -ENOMEM;
+
+	nla_for_each_attr(nla, head, len, rem) {
+		const struct rtnl_link_ops *ops;
+		struct net_device *dev;
+		int ifindex;
+
+		if (nla_type(nla) != IFLA_IFINDEX)
+			continue;
+
+		ifindex = nla_get_s32(nla);
+		ret = -ENODEV;
+		dev = __dev_get_by_index(net, ifindex);
+		if (!dev) {
+			NL_SET_ERR_MSG_ATTR(extack, nla, "Unknown ifindex");
+			goto out_free;
+		}
+
+		ret = -EOPNOTSUPP;
+		ops = dev->rtnl_link_ops;
+		if (!ops || !ops->dellink) {
+			NL_SET_ERR_MSG_ATTR(extack, nla, "Device cannot be deleted");
+			goto out_free;
+		}
+
+		dev_list[i++] = dev;
+	}
+
+	for (i = 0; i < num_devices; i++) {
+		struct net_device *dev = dev_list[i];
+
+		dev->rtnl_link_ops->dellink(dev, &list_kill);
+	}
+
+	unregister_netdevice_many(&list_kill);
+
+	ret = 0;
+
+out_free:
+	kfree(dev_list);
+	return ret;
+}
+
 int rtnl_delete_link(struct net_device *dev)
 {
 	const struct rtnl_link_ops *ops;
@@ -3093,6 +3153,11 @@ static int rtnl_dellink(struct sk_buff *skb, struct nlmsghdr *nlh,
 			return PTR_ERR(tgt_net);
 	}
 
+	if (tb[IFLA_GROUP] && tb[IFLA_IFINDEX]) {
+		NL_SET_ERR_MSG(extack, "Can't pass both IFLA_GROUP and IFLA_IFINDEX");
+		return -EOPNOTSUPP;
+	}
+
 	err = -EINVAL;
 	ifm = nlmsg_data(nlh);
 	if (ifm->ifi_index > 0)
@@ -3102,6 +3167,9 @@ static int rtnl_dellink(struct sk_buff *skb, struct nlmsghdr *nlh,
 				   tb[IFLA_ALT_IFNAME], NULL);
 	else if (tb[IFLA_GROUP])
 		err = rtnl_group_dellink(tgt_net, nla_get_u32(tb[IFLA_GROUP]));
+	else if (tb[IFLA_IFINDEX])
+		err = rtnl_ifindex_dellink(tgt_net, nlmsg_attrdata(nlh, sizeof(*ifm)),
+					   nlmsg_attrlen(nlh, sizeof(*ifm)), extack);
 	else
 		goto out;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion
  2021-12-02 17:45 [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion Lahav Schlesinger
@ 2021-12-04  1:06 ` Jakub Kicinski
  2021-12-04 10:15 ` Nikolay Aleksandrov
  2021-12-05  9:53 ` Ido Schimmel
  2 siblings, 0 replies; 11+ messages in thread
From: Jakub Kicinski @ 2021-12-04  1:06 UTC (permalink / raw)
  To: dsahern, Nicolas Dichtel; +Cc: Lahav Schlesinger, netdev

On Thu,  2 Dec 2021 19:45:02 +0200 Lahav Schlesinger wrote:
> Under large scale, some routers are required to support tens of thousands
> of devices at once, both physical and virtual (e.g. loopbacks, tunnels,
> vrfs, etc).
> At times such routers are required to delete massive amounts of devices
> at once, such as when a factory reset is performed on the router (causing
> a deletion of all devices), or when a configuration is restored after an
> upgrade, or as a request from an operator.

LGTM, David, Nicolas, ack?

https://lore.kernel.org/all/20211202174502.28903-1-lschlesinger@drivenets.com/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion
  2021-12-02 17:45 [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion Lahav Schlesinger
  2021-12-04  1:06 ` Jakub Kicinski
@ 2021-12-04 10:15 ` Nikolay Aleksandrov
  2021-12-04 16:18   ` David Ahern
  2021-12-05  8:28   ` Lahav Schlesinger
  2021-12-05  9:53 ` Ido Schimmel
  2 siblings, 2 replies; 11+ messages in thread
From: Nikolay Aleksandrov @ 2021-12-04 10:15 UTC (permalink / raw)
  To: Lahav Schlesinger, netdev; +Cc: kuba, dsahern

On 02/12/2021 19:45, Lahav Schlesinger wrote:
> Under large scale, some routers are required to support tens of thousands
> of devices at once, both physical and virtual (e.g. loopbacks, tunnels,
> vrfs, etc).
> At times such routers are required to delete massive amounts of devices
> at once, such as when a factory reset is performed on the router (causing
> a deletion of all devices), or when a configuration is restored after an
> upgrade, or as a request from an operator.
> 
> Currently there are 2 means of deleting devices using Netlink:
> 1. Deleting a single device (either by ifindex using ifinfomsg::ifi_index,
> or by name using IFLA_IFNAME)
> 2. Delete all device that belong to a group (using IFLA_GROUP)
> 
> Deletion of devices one-by-one has poor performance on large scale of
> devices compared to "group deletion":
> After all device are handled, netdev_run_todo() is called which
> calls rcu_barrier() to finish any outstanding RCU callbacks that were
> registered during the deletion of the device, then wait until the
> refcount of all the devices is 0, then perform final cleanups.
> 
> However, calling rcu_barrier() is a very costly operation, each call
> taking in the order of 10s of milliseconds.
> 
> When deleting a large number of device one-by-one, rcu_barrier()
> will be called for each device being deleted.
> As an example, following benchmark deletes 10K loopback devices,
> all of which are UP and with only IPv6 LLA being configured:
> 
> 1. Deleting one-by-one using 1 thread : 243 seconds
> 2. Deleting one-by-one using 10 thread: 70 seconds
> 3. Deleting one-by-one using 50 thread: 54 seconds
> 4. Deleting all using "group deletion": 30 seconds
> 
> Note that even though the deletion logic takes place under the rtnl
> lock, since the call to rcu_barrier() is outside the lock we gain
> some improvements.
> 
> But, while "group deletion" is the fastest, it is not suited for
> deleting large number of arbitrary devices which are unknown a head of
> time. Furthermore, moving large number of devices to a group is also a
> costly operation.
> 
> This patch adds support for passing an arbitrary list of ifindex of
> devices to delete with a new IFLA_IFINDEX attribute. A single message
> may contain multiple instances of this attribute).
> This gives a more fine-grained control over which devices to delete,
> while still resulting in rcu_barrier() being called only once.
> Indeed, the timings of using this new API to delete 10K devices is
> the same as using the existing "group" deletion.
> 
> Signed-off-by: Lahav Schlesinger <lschlesinger@drivenets.com>
> ---
> v3 -> v4
>  - Change single IFLA_INDEX_LIST into multiple IFLA_IFINDEX
>  - Fail if passing both IFLA_GROUP and at least one IFLA_IFNEX
> 
> v2 -> v3
>  - Rename 'ifindex_list' to 'ifindices', and pass it as int*
>  - Clamp 'ops' variable in second loop.
> 
> v1 -> v2
>  - Unset 'len' of IFLA_IFINDEX_LIST in policy.
>  - Use __dev_get_by_index() instead of n^2 loop.
>  - Return -ENODEV if any ifindex is not present.
>  - Saved devices in an array.
>  - Fix formatting.
> 
>  include/uapi/linux/if_link.h |  1 +
>  net/core/rtnetlink.c         | 68 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 69 insertions(+)
> 

I like the idea, but what happens if the same device is present twice or more times?
I mean are you sure it is safe to call dellink method of all device types multiple
times with the same device?

Cheers,
 Nik


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion
  2021-12-04 10:15 ` Nikolay Aleksandrov
@ 2021-12-04 16:18   ` David Ahern
  2021-12-05  8:28   ` Lahav Schlesinger
  1 sibling, 0 replies; 11+ messages in thread
From: David Ahern @ 2021-12-04 16:18 UTC (permalink / raw)
  To: Nikolay Aleksandrov, Lahav Schlesinger, netdev; +Cc: kuba

On 12/4/21 3:15 AM, Nikolay Aleksandrov wrote:
> 
> I like the idea, but what happens if the same device is present twice or more times?
> I mean are you sure it is safe to call dellink method of all device types multiple
> times with the same device?
> 

Very good point. Can't rely on dellink callbacks being able to handle
multiple calls.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion
  2021-12-04 10:15 ` Nikolay Aleksandrov
  2021-12-04 16:18   ` David Ahern
@ 2021-12-05  8:28   ` Lahav Schlesinger
  1 sibling, 0 replies; 11+ messages in thread
From: Lahav Schlesinger @ 2021-12-05  8:28 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: netdev, kuba, dsahern

On Sat, Dec 04, 2021 at 12:15:19PM +0200, Nikolay Aleksandrov wrote:
> CAUTION: External E-Mail - Use caution with links and attachments
>
>
> On 02/12/2021 19:45, Lahav Schlesinger wrote:
> > Under large scale, some routers are required to support tens of thousands
> > of devices at once, both physical and virtual (e.g. loopbacks, tunnels,
> > vrfs, etc).
> > At times such routers are required to delete massive amounts of devices
> > at once, such as when a factory reset is performed on the router (causing
> > a deletion of all devices), or when a configuration is restored after an
> > upgrade, or as a request from an operator.
> >
> > Currently there are 2 means of deleting devices using Netlink:
> > 1. Deleting a single device (either by ifindex using ifinfomsg::ifi_index,
> > or by name using IFLA_IFNAME)
> > 2. Delete all device that belong to a group (using IFLA_GROUP)
> >
> > Deletion of devices one-by-one has poor performance on large scale of
> > devices compared to "group deletion":
> > After all device are handled, netdev_run_todo() is called which
> > calls rcu_barrier() to finish any outstanding RCU callbacks that were
> > registered during the deletion of the device, then wait until the
> > refcount of all the devices is 0, then perform final cleanups.
> >
> > However, calling rcu_barrier() is a very costly operation, each call
> > taking in the order of 10s of milliseconds.
> >
> > When deleting a large number of device one-by-one, rcu_barrier()
> > will be called for each device being deleted.
> > As an example, following benchmark deletes 10K loopback devices,
> > all of which are UP and with only IPv6 LLA being configured:
> >
> > 1. Deleting one-by-one using 1 thread : 243 seconds
> > 2. Deleting one-by-one using 10 thread: 70 seconds
> > 3. Deleting one-by-one using 50 thread: 54 seconds
> > 4. Deleting all using "group deletion": 30 seconds
> >
> > Note that even though the deletion logic takes place under the rtnl
> > lock, since the call to rcu_barrier() is outside the lock we gain
> > some improvements.
> >
> > But, while "group deletion" is the fastest, it is not suited for
> > deleting large number of arbitrary devices which are unknown a head of
> > time. Furthermore, moving large number of devices to a group is also a
> > costly operation.
> >
> > This patch adds support for passing an arbitrary list of ifindex of
> > devices to delete with a new IFLA_IFINDEX attribute. A single message
> > may contain multiple instances of this attribute).
> > This gives a more fine-grained control over which devices to delete,
> > while still resulting in rcu_barrier() being called only once.
> > Indeed, the timings of using this new API to delete 10K devices is
> > the same as using the existing "group" deletion.
> >
> > Signed-off-by: Lahav Schlesinger <lschlesinger@drivenets.com>
> > ---
> > v3 -> v4
> >  - Change single IFLA_INDEX_LIST into multiple IFLA_IFINDEX
> >  - Fail if passing both IFLA_GROUP and at least one IFLA_IFNEX
> >
> > v2 -> v3
> >  - Rename 'ifindex_list' to 'ifindices', and pass it as int*
> >  - Clamp 'ops' variable in second loop.
> >
> > v1 -> v2
> >  - Unset 'len' of IFLA_IFINDEX_LIST in policy.
> >  - Use __dev_get_by_index() instead of n^2 loop.
> >  - Return -ENODEV if any ifindex is not present.
> >  - Saved devices in an array.
> >  - Fix formatting.
> >
> >  include/uapi/linux/if_link.h |  1 +
> >  net/core/rtnetlink.c         | 68 ++++++++++++++++++++++++++++++++++++
> >  2 files changed, 69 insertions(+)
> >
>
> I like the idea, but what happens if the same device is present twice or more times?
> I mean are you sure it is safe to call dellink method of all device types multiple
> times with the same device?
>
> Cheers,
>  Nik
>

Thanks for catching this. I initially went over a few dellink()
functions and they all seemed to be re-entrant, but evidently I missed
some..

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion
  2021-12-02 17:45 [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion Lahav Schlesinger
  2021-12-04  1:06 ` Jakub Kicinski
  2021-12-04 10:15 ` Nikolay Aleksandrov
@ 2021-12-05  9:53 ` Ido Schimmel
  2021-12-05 12:11   ` Lahav Schlesinger
  2 siblings, 1 reply; 11+ messages in thread
From: Ido Schimmel @ 2021-12-05  9:53 UTC (permalink / raw)
  To: Lahav Schlesinger; +Cc: netdev, kuba, dsahern

On Thu, Dec 02, 2021 at 07:45:02PM +0200, Lahav Schlesinger wrote:
> Under large scale, some routers are required to support tens of thousands
> of devices at once, both physical and virtual (e.g. loopbacks, tunnels,
> vrfs, etc).
> At times such routers are required to delete massive amounts of devices
> at once, such as when a factory reset is performed on the router (causing
> a deletion of all devices), or when a configuration is restored after an
> upgrade, or as a request from an operator.
> 
> Currently there are 2 means of deleting devices using Netlink:
> 1. Deleting a single device (either by ifindex using ifinfomsg::ifi_index,
> or by name using IFLA_IFNAME)
> 2. Delete all device that belong to a group (using IFLA_GROUP)
> 
> Deletion of devices one-by-one has poor performance on large scale of
> devices compared to "group deletion":
> After all device are handled, netdev_run_todo() is called which
> calls rcu_barrier() to finish any outstanding RCU callbacks that were
> registered during the deletion of the device, then wait until the
> refcount of all the devices is 0, then perform final cleanups.
> 
> However, calling rcu_barrier() is a very costly operation, each call
> taking in the order of 10s of milliseconds.
> 
> When deleting a large number of device one-by-one, rcu_barrier()
> will be called for each device being deleted.
> As an example, following benchmark deletes 10K loopback devices,
> all of which are UP and with only IPv6 LLA being configured:
> 
> 1. Deleting one-by-one using 1 thread : 243 seconds
> 2. Deleting one-by-one using 10 thread: 70 seconds
> 3. Deleting one-by-one using 50 thread: 54 seconds
> 4. Deleting all using "group deletion": 30 seconds
> 
> Note that even though the deletion logic takes place under the rtnl
> lock, since the call to rcu_barrier() is outside the lock we gain
> some improvements.
> 
> But, while "group deletion" is the fastest, it is not suited for
> deleting large number of arbitrary devices which are unknown a head of
> time. Furthermore, moving large number of devices to a group is also a
> costly operation.

These are the number I get in a VM running on my laptop.

Moving 16k dummy netdevs to a group:

# time -p ip -b group.batch 
real 1.91
user 0.04
sys 0.27

Deleting the group:

# time -p ip link del group 10
real 6.15
user 0.00
sys 3.02

IMO, these numbers do not justify a new API. Also, your user space can
be taught to create all the netdevs in the same group to begin with:

# ip link add name dummy1 group 10 type dummy
# ip link show dev dummy1
10: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group 10 qlen 1000
    link/ether 12:b6:7d:ff:48:99 brd ff:ff:ff:ff:ff:ff

Moreover, unlike the list API that is specific to deletion, the group
API also lets you batch set operations:

# ip link set group 10 mtu 2000
# ip link show dev dummy1
10: dummy1: <BROADCAST,NOARP> mtu 2000 qdisc noop state DOWN mode
DEFAULT group 10 qlen 1000
    link/ether 12:b6:7d:ff:48:99 brd ff:ff:ff:ff:ff:ff

If you are using namespaces, then during "factory reset" you can delete
the namespace which should trigger batch deletion of the netdevs inside
it.

> 
> This patch adds support for passing an arbitrary list of ifindex of
> devices to delete with a new IFLA_IFINDEX attribute. A single message
> may contain multiple instances of this attribute).
> This gives a more fine-grained control over which devices to delete,
> while still resulting in rcu_barrier() being called only once.
> Indeed, the timings of using this new API to delete 10K devices is
> the same as using the existing "group" deletion.
> 
> Signed-off-by: Lahav Schlesinger <lschlesinger@drivenets.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion
  2021-12-05  9:53 ` Ido Schimmel
@ 2021-12-05 12:11   ` Lahav Schlesinger
  2021-12-05 13:49     ` Ido Schimmel
  0 siblings, 1 reply; 11+ messages in thread
From: Lahav Schlesinger @ 2021-12-05 12:11 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, kuba, dsahern

On Sun, Dec 05, 2021 at 11:53:03AM +0200, Ido Schimmel wrote:
> CAUTION: External E-Mail - Use caution with links and attachments
>
>
> On Thu, Dec 02, 2021 at 07:45:02PM +0200, Lahav Schlesinger wrote:
> > Under large scale, some routers are required to support tens of thousands
> > of devices at once, both physical and virtual (e.g. loopbacks, tunnels,
> > vrfs, etc).
> > At times such routers are required to delete massive amounts of devices
> > at once, such as when a factory reset is performed on the router (causing
> > a deletion of all devices), or when a configuration is restored after an
> > upgrade, or as a request from an operator.
> >
> > Currently there are 2 means of deleting devices using Netlink:
> > 1. Deleting a single device (either by ifindex using ifinfomsg::ifi_index,
> > or by name using IFLA_IFNAME)
> > 2. Delete all device that belong to a group (using IFLA_GROUP)
> >
> > Deletion of devices one-by-one has poor performance on large scale of
> > devices compared to "group deletion":
> > After all device are handled, netdev_run_todo() is called which
> > calls rcu_barrier() to finish any outstanding RCU callbacks that were
> > registered during the deletion of the device, then wait until the
> > refcount of all the devices is 0, then perform final cleanups.
> >
> > However, calling rcu_barrier() is a very costly operation, each call
> > taking in the order of 10s of milliseconds.
> >
> > When deleting a large number of device one-by-one, rcu_barrier()
> > will be called for each device being deleted.
> > As an example, following benchmark deletes 10K loopback devices,
> > all of which are UP and with only IPv6 LLA being configured:
> >
> > 1. Deleting one-by-one using 1 thread : 243 seconds
> > 2. Deleting one-by-one using 10 thread: 70 seconds
> > 3. Deleting one-by-one using 50 thread: 54 seconds
> > 4. Deleting all using "group deletion": 30 seconds
> >
> > Note that even though the deletion logic takes place under the rtnl
> > lock, since the call to rcu_barrier() is outside the lock we gain
> > some improvements.
> >
> > But, while "group deletion" is the fastest, it is not suited for
> > deleting large number of arbitrary devices which are unknown a head of
> > time. Furthermore, moving large number of devices to a group is also a
> > costly operation.
>
> These are the number I get in a VM running on my laptop.
>
> Moving 16k dummy netdevs to a group:
>
> # time -p ip -b group.batch
> real 1.91
> user 0.04
> sys 0.27
>
> Deleting the group:
>
> # time -p ip link del group 10
> real 6.15
> user 0.00
> sys 3.02
>

Hi Ido, in your tests in which state the dummy devices are before
deleting/changing group?
When they are DOWN I get similar numbers to yours (16k devices):

# time ip -b group_16000_batch
real	0m0.640s
user	0m0.152s
sys	0m0.478s

# time ip link delete group 100
real	0m5.324s
user	0m0.017s
sys	0m4.991s

But when the devices are in state UP, I get:

# time ip -b group_16000_batch
real	0m48.605s
user	0m0.218s
sys	0m48.244s

# time ip link delete group 100
real	1m13.219s
user	0m0.010s
sys	1m9.117s

And for completeness, setting the devices to DOWN prior to deleting them
is as fast as deleting them in the first place while they're UP.

Also, while this is probably a minuscule issue, changing the group of
10ks+ of interfaces will result in a storm of netlink events that will
make any userspace program listening on link events to spend time
handling these events.  This will result in twice as many events
compared to directly deleting the devices.

> IMO, these numbers do not justify a new API. Also, your user space can
> be taught to create all the netdevs in the same group to begin with:
>
> # ip link add name dummy1 group 10 type dummy
> # ip link show dev dummy1
> 10: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group 10 qlen 1000
>     link/ether 12:b6:7d:ff:48:99 brd ff:ff:ff:ff:ff:ff
>
> Moreover, unlike the list API that is specific to deletion, the group
> API also lets you batch set operations:
>
> # ip link set group 10 mtu 2000
> # ip link show dev dummy1
> 10: dummy1: <BROADCAST,NOARP> mtu 2000 qdisc noop state DOWN mode
> DEFAULT group 10 qlen 1000
>     link/ether 12:b6:7d:ff:48:99 brd ff:ff:ff:ff:ff:ff

The list API can be extended to support other operations as well
(similar to group set operations, we can call do_setlink() for each
device specified in an IFLA_IFINDEX).
I didn't implement it in this patch because we don't have a use for it
currently.

>
> If you are using namespaces, then during "factory reset" you can delete
> the namespace which should trigger batch deletion of the netdevs inside
> it.
>

In some scenarios we are required to delete only a subset of devices
(e.g. when a physical link becomes DOWN, we need to delete all the VLANs
and any tunnels configured on that device).  Furthermore, a user is
allowed to load a new configuration in which he deletes only some of the
devices (e.g. delete all of the loopbacks in the system), while not
touching the other devices.

> >
> > This patch adds support for passing an arbitrary list of ifindex of
> > devices to delete with a new IFLA_IFINDEX attribute. A single message
> > may contain multiple instances of this attribute).
> > This gives a more fine-grained control over which devices to delete,
> > while still resulting in rcu_barrier() being called only once.
> > Indeed, the timings of using this new API to delete 10K devices is
> > the same as using the existing "group" deletion.
> >
> > Signed-off-by: Lahav Schlesinger <lschlesinger@drivenets.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion
  2021-12-05 12:11   ` Lahav Schlesinger
@ 2021-12-05 13:49     ` Ido Schimmel
  2021-12-05 15:05       ` Lahav Schlesinger
  2021-12-07  3:56       ` David Ahern
  0 siblings, 2 replies; 11+ messages in thread
From: Ido Schimmel @ 2021-12-05 13:49 UTC (permalink / raw)
  To: Lahav Schlesinger; +Cc: netdev, kuba, dsahern

On Sun, Dec 05, 2021 at 02:11:00PM +0200, Lahav Schlesinger wrote:
> On Sun, Dec 05, 2021 at 11:53:03AM +0200, Ido Schimmel wrote:
> > CAUTION: External E-Mail - Use caution with links and attachments
> >
> >
> > On Thu, Dec 02, 2021 at 07:45:02PM +0200, Lahav Schlesinger wrote:
> > > Under large scale, some routers are required to support tens of thousands
> > > of devices at once, both physical and virtual (e.g. loopbacks, tunnels,
> > > vrfs, etc).
> > > At times such routers are required to delete massive amounts of devices
> > > at once, such as when a factory reset is performed on the router (causing
> > > a deletion of all devices), or when a configuration is restored after an
> > > upgrade, or as a request from an operator.
> > >
> > > Currently there are 2 means of deleting devices using Netlink:
> > > 1. Deleting a single device (either by ifindex using ifinfomsg::ifi_index,
> > > or by name using IFLA_IFNAME)
> > > 2. Delete all device that belong to a group (using IFLA_GROUP)
> > >
> > > Deletion of devices one-by-one has poor performance on large scale of
> > > devices compared to "group deletion":
> > > After all device are handled, netdev_run_todo() is called which
> > > calls rcu_barrier() to finish any outstanding RCU callbacks that were
> > > registered during the deletion of the device, then wait until the
> > > refcount of all the devices is 0, then perform final cleanups.
> > >
> > > However, calling rcu_barrier() is a very costly operation, each call
> > > taking in the order of 10s of milliseconds.
> > >
> > > When deleting a large number of device one-by-one, rcu_barrier()
> > > will be called for each device being deleted.
> > > As an example, following benchmark deletes 10K loopback devices,
> > > all of which are UP and with only IPv6 LLA being configured:
> > >
> > > 1. Deleting one-by-one using 1 thread : 243 seconds
> > > 2. Deleting one-by-one using 10 thread: 70 seconds
> > > 3. Deleting one-by-one using 50 thread: 54 seconds
> > > 4. Deleting all using "group deletion": 30 seconds
> > >
> > > Note that even though the deletion logic takes place under the rtnl
> > > lock, since the call to rcu_barrier() is outside the lock we gain
> > > some improvements.
> > >
> > > But, while "group deletion" is the fastest, it is not suited for
> > > deleting large number of arbitrary devices which are unknown a head of
> > > time. Furthermore, moving large number of devices to a group is also a
> > > costly operation.
> >
> > These are the number I get in a VM running on my laptop.
> >
> > Moving 16k dummy netdevs to a group:
> >
> > # time -p ip -b group.batch
> > real 1.91
> > user 0.04
> > sys 0.27
> >
> > Deleting the group:
> >
> > # time -p ip link del group 10
> > real 6.15
> > user 0.00
> > sys 3.02
> >
> 
> Hi Ido, in your tests in which state the dummy devices are before
> deleting/changing group?
> When they are DOWN I get similar numbers to yours (16k devices):
> 
> # time ip -b group_16000_batch
> real	0m0.640s
> user	0m0.152s
> sys	0m0.478s
> 
> # time ip link delete group 100
> real	0m5.324s
> user	0m0.017s
> sys	0m4.991s
> 
> But when the devices are in state UP, I get:
> 
> # time ip -b group_16000_batch
> real	0m48.605s
> user	0m0.218s
> sys	0m48.244s
> 
> # time ip link delete group 100
> real	1m13.219s
> user	0m0.010s
> sys	1m9.117s
> 
> And for completeness, setting the devices to DOWN prior to deleting them
> is as fast as deleting them in the first place while they're UP.
> 
> Also, while this is probably a minuscule issue, changing the group of
> 10ks+ of interfaces will result in a storm of netlink events that will
> make any userspace program listening on link events to spend time
> handling these events.  This will result in twice as many events
> compared to directly deleting the devices.

Yes, in my setup the netdevs were down. Looking at the code, I think the
reason for the 75x increase in latency is the fact that netlink
notifications are not generated when the netdev is down. See
netdev_state_change().

In your specific case, it is quite useless for the kernel to generate
16k notifications when moving the netdevs to a group since the entire
reason they are moved to a group is so that they could be deleted in a
batch.

I assume that there are other use cases where having the kernel suppress
notifications can be useful. Did you consider adding such a flag to the
request? I think such a mechanism is more generic/useful than an ad-hoc
API to delete a list of netdevs and should allow you to utilize the
existing group deletion mechanism.

> 
> > IMO, these numbers do not justify a new API. Also, your user space can
> > be taught to create all the netdevs in the same group to begin with:
> >
> > # ip link add name dummy1 group 10 type dummy
> > # ip link show dev dummy1
> > 10: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group 10 qlen 1000
> >     link/ether 12:b6:7d:ff:48:99 brd ff:ff:ff:ff:ff:ff
> >
> > Moreover, unlike the list API that is specific to deletion, the group
> > API also lets you batch set operations:
> >
> > # ip link set group 10 mtu 2000
> > # ip link show dev dummy1
> > 10: dummy1: <BROADCAST,NOARP> mtu 2000 qdisc noop state DOWN mode
> > DEFAULT group 10 qlen 1000
> >     link/ether 12:b6:7d:ff:48:99 brd ff:ff:ff:ff:ff:ff
> 
> The list API can be extended to support other operations as well
> (similar to group set operations, we can call do_setlink() for each
> device specified in an IFLA_IFINDEX).
> I didn't implement it in this patch because we don't have a use for it
> currently.
> 
> >
> > If you are using namespaces, then during "factory reset" you can delete
> > the namespace which should trigger batch deletion of the netdevs inside
> > it.
> >
> 
> In some scenarios we are required to delete only a subset of devices
> (e.g. when a physical link becomes DOWN, we need to delete all the VLANs
> and any tunnels configured on that device).  Furthermore, a user is
> allowed to load a new configuration in which he deletes only some of the
> devices (e.g. delete all of the loopbacks in the system), while not
> touching the other devices.
> 
> > >
> > > This patch adds support for passing an arbitrary list of ifindex of
> > > devices to delete with a new IFLA_IFINDEX attribute. A single message
> > > may contain multiple instances of this attribute).
> > > This gives a more fine-grained control over which devices to delete,
> > > while still resulting in rcu_barrier() being called only once.
> > > Indeed, the timings of using this new API to delete 10K devices is
> > > the same as using the existing "group" deletion.
> > >
> > > Signed-off-by: Lahav Schlesinger <lschlesinger@drivenets.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion
  2021-12-05 13:49     ` Ido Schimmel
@ 2021-12-05 15:05       ` Lahav Schlesinger
  2021-12-06  8:07         ` Nicolas Dichtel
  2021-12-07  3:56       ` David Ahern
  1 sibling, 1 reply; 11+ messages in thread
From: Lahav Schlesinger @ 2021-12-05 15:05 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, kuba, dsahern

On Sun, Dec 05, 2021 at 03:49:59PM +0200, Ido Schimmel wrote:
> CAUTION: External E-Mail - Use caution with links and attachments
>
>
> On Sun, Dec 05, 2021 at 02:11:00PM +0200, Lahav Schlesinger wrote:
> > On Sun, Dec 05, 2021 at 11:53:03AM +0200, Ido Schimmel wrote:
> > > CAUTION: External E-Mail - Use caution with links and attachments
> > >
> > >
> > > On Thu, Dec 02, 2021 at 07:45:02PM +0200, Lahav Schlesinger wrote:
> > > > Under large scale, some routers are required to support tens of thousands
> > > > of devices at once, both physical and virtual (e.g. loopbacks, tunnels,
> > > > vrfs, etc).
> > > > At times such routers are required to delete massive amounts of devices
> > > > at once, such as when a factory reset is performed on the router (causing
> > > > a deletion of all devices), or when a configuration is restored after an
> > > > upgrade, or as a request from an operator.
> > > >
> > > > Currently there are 2 means of deleting devices using Netlink:
> > > > 1. Deleting a single device (either by ifindex using ifinfomsg::ifi_index,
> > > > or by name using IFLA_IFNAME)
> > > > 2. Delete all device that belong to a group (using IFLA_GROUP)
> > > >
> > > > Deletion of devices one-by-one has poor performance on large scale of
> > > > devices compared to "group deletion":
> > > > After all device are handled, netdev_run_todo() is called which
> > > > calls rcu_barrier() to finish any outstanding RCU callbacks that were
> > > > registered during the deletion of the device, then wait until the
> > > > refcount of all the devices is 0, then perform final cleanups.
> > > >
> > > > However, calling rcu_barrier() is a very costly operation, each call
> > > > taking in the order of 10s of milliseconds.
> > > >
> > > > When deleting a large number of device one-by-one, rcu_barrier()
> > > > will be called for each device being deleted.
> > > > As an example, following benchmark deletes 10K loopback devices,
> > > > all of which are UP and with only IPv6 LLA being configured:
> > > >
> > > > 1. Deleting one-by-one using 1 thread : 243 seconds
> > > > 2. Deleting one-by-one using 10 thread: 70 seconds
> > > > 3. Deleting one-by-one using 50 thread: 54 seconds
> > > > 4. Deleting all using "group deletion": 30 seconds
> > > >
> > > > Note that even though the deletion logic takes place under the rtnl
> > > > lock, since the call to rcu_barrier() is outside the lock we gain
> > > > some improvements.
> > > >
> > > > But, while "group deletion" is the fastest, it is not suited for
> > > > deleting large number of arbitrary devices which are unknown a head of
> > > > time. Furthermore, moving large number of devices to a group is also a
> > > > costly operation.
> > >
> > > These are the number I get in a VM running on my laptop.
> > >
> > > Moving 16k dummy netdevs to a group:
> > >
> > > # time -p ip -b group.batch
> > > real 1.91
> > > user 0.04
> > > sys 0.27
> > >
> > > Deleting the group:
> > >
> > > # time -p ip link del group 10
> > > real 6.15
> > > user 0.00
> > > sys 3.02
> > >
> >
> > Hi Ido, in your tests in which state the dummy devices are before
> > deleting/changing group?
> > When they are DOWN I get similar numbers to yours (16k devices):
> >
> > # time ip -b group_16000_batch
> > real  0m0.640s
> > user  0m0.152s
> > sys   0m0.478s
> >
> > # time ip link delete group 100
> > real  0m5.324s
> > user  0m0.017s
> > sys   0m4.991s
> >
> > But when the devices are in state UP, I get:
> >
> > # time ip -b group_16000_batch
> > real  0m48.605s
> > user  0m0.218s
> > sys   0m48.244s
> >
> > # time ip link delete group 100
> > real  1m13.219s
> > user  0m0.010s
> > sys   1m9.117s
> >
> > And for completeness, setting the devices to DOWN prior to deleting them
> > is as fast as deleting them in the first place while they're UP.
> >
> > Also, while this is probably a minuscule issue, changing the group of
> > 10ks+ of interfaces will result in a storm of netlink events that will
> > make any userspace program listening on link events to spend time
> > handling these events.  This will result in twice as many events
> > compared to directly deleting the devices.
>
> Yes, in my setup the netdevs were down. Looking at the code, I think the
> reason for the 75x increase in latency is the fact that netlink
> notifications are not generated when the netdev is down. See
> netdev_state_change().
>
> In your specific case, it is quite useless for the kernel to generate
> 16k notifications when moving the netdevs to a group since the entire
> reason they are moved to a group is so that they could be deleted in a
> batch.
>
> I assume that there are other use cases where having the kernel suppress
> notifications can be useful. Did you consider adding such a flag to the
> request? I think such a mechanism is more generic/useful than an ad-hoc
> API to delete a list of netdevs and should allow you to utilize the
> existing group deletion mechanism.

I think having an API to suppress kernel notifications will be abused by
userspace and introduce hard-to-debug bugs, e.g. some program will
incorrectly set this flag when it shouldn't (on the premise that this
flag will "make things faster") and inadvertently break other programs
that depend on the notifications to function.

Furthermore, such ability can be used by malicious programs to hide
their activity, effectively masking themselves from Intrusion Detection
Systems that monitor these notifications.
Allowing this flag _only_ for group change can still be used maliciously,
e.g. a malicious program can covertly change the group of a device to a
"management group" (something that will usually be detected by an IDS),
and depend on a trusty program that performs bulk group operation on
said group to change some configuration of that device.

The last point might be a bit of a stretch, but I imagine there are more
creative scenarios where such flag can be maliciously used.

>
> >
> > > IMO, these numbers do not justify a new API. Also, your user space can
> > > be taught to create all the netdevs in the same group to begin with:
> > >
> > > # ip link add name dummy1 group 10 type dummy
> > > # ip link show dev dummy1
> > > 10: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group 10 qlen 1000
> > >     link/ether 12:b6:7d:ff:48:99 brd ff:ff:ff:ff:ff:ff
> > >
> > > Moreover, unlike the list API that is specific to deletion, the group
> > > API also lets you batch set operations:
> > >
> > > # ip link set group 10 mtu 2000
> > > # ip link show dev dummy1
> > > 10: dummy1: <BROADCAST,NOARP> mtu 2000 qdisc noop state DOWN mode
> > > DEFAULT group 10 qlen 1000
> > >     link/ether 12:b6:7d:ff:48:99 brd ff:ff:ff:ff:ff:ff
> >
> > The list API can be extended to support other operations as well
> > (similar to group set operations, we can call do_setlink() for each
> > device specified in an IFLA_IFINDEX).
> > I didn't implement it in this patch because we don't have a use for it
> > currently.
> >
> > >
> > > If you are using namespaces, then during "factory reset" you can delete
> > > the namespace which should trigger batch deletion of the netdevs inside
> > > it.
> > >
> >
> > In some scenarios we are required to delete only a subset of devices
> > (e.g. when a physical link becomes DOWN, we need to delete all the VLANs
> > and any tunnels configured on that device).  Furthermore, a user is
> > allowed to load a new configuration in which he deletes only some of the
> > devices (e.g. delete all of the loopbacks in the system), while not
> > touching the other devices.
> >
> > > >
> > > > This patch adds support for passing an arbitrary list of ifindex of
> > > > devices to delete with a new IFLA_IFINDEX attribute. A single message
> > > > may contain multiple instances of this attribute).
> > > > This gives a more fine-grained control over which devices to delete,
> > > > while still resulting in rcu_barrier() being called only once.
> > > > Indeed, the timings of using this new API to delete 10K devices is
> > > > the same as using the existing "group" deletion.
> > > >
> > > > Signed-off-by: Lahav Schlesinger <lschlesinger@drivenets.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion
  2021-12-05 15:05       ` Lahav Schlesinger
@ 2021-12-06  8:07         ` Nicolas Dichtel
  0 siblings, 0 replies; 11+ messages in thread
From: Nicolas Dichtel @ 2021-12-06  8:07 UTC (permalink / raw)
  To: Lahav Schlesinger, Ido Schimmel; +Cc: netdev, kuba, dsahern

Le 05/12/2021 à 16:05, Lahav Schlesinger a écrit :
[snip]
>> In your specific case, it is quite useless for the kernel to generate
>> 16k notifications when moving the netdevs to a group since the entire
>> reason they are moved to a group is so that they could be deleted in a
>> batch.
>>
>> I assume that there are other use cases where having the kernel suppress
>> notifications can be useful. Did you consider adding such a flag to the
>> request? I think such a mechanism is more generic/useful than an ad-hoc
>> API to delete a list of netdevs and should allow you to utilize the
>> existing group deletion mechanism.
> 
> I think having an API to suppress kernel notifications will be abused by
> userspace and introduce hard-to-debug bugs, e.g. some program will
> incorrectly set this flag when it shouldn't (on the premise that this
> flag will "make things faster") and inadvertently break other programs
> that depend on the notifications to function.
+1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion
  2021-12-05 13:49     ` Ido Schimmel
  2021-12-05 15:05       ` Lahav Schlesinger
@ 2021-12-07  3:56       ` David Ahern
  1 sibling, 0 replies; 11+ messages in thread
From: David Ahern @ 2021-12-07  3:56 UTC (permalink / raw)
  To: Ido Schimmel, Lahav Schlesinger; +Cc: netdev, kuba

On 12/5/21 6:49 AM, Ido Schimmel wrote:
> In your specific case, it is quite useless for the kernel to generate
> 16k notifications when moving the netdevs to a group since the entire
> reason they are moved to a group is so that they could be deleted in a
> batch.
> 
> I assume that there are other use cases where having the kernel suppress
> notifications can be useful. Did you consider adding such a flag to the
> request? I think such a mechanism is more generic/useful than an ad-hoc
> API to delete a list of netdevs and should allow you to utilize the
> existing group deletion mechanism.

I do agree that we do not want to allow netdev changes without userspace
notifications.

It would be nice to re-use the group delete API, but really there is no
guaranteed, reserved group value that can be used for a temp assignment
and to then use the group delete call.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-12-07  3:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-02 17:45 [PATCH net-next v4] rtnetlink: Support fine-grained netdevice bulk deletion Lahav Schlesinger
2021-12-04  1:06 ` Jakub Kicinski
2021-12-04 10:15 ` Nikolay Aleksandrov
2021-12-04 16:18   ` David Ahern
2021-12-05  8:28   ` Lahav Schlesinger
2021-12-05  9:53 ` Ido Schimmel
2021-12-05 12:11   ` Lahav Schlesinger
2021-12-05 13:49     ` Ido Schimmel
2021-12-05 15:05       ` Lahav Schlesinger
2021-12-06  8:07         ` Nicolas Dichtel
2021-12-07  3:56       ` David Ahern

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.