linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/2] netns: uevent performance tweaks
@ 2018-04-18 15:21 Christian Brauner
  2018-04-18 15:21 ` [PATCH net-next 1/2] netns: restrict uevents Christian Brauner
  2018-04-18 15:21 ` [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks Christian Brauner
  0 siblings, 2 replies; 10+ messages in thread
From: Christian Brauner @ 2018-04-18 15:21 UTC (permalink / raw)
  To: ebiederm, davem, netdev, linux-kernel
  Cc: avagin, ktkhai, serge, gregkh, Christian Brauner

Hey,

This series deals with a bunch of performance improvements when sending out
uevents that have been extensively discussed here:
https://lkml.org/lkml/2018/4/10/592

- Only record uevent sockets from network namespaces owned by the
  initial user namespace in the global uevent socket list.
  Eric, this is the exact patch we agreed upon in
  https://lkml.org/lkml/2018/4/10/592.
  **A very detailed rationale is present in the commit message for
    [PATCH 1/2] netns: restrict uevents**
- Decouple the locking for network namespaces in the global uevent socket
  list from the locking for network namespaces not in the global uevent
  socket list.
  **A very detailed rationale is present in the commit message
    [PATCH 2/2] netns: isolate seqnums to use per-netns locks**

Thanks!
Christian

Christian Brauner (2):
  netns: restrict uevents
  netns: isolate seqnums to use per-netns locks

 include/linux/kobject.h     |   3 -
 include/net/net_namespace.h |   3 +
 kernel/ksysfs.c             |   3 +-
 lib/kobject_uevent.c        | 118 ++++++++++++++++++++++++++++--------
 net/core/net_namespace.c    |  13 ++++
 5 files changed, 110 insertions(+), 30 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH net-next 1/2] netns: restrict uevents
  2018-04-18 15:21 [PATCH net-next 0/2] netns: uevent performance tweaks Christian Brauner
@ 2018-04-18 15:21 ` Christian Brauner
  2018-04-18 15:21 ` [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks Christian Brauner
  1 sibling, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2018-04-18 15:21 UTC (permalink / raw)
  To: ebiederm, davem, netdev, linux-kernel
  Cc: avagin, ktkhai, serge, gregkh, Christian Brauner

commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")

enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk a little. We have now reached the point where hotplug events for all
devices that carry a namespace tag are filtered according to that
namespace. Specifically, they are filtered whenever the namespace tag of
the kobject does not match the namespace tag of the netlink socket. One
example are network devices. Uevents for network devices only show up in
the network namespaces these devices are moved to or created in.

However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.

This patch restricts uevents to the initial user namespace for a couple of
reasons that have been extensively discusses on the mailing list [1].
- Thundering herd:
  Broadcasting uevents into all network namespaces introduces significant
  overhead.
  All processes that listen to uevents running in non-initial user
  namespaces will end up responding to uevents that will be meaningless to
  them. Mainly, because non-initial user namespaces cannot easily manage
  devices unless they have a privileged host-process helping them out. This
  means that there will be a thundering herd of activity when there
  shouldn't be any.
- Uevents from non-root users are already filtered in userspace:
  Uevents are filtered by userspace in a user namespace because the
  received uid != 0. Instead the uid associated with the event will be
  65534 == "nobody" because the global root uid is not mapped.
  This means we can safely and without introducing regressions modify the
  kernel to not send uevents into all network namespaces whose owning user
  namespace is not the initial user namespace because we know that
  userspace will ignore the message because of the uid anyway. I have
  a) verified that is is true for every udev implementation out there b)
  that this behavior has been present in all udev implementations from the
  very beginning.
- Removing needless overhead/Increasing performance:
  Currently, the uevent socket for each network namespace is added to the
  global variable uevent_sock_list. The list itself needs to be protected
  by a mutex. So everytime a uevent is generated the mutex is taken on the
  list. The mutex is held *from the creation of the uevent (memory
  allocation, string creation etc. until all uevent sockets have been
  handled*. This is aggravated by the fact that for each uevent socket that
  has listeners the mc_list must be walked as well which means we're
  talking O(n^2) here. Given that a standard Linux workload usually has
  quite a lot of network namespaces and - in the face of containers - a lot
  of user namespaces this quickly becomes a performance problem (see
  "Thundering herd" above). By just recording uevent sockets of network
  namespaces that are owned by the initial user namespace we significantly
  increase performance in this codepath.
- Injecting uevents:
  There's a valid argument that containers might be interested in receiving
  device events especially if they are delegated to them by a privileged
  userspace process. One prime example are SR-IOV enabled devices that are
  explicitly designed to be handed of to other users such as VMs or
  containers.
  This use-case can now be correctly handled since
  commit 692ec06d7c92 ("netns: send uevent messages"). This commit
  introduced the ability to send uevents from userspace. As such we can let
  a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
  the network namespace of the netlink socket) userspace process make a
  decision what uevents should be sent. This removes the need to blindly
  broadcast uevents into all user namespaces and provides a performant and
  safe solution to this problem.
- Filtering logic:
  This patch filters by *owning user namespace of the network namespace a
  given task resides in* and not by user namespace of the task per se. This
  means if the user namespace of a given task is unshared but the network
  namespace is kept and is owned by the initial user namespace a listener
  that is opening the uevent socket in that network namespace can still
  listen to uevents.

[1]: https://lkml.org/lkml/2018/4/4/739
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 lib/kobject_uevent.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 15ea216a67ce..f5f5038787ac 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
 
 	net->uevent_sock = ue_sk;
 
-	mutex_lock(&uevent_sock_mutex);
-	list_add_tail(&ue_sk->list, &uevent_sock_list);
-	mutex_unlock(&uevent_sock_mutex);
+	/* Restrict uevents to initial user namespace. */
+	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
+		mutex_lock(&uevent_sock_mutex);
+		list_add_tail(&ue_sk->list, &uevent_sock_list);
+		mutex_unlock(&uevent_sock_mutex);
+	}
+
 	return 0;
 }
 
@@ -713,9 +717,11 @@ static void uevent_net_exit(struct net *net)
 {
 	struct uevent_sock *ue_sk = net->uevent_sock;
 
-	mutex_lock(&uevent_sock_mutex);
-	list_del(&ue_sk->list);
-	mutex_unlock(&uevent_sock_mutex);
+	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
+		mutex_lock(&uevent_sock_mutex);
+		list_del(&ue_sk->list);
+		mutex_unlock(&uevent_sock_mutex);
+	}
 
 	netlink_kernel_release(ue_sk->sk);
 	kfree(ue_sk);
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks
  2018-04-18 15:21 [PATCH net-next 0/2] netns: uevent performance tweaks Christian Brauner
  2018-04-18 15:21 ` [PATCH net-next 1/2] netns: restrict uevents Christian Brauner
@ 2018-04-18 15:21 ` Christian Brauner
  2018-04-18 16:55   ` Eric W. Biederman
  2018-04-23  2:39   ` kbuild test robot
  1 sibling, 2 replies; 10+ messages in thread
From: Christian Brauner @ 2018-04-18 15:21 UTC (permalink / raw)
  To: ebiederm, davem, netdev, linux-kernel
  Cc: avagin, ktkhai, serge, gregkh, Christian Brauner

Now that it's possible to have a different set of uevents in different
network namespaces, per-network namespace uevent sequence numbers are
introduced. This increases performance as locking is now restricted to the
network namespace affected by the uevent rather than locking everything.

Since commit 692ec06 ("netns: send uevent messages") network namespaces not
owned by the intial user namespace can be sent uevents from a sufficiently
privileged userspace process.
In order to send a uevent into a network namespace not owned by the initial
user namespace we currently still need to take the *global mutex* that
locks the uevent socket list even though the list *only contains network
namespaces owned by the initial user namespace*. This needs to be done
because the uevent counter is a global variable. Taking the global lock is
performance sensitive since a user on the host can spawn a pool of n
process that each create their own new user and network namespaces and then
go on to inject uevents in parallel into the network namespace of all of
these processes. This can have a significant performance impact for the
host's udevd since it means that there can be a lot of delay between a
device being added and the corresponding uevent being sent out and
available for processing by udevd. It also means that each network
namespace not owned by the initial user namespace which userspace has sent
a uevent to will need to wait until the lock becomes available.

Implementation:
This patch gives each network namespace its own uevent sequence number.
Each network namespace not owned by the initial user namespace receives its
own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
file it is clearly documented which lock has to be taken. All network
namespaces owned by the initial user namespace will still share the same
lock since they are all served sequentially via the uevent socket list.
This decouples the locking and ensures that the host retrieves uevents as
fast as possible even if there are a lot of uevents injected into network
namespaces not owned by the initial user namespace.  In addition, each
network namespace not owned by the initial user namespace does not have to
wait on any other network namespace not sharing the same user namespace.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 include/linux/kobject.h     |   3 --
 include/net/net_namespace.h |   3 ++
 kernel/ksysfs.c             |   3 +-
 lib/kobject_uevent.c        | 100 ++++++++++++++++++++++++++++--------
 net/core/net_namespace.c    |  13 +++++
 5 files changed, 98 insertions(+), 24 deletions(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 7f6f93c3df9c..776391aea247 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -36,9 +36,6 @@
 extern char uevent_helper[];
 #endif
 
-/* counter to tag the uevent, read only except for the kobject core */
-extern u64 uevent_seqnum;
-
 /*
  * The actions here must match the index to the string array
  * in lib/kobject_uevent.c
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 47e35cce3b64..e4e171b1ba69 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -85,6 +85,8 @@ struct net {
 	struct sock		*genl_sock;
 
 	struct uevent_sock	*uevent_sock;		/* uevent socket */
+	/* counter to tag the uevent, read only except for the kobject core */
+	u64                     uevent_seqnum;
 
 	struct list_head 	dev_base_head;
 	struct hlist_head 	*dev_name_head;
@@ -189,6 +191,7 @@ extern struct list_head net_namespace_list;
 
 struct net *get_net_ns_by_pid(pid_t pid);
 struct net *get_net_ns_by_fd(int fd);
+u64 get_ns_uevent_seqnum_by_vpid(void);
 
 #ifdef CONFIG_SYSCTL
 void ipx_register_sysctl(void);
diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index 46ba853656f6..83264edcecda 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -19,6 +19,7 @@
 #include <linux/sched.h>
 #include <linux/capability.h>
 #include <linux/compiler.h>
+#include <net/net_namespace.h>
 
 #include <linux/rcupdate.h>	/* rcu_expedited and rcu_normal */
 
@@ -33,7 +34,7 @@ static struct kobj_attribute _name##_attr = \
 static ssize_t uevent_seqnum_show(struct kobject *kobj,
 				  struct kobj_attribute *attr, char *buf)
 {
-	return sprintf(buf, "%llu\n", (unsigned long long)uevent_seqnum);
+	return sprintf(buf, "%llu\n", (unsigned long long)get_ns_uevent_seqnum_by_vpid());
 }
 KERNEL_ATTR_RO(uevent_seqnum);
 
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index f5f5038787ac..796fd502c227 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -29,21 +29,38 @@
 #include <net/net_namespace.h>
 
 
-u64 uevent_seqnum;
 #ifdef CONFIG_UEVENT_HELPER
 char uevent_helper[UEVENT_HELPER_PATH_LEN] = CONFIG_UEVENT_HELPER_PATH;
 #endif
 
+/*
+ * Size a buffer needs to be in order to hold the largest possible sequence
+ * number stored in a u64 including \0 byte: 2^64 - 1 = 21 chars.
+ */
+#define SEQNUM_BUFSIZE (sizeof("SEQNUM=") + 21)
 struct uevent_sock {
 	struct list_head list;
 	struct sock *sk;
+	/*
+	 * This mutex protects uevent sockets and the uevent counter of
+	 * network namespaces *not* owned by init_user_ns.
+	 * For network namespaces owned by init_user_ns this lock is *not*
+	 * valid instead the global uevent_sock_mutex must be used!
+	 */
+	struct mutex sk_mutex;
 };
 
 #ifdef CONFIG_NET
 static LIST_HEAD(uevent_sock_list);
 #endif
 
-/* This lock protects uevent_seqnum and uevent_sock_list */
+/*
+ * This mutex protects uevent sockets and the uevent counter of network
+ * namespaces owned by init_user_ns.
+ * For network namespaces not owned by init_user_ns this lock is *not*
+ * valid instead the network namespace specific sk_mutex in struct
+ * uevent_sock must be used!
+ */
 static DEFINE_MUTEX(uevent_sock_mutex);
 
 /* the strings here must match the enum in include/linux/kobject.h */
@@ -253,6 +270,22 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
 
 	return 0;
 }
+
+static bool can_hold_seqnum(const struct kobj_uevent_env *env, size_t len)
+{
+	if (env->envp_idx >= ARRAY_SIZE(env->envp)) {
+		WARN(1, KERN_ERR "Failed to append sequence number. "
+		     "Too many uevent variables\n");
+		return false;
+	}
+
+	if ((env->buflen + len) > UEVENT_BUFFER_SIZE) {
+		WARN(1, KERN_ERR "Insufficient space to append sequence number\n");
+		return false;
+	}
+
+	return true;
+}
 #endif
 
 #ifdef CONFIG_UEVENT_HELPER
@@ -308,18 +341,22 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 
 	/* send netlink message */
 	list_for_each_entry(ue_sk, &uevent_sock_list, list) {
+		/* bump sequence number */
+		u64 seqnum = ++sock_net(ue_sk->sk)->uevent_seqnum;
 		struct sock *uevent_sock = ue_sk->sk;
+		char buf[SEQNUM_BUFSIZE];
 
 		if (!netlink_has_listeners(uevent_sock, 1))
 			continue;
 
 		if (!skb) {
-			/* allocate message with the maximum possible size */
+			/* calculate header length */
 			size_t len = strlen(action_string) + strlen(devpath) + 2;
 			char *scratch;
 
+			/* allocate message with the maximum possible size */
 			retval = -ENOMEM;
-			skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+			skb = alloc_skb(len + env->buflen + SEQNUM_BUFSIZE, GFP_KERNEL);
 			if (!skb)
 				continue;
 
@@ -327,11 +364,24 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 			scratch = skb_put(skb, len);
 			sprintf(scratch, "%s@%s", action_string, devpath);
 
+			/* add env */
 			skb_put_data(skb, env->buf, env->buflen);
 
 			NETLINK_CB(skb).dst_group = 1;
 		}
 
+		/* prepare netns seqnum */
+		retval = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu", seqnum);
+		if (retval < 0 || retval >= SEQNUM_BUFSIZE)
+			continue;
+		retval++;
+
+		if (!can_hold_seqnum(env, retval))
+			continue;
+
+		/* append netns seqnum */
+		skb_put_data(skb, buf, retval);
+
 		retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
 						    0, 1, GFP_KERNEL,
 						    kobj_bcast_filter,
@@ -339,6 +389,9 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 		/* ENOBUFS should be handled in userspace */
 		if (retval == -ENOBUFS || retval == -ESRCH)
 			retval = 0;
+
+		/* remove netns seqnum */
+		skb_trim(skb, env->buflen);
 	}
 	consume_skb(skb);
 #endif
@@ -510,14 +563,7 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
 	}
 
 	mutex_lock(&uevent_sock_mutex);
-	/* we will send an event, so request a new sequence number */
-	retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
-	if (retval) {
-		mutex_unlock(&uevent_sock_mutex);
-		goto exit;
-	}
-	retval = kobject_uevent_net_broadcast(kobj, env, action_string,
-					      devpath);
+	retval = kobject_uevent_net_broadcast(kobj, env, action_string, devpath);
 	mutex_unlock(&uevent_sock_mutex);
 
 #ifdef CONFIG_UEVENT_HELPER
@@ -605,17 +651,18 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
 EXPORT_SYMBOL_GPL(add_uevent_var);
 
 #if defined(CONFIG_NET)
-static int uevent_net_broadcast(struct sock *usk, struct sk_buff *skb,
+static int uevent_net_broadcast(struct uevent_sock *ue_sk, struct sk_buff *skb,
 				struct netlink_ext_ack *extack)
 {
-	/* u64 to chars: 2^64 - 1 = 21 chars */
-	char buf[sizeof("SEQNUM=") + 21];
+	struct sock *usk = ue_sk->sk;
+	char buf[SEQNUM_BUFSIZE];
 	struct sk_buff *skbc;
 	int ret;
 
 	/* bump and prepare sequence number */
-	ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", ++uevent_seqnum);
-	if (ret < 0 || (size_t)ret >= sizeof(buf))
+	ret = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu",
+		       ++sock_net(ue_sk->sk)->uevent_seqnum);
+	if (ret < 0 || ret >= SEQNUM_BUFSIZE)
 		return -ENOMEM;
 	ret++;
 
@@ -668,9 +715,15 @@ static int uevent_net_rcv_skb(struct sk_buff *skb, struct nlmsghdr *nlh,
 		return -EPERM;
 	}
 
-	mutex_lock(&uevent_sock_mutex);
-	ret = uevent_net_broadcast(net->uevent_sock->sk, skb, extack);
-	mutex_unlock(&uevent_sock_mutex);
+	if (net->user_ns == &init_user_ns)
+		mutex_lock(&uevent_sock_mutex);
+	else
+		mutex_lock(&net->uevent_sock->sk_mutex);
+	ret = uevent_net_broadcast(net->uevent_sock, skb, extack);
+	if (net->user_ns == &init_user_ns)
+		mutex_unlock(&uevent_sock_mutex);
+	else
+		mutex_unlock(&net->uevent_sock->sk_mutex);
 
 	return ret;
 }
@@ -708,6 +761,13 @@ static int uevent_net_init(struct net *net)
 		mutex_lock(&uevent_sock_mutex);
 		list_add_tail(&ue_sk->list, &uevent_sock_list);
 		mutex_unlock(&uevent_sock_mutex);
+	} else {
+		/*
+		 * Uevent sockets and counters for network namespaces
+		 * not owned by the initial user namespace have their
+		 * own mutex.
+		 */
+		mutex_init(&ue_sk->sk_mutex);
 	}
 
 	return 0;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a11e03f920d3..2f914804ef73 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -618,6 +618,19 @@ struct net *get_net_ns_by_pid(pid_t pid)
 }
 EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
 
+u64 get_ns_uevent_seqnum_by_vpid(void)
+{
+	pid_t cur_pid;
+	struct net *net;
+
+	cur_pid = task_pid_vnr(current);
+	net = get_net_ns_by_pid(cur_pid);
+	if (IS_ERR(net))
+		return 0;
+
+	return net->uevent_seqnum;
+}
+
 static __net_init int net_ns_net_init(struct net *net)
 {
 #ifdef CONFIG_NET_NS
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks
  2018-04-18 15:21 ` [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks Christian Brauner
@ 2018-04-18 16:55   ` Eric W. Biederman
  2018-04-18 21:52     ` Christian Brauner
  2018-04-23  2:39   ` kbuild test robot
  1 sibling, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 2018-04-18 16:55 UTC (permalink / raw)
  To: Christian Brauner
  Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh

Christian Brauner <christian.brauner@ubuntu.com> writes:

> Now that it's possible to have a different set of uevents in different
> network namespaces, per-network namespace uevent sequence numbers are
> introduced. This increases performance as locking is now restricted to the
> network namespace affected by the uevent rather than locking
> everything.

Numbers please.  I personally expect that the netlink mc_list issues
will swamp any benefit you get from this.

Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks
  2018-04-18 16:55   ` Eric W. Biederman
@ 2018-04-18 21:52     ` Christian Brauner
  2018-04-20 13:56       ` Christian Brauner
  0 siblings, 1 reply; 10+ messages in thread
From: Christian Brauner @ 2018-04-18 21:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh

On Wed, Apr 18, 2018 at 11:55:52AM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@ubuntu.com> writes:
> 
> > Now that it's possible to have a different set of uevents in different
> > network namespaces, per-network namespace uevent sequence numbers are
> > introduced. This increases performance as locking is now restricted to the
> > network namespace affected by the uevent rather than locking
> > everything.
> 
> Numbers please.  I personally expect that the netlink mc_list issues
> will swamp any benefit you get from this.

I wouldn't see how this would be the case. The gist of this is:
Everytime you send a uevent into a network namespace *not* owned by
init_user_ns you currently *have* to take mutex_lock(uevent_sock_list)
effectively blocking the host from processing uevents even though
- the uevent you're receiving might be totally different from the
  uevent that you're sending
- the uevent socket of the non-init_user_ns owned network namespace
  isn't even recorded in the list.

The other argument is that we now have properly isolated network
namespaces wrt to uevents such that each netns can have its own set of
uevents. This can either happen by a sufficiently privileged userspace
process sending it uevents that are only dedicated to that specific
netns. Or - and this *has been true for a long time* - because network
devices are *properly namespaced*. Meaning a uevent for that network
device is *tied to a network namespace*. For both cases the uevent
sequence numbering will be absolutely misleading. For example, whenever
you create e.g. a new veth device in a new network namespace it
shouldn't be accounted against the initial network namespace but *only*
against the network namespace that has that device added to it.

Thanks!
Christian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks
  2018-04-18 21:52     ` Christian Brauner
@ 2018-04-20 13:56       ` Christian Brauner
  2018-04-20 16:16         ` Christian Brauner
  0 siblings, 1 reply; 10+ messages in thread
From: Christian Brauner @ 2018-04-20 13:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh

On Wed, Apr 18, 2018 at 11:52:47PM +0200, Christian Brauner wrote:
> On Wed, Apr 18, 2018 at 11:55:52AM -0500, Eric W. Biederman wrote:
> > Christian Brauner <christian.brauner@ubuntu.com> writes:
> > 
> > > Now that it's possible to have a different set of uevents in different
> > > network namespaces, per-network namespace uevent sequence numbers are
> > > introduced. This increases performance as locking is now restricted to the
> > > network namespace affected by the uevent rather than locking
> > > everything.
> > 
> > Numbers please.  I personally expect that the netlink mc_list issues
> > will swamp any benefit you get from this.
> 
> I wouldn't see how this would be the case. The gist of this is:
> Everytime you send a uevent into a network namespace *not* owned by
> init_user_ns you currently *have* to take mutex_lock(uevent_sock_list)
> effectively blocking the host from processing uevents even though
> - the uevent you're receiving might be totally different from the
>   uevent that you're sending
> - the uevent socket of the non-init_user_ns owned network namespace
>   isn't even recorded in the list.
> 
> The other argument is that we now have properly isolated network
> namespaces wrt to uevents such that each netns can have its own set of
> uevents. This can either happen by a sufficiently privileged userspace
> process sending it uevents that are only dedicated to that specific
> netns. Or - and this *has been true for a long time* - because network
> devices are *properly namespaced*. Meaning a uevent for that network
> device is *tied to a network namespace*. For both cases the uevent
> sequence numbering will be absolutely misleading. For example, whenever
> you create e.g. a new veth device in a new network namespace it
> shouldn't be accounted against the initial network namespace but *only*
> against the network namespace that has that device added to it.

Eric, I did the testing. Here's what I did:

I compiled two 4.17-rc1 Kernels:
- one with per netns uevent seqnums with decoupled locking
- one without per netns uevent seqnums with decoupled locking

# Testcase 1:
Only Injecting Uevents into network namespaces not owned by the initial user
namespace.
- created 1000 new user namespace + network namespace pairs
- opened a uevent listener in each of those namespace pairs
- injected uevents into each of those network namespaces 10,000 times meaning
  10,000,000 (10 million) uevents were injected. (The high number of
  uevent injections should get rid of a lot of jitter.)
- Calculated the mean transaction time.
- *without* uevent sequence number namespacing:
  67 μs
- *with* uevent sequence number namespacing:
  55 μs
- makes a difference of 12 μs

# Testcase 2:
Injecting Uevents into network namespaces not owned by the initial user
namespace and network namespaces owned by the initial user namespace.
- created 500 new user namespace + network namespace pairs
- created 500 new network namespace pairs
- opened a uevent listener in each of those namespace pairs
- injected uevents into each of those network namespaces 10,000 times meaning
  10,000,000 (10 million) uevents were injected. (The high number of
  uevent injections should get rid of a lot of jitter.)
- Calculated the mean transaction time.
- *without* uevent sequence number namespacing:
  572 μs
- *with* uevent sequence number namespacing:
  514 μs
- makes a difference of 58 μs

So there's performance gain. The third case would be to create a bunch
of hanging processes that send SIGSTOP to themselves but do not actually
open a uevent socket in their respective namespaces and then inject
uevents into them. I expect there to be an even more performance
benefits since the rtnl_table_lock() isn't hit in this case because
there are no listeners.

Christian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks
  2018-04-20 13:56       ` Christian Brauner
@ 2018-04-20 16:16         ` Christian Brauner
  2018-04-21 15:49           ` Christian Brauner
  0 siblings, 1 reply; 10+ messages in thread
From: Christian Brauner @ 2018-04-20 16:16 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh

On Fri, Apr 20, 2018 at 03:56:28PM +0200, Christian Brauner wrote:
> On Wed, Apr 18, 2018 at 11:52:47PM +0200, Christian Brauner wrote:
> > On Wed, Apr 18, 2018 at 11:55:52AM -0500, Eric W. Biederman wrote:
> > > Christian Brauner <christian.brauner@ubuntu.com> writes:
> > > 
> > > > Now that it's possible to have a different set of uevents in different
> > > > network namespaces, per-network namespace uevent sequence numbers are
> > > > introduced. This increases performance as locking is now restricted to the
> > > > network namespace affected by the uevent rather than locking
> > > > everything.
> > > 
> > > Numbers please.  I personally expect that the netlink mc_list issues
> > > will swamp any benefit you get from this.
> > 
> > I wouldn't see how this would be the case. The gist of this is:
> > Everytime you send a uevent into a network namespace *not* owned by
> > init_user_ns you currently *have* to take mutex_lock(uevent_sock_list)
> > effectively blocking the host from processing uevents even though
> > - the uevent you're receiving might be totally different from the
> >   uevent that you're sending
> > - the uevent socket of the non-init_user_ns owned network namespace
> >   isn't even recorded in the list.
> > 
> > The other argument is that we now have properly isolated network
> > namespaces wrt to uevents such that each netns can have its own set of
> > uevents. This can either happen by a sufficiently privileged userspace
> > process sending it uevents that are only dedicated to that specific
> > netns. Or - and this *has been true for a long time* - because network
> > devices are *properly namespaced*. Meaning a uevent for that network
> > device is *tied to a network namespace*. For both cases the uevent
> > sequence numbering will be absolutely misleading. For example, whenever
> > you create e.g. a new veth device in a new network namespace it
> > shouldn't be accounted against the initial network namespace but *only*
> > against the network namespace that has that device added to it.
> 
> Eric, I did the testing. Here's what I did:
> 
> I compiled two 4.17-rc1 Kernels:
> - one with per netns uevent seqnums with decoupled locking
> - one without per netns uevent seqnums with decoupled locking
> 
> # Testcase 1:
> Only Injecting Uevents into network namespaces not owned by the initial user
> namespace.
> - created 1000 new user namespace + network namespace pairs
> - opened a uevent listener in each of those namespace pairs
> - injected uevents into each of those network namespaces 10,000 times meaning
>   10,000,000 (10 million) uevents were injected. (The high number of
>   uevent injections should get rid of a lot of jitter.)
> - Calculated the mean transaction time.
> - *without* uevent sequence number namespacing:
>   67 μs
> - *with* uevent sequence number namespacing:
>   55 μs
> - makes a difference of 12 μs
> 
> # Testcase 2:
> Injecting Uevents into network namespaces not owned by the initial user
> namespace and network namespaces owned by the initial user namespace.
> - created 500 new user namespace + network namespace pairs
> - created 500 new network namespace pairs
> - opened a uevent listener in each of those namespace pairs
> - injected uevents into each of those network namespaces 10,000 times meaning
>   10,000,000 (10 million) uevents were injected. (The high number of
>   uevent injections should get rid of a lot of jitter.)
> - Calculated the mean transaction time.
> - *without* uevent sequence number namespacing:
>   572 μs
> - *with* uevent sequence number namespacing:
>   514 μs
> - makes a difference of 58 μs
> 
> So there's performance gain. The third case would be to create a bunch
> of hanging processes that send SIGSTOP to themselves but do not actually
> open a uevent socket in their respective namespaces and then inject
> uevents into them. I expect there to be an even more performance
> benefits since the rtnl_table_lock() isn't hit in this case because
> there are no listeners.

I did the third test-case as well so:
- created 500 new user namespace + network namespace pairs *without
  uevent listeners*
- created 500 new network namespace pairs *without uevent listeners*
- injected uevents into each of those network namespaces 10,000 times meaning
  10,000,000 (10 million) uevents were injected. (The high number of
  uevent injections should get rid of a lot of jitter.)
- Calculated the mean transaction time.
- *without* uevent sequence number namespacing:
  206 μs
- *with* uevent sequence number namespacing:
  163 μs
- makes a difference of 43 μs

So this test-case shows performance improvement as well.

Thanks!
Christian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks
  2018-04-20 16:16         ` Christian Brauner
@ 2018-04-21 15:49           ` Christian Brauner
  0 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2018-04-21 15:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh

On Fri, Apr 20, 2018 at 06:16:44PM +0200, Christian Brauner wrote:
> On Fri, Apr 20, 2018 at 03:56:28PM +0200, Christian Brauner wrote:
> > On Wed, Apr 18, 2018 at 11:52:47PM +0200, Christian Brauner wrote:
> > > On Wed, Apr 18, 2018 at 11:55:52AM -0500, Eric W. Biederman wrote:
> > > > Christian Brauner <christian.brauner@ubuntu.com> writes:
> > > > 
> > > > > Now that it's possible to have a different set of uevents in different
> > > > > network namespaces, per-network namespace uevent sequence numbers are
> > > > > introduced. This increases performance as locking is now restricted to the
> > > > > network namespace affected by the uevent rather than locking
> > > > > everything.
> > > > 
> > > > Numbers please.  I personally expect that the netlink mc_list issues
> > > > will swamp any benefit you get from this.
> > > 
> > > I wouldn't see how this would be the case. The gist of this is:
> > > Everytime you send a uevent into a network namespace *not* owned by
> > > init_user_ns you currently *have* to take mutex_lock(uevent_sock_list)
> > > effectively blocking the host from processing uevents even though
> > > - the uevent you're receiving might be totally different from the
> > >   uevent that you're sending
> > > - the uevent socket of the non-init_user_ns owned network namespace
> > >   isn't even recorded in the list.
> > > 
> > > The other argument is that we now have properly isolated network
> > > namespaces wrt to uevents such that each netns can have its own set of
> > > uevents. This can either happen by a sufficiently privileged userspace
> > > process sending it uevents that are only dedicated to that specific
> > > netns. Or - and this *has been true for a long time* - because network
> > > devices are *properly namespaced*. Meaning a uevent for that network
> > > device is *tied to a network namespace*. For both cases the uevent
> > > sequence numbering will be absolutely misleading. For example, whenever
> > > you create e.g. a new veth device in a new network namespace it
> > > shouldn't be accounted against the initial network namespace but *only*
> > > against the network namespace that has that device added to it.
> > 
> > Eric, I did the testing. Here's what I did:
> > 
> > I compiled two 4.17-rc1 Kernels:
> > - one with per netns uevent seqnums with decoupled locking
> > - one without per netns uevent seqnums with decoupled locking
> > 
> > # Testcase 1:
> > Only Injecting Uevents into network namespaces not owned by the initial user
> > namespace.
> > - created 1000 new user namespace + network namespace pairs
> > - opened a uevent listener in each of those namespace pairs
> > - injected uevents into each of those network namespaces 10,000 times meaning
> >   10,000,000 (10 million) uevents were injected. (The high number of
> >   uevent injections should get rid of a lot of jitter.)
> > - Calculated the mean transaction time.
> > - *without* uevent sequence number namespacing:
> >   67 μs
> > - *with* uevent sequence number namespacing:
> >   55 μs
> > - makes a difference of 12 μs
> > 
> > # Testcase 2:
> > Injecting Uevents into network namespaces not owned by the initial user
> > namespace and network namespaces owned by the initial user namespace.
> > - created 500 new user namespace + network namespace pairs
> > - created 500 new network namespace pairs
> > - opened a uevent listener in each of those namespace pairs
> > - injected uevents into each of those network namespaces 10,000 times meaning
> >   10,000,000 (10 million) uevents were injected. (The high number of
> >   uevent injections should get rid of a lot of jitter.)
> > - Calculated the mean transaction time.
> > - *without* uevent sequence number namespacing:
> >   572 μs
> > - *with* uevent sequence number namespacing:
> >   514 μs
> > - makes a difference of 58 μs
> > 
> > So there's performance gain. The third case would be to create a bunch
> > of hanging processes that send SIGSTOP to themselves but do not actually
> > open a uevent socket in their respective namespaces and then inject
> > uevents into them. I expect there to be an even more performance
> > benefits since the rtnl_table_lock() isn't hit in this case because
> > there are no listeners.
> 
> I did the third test-case as well so:
> - created 500 new user namespace + network namespace pairs *without
>   uevent listeners*
> - created 500 new network namespace pairs *without uevent listeners*
> - injected uevents into each of those network namespaces 10,000 times meaning
>   10,000,000 (10 million) uevents were injected. (The high number of
>   uevent injections should get rid of a lot of jitter.)
> - Calculated the mean transaction time.
> - *without* uevent sequence number namespacing:
>   206 μs
> - *with* uevent sequence number namespacing:
>   163 μs
> - makes a difference of 43 μs
> 
> So this test-case shows performance improvement as well.

Just for fun, I did a simple statistical anlysis using t-tests and they
all show significant differences at alpha-level 0.001 (Which I chose
because it seemed 0.05 is a bit too lax.).

Testcase 1:
        Welch Two Sample t-test

data:  x1 and y1
t = 405.16, df = 18883000, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 12.14949 12.26761
sample estimates:
mean of x mean of y
 68.48594  56.27739

Testcase 2:
        Welch Two Sample t-test

data:  x2 and y2
t = 38.685, df = 19682000, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 55.10630 60.98815
sample estimates:
mean of x mean of y
 572.9684  514.9211

Testcase 3:
        Welch Two Sample t-test

data:  x3 and y3
t = 58.37, df = 17711000, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 41.77860 44.68178
sample estimates:
mean of x mean of y
 207.2632  164.0330

Thanks!
Christian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks
  2018-04-18 15:21 ` [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks Christian Brauner
  2018-04-18 16:55   ` Eric W. Biederman
@ 2018-04-23  2:39   ` kbuild test robot
  2018-04-23 10:12     ` Christian Brauner
  1 sibling, 1 reply; 10+ messages in thread
From: kbuild test robot @ 2018-04-23  2:39 UTC (permalink / raw)
  To: Christian Brauner
  Cc: kbuild-all, ebiederm, davem, netdev, linux-kernel, avagin,
	ktkhai, serge, gregkh, Christian Brauner

[-- Attachment #1: Type: text/plain, Size: 966 bytes --]

Hi Christian,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Christian-Brauner/netns-uevent-performance-tweaks/20180420-013717
config: alpha-alldefconfig (attached as .config)
compiler: alpha-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=alpha 

All errors (new ones prefixed by >>):

   kernel/ksysfs.o: In function `uevent_seqnum_show':
>> (.text+0x18c): undefined reference to `get_ns_uevent_seqnum_by_vpid'
   (.text+0x19c): undefined reference to `get_ns_uevent_seqnum_by_vpid'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6435 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks
  2018-04-23  2:39   ` kbuild test robot
@ 2018-04-23 10:12     ` Christian Brauner
  0 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2018-04-23 10:12 UTC (permalink / raw)
  To: kbuild-all
  Cc: ebiederm, davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh

On Mon, Apr 23, 2018 at 10:39:50AM +0800, kbuild test robot wrote:
> Hi Christian,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on net-next/master]
> 
> url:    https://github.com/0day-ci/linux/commits/Christian-Brauner/netns-uevent-performance-tweaks/20180420-013717
> config: alpha-alldefconfig (attached as .config)
> compiler: alpha-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=alpha 
> 
> All errors (new ones prefixed by >>):
> 
>    kernel/ksysfs.o: In function `uevent_seqnum_show':
> >> (.text+0x18c): undefined reference to `get_ns_uevent_seqnum_by_vpid'
>    (.text+0x19c): undefined reference to `get_ns_uevent_seqnum_by_vpid'

Right, this happens when CONFIG_NET=n. I am about to send out the second
version of the patch that includes all of the test results in the commit
message and also accounts for kernels compiled without CONFIG_NET.

Thanks!
Christian

> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-04-23 10:12 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-18 15:21 [PATCH net-next 0/2] netns: uevent performance tweaks Christian Brauner
2018-04-18 15:21 ` [PATCH net-next 1/2] netns: restrict uevents Christian Brauner
2018-04-18 15:21 ` [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks Christian Brauner
2018-04-18 16:55   ` Eric W. Biederman
2018-04-18 21:52     ` Christian Brauner
2018-04-20 13:56       ` Christian Brauner
2018-04-20 16:16         ` Christian Brauner
2018-04-21 15:49           ` Christian Brauner
2018-04-23  2:39   ` kbuild test robot
2018-04-23 10:12     ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).