* [PATCH net-next 0/2 v5] netns: uevent filtering @ 2018-04-29 10:44 Christian Brauner 2018-04-29 10:44 ` [PATCH net-next 1/2 v5] uevent: add alloc_uevent_skb() helper Christian Brauner ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: Christian Brauner @ 2018-04-29 10:44 UTC (permalink / raw) To: ebiederm, davem, netdev, linux-kernel Cc: avagin, ktkhai, serge, gregkh, Christian Brauner Hey everyone, This is the new approach to uevent filtering as discussed (see the threads in [1], [2], and [3]). It only contains *non-functional changes*. This series deals with with fixing up uevent filtering logic: - uevent filtering logic is simplified - locking time on uevent_sock_list is minimized - tagged and untagged kobjects are handled in separate codepaths - permissions for userspace are fixed for network device uevents in network namespaces owned by non-initial user namespaces Udev is now able to see those events correctly which it wasn't before. For example, moving a physical device into a network namespace not owned by the initial user namespaces before gave: root@xen1:~# udevadm --debug monitor -k calling: monitor monitor will print the received events for: KERNEL - the kernel uevent sender uid=65534, message ignored sender uid=65534, message ignored sender uid=65534, message ignored sender uid=65534, message ignored sender uid=65534, message ignored and now after the discussion and solution in [3] correctly gives: root@xen1:~# udevadm --debug monitor -k calling: monitor monitor will print the received events for: KERNEL - the kernel uevent KERNEL[625.301042] add /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/enp1s0f1 (net) KERNEL[625.301109] move /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/enp1s0f1 (net) KERNEL[625.301138] move /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/eth1 (net) KERNEL[655.333272] remove /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/eth1 (net) Thanks! Christian [1]: https://lkml.org/lkml/2018/4/4/739 [2]: https://lkml.org/lkml/2018/4/26/767 [3]: https://lkml.org/lkml/2018/4/26/738 Christian Brauner (2): uevent: add alloc_uevent_skb() helper netns: restrict uevents lib/kobject_uevent.c | 178 ++++++++++++++++++++++++++++++------------- 1 file changed, 126 insertions(+), 52 deletions(-) -- 2.17.0 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH net-next 1/2 v5] uevent: add alloc_uevent_skb() helper 2018-04-29 10:44 [PATCH net-next 0/2 v5] netns: uevent filtering Christian Brauner @ 2018-04-29 10:44 ` Christian Brauner 2018-04-29 10:44 ` [PATCH net-next 2/2 v5] netns: restrict uevents Christian Brauner 2018-04-30 15:55 ` [PATCH net-next 0/2 v5] netns: uevent filtering Eric W. Biederman 2 siblings, 0 replies; 10+ messages in thread From: Christian Brauner @ 2018-04-29 10:44 UTC (permalink / raw) To: ebiederm, davem, netdev, linux-kernel Cc: avagin, ktkhai, serge, gregkh, Christian Brauner This patch adds alloc_uevent_skb() in preparation for follow up patches. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> --- v4->v5: * patch unchanged v3->v4: * non-functional changes: initialize some variables again explicitly to make it obvious to readers that they are correctly set v2->v3: * new approach: patch added v1->v2: * different approach in different patchset v0->v1: * different approach in different patchset --- lib/kobject_uevent.c | 47 ++++++++++++++++++++++++++++++++------------ 1 file changed, 34 insertions(+), 13 deletions(-) diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c index 15ea216a67ce..649bf60a9440 100644 --- a/lib/kobject_uevent.c +++ b/lib/kobject_uevent.c @@ -22,6 +22,7 @@ #include <linux/socket.h> #include <linux/skbuff.h> #include <linux/netlink.h> +#include <linux/uidgid.h> #include <linux/uuid.h> #include <linux/ctype.h> #include <net/sock.h> @@ -296,6 +297,38 @@ static void cleanup_uevent_env(struct subprocess_info *info) } #endif +#ifdef CONFIG_NET +static struct sk_buff *alloc_uevent_skb(struct kobj_uevent_env *env, + const char *action_string, + const char *devpath) +{ + struct netlink_skb_parms *parms; + struct sk_buff *skb = NULL; + char *scratch; + size_t len; + + /* allocate message with maximum possible size */ + len = strlen(action_string) + strlen(devpath) + 2; + skb = alloc_skb(len + env->buflen, GFP_KERNEL); + if (!skb) + return NULL; + + /* add header */ + scratch = skb_put(skb, len); + sprintf(scratch, "%s@%s", action_string, devpath); + + skb_put_data(skb, env->buf, env->buflen); + + parms = &NETLINK_CB(skb); + parms->creds.uid = GLOBAL_ROOT_UID; + parms->creds.gid = GLOBAL_ROOT_GID; + parms->dst_group = 1; + parms->portid = 0; + + return skb; +} +#endif + static int kobject_uevent_net_broadcast(struct kobject *kobj, struct kobj_uevent_env *env, const char *action_string, @@ -314,22 +347,10 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj, continue; if (!skb) { - /* allocate message with the maximum possible size */ - size_t len = strlen(action_string) + strlen(devpath) + 2; - char *scratch; - retval = -ENOMEM; - skb = alloc_skb(len + env->buflen, GFP_KERNEL); + skb = alloc_uevent_skb(env, action_string, devpath); if (!skb) continue; - - /* add header */ - scratch = skb_put(skb, len); - sprintf(scratch, "%s@%s", action_string, devpath); - - skb_put_data(skb, env->buf, env->buflen); - - NETLINK_CB(skb).dst_group = 1; } retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb), -- 2.17.0 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH net-next 2/2 v5] netns: restrict uevents 2018-04-29 10:44 [PATCH net-next 0/2 v5] netns: uevent filtering Christian Brauner 2018-04-29 10:44 ` [PATCH net-next 1/2 v5] uevent: add alloc_uevent_skb() helper Christian Brauner @ 2018-04-29 10:44 ` Christian Brauner 2019-06-14 22:49 ` Dmitry Torokhov 2018-04-30 15:55 ` [PATCH net-next 0/2 v5] netns: uevent filtering Eric W. Biederman 2 siblings, 1 reply; 10+ messages in thread From: Christian Brauner @ 2018-04-29 10:44 UTC (permalink / raw) To: ebiederm, davem, netdev, linux-kernel Cc: avagin, ktkhai, serge, gregkh, Christian Brauner commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces") enabled sending hotplug events into all network namespaces back in 2010. Over time the set of uevents that get sent into all network namespaces has shrunk. We have now reached the point where hotplug events for all devices that carry a namespace tag are filtered according to that namespace. Specifically, they are filtered whenever the namespace tag of the kobject does not match the namespace tag of the netlink socket. Currently, only network devices carry namespace tags (i.e. network namespace tags). Hence, uevents for network devices only show up in the network namespace such devices are created in or moved to. However, any uevent for a kobject that does not have a namespace tag associated with it will not be filtered and we will broadcast it into all network namespaces. This behavior stopped making sense when user namespaces were introduced. This patch simplifies and fixes couple of things: - Split codepath for sending uevents by kobject namespace tags: 1. Untagged kobjects - uevent_net_broadcast_untagged(): Untagged kobjects will be broadcast into all uevent sockets recorded in uevent_sock_list, i.e. into all network namespacs owned by the intial user namespace. 2. Tagged kobjects - uevent_net_broadcast_tagged(): Tagged kobjects will only be broadcast into the network namespace they were tagged with. Handling of tagged kobjects in 2. does not cause any semantic changes. This is just splitting out the filtering logic that was handled by kobj_bcast_filter() before. Handling of untagged kobjects in 1. will cause a semantic change. The reasons why this is needed and ok have been discussed in [1]. Here is a short summary: - Userspace ignores uevents from network namespaces that are not owned by the intial user namespace: Uevents are filtered by userspace in a user namespace because the received uid != 0. Instead the uid associated with the event will be 65534 == "nobody" because the global root uid is not mapped. This means we can safely and without introducing regressions modify the kernel to not send uevents into all network namespaces whose owning user namespace is not the initial user namespace because we know that userspace will ignore the message because of the uid anyway. I have a) verified that is is true for every udev implementation out there b) that this behavior has been present in all udev implementations from the very beginning. - Thundering herd: Broadcasting uevents into all network namespaces introduces significant overhead. All processes that listen to uevents running in non-initial user namespaces will end up responding to uevents that will be meaningless to them. Mainly, because non-initial user namespaces cannot easily manage devices unless they have a privileged host-process helping them out. This means that there will be a thundering herd of activity when there shouldn't be any. - Removing needless overhead/Increasing performance: Currently, the uevent socket for each network namespace is added to the global variable uevent_sock_list. The list itself needs to be protected by a mutex. So everytime a uevent is generated the mutex is taken on the list. The mutex is held *from the creation of the uevent (memory allocation, string creation etc. until all uevent sockets have been handled*. This is aggravated by the fact that for each uevent socket that has listeners the mc_list must be walked as well which means we're talking O(n^2) here. Given that a standard Linux workload usually has quite a lot of network namespaces and - in the face of containers - a lot of user namespaces this quickly becomes a performance problem (see "Thundering herd" above). By just recording uevent sockets of network namespaces that are owned by the initial user namespace we significantly increase performance in this codepath. - Injecting uevents: There's a valid argument that containers might be interested in receiving device events especially if they are delegated to them by a privileged userspace process. One prime example are SR-IOV enabled devices that are explicitly designed to be handed of to other users such as VMs or containers. This use-case can now be correctly handled since commit 692ec06d7c92 ("netns: send uevent messages"). This commit introduced the ability to send uevents from userspace. As such we can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of the network namespace of the netlink socket) userspace process make a decision what uevents should be sent. This removes the need to blindly broadcast uevents into all user namespaces and provides a performant and safe solution to this problem. - Filtering logic: This patch filters by *owning user namespace of the network namespace a given task resides in* and not by user namespace of the task per se. This means if the user namespace of a given task is unshared but the network namespace is kept and is owned by the initial user namespace a listener that is opening the uevent socket in that network namespace can still listen to uevents. - Fix permission for tagged kobjects: Network devices that are created or moved into a network namespace that is owned by a non-initial user namespace currently are send with INVALID_{G,U}ID in their credentials. This means that all current udev implementations in userspace will ignore the uevent they receive for them. This has lead to weird bugs whereby new devices showing up in such network namespaces were not recognized and did not get IPs assigned etc. This patch adjusts the permission to the appropriate {g,u}id in the respective user namespace. This way udevd is able to correctly handle such devices. - Simplify filtering logic: do_one_broadcast() already ensures that only listeners in mc_list receive uevents that have the same network namespace as the uevent socket itself. So the filtering logic in kobj_bcast_filter is not needed (see [3]). This patch therefore removes kobj_bcast_filter() and replaces netlink_broadcast_filtered() with the simpler netlink_broadcast() everywhere. [1]: https://lkml.org/lkml/2018/4/4/739 [2]: https://lkml.org/lkml/2018/4/26/767 [3]: https://lkml.org/lkml/2018/4/26/738 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> --- v4->v5: * non-functional changes v3->v4: * patch unchanged v2->v3: * new approach: patch added v1->v2: * old approach: different patchset v0->v1: * old approach: different patchset --- lib/kobject_uevent.c | 137 ++++++++++++++++++++++++++++++------------- 1 file changed, 95 insertions(+), 42 deletions(-) diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c index 649bf60a9440..63d0816ab23b 100644 --- a/lib/kobject_uevent.c +++ b/lib/kobject_uevent.c @@ -232,30 +232,6 @@ int kobject_synth_uevent(struct kobject *kobj, const char *buf, size_t count) return r; } -#ifdef CONFIG_NET -static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data) -{ - struct kobject *kobj = data, *ksobj; - const struct kobj_ns_type_operations *ops; - - ops = kobj_ns_ops(kobj); - if (!ops && kobj->kset) { - ksobj = &kobj->kset->kobj; - if (ksobj->parent != NULL) - ops = kobj_ns_ops(ksobj->parent); - } - - if (ops && ops->netlink_ns && kobj->ktype->namespace) { - const void *sock_ns, *ns; - ns = kobj->ktype->namespace(kobj); - sock_ns = ops->netlink_ns(dsk); - return sock_ns != ns; - } - - return 0; -} -#endif - #ifdef CONFIG_UEVENT_HELPER static int kobj_usermode_filter(struct kobject *kobj) { @@ -327,17 +303,14 @@ static struct sk_buff *alloc_uevent_skb(struct kobj_uevent_env *env, return skb; } -#endif -static int kobject_uevent_net_broadcast(struct kobject *kobj, - struct kobj_uevent_env *env, - const char *action_string, - const char *devpath) +static int uevent_net_broadcast_untagged(struct kobj_uevent_env *env, + const char *action_string, + const char *devpath) { - int retval = 0; -#if defined(CONFIG_NET) struct sk_buff *skb = NULL; struct uevent_sock *ue_sk; + int retval = 0; /* send netlink message */ list_for_each_entry(ue_sk, &uevent_sock_list, list) { @@ -353,19 +326,93 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj, continue; } - retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb), - 0, 1, GFP_KERNEL, - kobj_bcast_filter, - kobj); + retval = netlink_broadcast(uevent_sock, skb_get(skb), 0, 1, + GFP_KERNEL); /* ENOBUFS should be handled in userspace */ if (retval == -ENOBUFS || retval == -ESRCH) retval = 0; } consume_skb(skb); -#endif + return retval; } +static int uevent_net_broadcast_tagged(struct sock *usk, + struct kobj_uevent_env *env, + const char *action_string, + const char *devpath) +{ + struct user_namespace *owning_user_ns = sock_net(usk)->user_ns; + struct sk_buff *skb = NULL; + int ret = 0; + + skb = alloc_uevent_skb(env, action_string, devpath); + if (!skb) + return -ENOMEM; + + /* fix credentials */ + if (owning_user_ns != &init_user_ns) { + struct netlink_skb_parms *parms = &NETLINK_CB(skb); + kuid_t root_uid; + kgid_t root_gid; + + /* fix uid */ + root_uid = make_kuid(owning_user_ns, 0); + if (uid_valid(root_uid)) + parms->creds.uid = root_uid; + + /* fix gid */ + root_gid = make_kgid(owning_user_ns, 0); + if (gid_valid(root_gid)) + parms->creds.gid = root_gid; + } + + ret = netlink_broadcast(usk, skb, 0, 1, GFP_KERNEL); + /* ENOBUFS should be handled in userspace */ + if (ret == -ENOBUFS || ret == -ESRCH) + ret = 0; + + return ret; +} +#endif + +static int kobject_uevent_net_broadcast(struct kobject *kobj, + struct kobj_uevent_env *env, + const char *action_string, + const char *devpath) +{ + int ret = 0; + +#ifdef CONFIG_NET + const struct kobj_ns_type_operations *ops; + const struct net *net = NULL; + + ops = kobj_ns_ops(kobj); + if (!ops && kobj->kset) { + struct kobject *ksobj = &kobj->kset->kobj; + if (ksobj->parent != NULL) + ops = kobj_ns_ops(ksobj->parent); + } + + /* kobjects currently only carry network namespace tags and they + * are the only tag relevant here since we want to decide which + * network namespaces to broadcast the uevent into. + */ + if (ops && ops->netlink_ns && kobj->ktype->namespace) + if (ops->type == KOBJ_NS_TYPE_NET) + net = kobj->ktype->namespace(kobj); + + if (!net) + ret = uevent_net_broadcast_untagged(env, action_string, + devpath); + else + ret = uevent_net_broadcast_tagged(net->uevent_sock->sk, env, + action_string, devpath); +#endif + + return ret; +} + static void zap_modalias_env(struct kobj_uevent_env *env) { static const char modalias_prefix[] = "MODALIAS="; @@ -724,9 +771,13 @@ static int uevent_net_init(struct net *net) net->uevent_sock = ue_sk; - mutex_lock(&uevent_sock_mutex); - list_add_tail(&ue_sk->list, &uevent_sock_list); - mutex_unlock(&uevent_sock_mutex); + /* Restrict uevents to initial user namespace. */ + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) { + mutex_lock(&uevent_sock_mutex); + list_add_tail(&ue_sk->list, &uevent_sock_list); + mutex_unlock(&uevent_sock_mutex); + } + return 0; } @@ -734,9 +785,11 @@ static void uevent_net_exit(struct net *net) { struct uevent_sock *ue_sk = net->uevent_sock; - mutex_lock(&uevent_sock_mutex); - list_del(&ue_sk->list); - mutex_unlock(&uevent_sock_mutex); + if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) { + mutex_lock(&uevent_sock_mutex); + list_del(&ue_sk->list); + mutex_unlock(&uevent_sock_mutex); + } netlink_kernel_release(ue_sk->sk); kfree(ue_sk); -- 2.17.0 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH net-next 2/2 v5] netns: restrict uevents 2018-04-29 10:44 ` [PATCH net-next 2/2 v5] netns: restrict uevents Christian Brauner @ 2019-06-14 22:49 ` Dmitry Torokhov 2019-06-16 11:50 ` Eric W. Biederman 0 siblings, 1 reply; 10+ messages in thread From: Dmitry Torokhov @ 2019-06-14 22:49 UTC (permalink / raw) To: Christian Brauner Cc: Eric W. Biederman, David S. Miller, netdev, lkml, avagin, ktkhai, Serge E. Hallyn, Greg Kroah-Hartman Hi Christian, On Sun, Apr 29, 2018 at 3:45 AM Christian Brauner <christian.brauner@ubuntu.com> wrote: > > commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces") >abhishekbh@google.com > enabled sending hotplug events into all network namespaces back in 2010. > Over time the set of uevents that get sent into all network namespaces has > shrunk. We have now reached the point where hotplug events for all devices > that carry a namespace tag are filtered according to that namespace. > Specifically, they are filtered whenever the namespace tag of the kobject > does not match the namespace tag of the netlink socket. > Currently, only network devices carry namespace tags (i.e. network > namespace tags). Hence, uevents for network devices only show up in the > network namespace such devices are created in or moved to. > > However, any uevent for a kobject that does not have a namespace tag > associated with it will not be filtered and we will broadcast it into all > network namespaces. This behavior stopped making sense when user namespaces > were introduced. > > This patch simplifies and fixes couple of things: > - Split codepath for sending uevents by kobject namespace tags: > 1. Untagged kobjects - uevent_net_broadcast_untagged(): > Untagged kobjects will be broadcast into all uevent sockets recorded > in uevent_sock_list, i.e. into all network namespacs owned by the > intial user namespace. > 2. Tagged kobjects - uevent_net_broadcast_tagged(): > Tagged kobjects will only be broadcast into the network namespace they > were tagged with. > Handling of tagged kobjects in 2. does not cause any semantic changes. > This is just splitting out the filtering logic that was handled by > kobj_bcast_filter() before. > Handling of untagged kobjects in 1. will cause a semantic change. The > reasons why this is needed and ok have been discussed in [1]. Here is a > short summary: > - Userspace ignores uevents from network namespaces that are not owned by > the intial user namespace: > Uevents are filtered by userspace in a user namespace because the > received uid != 0. Instead the uid associated with the event will be > 65534 == "nobody" because the global root uid is not mapped. > This means we can safely and without introducing regressions modify the > kernel to not send uevents into all network namespaces whose owning > user namespace is not the initial user namespace because we know that > userspace will ignore the message because of the uid anyway. > I have a) verified that is is true for every udev implementation out > there b) that this behavior has been present in all udev > implementations from the very beginning. Unfortunately udev is not the only consumer of uevents, for example on Android there is healthd that also consumes uevents, and this particular change broke Android running in a container on Chrome OS. Can this be reverted? Or, if we want to keep this, how can containers that use separate user namespace still listen to uevents? Thanks. -- Dmitry ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next 2/2 v5] netns: restrict uevents 2019-06-14 22:49 ` Dmitry Torokhov @ 2019-06-16 11:50 ` Eric W. Biederman 2019-06-16 16:50 ` Christian Brauner 2019-06-16 16:50 ` Dmitry Torokhov 0 siblings, 2 replies; 10+ messages in thread From: Eric W. Biederman @ 2019-06-16 11:50 UTC (permalink / raw) To: Dmitry Torokhov Cc: Christian Brauner, David S. Miller, netdev, lkml, avagin, ktkhai, Serge E. Hallyn, Greg Kroah-Hartman Dmitry Torokhov <dmitry.torokhov@gmail.com> writes: > Hi Christian, > > On Sun, Apr 29, 2018 at 3:45 AM Christian Brauner > <christian.brauner@ubuntu.com> wrote: >> >> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces") >>abhishekbh@google.com >> enabled sending hotplug events into all network namespaces back in 2010. >> Over time the set of uevents that get sent into all network namespaces has >> shrunk. We have now reached the point where hotplug events for all devices >> that carry a namespace tag are filtered according to that namespace. >> Specifically, they are filtered whenever the namespace tag of the kobject >> does not match the namespace tag of the netlink socket. >> Currently, only network devices carry namespace tags (i.e. network >> namespace tags). Hence, uevents for network devices only show up in the >> network namespace such devices are created in or moved to. >> >> However, any uevent for a kobject that does not have a namespace tag >> associated with it will not be filtered and we will broadcast it into all >> network namespaces. This behavior stopped making sense when user namespaces >> were introduced. >> >> This patch simplifies and fixes couple of things: >> - Split codepath for sending uevents by kobject namespace tags: >> 1. Untagged kobjects - uevent_net_broadcast_untagged(): >> Untagged kobjects will be broadcast into all uevent sockets recorded >> in uevent_sock_list, i.e. into all network namespacs owned by the >> intial user namespace. >> 2. Tagged kobjects - uevent_net_broadcast_tagged(): >> Tagged kobjects will only be broadcast into the network namespace they >> were tagged with. >> Handling of tagged kobjects in 2. does not cause any semantic changes. >> This is just splitting out the filtering logic that was handled by >> kobj_bcast_filter() before. >> Handling of untagged kobjects in 1. will cause a semantic change. The >> reasons why this is needed and ok have been discussed in [1]. Here is a >> short summary: >> - Userspace ignores uevents from network namespaces that are not owned by >> the intial user namespace: >> Uevents are filtered by userspace in a user namespace because the >> received uid != 0. Instead the uid associated with the event will be >> 65534 == "nobody" because the global root uid is not mapped. >> This means we can safely and without introducing regressions modify the >> kernel to not send uevents into all network namespaces whose owning >> user namespace is not the initial user namespace because we know that >> userspace will ignore the message because of the uid anyway. >> I have a) verified that is is true for every udev implementation out >> there b) that this behavior has been present in all udev >> implementations from the very beginning. > > Unfortunately udev is not the only consumer of uevents, for example on > Android there is healthd that also consumes uevents, and this > particular change broke Android running in a container on Chrome OS. > Can this be reverted? Or, if we want to keep this, how can containers > that use separate user namespace still listen to uevents? The code has been in the main tree for over a year so at a minimum reverting this has the real chance of causing a regression for folks like lxc. I don't think Android running in a container on Chrome OS was even available when this change was merged. So I don't think this falls under the ordinary no regression rules. I may be wrong but I think this is a case of developing new code on an old kernel and developing a dependence on a bug that had already been fixed in newer kernels. I know Christian did his best to reach out to everyone when this change came through, so only getting a bug report over a year after the code was merged is concerning. That said uevents should be completely useless in a user namespace except as letting you know something happened. Is that what healthd is using them for? One solution would be to tweak the container userspace on ChromeOS to listen to the uevents outside the container and to relay them into the Android container. Eric ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next 2/2 v5] netns: restrict uevents 2019-06-16 11:50 ` Eric W. Biederman @ 2019-06-16 16:50 ` Christian Brauner 2019-06-16 17:14 ` Dmitry Torokhov 2019-06-16 16:50 ` Dmitry Torokhov 1 sibling, 1 reply; 10+ messages in thread From: Christian Brauner @ 2019-06-16 16:50 UTC (permalink / raw) To: Eric W. Biederman, Dmitry Torokhov Cc: David S. Miller, netdev, lkml, avagin, ktkhai, Serge E. Hallyn, Greg Kroah-Hartman On Sun, Jun 16, 2019 at 06:50:20AM -0500, Eric W. Biederman wrote: > Dmitry Torokhov <dmitry.torokhov@gmail.com> writes: > > > Hi Christian, > > > > On Sun, Apr 29, 2018 at 3:45 AM Christian Brauner > > <christian.brauner@ubuntu.com> wrote: > >> > >> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces") > >>abhishekbh@google.com > >> enabled sending hotplug events into all network namespaces back in 2010. > >> Over time the set of uevents that get sent into all network namespaces has > >> shrunk. We have now reached the point where hotplug events for all devices > >> that carry a namespace tag are filtered according to that namespace. > >> Specifically, they are filtered whenever the namespace tag of the kobject > >> does not match the namespace tag of the netlink socket. > >> Currently, only network devices carry namespace tags (i.e. network > >> namespace tags). Hence, uevents for network devices only show up in the > >> network namespace such devices are created in or moved to. > >> > >> However, any uevent for a kobject that does not have a namespace tag > >> associated with it will not be filtered and we will broadcast it into all > >> network namespaces. This behavior stopped making sense when user namespaces > >> were introduced. > >> > >> This patch simplifies and fixes couple of things: > >> - Split codepath for sending uevents by kobject namespace tags: > >> 1. Untagged kobjects - uevent_net_broadcast_untagged(): > >> Untagged kobjects will be broadcast into all uevent sockets recorded > >> in uevent_sock_list, i.e. into all network namespacs owned by the > >> intial user namespace. > >> 2. Tagged kobjects - uevent_net_broadcast_tagged(): > >> Tagged kobjects will only be broadcast into the network namespace they > >> were tagged with. > >> Handling of tagged kobjects in 2. does not cause any semantic changes. > >> This is just splitting out the filtering logic that was handled by > >> kobj_bcast_filter() before. > >> Handling of untagged kobjects in 1. will cause a semantic change. The > >> reasons why this is needed and ok have been discussed in [1]. Here is a > >> short summary: > >> - Userspace ignores uevents from network namespaces that are not owned by > >> the intial user namespace: > >> Uevents are filtered by userspace in a user namespace because the > >> received uid != 0. Instead the uid associated with the event will be > >> 65534 == "nobody" because the global root uid is not mapped. > >> This means we can safely and without introducing regressions modify the > >> kernel to not send uevents into all network namespaces whose owning > >> user namespace is not the initial user namespace because we know that > >> userspace will ignore the message because of the uid anyway. > >> I have a) verified that is is true for every udev implementation out > >> there b) that this behavior has been present in all udev > >> implementations from the very beginning. > > > > Unfortunately udev is not the only consumer of uevents, for example on > > Android there is healthd that also consumes uevents, and this > > particular change broke Android running in a container on Chrome OS. > > Can this be reverted? Or, if we want to keep this, how can containers > > that use separate user namespace still listen to uevents? > > The code has been in the main tree for over a year so at a minimum > reverting this has the real chance of causing a regression for > folks like lxc. > > I don't think Android running in a container on Chrome OS was even > available when this change was merged. So I don't think this falls > under the ordinary no regression rules. > > I may be wrong but I think this is a case of developing new code on an > old kernel and developing a dependence on a bug that had already been > fixed in newer kernels. I know Christian did his best to reach out to > everyone when this change came through, so only getting a bug report > over a year after the code was merged is concerning. > > That said uevents should be completely useless in a user namespace > except as letting you know something happened. Is that what healthd > is using them for? > > > One solution would be to tweak the container userspace on ChromeOS to > listen to the uevents outside the container and to relay them into the > Android container. Thanks for jumping in Eric! Welcome back. :) Hey Dmitry, Crostini on ChromeOS is making heavy use of this patchset and of LXD. So reverting this almost 1.5 years after the fact will regress all of Google's ChromeOS Crostini users, and all LXD/LXC users. LXD and Crostini by using LXD (through Concierge/Tremplin etc. [2]) are using this whole series e.g. when hotplugging usb devices into the container. When a usb hotplug happens, LXD will receive the relevant uevent and will forward it to the container. Any process listening on a uevent socket inside the container will now be able to see it. Now, to talk briefly about solutions: From what I gather from talking to the ChromeOS guys and from your ChromeOS bugtracker and recent patchsets to ARC you are moving your Android workloads into Crostini? So like Eric said this seems like a new feature you're implementing. If you need to be able to listen to uevents inside of a user namespace and plan on using Crostini going forward then you can have Crostini forward battery uevents to the container. The logic for doing this can be found in the LXD codebase (cf. [3]). It's pretty simple. If you want to go this route I'm happy to guide you. :) Note, both options are a version of what Eric suggested in his last paragraph! What astonishes me is that healthd couldn't have possibly received battery uevents for a long time even if Android already was run in user namespaces prior to the new feature you're working on and the healthd I see in master is not even using uevents anymore (cf. [8]) but rather is using sysfs it seems. :) Before that healthd was using (cf. [7]) uevent_kernel_multicast_recv() | -> uevent_kernel_multicast_uid_recv the latter containing the check if (cred->uid != 0) { /* ignoring netlink message from non-root user */ goto out; } Before my patchset here the uevents sent out came with cred->uid == INVALID_UID and so healthd never received those events until very recently. And I can tell you exactly when it started receiving those events as I reported the removal of this check as a bug against Android before this patchset was ever merged and before a version of Android without this check was released (cf. [6]). :) While we're at it, removing this check is strange. Why would you have any core tool of yours listen to uevents that do not come from the root user? Especially when it comes from INVALID_UID. That's what I tried to tell you in [6]. This patchset also fixes a real information leak. Netlink is a socket and having those sockets respect network namespaces like all others makes sense. Especially to allow containers to do their own thing and to avoid information leakage. Fyi, when I wrote this patch I informed the ChromeOS guys about it multiple times as early as March/April 2018. At least two times the PM in person (last time we talked about this was during FOSDEM because ChromeOS was already reling on this). I also mentioned this feature on the official ChromeOS bugtracker here [1] in April 2018 since ChromeOS could use it. So that this comes up as a problem almost 1.5 years later is a bit surprising. :) Additionally, I gave a presentation about this feature at LPC 2018 with Google folks from Android around as well (cf. [4]). :) Thanks! Christian [1]: https://bugs.chromium.org/p/chromium/issues/detail?id=831850#c2 [2]: https://bugs.chromium.org/p/chromium/issues/detail?id=956288 [3]: https://github.com/lxc/lxd [4]: https://www.youtube.com/watch?v=yI_1cuhoDgA&t=327s [5]: https://android.googlesource.com/platform/system/core.git/+/android-4.2.2_r1/libcutils/uevent.c#77 [6]: https://issuetracker.google.com/issues/77764945 [7]: https://android.googlesource.com/platform/system/core/+/android-4.4_r1/healthd/healthd.cpp#135 [8]: https://android.googlesource.com/platform/system/core/+/master/healthd/BatteryMonitor.cpp ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next 2/2 v5] netns: restrict uevents 2019-06-16 16:50 ` Christian Brauner @ 2019-06-16 17:14 ` Dmitry Torokhov 0 siblings, 0 replies; 10+ messages in thread From: Dmitry Torokhov @ 2019-06-16 17:14 UTC (permalink / raw) To: Christian Brauner Cc: Eric W. Biederman, David S. Miller, netdev, lkml, avagin, ktkhai, Serge E. Hallyn, Greg Kroah-Hartman Hi Christian, On Sun, Jun 16, 2019 at 9:50 AM Christian Brauner <christian.brauner@canonical.com> wrote: > > Hey Dmitry, > > Crostini on ChromeOS is making heavy use of this patchset and of LXD. So > reverting this almost 1.5 years after the fact will regress all of > Google's ChromeOS Crostini users, and all LXD/LXC users. > > LXD and Crostini by using LXD (through Concierge/Tremplin etc. [2]) are > using this whole series e.g. when hotplugging usb devices into the > container. > > When a usb hotplug happens, LXD will receive the relevant uevent and > will forward it to the container. Any process listening on a uevent > socket inside the container will now be able to see it. > > Now, to talk briefly about solutions: > From what I gather from talking to the ChromeOS guys and from your > ChromeOS bugtracker and recent patchsets to ARC you are moving your > Android workloads into Crostini? So like Eric said this seems like a new > feature you're implementing. No, I am talking about ARC, not Crostini here. > > If you need to be able to listen to uevents inside of a user namespace > and plan on using Crostini going forward then you can have Crostini > forward battery uevents to the container. The logic for doing this can > be found in the LXD codebase (cf. [3]). It's pretty simple. If you want > to go this route I'm happy to guide you. :) > Note, both options are a version of what Eric suggested in his last > paragraph! > > What astonishes me is that healthd couldn't have possibly received > battery uevents for a long time even if Android already was run in user > namespaces prior to the new feature you're working on and the healthd I > see in master is not even using uevents anymore (cf. [8]) but rather is > using sysfs it seems. :) > Before that healthd was using (cf. [7]) > > uevent_kernel_multicast_recv() > | > -> uevent_kernel_multicast_uid_recv > > the latter containing the check > > if (cred->uid != 0) { > /* ignoring netlink message from non-root user */ > goto out; > } > > Before my patchset here the uevents sent out came with cred->uid == INVALID_UID > and so healthd never received those events until very recently. I see. OK, let me try digging into this to figure out what exactly changed. Thanks. -- Dmitry ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next 2/2 v5] netns: restrict uevents 2019-06-16 11:50 ` Eric W. Biederman 2019-06-16 16:50 ` Christian Brauner @ 2019-06-16 16:50 ` Dmitry Torokhov 1 sibling, 0 replies; 10+ messages in thread From: Dmitry Torokhov @ 2019-06-16 16:50 UTC (permalink / raw) To: Eric W. Biederman Cc: Christian Brauner, David S. Miller, netdev, lkml, avagin, ktkhai, Serge E. Hallyn, Greg Kroah-Hartman Hi Eric, On Sun, Jun 16, 2019 at 4:50 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Dmitry Torokhov <dmitry.torokhov@gmail.com> writes: > > > Hi Christian, > > > > On Sun, Apr 29, 2018 at 3:45 AM Christian Brauner > > <christian.brauner@ubuntu.com> wrote: > >> > >> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces") > >>abhishekbh@google.com > >> enabled sending hotplug events into all network namespaces back in 2010. > >> Over time the set of uevents that get sent into all network namespaces has > >> shrunk. We have now reached the point where hotplug events for all devices > >> that carry a namespace tag are filtered according to that namespace. > >> Specifically, they are filtered whenever the namespace tag of the kobject > >> does not match the namespace tag of the netlink socket. > >> Currently, only network devices carry namespace tags (i.e. network > >> namespace tags). Hence, uevents for network devices only show up in the > >> network namespace such devices are created in or moved to. > >> > >> However, any uevent for a kobject that does not have a namespace tag > >> associated with it will not be filtered and we will broadcast it into all > >> network namespaces. This behavior stopped making sense when user namespaces > >> were introduced. > >> > >> This patch simplifies and fixes couple of things: > >> - Split codepath for sending uevents by kobject namespace tags: > >> 1. Untagged kobjects - uevent_net_broadcast_untagged(): > >> Untagged kobjects will be broadcast into all uevent sockets recorded > >> in uevent_sock_list, i.e. into all network namespacs owned by the > >> intial user namespace. > >> 2. Tagged kobjects - uevent_net_broadcast_tagged(): > >> Tagged kobjects will only be broadcast into the network namespace they > >> were tagged with. > >> Handling of tagged kobjects in 2. does not cause any semantic changes. > >> This is just splitting out the filtering logic that was handled by > >> kobj_bcast_filter() before. > >> Handling of untagged kobjects in 1. will cause a semantic change. The > >> reasons why this is needed and ok have been discussed in [1]. Here is a > >> short summary: > >> - Userspace ignores uevents from network namespaces that are not owned by > >> the intial user namespace: > >> Uevents are filtered by userspace in a user namespace because the > >> received uid != 0. Instead the uid associated with the event will be > >> 65534 == "nobody" because the global root uid is not mapped. > >> This means we can safely and without introducing regressions modify the > >> kernel to not send uevents into all network namespaces whose owning > >> user namespace is not the initial user namespace because we know that > >> userspace will ignore the message because of the uid anyway. > >> I have a) verified that is is true for every udev implementation out > >> there b) that this behavior has been present in all udev > >> implementations from the very beginning. > > > > Unfortunately udev is not the only consumer of uevents, for example on > > Android there is healthd that also consumes uevents, and this > > particular change broke Android running in a container on Chrome OS. > > Can this be reverted? Or, if we want to keep this, how can containers > > that use separate user namespace still listen to uevents? > > The code has been in the main tree for over a year so at a minimum > reverting this has the real chance of causing a regression for > folks like lxc. > > I don't think Android running in a container on Chrome OS was even > available when this change was merged. So I don't think this falls > under the ordinary no regression rules. > > I may be wrong but I think this is a case of developing new code on an > old kernel and developing a dependence on a bug that had already been > fixed in newer kernels. No, this is not quite the case. We have been shipping Android on Chrome OS since 2016, the concept of running Android in a container definitely predates these series of patches. > I know Christian did his best to reach out to > everyone when this change came through, so only getting a bug report > over a year after the code was merged is concerning. This only proves that it is hard to change userspace-visible behavior as one can't really know who might be using the interfaces and for what reason. Again, udev is not the only consumer of uevents; as fat as I know Android does not use udev and there are other users of uevents as well. For example, libusb can be compiled to listen to uevents directly. > > That said uevents should be completely useless in a user namespace > except as letting you know something happened. Is that what healthd > is using them for? Yes, that is one of the use cases. Appearance of AC power supply can be used to adjust system behavior, for example. Thanks. -- Dmitry ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next 0/2 v5] netns: uevent filtering 2018-04-29 10:44 [PATCH net-next 0/2 v5] netns: uevent filtering Christian Brauner 2018-04-29 10:44 ` [PATCH net-next 1/2 v5] uevent: add alloc_uevent_skb() helper Christian Brauner 2018-04-29 10:44 ` [PATCH net-next 2/2 v5] netns: restrict uevents Christian Brauner @ 2018-04-30 15:55 ` Eric W. Biederman 2018-05-01 14:23 ` David Miller 2 siblings, 1 reply; 10+ messages in thread From: Eric W. Biederman @ 2018-04-30 15:55 UTC (permalink / raw) To: Christian Brauner Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh Christian Brauner <christian.brauner@ubuntu.com> writes: > Hey everyone, > > This is the new approach to uevent filtering as discussed (see the > threads in [1], [2], and [3]). It only contains *non-functional > changes*. > > This series deals with with fixing up uevent filtering logic: > - uevent filtering logic is simplified > - locking time on uevent_sock_list is minimized > - tagged and untagged kobjects are handled in separate codepaths > - permissions for userspace are fixed for network device uevents in > network namespaces owned by non-initial user namespaces > Udev is now able to see those events correctly which it wasn't before. > For example, moving a physical device into a network namespace not > owned by the initial user namespaces before gave: > > root@xen1:~# udevadm --debug monitor -k > calling: monitor > monitor will print the received events for: > KERNEL - the kernel uevent > > sender uid=65534, message ignored > sender uid=65534, message ignored > sender uid=65534, message ignored > sender uid=65534, message ignored > sender uid=65534, message ignored > > and now after the discussion and solution in [3] correctly gives: > > root@xen1:~# udevadm --debug monitor -k > calling: monitor > monitor will print the received events for: > KERNEL - the kernel uevent > > KERNEL[625.301042] add /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/enp1s0f1 (net) > KERNEL[625.301109] move /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/enp1s0f1 (net) > KERNEL[625.301138] move /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/eth1 (net) > KERNEL[655.333272] remove /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/eth1 (net) > > Thanks! > Christian > > [1]: https://lkml.org/lkml/2018/4/4/739 > [2]: https://lkml.org/lkml/2018/4/26/767 > [3]: https://lkml.org/lkml/2018/4/26/738 Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > Christian Brauner (2): > uevent: add alloc_uevent_skb() helper > netns: restrict uevents > > lib/kobject_uevent.c | 178 ++++++++++++++++++++++++++++++------------- > 1 file changed, 126 insertions(+), 52 deletions(-) Eric ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net-next 0/2 v5] netns: uevent filtering 2018-04-30 15:55 ` [PATCH net-next 0/2 v5] netns: uevent filtering Eric W. Biederman @ 2018-05-01 14:23 ` David Miller 0 siblings, 0 replies; 10+ messages in thread From: David Miller @ 2018-05-01 14:23 UTC (permalink / raw) To: ebiederm Cc: christian.brauner, netdev, linux-kernel, avagin, ktkhai, serge, gregkh From: ebiederm@xmission.com (Eric W. Biederman) Date: Mon, 30 Apr 2018 10:55:55 -0500 > Christian Brauner <christian.brauner@ubuntu.com> writes: > >> Hey everyone, >> >> This is the new approach to uevent filtering as discussed (see the >> threads in [1], [2], and [3]). It only contains *non-functional >> changes*. ... > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Series applied, thanks everyone. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2019-06-16 17:14 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-04-29 10:44 [PATCH net-next 0/2 v5] netns: uevent filtering Christian Brauner 2018-04-29 10:44 ` [PATCH net-next 1/2 v5] uevent: add alloc_uevent_skb() helper Christian Brauner 2018-04-29 10:44 ` [PATCH net-next 2/2 v5] netns: restrict uevents Christian Brauner 2019-06-14 22:49 ` Dmitry Torokhov 2019-06-16 11:50 ` Eric W. Biederman 2019-06-16 16:50 ` Christian Brauner 2019-06-16 17:14 ` Dmitry Torokhov 2019-06-16 16:50 ` Dmitry Torokhov 2018-04-30 15:55 ` [PATCH net-next 0/2 v5] netns: uevent filtering Eric W. Biederman 2018-05-01 14:23 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).