linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next] netns: filter uevents correctly
@ 2018-04-04 19:48 Christian Brauner
  2018-04-04 20:30 ` [PATCH net] " Christian Brauner
  2018-04-05 13:01 ` [PATCH net-next] " Kirill Tkhai
  0 siblings, 2 replies; 26+ messages in thread
From: Christian Brauner @ 2018-04-04 19:48 UTC (permalink / raw)
  To: ebiederm, davem, gregkh, netdev, linux-kernel
  Cc: avagin, ktkhai, serge, Christian Brauner

commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")

enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk. We have now reached the point where hotplug events for all devices
that carry a namespace tag are filtered according to that namespace.

Specifically, they are filtered whenever the namespace tag of the kobject
does not match the namespace tag of the netlink socket. One example are
network devices. Uevents for network devices only show up in the network
namespaces these devices are moved to or created in.

However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will *try* to broadcast it
into all network namespaces.

The original patchset was written in 2010 before user namespaces were a
thing. With the introduction of user namespaces sending out uevents became
partially isolated as they were filtered by user namespaces:

net/netlink/af_netlink.c:do_one_broadcast()

if (!net_eq(sock_net(sk), p->net)) {
        if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
                return;

        if (!peernet_has_id(sock_net(sk), p->net))
                return;

        if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
                             CAP_NET_BROADCAST))
        j       return;
}

The file_ns_capable() check will check whether the caller had
CAP_NET_BROADCAST at the time of opening the netlink socket in the user
namespace of interest. This check is fine in general but seems insufficient
to me when paired with uevents. The reason is that devices always belong to
the initial user namespace so uevents for kobjects that do not carry a
namespace tag should never be sent into another user namespace. This has
been the intention all along. But there's one case where this breaks,
namely if a new user namespace is created by root on the host and an
identity mapping is established between root on the host and root in the
new user namespace. Here's a reproducer:

 sudo unshare -U --map-root
 udevadm monitor -k
 # Now change to initial user namespace and e.g. do
 modprobe kvm
 # or
 rmmod kvm

will allow the non-initial user namespace to retrieve all uevents from the
host. This seems very anecdotal given that in the general case user
namespaces do not see any uevents and also can't really do anything useful
with them.

Additionally, it is now possible to send uevents from userspace. As such we
can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
namespace of the network namespace of the netlink socket) userspace process
make a decision what uevents should be sent.

This makes me think that we should simply ensure that uevents for kobjects
that do not carry a namespace tag are *always* filtered by user namespace
in kobj_bcast_filter(). Specifically:
- If the owning user namespace of the uevent socket is not init_user_ns the
  event will always be filtered.
- If the network namespace the uevent socket belongs to was created in the
  initial user namespace but was opened from a non-initial user namespace
  the event will be filtered as well.
Put another way, uevents for kobjects not carrying a namespace tag are now
always only sent to the initial user namespace. The regression potential
for this is near to non-existent since user namespaces can't really do
anything with interesting devices.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 lib/kobject_uevent.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 15ea216a67ce..cb98cddb6e3b 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
 		return sock_ns != ns;
 	}
 
-	return 0;
+	/*
+	 * The kobject does not carry a namespace tag so filter by user
+	 * namespace below.
+	 */
+	if (sock_net(dsk)->user_ns != &init_user_ns)
+		return 1;
+
+	/* Check if socket was opened from non-initial user namespace. */
+	return sk_user_ns(dsk) != &init_user_ns;
 }
 #endif
 
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH net] netns: filter uevents correctly
  2018-04-04 19:48 [PATCH net-next] netns: filter uevents correctly Christian Brauner
@ 2018-04-04 20:30 ` Christian Brauner
  2018-04-04 22:38   ` Eric W. Biederman
  2018-04-05 13:01 ` [PATCH net-next] " Kirill Tkhai
  1 sibling, 1 reply; 26+ messages in thread
From: Christian Brauner @ 2018-04-04 20:30 UTC (permalink / raw)
  To: ebiederm, davem, gregkh, netdev, linux-kernel; +Cc: avagin, ktkhai, serge

On Wed, Apr 04, 2018 at 09:48:57PM +0200, Christian Brauner wrote:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> 
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk. We have now reached the point where hotplug events for all devices
> that carry a namespace tag are filtered according to that namespace.
> 
> Specifically, they are filtered whenever the namespace tag of the kobject
> does not match the namespace tag of the netlink socket. One example are
> network devices. Uevents for network devices only show up in the network
> namespaces these devices are moved to or created in.
> 
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will *try* to broadcast it
> into all network namespaces.
> 
> The original patchset was written in 2010 before user namespaces were a
> thing. With the introduction of user namespaces sending out uevents became
> partially isolated as they were filtered by user namespaces:
> 
> net/netlink/af_netlink.c:do_one_broadcast()
> 
> if (!net_eq(sock_net(sk), p->net)) {
>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>                 return;
> 
>         if (!peernet_has_id(sock_net(sk), p->net))
>                 return;
> 
>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>                              CAP_NET_BROADCAST))
>         j       return;
> }
> 
> The file_ns_capable() check will check whether the caller had
> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> namespace of interest. This check is fine in general but seems insufficient
> to me when paired with uevents. The reason is that devices always belong to
> the initial user namespace so uevents for kobjects that do not carry a
> namespace tag should never be sent into another user namespace. This has
> been the intention all along. But there's one case where this breaks,
> namely if a new user namespace is created by root on the host and an
> identity mapping is established between root on the host and root in the
> new user namespace. Here's a reproducer:
> 
>  sudo unshare -U --map-root
>  udevadm monitor -k
>  # Now change to initial user namespace and e.g. do
>  modprobe kvm
>  # or
>  rmmod kvm
> 
> will allow the non-initial user namespace to retrieve all uevents from the
> host. This seems very anecdotal given that in the general case user
> namespaces do not see any uevents and also can't really do anything useful
> with them.
> 
> Additionally, it is now possible to send uevents from userspace. As such we
> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> namespace of the network namespace of the netlink socket) userspace process
> make a decision what uevents should be sent.
> 
> This makes me think that we should simply ensure that uevents for kobjects
> that do not carry a namespace tag are *always* filtered by user namespace
> in kobj_bcast_filter(). Specifically:
> - If the owning user namespace of the uevent socket is not init_user_ns the
>   event will always be filtered.
> - If the network namespace the uevent socket belongs to was created in the
>   initial user namespace but was opened from a non-initial user namespace
>   the event will be filtered as well.
> Put another way, uevents for kobjects not carrying a namespace tag are now
> always only sent to the initial user namespace. The regression potential
> for this is near to non-existent since user namespaces can't really do
> anything with interesting devices.
> 
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

That was supposed to be [PATCH net] not [PATCH net-next] which is
obviously closed. Sorry about that.

Christian

> ---
>  lib/kobject_uevent.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 15ea216a67ce..cb98cddb6e3b 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>  		return sock_ns != ns;
>  	}
>  
> -	return 0;
> +	/*
> +	 * The kobject does not carry a namespace tag so filter by user
> +	 * namespace below.
> +	 */
> +	if (sock_net(dsk)->user_ns != &init_user_ns)
> +		return 1;
> +
> +	/* Check if socket was opened from non-initial user namespace. */
> +	return sk_user_ns(dsk) != &init_user_ns;
>  }
>  #endif
>  
> -- 
> 2.15.1
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net] netns: filter uevents correctly
  2018-04-04 20:30 ` [PATCH net] " Christian Brauner
@ 2018-04-04 22:38   ` Eric W. Biederman
  2018-04-05  1:27     ` Christian Brauner
  2018-04-05  1:35     ` Christian Brauner
  0 siblings, 2 replies; 26+ messages in thread
From: Eric W. Biederman @ 2018-04-04 22:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: davem, gregkh, netdev, linux-kernel, avagin, ktkhai, serge

Christian Brauner <christian.brauner@canonical.com> writes:

> On Wed, Apr 04, 2018 at 09:48:57PM +0200, Christian Brauner wrote:
>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>> 
>> enabled sending hotplug events into all network namespaces back in 2010.
>> Over time the set of uevents that get sent into all network namespaces has
>> shrunk. We have now reached the point where hotplug events for all devices
>> that carry a namespace tag are filtered according to that namespace.
>> 
>> Specifically, they are filtered whenever the namespace tag of the kobject
>> does not match the namespace tag of the netlink socket. One example are
>> network devices. Uevents for network devices only show up in the network
>> namespaces these devices are moved to or created in.
>> 
>> However, any uevent for a kobject that does not have a namespace tag
>> associated with it will not be filtered and we will *try* to broadcast it
>> into all network namespaces.
>> 
>> The original patchset was written in 2010 before user namespaces were a
>> thing. With the introduction of user namespaces sending out uevents became
>> partially isolated as they were filtered by user namespaces:
>> 
>> net/netlink/af_netlink.c:do_one_broadcast()
>> 
>> if (!net_eq(sock_net(sk), p->net)) {
>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>>                 return;
>> 
>>         if (!peernet_has_id(sock_net(sk), p->net))
>>                 return;
>> 
>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>>                              CAP_NET_BROADCAST))
>>         j       return;
>> }
>> 
>> The file_ns_capable() check will check whether the caller had
>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>> namespace of interest. This check is fine in general but seems insufficient
>> to me when paired with uevents. The reason is that devices always belong to
>> the initial user namespace so uevents for kobjects that do not carry a
>> namespace tag should never be sent into another user namespace. This has
>> been the intention all along. But there's one case where this breaks,
>> namely if a new user namespace is created by root on the host and an
>> identity mapping is established between root on the host and root in the
>> new user namespace. Here's a reproducer:
>> 
>>  sudo unshare -U --map-root
>>  udevadm monitor -k
>>  # Now change to initial user namespace and e.g. do
>>  modprobe kvm
>>  # or
>>  rmmod kvm
>> 
>> will allow the non-initial user namespace to retrieve all uevents from the
>> host. This seems very anecdotal given that in the general case user
>> namespaces do not see any uevents and also can't really do anything useful
>> with them.
>> 
>> Additionally, it is now possible to send uevents from userspace. As such we
>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>> namespace of the network namespace of the netlink socket) userspace process
>> make a decision what uevents should be sent.
>> 
>> This makes me think that we should simply ensure that uevents for kobjects
>> that do not carry a namespace tag are *always* filtered by user namespace
>> in kobj_bcast_filter(). Specifically:
>> - If the owning user namespace of the uevent socket is not init_user_ns the
>>   event will always be filtered.
>> - If the network namespace the uevent socket belongs to was created in the
>>   initial user namespace but was opened from a non-initial user namespace
>>   the event will be filtered as well.
>> Put another way, uevents for kobjects not carrying a namespace tag are now
>> always only sent to the initial user namespace. The regression potential
>> for this is near to non-existent since user namespaces can't really do
>> anything with interesting devices.
>> 
>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>
> That was supposed to be [PATCH net] not [PATCH net-next] which is
> obviously closed. Sorry about that.

This does not appear to be a fix.
This looks like feature work.
The motivation appears to be that looks wrong let's change it.

So let's please leave this for when net-next opens again so we can
have time to fully consider a change in semantics.

Thank you,
Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net] netns: filter uevents correctly
  2018-04-04 22:38   ` Eric W. Biederman
@ 2018-04-05  1:27     ` Christian Brauner
  2018-04-06  2:02       ` David Miller
  2018-04-05  1:35     ` Christian Brauner
  1 sibling, 1 reply; 26+ messages in thread
From: Christian Brauner @ 2018-04-05  1:27 UTC (permalink / raw)
  To: Eric W. Biederman, David Miller
  Cc: Greg KH, Netdev, Linux Kernel Mailing List, Andrei Vagin,
	Kirill Tkhai, Serge Hallyn

[-- Attachment #1: Type: text/plain, Size: 4853 bytes --]

On Thu, Apr 5, 2018, 00:39 Eric W. Biederman <ebiederm@xmission.com> wrote:

> Christian Brauner <christian.brauner@canonical.com> writes:
>
> > On Wed, Apr 04, 2018 at 09:48:57PM +0200, Christian Brauner wrote:
> >> commit 07e98962fa77 ("kobject: Send hotplug events in all network
> namespaces")
> >>
> >> enabled sending hotplug events into all network namespaces back in 2010.
> >> Over time the set of uevents that get sent into all network namespaces
> has
> >> shrunk. We have now reached the point where hotplug events for all
> devices
> >> that carry a namespace tag are filtered according to that namespace.
> >>
> >> Specifically, they are filtered whenever the namespace tag of the
> kobject
> >> does not match the namespace tag of the netlink socket. One example are
> >> network devices. Uevents for network devices only show up in the network
> >> namespaces these devices are moved to or created in.
> >>
> >> However, any uevent for a kobject that does not have a namespace tag
> >> associated with it will not be filtered and we will *try* to broadcast
> it
> >> into all network namespaces.
> >>
> >> The original patchset was written in 2010 before user namespaces were a
> >> thing. With the introduction of user namespaces sending out uevents
> became
> >> partially isolated as they were filtered by user namespaces:
> >>
> >> net/netlink/af_netlink.c:do_one_broadcast()
> >>
> >> if (!net_eq(sock_net(sk), p->net)) {
> >>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >>                 return;
> >>
> >>         if (!peernet_has_id(sock_net(sk), p->net))
> >>                 return;
> >>
> >>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >>                              CAP_NET_BROADCAST))
> >>         j       return;
> >> }
> >>
> >> The file_ns_capable() check will check whether the caller had
> >> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >> namespace of interest. This check is fine in general but seems
> insufficient
> >> to me when paired with uevents. The reason is that devices always
> belong to
> >> the initial user namespace so uevents for kobjects that do not carry a
> >> namespace tag should never be sent into another user namespace. This has
> >> been the intention all along. But there's one case where this breaks,
> >> namely if a new user namespace is created by root on the host and an
> >> identity mapping is established between root on the host and root in the
> >> new user namespace. Here's a reproducer:
> >>
> >>  sudo unshare -U --map-root
> >>  udevadm monitor -k
> >>  # Now change to initial user namespace and e.g. do
> >>  modprobe kvm
> >>  # or
> >>  rmmod kvm
> >>
> >> will allow the non-initial user namespace to retrieve all uevents from
> the
> >> host. This seems very anecdotal given that in the general case user
> >> namespaces do not see any uevents and also can't really do anything
> useful
> >> with them.
> >>
> >> Additionally, it is now possible to send uevents from userspace. As
> such we
> >> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >> namespace of the network namespace of the netlink socket) userspace
> process
> >> make a decision what uevents should be sent.
> >>
> >> This makes me think that we should simply ensure that uevents for
> kobjects
> >> that do not carry a namespace tag are *always* filtered by user
> namespace
> >> in kobj_bcast_filter(). Specifically:
> >> - If the owning user namespace of the uevent socket is not init_user_ns
> the
> >>   event will always be filtered.
> >> - If the network namespace the uevent socket belongs to was created in
> the
> >>   initial user namespace but was opened from a non-initial user
> namespace
> >>   the event will be filtered as well.
> >> Put another way, uevents for kobjects not carrying a namespace tag are
> now
> >> always only sent to the initial user namespace. The regression potential
> >> for this is near to non-existent since user namespaces can't really do
> >> anything with interesting devices.
> >>
> >> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> >
> > That was supposed to be [PATCH net] not [PATCH net-next] which is
> > obviously closed. Sorry about that.
>
> This does not appear to be a fix.
> This looks like feature work.
> The motivation appears to be that looks wrong let's change it.
>

Hm, sure looks like a bug to me though. But I'm happy to leave this for
net-next. I don't want to rush things if this change in semantics is
non-trivial.

David, is it ok to queue this or would you prefer I resend when net-next
reopens?


> So let's please leave this for when net-next opens again so we can
> have time to fully consider a change in semantics.
>

Sure. Any thoughts on whether this approach makes sense?

Thanks!
Christian


> Thank you,
> Eric
>

[-- Attachment #2: Type: text/html, Size: 6722 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net] netns: filter uevents correctly
  2018-04-04 22:38   ` Eric W. Biederman
  2018-04-05  1:27     ` Christian Brauner
@ 2018-04-05  1:35     ` Christian Brauner
  1 sibling, 0 replies; 26+ messages in thread
From: Christian Brauner @ 2018-04-05  1:35 UTC (permalink / raw)
  To: Eric W. Biederman, davem
  Cc: Christian Brauner, gregkh, netdev, linux-kernel, avagin, ktkhai, serge

On Wed, Apr 04, 2018 at 05:38:02PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Wed, Apr 04, 2018 at 09:48:57PM +0200, Christian Brauner wrote:
> >> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >> 
> >> enabled sending hotplug events into all network namespaces back in 2010.
> >> Over time the set of uevents that get sent into all network namespaces has
> >> shrunk. We have now reached the point where hotplug events for all devices
> >> that carry a namespace tag are filtered according to that namespace.
> >> 
> >> Specifically, they are filtered whenever the namespace tag of the kobject
> >> does not match the namespace tag of the netlink socket. One example are
> >> network devices. Uevents for network devices only show up in the network
> >> namespaces these devices are moved to or created in.
> >> 
> >> However, any uevent for a kobject that does not have a namespace tag
> >> associated with it will not be filtered and we will *try* to broadcast it
> >> into all network namespaces.
> >> 
> >> The original patchset was written in 2010 before user namespaces were a
> >> thing. With the introduction of user namespaces sending out uevents became
> >> partially isolated as they were filtered by user namespaces:
> >> 
> >> net/netlink/af_netlink.c:do_one_broadcast()
> >> 
> >> if (!net_eq(sock_net(sk), p->net)) {
> >>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >>                 return;
> >> 
> >>         if (!peernet_has_id(sock_net(sk), p->net))
> >>                 return;
> >> 
> >>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >>                              CAP_NET_BROADCAST))
> >>         j       return;
> >> }
> >> 
> >> The file_ns_capable() check will check whether the caller had
> >> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >> namespace of interest. This check is fine in general but seems insufficient
> >> to me when paired with uevents. The reason is that devices always belong to
> >> the initial user namespace so uevents for kobjects that do not carry a
> >> namespace tag should never be sent into another user namespace. This has
> >> been the intention all along. But there's one case where this breaks,
> >> namely if a new user namespace is created by root on the host and an
> >> identity mapping is established between root on the host and root in the
> >> new user namespace. Here's a reproducer:
> >> 
> >>  sudo unshare -U --map-root
> >>  udevadm monitor -k
> >>  # Now change to initial user namespace and e.g. do
> >>  modprobe kvm
> >>  # or
> >>  rmmod kvm
> >> 
> >> will allow the non-initial user namespace to retrieve all uevents from the
> >> host. This seems very anecdotal given that in the general case user
> >> namespaces do not see any uevents and also can't really do anything useful
> >> with them.
> >> 
> >> Additionally, it is now possible to send uevents from userspace. As such we
> >> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >> namespace of the network namespace of the netlink socket) userspace process
> >> make a decision what uevents should be sent.
> >> 
> >> This makes me think that we should simply ensure that uevents for kobjects
> >> that do not carry a namespace tag are *always* filtered by user namespace
> >> in kobj_bcast_filter(). Specifically:
> >> - If the owning user namespace of the uevent socket is not init_user_ns the
> >>   event will always be filtered.
> >> - If the network namespace the uevent socket belongs to was created in the
> >>   initial user namespace but was opened from a non-initial user namespace
> >>   the event will be filtered as well.
> >> Put another way, uevents for kobjects not carrying a namespace tag are now
> >> always only sent to the initial user namespace. The regression potential
> >> for this is near to non-existent since user namespaces can't really do
> >> anything with interesting devices.
> >> 
> >> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> >
> > That was supposed to be [PATCH net] not [PATCH net-next] which is
> > obviously closed. Sorry about that.
> 
> This does not appear to be a fix.
> This looks like feature work.
> The motivation appears to be that looks wrong let's change it.

Hm, it looked like an oversight an therefore seems like a bug which is
why I thought would be a good candidate for net. Recent patches to the
semantics here just make it more obvious and provide a better argument
to fix it in the current release rather then defer it to the next one.
But I'm happy to leave this for net-next. I don't want to rush things if
this change in semantics is not trivial enough. For the record, I'm
merely fixing/expanding on the current status quo.

David, is it ok to queue this or would you prefer I resend when net-next
reopens?

> 
> So let's please leave this for when net-next opens again so we can
> have time to fully consider a change in semantics.

Sure, if we agree that this is the way to go I'm happy too.
Is your issue just with when we merge it and you disagree from a
technical perspective? That wasn't entirely obvious from your previous
mail. :)

Thanks!
Christian

> 
> Thank you,
> Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-04 19:48 [PATCH net-next] netns: filter uevents correctly Christian Brauner
  2018-04-04 20:30 ` [PATCH net] " Christian Brauner
@ 2018-04-05 13:01 ` Kirill Tkhai
  2018-04-05 14:07   ` Christian Brauner
  1 sibling, 1 reply; 26+ messages in thread
From: Kirill Tkhai @ 2018-04-05 13:01 UTC (permalink / raw)
  To: Christian Brauner, ebiederm, davem, gregkh, netdev, linux-kernel
  Cc: avagin, serge

On 04.04.2018 22:48, Christian Brauner wrote:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> 
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk. We have now reached the point where hotplug events for all devices
> that carry a namespace tag are filtered according to that namespace.
> 
> Specifically, they are filtered whenever the namespace tag of the kobject
> does not match the namespace tag of the netlink socket. One example are
> network devices. Uevents for network devices only show up in the network
> namespaces these devices are moved to or created in.
> 
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will *try* to broadcast it
> into all network namespaces.
> 
> The original patchset was written in 2010 before user namespaces were a
> thing. With the introduction of user namespaces sending out uevents became
> partially isolated as they were filtered by user namespaces:
> 
> net/netlink/af_netlink.c:do_one_broadcast()
> 
> if (!net_eq(sock_net(sk), p->net)) {
>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>                 return;
> 
>         if (!peernet_has_id(sock_net(sk), p->net))
>                 return;
> 
>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>                              CAP_NET_BROADCAST))
>         j       return;
> }
> 
> The file_ns_capable() check will check whether the caller had
> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> namespace of interest. This check is fine in general but seems insufficient
> to me when paired with uevents. The reason is that devices always belong to
> the initial user namespace so uevents for kobjects that do not carry a
> namespace tag should never be sent into another user namespace. This has
> been the intention all along. But there's one case where this breaks,
> namely if a new user namespace is created by root on the host and an
> identity mapping is established between root on the host and root in the
> new user namespace. Here's a reproducer:
> 
>  sudo unshare -U --map-root
>  udevadm monitor -k
>  # Now change to initial user namespace and e.g. do
>  modprobe kvm
>  # or
>  rmmod kvm
> 
> will allow the non-initial user namespace to retrieve all uevents from the
> host. This seems very anecdotal given that in the general case user
> namespaces do not see any uevents and also can't really do anything useful
> with them.
> 
> Additionally, it is now possible to send uevents from userspace. As such we
> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> namespace of the network namespace of the netlink socket) userspace process
> make a decision what uevents should be sent.
> 
> This makes me think that we should simply ensure that uevents for kobjects
> that do not carry a namespace tag are *always* filtered by user namespace
> in kobj_bcast_filter(). Specifically:
> - If the owning user namespace of the uevent socket is not init_user_ns the
>   event will always be filtered.
> - If the network namespace the uevent socket belongs to was created in the
>   initial user namespace but was opened from a non-initial user namespace
>   the event will be filtered as well.
> Put another way, uevents for kobjects not carrying a namespace tag are now
> always only sent to the initial user namespace. The regression potential
> for this is near to non-existent since user namespaces can't really do
> anything with interesting devices.
> 
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
>  lib/kobject_uevent.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 15ea216a67ce..cb98cddb6e3b 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>  		return sock_ns != ns;
>  	}
>  
> -	return 0;
> +	/*
> +	 * The kobject does not carry a namespace tag so filter by user
> +	 * namespace below.
> +	 */
> +	if (sock_net(dsk)->user_ns != &init_user_ns)
> +		return 1;
> +
> +	/* Check if socket was opened from non-initial user namespace. */
> +	return sk_user_ns(dsk) != &init_user_ns;
>  }
>  #endif

So, this prohibits to listen events of all devices except network-related
in containers? If it's so, I don't think it's a good solution. Uevents is not
net-devices-only related interface and it's used for all devices in system.
People may want to delegate block devices to nested user_ns, for example.

Better we should think about something like "generic device <-> user_ns" connection,
and to filter events by this user_ns.

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-05 13:01 ` [PATCH net-next] " Kirill Tkhai
@ 2018-04-05 14:07   ` Christian Brauner
  2018-04-05 14:26     ` Kirill Tkhai
  0 siblings, 1 reply; 26+ messages in thread
From: Christian Brauner @ 2018-04-05 14:07 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: ebiederm, davem, gregkh, netdev, linux-kernel, avagin, serge

On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> On 04.04.2018 22:48, Christian Brauner wrote:
> > commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> > 
> > enabled sending hotplug events into all network namespaces back in 2010.
> > Over time the set of uevents that get sent into all network namespaces has
> > shrunk. We have now reached the point where hotplug events for all devices
> > that carry a namespace tag are filtered according to that namespace.
> > 
> > Specifically, they are filtered whenever the namespace tag of the kobject
> > does not match the namespace tag of the netlink socket. One example are
> > network devices. Uevents for network devices only show up in the network
> > namespaces these devices are moved to or created in.
> > 
> > However, any uevent for a kobject that does not have a namespace tag
> > associated with it will not be filtered and we will *try* to broadcast it
> > into all network namespaces.
> > 
> > The original patchset was written in 2010 before user namespaces were a
> > thing. With the introduction of user namespaces sending out uevents became
> > partially isolated as they were filtered by user namespaces:
> > 
> > net/netlink/af_netlink.c:do_one_broadcast()
> > 
> > if (!net_eq(sock_net(sk), p->net)) {
> >         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >                 return;
> > 
> >         if (!peernet_has_id(sock_net(sk), p->net))
> >                 return;
> > 
> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >                              CAP_NET_BROADCAST))
> >         j       return;
> > }
> > 
> > The file_ns_capable() check will check whether the caller had
> > CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> > namespace of interest. This check is fine in general but seems insufficient
> > to me when paired with uevents. The reason is that devices always belong to
> > the initial user namespace so uevents for kobjects that do not carry a
> > namespace tag should never be sent into another user namespace. This has
> > been the intention all along. But there's one case where this breaks,
> > namely if a new user namespace is created by root on the host and an
> > identity mapping is established between root on the host and root in the
> > new user namespace. Here's a reproducer:
> > 
> >  sudo unshare -U --map-root
> >  udevadm monitor -k
> >  # Now change to initial user namespace and e.g. do
> >  modprobe kvm
> >  # or
> >  rmmod kvm
> > 
> > will allow the non-initial user namespace to retrieve all uevents from the
> > host. This seems very anecdotal given that in the general case user
> > namespaces do not see any uevents and also can't really do anything useful
> > with them.
> > 
> > Additionally, it is now possible to send uevents from userspace. As such we
> > can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> > namespace of the network namespace of the netlink socket) userspace process
> > make a decision what uevents should be sent.
> > 
> > This makes me think that we should simply ensure that uevents for kobjects
> > that do not carry a namespace tag are *always* filtered by user namespace
> > in kobj_bcast_filter(). Specifically:
> > - If the owning user namespace of the uevent socket is not init_user_ns the
> >   event will always be filtered.
> > - If the network namespace the uevent socket belongs to was created in the
> >   initial user namespace but was opened from a non-initial user namespace
> >   the event will be filtered as well.
> > Put another way, uevents for kobjects not carrying a namespace tag are now
> > always only sent to the initial user namespace. The regression potential
> > for this is near to non-existent since user namespaces can't really do
> > anything with interesting devices.
> > 
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > ---
> >  lib/kobject_uevent.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > index 15ea216a67ce..cb98cddb6e3b 100644
> > --- a/lib/kobject_uevent.c
> > +++ b/lib/kobject_uevent.c
> > @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >  		return sock_ns != ns;
> >  	}
> >  
> > -	return 0;
> > +	/*
> > +	 * The kobject does not carry a namespace tag so filter by user
> > +	 * namespace below.
> > +	 */
> > +	if (sock_net(dsk)->user_ns != &init_user_ns)
> > +		return 1;
> > +
> > +	/* Check if socket was opened from non-initial user namespace. */
> > +	return sk_user_ns(dsk) != &init_user_ns;
> >  }
> >  #endif
> 
> So, this prohibits to listen events of all devices except network-related
> in containers? If it's so, I don't think it's a good solution. Uevents is not

No, this is not correct: As it is right now *without my patch* no
non-initial user namespace is receiving *any uevents* but those
specifically namespaced such as those for network devices. This patch
doesn't change that at all. The commit message outlines this in detail
how this comes about.
There is only one case where this currently breaks and this is as I
outlined explicitly in my commit message when you create a new user
namespace and map container(0) -> host(0). This patch fixes this.

> net-devices-only related interface and it's used for all devices in system.
> People may want to delegate block devices to nested user_ns, for example.

That's fine but that's why I added uevent injection in a previous patch
series: I repeat no non-initial user namespace will by default receive
uevents.

Thanks!
Christian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-05 14:07   ` Christian Brauner
@ 2018-04-05 14:26     ` Kirill Tkhai
  2018-04-05 14:41       ` Christian Brauner
  0 siblings, 1 reply; 26+ messages in thread
From: Kirill Tkhai @ 2018-04-05 14:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: ebiederm, davem, gregkh, netdev, linux-kernel, avagin, serge

On 05.04.2018 17:07, Christian Brauner wrote:
> On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> On 04.04.2018 22:48, Christian Brauner wrote:
>>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>>>
>>> enabled sending hotplug events into all network namespaces back in 2010.
>>> Over time the set of uevents that get sent into all network namespaces has
>>> shrunk. We have now reached the point where hotplug events for all devices
>>> that carry a namespace tag are filtered according to that namespace.
>>>
>>> Specifically, they are filtered whenever the namespace tag of the kobject
>>> does not match the namespace tag of the netlink socket. One example are
>>> network devices. Uevents for network devices only show up in the network
>>> namespaces these devices are moved to or created in.
>>>
>>> However, any uevent for a kobject that does not have a namespace tag
>>> associated with it will not be filtered and we will *try* to broadcast it
>>> into all network namespaces.
>>>
>>> The original patchset was written in 2010 before user namespaces were a
>>> thing. With the introduction of user namespaces sending out uevents became
>>> partially isolated as they were filtered by user namespaces:
>>>
>>> net/netlink/af_netlink.c:do_one_broadcast()
>>>
>>> if (!net_eq(sock_net(sk), p->net)) {
>>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>>>                 return;
>>>
>>>         if (!peernet_has_id(sock_net(sk), p->net))
>>>                 return;
>>>
>>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>>>                              CAP_NET_BROADCAST))
>>>         j       return;
>>> }
>>>
>>> The file_ns_capable() check will check whether the caller had
>>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>>> namespace of interest. This check is fine in general but seems insufficient
>>> to me when paired with uevents. The reason is that devices always belong to
>>> the initial user namespace so uevents for kobjects that do not carry a
>>> namespace tag should never be sent into another user namespace. This has
>>> been the intention all along. But there's one case where this breaks,
>>> namely if a new user namespace is created by root on the host and an
>>> identity mapping is established between root on the host and root in the
>>> new user namespace. Here's a reproducer:
>>>
>>>  sudo unshare -U --map-root
>>>  udevadm monitor -k
>>>  # Now change to initial user namespace and e.g. do
>>>  modprobe kvm
>>>  # or
>>>  rmmod kvm
>>>
>>> will allow the non-initial user namespace to retrieve all uevents from the
>>> host. This seems very anecdotal given that in the general case user
>>> namespaces do not see any uevents and also can't really do anything useful
>>> with them.
>>>
>>> Additionally, it is now possible to send uevents from userspace. As such we
>>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>>> namespace of the network namespace of the netlink socket) userspace process
>>> make a decision what uevents should be sent.
>>>
>>> This makes me think that we should simply ensure that uevents for kobjects
>>> that do not carry a namespace tag are *always* filtered by user namespace
>>> in kobj_bcast_filter(). Specifically:
>>> - If the owning user namespace of the uevent socket is not init_user_ns the
>>>   event will always be filtered.
>>> - If the network namespace the uevent socket belongs to was created in the
>>>   initial user namespace but was opened from a non-initial user namespace
>>>   the event will be filtered as well.
>>> Put another way, uevents for kobjects not carrying a namespace tag are now
>>> always only sent to the initial user namespace. The regression potential
>>> for this is near to non-existent since user namespaces can't really do
>>> anything with interesting devices.
>>>
>>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>>> ---
>>>  lib/kobject_uevent.c | 10 +++++++++-
>>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>>> index 15ea216a67ce..cb98cddb6e3b 100644
>>> --- a/lib/kobject_uevent.c
>>> +++ b/lib/kobject_uevent.c
>>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>>>  		return sock_ns != ns;
>>>  	}
>>>  
>>> -	return 0;
>>> +	/*
>>> +	 * The kobject does not carry a namespace tag so filter by user
>>> +	 * namespace below.
>>> +	 */
>>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
>>> +		return 1;
>>> +
>>> +	/* Check if socket was opened from non-initial user namespace. */
>>> +	return sk_user_ns(dsk) != &init_user_ns;
>>>  }
>>>  #endif
>>
>> So, this prohibits to listen events of all devices except network-related
>> in containers? If it's so, I don't think it's a good solution. Uevents is not
> 
> No, this is not correct: As it is right now *without my patch* no
> non-initial user namespace is receiving *any uevents* but those
> specifically namespaced such as those for network devices. This patch
> doesn't change that at all. The commit message outlines this in detail
> how this comes about.
> There is only one case where this currently breaks and this is as I
> outlined explicitly in my commit message when you create a new user
> namespace and map container(0) -> host(0). This patch fixes this.

Could you please point the place, where non-initial user namespaces are filtered?
I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
Now it will return 1 sometimes.

>> net-devices-only related interface and it's used for all devices in system.
>> People may want to delegate block devices to nested user_ns, for example.
> 
> That's fine but that's why I added uevent injection in a previous patch
> series: I repeat no non-initial user namespace will by default receive
> uevents.

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-05 14:26     ` Kirill Tkhai
@ 2018-04-05 14:41       ` Christian Brauner
  2018-04-06  3:59         ` Eric W. Biederman
  0 siblings, 1 reply; 26+ messages in thread
From: Christian Brauner @ 2018-04-05 14:41 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Christian Brauner, ebiederm, davem, gregkh, netdev, linux-kernel,
	avagin, serge

On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> On 05.04.2018 17:07, Christian Brauner wrote:
> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >>>
> >>> enabled sending hotplug events into all network namespaces back in 2010.
> >>> Over time the set of uevents that get sent into all network namespaces has
> >>> shrunk. We have now reached the point where hotplug events for all devices
> >>> that carry a namespace tag are filtered according to that namespace.
> >>>
> >>> Specifically, they are filtered whenever the namespace tag of the kobject
> >>> does not match the namespace tag of the netlink socket. One example are
> >>> network devices. Uevents for network devices only show up in the network
> >>> namespaces these devices are moved to or created in.
> >>>
> >>> However, any uevent for a kobject that does not have a namespace tag
> >>> associated with it will not be filtered and we will *try* to broadcast it
> >>> into all network namespaces.
> >>>
> >>> The original patchset was written in 2010 before user namespaces were a
> >>> thing. With the introduction of user namespaces sending out uevents became
> >>> partially isolated as they were filtered by user namespaces:
> >>>
> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >>>
> >>> if (!net_eq(sock_net(sk), p->net)) {
> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >>>                 return;
> >>>
> >>>         if (!peernet_has_id(sock_net(sk), p->net))
> >>>                 return;
> >>>
> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >>>                              CAP_NET_BROADCAST))
> >>>         j       return;
> >>> }
> >>>
> >>> The file_ns_capable() check will check whether the caller had
> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >>> namespace of interest. This check is fine in general but seems insufficient
> >>> to me when paired with uevents. The reason is that devices always belong to
> >>> the initial user namespace so uevents for kobjects that do not carry a
> >>> namespace tag should never be sent into another user namespace. This has
> >>> been the intention all along. But there's one case where this breaks,
> >>> namely if a new user namespace is created by root on the host and an
> >>> identity mapping is established between root on the host and root in the
> >>> new user namespace. Here's a reproducer:
> >>>
> >>>  sudo unshare -U --map-root
> >>>  udevadm monitor -k
> >>>  # Now change to initial user namespace and e.g. do
> >>>  modprobe kvm
> >>>  # or
> >>>  rmmod kvm
> >>>
> >>> will allow the non-initial user namespace to retrieve all uevents from the
> >>> host. This seems very anecdotal given that in the general case user
> >>> namespaces do not see any uevents and also can't really do anything useful
> >>> with them.
> >>>
> >>> Additionally, it is now possible to send uevents from userspace. As such we
> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >>> namespace of the network namespace of the netlink socket) userspace process
> >>> make a decision what uevents should be sent.
> >>>
> >>> This makes me think that we should simply ensure that uevents for kobjects
> >>> that do not carry a namespace tag are *always* filtered by user namespace
> >>> in kobj_bcast_filter(). Specifically:
> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
> >>>   event will always be filtered.
> >>> - If the network namespace the uevent socket belongs to was created in the
> >>>   initial user namespace but was opened from a non-initial user namespace
> >>>   the event will be filtered as well.
> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
> >>> always only sent to the initial user namespace. The regression potential
> >>> for this is near to non-existent since user namespaces can't really do
> >>> anything with interesting devices.
> >>>
> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> >>> ---
> >>>  lib/kobject_uevent.c | 10 +++++++++-
> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >>> index 15ea216a67ce..cb98cddb6e3b 100644
> >>> --- a/lib/kobject_uevent.c
> >>> +++ b/lib/kobject_uevent.c
> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >>>  		return sock_ns != ns;
> >>>  	}
> >>>  
> >>> -	return 0;
> >>> +	/*
> >>> +	 * The kobject does not carry a namespace tag so filter by user
> >>> +	 * namespace below.
> >>> +	 */
> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
> >>> +		return 1;
> >>> +
> >>> +	/* Check if socket was opened from non-initial user namespace. */
> >>> +	return sk_user_ns(dsk) != &init_user_ns;
> >>>  }
> >>>  #endif
> >>
> >> So, this prohibits to listen events of all devices except network-related
> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
> > 
> > No, this is not correct: As it is right now *without my patch* no
> > non-initial user namespace is receiving *any uevents* but those
> > specifically namespaced such as those for network devices. This patch
> > doesn't change that at all. The commit message outlines this in detail
> > how this comes about.
> > There is only one case where this currently breaks and this is as I
> > outlined explicitly in my commit message when you create a new user
> > namespace and map container(0) -> host(0). This patch fixes this.
> 
> Could you please point the place, where non-initial user namespaces are filtered?
> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
> Now it will return 1 sometimes.

Oh sure, it's in the commit message though. The callchain is
lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
net/netlink/af_netlink.c:do_one_broadcast():

This codepiece will check whether the openened socket holds
CAP_NET_BROADCAST in the user namespace of the target network namespace
which it won't because we don't have device namespaces and all devices
belong to the initial set of namespaces.

        if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
                             CAP_NET_BROADCAST))
        j       return;

The only exception are network devices but they will pass the preceeding
if (!net_eq(sock_net(sk), p->net)) for any netlink uevent socket opened
in the same network namespace.

> 
> >> net-devices-only related interface and it's used for all devices in system.
> >> People may want to delegate block devices to nested user_ns, for example.
> > 
> > That's fine but that's why I added uevent injection in a previous patch
> > series: I repeat no non-initial user namespace will by default receive
> > uevents.
> 
> Thanks,
> Kirill

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net] netns: filter uevents correctly
  2018-04-05  1:27     ` Christian Brauner
@ 2018-04-06  2:02       ` David Miller
  0 siblings, 0 replies; 26+ messages in thread
From: David Miller @ 2018-04-06  2:02 UTC (permalink / raw)
  To: christian.brauner
  Cc: ebiederm, gregkh, netdev, linux-kernel, avagin, ktkhai, serge

From: Christian Brauner <christian.brauner@canonical.com>
Date: Thu, 05 Apr 2018 01:27:16 +0000

> David, is it ok to queue this or would you prefer I resend when net-next
> reopens?

This definitely needs more discussion, and after the discussion some
further clarification in the commit log message based upon that
discussion if we move forward with this change at all.

Thank you.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-05 14:41       ` Christian Brauner
@ 2018-04-06  3:59         ` Eric W. Biederman
  2018-04-06 13:07           ` Christian Brauner
  2018-04-09 15:46           ` Christian Brauner
  0 siblings, 2 replies; 26+ messages in thread
From: Eric W. Biederman @ 2018-04-06  3:59 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge

Christian Brauner <christian.brauner@canonical.com> writes:

> On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
>> On 05.04.2018 17:07, Christian Brauner wrote:
>> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> >> On 04.04.2018 22:48, Christian Brauner wrote:
>> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>> >>>
>> >>> enabled sending hotplug events into all network namespaces back in 2010.
>> >>> Over time the set of uevents that get sent into all network namespaces has
>> >>> shrunk. We have now reached the point where hotplug events for all devices
>> >>> that carry a namespace tag are filtered according to that namespace.
>> >>>
>> >>> Specifically, they are filtered whenever the namespace tag of the kobject
>> >>> does not match the namespace tag of the netlink socket. One example are
>> >>> network devices. Uevents for network devices only show up in the network
>> >>> namespaces these devices are moved to or created in.
>> >>>
>> >>> However, any uevent for a kobject that does not have a namespace tag
>> >>> associated with it will not be filtered and we will *try* to broadcast it
>> >>> into all network namespaces.
>> >>>
>> >>> The original patchset was written in 2010 before user namespaces were a
>> >>> thing. With the introduction of user namespaces sending out uevents became
>> >>> partially isolated as they were filtered by user namespaces:
>> >>>
>> >>> net/netlink/af_netlink.c:do_one_broadcast()
>> >>>
>> >>> if (!net_eq(sock_net(sk), p->net)) {
>> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> >>>                 return;
>> >>>
>> >>>         if (!peernet_has_id(sock_net(sk), p->net))
>> >>>                 return;
>> >>>
>> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >>>                              CAP_NET_BROADCAST))
>> >>>         j       return;
>> >>> }
>> >>>
>> >>> The file_ns_capable() check will check whether the caller had
>> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>> >>> namespace of interest. This check is fine in general but seems insufficient
>> >>> to me when paired with uevents. The reason is that devices always belong to
>> >>> the initial user namespace so uevents for kobjects that do not carry a
>> >>> namespace tag should never be sent into another user namespace. This has
>> >>> been the intention all along. But there's one case where this breaks,
>> >>> namely if a new user namespace is created by root on the host and an
>> >>> identity mapping is established between root on the host and root in the
>> >>> new user namespace. Here's a reproducer:
>> >>>
>> >>>  sudo unshare -U --map-root
>> >>>  udevadm monitor -k
>> >>>  # Now change to initial user namespace and e.g. do
>> >>>  modprobe kvm
>> >>>  # or
>> >>>  rmmod kvm
>> >>>
>> >>> will allow the non-initial user namespace to retrieve all uevents from the
>> >>> host. This seems very anecdotal given that in the general case user
>> >>> namespaces do not see any uevents and also can't really do anything useful
>> >>> with them.
>> >>>
>> >>> Additionally, it is now possible to send uevents from userspace. As such we
>> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>> >>> namespace of the network namespace of the netlink socket) userspace process
>> >>> make a decision what uevents should be sent.
>> >>>
>> >>> This makes me think that we should simply ensure that uevents for kobjects
>> >>> that do not carry a namespace tag are *always* filtered by user namespace
>> >>> in kobj_bcast_filter(). Specifically:
>> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
>> >>>   event will always be filtered.
>> >>> - If the network namespace the uevent socket belongs to was created in the
>> >>>   initial user namespace but was opened from a non-initial user namespace
>> >>>   the event will be filtered as well.
>> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
>> >>> always only sent to the initial user namespace. The regression potential
>> >>> for this is near to non-existent since user namespaces can't really do
>> >>> anything with interesting devices.
>> >>>
>> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>> >>> ---
>> >>>  lib/kobject_uevent.c | 10 +++++++++-
>> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
>> >>>
>> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> >>> index 15ea216a67ce..cb98cddb6e3b 100644
>> >>> --- a/lib/kobject_uevent.c
>> >>> +++ b/lib/kobject_uevent.c
>> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >>>  		return sock_ns != ns;
>> >>>  	}
>> >>>  
>> >>> -	return 0;
>> >>> +	/*
>> >>> +	 * The kobject does not carry a namespace tag so filter by user
>> >>> +	 * namespace below.
>> >>> +	 */
>> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
>> >>> +		return 1;
>> >>> +
>> >>> +	/* Check if socket was opened from non-initial user namespace. */
>> >>> +	return sk_user_ns(dsk) != &init_user_ns;
>> >>>  }
>> >>>  #endif
>> >>
>> >> So, this prohibits to listen events of all devices except network-related
>> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
>> > 
>> > No, this is not correct: As it is right now *without my patch* no
>> > non-initial user namespace is receiving *any uevents* but those
>> > specifically namespaced such as those for network devices. This patch
>> > doesn't change that at all. The commit message outlines this in detail
>> > how this comes about.
>> > There is only one case where this currently breaks and this is as I
>> > outlined explicitly in my commit message when you create a new user
>> > namespace and map container(0) -> host(0). This patch fixes this.
>> 
>> Could you please point the place, where non-initial user namespaces are filtered?
>> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
>> Now it will return 1 sometimes.
>
> Oh sure, it's in the commit message though. The callchain is
> lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
> nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
> net/netlink/af_netlink.c:do_one_broadcast():
>
> This codepiece will check whether the openened socket holds
> CAP_NET_BROADCAST in the user namespace of the target network namespace
> which it won't because we don't have device namespaces and all devices
> belong to the initial set of namespaces.
>
>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>                              CAP_NET_BROADCAST))
>         j       return;
>

The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
on their socket and has had someone with the appropriate privileges
assign a peerrnetid.

All of which is to say that file_ns_capable is not nearly as applicable
as it might be, and if you can pass the other two checks I think it is
pointless (because the peernet attributes are not generated for
kobj_uevents) but valid to receive events from outside your network
namespace.


I might be missing something but I don't see anything excluding network
namespaces owned by !init_user_ns excluded from the kobject_uevent
logic.

The uevent_sock_list has one entry per network namespace. And
kobject_uevent_net_broacast appears to walk each one.

I had a memory of filtering uevent messages and I had a memory
that the netlink_has_listeners had a per network namespace effect.
Neither seems true from my inspection of the code tonight.

If we are not filtering ordinary uevents at least at the user namespace
level that does seem to be at least a nuisance.


Christian can you dig a little deeper into this.  I have a feeling that
there are some real efficiency improvements that we could make to
kobject_uevent_net_broadcast if nothing else.

Perhaps you could see where uevents are broadcast by poking
the sysfs uevent of an existing device, and triggering a retransmit.

Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-06  3:59         ` Eric W. Biederman
@ 2018-04-06 13:07           ` Christian Brauner
  2018-04-06 14:45             ` Eric W. Biederman
  2018-04-09 15:46           ` Christian Brauner
  1 sibling, 1 reply; 26+ messages in thread
From: Christian Brauner @ 2018-04-06 13:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge

On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> >> On 05.04.2018 17:07, Christian Brauner wrote:
> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >> >>>
> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
> >> >>> Over time the set of uevents that get sent into all network namespaces has
> >> >>> shrunk. We have now reached the point where hotplug events for all devices
> >> >>> that carry a namespace tag are filtered according to that namespace.
> >> >>>
> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
> >> >>> does not match the namespace tag of the netlink socket. One example are
> >> >>> network devices. Uevents for network devices only show up in the network
> >> >>> namespaces these devices are moved to or created in.
> >> >>>
> >> >>> However, any uevent for a kobject that does not have a namespace tag
> >> >>> associated with it will not be filtered and we will *try* to broadcast it
> >> >>> into all network namespaces.
> >> >>>
> >> >>> The original patchset was written in 2010 before user namespaces were a
> >> >>> thing. With the introduction of user namespaces sending out uevents became
> >> >>> partially isolated as they were filtered by user namespaces:
> >> >>>
> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >> >>>
> >> >>> if (!net_eq(sock_net(sk), p->net)) {
> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >>>                 return;
> >> >>>
> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
> >> >>>                 return;
> >> >>>
> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >>>                              CAP_NET_BROADCAST))
> >> >>>         j       return;
> >> >>> }
> >> >>>
> >> >>> The file_ns_capable() check will check whether the caller had
> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >> >>> namespace of interest. This check is fine in general but seems insufficient
> >> >>> to me when paired with uevents. The reason is that devices always belong to
> >> >>> the initial user namespace so uevents for kobjects that do not carry a
> >> >>> namespace tag should never be sent into another user namespace. This has
> >> >>> been the intention all along. But there's one case where this breaks,
> >> >>> namely if a new user namespace is created by root on the host and an
> >> >>> identity mapping is established between root on the host and root in the
> >> >>> new user namespace. Here's a reproducer:
> >> >>>
> >> >>>  sudo unshare -U --map-root
> >> >>>  udevadm monitor -k
> >> >>>  # Now change to initial user namespace and e.g. do
> >> >>>  modprobe kvm
> >> >>>  # or
> >> >>>  rmmod kvm
> >> >>>
> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
> >> >>> host. This seems very anecdotal given that in the general case user
> >> >>> namespaces do not see any uevents and also can't really do anything useful
> >> >>> with them.
> >> >>>
> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >> >>> namespace of the network namespace of the netlink socket) userspace process
> >> >>> make a decision what uevents should be sent.
> >> >>>
> >> >>> This makes me think that we should simply ensure that uevents for kobjects
> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
> >> >>> in kobj_bcast_filter(). Specifically:
> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
> >> >>>   event will always be filtered.
> >> >>> - If the network namespace the uevent socket belongs to was created in the
> >> >>>   initial user namespace but was opened from a non-initial user namespace
> >> >>>   the event will be filtered as well.
> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
> >> >>> always only sent to the initial user namespace. The regression potential
> >> >>> for this is near to non-existent since user namespaces can't really do
> >> >>> anything with interesting devices.
> >> >>>
> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> >> >>> ---
> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
> >> >>>
> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
> >> >>> --- a/lib/kobject_uevent.c
> >> >>> +++ b/lib/kobject_uevent.c
> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >>>  		return sock_ns != ns;
> >> >>>  	}
> >> >>>  
> >> >>> -	return 0;
> >> >>> +	/*
> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
> >> >>> +	 * namespace below.
> >> >>> +	 */
> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
> >> >>> +		return 1;
> >> >>> +
> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
> >> >>>  }
> >> >>>  #endif
> >> >>
> >> >> So, this prohibits to listen events of all devices except network-related
> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
> >> > 
> >> > No, this is not correct: As it is right now *without my patch* no
> >> > non-initial user namespace is receiving *any uevents* but those
> >> > specifically namespaced such as those for network devices. This patch
> >> > doesn't change that at all. The commit message outlines this in detail
> >> > how this comes about.
> >> > There is only one case where this currently breaks and this is as I
> >> > outlined explicitly in my commit message when you create a new user
> >> > namespace and map container(0) -> host(0). This patch fixes this.
> >> 
> >> Could you please point the place, where non-initial user namespaces are filtered?
> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
> >> Now it will return 1 sometimes.
> >
> > Oh sure, it's in the commit message though. The callchain is
> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
> > net/netlink/af_netlink.c:do_one_broadcast():
> >
> > This codepiece will check whether the openened socket holds
> > CAP_NET_BROADCAST in the user namespace of the target network namespace
> > which it won't because we don't have device namespaces and all devices
> > belong to the initial set of namespaces.
> >
> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >                              CAP_NET_BROADCAST))
> >         j       return;
> >
> 
> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
> on their socket and has had someone with the appropriate privileges
> assign a peerrnetid.
> 
> All of which is to say that file_ns_capable is not nearly as applicable
> as it might be, and if you can pass the other two checks I think it is
> pointless (because the peernet attributes are not generated for
> kobj_uevents) but valid to receive events from outside your network
> namespace.
> 
> 
> I might be missing something but I don't see anything excluding network
> namespaces owned by !init_user_ns excluded from the kobject_uevent
> logic.
> 
> The uevent_sock_list has one entry per network namespace. And
> kobject_uevent_net_broacast appears to walk each one.

Yeah, it definitely does.

> 
> I had a memory of filtering uevent messages and I had a memory
> that the netlink_has_listeners had a per network namespace effect.
> Neither seems true from my inspection of the code tonight.

Yeah, I drew the same conclusion.

> 
> If we are not filtering ordinary uevents at least at the user namespace
> level that does seem to be at least a nuisance.

This patch would filter based on user namespace and bump it from "we're
accidently doing the right thing" to "we're doing the right on purpose"
and without accidently leaking uevents in some corner cases.
Using kobj_bcast_filter() for this seems like the right thing to do.
This sounds like you don't disagree with the approach I'm taking here?

On a sidenote, it also very much feels like we're leaking information if
we're not filtering based on user namespaces on purpose. Non-initial
user namespaces should not be able to receive information about device
attribues simply via netlink uevent messages. At least not *trivially*
[1] by opening a netlink uevent socket.

> 
> 
> Christian can you dig a little deeper into this.  I have a feeling that
> there are some real efficiency improvements that we could make to
> kobject_uevent_net_broadcast if nothing else.

Yeah, for sure! I already started doing this. This patch here is
basically a preparatory patch for that work which is on my todo.

Christian

> 
> Perhaps you could see where uevents are broadcast by poking
> the sysfs uevent of an existing device, and triggering a retransmit.
> 
> Eric
> 
>                                                      
[1]: I mean they technically could very likely still try to gather the
     same info by constantly parsing all of sysfs and try to detect new
     PCI paths/directories and so on but it shouldn't be as easy as
     opening a netlink uevent socket.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-06 13:07           ` Christian Brauner
@ 2018-04-06 14:45             ` Eric W. Biederman
  2018-04-06 16:07               ` Christian Brauner
  0 siblings, 1 reply; 26+ messages in thread
From: Eric W. Biederman @ 2018-04-06 14:45 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge

Christian Brauner <christian.brauner@canonical.com> writes:

> On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@canonical.com> writes:
>> 
>> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
>> >> On 05.04.2018 17:07, Christian Brauner wrote:
>> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
>> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>> >> >>>
>> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
>> >> >>> Over time the set of uevents that get sent into all network namespaces has
>> >> >>> shrunk. We have now reached the point where hotplug events for all devices
>> >> >>> that carry a namespace tag are filtered according to that namespace.
>> >> >>>
>> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
>> >> >>> does not match the namespace tag of the netlink socket. One example are
>> >> >>> network devices. Uevents for network devices only show up in the network
>> >> >>> namespaces these devices are moved to or created in.
>> >> >>>
>> >> >>> However, any uevent for a kobject that does not have a namespace tag
>> >> >>> associated with it will not be filtered and we will *try* to broadcast it
>> >> >>> into all network namespaces.
>> >> >>>
>> >> >>> The original patchset was written in 2010 before user namespaces were a
>> >> >>> thing. With the introduction of user namespaces sending out uevents became
>> >> >>> partially isolated as they were filtered by user namespaces:
>> >> >>>
>> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
>> >> >>>
>> >> >>> if (!net_eq(sock_net(sk), p->net)) {
>> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> >> >>>                 return;
>> >> >>>
>> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
>> >> >>>                 return;
>> >> >>>
>> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >> >>>                              CAP_NET_BROADCAST))
>> >> >>>         j       return;
>> >> >>> }
>> >> >>>
>> >> >>> The file_ns_capable() check will check whether the caller had
>> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>> >> >>> namespace of interest. This check is fine in general but seems insufficient
>> >> >>> to me when paired with uevents. The reason is that devices always belong to
>> >> >>> the initial user namespace so uevents for kobjects that do not carry a
>> >> >>> namespace tag should never be sent into another user namespace. This has
>> >> >>> been the intention all along. But there's one case where this breaks,
>> >> >>> namely if a new user namespace is created by root on the host and an
>> >> >>> identity mapping is established between root on the host and root in the
>> >> >>> new user namespace. Here's a reproducer:
>> >> >>>
>> >> >>>  sudo unshare -U --map-root
>> >> >>>  udevadm monitor -k
>> >> >>>  # Now change to initial user namespace and e.g. do
>> >> >>>  modprobe kvm
>> >> >>>  # or
>> >> >>>  rmmod kvm
>> >> >>>
>> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
>> >> >>> host. This seems very anecdotal given that in the general case user
>> >> >>> namespaces do not see any uevents and also can't really do anything useful
>> >> >>> with them.
>> >> >>>
>> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
>> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>> >> >>> namespace of the network namespace of the netlink socket) userspace process
>> >> >>> make a decision what uevents should be sent.
>> >> >>>
>> >> >>> This makes me think that we should simply ensure that uevents for kobjects
>> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
>> >> >>> in kobj_bcast_filter(). Specifically:
>> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
>> >> >>>   event will always be filtered.
>> >> >>> - If the network namespace the uevent socket belongs to was created in the
>> >> >>>   initial user namespace but was opened from a non-initial user namespace
>> >> >>>   the event will be filtered as well.
>> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
>> >> >>> always only sent to the initial user namespace. The regression potential
>> >> >>> for this is near to non-existent since user namespaces can't really do
>> >> >>> anything with interesting devices.
>> >> >>>
>> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>> >> >>> ---
>> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
>> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
>> >> >>>
>> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
>> >> >>> --- a/lib/kobject_uevent.c
>> >> >>> +++ b/lib/kobject_uevent.c
>> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >> >>>  		return sock_ns != ns;
>> >> >>>  	}
>> >> >>>  
>> >> >>> -	return 0;
>> >> >>> +	/*
>> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
>> >> >>> +	 * namespace below.
>> >> >>> +	 */
>> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
>> >> >>> +		return 1;
>> >> >>> +
>> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
>> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
>> >> >>>  }
>> >> >>>  #endif
>> >> >>
>> >> >> So, this prohibits to listen events of all devices except network-related
>> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
>> >> > 
>> >> > No, this is not correct: As it is right now *without my patch* no
>> >> > non-initial user namespace is receiving *any uevents* but those
>> >> > specifically namespaced such as those for network devices. This patch
>> >> > doesn't change that at all. The commit message outlines this in detail
>> >> > how this comes about.
>> >> > There is only one case where this currently breaks and this is as I
>> >> > outlined explicitly in my commit message when you create a new user
>> >> > namespace and map container(0) -> host(0). This patch fixes this.
>> >> 
>> >> Could you please point the place, where non-initial user namespaces are filtered?
>> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
>> >> Now it will return 1 sometimes.
>> >
>> > Oh sure, it's in the commit message though. The callchain is
>> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
>> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
>> > net/netlink/af_netlink.c:do_one_broadcast():
>> >
>> > This codepiece will check whether the openened socket holds
>> > CAP_NET_BROADCAST in the user namespace of the target network namespace
>> > which it won't because we don't have device namespaces and all devices
>> > belong to the initial set of namespaces.
>> >
>> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >                              CAP_NET_BROADCAST))
>> >         j       return;
>> >
>> 
>> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
>> on their socket and has had someone with the appropriate privileges
>> assign a peerrnetid.
>> 
>> All of which is to say that file_ns_capable is not nearly as applicable
>> as it might be, and if you can pass the other two checks I think it is
>> pointless (because the peernet attributes are not generated for
>> kobj_uevents) but valid to receive events from outside your network
>> namespace.
>> 
>> 
>> I might be missing something but I don't see anything excluding network
>> namespaces owned by !init_user_ns excluded from the kobject_uevent
>> logic.
>> 
>> The uevent_sock_list has one entry per network namespace. And
>> kobject_uevent_net_broacast appears to walk each one.
>
> Yeah, it definitely does.
>
>> 
>> I had a memory of filtering uevent messages and I had a memory
>> that the netlink_has_listeners had a per network namespace effect.
>> Neither seems true from my inspection of the code tonight.
>
> Yeah, I drew the same conclusion.
>
>> 
>> If we are not filtering ordinary uevents at least at the user namespace
>> level that does seem to be at least a nuisance.
>
> This patch would filter based on user namespace and bump it from "we're
> accidently doing the right thing" to "we're doing the right on purpose"
> and without accidently leaking uevents in some corner cases.
> Using kobj_bcast_filter() for this seems like the right thing to do.
> This sounds like you don't disagree with the approach I'm taking here?

How are we accidentally filtering?  If we can not describe how the code
currently works to achieve something we don't understand it well enough
to change it.



> On a sidenote, it also very much feels like we're leaking information if
> we're not filtering based on user namespaces on purpose. Non-initial
> user namespaces should not be able to receive information about device
> attribues simply via netlink uevent messages. At least not *trivially*
> [1] by opening a netlink uevent socket.

This is not at all about isolation.  Devices don't belong to user
namespaces.  This is about it being useless/silly to send events
to those sockets as the receiver could not do anything with the uevent.
At a practical level there should be no receivers.  Plus performance
issues.  At least my memory is that any unprivileged user on the system
is allowed to listen to those events.

At least at one point udev listening and reacting to events in every
network namespace in containers was a significant slow down on the
system.

The report was that adding netlink_has_listeners greatly sped up uevent
sending.   That testing appears faulty.  I think the real gain was
userspace not listening to uevents in every container.

Which means uevent injection for containers has some real technical
challenges to overcome before it can be wide spread.

>> Christian can you dig a little deeper into this.  I have a feeling that
>> there are some real efficiency improvements that we could make to
>> kobject_uevent_net_broadcast if nothing else.
>
> Yeah, for sure! I already started doing this. This patch here is
> basically a preparatory patch for that work which is on my todo.

Limiting the network namespaces on uevent_sock_list to just network
namespaces owned by the initial user namespaces would be a lot better
than your patch.


At a practical level I am concerned what will happen when we stand up a
uevent listener aka udev in say 100 or maybe 1000 unprivileged
containers on a system.  History suggests the current data structures
won't scale to the problem, and the fixes that were put in place kernel
side only did not change anything.  It was only the userspace fixes
that made a difference.

Which suggests either improving the isolation between network namespaces
for netlink multicast sockets or putting:


	if (!net_eq(sock_net(sk), p->net)) {
		if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
			return;

		if (!peernet_has_id(sock_net(sk), p->net))
			return;

		if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
				     CAP_NET_BROADCAST))
			return;
	}

In a filter and having an appropriate filter so that
netlink_broadcast_filtered only needs to be called once from
kobject_uevent_net_broadcast.

It is more difficult but in practice I expect we have far more to gain
by fixing the mc_list and the netlink_has_listeners function to be per
network namespace basis.  Handling the NETLINK_F_LISTEN_ALL_NSID will
be trickier in that case but the current semantics appear correct
for that case.

But before we do anything we have to test and see if the assertion
that we don't make it to netlink listeners in network namespaces
that are outside of the init_user_ns is true.   If it is true we need to
find the code that makes it true.

Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-06 14:45             ` Eric W. Biederman
@ 2018-04-06 16:07               ` Christian Brauner
  2018-04-06 16:48                 ` Eric W. Biederman
  0 siblings, 1 reply; 26+ messages in thread
From: Christian Brauner @ 2018-04-06 16:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge

On Fri, Apr 06, 2018 at 09:45:41AM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> 
> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> >> >> On 05.04.2018 17:07, Christian Brauner wrote:
> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >> >> >>>
> >> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
> >> >> >>> Over time the set of uevents that get sent into all network namespaces has
> >> >> >>> shrunk. We have now reached the point where hotplug events for all devices
> >> >> >>> that carry a namespace tag are filtered according to that namespace.
> >> >> >>>
> >> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
> >> >> >>> does not match the namespace tag of the netlink socket. One example are
> >> >> >>> network devices. Uevents for network devices only show up in the network
> >> >> >>> namespaces these devices are moved to or created in.
> >> >> >>>
> >> >> >>> However, any uevent for a kobject that does not have a namespace tag
> >> >> >>> associated with it will not be filtered and we will *try* to broadcast it
> >> >> >>> into all network namespaces.
> >> >> >>>
> >> >> >>> The original patchset was written in 2010 before user namespaces were a
> >> >> >>> thing. With the introduction of user namespaces sending out uevents became
> >> >> >>> partially isolated as they were filtered by user namespaces:
> >> >> >>>
> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >> >> >>>
> >> >> >>> if (!net_eq(sock_net(sk), p->net)) {
> >> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >> >>>                 return;
> >> >> >>>
> >> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
> >> >> >>>                 return;
> >> >> >>>
> >> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >> >>>                              CAP_NET_BROADCAST))
> >> >> >>>         j       return;
> >> >> >>> }
> >> >> >>>
> >> >> >>> The file_ns_capable() check will check whether the caller had
> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >> >> >>> namespace of interest. This check is fine in general but seems insufficient
> >> >> >>> to me when paired with uevents. The reason is that devices always belong to
> >> >> >>> the initial user namespace so uevents for kobjects that do not carry a
> >> >> >>> namespace tag should never be sent into another user namespace. This has
> >> >> >>> been the intention all along. But there's one case where this breaks,
> >> >> >>> namely if a new user namespace is created by root on the host and an
> >> >> >>> identity mapping is established between root on the host and root in the
> >> >> >>> new user namespace. Here's a reproducer:
> >> >> >>>
> >> >> >>>  sudo unshare -U --map-root
> >> >> >>>  udevadm monitor -k
> >> >> >>>  # Now change to initial user namespace and e.g. do
> >> >> >>>  modprobe kvm
> >> >> >>>  # or
> >> >> >>>  rmmod kvm
> >> >> >>>
> >> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
> >> >> >>> host. This seems very anecdotal given that in the general case user
> >> >> >>> namespaces do not see any uevents and also can't really do anything useful
> >> >> >>> with them.
> >> >> >>>
> >> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >> >> >>> namespace of the network namespace of the netlink socket) userspace process
> >> >> >>> make a decision what uevents should be sent.
> >> >> >>>
> >> >> >>> This makes me think that we should simply ensure that uevents for kobjects
> >> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
> >> >> >>> in kobj_bcast_filter(). Specifically:
> >> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
> >> >> >>>   event will always be filtered.
> >> >> >>> - If the network namespace the uevent socket belongs to was created in the
> >> >> >>>   initial user namespace but was opened from a non-initial user namespace
> >> >> >>>   the event will be filtered as well.
> >> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
> >> >> >>> always only sent to the initial user namespace. The regression potential
> >> >> >>> for this is near to non-existent since user namespaces can't really do
> >> >> >>> anything with interesting devices.
> >> >> >>>
> >> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> >> >> >>> ---
> >> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
> >> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
> >> >> >>>
> >> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
> >> >> >>> --- a/lib/kobject_uevent.c
> >> >> >>> +++ b/lib/kobject_uevent.c
> >> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >> >>>  		return sock_ns != ns;
> >> >> >>>  	}
> >> >> >>>  
> >> >> >>> -	return 0;
> >> >> >>> +	/*
> >> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
> >> >> >>> +	 * namespace below.
> >> >> >>> +	 */
> >> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
> >> >> >>> +		return 1;
> >> >> >>> +
> >> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
> >> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
> >> >> >>>  }
> >> >> >>>  #endif
> >> >> >>
> >> >> >> So, this prohibits to listen events of all devices except network-related
> >> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
> >> >> > 
> >> >> > No, this is not correct: As it is right now *without my patch* no
> >> >> > non-initial user namespace is receiving *any uevents* but those
> >> >> > specifically namespaced such as those for network devices. This patch
> >> >> > doesn't change that at all. The commit message outlines this in detail
> >> >> > how this comes about.
> >> >> > There is only one case where this currently breaks and this is as I
> >> >> > outlined explicitly in my commit message when you create a new user
> >> >> > namespace and map container(0) -> host(0). This patch fixes this.
> >> >> 
> >> >> Could you please point the place, where non-initial user namespaces are filtered?
> >> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
> >> >> Now it will return 1 sometimes.
> >> >
> >> > Oh sure, it's in the commit message though. The callchain is
> >> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
> >> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
> >> > net/netlink/af_netlink.c:do_one_broadcast():
> >> >
> >> > This codepiece will check whether the openened socket holds
> >> > CAP_NET_BROADCAST in the user namespace of the target network namespace
> >> > which it won't because we don't have device namespaces and all devices
> >> > belong to the initial set of namespaces.
> >> >
> >> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >                              CAP_NET_BROADCAST))
> >> >         j       return;
> >> >
> >> 
> >> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
> >> on their socket and has had someone with the appropriate privileges
> >> assign a peerrnetid.
> >> 
> >> All of which is to say that file_ns_capable is not nearly as applicable
> >> as it might be, and if you can pass the other two checks I think it is
> >> pointless (because the peernet attributes are not generated for
> >> kobj_uevents) but valid to receive events from outside your network
> >> namespace.
> >> 
> >> 
> >> I might be missing something but I don't see anything excluding network
> >> namespaces owned by !init_user_ns excluded from the kobject_uevent
> >> logic.
> >> 
> >> The uevent_sock_list has one entry per network namespace. And
> >> kobject_uevent_net_broacast appears to walk each one.
> >
> > Yeah, it definitely does.
> >
> >> 
> >> I had a memory of filtering uevent messages and I had a memory
> >> that the netlink_has_listeners had a per network namespace effect.
> >> Neither seems true from my inspection of the code tonight.
> >
> > Yeah, I drew the same conclusion.
> >
> >> 
> >> If we are not filtering ordinary uevents at least at the user namespace
> >> level that does seem to be at least a nuisance.
> >
> > This patch would filter based on user namespace and bump it from "we're
> > accidently doing the right thing" to "we're doing the right on purpose"
> > and without accidently leaking uevents in some corner cases.
> > Using kobj_bcast_filter() for this seems like the right thing to do.
> > This sounds like you don't disagree with the approach I'm taking here?
> 
> How are we accidentally filtering?  If we can not describe how the code
> currently works to achieve something we don't understand it well enough
> to change it.

I absolutely agree and without doing a lot of semantics here I just
meant that so far this all worked on accident that doesn't presuppose
that we understand why it worked.

> 
> 
> 
> > On a sidenote, it also very much feels like we're leaking information if
> > we're not filtering based on user namespaces on purpose. Non-initial
> > user namespaces should not be able to receive information about device
> > attribues simply via netlink uevent messages. At least not *trivially*
> > [1] by opening a netlink uevent socket.
> 
> This is not at all about isolation.  Devices don't belong to user

Sure, but there's still a difference between an unprivileged host user
listening to uevents and user namespaces that they might have delegated.

> namespaces.  This is about it being useless/silly to send events
> to those sockets as the receiver could not do anything with the uevent.

Yes, indeed unless a sufficiently privileged process explicitly decides
that a given uevent should be sent.

> At a practical level there should be no receivers.  Plus performance
> issues.  At least my memory is that any unprivileged user on the system
> is allowed to listen to those events.

Any unprivileged user is allowed to listen to uevents if they have
net_broadcast in the user namespace the uevent socket was opened in;
unless I'm misreading.

> 
> At least at one point udev listening and reacting to events in every
> network namespace in containers was a significant slow down on the
> system.

Sure, because we're holding a global lock on the list of all uevent
sockets and then go on to walk all of mc_list for each of those sockets
when broadcasting to guarantee that uevents are correctly ordered:

	mutex_lock(&uevent_sock_mutex);
	/* we will send an event, so request a new sequence number */
	retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
	if (retval) {
		mutex_unlock(&uevent_sock_mutex);
		goto exit;
	}
	retval = kobject_uevent_net_broadcast(kobj, env, action_string,
					      devpath);
	mutex_unlock(&uevent_sock_mutex);

which means that the time it takes for uevents to be sent out increases
drastically the more listeners you have in more network namespaces. Even
for just one network namespace you have quite a lot:

ss -e -f netlink -a | grep uevent

then just look at those that you can clearly associate with a single
program I typically get ~20 listeners.

> 
> The report was that adding netlink_has_listeners greatly sped up uevent
> sending.   That testing appears faulty.  I think the real gain was
> userspace not listening to uevents in every container.

Banal point but the truth is probably not userspace changes in general
but that most massive container deployments are app containers. So
there's no udevd.

> 
> Which means uevent injection for containers has some real technical
> challenges to overcome before it can be wide spread.
> 
> >> Christian can you dig a little deeper into this.  I have a feeling that
> >> there are some real efficiency improvements that we could make to
> >> kobject_uevent_net_broadcast if nothing else.
> >
> > Yeah, for sure! I already started doing this. This patch here is
> > basically a preparatory patch for that work which is on my todo.
> 
> Limiting the network namespaces on uevent_sock_list to just network
> namespaces owned by the initial user namespaces would be a lot better
> than your patch.

I had this idea as well but I decided against it because of namespace
tag carrying kobjects. The only kobjects that fall into this category
are network devices so I need to think through this again and more
thoroughly.

> 
> 
> At a practical level I am concerned what will happen when we stand up a
> uevent listener aka udev in say 100 or maybe 1000 unprivileged
> containers on a system.  History suggests the current data structures
> won't scale to the problem, and the fixes that were put in place kernel
> side only did not change anything.  It was only the userspace fixes
> that made a difference.
> 
> Which suggests either improving the isolation between network namespaces
> for netlink multicast sockets or putting:
> 
> 
> 	if (!net_eq(sock_net(sk), p->net)) {
> 		if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> 			return;
> 
> 		if (!peernet_has_id(sock_net(sk), p->net))
> 			return;
> 
> 		if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> 				     CAP_NET_BROADCAST))
> 			return;
> 	}
> 
> In a filter and having an appropriate filter so that
> netlink_broadcast_filtered only needs to be called once from
> kobject_uevent_net_broadcast.

Maybe, but I'd like to try and find a solution that let's us wipe the
uevent socket list.

> 
> It is more difficult but in practice I expect we have far more to gain
> by fixing the mc_list and the netlink_has_listeners function to be per
> network namespace basis.  Handling the NETLINK_F_LISTEN_ALL_NSID will
> be trickier in that case but the current semantics appear correct
> for that case.

Yeah, I'll dig into this.

> 
> But before we do anything we have to test and see if the assertion
> that we don't make it to netlink listeners in network namespaces
> that are outside of the init_user_ns is true.   If it is true we need to
> find the code that makes it true.

Yeah, I'm on this as well.

Thanks!
Christian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-06 16:07               ` Christian Brauner
@ 2018-04-06 16:48                 ` Eric W. Biederman
  0 siblings, 0 replies; 26+ messages in thread
From: Eric W. Biederman @ 2018-04-06 16:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge

Christian Brauner <christian.brauner@canonical.com> writes:

>> At a practical level there should be no receivers.  Plus performance
>> issues.  At least my memory is that any unprivileged user on the system
>> is allowed to listen to those events.
>
> Any unprivileged user is allowed to listen to uevents if they have
> net_broadcast in the user namespace the uevent socket was opened in;
> unless I'm misreading.

I believe you are.

This code in do_one_broadcast.

	if (!net_eq(sock_net(sk), p->net)) {
		if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
			return;

		if (!peernet_has_id(sock_net(sk), p->net))
			return;

		if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
				     CAP_NET_BROADCAST))
			return;
	}

Used to just be:
	if (!net_eq(sock_net(sk), p->net))
        	return;

Which makes sense when you have a shared hash table and a shared mc_list
between network namespaces.

There is a non-container use of network namespaces where you just need
different contexts were ip addresses can overlap etc.  In that
configuration where a single program is mananging multiple network
namespaces being able to listen to rtnetlink events in all of them is an
advantage.

For that case a special socket option NETLINK_F_LISTEN_ALL_NSID was
added that allowed one socket to listen for events from multiple network
namespaces.

If we rework the code in af_netlink.c that matters.  However for just
understanding uevents you can assume there are no sockets with
NETLINK_F_LISTEN_ALL_NSID set.

Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-06  3:59         ` Eric W. Biederman
  2018-04-06 13:07           ` Christian Brauner
@ 2018-04-09 15:46           ` Christian Brauner
  2018-04-09 23:21             ` Eric W. Biederman
  1 sibling, 1 reply; 26+ messages in thread
From: Christian Brauner @ 2018-04-09 15:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge

On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> >> On 05.04.2018 17:07, Christian Brauner wrote:
> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >> >>>
> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
> >> >>> Over time the set of uevents that get sent into all network namespaces has
> >> >>> shrunk. We have now reached the point where hotplug events for all devices
> >> >>> that carry a namespace tag are filtered according to that namespace.
> >> >>>
> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
> >> >>> does not match the namespace tag of the netlink socket. One example are
> >> >>> network devices. Uevents for network devices only show up in the network
> >> >>> namespaces these devices are moved to or created in.
> >> >>>
> >> >>> However, any uevent for a kobject that does not have a namespace tag
> >> >>> associated with it will not be filtered and we will *try* to broadcast it
> >> >>> into all network namespaces.
> >> >>>
> >> >>> The original patchset was written in 2010 before user namespaces were a
> >> >>> thing. With the introduction of user namespaces sending out uevents became
> >> >>> partially isolated as they were filtered by user namespaces:
> >> >>>
> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >> >>>
> >> >>> if (!net_eq(sock_net(sk), p->net)) {
> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >>>                 return;
> >> >>>
> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
> >> >>>                 return;
> >> >>>
> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >>>                              CAP_NET_BROADCAST))
> >> >>>         j       return;
> >> >>> }
> >> >>>
> >> >>> The file_ns_capable() check will check whether the caller had
> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >> >>> namespace of interest. This check is fine in general but seems insufficient
> >> >>> to me when paired with uevents. The reason is that devices always belong to
> >> >>> the initial user namespace so uevents for kobjects that do not carry a
> >> >>> namespace tag should never be sent into another user namespace. This has
> >> >>> been the intention all along. But there's one case where this breaks,
> >> >>> namely if a new user namespace is created by root on the host and an
> >> >>> identity mapping is established between root on the host and root in the
> >> >>> new user namespace. Here's a reproducer:
> >> >>>
> >> >>>  sudo unshare -U --map-root
> >> >>>  udevadm monitor -k
> >> >>>  # Now change to initial user namespace and e.g. do
> >> >>>  modprobe kvm
> >> >>>  # or
> >> >>>  rmmod kvm
> >> >>>
> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
> >> >>> host. This seems very anecdotal given that in the general case user
> >> >>> namespaces do not see any uevents and also can't really do anything useful
> >> >>> with them.
> >> >>>
> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >> >>> namespace of the network namespace of the netlink socket) userspace process
> >> >>> make a decision what uevents should be sent.
> >> >>>
> >> >>> This makes me think that we should simply ensure that uevents for kobjects
> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
> >> >>> in kobj_bcast_filter(). Specifically:
> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
> >> >>>   event will always be filtered.
> >> >>> - If the network namespace the uevent socket belongs to was created in the
> >> >>>   initial user namespace but was opened from a non-initial user namespace
> >> >>>   the event will be filtered as well.
> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
> >> >>> always only sent to the initial user namespace. The regression potential
> >> >>> for this is near to non-existent since user namespaces can't really do
> >> >>> anything with interesting devices.
> >> >>>
> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> >> >>> ---
> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
> >> >>>
> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
> >> >>> --- a/lib/kobject_uevent.c
> >> >>> +++ b/lib/kobject_uevent.c
> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >>>  		return sock_ns != ns;
> >> >>>  	}
> >> >>>  
> >> >>> -	return 0;
> >> >>> +	/*
> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
> >> >>> +	 * namespace below.
> >> >>> +	 */
> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
> >> >>> +		return 1;
> >> >>> +
> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
> >> >>>  }
> >> >>>  #endif
> >> >>
> >> >> So, this prohibits to listen events of all devices except network-related
> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
> >> > 
> >> > No, this is not correct: As it is right now *without my patch* no
> >> > non-initial user namespace is receiving *any uevents* but those
> >> > specifically namespaced such as those for network devices. This patch
> >> > doesn't change that at all. The commit message outlines this in detail
> >> > how this comes about.
> >> > There is only one case where this currently breaks and this is as I
> >> > outlined explicitly in my commit message when you create a new user
> >> > namespace and map container(0) -> host(0). This patch fixes this.
> >> 
> >> Could you please point the place, where non-initial user namespaces are filtered?
> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
> >> Now it will return 1 sometimes.
> >
> > Oh sure, it's in the commit message though. The callchain is
> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
> > net/netlink/af_netlink.c:do_one_broadcast():
> >
> > This codepiece will check whether the openened socket holds
> > CAP_NET_BROADCAST in the user namespace of the target network namespace
> > which it won't because we don't have device namespaces and all devices
> > belong to the initial set of namespaces.
> >
> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >                              CAP_NET_BROADCAST))
> >         j       return;
> >
> 
> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
> on their socket and has had someone with the appropriate privileges
> assign a peerrnetid.
> 
> All of which is to say that file_ns_capable is not nearly as applicable
> as it might be, and if you can pass the other two checks I think it is
> pointless (because the peernet attributes are not generated for
> kobj_uevents) but valid to receive events from outside your network
> namespace.
> 
> 
> I might be missing something but I don't see anything excluding network
> namespaces owned by !init_user_ns excluded from the kobject_uevent
> logic.
> 
> The uevent_sock_list has one entry per network namespace. And
> kobject_uevent_net_broacast appears to walk each one.
> 
> I had a memory of filtering uevent messages and I had a memory
> that the netlink_has_listeners had a per network namespace effect.
> Neither seems true from my inspection of the code tonight.
> 
> If we are not filtering ordinary uevents at least at the user namespace
> level that does seem to be at least a nuisance.
> 
> 
> Christian can you dig a little deeper into this.  I have a feeling that
> there are some real efficiency improvements that we could make to
> kobject_uevent_net_broadcast if nothing else.
> 
> Perhaps you could see where uevents are broadcast by poking
> the sysfs uevent of an existing device, and triggering a retransmit.

@Eric, so I did some intensive testing over the weekend and forget everything I
said before. Uevents are not filtered by the kernel at the moment. This is
currently - apart from network devices - a pure userspace thing. Specifically,
anyone  on the host can listen to all uevents from everywhere. It's neither
filtered by user nor by network namespace. The fact that I used

udevadm --debug monitor

to test my prior hypothesis was what led me to believe that uevents are already
correctly filtered.
Instead, what is actually happening is that every udev implementation out there
is discarding uevents that were send by uids != 0 in the CMSG_DATA.
Specifically,

- legacy standalone udevd:
  https://git.kernel.org/pub/scm/linux/hotplug/udev.git/snapshot/udev-062.tar.gz
- eudevd
  https://github.com/gentoo/eudev/blob/6f630d32bf494a457171b3f99e329592497bf271/src/libudev/libudev-monitor.c#L645
- systemd-udevd
  https://github.com/systemd/systemd/blob/e89ab7f219a399ab719c78cf43c07c0da60bd151/src/libudev/libudev-monitor.c#L656
- ueventd (Android)
  https://android.googlesource.com/platform/system/core.git/+/android-8.1.0_r22/libcutils/uevent.c#81

For all of those I could trace this behavior back to the first released
version. (To be precise, for legacy udevd that eventually became systemd-udevd
I could trace it back to the first version that is still available on
git.kernel.org which is 062. Since eudevd is a fork of systemd-udevd it is
trivially true that it has the same behavior from the beginning.)
Android filters uevents in the same way but removed that check on January 8
2018 for what I think is invalid reasoning. The good news is that this is only
in their master branch. It hasn't even made it into an rc release for Android 8
yet. I filed a bug against Android and offered them a fix if they agree.

In any case, userspace udev is not making use of uevents at all right now since
any uid != 0 events are **explicitly** discarded.
The fact that you receive uevents for

sudo unshare -U --map-root -n
udevadm --debug monitor

is simply explained by the fact that container(0) <=> host(0) at which point
the uid in CMSG_DATA will be 0 in the new user namespace and udev will not
discard it.
The use case for receiving uevents in containers/user namespaces is definitely
there but that's what the uevent injection patch series was for that we merged.
This is a much safer and saner solution.
The fact that all udev implementations filter uevents by uid != 0 very much
seems like a security mechanism in userspace that we probably should provide by
isolating uevents based on user and/or network namespaces.

Christian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-09 15:46           ` Christian Brauner
@ 2018-04-09 23:21             ` Eric W. Biederman
  2018-04-10 14:35               ` Christian Brauner
  0 siblings, 1 reply; 26+ messages in thread
From: Eric W. Biederman @ 2018-04-09 23:21 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge

Christian Brauner <christian.brauner@canonical.com> writes:

> On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@canonical.com> writes:
>> 
>> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
>> >> On 05.04.2018 17:07, Christian Brauner wrote:
>> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
>> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>> >> >>>
>> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
>> >> >>> Over time the set of uevents that get sent into all network namespaces has
>> >> >>> shrunk. We have now reached the point where hotplug events for all devices
>> >> >>> that carry a namespace tag are filtered according to that namespace.
>> >> >>>
>> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
>> >> >>> does not match the namespace tag of the netlink socket. One example are
>> >> >>> network devices. Uevents for network devices only show up in the network
>> >> >>> namespaces these devices are moved to or created in.
>> >> >>>
>> >> >>> However, any uevent for a kobject that does not have a namespace tag
>> >> >>> associated with it will not be filtered and we will *try* to broadcast it
>> >> >>> into all network namespaces.
>> >> >>>
>> >> >>> The original patchset was written in 2010 before user namespaces were a
>> >> >>> thing. With the introduction of user namespaces sending out uevents became
>> >> >>> partially isolated as they were filtered by user namespaces:
>> >> >>>
>> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
>> >> >>>
>> >> >>> if (!net_eq(sock_net(sk), p->net)) {
>> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> >> >>>                 return;
>> >> >>>
>> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
>> >> >>>                 return;
>> >> >>>
>> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >> >>>                              CAP_NET_BROADCAST))
>> >> >>>         j       return;
>> >> >>> }
>> >> >>>
>> >> >>> The file_ns_capable() check will check whether the caller had
>> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>> >> >>> namespace of interest. This check is fine in general but seems insufficient
>> >> >>> to me when paired with uevents. The reason is that devices always belong to
>> >> >>> the initial user namespace so uevents for kobjects that do not carry a
>> >> >>> namespace tag should never be sent into another user namespace. This has
>> >> >>> been the intention all along. But there's one case where this breaks,
>> >> >>> namely if a new user namespace is created by root on the host and an
>> >> >>> identity mapping is established between root on the host and root in the
>> >> >>> new user namespace. Here's a reproducer:
>> >> >>>
>> >> >>>  sudo unshare -U --map-root
>> >> >>>  udevadm monitor -k
>> >> >>>  # Now change to initial user namespace and e.g. do
>> >> >>>  modprobe kvm
>> >> >>>  # or
>> >> >>>  rmmod kvm
>> >> >>>
>> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
>> >> >>> host. This seems very anecdotal given that in the general case user
>> >> >>> namespaces do not see any uevents and also can't really do anything useful
>> >> >>> with them.
>> >> >>>
>> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
>> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>> >> >>> namespace of the network namespace of the netlink socket) userspace process
>> >> >>> make a decision what uevents should be sent.
>> >> >>>
>> >> >>> This makes me think that we should simply ensure that uevents for kobjects
>> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
>> >> >>> in kobj_bcast_filter(). Specifically:
>> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
>> >> >>>   event will always be filtered.
>> >> >>> - If the network namespace the uevent socket belongs to was created in the
>> >> >>>   initial user namespace but was opened from a non-initial user namespace
>> >> >>>   the event will be filtered as well.
>> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
>> >> >>> always only sent to the initial user namespace. The regression potential
>> >> >>> for this is near to non-existent since user namespaces can't really do
>> >> >>> anything with interesting devices.
>> >> >>>
>> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>> >> >>> ---
>> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
>> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
>> >> >>>
>> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
>> >> >>> --- a/lib/kobject_uevent.c
>> >> >>> +++ b/lib/kobject_uevent.c
>> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >> >>>  		return sock_ns != ns;
>> >> >>>  	}
>> >> >>>  
>> >> >>> -	return 0;
>> >> >>> +	/*
>> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
>> >> >>> +	 * namespace below.
>> >> >>> +	 */
>> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
>> >> >>> +		return 1;
>> >> >>> +
>> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
>> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
>> >> >>>  }
>> >> >>>  #endif
>> >> >>
>> >> >> So, this prohibits to listen events of all devices except network-related
>> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
>> >> > 
>> >> > No, this is not correct: As it is right now *without my patch* no
>> >> > non-initial user namespace is receiving *any uevents* but those
>> >> > specifically namespaced such as those for network devices. This patch
>> >> > doesn't change that at all. The commit message outlines this in detail
>> >> > how this comes about.
>> >> > There is only one case where this currently breaks and this is as I
>> >> > outlined explicitly in my commit message when you create a new user
>> >> > namespace and map container(0) -> host(0). This patch fixes this.
>> >> 
>> >> Could you please point the place, where non-initial user namespaces are filtered?
>> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
>> >> Now it will return 1 sometimes.
>> >
>> > Oh sure, it's in the commit message though. The callchain is
>> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
>> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
>> > net/netlink/af_netlink.c:do_one_broadcast():
>> >
>> > This codepiece will check whether the openened socket holds
>> > CAP_NET_BROADCAST in the user namespace of the target network namespace
>> > which it won't because we don't have device namespaces and all devices
>> > belong to the initial set of namespaces.
>> >
>> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >                              CAP_NET_BROADCAST))
>> >         j       return;
>> >
>> 
>> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
>> on their socket and has had someone with the appropriate privileges
>> assign a peerrnetid.
>> 
>> All of which is to say that file_ns_capable is not nearly as applicable
>> as it might be, and if you can pass the other two checks I think it is
>> pointless (because the peernet attributes are not generated for
>> kobj_uevents) but valid to receive events from outside your network
>> namespace.
>> 
>> 
>> I might be missing something but I don't see anything excluding network
>> namespaces owned by !init_user_ns excluded from the kobject_uevent
>> logic.
>> 
>> The uevent_sock_list has one entry per network namespace. And
>> kobject_uevent_net_broacast appears to walk each one.
>> 
>> I had a memory of filtering uevent messages and I had a memory
>> that the netlink_has_listeners had a per network namespace effect.
>> Neither seems true from my inspection of the code tonight.
>> 
>> If we are not filtering ordinary uevents at least at the user namespace
>> level that does seem to be at least a nuisance.
>> 
>> 
>> Christian can you dig a little deeper into this.  I have a feeling that
>> there are some real efficiency improvements that we could make to
>> kobject_uevent_net_broadcast if nothing else.
>> 
>> Perhaps you could see where uevents are broadcast by poking
>> the sysfs uevent of an existing device, and triggering a retransmit.
>
> @Eric, so I did some intensive testing over the weekend and forget everything I
> said before. Uevents are not filtered by the kernel at the moment. This is
> currently - apart from network devices - a pure userspace thing. Specifically,
> anyone  on the host can listen to all uevents from everywhere. It's neither
> filtered by user nor by network namespace. The fact that I used
>
> udevadm --debug monitor
>
> to test my prior hypothesis was what led me to believe that uevents are already
> correctly filtered.
> Instead, what is actually happening is that every udev implementation out there
> is discarding uevents that were send by uids != 0 in the CMSG_DATA.
> Specifically,

Yes.  I remember something of the sort.  I believe udev also filters to
ensure that the netlink port id == 0 to ensure the message came from
the kernel and was not spoofed in any way.

> - legacy standalone udevd:
>   https://git.kernel.org/pub/scm/linux/hotplug/udev.git/snapshot/udev-062.tar.gz
> - eudevd
>   https://github.com/gentoo/eudev/blob/6f630d32bf494a457171b3f99e329592497bf271/src/libudev/libudev-monitor.c#L645
> - systemd-udevd
>   https://github.com/systemd/systemd/blob/e89ab7f219a399ab719c78cf43c07c0da60bd151/src/libudev/libudev-monitor.c#L656
> - ueventd (Android)
>   https://android.googlesource.com/platform/system/core.git/+/android-8.1.0_r22/libcutils/uevent.c#81
>
> For all of those I could trace this behavior back to the first released
> version. (To be precise, for legacy udevd that eventually became systemd-udevd
> I could trace it back to the first version that is still available on
> git.kernel.org which is 062. Since eudevd is a fork of systemd-udevd it is
> trivially true that it has the same behavior from the beginning.)
> Android filters uevents in the same way but removed that check on January 8
> 2018 for what I think is invalid reasoning. The good news is that this is only
> in their master branch. It hasn't even made it into an rc release for Android 8
> yet. I filed a bug against Android and offered them a fix if they agree.
>
> In any case, userspace udev is not making use of uevents at all right now since
> any uid != 0 events are **explicitly** discarded.
> The fact that you receive uevents for
>
> sudo unshare -U --map-root -n
> udevadm --debug monitor
>
> is simply explained by the fact that container(0) <=> host(0) at which point
> the uid in CMSG_DATA will be 0 in the new user namespace and udev will not
> discard it.
> The use case for receiving uevents in containers/user namespaces is definitely
> there but that's what the uevent injection patch series was for that we merged.
> This is a much safer and saner solution.
> The fact that all udev implementations filter uevents by uid != 0 very much
> seems like a security mechanism in userspace that we probably should provide by
> isolating uevents based on user and/or network namespaces.

So in summary.  Uevents are filtered in a user namespace (by userspace)
because the received uid != 0.  It instead == 65534 == "nobody" because
the global root uid is not mapped.

Which means that we can modify the kernel to not send to all network
namespaces whose user_ns != &init_user_ns because we know that userspace
will ignore the message because of the uid anyway.  Which means when
net-next reopens you can send that patch.  But please base it on just
not including network namespaces in the list, as that is much more
efficient than adding more conditions to the filter.

Thank you for tracking down what is going on.

Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-09 23:21             ` Eric W. Biederman
@ 2018-04-10 14:35               ` Christian Brauner
  2018-04-10 15:04                 ` Eric W. Biederman
  0 siblings, 1 reply; 26+ messages in thread
From: Christian Brauner @ 2018-04-10 14:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge

On Mon, Apr 09, 2018 at 06:21:31PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> 
> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> >> >> On 05.04.2018 17:07, Christian Brauner wrote:
> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >> >> >>>
> >> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
> >> >> >>> Over time the set of uevents that get sent into all network namespaces has
> >> >> >>> shrunk. We have now reached the point where hotplug events for all devices
> >> >> >>> that carry a namespace tag are filtered according to that namespace.
> >> >> >>>
> >> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
> >> >> >>> does not match the namespace tag of the netlink socket. One example are
> >> >> >>> network devices. Uevents for network devices only show up in the network
> >> >> >>> namespaces these devices are moved to or created in.
> >> >> >>>
> >> >> >>> However, any uevent for a kobject that does not have a namespace tag
> >> >> >>> associated with it will not be filtered and we will *try* to broadcast it
> >> >> >>> into all network namespaces.
> >> >> >>>
> >> >> >>> The original patchset was written in 2010 before user namespaces were a
> >> >> >>> thing. With the introduction of user namespaces sending out uevents became
> >> >> >>> partially isolated as they were filtered by user namespaces:
> >> >> >>>
> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >> >> >>>
> >> >> >>> if (!net_eq(sock_net(sk), p->net)) {
> >> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >> >>>                 return;
> >> >> >>>
> >> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
> >> >> >>>                 return;
> >> >> >>>
> >> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >> >>>                              CAP_NET_BROADCAST))
> >> >> >>>         j       return;
> >> >> >>> }
> >> >> >>>
> >> >> >>> The file_ns_capable() check will check whether the caller had
> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >> >> >>> namespace of interest. This check is fine in general but seems insufficient
> >> >> >>> to me when paired with uevents. The reason is that devices always belong to
> >> >> >>> the initial user namespace so uevents for kobjects that do not carry a
> >> >> >>> namespace tag should never be sent into another user namespace. This has
> >> >> >>> been the intention all along. But there's one case where this breaks,
> >> >> >>> namely if a new user namespace is created by root on the host and an
> >> >> >>> identity mapping is established between root on the host and root in the
> >> >> >>> new user namespace. Here's a reproducer:
> >> >> >>>
> >> >> >>>  sudo unshare -U --map-root
> >> >> >>>  udevadm monitor -k
> >> >> >>>  # Now change to initial user namespace and e.g. do
> >> >> >>>  modprobe kvm
> >> >> >>>  # or
> >> >> >>>  rmmod kvm
> >> >> >>>
> >> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
> >> >> >>> host. This seems very anecdotal given that in the general case user
> >> >> >>> namespaces do not see any uevents and also can't really do anything useful
> >> >> >>> with them.
> >> >> >>>
> >> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >> >> >>> namespace of the network namespace of the netlink socket) userspace process
> >> >> >>> make a decision what uevents should be sent.
> >> >> >>>
> >> >> >>> This makes me think that we should simply ensure that uevents for kobjects
> >> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
> >> >> >>> in kobj_bcast_filter(). Specifically:
> >> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
> >> >> >>>   event will always be filtered.
> >> >> >>> - If the network namespace the uevent socket belongs to was created in the
> >> >> >>>   initial user namespace but was opened from a non-initial user namespace
> >> >> >>>   the event will be filtered as well.
> >> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
> >> >> >>> always only sent to the initial user namespace. The regression potential
> >> >> >>> for this is near to non-existent since user namespaces can't really do
> >> >> >>> anything with interesting devices.
> >> >> >>>
> >> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> >> >> >>> ---
> >> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
> >> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
> >> >> >>>
> >> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
> >> >> >>> --- a/lib/kobject_uevent.c
> >> >> >>> +++ b/lib/kobject_uevent.c
> >> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >> >>>  		return sock_ns != ns;
> >> >> >>>  	}
> >> >> >>>  
> >> >> >>> -	return 0;
> >> >> >>> +	/*
> >> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
> >> >> >>> +	 * namespace below.
> >> >> >>> +	 */
> >> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
> >> >> >>> +		return 1;
> >> >> >>> +
> >> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
> >> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
> >> >> >>>  }
> >> >> >>>  #endif
> >> >> >>
> >> >> >> So, this prohibits to listen events of all devices except network-related
> >> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
> >> >> > 
> >> >> > No, this is not correct: As it is right now *without my patch* no
> >> >> > non-initial user namespace is receiving *any uevents* but those
> >> >> > specifically namespaced such as those for network devices. This patch
> >> >> > doesn't change that at all. The commit message outlines this in detail
> >> >> > how this comes about.
> >> >> > There is only one case where this currently breaks and this is as I
> >> >> > outlined explicitly in my commit message when you create a new user
> >> >> > namespace and map container(0) -> host(0). This patch fixes this.
> >> >> 
> >> >> Could you please point the place, where non-initial user namespaces are filtered?
> >> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
> >> >> Now it will return 1 sometimes.
> >> >
> >> > Oh sure, it's in the commit message though. The callchain is
> >> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
> >> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
> >> > net/netlink/af_netlink.c:do_one_broadcast():
> >> >
> >> > This codepiece will check whether the openened socket holds
> >> > CAP_NET_BROADCAST in the user namespace of the target network namespace
> >> > which it won't because we don't have device namespaces and all devices
> >> > belong to the initial set of namespaces.
> >> >
> >> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >                              CAP_NET_BROADCAST))
> >> >         j       return;
> >> >
> >> 
> >> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
> >> on their socket and has had someone with the appropriate privileges
> >> assign a peerrnetid.
> >> 
> >> All of which is to say that file_ns_capable is not nearly as applicable
> >> as it might be, and if you can pass the other two checks I think it is
> >> pointless (because the peernet attributes are not generated for
> >> kobj_uevents) but valid to receive events from outside your network
> >> namespace.
> >> 
> >> 
> >> I might be missing something but I don't see anything excluding network
> >> namespaces owned by !init_user_ns excluded from the kobject_uevent
> >> logic.
> >> 
> >> The uevent_sock_list has one entry per network namespace. And
> >> kobject_uevent_net_broacast appears to walk each one.
> >> 
> >> I had a memory of filtering uevent messages and I had a memory
> >> that the netlink_has_listeners had a per network namespace effect.
> >> Neither seems true from my inspection of the code tonight.
> >> 
> >> If we are not filtering ordinary uevents at least at the user namespace
> >> level that does seem to be at least a nuisance.
> >> 
> >> 
> >> Christian can you dig a little deeper into this.  I have a feeling that
> >> there are some real efficiency improvements that we could make to
> >> kobject_uevent_net_broadcast if nothing else.
> >> 
> >> Perhaps you could see where uevents are broadcast by poking
> >> the sysfs uevent of an existing device, and triggering a retransmit.
> >
> > @Eric, so I did some intensive testing over the weekend and forget everything I
> > said before. Uevents are not filtered by the kernel at the moment. This is
> > currently - apart from network devices - a pure userspace thing. Specifically,
> > anyone  on the host can listen to all uevents from everywhere. It's neither
> > filtered by user nor by network namespace. The fact that I used
> >
> > udevadm --debug monitor
> >
> > to test my prior hypothesis was what led me to believe that uevents are already
> > correctly filtered.
> > Instead, what is actually happening is that every udev implementation out there
> > is discarding uevents that were send by uids != 0 in the CMSG_DATA.
> > Specifically,
> 
> Yes.  I remember something of the sort.  I believe udev also filters to
> ensure that the netlink port id == 0 to ensure the message came from
> the kernel and was not spoofed in any way.
> 
> > - legacy standalone udevd:
> >   https://git.kernel.org/pub/scm/linux/hotplug/udev.git/snapshot/udev-062.tar.gz
> > - eudevd
> >   https://github.com/gentoo/eudev/blob/6f630d32bf494a457171b3f99e329592497bf271/src/libudev/libudev-monitor.c#L645
> > - systemd-udevd
> >   https://github.com/systemd/systemd/blob/e89ab7f219a399ab719c78cf43c07c0da60bd151/src/libudev/libudev-monitor.c#L656
> > - ueventd (Android)
> >   https://android.googlesource.com/platform/system/core.git/+/android-8.1.0_r22/libcutils/uevent.c#81
> >
> > For all of those I could trace this behavior back to the first released
> > version. (To be precise, for legacy udevd that eventually became systemd-udevd
> > I could trace it back to the first version that is still available on
> > git.kernel.org which is 062. Since eudevd is a fork of systemd-udevd it is
> > trivially true that it has the same behavior from the beginning.)
> > Android filters uevents in the same way but removed that check on January 8
> > 2018 for what I think is invalid reasoning. The good news is that this is only
> > in their master branch. It hasn't even made it into an rc release for Android 8
> > yet. I filed a bug against Android and offered them a fix if they agree.
> >
> > In any case, userspace udev is not making use of uevents at all right now since
> > any uid != 0 events are **explicitly** discarded.
> > The fact that you receive uevents for
> >
> > sudo unshare -U --map-root -n
> > udevadm --debug monitor
> >
> > is simply explained by the fact that container(0) <=> host(0) at which point
> > the uid in CMSG_DATA will be 0 in the new user namespace and udev will not
> > discard it.
> > The use case for receiving uevents in containers/user namespaces is definitely
> > there but that's what the uevent injection patch series was for that we merged.
> > This is a much safer and saner solution.
> > The fact that all udev implementations filter uevents by uid != 0 very much
> > seems like a security mechanism in userspace that we probably should provide by
> > isolating uevents based on user and/or network namespaces.
> 
> So in summary.  Uevents are filtered in a user namespace (by userspace)
> because the received uid != 0.  It instead == 65534 == "nobody" because
> the global root uid is not mapped.

Exactly.

> 
> Which means that we can modify the kernel to not send to all network
> namespaces whose user_ns != &init_user_ns because we know that userspace
> will ignore the message because of the uid anyway.  Which means when

Yes.

> net-next reopens you can send that patch.  But please base it on just
> not including network namespaces in the list, as that is much more
> efficient than adding more conditions to the filter.

I'll send a patch out once net-next reopens. I'll also make sure to
inform all udev userspace implementations of the change. It won't affect
them but it is nice for them to know that they're safer now.

Something like this (Proper commit message and so on will be added once
I sent this out.):

>From 68d3c27435520cd600874999b6d9d17572854a7a Mon Sep 17 00:00:00 2001
From: Christian Brauner <christian.brauner@ubuntu.com>
Date: Tue, 10 Apr 2018 11:56:49 +0200
Subject: [PATCH] netns: restrict uevents to initial user namespace

/* Here'll be a useful commit message describing in detail what's
 * happening once I sent it to net-next. */

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 lib/kobject_uevent.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 15ea216a67ce..22a2c1a98b8f 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -703,9 +703,16 @@ static int uevent_net_init(struct net *net)
 
 	net->uevent_sock = ue_sk;
 
-	mutex_lock(&uevent_sock_mutex);
-	list_add_tail(&ue_sk->list, &uevent_sock_list);
-	mutex_unlock(&uevent_sock_mutex);
+	/*
+	 * Only sent uevents to uevent sockets that are located in network
+	 * namespaces owned by the initial user namespace.
+	 */
+	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
+		mutex_lock(&uevent_sock_mutex);
+		list_add_tail(&ue_sk->list, &uevent_sock_list);
+		mutex_unlock(&uevent_sock_mutex);
+	}
+
 	return 0;
 }
 
@@ -713,9 +720,11 @@ static void uevent_net_exit(struct net *net)
 {
 	struct uevent_sock *ue_sk = net->uevent_sock;
 
-	mutex_lock(&uevent_sock_mutex);
-	list_del(&ue_sk->list);
-	mutex_unlock(&uevent_sock_mutex);
+	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
+		mutex_lock(&uevent_sock_mutex);
+		list_del(&ue_sk->list);
+		mutex_unlock(&uevent_sock_mutex);
+	}
 
 	netlink_kernel_release(ue_sk->sk);
 	kfree(ue_sk);
-- 
2.15.1

> 
> Thank you for tracking down what is going on.

Sure! Thanks for properly poking at this.

Christian

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-10 14:35               ` Christian Brauner
@ 2018-04-10 15:04                 ` Eric W. Biederman
  2018-04-11  9:09                   ` Christian Brauner
  0 siblings, 1 reply; 26+ messages in thread
From: Eric W. Biederman @ 2018-04-10 15:04 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge

Christian Brauner <christian.brauner@canonical.com> writes:

> On Mon, Apr 09, 2018 at 06:21:31PM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@canonical.com> writes:
>> 
>> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner <christian.brauner@canonical.com> writes:
>> >> 
>> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
>> >> >> On 05.04.2018 17:07, Christian Brauner wrote:
>> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
>> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>> >> >> >>>
>> >> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
>> >> >> >>> Over time the set of uevents that get sent into all network namespaces has
>> >> >> >>> shrunk. We have now reached the point where hotplug events for all devices
>> >> >> >>> that carry a namespace tag are filtered according to that namespace.
>> >> >> >>>
>> >> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
>> >> >> >>> does not match the namespace tag of the netlink socket. One example are
>> >> >> >>> network devices. Uevents for network devices only show up in the network
>> >> >> >>> namespaces these devices are moved to or created in.
>> >> >> >>>
>> >> >> >>> However, any uevent for a kobject that does not have a namespace tag
>> >> >> >>> associated with it will not be filtered and we will *try* to broadcast it
>> >> >> >>> into all network namespaces.
>> >> >> >>>
>> >> >> >>> The original patchset was written in 2010 before user namespaces were a
>> >> >> >>> thing. With the introduction of user namespaces sending out uevents became
>> >> >> >>> partially isolated as they were filtered by user namespaces:
>> >> >> >>>
>> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
>> >> >> >>>
>> >> >> >>> if (!net_eq(sock_net(sk), p->net)) {
>> >> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> >> >> >>>                 return;
>> >> >> >>>
>> >> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
>> >> >> >>>                 return;
>> >> >> >>>
>> >> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >> >> >>>                              CAP_NET_BROADCAST))
>> >> >> >>>         j       return;
>> >> >> >>> }
>> >> >> >>>
>> >> >> >>> The file_ns_capable() check will check whether the caller had
>> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>> >> >> >>> namespace of interest. This check is fine in general but seems insufficient
>> >> >> >>> to me when paired with uevents. The reason is that devices always belong to
>> >> >> >>> the initial user namespace so uevents for kobjects that do not carry a
>> >> >> >>> namespace tag should never be sent into another user namespace. This has
>> >> >> >>> been the intention all along. But there's one case where this breaks,
>> >> >> >>> namely if a new user namespace is created by root on the host and an
>> >> >> >>> identity mapping is established between root on the host and root in the
>> >> >> >>> new user namespace. Here's a reproducer:
>> >> >> >>>
>> >> >> >>>  sudo unshare -U --map-root
>> >> >> >>>  udevadm monitor -k
>> >> >> >>>  # Now change to initial user namespace and e.g. do
>> >> >> >>>  modprobe kvm
>> >> >> >>>  # or
>> >> >> >>>  rmmod kvm
>> >> >> >>>
>> >> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
>> >> >> >>> host. This seems very anecdotal given that in the general case user
>> >> >> >>> namespaces do not see any uevents and also can't really do anything useful
>> >> >> >>> with them.
>> >> >> >>>
>> >> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
>> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>> >> >> >>> namespace of the network namespace of the netlink socket) userspace process
>> >> >> >>> make a decision what uevents should be sent.
>> >> >> >>>
>> >> >> >>> This makes me think that we should simply ensure that uevents for kobjects
>> >> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
>> >> >> >>> in kobj_bcast_filter(). Specifically:
>> >> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
>> >> >> >>>   event will always be filtered.
>> >> >> >>> - If the network namespace the uevent socket belongs to was created in the
>> >> >> >>>   initial user namespace but was opened from a non-initial user namespace
>> >> >> >>>   the event will be filtered as well.
>> >> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
>> >> >> >>> always only sent to the initial user namespace. The regression potential
>> >> >> >>> for this is near to non-existent since user namespaces can't really do
>> >> >> >>> anything with interesting devices.
>> >> >> >>>
>> >> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>> >> >> >>> ---
>> >> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
>> >> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
>> >> >> >>>
>> >> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> >> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
>> >> >> >>> --- a/lib/kobject_uevent.c
>> >> >> >>> +++ b/lib/kobject_uevent.c
>> >> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >> >> >>>  		return sock_ns != ns;
>> >> >> >>>  	}
>> >> >> >>>  
>> >> >> >>> -	return 0;
>> >> >> >>> +	/*
>> >> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
>> >> >> >>> +	 * namespace below.
>> >> >> >>> +	 */
>> >> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
>> >> >> >>> +		return 1;
>> >> >> >>> +
>> >> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
>> >> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
>> >> >> >>>  }
>> >> >> >>>  #endif
>> >> >> >>
>> >> >> >> So, this prohibits to listen events of all devices except network-related
>> >> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
>> >> >> > 
>> >> >> > No, this is not correct: As it is right now *without my patch* no
>> >> >> > non-initial user namespace is receiving *any uevents* but those
>> >> >> > specifically namespaced such as those for network devices. This patch
>> >> >> > doesn't change that at all. The commit message outlines this in detail
>> >> >> > how this comes about.
>> >> >> > There is only one case where this currently breaks and this is as I
>> >> >> > outlined explicitly in my commit message when you create a new user
>> >> >> > namespace and map container(0) -> host(0). This patch fixes this.
>> >> >> 
>> >> >> Could you please point the place, where non-initial user namespaces are filtered?
>> >> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
>> >> >> Now it will return 1 sometimes.
>> >> >
>> >> > Oh sure, it's in the commit message though. The callchain is
>> >> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
>> >> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
>> >> > net/netlink/af_netlink.c:do_one_broadcast():
>> >> >
>> >> > This codepiece will check whether the openened socket holds
>> >> > CAP_NET_BROADCAST in the user namespace of the target network namespace
>> >> > which it won't because we don't have device namespaces and all devices
>> >> > belong to the initial set of namespaces.
>> >> >
>> >> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >> >                              CAP_NET_BROADCAST))
>> >> >         j       return;
>> >> >
>> >> 
>> >> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
>> >> on their socket and has had someone with the appropriate privileges
>> >> assign a peerrnetid.
>> >> 
>> >> All of which is to say that file_ns_capable is not nearly as applicable
>> >> as it might be, and if you can pass the other two checks I think it is
>> >> pointless (because the peernet attributes are not generated for
>> >> kobj_uevents) but valid to receive events from outside your network
>> >> namespace.
>> >> 
>> >> 
>> >> I might be missing something but I don't see anything excluding network
>> >> namespaces owned by !init_user_ns excluded from the kobject_uevent
>> >> logic.
>> >> 
>> >> The uevent_sock_list has one entry per network namespace. And
>> >> kobject_uevent_net_broacast appears to walk each one.
>> >> 
>> >> I had a memory of filtering uevent messages and I had a memory
>> >> that the netlink_has_listeners had a per network namespace effect.
>> >> Neither seems true from my inspection of the code tonight.
>> >> 
>> >> If we are not filtering ordinary uevents at least at the user namespace
>> >> level that does seem to be at least a nuisance.
>> >> 
>> >> 
>> >> Christian can you dig a little deeper into this.  I have a feeling that
>> >> there are some real efficiency improvements that we could make to
>> >> kobject_uevent_net_broadcast if nothing else.
>> >> 
>> >> Perhaps you could see where uevents are broadcast by poking
>> >> the sysfs uevent of an existing device, and triggering a retransmit.
>> >
>> > @Eric, so I did some intensive testing over the weekend and forget everything I
>> > said before. Uevents are not filtered by the kernel at the moment. This is
>> > currently - apart from network devices - a pure userspace thing. Specifically,
>> > anyone  on the host can listen to all uevents from everywhere. It's neither
>> > filtered by user nor by network namespace. The fact that I used
>> >
>> > udevadm --debug monitor
>> >
>> > to test my prior hypothesis was what led me to believe that uevents are already
>> > correctly filtered.
>> > Instead, what is actually happening is that every udev implementation out there
>> > is discarding uevents that were send by uids != 0 in the CMSG_DATA.
>> > Specifically,
>> 
>> Yes.  I remember something of the sort.  I believe udev also filters to
>> ensure that the netlink port id == 0 to ensure the message came from
>> the kernel and was not spoofed in any way.
>> 
>> > - legacy standalone udevd:
>> >   https://git.kernel.org/pub/scm/linux/hotplug/udev.git/snapshot/udev-062.tar.gz
>> > - eudevd
>> >   https://github.com/gentoo/eudev/blob/6f630d32bf494a457171b3f99e329592497bf271/src/libudev/libudev-monitor.c#L645
>> > - systemd-udevd
>> >   https://github.com/systemd/systemd/blob/e89ab7f219a399ab719c78cf43c07c0da60bd151/src/libudev/libudev-monitor.c#L656
>> > - ueventd (Android)
>> >   https://android.googlesource.com/platform/system/core.git/+/android-8.1.0_r22/libcutils/uevent.c#81
>> >
>> > For all of those I could trace this behavior back to the first released
>> > version. (To be precise, for legacy udevd that eventually became systemd-udevd
>> > I could trace it back to the first version that is still available on
>> > git.kernel.org which is 062. Since eudevd is a fork of systemd-udevd it is
>> > trivially true that it has the same behavior from the beginning.)
>> > Android filters uevents in the same way but removed that check on January 8
>> > 2018 for what I think is invalid reasoning. The good news is that this is only
>> > in their master branch. It hasn't even made it into an rc release for Android 8
>> > yet. I filed a bug against Android and offered them a fix if they agree.
>> >
>> > In any case, userspace udev is not making use of uevents at all right now since
>> > any uid != 0 events are **explicitly** discarded.
>> > The fact that you receive uevents for
>> >
>> > sudo unshare -U --map-root -n
>> > udevadm --debug monitor
>> >
>> > is simply explained by the fact that container(0) <=> host(0) at which point
>> > the uid in CMSG_DATA will be 0 in the new user namespace and udev will not
>> > discard it.
>> > The use case for receiving uevents in containers/user namespaces is definitely
>> > there but that's what the uevent injection patch series was for that we merged.
>> > This is a much safer and saner solution.
>> > The fact that all udev implementations filter uevents by uid != 0 very much
>> > seems like a security mechanism in userspace that we probably should provide by
>> > isolating uevents based on user and/or network namespaces.
>> 
>> So in summary.  Uevents are filtered in a user namespace (by userspace)
>> because the received uid != 0.  It instead == 65534 == "nobody" because
>> the global root uid is not mapped.
>
> Exactly.
>
>> 
>> Which means that we can modify the kernel to not send to all network
>> namespaces whose user_ns != &init_user_ns because we know that userspace
>> will ignore the message because of the uid anyway.  Which means when
>
> Yes.
>
>> net-next reopens you can send that patch.  But please base it on just
>> not including network namespaces in the list, as that is much more
>> efficient than adding more conditions to the filter.
>
> I'll send a patch out once net-next reopens. I'll also make sure to
> inform all udev userspace implementations of the change. It won't affect
> them but it is nice for them to know that they're safer now.

The real danger is in a user namespace or in a container really is too
many daemons responding to events will generate a thundering hurd of
activity when there is really nothing to do.

> Something like this (Proper commit message and so on will be added once
> I sent this out.):

Exactly.

I would make the comment say something like: "ignore all but the initial
user namespace".

Eric


> From 68d3c27435520cd600874999b6d9d17572854a7a Mon Sep 17 00:00:00 2001
> From: Christian Brauner <christian.brauner@ubuntu.com>
> Date: Tue, 10 Apr 2018 11:56:49 +0200
> Subject: [PATCH] netns: restrict uevents to initial user namespace
>
> /* Here'll be a useful commit message describing in detail what's
>  * happening once I sent it to net-next. */
>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
>  lib/kobject_uevent.c | 21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 15ea216a67ce..22a2c1a98b8f 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -703,9 +703,16 @@ static int uevent_net_init(struct net *net)
>  
>  	net->uevent_sock = ue_sk;
>  
> -	mutex_lock(&uevent_sock_mutex);
> -	list_add_tail(&ue_sk->list, &uevent_sock_list);
> -	mutex_unlock(&uevent_sock_mutex);
> +	/*
> +	 * Only sent uevents to uevent sockets that are located in network
> +	 * namespaces owned by the initial user namespace.
> +	 */
> +	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> +		mutex_lock(&uevent_sock_mutex);
> +		list_add_tail(&ue_sk->list, &uevent_sock_list);
> +		mutex_unlock(&uevent_sock_mutex);
> +	}
> +
>  	return 0;
>  }
>  
> @@ -713,9 +720,11 @@ static void uevent_net_exit(struct net *net)
>  {
>  	struct uevent_sock *ue_sk = net->uevent_sock;
>  
> -	mutex_lock(&uevent_sock_mutex);
> -	list_del(&ue_sk->list);
> -	mutex_unlock(&uevent_sock_mutex);
> +	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> +		mutex_lock(&uevent_sock_mutex);
> +		list_del(&ue_sk->list);
> +		mutex_unlock(&uevent_sock_mutex);
> +	}
>  
>  	netlink_kernel_release(ue_sk->sk);
>  	kfree(ue_sk);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-10 15:04                 ` Eric W. Biederman
@ 2018-04-11  9:09                   ` Christian Brauner
  2018-04-11 16:40                     ` Eric W. Biederman
  0 siblings, 1 reply; 26+ messages in thread
From: Christian Brauner @ 2018-04-11  9:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge

On Tue, Apr 10, 2018 at 10:04:46AM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Mon, Apr 09, 2018 at 06:21:31PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> 
> >> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> >> 
> >> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> >> >> >> On 05.04.2018 17:07, Christian Brauner wrote:
> >> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >> >> >> >>>
> >> >> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
> >> >> >> >>> Over time the set of uevents that get sent into all network namespaces has
> >> >> >> >>> shrunk. We have now reached the point where hotplug events for all devices
> >> >> >> >>> that carry a namespace tag are filtered according to that namespace.
> >> >> >> >>>
> >> >> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
> >> >> >> >>> does not match the namespace tag of the netlink socket. One example are
> >> >> >> >>> network devices. Uevents for network devices only show up in the network
> >> >> >> >>> namespaces these devices are moved to or created in.
> >> >> >> >>>
> >> >> >> >>> However, any uevent for a kobject that does not have a namespace tag
> >> >> >> >>> associated with it will not be filtered and we will *try* to broadcast it
> >> >> >> >>> into all network namespaces.
> >> >> >> >>>
> >> >> >> >>> The original patchset was written in 2010 before user namespaces were a
> >> >> >> >>> thing. With the introduction of user namespaces sending out uevents became
> >> >> >> >>> partially isolated as they were filtered by user namespaces:
> >> >> >> >>>
> >> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >> >> >> >>>
> >> >> >> >>> if (!net_eq(sock_net(sk), p->net)) {
> >> >> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >> >> >>>                 return;
> >> >> >> >>>
> >> >> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
> >> >> >> >>>                 return;
> >> >> >> >>>
> >> >> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >> >> >>>                              CAP_NET_BROADCAST))
> >> >> >> >>>         j       return;
> >> >> >> >>> }
> >> >> >> >>>
> >> >> >> >>> The file_ns_capable() check will check whether the caller had
> >> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >> >> >> >>> namespace of interest. This check is fine in general but seems insufficient
> >> >> >> >>> to me when paired with uevents. The reason is that devices always belong to
> >> >> >> >>> the initial user namespace so uevents for kobjects that do not carry a
> >> >> >> >>> namespace tag should never be sent into another user namespace. This has
> >> >> >> >>> been the intention all along. But there's one case where this breaks,
> >> >> >> >>> namely if a new user namespace is created by root on the host and an
> >> >> >> >>> identity mapping is established between root on the host and root in the
> >> >> >> >>> new user namespace. Here's a reproducer:
> >> >> >> >>>
> >> >> >> >>>  sudo unshare -U --map-root
> >> >> >> >>>  udevadm monitor -k
> >> >> >> >>>  # Now change to initial user namespace and e.g. do
> >> >> >> >>>  modprobe kvm
> >> >> >> >>>  # or
> >> >> >> >>>  rmmod kvm
> >> >> >> >>>
> >> >> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
> >> >> >> >>> host. This seems very anecdotal given that in the general case user
> >> >> >> >>> namespaces do not see any uevents and also can't really do anything useful
> >> >> >> >>> with them.
> >> >> >> >>>
> >> >> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
> >> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >> >> >> >>> namespace of the network namespace of the netlink socket) userspace process
> >> >> >> >>> make a decision what uevents should be sent.
> >> >> >> >>>
> >> >> >> >>> This makes me think that we should simply ensure that uevents for kobjects
> >> >> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
> >> >> >> >>> in kobj_bcast_filter(). Specifically:
> >> >> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
> >> >> >> >>>   event will always be filtered.
> >> >> >> >>> - If the network namespace the uevent socket belongs to was created in the
> >> >> >> >>>   initial user namespace but was opened from a non-initial user namespace
> >> >> >> >>>   the event will be filtered as well.
> >> >> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
> >> >> >> >>> always only sent to the initial user namespace. The regression potential
> >> >> >> >>> for this is near to non-existent since user namespaces can't really do
> >> >> >> >>> anything with interesting devices.
> >> >> >> >>>
> >> >> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> >> >> >> >>> ---
> >> >> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
> >> >> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
> >> >> >> >>>
> >> >> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
> >> >> >> >>> --- a/lib/kobject_uevent.c
> >> >> >> >>> +++ b/lib/kobject_uevent.c
> >> >> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >> >> >>>  		return sock_ns != ns;
> >> >> >> >>>  	}
> >> >> >> >>>  
> >> >> >> >>> -	return 0;
> >> >> >> >>> +	/*
> >> >> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
> >> >> >> >>> +	 * namespace below.
> >> >> >> >>> +	 */
> >> >> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
> >> >> >> >>> +		return 1;
> >> >> >> >>> +
> >> >> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
> >> >> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
> >> >> >> >>>  }
> >> >> >> >>>  #endif
> >> >> >> >>
> >> >> >> >> So, this prohibits to listen events of all devices except network-related
> >> >> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
> >> >> >> > 
> >> >> >> > No, this is not correct: As it is right now *without my patch* no
> >> >> >> > non-initial user namespace is receiving *any uevents* but those
> >> >> >> > specifically namespaced such as those for network devices. This patch
> >> >> >> > doesn't change that at all. The commit message outlines this in detail
> >> >> >> > how this comes about.
> >> >> >> > There is only one case where this currently breaks and this is as I
> >> >> >> > outlined explicitly in my commit message when you create a new user
> >> >> >> > namespace and map container(0) -> host(0). This patch fixes this.
> >> >> >> 
> >> >> >> Could you please point the place, where non-initial user namespaces are filtered?
> >> >> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
> >> >> >> Now it will return 1 sometimes.
> >> >> >
> >> >> > Oh sure, it's in the commit message though. The callchain is
> >> >> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
> >> >> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
> >> >> > net/netlink/af_netlink.c:do_one_broadcast():
> >> >> >
> >> >> > This codepiece will check whether the openened socket holds
> >> >> > CAP_NET_BROADCAST in the user namespace of the target network namespace
> >> >> > which it won't because we don't have device namespaces and all devices
> >> >> > belong to the initial set of namespaces.
> >> >> >
> >> >> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >> >                              CAP_NET_BROADCAST))
> >> >> >         j       return;
> >> >> >
> >> >> 
> >> >> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
> >> >> on their socket and has had someone with the appropriate privileges
> >> >> assign a peerrnetid.
> >> >> 
> >> >> All of which is to say that file_ns_capable is not nearly as applicable
> >> >> as it might be, and if you can pass the other two checks I think it is
> >> >> pointless (because the peernet attributes are not generated for
> >> >> kobj_uevents) but valid to receive events from outside your network
> >> >> namespace.
> >> >> 
> >> >> 
> >> >> I might be missing something but I don't see anything excluding network
> >> >> namespaces owned by !init_user_ns excluded from the kobject_uevent
> >> >> logic.
> >> >> 
> >> >> The uevent_sock_list has one entry per network namespace. And
> >> >> kobject_uevent_net_broacast appears to walk each one.
> >> >> 
> >> >> I had a memory of filtering uevent messages and I had a memory
> >> >> that the netlink_has_listeners had a per network namespace effect.
> >> >> Neither seems true from my inspection of the code tonight.
> >> >> 
> >> >> If we are not filtering ordinary uevents at least at the user namespace
> >> >> level that does seem to be at least a nuisance.
> >> >> 
> >> >> 
> >> >> Christian can you dig a little deeper into this.  I have a feeling that
> >> >> there are some real efficiency improvements that we could make to
> >> >> kobject_uevent_net_broadcast if nothing else.
> >> >> 
> >> >> Perhaps you could see where uevents are broadcast by poking
> >> >> the sysfs uevent of an existing device, and triggering a retransmit.
> >> >
> >> > @Eric, so I did some intensive testing over the weekend and forget everything I
> >> > said before. Uevents are not filtered by the kernel at the moment. This is
> >> > currently - apart from network devices - a pure userspace thing. Specifically,
> >> > anyone  on the host can listen to all uevents from everywhere. It's neither
> >> > filtered by user nor by network namespace. The fact that I used
> >> >
> >> > udevadm --debug monitor
> >> >
> >> > to test my prior hypothesis was what led me to believe that uevents are already
> >> > correctly filtered.
> >> > Instead, what is actually happening is that every udev implementation out there
> >> > is discarding uevents that were send by uids != 0 in the CMSG_DATA.
> >> > Specifically,
> >> 
> >> Yes.  I remember something of the sort.  I believe udev also filters to
> >> ensure that the netlink port id == 0 to ensure the message came from
> >> the kernel and was not spoofed in any way.
> >> 
> >> > - legacy standalone udevd:
> >> >   https://git.kernel.org/pub/scm/linux/hotplug/udev.git/snapshot/udev-062.tar.gz
> >> > - eudevd
> >> >   https://github.com/gentoo/eudev/blob/6f630d32bf494a457171b3f99e329592497bf271/src/libudev/libudev-monitor.c#L645
> >> > - systemd-udevd
> >> >   https://github.com/systemd/systemd/blob/e89ab7f219a399ab719c78cf43c07c0da60bd151/src/libudev/libudev-monitor.c#L656
> >> > - ueventd (Android)
> >> >   https://android.googlesource.com/platform/system/core.git/+/android-8.1.0_r22/libcutils/uevent.c#81
> >> >
> >> > For all of those I could trace this behavior back to the first released
> >> > version. (To be precise, for legacy udevd that eventually became systemd-udevd
> >> > I could trace it back to the first version that is still available on
> >> > git.kernel.org which is 062. Since eudevd is a fork of systemd-udevd it is
> >> > trivially true that it has the same behavior from the beginning.)
> >> >
> >> > In any case, userspace udev is not making use of uevents at all right now since
> >> > any uid != 0 events are **explicitly** discarded.
> >> > The fact that you receive uevents for
> >> >
> >> > sudo unshare -U --map-root -n
> >> > udevadm --debug monitor
> >> >
> >> > is simply explained by the fact that container(0) <=> host(0) at which point
> >> > the uid in CMSG_DATA will be 0 in the new user namespace and udev will not
> >> > discard it.
> >> > The use case for receiving uevents in containers/user namespaces is definitely
> >> > there but that's what the uevent injection patch series was for that we merged.
> >> > This is a much safer and saner solution.
> >> > The fact that all udev implementations filter uevents by uid != 0 very much
> >> > seems like a security mechanism in userspace that we probably should provide by
> >> > isolating uevents based on user and/or network namespaces.
> >> 
> >> So in summary.  Uevents are filtered in a user namespace (by userspace)
> >> because the received uid != 0.  It instead == 65534 == "nobody" because
> >> the global root uid is not mapped.
> >
> > Exactly.
> >
> >> 
> >> Which means that we can modify the kernel to not send to all network
> >> namespaces whose user_ns != &init_user_ns because we know that userspace
> >> will ignore the message because of the uid anyway.  Which means when
> >
> > Yes.
> >
> >> net-next reopens you can send that patch.  But please base it on just
> >> not including network namespaces in the list, as that is much more
> >> efficient than adding more conditions to the filter.
> >
> > I'll send a patch out once net-next reopens. I'll also make sure to
> > inform all udev userspace implementations of the change. It won't affect
> > them but it is nice for them to know that they're safer now.
> 
> The real danger is in a user namespace or in a container really is too
> many daemons responding to events will generate a thundering hurd of
> activity when there is really nothing to do.
> 
> > Something like this (Proper commit message and so on will be added once
> > I sent this out.):
> 
> Exactly.
> 
> I would make the comment say something like: "ignore all but the initial
> user namespace".

Yeah, agreed.
But I think the patch is not complete. To guarantee that no non-initial
user namespace actually receives uevents we need to:
1. only sent uevents to uevent sockets that are located in network
   namespaces that are owned by init_user_ns
2. filter uevents that are sent to sockets in mc_list that have opened a
   uevent socket that is owned by init_user_ns *from* a
   non-init_user_ns

We account for 1. by only recording uevent sockets in the global uevent
socket list who are owned by init_user_ns.
But to account for 2. we need to filter by the user namespace who owns
the socket in mc_list. So in addition to that we also need to slightly
change the filter logic in kobj_bcast_filter() I think:

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 22a2c1a98b8f..064d7d29ace5 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
 		return sock_ns != ns;
 	}
 
-	return 0;
+ 	/* Check if socket was opened from non-initial user namespace. */
+ 	return sk_user_ns(dsk) != &init_user_ns;
 }
 #endif
 

But correct me if I'm wrong.

Christian

> 
> Eric
> 
> 
> > From 68d3c27435520cd600874999b6d9d17572854a7a Mon Sep 17 00:00:00 2001
> > From: Christian Brauner <christian.brauner@ubuntu.com>
> > Date: Tue, 10 Apr 2018 11:56:49 +0200
> > Subject: [PATCH] netns: restrict uevents to initial user namespace
> >
> > /* Here'll be a useful commit message describing in detail what's
> >  * happening once I sent it to net-next. */
> >
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > ---
> >  lib/kobject_uevent.c | 21 +++++++++++++++------
> >  1 file changed, 15 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > index 15ea216a67ce..22a2c1a98b8f 100644
> > --- a/lib/kobject_uevent.c
> > +++ b/lib/kobject_uevent.c
> > @@ -703,9 +703,16 @@ static int uevent_net_init(struct net *net)
> >  
> >  	net->uevent_sock = ue_sk;
> >  
> > -	mutex_lock(&uevent_sock_mutex);
> > -	list_add_tail(&ue_sk->list, &uevent_sock_list);
> > -	mutex_unlock(&uevent_sock_mutex);
> > +	/*
> > +	 * Only sent uevents to uevent sockets that are located in network
> > +	 * namespaces owned by the initial user namespace.
> > +	 */
> > +	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> > +		mutex_lock(&uevent_sock_mutex);
> > +		list_add_tail(&ue_sk->list, &uevent_sock_list);
> > +		mutex_unlock(&uevent_sock_mutex);
> > +	}
> > +
> >  	return 0;
> >  }
> >  
> > @@ -713,9 +720,11 @@ static void uevent_net_exit(struct net *net)
> >  {
> >  	struct uevent_sock *ue_sk = net->uevent_sock;
> >  
> > -	mutex_lock(&uevent_sock_mutex);
> > -	list_del(&ue_sk->list);
> > -	mutex_unlock(&uevent_sock_mutex);
> > +	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
> > +		mutex_lock(&uevent_sock_mutex);
> > +		list_del(&ue_sk->list);
> > +		mutex_unlock(&uevent_sock_mutex);
> > +	}
> >  
> >  	netlink_kernel_release(ue_sk->sk);
> >  	kfree(ue_sk);

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-11  9:09                   ` Christian Brauner
@ 2018-04-11 16:40                     ` Eric W. Biederman
  2018-04-11 17:03                       ` Christian Brauner
  0 siblings, 1 reply; 26+ messages in thread
From: Eric W. Biederman @ 2018-04-11 16:40 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge

Christian Brauner <christian.brauner@canonical.com> writes:

> On Tue, Apr 10, 2018 at 10:04:46AM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@canonical.com> writes:
>> 
>> > On Mon, Apr 09, 2018 at 06:21:31PM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner <christian.brauner@canonical.com> writes:
>> >> 
>> >> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
>> >> >> Christian Brauner <christian.brauner@canonical.com> writes:
>> >> >> 
>> >> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
>> >> >> >> On 05.04.2018 17:07, Christian Brauner wrote:
>> >> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> >> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
>> >> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>> >> >> >> >>>
>> >> >> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
>> >> >> >> >>> Over time the set of uevents that get sent into all network namespaces has
>> >> >> >> >>> shrunk. We have now reached the point where hotplug events for all devices
>> >> >> >> >>> that carry a namespace tag are filtered according to that namespace.
>> >> >> >> >>>
>> >> >> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
>> >> >> >> >>> does not match the namespace tag of the netlink socket. One example are
>> >> >> >> >>> network devices. Uevents for network devices only show up in the network
>> >> >> >> >>> namespaces these devices are moved to or created in.
>> >> >> >> >>>
>> >> >> >> >>> However, any uevent for a kobject that does not have a namespace tag
>> >> >> >> >>> associated with it will not be filtered and we will *try* to broadcast it
>> >> >> >> >>> into all network namespaces.
>> >> >> >> >>>
>> >> >> >> >>> The original patchset was written in 2010 before user namespaces were a
>> >> >> >> >>> thing. With the introduction of user namespaces sending out uevents became
>> >> >> >> >>> partially isolated as they were filtered by user namespaces:
>> >> >> >> >>>
>> >> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
>> >> >> >> >>>
>> >> >> >> >>> if (!net_eq(sock_net(sk), p->net)) {
>> >> >> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> >> >> >> >>>                 return;
>> >> >> >> >>>
>> >> >> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
>> >> >> >> >>>                 return;
>> >> >> >> >>>
>> >> >> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >> >> >> >>>                              CAP_NET_BROADCAST))
>> >> >> >> >>>         j       return;
>> >> >> >> >>> }
>> >> >> >> >>>
>> >> >> >> >>> The file_ns_capable() check will check whether the caller had
>> >> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>> >> >> >> >>> namespace of interest. This check is fine in general but seems insufficient
>> >> >> >> >>> to me when paired with uevents. The reason is that devices always belong to
>> >> >> >> >>> the initial user namespace so uevents for kobjects that do not carry a
>> >> >> >> >>> namespace tag should never be sent into another user namespace. This has
>> >> >> >> >>> been the intention all along. But there's one case where this breaks,
>> >> >> >> >>> namely if a new user namespace is created by root on the host and an
>> >> >> >> >>> identity mapping is established between root on the host and root in the
>> >> >> >> >>> new user namespace. Here's a reproducer:
>> >> >> >> >>>
>> >> >> >> >>>  sudo unshare -U --map-root
>> >> >> >> >>>  udevadm monitor -k
>> >> >> >> >>>  # Now change to initial user namespace and e.g. do
>> >> >> >> >>>  modprobe kvm
>> >> >> >> >>>  # or
>> >> >> >> >>>  rmmod kvm
>> >> >> >> >>>
>> >> >> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
>> >> >> >> >>> host. This seems very anecdotal given that in the general case user
>> >> >> >> >>> namespaces do not see any uevents and also can't really do anything useful
>> >> >> >> >>> with them.
>> >> >> >> >>>
>> >> >> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
>> >> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>> >> >> >> >>> namespace of the network namespace of the netlink socket) userspace process
>> >> >> >> >>> make a decision what uevents should be sent.
>> >> >> >> >>>
>> >> >> >> >>> This makes me think that we should simply ensure that uevents for kobjects
>> >> >> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
>> >> >> >> >>> in kobj_bcast_filter(). Specifically:
>> >> >> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
>> >> >> >> >>>   event will always be filtered.
>> >> >> >> >>> - If the network namespace the uevent socket belongs to was created in the
>> >> >> >> >>>   initial user namespace but was opened from a non-initial user namespace
>> >> >> >> >>>   the event will be filtered as well.
>> >> >> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
>> >> >> >> >>> always only sent to the initial user namespace. The regression potential
>> >> >> >> >>> for this is near to non-existent since user namespaces can't really do
>> >> >> >> >>> anything with interesting devices.
>> >> >> >> >>>
>> >> >> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>> >> >> >> >>> ---
>> >> >> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
>> >> >> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
>> >> >> >> >>>
>> >> >> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> >> >> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
>> >> >> >> >>> --- a/lib/kobject_uevent.c
>> >> >> >> >>> +++ b/lib/kobject_uevent.c
>> >> >> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >> >> >> >>>  		return sock_ns != ns;
>> >> >> >> >>>  	}
>> >> >> >> >>>  
>> >> >> >> >>> -	return 0;
>> >> >> >> >>> +	/*
>> >> >> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
>> >> >> >> >>> +	 * namespace below.
>> >> >> >> >>> +	 */
>> >> >> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
>> >> >> >> >>> +		return 1;
>> >> >> >> >>> +
>> >> >> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
>> >> >> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
>> >> >> >> >>>  }
>> >> >> >> >>>  #endif
>> >> >> >> >>
>> >> >> >> >> So, this prohibits to listen events of all devices except network-related
>> >> >> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
>> >> >> >> > 
>> >> >> >> > No, this is not correct: As it is right now *without my patch* no
>> >> >> >> > non-initial user namespace is receiving *any uevents* but those
>> >> >> >> > specifically namespaced such as those for network devices. This patch
>> >> >> >> > doesn't change that at all. The commit message outlines this in detail
>> >> >> >> > how this comes about.
>> >> >> >> > There is only one case where this currently breaks and this is as I
>> >> >> >> > outlined explicitly in my commit message when you create a new user
>> >> >> >> > namespace and map container(0) -> host(0). This patch fixes this.
>> >> >> >> 
>> >> >> >> Could you please point the place, where non-initial user namespaces are filtered?
>> >> >> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
>> >> >> >> Now it will return 1 sometimes.
>> >> >> >
>> >> >> > Oh sure, it's in the commit message though. The callchain is
>> >> >> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
>> >> >> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
>> >> >> > net/netlink/af_netlink.c:do_one_broadcast():
>> >> >> >
>> >> >> > This codepiece will check whether the openened socket holds
>> >> >> > CAP_NET_BROADCAST in the user namespace of the target network namespace
>> >> >> > which it won't because we don't have device namespaces and all devices
>> >> >> > belong to the initial set of namespaces.
>> >> >> >
>> >> >> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >> >> >                              CAP_NET_BROADCAST))
>> >> >> >         j       return;
>> >> >> >
>> >> >> 
>> >> >> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
>> >> >> on their socket and has had someone with the appropriate privileges
>> >> >> assign a peerrnetid.
>> >> >> 
>> >> >> All of which is to say that file_ns_capable is not nearly as applicable
>> >> >> as it might be, and if you can pass the other two checks I think it is
>> >> >> pointless (because the peernet attributes are not generated for
>> >> >> kobj_uevents) but valid to receive events from outside your network
>> >> >> namespace.
>> >> >> 
>> >> >> 
>> >> >> I might be missing something but I don't see anything excluding network
>> >> >> namespaces owned by !init_user_ns excluded from the kobject_uevent
>> >> >> logic.
>> >> >> 
>> >> >> The uevent_sock_list has one entry per network namespace. And
>> >> >> kobject_uevent_net_broacast appears to walk each one.
>> >> >> 
>> >> >> I had a memory of filtering uevent messages and I had a memory
>> >> >> that the netlink_has_listeners had a per network namespace effect.
>> >> >> Neither seems true from my inspection of the code tonight.
>> >> >> 
>> >> >> If we are not filtering ordinary uevents at least at the user namespace
>> >> >> level that does seem to be at least a nuisance.
>> >> >> 
>> >> >> 
>> >> >> Christian can you dig a little deeper into this.  I have a feeling that
>> >> >> there are some real efficiency improvements that we could make to
>> >> >> kobject_uevent_net_broadcast if nothing else.
>> >> >> 
>> >> >> Perhaps you could see where uevents are broadcast by poking
>> >> >> the sysfs uevent of an existing device, and triggering a retransmit.
>> >> >
>> >> > @Eric, so I did some intensive testing over the weekend and forget everything I
>> >> > said before. Uevents are not filtered by the kernel at the moment. This is
>> >> > currently - apart from network devices - a pure userspace thing. Specifically,
>> >> > anyone  on the host can listen to all uevents from everywhere. It's neither
>> >> > filtered by user nor by network namespace. The fact that I used
>> >> >
>> >> > udevadm --debug monitor
>> >> >
>> >> > to test my prior hypothesis was what led me to believe that uevents are already
>> >> > correctly filtered.
>> >> > Instead, what is actually happening is that every udev implementation out there
>> >> > is discarding uevents that were send by uids != 0 in the CMSG_DATA.
>> >> > Specifically,
>> >> 
>> >> Yes.  I remember something of the sort.  I believe udev also filters to
>> >> ensure that the netlink port id == 0 to ensure the message came from
>> >> the kernel and was not spoofed in any way.
>> >> 
>> >> > - legacy standalone udevd:
>> >> >   https://git.kernel.org/pub/scm/linux/hotplug/udev.git/snapshot/udev-062.tar.gz
>> >> > - eudevd
>> >> >   https://github.com/gentoo/eudev/blob/6f630d32bf494a457171b3f99e329592497bf271/src/libudev/libudev-monitor.c#L645
>> >> > - systemd-udevd
>> >> >   https://github.com/systemd/systemd/blob/e89ab7f219a399ab719c78cf43c07c0da60bd151/src/libudev/libudev-monitor.c#L656
>> >> > - ueventd (Android)
>> >> >   https://android.googlesource.com/platform/system/core.git/+/android-8.1.0_r22/libcutils/uevent.c#81
>> >> >
>> >> > For all of those I could trace this behavior back to the first released
>> >> > version. (To be precise, for legacy udevd that eventually became systemd-udevd
>> >> > I could trace it back to the first version that is still available on
>> >> > git.kernel.org which is 062. Since eudevd is a fork of systemd-udevd it is
>> >> > trivially true that it has the same behavior from the beginning.)
>> >> >
>> >> > In any case, userspace udev is not making use of uevents at all right now since
>> >> > any uid != 0 events are **explicitly** discarded.
>> >> > The fact that you receive uevents for
>> >> >
>> >> > sudo unshare -U --map-root -n
>> >> > udevadm --debug monitor
>> >> >
>> >> > is simply explained by the fact that container(0) <=> host(0) at which point
>> >> > the uid in CMSG_DATA will be 0 in the new user namespace and udev will not
>> >> > discard it.
>> >> > The use case for receiving uevents in containers/user namespaces is definitely
>> >> > there but that's what the uevent injection patch series was for that we merged.
>> >> > This is a much safer and saner solution.
>> >> > The fact that all udev implementations filter uevents by uid != 0 very much
>> >> > seems like a security mechanism in userspace that we probably should provide by
>> >> > isolating uevents based on user and/or network namespaces.
>> >> 
>> >> So in summary.  Uevents are filtered in a user namespace (by userspace)
>> >> because the received uid != 0.  It instead == 65534 == "nobody" because
>> >> the global root uid is not mapped.
>> >
>> > Exactly.
>> >
>> >> 
>> >> Which means that we can modify the kernel to not send to all network
>> >> namespaces whose user_ns != &init_user_ns because we know that userspace
>> >> will ignore the message because of the uid anyway.  Which means when
>> >
>> > Yes.
>> >
>> >> net-next reopens you can send that patch.  But please base it on just
>> >> not including network namespaces in the list, as that is much more
>> >> efficient than adding more conditions to the filter.
>> >
>> > I'll send a patch out once net-next reopens. I'll also make sure to
>> > inform all udev userspace implementations of the change. It won't affect
>> > them but it is nice for them to know that they're safer now.
>> 
>> The real danger is in a user namespace or in a container really is too
>> many daemons responding to events will generate a thundering hurd of
>> activity when there is really nothing to do.
>> 
>> > Something like this (Proper commit message and so on will be added once
>> > I sent this out.):
>> 
>> Exactly.
>> 
>> I would make the comment say something like: "ignore all but the initial
>> user namespace".
>
> Yeah, agreed.
> But I think the patch is not complete. To guarantee that no non-initial
> user namespace actually receives uevents we need to:
> 1. only sent uevents to uevent sockets that are located in network
>    namespaces that are owned by init_user_ns
> 2. filter uevents that are sent to sockets in mc_list that have opened a
>    uevent socket that is owned by init_user_ns *from* a
>    non-init_user_ns
>
> We account for 1. by only recording uevent sockets in the global uevent
> socket list who are owned by init_user_ns.
> But to account for 2. we need to filter by the user namespace who owns
> the socket in mc_list. So in addition to that we also need to slightly
> change the filter logic in kobj_bcast_filter() I think:
>
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 22a2c1a98b8f..064d7d29ace5 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>  		return sock_ns != ns;
>  	}
>  
> -	return 0;
> + 	/* Check if socket was opened from non-initial user namespace. */
> + 	return sk_user_ns(dsk) != &init_user_ns;
>  }
>  #endif
>  
>
> But correct me if I'm wrong.

You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
permissions and an explicit opt-in to receiving packets from multiple
network namespaces.

There is a netlink CMSG that tells you which network namespace the
packet comes from.  Something any sane person will use if they set
NETLINK_LISTEN_ALL_NSID.

There is no problem of confusion.  There is no problem of permissions.
So we don't need to worry about preventing NETLINK_LISTEN_ALL_NSID
listeners from seeing the uevents.

Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-11 16:40                     ` Eric W. Biederman
@ 2018-04-11 17:03                       ` Christian Brauner
  2018-04-11 18:37                         ` Eric W. Biederman
  0 siblings, 1 reply; 26+ messages in thread
From: Christian Brauner @ 2018-04-11 17:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge

On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Tue, Apr 10, 2018 at 10:04:46AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> 
> >> > On Mon, Apr 09, 2018 at 06:21:31PM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> >> 
> >> >> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
> >> >> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> >> >> 
> >> >> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> >> >> >> >> On 05.04.2018 17:07, Christian Brauner wrote:
> >> >> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> >> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >> >> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >> >> >> >> >>>
> >> >> >> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
> >> >> >> >> >>> Over time the set of uevents that get sent into all network namespaces has
> >> >> >> >> >>> shrunk. We have now reached the point where hotplug events for all devices
> >> >> >> >> >>> that carry a namespace tag are filtered according to that namespace.
> >> >> >> >> >>>
> >> >> >> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
> >> >> >> >> >>> does not match the namespace tag of the netlink socket. One example are
> >> >> >> >> >>> network devices. Uevents for network devices only show up in the network
> >> >> >> >> >>> namespaces these devices are moved to or created in.
> >> >> >> >> >>>
> >> >> >> >> >>> However, any uevent for a kobject that does not have a namespace tag
> >> >> >> >> >>> associated with it will not be filtered and we will *try* to broadcast it
> >> >> >> >> >>> into all network namespaces.
> >> >> >> >> >>>
> >> >> >> >> >>> The original patchset was written in 2010 before user namespaces were a
> >> >> >> >> >>> thing. With the introduction of user namespaces sending out uevents became
> >> >> >> >> >>> partially isolated as they were filtered by user namespaces:
> >> >> >> >> >>>
> >> >> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >> >> >> >> >>>
> >> >> >> >> >>> if (!net_eq(sock_net(sk), p->net)) {
> >> >> >> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >> >> >> >>>                 return;
> >> >> >> >> >>>
> >> >> >> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
> >> >> >> >> >>>                 return;
> >> >> >> >> >>>
> >> >> >> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >> >> >> >>>                              CAP_NET_BROADCAST))
> >> >> >> >> >>>         j       return;
> >> >> >> >> >>> }
> >> >> >> >> >>>
> >> >> >> >> >>> The file_ns_capable() check will check whether the caller had
> >> >> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >> >> >> >> >>> namespace of interest. This check is fine in general but seems insufficient
> >> >> >> >> >>> to me when paired with uevents. The reason is that devices always belong to
> >> >> >> >> >>> the initial user namespace so uevents for kobjects that do not carry a
> >> >> >> >> >>> namespace tag should never be sent into another user namespace. This has
> >> >> >> >> >>> been the intention all along. But there's one case where this breaks,
> >> >> >> >> >>> namely if a new user namespace is created by root on the host and an
> >> >> >> >> >>> identity mapping is established between root on the host and root in the
> >> >> >> >> >>> new user namespace. Here's a reproducer:
> >> >> >> >> >>>
> >> >> >> >> >>>  sudo unshare -U --map-root
> >> >> >> >> >>>  udevadm monitor -k
> >> >> >> >> >>>  # Now change to initial user namespace and e.g. do
> >> >> >> >> >>>  modprobe kvm
> >> >> >> >> >>>  # or
> >> >> >> >> >>>  rmmod kvm
> >> >> >> >> >>>
> >> >> >> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
> >> >> >> >> >>> host. This seems very anecdotal given that in the general case user
> >> >> >> >> >>> namespaces do not see any uevents and also can't really do anything useful
> >> >> >> >> >>> with them.
> >> >> >> >> >>>
> >> >> >> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
> >> >> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >> >> >> >> >>> namespace of the network namespace of the netlink socket) userspace process
> >> >> >> >> >>> make a decision what uevents should be sent.
> >> >> >> >> >>>
> >> >> >> >> >>> This makes me think that we should simply ensure that uevents for kobjects
> >> >> >> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
> >> >> >> >> >>> in kobj_bcast_filter(). Specifically:
> >> >> >> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
> >> >> >> >> >>>   event will always be filtered.
> >> >> >> >> >>> - If the network namespace the uevent socket belongs to was created in the
> >> >> >> >> >>>   initial user namespace but was opened from a non-initial user namespace
> >> >> >> >> >>>   the event will be filtered as well.
> >> >> >> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
> >> >> >> >> >>> always only sent to the initial user namespace. The regression potential
> >> >> >> >> >>> for this is near to non-existent since user namespaces can't really do
> >> >> >> >> >>> anything with interesting devices.
> >> >> >> >> >>>
> >> >> >> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> >> >> >> >> >>> ---
> >> >> >> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
> >> >> >> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
> >> >> >> >> >>>
> >> >> >> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >> >> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
> >> >> >> >> >>> --- a/lib/kobject_uevent.c
> >> >> >> >> >>> +++ b/lib/kobject_uevent.c
> >> >> >> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >> >> >> >>>  		return sock_ns != ns;
> >> >> >> >> >>>  	}
> >> >> >> >> >>>  
> >> >> >> >> >>> -	return 0;
> >> >> >> >> >>> +	/*
> >> >> >> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
> >> >> >> >> >>> +	 * namespace below.
> >> >> >> >> >>> +	 */
> >> >> >> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
> >> >> >> >> >>> +		return 1;
> >> >> >> >> >>> +
> >> >> >> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
> >> >> >> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
> >> >> >> >> >>>  }
> >> >> >> >> >>>  #endif
> >> >> >> >> >>
> >> >> >> >> >> So, this prohibits to listen events of all devices except network-related
> >> >> >> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
> >> >> >> >> > 
> >> >> >> >> > No, this is not correct: As it is right now *without my patch* no
> >> >> >> >> > non-initial user namespace is receiving *any uevents* but those
> >> >> >> >> > specifically namespaced such as those for network devices. This patch
> >> >> >> >> > doesn't change that at all. The commit message outlines this in detail
> >> >> >> >> > how this comes about.
> >> >> >> >> > There is only one case where this currently breaks and this is as I
> >> >> >> >> > outlined explicitly in my commit message when you create a new user
> >> >> >> >> > namespace and map container(0) -> host(0). This patch fixes this.
> >> >> >> >> 
> >> >> >> >> Could you please point the place, where non-initial user namespaces are filtered?
> >> >> >> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
> >> >> >> >> Now it will return 1 sometimes.
> >> >> >> >
> >> >> >> > Oh sure, it's in the commit message though. The callchain is
> >> >> >> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
> >> >> >> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
> >> >> >> > net/netlink/af_netlink.c:do_one_broadcast():
> >> >> >> >
> >> >> >> > This codepiece will check whether the openened socket holds
> >> >> >> > CAP_NET_BROADCAST in the user namespace of the target network namespace
> >> >> >> > which it won't because we don't have device namespaces and all devices
> >> >> >> > belong to the initial set of namespaces.
> >> >> >> >
> >> >> >> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >> >> >                              CAP_NET_BROADCAST))
> >> >> >> >         j       return;
> >> >> >> >
> >> >> >> 
> >> >> >> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
> >> >> >> on their socket and has had someone with the appropriate privileges
> >> >> >> assign a peerrnetid.
> >> >> >> 
> >> >> >> All of which is to say that file_ns_capable is not nearly as applicable
> >> >> >> as it might be, and if you can pass the other two checks I think it is
> >> >> >> pointless (because the peernet attributes are not generated for
> >> >> >> kobj_uevents) but valid to receive events from outside your network
> >> >> >> namespace.
> >> >> >> 
> >> >> >> 
> >> >> >> I might be missing something but I don't see anything excluding network
> >> >> >> namespaces owned by !init_user_ns excluded from the kobject_uevent
> >> >> >> logic.
> >> >> >> 
> >> >> >> The uevent_sock_list has one entry per network namespace. And
> >> >> >> kobject_uevent_net_broacast appears to walk each one.
> >> >> >> 
> >> >> >> I had a memory of filtering uevent messages and I had a memory
> >> >> >> that the netlink_has_listeners had a per network namespace effect.
> >> >> >> Neither seems true from my inspection of the code tonight.
> >> >> >> 
> >> >> >> If we are not filtering ordinary uevents at least at the user namespace
> >> >> >> level that does seem to be at least a nuisance.
> >> >> >> 
> >> >> >> 
> >> >> >> Christian can you dig a little deeper into this.  I have a feeling that
> >> >> >> there are some real efficiency improvements that we could make to
> >> >> >> kobject_uevent_net_broadcast if nothing else.
> >> >> >> 
> >> >> >> Perhaps you could see where uevents are broadcast by poking
> >> >> >> the sysfs uevent of an existing device, and triggering a retransmit.
> >> >> >
> >> >> > @Eric, so I did some intensive testing over the weekend and forget everything I
> >> >> > said before. Uevents are not filtered by the kernel at the moment. This is
> >> >> > currently - apart from network devices - a pure userspace thing. Specifically,
> >> >> > anyone  on the host can listen to all uevents from everywhere. It's neither
> >> >> > filtered by user nor by network namespace. The fact that I used
> >> >> >
> >> >> > udevadm --debug monitor
> >> >> >
> >> >> > to test my prior hypothesis was what led me to believe that uevents are already
> >> >> > correctly filtered.
> >> >> > Instead, what is actually happening is that every udev implementation out there
> >> >> > is discarding uevents that were send by uids != 0 in the CMSG_DATA.
> >> >> > Specifically,
> >> >> 
> >> >> Yes.  I remember something of the sort.  I believe udev also filters to
> >> >> ensure that the netlink port id == 0 to ensure the message came from
> >> >> the kernel and was not spoofed in any way.
> >> >> 
> >> >> > - legacy standalone udevd:
> >> >> >   https://git.kernel.org/pub/scm/linux/hotplug/udev.git/snapshot/udev-062.tar.gz
> >> >> > - eudevd
> >> >> >   https://github.com/gentoo/eudev/blob/6f630d32bf494a457171b3f99e329592497bf271/src/libudev/libudev-monitor.c#L645
> >> >> > - systemd-udevd
> >> >> >   https://github.com/systemd/systemd/blob/e89ab7f219a399ab719c78cf43c07c0da60bd151/src/libudev/libudev-monitor.c#L656
> >> >> > - ueventd (Android)
> >> >> >   https://android.googlesource.com/platform/system/core.git/+/android-8.1.0_r22/libcutils/uevent.c#81
> >> >> >
> >> >> > For all of those I could trace this behavior back to the first released
> >> >> > version. (To be precise, for legacy udevd that eventually became systemd-udevd
> >> >> > I could trace it back to the first version that is still available on
> >> >> > git.kernel.org which is 062. Since eudevd is a fork of systemd-udevd it is
> >> >> > trivially true that it has the same behavior from the beginning.)
> >> >> >
> >> >> > In any case, userspace udev is not making use of uevents at all right now since
> >> >> > any uid != 0 events are **explicitly** discarded.
> >> >> > The fact that you receive uevents for
> >> >> >
> >> >> > sudo unshare -U --map-root -n
> >> >> > udevadm --debug monitor
> >> >> >
> >> >> > is simply explained by the fact that container(0) <=> host(0) at which point
> >> >> > the uid in CMSG_DATA will be 0 in the new user namespace and udev will not
> >> >> > discard it.
> >> >> > The use case for receiving uevents in containers/user namespaces is definitely
> >> >> > there but that's what the uevent injection patch series was for that we merged.
> >> >> > This is a much safer and saner solution.
> >> >> > The fact that all udev implementations filter uevents by uid != 0 very much
> >> >> > seems like a security mechanism in userspace that we probably should provide by
> >> >> > isolating uevents based on user and/or network namespaces.
> >> >> 
> >> >> So in summary.  Uevents are filtered in a user namespace (by userspace)
> >> >> because the received uid != 0.  It instead == 65534 == "nobody" because
> >> >> the global root uid is not mapped.
> >> >
> >> > Exactly.
> >> >
> >> >> 
> >> >> Which means that we can modify the kernel to not send to all network
> >> >> namespaces whose user_ns != &init_user_ns because we know that userspace
> >> >> will ignore the message because of the uid anyway.  Which means when
> >> >
> >> > Yes.
> >> >
> >> >> net-next reopens you can send that patch.  But please base it on just
> >> >> not including network namespaces in the list, as that is much more
> >> >> efficient than adding more conditions to the filter.
> >> >
> >> > I'll send a patch out once net-next reopens. I'll also make sure to
> >> > inform all udev userspace implementations of the change. It won't affect
> >> > them but it is nice for them to know that they're safer now.
> >> 
> >> The real danger is in a user namespace or in a container really is too
> >> many daemons responding to events will generate a thundering hurd of
> >> activity when there is really nothing to do.
> >> 
> >> > Something like this (Proper commit message and so on will be added once
> >> > I sent this out.):
> >> 
> >> Exactly.
> >> 
> >> I would make the comment say something like: "ignore all but the initial
> >> user namespace".
> >
> > Yeah, agreed.
> > But I think the patch is not complete. To guarantee that no non-initial
> > user namespace actually receives uevents we need to:
> > 1. only sent uevents to uevent sockets that are located in network
> >    namespaces that are owned by init_user_ns
> > 2. filter uevents that are sent to sockets in mc_list that have opened a
> >    uevent socket that is owned by init_user_ns *from* a
> >    non-init_user_ns
> >
> > We account for 1. by only recording uevent sockets in the global uevent
> > socket list who are owned by init_user_ns.
> > But to account for 2. we need to filter by the user namespace who owns
> > the socket in mc_list. So in addition to that we also need to slightly
> > change the filter logic in kobj_bcast_filter() I think:
> >
> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > index 22a2c1a98b8f..064d7d29ace5 100644
> > --- a/lib/kobject_uevent.c
> > +++ b/lib/kobject_uevent.c
> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >  		return sock_ns != ns;
> >  	}
> >  
> > -	return 0;
> > + 	/* Check if socket was opened from non-initial user namespace. */
> > + 	return sk_user_ns(dsk) != &init_user_ns;
> >  }
> >  #endif
> >  
> >
> > But correct me if I'm wrong.
> 
> You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
> permissions and an explicit opt-in to receiving packets from multiple
> network namespaces.

I don't think that's what I'm talking about unless that is somehow the
default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
doing

unshare -U --map-root

then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
uevents. Imho, this should not be possible because I'm in a
non-init_user_ns. But currently I'm able to - even with the patch to
come - since the uevent socket in the kernel was created when init_net
was created and hence is *owned* by the init_user_ns which means it is
in the list of uevent sockets. Here's a demo of what I mean:

https://asciinema.org/a/175632

Christian

> 
> There is a netlink CMSG that tells you which network namespace the
> packet comes from.  Something any sane person will use if they set
> NETLINK_LISTEN_ALL_NSID.
> 
> There is no problem of confusion.  There is no problem of permissions.
> So we don't need to worry about preventing NETLINK_LISTEN_ALL_NSID
> listeners from seeing the uevents.
> 
> Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-11 17:03                       ` Christian Brauner
@ 2018-04-11 18:37                         ` Eric W. Biederman
  2018-04-11 18:57                           ` Christian Brauner
  0 siblings, 1 reply; 26+ messages in thread
From: Eric W. Biederman @ 2018-04-11 18:37 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge

Christian Brauner <christian.brauner@canonical.com> writes:

> On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@canonical.com> writes:
>> > Yeah, agreed.
>> > But I think the patch is not complete. To guarantee that no non-initial
>> > user namespace actually receives uevents we need to:
>> > 1. only sent uevents to uevent sockets that are located in network
>> >    namespaces that are owned by init_user_ns
>> > 2. filter uevents that are sent to sockets in mc_list that have opened a
>> >    uevent socket that is owned by init_user_ns *from* a
>> >    non-init_user_ns
>> >
>> > We account for 1. by only recording uevent sockets in the global uevent
>> > socket list who are owned by init_user_ns.
>> > But to account for 2. we need to filter by the user namespace who owns
>> > the socket in mc_list. So in addition to that we also need to slightly
>> > change the filter logic in kobj_bcast_filter() I think:
>> >
>> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> > index 22a2c1a98b8f..064d7d29ace5 100644
>> > --- a/lib/kobject_uevent.c
>> > +++ b/lib/kobject_uevent.c
>> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >  		return sock_ns != ns;
>> >  	}
>> >  
>> > -	return 0;
>> > + 	/* Check if socket was opened from non-initial user namespace. */
>> > + 	return sk_user_ns(dsk) != &init_user_ns;
>> >  }
>> >  #endif
>> >  
>> >
>> > But correct me if I'm wrong.
>> 
>> You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
>> permissions and an explicit opt-in to receiving packets from multiple
>> network namespaces.
>
> I don't think that's what I'm talking about unless that is somehow the
> default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
> doing
>
> unshare -U --map-root
>
> then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
> uevents. Imho, this should not be possible because I'm in a
> non-init_user_ns. But currently I'm able to - even with the patch to
> come - since the uevent socket in the kernel was created when init_net
> was created and hence is *owned* by the init_user_ns which means it is
> in the list of uevent sockets. Here's a demo of what I mean:
>
> https://asciinema.org/a/175632

Why do you care about this case?

Everyone is allowed to listen to the uevent netlink sockets with or
without user namespaces.  So there are no permission issues, and
this is not even a data information leak.

If you don't want programs in your user namespace to have access you
will be able to unshare the network namespace.

Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-11 18:37                         ` Eric W. Biederman
@ 2018-04-11 18:57                           ` Christian Brauner
  2018-04-11 19:16                             ` Eric W. Biederman
  0 siblings, 1 reply; 26+ messages in thread
From: Christian Brauner @ 2018-04-11 18:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge

On Wed, Apr 11, 2018 at 01:37:18PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> > Yeah, agreed.
> >> > But I think the patch is not complete. To guarantee that no non-initial
> >> > user namespace actually receives uevents we need to:
> >> > 1. only sent uevents to uevent sockets that are located in network
> >> >    namespaces that are owned by init_user_ns
> >> > 2. filter uevents that are sent to sockets in mc_list that have opened a
> >> >    uevent socket that is owned by init_user_ns *from* a
> >> >    non-init_user_ns
> >> >
> >> > We account for 1. by only recording uevent sockets in the global uevent
> >> > socket list who are owned by init_user_ns.
> >> > But to account for 2. we need to filter by the user namespace who owns
> >> > the socket in mc_list. So in addition to that we also need to slightly
> >> > change the filter logic in kobj_bcast_filter() I think:
> >> >
> >> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> > index 22a2c1a98b8f..064d7d29ace5 100644
> >> > --- a/lib/kobject_uevent.c
> >> > +++ b/lib/kobject_uevent.c
> >> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >  		return sock_ns != ns;
> >> >  	}
> >> >  
> >> > -	return 0;
> >> > + 	/* Check if socket was opened from non-initial user namespace. */
> >> > + 	return sk_user_ns(dsk) != &init_user_ns;
> >> >  }
> >> >  #endif
> >> >  
> >> >
> >> > But correct me if I'm wrong.
> >> 
> >> You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
> >> permissions and an explicit opt-in to receiving packets from multiple
> >> network namespaces.
> >
> > I don't think that's what I'm talking about unless that is somehow the
> > default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
> > doing
> >
> > unshare -U --map-root
> >
> > then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
> > uevents. Imho, this should not be possible because I'm in a
> > non-init_user_ns. But currently I'm able to - even with the patch to
> > come - since the uevent socket in the kernel was created when init_net
> > was created and hence is *owned* by the init_user_ns which means it is
> > in the list of uevent sockets. Here's a demo of what I mean:
> >
> > https://asciinema.org/a/175632
> 
> Why do you care about this case?

It's not so much that I care about this case since any workload that
wants to run a separate udevd will have to unshare the network namespace
and the user namespace for it to make complete sense.
What I do care about is that the two of us are absolutely in the clear
about what semantics we are going to expose to userspace and it seems
that we were talking past each other wrt to this "corner case".
For userspace, it needs to be very clear that the intention is to filter
by *owning user namespace of the network namespace a given task resides
in* and not by user namespace of the task per se. This is what this
corner case basically shows, I think.

Christian

> 
> Everyone is allowed to listen to the uevent netlink sockets with or
> without user namespaces.  So there are no permission issues, and
> this is not even a data information leak.
> 
> If you don't want programs in your user namespace to have access you
> will be able to unshare the network namespace.
> 
> Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-11 18:57                           ` Christian Brauner
@ 2018-04-11 19:16                             ` Eric W. Biederman
  2018-04-11 19:57                               ` Christian Brauner
  0 siblings, 1 reply; 26+ messages in thread
From: Eric W. Biederman @ 2018-04-11 19:16 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge

Christian Brauner <christian.brauner@canonical.com> writes:

> On Wed, Apr 11, 2018 at 01:37:18PM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@canonical.com> writes:
>> 
>> > On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner <christian.brauner@canonical.com> writes:
>> >> > Yeah, agreed.
>> >> > But I think the patch is not complete. To guarantee that no non-initial
>> >> > user namespace actually receives uevents we need to:
>> >> > 1. only sent uevents to uevent sockets that are located in network
>> >> >    namespaces that are owned by init_user_ns
>> >> > 2. filter uevents that are sent to sockets in mc_list that have opened a
>> >> >    uevent socket that is owned by init_user_ns *from* a
>> >> >    non-init_user_ns
>> >> >
>> >> > We account for 1. by only recording uevent sockets in the global uevent
>> >> > socket list who are owned by init_user_ns.
>> >> > But to account for 2. we need to filter by the user namespace who owns
>> >> > the socket in mc_list. So in addition to that we also need to slightly
>> >> > change the filter logic in kobj_bcast_filter() I think:
>> >> >
>> >> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> >> > index 22a2c1a98b8f..064d7d29ace5 100644
>> >> > --- a/lib/kobject_uevent.c
>> >> > +++ b/lib/kobject_uevent.c
>> >> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >> >  		return sock_ns != ns;
>> >> >  	}
>> >> >  
>> >> > -	return 0;
>> >> > + 	/* Check if socket was opened from non-initial user namespace. */
>> >> > + 	return sk_user_ns(dsk) != &init_user_ns;
>> >> >  }
>> >> >  #endif
>> >> >  
>> >> >
>> >> > But correct me if I'm wrong.
>> >> 
>> >> You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
>> >> permissions and an explicit opt-in to receiving packets from multiple
>> >> network namespaces.
>> >
>> > I don't think that's what I'm talking about unless that is somehow the
>> > default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
>> > doing
>> >
>> > unshare -U --map-root
>> >
>> > then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
>> > uevents. Imho, this should not be possible because I'm in a
>> > non-init_user_ns. But currently I'm able to - even with the patch to
>> > come - since the uevent socket in the kernel was created when init_net
>> > was created and hence is *owned* by the init_user_ns which means it is
>> > in the list of uevent sockets. Here's a demo of what I mean:
>> >
>> > https://asciinema.org/a/175632
>> 
>> Why do you care about this case?
>
> It's not so much that I care about this case since any workload that
> wants to run a separate udevd will have to unshare the network namespace
> and the user namespace for it to make complete sense.
> What I do care about is that the two of us are absolutely in the clear
> about what semantics we are going to expose to userspace and it seems
> that we were talking past each other wrt to this "corner case".
> For userspace, it needs to be very clear that the intention is to filter
> by *owning user namespace of the network namespace a given task resides
> in* and not by user namespace of the task per se. This is what this
> corner case basically shows, I think.

If this is just a clarification of semantics then yes this is a
productive question.  I almost agree with your definition above.

I would make the definition very simple.  Uevents will not be broadcast
via netlink in a network namespace where net->user_ns != &init_user_ns,
with the exception of uevents for network devices in that network
namespace.

The existing filtering by the sending uid and verifying that it is uid 0
gives a little more room to filter if we want (as udev & friends will
ignore the uevent), but I don't see the point.

Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] netns: filter uevents correctly
  2018-04-11 19:16                             ` Eric W. Biederman
@ 2018-04-11 19:57                               ` Christian Brauner
  0 siblings, 0 replies; 26+ messages in thread
From: Christian Brauner @ 2018-04-11 19:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge

On Wed, Apr 11, 2018 at 02:16:23PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Wed, Apr 11, 2018 at 01:37:18PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> 
> >> > On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> >> > Yeah, agreed.
> >> >> > But I think the patch is not complete. To guarantee that no non-initial
> >> >> > user namespace actually receives uevents we need to:
> >> >> > 1. only sent uevents to uevent sockets that are located in network
> >> >> >    namespaces that are owned by init_user_ns
> >> >> > 2. filter uevents that are sent to sockets in mc_list that have opened a
> >> >> >    uevent socket that is owned by init_user_ns *from* a
> >> >> >    non-init_user_ns
> >> >> >
> >> >> > We account for 1. by only recording uevent sockets in the global uevent
> >> >> > socket list who are owned by init_user_ns.
> >> >> > But to account for 2. we need to filter by the user namespace who owns
> >> >> > the socket in mc_list. So in addition to that we also need to slightly
> >> >> > change the filter logic in kobj_bcast_filter() I think:
> >> >> >
> >> >> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >> > index 22a2c1a98b8f..064d7d29ace5 100644
> >> >> > --- a/lib/kobject_uevent.c
> >> >> > +++ b/lib/kobject_uevent.c
> >> >> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >> >  		return sock_ns != ns;
> >> >> >  	}
> >> >> >  
> >> >> > -	return 0;
> >> >> > + 	/* Check if socket was opened from non-initial user namespace. */
> >> >> > + 	return sk_user_ns(dsk) != &init_user_ns;
> >> >> >  }
> >> >> >  #endif
> >> >> >  
> >> >> >
> >> >> > But correct me if I'm wrong.
> >> >> 
> >> >> You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
> >> >> permissions and an explicit opt-in to receiving packets from multiple
> >> >> network namespaces.
> >> >
> >> > I don't think that's what I'm talking about unless that is somehow the
> >> > default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
> >> > doing
> >> >
> >> > unshare -U --map-root
> >> >
> >> > then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
> >> > uevents. Imho, this should not be possible because I'm in a
> >> > non-init_user_ns. But currently I'm able to - even with the patch to
> >> > come - since the uevent socket in the kernel was created when init_net
> >> > was created and hence is *owned* by the init_user_ns which means it is
> >> > in the list of uevent sockets. Here's a demo of what I mean:
> >> >
> >> > https://asciinema.org/a/175632
> >> 
> >> Why do you care about this case?
> >
> > It's not so much that I care about this case since any workload that
> > wants to run a separate udevd will have to unshare the network namespace
> > and the user namespace for it to make complete sense.
> > What I do care about is that the two of us are absolutely in the clear
> > about what semantics we are going to expose to userspace and it seems
> > that we were talking past each other wrt to this "corner case".
> > For userspace, it needs to be very clear that the intention is to filter
> > by *owning user namespace of the network namespace a given task resides
> > in* and not by user namespace of the task per se. This is what this
> > corner case basically shows, I think.
> 
> If this is just a clarification of semantics then yes this is a
> productive question.  I almost agree with your definition above.
> 
> I would make the definition very simple.  Uevents will not be broadcast
> via netlink in a network namespace where net->user_ns != &init_user_ns,
> with the exception of uevents for network devices in that network
> namespace.

Well, for the sake of posterity :) I should add that I'd prefer we'd add
what I suggested above:

-	return 0;
+ 	/* Check if socket was opened from non-initial user namespace. */
+ 	return sk_user_ns(dsk) != &init_user_ns;
 }

to slam the door shut once and for all for all non-init_user_ns
namespaces because it *seems* like the cleanest solution: uevents are
owned by init_user_ns; period. Because it is the only user namespace
that can do anything interesting with them *by default*.
But what we have now right now with my upcoming patch is at least
sufficient and safe.

Christian

> 
> The existing filtering by the sending uid and verifying that it is uid 0
> gives a little more room to filter if we want (as udev & friends will
> ignore the uevent), but I don't see the point.
> 
> Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2018-04-11 19:57 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-04 19:48 [PATCH net-next] netns: filter uevents correctly Christian Brauner
2018-04-04 20:30 ` [PATCH net] " Christian Brauner
2018-04-04 22:38   ` Eric W. Biederman
2018-04-05  1:27     ` Christian Brauner
2018-04-06  2:02       ` David Miller
2018-04-05  1:35     ` Christian Brauner
2018-04-05 13:01 ` [PATCH net-next] " Kirill Tkhai
2018-04-05 14:07   ` Christian Brauner
2018-04-05 14:26     ` Kirill Tkhai
2018-04-05 14:41       ` Christian Brauner
2018-04-06  3:59         ` Eric W. Biederman
2018-04-06 13:07           ` Christian Brauner
2018-04-06 14:45             ` Eric W. Biederman
2018-04-06 16:07               ` Christian Brauner
2018-04-06 16:48                 ` Eric W. Biederman
2018-04-09 15:46           ` Christian Brauner
2018-04-09 23:21             ` Eric W. Biederman
2018-04-10 14:35               ` Christian Brauner
2018-04-10 15:04                 ` Eric W. Biederman
2018-04-11  9:09                   ` Christian Brauner
2018-04-11 16:40                     ` Eric W. Biederman
2018-04-11 17:03                       ` Christian Brauner
2018-04-11 18:37                         ` Eric W. Biederman
2018-04-11 18:57                           ` Christian Brauner
2018-04-11 19:16                             ` Eric W. Biederman
2018-04-11 19:57                               ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).