All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christian Brauner <christian.brauner@canonical.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: David Miller <davem@davemloft.net>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	avagin@virtuozzo.com, ktkhai@virtuozzo.com, serge@hallyn.com,
	gregkh@linuxfoundation.org
Subject: Re: [PATCH net-next 1/2 v2] netns: restrict uevents
Date: Fri, 27 Apr 2018 10:41:58 +0200	[thread overview]
Message-ID: <20180427084157.GA29044@gmail.com> (raw)
In-Reply-To: <87vacdbi58.fsf@xmission.com>

On Thu, Apr 26, 2018 at 07:35:47PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Thu, Apr 26, 2018 at 12:10:30PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> 
> >> > On Thu, Apr 26, 2018 at 11:47:19AM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> >> 
> >> >> > On Tue, Apr 24, 2018 at 06:00:35PM -0500, Eric W. Biederman wrote:
> >> >> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> >> >> 
> >> >> >> > On Wed, Apr 25, 2018, 00:41 Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> >> >> >
> >> >> >> >  Bah. This code is obviously correct and probably wrong.
> >> >> >> >
> >> >> >> >  How do we deliver uevents for network devices that are outside of the
> >> >> >> >  initial user namespace? The kernel still needs to deliver those.
> >> >> >> >
> >> >> >> >  The logic to figure out which network namespace a device needs to be
> >> >> >> >  delivered to is is present in kobj_bcast_filter. That logic will almost
> >> >> >> >  certainly need to be turned inside out. Sign not as easy as I would
> >> >> >> >  have hoped.
> >> >> >> >
> >> >> >> > My first patch that we discussed put additional filtering logic into kobj_bcast_filter for that very reason. But I can move that logic
> >> >> >> > out and come up with a new patch.
> >> >> >> 
> >> >> >> I may have mis-understood.
> >> >> >> 
> >> >> >> I heard and am still hearing additional filtering to reduce the places
> >> >> >> the packet is delievered.
> >> >> >> 
> >> >> >> I am saying something needs to change to increase the number of places
> >> >> >> the packet is delivered.
> >> >> >> 
> >> >> >> For the special class of devices that kobj_bcast_filter would apply to
> >> >> >> those need to be delivered to netowrk namespaces  that are no longer on
> >> >> >> uevent_sock_list.
> >> >> >> 
> >> >> >> So the code fundamentally needs to split into two paths.  Ordinary
> >> >> >> devices that use uevent_sock_list.  Network devices that are just
> >> >> >> delivered in their own network namespace.
> >> >> >> 
> >> >> >> netlink_broadcast_filtered gets to go away completely.
> >> >> >
> >> >> > The split *might* make sense but I think you're wrong about removing the
> >> >> > kobj_bcast_filter. The current filter doesn't operate on the uevent
> >> >> > socket in uevent_sock_list itself it rather operates on the sockets in
> >> >> > mc_list. And if socket in mc_list can have a different network namespace
> >> >> > then the uevent_socket itself then your way won't work. That's why my
> >> >> > original patch added additional filtering in there. The way I see it we
> >> >> > need something like:
> >> >> 
> >> >> We already filter the sockets in the mc_list by network namespace.
> >> >
> >> > Oh really? That's good to know. I haven't found where in the code this
> >> > actually happens. I thought that when netlink_bind() is called anyone
> >> > could register themselves in mc_list.
> >> 
> >> The code in af_netlink.c does:
> >> > static void do_one_broadcast(struct sock *sk,
> >> > 				    struct netlink_broadcast_data *p)
> >> > {
> >> > 	struct netlink_sock *nlk = nlk_sk(sk);
> >> > 	int val;
> >> > 
> >> > 	if (p->exclude_sk == sk)
> >> > 		return;
> >> > 
> >> > 	if (nlk->portid == p->portid || p->group - 1 >= nlk->ngroups ||
> >> > 	    !test_bit(p->group - 1, nlk->groups))
> >> > 		return;
> >> > 
> >> > 	if (!net_eq(sock_net(sk), p->net)) {
> >>             ^^^^^^^^^^^^ Here
> >> > 		if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> > 			return;
> >>                 ^^^^^^^^^^^ Here
> >> > 
> >> > 		if (!peernet_has_id(sock_net(sk), p->net))
> >> > 			return;
> >> > 
> >> > 		if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> > 				     CAP_NET_BROADCAST))
> >> > 			return;
> >> > 	}
> >> 
> >> Which if you are not a magic NETLINK_F_LISTEN_ALL_NSID socket filters
> >> you out if you are the wrong network namespace.
> >> 
> >> 
> >> >> When a packet is transmitted with netlink_broadcast it is only
> >> >> transmitted within a single network namespace.
> >> >> 
> >> >> Even in the case of a NETLINK_F_LISTEN_ALL_NSID socket the skb is tagged
> >> >> with it's source network namespace so no confusion will result, and the
> >> >> permission checks have been done to make it safe. So you can safely
> >> >> ignore that case.  Please ignore that case.  It only needs to be
> >> >> considered if refactoring af_netlink.c
> >> >> 
> >> >> When I added netlink_broadcast_filtered I imagined that we would need
> >> >> code that worked across network namespaces that worked for different
> >> >> namespaces.   So it looked like we would need the level of granularity
> >> >> that you can get with netlink_broadcast_filtered.  It turns out we don't
> >> >> and that it was a case of over design.  As the only split we care about
> >> >> is per network namespace there is no need for
> >> >> netlink_broadcast_filtered.
> >> >> 
> >> >> > init_user_ns_broadcast_filtered(uevent_sock_list, kobj_bcast_filter);
> >> >> > user_ns_broadcast_filtered(uevent_sock_list,kobj_bcast_filter);
> >> >> >
> >> >> > The question that remains is whether we can rely on the network
> >> >> > namespace information we can gather from the kobject_ns_type_operations
> >> >> > to decide where we want to broadcast that event to. So something
> >> >> > *like*:
> >> >> 
> >> >> We can.  We already do.  That is what kobj_bcast_filter implements.
> >> >> 
> >> >> > 	ops = kobj_ns_ops(kobj);
> >> >> > 	if (!ops && kobj->kset) {
> >> >> > 		struct kobject *ksobj = &kobj->kset->kobj;
> >> >> > 		if (ksobj->parent != NULL)
> >> >> > 			ops = kobj_ns_ops(ksobj->parent);
> >> >> > 	}
> >> >> >
> >> >> > 	if (ops && ops->netlink_ns && kobj->ktype->namespace)
> >> >> > 		if (ops->type == KOBJ_NS_TYPE_NET)
> >> >> > 			net = kobj->ktype->namespace(kobj);
> >> >> 
> >> >> Please note the only entry in the enumeration in the kobj_ns_type
> >> >> enumeration other than KOBJ_NS_TYPE_NONE is KOBJ_NS_TYPE_NET.  So the
> >> >> check for ops->type in this case is redundant.
> >> >
> >> > Yes, I know the reason for doing it explicitly is to block the case
> >> > where kobjects get tagged with other namespaces. So we'd need to be
> >> > vigilant should that ever happen but fine.
> >> 
> >> It is fine to keep the check.
> >> 
> >> I was intending to point out that it is much more likely that we remove
> >> the enumeration and remove some of the extra abstraction, than another
> >> namespace is implemented there.
> >> 
> >> >> That is something else that could be simplifed.  At the time it was the
> >> >> necessary to get the sysfs changes merged.
> >> >> 
> >> >> > 	if (!net || net->user_ns == &init_user_ns)
> >> >> > 		ret = init_user_ns_broadcast(env, action_string, devpath);
> >> >> > 	else
> >> >> > 		ret = user_ns_broadcast(net->uevent_sock->sk, env,
> >> >> > 					action_string, devpath);
> >> >> 
> >> >> Almost.
> >> >> 
> >> >> 	if (!net)
> >> >>         	kobject_uevent_net_broadcast(kobj, env, action_string,
> >> >>         					dev_path);
> >> >> 	else
> >> >>         	netlink_broadcast(net->uevent_sock->sk, skb, 0, 1, GFP_KERNEL);
> >> >> 
> >> >> 
> >> >> I am handwaving to get the skb in the netlink_broadcast case but that
> >> >> should be enough for you to see what I am thinking.
> >> >
> >> > I have added a helper alloc_uevent_skb() that can be used in both cases.
> >> >
> >> > static struct sk_buff *alloc_uevent_skb(struct kobj_uevent_env *env,
> >> > 					const char *action_string,
> >> > 					const char *devpath)
> >> > {
> >> > 	struct sk_buff *skb = NULL;
> >> > 	char *scratch;
> >> > 	size_t len;
> >> >
> >> > 	/* allocate message with maximum possible size */
> >> > 	len = strlen(action_string) + strlen(devpath) + 2;
> >> > 	skb = alloc_skb(len + env->buflen, GFP_KERNEL);
> >> > 	if (!skb)
> >> > 		return NULL;
> >> >
> >> > 	/* add header */
> >> > 	scratch = skb_put(skb, len);
> >> > 	sprintf(scratch, "%s@%s", action_string, devpath);
> >> >
> >> > 	skb_put_data(skb, env->buf, env->buflen);
> >> >
> >> > 	NETLINK_CB(skb).dst_group = 1;
> >> >
> >> > 	return skb;
> >> > }
> >> >
> >> >> 
> >> >> My only concern with the above is that we almost certainly need to fix
> >> >> the credentials on the skb so that userspace does not drop the packet
> >
> > I guess we simply want:
> > 	if (user_ns != &init_user_ns) {
> > 		NETLINK_CB(skb).creds.uid = (kuid_t)0;
> > 		NETLINK_CB(skb).creds.gid = kgid_t)0;
> > 	}
> 
> I believe the above is what we already have.
> 
> > instead of the more complicated and - imho wrong:
> >
> > 	if (user_ns != &init_user_ns) {
> > 		/* fix credentials for udev running in user namespace */
> > 		kuid_t uid = NETLINK_CB(skb).creds.uid;
> > 		kgid_t gid = NETLINK_CB(skb).creds.gid;
> > 		NETLINK_CB(skb).creds.uid = from_kuid_munged(user_ns, uid);
> > 		NETLINK_CB(skb).creds.gid = from_kgid_munged(user_ns, gid);
> > 	}
> 
> The above is most definitely wrong as we store kuids and kgids in
> "NETLINK_CB(skb).creds".
> 
> I am pretty certain what we want is:
> 	kuid_t root_uid = make_kuid(net->user_ns, 0);
>         kgid_g root_gid = make_kgid(net->user_ns, 0);

Thanks! I looked at user_namespace.c which contained map_id_down() which
is the function that I wanted and remembered from a prior patchset of
mine but they weren't exported. I didn't spot make_k{g,u}id() which are
wrapping those. These are the droids^H^H^H^H^H^Hfunctions I was looking
for!

>         if (!uid_valid(root_uid))
>         	root_uid = GLOBAL_ROOT_UID;
> 	if (!gid_valid(root_gid))
>         	root_gid = GLOBAL_ROOT_GID;
> 	NETLINK_CB(skb).creds.uid = root_uid;
>         NETLINK_CB(skb).creds.gid = root_gid;
> 
> Let's be careful and only fix this for the networking uevents please.
> We want the other onces to just go away.

This is already handled by the

if (!net)
       handle_untagged_uevents()
else
       handle_taggged_uevents()

The else branch will only every contain network devices as to my
knowledge no other kernel devices are currently tagged.

Thanks!
Christian

> 
> The networking uevents we have to fix or they will be gone completely.
> 
> Eric

  reply	other threads:[~2018-04-27  8:41 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-24 20:43 [PATCH net-next 0/2 v2] netns: uevent performance tweaks Christian Brauner
2018-04-24 20:43 ` [PATCH net-next 1/2 v2] netns: restrict uevents Christian Brauner
2018-04-24 21:54   ` Eric W. Biederman
2018-04-24 21:54     ` Eric W. Biederman
2018-04-24 22:40   ` Eric W. Biederman
2018-04-24 22:40     ` Eric W. Biederman
2018-04-24 22:47     ` Christian Brauner
2018-04-24 23:00       ` Eric W. Biederman
2018-04-24 23:00         ` Eric W. Biederman
2018-04-26 16:13         ` Christian Brauner
2018-04-26 16:47           ` Eric W. Biederman
2018-04-26 16:47             ` Eric W. Biederman
2018-04-26 17:03             ` Christian Brauner
2018-04-26 17:10               ` Eric W. Biederman
2018-04-26 17:10                 ` Eric W. Biederman
2018-04-26 21:27                 ` Christian Brauner
2018-04-27  0:35                   ` Eric W. Biederman
2018-04-27  0:35                     ` Eric W. Biederman
2018-04-27  8:41                     ` Christian Brauner [this message]
2018-04-24 22:54     ` Christian Brauner
2018-04-24 20:43 ` [PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks Christian Brauner
2018-04-24 21:52   ` Eric W. Biederman
2018-04-24 22:20     ` Christian Brauner
2018-04-24 22:52       ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180427084157.GA29044@gmail.com \
    --to=christian.brauner@canonical.com \
    --cc=avagin@virtuozzo.com \
    --cc=davem@davemloft.net \
    --cc=ebiederm@xmission.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=ktkhai@virtuozzo.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=serge@hallyn.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.