Re: [libvirt] opening tap devices that are created in a container

From: Roman Mohr <rmohr@redhat.com>
To: Martin Kletzander <mkletzan@redhat.com>
Cc: fabiand@sni.github.map.fastly.net, libvir-list@redhat.com,
	netdev@vger.kernel.org, jbaron@akamai.com, ebiederm@xmission.com,
	davem@davemloft.net, laine@laine.org
Subject: Re: [libvirt] opening tap devices that are created in a container
Date: Tue, 17 Jul 2018 13:58:21 +0200	[thread overview]
Message-ID: <CALDPj7v-bmAWXWAVBC5ALtEc0fdDKO9=dnHSOscMPGUL221J1Q@mail.gmail.com> (raw)
In-Reply-To: <20180711101005.GA13392@wheatley>

[-- Attachment #1.1: Type: text/plain, Size: 5460 bytes --]

On Wed, Jul 11, 2018 at 12:10 PM <nert@wheatley> wrote:

> On Mon, Jul 09, 2018 at 05:00:49PM -0400, Jason Baron wrote:
> >
> >
> >On 07/08/2018 02:01 AM, Martin Kletzander wrote:
> >> On Thu, Jul 05, 2018 at 06:24:20PM +0200, Roman Mohr wrote:
> >>> On Thu, Jul 5, 2018 at 4:20 PM Jason Baron <jbaron@akamai.com> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Opening tap devices, such as macvtap, that are created in containers
> is
> >>>> problematic because the interface for opening tap devices is via
> >>>> /dev/tapNN and devtmpfs is not typically mounted inside a container as
> >>>> its not namespace aware. It is possible to do a mknod() in the
> >>>> container, once the tap devices are created, however, since the tap
> >>>> devices are created dynamically its not possible to apriori allow
> access
> >>>> to certain major/minor numbers, since we don't know what these are
> going
> >>>> to be. In addition, its desirable to not allow the mknod capability in
> >>>> containers. This behavior, I think is somewhat inconsistent with the
> >>>> tuntap driver where one can create tuntap devices inside a container
> by
> >>>> first opening /dev/net/tun and then using them by supplying the tuntap
> >>>> device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates
> the
> >>>> network namespace, one is limited to opening network devices that
> belong
> >>>> to your current network namespace.
> >>>>
> >>>> Here are some options to this issue, that I wanted to get feedback
> >>>> about, and just wondering if anybody else has run into this.
> >>>>
> >>>> 1)
> >>>>
> >>>> Don't create the tap device, such as macvtap in the container.
> Instead,
> >>>> create the tap device outside of the container and then move it into
> the
> >>>> desired container network namespace. In addition, do a mknod() for the
> >>>> corresponding /dev/tapNN device from outside the container before
> doing
> >>>> chroot().
> >>>>
> >>>> This solution still doesn't allow tap devices to be created inside the
> >>>> container. Thus, in the case of kubevirt, which runs libvirtd inside
> of
> >>>> a container, it would mean changing libvirtd to open existing tap
> >>>> devices (as opposed to the current behavior of creating new ones).
> This
> >>>> would not require any kernel changes, but as mentioned seems
> >>>> inconsistent with the tuntap interface.
> >>>>
> >>>
> >>> For KubeVirt, apart from how exactly the device ends up in the
> >>> container, I
> >>> would want to pursue a way where all network preparations which require
> >>> privileges happens from a privileged process *outside* of the
> container.
> >>> Like CNI solutions do it. They run outside, have privileges and then
> >>> create
> >>> devices in the right network/mount namespace or move them there. The
> >>> final
> >>> goal for KubeVirt is that our pod with the qemu process is completely
> >>> unprivileged and privileged setup happens from outside.
> >>>
> >>> As a consequence, and depending on which route Dan pursues with the
> >>> restructured libvirt, I would assume that either a privileged
> >>> libvirtd-part
> >>> outside of containers creates the devices by entering the right
> >>> namespaces,
> >>> or that libvirt in the container can consume pre-created tun/tap
> devices,
> >>> like qemu.
> >>>
> >>
> >> That would be nice, but as far as I understand there will always be a
> >> need for
> >> some privileges if you want to use a tap device.  It's nice that CNI
> >> does that
> >> and all the containers can run unprivileged, but that's because they do
> >> not open
> >> the tap device and they do not do any privileged operations on it.  But
> >> QEMU
> >> needs to.  So the only way would be passing an opened fd to the
> >> container or
> >> opening the tap device there and making the fd usable for one process in
> >> the
> >> container.  Is this already supported for some type of containers in
> >> some way?
> >>
> >> Martin
> >
> >Hi,
> >
> >So another option here call it #3 is to pass open fds via unix sockets.
> >If there are privileged operations that QEMU is trying to do with the fd
> >though, how will opening it first and then passing it to an unprivileged
> >QEMU address that? Is the opener doing those operations first?
> >
>
> Sorry for the confusion, but QEMU is not doing any privileged operations.
> I got
> confused by the fact that anyone can open and do a R/W on a tap device.
> But it
> looks like that's on purpose.  No capabilities are needed for opening
> /dev/net/tun and calling ioctl(TUNSETIFF) with existing name and then
> doing R/W
> operations on it.  It just works.
>
> Correct me if I'm wrong, but to sum it all up, the only things that we
> need to
> figure out (which might possibly be solved by ideas in the other thread)
> are:
>
> tap:
> - Existence of /dev/net/tun
> - Having permissions to open it (0666 by default, shouldn't be a nig deal)
> - Knowing the device name
>
> macvtap:
> - Existence of /dev/tapXX
> - Having permissions to open /dev/tapXX
> - One of the following:
>   - Knowing the device name (and being able to translate it using a
> netlink socket)
>   - Knowing the the device index
>
> The rest should be an implementation detail.
>
> Am I right?  Did I miss anything?

At least from the KubeVirt use-case that sounds to be the things which we
would need to solve the networking setup in a similar way like the
Container Network Interface implementations solve the setup in k8s.

Best Regards,
Roman

[-- Attachment #1.2: Type: text/html, Size: 6899 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]