All of lore.kernel.org
 help / color / mirror / Atom feed
From: Roman Mohr <rmohr@redhat.com>
To: Martin Kletzander <mkletzan@redhat.com>
Cc: fabiand@sni.github.map.fastly.net, libvir-list@redhat.com,
	netdev@vger.kernel.org, jbaron@akamai.com, ebiederm@xmission.com,
	davem@davemloft.net, laine@laine.org
Subject: Re: [libvirt] opening tap devices that are created in a container
Date: Tue, 17 Jul 2018 13:58:21 +0200	[thread overview]
Message-ID: <CALDPj7v-bmAWXWAVBC5ALtEc0fdDKO9=dnHSOscMPGUL221J1Q@mail.gmail.com> (raw)
In-Reply-To: <20180711101005.GA13392@wheatley>


[-- Attachment #1.1: Type: text/plain, Size: 5460 bytes --]

On Wed, Jul 11, 2018 at 12:10 PM <nert@wheatley> wrote:

> On Mon, Jul 09, 2018 at 05:00:49PM -0400, Jason Baron wrote:
> >
> >
> >On 07/08/2018 02:01 AM, Martin Kletzander wrote:
> >> On Thu, Jul 05, 2018 at 06:24:20PM +0200, Roman Mohr wrote:
> >>> On Thu, Jul 5, 2018 at 4:20 PM Jason Baron <jbaron@akamai.com> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Opening tap devices, such as macvtap, that are created in containers
> is
> >>>> problematic because the interface for opening tap devices is via
> >>>> /dev/tapNN and devtmpfs is not typically mounted inside a container as
> >>>> its not namespace aware. It is possible to do a mknod() in the
> >>>> container, once the tap devices are created, however, since the tap
> >>>> devices are created dynamically its not possible to apriori allow
> access
> >>>> to certain major/minor numbers, since we don't know what these are
> going
> >>>> to be. In addition, its desirable to not allow the mknod capability in
> >>>> containers. This behavior, I think is somewhat inconsistent with the
> >>>> tuntap driver where one can create tuntap devices inside a container
> by
> >>>> first opening /dev/net/tun and then using them by supplying the tuntap
> >>>> device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates
> the
> >>>> network namespace, one is limited to opening network devices that
> belong
> >>>> to your current network namespace.
> >>>>
> >>>> Here are some options to this issue, that I wanted to get feedback
> >>>> about, and just wondering if anybody else has run into this.
> >>>>
> >>>> 1)
> >>>>
> >>>> Don't create the tap device, such as macvtap in the container.
> Instead,
> >>>> create the tap device outside of the container and then move it into
> the
> >>>> desired container network namespace. In addition, do a mknod() for the
> >>>> corresponding /dev/tapNN device from outside the container before
> doing
> >>>> chroot().
> >>>>
> >>>> This solution still doesn't allow tap devices to be created inside the
> >>>> container. Thus, in the case of kubevirt, which runs libvirtd inside
> of
> >>>> a container, it would mean changing libvirtd to open existing tap
> >>>> devices (as opposed to the current behavior of creating new ones).
> This
> >>>> would not require any kernel changes, but as mentioned seems
> >>>> inconsistent with the tuntap interface.
> >>>>
> >>>
> >>> For KubeVirt, apart from how exactly the device ends up in the
> >>> container, I
> >>> would want to pursue a way where all network preparations which require
> >>> privileges happens from a privileged process *outside* of the
> container.
> >>> Like CNI solutions do it. They run outside, have privileges and then
> >>> create
> >>> devices in the right network/mount namespace or move them there. The
> >>> final
> >>> goal for KubeVirt is that our pod with the qemu process is completely
> >>> unprivileged and privileged setup happens from outside.
> >>>
> >>> As a consequence, and depending on which route Dan pursues with the
> >>> restructured libvirt, I would assume that either a privileged
> >>> libvirtd-part
> >>> outside of containers creates the devices by entering the right
> >>> namespaces,
> >>> or that libvirt in the container can consume pre-created tun/tap
> devices,
> >>> like qemu.
> >>>
> >>
> >> That would be nice, but as far as I understand there will always be a
> >> need for
> >> some privileges if you want to use a tap device.  It's nice that CNI
> >> does that
> >> and all the containers can run unprivileged, but that's because they do
> >> not open
> >> the tap device and they do not do any privileged operations on it.  But
> >> QEMU
> >> needs to.  So the only way would be passing an opened fd to the
> >> container or
> >> opening the tap device there and making the fd usable for one process in
> >> the
> >> container.  Is this already supported for some type of containers in
> >> some way?
> >>
> >> Martin
> >
> >Hi,
> >
> >So another option here call it #3 is to pass open fds via unix sockets.
> >If there are privileged operations that QEMU is trying to do with the fd
> >though, how will opening it first and then passing it to an unprivileged
> >QEMU address that? Is the opener doing those operations first?
> >
>
> Sorry for the confusion, but QEMU is not doing any privileged operations.
> I got
> confused by the fact that anyone can open and do a R/W on a tap device.
> But it
> looks like that's on purpose.  No capabilities are needed for opening
> /dev/net/tun and calling ioctl(TUNSETIFF) with existing name and then
> doing R/W
> operations on it.  It just works.
>
> Correct me if I'm wrong, but to sum it all up, the only things that we
> need to
> figure out (which might possibly be solved by ideas in the other thread)
> are:
>
> tap:
> - Existence of /dev/net/tun
> - Having permissions to open it (0666 by default, shouldn't be a nig deal)
> - Knowing the device name
>
> macvtap:
> - Existence of /dev/tapXX
> - Having permissions to open /dev/tapXX
> - One of the following:
>   - Knowing the device name (and being able to translate it using a
> netlink socket)
>   - Knowing the the device index
>
> The rest should be an implementation detail.
>
> Am I right?  Did I miss anything?


At least from the KubeVirt use-case that sounds to be the things which we
would need to solve the networking setup in a similar way like the
Container Network Interface implementations solve the setup in k8s.

Best Regards,
Roman

[-- Attachment #1.2: Type: text/html, Size: 6899 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



      parent reply	other threads:[~2018-07-17 11:58 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-05 14:20 opening tap devices that are created in a container Jason Baron
2018-07-05 16:10 ` Daniel P. Berrangé
2018-07-09 20:56   ` Jason Baron
2018-07-10  8:46     ` Daniel P. Berrangé
2018-07-05 16:24 ` [libvirt] " Roman Mohr
2018-07-08  6:01   ` Martin Kletzander
2018-07-09 21:00     ` Jason Baron
2018-07-10  8:47       ` Daniel P. Berrangé
2018-07-17 11:45       ` Martin Kletzander
     [not found]       ` <20180711101005.GA13392@wheatley>
2018-07-12  3:33         ` Jason Baron
2018-07-17 11:58         ` Roman Mohr [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALDPj7v-bmAWXWAVBC5ALtEc0fdDKO9=dnHSOscMPGUL221J1Q@mail.gmail.com' \
    --to=rmohr@redhat.com \
    --cc=davem@davemloft.net \
    --cc=ebiederm@xmission.com \
    --cc=fabiand@sni.github.map.fastly.net \
    --cc=jbaron@akamai.com \
    --cc=laine@laine.org \
    --cc=libvir-list@redhat.com \
    --cc=mkletzan@redhat.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.