From mboxrd@z Thu Jan 1 00:00:00 1970 From: Roman Mohr Subject: Re: [libvirt] opening tap devices that are created in a container Date: Thu, 5 Jul 2018 18:24:20 +0200 Message-ID: References: <6a8d7673-0ed7-5920-cc3a-d5d68dbc547c@akamai.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============4509458833130845418==" Cc: fabiand@sni.github.map.fastly.net, libvir-list@redhat.com, netdev@vger.kernel.org, ebiederm@xmission.com, davem@davemloft.net To: jbaron@akamai.com Return-path: In-Reply-To: <6a8d7673-0ed7-5920-cc3a-d5d68dbc547c@akamai.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: libvir-list-bounces@redhat.com Errors-To: libvir-list-bounces@redhat.com List-Id: netdev.vger.kernel.org --===============4509458833130845418== Content-Type: multipart/alternative; boundary="000000000000658f31057042fb07" --000000000000658f31057042fb07 Content-Type: text/plain; charset="UTF-8" On Thu, Jul 5, 2018 at 4:20 PM Jason Baron wrote: > Hi, > > Opening tap devices, such as macvtap, that are created in containers is > problematic because the interface for opening tap devices is via > /dev/tapNN and devtmpfs is not typically mounted inside a container as > its not namespace aware. It is possible to do a mknod() in the > container, once the tap devices are created, however, since the tap > devices are created dynamically its not possible to apriori allow access > to certain major/minor numbers, since we don't know what these are going > to be. In addition, its desirable to not allow the mknod capability in > containers. This behavior, I think is somewhat inconsistent with the > tuntap driver where one can create tuntap devices inside a container by > first opening /dev/net/tun and then using them by supplying the tuntap > device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates the > network namespace, one is limited to opening network devices that belong > to your current network namespace. > > Here are some options to this issue, that I wanted to get feedback > about, and just wondering if anybody else has run into this. > > 1) > > Don't create the tap device, such as macvtap in the container. Instead, > create the tap device outside of the container and then move it into the > desired container network namespace. In addition, do a mknod() for the > corresponding /dev/tapNN device from outside the container before doing > chroot(). > > This solution still doesn't allow tap devices to be created inside the > container. Thus, in the case of kubevirt, which runs libvirtd inside of > a container, it would mean changing libvirtd to open existing tap > devices (as opposed to the current behavior of creating new ones). This > would not require any kernel changes, but as mentioned seems > inconsistent with the tuntap interface. > For KubeVirt, apart from how exactly the device ends up in the container, I would want to pursue a way where all network preparations which require privileges happens from a privileged process *outside* of the container. Like CNI solutions do it. They run outside, have privileges and then create devices in the right network/mount namespace or move them there. The final goal for KubeVirt is that our pod with the qemu process is completely unprivileged and privileged setup happens from outside. As a consequence, and depending on which route Dan pursues with the restructured libvirt, I would assume that either a privileged libvirtd-part outside of containers creates the devices by entering the right namespaces, or that libvirt in the container can consume pre-created tun/tap devices, like qemu. Best Regards, Roman > > 2) > > Add a new kernel interface for tap devices similar to how /dev/net/tun > currently works. It might be nice to use TUNSETIFF for tap devices, but > because tap devices have different fops they can't be easily switched > after open(). So the suggestion is a new ioctl (TUNGETFDBYNAME?), where > the tap device name is supplied and a new fd (distinct from the fd > returned by the open of /dev/net/tun) is returned as an output field as > part of the new ioctl parameter. > > It may not make sense to have this new ioctl call for /dev/net/tun since > its really about opening a tap device, so it may make sense to introduce > it as part of a new device, such as /dev/net/tap. This new ioctl could > be used for macvtap and ipvtap (or any tap device). I think it might > also improve performance for tuntap devices themselves, if they are > opened this way since currently all tun operations such as read() and > write() take a reference count on the underlying tuntap device, since it > can be changed via TUNSETIFF. I tested this interface out, so I can > provide the kernel changes if that's helpful for clarification. > > Thanks, > > -Jason > --000000000000658f31057042fb07 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Thu= , Jul 5, 2018 at 4:20 PM Jason Baron <jbaron@akamai.com> wrote:
Hi,

Opening tap devices, such as macvtap, that are created in containers is
problematic because the interface for opening tap devices is via
/dev/tapNN and devtmpfs is not typically mounted inside a container as
its not namespace aware. It is possible to do a mknod() in the
container, once the tap devices are created, however, since the tap
devices are created dynamically its not possible to apriori allow access to certain major/minor numbers, since we don't know what these are goin= g
to be. In addition, its desirable to not allow the mknod capability in
containers. This behavior, I think is somewhat inconsistent with the
tuntap driver where one can create tuntap devices inside a container by
first opening /dev/net/tun and then using them by supplying the tuntap
device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates the
network namespace, one is limited to opening network devices that belong to your current network namespace.

Here are some options to this issue, that I wanted to get feedback
about, and just wondering if anybody else has run into this.

1)

Don't create the tap device, such as macvtap in the container. Instead,=
create the tap device outside of the container and then move it into the desired container network namespace. In addition, do a mknod() for the
corresponding /dev/tapNN device from outside the container before doing
chroot().

This solution still doesn't allow tap devices to be created inside the<= br> container. Thus, in the case of kubevirt, which runs libvirtd inside of
a container, it would mean changing libvirtd to open existing tap
devices (as opposed to the current behavior of creating new ones). This
would not require any kernel changes, but as mentioned seems
inconsistent with the tuntap interface.

For KubeVirt, apart from how exactly the device ends up in the container, = I would want to pursue a way where all network preparations which require p= rivileges happens from a privileged process *outside* of the container. Lik= e CNI solutions do it. They run outside, have privileges and then create de= vices in the right network/mount namespace or move them there. The final go= al for KubeVirt is that our pod with the qemu process is completely unprivi= leged and privileged setup happens from outside.

A= s a consequence, and depending on which route Dan pursues with the restruct= ured libvirt, I would assume that either a privileged libvirtd-part outside= of containers creates the devices by entering the right namespaces, or tha= t libvirt in the container can consume pre-created tun/tap devices, like qe= mu.

Best Regards,
Roman
=C2=A0=

2)

Add a new kernel interface for tap devices similar to how /dev/net/tun
currently works. It might be nice to use TUNSETIFF for tap devices, but
because tap devices have different fops they can't be easily switched after open(). So the suggestion is a new ioctl (TUNGETFDBYNAME?), where
the tap device name is supplied and a new fd (distinct from the fd
returned by the open of /dev/net/tun) is returned as an output field as
part of the new ioctl parameter.

It may not make sense to have this new ioctl call for /dev/net/tun since its really about opening a tap device, so it may make sense to introduce it as part of a new device, such as /dev/net/tap. This new ioctl could
be used for macvtap and ipvtap (or any tap device). I think it might
also improve performance for tuntap devices themselves, if they are
opened this way since currently all tun operations such as read() and
write() take a reference count on the underlying tuntap device, since it can be changed via TUNSETIFF. I tested this interface out, so I can
provide the kernel changes if that's helpful for clarification.

Thanks,

-Jason
--000000000000658f31057042fb07-- --===============4509458833130845418== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline --===============4509458833130845418==--