From mboxrd@z Thu Jan  1 00:00:00 1970
From: Roman Mohr <rmohr@redhat.com>
Subject: Re: [libvirt] opening tap devices that are created in a container
Date: Thu, 5 Jul 2018 18:24:20 +0200
Message-ID: <CALDPj7tWaHLe4kfhyCwPk0zHawOEULYFOQ2sX-Y3wQX7ba+HEw@mail.gmail.com>
References: <6a8d7673-0ed7-5920-cc3a-d5d68dbc547c@akamai.com>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============4509458833130845418=="
Cc: fabiand@sni.github.map.fastly.net, libvir-list@redhat.com,
	netdev@vger.kernel.org, ebiederm@xmission.com, davem@davemloft.net
To: jbaron@akamai.com
Return-path: <libvir-list-bounces@redhat.com>
In-Reply-To: <6a8d7673-0ed7-5920-cc3a-d5d68dbc547c@akamai.com>
List-Unsubscribe: <https://www.redhat.com/mailman/options/libvir-list>,
	<mailto:libvir-list-request@redhat.com?subject=unsubscribe>
List-Archive: <https://www.redhat.com/archives/libvir-list>
List-Post: <mailto:libvir-list@redhat.com>
List-Help: <mailto:libvir-list-request@redhat.com?subject=help>
List-Subscribe: <https://www.redhat.com/mailman/listinfo/libvir-list>,
	<mailto:libvir-list-request@redhat.com?subject=subscribe>
Sender: libvir-list-bounces@redhat.com
Errors-To: libvir-list-bounces@redhat.com
List-Id: netdev.vger.kernel.org

--===============4509458833130845418==
Content-Type: multipart/alternative; boundary="000000000000658f31057042fb07"

--000000000000658f31057042fb07
Content-Type: text/plain; charset="UTF-8"

On Thu, Jul 5, 2018 at 4:20 PM Jason Baron <jbaron@akamai.com> wrote:

> Hi,
>
> Opening tap devices, such as macvtap, that are created in containers is
> problematic because the interface for opening tap devices is via
> /dev/tapNN and devtmpfs is not typically mounted inside a container as
> its not namespace aware. It is possible to do a mknod() in the
> container, once the tap devices are created, however, since the tap
> devices are created dynamically its not possible to apriori allow access
> to certain major/minor numbers, since we don't know what these are going
> to be. In addition, its desirable to not allow the mknod capability in
> containers. This behavior, I think is somewhat inconsistent with the
> tuntap driver where one can create tuntap devices inside a container by
> first opening /dev/net/tun and then using them by supplying the tuntap
> device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates the
> network namespace, one is limited to opening network devices that belong
> to your current network namespace.
>
> Here are some options to this issue, that I wanted to get feedback
> about, and just wondering if anybody else has run into this.
>
> 1)
>
> Don't create the tap device, such as macvtap in the container. Instead,
> create the tap device outside of the container and then move it into the
> desired container network namespace. In addition, do a mknod() for the
> corresponding /dev/tapNN device from outside the container before doing
> chroot().
>
> This solution still doesn't allow tap devices to be created inside the
> container. Thus, in the case of kubevirt, which runs libvirtd inside of
> a container, it would mean changing libvirtd to open existing tap
> devices (as opposed to the current behavior of creating new ones). This
> would not require any kernel changes, but as mentioned seems
> inconsistent with the tuntap interface.
>

For KubeVirt, apart from how exactly the device ends up in the container, I
would want to pursue a way where all network preparations which require
privileges happens from a privileged process *outside* of the container.
Like CNI solutions do it. They run outside, have privileges and then create
devices in the right network/mount namespace or move them there. The final
goal for KubeVirt is that our pod with the qemu process is completely
unprivileged and privileged setup happens from outside.

As a consequence, and depending on which route Dan pursues with the
restructured libvirt, I would assume that either a privileged libvirtd-part
outside of containers creates the devices by entering the right namespaces,
or that libvirt in the container can consume pre-created tun/tap devices,
like qemu.

Best Regards,
Roman


>
> 2)
>
> Add a new kernel interface for tap devices similar to how /dev/net/tun
> currently works. It might be nice to use TUNSETIFF for tap devices, but
> because tap devices have different fops they can't be easily switched
> after open(). So the suggestion is a new ioctl (TUNGETFDBYNAME?), where
> the tap device name is supplied and a new fd (distinct from the fd
> returned by the open of /dev/net/tun) is returned as an output field as
> part of the new ioctl parameter.
>
> It may not make sense to have this new ioctl call for /dev/net/tun since
> its really about opening a tap device, so it may make sense to introduce
> it as part of a new device, such as /dev/net/tap. This new ioctl could
> be used for macvtap and ipvtap (or any tap device). I think it might
> also improve performance for tuntap devices themselves, if they are
> opened this way since currently all tun operations such as read() and
> write() take a reference count on the underlying tuntap device, since it
> can be changed via TUNSETIFF. I tested this interface out, so I can
> provide the kernel changes if that's helpful for clarification.
>
> Thanks,
>
> -Jason
>

--000000000000658f31057042fb07
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Thu=
, Jul 5, 2018 at 4:20 PM Jason Baron &lt;<a href=3D"mailto:jbaron@akamai.co=
m">jbaron@akamai.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quo=
te" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"=
>Hi,<br>
<br>
Opening tap devices, such as macvtap, that are created in containers is<br>
problematic because the interface for opening tap devices is via<br>
/dev/tapNN and devtmpfs is not typically mounted inside a container as<br>
its not namespace aware. It is possible to do a mknod() in the<br>
container, once the tap devices are created, however, since the tap<br>
devices are created dynamically its not possible to apriori allow access<br=
>
to certain major/minor numbers, since we don&#39;t know what these are goin=
g<br>
to be. In addition, its desirable to not allow the mknod capability in<br>
containers. This behavior, I think is somewhat inconsistent with the<br>
tuntap driver where one can create tuntap devices inside a container by<br>
first opening /dev/net/tun and then using them by supplying the tuntap<br>
device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates the<br>
network namespace, one is limited to opening network devices that belong<br=
>
to your current network namespace.<br>
<br>
Here are some options to this issue, that I wanted to get feedback<br>
about, and just wondering if anybody else has run into this.<br>
<br>
1)<br>
<br>
Don&#39;t create the tap device, such as macvtap in the container. Instead,=
<br>
create the tap device outside of the container and then move it into the<br=
>
desired container network namespace. In addition, do a mknod() for the<br>
corresponding /dev/tapNN device from outside the container before doing<br>
chroot().<br>
<br>
This solution still doesn&#39;t allow tap devices to be created inside the<=
br>
container. Thus, in the case of kubevirt, which runs libvirtd inside of<br>
a container, it would mean changing libvirtd to open existing tap<br>
devices (as opposed to the current behavior of creating new ones). This<br>
would not require any kernel changes, but as mentioned seems<br>
inconsistent with the tuntap interface.<br></blockquote><div><br></div><div=
>For KubeVirt, apart from how exactly the device ends up in the container, =
I would want to pursue a way where all network preparations which require p=
rivileges happens from a privileged process *outside* of the container. Lik=
e CNI solutions do it. They run outside, have privileges and then create de=
vices in the right network/mount namespace or move them there. The final go=
al for KubeVirt is that our pod with the qemu process is completely unprivi=
leged and privileged setup happens from outside.</div><div><br></div><div>A=
s a consequence, and depending on which route Dan pursues with the restruct=
ured libvirt, I would assume that either a privileged libvirtd-part outside=
 of containers creates the devices by entering the right namespaces, or tha=
t libvirt in the container can consume pre-created tun/tap devices, like qe=
mu.</div><div><br></div><div>Best Regards,</div><div>Roman</div><div>=C2=A0=
</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l=
eft:1px #ccc solid;padding-left:1ex">
<br>
2)<br>
<br>
Add a new kernel interface for tap devices similar to how /dev/net/tun<br>
currently works. It might be nice to use TUNSETIFF for tap devices, but<br>
because tap devices have different fops they can&#39;t be easily switched<b=
r>
after open(). So the suggestion is a new ioctl (TUNGETFDBYNAME?), where<br>
the tap device name is supplied and a new fd (distinct from the fd<br>
returned by the open of /dev/net/tun) is returned as an output field as<br>
part of the new ioctl parameter.<br>
<br>
It may not make sense to have this new ioctl call for /dev/net/tun since<br=
>
its really about opening a tap device, so it may make sense to introduce<br=
>
it as part of a new device, such as /dev/net/tap. This new ioctl could<br>
be used for macvtap and ipvtap (or any tap device). I think it might<br>
also improve performance for tuntap devices themselves, if they are<br>
opened this way since currently all tun operations such as read() and<br>
write() take a reference count on the underlying tuntap device, since it<br=
>
can be changed via TUNSETIFF. I tested this interface out, so I can<br>
provide the kernel changes if that&#39;s helpful for clarification.<br>
<br>
Thanks,<br>
<br>
-Jason<br>
</blockquote></div></div>

--000000000000658f31057042fb07--


--===============4509458833130845418==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline


--===============4509458833130845418==--