From mboxrd@z Thu Jan  1 00:00:00 1970
From: Martin Kletzander <mkletzan@redhat.com>
Subject: Re: [libvirt] opening tap devices that are created in a container
Date: Sun, 8 Jul 2018 08:01:52 +0200
Message-ID: <20180708060152.GB20206@wheatley>
References: <6a8d7673-0ed7-5920-cc3a-d5d68dbc547c@akamai.com>
 <CALDPj7tWaHLe4kfhyCwPk0zHawOEULYFOQ2sX-Y3wQX7ba+HEw@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
        protocol="application/pgp-signature"; boundary="3lcZGd9BuhuYXNfi"
Cc: jbaron@akamai.com, fabiand@sni.github.map.fastly.net,
        libvir-list@redhat.com, netdev@vger.kernel.org,
        ebiederm@xmission.com, davem@davemloft.net
To: Roman Mohr <rmohr@redhat.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mx3-rdu2.redhat.com ([66.187.233.73]:54370 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
        id S1751422AbeGHGBz (ORCPT <rfc822;netdev@vger.kernel.org>);
        Sun, 8 Jul 2018 02:01:55 -0400
Content-Disposition: inline
In-Reply-To: <CALDPj7tWaHLe4kfhyCwPk0zHawOEULYFOQ2sX-Y3wQX7ba+HEw@mail.gmail.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>


--3lcZGd9BuhuYXNfi
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Disposition: inline

On Thu, Jul 05, 2018 at 06:24:20PM +0200, Roman Mohr wrote:
>On Thu, Jul 5, 2018 at 4:20 PM Jason Baron <jbaron@akamai.com> wrote:
>
>> Hi,
>>
>> Opening tap devices, such as macvtap, that are created in containers is
>> problematic because the interface for opening tap devices is via
>> /dev/tapNN and devtmpfs is not typically mounted inside a container as
>> its not namespace aware. It is possible to do a mknod() in the
>> container, once the tap devices are created, however, since the tap
>> devices are created dynamically its not possible to apriori allow access
>> to certain major/minor numbers, since we don't know what these are going
>> to be. In addition, its desirable to not allow the mknod capability in
>> containers. This behavior, I think is somewhat inconsistent with the
>> tuntap driver where one can create tuntap devices inside a container by
>> first opening /dev/net/tun and then using them by supplying the tuntap
>> device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates the
>> network namespace, one is limited to opening network devices that belong
>> to your current network namespace.
>>
>> Here are some options to this issue, that I wanted to get feedback
>> about, and just wondering if anybody else has run into this.
>>
>> 1)
>>
>> Don't create the tap device, such as macvtap in the container. Instead,
>> create the tap device outside of the container and then move it into the
>> desired container network namespace. In addition, do a mknod() for the
>> corresponding /dev/tapNN device from outside the container before doing
>> chroot().
>>
>> This solution still doesn't allow tap devices to be created inside the
>> container. Thus, in the case of kubevirt, which runs libvirtd inside of
>> a container, it would mean changing libvirtd to open existing tap
>> devices (as opposed to the current behavior of creating new ones). This
>> would not require any kernel changes, but as mentioned seems
>> inconsistent with the tuntap interface.
>>
>
>For KubeVirt, apart from how exactly the device ends up in the container, I
>would want to pursue a way where all network preparations which require
>privileges happens from a privileged process *outside* of the container.
>Like CNI solutions do it. They run outside, have privileges and then create
>devices in the right network/mount namespace or move them there. The final
>goal for KubeVirt is that our pod with the qemu process is completely
>unprivileged and privileged setup happens from outside.
>
>As a consequence, and depending on which route Dan pursues with the
>restructured libvirt, I would assume that either a privileged libvirtd-part
>outside of containers creates the devices by entering the right namespaces,
>or that libvirt in the container can consume pre-created tun/tap devices,
>like qemu.
>

That would be nice, but as far as I understand there will always be a need for
some privileges if you want to use a tap device.  It's nice that CNI does that
and all the containers can run unprivileged, but that's because they do not open
the tap device and they do not do any privileged operations on it.  But QEMU
needs to.  So the only way would be passing an opened fd to the container or
opening the tap device there and making the fd usable for one process in the
container.  Is this already supported for some type of containers in some way?

Martin
--3lcZGd9BuhuYXNfi
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEiXAnXDYdKAaCyvS1CB/CnyQXht0FAltBqNAACgkQCB/CnyQX
ht3gFw/5AUDlehh+ibkv+bm7Rp1Qt8gQbkJjgRvQZ2zk6W0SiZyauJIKECt2Mfvy
Q8rEf85Ca0TbsmiGRu2xtsYrIjtb0NWI+QEsr8cPI/q99AgChqc8M8zqA5DeNdcV
OBte1gfWeiCdKxXZUKrDTI+AyooMatv07dK2Zxt12/dgLCSpsug2nNUVBIq2mJay
TEK8rHna2XHso5gT/Za0CRAvaf2KoGq46cHr/9sTBdRbLU0oyG0pYcbv2fpwbyRs
qEr+1u4SdwxXzJgEP7w8Bvunl3t4Gg9kSy3LnEd88pgGJVSiASgVVmOOCVAHH2w9
9OEg67BLfqt5MWr4RWbzjrXUz5I/WtwN2BN4SKbVy/ayP+9PpzhsFQKH0OPx1jq7
llcNRPcZzbxLWA/Uf8bhhCH2qjjYhfiVYIkcO9iZe+QOKYgLXmb1CKK4jmSXkhqu
vK5Lmnd2vstZNoBOqMKbBtLSphjLXrpHby5+ftU0zDiQ5a2XpvnnAd9V0hvWkA1m
7qpD1EIa9z43mw9uyWApH9mx7FwFWkac1/Sv0pUIV1AIJOjRst1BzdL6XO7UVNiF
ppRdiEXVE5lCEiynGbp+jh2myMoMLQzkcPX1U0bFzxmtNiC+KG+7SJ0UadjQ+LJt
2k6CJY1FwDY/m3buwYDA9yHv4HlQ5S6PCAz527CQXeXdM+Ixq6Y=
=/VIj
-----END PGP SIGNATURE-----

--3lcZGd9BuhuYXNfi--