From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f65.google.com ([209.85.214.65]:52272 "EHLO mail-it0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932559AbeGCRIy (ORCPT ); Tue, 3 Jul 2018 13:08:54 -0400 Received: by mail-it0-f65.google.com with SMTP id p4-v6so4148538itf.2 for ; Tue, 03 Jul 2018 10:08:53 -0700 (PDT) Date: Tue, 3 Jul 2018 13:08:48 -0400 From: =?iso-8859-1?Q?St=E9phane?= Graber To: "Serge E. Hallyn" Cc: James Bottomley , Tyler Hicks , Linux Containers , Seth Forshee , Christian Brauner , linux-fsdevel Subject: Re: shiftfs status and future development Message-ID: <20180703170848.GA6828@castiana> References: <20180614184448.GC30028@ubuntu-xps13> <20180615135638.GA29299@mail.hallyn.com> <20180615145917.GF30028@ubuntu-xps13> <1529118185.4048.46.camel@HansenPartnership.com> <20180618134032.GP30028@ubuntu-xps13> <1529333819.4021.4.camel@HansenPartnership.com> <1530085696.4243.5.camel@HansenPartnership.com> <20180703165450.GB22894@mail.hallyn.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="bg08WKrSYDhXBjb5" Content-Disposition: inline In-Reply-To: <20180703165450.GB22894@mail.hallyn.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: --bg08WKrSYDhXBjb5 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Jul 03, 2018 at 11:54:50AM -0500, Serge E. Hallyn wrote: > Quoting James Bottomley (James.Bottomley@HansenPartnership.com): > > On Mon, 2018-06-18 at 20:11 +0300, Amir Goldstein wrote: > > > On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley > > > wrote: > > > [...] > > > > > > > =A0- Does not break inotify > > > > > >=20 > > > > > > I don't expect it does, but I haven't checked. > > > > >=20 > > > > > I haven't checked either; I'm planning to do so soon. This is a > > > > > concern that was expressed to me by others, I think because > > > > > inotify doesn't work with overlayfs. > > > >=20 > > > > I think shiftfs does work simply because it doesn't really do > > > > overlays, so lots of stuff that doesn't work with overlays does > > > > work with it. > > > >=20 > > >=20 > > > I'm afraid shiftfs suffers from the same problems that the old naiive > > > overlayfs inode implementation suffered from. > > >=20 > > > This problem is demonstrated with LTP tests inotify08 inotify09. > > > shiftfs_new_inode() is called on every lookup, so inotify watch > > > may be set on an inode object, then dentry is evicted from cache > > > and then all events on new dentry are not reported on the watched > > > inode. You will need to implement hashed inodes to solve it. > > > Can be done as overlay does - hashing by real inode pointer. > > >=20 > > > This is just one of those subtle things about stacked fs and there > > > may be other in present and more in future - if we don't have a > > > shared code base for the two stacked fs, I wager you are going to end > > > up "cherry picking" fixes often. > > >=20 > > > IMO, an important question to ask is, since both shiftfs and > > > overlayfs are strongly coupled with container use cases, are there > > > users that are interested in both layering AND shifting? on the same > > > "mark"? If the answer is yes, then this may be an argument in favor > > > of integrating at least some of shittfs functionality into overlayfs. > >=20 > > My container use case is interested in shifting but not layering. Even > > the docker use case would only mix the two with the overlay graph > > driver. There seem to be quite a few clouds using non overlayfs graph > > drivers (the dm one being the most popular). > >=20 > > > Another argument is that shiftfs itself takes the maximum allowed > > > 2 levels of s_stack_depth for it's 2 mounts, so it is actually not > > > possible with current VFS limitation to combine shiftfs with > > > overlayfs. > >=20 > > That's an artificial, not an inherent, restriction that was introduced > > to keep the call stack small. It can be increased or even eliminated > > (although then we'd risk a real run off the end of the kernel stack > > problem). > >=20 > > > This could be solved relatively easily by adding "-o mark" support > > > to overlayfs and allowing to mount shiftfs also over "marked" > > > overlayfs inside container. > >=20 > > Can we please decided whether the temporary mark, as implemented in the > > current patch set or a more permanent security. xattr type > > mark is preferred for this? It's an important question that's been > > asked, but we have no resolution on. >=20 > I think something permanent is mandatory. Otherwise users may be able > to induce a reboot into a state where the temp mark isn't made. A > security. xattr has the problem that an older kernel may not > know about it. >=20 > Two possibilities which have been mentioned before: >=20 > 1. just demand that the *source* idmap doesn't start at 0. Ideally it > would be something like 100k uids starting at 100k. The kernel would > refuse to do a shiftfs mount if the source idmap includes uid 0. >=20 > I suppose the "let-them-shoot-themselves-in-the-foot" approach would be > to just strongly recommend using such a source uid mapping, but not > enforce it. >=20 > 2. Enforce that the base directory have perms 700 for shiftfs-mount to > be allowed. So /var/lib/shiftfs/base-rootfs might be root-owned with > 700 perms. Root then can shiftfs-mount it to /container1 uid-shifted to > 100000:0:100000. Yes, root could stupidly change the perms later, but > failing that uid 1000 can never exploit a setuid-root script under > /var/lib/shiftfs/base-rootfs >=20 > -serge I'm personally in favor of this second approach and is what I was hoping to= use for LXD in the future. It's trivial for us to apply that to the parent directory of the container (so that the rootfs itself can have a different mode) and is something we're already doing today for privileged containers anyway (as those have the exact same issue with setuid binaries). Having the data on the host shifted to a map which is never used by any user on the system would make mistakes less likely to happen but it would also require you to know ahead of time how many uid/gid you'll need for all future containers that will use that filesystem. 100k for example would effectively prevent the use of many network authentication systems inside the container as those tend to map to the 200k+ uid range. --=20 St=E9phane Graber Ubuntu developer http://www.ubuntu.com --bg08WKrSYDhXBjb5 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEYC9WdmPlk7y9FPM4xjiXTWR5LWcFAls7rZ0ACgkQxjiXTWR5 LWefGRAAiuOQqR0gRgBY+KPWDxxKwaGTYHCMyKi+qK3uri3kJrwI7LX3mFRq3dJ7 JJ2x9Zg+hKaYVkDTkQ0l+4Of+lM++3XKeyPvGP2ZzLvIax9cv8JmpuGMVIHU6CSI PbZgNBH4uMFZlVDQ9IQLHEiAaNrtZz3RBsysVQVAs9wlAscm8Lz0yOFINuhiJ7aX Nm7uIyKeOLV+UAoXSrf4jZ2a8HzS1j6+QM4uYtjoTBFat6bpq2P/z3yS8+P/f0sg JsyGK9u+lo5rTIZIf4+JaWcQBFizsqmlxTMMT3QsYScBUdO9HkjarMlq9IH/n83R 1U9eLFp73KA+nNtoIRWBMQ9Ww9alvWzj5zp25kdN7dy4BCX+Rp8aiuQxn0iSBNGU QoMASVNumRi3p/pdzX/A/5sVwrZR06voYCzTtjdFY3XhLl53VTIx3jo07nqlIQZY fFXmTS+e5yT2CrSm+yfXVe9DRQ+J7M8ULuTjL5Ucv7yP1+TDCpMQuY1/fDFHG5Y5 q/WjCsWP3W+d1mvPHNcsLKj7h8cocyWydol/YtMQXvTgYNu8FyFrpF3dkcbt5wSP J0tBkjYLuhQMtNJHin/DSsFgoOVkp+j0OddpmE4cNQ+o3IbReF/ACOUuWXbXKwVP KYwhXeT5oZnqpsWfQEuFfMxARINjTUml0n2xUZYiL9VJO8d78cE= =WWDE -----END PGP SIGNATURE----- --bg08WKrSYDhXBjb5--