Re: [PATCH 0/2] overlayfs: C/R enhancements

From: Amir Goldstein <amir73il@gmail.com>
To: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>,
	Miklos Szeredi <miklos@szeredi.hu>,
	Andrey Vagin <avagin@virtuozzo.com>,
	Konstantin Khorenko <khorenko@virtuozzo.com>,
	Vasiliy Averin <vvs@virtuozzo.com>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	overlayfs <linux-unionfs@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 0/2] overlayfs: C/R enhancements
Date: Fri, 5 Jun 2020 13:21:40 +0300	[thread overview]
Message-ID: <CAOQ4uxhhNx0VxJB=eLoPX+wt15tH3-KLjGuQem4h_R=0nfkAiA@mail.gmail.com> (raw)
In-Reply-To: <fb79be2c-4fc8-5a9d-9b07-e0464fca9c3f@virtuozzo.com>

On Fri, Jun 5, 2020 at 11:41 AM Pavel Tikhomirov
<ptikhomirov@virtuozzo.com> wrote:
>
>
>
> On 6/5/20 5:35 AM, Amir Goldstein wrote:
> > On Fri, Jun 5, 2020 at 12:34 AM Alexander Mikhalitsyn
> > <alexander.mikhalitsyn@virtuozzo.com> wrote:
> >>
> >> Hello,
> >>
> >>> But overlayfs won't accept these "output only" options as input args,
> >> which is a problem.
> >>
> >> Will it be problematic if we simply ignore "lowerdir_mnt_id" and "upperdir_mnt_id" options in ovl_parse_opt()?
> >>
> >
> > That would solve this small problem.
>
> This is not a big problem actually as these options shown in mountinfo
> for overlay had been "output only" forever,
> please see these two examples below:
>
> a) Imagine you've mounted overlay with relative paths and forgot (or you
> never known as you are another user) where your cwd was at the moment of
> mount syscall. - How would you use those options as "input" to create
> the same overlay mount somethere else (bind-mounting not involved)?
>
> b) Imagine you've mounted overlay with absolute paths and someone (other
> user) overmounted lower (upper/workdir) paths for you, all directory
> structure would be the same on overmount but yet files are different. -
> How would you use those options from mountinfo as "input" ones?
>
> We try to make them much closer to "input" ones.

That is not what I meant by "output only"
I meant invalid input options as in EINVAL not ENOENT

>
> Agreed, we should ignore *_mnt_id on mount because paths identify mounts
> at the time of mount call.
>
> >
> >>> Wouldn't it be better for C/R to implement mount options
> >> that overlayfs can parse and pass it mntid and fhandle instead
> >> of paths? >>
> >> Problem is that we need to know on C/R "dump stage" which mounts are used on lower layers and upper layer. Most likely I don't understand something but I can't catch how "mount-time" options will help us.
> >
> > As you already know from inotify/fanotify C/R fhandle is timeless, so
> > there would be no distinction between mount time and dump time.
>
> Pair of fhandle+mnt_id looks an equivalent to path+mnt_id pair, CRIU
> will just need to open fhandle+mnt_id with open_by_handle_at and
> readlink to get path on dump and continue to use path+mnt_id as before.
> (not too common with fhandles but it's my current understanding)
>
> But if you take a look on (a) and (b) again, the regular user does not
> see full information about overlay mount in /proc/pid/mountinfo, they
> can't just take a look on it and understand from there it comes from.
> Resolving fhandle looks like a too hard task for a user.
>

Right, and we need to provide regular user needs in parallel to
providing C/R needs. understood.

> > About mnt_id, your patches will cause the original mount-time mounts to be busy.
> > That is a problem as well.
>
> Children mounts lock parent, open files lock parent. Another analogy is
> a loop device which locks the backing file mount (AFAICS). Anyway one
> can lazy umount, can't they? But I'm not too sure for this one, maybe
> you can share more implications of this problem?
>

Overlayfs mounts are internal not children mounts in the namespace,
so no open files hold reference the mounts in the namespace (AFAICS).

This use case will break:

  mount /dev/vdf /vdf
  mkdir /vdf/{l,u,w} /tmp/m
  mount -t overlay overlay /tmp/m -o
lowerdir=/vdf/l,upperdir=/vdf/u,workdir=/vdf/w
  umount /vdf

Yes users can lazy unmount, the filesystem itself on /dev/vdf is not actually
unmounted, only the /vdf mount goes away from the namespace, but the use case
without lazy unmount will still break.

Maybe its fine since distro/admin needs to opt-in for this change of behavior.
I have to wonder though, why did you add two different config/module
options for this feature? Yes, its two different sub-functionalities, but
which real user (not imaginary one) will really turn on just half the feature?
While at it, you copy pasted the text:
          For more information, see Documentation/filesystems/overlayfs.txt
but there is no more information to be found.

> >
> > I think you should describe the use case is more details.
> > Is your goal to C/R any overlayfs mount that the process has open
> > files on? visible to process
> We wan't to dump a container, not a simple process, if the container
> process has access to some resource CRIU needs to restore this resource.
>
> Imagine the process in container mounts it's own overlay inside
> container, for instance to imulate write access to readonly mount or
> just to implement some snapshots, don't know exact use case. And we want
> to checkpoint/restore this container. (Currently CRIU only supports
> overlay as external mount, e.g. for docker continers docker engine
> pre-creates overlay for us and we just bind from it - it's a different
> case.) If the in-container process creates the in-container mount we
> need to recreate it on restore so that the in-container view of the
> filesystem persists.
>

Understood. but how do you *know* which mounts the container
created and need to be migrated?
Which loop devices the user has created?
As opposed to the ones that docker engine re-created?
It is the found from diff between mountinfo of process and host?

> > For NFS export, we use the persistent descriptor {uuid;fhandle}
> > (a.k.a. struct ovl_fh) to encode
> > an underlying layer object.
> >
> > CRIU can look for an existing mount to a filesystem with uuid as restore stage
> > (or even mount this filesystem) and use open_by_handle_at() to open a
> > path to layer.
>
> On restore we can be on another physical node, so I doubt we have same
> uuid's, sorry I don't fully understand here already.
>

I see, so what about inotify/fanotify?
fhandle and uuid can be looked up/resolved to mnt/path at "dump" time.
The difference between mnt_id/uuid is who keeps the reference on the
mount. If overlayfs provides uuid, then you rely on docker to keep the
reference on the mount and use the reference-less uuid to find the
mount that docker is holding for you.

> > After mounting overlay, that mount to underlying fs can even be discarded.
> >
> > And if this works for you, you don't have to export the layers ovl_fh in
> > /proc/mounts, you can export them in numerous other ways.
> > One way from the top of my head, getxattr on overlay root dir.
> > "trusted.overlay" xattr is anyway a reserved prefix, so "trusted.overlay.layers"
> > for example could work.
>
> Thanks xattr might be a good option, but still don't forget about (a)
> and (b), users like to know all information about mount from
> /proc/pid/mountinfo.
>

Let's stick to your use cases requirements. If you have other use cases
for this functionality lay them out explicitly.

I went to see what losetup does and I see that LOOP_SET_STATUS
ioctl stores a path string that LOOP_GET_STATUS gets back in return.
Does not seem C/R friendly either. Are you not handling loop devices?

It's strange because loop driver keeps an open backing file so
LOOP_GET_STATUS could have returned the uptodate path.

Thanks,
Amir.