($INBOX_DIR/description missing)
 help / color / Atom feed
* Re: [PATCH] ext4: add rename whiteout support for fast commit
       [not found]   ` <CAD+ocbzMv6SyUUZFnBE0gTnHf8yvMFfq6Dm9rdnLXoUrh7gYkg@mail.gmail.com>
@ 2021-03-19  5:51     ` Amir Goldstein
  2021-03-19  8:36       ` Miklos Szeredi
  0 siblings, 1 reply; 4+ messages in thread
From: Amir Goldstein @ 2021-03-19  5:51 UTC (permalink / raw)
  To: harshad shirwadkar; +Cc: Ext4, Theodore Tso, Miklos Szeredi, overlayfs

[adding overlayfs list]

On Fri, Mar 19, 2021 at 3:32 AM harshad shirwadkar
<harshadshirwadkar@gmail.com> wrote:
>
> Thanks for the review Amir.
>
> Sure changing the subject makes sense.
>
> Also, on further discussions on Ext4 conference call, we also thought
> that with this patch, overlayfs customers would not benefit from fast
> commits much if they call renames often. So, in order to really make
> rename whiteout a fast commit compatible operation, we probably would
> need to add support in fast commit to replay a char device creation
> event (since whiteout object is a char device). That would imply, we
> would need to do careful versioning and would need to burn an on-disk
> feature flag.
>
> An alternative to this would be to have a static whiteout object with
> irrelevant nlink count and to have every rename point to that object
> instead. Based on how we decide to implement that, at max only the
> first rename operation would be fast commit incompatible since that's
> when this object would get created. All the further operations would
> be fast commit compatible. The big benefit of this approach is that
> this way we don't have to add support for char device creation in fast
> commit replay code and thus we don't have to worry about versioning.
>

I'm glad to hear that, Harshad.

Please note that creating a static whiteout object on-disk is one possible
implementation option. Not creating any object on-disk may be even better.

I had suggested the static object approach because it should be pretty
simple to implement and add e2fsprogs support for.

However, if we look at the requirements for RENAME_WHITEOUT,
the resulting directory entry does not actually need to point to any
object on-disk at all.

An alternative implementation would be to create a directory entry
with {d_ino = 0, d_type = DT_WHT}. Lookup of this entry will return
a reference to a singleton read-only ext4 whiteout inode object, which
does not reside on disk, so fast commit is irrelevant in that sense.
i_nlink should be handled carefully, but that should be easier from
doing that for a static on-disk object.

I am not sure how userland tools handle DT_WHT, but I see that
other filesystems can emit this value in theory and man rename(2)
claims that BSD uses DT_WHT, so the common tools should be
able to handle it.

As far as overlayfs is concerned:
1. ovl_lookup() will find an IS_WHITEOUT() inode as usual
2. ovl_dir_read_merged() will need this small patch (below) and will
    not access the inode object at all
3. At first, ovl_whiteout() -> vfs_whiteout() can still create a usual chardev
4. Later, we can initiate the overlayfs instance singleton whiteout
    reference in ovl_check_rename_whiteout() and ovl_whiteout() will
    never get -EMLINK when linking this whiteout object

One other challenge is how to handle users trying to make operations
on the upper layer directly (migrating images etc).
As long as the tools still observe the whiteout as a chadev (with stat(2))
then export and import should work fine (creating a real chardev on import).

If there are tools that try to change  inode permissions recursively on the
upper layer (?) there may be a problem with those read-only whiteouts
although the permission of a whiteout is a moot concept.

Thanks,
Amir.

--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -161,7 +161,7 @@ static struct ovl_cache_entry
*ovl_cache_entry_new(struct ovl_readdir_data *rdd,
        if (ovl_calc_d_ino(rdd, p))
                p->ino = 0;
        p->is_upper = rdd->is_upper;
-       p->is_whiteout = false;
+       p->is_whiteout = (d_type == DT_WHT);

        if (d_type == DT_CHR) {
                p->next_maybe_whiteout = rdd->first_maybe_whiteout;

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] ext4: add rename whiteout support for fast commit
  2021-03-19  5:51     ` [PATCH] ext4: add rename whiteout support for fast commit Amir Goldstein
@ 2021-03-19  8:36       ` Miklos Szeredi
  2021-03-19 10:35         ` Amir Goldstein
  0 siblings, 1 reply; 4+ messages in thread
From: Miklos Szeredi @ 2021-03-19  8:36 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: harshad shirwadkar, Ext4, Theodore Tso, overlayfs

On Fri, Mar 19, 2021 at 6:52 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> [adding overlayfs list]
>
> On Fri, Mar 19, 2021 at 3:32 AM harshad shirwadkar
> <harshadshirwadkar@gmail.com> wrote:
> >
> > Thanks for the review Amir.
> >
> > Sure changing the subject makes sense.
> >
> > Also, on further discussions on Ext4 conference call, we also thought
> > that with this patch, overlayfs customers would not benefit from fast
> > commits much if they call renames often. So, in order to really make
> > rename whiteout a fast commit compatible operation, we probably would
> > need to add support in fast commit to replay a char device creation
> > event (since whiteout object is a char device). That would imply, we
> > would need to do careful versioning and would need to burn an on-disk
> > feature flag.
> >
> > An alternative to this would be to have a static whiteout object with
> > irrelevant nlink count and to have every rename point to that object
> > instead. Based on how we decide to implement that, at max only the
> > first rename operation would be fast commit incompatible since that's
> > when this object would get created. All the further operations would
> > be fast commit compatible. The big benefit of this approach is that
> > this way we don't have to add support for char device creation in fast
> > commit replay code and thus we don't have to worry about versioning.
> >
>
> I'm glad to hear that, Harshad.
>
> Please note that creating a static whiteout object on-disk is one possible
> implementation option. Not creating any object on-disk may be even better.

I don't really get it.  What's the advantage of not having an object?

Readdir returning DT_WHT internally might be nice, but I'd be careful
with exporting that to userspace, since it's likely to cause more
problems that it solves.   And on the stat(2) interface adding S_IFWHT
or even worse: ENOENT are really out of the question due to backward
incompatibility with almost every application.

> One other challenge is how to handle users trying to make operations
> on the upper layer directly (migrating images etc).
> As long as the tools still observe the whiteout as a chadev (with stat(2))
> then export and import should work fine (creating a real chardev on import).

Right.

Can't mkfs.ext4 just create the static object?  That sounds to me like
the simplest approach.

Thanks,
Miklos


Thanks,
Miklos
>
> I had suggested the static object approach because it should be pretty
> simple to implement and add e2fsprogs support for.
>
> However, if we look at the requirements for RENAME_WHITEOUT,
> the resulting directory entry does not actually need to point to any
> object on-disk at all.
>
> An alternative implementation would be to create a directory entry
> with {d_ino = 0, d_type = DT_WHT}. Lookup of this entry will return
> a reference to a singleton read-only ext4 whiteout inode object, which
> does not reside on disk, so fast commit is irrelevant in that sense.
> i_nlink should be handled carefully, but that should be easier from
> doing that for a static on-disk object.
>
> I am not sure how userland tools handle DT_WHT, but I see that
> other filesystems can emit this value in theory and man rename(2)
> claims that BSD uses DT_WHT, so the common tools should be
> able to handle it.
>
> As far as overlayfs is concerned:
> 1. ovl_lookup() will find an IS_WHITEOUT() inode as usual
> 2. ovl_dir_read_merged() will need this small patch (below) and will
>     not access the inode object at all
> 3. At first, ovl_whiteout() -> vfs_whiteout() can still create a usual chardev
> 4. Later, we can initiate the overlayfs instance singleton whiteout
>     reference in ovl_check_rename_whiteout() and ovl_whiteout() will
>     never get -EMLINK when linking this whiteout object
>
> If there are tools that try to change  inode permissions recursively on the
> upper layer (?) there may be a problem with those read-only whiteouts
> although the permission of a whiteout is a moot concept.
>
> Thanks,
> Amir.
>
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -161,7 +161,7 @@ static struct ovl_cache_entry
> *ovl_cache_entry_new(struct ovl_readdir_data *rdd,
>         if (ovl_calc_d_ino(rdd, p))
>                 p->ino = 0;
>         p->is_upper = rdd->is_upper;
> -       p->is_whiteout = false;
> +       p->is_whiteout = (d_type == DT_WHT);
>
>         if (d_type == DT_CHR) {
>                 p->next_maybe_whiteout = rdd->first_maybe_whiteout;

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] ext4: add rename whiteout support for fast commit
  2021-03-19  8:36       ` Miklos Szeredi
@ 2021-03-19 10:35         ` Amir Goldstein
  2021-03-19 10:51           ` Miklos Szeredi
  0 siblings, 1 reply; 4+ messages in thread
From: Amir Goldstein @ 2021-03-19 10:35 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: harshad shirwadkar, Ext4, Theodore Tso, overlayfs

On Fri, Mar 19, 2021 at 10:36 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Fri, Mar 19, 2021 at 6:52 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > [adding overlayfs list]
> >
> > On Fri, Mar 19, 2021 at 3:32 AM harshad shirwadkar
> > <harshadshirwadkar@gmail.com> wrote:
> > >
> > > Thanks for the review Amir.
> > >
> > > Sure changing the subject makes sense.
> > >
> > > Also, on further discussions on Ext4 conference call, we also thought
> > > that with this patch, overlayfs customers would not benefit from fast
> > > commits much if they call renames often. So, in order to really make
> > > rename whiteout a fast commit compatible operation, we probably would
> > > need to add support in fast commit to replay a char device creation
> > > event (since whiteout object is a char device). That would imply, we
> > > would need to do careful versioning and would need to burn an on-disk
> > > feature flag.
> > >
> > > An alternative to this would be to have a static whiteout object with
> > > irrelevant nlink count and to have every rename point to that object
> > > instead. Based on how we decide to implement that, at max only the
> > > first rename operation would be fast commit incompatible since that's
> > > when this object would get created. All the further operations would
> > > be fast commit compatible. The big benefit of this approach is that
> > > this way we don't have to add support for char device creation in fast
> > > commit replay code and thus we don't have to worry about versioning.
> > >
> >
> > I'm glad to hear that, Harshad.
> >
> > Please note that creating a static whiteout object on-disk is one possible
> > implementation option. Not creating any object on-disk may be even better.
>
> I don't really get it.  What's the advantage of not having an object?
>
> Readdir returning DT_WHT internally might be nice, but I'd be careful
> with exporting that to userspace, since it's likely to cause more
> problems that it solves.   And on the stat(2) interface adding S_IFWHT
> or even worse: ENOENT are really out of the question due to backward
> incompatibility with almost every application.
>
> > One other challenge is how to handle users trying to make operations
> > on the upper layer directly (migrating images etc).
> > As long as the tools still observe the whiteout as a chadev (with stat(2))
> > then export and import should work fine (creating a real chardev on import).
>
> Right.
>
> Can't mkfs.ext4 just create the static object?  That sounds to me like
> the simplest approach.
>

Sure. I think that was the original intention and it is probably the easier way.

One thing that we will probably need to do is use the RENAME_WHITEOUT
interface as the explicit way to create the shared whiteout instead of using
vfs_whiteout() for filesystems that support RENAME_WHITEOUT
(we check for RENAME_WHITEOUT support anyway).

The only thing that bothered me in moving from per-ovl-instance singleton
to per-ext4-singleton is what happens if someone tries to (say) chown -R
the upper layer or some other offline modification that was working up to
now and seemed to make sense.

Surely, the ext4 singleton whiteout cannot allow modifications like that,
so what do we do about this? Let those scripts fail (if they exist) and
let their owners fix them to skip errors on whiteouts?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] ext4: add rename whiteout support for fast commit
  2021-03-19 10:35         ` Amir Goldstein
@ 2021-03-19 10:51           ` Miklos Szeredi
  0 siblings, 0 replies; 4+ messages in thread
From: Miklos Szeredi @ 2021-03-19 10:51 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: harshad shirwadkar, Ext4, Theodore Tso, overlayfs

On Fri, Mar 19, 2021 at 11:35 AM Amir Goldstein <amir73il@gmail.com> wrote:

> One thing that we will probably need to do is use the RENAME_WHITEOUT
> interface as the explicit way to create the shared whiteout instead of using
> vfs_whiteout() for filesystems that support RENAME_WHITEOUT
> (we check for RENAME_WHITEOUT support anyway).
>
> The only thing that bothered me in moving from per-ovl-instance singleton
> to per-ext4-singleton is what happens if someone tries to (say) chown -R
> the upper layer or some other offline modification that was working up to
> now and seemed to make sense.

Eek.

>
> Surely, the ext4 singleton whiteout cannot allow modifications like that,
> so what do we do about this? Let those scripts fail (if they exist) and
> let their owners fix them to skip errors on whiteouts?

Might try that.  But the no-regressions rule means we'd have to change
that in case it breaks something.

Thanks,
Miklos

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, back to index

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210316221921.1124955-1-harshadshirwadkar@gmail.com>
     [not found] ` <CAOQ4uxiD8WGLeSftqL6dOfz_kNp+YSE7qfXYG34Pea4j8G7CxA@mail.gmail.com>
     [not found]   ` <CAD+ocbzMv6SyUUZFnBE0gTnHf8yvMFfq6Dm9rdnLXoUrh7gYkg@mail.gmail.com>
2021-03-19  5:51     ` [PATCH] ext4: add rename whiteout support for fast commit Amir Goldstein
2021-03-19  8:36       ` Miklos Szeredi
2021-03-19 10:35         ` Amir Goldstein
2021-03-19 10:51           ` Miklos Szeredi

($INBOX_DIR/description missing)

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-unionfs/0 linux-unionfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-unionfs linux-unionfs/ https://lore.kernel.org/linux-unionfs \
		linux-unionfs@vger.kernel.org
	public-inbox-index linux-unionfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-unionfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git