linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
@ 2020-01-17 12:49 David Howells
  2020-01-17 14:33 ` Trond Myklebust
  2020-01-17 14:47 ` David Howells
  0 siblings, 2 replies; 36+ messages in thread
From: David Howells @ 2020-01-17 12:49 UTC (permalink / raw)
  To: lsf-pc, Amir Goldstein, Omar Sandoval
  Cc: dhowells, Al Viro, Christoph Hellwig, Miklos Szeredi, linux-fsdevel

It may be worth a discussion of whether linkat() could be given a flag to
allow the destination to be replaced or if a new syscall should be made for
this - or whether it should be disallowed entirely.

A set of patches has been posted by Omar Sandoval that makes this possible:

    https://lore.kernel.org/linux-fsdevel/cover.1524549513.git.osandov@fb.com/

though it only includes filesystem support for btrfs.

This could be useful for cachefiles:

	https://lore.kernel.org/linux-fsdevel/3326.1579019665@warthog.procyon.org.uk/

and overlayfs.

David


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-17 12:49 [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination David Howells
@ 2020-01-17 14:33 ` Trond Myklebust
  2020-01-17 15:46   ` Al Viro
  2020-01-17 14:47 ` David Howells
  1 sibling, 1 reply; 36+ messages in thread
From: Trond Myklebust @ 2020-01-17 14:33 UTC (permalink / raw)
  To: osandov, amir73il, dhowells, lsf-pc; +Cc: hch, miklos, linux-fsdevel, viro

On Fri, 2020-01-17 at 12:49 +0000, David Howells wrote:
> It may be worth a discussion of whether linkat() could be given a
> flag to
> allow the destination to be replaced or if a new syscall should be
> made for
> this - or whether it should be disallowed entirely.
> 
> A set of patches has been posted by Omar Sandoval that makes this
> possible:
> 
>     
> https://lore.kernel.org/linux-fsdevel/cover.1524549513.git.osandov@fb.com/
> 
> though it only includes filesystem support for btrfs.
> 
> This could be useful for cachefiles:
> 
> 	
> https://lore.kernel.org/linux-fsdevel/3326.1579019665@warthog.procyon.org.uk/
> 
> and overlayfs.
> 

That seems to me like a "just go ahead and do it if you can justify it"
kind of thing. It has plenty of precedent, and fits easily into the
existing syscall, so why do we need a face-to-face discussion?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-17 12:49 [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination David Howells
  2020-01-17 14:33 ` Trond Myklebust
@ 2020-01-17 14:47 ` David Howells
  2020-01-17 14:56   ` Trond Myklebust
  1 sibling, 1 reply; 36+ messages in thread
From: David Howells @ 2020-01-17 14:47 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: dhowells, osandov, amir73il, lsf-pc, hch, miklos, linux-fsdevel, viro

Trond Myklebust <trondmy@hammerspace.com> wrote:

> That seems to me like a "just go ahead and do it if you can justify it"
> kind of thing. It has plenty of precedent, and fits easily into the
> existing syscall, so why do we need a face-to-face discussion?

Amir said "This sounds like a good topic to be discussed at LSF/MM (hint
hint)"

Also Christoph H is okay with the idea, but suggested it should be a separate
syscall and Al doesn't seem to like it.  Omar posted patches to do this, but
they didn't seem to get anywhere.

David


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-17 14:47 ` David Howells
@ 2020-01-17 14:56   ` Trond Myklebust
  2020-01-17 16:01     ` Al Viro
  0 siblings, 1 reply; 36+ messages in thread
From: Trond Myklebust @ 2020-01-17 14:56 UTC (permalink / raw)
  To: dhowells; +Cc: hch, osandov, miklos, linux-fsdevel, viro, amir73il, lsf-pc

On Fri, 2020-01-17 at 14:47 +0000, David Howells wrote:
> Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> > That seems to me like a "just go ahead and do it if you can justify
> > it"
> > kind of thing. It has plenty of precedent, and fits easily into the
> > existing syscall, so why do we need a face-to-face discussion?
> 
> Amir said "This sounds like a good topic to be discussed at LSF/MM
> (hint
> hint)"
> 
> Also Christoph H is okay with the idea, but suggested it should be a
> separate
> syscall and Al doesn't seem to like it.  Omar posted patches to do
> this, but
> they didn't seem to get anywhere.
> 

It sounds to me like we rather need a meta-topic about "How do we get
simple things done in the Linux fs community?"

It shouldn't take a ticket to Palm Springs to perform something simple
like adding a new flag to a syscall.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-17 14:33 ` Trond Myklebust
@ 2020-01-17 15:46   ` Al Viro
  2020-01-17 16:12     ` Trond Myklebust
  2020-01-17 16:36     ` Omar Sandoval
  0 siblings, 2 replies; 36+ messages in thread
From: Al Viro @ 2020-01-17 15:46 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: osandov, amir73il, dhowells, lsf-pc, hch, miklos, linux-fsdevel

On Fri, Jan 17, 2020 at 02:33:01PM +0000, Trond Myklebust wrote:
> On Fri, 2020-01-17 at 12:49 +0000, David Howells wrote:
> > It may be worth a discussion of whether linkat() could be given a
> > flag to
> > allow the destination to be replaced or if a new syscall should be
> > made for
> > this - or whether it should be disallowed entirely.
> > 
> > A set of patches has been posted by Omar Sandoval that makes this
> > possible:
> > 
> >     
> > https://lore.kernel.org/linux-fsdevel/cover.1524549513.git.osandov@fb.com/
> > 
> > though it only includes filesystem support for btrfs.
> > 
> > This could be useful for cachefiles:
> > 
> > 	
> > https://lore.kernel.org/linux-fsdevel/3326.1579019665@warthog.procyon.org.uk/
> > 
> > and overlayfs.
> > 
> 
> That seems to me like a "just go ahead and do it if you can justify it"
> kind of thing. It has plenty of precedent, and fits easily into the
> existing syscall, so why do we need a face-to-face discussion?

Unfortunately, it does *not* fit easily.  And IMO that's linux-abi fodder more
than anything else.  The problem is in coming up with sane semantics - there's
a plenty of corner cases with that one.  What to do when destination is
a dangling symlink, for example?  Or has something mounted on it (no, saying
"we'll just reject directories" is not enough).  What should happen when
destination is already a hardlink to the same object?

It's less of a horror than rename() would've been, but that's not saying
much.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-17 14:56   ` Trond Myklebust
@ 2020-01-17 16:01     ` Al Viro
  0 siblings, 0 replies; 36+ messages in thread
From: Al Viro @ 2020-01-17 16:01 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: dhowells, hch, osandov, miklos, linux-fsdevel, amir73il, lsf-pc

On Fri, Jan 17, 2020 at 02:56:05PM +0000, Trond Myklebust wrote:

> It sounds to me like we rather need a meta-topic about "How do we get
> simple things done in the Linux fs community?"
> 
> It shouldn't take a ticket to Palm Springs to perform something simple
> like adding a new flag to a syscall.

Sure - adding a new flag is trivial.  Coming up with sane semantics for
it, OTOH, can be rather non-trivial and in this case it is, unfortunately.
"something like link(2), only it tolerates the existing target and
atomically replaces it" does _not_ specify the semantics.  Try to sit
down for a few minutes and come up with the cases when behaviour is
undefined by the above; it won't take longer than that.

We can do it by asking the proponent to come up with full description to
be included into the proposal, then have at it on fsdevel/linux-abi (as
well as security lists).  Doable, but not a small amount of PITA for
original poster and dealing with questions/objections/etc. is certain
to grow a large thread with many branches (and lots of bikeshedding
thrown in) _and_ would include tons of roundtrips, so the latency
(especially early on, while the proposal is still raw) will be a factor.

It's not the question of how to implement it; it's what should it _do_.
And "we'll tweak the behaviour in corner cases later on" is good in
a lot of situations, but not for userland ABI.  I'd been guilty of
such fuckups several times and they are not cheap to fix afterwards ;-/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-17 15:46   ` Al Viro
@ 2020-01-17 16:12     ` Trond Myklebust
  2020-01-17 16:48       ` Al Viro
  2020-01-17 16:36     ` Omar Sandoval
  1 sibling, 1 reply; 36+ messages in thread
From: Trond Myklebust @ 2020-01-17 16:12 UTC (permalink / raw)
  To: viro; +Cc: hch, osandov, miklos, linux-fsdevel, amir73il, dhowells, lsf-pc

On Fri, 2020-01-17 at 15:46 +0000, Al Viro wrote:
> On Fri, Jan 17, 2020 at 02:33:01PM +0000, Trond Myklebust wrote:
> > On Fri, 2020-01-17 at 12:49 +0000, David Howells wrote:
> > > It may be worth a discussion of whether linkat() could be given a
> > > flag to
> > > allow the destination to be replaced or if a new syscall should
> > > be
> > > made for
> > > this - or whether it should be disallowed entirely.
> > > 
> > > A set of patches has been posted by Omar Sandoval that makes this
> > > possible:
> > > 
> > >     
> > > https://lore.kernel.org/linux-fsdevel/cover.1524549513.git.osandov@fb.com/
> > > 
> > > though it only includes filesystem support for btrfs.
> > > 
> > > This could be useful for cachefiles:
> > > 
> > > 	
> > > https://lore.kernel.org/linux-fsdevel/3326.1579019665@warthog.procyon.org.uk/
> > > 
> > > and overlayfs.
> > > 
> > 
> > That seems to me like a "just go ahead and do it if you can justify
> > it"
> > kind of thing. It has plenty of precedent, and fits easily into the
> > existing syscall, so why do we need a face-to-face discussion?
> 
> Unfortunately, it does *not* fit easily.  And IMO that's linux-abi
> fodder more
> than anything else.  The problem is in coming up with sane semantics
> - there's
> a plenty of corner cases with that one.  What to do when destination
> is
> a dangling symlink, for example?  Or has something mounted on it (no,
> saying
> "we'll just reject directories" is not enough).  What should happen
> when
> destination is already a hardlink to the same object?
> 
> It's less of a horror than rename() would've been, but that's not
> saying
> much.

We already have precedents for all of that when handling bog-standard
open(O_CREAT) (which creates the first link to the file). Yes, there is
the question of choosing whether to implement O_NOFOLLOW semantics or
not, but that should be dictated by the requirements of the use case.

As for the "hard link on top of itself", that case is already well
defined by POSIX to be a null op IIRC.

What in the proposal is requiring new semantics beyond these precedents
already set by open() and link() itself?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-17 15:46   ` Al Viro
  2020-01-17 16:12     ` Trond Myklebust
@ 2020-01-17 16:36     ` Omar Sandoval
  2020-01-17 16:59       ` Al Viro
  1 sibling, 1 reply; 36+ messages in thread
From: Omar Sandoval @ 2020-01-17 16:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Trond Myklebust, amir73il, dhowells, lsf-pc, hch, miklos, linux-fsdevel

On Fri, Jan 17, 2020 at 03:46:57PM +0000, Al Viro wrote:
> On Fri, Jan 17, 2020 at 02:33:01PM +0000, Trond Myklebust wrote:
> > On Fri, 2020-01-17 at 12:49 +0000, David Howells wrote:
> > > It may be worth a discussion of whether linkat() could be given a
> > > flag to
> > > allow the destination to be replaced or if a new syscall should be
> > > made for
> > > this - or whether it should be disallowed entirely.
> > > 
> > > A set of patches has been posted by Omar Sandoval that makes this
> > > possible:
> > > 
> > >     
> > > https://lore.kernel.org/linux-fsdevel/cover.1524549513.git.osandov@fb.com/
> > > 
> > > though it only includes filesystem support for btrfs.
> > > 
> > > This could be useful for cachefiles:
> > > 
> > > 	
> > > https://lore.kernel.org/linux-fsdevel/3326.1579019665@warthog.procyon.org.uk/
> > > 
> > > and overlayfs.
> > > 
> > 
> > That seems to me like a "just go ahead and do it if you can justify it"
> > kind of thing. It has plenty of precedent, and fits easily into the
> > existing syscall, so why do we need a face-to-face discussion?
> 
> Unfortunately, it does *not* fit easily.  And IMO that's linux-abi fodder more
> than anything else.  The problem is in coming up with sane semantics - there's
> a plenty of corner cases with that one.  What to do when destination is
> a dangling symlink, for example?  Or has something mounted on it (no, saying
> "we'll just reject directories" is not enough).  What should happen when
> destination is already a hardlink to the same object?
> 
> It's less of a horror than rename() would've been, but that's not saying
> much.

The semantics I implemented in my series were basically "linkat with
AT_REPLACE replaces the target iff rename would replace the target".
Therefore, symlinks are replaced, not followed, and mountpoints get
EXDEV. In my opinion that's both sane and unsurprising.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-17 16:12     ` Trond Myklebust
@ 2020-01-17 16:48       ` Al Viro
  0 siblings, 0 replies; 36+ messages in thread
From: Al Viro @ 2020-01-17 16:48 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: hch, osandov, miklos, linux-fsdevel, amir73il, dhowells, lsf-pc

On Fri, Jan 17, 2020 at 04:12:28PM +0000, Trond Myklebust wrote:

> > Unfortunately, it does *not* fit easily.  And IMO that's linux-abi
> > fodder more
> > than anything else.  The problem is in coming up with sane semantics
> > - there's
> > a plenty of corner cases with that one.  What to do when destination
> > is
> > a dangling symlink, for example?  Or has something mounted on it (no,
> > saying
> > "we'll just reject directories" is not enough).  What should happen
> > when
> > destination is already a hardlink to the same object?
> > 
> > It's less of a horror than rename() would've been, but that's not
> > saying
> > much.
> 
> We already have precedents for all of that when handling bog-standard
> open(O_CREAT) (which creates the first link to the file). Yes, there is
> the question of choosing whether to implement O_NOFOLLOW semantics or
> not, but that should be dictated by the requirements of the use case.
> 
> As for the "hard link on top of itself", that case is already well
> defined by POSIX to be a null op IIRC.

Where in POSIX does it say anything about it?  It is a null op for
rename, but for link it's EEXIST on the general grounds.

> What in the proposal is requiring new semantics beyond these precedents
> already set by open() and link() itself?

The fact that O_CREAT does not do anything to the existing target,
perhaps?  This, unless I'm seriously misunderstanding the proposal,
should have the preexisting link removed.  Which makes it a lot
more similar to "unlink target, then link source to target, atomically"
than to O_CREAT.

Incidentally,

echo foo >/tmp/foo
echo bar >/tmp/bar
ln /tmp/foo /tmp/foo2
mount --bind /tmp/foo /tmp/bar
echo a >/tmp/bar
cat /tmp/foo2

will print "a" - IOW, O_CREAT in the redirect of that last echo will
	find /tmp/bar
	see it overmounted (by /tmp/foo)
	access /tmp/foo, which happens to be the same file as /tmp/foo2

What would you want that link() variant do in similar situation
(i.e. mount traversal at the end of pathname)?  I can see several
variants of behaviour, none of them too appealing.

What should happen if target is opened by somebody?  I would expect it
to be treated as opened-and-unlinked (with sillyrename if fs requires
that).  Which is where the corner case with target already being a link
to source comes from...

For fuck sake, I'm not being obstructionist - if you (or David, or anyone
else) is willing to come up with sane semantics (I'm _not_ talking about
implementation, VFS or fs data structures, etc. - just the rules describing
what the effect should it have), great, I'll be happy to help with the
implementation side.  As well as poking holes in said proposal (i.e.
asking what should happen in such and such case).

But it's really _not_ as trivial as "do by analogy with O_CREAT".  I don't
have any problem with discussing that over email, but latencies do suck
sometimes (e.g. when discussing autofs ->d_manage() semantics, with
3-way conversation - one participant on US east coast, one in UK, one
on AU west coast), so I understand why David (who'd just had exactly that
lovely experience) might find an idea of doing that face-to-face appealing...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-17 16:36     ` Omar Sandoval
@ 2020-01-17 16:59       ` Al Viro
  2020-01-17 17:28         ` Omar Sandoval
  0 siblings, 1 reply; 36+ messages in thread
From: Al Viro @ 2020-01-17 16:59 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Trond Myklebust, amir73il, dhowells, lsf-pc, hch, miklos, linux-fsdevel

On Fri, Jan 17, 2020 at 08:36:16AM -0800, Omar Sandoval wrote:

> The semantics I implemented in my series were basically "linkat with
> AT_REPLACE replaces the target iff rename would replace the target".
> Therefore, symlinks are replaced, not followed, and mountpoints get
> EXDEV. In my opinion that's both sane and unsurprising.

Umm...  EXDEV in rename() comes when _parents_ are on different mounts.
rename() over a mountpoint is EBUSY if it has mounts in caller's
namespace, but it succeeds (and detaches all mounts on the victim
in any namespaces) otherwise.

When are you returning EXDEV?  Incidentally, mounts _are_ traversed on
the link source, so what should that variant do when /tmp/foo is
a mountpoint and you feed it "/tmp/foo" both for source and target?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-17 16:59       ` Al Viro
@ 2020-01-17 17:28         ` Omar Sandoval
  2020-01-17 18:17           ` Al Viro
  0 siblings, 1 reply; 36+ messages in thread
From: Omar Sandoval @ 2020-01-17 17:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Trond Myklebust, amir73il, dhowells, lsf-pc, hch, miklos, linux-fsdevel

On Fri, Jan 17, 2020 at 04:59:04PM +0000, Al Viro wrote:
> On Fri, Jan 17, 2020 at 08:36:16AM -0800, Omar Sandoval wrote:
> 
> > The semantics I implemented in my series were basically "linkat with
> > AT_REPLACE replaces the target iff rename would replace the target".
> > Therefore, symlinks are replaced, not followed, and mountpoints get
> > EXDEV. In my opinion that's both sane and unsurprising.
> 
> Umm...  EXDEV in rename() comes when _parents_ are on different mounts.
> rename() over a mountpoint is EBUSY if it has mounts in caller's
> namespace, but it succeeds (and detaches all mounts on the victim
> in any namespaces) otherwise.
> 
> When are you returning EXDEV?

EXDEV was a thinko, the patch does what rename does:


+	if (is_local_mountpoint(new_dentry)) {
+		error = -EBUSY;
+		goto out;
+	}

...

+	if (target) {
+		dont_mount(new_dentry);
+		detach_mounts(new_dentry);
+	}

Anyways, my point is that the rename semantics cover 90% of AT_REPLACE.
Before I resend the patches, I'll write up the documentation and we can
see what other corner cases I missed.

> Incidentally, mounts _are_ traversed on
> the link source, so what should that variant do when /tmp/foo is
> a mountpoint and you feed it "/tmp/foo" both for source and target?

EBUSY and noop both seem reasonable in this case, so we pick one and
document it.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-17 17:28         ` Omar Sandoval
@ 2020-01-17 18:17           ` Al Viro
  2020-01-17 20:22             ` Omar Sandoval
  0 siblings, 1 reply; 36+ messages in thread
From: Al Viro @ 2020-01-17 18:17 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Trond Myklebust, amir73il, dhowells, lsf-pc, hch, miklos, linux-fsdevel

On Fri, Jan 17, 2020 at 09:28:55AM -0800, Omar Sandoval wrote:
> On Fri, Jan 17, 2020 at 04:59:04PM +0000, Al Viro wrote:
> > On Fri, Jan 17, 2020 at 08:36:16AM -0800, Omar Sandoval wrote:
> > 
> > > The semantics I implemented in my series were basically "linkat with
> > > AT_REPLACE replaces the target iff rename would replace the target".
> > > Therefore, symlinks are replaced, not followed, and mountpoints get
> > > EXDEV. In my opinion that's both sane and unsurprising.
> > 
> > Umm...  EXDEV in rename() comes when _parents_ are on different mounts.
> > rename() over a mountpoint is EBUSY if it has mounts in caller's
> > namespace, but it succeeds (and detaches all mounts on the victim
> > in any namespaces) otherwise.
> > 
> > When are you returning EXDEV?
> 
> EXDEV was a thinko, the patch does what rename does:
> 
> 
> +	if (is_local_mountpoint(new_dentry)) {
> +		error = -EBUSY;
> +		goto out;
> +	}
> 
> ...
> 
> +	if (target) {
> +		dont_mount(new_dentry);
> +		detach_mounts(new_dentry);
> +	}
> 
> Anyways, my point is that the rename semantics cover 90% of AT_REPLACE.
> Before I resend the patches, I'll write up the documentation and we can
> see what other corner cases I missed.

OK... rename() has a major difference from linkat(), though, in not
following links (or allowing fd + empty path).  link() is deeply
asymmetric in treatment of pathnames - the first argument is
"pathname describes a filesystem object" and the second - "pathname
descripes an entry in a directory" (the link to be).  rename() is
not - both arguments are pathnames-as-link-specifiers.  And that
affects what is and what is not allowed there, so that'll need
a careful look into.

FWIW, currently linkat() semantics can be described simply enough
	* oldfd/oldname specifies an fs object; symlink traversal is
optional.
	* newfd/newname specifies an entry in some directory.  Directory
must be on the same mount as the object specified by oldname.
Entry must be a normal component (no empty paths allowed, no . or ..
either).  Trailing slashes are not allowed.  There must be no
entry with that name (which automatically implies that trailing
symlinks are not to be followed and there can't be anything
mounted on it).
	* the object specified by oldname must be a non-directory.
	* if the object is not a never-linked-yet anonymous,
it must still have some links.
	* caller must have permission to create links in the
affected directory.
	* append-only and immutables are not allowed (rationale:
they can't be unlinked)
	* filesystem is allowed to fail for any reasons, as with
any operation; a linkat()-specific one is having the link count
overflow, but any generic error is possible (out of memory, IO
error, EPERM-because-I-feel-like-that, etc.)

All checks related to object in question are atomic wrt operations
that add or remove links to that object.  Checks on parent are
atomic wrt operations modifying the parent.  Neither group is
atomic wrt operations modifying the _old_ parent (if there's
any, in the first place).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-17 18:17           ` Al Viro
@ 2020-01-17 20:22             ` Omar Sandoval
  2020-01-17 22:22               ` Al Viro
  0 siblings, 1 reply; 36+ messages in thread
From: Omar Sandoval @ 2020-01-17 20:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Trond Myklebust, amir73il, dhowells, lsf-pc, hch, miklos, linux-fsdevel

On Fri, Jan 17, 2020 at 06:17:30PM +0000, Al Viro wrote:
> On Fri, Jan 17, 2020 at 09:28:55AM -0800, Omar Sandoval wrote:
> > On Fri, Jan 17, 2020 at 04:59:04PM +0000, Al Viro wrote:
> > > On Fri, Jan 17, 2020 at 08:36:16AM -0800, Omar Sandoval wrote:
> > > 
> > > > The semantics I implemented in my series were basically "linkat with
> > > > AT_REPLACE replaces the target iff rename would replace the target".
> > > > Therefore, symlinks are replaced, not followed, and mountpoints get
> > > > EXDEV. In my opinion that's both sane and unsurprising.
> > > 
> > > Umm...  EXDEV in rename() comes when _parents_ are on different mounts.
> > > rename() over a mountpoint is EBUSY if it has mounts in caller's
> > > namespace, but it succeeds (and detaches all mounts on the victim
> > > in any namespaces) otherwise.
> > > 
> > > When are you returning EXDEV?
> > 
> > EXDEV was a thinko, the patch does what rename does:
> > 
> > 
> > +	if (is_local_mountpoint(new_dentry)) {
> > +		error = -EBUSY;
> > +		goto out;
> > +	}
> > 
> > ...
> > 
> > +	if (target) {
> > +		dont_mount(new_dentry);
> > +		detach_mounts(new_dentry);
> > +	}
> > 
> > Anyways, my point is that the rename semantics cover 90% of AT_REPLACE.
> > Before I resend the patches, I'll write up the documentation and we can
> > see what other corner cases I missed.
> 
> OK... rename() has a major difference from linkat(), though, in not
> following links (or allowing fd + empty path).  link() is deeply
> asymmetric in treatment of pathnames - the first argument is
> "pathname describes a filesystem object" and the second - "pathname
> descripes an entry in a directory" (the link to be).  rename() is
> not - both arguments are pathnames-as-link-specifiers.  And that
> affects what is and what is not allowed there, so that'll need
> a careful look into.
> 
> FWIW, currently linkat() semantics can be described simply enough
> 	* oldfd/oldname specifies an fs object; symlink traversal is
> optional.
> 	* newfd/newname specifies an entry in some directory.  Directory
> must be on the same mount as the object specified by oldname.
> Entry must be a normal component (no empty paths allowed, no . or ..
> either).  Trailing slashes are not allowed.  There must be no
> entry with that name (which automatically implies that trailing
> symlinks are not to be followed and there can't be anything
> mounted on it).
> 	* the object specified by oldname must be a non-directory.
> 	* if the object is not a never-linked-yet anonymous,
> it must still have some links.
> 	* caller must have permission to create links in the
> affected directory.
> 	* append-only and immutables are not allowed (rationale:
> they can't be unlinked)
> 	* filesystem is allowed to fail for any reasons, as with
> any operation; a linkat()-specific one is having the link count
> overflow, but any generic error is possible (out of memory, IO
> error, EPERM-because-I-feel-like-that, etc.)
> 
> All checks related to object in question are atomic wrt operations
> that add or remove links to that object.  Checks on parent are
> atomic wrt operations modifying the parent.  Neither group is
> atomic wrt operations modifying the _old_ parent (if there's
> any, in the first place).

Since manpage troff patches aren't particularly nice to read, here is
what I had in mind in a readable form:

int linkat(int olddirfd, const char *oldpath, int newdirfd,
	   const char *newpath, int flags);

	AT_REPLACE
		If newpath exists, replace it atomically. There is no
		point at which another process attempting to access
		newpath will find it missing.
	
		newpath must not be a directory.
	
		If oldpath and newpath are existing hard links referring
		to the same file, then this does nothing, and returns a
		success status. If newpath is a mount point, then this
		comparison considers the mount point, not the mounted
		file or directory.
	
		If newpath refers to a symbolic link, the link will be
		overwritten.

ERRORS
	EBUSY	AT_REPLACE was specified in flags, newpath is a mount
		point, and the mount point does not refer to the same
		file as oldpath. Or, AT_REPLACE was specified in flags
		and newpath ends with a . or .. component.

	EISDIR	AT_REPLACE was specified in flags and newpath is an
		existing directory.
---

Overall, I think what makes the most sense is that AT_REPLACE does not
change the treatment of oldpath and is as consistent with rename(2) for
the treatment of newpath. Therefore, linkat with AT_RENAME:

* Doesn't traverse mounts on newpath (see below).
* If newpath is a symlink, doesn't follow it.
* Returns EBUSY for trailing . and .. components instead of EEXIST like
  you get without AT_REPLACE.
* Is a noop if linking a file to a hard link of itself.

The hairiest bit is that last point if there are mounts involved. As I
mentioned, we probably want to handle oldpath exactly the same whether
or not AT_REPLACE is used, so we will always traverse mounts and follow
symlinks (modulo AT_SYMLINK_FOLLOW). However, let's consider how rename
handles mounts on newpath:

# cat rename.c
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
        if (argc != 3) {
                fprintf(stderr, "%s: SOURCE DEST\n", argv[0]);
                return EXIT_FAILURE;
        }
        if (rename(argv[1], argv[2]) == -1) {
                perror("rename");
                return EXIT_FAILURE;
        }
        return EXIT_SUCCESS;
}
# touch a b
# ln b c
# mount --bind a c
# ./rename b c

This succeeds because rename treats the path as an entry in a directory.
It doesn't traverse the mount on "c", so "c" is still considered the
same file as "b".

# umount c
# mount --bind c a
# ./rename b a
rename: Device or resource busy

Likewise, this fails because the underlying "a" mount point is not the
same file as "b".

So the rules for "same file" are "oldpath follows mounts and symlinks as
appropriate" and "newpath does not follow mounts or symlinks". The
asymmetry is a bit awkward, but I think it's the soundest choice.

Thoughts? Any other gaps here?

Thanks!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-17 20:22             ` Omar Sandoval
@ 2020-01-17 22:22               ` Al Viro
  2020-01-17 23:54                 ` Omar Sandoval
  0 siblings, 1 reply; 36+ messages in thread
From: Al Viro @ 2020-01-17 22:22 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Trond Myklebust, amir73il, dhowells, lsf-pc, hch, miklos, linux-fsdevel

On Fri, Jan 17, 2020 at 12:22:19PM -0800, Omar Sandoval wrote:

> Since manpage troff patches aren't particularly nice to read, here is
> what I had in mind in a readable form:
> 
> int linkat(int olddirfd, const char *oldpath, int newdirfd,
> 	   const char *newpath, int flags);
> 
> 	AT_REPLACE
> 		If newpath exists, replace it atomically. There is no
> 		point at which another process attempting to access
> 		newpath will find it missing.
> 	
> 		newpath must not be a directory.
> 	
> 		If oldpath and newpath are existing hard links referring
> 		to the same file, then this does nothing, and returns a
> 		success status. If newpath is a mount point, then this
> 		comparison considers the mount point, not the mounted
> 		file or directory.
> 	
> 		If newpath refers to a symbolic link, the link will be
> 		overwritten.
> 
> ERRORS
> 	EBUSY	AT_REPLACE was specified in flags, newpath is a mount
> 		point, and the mount point does not refer to the same
> 		file as oldpath. Or, AT_REPLACE was specified in flags
> 		and newpath ends with a . or .. component.

> 	EISDIR	AT_REPLACE was specified in flags and newpath is an
> 		existing directory.

So are <existing directory>/. or <existing directory>/.., so I wonder why
bother with -EBUSY there.

As for the gaps...
	1) emptypath for new.  Should be an error; EINVAL, probably.
Avoids arseloads of fun cases ("what if newfd/newpath refers to something
unlinked?", etc., etc.)
	2) mountpoint vs. mountpoint in the local namespace.  See the
rationale in that area for unlink()/rmdir()/rename().
	3) permission checks need to be specified
	4) as for the hardlinks, I would be very careful with the language;
if anything, that's probably "if the link specified by newfd/newpath points
to the object specified by oldfd/oldpath..."  Another thing is, as you
could've seen for rename(), such "permissive" clauses tend to give surprising
results.  For example, put the check for "it's a hardlink" early enough,
and you've suddenly gotten a situation when it *can* succeed for directory.
Or on a filesystem that never allowed hardlinks (the latter is probably
unavoidable).
	5) what _really_ needs to be said is that other links to
newpath are unaffected and so are previously opened files.
	6) "atomically" is bloody vague; the explanation you've put there
is not enough, IMO - in addition to "nobody sees it gone in the middle
of operation" there's also "if the operation fails, the target remains
undisturbed" and something along the lines of "if filesystem makes any
promises about the state after dirty shutdown in the middle of rename(),
the same promises should apply here".
	7) how do users tell if filesystem supports that?  And no,
references to pathconf, Cthulhu and other equally delightful entities
are not really welcome.

There might be be more - like I said, it needs to go through
linux-abi, fsdevel and security lists.  The above is all I see right
now, but I might be missing something subtle (or not so subtle).

As for the implementation... VFS side should be reasonably easy (OK,
it'll bloat do_symlinkat() quite a bit, since we won't be able to use
filename_create() for the target in there, but it's not _that_
terrible).  I'd probably prefer a separate method rather than
overloading ->link(), with the old method called when the target
is negative and new - when it's positive - less boilerplate and
harder for an instance to get confused.  They can easily use common
helpers with ->link() instances, so the code duplication won't
be an issue.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-17 22:22               ` Al Viro
@ 2020-01-17 23:54                 ` Omar Sandoval
  2020-01-18  0:47                   ` Al Viro
  0 siblings, 1 reply; 36+ messages in thread
From: Omar Sandoval @ 2020-01-17 23:54 UTC (permalink / raw)
  To: Al Viro
  Cc: Trond Myklebust, amir73il, dhowells, lsf-pc, hch, miklos, linux-fsdevel

On Fri, Jan 17, 2020 at 10:22:12PM +0000, Al Viro wrote:
> On Fri, Jan 17, 2020 at 12:22:19PM -0800, Omar Sandoval wrote:
> 
> > Since manpage troff patches aren't particularly nice to read, here is
> > what I had in mind in a readable form:
> > 
> > int linkat(int olddirfd, const char *oldpath, int newdirfd,
> > 	   const char *newpath, int flags);
> > 
> > 	AT_REPLACE
> > 		If newpath exists, replace it atomically. There is no
> > 		point at which another process attempting to access
> > 		newpath will find it missing.
> > 	
> > 		newpath must not be a directory.
> > 	
> > 		If oldpath and newpath are existing hard links referring
> > 		to the same file, then this does nothing, and returns a
> > 		success status. If newpath is a mount point, then this
> > 		comparison considers the mount point, not the mounted
> > 		file or directory.
> > 	
> > 		If newpath refers to a symbolic link, the link will be
> > 		overwritten.
> > 
> > ERRORS
> > 	EBUSY	AT_REPLACE was specified in flags, newpath is a mount
> > 		point, and the mount point does not refer to the same
> > 		file as oldpath. Or, AT_REPLACE was specified in flags
> > 		and newpath ends with a . or .. component.
> 
> > 	EISDIR	AT_REPLACE was specified in flags and newpath is an
> > 		existing directory.

Thanks for taking a look.

> So are <existing directory>/. or <existing directory>/.., so I wonder why
> bother with -EBUSY there.

This was for consistency with rename, as it happens to check for . and
.. before it checks whether oldpath is a directory. I don't feel
strongly either way.

> As for the gaps...
> 	1) emptypath for new.  Should be an error; EINVAL, probably.
> Avoids arseloads of fun cases ("what if newfd/newpath refers to something
> unlinked?", etc., etc.)

linkat(2) explicitly mentions that AT_EMPTY_PATH only applies to
oldpath, so this falls back to the standard ENOENT documented in
path_resolution(7).

> 	2) mountpoint vs. mountpoint in the local namespace.  See the
> rationale in that area for unlink()/rmdir()/rename().

I'll add that.

Side-note, unlink(2), rmdir(2), and rename(2) should probably mention
this behavior, too...

> 	3) permission checks need to be specified

I believe the only difference here vs standard linkat is that newpath
must not be immutable or append-only?

> 	4) as for the hardlinks, I would be very careful with the language;
> if anything, that's probably "if the link specified by newfd/newpath points
> to the object specified by oldfd/oldpath..."

It seems like there's room for elaborating on the distinction between
path-to-a-file vs. path-to-a-dir-entry in path_resolution(7). In the
meantime, I'll mention it explicitly.

> Another thing is, as you
> could've seen for rename(), such "permissive" clauses tend to give surprising
> results.  For example, put the check for "it's a hardlink" early enough,
> and you've suddenly gotten a situation when it *can* succeed for directory.
> Or on a filesystem that never allowed hardlinks (the latter is probably
> unavoidable).

Good point. Unless I'm missing something, the only way you'll end up
with two paths referring to the same directory/file on a filesystem not
supporting hardlinks is through multiple mounts of the same filesystem,
and I check for the EXDEV case earlier than the same file case. I'll
double-check for other cases.

> 	5) what _really_ needs to be said is that other links to
> newpath are unaffected and so are previously opened files.

Done.

> 	6) "atomically" is bloody vague; the explanation you've put there
> is not enough, IMO - in addition to "nobody sees it gone in the middle
> of operation" there's also "if the operation fails, the target remains
> undisturbed" and something along the lines of "if filesystem makes any
> promises about the state after dirty shutdown in the middle of rename(),
> the same promises should apply here".

Done. I'd be nice if I could say "refer to rename(2) for durability
guarantees", but rename(2) doesn't mention it at all.

> references to pathconf, Cthulhu and other equally delightful entities
> are not really welcome.

EOPNOTSUPP is probably the most helpful.

> There might be be more - like I said, it needs to go through
> linux-abi, fsdevel and security lists.  The above is all I see right
> now, but I might be missing something subtle (or not so subtle).

Here's the updated description. If you don't see any other glaring
problems, I'll update the implementation and send it out to the
appropriate lists.

int linkat(int olddirfd, const char *oldpath, int newdirfd,
	   const char *newpath, int flags);

	AT_REPLACE
		If newpath exists, replace it atomically. There is no
		point at which another process attempting to access
		newpath will find it missing. If newpath exists but the
		operation fails, the original link specified by newpath
		will remain in place. This has the same durability
		guarantees for newpath as rename(2).

		If newpath is replaced, any other hard links referring
		to the file are unaffected. Open file descriptors for
		newpath are also unaffected.
	
		newpath must not be a directory.
	
		If the entry specified by newpath refers to the file
		specified by oldpath, linkat() does nothing and returns
		a success status. Note that this comparison does not
		follow mounts on newpath.

		Otherwise, newpath must not be a mount point in the
		local namespace. If it is a mount point in another
		namespace and the operation succeeds, all mounts are
		detached from newpath in all namespaces, as is the case
		for rename(2), rmdir(2), and unlink(2).
	
		If newpath refers to a symbolic link, the link will be
		overwritten.

ERRORS
	EBUSY	AT_REPLACE was specified in flags, newpath is a mount
		point in the local namespace, and the mount point does
		not refer to the same file as oldpath. Or, AT_REPLACE
		was specified in flags and newpath ends with a . or ..
		component.

	EISDIR	AT_REPLACE was specified in flags and newpath is an
		existing directory.

	EOPNOTSUPP
		AT_REPLACE was specified in flags and the filesystem
		containing newpath does not support AT_REPLACE.

	EPERM	AT_REPLACE was specified and newpath refers to an
		immutable or append-only file.

> As for the implementation... VFS side should be reasonably easy (OK,
> it'll bloat do_symlinkat() quite a bit, since we won't be able to use
> filename_create() for the target in there, but it's not _that_
> terrible).

Based on my previous attempt at it [1], it's not too bad.

> I'd probably prefer a separate method rather than
> overloading ->link(), with the old method called when the target
> is negative and new - when it's positive - less boilerplate and
> harder for an instance to get confused.  They can easily use common
> helpers with ->link() instances, so the code duplication won't
> be an issue.

Agreed, thanks again!

1: https://lore.kernel.org/linux-fsdevel/eac9480f80c689504148b5d658ee4218cc1e421e.1524549513.git.osandov@fb.com/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-17 23:54                 ` Omar Sandoval
@ 2020-01-18  0:47                   ` Al Viro
  2020-01-18  1:17                     ` Omar Sandoval
  0 siblings, 1 reply; 36+ messages in thread
From: Al Viro @ 2020-01-18  0:47 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Trond Myklebust, amir73il, dhowells, lsf-pc, hch, miklos, linux-fsdevel

On Fri, Jan 17, 2020 at 03:54:44PM -0800, Omar Sandoval wrote:
 
> > 	3) permission checks need to be specified
> 
> I believe the only difference here vs standard linkat is that newpath
> must not be immutable or append-only?

I would bloody hope not - at the very least you want sticky bit on parent
to have effect, same as with rename()/rmdir()/unlink()...

> > references to pathconf, Cthulhu and other equally delightful entities
> > are not really welcome.
> 
> EOPNOTSUPP is probably the most helpful.

Umm...  What would you feed it, though?  You need to get past your
"links to the same file, do nothing" escape...

> Based on my previous attempt at it [1], it's not too bad.

+                       error = may_delete(dir, new_dentry, d_is_dir(old_dentry));                                       

Why bother with d_is_dir(), when you are going to reject directories
anyway?

+       if (dir->i_op->link)                                                                                             
+               error = dir->i_op->link(old_dentry, dir, new_dentry);                                                    
+       else                                                                                                             
+               error = dir->i_op->link2(old_dentry, dir, new_dentry, flags);                                            
+       if (error)                                                                                                       
+               goto out;                                                                                                
+                                                                                                                        

No.  This is completely wrong; just make it ->link_replace() and be done
with that; no extra arguments and *always* the same conditions wrt
positive/negative.  One of the reasons why ->rename() tends to be
ugly (and a source of quite a few bugs over years) are those "if
target is positive/if target is negative" scattered over the instances.

Make the choice conditional upon the positivity of target.

And you don't need to reproduce every quirk of rename() error values.
Really.  Unless you really intend to have userland do a loop of
linkat(2) attempts (a-la mkstemp(3)), followed by rename(2) for
fallback...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-18  0:47                   ` Al Viro
@ 2020-01-18  1:17                     ` Omar Sandoval
  2020-01-18  2:20                       ` Al Viro
  2020-01-22  9:53                       ` David Howells
  0 siblings, 2 replies; 36+ messages in thread
From: Omar Sandoval @ 2020-01-18  1:17 UTC (permalink / raw)
  To: Al Viro
  Cc: Trond Myklebust, amir73il, dhowells, lsf-pc, hch, miklos, linux-fsdevel

On Sat, Jan 18, 2020 at 12:47:38AM +0000, Al Viro wrote:
> On Fri, Jan 17, 2020 at 03:54:44PM -0800, Omar Sandoval wrote:
>  
> > > 	3) permission checks need to be specified
> > 
> > I believe the only difference here vs standard linkat is that newpath
> > must not be immutable or append-only?
> 
> I would bloody hope not - at the very least you want sticky bit on parent
> to have effect, same as with rename()/rmdir()/unlink()...

Right, I should've reread may_delete(). I'll document that, too.

> > > references to pathconf, Cthulhu and other equally delightful entities
> > > are not really welcome.
> > 
> > EOPNOTSUPP is probably the most helpful.
> 
> Umm...  What would you feed it, though?  You need to get past your
> "links to the same file, do nothing" escape...

I think what you're getting at is that we can make this easier by
failing linkat AT_REPLACE very early if the filesystem doesn't have a
->link_replace(). Namely, if the filesystem doesn't support AT_REPLACE
but we still allow the "same file" or "newpath doesn't exist" cases to
succeed, then feature detection gets annoying.

As long as that's right, then applications can do the usual "try the new
feature or fall back" pattern that they do for fallocate modes and such.

> > Based on my previous attempt at it [1], it's not too bad.
> 
> +                       error = may_delete(dir, new_dentry, d_is_dir(old_dentry));                                       
> 
> Why bother with d_is_dir(), when you are going to reject directories
> anyway?
> 
> +       if (dir->i_op->link)                                                                                             
> +               error = dir->i_op->link(old_dentry, dir, new_dentry);                                                    
> +       else                                                                                                             
> +               error = dir->i_op->link2(old_dentry, dir, new_dentry, flags);                                            
> +       if (error)                                                                                                       
> +               goto out;                                                                                                
> +                                                                                                                        
> 
> No.  This is completely wrong; just make it ->link_replace() and be done
> with that; no extra arguments and *always* the same conditions wrt
> positive/negative.  One of the reasons why ->rename() tends to be
> ugly (and a source of quite a few bugs over years) are those "if
> target is positive/if target is negative" scattered over the instances.
> 
> Make the choice conditional upon the positivity of target.

Yup, you already convinced me that ->link_replace() is better in your
last email.

> And you don't need to reproduce every quirk of rename() error values.
> Really.  Unless you really intend to have userland do a loop of
> linkat(2) attempts (a-la mkstemp(3)), followed by rename(2) for
> fallback...

Understood, thanks. I'll get this all cleaned up and resent next week.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-18  1:17                     ` Omar Sandoval
@ 2020-01-18  2:20                       ` Al Viro
  2020-01-21 23:05                         ` Omar Sandoval
  2020-01-22  9:53                       ` David Howells
  1 sibling, 1 reply; 36+ messages in thread
From: Al Viro @ 2020-01-18  2:20 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Trond Myklebust, amir73il, dhowells, lsf-pc, hch, miklos, linux-fsdevel

On Fri, Jan 17, 2020 at 05:17:34PM -0800, Omar Sandoval wrote:
> > No.  This is completely wrong; just make it ->link_replace() and be done
> > with that; no extra arguments and *always* the same conditions wrt
> > positive/negative.  One of the reasons why ->rename() tends to be
> > ugly (and a source of quite a few bugs over years) are those "if
> > target is positive/if target is negative" scattered over the instances.
> > 
> > Make the choice conditional upon the positivity of target.
> 
> Yup, you already convinced me that ->link_replace() is better in your
> last email.

FWIW, that might be not so simple ;-/  Reason: NFS-like stuff.  Client
sees a negative in cache; the problem is how to decide whether to
tell the server "OK, I want normal link()" vs. "if it turns out that
someone has created it by the time you see the request, give do
a replacing link".  Sure, if could treat ->link() telling you -EEXIST
as "OK, repeat it with ->link_replace(), then", but that's an extra
roundtrip...

Hell knows...  I would really like to avoid any kind of ->atomic_open()
redux ;-/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-18  2:20                       ` Al Viro
@ 2020-01-21 23:05                         ` Omar Sandoval
  2020-01-22  6:57                           ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Omar Sandoval @ 2020-01-21 23:05 UTC (permalink / raw)
  To: Al Viro
  Cc: Trond Myklebust, amir73il, dhowells, lsf-pc, hch, miklos, linux-fsdevel

On Sat, Jan 18, 2020 at 02:20:32AM +0000, Al Viro wrote:
> On Fri, Jan 17, 2020 at 05:17:34PM -0800, Omar Sandoval wrote:
> > > No.  This is completely wrong; just make it ->link_replace() and be done
> > > with that; no extra arguments and *always* the same conditions wrt
> > > positive/negative.  One of the reasons why ->rename() tends to be
> > > ugly (and a source of quite a few bugs over years) are those "if
> > > target is positive/if target is negative" scattered over the instances.
> > > 
> > > Make the choice conditional upon the positivity of target.
> > 
> > Yup, you already convinced me that ->link_replace() is better in your
> > last email.
> 
> FWIW, that might be not so simple ;-/  Reason: NFS-like stuff.  Client
> sees a negative in cache; the problem is how to decide whether to
> tell the server "OK, I want normal link()" vs. "if it turns out that
> someone has created it by the time you see the request, give do
> a replacing link".  Sure, if could treat ->link() telling you -EEXIST
> as "OK, repeat it with ->link_replace(), then", but that's an extra
> roundtrip...

So that's a point in favor of ->link(). But then if we overload ->link()
instead of adding ->link_replace() and we want EOPNOTSUPP to fail fast,
we need to add something like FMODE_SUPPORTS_AT_REPLACE.

Some options I see are:

1. Go with ->link_replace() until network filesystem specs support
   AT_REPLACE. That would be a bit of a mess down the line, though.
2. Stick with ->link(), let the filesystem implementations deal with the
   positive targets, and add FMODE_SUPPORTS_AT_REPLACE so that feature
   detection remains easy for userspace.
3. Option 2, but don't bother with FMODE_SUPPORTS_AT_REPLACE.

FWIW, there is precendent for option 3: RENAME_EXCHANGE. That has the
same "files are the same" noop condition, and we don't know whether
RENAME_EXCHANGE is supported until ->rename(). A cursory search shows
that applications using RENAME_EXCHANGE try it and fall back to a
non-atomic exchange on EINVAL. They could do the exact same thing for
AT_REPLACE.

None of it is elegant, but which approach would you hate the least?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-21 23:05                         ` Omar Sandoval
@ 2020-01-22  6:57                           ` Amir Goldstein
  2020-01-22 22:10                             ` Omar Sandoval
  0 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2020-01-22  6:57 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Al Viro, Trond Myklebust, dhowells, lsf-pc, hch, miklos,
	linux-fsdevel, Darrick J. Wong

On Wed, Jan 22, 2020 at 1:05 AM Omar Sandoval <osandov@osandov.com> wrote:
>
> On Sat, Jan 18, 2020 at 02:20:32AM +0000, Al Viro wrote:
> > On Fri, Jan 17, 2020 at 05:17:34PM -0800, Omar Sandoval wrote:
> > > > No.  This is completely wrong; just make it ->link_replace() and be done
> > > > with that; no extra arguments and *always* the same conditions wrt
> > > > positive/negative.  One of the reasons why ->rename() tends to be
> > > > ugly (and a source of quite a few bugs over years) are those "if
> > > > target is positive/if target is negative" scattered over the instances.
> > > >
> > > > Make the choice conditional upon the positivity of target.
> > >
> > > Yup, you already convinced me that ->link_replace() is better in your
> > > last email.
> >
> > FWIW, that might be not so simple ;-/  Reason: NFS-like stuff.  Client
> > sees a negative in cache; the problem is how to decide whether to
> > tell the server "OK, I want normal link()" vs. "if it turns out that
> > someone has created it by the time you see the request, give do
> > a replacing link".  Sure, if could treat ->link() telling you -EEXIST
> > as "OK, repeat it with ->link_replace(), then", but that's an extra
> > roundtrip...
>
> So that's a point in favor of ->link(). But then if we overload ->link()
> instead of adding ->link_replace() and we want EOPNOTSUPP to fail fast,
> we need to add something like FMODE_SUPPORTS_AT_REPLACE.
>
> Some options I see are:
>
> 1. Go with ->link_replace() until network filesystem specs support
>    AT_REPLACE. That would be a bit of a mess down the line, though.
> 2. Stick with ->link(), let the filesystem implementations deal with the
>    positive targets, and add FMODE_SUPPORTS_AT_REPLACE so that feature
>    detection remains easy for userspace.

"detection remains easy..." why is this important?
Do you know of a userspace application that would have a problem checking
if AT_REPLACE works, fall back to whatever and never try it ever again?
Besides, when said application tried to open an O_TMPFILE and fail, it
will have already detected a lot of the unsupported cases.
Sorry for not reading all the thread again, some API questions:
- We intend to allow AT_REPLACE only with O_TMPFILE src. Right?
- Does AT_REPLACE assert that destination is positive? and if so why?
The functionality that is complement to atomic rename would be atomic
link, destination could be positive or negative, but end results will be
that destination is positive with new inode.
With those semantics, ->link_replace() makes much less sense IMO.

> 3. Option 2, but don't bother with FMODE_SUPPORTS_AT_REPLACE.
>
> FWIW, there is precendent for option 3: RENAME_EXCHANGE. That has the
> same "files are the same" noop condition, and we don't know whether
> RENAME_EXCHANGE is supported until ->rename(). A cursory search shows
> that applications using RENAME_EXCHANGE try it and fall back to a
> non-atomic exchange on EINVAL. They could do the exact same thing for
> AT_REPLACE.
>

That sounds like the most reasonable approach to me. Let's not over complicate.
If you find that creates too much generic logic in ->link(), you can take
the approach Darrick employed with generic_remap_file_range_prep() for
filesystems that want to support AT_REPLACE. All other fs just need to check
for valid flags mask, like the ->rename() precedent.

Another side discussion about passing AT_ flags down to filesystems.
Traditionally, that was never done, until AT_STATX_ mixed vfs flags
with filesystem flags on the same AT_ namespace.
Now we have linkat() syscall that can take only AT_ vfs flags and
renameat2() syscall that can take only RENAME_ filesystem flags not
from the AT_ namespace.
I feel that the situation of having AT_REPLACE API along with
RENAME_EXCHANGE and RENAME_NOREPLACE is a bit awkward
and some standardization is in order.

According to include/uapi/linux/fcntl.h, there is no numeric collision
between the RENAME_ flag namepsace and AT_ flags namespace,
although I do find it suspicious that AT_ flags start at 0x100...
Could we define AT_RENAME_xxx RENAME_xxx flags and name the
new flag AT_LINK_REPLACE, so it is a bit more clear that the flag
is specific to link(2) syscall and not vfs generic AT_ flag.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-18  1:17                     ` Omar Sandoval
  2020-01-18  2:20                       ` Al Viro
@ 2020-01-22  9:53                       ` David Howells
  1 sibling, 0 replies; 36+ messages in thread
From: David Howells @ 2020-01-22  9:53 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Omar Sandoval, Trond Myklebust, amir73il, lsf-pc, hch,
	miklos, linux-fsdevel

Al Viro <viro@zeniv.linux.org.uk> wrote:

> FWIW, that might be not so simple ;-/  Reason: NFS-like stuff.  Client
> sees a negative in cache; the problem is how to decide whether to
> tell the server "OK, I want normal link()" vs. "if it turns out that
> someone has created it by the time you see the request, give do
> a replacing link".  Sure, if could treat ->link() telling you -EEXIST
> as "OK, repeat it with ->link_replace(), then", but that's an extra
> roundtrip...

If someone asks for link_replace on a filesystem that doesn't support it or if
it's a network filesystem in which the client does, but the server being
accessed does not, then return an error (say EOPNOTSUPP) and let userspace (or
cachefiles or whatever) handle the fallback?

David


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-22  6:57                           ` Amir Goldstein
@ 2020-01-22 22:10                             ` Omar Sandoval
  2020-01-23  3:47                               ` Al Viro
  0 siblings, 1 reply; 36+ messages in thread
From: Omar Sandoval @ 2020-01-22 22:10 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Al Viro, Trond Myklebust, dhowells, lsf-pc, hch, miklos,
	linux-fsdevel, Darrick J. Wong

On Wed, Jan 22, 2020 at 08:57:01AM +0200, Amir Goldstein wrote:
> On Wed, Jan 22, 2020 at 1:05 AM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > On Sat, Jan 18, 2020 at 02:20:32AM +0000, Al Viro wrote:
> > > On Fri, Jan 17, 2020 at 05:17:34PM -0800, Omar Sandoval wrote:
> > > > > No.  This is completely wrong; just make it ->link_replace() and be done
> > > > > with that; no extra arguments and *always* the same conditions wrt
> > > > > positive/negative.  One of the reasons why ->rename() tends to be
> > > > > ugly (and a source of quite a few bugs over years) are those "if
> > > > > target is positive/if target is negative" scattered over the instances.
> > > > >
> > > > > Make the choice conditional upon the positivity of target.
> > > >
> > > > Yup, you already convinced me that ->link_replace() is better in your
> > > > last email.
> > >
> > > FWIW, that might be not so simple ;-/  Reason: NFS-like stuff.  Client
> > > sees a negative in cache; the problem is how to decide whether to
> > > tell the server "OK, I want normal link()" vs. "if it turns out that
> > > someone has created it by the time you see the request, give do
> > > a replacing link".  Sure, if could treat ->link() telling you -EEXIST
> > > as "OK, repeat it with ->link_replace(), then", but that's an extra
> > > roundtrip...
> >
> > So that's a point in favor of ->link(). But then if we overload ->link()
> > instead of adding ->link_replace() and we want EOPNOTSUPP to fail fast,
> > we need to add something like FMODE_SUPPORTS_AT_REPLACE.
> >
> > Some options I see are:
> >
> > 1. Go with ->link_replace() until network filesystem specs support
> >    AT_REPLACE. That would be a bit of a mess down the line, though.
> > 2. Stick with ->link(), let the filesystem implementations deal with the
> >    positive targets, and add FMODE_SUPPORTS_AT_REPLACE so that feature
> >    detection remains easy for userspace.
> 
> "detection remains easy..." why is this important?

As I mentioned, I don't think it's necessary given the precedent.
However, Al voiced some concern over this earlier in the thread:

---
> > >        7) how do users tell if filesystem supports that?  And no,
> > >references to pathconf, Cthulhu and other equally delightful entities
> > >are not really welcome.
> >
> > EOPNOTSUPP is probably the most helpful.
> 
> Umm...  What would you feed it, though?  You need to get past your
> "links to the same file, do nothing" escape...
---

> Do you know of a userspace application that would have a problem checking
> if AT_REPLACE works, fall back to whatever and never try it ever again?
> 
> Besides, when said application tried to open an O_TMPFILE and fail, it
> will have already detected a lot of the unsupported cases.
> Sorry for not reading all the thread again, some API questions:
> - We intend to allow AT_REPLACE only with O_TMPFILE src. Right?

I wasn't planning on having that restriction. It's not too much effort
for filesystems to support it for normal files, so I wouldn't want to
place an artificial restriction on a useful primitive.

> - Does AT_REPLACE assert that destination is positive? and if so why?

No, it should work like a normal link() if the destination doesn't
exist.

> The functionality that is complement to atomic rename would be atomic
> link, destination could be positive or negative, but end results will be
> that destination is positive with new inode.
> With those semantics, ->link_replace() makes much less sense IMO.
> 
> > 3. Option 2, but don't bother with FMODE_SUPPORTS_AT_REPLACE.
> >
> > FWIW, there is precendent for option 3: RENAME_EXCHANGE. That has the
> > same "files are the same" noop condition, and we don't know whether
> > RENAME_EXCHANGE is supported until ->rename(). A cursory search shows
> > that applications using RENAME_EXCHANGE try it and fall back to a
> > non-atomic exchange on EINVAL. They could do the exact same thing for
> > AT_REPLACE.
> >
> 
> That sounds like the most reasonable approach to me. Let's not over complicate.
> If you find that creates too much generic logic in ->link(), you can take
> the approach Darrick employed with generic_remap_file_range_prep() for
> filesystems that want to support AT_REPLACE. All other fs just need to check
> for valid flags mask, like the ->rename() precedent.
> 
> Another side discussion about passing AT_ flags down to filesystems.
> Traditionally, that was never done, until AT_STATX_ mixed vfs flags
> with filesystem flags on the same AT_ namespace.
> Now we have linkat() syscall that can take only AT_ vfs flags and
> renameat2() syscall that can take only RENAME_ filesystem flags not
> from the AT_ namespace.
> I feel that the situation of having AT_REPLACE API along with
> RENAME_EXCHANGE and RENAME_NOREPLACE is a bit awkward
> and some standardization is in order.
> 
> According to include/uapi/linux/fcntl.h, there is no numeric collision
> between the RENAME_ flag namepsace and AT_ flags namespace,
> although I do find it suspicious that AT_ flags start at 0x100...
> Could we define AT_RENAME_xxx RENAME_xxx flags and name the
> new flag AT_LINK_REPLACE, so it is a bit more clear that the flag
> is specific to link(2) syscall and not vfs generic AT_ flag.

Sure, I'll rename it to AT_LINK_REPLACE. AT_RENAME_xxx is probably for a
separate series.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-22 22:10                             ` Omar Sandoval
@ 2020-01-23  3:47                               ` Al Viro
  2020-01-23  7:16                                 ` Dave Chinner
  2020-01-28 14:35                                 ` David Howells
  0 siblings, 2 replies; 36+ messages in thread
From: Al Viro @ 2020-01-23  3:47 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Amir Goldstein, Trond Myklebust, dhowells, lsf-pc, hch, miklos,
	linux-fsdevel, Darrick J. Wong

On Wed, Jan 22, 2020 at 02:10:03PM -0800, Omar Sandoval wrote:

> > Sorry for not reading all the thread again, some API questions:
> > - We intend to allow AT_REPLACE only with O_TMPFILE src. Right?
> 
> I wasn't planning on having that restriction. It's not too much effort
> for filesystems to support it for normal files, so I wouldn't want to
> place an artificial restriction on a useful primitive.

I'm not sure; that's how we ended up with the unspeakable APIs like
rename(2), after all...

Said that, in this particular case I would suggest allowing it for
any source - what matters is not the _origin_ of that thing, but the
current state.  And linkat() (including the already existing variant)
can change that at any time...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-23  3:47                               ` Al Viro
@ 2020-01-23  7:16                                 ` Dave Chinner
  2020-01-23  7:47                                   ` Amir Goldstein
  2020-01-28  1:27                                   ` Omar Sandoval
  2020-01-28 14:35                                 ` David Howells
  1 sibling, 2 replies; 36+ messages in thread
From: Dave Chinner @ 2020-01-23  7:16 UTC (permalink / raw)
  To: Al Viro
  Cc: Omar Sandoval, Amir Goldstein, Trond Myklebust, dhowells, lsf-pc,
	hch, miklos, linux-fsdevel, Darrick J. Wong

On Thu, Jan 23, 2020 at 03:47:45AM +0000, Al Viro wrote:
> On Wed, Jan 22, 2020 at 02:10:03PM -0800, Omar Sandoval wrote:
> 
> > > Sorry for not reading all the thread again, some API questions:
> > > - We intend to allow AT_REPLACE only with O_TMPFILE src. Right?
> > 
> > I wasn't planning on having that restriction. It's not too much effort
> > for filesystems to support it for normal files, so I wouldn't want to
> > place an artificial restriction on a useful primitive.
> 
> I'm not sure; that's how we ended up with the unspeakable APIs like
> rename(2), after all...

Yet it is just rename(2) with the serial numbers filed off -
complete with all the same data vs metadata ordering problems that
rename(2) comes along with. i.e. it needs fsync to guarantee data
integrity of the source file before the linkat() call is made.

If we can forsee that users are going to complain that
linkat(AT_REPLACE) using O_TMPFILE files is not atomic because it
leaves zero length files behind after a crash just like rename()
does, then we haven't really improved anything at all...

And, really, I don't think anyone wants another API that requires
multiple fsync calls to use correctly for crash-safe file
replacement, let alone try to teach people who still cant rename a
file safely how to use it....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-23  7:16                                 ` Dave Chinner
@ 2020-01-23  7:47                                   ` Amir Goldstein
  2020-01-24 21:25                                     ` Dave Chinner
  2020-01-28  1:27                                   ` Omar Sandoval
  1 sibling, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2020-01-23  7:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Al Viro, Omar Sandoval, Trond Myklebust, dhowells, lsf-pc, hch,
	miklos, linux-fsdevel, Darrick J. Wong

On Thu, Jan 23, 2020 at 9:16 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Jan 23, 2020 at 03:47:45AM +0000, Al Viro wrote:
> > On Wed, Jan 22, 2020 at 02:10:03PM -0800, Omar Sandoval wrote:
> >
> > > > Sorry for not reading all the thread again, some API questions:
> > > > - We intend to allow AT_REPLACE only with O_TMPFILE src. Right?
> > >
> > > I wasn't planning on having that restriction. It's not too much effort
> > > for filesystems to support it for normal files, so I wouldn't want to
> > > place an artificial restriction on a useful primitive.
> >

I have too many gray hairs each one for implementing a "useful primitive"
that nobody asked for and bare the consequences.
Your introduction to AT_REPLACE uses O_TMPFILE.
I see no other sane use of the interface.

> > I'm not sure; that's how we ended up with the unspeakable APIs like
> > rename(2), after all...
>
> Yet it is just rename(2) with the serial numbers filed off -
> complete with all the same data vs metadata ordering problems that
> rename(2) comes along with. i.e. it needs fsync to guarantee data
> integrity of the source file before the linkat() call is made.
>
> If we can forsee that users are going to complain that
> linkat(AT_REPLACE) using O_TMPFILE files is not atomic because it
> leaves zero length files behind after a crash just like rename()
> does, then we haven't really improved anything at all...
>
> And, really, I don't think anyone wants another API that requires
> multiple fsync calls to use correctly for crash-safe file
> replacement, let alone try to teach people who still cant rename a
> file safely how to use it....
>

Are you suggesting that AT_LINK_REPLACE should have some of
the semantics I posted in this RFC  for AT_ATOMIC_xxx:
https://lore.kernel.org/linux-fsdevel/20190527172655.9287-1-amir73il@gmail.com/

I admit I did not follow up on performance benchmarks that
you asked on that thread, but I also see the value in a simple API
as opposed to the AIO_FSYNC scheme that you proposed.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-23  7:47                                   ` Amir Goldstein
@ 2020-01-24 21:25                                     ` Dave Chinner
  2020-01-31  5:24                                       ` Darrick J. Wong
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2020-01-24 21:25 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Al Viro, Omar Sandoval, Trond Myklebust, dhowells, lsf-pc, hch,
	miklos, linux-fsdevel, Darrick J. Wong

On Thu, Jan 23, 2020 at 09:47:30AM +0200, Amir Goldstein wrote:
> On Thu, Jan 23, 2020 at 9:16 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, Jan 23, 2020 at 03:47:45AM +0000, Al Viro wrote:
> > > On Wed, Jan 22, 2020 at 02:10:03PM -0800, Omar Sandoval wrote:
> > >
> > > > > Sorry for not reading all the thread again, some API questions:
> > > > > - We intend to allow AT_REPLACE only with O_TMPFILE src. Right?
> > > >
> > > > I wasn't planning on having that restriction. It's not too much effort
> > > > for filesystems to support it for normal files, so I wouldn't want to
> > > > place an artificial restriction on a useful primitive.
> > >
> 
> I have too many gray hairs each one for implementing a "useful primitive"
> that nobody asked for and bare the consequences.
> Your introduction to AT_REPLACE uses O_TMPFILE.
> I see no other sane use of the interface.
> 
> > > I'm not sure; that's how we ended up with the unspeakable APIs like
> > > rename(2), after all...
> >
> > Yet it is just rename(2) with the serial numbers filed off -
> > complete with all the same data vs metadata ordering problems that
> > rename(2) comes along with. i.e. it needs fsync to guarantee data
> > integrity of the source file before the linkat() call is made.
> >
> > If we can forsee that users are going to complain that
> > linkat(AT_REPLACE) using O_TMPFILE files is not atomic because it
> > leaves zero length files behind after a crash just like rename()
> > does, then we haven't really improved anything at all...
> >
> > And, really, I don't think anyone wants another API that requires
> > multiple fsync calls to use correctly for crash-safe file
> > replacement, let alone try to teach people who still cant rename a
> > file safely how to use it....
> >
> 
> Are you suggesting that AT_LINK_REPLACE should have some of
> the semantics I posted in this RFC  for AT_ATOMIC_xxx:
> https://lore.kernel.org/linux-fsdevel/20190527172655.9287-1-amir73il@gmail.com/

Not directly.

All I'm pointing out is that data integrity guarantees of
AT_LINK_REPLACE are yet another aspect of this new feature that
has not yet been specified or documented at all.

And in pointing this out, I'm making an observation that the
rename(2) behaviour which everyone seems to be assuming this
function will replicate is a terrible model to copy/reinvent.

Addressing this problem is something for the people pushing for
AT_LINK_REPLACE to solve, not me....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-23  7:16                                 ` Dave Chinner
  2020-01-23  7:47                                   ` Amir Goldstein
@ 2020-01-28  1:27                                   ` Omar Sandoval
  1 sibling, 0 replies; 36+ messages in thread
From: Omar Sandoval @ 2020-01-28  1:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Al Viro, Amir Goldstein, Trond Myklebust, dhowells, lsf-pc, hch,
	miklos, linux-fsdevel, Darrick J. Wong

On Thu, Jan 23, 2020 at 06:16:39PM +1100, Dave Chinner wrote:
> On Thu, Jan 23, 2020 at 03:47:45AM +0000, Al Viro wrote:
> > On Wed, Jan 22, 2020 at 02:10:03PM -0800, Omar Sandoval wrote:
> > 
> > > > Sorry for not reading all the thread again, some API questions:
> > > > - We intend to allow AT_REPLACE only with O_TMPFILE src. Right?
> > > 
> > > I wasn't planning on having that restriction. It's not too much effort
> > > for filesystems to support it for normal files, so I wouldn't want to
> > > place an artificial restriction on a useful primitive.
> > 
> > I'm not sure; that's how we ended up with the unspeakable APIs like
> > rename(2), after all...
> 
> Yet it is just rename(2) with the serial numbers filed off -
> complete with all the same data vs metadata ordering problems that
> rename(2) comes along with. i.e. it needs fsync to guarantee data
> integrity of the source file before the linkat() call is made.
> 
> If we can forsee that users are going to complain that
> linkat(AT_REPLACE) using O_TMPFILE files is not atomic because it
> leaves zero length files behind after a crash just like rename()
> does, then we haven't really improved anything at all...
> 
> And, really, I don't think anyone wants another API that requires
> multiple fsync calls to use correctly for crash-safe file
> replacement, let alone try to teach people who still cant rename a
> file safely how to use it....

AT_REPLACE would have a leg up in that we can at least mention the (lack
of) integrity guarantees in the man page. It's no suprise that no one
gets rename right when the documentation make no mention of
integrity/durability/crash safety.

Sure, if everyone wants AT_REPLACE to guarantee integrity, we can do
that. But I'm going to start with the simplest proposal that makes no
such guarantees.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-23  3:47                               ` Al Viro
  2020-01-23  7:16                                 ` Dave Chinner
@ 2020-01-28 14:35                                 ` David Howells
  2020-01-31  5:31                                   ` hch
  2020-01-31  8:04                                   ` David Howells
  1 sibling, 2 replies; 36+ messages in thread
From: David Howells @ 2020-01-28 14:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: dhowells, Al Viro, Omar Sandoval, Amir Goldstein,
	Trond Myklebust, lsf-pc, hch, miklos, linux-fsdevel,
	Darrick J. Wong

Dave Chinner <david@fromorbit.com> wrote:

> If we can forsee that users are going to complain that
> linkat(AT_REPLACE) using O_TMPFILE files is not atomic because it
> leaves zero length files behind after a crash just like rename()
> does, then we haven't really improved anything at all...

For my purposes, I would like it to look like the end result of
unlink()+linkat().

I need to avoid losing the metadata in case of disconnected operation.  Now, I
can do it like Miklós suggested and have a bunch of temporary dirs to iterate
round, then create new files there and rename over the target file.

I'm using direct I/O, so I'm assuming I don't need to fsync().

David


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-24 21:25                                     ` Dave Chinner
@ 2020-01-31  5:24                                       ` Darrick J. Wong
  2020-01-31  5:29                                         ` hch
  2020-01-31  7:00                                         ` Amir Goldstein
  0 siblings, 2 replies; 36+ messages in thread
From: Darrick J. Wong @ 2020-01-31  5:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Amir Goldstein, Al Viro, Omar Sandoval, Trond Myklebust,
	dhowells, lsf-pc, hch, miklos, linux-fsdevel

On Sat, Jan 25, 2020 at 08:25:46AM +1100, Dave Chinner wrote:
> On Thu, Jan 23, 2020 at 09:47:30AM +0200, Amir Goldstein wrote:
> > On Thu, Jan 23, 2020 at 9:16 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Thu, Jan 23, 2020 at 03:47:45AM +0000, Al Viro wrote:
> > > > On Wed, Jan 22, 2020 at 02:10:03PM -0800, Omar Sandoval wrote:
> > > >
> > > > > > Sorry for not reading all the thread again, some API questions:
> > > > > > - We intend to allow AT_REPLACE only with O_TMPFILE src. Right?
> > > > >
> > > > > I wasn't planning on having that restriction. It's not too much effort
> > > > > for filesystems to support it for normal files, so I wouldn't want to
> > > > > place an artificial restriction on a useful primitive.
> > > >
> > 
> > I have too many gray hairs each one for implementing a "useful primitive"
> > that nobody asked for and bare the consequences.
> > Your introduction to AT_REPLACE uses O_TMPFILE.
> > I see no other sane use of the interface.
> > 
> > > > I'm not sure; that's how we ended up with the unspeakable APIs like
> > > > rename(2), after all...
> > >
> > > Yet it is just rename(2) with the serial numbers filed off -
> > > complete with all the same data vs metadata ordering problems that
> > > rename(2) comes along with. i.e. it needs fsync to guarantee data
> > > integrity of the source file before the linkat() call is made.
> > >
> > > If we can forsee that users are going to complain that
> > > linkat(AT_REPLACE) using O_TMPFILE files is not atomic because it
> > > leaves zero length files behind after a crash just like rename()
> > > does, then we haven't really improved anything at all...
> > >
> > > And, really, I don't think anyone wants another API that requires
> > > multiple fsync calls to use correctly for crash-safe file
> > > replacement, let alone try to teach people who still cant rename a
> > > file safely how to use it....
> > >
> > 
> > Are you suggesting that AT_LINK_REPLACE should have some of
> > the semantics I posted in this RFC  for AT_ATOMIC_xxx:
> > https://lore.kernel.org/linux-fsdevel/20190527172655.9287-1-amir73il@gmail.com/
> 
> Not directly.
> 
> All I'm pointing out is that data integrity guarantees of
> AT_LINK_REPLACE are yet another aspect of this new feature that
> has not yet been specified or documented at all.
> 
> And in pointing this out, I'm making an observation that the
> rename(2) behaviour which everyone seems to be assuming this
> function will replicate is a terrible model to copy/reinvent.
> 
> Addressing this problem is something for the people pushing for
> AT_LINK_REPLACE to solve, not me....

Or the grumpy maintainer who will have to digest all of this.

Can we update the documentation to admit that many people will probably
want to use this (and rename) as atomic swap operations?

"The filesystem will commit the data and metadata of all files and
directories involved in the link operation to stable storage before the
call returns."

And finally add a flag:

"AT_LINK_EATMYDATA: If specified, the filesystem may opt out of
committing anything to disk."

(Or a prctl(PR_EATMYDATA) to make it process wide, idk.)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-31  5:24                                       ` Darrick J. Wong
@ 2020-01-31  5:29                                         ` hch
  2020-01-31  7:00                                         ` Amir Goldstein
  1 sibling, 0 replies; 36+ messages in thread
From: hch @ 2020-01-31  5:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, Amir Goldstein, Al Viro, Omar Sandoval,
	Trond Myklebust, dhowells, lsf-pc, hch, miklos, linux-fsdevel

On Thu, Jan 30, 2020 at 09:24:54PM -0800, Darrick J. Wong wrote:
> Or the grumpy maintainer who will have to digest all of this.
> 
> Can we update the documentation to admit that many people will probably
> want to use this (and rename) as atomic swap operations?
> 
> "The filesystem will commit the data and metadata of all files and
> directories involved in the link operation to stable storage before the
> call returns."

That sounds horrible, because it is so different from any other metadata
operation.  Maybe requiring fsync to stabilize metadata ops wasn't the
best idea ever, but having some varinats of linkat behave from others
is just stupid.  We've been beating the fact that you shall fsync into
peoples heads for years.  No reason to make an exception for an obscure
operation now.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-28 14:35                                 ` David Howells
@ 2020-01-31  5:31                                   ` hch
  2020-01-31  8:04                                   ` David Howells
  1 sibling, 0 replies; 36+ messages in thread
From: hch @ 2020-01-31  5:31 UTC (permalink / raw)
  To: David Howells
  Cc: Dave Chinner, Al Viro, Omar Sandoval, Amir Goldstein,
	Trond Myklebust, lsf-pc, hch, miklos, linux-fsdevel,
	Darrick J. Wong

On Tue, Jan 28, 2020 at 02:35:38PM +0000, David Howells wrote:
> I'm using direct I/O, so I'm assuming I don't need to fsync().

Direct I/O of course requires fsync.  What makes you think different?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-31  5:24                                       ` Darrick J. Wong
  2020-01-31  5:29                                         ` hch
@ 2020-01-31  7:00                                         ` Amir Goldstein
  2020-01-31 20:33                                           ` Omar Sandoval
  1 sibling, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2020-01-31  7:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, Al Viro, Omar Sandoval, Trond Myklebust, dhowells,
	lsf-pc, hch, miklos, linux-fsdevel, Jan Kara

On Fri, Jan 31, 2020 at 7:27 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Sat, Jan 25, 2020 at 08:25:46AM +1100, Dave Chinner wrote:
> > On Thu, Jan 23, 2020 at 09:47:30AM +0200, Amir Goldstein wrote:
> > > On Thu, Jan 23, 2020 at 9:16 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Thu, Jan 23, 2020 at 03:47:45AM +0000, Al Viro wrote:
> > > > > On Wed, Jan 22, 2020 at 02:10:03PM -0800, Omar Sandoval wrote:
> > > > >
> > > > > > > Sorry for not reading all the thread again, some API questions:
> > > > > > > - We intend to allow AT_REPLACE only with O_TMPFILE src. Right?
> > > > > >
> > > > > > I wasn't planning on having that restriction. It's not too much effort
> > > > > > for filesystems to support it for normal files, so I wouldn't want to
> > > > > > place an artificial restriction on a useful primitive.
> > > > >
> > >
> > > I have too many gray hairs each one for implementing a "useful primitive"
> > > that nobody asked for and bare the consequences.
> > > Your introduction to AT_REPLACE uses O_TMPFILE.
> > > I see no other sane use of the interface.
> > >
> > > > > I'm not sure; that's how we ended up with the unspeakable APIs like
> > > > > rename(2), after all...
> > > >
> > > > Yet it is just rename(2) with the serial numbers filed off -
> > > > complete with all the same data vs metadata ordering problems that
> > > > rename(2) comes along with. i.e. it needs fsync to guarantee data
> > > > integrity of the source file before the linkat() call is made.
> > > >
> > > > If we can forsee that users are going to complain that
> > > > linkat(AT_REPLACE) using O_TMPFILE files is not atomic because it
> > > > leaves zero length files behind after a crash just like rename()
> > > > does, then we haven't really improved anything at all...
> > > >
> > > > And, really, I don't think anyone wants another API that requires
> > > > multiple fsync calls to use correctly for crash-safe file
> > > > replacement, let alone try to teach people who still cant rename a
> > > > file safely how to use it....
> > > >
> > >
> > > Are you suggesting that AT_LINK_REPLACE should have some of
> > > the semantics I posted in this RFC  for AT_ATOMIC_xxx:
> > > https://lore.kernel.org/linux-fsdevel/20190527172655.9287-1-amir73il@gmail.com/
> >
> > Not directly.
> >
> > All I'm pointing out is that data integrity guarantees of
> > AT_LINK_REPLACE are yet another aspect of this new feature that
> > has not yet been specified or documented at all.
> >
> > And in pointing this out, I'm making an observation that the
> > rename(2) behaviour which everyone seems to be assuming this
> > function will replicate is a terrible model to copy/reinvent.
> >
> > Addressing this problem is something for the people pushing for
> > AT_LINK_REPLACE to solve, not me....
>
> Or the grumpy maintainer who will have to digest all of this.
>
> Can we update the documentation to admit that many people will probably
> want to use this (and rename) as atomic swap operations?
>
> "The filesystem will commit the data and metadata of all files and
> directories involved in the link operation to stable storage before the
> call returns."
>
> And finally add a flag:
>
> "AT_LINK_EATMYDATA: If specified, the filesystem may opt out of
> committing anything to disk."
>

I agree with Christoph that this anomaly is not a good idea, but I also
agree with you and Dave that if an operation is meant to be used for
atomic swap, we should make it much harder for users to get it wrong.

To that end, we can define both flags AT_LINK_REPLACE and
AT_ATOMIC in uapi, but also define the combo
AT_LINK_ATOMIC_REPLACE and let the documentation be
very much focused on this usage.

I would like to stress a point, which some who haven't followed [1]
closely might have missed - the purpose of AT_ATOMIC is
*not* to make the new link durable - it is *not* a replacement for fsync
on parent dir. It's purpose is *only* to create a dependency between the
durability of the new link and the new data it can expose.

AFAICT, this means that AT_ATOMIC would be implemented
as filemap_write_and_wait() in xfs/ext4 and probably fdatasync in btrfs.

Really, there is no chance that any user is interested in a non-atomic
replace in that respect, so I am not even sure that we need an explicit
flag for it. As it is, the AT_REPLACE API design would rank poorly as
"The obvious use is wrong."

An explicit AT_ATOMIC flag would help is that we could make the same
semantics also apply to rename(2) with the same flag.

Omar,

I do understand why you say that you want to implement AT_REPLACE
and let someone else take care of (bike shedding) AT_ATOMIC later.
The problem with that approach is that man page will have the
AT_REPLACE documentation spread out without the mention of
AT_ATOMIC_REPLACE and that can generate damage long into the future.

TBH, I don't think there were any actual objections to the final
AT_ATOMIC_DATA proposal (which was Jan's idea by the way).
Dave was claiming that introducing a new API requires proof of improvement,
so he suggested (for the sake of debate) that I compare the performance of
atomic links to using batched AIO_FSYNC with rename/linkat completion
callbacks and I did not follow up on that.

The situation now is different. You are proposing a new API and the improvement
is clear, so the concern is to get the API right, not to show that it
performs better
than another API.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/20190527172655.9287-1-amir73il@gmail.com/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-28 14:35                                 ` David Howells
  2020-01-31  5:31                                   ` hch
@ 2020-01-31  8:04                                   ` David Howells
  2020-01-31  8:56                                     ` Amir Goldstein
  1 sibling, 1 reply; 36+ messages in thread
From: David Howells @ 2020-01-31  8:04 UTC (permalink / raw)
  To: hch
  Cc: dhowells, Dave Chinner, Al Viro, Omar Sandoval, Amir Goldstein,
	Trond Myklebust, lsf-pc, miklos, linux-fsdevel, Darrick J. Wong

hch@lst.de <hch@lst.de> wrote:

> > I'm using direct I/O, so I'm assuming I don't need to fsync().
> 
> Direct I/O of course requires fsync.  What makes you think different?

I guess that's fair.  Even if the data makes it directly to storage, that's
not a guarantee that the metadata does.

David


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-31  8:04                                   ` David Howells
@ 2020-01-31  8:56                                     ` Amir Goldstein
  0 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2020-01-31  8:56 UTC (permalink / raw)
  To: David Howells
  Cc: hch, Dave Chinner, Al Viro, Omar Sandoval, Trond Myklebust,
	lsf-pc, miklos, linux-fsdevel, Darrick J. Wong

On Fri, Jan 31, 2020 at 10:04 AM David Howells <dhowells@redhat.com> wrote:
>
> hch@lst.de <hch@lst.de> wrote:
>
> > > I'm using direct I/O, so I'm assuming I don't need to fsync().
> >
> > Direct I/O of course requires fsync.  What makes you think different?
>
> I guess that's fair.  Even if the data makes it directly to storage, that's
> not a guarantee that the metadata does.
>

It's not only that. DIO gets data to storage *layer*, but it doesn't
tell storage *layer* to make the data durable (usually with FLUSH/FUA).

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-31  7:00                                         ` Amir Goldstein
@ 2020-01-31 20:33                                           ` Omar Sandoval
  2020-01-31 21:55                                             ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Omar Sandoval @ 2020-01-31 20:33 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Darrick J. Wong, Dave Chinner, Al Viro, Trond Myklebust,
	dhowells, lsf-pc, hch, miklos, linux-fsdevel, Jan Kara

On Fri, Jan 31, 2020 at 09:00:47AM +0200, Amir Goldstein wrote:
> On Fri, Jan 31, 2020 at 7:27 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > On Sat, Jan 25, 2020 at 08:25:46AM +1100, Dave Chinner wrote:
> > > On Thu, Jan 23, 2020 at 09:47:30AM +0200, Amir Goldstein wrote:
> > > > On Thu, Jan 23, 2020 at 9:16 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > >
> > > > > On Thu, Jan 23, 2020 at 03:47:45AM +0000, Al Viro wrote:
> > > > > > On Wed, Jan 22, 2020 at 02:10:03PM -0800, Omar Sandoval wrote:
> > > > > >
> > > > > > > > Sorry for not reading all the thread again, some API questions:
> > > > > > > > - We intend to allow AT_REPLACE only with O_TMPFILE src. Right?
> > > > > > >
> > > > > > > I wasn't planning on having that restriction. It's not too much effort
> > > > > > > for filesystems to support it for normal files, so I wouldn't want to
> > > > > > > place an artificial restriction on a useful primitive.
> > > > > >
> > > >
> > > > I have too many gray hairs each one for implementing a "useful primitive"
> > > > that nobody asked for and bare the consequences.
> > > > Your introduction to AT_REPLACE uses O_TMPFILE.
> > > > I see no other sane use of the interface.
> > > >
> > > > > > I'm not sure; that's how we ended up with the unspeakable APIs like
> > > > > > rename(2), after all...
> > > > >
> > > > > Yet it is just rename(2) with the serial numbers filed off -
> > > > > complete with all the same data vs metadata ordering problems that
> > > > > rename(2) comes along with. i.e. it needs fsync to guarantee data
> > > > > integrity of the source file before the linkat() call is made.
> > > > >
> > > > > If we can forsee that users are going to complain that
> > > > > linkat(AT_REPLACE) using O_TMPFILE files is not atomic because it
> > > > > leaves zero length files behind after a crash just like rename()
> > > > > does, then we haven't really improved anything at all...
> > > > >
> > > > > And, really, I don't think anyone wants another API that requires
> > > > > multiple fsync calls to use correctly for crash-safe file
> > > > > replacement, let alone try to teach people who still cant rename a
> > > > > file safely how to use it....
> > > > >
> > > >
> > > > Are you suggesting that AT_LINK_REPLACE should have some of
> > > > the semantics I posted in this RFC  for AT_ATOMIC_xxx:
> > > > https://lore.kernel.org/linux-fsdevel/20190527172655.9287-1-amir73il@gmail.com/
> > >
> > > Not directly.
> > >
> > > All I'm pointing out is that data integrity guarantees of
> > > AT_LINK_REPLACE are yet another aspect of this new feature that
> > > has not yet been specified or documented at all.
> > >
> > > And in pointing this out, I'm making an observation that the
> > > rename(2) behaviour which everyone seems to be assuming this
> > > function will replicate is a terrible model to copy/reinvent.
> > >
> > > Addressing this problem is something for the people pushing for
> > > AT_LINK_REPLACE to solve, not me....
> >
> > Or the grumpy maintainer who will have to digest all of this.
> >
> > Can we update the documentation to admit that many people will probably
> > want to use this (and rename) as atomic swap operations?
> >
> > "The filesystem will commit the data and metadata of all files and
> > directories involved in the link operation to stable storage before the
> > call returns."
> >
> > And finally add a flag:
> >
> > "AT_LINK_EATMYDATA: If specified, the filesystem may opt out of
> > committing anything to disk."
> >
> 
> I agree with Christoph that this anomaly is not a good idea, but I also
> agree with you and Dave that if an operation is meant to be used for
> atomic swap, we should make it much harder for users to get it wrong.
> 
> To that end, we can define both flags AT_LINK_REPLACE and
> AT_ATOMIC in uapi, but also define the combo
> AT_LINK_ATOMIC_REPLACE and let the documentation be
> very much focused on this usage.
> 
> I would like to stress a point, which some who haven't followed [1]
> closely might have missed - the purpose of AT_ATOMIC is
> *not* to make the new link durable - it is *not* a replacement for fsync
> on parent dir. It's purpose is *only* to create a dependency between the
> durability of the new link and the new data it can expose.

At the risk of spawning another bikeshed subthread, the naming could be
improved to make that more clear. AT_BARRIER? AT_DEPENDENCY?

> AFAICT, this means that AT_ATOMIC would be implemented
> as filemap_write_and_wait() in xfs/ext4 and probably fdatasync in btrfs.
> 
> Really, there is no chance that any user is interested in a non-atomic
> replace in that respect, so I am not even sure that we need an explicit
> flag for it. As it is, the AT_REPLACE API design would rank poorly as
> "The obvious use is wrong."
> 
> An explicit AT_ATOMIC flag would help is that we could make the same
> semantics also apply to rename(2) with the same flag.
> 
> Omar,
> 
> I do understand why you say that you want to implement AT_REPLACE
> and let someone else take care of (bike shedding) AT_ATOMIC later.

I'm not in much of a rush to get AT_LINK_REPLACE in, so I'd be okay
getting it in together with AT_ATOMIC. I don't think I have the
bandwidth to get yet another VFS interface in, though (since I still
have to get back to RWF_ENCODED, which is a higher priority for me). Are
you planning on picking up AT_ATOMIC any time soon?

> The problem with that approach is that man page will have the
> AT_REPLACE documentation spread out without the mention of
> AT_ATOMIC_REPLACE and that can generate damage long into the future.

On the other hand, I believe I made this fairly clear in my man-page
patch:

+This does not guarantee data integrity;
+see EXAMPLE below for how to use this for crash-safe file replacement with
+.BR O_TMPFILE .

With the fsync() calls in the example.

Thanks!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination
  2020-01-31 20:33                                           ` Omar Sandoval
@ 2020-01-31 21:55                                             ` Amir Goldstein
  0 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2020-01-31 21:55 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Darrick J. Wong, Dave Chinner, Al Viro, Trond Myklebust,
	dhowells, lsf-pc, hch, miklos, linux-fsdevel, Jan Kara

On Fri, Jan 31, 2020 at 10:33 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> On Fri, Jan 31, 2020 at 09:00:47AM +0200, Amir Goldstein wrote:
> > On Fri, Jan 31, 2020 at 7:27 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > >
> > > On Sat, Jan 25, 2020 at 08:25:46AM +1100, Dave Chinner wrote:
> > > > On Thu, Jan 23, 2020 at 09:47:30AM +0200, Amir Goldstein wrote:
> > > > > On Thu, Jan 23, 2020 at 9:16 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > > >
> > > > > > On Thu, Jan 23, 2020 at 03:47:45AM +0000, Al Viro wrote:
> > > > > > > On Wed, Jan 22, 2020 at 02:10:03PM -0800, Omar Sandoval wrote:
> > > > > > >
> > > > > > > > > Sorry for not reading all the thread again, some API questions:
> > > > > > > > > - We intend to allow AT_REPLACE only with O_TMPFILE src. Right?
> > > > > > > >
> > > > > > > > I wasn't planning on having that restriction. It's not too much effort
> > > > > > > > for filesystems to support it for normal files, so I wouldn't want to
> > > > > > > > place an artificial restriction on a useful primitive.
> > > > > > >
> > > > >
> > > > > I have too many gray hairs each one for implementing a "useful primitive"
> > > > > that nobody asked for and bare the consequences.
> > > > > Your introduction to AT_REPLACE uses O_TMPFILE.
> > > > > I see no other sane use of the interface.
> > > > >
> > > > > > > I'm not sure; that's how we ended up with the unspeakable APIs like
> > > > > > > rename(2), after all...
> > > > > >
> > > > > > Yet it is just rename(2) with the serial numbers filed off -
> > > > > > complete with all the same data vs metadata ordering problems that
> > > > > > rename(2) comes along with. i.e. it needs fsync to guarantee data
> > > > > > integrity of the source file before the linkat() call is made.
> > > > > >
> > > > > > If we can forsee that users are going to complain that
> > > > > > linkat(AT_REPLACE) using O_TMPFILE files is not atomic because it
> > > > > > leaves zero length files behind after a crash just like rename()
> > > > > > does, then we haven't really improved anything at all...
> > > > > >
> > > > > > And, really, I don't think anyone wants another API that requires
> > > > > > multiple fsync calls to use correctly for crash-safe file
> > > > > > replacement, let alone try to teach people who still cant rename a
> > > > > > file safely how to use it....
> > > > > >
> > > > >
> > > > > Are you suggesting that AT_LINK_REPLACE should have some of
> > > > > the semantics I posted in this RFC  for AT_ATOMIC_xxx:
> > > > > https://lore.kernel.org/linux-fsdevel/20190527172655.9287-1-amir73il@gmail.com/
> > > >
> > > > Not directly.
> > > >
> > > > All I'm pointing out is that data integrity guarantees of
> > > > AT_LINK_REPLACE are yet another aspect of this new feature that
> > > > has not yet been specified or documented at all.
> > > >
> > > > And in pointing this out, I'm making an observation that the
> > > > rename(2) behaviour which everyone seems to be assuming this
> > > > function will replicate is a terrible model to copy/reinvent.
> > > >
> > > > Addressing this problem is something for the people pushing for
> > > > AT_LINK_REPLACE to solve, not me....
> > >
> > > Or the grumpy maintainer who will have to digest all of this.
> > >
> > > Can we update the documentation to admit that many people will probably
> > > want to use this (and rename) as atomic swap operations?
> > >
> > > "The filesystem will commit the data and metadata of all files and
> > > directories involved in the link operation to stable storage before the
> > > call returns."
> > >
> > > And finally add a flag:
> > >
> > > "AT_LINK_EATMYDATA: If specified, the filesystem may opt out of
> > > committing anything to disk."
> > >
> >
> > I agree with Christoph that this anomaly is not a good idea, but I also
> > agree with you and Dave that if an operation is meant to be used for
> > atomic swap, we should make it much harder for users to get it wrong.
> >
> > To that end, we can define both flags AT_LINK_REPLACE and
> > AT_ATOMIC in uapi, but also define the combo
> > AT_LINK_ATOMIC_REPLACE and let the documentation be
> > very much focused on this usage.
> >
> > I would like to stress a point, which some who haven't followed [1]
> > closely might have missed - the purpose of AT_ATOMIC is
> > *not* to make the new link durable - it is *not* a replacement for fsync
> > on parent dir. It's purpose is *only* to create a dependency between the
> > durability of the new link and the new data it can expose.
>
> At the risk of spawning another bikeshed subthread, the naming could be
> improved to make that more clear. AT_BARRIER? AT_DEPENDENCY?
>

It crossed my mind a few times, I may have even suggested it, but IMO
this will not help anybody outside the kernel developers community at
getting a better understanding of what the API means.

> > AFAICT, this means that AT_ATOMIC would be implemented
> > as filemap_write_and_wait() in xfs/ext4 and probably fdatasync in btrfs.
> >
> > Really, there is no chance that any user is interested in a non-atomic
> > replace in that respect, so I am not even sure that we need an explicit
> > flag for it. As it is, the AT_REPLACE API design would rank poorly as
> > "The obvious use is wrong."
> >
> > An explicit AT_ATOMIC flag would help is that we could make the same
> > semantics also apply to rename(2) with the same flag.
> >
> > Omar,
> >
> > I do understand why you say that you want to implement AT_REPLACE
> > and let someone else take care of (bike shedding) AT_ATOMIC later.
>
> I'm not in much of a rush to get AT_LINK_REPLACE in, so I'd be okay
> getting it in together with AT_ATOMIC. I don't think I have the
> bandwidth to get yet another VFS interface in, though (since I still
> have to get back to RWF_ENCODED, which is a higher priority for me). Are
> you planning on picking up AT_ATOMIC any time soon?
>

I wasn't planning to. Mind you, there is no actual "work" to do with
AT_ATOMIC - it is only about agreeing on the semantics - the implementation
is trivial. The man page draft I have already posted, but as I said it would
be much more valuable IMO to introduce AT_LINK_ATOMIC_REPLACE
together to man page. I can help with that if needed.

> > The problem with that approach is that man page will have the
> > AT_REPLACE documentation spread out without the mention of
> > AT_ATOMIC_REPLACE and that can generate damage long into the future.
>
> On the other hand, I believe I made this fairly clear in my man-page
> patch:
>
> +This does not guarantee data integrity;
> +see EXAMPLE below for how to use this for crash-safe file replacement with
> +.BR O_TMPFILE .
>
> With the fsync() calls in the example.
>

Heh that is not how humans work ;-)
People don't go reading the entire man page nor follow examples to the letter.
Many get to AT_LINK_REPLACE description and stop reading.
In that case, I prefer that they first get to AT_LINK_ATOMIC_REPLACE.

I don't mind re-posting your patches adding the _ATOMIC_ keyword
and implementation, but it's a very small change, so perhaps you'd
rather do it yourself, or maybe Dave would want to carry this through
if he is in a bigger rush.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2020-01-31 21:55 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-17 12:49 [LSF/MM/BPF TOPIC] Allowing linkat() to replace the destination David Howells
2020-01-17 14:33 ` Trond Myklebust
2020-01-17 15:46   ` Al Viro
2020-01-17 16:12     ` Trond Myklebust
2020-01-17 16:48       ` Al Viro
2020-01-17 16:36     ` Omar Sandoval
2020-01-17 16:59       ` Al Viro
2020-01-17 17:28         ` Omar Sandoval
2020-01-17 18:17           ` Al Viro
2020-01-17 20:22             ` Omar Sandoval
2020-01-17 22:22               ` Al Viro
2020-01-17 23:54                 ` Omar Sandoval
2020-01-18  0:47                   ` Al Viro
2020-01-18  1:17                     ` Omar Sandoval
2020-01-18  2:20                       ` Al Viro
2020-01-21 23:05                         ` Omar Sandoval
2020-01-22  6:57                           ` Amir Goldstein
2020-01-22 22:10                             ` Omar Sandoval
2020-01-23  3:47                               ` Al Viro
2020-01-23  7:16                                 ` Dave Chinner
2020-01-23  7:47                                   ` Amir Goldstein
2020-01-24 21:25                                     ` Dave Chinner
2020-01-31  5:24                                       ` Darrick J. Wong
2020-01-31  5:29                                         ` hch
2020-01-31  7:00                                         ` Amir Goldstein
2020-01-31 20:33                                           ` Omar Sandoval
2020-01-31 21:55                                             ` Amir Goldstein
2020-01-28  1:27                                   ` Omar Sandoval
2020-01-28 14:35                                 ` David Howells
2020-01-31  5:31                                   ` hch
2020-01-31  8:04                                   ` David Howells
2020-01-31  8:56                                     ` Amir Goldstein
2020-01-22  9:53                       ` David Howells
2020-01-17 14:47 ` David Howells
2020-01-17 14:56   ` Trond Myklebust
2020-01-17 16:01     ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).