Re: [PATCH v2] vfs: fix link vs. rename race

From: Al Viro <viro@zeniv.linux.org.uk>
To: NeilBrown <neilb@suse.de>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
	Xavier Roche <xavier.roche@algolia.com>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v2] vfs: fix link vs. rename race
Date: Wed, 14 Sep 2022 02:30:53 +0100	[thread overview]
Message-ID: <YyEuzf33chhNXZwD@ZenIV> (raw)
In-Reply-To: <166311448437.20483.2299581036245030695@noble.neil.brown.name>

On Wed, Sep 14, 2022 at 10:14:44AM +1000, NeilBrown wrote:

> But they can race.  The open completes the lookup of /tmp/share-dir and
> holds the dentry, but the rename succeeds with inode_lock(target) in the
> code fragment you provided above before the open() can get the lock.
> By the time open() does get the lock, the dentry it holds will be marked
> S_DEAD and the __lookup_hash() will fail.
> So the open() returns -ENOENT - unexpected.
> 
> Test code below, based on the test code for the link race.
> 
> I wonder if S_DEAD should result in -ESTALE rather than -ENOENT.
> That would cause the various retry_estale() loops to retry the whole
> operation.  I suspect we'd actually need more subtlety than just that
> simple change, but it might be worth exploring.

	I don't think that wording in POSIX prohibits that ENOENT.
It does not (and no UNIX I've ever seen provides) atomicity of all
link traversals involved in pathname resolution.

root@sonny:/tmp# mkdir fubar
root@sonny:/tmp# cd fubar/
root@sonny:/tmp/fubar# rmdir ../fubar
root@sonny:/tmp/fubar# touch ./foo
touch: cannot touch './foo': No such file or directory

Do you agree that this is a reasonable behaviour?  That, BTW, is why
"retry on S_DEAD" is wrong - it can be a persistent state.

Now, replace rmdir ../fubar with rename("/tmp/barf", "/tmp/fubar") and
you'll get pretty much your testcase.  You are asking not just for
atomicity of rename vs. traversal of "fubar" in /tmp - that we already
have.  You are asking for the atomicity of the entire pathname resolution
of open() argument wrt rename().  And that is a much scarier can of worms.

Basically, pathname resolution happens on per-component basis and it's
*NOT* guaranteed that filesystem won't be changed between those or that
we won't observe such modifications.  If you check POSIX, you'll see
that it avoids any such promises - all it says (for directories; non-directory
case has similar verbiage) is this:

	If the directory named by the new argument exists, it shall be removed
	and old renamed to new. In this case, a link named new shall exist
	throughout the renaming operation and shall refer either to the directory
	referred to by new or old before the operation began.

It carefully stays the hell out of any pathname resolution atomicity promises -
all it's talking about is resolution of a single link.  It's enough to
guarantee that rename("old", "new") won't race with mkdir("new") allowing the
latter to succeed, with similar for creat(2), etc. - "new" is promised to
refer to an existing object at all times.  And that's what rename() atomicity
is normally for.  Operation on the link in question will observe either old
or new state; operation that passed through that link on the way to whatever
it will act upon has *also* observed either of those states - and the next
pathname component had been looked up in either old or new directory, but
there is no promise that nothing has happened to that directory since we got
there.