All of lore.kernel.org
 help / color / mirror / Atom feed
From: Al Viro <viro@ZenIV.linux.org.uk>
To: "J. Bruce Fields" <bfields@fieldses.org>
Cc: linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Jeff Layton <jlayton@kernel.org>
Subject: Re: [RFC][PATCH] nfsd: vfs_mkdir() might succeed leaving dentry negative unhashed
Date: Mon, 14 May 2018 17:47:43 +0100	[thread overview]
Message-ID: <20180514164743.GI30522@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20180514154550.GH30522@ZenIV.linux.org.uk>

On Mon, May 14, 2018 at 04:45:51PM +0100, Al Viro wrote:

> > > 	2) is nfserr_serverfail valid for situation when
> > > directory created by such vfs_mkdir() manages to disappear
> > > under us immediately afterwards?  Or should we return nfserr_stale
> > > instead?
> > 
> > We just got a successful result on the create and the parent's still
> > locked, so if somebody hits this I think we want them reporting a bug,
> > not wasting time trying to find a mistake in their own logic.
> 
> No.  Suppose it's NFS-over-NFS (and yes, I agree that it's a bad idea;
> somebody *has* done that).  Lookup after successful mkdir can legitimately
> fail if it's been removed server-side.
> 
> And we *will* need to allow nfs_mkdir() to return that way in some cases
> (== success with dentry passed to it left unhashed negative), I'm afraid ;-/

Consider the situation when you have NFS reexported - there's server S1,
exporting a filesystem to S2, which reexports it for client C.

On S2, process A does stat foo, gets ENOENT and proceeds to mkdir foo.  
Dentry of foo is passed to nfs_mkdir(), which calls e.g. nfs3_proc_mkdir(),
which calls nfs3_do_create(), which requests S1 to create the damn thing.
Then, after that succeeds, it calls nfs_instantiate().  There we would 
proceed to get the inode and call d_add(dentry, inode).

In the meanwhile, C, having figured out the fhandle S2 would assign to
foo (e.g. having spied upon the traffic, etc.) sends that fhandle to
S2.  nfs_fh_to_dentry() is called by process B on S2 (either knfsd, or,
in case if C==S2 and attacker has done open-by-fhandle - the caller
of open_by_handle(2)).  It gets the inode - the damn thing has been
created on S1 already.  That inode still has no dentries attached to
it (B has just set it up), so d_obtain_alias() creates one and attaches
to it.

Now A gets around to nfs_fhget() and finds the same inode.  Voila -
d_add(dentry, inode) and we are fucked; two dentries over the same
directory inode.  Fun starts very shortly when fs/exportfs/* code
is used by B to reconnect its dentry - the things really get bogus
once that constraint is violated.

The obvious solution would be to replace that d_add() with
	res = d_splice_alias(inode, dentry);
	if (IS_ERR(res)) {
		error = PTR_ERR(res);
		goto out_error;
	}
	dput(res);
	
leaving the dentry passed to nfs_mkdir() unhashed and negative if
we hit that race.  That's fine - the next lookup (e.g. the one
done by exportfs to reconnect the sucker) will find the right
dentry in dcache; it's just that it won't the one passed to
vfs_mkdir().

That's different from the local case - there mkdir gets the inumber,
inserts locked in-core inode with that inumber into icache and
only then proceeds to set on-disk data up.  Anyone guessing the
inumber (and thus the fhandle) will either get -ESTALE (if they
come first, as in this scenario) or find the in-core inode mkdir
is setting up and wait for it to be unlocked, at which point
d_obtain_alias() will already find it attached to dentry passed
to mkdir.

But that critically relies upon the icache search key being known
to mkdir *BEFORE* the on-disk metadata starts looking acceptable.
For NFS we really can't do that - there the key is S1's fhandle
and we don't get that until S1 has created the damn thing.

I don't see any variation of the trick used by local filesystems
that would not have this race; we really can end up with B getting
there first and creating directory inode with dentry attached to
it before A gets through.  Which, AFAICS, leaves only one solution -
let A put the dentry created by B in place of what had been passed
to A (and leave its argument unhashed).  Which is trivial to
implement (see above); however, it means that NFS itself is in
the same class as cgroup - its ->mkdir() may, in some cases,
end up succeeding and leaving its argument unhashed negative.
Note that dcache is perfectly fine after that - we have hashed
positive dentry with the right name and right parent, over the
inode for directory we'd just created; everything's fine, except
that it's not the struct dentry originally passed to vfs_mkdir().

  parent reply	other threads:[~2018-05-14 16:47 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-11 21:13 [RFC][PATCH] nfsd: vfs_mkdir() might succeed leaving dentry negative unhashed Al Viro
2018-05-14 15:32 ` J. Bruce Fields
2018-05-14 15:45   ` Al Viro
2018-05-14 16:27     ` J. Bruce Fields
2018-05-14 17:00       ` Al Viro
2018-05-14 17:56         ` J. Bruce Fields
2018-05-14 16:47     ` Al Viro [this message]
2018-05-16 14:09       ` J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180514164743.GI30522@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=bfields@fieldses.org \
    --cc=jlayton@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.