All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
@ 2014-09-24 18:14 Mikhail Efremov
  2014-09-24 18:55 ` Al Viro
  0 siblings, 1 reply; 37+ messages in thread
From: Mikhail Efremov @ 2014-09-24 18:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Miklos Szeredi, Mikhail Efremov, Linus Torvalds, Alexander Viro,
	linux-fsdevel, stable

Only exchange source and destination filenames
if flags contain RENAME_EXCHANGE.
In case if executable file was running and replaced by
other file /proc/PID/exe should still show correct file name,
not the old name of the file by which it was replaced.

The scenario when this bug manifests itself was like this:
* ALT Linux uses rpm and start-stop-daemon;
* during a package upgrade rpm creates a temporary file
  for an executable to rename it upon successful unpacking;
* start-stop-daemon is run subsequently and it obtains
  the (nonexistant) temporary filename via /proc/PID/exe
  thus failing to identify the running process.

Note that "long" filenames (> DNAiME_INLINE_LEN) are still
exchanged without RENAME_EXCHANGE and this behaviour exists
long enough (should be fixed too apparently).
So this patch is just an interim workaround that restores
behavior for "short" names as it was before changes
introduced by commit da1ce0670c14 ("vfs: add cross-rename").

See https://lkml.org/lkml/2014/9/7/6 for details.

Acked-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: da1ce0670c14 "vfs: add cross-rename"
Signed-off-by: Mikhail Efremov <sem@altlinux.org>
---
 fs/dcache.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 7a5b514..3218570 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2372,7 +2372,8 @@ void dentry_update_name_case(struct dentry *dentry, struct qstr *name)
 }
 EXPORT_SYMBOL(dentry_update_name_case);
 
-static void switch_names(struct dentry *dentry, struct dentry *target)
+static void switch_names(struct dentry *dentry, struct dentry *target,
+			 bool exchange)
 {
 	if (dname_external(target)) {
 		if (dname_external(dentry)) {
@@ -2404,11 +2405,21 @@ static void switch_names(struct dentry *dentry, struct dentry *target)
 			/*
 			 * Both are internal.
 			 */
-			unsigned int i;
-			BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(long)));
-			for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) {
-				swap(((long *) &dentry->d_iname)[i],
-				     ((long *) &target->d_iname)[i]);
+			if (exchange) {
+				unsigned int i;
+
+				BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN,
+							 sizeof(long)));
+				for (i = 0; i < DNAME_INLINE_LEN / sizeof(long);
+									i++) {
+					swap(((long *) &dentry->d_iname)[i],
+					     ((long *) &target->d_iname)[i]);
+				}
+			} else {
+				memcpy(dentry->d_iname, target->d_name.name,
+						target->d_name.len + 1);
+				dentry->d_name.len = target->d_name.len;
+				return;
 			}
 		}
 	}
@@ -2510,7 +2521,7 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	list_del(&target->d_u.d_child);
 
 	/* Switch the names.. */
-	switch_names(dentry, target);
+	switch_names(dentry, target, exchange);
 	swap(dentry->d_name.hash, target->d_name.hash);
 
 	/* ... and switch the parents */
@@ -2649,7 +2660,7 @@ static void __d_materialise_dentry(struct dentry *dentry, struct dentry *anon)
 
 	dparent = dentry->d_parent;
 
-	switch_names(dentry, anon);
+	switch_names(dentry, anon, false);
 	swap(dentry->d_name.hash, anon->d_name.hash);
 
 	dentry->d_parent = dentry;
-- 
1.8.5.5


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-24 18:14 [PATCH v2] vfs: Don't exchange "short" filenames unconditionally Mikhail Efremov
@ 2014-09-24 18:55 ` Al Viro
  2014-09-24 19:20   ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2014-09-24 18:55 UTC (permalink / raw)
  To: Mikhail Efremov
  Cc: linux-kernel, Miklos Szeredi, Linus Torvalds, linux-fsdevel, stable

On Wed, Sep 24, 2014 at 10:14:33PM +0400, Mikhail Efremov wrote:
> Only exchange source and destination filenames
> if flags contain RENAME_EXCHANGE.
> In case if executable file was running and replaced by
> other file /proc/PID/exe should still show correct file name,
> not the old name of the file by which it was replaced.

Yecchhhh...  Applied, but it's very ugly.  Oh, well - regression is
regression, and I don't see a cleaner fix at the moment.  If I don't
manage to come up with anything prettier, to Linus it goes in tonight
pull request ;-/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-24 18:55 ` Al Viro
@ 2014-09-24 19:20   ` Linus Torvalds
  2014-09-24 19:27     ` Linus Torvalds
  2014-09-24 20:18     ` Al Viro
  0 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-09-24 19:20 UTC (permalink / raw)
  To: Al Viro
  Cc: Mikhail Efremov, Linux Kernel Mailing List, Miklos Szeredi,
	linux-fsdevel, stable

On Wed, Sep 24, 2014 at 11:55 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Yecchhhh...  Applied, but it's very ugly.  Oh, well - regression is
> regression, and I don't see a cleaner fix at the moment.  If I don't
> manage to come up with anything prettier, to Linus it goes in tonight
> pull request ;-/

Please don't. That thing is too ugly to exist. It also looks
completely and utterly buggy. There's no way I'm taking it. If
switch-names is suddenly conditional, what the f*ck happens to the
name hash which is unconditionally done with a swap() right
afterwards.

There's no way that patch is correct.

           Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-24 19:20   ` Linus Torvalds
@ 2014-09-24 19:27     ` Linus Torvalds
  2014-09-24 20:18     ` Al Viro
  1 sibling, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-09-24 19:27 UTC (permalink / raw)
  To: Al Viro
  Cc: Mikhail Efremov, Linux Kernel Mailing List, Miklos Szeredi,
	linux-fsdevel, stable

On Wed, Sep 24, 2014 at 12:20 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> There's no way that patch is correct.

Something like the appended might be a good idea regardless. And might
have made people notice that the len and the hash go together with the
name.

             Linus

----

    diff --git a/fs/dcache.c b/fs/dcache.c
    index 7a5b51440afa..aa1b045b997c 100644
    --- a/fs/dcache.c
    +++ b/fs/dcache.c
    @@ -2412,7 +2412,7 @@ static void switch_names(struct dentry
*dentry, struct dentry *target)
                            }
                    }
            }
    -       swap(dentry->d_name.len, target->d_name.len);
    +       swap(dentry->d_name.hash_len, target->d_name.hash_len);
     }

     static void dentry_lock_for_move(struct dentry *dentry, struct
dentry *target)
    @@ -2511,7 +2511,6 @@ static void __d_move(struct dentry *dentry,
struct dentry *target,

            /* Switch the names.. */
            switch_names(dentry, target);
    -       swap(dentry->d_name.hash, target->d_name.hash);

            /* ... and switch the parents */
            if (IS_ROOT(dentry)) {
    @@ -2650,7 +2649,6 @@ static void __d_materialise_dentry(struct
dentry *dentry, struct dentry *anon)
            dparent = dentry->d_parent;

            switch_names(dentry, anon);
    -       swap(dentry->d_name.hash, anon->d_name.hash);

            dentry->d_parent = dentry;
            list_del_init(&dentry->d_u.d_child);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-24 19:20   ` Linus Torvalds
  2014-09-24 19:27     ` Linus Torvalds
@ 2014-09-24 20:18     ` Al Viro
  2014-09-25  4:46       ` Al Viro
  1 sibling, 1 reply; 37+ messages in thread
From: Al Viro @ 2014-09-24 20:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikhail Efremov, Linux Kernel Mailing List, Miklos Szeredi,
	linux-fsdevel, stable

On Wed, Sep 24, 2014 at 12:20:38PM -0700, Linus Torvalds wrote:
> On Wed, Sep 24, 2014 at 11:55 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Yecchhhh...  Applied, but it's very ugly.  Oh, well - regression is
> > regression, and I don't see a cleaner fix at the moment.  If I don't
> > manage to come up with anything prettier, to Linus it goes in tonight
> > pull request ;-/
> 
> Please don't. That thing is too ugly to exist. It also looks
> completely and utterly buggy. There's no way I'm taking it. If
> switch-names is suddenly conditional, what the f*ck happens to the
> name hash which is unconditionally done with a swap() right
> afterwards.

The sucker's unhashed after that...  And yes, I agree that it's fucking
ugly.  Still looking for saner ways to do that...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-24 20:18     ` Al Viro
@ 2014-09-25  4:46       ` Al Viro
  2014-09-26 16:44         ` Al Viro
  2014-09-26 20:23         ` Al Viro
  0 siblings, 2 replies; 37+ messages in thread
From: Al Viro @ 2014-09-25  4:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikhail Efremov, Linux Kernel Mailing List, Miklos Szeredi,
	linux-fsdevel, stable

On Wed, Sep 24, 2014 at 09:18:13PM +0100, Al Viro wrote:
> On Wed, Sep 24, 2014 at 12:20:38PM -0700, Linus Torvalds wrote:
> > On Wed, Sep 24, 2014 at 11:55 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > Yecchhhh...  Applied, but it's very ugly.  Oh, well - regression is
> > > regression, and I don't see a cleaner fix at the moment.  If I don't
> > > manage to come up with anything prettier, to Linus it goes in tonight
> > > pull request ;-/
> > 
> > Please don't. That thing is too ugly to exist. It also looks
> > completely and utterly buggy. There's no way I'm taking it. If
> > switch-names is suddenly conditional, what the f*ck happens to the
> > name hash which is unconditionally done with a swap() right
> > afterwards.
> 
> The sucker's unhashed after that...  And yes, I agree that it's fucking
> ugly.  Still looking for saner ways to do that...

I really wonder if it's possible to get d_rehash() hitting the victim of
(non-exchange) __d_move().  _Then_ this patch (as well as the historical
behaviour it restores, all way back to 2.5, if not 2.3) would, indeed,
be buggy.

I'd probably better sleep on that and finish YACrawlingTFS tomorrow morning
- it's nearly 1am here and I've got only 5 hours of sleep left until it's time
to get the kids up for school ;-/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-25  4:46       ` Al Viro
@ 2014-09-26 16:44         ` Al Viro
  2014-09-27  4:45           ` Al Viro
  2014-09-26 20:23         ` Al Viro
  1 sibling, 1 reply; 37+ messages in thread
From: Al Viro @ 2014-09-26 16:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikhail Efremov, Linux Kernel Mailing List, Miklos Szeredi,
	linux-fsdevel, stable

On Thu, Sep 25, 2014 at 05:46:01AM +0100, Al Viro wrote:
> On Wed, Sep 24, 2014 at 09:18:13PM +0100, Al Viro wrote:
> > On Wed, Sep 24, 2014 at 12:20:38PM -0700, Linus Torvalds wrote:
> > > On Wed, Sep 24, 2014 at 11:55 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > >
> > > > Yecchhhh...  Applied, but it's very ugly.  Oh, well - regression is
> > > > regression, and I don't see a cleaner fix at the moment.  If I don't
> > > > manage to come up with anything prettier, to Linus it goes in tonight
> > > > pull request ;-/
> > > 
> > > Please don't. That thing is too ugly to exist. It also looks
> > > completely and utterly buggy. There's no way I'm taking it. If
> > > switch-names is suddenly conditional, what the f*ck happens to the
> > > name hash which is unconditionally done with a swap() right
> > > afterwards.
> > 
> > The sucker's unhashed after that...  And yes, I agree that it's fucking
> > ugly.  Still looking for saner ways to do that...
> 
> I really wonder if it's possible to get d_rehash() hitting the victim of
> (non-exchange) __d_move().  _Then_ this patch (as well as the historical
> behaviour it restores, all way back to 2.5, if not 2.3) would, indeed,
> be buggy.
> 
> I'd probably better sleep on that and finish YACrawlingTFS tomorrow morning
> - it's nearly 1am here and I've got only 5 hours of sleep left until it's time
> to get the kids up for school ;-/

FWIW, the longer I'm looking at that, the more it seems that __d_move() and
__d_materialise_dentry() ought to be merged.  The thing is, both callers of
the latter are followed by the same sequence:
                        write_sequnlock(&rename_lock);
			_d_rehash(anon);
			spin_unlock(&anon->d_lock);
(in one case - verbatim, in another - with goto thrown in before _d_rehash()).
Now, there's no good reason for doing that write_sequnlock() first; both
_d_rehash() and spin_unlock() are fast enough to make contention on
rename_lock a non-issue.  And if we pull those before write_sequnlock() and
into __d_materialise_dentry(dentry, anon), we get rid not only of code
duplication, but of the "it returns with anon->d_lock held" caution as well,
*and* __d_materialise_dentry() becomes very similar to __d_move().

	Note that __d_move(x, y) depends on !IS_ROOT(y), while
__d_materialise_dentry(x, y) assumes (and is obviously guaranteed by
callers) IS_ROOT(y).  IOW, even leaving aside the "is it an exchange" argument
fed to __d_move(), we could distinguish between these guys by IS_ROOT(y)
alone.  And there's a fuckload of duplication between them even now, let alone
after pulling the rehash in.

And d_splice_alias() definitely ought to be merged with d_materialise_unique().
Making it accept hashed dentries had been a mistake - there are _very_ few
places where that happens.  One in d_add_ci() is my mistake (and essentially
a dead code - case-insensitive XFS doesn't get hashed negative dentries,
so this codepath in d_add_ci() never triggers), the one in gfs2_create_inode()
is trivially avoided - it should be "fail with EISDIR if that inode is a
directory, do d_instantiate() otherwise" and one in __gfs2_lookup() can only
be reached for a hashed dentry from gfs2_atomic_open(), which has no business
calling gfs2_lookup() if dentry it got from caller is hashed.  And that's
it - only 3 callsites where we could call d_splice_alias() for hashed dentry.

With those taken care of, we really have no reason not to fold
d_materialise_unique() into d_splice_alias() - just teach the latter to
do __d_unalias() in case when existing directory alias is not IS_ROOT.
d_splice_alias() fails with EIO in that case, mostly since on a local fs
that can only happen due to fs corruption.  No reason not to do __d_unalias()
instead.

What's more, d_add_unique() and d_instantiate_unique() are also very
dubious.  The use of the latter in proc_ns_get_dentry() is complete BS,
since there the dentry passed to it is an IS_ROOT() one we'd just obtained
from d_alloc_pseudo(), so the loop over aliases in __d_instantiate_unique()
has no chance whatsoever to find anything it would like - none of them
could possibly have the same ->d_parent value.  IOW, that call is plain
and simple pessimization of d_instantiate().  Which leaves us with exactly
one caller - d_add_unique().  Which itself has exactly one caller.  This:
        dentry = opendata->dentry;
        if (dentry->d_inode == NULL) {
                /* FIXME: Is this d_drop() ever needed? */
                d_drop(dentry);
                dentry = d_add_unique(dentry, igrab(state->inode));
                if (dentry == NULL) {
                        dentry = opendata->dentry;
                } else if (dentry != ctx->dentry) {
                        dput(ctx->dentry);
                        ctx->dentry = dget(dentry); 
                }
First of all, *is* this d_drop() needed, indeed?  Can that dentry be
already hashed?  FWIW, in case it has been hashed we can't expect any aliases
to satisfy d_instantiate_unique() - for that we would need to have a really
weird state just before d_drop(); positive and negative dentries with the
same parent and the same name, which obviously shouldn't happen.  So in
case when dentry has been hashed when we got there (if that's possible at
all), we might as well just call d_instatiate() and be done with that.
And when it's not hashed...  What's wrong with just using d_materialise_unique()
there?  I'm not familiar enough with the guts of NFS4 (and I've had more than
enough of digging through the really nasty callchains during the last few
days), so some comments from NFS folks would be very welcome...

AFAICS, we have a good chance to fold that bunch of suckers into one
primitive.  Moreover, the same primitive ought to be used through all
->lookup() instances, even on non-exportable filesystems where we are
still using d_add().  The thing is, d_splice_alias() will only hit the
fastpath on those, which does the same thing d_add() does.  That's
the only reason why d_add() is valid there in the first place.  And
checking for that fastpath is cheap - we need to check that inode is
non-NULL anyway (to decide whether we need to take ->i_lock) and we
need to take ->i_lock and at least fetch ->i_alias.next in case when
inode isn't NULL.  The only thing remaining to decide that d_splice_alias()
can take the fastpath is checking that inode isn't a directory one.

That basically leaves d_add() to be used only when something is populating
a synthetic tree.  And even there I'm not sure it makes sense to keep it
a separate primitive - inodes in those remaining callers tend to be
brand-new, which would send d_splice_alias() on the fastpath simply because
->i_alias will be empty.  So in theory we might get away with *all* that
stuff merged into one primitive: d_add/d_splice_alias/d_materialise_unique/
d_add_unique to be used when we want hashed result.  FWIW, all those guys
had started as forked-off variants of d_add(), so we would just be folding
them all back...

As an aside, I really *hate* the way lustre crowd has spewed out the wonders
like ll_d_hlist_empty() and its ilk.  The whole thing is a major eye-bleeder
as it is, but this "If you want to grep for something, you'll just have to
grep for our pointless wrapper separately.  Give us respect, bitch - we are
reeeal special!" kind of attitude is extremely annoying.  Could we please
take all that shite out?  Not drivers/staging/lustre, tempting as it might be,
just those wrappers?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-25  4:46       ` Al Viro
  2014-09-26 16:44         ` Al Viro
@ 2014-09-26 20:23         ` Al Viro
  1 sibling, 0 replies; 37+ messages in thread
From: Al Viro @ 2014-09-26 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikhail Efremov, Linux Kernel Mailing List, Miklos Szeredi,
	linux-fsdevel, stable, Sage Weil

On Thu, Sep 25, 2014 at 05:46:01AM +0100, Al Viro wrote:

> I really wonder if it's possible to get d_rehash() hitting the victim of
> (non-exchange) __d_move().  _Then_ this patch (as well as the historical
> behaviour it restores, all way back to 2.5, if not 2.3) would, indeed,
> be buggy.

More fun: what's going on in ceph_handle_notrace_create()?  AFAICS, this
        struct dentry *result = ceph_lookup(dir, dentry, 0);

        if (result && !IS_ERR(result)) {
                /*
                 * We created the item, then did a lookup, and found
                 * it was already linked to another inode we already
                 * had in our cache (and thus got spliced).  Link our
                 * dentry to that inode, but don't hash it, just in  
                 * case the VFS wants to dereference it.
                 */
                BUG_ON(!result->d_inode);
                d_instantiate(dentry, result->d_inode);
                return 0;
        }
is bogus.  What will happen if server goes nuts and that existing alias picked
by lookup turns out to be a directory?  And while we are at it, what's to
prevent a leak if we ever hit that codepath, directory or no directory?
Sage?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-26 16:44         ` Al Viro
@ 2014-09-27  4:45           ` Al Viro
  2014-09-27 17:56             ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2014-09-27  4:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikhail Efremov, Linux Kernel Mailing List, Miklos Szeredi,
	linux-fsdevel, stable

On Fri, Sep 26, 2014 at 05:44:42PM +0100, Al Viro wrote:

> FWIW, the longer I'm looking at that, the more it seems that __d_move() and
> __d_materialise_dentry() ought to be merged.  The thing is, both callers of
> the latter are followed by the same sequence:
>                         write_sequnlock(&rename_lock);
> 			_d_rehash(anon);
> 			spin_unlock(&anon->d_lock);
> (in one case - verbatim, in another - with goto thrown in before _d_rehash()).
> Now, there's no good reason for doing that write_sequnlock() first; both
> _d_rehash() and spin_unlock() are fast enough to make contention on
> rename_lock a non-issue.  And if we pull those before write_sequnlock() and
> into __d_materialise_dentry(dentry, anon), we get rid not only of code
> duplication, but of the "it returns with anon->d_lock held" caution as well,
> *and* __d_materialise_dentry() becomes very similar to __d_move().
> 
> 	Note that __d_move(x, y) depends on !IS_ROOT(y), while
> __d_materialise_dentry(x, y) assumes (and is obviously guaranteed by
> callers) IS_ROOT(y).  IOW, even leaving aside the "is it an exchange" argument
> fed to __d_move(), we could distinguish between these guys by IS_ROOT(y)
> alone.  And there's a fuckload of duplication between them even now, let alone
> after pulling the rehash in.

It's actually even better.  __d_materialise_dentry() has its arguments in
the wrong order.  Flip them around and it's very nearly an exact copy of
what we do in __d_move() when the first argument is IS_ROOT().  Actually,
that case in __d_move() had been added back in 2002 exactly for the needs
of d_splice_alias().  And it's been unreachable since this Bruce has
switched d_splice_alias() to use of __d_materialise_dentry().  In fact,
both d_splice_alias() and d_materialise_unique() should've been switched
to __d_move() instead.

I've done that massage and removal of __d_materialise_dentry().  The things
look a lot saner now that all dentry moving is done in one place, IMO.  Not to
mention that it trims ~50 lines off fs/dcache.c, driving it down to the 3rd
place in fs/*.c ;-)

Anyway, I've put the branch with fixes + that stuff in vfs.git#for-linus;
I have *NOT* put the "keep names on overwriting rename" horror in there,
but I believe that we will need something of that sort.

Linus, it's a real userland regression.  Yes, it affects only shitty scripts,
and they had never been reliable for long names anyway, but we have broken
real-world userland code by that.  What happened there is that old behaviour
of removal-by-rename wrt d_path() used to be similar to that on unlink.
cd /tmp; touch foo; touch bar
exec 42<bar
mv foo bar
ls -l /proc/self/fd/42
would give you
lr-x------ 1 al al 64 Sep 26 23:44 /proc/self/fd/42 -> /tmp/bar (deleted)
Now it gives "/tmp/foo (deleted)".  The trouble being, the same goes for
/proc/*/exe.  If you upgraded a package and that had replaced a running binary,
you used to be able to find process still running that binary.  After the
upgrade.  And yes, it was an unreliable kludge and the value of that was
in the best case dubious, but there really are people who used to do it.
And it broke after we'd added RENAME_EXCHANGE support.  The old behaviour
of switch_names() in short names case used to be "copy the name/length
of target to source dentry, swap hash keys and piss on hash key of target
not matching target's name anymore - the sucker is unhashed anyway".  For
RENAME_EXCHANGE we certainly needed to swap the names in all cases -
precisely because both suckers are hashed in the end.  And since we had
done that in exchange case, it was simpler to do it in all cases.  Especially
since when either name had been a long one we'd always swapped them, so
anything relying on old behaviour had been unreliable anyway.

Except that it turns out that old behaviour (keep the last component of
victim on normal rename) was relied upon...  That ucking fugly patch
reverts the non-swapping renames to the old behaviour.  I don't like it
any more than you do - it *is* ugly as hell and I still can't swear that
we don't have a bogus codepath leading to d_rehash() of rename() victim
(which would break things very badly with the old behaviour).  Still,
it is a real-world userland regression.

Up to you, but I'm afraid that it's on the "we get to keep supporting that"
side of things.

Anyway, what I've got in vfs.git#for-linus right now seems to survive the
obvious tests so far, still running through full xfstests; please, take
a look at that pile.  The last one should go with you as author - it was
just faster to do the change manually than deal with git-am failure when
applying your patch.  I'll take the headers from your mail and update that
commit before sending the pull request.

Shortlog:
Al Viro (8):
      ufs: deal with nfsd/iget races
      pull rehashing and unlocking the target dentry into __d_materialise_dentry()
      don't open-code d_rehash() in d_materialise_unique()
      __d_move(): fold manipulations with ->d_child/->d_subdirs
      __d_materialise_dentry(): flip the order of arguments
      kill __d_materialise_dentry()
      fold unlocking the children into dentry_unlock_parents_for_move()
      fold swapping ->d_name.hash into switch_names()

Miklos Szeredi (2):
      shmem: fix nlink for rename overwrite directory
      fuse: honour max_read and max_write in direct_io mode

Diffstat:
 fs/dcache.c         |   85 +++++++++++----------------------------------------
 fs/direct-io.c      |    2 +-
 fs/fuse/file.c      |    1 +
 fs/ufs/ialloc.c     |    6 +++-
 fs/ufs/namei.c      |    4 +++
 include/linux/uio.h |    2 +-
 mm/iov_iter.c       |   14 ++++++---
 mm/shmem.c          |    4 ++-
 8 files changed, 42 insertions(+), 76 deletions(-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-27  4:45           ` Al Viro
@ 2014-09-27 17:56             ` Linus Torvalds
  2014-09-27 18:31               ` Al Viro
  2014-09-28 15:01               ` Mikhail Efremov
  0 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-09-27 17:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Mikhail Efremov, Linux Kernel Mailing List, Miklos Szeredi,
	linux-fsdevel, stable

On Fri, Sep 26, 2014 at 9:45 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Linus, it's a real userland regression.

That's not the part I'm arguing against. I think the patch itself was
too ugly to live.

Yes, we had that hack before, but we didn't make it conditional. It
historically was more of a "it's easier to just memcpy the name" than
switch things around. Then that became accidental semantics, and
that's all normal. But then when we make this explicit and
intentional, I really think we should do it *right*, and either
switch() the names around or just copy it.

Having a "switch_names()" function that *neither* switches *nor*
copies, and giving it an argument to decide which, but not even do it
*right*? That's just too f*cking disgusting for words.

So seriously, I think we should:

 - keep the "switch_names()" as is, since it actually does what the
name says it does, and some callers want to statically do exactly
that.

 - make a new "move_name(src,dst)" that explicitly moves names, and
explicitly knows that the source needs to be kept alive (although it
*could* also look at the source dentry refcount to decide whether it
needs to be careful or not). And do it right for the long names too,
not just the inline case.

 - make that __d_move() caller just pick one or the other explicitly.

See what my complaint is? Not this half-assery that used to be a small
random detail, and that the patch makes into an institutionalized and
explicit half-assery.

(And Mikhail - I'm not ragging on you, even if I'm ragging on the
patch. I understand why you did it the way you did, and it makes sense
exactly in the "let's reinstate old hackery" model. I just think we
can and should do better than that, now that the "exchange" vs "move
over" semantics are so explicit).

Al (or Mikhail) - willing to do that extra cleanup? Please?

              Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-27 17:56             ` Linus Torvalds
@ 2014-09-27 18:31               ` Al Viro
  2014-09-27 19:16                 ` Al Viro
  2014-09-28 15:01               ` Mikhail Efremov
  1 sibling, 1 reply; 37+ messages in thread
From: Al Viro @ 2014-09-27 18:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikhail Efremov, Linux Kernel Mailing List, Miklos Szeredi,
	linux-fsdevel, stable

On Sat, Sep 27, 2014 at 10:56:57AM -0700, Linus Torvalds wrote:

>  - make a new "move_name(src,dst)" that explicitly moves names, and
> explicitly knows that the source needs to be kept alive (although it
> *could* also look at the source dentry refcount to decide whether it
> needs to be careful or not). And do it right for the long names too,
> not just the inline case.

Er?  Source dentry refcount doesn't matter at all - *that* one gets
rehashed, so we must get its name right, no matter what.  The target
one is where it becomes interesting.  And I doubt it's worth bothering
with refcount checks on that either, TBH...

The reason why we really can't get it right for long names without more
serious surgery is that currently the lifetime of external name is
the same as of dentry having that name.  And that makes "just copy them
over, long or not" a problem - we suddenly get two dentries sharing the
external name.  And we hit that under a shitload of spinlocks, so
allocating a separate copy isn't attractive.  Mikhail (IIRC - I hadn't
watched this thread all that closely back then, so I might be misattributing)
had proposed refcounting the external names.  That would work, but we'll
need to get it right.  Sure, we can make that refcount atomic_t; it's not
as if we'll be playing with it a lot.  However, if the *source* name had
been external, we need to drop the refcount on that, and we need to
RCU-delay actual freeing.  Right now the delay comes from RCU-delaying
__d_free(), with some optimisations for dentries that had never been hashed;
this will be somewhat hairier.

> Al (or Mikhail) - willing to do that extra cleanup? Please?

See above.  I can do that, but it will be less trivial than you seem to
expect it to be.  Anyway, the situation right now: vfs.git#for-linus
survives the testing (and I'd like your S-o-b on the last one-liner there;
if you have saner commit message - even better).  That can go in right
now, as far as I'm concerned.

The minimal port of Mikhail's patch on top of that is below; it doesn't
include what you are asking for, it's just an update to the situation past
merging __d_move() and __d_materialise_dentry().

We can get the long name cases right, and I agree that it'll make the
things nicer, but it might take a couple of days to get right.  The thing
I'm concerned about is not screwing DCACHE_RCUACCESS up.

BTW, the comments about being more careful with ->d_name.hash than with
->d_name.name are from back in 2.3.40s; they became obsolete by 2.3.60s,
when we started to unhash the target instead of swapping hash chain positions
followed by d_delete() as we used to do when dcache was first introduced.
It's a really old piece of detritus...

diff --git a/fs/dcache.c b/fs/dcache.c
index 7599d35..cb25a1a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2372,7 +2372,8 @@ void dentry_update_name_case(struct dentry *dentry, struct qstr *name)
 }
 EXPORT_SYMBOL(dentry_update_name_case);
 
-static void switch_names(struct dentry *dentry, struct dentry *target)
+static void switch_names(struct dentry *dentry, struct dentry *target,
+			 bool exchange)
 {
 	if (dname_external(target)) {
 		if (dname_external(dentry)) {
@@ -2406,6 +2407,12 @@ static void switch_names(struct dentry *dentry, struct dentry *target)
 			 */
 			unsigned int i;
 			BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(long)));
+			if (!exchange) {
+				memcpy(dentry->d_iname, target->d_name.name,
+						target->d_name.len + 1);
+				dentry->d_name.hash_len = target->d_name.hash_len;
+				return;
+			}
 			for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) {
 				swap(((long *) &dentry->d_iname)[i],
 				     ((long *) &target->d_iname)[i]);
@@ -2456,12 +2463,15 @@ static void dentry_unlock_for_move(struct dentry *dentry, struct dentry *target)
  * When switching names, the actual string doesn't strictly have to
  * be preserved in the target - because we're dropping the target
  * anyway. As such, we can just do a simple memcpy() to copy over
- * the new name before we switch.
- *
- * Note that we have to be a lot more careful about getting the hash
- * switched - we have to switch the hash value properly even if it
- * then no longer matches the actual (corrupted) string of the target.
- * The hash value has to match the hash queue that the dentry is on..
+ * the new name before we switch, unless we are going to rehash
+ * it.  Note that if we *do* unhash the target, we are not allowed
+ * to rehash it without giving it a new name/hash key - whether
+ * we swap or overwrite the names here, resulting name won't match
+ * the reality in filesystem; it's only there for d_path() purposes.
+ * Note that all of this is happening under rename_lock, so the
+ * any hash lookup seeing it in the middle of manipulations will
+ * be discarded anyway.  So we do not care what happens to the hash
+ * key in that case.
  */
 /*
  * __d_move - move a dentry
@@ -2507,9 +2517,8 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 			   d_hash(dentry->d_parent, dentry->d_name.hash));
 	}
 
-
 	/* Switch the names.. */
-	switch_names(dentry, target);
+	switch_names(dentry, target, exchange);
 
 	/* ... and switch them in the tree */
 	if (IS_ROOT(dentry)) {

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-27 18:31               ` Al Viro
@ 2014-09-27 19:16                 ` Al Viro
  2014-09-27 19:37                   ` Linus Torvalds
                                     ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Al Viro @ 2014-09-27 19:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikhail Efremov, Linux Kernel Mailing List, Miklos Szeredi,
	linux-fsdevel, stable

On Sat, Sep 27, 2014 at 07:31:39PM +0100, Al Viro wrote:

> We can get the long name cases right, and I agree that it'll make the
> things nicer, but it might take a couple of days to get right.  The thing
> I'm concerned about is not screwing DCACHE_RCUACCESS up.

FWIW, I suspect that the right approach is to put refcount + rcu_head in
front of external name and do the following:
	* __d_free() checks if we have an external name, gets its containing
structure and does if (atomic_dec_and_test(&name->count)) kfree(name);
	* switch_names() in non-exchange case (I'd probably call it copy_name,
not move_names, but anyway) sets DCACHE_RCUACCESS on target (source has
already gotten it from __d_rehash()), increments refcount on target's name
if external and, if the source old name is external, decrements its refcount
and calls kfree_rcu() if it has hit zero.

AFAICS, it guarantees that we'll schedule an RCU callback on name's rch_head
at most once, that we won't free it while RCU callback on it is scheduled
and we won't free it until a grace period has expired since the last time
it had been referenced by observable dentries.  Do you see any holes in that?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-27 19:16                 ` Al Viro
@ 2014-09-27 19:37                   ` Linus Torvalds
  2014-09-27 19:39                     ` Linus Torvalds
  2014-09-27 19:45                   ` [PATCH v2] vfs: Don't exchange "short" filenames unconditionally Al Viro
  2014-09-28  7:47                   ` Al Viro
  2 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2014-09-27 19:37 UTC (permalink / raw)
  To: Al Viro
  Cc: Mikhail Efremov, Linux Kernel Mailing List, Miklos Szeredi,
	linux-fsdevel, stable

On Sat, Sep 27, 2014 at 12:16 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, I suspect that the right approach is to put refcount + rcu_head in
> front of external name and do the following:

That's actually fancier than I was expecting. I was just thinking
doing a whole new allocation and freeing the old one using RCU. But I
guess you're right, you do need the rcu_head even for that, and once
you start adding fields you might as well just add a refcount too, and
then you don't have the annoyance of a potential memory allocation in
that code.

So your approach is better and doesn't sound too painful at all.

But yeah, I guess we can/should do the trivial ugly thing for 3.17.
Send me a proper pull-request,

                  Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-27 19:37                   ` Linus Torvalds
@ 2014-09-27 19:39                     ` Linus Torvalds
  2014-09-27 19:49                       ` Al Viro
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2014-09-27 19:39 UTC (permalink / raw)
  To: Al Viro
  Cc: Mikhail Efremov, Linux Kernel Mailing List, Miklos Szeredi,
	linux-fsdevel, stable

On Sat, Sep 27, 2014 at 12:37 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But yeah, I guess we can/should do the trivial ugly thing for 3.17.
> Send me a proper pull-request,

.. and if it can happen before tomorrow, that would be good. I had to
give up on my wish to release 3.17 tomorrow due to the current size of
changes anyway, so there will be a rc7 (and my travel inconveniences
be damned), but I'd like this to be in that rc7. Works for you?

             Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-27 19:16                 ` Al Viro
  2014-09-27 19:37                   ` Linus Torvalds
@ 2014-09-27 19:45                   ` Al Viro
  2014-09-28  7:47                   ` Al Viro
  2 siblings, 0 replies; 37+ messages in thread
From: Al Viro @ 2014-09-27 19:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikhail Efremov, Linux Kernel Mailing List, Miklos Szeredi,
	linux-fsdevel, stable

On Sat, Sep 27, 2014 at 08:16:57PM +0100, Al Viro wrote:
> On Sat, Sep 27, 2014 at 07:31:39PM +0100, Al Viro wrote:
> 
> > We can get the long name cases right, and I agree that it'll make the
> > things nicer, but it might take a couple of days to get right.  The thing
> > I'm concerned about is not screwing DCACHE_RCUACCESS up.
> 
> FWIW, I suspect that the right approach is to put refcount + rcu_head in
> front of external name and do the following:
> 	* __d_free() checks if we have an external name, gets its containing
> structure and does if (atomic_dec_and_test(&name->count)) kfree(name);
> 	* switch_names() in non-exchange case (I'd probably call it copy_name,
> not move_names, but anyway) sets DCACHE_RCUACCESS on target (source has
> already gotten it from __d_rehash()), increments refcount on target's name
> if external and, if the source old name is external, decrements its refcount
> and calls kfree_rcu() if it has hit zero.
> 
> AFAICS, it guarantees that we'll schedule an RCU callback on name's rch_head
> at most once, that we won't free it while RCU callback on it is scheduled
> and we won't free it until a grace period has expired since the last time
> it had been referenced by observable dentries.  Do you see any holes in that?

We probably want to put a union of refcount and rcu_head there, actually...
Gives the right alignment without padding.  As in
struct ext_name {
	union {
		atomic_t count;
		struct rcu_head head;
	};
	char name[0];
};
->count corresponds to the number of dentries that have ->d_name.name
pointing to the sucker's ->name.  And we use ->head only when it reaches
zero in __d_move().  That's 2 words per external name; somewhat unpleasant
on 64bit, but I don't see how to avoid an rcu_head in there...  The cutoff
for external names is 32 bytes on 64bit boxen.  That way we get 16 bytes
of overhead per long-named dentry...  OTOH, we allocate them with kmalloc(),
so it means that 32-character names lead to 64-bytes actual allocation.

Hmmm...  So the old behaviour is
 32--63 	=> 64 byte allocation
 64--95 	=> 96
 96--127	=> 128
and the new one
 32--47 	=> 64 byte allocation
 48--79 	=> 96
 80--111	=> 128
112--127	=> 192
(components longer than 128 characters are definitely too rare to worry about)
IOW, the main worry is about the names in range from 48 to 64 characters;
for those we push the allocation from size-64 to size-96...

Note, BTW, that git hits external name case on everything except 32-bit UP;
a _lot_ of 38-character names there.  And IIRC there had been some plans for
possible replacement of SHA1 with something wider, right?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-27 19:39                     ` Linus Torvalds
@ 2014-09-27 19:49                       ` Al Viro
  2014-09-27 19:55                         ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2014-09-27 19:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikhail Efremov, Linux Kernel Mailing List, Miklos Szeredi,
	linux-fsdevel, stable

On Sat, Sep 27, 2014 at 12:39:02PM -0700, Linus Torvalds wrote:
> On Sat, Sep 27, 2014 at 12:37 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > But yeah, I guess we can/should do the trivial ugly thing for 3.17.
> > Send me a proper pull-request,
> 
> .. and if it can happen before tomorrow, that would be good. I had to
> give up on my wish to release 3.17 tomorrow due to the current size of
> changes anyway, so there will be a rc7 (and my travel inconveniences
> be damned), but I'd like this to be in that rc7. Works for you?

Fine by me.  How about that s-o-b?  Right now it's

commit 4f78a56cd96a3d421852b5a03e10355b0cbe764b
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Sep 24 12:27:39 2014 -0700

    fold swapping ->d_name.hash into switch_names()
    
    and do it along with ->d_name.len there
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/fs/dcache.c b/fs/dcache.c
index 92f7d76..7599d35 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2412,7 +2412,7 @@ static void switch_names(struct dentry *dentry, struct dentry *target)
                        }
                }
        }
-       swap(dentry->d_name.len, target->d_name.len);
+       swap(dentry->d_name.hash_len, target->d_name.hash_len);
 }
 
 static void dentry_lock_for_move(struct dentry *dentry, struct dentry *target)
@@ -2510,7 +2510,6 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 
        /* Switch the names.. */
        switch_names(dentry, target);
-       swap(dentry->d_name.hash, target->d_name.hash);
 
        /* ... and switch them in the tree */
        if (IS_ROOT(dentry)) {



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-27 19:49                       ` Al Viro
@ 2014-09-27 19:55                         ` Linus Torvalds
  2014-09-27 21:48                           ` [git pull] vfs.git for 3.17-rc7 Al Viro
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2014-09-27 19:55 UTC (permalink / raw)
  To: Al Viro
  Cc: Mikhail Efremov, Linux Kernel Mailing List, Miklos Szeredi,
	linux-fsdevel, stable

On Sat, Sep 27, 2014 at 12:49 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Fine by me.  How about that s-o-b?  Right now it's

Sure, add my sign-off too. Not that I feel I even need the credit, so
you could just have done it as yourself. But I'll take it, so..

  Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Thanks,
        Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [git pull] vfs.git for 3.17-rc7
  2014-09-27 19:55                         ` Linus Torvalds
@ 2014-09-27 21:48                           ` Al Viro
  0 siblings, 0 replies; 37+ messages in thread
From: Al Viro @ 2014-09-27 21:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikhail Efremov, Linux Kernel Mailing List, Miklos Szeredi,
	linux-fsdevel, stable

Assorted fixes + unifying __d_move() and __d_materialise_dentry() +
minimal regression fix for d_path() of victims of overwriting rename()
ported on top of that.  Please, pull from

git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus

Shortlog:
Al Viro (7):
      ufs: deal with nfsd/iget races
      pull rehashing and unlocking the target dentry into __d_materialise_dentry()
      don't open-code d_rehash() in d_materialise_unique()
      __d_move(): fold manipulations with ->d_child/->d_subdirs
      __d_materialise_dentry(): flip the order of arguments
      kill __d_materialise_dentry()
      fold unlocking the children into dentry_unlock_parents_for_move()

Linus Torvalds (1):
      fold swapping ->d_name.hash into switch_names()

Mikhail Efremov (1):
      vfs: Don't exchange "short" filenames unconditionally.

Miklos Szeredi (2):
      shmem: fix nlink for rename overwrite directory
      fuse: honour max_read and max_write in direct_io mode

Diffstat:
 fs/dcache.c         |  112 +++++++++++++++++----------------------------------
 fs/direct-io.c      |    2 +-
 fs/fuse/file.c      |    1 +
 fs/ufs/ialloc.c     |    6 ++-
 fs/ufs/namei.c      |    4 ++
 include/linux/uio.h |    2 +-
 mm/iov_iter.c       |   14 ++++---
 mm/shmem.c          |    4 +-
 8 files changed, 60 insertions(+), 85 deletions(-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-27 19:16                 ` Al Viro
  2014-09-27 19:37                   ` Linus Torvalds
  2014-09-27 19:45                   ` [PATCH v2] vfs: Don't exchange "short" filenames unconditionally Al Viro
@ 2014-09-28  7:47                   ` Al Viro
  2014-09-28 18:05                     ` Al Viro
  2014-09-29 13:16                     ` Paul E. McKenney
  2 siblings, 2 replies; 37+ messages in thread
From: Al Viro @ 2014-09-28  7:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List, linux-fsdevel, Paul E. McKenney

On Sat, Sep 27, 2014 at 08:16:57PM +0100, Al Viro wrote:
> On Sat, Sep 27, 2014 at 07:31:39PM +0100, Al Viro wrote:
> 
> > We can get the long name cases right, and I agree that it'll make the
> > things nicer, but it might take a couple of days to get right.  The thing
> > I'm concerned about is not screwing DCACHE_RCUACCESS up.
> 
> FWIW, I suspect that the right approach is to put refcount + rcu_head in
> front of external name and do the following:
> 	* __d_free() checks if we have an external name, gets its containing
> structure and does if (atomic_dec_and_test(&name->count)) kfree(name);
> 	* switch_names() in non-exchange case (I'd probably call it copy_name,
> not move_names, but anyway) sets DCACHE_RCUACCESS on target (source has
> already gotten it from __d_rehash()), increments refcount on target's name
> if external and, if the source old name is external, decrements its refcount
> and calls kfree_rcu() if it has hit zero.
> 
> AFAICS, it guarantees that we'll schedule an RCU callback on name's rch_head
> at most once, that we won't free it while RCU callback on it is scheduled
> and we won't free it until a grace period has expired since the last time
> it had been referenced by observable dentries.  Do you see any holes in that?

Bugger...  I see one, actually.  Look: d1 and d2 come to share name at some
point.  name has refcount 2.  d1 gets dropped.  OK, we get RCU-delayed
__d_free().  refcount is still 2.  Now somebody starts looking through the
hash chain.  And somebody else renames d2.  __d_move() decrements refcount of
name to 1.  No kfree_rcu().  OK, _now_ __d_free() on d1 happens; it was
RCU-delayed, but that won't wait for rcu_read_lock() done after we'd
done call_rcu().  So __d_free() is called, name refcount reaches 0 and the
name is freed.  Just as the hash lookup has gotten around to looking at what
used to be the name of d2.  Oops...

The root cause, of course, is that we delay decrementing the refcount on
dentry_free() path...  One variant is to rip freeing these suckers out of
__d_free() and have dentry_free() do atomic_dec_and_test + kfree_rcu.
That works (and we don't need to set DCACHE_RCUACCESS in copy_name();
the interesting part of never-hashed case is for short names anyway), but
the cost is twice the number of rcu callbacks on freeing long-name dentries.
Is that really painful enough to care?  If not, something like the following
would do; if it is... well, we could probably do make free_dentry() mark
dentry->d_flags for __d_free() with "free the external name hanging off
this one" in case if refcount decrement has ended up with zero.  Doable,
but more complicated (and somewhat messy, since ->d_lock is not held
at that point; OTOH, we could play with ->d_name instead of ->d_flags...).
Anyway, the delta below might suffice.  Comments?

diff --git a/fs/dcache.c b/fs/dcache.c
index cb25a1a..b52f4ee 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -235,18 +235,34 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
 	return dentry_string_cmp(cs, ct, tcount);
 }
 
+struct ext_name {
+	union {
+		atomic_t count;
+		struct rcu_head head;
+	};
+	unsigned char name[];
+};
+
+static inline struct ext_name *ext_name(struct dentry *dentry)
+{
+	if (likely(!dname_external(dentry)))
+		return NULL;
+	return container_of(dentry->d_name.name, struct ext_name, name[0]);
+}
+
 static void __d_free(struct rcu_head *head)
 {
 	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
 
 	WARN_ON(!hlist_unhashed(&dentry->d_alias));
-	if (dname_external(dentry))
-		kfree(dentry->d_name.name);
 	kmem_cache_free(dentry_cache, dentry); 
 }
 
 static void dentry_free(struct dentry *dentry)
 {
+	struct ext_name *p = ext_name(dentry);
+	if (unlikely(p) && likely(atomic_dec_and_test(&p->count)))
+		kfree_rcu(p, head);
 	/* if dentry was never visible to RCU, immediate free is OK */
 	if (!(dentry->d_flags & DCACHE_RCUACCESS))
 		__d_free(&dentry->d_u.d_rcu);
@@ -1438,11 +1454,14 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	 */
 	dentry->d_iname[DNAME_INLINE_LEN-1] = 0;
 	if (name->len > DNAME_INLINE_LEN-1) {
-		dname = kmalloc(name->len + 1, GFP_KERNEL);
-		if (!dname) {
+		size_t size = offsetof(struct ext_name, name) + name->len + 1;
+		struct ext_name *p = kmalloc(size, GFP_KERNEL);
+		if (!p) {
 			kmem_cache_free(dentry_cache, dentry); 
 			return NULL;
 		}
+		atomic_set(&p->count, 1);
+		dname = p->name;
 	} else  {
 		dname = dentry->d_iname;
 	}	
@@ -2372,11 +2391,10 @@ void dentry_update_name_case(struct dentry *dentry, struct qstr *name)
 }
 EXPORT_SYMBOL(dentry_update_name_case);
 
-static void switch_names(struct dentry *dentry, struct dentry *target,
-			 bool exchange)
+static void swap_names(struct dentry *dentry, struct dentry *target)
 {
-	if (dname_external(target)) {
-		if (dname_external(dentry)) {
+	if (unlikely(dname_external(target))) {
+		if (unlikely(dname_external(dentry))) {
 			/*
 			 * Both external: swap the pointers
 			 */
@@ -2392,7 +2410,7 @@ static void switch_names(struct dentry *dentry, struct dentry *target,
 			target->d_name.name = target->d_iname;
 		}
 	} else {
-		if (dname_external(dentry)) {
+		if (unlikely(dname_external(dentry))) {
 			/*
 			 * dentry:external, target:internal.  Give dentry's
 			 * storage to target and make dentry internal
@@ -2407,12 +2425,6 @@ static void switch_names(struct dentry *dentry, struct dentry *target,
 			 */
 			unsigned int i;
 			BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(long)));
-			if (!exchange) {
-				memcpy(dentry->d_iname, target->d_name.name,
-						target->d_name.len + 1);
-				dentry->d_name.hash_len = target->d_name.hash_len;
-				return;
-			}
 			for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) {
 				swap(((long *) &dentry->d_iname)[i],
 				     ((long *) &target->d_iname)[i]);
@@ -2422,6 +2434,22 @@ static void switch_names(struct dentry *dentry, struct dentry *target,
 	swap(dentry->d_name.hash_len, target->d_name.hash_len);
 }
 
+static void copy_name(struct dentry *dentry, struct dentry *target)
+{
+	struct ext_name *old_name = ext_name(dentry);
+	if (unlikely(dname_external(target))) {
+		atomic_inc(&ext_name(target)->count);
+		dentry->d_name = target->d_name;
+	} else {
+		memcpy(dentry->d_iname, target->d_name.name,
+				target->d_name.len + 1);
+		dentry->d_name.name = dentry->d_iname;
+		dentry->d_name.hash_len = target->d_name.hash_len;
+	}
+	if (unlikely(old_name) && likely(atomic_dec_and_test(&old_name->count)))
+		kfree_rcu(old_name, head);
+}
+
 static void dentry_lock_for_move(struct dentry *dentry, struct dentry *target)
 {
 	/*
@@ -2518,7 +2546,10 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	}
 
 	/* Switch the names.. */
-	switch_names(dentry, target, exchange);
+	if (exchange)
+		swap_names(dentry, target);
+	else
+		copy_name(dentry, target);
 
 	/* ... and switch them in the tree */
 	if (IS_ROOT(dentry)) {

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-27 17:56             ` Linus Torvalds
  2014-09-27 18:31               ` Al Viro
@ 2014-09-28 15:01               ` Mikhail Efremov
  1 sibling, 0 replies; 37+ messages in thread
From: Mikhail Efremov @ 2014-09-28 15:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Mikhail Efremov, Linux Kernel Mailing List,
	Miklos Szeredi, linux-fsdevel, stable

On Sat, 27 Sep 2014 10:56:57 -0700 Linus Torvalds wrote:
> (And Mikhail - I'm not ragging on you, even if I'm ragging on the
> patch. I understand why you did it the way you did, and it makes sense
> exactly in the "let's reinstate old hackery" model. I just think we
> can and should do better than that, now that the "exchange" vs "move
> over" semantics are so explicit).

Well, I can't argue with the fact that this patch is ugly, especially
after all this discussion.
I just hope that the right solution will be implemented soon and this
ugly hack will be dropped.

> Al (or Mikhail) - willing to do that extra cleanup? Please?

I'll definetly take a look at this, but I doubt that I will be able to
implement the right solution soon. Actually this is the first time when
I look at the kernel code so deeply.
I hope that someone with much more knowledge of the kernel code than my
own (Al or Miklos may be) will do that.

-- 
WBR, Mikhail Efremov

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-28  7:47                   ` Al Viro
@ 2014-09-28 18:05                     ` Al Viro
  2014-09-28 21:51                       ` Al Viro
                                         ` (2 more replies)
  2014-09-29 13:16                     ` Paul E. McKenney
  1 sibling, 3 replies; 37+ messages in thread
From: Al Viro @ 2014-09-28 18:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-fsdevel, Paul E. McKenney,
	Mikhail Efremov

On Sun, Sep 28, 2014 at 08:47:47AM +0100, Al Viro wrote:

> The root cause, of course, is that we delay decrementing the refcount on
> dentry_free() path...  One variant is to rip freeing these suckers out of
> __d_free() and have dentry_free() do atomic_dec_and_test + kfree_rcu.
> That works (and we don't need to set DCACHE_RCUACCESS in copy_name();
> the interesting part of never-hashed case is for short names anyway), but
> the cost is twice the number of rcu callbacks on freeing long-name dentries.
> Is that really painful enough to care?  If not, something like the following
> would do; if it is... well, we could probably do make free_dentry() mark
> dentry->d_flags for __d_free() with "free the external name hanging off
> this one" in case if refcount decrement has ended up with zero.  Doable,
> but more complicated (and somewhat messy, since ->d_lock is not held
> at that point; OTOH, we could play with ->d_name instead of ->d_flags...).
> Anyway, the delta below might suffice.  Comments?

Now that I've got some sleep...  It's easy to avoid extra call_rcu, of
course - just have two variants of __d_free(), one freeing the external
name (always called via call_rcu()) and another just freeing the dentry
itself.  So free_dentry() would
	* decrement the refcount on external name and if that has reached
zero - call_rcu the full variant.
	* if there had been no external name or if its refcount has not
reached zero - call_rcu the variant that just frees dentry itself or
just do a direct call of the same if dentry has never been RCU-visible.

One thing that worries me is the barriers that might be needed on assignments
to ->d_name.name.  We should be no worse than we are right now - either RCU
accessors are careful enough with ACCESS_ONCE() and everything's fine,
or they are not, in which case we already have a bug in mainline - swapping
->d_name followed by dput() and freeing of target is no better than
copying ->d_name from target to source followed by kfree_rcu() of what
used to be ->d_name.name of source.  IOW, if RCU lookup could pick
a value of ->d_name.name that got obsolete by d_move() happening
before our read_lock_rcu(), we would be in trouble anyway - it might
already have had its freeing RCU-scheduled and thus not delayed
by read_lock_rcu() done afterwards.  So I think the patch below doesn't
introduce new problems of that sort, but I'd really appreciate if RCU
people would take a look at the situation with barriers in that area.
Are those ACCESS_ONCE() in dentry_cmp() and prepend_name() enough, or
do we need some barriers in switch_names() as well?

Folks, care to review and test the following?

Allow sharing external names after __d_move()

* external dentry names get a small structure prepended to them
(struct ext_name).
* it contains an atomic refcount, matching the number of struct dentry
instances that have ->d_name.name pointing to that ext name.  The
first thing free_dentry() does is decrementing refcount of ext name,
so the instances between the call of free_dentry() and RCU-delayed
actual freeing do not contribute.
* __d_move(x, y, false) makes the name of x equal to the name of y,
external or not.  If y has an external name, extra reference is grabbed
and put into x->d_name.name.  If x used to have an external name, the
reference to it is dropped and, should it reach zero, freeing is scheduled
via kfree_rcu().

All non-RCU accesses to dentry ext name are safe wrt freeing since they
all should happen before free_dentry() is called.  RCU accesses might run
into a dentry seen by free_dentry() or into an old name that got already
dropped by __d_move(); however, in both cases dentry must have been
alive and refer to that name at some point after we'd done rcu_read_lock(),
which means that any freeing must be still pending.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/dcache.c b/fs/dcache.c
index cb25a1a..0cf5aff 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -235,20 +235,44 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
 	return dentry_string_cmp(cs, ct, tcount);
 }
 
+struct ext_name {
+	union {
+		atomic_t count;
+		struct rcu_head head;
+	};
+	unsigned char name[];
+};
+
+static inline struct ext_name *ext_name(struct dentry *dentry)
+{
+	if (likely(!dname_external(dentry)))
+		return NULL;
+	return container_of(dentry->d_name.name, struct ext_name, name[0]);
+}
+
 static void __d_free(struct rcu_head *head)
 {
 	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
 
 	WARN_ON(!hlist_unhashed(&dentry->d_alias));
-	if (dname_external(dentry))
-		kfree(dentry->d_name.name);
+	kmem_cache_free(dentry_cache, dentry); 
+}
+
+static void __d_free_ext(struct rcu_head *head)
+{
+	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
+	WARN_ON(!hlist_unhashed(&dentry->d_alias));
+	kfree(ext_name(dentry));
 	kmem_cache_free(dentry_cache, dentry); 
 }
 
 static void dentry_free(struct dentry *dentry)
 {
+	struct ext_name *p = ext_name(dentry);
+	if (unlikely(p) && likely(atomic_dec_and_test(&p->count)))
+		call_rcu(&dentry->d_u.d_rcu, __d_free_ext);
 	/* if dentry was never visible to RCU, immediate free is OK */
-	if (!(dentry->d_flags & DCACHE_RCUACCESS))
+	else if (!(dentry->d_flags & DCACHE_RCUACCESS))
 		__d_free(&dentry->d_u.d_rcu);
 	else
 		call_rcu(&dentry->d_u.d_rcu, __d_free);
@@ -1438,11 +1462,14 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	 */
 	dentry->d_iname[DNAME_INLINE_LEN-1] = 0;
 	if (name->len > DNAME_INLINE_LEN-1) {
-		dname = kmalloc(name->len + 1, GFP_KERNEL);
-		if (!dname) {
+		size_t size = offsetof(struct ext_name, name) + name->len + 1;
+		struct ext_name *p = kmalloc(size, GFP_KERNEL);
+		if (!p) {
 			kmem_cache_free(dentry_cache, dentry); 
 			return NULL;
 		}
+		atomic_set(&p->count, 1);
+		dname = p->name;
 	} else  {
 		dname = dentry->d_iname;
 	}	
@@ -2372,11 +2399,10 @@ void dentry_update_name_case(struct dentry *dentry, struct qstr *name)
 }
 EXPORT_SYMBOL(dentry_update_name_case);
 
-static void switch_names(struct dentry *dentry, struct dentry *target,
-			 bool exchange)
+static void swap_names(struct dentry *dentry, struct dentry *target)
 {
-	if (dname_external(target)) {
-		if (dname_external(dentry)) {
+	if (unlikely(dname_external(target))) {
+		if (unlikely(dname_external(dentry))) {
 			/*
 			 * Both external: swap the pointers
 			 */
@@ -2392,7 +2418,7 @@ static void switch_names(struct dentry *dentry, struct dentry *target,
 			target->d_name.name = target->d_iname;
 		}
 	} else {
-		if (dname_external(dentry)) {
+		if (unlikely(dname_external(dentry))) {
 			/*
 			 * dentry:external, target:internal.  Give dentry's
 			 * storage to target and make dentry internal
@@ -2407,12 +2433,6 @@ static void switch_names(struct dentry *dentry, struct dentry *target,
 			 */
 			unsigned int i;
 			BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(long)));
-			if (!exchange) {
-				memcpy(dentry->d_iname, target->d_name.name,
-						target->d_name.len + 1);
-				dentry->d_name.hash_len = target->d_name.hash_len;
-				return;
-			}
 			for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) {
 				swap(((long *) &dentry->d_iname)[i],
 				     ((long *) &target->d_iname)[i]);
@@ -2422,6 +2442,22 @@ static void switch_names(struct dentry *dentry, struct dentry *target,
 	swap(dentry->d_name.hash_len, target->d_name.hash_len);
 }
 
+static void copy_name(struct dentry *dentry, struct dentry *target)
+{
+	struct ext_name *old_name = ext_name(dentry);
+	if (unlikely(dname_external(target))) {
+		atomic_inc(&ext_name(target)->count);
+		dentry->d_name = target->d_name;
+	} else {
+		memcpy(dentry->d_iname, target->d_name.name,
+				target->d_name.len + 1);
+		dentry->d_name.name = dentry->d_iname;
+		dentry->d_name.hash_len = target->d_name.hash_len;
+	}
+	if (unlikely(old_name) && likely(atomic_dec_and_test(&old_name->count)))
+		kfree_rcu(old_name, head);
+}
+
 static void dentry_lock_for_move(struct dentry *dentry, struct dentry *target)
 {
 	/*
@@ -2518,7 +2554,10 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	}
 
 	/* Switch the names.. */
-	switch_names(dentry, target, exchange);
+	if (exchange)
+		swap_names(dentry, target);
+	else
+		copy_name(dentry, target);
 
 	/* ... and switch them in the tree */
 	if (IS_ROOT(dentry)) {

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-28 18:05                     ` Al Viro
@ 2014-09-28 21:51                       ` Al Viro
  2014-09-29  1:06                         ` [PATCH] missing data dependency barrier in prepend_name() Al Viro
  2014-09-29 15:15                       ` [PATCH v2] vfs: Don't exchange "short" filenames unconditionally Linus Torvalds
  2014-09-29 18:42                       ` Paul E. McKenney
  2 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2014-09-28 21:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-fsdevel, Paul E. McKenney,
	Mikhail Efremov

On Sun, Sep 28, 2014 at 07:05:56PM +0100, Al Viro wrote:

> One thing that worries me is the barriers that might be needed on assignments
> to ->d_name.name.  We should be no worse than we are right now - either RCU
> accessors are careful enough with ACCESS_ONCE() and everything's fine,
> or they are not, in which case we already have a bug in mainline - swapping
> ->d_name followed by dput() and freeing of target is no better than
> copying ->d_name from target to source followed by kfree_rcu() of what
> used to be ->d_name.name of source.  IOW, if RCU lookup could pick
> a value of ->d_name.name that got obsolete by d_move() happening
> before our read_lock_rcu(), we would be in trouble anyway - it might
> already have had its freeing RCU-scheduled and thus not delayed
> by read_lock_rcu() done afterwards.  So I think the patch below doesn't
> introduce new problems of that sort, but I'd really appreciate if RCU
> people would take a look at the situation with barriers in that area.
> Are those ACCESS_ONCE() in dentry_cmp() and prepend_name() enough, or
> do we need some barriers in switch_names() as well?

Hmm...  OK, dentry_cmp() is doing something similar to open-coded
rcu_dereference().  prepend_name() does not, and I really wonder if
that's correct...

I'm afraid that the answer is "should've been more careful when switching
d_path() to RCU", but maybe there's something subtle I'm missing there...

I really hate memory ordering rules on alpha ;-/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] missing data dependency barrier in prepend_name()
  2014-09-28 21:51                       ` Al Viro
@ 2014-09-29  1:06                         ` Al Viro
  0 siblings, 0 replies; 37+ messages in thread
From: Al Viro @ 2014-09-29  1:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-fsdevel, Paul E. McKenney,
	Mikhail Efremov

On Sun, Sep 28, 2014 at 10:51:38PM +0100, Al Viro wrote:
> Hmm...  OK, dentry_cmp() is doing something similar to open-coded
> rcu_dereference().  prepend_name() does not, and I really wonder if
> that's correct...
> 
> I'm afraid that the answer is "should've been more careful when switching
> d_path() to RCU", but maybe there's something subtle I'm missing there...
> 
> I really hate memory ordering rules on alpha ;-/

AFAICS, prepend_name() is broken on SMP alpha.  Disclaimer: I don't have
SMP alpha boxen to reproduce it on.  However, it really looks like the race
is real.

CPU1: d_path() on /mnt/ramfs/<255-character>/foo
CPU2: mv /mnt/ramfs/<255-character> /mnt/ramfs/<63-character>

CPU2 does d_alloc(), which allocates an external name, stores the name there
including terminating NUL, does smp_wmb() and stores its address in
dentry->d_name.name.  It proceeds to d_add(dentry, NULL) and d_move()
old dentry over to that.  ->d_name.name value ends up in that dentry.

In the meanwhile, CPU1 gets to prepend_name() for that dentry.  It fetches
->d_name.name and ->d_name.len; the former ends up pointing to new name
(64-byte kmalloc'ed array), the latter - 255 (length of the old name).
Nothing to force the ordering there, and normally that would be OK, since we'd
run into the terminating NUL and stop.  Except that it's alpha, and we'd need
a data dependency barrier to guarantee that we see that store of NUL
__d_alloc() has done.  In a similar situation dentry_cmp() would survive; it
does explicit smp_read_barrier_depends() after fetching ->d_name.name.
prepend_name() doesn't and it risks walking past the end of kmalloc'ed object
and possibly oops due to taking a page fault in kernel mode.

Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/dcache.c b/fs/dcache.c
index cb25a1a..e7484f9 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2810,6 +2810,9 @@ static int prepend(char **buffer, int *buflen, const char *str, int namelen)
  * the beginning of the name. The sequence number check at the caller will
  * retry it again when a d_move() does happen. So any garbage in the buffer
  * due to mismatched pointer and length will be discarded.
+ *
+ * Data dependency barrier is needed to make sure that we see that terminating
+ * NUL.  Alpha strikes again, film at 11...
  */
 static int prepend_name(char **buffer, int *buflen, struct qstr *name)
 {
@@ -2817,6 +2820,8 @@ static int prepend_name(char **buffer, int *buflen, struct qstr *name)
 	u32 dlen = ACCESS_ONCE(name->len);
 	char *p;
 
+	smp_read_barrier_depends();
+
 	*buflen -= dlen + 1;
 	if (*buflen < 0)
 		return -ENAMETOOLONG;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-28  7:47                   ` Al Viro
  2014-09-28 18:05                     ` Al Viro
@ 2014-09-29 13:16                     ` Paul E. McKenney
  2014-09-29 15:04                       ` Al Viro
  1 sibling, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2014-09-29 13:16 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel

On Sun, Sep 28, 2014 at 08:47:47AM +0100, Al Viro wrote:
> On Sat, Sep 27, 2014 at 08:16:57PM +0100, Al Viro wrote:
> > On Sat, Sep 27, 2014 at 07:31:39PM +0100, Al Viro wrote:
> > 
> > > We can get the long name cases right, and I agree that it'll make the
> > > things nicer, but it might take a couple of days to get right.  The thing
> > > I'm concerned about is not screwing DCACHE_RCUACCESS up.
> > 
> > FWIW, I suspect that the right approach is to put refcount + rcu_head in
> > front of external name and do the following:
> > 	* __d_free() checks if we have an external name, gets its containing
> > structure and does if (atomic_dec_and_test(&name->count)) kfree(name);
> > 	* switch_names() in non-exchange case (I'd probably call it copy_name,
> > not move_names, but anyway) sets DCACHE_RCUACCESS on target (source has
> > already gotten it from __d_rehash()), increments refcount on target's name
> > if external and, if the source old name is external, decrements its refcount
> > and calls kfree_rcu() if it has hit zero.
> > 
> > AFAICS, it guarantees that we'll schedule an RCU callback on name's rch_head
> > at most once, that we won't free it while RCU callback on it is scheduled
> > and we won't free it until a grace period has expired since the last time
> > it had been referenced by observable dentries.  Do you see any holes in that?
> 
> Bugger...  I see one, actually.  Look: d1 and d2 come to share name at some
> point.  name has refcount 2.  d1 gets dropped.  OK, we get RCU-delayed
> __d_free().  refcount is still 2.  Now somebody starts looking through the
> hash chain.  And somebody else renames d2.  __d_move() decrements refcount of
> name to 1.  No kfree_rcu().  OK, _now_ __d_free() on d1 happens; it was
> RCU-delayed, but that won't wait for rcu_read_lock() done after we'd
> done call_rcu().  So __d_free() is called, name refcount reaches 0 and the
> name is freed.  Just as the hash lookup has gotten around to looking at what
> used to be the name of d2.  Oops...

The "textbook" approach is to avoid starting the grace period until
the after the final reference count is dropped (and of course after
the name has been made inaccessible to all readers).  Not sure what
the best way to adapt the current code to this approach (if it is
in fact feasible to begin with), but one approach would be something
like this:

static void __d_free(struct rcu_head *head)
{
	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);

	WARN_ON(!hlist_unhashed(&dentry->d_alias));
	if (dname_external(dentry) && atomic_dec_return(&dentry->refcnt))
		call_rcu(&dentry->rh, __d_free_name_rcu);
	kmem_cache_free(dentry_cache, dentry); 
}

Of course, this means that the link side has to do something like
atomic_add_unless(&&dentry->refcnt, 1, 0), and create a new name
if the old one is on its way out.  If that is too painful, another
approach is to increment the count for the grace period and decrement
it at the end of the grace period, and to skip the free if non-zero
after that decrement.  And this means adding an rcu_head to the
qstr structure, which might not help memory footprint.

> The root cause, of course, is that we delay decrementing the refcount on
> dentry_free() path...  One variant is to rip freeing these suckers out of
> __d_free() and have dentry_free() do atomic_dec_and_test + kfree_rcu.
> That works (and we don't need to set DCACHE_RCUACCESS in copy_name();
> the interesting part of never-hashed case is for short names anyway), but
> the cost is twice the number of rcu callbacks on freeing long-name dentries.
> Is that really painful enough to care?  If not, something like the following
> would do; if it is... well, we could probably do make free_dentry() mark
> dentry->d_flags for __d_free() with "free the external name hanging off
> this one" in case if refcount decrement has ended up with zero.  Doable,
> but more complicated (and somewhat messy, since ->d_lock is not held
> at that point; OTOH, we could play with ->d_name instead of ->d_flags...).
> Anyway, the delta below might suffice.  Comments?

Odd guesses below...

> diff --git a/fs/dcache.c b/fs/dcache.c
> index cb25a1a..b52f4ee 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -235,18 +235,34 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
>  	return dentry_string_cmp(cs, ct, tcount);
>  }
> 
> +struct ext_name {
> +	union {
> +		atomic_t count;
> +		struct rcu_head head;
> +	};
> +	unsigned char name[];
> +};
> +
> +static inline struct ext_name *ext_name(struct dentry *dentry)
> +{
> +	if (likely(!dname_external(dentry)))
> +		return NULL;
> +	return container_of(dentry->d_name.name, struct ext_name, name[0]);
> +}
> +
>  static void __d_free(struct rcu_head *head)
>  {
>  	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
> 
>  	WARN_ON(!hlist_unhashed(&dentry->d_alias));
> -	if (dname_external(dentry))
> -		kfree(dentry->d_name.name);
>  	kmem_cache_free(dentry_cache, dentry); 
>  }
> 
>  static void dentry_free(struct dentry *dentry)
>  {
> +	struct ext_name *p = ext_name(dentry);
> +	if (unlikely(p) && likely(atomic_dec_and_test(&p->count)))
> +		kfree_rcu(p, head);
>  	/* if dentry was never visible to RCU, immediate free is OK */
>  	if (!(dentry->d_flags & DCACHE_RCUACCESS))
>  		__d_free(&dentry->d_u.d_rcu);
> @@ -1438,11 +1454,14 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
>  	 */
>  	dentry->d_iname[DNAME_INLINE_LEN-1] = 0;
>  	if (name->len > DNAME_INLINE_LEN-1) {
> -		dname = kmalloc(name->len + 1, GFP_KERNEL);
> -		if (!dname) {
> +		size_t size = offsetof(struct ext_name, name) + name->len + 1;
> +		struct ext_name *p = kmalloc(size, GFP_KERNEL);
> +		if (!p) {
>  			kmem_cache_free(dentry_cache, dentry); 
>  			return NULL;
>  		}
> +		atomic_set(&p->count, 1);
> +		dname = p->name;
>  	} else  {
>  		dname = dentry->d_iname;
>  	}	
> @@ -2372,11 +2391,10 @@ void dentry_update_name_case(struct dentry *dentry, struct qstr *name)
>  }
>  EXPORT_SYMBOL(dentry_update_name_case);
> 
> -static void switch_names(struct dentry *dentry, struct dentry *target,
> -			 bool exchange)
> +static void swap_names(struct dentry *dentry, struct dentry *target)
>  {
> -	if (dname_external(target)) {
> -		if (dname_external(dentry)) {
> +	if (unlikely(dname_external(target))) {
> +		if (unlikely(dname_external(dentry))) {
>  			/*
>  			 * Both external: swap the pointers
>  			 */
> @@ -2392,7 +2410,7 @@ static void switch_names(struct dentry *dentry, struct dentry *target,
>  			target->d_name.name = target->d_iname;
>  		}
>  	} else {
> -		if (dname_external(dentry)) {
> +		if (unlikely(dname_external(dentry))) {
>  			/*
>  			 * dentry:external, target:internal.  Give dentry's
>  			 * storage to target and make dentry internal
> @@ -2407,12 +2425,6 @@ static void switch_names(struct dentry *dentry, struct dentry *target,
>  			 */
>  			unsigned int i;
>  			BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(long)));
> -			if (!exchange) {
> -				memcpy(dentry->d_iname, target->d_name.name,
> -						target->d_name.len + 1);
> -				dentry->d_name.hash_len = target->d_name.hash_len;
> -				return;
> -			}
>  			for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) {
>  				swap(((long *) &dentry->d_iname)[i],
>  				     ((long *) &target->d_iname)[i]);

Presumably the write_seqcount_begin() prevents confusion from ephemeral
names...

> @@ -2422,6 +2434,22 @@ static void switch_names(struct dentry *dentry, struct dentry *target,
>  	swap(dentry->d_name.hash_len, target->d_name.hash_len);
>  }
> 
> +static void copy_name(struct dentry *dentry, struct dentry *target)
> +{
> +	struct ext_name *old_name = ext_name(dentry);
> +	if (unlikely(dname_external(target))) {
> +		atomic_inc(&ext_name(target)->count);

I might be missing something, but I believe that the above needs to
be atomic_add_unless().

Might also need to be in an RCU read-side critical section, but I am not
so sure about that.  If it is not in an RCU read-side critical section,
I don't see how the kfree_rcu() knows how long to wait before freeing
the name.

> +		dentry->d_name = target->d_name;
> +	} else {
> +		memcpy(dentry->d_iname, target->d_name.name,
> +				target->d_name.len + 1);
> +		dentry->d_name.name = dentry->d_iname;
> +		dentry->d_name.hash_len = target->d_name.hash_len;
> +	}
> +	if (unlikely(old_name) && likely(atomic_dec_and_test(&old_name->count)))
> +		kfree_rcu(old_name, head);
> +}
> +
>  static void dentry_lock_for_move(struct dentry *dentry, struct dentry *target)
>  {
>  	/*
> @@ -2518,7 +2546,10 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
>  	}
> 
>  	/* Switch the names.. */
> -	switch_names(dentry, target, exchange);
> +	if (exchange)
> +		swap_names(dentry, target);
> +	else
> +		copy_name(dentry, target);
> 
>  	/* ... and switch them in the tree */
>  	if (IS_ROOT(dentry)) {
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-29 13:16                     ` Paul E. McKenney
@ 2014-09-29 15:04                       ` Al Viro
  0 siblings, 0 replies; 37+ messages in thread
From: Al Viro @ 2014-09-29 15:04 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel

On Mon, Sep 29, 2014 at 06:16:13AM -0700, Paul E. McKenney wrote:

> The "textbook" approach is to avoid starting the grace period until
> the after the final reference count is dropped (and of course after
> the name has been made inaccessible to all readers).  Not sure what
> the best way to adapt the current code to this approach (if it is
> in fact feasible to begin with), but one approach would be something
> like this:

> static void __d_free(struct rcu_head *head)
> {
> 	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
> 
> 	WARN_ON(!hlist_unhashed(&dentry->d_alias));
> 	if (dname_external(dentry) && atomic_dec_return(&dentry->refcnt))
> 		call_rcu(&dentry->rh, __d_free_name_rcu);
> 	kmem_cache_free(dentry_cache, dentry); 
> }
> 
> Of course, this means that the link side has to do something like
> atomic_add_unless(&&dentry->refcnt, 1, 0), and create a new name
> if the old one is on its way out.  If that is too painful, another

What do you mean, "link"?  Rename, perhaps?

> approach is to increment the count for the grace period and decrement
> it at the end of the grace period, and to skip the free if non-zero
> after that decrement.  And this means adding an rcu_head to the
> qstr structure, which might not help memory footprint.

> Presumably the write_seqcount_begin() prevents confusion from ephemeral
> names...

We hold rename_lock, so __d_move() is serialized.  *And* both arguments
are pinned, so they couldn't have reached free_dentry() yet.

> > +static void copy_name(struct dentry *dentry, struct dentry *target)
> > +{
> > +	struct ext_name *old_name = ext_name(dentry);
> > +	if (unlikely(dname_external(target))) {
> > +		atomic_inc(&ext_name(target)->count);
> 
> I might be missing something, but I believe that the above needs to
> be atomic_add_unless().

It would better not; both dentry and target are pinned.

> Might also need to be in an RCU read-side critical section, but I am not
> so sure about that.  If it is not in an RCU read-side critical section,
> I don't see how the kfree_rcu() knows how long to wait before freeing
> the name.

It is a *writer*, not reader.  Readers of ->d_name are either under
rcu_read_lock(), or have hard exclusion wrt that __d_move() one way or another
and hold a reference to dentry, preventing dentry_free().  We really
need to care only about the ones under rcu_read_lock(), so kfree_rcu()
will do.  Non-RCU readers are relying on something like "I'm holding
a reference to dentry and ->i_mutex on the parent directory or
->s_vfs_rename_mutex on the entire filesystem".  Or "I'm holding that
reference and I know that on this filesystem nothing could lead to
d_move() and friends".

Anyway, see followup to that patch; it's equivalent to this one, but instead
of separately delayed freeing of dentry and ext name it makes __d_free()
free just the dentry itself and __d_free_ext() freeing both.  With
free_dentry() choosing which one to call_rcu().

FWIW, the thing I had been worrying about turned out to be a red herring (for
the same reason why RCU list removals don't need barriers - call_rcu() acts
as a barrier, so that reader managing to see the old pointer after
rcu_read_lock() will be seen by call_rcu() from writer, making the callback
delayed until that reader does rcu_read_unlock(); AFAICS, that's guaranteed
by smp_mb() in __call_rcu()).  However, the readers are, indeed, in trouble.
Already.  There's a data dependency - we want to make sure that the string
we observe via ->d_name.name is NUL-terminated.  We have dentry allocation
doing this
        dname[name->len] = 0;

        /* Make sure we always see the terminating NUL character */
        smp_wmb();
        dentry->d_name.name = dname;
and dentry_cmp() (one of the RCU readers) has
        cs = ACCESS_ONCE(dentry->d_name.name);
        smp_read_barrier_depends();
before looking at the string itself.  However, another such reader
(prepend_name()) doesn't - it just does ACCESS_ONCE() and assumes that
store of terminating NUL will be seen.  AFAICS, not guaranteed on Alpha,
so we need the same smp_read_barrier_depends() there.

Another thing: dentry_free() is called only when all non-RCU readers can't
see the dentry.  So that's where we need to decrement the refcount of ext
name, AFAICS.  Same in copy_name() (from __d_move()) where we lose the
reference to old name.  Again, non-RCU readers are not a problem - they
have exclusion wrt __d_move().  For them it's atomic.

Refcount is equal to the number of dentries visible to non-RCU readers with
->d_name.name pointing to that external name.  Freeing isn't scheduled until
that reaches 0 and is done via kfree_rcu() (or, in the later version of patch,
kfree_rcu() on one path and call_rcu() on another)...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-28 18:05                     ` Al Viro
  2014-09-28 21:51                       ` Al Viro
@ 2014-09-29 15:15                       ` Linus Torvalds
  2014-09-29 15:59                         ` Al Viro
  2014-09-29 18:42                       ` Paul E. McKenney
  2 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2014-09-29 15:15 UTC (permalink / raw)
  To: Al Viro
  Cc: Linux Kernel Mailing List, linux-fsdevel, Paul E. McKenney,
	Mikhail Efremov

On Sun, Sep 28, 2014 at 11:05 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Folks, care to review and test the following?

No testing, but having thought about this some more, I'm personally
getting quite convinced that doing the RCU delaying of the external
name freeing in the __d_free() path is entirely pointless.

So I think the *only* rcu_free() you need is for just the "free old
name" case in copy_name().

In __d_free(), the name pointer has gone through the same grace period
that the dentry pointer itself went through. If it's not safe to free
the external name, then it damn well wouldn't be safe to free the
dentry itself either.

IOW, I think your games in __d_free() are totally unnecessary.

Now you can tell me why I'm wrong.

              Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-29 15:15                       ` [PATCH v2] vfs: Don't exchange "short" filenames unconditionally Linus Torvalds
@ 2014-09-29 15:59                         ` Al Viro
  2014-09-29 16:07                           ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2014-09-29 15:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-fsdevel, Paul E. McKenney,
	Mikhail Efremov

On Mon, Sep 29, 2014 at 08:15:51AM -0700, Linus Torvalds wrote:
> On Sun, Sep 28, 2014 at 11:05 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Folks, care to review and test the following?
> 
> No testing, but having thought about this some more, I'm personally
> getting quite convinced that doing the RCU delaying of the external
> name freeing in the __d_free() path is entirely pointless.
> 
> So I think the *only* rcu_free() you need is for just the "free old
> name" case in copy_name().
> 
> In __d_free(), the name pointer has gone through the same grace period
> that the dentry pointer itself went through. If it's not safe to free
> the external name, then it damn well wouldn't be safe to free the
> dentry itself either.
> 
> IOW, I think your games in __d_free() are totally unnecessary.
> 
> Now you can tell me why I'm wrong.

Sure.  Three dentries.  move the first over the second, dput the second.
The name is shared, the second dentry unhashed and the last reference
dropped.  free_dentry(second) does call_rcu(__d_free, ...).

Now RCU lookup starts.  And on another CPU we move the first dentry over
the third one.  copy_name() is called, it decrements the refcount down
to 1 (__d_free() hasn't happened yet) and doesn't schedule any freeing.

RCU lookup sees first dentry, still with the old name (what used to be the
name of the second dentry).

... and __d_free() on the second dentry is called.  call_rcu() has happened
before RCU lookup has done its rcu_read_lock(), so it's not required to
wait.  We decrement the refcount on ext name, it reaches 0 and we kfree()
it.  Right under dentry_cmp().  Oops...

We really need to decrement refcount before RCU-delaying.  IOW, the decrement
should be in dentry_free(), not in __d_free()...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-29 15:59                         ` Al Viro
@ 2014-09-29 16:07                           ` Linus Torvalds
  2014-09-29 16:27                             ` Al Viro
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2014-09-29 16:07 UTC (permalink / raw)
  To: Al Viro
  Cc: Linux Kernel Mailing List, linux-fsdevel, Paul E. McKenney,
	Mikhail Efremov

On Mon, Sep 29, 2014 at 8:59 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Now RCU lookup starts.  And on another CPU we move the first dentry over
> the third one.  copy_name() is called, it decrements the refcount down
> to 1 (__d_free() hasn't happened yet) and doesn't schedule any freeing.

Ahh. If we were to do *all* of the copy-name name freeing under RCU,
we'd be ok. But we don't. We do the refcount decrement immediately
(and have to, if we want to have the rcu/refcount union). Yeah, good
call, although I find that doubvle rcu grace period for the common
case to be very annoying.

Ugh. I'm starting to be sorry I ever suggested this cleanup. Maybe
we're better off with the horrid hack (which is equivalent to what we
did before) after all.

               Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-29 16:07                           ` Linus Torvalds
@ 2014-09-29 16:27                             ` Al Viro
  2014-09-29 17:54                               ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2014-09-29 16:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-fsdevel, Paul E. McKenney,
	Mikhail Efremov

On Mon, Sep 29, 2014 at 09:07:00AM -0700, Linus Torvalds wrote:
> On Mon, Sep 29, 2014 at 8:59 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Now RCU lookup starts.  And on another CPU we move the first dentry over
> > the third one.  copy_name() is called, it decrements the refcount down
> > to 1 (__d_free() hasn't happened yet) and doesn't schedule any freeing.
> 
> Ahh. If we were to do *all* of the copy-name name freeing under RCU,
> we'd be ok. But we don't. We do the refcount decrement immediately
> (and have to, if we want to have the rcu/refcount union). Yeah, good
> call, although I find that doubvle rcu grace period for the common
> case to be very annoying.

See below (or in followup to what you were replying to); it *is* trivial to
avoid double grace periods there.  Look: in free_dentry() we know if we want
the name freed.  So let's have __d_free() for "just free dentry" and
__d_free_ext() for "free dentry and name".  No looking at refcount in the
latter - just choose which one to call_rcu() when free_dentry() does decrement.
It's equivalent from the grace period POV to doing kfree_rcu(), but avoids
double call_rcu(); it just that in case we'd want kfree_rcu() we
schedule the call of a function that frees both.

What we get in free_dentry() is:
	* external name not shared: refcount driven to 0, RCU-delayed
call of "free dentry, free ext name"
	* external name still shared: refcount positive after decrement,
no freeing ext name
	* no external name: no ext name to free
In the last two cases we do what dentry_free() used to do, except that now
__d_free() doesn't even look for ext name.  Just frees the dentry.  If
it never had been hashed - directly called, otherwise - via call_rcu().

Does that look OK for you?

diff --git a/fs/dcache.c b/fs/dcache.c
index cb25a1a..0cf5aff 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -235,20 +235,44 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
 	return dentry_string_cmp(cs, ct, tcount);
 }
 
+struct ext_name {
+	union {
+		atomic_t count;
+		struct rcu_head head;
+	};
+	unsigned char name[];
+};
+
+static inline struct ext_name *ext_name(struct dentry *dentry)
+{
+	if (likely(!dname_external(dentry)))
+		return NULL;
+	return container_of(dentry->d_name.name, struct ext_name, name[0]);
+}
+
 static void __d_free(struct rcu_head *head)
 {
 	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
 
 	WARN_ON(!hlist_unhashed(&dentry->d_alias));
-	if (dname_external(dentry))
-		kfree(dentry->d_name.name);
+	kmem_cache_free(dentry_cache, dentry); 
+}
+
+static void __d_free_ext(struct rcu_head *head)
+{
+	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
+	WARN_ON(!hlist_unhashed(&dentry->d_alias));
+	kfree(ext_name(dentry));
 	kmem_cache_free(dentry_cache, dentry); 
 }
 
 static void dentry_free(struct dentry *dentry)
 {
+	struct ext_name *p = ext_name(dentry);
+	if (unlikely(p) && likely(atomic_dec_and_test(&p->count)))
+		call_rcu(&dentry->d_u.d_rcu, __d_free_ext);
 	/* if dentry was never visible to RCU, immediate free is OK */
-	if (!(dentry->d_flags & DCACHE_RCUACCESS))
+	else if (!(dentry->d_flags & DCACHE_RCUACCESS))
 		__d_free(&dentry->d_u.d_rcu);
 	else
 		call_rcu(&dentry->d_u.d_rcu, __d_free);
@@ -1438,11 +1462,14 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	 */
 	dentry->d_iname[DNAME_INLINE_LEN-1] = 0;
 	if (name->len > DNAME_INLINE_LEN-1) {
-		dname = kmalloc(name->len + 1, GFP_KERNEL);
-		if (!dname) {
+		size_t size = offsetof(struct ext_name, name) + name->len + 1;
+		struct ext_name *p = kmalloc(size, GFP_KERNEL);
+		if (!p) {
 			kmem_cache_free(dentry_cache, dentry); 
 			return NULL;
 		}
+		atomic_set(&p->count, 1);
+		dname = p->name;
 	} else  {
 		dname = dentry->d_iname;
 	}	
@@ -2372,11 +2399,10 @@ void dentry_update_name_case(struct dentry *dentry, struct qstr *name)
 }
 EXPORT_SYMBOL(dentry_update_name_case);
 
-static void switch_names(struct dentry *dentry, struct dentry *target,
-			 bool exchange)
+static void swap_names(struct dentry *dentry, struct dentry *target)
 {
-	if (dname_external(target)) {
-		if (dname_external(dentry)) {
+	if (unlikely(dname_external(target))) {
+		if (unlikely(dname_external(dentry))) {
 			/*
 			 * Both external: swap the pointers
 			 */
@@ -2392,7 +2418,7 @@ static void switch_names(struct dentry *dentry, struct dentry *target,
 			target->d_name.name = target->d_iname;
 		}
 	} else {
-		if (dname_external(dentry)) {
+		if (unlikely(dname_external(dentry))) {
 			/*
 			 * dentry:external, target:internal.  Give dentry's
 			 * storage to target and make dentry internal
@@ -2407,12 +2433,6 @@ static void switch_names(struct dentry *dentry, struct dentry *target,
 			 */
 			unsigned int i;
 			BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(long)));
-			if (!exchange) {
-				memcpy(dentry->d_iname, target->d_name.name,
-						target->d_name.len + 1);
-				dentry->d_name.hash_len = target->d_name.hash_len;
-				return;
-			}
 			for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) {
 				swap(((long *) &dentry->d_iname)[i],
 				     ((long *) &target->d_iname)[i]);
@@ -2422,6 +2442,22 @@ static void switch_names(struct dentry *dentry, struct dentry *target,
 	swap(dentry->d_name.hash_len, target->d_name.hash_len);
 }
 
+static void copy_name(struct dentry *dentry, struct dentry *target)
+{
+	struct ext_name *old_name = ext_name(dentry);
+	if (unlikely(dname_external(target))) {
+		atomic_inc(&ext_name(target)->count);
+		dentry->d_name = target->d_name;
+	} else {
+		memcpy(dentry->d_iname, target->d_name.name,
+				target->d_name.len + 1);
+		dentry->d_name.name = dentry->d_iname;
+		dentry->d_name.hash_len = target->d_name.hash_len;
+	}
+	if (unlikely(old_name) && likely(atomic_dec_and_test(&old_name->count)))
+		kfree_rcu(old_name, head);
+}
+
 static void dentry_lock_for_move(struct dentry *dentry, struct dentry *target)
 {
 	/*
@@ -2518,7 +2554,10 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	}
 
 	/* Switch the names.. */
-	switch_names(dentry, target, exchange);
+	if (exchange)
+		swap_names(dentry, target);
+	else
+		copy_name(dentry, target);
 
 	/* ... and switch them in the tree */
 	if (IS_ROOT(dentry)) {

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-29 16:27                             ` Al Viro
@ 2014-09-29 17:54                               ` Linus Torvalds
  2014-09-29 19:04                                 ` Al Viro
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2014-09-29 17:54 UTC (permalink / raw)
  To: Al Viro
  Cc: Linux Kernel Mailing List, linux-fsdevel, Paul E. McKenney,
	Mikhail Efremov

On Mon, Sep 29, 2014 at 9:27 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What we get in free_dentry() is:
>         * external name not shared: refcount driven to 0, RCU-delayed
> call of "free dentry, free ext name"
>         * external name still shared: refcount positive after decrement,
> no freeing ext name
>         * no external name: no ext name to free
> In the last two cases we do what dentry_free() used to do, except that now
> __d_free() doesn't even look for ext name.  Just frees the dentry.  If
> it never had been hashed - directly called, otherwise - via call_rcu().
>
> Does that look OK for you?

Yes. That looks fairly straightforward.

Although please rename that "ext" in ext_name/__d_free_ext() to
something else. "ext" to me says not "external", but "extended". I
think we can just write out "external", like we already do in
"dname_external()".

                   Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-28 18:05                     ` Al Viro
  2014-09-28 21:51                       ` Al Viro
  2014-09-29 15:15                       ` [PATCH v2] vfs: Don't exchange "short" filenames unconditionally Linus Torvalds
@ 2014-09-29 18:42                       ` Paul E. McKenney
  2014-10-01  0:16                         ` Al Viro
  2 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2014-09-29 18:42 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Mikhail Efremov

On Sun, Sep 28, 2014 at 07:05:56PM +0100, Al Viro wrote:
> On Sun, Sep 28, 2014 at 08:47:47AM +0100, Al Viro wrote:
> 
> > The root cause, of course, is that we delay decrementing the refcount on
> > dentry_free() path...  One variant is to rip freeing these suckers out of
> > __d_free() and have dentry_free() do atomic_dec_and_test + kfree_rcu.
> > That works (and we don't need to set DCACHE_RCUACCESS in copy_name();
> > the interesting part of never-hashed case is for short names anyway), but
> > the cost is twice the number of rcu callbacks on freeing long-name dentries.
> > Is that really painful enough to care?  If not, something like the following
> > would do; if it is... well, we could probably do make free_dentry() mark
> > dentry->d_flags for __d_free() with "free the external name hanging off
> > this one" in case if refcount decrement has ended up with zero.  Doable,
> > but more complicated (and somewhat messy, since ->d_lock is not held
> > at that point; OTOH, we could play with ->d_name instead of ->d_flags...).
> > Anyway, the delta below might suffice.  Comments?
> 
> Now that I've got some sleep...  It's easy to avoid extra call_rcu, of
> course - just have two variants of __d_free(), one freeing the external
> name (always called via call_rcu()) and another just freeing the dentry
> itself.

Assuming that incrementing the external name's reference count is
atomic_add_unless, I could believe this part.  Or if you have some
locking that makes it impossible to increment the reference count
in any case where there is any risk that it might be decremented
to zero, I guess.

Which might well be the pair of write_seqcount_begin() calls in __d_move(),
now that I look more carefully.  So if the name has to be in use somewhere
before it can be copied, then a copy can only be created if there is at
least one copy that is not currently being removed?  If so, OK.

>          So free_dentry() would
> 	* decrement the refcount on external name and if that has reached
> zero - call_rcu the full variant.
> 	* if there had been no external name or if its refcount has not
> reached zero - call_rcu the variant that just frees dentry itself or
> just do a direct call of the same if dentry has never been RCU-visible.
> 
> One thing that worries me is the barriers that might be needed on assignments
> to ->d_name.name.  We should be no worse than we are right now - either RCU
> accessors are careful enough with ACCESS_ONCE() and everything's fine,
> or they are not, in which case we already have a bug in mainline - swapping
> ->d_name followed by dput() and freeing of target is no better than
> copying ->d_name from target to source followed by kfree_rcu() of what
> used to be ->d_name.name of source.  IOW, if RCU lookup could pick
> a value of ->d_name.name that got obsolete by d_move() happening
> before our read_lock_rcu(), we would be in trouble anyway - it might
> already have had its freeing RCU-scheduled and thus not delayed
> by read_lock_rcu() done afterwards.

Yep!  Of course, the usual advice is to wait until the structure is
inaccessible to readers before starting the grace period.

>                                      So I think the patch below doesn't
> introduce new problems of that sort, but I'd really appreciate if RCU
> people would take a look at the situation with barriers in that area.
> Are those ACCESS_ONCE() in dentry_cmp() and prepend_name() enough, or
> do we need some barriers in switch_names() as well?
> 
> Folks, care to review and test the following?
> 
> Allow sharing external names after __d_move()
> 
> * external dentry names get a small structure prepended to them
> (struct ext_name).
> * it contains an atomic refcount, matching the number of struct dentry
> instances that have ->d_name.name pointing to that ext name.  The
> first thing free_dentry() does is decrementing refcount of ext name,
> so the instances between the call of free_dentry() and RCU-delayed
> actual freeing do not contribute.
> * __d_move(x, y, false) makes the name of x equal to the name of y,
> external or not.  If y has an external name, extra reference is grabbed
> and put into x->d_name.name.  If x used to have an external name, the
> reference to it is dropped and, should it reach zero, freeing is scheduled
> via kfree_rcu().
> 
> All non-RCU accesses to dentry ext name are safe wrt freeing since they
> all should happen before free_dentry() is called.  RCU accesses might run
> into a dentry seen by free_dentry() or into an old name that got already
> dropped by __d_move(); however, in both cases dentry must have been
> alive and refer to that name at some point after we'd done rcu_read_lock(),
> which means that any freeing must be still pending.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/fs/dcache.c b/fs/dcache.c
> index cb25a1a..0cf5aff 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -235,20 +235,44 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
>  	return dentry_string_cmp(cs, ct, tcount);
>  }
> 
> +struct ext_name {
> +	union {
> +		atomic_t count;
> +		struct rcu_head head;
> +	};

The usual trick to simplify handling string lengths is to put the
length in the same structure as the name.  Probably doesn't play well
with your swap operation, though.  I cannot find the code to which
this patch applies, but vaguely recall something about nul termination.

> +	unsigned char name[];
> +};
> +
> +static inline struct ext_name *ext_name(struct dentry *dentry)
> +{
> +	if (likely(!dname_external(dentry)))
> +		return NULL;
> +	return container_of(dentry->d_name.name, struct ext_name, name[0]);
> +}
> +
>  static void __d_free(struct rcu_head *head)
>  {
>  	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
> 
>  	WARN_ON(!hlist_unhashed(&dentry->d_alias));
> -	if (dname_external(dentry))
> -		kfree(dentry->d_name.name);
> +	kmem_cache_free(dentry_cache, dentry); 
> +}
> +
> +static void __d_free_ext(struct rcu_head *head)
> +{
> +	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
> +	WARN_ON(!hlist_unhashed(&dentry->d_alias));
> +	kfree(ext_name(dentry));
>  	kmem_cache_free(dentry_cache, dentry); 
>  }
> 
>  static void dentry_free(struct dentry *dentry)
>  {
> +	struct ext_name *p = ext_name(dentry);
> +	if (unlikely(p) && likely(atomic_dec_and_test(&p->count)))
> +		call_rcu(&dentry->d_u.d_rcu, __d_free_ext);
>  	/* if dentry was never visible to RCU, immediate free is OK */
> -	if (!(dentry->d_flags & DCACHE_RCUACCESS))
> +	else if (!(dentry->d_flags & DCACHE_RCUACCESS))
>  		__d_free(&dentry->d_u.d_rcu);
>  	else
>  		call_rcu(&dentry->d_u.d_rcu, __d_free);
> @@ -1438,11 +1462,14 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
>  	 */
>  	dentry->d_iname[DNAME_INLINE_LEN-1] = 0;
>  	if (name->len > DNAME_INLINE_LEN-1) {
> -		dname = kmalloc(name->len + 1, GFP_KERNEL);
> -		if (!dname) {
> +		size_t size = offsetof(struct ext_name, name) + name->len + 1;
> +		struct ext_name *p = kmalloc(size, GFP_KERNEL);
> +		if (!p) {
>  			kmem_cache_free(dentry_cache, dentry); 
>  			return NULL;
>  		}
> +		atomic_set(&p->count, 1);
> +		dname = p->name;
>  	} else  {
>  		dname = dentry->d_iname;
>  	}	
> @@ -2372,11 +2399,10 @@ void dentry_update_name_case(struct dentry *dentry, struct qstr *name)
>  }
>  EXPORT_SYMBOL(dentry_update_name_case);
> 
> -static void switch_names(struct dentry *dentry, struct dentry *target,
> -			 bool exchange)
> +static void swap_names(struct dentry *dentry, struct dentry *target)
>  {
> -	if (dname_external(target)) {
> -		if (dname_external(dentry)) {
> +	if (unlikely(dname_external(target))) {
> +		if (unlikely(dname_external(dentry))) {
>  			/*
>  			 * Both external: swap the pointers
>  			 */
> @@ -2392,7 +2418,7 @@ static void switch_names(struct dentry *dentry, struct dentry *target,
>  			target->d_name.name = target->d_iname;
>  		}
>  	} else {
> -		if (dname_external(dentry)) {
> +		if (unlikely(dname_external(dentry))) {
>  			/*
>  			 * dentry:external, target:internal.  Give dentry's
>  			 * storage to target and make dentry internal
> @@ -2407,12 +2433,6 @@ static void switch_names(struct dentry *dentry, struct dentry *target,
>  			 */
>  			unsigned int i;
>  			BUILD_BUG_ON(!IS_ALIGNED(DNAME_INLINE_LEN, sizeof(long)));
> -			if (!exchange) {
> -				memcpy(dentry->d_iname, target->d_name.name,
> -						target->d_name.len + 1);
> -				dentry->d_name.hash_len = target->d_name.hash_len;
> -				return;
> -			}
>  			for (i = 0; i < DNAME_INLINE_LEN / sizeof(long); i++) {
>  				swap(((long *) &dentry->d_iname)[i],
>  				     ((long *) &target->d_iname)[i]);
> @@ -2422,6 +2442,22 @@ static void switch_names(struct dentry *dentry, struct dentry *target,
>  	swap(dentry->d_name.hash_len, target->d_name.hash_len);
>  }
> 
> +static void copy_name(struct dentry *dentry, struct dentry *target)
> +{
> +	struct ext_name *old_name = ext_name(dentry);
> +	if (unlikely(dname_external(target))) {
> +		atomic_inc(&ext_name(target)->count);
> +		dentry->d_name = target->d_name;
> +	} else {
> +		memcpy(dentry->d_iname, target->d_name.name,
> +				target->d_name.len + 1);
> +		dentry->d_name.name = dentry->d_iname;
> +		dentry->d_name.hash_len = target->d_name.hash_len;
> +	}
> +	if (unlikely(old_name) && likely(atomic_dec_and_test(&old_name->count)))
> +		kfree_rcu(old_name, head);
> +}
> +
>  static void dentry_lock_for_move(struct dentry *dentry, struct dentry *target)
>  {
>  	/*
> @@ -2518,7 +2554,10 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
>  	}
> 
>  	/* Switch the names.. */
> -	switch_names(dentry, target, exchange);
> +	if (exchange)
> +		swap_names(dentry, target);
> +	else
> +		copy_name(dentry, target);
> 
>  	/* ... and switch them in the tree */
>  	if (IS_ROOT(dentry)) {
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-29 17:54                               ` Linus Torvalds
@ 2014-09-29 19:04                                 ` Al Viro
  2014-09-29 20:45                                   ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2014-09-29 19:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-fsdevel, Paul E. McKenney,
	Mikhail Efremov

On Mon, Sep 29, 2014 at 10:54:59AM -0700, Linus Torvalds wrote:
> On Mon, Sep 29, 2014 at 9:27 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > What we get in free_dentry() is:
> >         * external name not shared: refcount driven to 0, RCU-delayed
> > call of "free dentry, free ext name"
> >         * external name still shared: refcount positive after decrement,
> > no freeing ext name
> >         * no external name: no ext name to free
> > In the last two cases we do what dentry_free() used to do, except that now
> > __d_free() doesn't even look for ext name.  Just frees the dentry.  If
> > it never had been hashed - directly called, otherwise - via call_rcu().
> >
> > Does that look OK for you?
> 
> Yes. That looks fairly straightforward.
> 
> Although please rename that "ext" in ext_name/__d_free_ext() to
> something else. "ext" to me says not "external", but "extended". I
> think we can just write out "external", like we already do in
> "dname_external()".

That we do...  Why is it in dcache.h, BTW?  No users outside of fs/dcache.c
and I can't imagine a valid use for it in a module, let alone out-of-tree
one...

ObAnnoyance: gcc not figuring out that this container_of() isn't going to be
NULL...  Sure, somebody might put (char *)16 into dentry->d_name.name, but...
As it is, it makes for messier code generation; I can work around that, of
course, but it's uglier than it ought to be ;-/

OK, see vfs.git#copy_name.  Warning: it does need testing.  Two commits in
there, one adding data dependency barrier in prepend_name(), another - this
thing.  Comments would be welcome...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-29 19:04                                 ` Al Viro
@ 2014-09-29 20:45                                   ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-09-29 20:45 UTC (permalink / raw)
  To: Al Viro
  Cc: Linux Kernel Mailing List, linux-fsdevel, Paul E. McKenney,
	Mikhail Efremov

On Mon, Sep 29, 2014 at 12:04 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> "dname_external()".
>
> That we do...  Why is it in dcache.h, BTW?  No users outside of fs/dcache.c
> and I can't imagine a valid use for it in a module, let alone out-of-tree
> one...

Yeah, no, remove the declaration in dcache.h, and make it static to
fs/dcache.c. I don't think there's any possible valid use outside of
dcache.

> ObAnnoyance: gcc not figuring out that this container_of() isn't going to be
> NULL...  Sure, somebody might put (char *)16 into dentry->d_name.name, but...
> As it is, it makes for messier code generation; I can work around that, of
> course, but it's uglier than it ought to be ;-/

You could possibly play games with __attribute__((nonnull)) or
whatever it's called. Although I think you can only annotate arguments
that way (and you do it not by annotating the argument, but by making
it a function attribute, which I think is horrible, but whatever).

              Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-09-29 18:42                       ` Paul E. McKenney
@ 2014-10-01  0:16                         ` Al Viro
  2014-10-02  5:38                           ` Paul E. McKenney
  2014-10-02 10:35                           ` Chuck Ebbert
  0 siblings, 2 replies; 37+ messages in thread
From: Al Viro @ 2014-10-01  0:16 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Mikhail Efremov

On Mon, Sep 29, 2014 at 11:42:18AM -0700, Paul E. McKenney wrote:

> Assuming that incrementing the external name's reference count is
> atomic_add_unless, I could believe this part.  Or if you have some
> locking that makes it impossible to increment the reference count
> in any case where there is any risk that it might be decremented
> to zero, I guess.
> 
> Which might well be the pair of write_seqcount_begin() calls in __d_move(),
> now that I look more carefully.  So if the name has to be in use somewhere
> before it can be copied, then a copy can only be created if there is at
> least one copy that is not currently being removed?  If so, OK.

Huh?  copy_name() does copy a _reference_, not the name itself.  All the
copying involved is source->d_name.name = target->d_name.name.  And those
are simply unsigned char *.

write_seqcount_begin() is irrelevant here.  Look: all callers of
__d_move(x, y) are holding references both to x and y.  Contributing to
the refcount of dentries themselves, that is, not the names.

That gives exclusion between __d_move() and free_dentry() - the latter cannot
be called until dentry refcount reaches zero.  RCU is completely irrelevant
here.  In fact, no call chain leads to __d_move() under rcu_read_lock().
You must hold the target dentry hard, or it could simply be freed right
under you.

And __d_move() is taking ->d_lock on all dentries involved (in
addition to rename_lock serializing it system-wide).

What could possibly lead to refcount zero being observed on target of
__d_move()?  The history of any dentry is this:
	* it is created by __d_alloc().  Nobody can see it until __d_alloc()
returns.  Dentry refcount (not to be confused with refcount of external
name) is 1.
	* it passes through some (usually - zero) __d_move() calls.
Some - as the first argument, some - as the second one.  All those
calls are serialized by global seqlock - callers must hold rename_lock.
And all of them are done by somebody who is holding a counting reference
to dentries in question.
	* counting references to dentry might be taken and dropped;
eventually refcount reaches zero (under ->d_lock) and no further
counting references can be taken after that.  See __dentry_kill() - the
first thing it does is poisoning the refcount, so that any future
attempt to increment it would fail.  __dentry_kill() (still under ->d_lock
of dentry, ->d_lock of its parent and ->i_lock of its inode) removes
dentry from the tree, from hash and from the alias list of inode;
Then it drops the locks.  At that point the only search structure dentry
might be found in is shrink list; if it's not on such list, free_dentry()
is called immediately, otherwise it's marked so that the code processing
the shrink list in question would, as soon as it gets to that sucker,
remove it from the shrink list and call the same free_dentry().  And that's
the only thing done to such dentry by somebody finding it via a shrink list.
	* once free_dentry() has been reached, dentry can can be only seen
by RCU lookups, and after the grace period ends it gets physically freed.

free_dentry() isn't allowed to overlap __d_move(); to have that happen is
a serious dentry refcounting bug.  No __d_move() is allowed _after_
free_dentry() has been entered, either.  Again, it would take a refcounting
bug for dentries to have that happen - basically, double dput() somewhere.
If that happens, all bets are off, of course - if dentry gets unexpectedly
freed under somebody who has grabbed a reference to it and has not dropped
it yet, we are fucked.

Nothing outside of __d_move() is allowed to change ->d_name.name.  RCU-critical
code is allowed to fetch and dereference it, and such code relies upon
	a) freeing of name seen by somebody who'd done rcu_read_lock() being
delayed until after the matching rcu_read_unlock()
	b) store of terminating NUL done by __d_alloc() (and never overwritten
afterwards) being seen by RCU-critical code that has found the pointer to
that name in dentry->d_name.name

All other code accessing ->d_name.name is required to hold one of the locks
that are held by __d_move() and its callers.  Grabbing any of those leads
to smp_mb() on alpha, which serves as data dependency barrier there, so
we don't need explicit barrier there as we do in RCU-critical places - guarding
NUL will be seen.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-10-01  0:16                         ` Al Viro
@ 2014-10-02  5:38                           ` Paul E. McKenney
  2014-10-02 10:35                           ` Chuck Ebbert
  1 sibling, 0 replies; 37+ messages in thread
From: Paul E. McKenney @ 2014-10-02  5:38 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Mikhail Efremov

On Wed, Oct 01, 2014 at 01:16:15AM +0100, Al Viro wrote:
> On Mon, Sep 29, 2014 at 11:42:18AM -0700, Paul E. McKenney wrote:
> 
> > Assuming that incrementing the external name's reference count is
> > atomic_add_unless, I could believe this part.  Or if you have some
> > locking that makes it impossible to increment the reference count
> > in any case where there is any risk that it might be decremented
> > to zero, I guess.
> > 
> > Which might well be the pair of write_seqcount_begin() calls in __d_move(),
> > now that I look more carefully.  So if the name has to be in use somewhere
> > before it can be copied, then a copy can only be created if there is at
> > least one copy that is not currently being removed?  If so, OK.
> 
> Huh?  copy_name() does copy a _reference_, not the name itself.  All the
> copying involved is source->d_name.name = target->d_name.name.  And those
> are simply unsigned char *.
> 
> write_seqcount_begin() is irrelevant here.  Look: all callers of
> __d_move(x, y) are holding references both to x and y.  Contributing to
> the refcount of dentries themselves, that is, not the names.
> 
> That gives exclusion between __d_move() and free_dentry() - the latter cannot
> be called until dentry refcount reaches zero.  RCU is completely irrelevant
> here.  In fact, no call chain leads to __d_move() under rcu_read_lock().
> You must hold the target dentry hard, or it could simply be freed right
> under you.
> 
> And __d_move() is taking ->d_lock on all dentries involved (in
> addition to rename_lock serializing it system-wide).
> 
> What could possibly lead to refcount zero being observed on target of
> __d_move()?  The history of any dentry is this:
> 	* it is created by __d_alloc().  Nobody can see it until __d_alloc()
> returns.  Dentry refcount (not to be confused with refcount of external
> name) is 1.
> 	* it passes through some (usually - zero) __d_move() calls.
> Some - as the first argument, some - as the second one.  All those
> calls are serialized by global seqlock - callers must hold rename_lock.
> And all of them are done by somebody who is holding a counting reference
> to dentries in question.
> 	* counting references to dentry might be taken and dropped;
> eventually refcount reaches zero (under ->d_lock) and no further
> counting references can be taken after that.  See __dentry_kill() - the
> first thing it does is poisoning the refcount, so that any future
> attempt to increment it would fail.  __dentry_kill() (still under ->d_lock
> of dentry, ->d_lock of its parent and ->i_lock of its inode) removes
> dentry from the tree, from hash and from the alias list of inode;
> Then it drops the locks.  At that point the only search structure dentry
> might be found in is shrink list; if it's not on such list, free_dentry()
> is called immediately, otherwise it's marked so that the code processing
> the shrink list in question would, as soon as it gets to that sucker,
> remove it from the shrink list and call the same free_dentry().  And that's
> the only thing done to such dentry by somebody finding it via a shrink list.
> 	* once free_dentry() has been reached, dentry can can be only seen
> by RCU lookups, and after the grace period ends it gets physically freed.
> 
> free_dentry() isn't allowed to overlap __d_move(); to have that happen is
> a serious dentry refcounting bug.  No __d_move() is allowed _after_
> free_dentry() has been entered, either.  Again, it would take a refcounting
> bug for dentries to have that happen - basically, double dput() somewhere.
> If that happens, all bets are off, of course - if dentry gets unexpectedly
> freed under somebody who has grabbed a reference to it and has not dropped
> it yet, we are fucked.
> 
> Nothing outside of __d_move() is allowed to change ->d_name.name.  RCU-critical
> code is allowed to fetch and dereference it, and such code relies upon
> 	a) freeing of name seen by somebody who'd done rcu_read_lock() being
> delayed until after the matching rcu_read_unlock()
> 	b) store of terminating NUL done by __d_alloc() (and never overwritten
> afterwards) being seen by RCU-critical code that has found the pointer to
> that name in dentry->d_name.name
> 
> All other code accessing ->d_name.name is required to hold one of the locks
> that are held by __d_move() and its callers.  Grabbing any of those leads
> to smp_mb() on alpha, which serves as data dependency barrier there, so
> we don't need explicit barrier there as we do in RCU-critical places - guarding
> NUL will be seen.

Please accept my apologies for the noise!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-10-01  0:16                         ` Al Viro
  2014-10-02  5:38                           ` Paul E. McKenney
@ 2014-10-02 10:35                           ` Chuck Ebbert
  2014-10-03  2:11                             ` Paul E. McKenney
  1 sibling, 1 reply; 37+ messages in thread
From: Chuck Ebbert @ 2014-10-02 10:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Paul E. McKenney, Linux Kernel Mailing List, linux-fsdevel,
	Mikhail Efremov

On Wed, 1 Oct 2014 01:16:15 +0100
Al Viro <viro@ZenIV.linux.org.uk> wrote:

Can we get the below added somewhere in Documentation/filesystems/ ? I
don't see anything there that covers all this.

> 
> Huh?  copy_name() does copy a _reference_, not the name itself.  All the
> copying involved is source->d_name.name = target->d_name.name.  And those
> are simply unsigned char *.
> 
> write_seqcount_begin() is irrelevant here.  Look: all callers of
> __d_move(x, y) are holding references both to x and y.  Contributing to
> the refcount of dentries themselves, that is, not the names.
> 
> That gives exclusion between __d_move() and free_dentry() - the latter cannot
> be called until dentry refcount reaches zero.  RCU is completely irrelevant
> here.  In fact, no call chain leads to __d_move() under rcu_read_lock().
> You must hold the target dentry hard, or it could simply be freed right
> under you.
> 
> And __d_move() is taking ->d_lock on all dentries involved (in
> addition to rename_lock serializing it system-wide).
> 
> What could possibly lead to refcount zero being observed on target of
> __d_move()?  The history of any dentry is this:
> 	* it is created by __d_alloc().  Nobody can see it until __d_alloc()
> returns.  Dentry refcount (not to be confused with refcount of external
> name) is 1.
> 	* it passes through some (usually - zero) __d_move() calls.
> Some - as the first argument, some - as the second one.  All those
> calls are serialized by global seqlock - callers must hold rename_lock.
> And all of them are done by somebody who is holding a counting reference
> to dentries in question.
> 	* counting references to dentry might be taken and dropped;
> eventually refcount reaches zero (under ->d_lock) and no further
> counting references can be taken after that.  See __dentry_kill() - the
> first thing it does is poisoning the refcount, so that any future
> attempt to increment it would fail.  __dentry_kill() (still under ->d_lock
> of dentry, ->d_lock of its parent and ->i_lock of its inode) removes
> dentry from the tree, from hash and from the alias list of inode;
> Then it drops the locks.  At that point the only search structure dentry
> might be found in is shrink list; if it's not on such list, free_dentry()
> is called immediately, otherwise it's marked so that the code processing
> the shrink list in question would, as soon as it gets to that sucker,
> remove it from the shrink list and call the same free_dentry().  And that's
> the only thing done to such dentry by somebody finding it via a shrink list.
> 	* once free_dentry() has been reached, dentry can can be only seen
> by RCU lookups, and after the grace period ends it gets physically freed.
> 
> free_dentry() isn't allowed to overlap __d_move(); to have that happen is
> a serious dentry refcounting bug.  No __d_move() is allowed _after_
> free_dentry() has been entered, either.  Again, it would take a refcounting
> bug for dentries to have that happen - basically, double dput() somewhere.
> If that happens, all bets are off, of course - if dentry gets unexpectedly
> freed under somebody who has grabbed a reference to it and has not dropped
> it yet, we are fucked.
> 
> Nothing outside of __d_move() is allowed to change ->d_name.name.  RCU-critical
> code is allowed to fetch and dereference it, and such code relies upon
> 	a) freeing of name seen by somebody who'd done rcu_read_lock() being
> delayed until after the matching rcu_read_unlock()
> 	b) store of terminating NUL done by __d_alloc() (and never overwritten
> afterwards) being seen by RCU-critical code that has found the pointer to
> that name in dentry->d_name.name
> 
> All other code accessing ->d_name.name is required to hold one of the locks
> that are held by __d_move() and its callers.  Grabbing any of those leads
> to smp_mb() on alpha, which serves as data dependency barrier there, so
> we don't need explicit barrier there as we do in RCU-critical places - guarding
> NUL will be seen.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
  2014-10-02 10:35                           ` Chuck Ebbert
@ 2014-10-03  2:11                             ` Paul E. McKenney
  0 siblings, 0 replies; 37+ messages in thread
From: Paul E. McKenney @ 2014-10-03  2:11 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Al Viro, Linux Kernel Mailing List, linux-fsdevel, Mikhail Efremov

On Thu, Oct 02, 2014 at 05:35:26AM -0500, Chuck Ebbert wrote:
> On Wed, 1 Oct 2014 01:16:15 +0100
> Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
> Can we get the below added somewhere in Documentation/filesystems/ ? I
> don't see anything there that covers all this.

More documentation would of course be nice, but the root cause of my
confusion was attempting to give an intelligent review of a significant
change to VFS given a 2-hour chunk of time, and without having spent
enough time getting familiar with VFS.  I would of course need to spend
more like a week or two, or at least several days, going through the
current code.

								Thanx, Paul

> > Huh?  copy_name() does copy a _reference_, not the name itself.  All the
> > copying involved is source->d_name.name = target->d_name.name.  And those
> > are simply unsigned char *.
> > 
> > write_seqcount_begin() is irrelevant here.  Look: all callers of
> > __d_move(x, y) are holding references both to x and y.  Contributing to
> > the refcount of dentries themselves, that is, not the names.
> > 
> > That gives exclusion between __d_move() and free_dentry() - the latter cannot
> > be called until dentry refcount reaches zero.  RCU is completely irrelevant
> > here.  In fact, no call chain leads to __d_move() under rcu_read_lock().
> > You must hold the target dentry hard, or it could simply be freed right
> > under you.
> > 
> > And __d_move() is taking ->d_lock on all dentries involved (in
> > addition to rename_lock serializing it system-wide).
> > 
> > What could possibly lead to refcount zero being observed on target of
> > __d_move()?  The history of any dentry is this:
> > 	* it is created by __d_alloc().  Nobody can see it until __d_alloc()
> > returns.  Dentry refcount (not to be confused with refcount of external
> > name) is 1.
> > 	* it passes through some (usually - zero) __d_move() calls.
> > Some - as the first argument, some - as the second one.  All those
> > calls are serialized by global seqlock - callers must hold rename_lock.
> > And all of them are done by somebody who is holding a counting reference
> > to dentries in question.
> > 	* counting references to dentry might be taken and dropped;
> > eventually refcount reaches zero (under ->d_lock) and no further
> > counting references can be taken after that.  See __dentry_kill() - the
> > first thing it does is poisoning the refcount, so that any future
> > attempt to increment it would fail.  __dentry_kill() (still under ->d_lock
> > of dentry, ->d_lock of its parent and ->i_lock of its inode) removes
> > dentry from the tree, from hash and from the alias list of inode;
> > Then it drops the locks.  At that point the only search structure dentry
> > might be found in is shrink list; if it's not on such list, free_dentry()
> > is called immediately, otherwise it's marked so that the code processing
> > the shrink list in question would, as soon as it gets to that sucker,
> > remove it from the shrink list and call the same free_dentry().  And that's
> > the only thing done to such dentry by somebody finding it via a shrink list.
> > 	* once free_dentry() has been reached, dentry can can be only seen
> > by RCU lookups, and after the grace period ends it gets physically freed.
> > 
> > free_dentry() isn't allowed to overlap __d_move(); to have that happen is
> > a serious dentry refcounting bug.  No __d_move() is allowed _after_
> > free_dentry() has been entered, either.  Again, it would take a refcounting
> > bug for dentries to have that happen - basically, double dput() somewhere.
> > If that happens, all bets are off, of course - if dentry gets unexpectedly
> > freed under somebody who has grabbed a reference to it and has not dropped
> > it yet, we are fucked.
> > 
> > Nothing outside of __d_move() is allowed to change ->d_name.name.  RCU-critical
> > code is allowed to fetch and dereference it, and such code relies upon
> > 	a) freeing of name seen by somebody who'd done rcu_read_lock() being
> > delayed until after the matching rcu_read_unlock()
> > 	b) store of terminating NUL done by __d_alloc() (and never overwritten
> > afterwards) being seen by RCU-critical code that has found the pointer to
> > that name in dentry->d_name.name
> > 
> > All other code accessing ->d_name.name is required to hold one of the locks
> > that are held by __d_move() and its callers.  Grabbing any of those leads
> > to smp_mb() on alpha, which serves as data dependency barrier there, so
> > we don't need explicit barrier there as we do in RCU-critical places - guarding
> > NUL will be seen.
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2014-10-03  2:11 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-24 18:14 [PATCH v2] vfs: Don't exchange "short" filenames unconditionally Mikhail Efremov
2014-09-24 18:55 ` Al Viro
2014-09-24 19:20   ` Linus Torvalds
2014-09-24 19:27     ` Linus Torvalds
2014-09-24 20:18     ` Al Viro
2014-09-25  4:46       ` Al Viro
2014-09-26 16:44         ` Al Viro
2014-09-27  4:45           ` Al Viro
2014-09-27 17:56             ` Linus Torvalds
2014-09-27 18:31               ` Al Viro
2014-09-27 19:16                 ` Al Viro
2014-09-27 19:37                   ` Linus Torvalds
2014-09-27 19:39                     ` Linus Torvalds
2014-09-27 19:49                       ` Al Viro
2014-09-27 19:55                         ` Linus Torvalds
2014-09-27 21:48                           ` [git pull] vfs.git for 3.17-rc7 Al Viro
2014-09-27 19:45                   ` [PATCH v2] vfs: Don't exchange "short" filenames unconditionally Al Viro
2014-09-28  7:47                   ` Al Viro
2014-09-28 18:05                     ` Al Viro
2014-09-28 21:51                       ` Al Viro
2014-09-29  1:06                         ` [PATCH] missing data dependency barrier in prepend_name() Al Viro
2014-09-29 15:15                       ` [PATCH v2] vfs: Don't exchange "short" filenames unconditionally Linus Torvalds
2014-09-29 15:59                         ` Al Viro
2014-09-29 16:07                           ` Linus Torvalds
2014-09-29 16:27                             ` Al Viro
2014-09-29 17:54                               ` Linus Torvalds
2014-09-29 19:04                                 ` Al Viro
2014-09-29 20:45                                   ` Linus Torvalds
2014-09-29 18:42                       ` Paul E. McKenney
2014-10-01  0:16                         ` Al Viro
2014-10-02  5:38                           ` Paul E. McKenney
2014-10-02 10:35                           ` Chuck Ebbert
2014-10-03  2:11                             ` Paul E. McKenney
2014-09-29 13:16                     ` Paul E. McKenney
2014-09-29 15:04                       ` Al Viro
2014-09-28 15:01               ` Mikhail Efremov
2014-09-26 20:23         ` Al Viro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.